supercat
-
Content Count
7,259 -
Joined
-
Last visited
Blog Comments posted by supercat
-
-
It claims to have a free input but looking above in the pinout, it's marked "Reserved" for some reason. Not sure why
Most 16V8 chips have three modes of operation, which vary in terms of whether they allow latched or unlatched inputs, or individual control of chip-selects. The only mode that allows latched outputs requires the use of pin 11 as a common chip-select for all the latched outputs. An 18CV8 would not have that limitation but would otherwise be pinout compatible (the "16" in "16V8" refers to there being 16 inputs to the array. In the mode that allows latches, there are eight inputs and eight feedback inputs. The other two input signals are used for clock and output enable, but do not feed the array.
If you don't mind requiring that any bank-switch instructions be run from ROM rather than RIOT RAM, you may be able to add another banking bit. Use the same PLD output for the EPROM /OE and the RC feedback circuit. Since /OE goes high during any banking address, that should let you save an output pin provided that your latching terms will keep their current value if A12/A11 aren't "0/1". Not sure what one extra output pin would be good for in the absence of an extra input pin to go with it.
-
Thought about that one two. Unfortunately the dots are one row above the score bottom line and the copies therefore would show....but they'll be covered up by score digits).So move them down a row. Or else set the playfield black and priority, and arrange things so the extra missile copies are off to the left.
-
Though, the two dots might help here. Only one load instead of two, saving 5 cylces. I'll have a look at it tonightHow about setting one sprite to 'two copies close' and one to 'two copies far'. Use the missiles for the periods (you'll get a second "copy" of each, but they'll be covered up by score digits). Since you'll only have four sprite shapes to worry about you should have no trouble hitting COLUPx.
-
John, does this mean that I can just throw a different oscillator in and the cart cart will be PAL compatible or are there other difference I need to take into account?Frequency should be the only issue. And if you include the right sort of logic in the CPLD you may be able to get by with only one frequency for both (4A50 doesn't support that sort of thing because I was very cramped in the XC9536XL).
BTW, I don't recall your having answered: at what point in the CPU cycle to you switch from reading the SRAM to floating the bus and writing to it? I was thinking that if you're a tiny bit late switching, the system would still work but you would have brief spikes of excessive current draw.
-
The CPLD is clocked at 30MHz not 50. That oscillator you suggested sounds great except the programming device costs $450, ouch. Thanks for the tip though. I will look into programmable ones and the 24X one. I see its only available at 5V, I could still use it though.The prices quoted included having Digi-Key program the device.
-
The budgeting for this cart is really precarious in order to make its target retail price. There are some hard decisions to be made regarding components like this which would only be used for Chimera Native, but not required for the multicart portion. It would be an easier call to make if someone were to write a quick but impressive title in Chimera Native to bundle into the package
If the 50MHz oscillator is accurate, dividing it by 42 will yield a 6507 clock rate within about 0.25%; that should be adequate accuracy for any normal purpose if the oscillator itself is precise. If the ARM's oscillator is tunable, that could be even more precise.
BTW, I'm curious whether some of the video noise might be a "beat" frequency between (ARM clk/42) and (chroma clk/3). And how long into each cycle do you drive data (when using "lazy writes")?
-
-1- Is the CPLD latching the data, or are you assuming that the ARM will find out later what data was written by looking at the SRAM?The VCS triggers an ARM interrupt by just accessing the specific address. If the VCS is reading it will trigger, if the VCS is writing it will trigger. Doesnt matter. When that happens, the CPLD latches the lowest 8bits in a register and interrupts the ARM. The ARM then jumps to its interrupt service routine which reads out the value the CPLD latched and uses it as an index into a jump table. If that specific hotspot function needs to read SRAM to gather more information, then it does. Its entirely function dependent. So to answer your question directly, it can do both.
-2- If I recall, you're using a Xilinx 9500-series CPLD? How are you doing for product terms? Are you really tight, or do you have oodles to spare, or somewhere in-between?The basic CPLD template without any bank schemes in it has tons of room in it. The empty template has the clocking code, the SRAM arbitration code, and some miscellaneous other small things. Plenty of room to add any bank scheme anyone can think of. The ARM can swap CPLD configuration dynamically as needed for different bank schemes. So the direct answer is, there is almost always room in the CPLD. You just add another potential configuration file to the mix.
Having $3x and $7x act as shortcuts for $103x and $107x might be a simple way to handle things then. One extra product term, which there should be plenty of space for.
No problem, the CPLD can read and write to the SRAM triggered by any VCS address. A12 being set is a function of the specific bank scheme. I am not entirely sure what you mean here with the contention stuff. There is never any bus contention ever, unless you make the design have contention. The SRAM has its own data and address bus. The VCS has its own data and address bus. The ARM has its own data and address bus. You can tie them together any way you choose with the CPLD.The cartridge can supply six bits of data to the VCS on TIA addresses; the TIA will want to supply the top two cycles. Using conventional read techniques, an attempt by the cartridge to supply 8 bits of data simultaneous with a TIA access would cause bus contention (TIA tries to drive one value while cart drives another). Using magic writes should avoid physically-damaging bus contention, though when a read of those addresses is performed, it will look as though the top two bits are being written with whatever data the TIA is putting out.
I am clocking with the CPLD with a 30MHz clock so the error shouldnt be hard to calculate. But my current clocking circuit resyncs anytime there is a change in any of the VCS 13 address bits. I can go a few cycles without an address change and still be in sync with an accurate data clock. I am certain 76 cycle of no address change will have sync problems if you are trying to grab something from the VCS data bus.Do you have any logic to disable magic reads after four cycles with no address change? Also, have you looked into programmable oscillators? Something like CPPX7-A7BR-ND from Digi-Key is available in any desired frequency; $4.38 for one, $3.75 in tens, or $2.50 in hundreds.
On the other hand, I just looked at standard frequencies and you should do fine even with those. 28.63636MHz are available from Digi-Key as a standard frequency (that's CPUCLK*24), and the 50MHz crystal you're using actually isn't bad (it's within 0.25% of CPUCLK*42).
-
To draw the walls in SuperBug, do something like:
; X and Y holds current X and Y positions ; Part 1 (moving upward) AllDone_Up: rts Up_1x: asl shifter bcs GoUp GoUp2: dey jsr putpixel GoUp: dey jsr putpixel asl shifter bne LoopUp FetchUp: jsr fetch_shifter ; Fetch byte from appropriate bank, set carry, ; rotate left, and store in shifter beq AllDone_Up LoopUp: bcs Up_1x Up_0x: asl shifter bcs GoUpLeft bcc GoUpRight ; Part 2 (moving up and to the left) AllDone_UpLeft: rts UpLeft_1x: asl shifter bcs GoUpLeft GoLeftUp2: dex dey jsr putpixel GoUpLeft: dex dey jsr putpixel asl shifter bne LoopUpLeft FetchUpLeft: jsr fetch_shifter ; Fetch byte from appropriate bank, set carry, ; rotate left, and store in shifter ; Return zero if I've fetched enough beq AllDone_UpLeft LoopUpLeft: bcs UpLeft_1x UpLeft_0x: asl shifter bcs GoLeft bcc GoUp ; Ignoring parts 3-8 (other directions) but they're similar. putpixel: bit INTIM ; Is our time almost expired? bmi pp_nokernel stx draw_x sty draw_y jsr KERNEL ldy draw_y ldx draw_x pp_nokernel: cpy #12 ; Number of rows bcs kernel_exit cpx #32 ; Number of columns bcs kernel_exit lda ptrs_list,x sta write_ptr lda (write_ptr),y ora mask_list,x sta (write_ptr),y rts
There's one part of the code for each of eight directions (only two parts shown). If two bits in the shape table are "11", it should move straight one pixel. If "10", two pixels. If "01", turn left by one direction and move straight one pixel. If "00", turn right by one direction and go straight one pixel.
If the "KERNEL" call decides to scroll the screen, it should adjust "draw_x" and "draw_y" appropriately.
-
I would think that if you could have a column or two on each side, and a row or two on the top and bottom, that weren't displayed, you might be able to manage some much better compression using a 'deltas' approach. Store the outlines of the track as a sequence of "steps" (e.g. straight, straight, right, straight, left, straight, straight, etc.) One approach would be to use eight directions but assume that there would never be any 90 degree bends; every two bits would choose from: "one forward", "one forward then left", "one forward then right", or "two forward". You would need a drawing routine which would ignore any points that were outside of the screen bitmap.
Even with an optimized drawing routine, you wouldn't be able to draw the entire track in a reasonable amount of time. To allow for that, you would probably need to divide the track into squares; you'd have a table which listed, for each square, which portions of the track should be drawn while the vehicle was within that square. For example, if the squares were 16x16, your table might say to draw track data starting at $1523, at coordinate 12,56, facing upward, and draw 23 pixels; use data starting at $1915, at coordinate 15,83, facing upward, and draw 28 pixels. Using larger squares would decrease the storage required for the index, but would increase the amount of drawing that one had to do for each refresh.
The amount of storage required for the track would be quite reasonable--especially if the track was fairly sparse. The question would be whether one could draw stuff fast enough that one could get by without an overly dense index.
Incidentally, one advantage of this approach would be that one could have tracks that overlap themselves (e.g. figure-eights) provided that the overlapping portion scrolled completely off-screen between visits. There's still a lot of trickiness reqiuired for stuff, though.
-
The only limitation to selecting hotspots is with addresses with the same lower 8 bits. Like if you wanted to trigger at 0x0040 and 0x1040. At the time the CPLD raises the ARM interrupt line it latches the lower 8 bits of the address. The ARM then reads the latched value to know which interrupt was triggered. The value is actually used as an index into a giant function jump table. So for both of those spots the ARM would jump to the same function with the same 8 bit value as a parameter to that function.A few questions, if I may:
-1- Is the CPLD latching the data, or are you assuming that the ARM will find out later what data was written by looking at the SRAM?
-2- If I recall, you're using a Xilinx 9500-series CPLD? How are you doing for product terms? Are you really tight, or do you have oodles to spare, or somewhere in-between?
-3- Would there be any difficulty making accesses to $0030-$003F and $0070-$007F access a small area of SRAM in addition to triggering the hotspots? I think the TIA doesn't drive D6-D7 until the latter half of an instruction cycle, so if you do a byte-wide read and then get off the bus, you should be safe from bus contention. A 6507 read operation would copy the TIA data to bits 6-7 of the SRAM, but a 6507 write operation would store data to the SRAM nicely.
-4- Since you're not using a multiple of 3.579545MHz to drive the CPLD, is your timing such that you can handle four consecutive addresses where A0 remains the same and the fourth is a write, or 76+ consecutive addresses without an A0 change where all are reads?
-
So the patch offsets can be relative to that hotspot location. That way the entire kernel chunk is relocatable.I'm not sure I follow you here. The address of the kernel should be known at assembly time, so there should be no problem including a list of absolute-address patch-points. Unless you want people to be limited to a certain exact kernel, there has to be a list of patch-points. So I don't see how relative addresses are better than absolute.
In your example you have a couple NOPs so I'm sure you can replace one of the load operations with a traditional queue variety to establish the trigger-pattern for the patching.I was just throwing that together quickly. I don't know how much slack time exists at which parts, and what one might want to do with it (e.g. modify ENABL, HMBL, CTRLPF, etc.)
If the CPLD can trigger the ARM at addresses $0040-$007F, that should probably eliminate the need for fancy timing on the ARM.
-
For normal queue usage, the ARM is reacting in realtime to absolute mode reads by the VCS...This is something you'd only need for really bleeding edge kernels. Most would be just fine with regular queues.
If one were trying to eliminate hardware, the "normal mode" queues would have the advantage that they could be implemented without dual-porting RAM. If the dual-ported RAM exists, though, I don't see a whole lot of disadvantage to using immediate mode queues.
If I were displaying something like a 13-column flicker-blinds text demo with some other stuff thrown in (playfield, etc.) it would look something like (but with different timing and better formatted):
lp: ; Line 0 PP0_PF1 lda #0 sta PF1 PP0_PF2 lda #0 sta PF2 PP0_PL0 lda #0 sta GRP0 PP0_PL2 lda #0 sta GRP1 PP0_PL4 lda #0 sta GRP0 PP0_PL6 lda #0 ; Figure sprite starts just before here sta GRP1 sta RESP0 PP0_PF3 lda #0 sta PF2 PP0_PL8 lda #0 sta GRP0 PP0_PF4 lda #0 sta PF1 PP0_PL10 lda #0 sta GRP1 PP0_PL12 lda #0 sta GRP0 sta GRP1 ; Line 1 nop 0 PP1_PF1 lda #0 sta PF1 PP1_PF2 lda #0 sta PF2 PP1_PL1 lda #0 sta GRP0 PP1_PL3 lda #0 sta GRP1 PP1_PL5 lda #0 sta GRP0 PP1_PL7 lda #0 ; Figure sprite starts just before here sta GRP1 PP1_PF3 lda #0 sta PF2 nop 0 PP1_PL9 lda #0 sta GRP0 PP1_PF4 lda #0 sta PF1 PP1_PL11 lda #0 sta GRP1 sta GRP0 ; Line 2 PP2_PF1 lda #0 sta PF1 PP2_PF2 lda #0 sta PF2 PP2_PL0 lda #0 sta GRP1 PP2_PL2 lda #0 sta GRP0 PP2_PL4 lda #0 sta GRP1 PP2_PL6 lda #0 ; Figure sprite starts just before here sta GRP0 sta RESP1 PP2_PF3 lda #0 sta PF2 PP2_PL8 lda #0 sta GRP1 PP2_PF4 lda #0 sta PF1 PP2_PL10 lda #0 sta GRP0 PP2_PL12 lda #0 sta GRP1 sta GRP0 ; Line 1 nop 0 PP3_PF1 lda #0 sta PF1 PP3_PF2 lda #0 sta PF2 PP3_PL1 lda #0 sta GRP1 PP3_PL3 lda #0 sta GRP0 PP3_PL5 lda #0 sta GRP1 PP3_PL7 lda #0 ; Figure sprite starts just before here sta GRP0 PP3_PF3 lda #0 sta PF2 nop 0 PP3_PL9 lda #0 sta GRP1 PP3_PF4 lda #0 sta PF1 PP3_PL11 lda #0 sta GRP0 sta GRP1 ; Patch table PG_LP: QUEUE_GROUP_START PATCH_ITEM PP0_PF1, QUEUE_PF1,1 PATCH_ITEM PP0_PF2, QUEUE_PF2,1 PATCH_ITEM PP0_PF3, QUEUE_PF3,1 PATCH_ITEM PP0_PF4, QUEUE_PF4,1 PATCH_ITEM PP0_PL0, QUEUE_P0,2 PATCH_ITEM PP0_PL2, QUEUE_P2,2 PATCH_ITEM PP0_PL4, QUEUE_P4,2 PATCH_ITEM PP0_PL6, QUEUE_P6,2 PATCH_ITEM PP0_PL8, QUEUE_P8,2 PATCH_ITEM PP0_PL10,QUEUE_P10,2 PATCH_ITEM PP0_PL12,QUEUE_P12,2 QUEUE_GROUP_START PATCH_ITEM PP1_PF1, QUEUE_PF1,1 PATCH_ITEM PP1_PF2, QUEUE_PF2,1 PATCH_ITEM PP1_PF3, QUEUE_PF3,1 PATCH_ITEM PP1_PF4, QUEUE_PF4,1 PATCH_ITEM PP1_PL1, QUEUE_P1,2 PATCH_ITEM PP1_PL3, QUEUE_P3,2 PATCH_ITEM PP1_PL5, QUEUE_P5,2 PATCH_ITEM PP1_PL7, QUEUE_P7,2 PATCH_ITEM PP1_PL9, QUEUE_P9,2 PATCH_ITEM PP1_PL11,QUEUE_P11,2 QUEUE_GROUP_START PATCH_ITEM PP2_PF1, QUEUE_PF1,1 PATCH_ITEM PP2_PF2, QUEUE_PF2,1 PATCH_ITEM PP2_PF3, QUEUE_PF3,1 PATCH_ITEM PP2_PF4, QUEUE_PF4,1 PATCH_ITEM PP2_PL0, QUEUE_P0,2 PATCH_ITEM PP2_PL2, QUEUE_P2,2 PATCH_ITEM PP2_PL4, QUEUE_P4,2 PATCH_ITEM PP2_PL6, QUEUE_P6,2 PATCH_ITEM PP2_PL8, QUEUE_P8,2 PATCH_ITEM PP2_PL10,QUEUE_P10,2 PATCH_ITEM PP2_PL12,QUEUE_P12,2 QUEUE_GROUP_START PATCH_ITEM PP3_PF1, QUEUE_PF1,1 PATCH_ITEM PP3_PF2, QUEUE_PF2,1 PATCH_ITEM PP3_PF3, QUEUE_PF3,1 PATCH_ITEM PP3_PF4, QUEUE_PF4,1 PATCH_ITEM PP3_PL1, QUEUE_P1,2 PATCH_ITEM PP3_PL3, QUEUE_P3,2 PATCH_ITEM PP3_PL5, QUEUE_P5,2 PATCH_ITEM PP3_PL7, QUEUE_P7,2 PATCH_ITEM PP3_PL9, QUEUE_P9,2 PATCH_ITEM PP3_PL11,QUEUE_P11,2 QUEUE_LOOP LOOPY_QUEUE,PG_LP
A bit verbose, but not terribly complicated. You have a bunch of macros listing out all the patches for each line and which queue should control them.
-
More fun. This time - switchbacks!The former would compress to about 512 bytes total. The latter would be over 4K.
-
If you are suggesting that the ARM get interrupted at the top of the screen and stay in this routine that handles immediate mode patching straight through to the start of VBLANK, a routine consisting mostly of waiting around, forget it. Even if it were possible to get the timing right, we can't allow it to do that. It has to share CPU time between servicing CPLD interrupts and servicing peripherals.What peripherals are you needing to service during a game.
With an alternating kernel, the ARM would be far enough ahead of the VCS that it could finish patching the next block's bytes and have plenty of time to spare for peripherals before the next interrupt crops up.Could the ARM have an interrupt every scan line (timing referenced off some spot the VCS hits once/frame), or how accurate is the ARM's clock? Even being three scans line fast in the course of a frame wouldn't hurt anything if the loop was unrolled for at least four repetitions.
There is no way to use queues of any sort unless you "read from the ARM" because the ARM is what makes the queue behavior happen. The CPLD just alerts the ARM that an address was hit and little else. A SRAM queue read is effectively the ARM copying from one area of SRAM to another. In fact, all your SRAM queues can be accessed in the traditional manner if you know where they are and bank them in properly. ARM RAM queues copy from ARM RAM to SRAM. So trying to get the VCS to work with ARM RAM in a flat memory model can only be simulated in a clunky way.Yes, the ARM has to be triggered to make the queues happen, but my thought is that the ARM would fetch data from queues and put it into SRAM from which the 2600 would later retrieve it. The ARM would be involved with putting the data in place for the 2600, but the ARM would not be involved in the read cycle the 6507 would eventually perform to retrieve it.
Some programmers may prefer to write to the low-level SRAM queue strips directly by banking them in rather than through the hotspots. We can't do this right now because the current Chimera CPLD doesn't support bankswitching.Bankswitching would certainly be helpful.
I've noticed when there are two paths to the same goal, you favor doing things the harder, more machine-language/cycle-counting way. The facilities that are built into the cart are intended to make it easier to code. If you don't like our solutions, you'll always have the freedom to write your own ARM subroutines and load them in with your game, albeit constrained by the size of ARM RAM.Pushing things to the limit is what the 2600 is about. Not necessarily trying to make them complicated, but rather trying to optimize them for the task at hand.
Since you have the dual-port RAM, the optimal behavior for the ARM queues would seem to be to throw the data in RAM in front of where the CPU is going to 'naturally' fetch it (typically immediate mode operands). So have a command to fetch a byte from a queue and shove it to a specified address. Since some data structures don't show every byte on every frame, and bytes may need to be fetched more than once, provide a means to bump the queue pointer by a value other than one. So as to avoid a really huge ARM "script", support counter loops. To minimize overhead on the 6507 end of things, use timing to figure out roughly when things should be patched (recognizing that precise timing isn't required).
There are a few other things that would be helpful when reading out data within a script (e.g. shift data left/right one bit before storage) but I would expect the code should be fairly straightforward.
-
RLE could might work sort of okay for tracks up to 256 wide if you keep separate pointers for the top and bottom of the visible window. Each RLE 'item' would be two bytes (left and right side of the 'extent'). The first item for each row would have the two bytes reversed. If you know where the top row is stored and where the bottom row is stored, you could search through memory reasonably well to find the next/previous row. But using two bytes per RLE item wold probably limit your savings. Further, making horizontal motion fast might be a challenge, though it you had a few pixels of buffer outside the screen you could probably manage. Of course, memory is very tight so that may be easier said than done.
Also mentioned was adding an EEPROM to a board, which is also a good idea for a number of reasons, but I'm not sure would be fast enough to handle the 4-way scrolling in Superbug.If you use a 24FC-series EPROM with a suitable PLD, I would expect that you could read data at about 50 cycles/byte. That would certainly be fast enough if you had two copies of the maze stored in EEPROM--one in 'row major' format and the other in 'column-major'. If there's just one copy of the maze data, things could be trickier especially if you need to work in limited RAM.
-
Does the ARM have a low-precision oscillator? Even a crummy crystal would be accurate to 100ppm or so, which would be about 1.6us/frame.
It might be possible to multiplex this "sync" signal with some other useful purpose. For instance, it could be both a queue AND the trigger for the byte patching. That way all you are sacrificing is altering one load on the scanline from immediate to absolute addressing mode. I'm sure you'll have enough time somewhere in the kernel for that.Eh, maybe. As an alternative, you could have a function on the CPLD to trigger on writes to $0040-$007F. Those could be done with zero cycle cost.
BTW, why are you attached to the idea of using only $1F00-$1FFF for Chimera control? If your CPLD is is designed to give a few bits of address and 8 bits of data to the ARM whenever there's an "interesting" access, I really would suggest that for 'write' access you use something like:
- $0030-$003F : Latch lower 4 bits of address into 'address' input, and data bus into 'data' input
$0040-$007F : Latch 0 into 'address' input and data bus into 'data' input (used purely for triggering)
$04xx-$0FFF : Latch address bits 8-11 into 'address' input and address bits 0-7 into 'data' input
The only 'read' address logic I can see being helpful would be to decode addresses $19-$1A (AUDvx). Those should return the next audio output value in bits 1-4, with status in bits 0 and 5. One status bit should indicate that the ARM is able to accept commands into its input buffer; the other should indicate that the ARM has finished processing all commands and is idle. Bits 6-7 should be left floating. Note that you need to make sure your game code is running before enabling reads to those addresses, to avoid bus conflicts on a 7800.
I'm not sure I see much need to 'read' from the ARM other than for Pitfall-style audio (use an LSR AUDVx). Anything the 6507 would want to read from the ARM could simply be stored into some convenient area of RAM (the area to be used would vary based on the application).
- $0030-$003F : Latch lower 4 bits of address into 'address' input, and data bus into 'data' input
-
You're not going to script out any manual cycle counting thing on the ARM. It's just not going to be predictable.The ARM has some sort of timers, doesn't it? If you have a counter that runs at a particular speed and you know what that timer read when the 'go' hotspot was triggered, you can tell pretty accurately what that timer should read after 76 cycles of the 6507, and what it should read 76 cycles after that, etc.
The idea would be that if my 'program' looks something like:
start: load_queue_item blahblah ; Group A load_queue_item blahblah load_queue_item blahblah reloop: wait 10 load_queue_item blahblah ; Group B load_queue_item blahblah load_queue_item blahblah load_queue_item blahblah load_queue_item blahblah load_queue_item blahblah load_queue_item blahblah load_queue_item blahblah load_queue_item blahblah wait 36 load_queue_item blahblah ; Group C wait 30 decbra 5,1,reloop
the first set of loads (Group A) would happen as soon as possible after the hotspot trigger. Group B would happen 10 cycles after the hotspot trigger, regardless of how long the load_queue operations took (provided they didn't take too long). Group C would start 46 cycles after the trigger; group B would start again 76 cycles after the trigger, etc.
What you'd do is just try to find pockets of time for the ARM to go in and finish its work (worst case) before the VCS needs it.The ARM would be specified as taking somewhere between 0 and (some number) of 6507 cycles to run each instruction. Code that wants to ensure that a memory address doesn't get loaded too early would have to put in a wait before the load. It would not be possible to predict precisely when any particular load would occur, but that shouldn't matter. Using a structure like the above, you could easily ensure that addresses that will be used in the second half of a scan line get written in the first and vice versa.
Then you send a block of configuration data to the ARM that says "when I hit this hotspot, update these addresses with these queues. When I hit this other hotspot, update these alternate addresses with these queues." So all you need is a single absolute access on each kernel line for the CPLD to detect the hotspot access and trigger the ARM. Depending on the hotspot hit, the ARM will know which section of the kernel is being run, top or bottom, even or odd, however you want to refer to it.That might be possible, but if the ARM has a timer (as I'm sure it must) I see no particular reason not to use it to boost available 6507 time by over 5%.
When you are ready, just send it our way with some very good documentation on how the queue data structures should be.Well, basically what you'd have would be a piece of code that draws sprites in the pattern specified by immediate-mode operands. How those operands get loaded from the queue would be a separate issue from designing the kernel to use them.
BTW, I don't know about you but I am partial to using full 8-bit wide fonts. There is little creativity possible with 4-bit fonts. It's nice being able to cram more words to a line, but DOS filenames are limited to 8+3 anyway. So the regular 12-char kernel that I have which is centered should be okay for the menu. Having more bits to work with is more of a benefit for text adventures, text manuals, or generic graphics displays.The limit for full 8-wide characters is probably 14/line. It might be possible to push things to 15/line if you use the Ball to fill in the gaps. Otherwise, my target is to allow 16 or 18 7-dot-wide characters.
-
All queues can seek to any position.I'm still working to see what I can push out using immediate-mode timing. I think I can probably manage a 32-column text kernel, and maybe even 36. That would require a somewhat different style of queues from what you'd been looking at, but it could be extremely powerful and I don't see many disadvantages compared with the other style. I'll see if I can get a demo working tonight to really show off the technique.
I'm not familiar enough with the ARM to know how best to set things up, but I would think it might be helpful to have a 'program' of 32-bit 'instructions' something like the following:
- 00000001 nnnnnnnn aaaaaaaa aaaaaaaa
-- Set base pointer for queue 'n' to 'a'
- 00000010 nnnnnnnn aaaaaaaa aaaaaaaa
-- Set current offset of queue 'n' to 'a'
- 00000011 nnnnnnnn aaaaaaaa aaaaaaaa
-- Set length of queue 'n' to 'a'
- 0001cccc nnnnnnnn aaaaaaaa aaaaaaaa
-- Read a byte of data from queue 'n' and store it to address 'a' of RAM, bumping pointer by 'c'
- 0010cccc nnnnnnnn aaaaaaaa aaaaaaaa
-- Subtract 'c' from offset of queue 'n'; if result <0, goto 'a'
- 0011cccc nnnnnnnn aaaaaaaa aaaaaaaa
-- Subtract 'c' from offset of queue 'n'; if result >=0, goto 'a'
- 01000000 -------- aaaaaaaa aaaaaaaa
-- Goto address 'a'
- 01010001 -------- aaaaaaaa aaaaaaaa
-- "Call" address 'a'
- 01100000 -------- -------- --------
-- "Return"
- 01110000 -------- nnnnnnnn nnnnnnnn
-- Wait 'n' cycles
The "wait" operations should be measured relative to the last 'wait' operation, or else the hotspot access that started execution of the queue program. So if the first 'wait' operation is a 'wait 76', then provided the ARM took less than ~64us (76 CPU cycles) to process commands up to that point, it should wait until a time ~64us after the trigger hotspot access. For example, if processing commands up to that point took 20us, then it should wait ~44us.
- 00000001 nnnnnnnn aaaaaaaa aaaaaaaa
-
Both of these issues could be solved if I had a boot ROM. The ARM has time to load 4K into the SRAM before the VCS even requests its first valid address.What would a boot ROM buy you? If you have hardware to allow the 2600 to operate in an "unbanked" mode without requiring the CPLD to be programmed, I would think it could run out of SRAM. And if you don't have such hardware, having a ROM wouldn't help.
I'm working now to see what sort of insane multi-sprite kernel I can manage. Timings are tough, and I may have to find some new tricks.
-
I see no reason why it couldn't work and there are probably various ways it could be done, but I have no kernels on file that would only be possible with these techniques.Naturally, since one would be hard-pressed to do much with such a kernel other than show a static screen, there hasn't been a whole lot of work put into them. But I'll see if I can cobble together a demo of 112-pixel mode.
-
Initially I was just expecting these kernels to use absolute mode queues, not immediate mode patching.That's what I would have expected to, but I've been playing around with some stuff and found myself a few cycles short.
Getting 104 pixels using even (zp),y addressing is not a problem, but pushing it to a "clean" 112 pixels wide even with absolute mode may not be possible. The former requires, for each pair of lines, two HMOVEs and one RESPx. The latter requires, if I remember right, two HMOVEs, four RESPx, two HMCLR, and two loads of HMPx with proper values. That's 34 cycles out of 152 to handle sprite motion. At least 98 would be required for GRPx data if everything was handled with a load abs,x and store. That's 132 out of 152, leaving 20. But the horizontal motion stores have to be timed precisely, leaving many 'gaps' which are not of usable size. If there are five cycles between two RESPx operations (the operations themselves are 24 pixels apart) it's possible to nestle a "LDA #xx/STA GRP0" in there, but if one is using abs,y addressing there's going to be at least one dead cycle if not two.
-
The raw cost of the fully loaded cart is already about twice as much as what I was originally hoping for. Stripping out the nonessential components actually doesn't save a heck of a lot. The big ticket items are the beefier ARM, the CPLD, and the 4-layer board (which was required to enable the super wide bus).My personal way of thinking is that whatever amount of hardware you want to throw at a cart, you should try to exploit it to the utmost. If anything in your cartridge isn't a bare minimum (e.g. the 32K RAM and 128K flash are the cheapest RAM and parallel flash chips available) and you're not exploiting it fully, you should try for something cheaper.
Adding a generous amount of RAM to the 2600 significantly extends its capabilities, especially when combined with a modest amount of bank-switching logic. What exactly is the purpose of the ARM?
Trying to push a 2600 to the limits by dynamically patching immediate-mode operands seems a little absurd, but going through the trouble of putting on a powerful micro and super-wide bus and then not seeking to push things to the limit also seems absurd.
What exactly is the purpose of the ARM and the super-wide bus, if not to push the 6507 to its absolute limits of performance? Since I hadn't realized you've actually set up for yourself a dual-port memory arrangement, my suggestion of counting cycles for bus insertion may not have been the best. A better approach may simply be to have a list of patch-points that will get loaded from queues at particular times. If you did that, the ARM's timing wouldn't have to be so precise as to jam things in a particular cycle.
If the goal of the ARM is simply to have a means of loading a multi-cart, and you're throwing on an overkill processor simply because expediency of completion is more important than optimizing unit cost, that may not be a bad idea. If I were doing things, I'd use a processor I was familiar with, but the ARM may be a good choice.
One of the difficulties with using a CPLD for a multi-cart is that unless you can cram all desired banking formats into one CPLD, you'll need to reprogram the CPLD in circuit and have a way of keeping the 6507 alive when the CPLD is being reprogrammed. One approach would be to have one CPLD that is never reprogrammed, which has enough intelligence to boot the system and has I/O to program the other; once the other is ready, the first CPLD can get out of the way. Alternatively, one could put a microcontroller in charge of that task; that could be tricky, but cost-effective--especially if it can eliminate the need for a parallel flash chip.
As I see it, there are three logical courses this project could take:
- Decide that the real goal of the ARM is to boot the system into different banking considerations. Overkill, perhaps, but if it means the thing gets done don't worry about it.
- Figure out how to push the dual-port memory scheme to its limits, so as to take maximal advantage of it.
- Figure on a cheaper and/or easier way of getting the functionality you really want.
If you're not seeking a hyper-accelerated native mode, there are almost certainly cheaper ways of making a multi-cart. As to whether they're easier, though, I can't say.
Even something like a 14-bit PIC running at 14.18MHz (three instructions per 6502 cycle) would probably suffice to boot a 2600, though the first parts would be a little awkward. Assume A12 and A0 are wired to inputs.
- Float all pins on CPLD (using JTAG control)
- Load PORTD (data bus) with $00 but tri-state it; enable pull-ups
- Wait for Addr12 low
- Wait for Addr12 high (should be fetching reset vector)
- Set TRISD to $1A (a type of NOP). That will put this value on the bus, but avoid outputting a high while the 6507 tries to output low.
- Now the 6507 should be fetching NOPs starting at address $1A1A (or $1Axx at any rate). Watch A0 for a little while, so we can know within 1/3 cycle where each cycle starts.
- Once we've seen enough cycles that we're comfortable (doesn't matter exactly how many), feed the 6507 instructions (one for every three PIC cycles) to put a simple boot-loader into RIOT RAM.
- Once that boot loader is running, reprogram the CPLD, and then use the boot loader to help put data there.
A little crude, and it may not be worth the effort required to get it working, but it would be a cheap way of doing things.
- Decide that the real goal of the ARM is to boot the system into different banking considerations. Overkill, perhaps, but if it means the thing gets done don't worry about it.
-
John, what is the faster oscillator you have found that is a multiple of the VCS? I can only find 12 times. I would like a little more resolution into the VCS cycle if I can get it. I would love around 30 times. ARM response time is critical, so the sooner I know an address is valid the better.Digi-Key has oscillators available in just about any frequency you want. It's been awhile since I've looked, but at least one type of custom modules cost only a tiny bit more than standard crystal units if you buy at least 10 or so of them.
-
Even if the CPLD could pull this trick off, there is a certain minimum latency for the ARM to respond to a hotspot interrupt. And there is some time involved for the ARM to write back the proper value through the super wide bus of the CPLD into the SRAM. So even if the ARM could run the necessary calculations to index the queues forward, there might not be a wide enough window for immediate mode. Delicon would know for sure.I don't fully know the details of how the ARM is wired in with the CPLD, and I forget exactly how fast the ARM is. Nonetheless, if the ARM and CPLD can manage to count 6507 cycles I don't think there should be any more problem with patching immediate operands than performing any other type of queue fetch. Indeed, the immediate-operand patching could be easier since the auto-fetch queue could specify in advance what needed to be done.
What means do you have for counting 6507 cycles? Are you running the ARM at a multiple of the 6507 clock, and are its timings easily deterministic?
Incidentally, I was thinking before the queues would be in the ARM RAM, but actually it would probably work to have them in external SRAM. Assuming you can easily switch between having the RAM address generated by the 6507+CPLD or by the ARM, the super-turbo mode loop would be something like:
; Assume R0 is the action queue pointer ; Assume R1 points to 64 other queue pointers (in a page-aligned group of 256 bytes of internal memory) ; R2 is a temp used to count down cycles ; R3 is a temp used to hold a queue-fetch address lp: Load the LSB of R1 from local address R0 using post-increment Load R3 from local address R1 Count down cycles in R2 until we're supposed to patch an immediate operand or something (*) Force address R3 on the SRAM address bus Increment R3 Store R3 back to local address R1 Wait for next cycle Return address bus to normal Load the LSB of R2 from local address R0 using post-increment If R2 is not zero, go back to lp
To improve robustness, you may want to test at the (*) to ensure that the 6507 is trying to fetch a byte from normal SRAM. Not quite sure what you would do if it wasn't, though. You may also want to check for an "exit super-turbo mode" hot spot while counting down cycles and provide some extra logic once you fetch a zero byte into R2.

Supersize me!
in Great Exploitations
A blog by batari
Posted
If you have a resistor between CAPOUT and CAPIN, and a cap between CAPIN and VSS, you should not need an RC circuit feeding A11. If you do not have the RC circuit feeding CAPIN, I would expect an RC circuit on A11 would cause the banking chip to work correctly with code running in the 1000-17FF range, but cause accidental bank-switches when accessing zero page from code running in the 1800-1FFF range.