Jump to content

supercat

Members
  • Content Count

    7,259
  • Joined

  • Last visited

Blog Comments posted by supercat


  1. I'm having a hard time understanding why VCS->ARM communications has to be that much faster when the only time-critical stuff is going to be simple queue reads. For those I don't see how Chimera could enable anything faster than a 4 cycle read (load absolute).

     

    I don't know the ARM instruction set, and don't remember exactly how fast it is, but since it's a 32-bit processor I would expect that it could accomplish the equivalent of at least eight 68000-style instructions per cycle.

     

    With the 68000 instruction set and a little optimization:

    ; Assume:
    ; R0 holds the number of cycles remaining for the current operation
    ; R1 holds prefetch value
    ; A0 holds the main queue pointer
    
    ; Note that on most 68000 instructions the SECOND operand is the destination
    
    loop1:
     wait_for_start_of_cycle
    
     dbra.w d0,loop1
     mov.b d1,DATA_BUS				; Output the last pre-fetched value
     mov.b #255,DATA_BUS_ENABLE; Enable the bus
     mov.b @a0++,d1; Fetch number of next queue to use (upper part of R1 holds rest of address)
     mov.w @d1,d2	; Get address for that queue (upper part holds rest of address)
    
     wait_for_start_of_cycle
    
     mov.b #0,DATA_BUS_ENABLE; Disable the bus
     mov.b @d2++					; Fetch the actual data byte and bump the pointer
     mov.w d2,@d1					  ; Save the new pointer
     mov.b @a0++,d0; Fetch number of cycles
     dbra.w d0,loop1; Loop if NZ

     

    Only five instructions between invocations of wait_for_start_of_cycle, and it should be able to handle fetches as quickly as every other cycle.

     

    nb: I know the real code would probably be quite a bit different; my general point is that the ARM should be able to handle things pretty quickly if it knows exactly what to expect.

     

    Also, btw, even if the ARM could only handle one 'patch' every four cycles, that would still be a huge speed boost since it would often be possible to put some sort of useful instruction (e.g. stores) between the loads.

     

    From my perspective, although queues seem like a decent concept, they don't really buy much, compared with simply having RAM, unless they reduce the seven cycles required for

      lda abs,y
     sta wherever

    If the Chimera could trap immediate-mode accesses, that could drop that seven cycles down to five, and that would be extremely useful.


  2. Chimera is flexible but it is also important that games don't fall into version-hell. It's not the same problem with the banking schemes themselves because these can be swapped out at runtime, but there is only one fixed 512KB firmware area. So to whatever extent the ARM firmware is dependent on a certain way of servicing hotspot interrupts, I think standardization is going to be necessary for the sake of avoiding backwards compatibility problems with games that expect one flavor of firmware or another. Worst case scenario, you COULD put on the jumper and overwrite the firmware. It's just inconvenient to keep doing that just to switch games. It's like the equivalent of dual booting two OSs. Overall it would be better for the end-user and better for the sake of a combined developerbase to settle on standardized firmware. That doesn't mean version 1.0 of the OS has to have all of the helper functions already built. But there should be some priorities set and a roadmap in place. This will be very important when the cart gets shaken down during beta testing. So threads like this are just to get the ball rolling on that line of thinking.

     

    By the sound of it, auto-patching into immediate-mode accesses by watching for the addresses to be patched probably wouldn't be workable, since the CPLD would have no means of quickly being informed by the Chimera which addresses to patch.

     

    Would it be possible, though, to have a 'master auto-fetch queue' which would allow the Chimera to intercept such fetches by counting particular numbers of cycles? Something equivalent to:

    AFQ_ITEM *afq_ptr;
    char afq_cycles_left;
    char afq_prefetch;
    
    void handle_queues_loop(void)  /* Assumes we're already set up to start right away */
    {
     while(1)  /* Until break */
     {
    if (afq_cycles_left < 0)  /* We've reached the end of processing (best case) or we misfigured something. */
      break;
    while(afq_cycles_left)  /* Count down cycles for the immediate operand */
    {
      afq_cycles_left--;
      wait_for_cycle_start();
    }	
    if (!ADDRESS & 0x1000) break;
    DATA_BUS = afq_prefetch;
    afq_cycles_left = afq_ptr -> cycle_count-2;  /* How many cycles until the next one? */
    wait_for_cycle_start();
    RELEASE_DATA_BUS();
    afq_prefetch = fetch_queue(afq_ptr -> the_queue);
    afq_ptr++;
     }
    }

    There should probably be a little more to it than that, to allow for some looping constructs, but if the ARM can accurately count cycles, it should be able to use something like the above to patch immediate-mode operands assuming the code knows in advance how many cycles there should be between each patched operand and the next one.

     

    Yes, I know that sort of thing is very ugly, but it could provide over a 20% CPU speed boost in some types of kernels. On the 2600, that's HUGE.

     

    The basic idea would be that code sets up the auto-fetch queue and then hits a trigger spot to activate it. The code for the trigger spot (not shown) should load afq_ptr to point to the start of the master queue, load afq_cycles_left and afq_prefetch, bump afq_ptr, and start the above loop. The first entry of afq_ptr->cycle_count must be set up so that it counts down to zero on the first immediate operand. The second afq_ptr->cycle_count value should be the number of cycles between that operand and the next one that will be fetched.

     

    Thus, the six-digit score kernel would be reduced to:

      lda #$FF ; Value will be patched by Chimera
     ldx #$FF ; Value will be patched by Chimera
     ldy #$FF ; Value will be patched by Chimera
     sta GRP0
     stx GRP1
     sty GRP0
     lda #$FF ; Value will be patched by Chimera
     ldx #$FF ; Value will be patched by Chimera
     ldy #$FF ; Value will be patched by Chimera
     sta GRP1
     stx GRP0
     sty GRP1
     sty GRP0

    A total of 33 cycles--a savings of 12 cycles compared to using absolute-mode queues, more than 12 compared to using ABS,y addressing, or more than 18 compared to using (ZP),y addressing.


  3. I am a hair slower than most; I cant see whats going on. Please explain.

     

    If you require Chimera commands and operands to be written to an address in cartridge space (e.g. $1005), then setting up a command or operand with an immediate value will take six cycles (e.g. LDA #value; STA $1005) and tie up a register; loading an command or operand with a value already in a register will take four cycles.

     

    Using a zero-page address for Chimera commands would be an easy way to shave a cycle, and shouldn't pose any particular difficulty. Let's suppose you go with $3E for commands and $3F for operands.

     

    Another cycle may be shaved if a range of 256 non-cartridge-space addresses is dedicated to each of those zero-page locations. Let's assume $0800-$08FF for $3E and $0900-$09FF for $3F.

     

    Any access to address $08xx should be treated the same as writing xx to $3E, and any address $09xx should be treated the same as writing xx to $3F. Another way of looking at is is to say that the $0C opcode (NOP abs) is a way of putting two specified bytes on the low and high address bus for a cycle. If the high-byte value is one the Chimera recognizes, the low-byte value can be used to specify the actual value.


  4. Our design philosophy for Chimera Native is to keep the CPLD functions generic. The CPLD will support bankswitching, magic writes, and will mediate between the ARM and the VCS, but nothing more exotic than that. If all you had on the cart were a CPLD obviously you'd start to press it into service to enable very specific 2600 effects. But we're not pursuing that. The amount of time the ARM saves the VCS will be more than worth the amount of time the VCS requires to trigger those functions.

     

    I suppose that if the ARM can handle things like pixel-plotting without delay, it probably shouldn't be any less capable than a CPLD-based approach. Your sample implementation, though, seems a bit slower than necessary. Taking advantage of address space will allow things to go much faster. For example, you could use:

      Address $003E -- Write data to ARM operand stack
     Address $1Exx -- Write xx to ARM operand stack
     Address $003F -- Write data to ARM command register and execute it
     Address $1Fxx -- Write xx to ARM command register and execute it

    Thus, to plot a pixel (command 5) whose X coordinate is in the X register and whose Y coordinate, $43 isn't in a register, one could use:

      stx $3E
     nop $1E43
     nop $1F05

    If the Y coordinate was in the Y register, one could save a cycle via "STY $3E" instead of "nop $1E43"; the "nop" form, however, is faster than a load-immediate plus store, and it also doesn't tie up a register.

     

    What do you think of the idea of having the Chimera tap into immediate-mode operands? A little tricky from a timing perspective, but it would make practical many things that would otherwise be impossible. It might even be possible to push a Ruby-Runner-style kernel out to 13 columns (I'm struggling with 12--I think I'll manage it, barely). It would probably also be possible to push a flicker-blinds text kernel out to 15 or 16 7-pixel-wide characters (30 or 32 small characters).


  5. The Atari 2600 has 3K of somewhat useful address space from $0400-$0FFF, though most of that may only be used as address-triggered hotspots, you could safely use data writes to addresses $2D-$3F and $6D-$7F in each page.

     

    One thing that may be very useful would be to have the Chimera do something like this on each cycle (not sure if this would be easily doable in your processor)

     

      if (address == magic_queue[magic_ptr].magic_address)
     {
    data = *(magic_queue[magic_ptr].fetch_ptr++); /* Ignore the actual data at that address */
    magic_ptr = queue[magic_ptr].next_magic_ptr;
     }

    The idea would be to have the first magic_address point to the operand of an immediate-mode instruction. The next magic_address would point to the operand of another immediate-mode instruction. As long as the code stayed within a loop, you wouldn't need any absolute-mode instructions; when the loop exits, though, it would be necessary to reset magic_pointer. If you had a range of hotspots (e.g. from $0F00-$0FFF) load magic_ptr directly with a value 0-255, reloading the pointer would take four cycles.

     

    If you could manage that in the ARM, then you could save two cycles per queue fetch. It may be necessary to use a four-cycle absolute-mode access when the code slips out of a loop, but the savings most of the time would generally be pretty good. Adding a little more sophistication to the queues could eliminate that extra instruction, but I don't know how far the ARM can be pushed.

     

    BTW, I know your pixel plotting was just an example, but I wouldn't think it would be worth involving the ARM on anything less sophisticated than line drawing. After all, even 4A50 can plot pixels using only three instructions (13 cycles):

      lda $1F00,x
     ora $1E00,y
     sta $1E00,y

    The CPLD might be a useful adjunct for pixel-plotting if it could latch and mask data bits in conjunction with magic writes. For example, if there were a mode that would cause the CPLD to, on all $1Exx accesses, set all data bits which were set on the last read of $1Fxx, then the pixel plot operation could be reduced to nine cycles and it would leave the accumulator untouched:

      nop $1F00,x
     cmp $1E00,y


  6. The only 5V trace is the one coming in from the VCS, it goes into a big electrolytic cap and then into a 3.3V regulator.

     

    FWIW, the 4A50 design is mostly at +5 except for the CPLD which is 3.3.

     

    If your RAM is running off 3.3 volts, switching from read to write mode shouldn't put any noise (or cause any change whatsoever) on the data lines, but may cause a momentary increase in current draw. Perhaps that's what you're seeing? I might suggest trying an inductor between the VCS supply and the cap feeding the regulator. A cap alone can only do so much; adding an inducator can help a lot, especially in cases where you can afford some instantaneous voltage drop.


  7. Just normal defender, with writes active across the entire address range. The noise is heaviest in a vertical column around the ship, but still heavy all over the screen.

     

    So you're running all of the code from RAM in either case, but in one case you're holding /OE for the whole cycle and in the other case you're holding /OE for only part of the cycle and then /WE for the other part, while the address does not change? Do you have good bypass caps everywhere?

     

    What parts of the cart are at five volts and what parts are at 3.3?


  8. Logically there is no difference, but when I run defender with magic writes on, I see a huge increase in noise.

     

    Curious. If you have magic writes off, but the bus-keeper enabled, what happens? Perhaps the bus keeper is adding noise by increasing the current flow and sharpening the transitions on the data bus (for all cycles, including TIA and RIOT ones), but I wouldn't think the effects would be too bad.

     

    BTW, just to clarify, are you saying that you're using Defender II/Stargate with magic writes at $1000-$107F? Is the noise concentrated on any particular part of the display? What parts of the system are running at five volts vs. 3.3?


  9. BTW, I forgot to mention: magic writes shouldn't have any particular effect on data-line RF emissions except when data is written to the RAM since data will be put on the lines during one part of the cycle and remain there during the other part; to someone watching the data lines with a scope, each cycle would "look" like one cycle. To be sure, if data in the RAM is written with a different value, the old value will appear on the bus during half the cycle and the new value for the other half, but most code does a lot more reads than writes.


  10. Incidentally, 4A50 does support use of an on-board serial EEPROM. I was really tight on chip resources when I added that feature, so it's a little goofy. Nonetheless it should offer pretty good performance. Not quite as good as what I could get if I used a little more address space and used a high-speed EEPROM, but decent nonetheless. Note that A14 is tied to the EEPROM's SCK line, so code using the EEPROM must bear this in mind (most likely by avoiding accessing upper areas of flash or RAM during EEPROM operations).

     

    Key addresses for EEPROM use:

    • $006C : Set bit 8 of the upper-bank address to the state of the SDA pin.
    • $006D : Drive A14 high for the duration of the cycle, but do nothing else.
    • $006E : Copy the state of D7 (0=asserted; 1=floating) to the SDA pin in the middle of the cycle.
    • $006F : Drive pin A14 high for the duration of the cycle, and toggle SDA's output state (asserted vs floating) in the middle of the cycle

    The key to using these effectively is to use zero-page indexed addressing mode which performs read of the "given" address followed by a read or write of the correct address. For example, if X=$FE, then "LDA $6F,X" will access address $6F and $6D in consecutive cycles. Code should be run from the $1E00 banking area; there should be two copies of the code on consecutive pages of RAM or flash, with a few instructions changed between them. For example, one part of the read-byte routine would look like:

    I2C_READ:
    	lda	 #$FF   ; To make things handy with ROL
    	nop	 $6E	; Assert data
    	ldx	 #$01
    	sta	 $6D,X  ; Hit CLK and release data
    	nop	 $6C,X  ; Test data and hit CLK
    	ASL
    	nop	 $6C,X  ; Test data and hit CLK
    	ASL
    	nop	 $6C,X  ; Test data and hit CLK
    	ASL
    	nop	 $6C,X  ; Test data and hit CLK
    	ASL
    	nop	 $6C,X  ; Test data and hit CLK
    	ASL
    	nop	 $6C,X  ; Test data and hit CLK
    	ASL
    	nop	 $6C,X  ; Test data and hit CLK
    	ASL
    	nop	 $6C,X  ; Test data and hit CLK
    	ASL
    	rts

    in the lower page and

    I2C_READ:
    	lda	 #$FF   ; To make things handy with ROL
    	nop	 $6E	; Assert data
    	ldx	 #$01
    	sta	 $6D,X  ; Hit CLK and release data
    	nop	 $6C,X  ; Test data and hit CLK
    	SEC
    	nop	 $6C,X  ; Test data and hit CLK
    	ROL
    	nop	 $6C,X  ; Test data and hit CLK
    	ROL
    	nop	 $6C,X  ; Test data and hit CLK
    	ROL
    	nop	 $6C,X  ; Test data and hit CLK
    	ROL
    	nop	 $6C,X  ; Test data and hit CLK
    	ROL
    	nop	 $6C,X  ; Test data and hit CLK
    	ROL
    	nop	 $6C,X  ; Test data and hit CLK
    	ROL
    	rts

    in the other. Not quite as fast as the technique I came up with for use in a more simply-banked cart (where I could use much more address space) but 6 cycles/bit is nothing to sneeze at.


  11. EDIT: SED ($F8) would work as well if you were sure to have a CLD in your code. Actually, this one might be best because it would only go 8 bytes backward if it were in a branch opcode, which would only have potential for problems if execution stopped on a branch in the first 8 bytes of ROM (unlikely.)

     

    I mistakenly thought JMP (ind) was $7C rather than $6C. My bad. As for the optimal capture sequence, perhaps fill $1000-$107F with something nop-ish of the form 0xx1xxxx ($18 SEC would work fine). The latter part of RAM could hold the sequence D0 D0 F0 F0 (BNE -$xx BEQ -$xx). Code would get stuck in the range $1050-$1083. Once capture occurs, switch in a bank with NOPs from $1000-$107F and the sequence $D0 00 $F0 00 at $1080. Real code would start at $1084. I think that capture approach would work in any scenario other than (1) CPU is jammed or (2) CPU is running entirely out of zero-page.


  12. I am actually loading a 4K section of SRAM with BRK (0x00) and 0x1FFB - 0x4C, 0x1FFC - 0xFB, 0x1FFD - 0xFF, 0x1FFE - 0xFB, 0x1FFF - 0xFF.

    Then I divert the cart space to that region. It will capture the VCS in a tight jump loop, which lets me load a new game in other SRAM space. Then I redirect the VCS to the new region when 0x1FFC is requested. It works quite well.

     

    That would be the first method that would come to mind, but I would worry that if you switch during the third cycles of a JMP instruction code could end up in page zero and get stuck or hit a 'JAM' instruction. I think $7C should be a safe 'capture' method in any scenario other than a branch instruction that takes place within the last 128 bytes. Of course, if you watched the address/data buses you could probably identify safe times to jump in.


  13. Does the CPLD output any data of its own, or is its only "output" function the bus hold?

    Yes, the CPLD data lines feed right to the VCS.

     

    Is it only when you're using queues or other such goodies that that happens? Or does the CPLD insert itself between the memory data bus and the 2600? In 4A50, the CPLD itself never puts anything on the data bus--anything that gets there is put there via the RAM or flash (or by the 2600 itself). If your CPLD has a full set of address and control lines, you could probably avoid having it output anything on the data bus itself if it simply read from a memory chip containing a 256-byte table of values 0-255.

     

    If you really wanted to make things nice from a noise perspective, you would have two sets of data pins on the CPLD--one connected to the VCS and one connected to the memory, with 10K-47K resistors connecting them.

    This is exactly how the previous version worked. But we moved from an 8 bit bus to a 17 bit bus between the MCU and the CPLD to speed up throughput. That used up those extra pins. I called that 'active bus hold'. I am calling this current way 'passive bus hold'. I could add an external octal register outside the CPLD, that would solve all the issues. But I am trying to avoid that. I will wait until I hear back from beta testers with RF systems to see if anything more needs to be done.

     

    If the cuttle cart has some noise and people have been fine with it, this shouldnt be a problem at all.

     

    Sometime I'll have to find out how 4A50 is in that regard. My only RF-related measure was a resistor between the 14.31818MHz oscillator module and the CPLD.

     

    BTW, are you running a nice multiple of 3.579545MHz? I think you'll need to be accurate within 1% for reliable operation if a STA WSYNC is run from cartridge RAM.


  14. The problem is that with magic writes, its now double the amount of accesses to the SRAM, a read and a write every VCS cycle, so that means a large increase in noise. The bus hold uses a very large resistance to keep its weak hold on the bus, 50K. If I Add much more resistance on top of that to squelch the newly created noise, I start to violate the electrical characteristics of the 6507.

     

    Does the CPLD output any data of its own, or is its only "output" function the bus hold?

     

    If the CPLD doesn't output any data of its own, I would suggest that you have the CPLD data pins tied directly to the 2600 side, and have resistors between the CPLD and the memory chips that will be putting data on the bus. The memory chips should output very strong highs/lows, so the balance will be purely between the CPLD hold circuit and the resistors. Since the VCS will be tied directly to the CPLD, its attempt to "pull" the bus won't have to battle the resistors. If there were a way to enable the bus hold function only part of the time, you could disable it during the portion of each cycle when you're just starting to read data from memory; in that case you could get by with fairly large resistors. The 9500XL series doesn't offer such a function, though.

     

    If you really wanted to make things nice from a noise perspective, you would have two sets of data pins on the CPLD--one connected to the VCS and one connected to the memory, with 10K-47K resistors connecting them. If you did that, you wouldn't need the CPLD's bus-hold function; instead, for the part of the cycle where you would use it you would program the CPLD's outputs (on the memory side) to follow its inputs (on the VCS side). Alternatively, you could build a bus-hold circuit from a 74HC373. As before it would only output during the appropriate part of the cycle.


  15. Would you like to get a beta cartridge so you can add 4A50 support? At this rate, we may only be a few weeks away from sending spare 10.0 boards to interested parties.

     

    I don't know how I'd bash your board, but I could let you look at the essential equations defining 4A50 behavior in my CPLD. One thing I forgot to ask, though: can Chimera 10.0 support magic writes? If not, 4A50 isn't going to be very useful.


  16. One often-ignored factor in the music and game markets is that the amount of extra effort people are willing to exert to get stuff without paying for it tends to be inversely proportional to the amount of money they have to buy stuff in the first place. Further, the ability of people to share games among themselves may allow indirectly for a form of differential-pricing model. Neither of these market factors works 100% toward the interest of content producers, of course, but they somewhat mitigate the effects of piracy.

     

    On the other hand, when companies start to put in protection schemes that become annoying, the dynamics shift. Given a choice between fumbling through the manual to find the magic word on line 2 of paragraph 4 of page 12, or looking up the word on a handy cheat-sheet, who (legitimate user or not) wouldn't prefer the latter? And who wouldn't better yet prefer not to bother with that nonsense at all? When a cracked copy of a game becomes more desirable than a legal one, even for someone who would have bought it, people who would have found it more convenient to just go to a store to buy the product rather than try to find a reliable hacked version, no longer would. Some might buy it anyway, but there's much less incentive.


  17. Sounds cool. I'm inching along with my 4A50 burner, which should help expedite that project. The two could actually end up fitting in with each other nicely, if the Chimera could provide a cheap platform for testing 4A50 games and the 4A50 cart could provide a solution for reasonably-priced production.

     

    The notion of "capturing execution" is an interesting one. In most cases it would be fairly easy (on any access above $1000 return $7C, trapping execution in a 5-cycle loop) but on something like Batari's "dumper" cart it wouldn't work at all.


  18. I wonder how much longer until we see unbeatable checkers on home computers?

     

    What would be the point? One of the reasons I think 4x4x4 tic-tac-toe was a popular computer game was that while it is winnable by the first player (and the computer versions I've seen let the human go first) eking out a win isn't easy (unlike 3x3x3 tic-tac-toe where moving to the center makes the game an easy win for the first player; disallowing the first player from moving to the center but letting the second player go there makes it an easy win for the second player).

     

    I suspect the decisions to release both Stellar Track and 3D Tic Tac Toe were based upon the popularity of the underlying mainframe computer games.


  19. Even some older movies managed some pretty amazing effects. Melies was the master of camera-trick effects (1890's to 1910's) but movies of the 1920's had some pretty amazing ones (see Metropolis or Ben Hur, for example). I remember, when I saw the naval battle scene in Ben Hur with all the little specks jumping off the giant ships I wondered how they animated those on the model. Well, actually those weren't models (unlike in the 1956 remake). Some of the stadium shots combined miniatures with full-sized sets, though. And Metropolis--wow. That film has got to be seen to be believed (though there are a number of versions out there, some of which are dubiously edited and scored).


  20. That style of sprite relocation will avoid having to blow any scan lines on RESxx timing loops, but repositioning a sprite that way will take eleven scan lines. It may be useful to have a few variations of one of your sub-kernels, each of which has a RESxx in a different place. You could use one of those to get the positioning close to where it should be (say within 30 pixels), and then use HMOVEs to get it to its final place within five scan lines (rather than eleven).

     

    Depending upon your exact register and RAM usage, the simplest way to handle this might be to use two bytes in RAM for each major line of display as a code pointer; after displaying each major line, perform an RTS to go to the routine for the next line. This will only cost six cycles, and eliminates the need for other looping constructs.


  21. I wonder whether it would be practical to encode three twelve-bit pointers per four bytes? That might be a workable approach. I don't think it could be done with just three blank scan lines between text rows, but maybe with four, or almost certainly with five.

     

      lda #7;  Once only
    
     ldx msbdataA,y
     stx ijmp1
     jmp (ijmp)
     jmp real_spot
    
     ldx lsbdataA1,y
     stx ptrA1
     sbx #$10; or #$00
     stx ptrA1+1
     ldx lsbdataA2,y
     stx ptrA2
     sbx #$10; or #$00
     stx ptrA2+1
     ldx lsbdataA3,y
     stx ptrA3
     sbx #$10; or #$00
     stx ptrA3+1
    
     ldx msbdataB,y; Next one
     stx ijmp
     jmp (ijmp)

    The cost of this would be an extra 15 cycles per three characters, but the code bloat would be substantial since there would need to be 32 (8 jump targets, and four groups of characters) copies of the above routine to handle 12 characters (I'd figure the rightmost column could fit within a 249-character set). It would probably be possible to improve things quite a bit, but I don't know exactly how much.


  22. The two RESPx writes are marked, and are mid-line, all other repositioning occurs with HMOVEs (left 6 at the end of lines 1 and 3 and left 10 (!) at the end of lines 2 and 4) and occurs at the end of the scanline.

     

    On Z26 and actual hardware, all the HMOVEs move both players eight pixels left. I'm not sure what's with those numbers you gave.

     

    The simplest way to describe the sprite handling of this kernel is to observe that each sprite gets hit every four lines, always in the exact same spot, and moves left 8 pixels on each of those lines. Consequently, the sprite will have moved 32 pixels left before it gets hit, and so the second copy of the sprite before the hit will be in the same place as the first copy will be afterward.

     

    There are a few spare cycles in the kernel, since it's not necessary to write any of the motion registers. I don't have anything particularly in mind for them in this application, but the approach might allow me to expand the Ruby Runner kernel from ten columns to eleven or twelve. That could be pretty sweet if it works.

     

    The code checks for loop exit after each of the four scan line routines. Since each line of text is seven lines rather than eight, each line of text will start with the sprites in a different place. Each time I want to start displaying text I use an indirect jump to go to one of four 'start showing text' routines; each of the four loop exits sets up the pointer for the next one.

     

    Since there are an odd number of lines displayed (21 in this case) the screen naturally alternates between the two visible frames. Indeed, it cycles among all four, as may be confirmed by disabling a player sprite. If an even number of lines were displayed, it would be necessary to add explicit code to 'nudge' things between frames. The way fonts are stored, I could make the first or last line of text be 8 scan lines instead of 7 (with a blank line at the top or bottom as needed); that would reduce my allowable character set size from 250 characters to 249.

     

    BTW, I would expect the kernel could use a different character set for each line of text without too much difficulty; I wonder what the best way would be to partition character pairs into character sets. Supporting more than 250 (or 249) characters within a line would be irksome, though it might be possible to support up to about 500 if I stored text using two bytes per character (i.e. pair of 4x7 characters) rather than one. Line spacing would probably have to increase in that case, though, since I'd no longer be able to use my 12-cycle-per-character pointer setup:

      lda #7; Just do this once
     ldx chardata,y
     stx pointer_low
     sbx #$10
     stx pointer_high

    Pretty sweet use of sbx, no?

×
×
  • Create New...