6502 assembly code optimisation

RevEng · August 3, 2020

I'm working on a driving wheel driver for the 7800, and have streamlined it as much as I can. Unlike the normal 2600 driving wheel functionality, this driver allows the wheel to act more like a paddle (or mouse) so multiple reads per frame need to be performed to catch quicker movement. The more compact the main driver loop gets, the better the driver will perform under heavy graphics DMA, so losing even 2 more cycles would be welcome, though I can't do anything that would be a big waste of ROM, like a page sized table.

driving0updateloop
   lda SWCHA
   ASR  #%00110000 ; Undocumented. A = A & #IMM, then LSR A.
   lsr

   ora mousecodex0
   tay
   lda rotationalcompare,y
   adc mousex0 ; carry still clear via previous mask+LSR
   sta mousex0
   lda rotationaldivideby4,y 
   sta mousecodex0

   ldy INTIM
   cpy #TIMEOFFSET
   bcs driving0updateloop

Note: I can't drop the CPY. If RIOT reaches 0, the switch to single-cycle resolution combined with DMA interruptions makes catching Z unreliable.

With all of the 6502 code bumming threads scattered around AA, it might be nice to concentrate some of them here in the DASM club. If anybody has code they want to optimise, feel free to post it here, or even post a cross-link to a optimisation request somewhere else in forums.

Thomas Jentzsch · August 3, 2020

Looks like this does use Gray Codes like a track ball or the driving controller, correct?

BTW: Can you explain your code a bit and post the tables used too, please?

Edited August 3, 2020 by Thomas Jentzsch

RevEng · August 3, 2020

Yep. In fact it is the 2600 driving wheel. I'm just reading in on the 7800, in 7800 mode.

Thomas Jentzsch · August 3, 2020

Have you seen OmegaMatrix and my discussion about optimal trackball code? It started around here.

Your code is 30 cycles, we sure can do better.

Edited August 3, 2020 by Thomas Jentzsch

RevEng · August 3, 2020

Just caught the edit. Here's the table...

rotationalcompare
     ; old =   00     01      10     11
     .byte     $00,   $01,   $ff,   $00  ; new=00
     .byte     $ff,   $00,   $00,   $01  ; new=01
     .byte     $01,   $00,   $00,   $ff  ; new=10
     .byte     $00,   $ff,   $01,   $00  ; new=11

Basically the driver reads the 2-bit position code, and ORs it with the previous 2-bit position code, for 4-bit index into the compare table.

The "lda rotationaldivideby4,y" is just a table division replacement for what was previously "TYA, LSR, LSR", so I can save 2 cycles at a cost of 16 bytes of rom.

RevEng · August 3, 2020

I'm going through the thread now. ?

I'm not entirely able to use all of the optimisations - I don't believe I have spare ZP memory available for the stack trick (need to double-check that), nor is an extra page of ROM feasible. But I'll look at what I can incorporate.

RevEng · August 3, 2020

Going over the approach in that thread in more detail, the lack of +ve or -ve change in kernel code is a non-starter for my purpose.

Having played around quite a bit with the mouse driver, the direction being skipped isn't important there - an errant direction change or even jammed gray code change isn't a big deal, because there are so many remaining samples to overcome the problem. Due to the driving controller being *way* more coarse than a mouse (2.6 DPI for the driving control vs worst case of 75 DPI for the mouse) getting the direction of every single change right is critical.

Thomas Jentzsch · August 3, 2020

8 minutes ago, RevEng said:

Going over the approach in that thread in more detail, the lack of +ve or -ve change in kernel code is a non-starter for my purpose.

Probably you are looking at the wrong code (or I don't get you).

8 minutes ago, RevEng said:

Having played around quite a bit with the mouse driver, the direction being skipped isn't important there - an errant direction change or even jammed gray code change isn't a big deal, because there are so many remaining samples to overcome the problem. Due to the driving controller being *way* more coarse than a mouse (2.6 DPI for the driving control vs worst case of 75 DPI for the mouse) getting the direction of every single change right is critical.

Agreed. I found that you need at least 10 samples/frame. And more than 20 is usually overkill.

RevEng · August 3, 2020

Everything in that thread that follows this post uses in-kernel code that measures absolute distance change, rather than tracking -ve and +ve movement in-kernel. (and assumedly updates the direction of that distance post-kernel.)

On 9/4/2015 at 2:11 AM, Thomas Jentzsch said:

Yup, that's about the fastest you can get using my algorithm (28 cycles). The table becomes a bit large, but that's no major concern anymore.

But how about alternative approaches?

Or how about assuming that the direction will not change during one frame? You detect direction once per frame (best outside the kernel) and then you would not have to make a difference between increasing or decreasing the variable.

And then probably you don't have to check if two bits have changed. So you just check if SWCHA changes its value.

The 28 cycles for Omega's routine prior to this post, uses a large shifting LUT, which is a price I can't pay.

Thomas Jentzsch · August 3, 2020

1 hour ago, RevEng said:

Everything in that thread that follows this post uses in-kernel code that measures absolute distance change, rather than tracking -ve and +ve movement in-kernel. (and assumedly updates the direction of that distance post-kernel.)

Sorry, I don't get this. What do you mean with -ve and +ve, what do you want to track instead?

RevEng · August 3, 2020

No worries. I'll try to be clearer, and please let me know if I don't manage it.

hypothetical example: If the driving wheel is turned clockwise for half of the frame and moves a distance of "2" , and the player turns counter-clockwise for the rest of the frame for another distance of "2", the distance update for the frame should be "0". (2 plus -2 equals 0)

For the example above, the code posted in the thread after you said "Or how about assuming that the direction will not change during one frame?", would calculate the distance travelled to be "-4". (2 plus 2 equals 4, and then adjusted to be negative post-kernel) Your suggestion was an optimisation that traded off some sub-kernel accuracy for cycle time.

The low resolution of driving controls makes incorrect samples worse than it is for mice. You can actually see the same problem with mice if you do very fine back and forth movements, or very slow movement overall. Only driving controls can't easily move at a speed required to overwhelm the driver with more good samples than bad, like mice can.

Thomas Jentzsch · August 3, 2020

Maybe you are looking at the wrong code. The code we used in our hacks doesn't use these assumptions.

RevEng · August 3, 2020

No doubt. I'm just following the thread you linked, and I don't know what you implemented in the end. If it's your directly linked post, or the 28 cycle code by Omegamatrix that follows after your linked post, then fair enough. I've read the thread too far, without realising your intended cut-off point. Unfortunately I can't manage the trade-offs for those 2 cycles.

Later in that same thread, suggestions that followed your supposition of "Or how about assuming that the direction will not change during one frame?" don't have in-kernel code that could adjust the values in either direction. e.g. in this post, diffY is only ever INCed, and the stack pointer (standing in as a diffX) is only ever PLAed.

I do truly appreciate the comments and suggestions. Sorry that we seem to have gotten stuck on this point.

Thomas Jentzsch · August 3, 2020

I think your limitations leave little room for improvements. And our code seems not to fit into the limitations. So probably we have to come up with something new.

Here is an idea (completely untested):

   lda SWCHA                ; 4
   ASR #%00110000           ; 2		Undocumented. A = A & #IMM, then LSR A.
   lsr                      ; 2
   tay                      ; 2		Y = 0, 4, 8, 12
   eor mousecodex0          ; 3
   sty mousecodex0          ; 3
   tay                      ; 2		Y = 0, 4, 8, 12
   lda rotationalcompare,y  ; 4
   adc mousex0              ; 3		carry still clear via previous mask+LSR
   sta mousex0              ; 3 = 28

This would save 2 cycles. The first 'tay' could be optimized away too if you invest another byte of RAM (mousecodex1) and alternate code.

Edited August 3, 2020 by Thomas Jentzsch

+Karl G · August 3, 2020

5 hours ago, RevEng said:
I'm working on a driving wheel driver for the 7800, and have streamlined it as much as I can. Unlike the normal 2600 driving wheel functionality, this driver allows the wheel to act more like a paddle (or mouse) so multiple reads per frame need to be performed to catch quicker movement. The more compact the main driver loop gets, the better the driver will perform under heavy graphics DMA, so losing even 2 more cycles would be welcome, though I can't do anything that would be a big waste of ROM, like a page sized table.
driving0updateloop
   lda SWCHA
   ASR  #%00110000 ; Undocumented. A = A & #IMM, then LSR A.
   lsr

   ora mousecodex0
   tay
   lda rotationalcompare,y
   adc mousex0 ; carry still clear via previous mask+LSR
   sta mousex0
   lda rotationaldivideby4,y 
   sta mousecodex0

   ldy INTIM
   cpy #TIMEOFFSET
   bcs driving0updateloop
Note: I can't drop the CPY. If RIOT reaches 0, the switch to single-cycle resolution combined with DMA interruptions makes catching Z unreliable.

With all of the 6502 code bumming threads scattered around AA, it might be nice to concentrate some of them here in the DASM club. If anybody has code they want to optimise, feel free to post it here, or even post a cross-link to a optimisation request somewhere else in forums.

Is there any way to setup your timer to end at zero, or when you reach a positive value to eliminate the cpy at the end of the loop?

Thomas Jentzsch · August 3, 2020

You could load X with TIMEOFFSET-1 and then 'cpx INTIM, bcc driving0updateloop'

Or you load the timer with +$80 and do 'bit INTIM, bmi driving0updateloop'. That way you don't have to catch 0.

Edited August 3, 2020 by Thomas Jentzsch

Thomas Jentzsch · August 3, 2020

BTW: I don't know nothing about the 7800, but I wonder how the whole loop is integrated. Is this the main loop which gets frequently interrupted by the DMA. How much CPU time does this cost? So how many loops will happen?

RevEng · August 4, 2020

8 hours ago, Karl G said:

Is there any way to setup your timer to end at zero, or when you reach a positive value to eliminate the cpy at the end of the loop?

Unfortunately not. Zero won't work because RIOT switches to single cycle mode when it reaches zero, and between that and DMA catching that not reliable. BPL or BMI won't work because I'd like to support more time than 127*64 cycles.

7 hours ago, Thomas Jentzsch said:

You could load X with TIMEOFFSET-1 and then 'cpx INTIM, bcc driving0updateloop'

That's doable. Very nice, and thank-you!

7 hours ago, Thomas Jentzsch said:

BTW: I don't know nothing about the 7800, but I wonder how the whole loop is integrated. Is this the main loop which gets frequently interrupted by the DMA. How much CPU time does this cost? So how many loops will happen?

On the 7800, there's no kernel, so the 6502 is free to process stuff during most of the visible screen, except for DMA time. DMA time happens near the beginning of each scanline. The 6502 is halted while Maria renders sprites and characters to a scanline buffer. so it's a periodic interruption that happens throughout the screen.

In my case, the long-running controller code runs via a display interrupt at the top of the visible screen. I probably didn't mention it, but this is driver is for 7800basic, so the use-case is more variable than a single game. So as to how many iterations will happen... it depends on how complex the game's display is, since that will drive how much time Maria spends to render each line. A simple display will barely steal any 6502 time, and a very complex one can steal all of the 6502 time. It's one of the reasons why the wheel-driver time will be compile-time modifiable.

It's also worth noting I can't do a simple number of iterations here (hence RIOT) because if DMA starves the 6502, a new interrupt may cut into my current interrupt, and so on, which would be catastrophic. A more graceful failure mode is preferred, like the driving wheel not working correctly, and even here there are levels of "not working correctly" that may be acceptable.

8 hours ago, Thomas Jentzsch said:

Here is an idea (completely untested):

The table lookup line "lda rotationalcompare,y" needs Y = %0000xxXX, where xx=new RIOT 2-bit value and XX=previous RIOT 2-bit value. Unless I'm missing something, I think your code would need a couple shifts to stick the new RIOT value in the old RIOT position, for next time around.

Thomas Jentzsch · August 4, 2020

2 hours ago, RevEng said:

The table lookup line "lda rotationalcompare,y" needs Y = %0000xxXX, where xx=new RIOT 2-bit value and XX=previous RIOT 2-bit value. Unless I'm missing something, I think your code would need a couple shifts to stick the new RIOT value in the old RIOT position, for next time around.

In my idea Y would be %0000xx00 ^ %0000XX00. You would still need a 16 byte table, but only 4 bytes are used. But I think my code has a bug anyway. I guess I have to code and test.

Thomas Jentzsch · August 4, 2020

@RevEng Couldn't you use TIMINT ($285) which triggers when the timer has run out? Or would that create varying scanlines?

Thomas Jentzsch · August 4, 2020

This is the best for now (and tested with Stella):

.left                       ; 3
    dex                     ; 2
    bcc     .cont           ; 3 =  8    unconditional

.loop
;    sta     WSYNC           ;           required for Stella!?
    lda     SWCHA           ; 4
    asr     #%00110000      ; 2
    lsr                     ; 2
    ldy     lastTrack       ; 3         Y = 0, 4, 8, 12
    sta     lastTrack       ; 3
    eor     NextTbl,y       ; 4 = 18
    beq     .left           ; 2/3
;.right
    eor     #%00001100      ; 2
    bne     .cont           ; 2/3
    inx                     ; 2 =  8
.cont
; check timer:
    lda     INTIM           ; 4
    cmp     #TIMEOFFSET     ; 2
    bcs     .loop           ; 3/2= 9/8
;total: 35 cycles

NextTbl ; only every 4th byte used
    .byte   %0100, 0, 0, 0   ; 00->01 = left, 00->10 = right
    .byte   %1100, 0, 0, 0   ; 01->11 = left, 01->00 = right
    .byte   %0000, 0, 0, 0   ; 10->00 = left, 10->11 = right
    .byte   %1000            ; 11->10 = left, 11->01 = right

So 4 cycles saved compared to the original code. And only one table used, so 16+3 bytes saved there.

Unrolling would gain 1 more cycle. Then X and Y (alternating) would be used instead of 'lastTrack' (saves 4 cycles and 1 byte RAM). Also 'inx/dex' would be replaced by 'inc/dec xPos' (costs 3 cycles). Maybe one could use 'pla/pha' instead (saves 1/2 cycles). You would have to find an area where some 'pha' wouldn't do any harm. Not sure if that works on the 7800.

One more question: Your code seems to indicate that you may want to check more than one controller. Is that correct? In that case, instead of repeating the code, a combined code checking both simultaneously would be more efficient.

Edited August 4, 2020 by Thomas Jentzsch

RevEng · August 4, 2020

Ah, brilliant!

Having two driving controllers would be an option to the 7800basic game, as would something else in the first port and driving wheel in the second. (I see where your routine is easily modified for the second case, and a four byte LUT would serve there.)

TIMINT (which was a gap in my knowledge) may or may not be an option on the 7800. I'll need to see if it works reliably on real hardware... I suspect it might work, but there's reliability issues with non-port based RIOT access on the 7800. Not quite as reliable here as on the 2600. (e.g. I confirm timer writes, and repeat if necessary.) I think maybe it's the DMA interruption as the culprit, if it happens in the middle of a RIOT write, as RIOT can't be halted like the 6502 is.

No luck with safe stack space; the ram is already in use, and there's a bunch of writable TIA and Maria registers in the lower part of the stack.

Thomas Jentzsch · August 4, 2020

3 hours ago, RevEng said:

No luck with safe stack space; the ram is already in use, and there's a bunch of writable TIA and Maria registers in the lower part of the stack.

Yup, I looked up the memory map and it doesn't look good.

Another weird idea I just had: Only use PLA! PLA each loop, except when going right. And PLA twice when going left. But that wouldn't save cycles (9..11 instead of the current 8). And then you could almost use 'inc/dec xPos' instead (11 cycles).

Edited August 4, 2020 by Thomas Jentzsch

Thomas Jentzsch · August 4, 2020

13 hours ago, RevEng said:

A simple display will barely steal any 6502 time, and a very complex one can steal all of the 6502 time.

If a display steals a lot of CPU time, then the code may even detect the wrong direction. Maybe it would be better to also check the timer and dismiss all changes where the timer has run for too long between two reads?

Do games for the 7800 and this controller exists? How do they handle this?

Edited August 4, 2020 by Thomas Jentzsch

RevEng · August 4, 2020

Agreed. A game that pushes the DMA boundaries and loses all or most of the CPU time will have trouble with any controller that needs a longer timeframe to read, in addition to paying the other prices associated with having much less CPU time. This DMA/display trade-off is at the heart of 7800 programming, so this isn't anything new for 7800 programmers, even to lightly experienced 7800basic programmers.

With the tuning in this thread, I'm just hoping to move where that DMA trade-off point is for long-read controllers. The Arkanoid WIP is showing promising initial results with the driver code I posted originally, which for me is means the technique is usable in a more than reasonable case. The cycle tuning that was done in this thread is the gravy on top of that.

There's one other 7800 game that uses driving controls like a paddle, which is Super Circus AA. It uses a series of interrupts for regular polling and keeps the DMA pretty light, which works great for the game as designed, but is less ideal for a general case approach.

6502 assembly code optimisation

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Recently Browsing 0 members