Jump to content
tschak909

Cycle Counting techniques?

Recommended Posts

When you guys do your cycle counting, do you guys literally just keep erasing and re-writing the cycle counts after recalculating, as you add instructions? Am wondering if any of you have found a trick to save sanity as you're figuring things out in a kernel?

 

-Thom

Share this post


Link to post
Share on other sites

Yes, I am updating the cycle counts very frequently, especially when I create a kernel without WSYNC or a reflected, asymmetrical playfield.

Share this post


Link to post
Share on other sites

When you guys do your cycle counting, do you guys literally just keep erasing and re-writing the cycle counts after recalculating, as you add instructions? Am wondering if any of you have found a trick to save sanity as you're figuring things out in a kernel?

 

-Thom

 

"sanity"?! Ha!

Large amounts of Boulder Dash code were cycle-counted. By hand. Over and over. I dream opcode cycle-times.

Here's an example; I don't think Thomas will mind releasing this - it's part of our engine draw code, over which we hold copyright...

   DEFINE_SUBROUTINE DrawIntoStack

                tsx                             ;2
                stx save_SP                     ;3
                ldx DrawStackPointer            ;3
                txs                             ;2 = 10

                ldy DSL                         ;3

    ; worst-case DrawStackOne loop = 61 cycles per character (+11 for first one)
    ; + exit cost which is +10 cycles
    ; TJ: I count 63
    ;   + 14 for the wtf2 exit
    ;   + 22 for the end of loop exit

    ; This loops 80 times/frame and is called 1-2 times/frame, so any saving inside the loop will make up for a lot of overhead outside
    ; For 80 iterations that is 640 cycles, just for checking INTIM
    ; Worst case we would have ~5000. Though that will most likely never happen, we have to optimize for it, since
    ; it also will require a lot of CPU time for updating the screen data.

    ; TODO: optimize!
    ;
    ; idea #4: the mixed draw idea (two different characters have to be redrawn),
    ; first do a QuickDraw and then a SlowDraw, faster than two SlowDraws
    ; costs some detection time here, but saves ~240 cylces for drawing the two

.loopDrawStack  lda INTIM                       ;4
                cmp #SEGTIME_DSL                ;2
                bcc .exitDrawStack              ;2/3= 8/9
                STRESS_TIME SEGTIME_DSL

                lda DrawFlag,y                  ;4
                cmp ScreenBuffer,y              ;4              Is the character already there the same as the new one?
                beq .next0                      ;2/3=10/11      yes, so we don't draw anything

    ; Character is NOT the same. Figure out how it should be drawn.
    ; If it is in column 0 or 5 then it can be DirectDrawn (indirectly found by a A:A compare)
    ; If it is the same as its paired character (sharing same PF byte) then it can be DirectDrawn
    ; The top bit of the ScreenBuffer character indicates the DirectDrawn hint

                ldx PairedCharacter,y           ;4              the "paired" character for this one
                cmp DrawFlag,x                  ;4              same as partner character in new drawn screen?
                bne .notPaired0                 ;2/3=10/11

    ; Consider two 'paired' characters. Either A:A or A:B
    ; When we're scanning, and we check the first, if they are NOT paired, then the second character
    ; can be considered in isolation -- its check comes later, will determine itself if the pair can be written
    ; If, however, the first character IS paired, then the write below will cause the second check to FAIL
    ; on the comparison, so the character will not be added to the draw stack. So our first character will
    ; do the job of drawing BOTH characters to the screen.

                sta ScreenBuffer+RAM_WRITE,x    ;5              mark paired character as drawn already (!!)
                ora #$80                        ;2 =  7         DirectDraw this character 'pair'

    ; In the case of columns 0 and 5, the X and Y registers will be the same -- no problemo, because
    ; the last write(below) marks the character as to be direct-drawn.

.notPaired0     sta ScreenBuffer+RAM_WRITE,y    ;5              NEW character to draw + DirectDraw flag (128)

    ; The following 'pla' really just increments the draw-stack pointer.  Value is unimportant. Unusual!

                pla                             ;4              ASSUMPTION IS WE DON'T OvERFLOW DRAW STACK
                tya                             ;2
                tsx                             ;2              << now X holds drawstackpointer
                sta DrawStack+RAM_WRITE,x       ;5 = 18         index of character to draw

.next0          dey                             ;2
                bmi .finishedDrawStack          ;2/3= 4/5

    ; unrolled 2nd loop:
                lda DrawFlag,y                  ;4
                cmp ScreenBuffer,y              ;4              Is the character already there the same as the new one?
                beq .next1                      ;2/3=10/11      yes, so we don't draw anything

                ldx PairedCharacter,y           ;4              the "paired" character for this one
                cmp DrawFlag,x                  ;4              same as partner character in new drawn screen?
                bne .notPaired1                 ;2/3=10/11

                sta ScreenBuffer+RAM_WRITE,x    ;5              mark paired character as drawn already (!!)
                ora #$80                        ;2 =  7         DirectDraw this character 'pair'

.notPaired1     sta ScreenBuffer+RAM_WRITE,y    ;5              NEW character to draw + DirectDraw flag (128)

                pla                             ;4              ASSUMPTION IS WE DON'T OvERFLOW DRAW STACK
                tya                             ;2
                tsx                             ;2
                sta DrawStack+RAM_WRITE,x       ;5 = 18         index of character to draw

.next1          dey                             ;2
                bpl .loopDrawStack              ;2/3= 4/5
;worst case: 111-4
;40 loops(-4), max. 2 calls(+20) -> -160+40=-120, +8 bytes

   ; THE FOLLOWING OPTIMISATION IS STUFFED IF PROCESSOBJSTACK is not first in the vector processor!

.finishedDrawStack
                inc ScreenDrawPhase             ;5 =  5
                tsx                             ;2
                stx DrawStackPointer            ;3
                ldx save_SP                     ;3
                txs                             ;2 = 10

                jmp SwitchObjects

                ;rts                             ;6 =  6

.exitDrawStack
                sty DSL                         ;3 =  3
                tsx                             ;2
                stx DrawStackPointer            ;3
                ldx save_SP                     ;3
                txs                             ;2 = 10
NoBlanks        rts                             ;6 =  6

  • Like 2

Share this post


Link to post
Share on other sites

I only did it for screen drawing code and only when I got close to working. After counting and a little tweaking I'd know I was maxed out or N cycles under, Any changes from then would be small so just add/subtract the changes. Kind of like counting cards. Pretty quick the cycle counts are memorized so pretty easy. I never put cycle counts in the source file. Too easy to get out of sync and then be misleading for me.

  • Like 1

Share this post


Link to post
Share on other sites

I still want a magic ide that auto-inserts/tracks them for me. Why are we doing a computer's job by hand? (and yes, DASM can assemble so quickly and computers are so powerful, that it should even know if it's crossing a page boundary, and highlight that fact for me as I'm going)

Share this post


Link to post
Share on other sites

Yep. I wrapped up my latest round of cycle counting last night:

MainLoop:               ;   11
        stx NUSIZ0      ; 3 14 - now 3 close
        stx NUSIZ1      ; 3 17 - now 3 close
        sta GRP0        ; 3 20 - blank out Stella copy of P0 (not seen on hardware)
        ldx #L8_3_MEDIUM; 2 22 - 3 Copies medium & left 8 (early HMOVE)
        sty GRP1        ; 3 25 @0-41  datastream 5        
        sty GRP0        ; 3 28 @??-30 datastream 1 - ??=0 for real hardware
        SLEEP 2         ; 2 30
        sty GRP0        ; 3 33 @33-35 datastream 3 
        SLEEP 3         ; 3 36
        sty GRP0        ; 3 39 @39-54 datastream A
        SLEEP 2         ; 2 41        
        sty GRP1        ; 3 44 @44-46 datastream 7
        sta RESP0       ; 3 47 @47
        sty GRP1        ; 3 50 @49-51 datastream 9
        sta RESP0       ; 3 53 @53
        stx NUSIZ1      ; 3 56 - now 3 medium
        sta GRP0        ; 3 59 @57-59 - blank out
        sty GRP1        ; 3 62 @55-62 datastream D
        sty GRP0        ; 3 65 @63-65 datastream E
        
EntryEven:              ;   65 
        ldx #L8_3_CLOSE ; 2 67
        stx HMP1        ; 3 70
        sta HMOVE       ; 3 73 @73-74 - move left 8
        sta RESP0       ; 3 76/0
;__________________________________________________  
        dec LoopCounter ; 5  5
        beq KernelDone  ; 2  7 (3 
        stx NUSIZ0      ; 3 10 - now 3 close
        stx NUSIZ1      ; 3 13 - now 3 close
        ldx #R8_3_MEDIUM; 2 15
        sta GRP0        ; 3 18 - blank out Stella copy of P0 (not seen on hardware)
        sty GRP1        ; 3 21 @0-38 datastream 4
        SLEEP 2         ; 2 23
        sty GRP0        ; 3 26 @25-27 datastream 0
        SLEEP 2         ; 2 28
        sty GRP0        ; 3 31 @31-33 datastream 2
        SLEEP 2         ; 2 33
        sta GRP0        ; 3 36 @??-46 blank out       
        sta RESP0       ; 3 39 @39
        sty GRP1        ; 3 42 @41-43 datastream 6
        sta RESP0       ; 3 45 @45
        sty GRP1        ; 3 48 @47-49 datastream 8   
        stx.w NUSIZ1    ; 4 52 - now 3 medium
        sty GRP0        ; 3 55 @49-57 datastream B
        sty GRP1        ; 3 58 @??-59 datastream C
        stx NUSIZ0      ; 3 61 - now 3 medium
        sty GRP0        ; 3 64 @61-67 datastream F
        stx HMP1        ; 3 67
        ldx #R8_3_CLOSE ; 2 69

EntryOdd:               ;   69
        SLEEP 4         ; 4 73
        sta RESP0       ; 3 76/0 @0
;__________________________________________________        
        sta HMOVE       ; 3  3
        dec LoopCounter ; 5  8
        bne MainLoop    ; 2 10 (3 11)

For the bus stuffing 128 pixel display:

post-3056-0-12853100-1472667130_thumb.png post-3056-0-69681000-1472667136_thumb.jpg

 

post-3056-0-41616800-1472667145_thumb.png post-3056-0-33765000-1472667151_thumb.jpg

  • Like 3

Share this post


Link to post
Share on other sites

DASM can assemble so quickly and computers are so powerful, that it should even know if it's crossing a page boundary

macro.h in Collect has a number of macros which take care of that. Search for SAME PAGE BRANCH CHECK as well as DIFFERENT PAGE BRANCH CHECK.

  • Like 1

Share this post


Link to post
Share on other sites

macro.h in Collect has a number of macros which take care of that. Search for SAME PAGE BRANCH CHECK as well as DIFFERENT PAGE BRANCH CHECK.

 

That's slick! But I want it integrated with my editor/IDE. Alongside automatically managed cycle counts.

 

Oh well, I know the drill. If I want it badly enough, I should write it. :)

Share this post


Link to post
Share on other sites

I calculate and write cycle counts in the source whenever timing is crucial.
Example: this is a subroutine that receives a byte of data at 19200 baud from a PC RS232 connected to the 2600 right controller port (using a serial-to-TTL level converter or a USB to TTL-serial adapter).

;   baudrate    mark/space           cpu cycles 
;               duration    2600P   2600N   7800P   7800N
;   
;   19200        52.083 us  61.6    62.1    92.4    93.2
;    9600       104.167 us  123.2   124.3   184.7   186.4
;    4800       208.333 us  246.3   248.6   369,5   372.9
;    2400       416.667 us  492.6   497.2   738.9   745.7
;    1200       833.333 us  985.2   994.3   1477.9  1491.5
;
;   max error allowed 2%
;
;   best values for 19200 baud:  
;   62 cycles in 2600 mode  
;   93 cycles in 7800 mode  
;   
;   PAL:  52.44 us (error +0.69%)
;   NTSC: 51.96 us (error -0.23%)
;
;
;   TTL serial data
;   start bit   = 0
;   stop bits   = 1   
;   idle state  = 1
;
; 
;   right port: pin 1 = TX (SWCHA D0)
;               pin 2 = RX (SWCHA D1)
;



BITIN   =   %00000001
BITOUT  =   %00000010


    SUBROUTINE

rxbyte_26
; end of previous frame stop bit...

    lda #BITIN
; wait for START bit
.loop2
    bit SWCHA       ;4
    bne .loop2      ;2*

    ;2 - 9 cycles after start of start bit ---> average 5

    ; wait about half of the bit duration (31 cycles in 2600 mode)
                    ;       5  (average!)
    ldy #2          ;2      7
.wait               ;           5*y+4=14
    dey             ;       
    bpl .wait       ;
                    ;       21
    nop             ;2      23

; main loop
    ldx #8          ;2      25

.nextbit            ;   56  
    lsr SWCHA       ;6  62  31  carry = received bit. PORT A must be configured as INPUT !!

    ror buffer      ;5  5   
    ldy #8          ;2  7   

.wait2              ;           5*y+4=44
    dey             ;2      
    bpl .wait2      ;2*
                    ;   51  

    dex             ;2  53  
    bpl .nextbit    ;2* 55  


    nop 0           ;3  58
    bit SWCHA       ;4  62  first stop bit
    rts             ;6      returns Z=0 if framing error (A=#BITIN)


Share this post


Link to post
Share on other sites

When counting, do you first do the pass from the top starting from 0, and then when you reach the bottom of your main kernel loop, do you go back up to the top, and continue counting until 76, or the next WSYNC?

 

Am trying to find the best technique to deal with cycle counting overlap.

 

-Thom

Share this post


Link to post
Share on other sites

I am starting from 0 after each WSYNC. In kernels without WSYNC I start with 0 (or more) when the count exceeds 76. So the counts are always in parallel with the real scanline count (see Stella debugger: "Scan Cycle").

Share this post


Link to post
Share on other sites

 

KERNloop:
        LDX #ENABL              ; 3     27
        TXS                     ; 2     29
        LDA SCANLINE            ; 3     32
        EOR BALLY2              ; 3     35
        AND #$FE                ; 2     37
        PHP                     ; 3     40
        LDA SCANLINE            ; 3     43
        EOR BALLY1              ; 3     46
        AND #$FE                ; 2     48
        PHP                     ; 3     51
        INC SCANLINE            ; 5     56
        STA WSYNC               ; 3     59 (---)
        LDA SCANLINE            ; 3     3
        LSR                     ; 2     5
        LSR                     ; 2     7
        LSR                     ; 2     9
        SEC                     ; 2     11
        SBC #$06                ; 2     13
        TAY                     ; 2     15
        STA WSYNC               ; 3     18
        LDA PF0_0,Y             ; 4     22
        STA PF0                 ; 3     25
        LDA PF1_0,Y             ; 4     29
        STA PF1                 ; 3     32
        LDA PF2_0,Y             ; 4     36
        STA PF2                 ; 3     39
        INC SCANLINE            ; 5     44
        STA WSYNC               ; 3     47 (---)
        LDA SCANLINE            ; 3     3
        EOR BALLY0              ; 3     6
        AND #$FC                ; 2     8
        PHP                     ; 3     11
        INC SCANLINE            ; 5     16
        LDA SCANLINE            ; 3     19
        CMP #232                ; 2     21
        BCC KERNloop            ; 2     24

dodgeball.bin

Share this post


Link to post
Share on other sites

Yes, about like that.

 

But I use lower case mnemonics, and camel case variables (to distinguish them from constants).

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...