Jump to content
IGNORED

cycles used by bB operations


RevEng

Recommended Posts

I was curious to see how many cycles certain operations took in bB, so I ran a bunch of loop for each operation and subtracted from the cyclecount. (also subtracting the loop overhead)

 

The result was fairly educational. I anticipated many of the results, but perhaps not to the magnitude I discovered. It's certainly helped me in my game tuning, so hopefully someone else finds use for it.

 

No doubt some assembly coders are going to suggest that I could have just added up cycles in the assembly. That get's pretty hairy when you're going though long branching code, and this method IMO was good enough to get a good idea of what the piggie operations are.

 

The results...

 

a=b+c : 11.5 cycles (+-1.3 cycles)

a=b/3 : 460.8 cycles (+-12.8 cycles)

a=b/2 : 7.68 cycles (+-1.3 cycles) *

 

a.a=b.b+c.c : 11.52 cycles (+-1.3 cycles)

a.d=b.b+c.c : 11.52 cycles (+-1.3 cycles)

a.d=b.b+c.f : 11.52 cycles (+-1.3 cycles)

a.d=b.e+c.c : 160.0 cycles (+-6.4 cycles) **

a.d=b.e+c.f : 20.48 cycles (+-1.3 cycles)

 

gosub+return : 25.6 cycles (+-1.3 cycles)

gosub_with_bankswitch+return : 123.73 cycles (+-4.2 cycles)

goto: 2.56 cycles (+-1.28) ***

goto_with_bankswitch : 49.06 cycles (+-4.2)

 

--------------------------------------------------

 

* = x/2 is a lot less than x/3 because bB uses a rotate-rights for divisions by powers of 2.

** = this one is weird. If you add an 8.8 type and a 4.4 type, it's *way* slow if the 8.8 is first.

*** = bB implements this with a jmp, which is 3 cycles. This one was a check of my methods.

 

Note 1:

a,b,c=byte

a.a,b.b,c.c=bB 4.4 fp type

a.d,b.e,c.f=bB 8.8 fp type

 

Note 2:

an empty-loop in the bB standard kernel has about ~2432 spare cycles. (~3432 if you count vblank area)

an empty-loop in the bB multi-sprite kernel has about ~2176 spare cycles. (not much more in the vblank)

Edited by RevEng
Link to comment
Share on other sites

It's true in all languages.

BTW goto should always use a minimum of 3 cycles. It's possible to skip over an instruction in 2 cycles using .asm, but I doubt bB is that sophisticated to recognise where that can be implemented.

That would require statements compiled as one byte instructions and bB doesn't use too many in general code. There are a few cases of two byte instructions that could use a BIT skip method in an if-then-else situation but the savings would be minimal and cases too rare to bother with.
Link to comment
Share on other sites

a=b+c : 11.5 cycles (+-1.3 cycles)

a=b/3 : 460.8 cycles (+-12.8 cycles)

a=b/2 : 7.68 cycles (+-1.3 cycles) *

 

a.a=b.b+c.c : 11.52 cycles (+-1.3 cycles)

a.d=b.b+c.c : 11.52 cycles (+-1.3 cycles)

a.d=b.b+c.f : 11.52 cycles (+-1.3 cycles)

a.d=b.e+c.c : 160.0 cycles (+-6.4 cycles) **

a.d=b.e+c.f : 20.48 cycles (+-1.3 cycles)

 

gosub+return : 25.6 cycles (+-1.3 cycles)

gosub_with_bankswitch+return : 123.73 cycles (+-4.2 cycles)

goto: 2.56 cycles (+-1.28) ***

goto_with_bankswitch : 49.06 cycles (+-4.2)

Goto always takes 3 cycles (non-bankswitch.) Bankswitching always adds overhead even in assembly. Some assembly coders use a jump table and can do inter-bank jumps in 6-9 cycles but those can grow very large and require code to be aligned in different banks. From a compilation standpoint, the method bB used made more sense, and it allows an arbitrary entry point into any bank, but at the expense of cycles (it feeds the stack with the destination address, does the bankswitch, and issues an RTS.)

 

gosub+return should only take 9 cycles in a 4k game. In a bankswitched game, inter-bank gosub/return is expensive, because you need to do everything for a gosub above but also do a JSR to push the return address, and the return needs to pull the bank to return to from the return address (it is in the upper 3 bits of the high byte) and perform a bankswitch. It's a lot to consider so I recommend using the special "thisbank" and "otherbank" tokens after return statements wherever possible as it will reduce the number of cycles.

 

Division is painfully slow for anything but a power-of-two denominator. The division routine is a long-division method method using bits in the bytes, which was better overall than a subtraction loop. I recommend avoiding division whenever possible except by a power of two.

 

The reason the 8.8+4.4 takes so long is because it's in a library and requires a bankswtich to get there. Try it in a 4k game.

Link to comment
Share on other sites

gosub+return should only take 9 cycles in a 4k game

 

Shouldn't that be 12 cycles?

 

BTW cycle-hungry division routines can be cleaned up by using lookup tables. If bBasic doesn't use some kind of

ON {var1} {var2} = {num1},{num2},{num3}, (etc.)

a workaround isn't that difficult. It's just using var1 as a pointer in a data table to set the new value of var2. Only a few cycles needed if var1 is already an integer.

Link to comment
Share on other sites

Thanks for the clarification! I didn't realize some of that behind-the-scenes bankswitching was going on.

 

As you surmised, I did do this in a bankswitched bin. I think that was serendipitous, as I haven't seen some of these penalties hightlighted before. (though likely it's somewhere buried deep in this forum's archives!)

 

Unfortunately, I'm not able to fit my current project into 4k, and it's fairly fixed-point heavy. I just tried a test program that stuck some fixed-point math in the last bank, and it seems to hit the same penalty. Is bB doing the bankswitch even when the fixed-point-using-code is in the last bank?

Edited by RevEng
Link to comment
Share on other sites

gosub+return should only take 9 cycles in a 4k game

 

Shouldn't that be 12 cycles?

 

BTW cycle-hungry division routines can be cleaned up by using lookup tables. If bBasic doesn't use some kind of

ON {var1} {var2} = {num1},{num2},{num3}, (etc.)

a workaround isn't that difficult. It's just using var1 as a pointer in a data table to set the new value of var2. Only a few cycles needed if var1 is already an integer.

Yeah, 12 cycles :dunce:

 

Anyway, bB's division routine was a compromise between space and cycles. Is there a general routine for division involving tables, or are you talking about specific cases? If the latter, I encourage use of tables in the programmer's reference as they are fast (bB's "data" statement is a lookup table) and something like division could work well there if the programmer has the space.

 

Thanks for the clarification! I didn't realize some of that behind-the-scenes bankswitching was going on.

 

As you surmised, I did do this in a bankswitched bin. I think that was serendipitous, as I haven't seen some of these penalties hightlighted before. (though likely it's somewhere buried deep in this forum's archives!)

 

Unfortunately, I'm not able to fit my current project into 4k, and it's fairly fixed-point heavy. I just tried a test program that stuck some fixed-point math in the last bank, and it seems to hit the same penalty. Is bB doing the bankswitch even when the fixed-point-using-code is in the last bank?

The math routines are placed in the first bank, so I would say yes :) Place your code in the first bank and it should be faster.

Link to comment
Share on other sites

The math routines are placed in the first bank, so I would say yes Place your code in the first bank and it should be faster.

Unfortunately this isn't the case. Looking at the generated assembly, it does a bankswitch to the first bank even in the first bank.

 

My original "benchmark" timing of 160 cycles was with the addition happening in the first bank, and it drops to 106 cycles if I change to a 4k bin. (this is all running on stella, BTW. Not sure if a bankswitch to an existing bank would be any faster on real hardware)

 

The other odd thing, is 8.8=4.4+8.8 doesn't call the library routines, and 8.8=8.8+4.4 does, which is odd because it's just an operator switch. It seems in the former case bB treats the 4.4 as a regular byte instead of a 4.4...

.L09;  dim a_fp44 = a.a
.L010;  dim b_fp44 = b.b
.L011;  dim c_fp44 = c.c
.
.L012;  dim a_fp88 = a.d
.L013;  dim b_fp88 = b.e
.L014;  dim c_fp88 = c.f

.L015;  a_fp88 = 123.123
	LDX #31
	STX d
	LDA #123
	STA a_fp88
.L016;  b_fp88 = 123.123
	LDX #31
	STX e
	LDA #123
	STA b_fp88
.L017;  c_fp88 = 123.123
	LDX #31
	STX f
	LDA #123
	STA c_fp88

.main
; main

.L018;  c_fp88 =  b_fp44  +  a_fp88

	LDA b_fp44
	CLC
	ADC a_fp88
	STA c_fp88
.L019;  c_fp88 =  a_fp88  +  b_fp44

	LDY b_fp44
	LDX d
	LDA a_fp88
sta temp7
lda #>(ret_point1-1)
pha
lda #<(ret_point1-1)
pha
lda #>(Add44to88-1)
pha
lda #<(Add44to88-1)
; ...etc

Link to comment
Share on other sites

The math routines are placed in the first bank, so I would say yes Place your code in the first bank and it should be faster.

Unfortunately this isn't the case. Looking at the generated assembly, it does a bankswitch to the first bank even in the first bank.

The code is there to check for bank 1, but apparently it is not working. I see the problem. Fixing...

The other odd thing, is 8.8=4.4+8.8 doesn't call the library routines, and 8.8=8.8+4.4 does, which is odd because it's just an operator switch. It seems in the former case bB treats the 4.4 as a regular byte instead of a 4.4...

That is a bug.

Link to comment
Share on other sites

  • 7 months later...
I was curious to see how many cycles certain operations took in bB, so I ran a bunch of loop for each operation and subtracted from the cyclecount. (also subtracting the loop overhead)

Can you post a sample program and explain how people with tiny brains can do their own tests like this?

 

Thanks.

Link to comment
Share on other sites

Overview...

Create the program that loops the code you want to measure N number of times. Then...

 

  • Adjust N, recompile, and rerun. Repeat this step with differen N values, until N provides a cyclescore fairly close to 0.
  • record score as TEST_CYCLES
  • edit the program and comment out the code in the loop. recompile, and rerun.
  • record score as OVERHEAD_CYCLES
  • CYCLES_PER_COMMAND=(OVERHEAD_CYCLES-TEST_CYCLES)/N
  • MARGIN_OF_ERROR=64/N (bB's cyclescore is accurate to +-64 cycles)

 

Example...

Here's an example for a=b+c...

set debug cycles
set debug cyclescore

dim loop=x

scorecolor=$0f

testloop
rem *** adjust the "to" value in the loop most of the cycles are used up 
for loop = 1 to 100
rem do what you want to measure here 
a=b+c
next
drawscreen
COLUBK=0
goto testloop

So for this example, running with N=100 provides a score of 128.

 

Running the same code without the a=b+c results in a score of 1216

 

CYCLES_PER_COMMAND = (1216-128)/100 = 10.88

 

MARGIN_OF_ERROR = 64/100 = +-0.64

 

Extra Credit Work - verifying the method

This example is pretty easy to verify with cycle counting. Looking at the generated assembly code, the code in the loop breaks down to...

       LDA b ; 3
       CLC   ; 2
       ADC c ; 3
       STA a ; 3
             ;=11

10.88 isn't 11, but it's definitely within our margin of error.

  • Like 1
Link to comment
Share on other sites

So for this example, running with N=100 provides a score of 128.

 

Running the same code without the a=b+c results in a score of 1216

 

CYCLES_PER_COMMAND = (1216-128)/100 = 10.88

 

MARGIN_OF_ERROR = 64/100 = +-0.64

When I run that program, I see 192. When I remove a=b+c, I see 1280. Does that mean something is wrong with my computer or maybe my version of Stella is different?

 

Thanks.

Link to comment
Share on other sites

I doubt there's anything wrong on your end. I suspect we just have 2 different versions of bB, so we have slightly different amounts of overhead.

 

Since the calculations subtract out the overhead for the loop, it doesn't factor into things anyway.

 

Here's my bins. I think you'll find stella gives you the same results as I had with these.

 

cyclecount.withcode.bin

cyclecount.nocode.bin

Link to comment
Share on other sites

I doubt there's anything wrong on your end. I suspect we just have 2 different versions of bB, so we have slightly different amounts of overhead.

 

Since the calculations subtract out the overhead for the loop, it doesn't factor into things anyway.

 

Here's my bins. I think you'll find stella gives you the same results as I had with these.

 

cyclecount.withcode.bin

cyclecount.nocode.bin

Thanks. Good to know there's nothing wrong. As you said, we both end up with 10.88, so it doesn't matter. Now I can get on with the testing.

Link to comment
Share on other sites

  • 6 months later...

I'm trying to figure something out. when I test a = a & %11101111 and a{4}=0 separately, they both give me 512. That makes sense. When I try the following, I get 512 (7.68) too:

 

a = a & %00001111

 

That makes sense.

 

 

But when I try the following, I get 1856 and that's a crazy number:

 

a{4}=0 : a{5}=0 : a{6}=0 : a{7}=0

 

 

Does anyone know why that would be much slower than a = a & %00001111?

 

 

Thanks.

Link to comment
Share on other sites

I'm trying to figure something out. when I test a = a & %11101111 and a{4}=0 separately, they both give me 512. That makes sense. When I try the following, I get 512 (7.68) too:

 

a = a & %00001111

 

That makes sense.

 

 

But when I try the following, I get 1856 and that's a crazy number:

 

a{4}=0 : a{5}=0 : a{6}=0 : a{7}=0

 

 

Does anyone know why that would be much slower than a = a & %00001111?

 

 

Thanks.

Yes, as bB does each bit assignment separately. I think the equivalent would be:

 

a = a & %11101111 & %11011111 & %10111111 & %01111111

  • Like 1
Link to comment
Share on other sites

Yes, as bB does each bit assignment separately. I think the equivalent would be:

 

a = a & %11101111 & %11011111 & %10111111 & %01111111

Thanks. That's good to know. Unless I'm dealing with only one bit, I should use AND and OR to speed things up.

Link to comment
Share on other sites

a=b+c : 11.5 cycles (+-1.3 cycles)

a=b/3 : 460.8 cycles (+-12.8 cycles)

a=b/2 : 7.68 cycles (+-1.3 cycles) *

 

a.a=b.b+c.c : 11.52 cycles (+-1.3 cycles)

a.d=b.b+c.c : 11.52 cycles (+-1.3 cycles)

a.d=b.b+c.f : 11.52 cycles (+-1.3 cycles)

a.d=b.e+c.c : 160.0 cycles (+-6.4 cycles) **

a.d=b.e+c.f : 20.48 cycles (+-1.3 cycles)

 

gosub+return : 25.6 cycles (+-1.3 cycles)

gosub_with_bankswitch+return : 123.73 cycles (+-4.2 cycles)

goto: 2.56 cycles (+-1.28) ***

goto_with_bankswitch : 49.06 cycles (+-4.2)

These numbers are wrong, because of the way you tested them. The easiest way to test them is as follows:

 

loop
  rem * code to be tested goes here
  goto loop

For example:

 

loop
  a = b + c
  goto loop

When you compile and run this, you just get a blank screen, but that's okay, since all you really want to do is see how long the instruction takes. Once the program is up and running in Stella (you do need to be using Stella for this), hit the `/~ key (backwards apostrophe, with a tilde on top, just left of the 1/! key) to switch to the debugger screen. If necessary, scroll the disassembly window until you can see all the code in your little loop. Then add up the cycle counts, but don't add the one for the JMP at the end of the loop. This is what you should get:

 

  ; a = b + c
  LDA $D7 ; 3
  CLC     ; 2
  ADC $D8 ; 3
  STA $D6 ; 3
  ; total = 11

  ; a = b / 3
  ; This one is trickier, because you need include div_mul.asm,
  ; and must include the cycles used by the division loop,
  ; plus the JSR and RTS. You should also step through the
  ; entire process to see which branches are taken, so you can
  ; add up how many cycles are actually used.
  ; The simplest way to do this is to step until you get to the
  ; start of the loop, note the frame cycle, then step through
  ; until you get to the JMP that loops back, note the new
  ; frame cycles, and subtract to get the cycles used.
  ; When you compile, your addresses may come out different
  ; than the ones shown here:
LF4A1 CPY #$02  ; 2
     BCC LF4AF ; 2
     STY $9C   ; 3
     LDY #$FF  ; 2
LF4A9 SBC $9C   ; 3
     INY       ; 2
     BCS LF4A9 ; 2
     TYA       ; 2
LF4AF RTS       ; 6
LF4B0 LDA $D7   ; 3 <-- start here, F. Cyc = 01451
     LDY #$03  ; 2
     JSR LF4A1 ; 6
     STA $D6   ; 3
     JMP LF4B0 ; 3 <-- stop here, F. Cyc = 01489
  ; total = 1489 - 1451 = 38

  ; a = b / 2
  LDA $D7 ; 3
  LSR A   ; 2
  STA $D6 ; 3
  ; total = 8

  ; a.a = b.b + c.c
  ; dim a1 = a.a
  ; dim b2 = b.b
  ; dim c3 = c.c
  ; a1 = b2 + c3
  LDA $D7 ; 3
  CLC     ; 2
  ADC $D8 ; 3
  STA $D6 ; 3
  ; total = 11

  ; a.d = b.b + c.c
  ; dim a1 = a.d
  ; dim b2 = b.b
  ; dim c3 = c.c
  ; a1 = b2 + c3
  LDA $D7 ; 3
  CLC     ; 2
  ADC $D8 ; 3
  STA $D6 ; 3
  ; total = 11

  ; a.d = b.b + c.f
  ; dim a1 = a.d
  ; dim b2 = b.b
  ; dim c3 = c.f
  ; a1 = b2 + c3
LF49E STA $9C   ; 3
     LDA #$00  ; 2
     ASL $9C   ; 5
     SBC #$00  ; 2
     EOR #$FF  ; 2
     ROL A     ; 2
     ASL $9C   ; 5
     ROL A     ; 2
     ASL $9C   ; 5
     ROL A     ; 2
     ASL $9C   ; 5
     ROL A     ; 2
     LDX $9C   ; 3
     RTS       ; 6
LF4C7 STA $9D   ; 3
     STX $9E   ; 3
     TYA       ; 2
     JSR LF49E ; 6
     CLC       ; 2
     STA $9C   ; 3
     TXA       ; 2
     ADC $9E   ; 3
     TAX       ; 2
     LDA $9C   ; 3
     ADC $9D   ; 3
     RTS       ; 6
LF4F2 LDY $D7   ; 3 <-- Start here, F. Cyc = 90192
     LDX $D8   ; 3
     LDA $D8   ; 3
     JSR LF4C7 ; 6
     STX $D9   ; 3
     STA $D6   ; 3
     JMP LF4F2 ; 3 <-- stop here, F. Cyc = 90297
  ; total = 90297 - 90192 = 105

  ; a.d = b.e + c.c
  ; dim a1 = a.d
  ; dim b2 = b.e
  ; dim c3 = c.c
  ; a1 = b2 + c3
LF49E STA $9C   ; 3
     LDA #$00  ; 2
     ASL $9C   ; 5
     SBC #$00  ; 2
     EOR #$FF  ; 2
     ROL A     ; 2
     ASL $9C   ; 5
     ROL A     ; 2
     ASL $9C   ; 5
     ROL A     ; 2
     ASL $9C   ; 5
     ROL A     ; 2
     LDX $9C   ; 3
     RTS       ; 6
LF4C7 STA $9D   ; 3
     STX $9E   ; 3
     TYA       ; 2
     JSR LF49E ; 6
     CLC       ; 2
     STA $9C   ; 3
     TXA       ; 2
     ADC $9E   ; 3
     TAX       ; 2
     LDA $9C   ; 3
     ADC $9D   ; 3
     RTS       ; 6
LF4F2 LDY $D8   ; 3 <-- Start here, F. Cyc = 99332
     LDX $DA   ; 3
     LDA $D7   ; 3
     JSR LF4C7 ; 6
     STX $D9   ; 3
     STA $D6   ; 3
     JMP LF4F2 ; 3 <-- stop here, F. Cyc = 99437
  ; total = 99437 - 99332 = 105

  ; a.d = b.e + c.f
  ; dim a1 = a.d
  ; dim b2 = b.e
  ; dim c3 = c.f
  ; a1 = b2 + c3
  LDA $DA ; 3
  CLC     ; 2
  ADC $DB ; 3
  STA $D9 ; 3
  LDA $D7 ; 3
  ADC $D8 ; 3
  STA $D6 ; 3
  ; total = 20

  ; gosub+return
; loop
;    gosub routine
;    goto loop
; routine
;    return
LF487 JSR LF495 ; 6
     JMP LF487 ; 3 <-- don't count this
LF495 RTS       ; 6
  ; total = 12

  ; gosub_with_bankswitch+return
; loop
;    gosub routine bank2
;    goto loop
;    bank 2
; routine
;    return
LD000 STA $D4      ; 3 <-- start here, F. Cyc = 91290
     LDA #$D0     ; 2
     PHA          ; 3
     LDA #$17     ; 2
     PHA          ; 3
     LDA #$F4     ; 2
     PHA          ; 3
     LDA #$D6     ; 2
     PHA          ; 3
     LDA $D4      ; 3
     PHA          ; 3
     TXA          ; 2
     PHA          ; 3
     LDX #$02     ; 2
     JMP LDFED    ; 3
     JMP LD000    ; 3 <-- stop here, F. Cyc = 91412
LFFED LDA LFFF7,X  ; 4
     PLA          ; 4
     TAX          ; 2
     PLA          ; 4
     RTS          ; 6
LF4D7 TSX          ; 2
     LDA CXP0FB,X ; 4
     EOR #$F4     ; 2
     AND #$E0     ; 2
     BEQ LF4E3    ; 2
     JMP LFFDF    ; 3
LF4E3 RTS          ; 6
LFFDF PHA          ; 3
     TXA          ; 2
     PHA          ; 3
     TSX          ; 2
     LDA CXM0FB,X ; 4
     ROL A        ; 2
     ROL A        ; 2
     ROL A        ; 2
     ROL A        ; 2
     AND #$01     ; 2
     TAX          ; 2
     INX          ; 2
LFFED LDA LFFF7,X  ; 4
     PLA          ; 4
     TAX          ; 2
     PLA          ; 4
     RTS          ; 6
  ; total = 91412 - 91290 = 122

(Note: This one jumps around and switches banks, so it's impossible to figure without just stepping through it and noting the frame cycles.)

 

  ; gosub_with_bankswitch+return_with_bankswitch
; (you didn't try this one, it's quicker)
; loop
;    gosub routine bank2
;    goto loop
;    bank 2
; routine
;    return otherbank
LD000 STA $D4      ; 3 <-- start here, F. Cyc = 38801
     LDA #$D0     ; 2
     PHA          ; 3
     LDA #$17     ; 2
     PHA          ; 3
     LDA #$F4     ; 2
     PHA          ; 3
     LDA #$D6     ; 2
     PHA          ; 3
     LDA $D4      ; 3
     PHA          ; 3
     TXA          ; 2
     PHA          ; 3
     LDX #$02     ; 2
     JMP LDFED    ; 3
     JMP LD000    ; 3 <-- stop here, F. Cyc = 38911
  ; total = 38911 - 38801 = 110

(I omitted the bankswitching code to keep the listing simpler.)

 

  ; goto
; loop
;    goto loop
LF48F JMP LF48F ; 3
  ; total = 3

  ; goto_with_bankswitch
; loop
;    goto routine bank2
;    bank 2
; routine
;    goto loop bank1
LD000 STA $D4      ; 3 <-- start here, F. Cyc = 43306
     LDA #$F4     ; 2
     PHA          ; 3
     LDA #$D6     ; 2
     PHA          ; 3
     LDA $D4      ; 3
     PHA          ; 3
     TXA          ; 2
     PHA          ; 3
     LDX #$02     ; 2
     JMP LDFED    ; 3
LFFED LDA LFFF7,X  ; 4
     PLA          ; 4
     TAX          ; 2
     PLA          ; 4
     RTS          ; 6
LF4D7 STA $D4      ; 3 <-- stop here, F. Cyc = 43355
  ; total = 43355 - 43306 = 49

So the correct numbers are as follows:

 

a=b+c : 11 cycles

a=b/3 : 38 cycles

a=b/2 : 8 cycles

 

a.a=b.b+c.c : 11 cycles

a.d=b.b+c.c : 11 cycles

a.d=b.b+c.f : 105 cycles

a.d=b.e+c.c : 105 cycles

a.d=b.e+c.f : 20 cycles

 

gosub+return : 12 cycles

gosub_with_bankswitch+return : 122 cycles

gosub_with_bankswitch+return_with_bankswitch : 110 cycles

goto: 3 cycles

goto_with_bankswitch : 49 cycles

 

I'm not sure why your a=b/3 was so far off.

 

Michael

Link to comment
Share on other sites

These numbers are wrong, because of the way you tested them. The easiest way to test them is as follows

There's nothing wrong with the way I tested them, if you don't mind working with a margin of error.

 

The stella method you outline is good for simple short code, but often it's more complex code you're interested in measuring. With my method, even when there are subroutines and bankswitching involved, I can get a "good enough" answer literally in a few seconds. Then I can tweak and re-time the code without having to step into subroutines and add up dozens or hundreds of opcode timings.

 

These are the cases where our answers disagree... (the other were in my stated margin of error)

 

RevEng: a.d=b.b+c.f : 11.52 cycles (+-1.3 cycles)

SeaGtGruff: a.d=b.b+c.f : 105 cycles

Comment: The discrepancy is due to a bB bug, reported in this thread and acknowledged by batari. A 8.8=4.4+8.8 should have called the library (with a bankswitch in a bankswitched binary) but it didn't.

 

RevEng: a.d=b.e+c.c : 160.0 cycles (+-6.4 cycles) **

SeaGtGruff: a.d=b.e+c.c : 105 cycles

Comment: Mine was for a bankswitched binary, so there's bank switching overhead. It was mentioned later in the thread that changing to a 4k binary changed the time to "106.6 cycles (+-4.2 cycles)".

 

RevEng: a=b/3 : 460.8 cycles (+-12.8 cycles)

SeaGtGruff: a=b/3 : 38 cycles

Comment: I used a non-zero value for b to get more of a typical case, since dividing in bB is iterative, and dividing 0 by 3 is an unusually fast corner-case. Also I was bankswitching. If I use b=0 and then the non-banked answer would be 38.66 +-1.33 cycles, which agrees with your result. What does using b=128 (the value I used, IIRC) and using bankswitching add up to with your method?

Edited by RevEng
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...