Jump to content

Photo

cycles used by bB operations


49 replies to this topic

#1 RevEng OFFLINE  

RevEng

    River Patroller

  • 3,150 posts
  • bit player
  • Location:Canada

Posted Sat Jun 27, 2009 2:31 PM

I was curious to see how many cycles certain operations took in bB, so I ran a bunch of loop for each operation and subtracted from the cyclecount. (also subtracting the loop overhead)

The result was fairly educational. I anticipated many of the results, but perhaps not to the magnitude I discovered. It's certainly helped me in my game tuning, so hopefully someone else finds use for it.

No doubt some assembly coders are going to suggest that I could have just added up cycles in the assembly. That get's pretty hairy when you're going though long branching code, and this method IMO was good enough to get a good idea of what the piggie operations are.

The results...

a=b+c : 11.5 cycles (+-1.3 cycles)
a=b/3 : 460.8 cycles (+-12.8 cycles)
a=b/2 : 7.68 cycles (+-1.3 cycles) *

a.a=b.b+c.c : 11.52 cycles (+-1.3 cycles)
a.d=b.b+c.c : 11.52 cycles (+-1.3 cycles)
a.d=b.b+c.f : 11.52 cycles (+-1.3 cycles)
a.d=b.e+c.c : 160.0 cycles (+-6.4 cycles) **
a.d=b.e+c.f : 20.48 cycles (+-1.3 cycles)

gosub+return : 25.6 cycles (+-1.3 cycles)
gosub_with_bankswitch+return : 123.73 cycles (+-4.2 cycles)
goto: 2.56 cycles (+-1.28) ***
goto_with_bankswitch : 49.06 cycles (+-4.2)

--------------------------------------------------

* = x/2 is a lot less than x/3 because bB uses a rotate-rights for divisions by powers of 2.
** = this one is weird. If you add an 8.8 type and a 4.4 type, it's *way* slow if the 8.8 is first.
*** = bB implements this with a jmp, which is 3 cycles. This one was a check of my methods.

Note 1:
a,b,c=byte
a.a,b.b,c.c=bB 4.4 fp type
a.d,b.e,c.f=bB 8.8 fp type

Note 2:
an empty-loop in the bB standard kernel has about ~2432 spare cycles. (~3432 if you count vblank area)
an empty-loop in the bB multi-sprite kernel has about ~2176 spare cycles. (not much more in the vblank)

Edited by RevEng, Sat Jun 27, 2009 2:33 PM.


#2 Random Terrain ONLINE  

Random Terrain

    Visual batari Basic User

  • 24,319 posts
  • Controlled Randomness
    Replay Value
    Nonlinear
  • Location:North Carolina (USA)

Posted Sat Jun 27, 2009 2:53 PM

So without a doubt, goto is faster than gosub. Good to know.

#3 Nukey Shay ONLINE  

Nukey Shay

    Sheik Yerbouti

  • 20,921 posts
  • Location:The land of Gorch

Posted Sat Jun 27, 2009 3:07 PM

It's true in all languages.
BTW goto should always use a minimum of 3 cycles. It's possible to skip over an instruction in 2 cycles using .asm, but I doubt bB is that sophisticated to recognise where that can be implemented.

#4 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,454 posts
  • begin 644 contest

Posted Sat Jun 27, 2009 3:44 PM

It's true in all languages.
BTW goto should always use a minimum of 3 cycles. It's possible to skip over an instruction in 2 cycles using .asm, but I doubt bB is that sophisticated to recognise where that can be implemented.

That would require statements compiled as one byte instructions and bB doesn't use too many in general code. There are a few cases of two byte instructions that could use a BIT skip method in an if-then-else situation but the savings would be minimal and cases too rare to bother with.

#5 MausGames OFFLINE  

MausGames

    Dragonstomper

  • 871 posts
  • Location:MO, USA

Posted Sat Jun 27, 2009 3:53 PM

Thanks for posting this, I'm still a beginner when it comes to efficient code.

#6 RevEng OFFLINE  

RevEng

    River Patroller

  • Topic Starter
  • 3,150 posts
  • bit player
  • Location:Canada

Posted Sat Jun 27, 2009 4:57 PM

You're welcome. Glad it was useful!

#7 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,454 posts
  • begin 644 contest

Posted Sat Jun 27, 2009 4:58 PM

a=b+c : 11.5 cycles (+-1.3 cycles)
a=b/3 : 460.8 cycles (+-12.8 cycles)
a=b/2 : 7.68 cycles (+-1.3 cycles) *

a.a=b.b+c.c : 11.52 cycles (+-1.3 cycles)
a.d=b.b+c.c : 11.52 cycles (+-1.3 cycles)
a.d=b.b+c.f : 11.52 cycles (+-1.3 cycles)
a.d=b.e+c.c : 160.0 cycles (+-6.4 cycles) **
a.d=b.e+c.f : 20.48 cycles (+-1.3 cycles)

gosub+return : 25.6 cycles (+-1.3 cycles)
gosub_with_bankswitch+return : 123.73 cycles (+-4.2 cycles)
goto: 2.56 cycles (+-1.28) ***
goto_with_bankswitch : 49.06 cycles (+-4.2)

Goto always takes 3 cycles (non-bankswitch.) Bankswitching always adds overhead even in assembly. Some assembly coders use a jump table and can do inter-bank jumps in 6-9 cycles but those can grow very large and require code to be aligned in different banks. From a compilation standpoint, the method bB used made more sense, and it allows an arbitrary entry point into any bank, but at the expense of cycles (it feeds the stack with the destination address, does the bankswitch, and issues an RTS.)

gosub+return should only take 9 cycles in a 4k game. In a bankswitched game, inter-bank gosub/return is expensive, because you need to do everything for a gosub above but also do a JSR to push the return address, and the return needs to pull the bank to return to from the return address (it is in the upper 3 bits of the high byte) and perform a bankswitch. It's a lot to consider so I recommend using the special "thisbank" and "otherbank" tokens after return statements wherever possible as it will reduce the number of cycles.

Division is painfully slow for anything but a power-of-two denominator. The division routine is a long-division method method using bits in the bytes, which was better overall than a subtraction loop. I recommend avoiding division whenever possible except by a power of two.

The reason the 8.8+4.4 takes so long is because it's in a library and requires a bankswtich to get there. Try it in a 4k game.

#8 Nukey Shay ONLINE  

Nukey Shay

    Sheik Yerbouti

  • 20,921 posts
  • Location:The land of Gorch

Posted Sat Jun 27, 2009 5:31 PM

gosub+return should only take 9 cycles in a 4k game


Shouldn't that be 12 cycles?

BTW cycle-hungry division routines can be cleaned up by using lookup tables. If bBasic doesn't use some kind of
ON {var1} {var2} = {num1},{num2},{num3}, (etc.)
a workaround isn't that difficult. It's just using var1 as a pointer in a data table to set the new value of var2. Only a few cycles needed if var1 is already an integer.

#9 RevEng OFFLINE  

RevEng

    River Patroller

  • Topic Starter
  • 3,150 posts
  • bit player
  • Location:Canada

Posted Sat Jun 27, 2009 5:41 PM

Thanks for the clarification! I didn't realize some of that behind-the-scenes bankswitching was going on.

As you surmised, I did do this in a bankswitched bin. I think that was serendipitous, as I haven't seen some of these penalties hightlighted before. (though likely it's somewhere buried deep in this forum's archives!)

Unfortunately, I'm not able to fit my current project into 4k, and it's fairly fixed-point heavy. I just tried a test program that stuck some fixed-point math in the last bank, and it seems to hit the same penalty. Is bB doing the bankswitch even when the fixed-point-using-code is in the last bank?

Edited by RevEng, Sat Jun 27, 2009 6:06 PM.


#10 RevEng OFFLINE  

RevEng

    River Patroller

  • Topic Starter
  • 3,150 posts
  • bit player
  • Location:Canada

Posted Sat Jun 27, 2009 6:12 PM

Ok, I tried 8.8=8.8+4.4 in a 4k bin, and it takes 106.6 cycles (+-4.2 cycles).

Unless I'm doing something horribly wrong, I think I'll keep my FPs in 8.8 to avoid it entirely.

#11 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,454 posts
  • begin 644 contest

Posted Sun Jun 28, 2009 12:26 AM

gosub+return should only take 9 cycles in a 4k game


Shouldn't that be 12 cycles?

BTW cycle-hungry division routines can be cleaned up by using lookup tables. If bBasic doesn't use some kind of
ON {var1} {var2} = {num1},{num2},{num3}, (etc.)
a workaround isn't that difficult. It's just using var1 as a pointer in a data table to set the new value of var2. Only a few cycles needed if var1 is already an integer.

Yeah, 12 cycles :dunce:

Anyway, bB's division routine was a compromise between space and cycles. Is there a general routine for division involving tables, or are you talking about specific cases? If the latter, I encourage use of tables in the programmer's reference as they are fast (bB's "data" statement is a lookup table) and something like division could work well there if the programmer has the space.

Thanks for the clarification! I didn't realize some of that behind-the-scenes bankswitching was going on.

As you surmised, I did do this in a bankswitched bin. I think that was serendipitous, as I haven't seen some of these penalties hightlighted before. (though likely it's somewhere buried deep in this forum's archives!)

Unfortunately, I'm not able to fit my current project into 4k, and it's fairly fixed-point heavy. I just tried a test program that stuck some fixed-point math in the last bank, and it seems to hit the same penalty. Is bB doing the bankswitch even when the fixed-point-using-code is in the last bank?

The math routines are placed in the first bank, so I would say yes :) Place your code in the first bank and it should be faster.

#12 RevEng OFFLINE  

RevEng

    River Patroller

  • Topic Starter
  • 3,150 posts
  • bit player
  • Location:Canada

Posted Sun Jun 28, 2009 7:59 AM

The math routines are placed in the first bank, so I would say yes Place your code in the first bank and it should be faster.

Unfortunately this isn't the case. Looking at the generated assembly, it does a bankswitch to the first bank even in the first bank.

My original "benchmark" timing of 160 cycles was with the addition happening in the first bank, and it drops to 106 cycles if I change to a 4k bin. (this is all running on stella, BTW. Not sure if a bankswitch to an existing bank would be any faster on real hardware)

The other odd thing, is 8.8=4.4+8.8 doesn't call the library routines, and 8.8=8.8+4.4 does, which is odd because it's just an operator switch. It seems in the former case bB treats the 4.4 as a regular byte instead of a 4.4...
.L09;  dim a_fp44 = a.a
.L010;  dim b_fp44 = b.b
.L011;  dim c_fp44 = c.c
.
.L012;  dim a_fp88 = a.d
.L013;  dim b_fp88 = b.e
.L014;  dim c_fp88 = c.f

.L015;  a_fp88 = 123.123
		LDX #31
		STX d
		LDA #123
		STA a_fp88
.L016;  b_fp88 = 123.123
		LDX #31
		STX e
		LDA #123
		STA b_fp88
.L017;  c_fp88 = 123.123
		LDX #31
		STX f
		LDA #123
		STA c_fp88

.main
; main

.L018;  c_fp88 =  b_fp44  +  a_fp88

		LDA b_fp44
		CLC
		ADC a_fp88
		STA c_fp88
.L019;  c_fp88 =  a_fp88  +  b_fp44

		LDY b_fp44
		LDX d
		LDA a_fp88
 sta temp7
 lda #>(ret_point1-1)
 pha
 lda #<(ret_point1-1)
 pha
 lda #>(Add44to88-1)
 pha
 lda #<(Add44to88-1)
; ...etc


#13 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,454 posts
  • begin 644 contest

Posted Sun Jun 28, 2009 10:19 AM

The math routines are placed in the first bank, so I would say yes Place your code in the first bank and it should be faster.

Unfortunately this isn't the case. Looking at the generated assembly, it does a bankswitch to the first bank even in the first bank.

The code is there to check for bank 1, but apparently it is not working. I see the problem. Fixing...

The other odd thing, is 8.8=4.4+8.8 doesn't call the library routines, and 8.8=8.8+4.4 does, which is odd because it's just an operator switch. It seems in the former case bB treats the 4.4 as a regular byte instead of a 4.4...

That is a bug.

#14 Random Terrain ONLINE  

Random Terrain

    Visual batari Basic User

  • 24,319 posts
  • Controlled Randomness
    Replay Value
    Nonlinear
  • Location:North Carolina (USA)

Posted Sat Feb 27, 2010 12:09 AM

I was curious to see how many cycles certain operations took in bB, so I ran a bunch of loop for each operation and subtracted from the cyclecount. (also subtracting the loop overhead)

Can you post a sample program and explain how people with tiny brains can do their own tests like this?

Thanks.

#15 RevEng OFFLINE  

RevEng

    River Patroller

  • Topic Starter
  • 3,150 posts
  • bit player
  • Location:Canada

Posted Sat Feb 27, 2010 10:19 AM

Overview...
Create the program that loops the code you want to measure N number of times. Then...

  • Adjust N, recompile, and rerun. Repeat this step with differen N values, until N provides a cyclescore fairly close to 0.
  • record score as TEST_CYCLES
  • edit the program and comment out the code in the loop. recompile, and rerun.
  • record score as OVERHEAD_CYCLES
  • CYCLES_PER_COMMAND=(OVERHEAD_CYCLES-TEST_CYCLES)/N
  • MARGIN_OF_ERROR=64/N (bB's cyclescore is accurate to +-64 cycles)

Example...
Here's an example for a=b+c...
 set debug cycles
 set debug cyclescore

 dim loop=x

 scorecolor=$0f

testloop
 rem *** adjust the "to" value in the loop most of the cycles are used up 
 for loop = 1 to 100
	rem do what you want to measure here 
	a=b+c
 next
 drawscreen
 COLUBK=0
 goto testloop
So for this example, running with N=100 provides a score of 128.

Running the same code without the a=b+c results in a score of 1216

CYCLES_PER_COMMAND = (1216-128)/100 = 10.88

MARGIN_OF_ERROR = 64/100 = +-0.64

Extra Credit Work - verifying the method
This example is pretty easy to verify with cycle counting. Looking at the generated assembly code, the code in the loop breaks down to...
        LDA b ; 3
        CLC   ; 2
        ADC c ; 3
        STA a ; 3
              ;=11
10.88 isn't 11, but it's definitely within our margin of error.

#16 Random Terrain ONLINE  

Random Terrain

    Visual batari Basic User

  • 24,319 posts
  • Controlled Randomness
    Replay Value
    Nonlinear
  • Location:North Carolina (USA)

Posted Sat Feb 27, 2010 4:31 PM

Thanks. I'm going to go do some testing.

#17 Random Terrain ONLINE  

Random Terrain

    Visual batari Basic User

  • 24,319 posts
  • Controlled Randomness
    Replay Value
    Nonlinear
  • Location:North Carolina (USA)

Posted Sat Feb 27, 2010 5:07 PM

So for this example, running with N=100 provides a score of 128.

Running the same code without the a=b+c results in a score of 1216

CYCLES_PER_COMMAND = (1216-128)/100 = 10.88

MARGIN_OF_ERROR = 64/100 = +-0.64

When I run that program, I see 192. When I remove a=b+c, I see 1280. Does that mean something is wrong with my computer or maybe my version of Stella is different?

Thanks.

#18 RevEng OFFLINE  

RevEng

    River Patroller

  • Topic Starter
  • 3,150 posts
  • bit player
  • Location:Canada

Posted Sat Feb 27, 2010 6:20 PM

I doubt there's anything wrong on your end. I suspect we just have 2 different versions of bB, so we have slightly different amounts of overhead.

Since the calculations subtract out the overhead for the loop, it doesn't factor into things anyway.

Here's my bins. I think you'll find stella gives you the same results as I had with these.

Attached File  cyclecount.withcode.bin   4KB   141 downloads
Attached File  cyclecount.nocode.bin   4KB   141 downloads

#19 Random Terrain ONLINE  

Random Terrain

    Visual batari Basic User

  • 24,319 posts
  • Controlled Randomness
    Replay Value
    Nonlinear
  • Location:North Carolina (USA)

Posted Sat Feb 27, 2010 7:18 PM

I doubt there's anything wrong on your end. I suspect we just have 2 different versions of bB, so we have slightly different amounts of overhead.

Since the calculations subtract out the overhead for the loop, it doesn't factor into things anyway.

Here's my bins. I think you'll find stella gives you the same results as I had with these.

Attached File  cyclecount.withcode.bin   4KB   141 downloads
Attached File  cyclecount.nocode.bin   4KB   141 downloads

Thanks. Good to know there's nothing wrong. As you said, we both end up with 10.88, so it doesn't matter. Now I can get on with the testing.

#20 RevEng OFFLINE  

RevEng

    River Patroller

  • Topic Starter
  • 3,150 posts
  • bit player
  • Location:Canada

Posted Sat Feb 27, 2010 7:38 PM

No problem. I look forward to hearing any discoveries you make! :thumbsup:

#21 Random Terrain ONLINE  

Random Terrain

    Visual batari Basic User

  • 24,319 posts
  • Controlled Randomness
    Replay Value
    Nonlinear
  • Location:North Carolina (USA)

Posted Sun Sep 5, 2010 11:05 PM

I'm trying to figure something out. when I test a = a & %11101111 and a{4}=0 separately, they both give me 512. That makes sense. When I try the following, I get 512 (7.68) too:

a = a & %00001111

That makes sense.


But when I try the following, I get 1856 and that's a crazy number:

a{4}=0 : a{5}=0 : a{6}=0 : a{7}=0


Does anyone know why that would be much slower than a = a & %00001111?


Thanks.

#22 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,454 posts
  • begin 644 contest

Posted Sun Sep 5, 2010 11:55 PM

I'm trying to figure something out. when I test a = a & %11101111 and a{4}=0 separately, they both give me 512. That makes sense. When I try the following, I get 512 (7.68) too:

a = a & %00001111

That makes sense.


But when I try the following, I get 1856 and that's a crazy number:

a{4}=0 : a{5}=0 : a{6}=0 : a{7}=0


Does anyone know why that would be much slower than a = a & %00001111?


Thanks.

Yes, as bB does each bit assignment separately. I think the equivalent would be:

a = a & %11101111 & %11011111 & %10111111 & %01111111

#23 Random Terrain ONLINE  

Random Terrain

    Visual batari Basic User

  • 24,319 posts
  • Controlled Randomness
    Replay Value
    Nonlinear
  • Location:North Carolina (USA)

Posted Mon Sep 6, 2010 12:17 AM

Yes, as bB does each bit assignment separately. I think the equivalent would be:

a = a & %11101111 & %11011111 & %10111111 & %01111111

Thanks. That's good to know. Unless I'm dealing with only one bit, I should use AND and OR to speed things up.

#24 SeaGtGruff OFFLINE  

SeaGtGruff

    Quadrunner

  • 5,359 posts
  • Location:Georgia, USA

Posted Fri Sep 10, 2010 10:12 PM

a=b+c : 11.5 cycles (+-1.3 cycles)
a=b/3 : 460.8 cycles (+-12.8 cycles)
a=b/2 : 7.68 cycles (+-1.3 cycles) *

a.a=b.b+c.c : 11.52 cycles (+-1.3 cycles)
a.d=b.b+c.c : 11.52 cycles (+-1.3 cycles)
a.d=b.b+c.f : 11.52 cycles (+-1.3 cycles)
a.d=b.e+c.c : 160.0 cycles (+-6.4 cycles) **
a.d=b.e+c.f : 20.48 cycles (+-1.3 cycles)

gosub+return : 25.6 cycles (+-1.3 cycles)
gosub_with_bankswitch+return : 123.73 cycles (+-4.2 cycles)
goto: 2.56 cycles (+-1.28) ***
goto_with_bankswitch : 49.06 cycles (+-4.2)

These numbers are wrong, because of the way you tested them. The easiest way to test them is as follows:

loop
   rem * code to be tested goes here
   goto loop
For example:

loop
   a = b + c
   goto loop
When you compile and run this, you just get a blank screen, but that's okay, since all you really want to do is see how long the instruction takes. Once the program is up and running in Stella (you do need to be using Stella for this), hit the `/~ key (backwards apostrophe, with a tilde on top, just left of the 1/! key) to switch to the debugger screen. If necessary, scroll the disassembly window until you can see all the code in your little loop. Then add up the cycle counts, but don't add the one for the JMP at the end of the loop. This is what you should get:

   ; a = b + c
   LDA $D7 ; 3
   CLC     ; 2
   ADC $D8 ; 3
   STA $D6 ; 3
   ; total = 11
   ; a = b / 3
   ; This one is trickier, because you need include div_mul.asm,
   ; and must include the cycles used by the division loop,
   ; plus the JSR and RTS. You should also step through the
   ; entire process to see which branches are taken, so you can
   ; add up how many cycles are actually used.
   ; The simplest way to do this is to step until you get to the
   ; start of the loop, note the frame cycle, then step through
   ; until you get to the JMP that loops back, note the new
   ; frame cycles, and subtract to get the cycles used.
   ; When you compile, your addresses may come out different
   ; than the ones shown here:
LF4A1 CPY #$02  ; 2
      BCC LF4AF ; 2
      STY $9C   ; 3
      LDY #$FF  ; 2
LF4A9 SBC $9C   ; 3
      INY       ; 2
      BCS LF4A9 ; 2
      TYA       ; 2
LF4AF RTS       ; 6
LF4B0 LDA $D7   ; 3 <-- start here, F. Cyc = 01451
      LDY #$03  ; 2
      JSR LF4A1 ; 6
      STA $D6   ; 3
      JMP LF4B0 ; 3 <-- stop here, F. Cyc = 01489
   ; total = 1489 - 1451 = 38
   ; a = b / 2
   LDA $D7 ; 3
   LSR A   ; 2
   STA $D6 ; 3
   ; total = 8
   ; a.a = b.b + c.c
   ; dim a1 = a.a
   ; dim b2 = b.b
   ; dim c3 = c.c
   ; a1 = b2 + c3
   LDA $D7 ; 3
   CLC     ; 2
   ADC $D8 ; 3
   STA $D6 ; 3
   ; total = 11
   ; a.d = b.b + c.c
   ; dim a1 = a.d
   ; dim b2 = b.b
   ; dim c3 = c.c
   ; a1 = b2 + c3
   LDA $D7 ; 3
   CLC     ; 2
   ADC $D8 ; 3
   STA $D6 ; 3
   ; total = 11
   ; a.d = b.b + c.f
   ; dim a1 = a.d
   ; dim b2 = b.b
   ; dim c3 = c.f
   ; a1 = b2 + c3
LF49E STA $9C   ; 3
      LDA #$00  ; 2
      ASL $9C   ; 5
      SBC #$00  ; 2
      EOR #$FF  ; 2
      ROL A     ; 2
      ASL $9C   ; 5
      ROL A     ; 2
      ASL $9C   ; 5
      ROL A     ; 2
      ASL $9C   ; 5
      ROL A     ; 2
      LDX $9C   ; 3
      RTS       ; 6
LF4C7 STA $9D   ; 3
      STX $9E   ; 3
      TYA       ; 2
      JSR LF49E ; 6
      CLC       ; 2
      STA $9C   ; 3
      TXA       ; 2
      ADC $9E   ; 3
      TAX       ; 2
      LDA $9C   ; 3
      ADC $9D   ; 3
      RTS       ; 6
LF4F2 LDY $D7   ; 3 <-- Start here, F. Cyc = 90192
      LDX $D8   ; 3
      LDA $D8   ; 3
      JSR LF4C7 ; 6
      STX $D9   ; 3
      STA $D6   ; 3
      JMP LF4F2 ; 3 <-- stop here, F. Cyc = 90297
   ; total = 90297 - 90192 = 105
   ; a.d = b.e + c.c
   ; dim a1 = a.d
   ; dim b2 = b.e
   ; dim c3 = c.c
   ; a1 = b2 + c3
LF49E STA $9C   ; 3
      LDA #$00  ; 2
      ASL $9C   ; 5
      SBC #$00  ; 2
      EOR #$FF  ; 2
      ROL A     ; 2
      ASL $9C   ; 5
      ROL A     ; 2
      ASL $9C   ; 5
      ROL A     ; 2
      ASL $9C   ; 5
      ROL A     ; 2
      LDX $9C   ; 3
      RTS       ; 6
LF4C7 STA $9D   ; 3
      STX $9E   ; 3
      TYA       ; 2
      JSR LF49E ; 6
      CLC       ; 2
      STA $9C   ; 3
      TXA       ; 2
      ADC $9E   ; 3
      TAX       ; 2
      LDA $9C   ; 3
      ADC $9D   ; 3
      RTS       ; 6
LF4F2 LDY $D8   ; 3 <-- Start here, F. Cyc = 99332
      LDX $DA   ; 3
      LDA $D7   ; 3
      JSR LF4C7 ; 6
      STX $D9   ; 3
      STA $D6   ; 3
      JMP LF4F2 ; 3 <-- stop here, F. Cyc = 99437
   ; total = 99437 - 99332 = 105
   ; a.d = b.e + c.f
   ; dim a1 = a.d
   ; dim b2 = b.e
   ; dim c3 = c.f
   ; a1 = b2 + c3
   LDA $DA ; 3
   CLC     ; 2
   ADC $DB ; 3
   STA $D9 ; 3
   LDA $D7 ; 3
   ADC $D8 ; 3
   STA $D6 ; 3
   ; total = 20
   ; gosub+return
; loop
;    gosub routine
;    goto loop
; routine
;    return
LF487 JSR LF495 ; 6
      JMP LF487 ; 3 <-- don't count this
LF495 RTS       ; 6
   ; total = 12
   ; gosub_with_bankswitch+return
; loop
;    gosub routine bank2
;    goto loop
;    bank 2
; routine
;    return
LD000 STA $D4      ; 3 <-- start here, F. Cyc = 91290
      LDA #$D0     ; 2
      PHA          ; 3
      LDA #$17     ; 2
      PHA          ; 3
      LDA #$F4     ; 2
      PHA          ; 3
      LDA #$D6     ; 2
      PHA          ; 3
      LDA $D4      ; 3
      PHA          ; 3
      TXA          ; 2
      PHA          ; 3
      LDX #$02     ; 2
      JMP LDFED    ; 3
      JMP LD000    ; 3 <-- stop here, F. Cyc = 91412
LFFED LDA LFFF7,X  ; 4
      PLA          ; 4
      TAX          ; 2
      PLA          ; 4
      RTS          ; 6
LF4D7 TSX          ; 2
      LDA CXP0FB,X ; 4
      EOR #$F4     ; 2
      AND #$E0     ; 2
      BEQ LF4E3    ; 2
      JMP LFFDF    ; 3
LF4E3 RTS          ; 6
LFFDF PHA          ; 3
      TXA          ; 2
      PHA          ; 3
      TSX          ; 2
      LDA CXM0FB,X ; 4
      ROL A        ; 2
      ROL A        ; 2
      ROL A        ; 2
      ROL A        ; 2
      AND #$01     ; 2
      TAX          ; 2
      INX          ; 2
LFFED LDA LFFF7,X  ; 4
      PLA          ; 4
      TAX          ; 2
      PLA          ; 4
      RTS          ; 6
   ; total = 91412 - 91290 = 122
(Note: This one jumps around and switches banks, so it's impossible to figure without just stepping through it and noting the frame cycles.)

   ; gosub_with_bankswitch+return_with_bankswitch
; (you didn't try this one, it's quicker)
; loop
;    gosub routine bank2
;    goto loop
;    bank 2
; routine
;    return otherbank
LD000 STA $D4      ; 3 <-- start here, F. Cyc = 38801
      LDA #$D0     ; 2
      PHA          ; 3
      LDA #$17     ; 2
      PHA          ; 3
      LDA #$F4     ; 2
      PHA          ; 3
      LDA #$D6     ; 2
      PHA          ; 3
      LDA $D4      ; 3
      PHA          ; 3
      TXA          ; 2
      PHA          ; 3
      LDX #$02     ; 2
      JMP LDFED    ; 3
      JMP LD000    ; 3 <-- stop here, F. Cyc = 38911
   ; total = 38911 - 38801 = 110
(I omitted the bankswitching code to keep the listing simpler.)

   ; goto
; loop
;    goto loop
LF48F JMP LF48F ; 3
   ; total = 3
   ; goto_with_bankswitch
; loop
;    goto routine bank2
;    bank 2
; routine
;    goto loop bank1
LD000 STA $D4      ; 3 <-- start here, F. Cyc = 43306
      LDA #$F4     ; 2
      PHA          ; 3
      LDA #$D6     ; 2
      PHA          ; 3
      LDA $D4      ; 3
      PHA          ; 3
      TXA          ; 2
      PHA          ; 3
      LDX #$02     ; 2
      JMP LDFED    ; 3
LFFED LDA LFFF7,X  ; 4
      PLA          ; 4
      TAX          ; 2
      PLA          ; 4
      RTS          ; 6
LF4D7 STA $D4      ; 3 <-- stop here, F. Cyc = 43355
   ; total = 43355 - 43306 = 49
So the correct numbers are as follows:

a=b+c : 11 cycles
a=b/3 : 38 cycles
a=b/2 : 8 cycles

a.a=b.b+c.c : 11 cycles
a.d=b.b+c.c : 11 cycles
a.d=b.b+c.f : 105 cycles
a.d=b.e+c.c : 105 cycles
a.d=b.e+c.f : 20 cycles

gosub+return : 12 cycles
gosub_with_bankswitch+return : 122 cycles
gosub_with_bankswitch+return_with_bankswitch : 110 cycles
goto: 3 cycles
goto_with_bankswitch : 49 cycles

I'm not sure why your a=b/3 was so far off.

Michael

#25 RevEng OFFLINE  

RevEng

    River Patroller

  • Topic Starter
  • 3,150 posts
  • bit player
  • Location:Canada

Posted Sat Sep 11, 2010 1:28 AM

These numbers are wrong, because of the way you tested them. The easiest way to test them is as follows

There's nothing wrong with the way I tested them, if you don't mind working with a margin of error.

The stella method you outline is good for simple short code, but often it's more complex code you're interested in measuring. With my method, even when there are subroutines and bankswitching involved, I can get a "good enough" answer literally in a few seconds. Then I can tweak and re-time the code without having to step into subroutines and add up dozens or hundreds of opcode timings.

These are the cases where our answers disagree... (the other were in my stated margin of error)

RevEng: a.d=b.b+c.f : 11.52 cycles (+-1.3 cycles)
SeaGtGruff: a.d=b.b+c.f : 105 cycles
Comment: The discrepancy is due to a bB bug, reported in this thread and acknowledged by batari. A 8.8=4.4+8.8 should have called the library (with a bankswitch in a bankswitched binary) but it didn't.

RevEng: a.d=b.e+c.c : 160.0 cycles (+-6.4 cycles) **
SeaGtGruff: a.d=b.e+c.c : 105 cycles
Comment: Mine was for a bankswitched binary, so there's bank switching overhead. It was mentioned later in the thread that changing to a 4k binary changed the time to "106.6 cycles (+-4.2 cycles)".

RevEng: a=b/3 : 460.8 cycles (+-12.8 cycles)
SeaGtGruff: a=b/3 : 38 cycles
Comment: I used a non-zero value for b to get more of a typical case, since dividing in bB is iterative, and dividing 0 by 3 is an unusually fast corner-case. Also I was bankswitching. If I use b=0 and then the non-banked answer would be 38.66 +-1.33 cycles, which agrees with your result. What does using b=128 (the value I used, IIRC) and using bankswitching add up to with your method?

Edited by RevEng, Sat Sep 11, 2010 11:45 AM.





0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users