CPU Speed Test

flashjazzcat · April 4, 2015

My software needs to not only detect the presence of a 65C816 CPU (taken care of), but also compute and report the CPU speed. The way I've gone about this (assuming a 6502 for the moment) is to run a block of code which updates a counter while waiting for a full frame to elapse (with DMA off, of course). We know the cycle count of the code block, so armed with the final value of the counter, we can compute the number of cycles per frame.

The first thing I ran into (Antic DMA being disabled) is RAM refresh, which I understand takes 9 cycles per scan line. Factoring this into the calculation seemed to yield a result closer to what's expected: i.e. ($9B * 2) * 9 added to the overall cycle count in PAL mode. The figures soon drift wildly off when the CPU is a 65C816, however. I'm not sure if this is down to my misunderstanding the cycle counts of 6502 instructions in emulation mode, or not properly accounting for RAM refresh cycles.

I notice that SysInfo computes the CPU speed quite accurately, so are there any special compensations required when trying to work out the speed of a 65C816 CPU running in emulation mode on an A8? Or is there a better way of measuring speed entirely?

Rybags · April 5, 2015

You should be able to get a pretty accurate reading using an entire frame.

Jitter might cause a little variation but a few cycles out of ~ 27,000 shouldn't mean much plus you could lock the start of the process with the help of WSYNC.

Another consideration - does the '816 actually lose "all" of the cycles that are occupied by the Antic Refresh.

It might be the case of none, only one or an equivalence where e.g. if the CPU multiplier factor is 8 that you lose 8 cycles.

Other consideration - variance of speed whether the program is running in or accessing system Ram, expansion Ram, system Rom, expansion Rom.

Other consideration - access to hardware registers. If e.g. you're monitoring VCOUNT then those accesses would likely occur at 1.79 Mhz.

Realistically though, if you have a count/wait loop that allows 40 cycles of inaccuracy, in an NTSC frame of 27,510 cycles that should give an error factor of +-0.145%

+JAC! · April 5, 2015

I'd just got for a fast inner loop with a fixed cycle count, adding 1 to a 24 bit value. And then use VBI to check how far it got after let's say 50 frames.

Rybags · April 5, 2015

I'd go for a bit less - a whole second might look unprofessional.

A known loop quantity that lasted 2 frames would probably be sufficient, then calculate CPU speed based how much time actually elapsed.

flashjazzcat · April 5, 2015

I'd have thought a single frame would be adequate too. Here's the (rough) code:

	.local GetCPUInfo
	lda #0
	sta nmien
	sta KernelPtr1
	sta KernelPtr1+1
	sta dmactl
	sei
@
	lda vcount ; wait for vcount = 0
	bne @-
Loop1
	lda KernelPtr1 ; c. 25 cycles on 6502
	clc
	adc #1
	sta KernelPtr1
	lda KernelPtr1+1
	adc #0
	sta KernelPtr1+1

	lda vcount ; loop while vcount = 0
	beq Loop1
	
Loop2
	lda KernelPtr1
	clc
	adc #1
	sta KernelPtr1
	lda KernelPtr1+1
	adc #0
	sta KernelPtr1+1
	
	lda vcount ; loop till vcount = 0
	bne Loop2
	
	ldax KernelPtr1 ; return # iterations in a,x
	rts
	.endl

On the 6502, the counter is bumped 1,255 times. If we assume 25 cycles for the whole block (branch always taken in all except two cases, no page crossing), then that's 1,255 * 25 = 31,375. Nine refresh cycles per scan line (PAL) = 310 * 9 = 2,790.

31,375 + 2,790 = 34,165.

34,165 * 50 = 1,708,250.

Not quite 1.79MHz, so already we have some unaccounted for cycles (assuming I added things up right).

We get the same number of iterations on a 65C816 @ 1.79MHz.

65C816 @ 7.14MHz:

4,443 iterations. 4,443 * 25 = 110,075. Add 2,790 (RAM refresh) = 113,865. Multiply by fifty frames and we get 5,693,250, which is just way, way off what we're aiming at. It's possible the inaccuracy at 1.79MHz is just being amplified the faster the CPU speed, but I'm a bit stuck as to how to fix it up.

Rybags · April 5, 2015

1.79 = NTSC speed. Also note the frame rates for both are less than 50/60. Actual rates selectable in Altirra Speed Options.

Shouldn't that be 312*9 for the refresh cycles?

Though even countring that, the speed figure seems still too low for PAL ~ 1.77 MHz.

flashjazzcat · April 5, 2015

Good corrections, but yeah: we're still missing a bunch of cycles somewhere, and it becomes wildly inaccurate at 65C816 speeds. Konrad obviously figured it out.

Edited April 5, 2015 by flashjazzcat

Rybags · April 5, 2015

A shorter loop would make for better precision... maybe just using the registers for counters rather than z-page?

  ldx #0
  ldy #0
wait1  lda vcount
  bne wait1
count1 inx
  bne wait1_1
  iny
wait1_1 lda vcount
  beq count1

... and repeat for VCOUNT <> 0 case.

That's 14 cycles per iteration, but requires count correction since there will be 13 cycle cases when Y is incremented.

Should be OK on really fast machines, a 14 MHz machine would have 284,544 cycles per frame so overflow won't occur with 16 bits worth of counter.

flashjazzcat · April 5, 2015

Good stuff - thanks: I'll give this a try and see how it goes.

+bob1200xl · April 5, 2015

What are you looking for - CPU clock speed or execution speed? Regardless of the clock speed, some cycles are not available for executing instructions. So, even at 1.79Mhz, where all clock cycles are the same, you have a significant reduction in possible execution speed.

In a 65816 that is running higher speed clocks, you have to halt the CPU and align the clocks to the 1.79Mhz hardware before you do things like access ANTIC, GTIA, REFRESH, and such. How many cycles are 'wasted' for syncing to the system clock is not deterministic - it may occur anywhere in the sequence. (well, a strong statement that may not be correct, but I certainly wouldn't want to try and figure it out)

I would say that you should leave all the processes running and count how many times you can execute a simple routine, if you want execution speed. The results will vary all over the place, depending on the graphics mode and such, but that's the reality of it.

For clock speed, I would turn everything off and execute a timing routine after REFRESH is finished for the frame. This will be tricky since you cannot access things like VCOUNT hardware or set interrupts for anything in the middle of the routine.

Bob

ClausB · April 5, 2015

At 7 MHz the 65C816 takes more than 4 cycles to LDA VCOUNT because it's reading from the slower ANTIC, no? Is the same true for RAM access or is there faster local RAM?

flashjazzcat · April 5, 2015

I might drop Konrad an email and see if he fancies sharing his approach to the problem. In any case, on-paper clock speed is what I'm really looking for, but if that can't be reliably attained, we'd have to settle for effective speed. What's reported by Konrad's SysInfo seems acceptably accurate.

Obviously DMA's off and all interrupts are disabled, so hopefully the only variance we're left with is refresh and any clock syncing for 65C816. The test code is run in RAM, so on the face of it, basic conditions are the same as those under which SysInfo runs.

+bob1200xl · April 5, 2015

At 7 MHz the 65C816 takes more than 4 cycles to LDA VCOUNT because it's reading from the slower ANTIC, no? Is the same true for RAM access or is there faster local RAM?

I can only answer for the XL14 hardware. LDA VCOUNT will take 70ns + 70ns + 70ns + SYNC + 560ns. SYNC being the time required to align the fast clock with the slow clock - up to 490ns (at 14.32Mhz). The typical mode is to fetch the opcode and operands from RAM at 14.32Mhz and the data from legacy hardware at 1.79Mhz.

One strategy may be to address VCOUNT at the beginning of the routine and then cycle-count from that point onward. This may work since executing a LDA VCOUNT (or any other hardware reg) always leaves you in sync. The second LDA VCOUNT will always take 210ns + 350ns + 560ns.

Bob

Rybags · April 6, 2015

I sort of forgot about the access to VCOUNT... that alone in every iteration will probably make for some huge amount of jitter.

e.g. on a 7.2 MHz system you'd probably get a mix of LDA VCOUNT that take anything from 7 to 10 cycles. 14 MHz system would be even worse.

That would be sufficient to throw any measurements out significantly.

My suggestion would be - use a z-page based flag instead, and use a VBlank routine that sets it.

There will still be a little indetermination to the whole thing, but more like in the order of several cycles for the whole frame rather than a few cycles every loop iteration.

So, as it stands, the shorter loop case I presented earlier would likely produce a less accurate result.

flashjazzcat · April 6, 2015

Thanks for the extra info. I got a reply from Konrad as well so I have some information to process this evening.

CPU Speed Test

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members