Jump to content
laoo

Exact CPU cycle timings

Recommended Posts

Did anyone observe variance of CPU cycle timing? I'm referring to the fact that one CPU cycle can take 4 system ticks if memory access is within the same page or 5 system ticks otherwise. As I understood the documentation 4 cycles should be (more or less, e.g. if we do not cross pages) while reading opcode / operand and 5 cycles while reading data pointed by operands. I've devised that a sequence of repeated instructions should take:

nop				4+4
lda #$80			4+4
lda $80				5+4+5
lda $8080			5+4+4+5
lda $8080,y			5+4+4+(5)+5
lda ($80)			5+4+5+4+5
lda ($80),y			5+4+5+4+(5)+5
lda ($80,x)			5+4+4+5+4+5
dec $80				5+4+5+4+4
dec $80,x			5+4+4+5+4+4
asl $8080,x			5+4+4+(5)+4+4+4
dec $8080,x			5+4+4+5+4+4+4

Number in parentheses is an optional cycle on page crossing.

 

I tried to check if I'm right by writing a program that has a HBI handler that burns some cycles to position the CPU on the screen, changes background color, executes sequence of 32 identical instructions and clears background color. I've actually written a program that performs all these instruction sequences in different display rows and I've added as a reference a sequence of 1 cycle NOP $x3 repeaded 32*1, 32*2, 32*3 etc times. On the image below are the results. The white pattern on right is binary encoded row number (last row has number 0 and last eight rows are a reference cycle counting NOPs). I've annotated the screen grabbing with instructions which are executed in given rows. As a bonus I set index registers to $00 on even rows and $ff on odd rows to observe page crossing (hence the jagged patterns on indexed addressing modes).

 

So as you can see... there are NO differences whatsoever. Every instruction sequence's timing is exact and proper multiplicity of reference timing of a sequence of 1-cycle nops. I've tried standard $EA NOP as well (with halved length) and the result was the same.

 

Am I missing something? Because it seems that each cycle of each and every tested instruction took the same amount of system ticks. Obviously I can't tell whether it's 4 or 5.

I'm attaching executable with source code to be compiled with mads assembler

 

PS. As a side note you can see that DEC $8080,x and ASL $8080,x takes the same amount of cycles. Contrary to documentation which says that DEC and INC always takes 7 cycles.

 

 

 

 

CycleTiming.png

test.zip

Edited by laoo
  • Like 1

Share this post


Link to post
Share on other sites

Does not work on my Lynx. Just returns to BLL loader.

 

See also:

 

Edited by 42bs

Share this post


Link to post
Share on other sites

The doc says (http://www.monlynx.de/lynx/lynx4.html#TOP)

Quote
Cycle                              Min       Max
---------------------------------------------------
Page Mode RAM(read)                4          4
Normal RAM(r/w)                    5          5

And:

Quote

The requirement for using a page mode cycle is that the current access is in the same 256 address page of memory as the previous access.

So, you won't be able to see the 5 ticks cycles in the benchmark. It is only the instruction after skiping the page boundary which is a tick longer.

 

Share this post


Link to post
Share on other sites
On 2/16/2020 at 11:55 AM, 42bs said:

Does not work on my Lynx. Just returns to BLL loader.

I've put some effort into it and managed to prepare a source code with embedded loader so that straightforward simple program can be assembled directly to LNX file. So try attached file Timing.zip.

On 2/16/2020 at 7:31 PM, Cyprian_K said:

did you test it on Lynx 1, Lynx 2 or emu?

Lynx II Hayato, so with additional bit instructions. I would love it if someone could run it on Lynx I or first version of Lynx II. No emulator can do such stuff :)

On 2/17/2020 at 6:33 AM, 42bs said:

The doc says (http://www.monlynx.de/lynx/lynx4.html#TOP)

And:

So, you won't be able to see the 5 ticks cycles in the benchmark. It is only the instruction after skiping the page boundary which is a tick longer.

 

I'm not convinced. The table you've pasted says that only reads can take 4 ticks. In my test I'm testing read and RMW instructions. Furthermore if the instruction isn't immediate it obviously reads or writes from other page. So there must be some 5 tick cycles. It's hard to do a test with code crossing page boundary several times but I could try to do it.

 

As a bonus I've attached another test with sequences of 453 CPU cycles each that alters background color filling (almost) whole line. 

Timing.zip Line.zip

Edited by laoo

Share this post


Link to post
Share on other sites

A lnx file needs to be flashed, so if you have a single .o file it is easier to text.

 

Edit: The .lnx file crashes

Edited by 42bs

Share this post


Link to post
Share on other sites

Out of curiostiy, I did a small test doing this in HBL (aligned on page boundary):

	REPT 16
	dec $FDA0
	stz $fda0
	ENDR

And get this (McWill LCD, but original looks the same):As you can see, number of cycles differ.

IMG_0453-1.thumb.JPG.d8c6df88d48bf8deac8ed36a8d4d2141.JPG

 

The minimum width is 1 pixel (at 75Hz it is 0.794us)

 

Share this post


Link to post
Share on other sites

And this is 32 NOP and 32 STA $FF

 

 

IMG_0457.thumb.JPG.35c526551fad56f31d189454aa4ecdea.JPG

 

With this code:
 

HBL:
	txa
	inx
	lsr
	lsr
	lsr
	bcc	a
	bcs	b

a
	dec $FDA0
	REPT 32
	nop
	ENDR
	stz $fda0
	END_IRQ

b
	dec $fda0
	rept 32
	inc $ff
	endr
	stz $fda0
	END_IRQ

So the offset at the beginning is due to the test selection.

 

Edited by 42bs

Share this post


Link to post
Share on other sites

This is weird.

I've run your example without problems. I've changed my code to be similar to yours (I've changed to 75 Hz and only obvious difference is that I don't use VBL but I'm dispatching code through VCOUNT. Furthermore I don't return from interrupt but jump to long sequence of NOP, maybe it's the reason why my output is more stable on the screen) and I get different results. I've made a collage of result of your code and my below and my sequences seems to be taking longer:

 

collage.thumb.png.96071855f14ecc24c514c8efcb39b701.png

 

Furthermore if counting pixels in your code $1b takes 84 pixels, NOP takes 168 and ADC $ff takes a bit more - 284. It seems to be almost consistent with 21 pixels per tick. ( $1b - 4 ticks, NOP - 4+4, ADC $ff - 5+4+5, for rest is the same ). Of course only approximately as we should take into account few tick burned by dec $FDA0, stz $fda0 and by feching video data.

 

On the other hand my cycles seems to be taking more time and approximately proportionally for each tested instruction.

 

Could you add a reference pattern of 32*1, 32*2, 32*3 etc cycles of $1b.

 

I'm running my code from LynxSD so it may be the reason that it works on my Lynx. I'll try to prepare 128 kB LYX image. Maybe this will be more compatible.

 

I've got currently two machines: PAG-0400 and PAG-401. The output is identical.

Edited by laoo

Share this post


Link to post
Share on other sites
4 hours ago, 42bs said:

Out of curiostiy, I did a small test doing this in HBL (aligned on page boundary):

	REPT 16
	dec $FDA0
	stz $fda0
	ENDR

And get this (McWill LCD, but original looks the same):As you can see, number of cycles differ.

IMG_0453-1.thumb.JPG.d8c6df88d48bf8deac8ed36a8d4d2141.JPG

 

The minimum width is 1 pixel (at 75Hz it is 0.794us)

 

Regarding this pattern I believe that the irregularities are (mostly?) due to video data fetching taking place each few columns. There are 10 such places where 8 bytes are fetched in sequence. It can be clearly seen in my example from another thread:

c0000.png

Share this post


Link to post
Share on other sites
25 minutes ago, 42bs said:

Updated test:
IMG_0459.JPG.b7343bc1449fa3bb02b613d6b2afec9f.JPG

Clearly ADC $FF takes more than 3 and less than 4 fold the time of $1b.

 

The question is why I have different results. I didn't make everything up :)

 

I'm now suspecting that I'm not initializing hardware properly.

 

EDIT: I've logged your initialization in HandyBug and done my initialization the same... no change.

Edited by laoo

Share this post


Link to post
Share on other sites

@laoo something must be wrong, as the .o file crashes on a real lynx. I suspect the "not returning" from interrupt may be a problem.

Share this post


Link to post
Share on other sites

Interessting: The 3-byte NOP $5c takes 9 cycles. So for tied CPU time burning loops the ideal opcode ;-)

Share this post


Link to post
Share on other sites
59 minutes ago, 42bs said:

@laoo something must be wrong, as the .o file crashes on a real lynx. I suspect the "not returning" from interrupt may be a problem.

Indeed. Something must be wrong. But my lynx is real too :)

 

https://drive.google.com/file/d/1LqKpAEaH5zzq7eYd6SzuAikPKvFBoJR0/view

 

I suspect that in this form it runs only on LynxSD. Could someone with LynxSD try to run, please, the code from my attachments?

Share this post


Link to post
Share on other sites
4 minutes ago, laoo said:

Indeed. Something must be wrong. But my lynx is real too :)

 

https://drive.google.com/file/d/1LqKpAEaH5zzq7eYd6SzuAikPKvFBoJR0/view

 

I suspect that in this form it runs only on LynxSD. Could someone with LynxSD try to run, please, the code from my attachments?

Strange. So maybe LynxSD does some init which your code is missing when I load the .o file.

 

 

Share this post


Link to post
Share on other sites
1 minute ago, 42bs said:

Strange. So maybe LynxSD does some init which your code is missing when I load the .o file.

I've replicated initializations from your .o file to mine and both are currently (roughly) the same. Maybe LynxSD has some other initializations than your runtime. I don't know what BLL does before launching .o file.

I think that comparing it ideally would involve making 256k LYX image and run it on AgaCart as it doesn't initialize anything on its own.

Share this post


Link to post
Share on other sites
1 hour ago, 42bs said:

I suspect the "not returning" from interrupt may be a problem.

I've changed the test so that it does return from HBI to a loop of 

loop
	stz GREEN0
	jmp loop

And there were two bugs that emerged in this scenario. I've fixed them but the output is essentially the same:

 

crti.thumb.png.7361d876c014a9417d3de9e2aeb87b93.png

 

Could you try this fixed version?

 

PS. The $5c NOP seems to be taking 8 cycles.

 

 

crti.o crti.lnx

Edited by laoo

Share this post


Link to post
Share on other sites

It returns diretly to BLL after launch.

 

Ok, it was the load address of $200 which does not work with the "standard" BLL.

Edited by 42bs

Share this post


Link to post
Share on other sites
35 minutes ago, laoo said:

 

 

PS. The $5c NOP seems to be taking 8 cycles.

Yes, that is the official value from the VLSI  TECHNOLOGY, INC. data book:
nop.jpg.a0a83328a2e9bd8897522704a5fb6cf0.jpg

 

Now the question is, why are some of your values different from mine.

 

I have a PAG-401 Lynx.

Edited by 42bs

Share this post


Link to post
Share on other sites
1 hour ago, 42bs said:

It returns diretly to BLL after launch.

 

Ok, it was the load address of $200 which does not work with the "standard" BLL.

So did you manage to run it or should I reassemble it to different address that does work with BLL?

 

I'll try to rewrite the test to a LYX image to be run in more controllable environment. 

Share this post


Link to post
Share on other sites
36 minutes ago, laoo said:

So did you manage to run it or should I reassemble it to different address that does work with BLL?

 

I'll try to rewrite the test to a LYX image to be run in more controllable environment. 

I could load it. In BLL I can add the downloader to the program and this one is placed at the end of the RAM, so no clash.

Share this post


Link to post
Share on other sites

Now to something completely different (RIP Terry Jones):

I wanted to get rid of the HBL interrupt, and did this:

w0:
	lda	$fd00
w1:	cmp 	$fd02
	bne	w1
w2:	cmp	$fd02
	beq	w2

Means, I wanted to wait until Timer 0 (HBL) gets reloaded and then start after the first tick.

What I see is, that every second line is skipped?!

 

Any idea?

Share this post


Link to post
Share on other sites

Polling timers is tricky. It seems that Timer 0 is faster than your w2 roundabout and it gets decremented before you are able to check it. And then w2 loops till the next reload.

It's not stated in the docs but I suspect that the timers registers are implemented in the same manner as audio registers - as a DPRAM processed in a cyclic manner. You could observe 1.25 μs latency in reading the counter then. 

image.png.64acbb03839083574e59051096b81bba.png

 

 

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...