Jump to content
IGNORED

What do we know about actual bus usage?


ThomH

Recommended Posts

In a classic simple 6502 system the setup is usually: the 6502 receives a symmetrical clock signal. It unfailingly tristates the bus during the first half of each cycle, then accesses memory during the second. Therefore RAM that is fast enough to keep up with the 6502 is used at only 50% of its bandwidth. So the other 50% is used for video fetching — the basic pattern is a video fetch during phase 1 (i.e. the first half of a cycle) and then a CPU access during phase 2 (i.e. the second half).

 

If machines require more bandwidth for video than that implies, the CPU has limited stoppage periods. Of those I'm familiar with, the C64 is a famous example, the Acorn Electron a less-famous example.

 

So... how much do we know about the Lynx's bus handling?

 

These things I think are uncontroversial:

  • Mikey both contains the CPU and is responsible for the video DMA;
  • from the processor's point of view, a page mode memory access costs 4 cycles and a completely random one costs 5;
  • Suzy has no bus access unless Mikey cedes it;
  • Mikey will then interrupt Suzy as required for video fetch, with a maximum latency from initiating a request to getting access of 40 cycles; and
  • a compromise was required "between bigger FIFOs in the Mikey video DMA and reduced performance in Suzy".

From which I can conclude, or guess anyway:

  • Mikey grabs data in bursts, implicitly filling itself with a quantity of memory much larger than that which can be fetched in 40 cycles;
  • it is unlikely that Mikey interleaves video and processor accesses even when it has full access to the bus, as otherwise there'd be almost no point in spending time on a page mode optimisation for CPU access — it'd rarely be helpful;
  • therefore memory bandwidth is probably something like an optimum of 3 cycles for a page mode read, 4 cycles for a random one, as it's hard to imagine the processor core eliminated the 6502's think-then-act rhythm;
  • ... which, if true, would also imply that the CPU sees stoppages just like Suzy does, albeit without a 10-cycle request/grant delay.

That being said, there is a mention of "the other pertinent states of the system" in deciding whether to permit something which otherwise definitely could be a paged mode processor access.

 

Has anybody ever done something simple like run a fixed counting loop against a timer to get a simple estimate of actual processor speed when Suzy isn't involved? Or anything as extreme as hooking up an oscilloscope and trying to map the whole pattern out?

  • Like 1
Link to comment
Share on other sites

Somewhere in the doc they write that you shall not rely on cycles of specific instructions. I guess because the bus handling is kinda weired.

 

I never did measurements on actual execution times despite overall time for a frame.

Link to comment
Share on other sites

Cool; work to do then. When I next get a flash card and Lynx together it might be smart to run some tests to try to poke at some of this stuff. Potentially interesting to measure versus a timer:

 

Total time for a bunch of iterations of a tight loop with harmless LDAs of different addressing modes that is entirely within a page. To give a broad sense of general processor bandwidth without Suzy, and because how this scales with different addressing modes will provide some information for making guesses about the paged mode logic.

 

Total time for a bunch of iterations of the same loop with the LDA straddling a page boundary to different extents, to help to confirm or deny paged mode guesses.

 

A variation on one or the other of those that records successive timer values, to look for anything that might imply video DMA interruptions. Refine as necessary go get a sense of their regularity; try different screen refresh rates.

 

Then there's a bunch of stuff one could rattle off for Suzy: looking for write versus read-modify-write optimisations, checking for page mode optimisations, comparing compressed and uncompressed inputs to get a sense of Suzy's bandwidth, comparing lots of small draws for more evidence on video DMA interrupts, etc.

 

I wonder whether I'll still be curious enough when I'm next simultaneously in possession of a Lynx and a flash card.

Link to comment
Share on other sites

I was putzing around with test cases to run on a real Lynx when I next have access to one (which should be soon) and I think I discovered the following about Handy and Mednafen:

  • refresh is not emulated;
  • page mode is not emulated — Mednafen seems to give you a constant 4Mhz CPU, which is already faster than I'd expect, and Handy runs even faster than that (EDIT: specifically, it seems to act as though the CPU were close to 4.25Mhz);
  • there is an attempt at emulating video DMA pauses, but it raises questions. At 60Hz it seemed like approximately a 20us pause every 610us. Those numbers don't really make sense but it feels imprudent to speculate further without having built up more confidence in the test code.

Re: 4Mhz as an unrealistic number; if only paged-mode fetches are 4 cycles, the majority will be 5 cycles. If all were five cycles that'd imply 3.2Mhz. In real life they won't all be, and there'll be refresh and video interceding.

 

I hope to have access to real hardware next week so, subject to the many usual time pressures of being an employed grown-up, the plan is to run the same tests there, and follow whatever path the results dictate.

Edited by ThomH
Link to comment
Share on other sites

Honestly, I'm fairly impressed that Handy is making any effort whatsoever towards accounting for video DMA given its era and therefore its processing budget and the difficulty then of testing on real Lynx hardware. And, for the record, I'm also now tending towards thinking that maybe the processor is implemented at 4Mhz in pre-Mednafen Handy but timers are of questionable accuracy. Just a guess, based on how some of the commercial titles run.

 

Anyway, the plan, such that there is one, is to use the test I currently have — which is just a tight read a timer, store a value loop, runnable in any of 50 Hz, 60 Hz, 75 Hz or screen off configurations — to try to come up with a model of refresh and video DMA timings plus some hints about CPU page mode. Then I can try to put some effort into figuring out exactly what the rules are around CPU page mode.

 

Poking around Suzy can follow after that. The only real hint the documentation seems to offer via the palette bug is that SCB reading uses page mode to some extent.

 

I'll make sure I keep everything in a Wiki somewhere, tests and results.

Link to comment
Share on other sites

Actually, no, cancel any comments about necessarily being able to perform these tests anytime soon; my situation is that in my possession I have a cased SainT multicard but a Lynx 1. I ordered a McWill Lynx 2 from eBay, which just arrived, but it seems to be a lemon. Obviously I'm going to try a little harder with it, then see what I can do about a remedy if necessary, but regardless of my consumer rights it's something of a spanner in the works. I'm also queued up for the uncased SainT multicard that would fit my Lynx 1. So it'll happen at some point. Frustrating I also own two Lynx 2s and a Lynxman flash card, but all are in storage back in my native country.

 

That was my venting. Thank you for putting up with it.

 

In the meantime, attached is the first benchmark I was hoping to run. It's not in the slightest bit user friendly as it was just for me. But here's what should happen:

 

It'll start up to a screen that says '00' in the top left. That doesn't mean anything, it's just letting you know that it loaded correctly.

 

You can then press:

  • A to test with the display off;
  • B to test with the display on at 60Hz;
  • Option 2 to test with the display on at 50Hz;
  • Option 1 to test with the display on at 75Hz.

It'll then do 222 iterations of a tight loop, which just grabs and stores a timer value, before printing 221 results to the display, in hexadecimal. You can run each test as many times as you want.

 

What I want to know is the full sequence of 221 results.

 

The loop being run is 13 cycles, of which at most seven could potentially be performed in page mode. You'll see in Mednafen that with the display off the results are three '03's followed by an '04'. Each count is in microseconds, the maximum timer precision, so from the whole pattern you can conclude that each 13 cycle iteration takes 3.25us. Which is from where I conclude that Mednafen gives you a full 4Mhz of processing power.

 

A real Lynx should have bigger numbers and probably be less predictable. From those numbers we'll try to come up with a model of refresh and page mode timing.

 

With the various display on options on Handy or Mednafen you'll see period 17s and 18s in there. By looking at the spacing of those numbers and their sizes, plus a model of the rules about refresh and page mode, it should be possible to derive video DMA interruption periods and lengths.

 

Or, possibly the tests will be inconclusive. E.g. one thing to keep in mind is that the numbers with video DMA interruption are modulo 256us. There'll be additional clues as to buffer size in frequency, but tests with a less precise timer may be necessary. It's unlikely though, as 256us is 4096 ticks, which even at the processor's probably slightly languid 5 ticks per non-page mode access that's enough time for 819 bytes to be fetched.

 

Anyway, fingers crossed I'll be able to start on some of this stuff for myself soon but until then, here's step one.

 

EDIT: for the record, I took the documentation's warning that the first tick after you set a timer value may not be full length to imply that there's a common master divider to get to 1us and your first tick depends on phase with that. So probably you wouldn't get any more from trying to set off two out-of-phase timers; they probably can't actually be out-of-phase at 1us precision.

benchmark1.lnx.zip

Edited by ThomH
Link to comment
Share on other sites

Fantastic, thanks!

 

Knee-jerk reactions: it looks like an 8-byte buffer for video, and that enabling video disables the normal refresh mechanism. I'm not sure though, I'll play around with the numbers and try to make some more-educated guesses.

 

This is the first step towards figuring out timings total; I thought I'd worry about Suzy once I have a concept of ordinary video activity and RAM access times as distinct from the speed at which the 65C02 can run.

 

One other knee-jerk reaction though: at 64us for 15 iterations with the screen off, a real Lynx is only about 70% as fast as Handy. Or Handy's CPU is about 40% faster than a real Lynx.

  • Like 1
Link to comment
Share on other sites

  • 3 weeks later...

I'm also very curious to know about this. However if you are saying the Emulators don't do anything to handle this.. why are they STILL so slow? I see a 4x speed, better ISA, smaller screen and yet I'm not seeing that level of power in the games.

 

If the CPU is the 65CS02 ( i've seen some things say 65C02 and some say 65CS02 and the 65CS02 is actually the 65C02S ) then it should have single clock DMA halting ability, unlike the N6502 which needs 3 clocks before you can take over.

 

Is there any doc or explorations that have looked at how much data it can get during the Phi1 phase, and how much data it needs to put on the screen per line to work out how many cycles it would need to eat?

Link to comment
Share on other sites

Is there any doc or explorations that have looked at how much data it can get during the Phi1 phase, and how much data it needs to put on the screen per line to work out how many cycles it would need to eat?

 

None that I'm aware of, believe it or not. I'm still optimistic I'll be able to try some things soon, but whether I'm even a good candidate for the job time has yet to tell.

 

That said: the programmer selects the refresh rate, and the numbers above show that the number of CPU stoppages increases with greater refresh rates.

 

Each set of data also suggests that there are 10 stoppages per line. Each line being 80 bytes wide leads to my 8-byte buffer guess. So it's a little odd to me that the video output buffer need be only quad-word aligned. But given that it is, a reasonable guess might be that each interruption is:

  • 5 ticks to fetch the first byte; then
  • 4 ticks, 4 ticks, 4 ticks, to fetch the next three in page mode; then
  • 5 to fetch the fifth byte; then
  • 4 ticks, 4 ticks, 4 ticks, to fetch the remaining three.

i.e. each time the CPU is paused to grab some more video, it is stopped for 34 ticks. And that happens as often as is necessary to get enough data to draw the display.

 

Further evidence that those numbers are likely to be similar to true is the difference in loop lengths. It looks like approximately 4us for a loop iteration under normal circumstances, increasing to 6 every time there is a video-related stoppage. 34 ticks would be 2.125us.

 

Otherwise, it's clear from the test above that video fetch occurs instead of refresh, so we can rule out RAS-only refresh. Which might be a helpful observation at some point.

Link to comment
Share on other sites

It seems as if the Lynx has a concept similar to the "Master Cycles" on a SNES.

So the Master clock is 16Mhz, and it seems some of the custom chips run off it from the "marketing" but then you could also says the C64's VIC-II runs at 8Mhz in the same way.

Is a tick 1 "clock/cycle" in normal CPU terms i.e a full hi + lo phase of the Phi2 clock, at 4Mhz.. or are they slightly faster ticks?

The RAM data sheet says its 120ns RAM so that is 220ns to do a read which gives a max CPU speed of 4.5Mhz but that takes 5 ticks, a Page look up then takes 120ns but that still takes 4 ticks.. However does Mikey/Suzy still need 4 ticks, or do they take advantage that the data grabs will mostly be on the same page, and grab data faster using the page timings. ( this is going to make self mode mod code even faster and more critical for speed on a Lynx)

 

Having the Page mode makes it impossible to do shared bus.. as you can't have the graphics chip access RAM which may need a completely different address and keep the Page mode. So I theorize that the blitter etc always eats cycles for every byte it has to read, no shared bus.. I think having a shared bus would be faster overall than this 'special' page logic..

 

I think we need a "Howard" and a Logic Analyzer ;)

Is there a way to modify a "background colour" on the Lynx in real time, "work out raster time" inc dec some colour register, pen value? then you could make code that just inc dec inc dec and then see where the bars stretch. I don't have any hardware so I can't test such things.

Link to comment
Share on other sites

It seems as if the Lynx has a concept similar to the "Master Cycles" on a SNES.

So the Master clock is 16Mhz, and it seems some of the custom chips run off it from the "marketing" but then you could also says the C64's VIC-II runs at 8Mhz in the same way.

Is a tick 1 "clock/cycle" in normal CPU terms i.e a full hi + lo phase of the Phi2 clock, at 4Mhz.. or are they slightly faster ticks?

The RAM data sheet says its 120ns RAM so that is 220ns to do a read which gives a max CPU speed of 4.5Mhz but that takes 5 ticks, a Page look up then takes 120ns but that still takes 4 ticks.. However does Mikey/Suzy still need 4 ticks, or do they take advantage that the data grabs will mostly be on the same page, and grab data faster using the page timings. ( this is going to make self mode mod code even faster and more critical for speed on a Lynx)

 

Having the Page mode makes it impossible to do shared bus.. as you can't have the graphics chip access RAM which may need a completely different address and keep the Page mode. So I theorize that the blitter etc always eats cycles for every byte it has to read, no shared bus.. I think having a shared bus would be faster overall than this 'special' page logic..

 

I think we need a "Howard" and a Logic Analyzer ;)

Is there a way to modify a "background colour" on the Lynx in real time, "work out raster time" inc dec some colour register, pen value? then you could make code that just inc dec inc dec and then see where the bars stretch. I don't have any hardware so I can't test such things.

yes

Link to comment
Share on other sites

It seems as if the Lynx has a concept similar to the "Master Cycles" on a SNES.

So the Master clock is 16Mhz, and it seems some of the custom chips run off it from the "marketing" but then you could also says the C64's VIC-II runs at 8Mhz in the same way.

Is a tick 1 "clock/cycle" in normal CPU terms i.e a full hi + lo phase of the Phi2 clock, at 4Mhz.. or are they slightly faster ticks?

 

Ticks are respective to the 16Mhz clock. It's jargon from the official hardware documentation, though it's not exactly pervasive. Maybe I was wrong to assume.

 

Anyway, the CPU doesn't actually run at 4Mhz. It opens each new instruction with a five-tick random access fetch. Some sort of external lookup table observes the embedded version of the SYNC line and determines some sort of estimate of the number of page mode accesses that can be generated as a function of the opcode, subject to other machine state.

 

To my mind the documentation is a little ambiguous whether that table can indicate that only the next byte may be page mode, or if it can potentially enable it for the next two, or possibly even provide a more complicated pattern than that (e.g. LDX (zero),x offers a pretty obvious predictable second location for a page mode access).

 

So the speed at which the CPU executes while it is running is somewhere in the range 3.2Mhz to 4Mhz depending on instructions used, exactly what that lookup table can specify, and when the video interruptions fall. And, of course, if you're using Suzy for blitting then for those periods the CPU gives up the bus, subtracting some more in terms of how many processing cycles a real game actually expends per second.

 

The RAM data sheet says its 120ns RAM so that is 220ns to do a read which gives a max CPU speed of 4.5Mhz but that takes 5 ticks, a Page look up then takes 120ns but that still takes 4 ticks.. However does Mikey/Suzy still need 4 ticks, or do they take advantage that the data grabs will mostly be on the same page, and grab data faster using the page timings. ( this is going to make self mode mod code even faster and more critical for speed on a Lynx)

 

Having the Page mode makes it impossible to do shared bus.. as you can't have the graphics chip access RAM which may need a completely different address and keep the Page mode. So I theorize that the blitter etc always eats cycles for every byte it has to read, no shared bus.. I think having a shared bus would be faster overall than this 'special' page logic..

 

The processor is documented to be 5 ticks per random access, 4 per page mode. Mikey and Suzy aren't documented. Mikey activity is as implied above:

  • it will pause the CPU;
  • to fetch 8 bytes at a time;
  • taking approximately 2us to do so;
  • the amount of time it uses to fetch and output a single line is fixed, the programmatic frame rate is achieved by adjusting the amount of time between lines.

I've done no investigation yet into Suzy, and I don't think anybody else has either, but it is definitely a page mode user. Besides anything else, there is an SCB palette-reading bug if you run the last section over a page boundary.

 

I think we need a "Howard" and a Logic Analyzer ;)

Is there a way to modify a "background colour" on the Lynx in real time, "work out raster time" inc dec some colour register, pen value? then you could make code that just inc dec inc dec and then see where the bars stretch. I don't have any hardware so I can't test such things.

 

Somebody with a logic analyser would be able to figure all this stuff out a lot more quickly, but flash carts are much readily available!

 

The hardware palette is indeed manipulable in real time, but it would help you further with Mikey timings only. The CPU does not run at the same time as Suzy (i.e. the blitter). For that I am leaning towards using an interrupt to interrupt Suzy after a fixed amount of time and seeing how far she got.

Link to comment
Share on other sites

I've done no investigation yet into Suzy, and I don't think anybody else has either, but it is definitely a page mode user. Besides anything else, there is an SCB palette-reading bug if you run the last section over a page boundary.

 

 

This sounds familiar. While debugging random display problems in Championship Rally these was a bug related to palette and page boundaries that hit Lynx I, but not Lynx II. The time relative to the VBL triggered it somehow and rendered the screen black for the rest of the game. By allowing to touch the palette at certain times relative to the drawing process the problem went away.

Link to comment
Share on other sites

I think we can at least bucket maths this to start with.

 

So a 16 colour screen at 160x101 = 80bytes x 102 = 8160 bytes per frame.

SCB = 27 bytes

Then its byte + width per line

For each byte of data Suzy draws she needs to Grab Byte from RAM, grab Sprite Byte, Store Sprite byte back at minimum. so each byte of sprite has a 3 to 6 byte fetch penalty. The question is does she only grab 1 byte at a time, or 2..3...4 regardless

 

From

"60Hz: 159 us x 105 lines 16.695 ms (59.90 Hz], 3 lines of Vertical Blank"

so 159us = 159,000ns where the 16mhz clock has a full phase of 62.5ns so 2,544 per line, for 105 lines = 267,120 ticks per frame. ~66,780 clocks per frame.

 

If Mikey grabs 8bytes at a go, lets assume 5 4 4 4 4 4 4 4 is the timing. That is 33 ticks per 8 bytes and it needs to do it 1020 times so that eats 33,660 ticks per frame.

Refresh will also eat more time and the docs mention that refresh should be done when Mikey does a fetch to make it faster.. not sure if that is the documentation being wishful or how it is actually done...

Lets assume that SCB is also 5 4 4 4 4... so that is 109 ticks to read the SCB

Optimal case you have a single SCB that holds the entire frame data in it pure literal. So you need to redraw it each frame. So that is 8160 * 3 access, however I would think that each read is going to need to be 5 then 5 4 4 5 4 4clocks as dest, src, dest, dest as the two dest, writing the previous value, then loading the next are most probably on the same page, won't always be but rough cut for now.

So each byte needs 13 ticks. 13 * 8160 + 1( for the initial 5 read of Dest ) = 106,081 ticks. This assume Suzy is fast enough to paint "background" mode sprite with 0 dead ticks. I feel this is "not true", but for now.

so a 60hz frame - draw DMA - time to draw one sprite to fill the whole background bestish case =

267,120 - 33,660 - 109 - 106,081 = 127,270 left

Discuss ;)

 

Seems all of these details ARE documented, I just noted a tiny tidbit in the docs we have that mentions Appendix 7 holds the gory details.

Edited by oziphantom
Link to comment
Share on other sites

I am not sure with the timeing regarding page mode.

If video DMA allone accesses RAM, there is only every 256 bytes a 5 clock cycle to setup the next page.

So given we start at a page boundary, it is 8160/256 ~= 32 pages.

So 32*5+8128*4 = 32672 clocks.

 

Right?

Link to comment
Share on other sites

I am not sure with the timeing regarding page mode.

If video DMA allone accesses RAM, there is only every 256 bytes a 5 clock cycle to setup the next page.

So given we start at a page boundary, it is 8160/256 ~= 32 pages.

So 32*5+8128*4 = 32672 clocks.

 

Right?

 

I was going to argue it from the other direction: given that the requirement is a four-byte boundary rather than an 8-byte, it doesn't necessarily have the logic to distinguish 5 4 4 4 4 4 4 4 from 5 4 4 4 5 4 4 4.

 

But, either way, since Mikey seems to access in 8-byte spurts it's bound to be one of those two patterns.

 

Re: refresh, don't forget that — 8-bit or otherwise — the Lynx is a 1989 machine. So it supports CAS-before-RAS and hidden refresh; indeed it almost certainly isn't using classic RAS-only refresh because the advantage of those two is that the row counter is inside the RAM, so that's one less thing for Mikey to keep track of. If and when more is known about Suzy, a test might be to see what the penalty is when drawing with Suzy with the display off, since Mikey won't actually strictly need to do any memory accesses.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...