Jump to content
IGNORED

VDP overrun revisited...


marc.hull

Recommended Posts

To revisit the "Can The TI Outpace the VDP" question.....

 

The answer is a tentative yes (with a caveat.) When the computer is running with 16 bit memory it can definitely (I think) outrun the VDP. See the video link below. Not that this is earth shattering news by any means just thought it interesting.

 

The code reads a byte from VDP, shifts that byte one time , writes back the address and the data. I used Matt's example of fast VSBR/VSBW using register addressing and symbolic addressing to avoid a SWPB. Unless I am mistaken (what are the odds of that ;-)the program is writing data faster than the VDP can handle it......

 

Here is the link to the video.

 

 

Theory's ??

 

 

 

Sorry for the audio quality of the narrator. He has a voice just made for the printed word ;-)

Edited by marc.hull
  • Like 1
Link to comment
Share on other sites

We need to see the code to answer that question.

 

Bitmap mode is the most memory intensive VDP mode, so it is of course the worst case scenario. But with a posting of the code we can run through the cycle counting and see if it makes sense, or this is new information.

Link to comment
Share on other sites

The theory of the 99/4A not being able to outrun the VDP is partly based on information from Thierry's website where he states that the memory addresses mapped to the VDP trigger the wait states. That means it does not matter if the source or destination is the 16-bit no wait state scratch pad RAM, the read or write to the VDP will incur the wait state, and thus never be fast enough to outpace the VDP even in the worst case.

 

I too would like to see the code you used.

 

Edit: If you have a console where you put 32K on the 16-bit bus and/or disabled the wait-state generator as per Thierry's tech pages, then yes, you can overrun the VDP. I suppose if you overclock the console you could also overrun.

 

Matthew

Edited by matthew180
Link to comment
Share on other sites

The effect looks good.

 

Looks like the read and write address comes thru alright. You're hitting the right spots in the bitmap. Guess you're demoing exactly the same code, only it running from 16 bit memory instead of 8 bit. I can't suspect a missing clearing of LSB. The effect would have been different anyway (from what I can see in the video).

 

So if the address is okay, then the read, write or both of data are failing (if the problem does not lie elsewhere). Firstly try to put in a single NOP before reading the data. Then one NOP before the write of data. Then more NOPs. The results are interesting, since, if you get it working correctly with one or a few NOPs, then we might generally take care when running from ScratchPad.

Link to comment
Share on other sites

Velly intelesting.

 

TF uses VDPWA in a register to read and write to/from VDP during VMBR and VMBW and VSBWM. It doesn't bother on single byte stuff. I have two consoles, one UK spec, one USA spec. Neither of which are set up as I don't have a PEB. I guess I should get one hacked to run 16-bit fast ram.

Link to comment
Share on other sites

This seems to be the offending piece of code.

 

CLR R1

MOVB @R0LB,@VDPWA

MOVB R0,@VDPWA

MOVB @VDPRD,R1

 

SLA R1,1

 

ORI R0,>4000

MOVB @R0LB,@VDPWA

MOVB R0,@VDWPA

MOVB R1,@VDPWD

 

When in 16 bit zero wait memory it causes the issue. Any one out there with real gear want to confirm?

Link to comment
Share on other sites

I am new to assembly and actually very interested in this topic. I kept up with the whole Y! thread and I've been reading here as well. It seems as though we are in need of a super-fast game to test this out. I propose "Crack-Mario" without any wait states or delay loops. He smokes a bunch of crack and flies through the levels at 7000 miles an hour. All 9 worlds

should take about 3.9 seconds to complete. :)

  • Like 1
Link to comment
Share on other sites

Try this:

 

CLR R1
MOVB @R0LB,@VDPWA
MOVB R0,@VDPWA
NOP     ; chill
MOVB @VDPRD,R1

SLA R1,1

ORI R0,>4000
MOVB @R0LB,@VDPWA
MOVB R0,@VDWPA
NOP     ; just chillin'
MOVB R1,@VDPWD

 

I reckon it's the period between the address being written and the data being read/written.

 

OR... You could run your code on the VDP interrupt. You get a VDP interrupt when the VDP is entering the vertical refresh period (VRAM isn't being accessed, and thus the CPU window is open for 4.3 milliseconds - you could get a fair bit of VRAM thrashing done in that window. You might have to sit down with a pen and paper and work out how many cycles your code needs and how long it takes. Then you can work out how many times you can loop in the 4.3 millisecs. One advantage of this method would be that it should work on both modified and stock consoles. By the way, don't bother with that LIMI 0 crap:

 

LIMI 0

[ do some work ]

LIMI 2

 

LIMI is a slow instruction, and eats into your interrupt window. Just work out how much code (or how many loops) you can run in the window. Classic99 has a feature that will tell you how many cycles a section of code takes: T(address-address). That will help you to work out how long a section of code takes.

 

Lastly: It strikes me that you may be able to use your discovery to 'discover' if the host machine has 16-bit ram or not. Thrash some data into VRAM like a chipmunk on speed, then read it back. If you don't get what you wrote, you're on 16-bit ram, or you've done to much shit. :P

Link to comment
Share on other sites

I am new to assembly and actually very interested in this topic. I kept up with the whole Y! thread and I've been reading here as well. It seems as though we are in need of a super-fast game to test this out. I propose "Crack-Mario" without any wait states or delay loops. He smokes a bunch of crack and flies through the levels at 7000 miles an hour. All 9 worlds

should take about 3.9 seconds to complete. :)

 

HA HA HA! LMAO! I just peed in my pants. It's not a pretty sight! :P

Link to comment
Share on other sites

Try this:

 

CLR R1
MOVB @R0LB,@VDPWA
MOVB R0,@VDPWA
NOP     ; chill
MOVB @VDPRD,R1

SLA R1,1

ORI R0,>4000
MOVB @R0LB,@VDPWA
MOVB R0,@VDWPA
NOP     ; just chillin'
MOVB R1,@VDPWD

 

I reckon it's the period between the address being written and the data being read/written.

 

OR... You could run your code on the VDP interrupt. You get a VDP interrupt when the VDP is entering the vertical refresh period (VRAM isn't being accessed, and thus the CPU window is open for 4.3 milliseconds - you could get a fair bit of VRAM thrashing done in that window. You might have to sit down with a pen and paper and work out how many cycles your code needs and how long it takes. Then you can work out how many times you can loop in the 4.3 millisecs. One advantage of this method would be that it should work on both modified and stock consoles. By the way, don't bother with that LIMI 0 crap:

 

LIMI 0

[ do some work ]

LIMI 2

 

LIMI is a slow instruction, and eats into your interrupt window. Just work out how much code (or how many loops) you can run in the window. Classic99 has a feature that will tell you how many cycles a section of code takes: T(address-address). That will help you to work out how long a section of code takes.

 

Lastly: It strikes me that you may be able to use your discovery to 'discover' if the host machine has 16-bit ram or not. Thrash some data into VRAM like a chipmunk on speed, then read it back. If you don't get what you wrote, you're on 16-bit ram, or you've done to much shit. :P

 

Hey K...

 

This is not really a problem I am encountering as much as a curiosity. The actual code run's OK in 8 bit RAM. I happened to leave my machine in 16 bit mode when I started the program the other night and noticed the anomaly. I believe that the 32K16 mod is equivalent to running in the scratchpad and just thought since I championed the claim that you couldn't overrun the VDP on a stock TI I should redact it with some new evidence because evidently you can (provided it bears out scrutiny ;-)

Link to comment
Share on other sites

The theory of the 99/4A not being able to outrun the VDP is partly based on information from Thierry's website where he states that the memory addresses mapped to the VDP trigger the wait states.

 

I thought you guys (Marc and Matthew) were both there on whichever list it was where we hammered this into the ground, getting the /actual/ timing from the console during VDP access (ie: there's no need to "rely on information from Thierry's website"). When we left it, all that was left was verifying the theories, which none of us did. ;) ). Unless you guys don't trust my results, I know I get chewed out every few months from someone for "pretending" to know what I'm talking about. :D

 

Oops, does ScratchPad issue wait-states ?

 

No, it doesn't, so it should be the same as Marc's test.

 

So, assuming registers are also in 0-wait-state RAM:

 

CLR R1             10 cycles
MOVB @R0LB,@VDPWA        14 + 8 symbolic + 8 symbolic + 4 read vdp + 4 write vdp
MOVB R0,@VDPWA           14 + 8 symbolic + 4 read vdp + 4 write vdp
MOVB @VDPRD,R1           14 + 8 symbolic + 4 read vdp

SLA R1,1                 12 + 2

ORI R0,>4000             14
MOVB @R0LB,@VDPWA        14 + 8 + 8 + 4 + 4
MOVB R0,@VDWPA           14 + 8 + 4 + 4
MOVB R1,@VDPWD           14 + 8 + 4 + 4

 

The 4 clocks are the wait states imposed, and every write has a read-before-write cycle.

 

A cycle is 0.333uS. According to the datasheet, the VDP needs 2uS after setting an address before it is ready to make the data transfer. It then takes anywhere from 2-8uS for the transfer to actually occur. Bitmap mode is the worst case, so the full 8uS is more likely to occur than in other modes - that's a total of 10uS, which takes 30 CPU cycles. This information is all out of the respective datasheets. Note that the 2uS delay after setting the address doesn't apply to subsequent reads or writes wrt the VDP, only the 2-8uS window applies to those.

 

Most likely, it's the read that is failing. But it's not going to be 100% reliable about failing, which makes this sample code a little tough. Sometimes it will work. It depends exactly when the CPU request happens compared to what the VDP is currently doing on the screen. Anyway, the reason it's likely the read, is it's the shortest time between writing the address and accessing the data register. Since the address write is the last thing the previous MOVB does, and the data read is the first thing that the next MOVB does, you've only got about 20 cycles or so between the write of the address and the read.

 

Even so, I did prove in that thread that READS could outstrip the VDP, only writes appeared to be safe (and based on this information, maybe not after setting the address). To quote:

 

VDP Writes - our fastest write is 8.65uS. This means that writes to the VDP at any speed should always be safe, as this is greater than the worst case access speed to the VDP. Confirmed theorhetically, just need some proof.

 

VDP Reads - our fastest read is 7.32uS. This is potentially close enough to the edge for tight loops to occasionally miss during graphics I or II. It's easy to add 4 cycles to make it safe by using the symbolic addressing mode instead of register indirect. They are safe in text (3.1uS max) and multicolor (3.5uS max) mode, however.

 

But this was talking about sequential accesses -- you need a little extra time after setting the address. The write might just be safe, as the actual write will occur right around the 30 clock mark, but the read is definately too early to be reliable. (Also note the above treated the fastest instruction as a MOVB R0,*R1, where R1 contains the VDP address, or vice versa, anyway, using register indirect to access the VDP and Register for the data. We talked about other hacks like LIMI later.)

 

Of course, with the program in wait-state RAM, you gain 4 extra clock cycles for every word of program, which pushes your read up to 30 cycles between the address set and the read.

Link to comment
Share on other sites

Yeah, I was part of that, I did the real hardware testing and you smacked it down with the data analyzer. We came to a resolution, I wrote my routines, and promptly forgot everything we figured out. ;-)

 

I'm still not clear if Marc's computer is modified and how, and does the 32K on the 16-bit bus remove the wait states all together?

 

It won't matter in a few months anyway... oops, shhhh. ;-)

 

Matthew

Link to comment
Share on other sites

[quote name='Tursi' date='Sun

I know I get chewed out every few months from someone for "pretending" to know what I'm talking about. :D

 

 

 

Hmmmmmm... Erik© Brent. Kinda catchy ;-) Just kidding bro.....

 

The test I ran earlier WAS entirely sequential and did not mix reads and writes. Sometimes it just takes a while for us less than guru status peeps to catch on ;-)

 

Marcus...

Link to comment
Share on other sites

Yeah, I was part of that, I did the real hardware testing and you smacked it down with the data analyzer. We came to a resolution, I wrote my routines, and promptly forgot everything we figured out. ;-)

 

I'm still not clear if Marc's computer is modified and how, and does the 32K on the 16-bit bus remove the wait states all together?

 

It won't matter in a few months anyway... oops, shhhh. ;-)

 

Matthew

 

My console is modded to run in either normal 8 bit wide mode with wait states or 16 bit wide mode without wait states. I did NOT put this code into the scratch pad and run it in normal mode. Only switched the console between the two. Normally when I program I assemble in "fast mode" to cut the time in half and switch over to slow mode to run the executable. A couple of beers got in the way the other night and I forgot to return to the slower operation and saw the anomaly. BTW this does not occur when the console is running @ 3.58 Mhz (to answer an earlier query.)

 

I'll leave it to you to do the scratch pad test......

  • Like 1
Link to comment
Share on other sites

  • 2 weeks later...

Scratchpad and RAM with no wait states run at the same speed, there's no need to retest that (unless you really want to). The mod just adds the normal 32k RAM areas to the circuit that disables the wait state generator.

 

The 3.58MHz tweak is interesting - are you saying it does not have any problems running at 3.58MHz with the wait states disabled (ie: running full speed)?

Link to comment
Share on other sites

Scratchpad and RAM with no wait states run at the same speed, there's no need to retest that (unless you really want to). The mod just adds the normal 32k RAM areas to the circuit that disables the wait state generator.

 

The 3.58MHz tweak is interesting - are you saying it does not have any problems running at 3.58MHz with the wait states disabled (ie: running full speed)?

 

 

32K16 enabled / 3.00 Mhz speed = fouled video

 

32K8 enabled / 3.58 Mhz speed = good video

 

32K16 enabled / 3.58 Mhz speed = fouled video (obviously)

 

32K8 enabled / 3.00 Mhz speed = good video (again obviously, just stating for the record...;-)

Link to comment
Share on other sites

Scratchpad and RAM with no wait states run at the same speed, there's no need to retest that (unless you really want to). The mod just adds the normal 32k RAM areas to the circuit that disables the wait state generator.

 

The 3.58MHz tweak is interesting - are you saying it does not have any problems running at 3.58MHz with the wait states disabled (ie: running full speed)?

 

 

32K16 enabled / 3.00 Mhz speed = fouled video

 

32K8 enabled / 3.58 Mhz speed = good video

 

32K16 enabled / 3.58 Mhz speed = fouled video (obviously)

 

32K8 enabled / 3.00 Mhz speed = good video (again obviously, just stating for the record...;-)

 

You're in bitmap mode, right Marc? Bitmap mode being the worst case in terms of bus activity (i.e. less cpu access windows). :?:

Link to comment
Share on other sites

Actually for Graphics Mode I and II it is the same, the CPU access window during the active display is between 2uS and 8uS. The problem is, you never know how long the access is going to take and the VDP does not have a READY or HOLD pin... Why not, I have no idea.

 

During the vertical retrace, VRAM can be accesses every 2uS, which is faster than the CPU. But even the worst case 8uS during the active display is faster than the CPU, except in very certain circumstances (like 32K on the 16-bit bus perhaps.)

 

Matthew

Link to comment
Share on other sites

Cool, thanks for the break down. So what I'm seeing is that a stock TI can't over run the VDP. Am I missing anything?

 

Matthew

 

I believe it can overrun the VDP if you shoe horn the code into the scratchpad which I think mike stated runs @ zero wait states.

Link to comment
Share on other sites

Yeah, the scratch pad runs at 0-wait state, but according to Thierry's site, the VDP triggers the wait-states. So you still have a wait-state for half of the MOVB instruction. I'll have to go back and read it all again and did out the schematics to make sure though. Tursi put his logic analyzer on it, but I don't remember all the details and if he saw a wait-state on half of the accesses and all that.

 

I'm pretty sure my VDP over run tests were done via code in the scratch pad. I'll go dig them up.

 

Matthew

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...