Jump to content
Kchula-Rrit

VDP - word access faster than byte?

Recommended Posts

Since the low-byte data lines are not connected to the VDP port, could a word access (MOV R1,@VDPWD) be used instead of a byte access (MOVB R1,@VDPWD)?

Just curious, since I thought a word-access is faster than a byte-access.  Especially a write operation, since the CPU has to read the word, plug the new byte into it, then write the word out.

 

K-R.

 

Share this post


Link to post
Share on other sites

Just checked the 9900 data sheet for MOV/MOVB execution times; time is essentially the same.

 

Thanks,

 

K-R.

 

Share this post


Link to post
Share on other sites

That's the big advantage of the TMS9995 in the Geneve; it does not need any read-before-write because of the 8-bit data bus.

Share this post


Link to post
Share on other sites
1 hour ago, Kchula-Rrit said:

Just checked the 9900 data sheet for MOV/MOVB execution times; time is essentially the same.

 

Thanks,     K-R.

 

...except for  indirect, auto-incrementing register writes:

       MOVB *R1+,@VDPWD      <<<-----Faster by 2 clock cycles
*           --vs--
       MOV  *R1+,@VDPWD

Of course, writing every other byte in a loop with the latter would rarely make sense, so would not likely be chosen by the programmer, anyway.

 

...lee

  • Like 1
  • Thanks 1

Share this post


Link to post
Share on other sites

You're right; I had not thought it through.

 

This is off the topic, but is the read-before-write the reason the VDP (among other peripherals) read and write addresses are not the same?

 

If, for example, you are writing an address to the VDP, it seems to me that, while writing the second byte, the read-before-write would reset the sequencer inside the VDP and never update the address.

 

Sometimes I still have trouble wrapping my head around the way the TI does things.

 

K-R.

 

  • Like 1

Share this post


Link to post
Share on other sites

Yes, and therefore, in the native mode of the Geneve, the video ports are the same for reading and writing because it was not necessary to pick two different addresses.

 

These are the things that you simply accepted at first, but learn some decades later why it had to be done.

  • Like 4

Share this post


Link to post
Share on other sites
8 hours ago, Kchula-Rrit said:

This is off the topic, but is the read-before-write the reason the VDP (among other peripherals) read and write addresses are not the same?

That's correct. Another downside is that the VDP access triggers the multiplexer, so you suffer the extra wait states on both read and write, even though you don't need them. ;)

 

Share this post


Link to post
Share on other sites

It's been a while since I looked at the schematics, but I thought the multiplexer was only used for the side port.  Well, that should help with the "time-waste" required between VDP writes.

 

K-R.

 

Share this post


Link to post
Share on other sites

There is one way I've found that skips the read before write, but it has some downsides: One byte can be written in 24 cycles by setting to the workspace pointer to the VDP write data address, then use a chain of "LI R0,>XX00" for each byte where XX is the byte to write.  The LI instruction is 4 bytes, so the code  size will be 4 times the number of bytes written.

 

PS. Don't forget you do need a NOP time waste when reading a byte from the VDP after setting the address to read from.  It may work fine on emulation or F18A, but a real 9918A will sometimes return wrong values if you read too soon.

  • Like 4
  • Thanks 1

Share this post


Link to post
Share on other sites

The reason for the read before write approach is obvious when handling bytes. Since the CPU can only write a full word, it has to read it in first, then modify the byte to access and write the whole thing back. That it does this for word access too was to save space in the CPU design. It means that the same logic can be used to access both word and byte values. Performance suffers, but that's the reason.

 

Since I have a console with 64 Kbytes 16-bit wide RAM inside, I can see when programmers have relied on the 16 to 8-bit multiplexing to slow things down enough for the VDP to be able to cope without the NOP TI recommended you put in your code. These programs fail if I don't switch back to the standard memory expansion (I designed my 64 Kbyte RAM expansion so that I can disable it with CRU bits). I also have the ability to turn on a piece of hardware which will detect VDP access and insert a wait state in these cases. Then the programs will run correctly, just faster than from 8-bit RAM.

  • Like 4

Share this post


Link to post
Share on other sites
10 hours ago, PeteE said:

PS. Don't forget you do need a NOP time waste when reading a byte from the VDP after setting the address to read from.  It may work fine on emulation or F18A, but a real 9918A will sometimes return wrong values if you read too soon.

IIRC it was proven that on the 99/4A you don't need a time-waste simply due to the 9900 access time to fetch the next instruction (which might the be VDP read) is longer than the required delay the VDP needs to pre-fetch the first byte after setting the read-address.  Somewhere in the forum is a thread about this with testing.  I guess I should probably try to find the reference. ;)

 

However, unless you are using unrolled loops, there will probably already be instructions between setting the VDP read-address and reading VDP data, so the traditional NOPs are not necessary; and if they are required you can typically find actual useful instructions to use instead.  Most VDP access is probably going to use some sort of function, like the VWTR, VSBR, VMBR, etc. so there will already be lots of instructions between setting the address and reading data.

  • Like 1

Share this post


Link to post
Share on other sites
12 hours ago, Kchula-Rrit said:

It's been a while since I looked at the schematics, but I thought the multiplexer was only used for the side port.  Well, that should help with the "time-waste" required between VDP writes.

It does, we've gone in detail about that in other threads. :)

 

 

  • Like 1

Share this post


Link to post
Share on other sites
1 hour ago, matthew180 said:

IIRC it was proven that on the 99/4A you don't need a time-waste simply due to the 9900 access time to fetch the next instruction (which might the be VDP read) is longer than the required delay the VDP needs to pre-fetch the first byte after setting the read-address.  Somewhere in the forum is a thread about this with testing.  I guess I should probably try to find the reference. ;)

 

However, unless you are using unrolled loops, there will probably already be instructions between setting the VDP read-address and reading VDP data, so the traditional NOPs are not necessary; and if they are required you can typically find actual useful instructions to use instead.  Most VDP access is probably going to use some sort of function, like the VWTR, VSBR, VMBR, etc. so there will already be lots of instructions between setting the address and reading data.

The case PeteE listed is the one case where it's easily possible, at least in scratchpad RAM and using registers. The turnaround time between writing the second byte of address and reading back the cache register from the VDP can be faster than the maximum 8uS needed for the VDP to fetch the data. It's /extremely/ common for programmers to put the address set and the read inline, so you do need to make sure something is in there. As you note, it's easy to find something useful to do, and if you're using subroutines, yeah, the RT counts!

 

No other case is an issue unless you're doing something deliberately tricky. LI's count as tricky. ;)

 

Of course, if you create your own memory expansion with 64k of 16-bit wide memory that can be controlled via CRU and turned on and off, then yes, you may have combability issues, and you probably should install a VDP that can keep up with your accelerated system. But even there, it's usually the address set/read data combination that has trouble... except now ALL your memory is essentially scratchpad. ;) Even without wait states, the usual instruction sequence is slow enough on the 9900 to not overrun the VDP for other accesses.

 

We can get into an accelerated 9900 as well, but it seems strange to tell everyone they need to slow down their software so your faster system works correctly... ;)

 

 

 

  • Like 5

Share this post


Link to post
Share on other sites
Posted (edited)

I didn't bother trying to modify the VDP just because I have fast memory all over the place. I designed it so that if I disable the 32 K part, which corresponds to the normal memory expansion, then whatever is "below" will be visible. So my internal memory expansion can co-exist with the standard 32 K RAM expansion. Just set the appropriate CRU bits and my internal expansion disappears.

I designed it so that the 64 K RAM covers the entire addressable range. When the 8-bit latch, which holds the memory enable bits, is reset, it pages out all internal RAM where there shouldn't be any, but pages in where the normal 32 K RAM is available. Thus default is using 32 out of 64 K RAM where RAM should be, but not where ROM and other stuff should be. But I can page in 8 K chunks over the monitor ROM, over DSR space etc. if I want to, just as I can page out memory from the 8 and/or 24 K RAM banks, if I want to. Assuming there is a standard memory expansion in the machine, setting the correct bits is what it takes to go into compatibility mode.

This scheme makes it possible to copy monitor ROM to RAM, then change things like interrupt vectors, for example.

 

As you say, very few programs actually fail with fast memory everywhere. The only one I've encountered is the game Tennis. Running in my fast RAM, the players will split by the hip, where the upper part of the body will run in one direction but the legs in another.

Enabling only the hardware wait state generation on VDP access will make the game run, but it looks more like table tennis than lawn tennis... You can just watch it run the demo, since beating it is virtually impossible in that case.

Edited by apersson850
  • Like 4
  • Thanks 1

Share this post


Link to post
Share on other sites
On 3/22/2021 at 12:56 AM, apersson850 said:

The reason for the read before write approach is obvious when handling bytes. Since the CPU can only write a full word, it has to read it in first, then modify the byte to access and write the whole thing back. That it does this for word access too was to save space in the CPU design. It means that the same logic can be used to access both word and byte values. Performance suffers, but that's the reason.

 

...

My response is a bit late, but that makes sense.  I'd wondered about the read-before-write.

 

Thanks,

 

K-R.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...