V9938 benchmarks

+mizapf · June 12, 2019

I spent some time measuring the performance of the V9938 commands. Funnily, I had that in mind already last year, the test programs already prepared (as their timestamps show), but I obviously got distracted by other things.

The results are shown on Ninerpedia at

https://www.ninerpedia.org/wiki/V9938_Video_Data_Processor

I will add some more information about the V9938 from time to time, but I thought the benchmarks could be most interesting. What I did was to run these programs on my (real) Geneve, using the real-time clock to take the time (up to tenths of a second) of some thousand iterations or even some tens of thousands of iterations. From these results I calculated the approximate time for a single operation.

Obviously, the processing time depends essentially linearly from the number of affected pixels. There are some more interesting effects: In Graphic 6 mode, the high speed operations perform at double speed compared to Graphic 7 because 2 pixels are processed in one go in G6. Also, there is a fixed amount of time in each command, maybe some preparations. We should also not neglect the time for setting up the operation by the user program (loading video registers).

At the bottom line, the results are somewhat disappointing. Line commands are quite fast (thinking about Qix or vector graphics), but mass transfers are not fast enough for serious operations. Thinking about scrolling the screen, tile scrolling is much faster than any video command could do.

+FarmerPotato · June 12, 2019

This is fascinating.

I'm wondering if you are calling MDOS LBlockMove etc. or rolling your own code? Maybe there is a bit of optimization possible on setup or inner loop time.

Drawing lines requires the VDP to do some internal counting, as well as external memory operations. Each of these has different opportunities for optimisation in the hardware of the VDP.

Horizontal lines can be optimised because 4464 DRAMs (aka 41464) have "page write" where pixels of the same row don't require another row address setup. This is also why wide rectangles are faster than tall ones.

You measured 6.8 microseconds / pixel for LINE. I calculate that just 10% of this time is memory access.

(Math: for an n pixel line, optimal memory time is: vertical 300n and horizontal 150*(n+1) in nanoseconds. So a 100 pixel line in horizontal would take 15 us, while the vertical would take 30 us. This is the optimal memory bound for line drawing.)

Screen retracing steals 55% of memory bandwidth in G6 or G7. DRAM refresh, not sure, maybe 1%. So VDP operations just have to pause 55% of the time. All told, that means VDP operations might be waiting on DRAM 65-70% of the time. If they run on a 3.58 MHz internal clock, there's additional slippage. I think that accounts for your actual measurements being 10x slower than my optimal memory bound.

HMMV at 25k pixels/second must be comically wasting many chances to send the next byte. By comparison, there are 15750 (192*60) horizontal lines read from memory per second by the VDP in that 55% of memory bandwidth. It's like your able to send only 1 pixel on average before the VDP pauses again.

Actually that's an average; during horizontal retrace, G6 or G7 consume 85% of memory bandwidth. (48 microsecond period per line; 40 microseconds of memory read for 256 bytes.) Followed by blanking interval (vertical retrace), 60 times/second.

If code were very optimised to read the TR bit and respond, it might squeak out an extra pixel per line retracel! Though maybe HMMV is totally unable to accept any pixels until the vertical blanking interval comes around.

The 9938 manual says commands are faster when sprites are disabled, or when screen is blanked. I wonder if all your timing would be 4x better with the screen blanked out? Not that that helps for a real program!

+FarmerPotato · June 12, 2019

For those who haven't programmed the 9938, this is the flowchart of a HMMC command (high speed CPU to VRAM block)

Setup Registers including first byte in R#44
Start Command
Read VDP status SR#2
If CE=1, DONE
if TR=0, go back to Read SR#2
Put next byte in R#44
Go back to Read SR#2

Writing to SR#44 can be optimised using Register Indirect Addressing.

(Based on page 59 of V9938 MSX-VIDEO USER'S MANUAL )

Asmusr · June 13, 2019

At the bottom line, the results are somewhat disappointing. Line commands are quite fast (thinking about Qix or vector graphics), but mass transfers are not fast enough for serious operations. Thinking about scrolling the screen, tile scrolling is much faster than any video command could do.

I'm sure MSX folks like @artrag could explain how to make smooth scrolling on the v9938 work.

https://www.youtube.com/watch?v=Za9tPEgOAgk

artrag · June 13, 2019

Sure, smooth scrolling on v9938 is not easy but can be done also in screen 8 (8bit of color per pixel).

Later I will try to post a short description of the algorithm.

Basically the screen moves 16 steps by setting up the screen offset register.

At each step a vertical strip whide 16 pixels is copied from the visible page to the hidden page.

The CPU is in charge of blanking one border and plotting new pixels on the other border.

The real problems stem from the need to not incurr in a glitch of the vdp where byte copies are affected by the screen offset register.

In the meantime there are some vdp benchmarks also from the msx scene

http://map.grauw.nl/articles/vdp_commands_speed.php#copyresults

Much more extensive measures are in pages of the openmsx emulator that tries to match perfectly the vdp timing

Edited June 13, 2019 by artrag

+mizapf · June 13, 2019

I'm wondering if you are calling MDOS LBlockMove etc. or rolling your own code? Maybe there is a bit of optimization possible on setup or inner loop time.

No, this is all my own code (i.e. setting the video registers directly) because I wanted to avoid measuring MDOS system call overheads. The thing about sprite inhibition is interesting; I could re-run the tests with it.

These benchmarks are also interesting for me in order to correct the timings of the V9938 emulation in MAME.

+FarmerPotato · June 13, 2019

No, this is all my own code (i.e. setting the video registers directly) because I wanted to avoid measuring MDOS system call overheads. The thing about sprite inhibition is interesting; I could re-run the tests with it.

These benchmarks are also interesting for me in order to correct the timings of the V9938 emulation in MAME.

Hi,

I see, you are interested in matching the timing in any case. That could be some interesting code.

I'm interested in optimising (because I spent most of Saturday studying the V9938/58)

Do you have these optimisations:

1. inner loop running from PAD (I forget whether this really helps on the 9995)

2. When sending data in HMMC, do you load the byte into R#44 the usual way, or do you use Register Indirect Addressing?

i.e.

set R#44 by writing 2 bytes to VDPWA

vs

set R#17 to 44 once

set R#44 by writing 1 byte to VDPWA+4

Asmusr · June 13, 2019

Sure, smooth scrolling on v9938 is not easy but can be done also in screen 8 (8bit of color per pixel).

Later I will try to post a short description of the algorithm.

Basically the screen moves 16 steps by setting up the screen offset register.

At each step a vertical strip whide 16 pixels is copied from the visible page to the hidden page.

The CPU is in charge of blanking one border and plotting new pixels on the other border.

The real problems stem from the need to not incurr in a glitch of the vdp where byte copies are affected by the screen offset register.

In the meantime there are some vdp benchmarks also from the msx scene

http://map.grauw.nl/articles/vdp_commands_speed.php#copyresults

Much more extensive measures are in pages of the openmsx emulator that tries to match perfectly the vdp timing

You're describing vertical scrolling, right? But how is the horizontal scrolling in Uridium 2 possible?

artrag · June 13, 2019

You're describing vertical scrolling, right? But how is the horizontal scrolling in Uridium 2 possible?

No, I'm describing horizontal scrolling in Uridium 2 (screen 8, scroll speed up tp 1 pixel/60Hz in both directions)

artrag · June 13, 2019

This is an explanation of how the scrolling in Uridium 2 works

;----------------------------------------------------------------------------
; Horizontal scrolling in screen 8 on msx2
; by artrag
;
; Abstract: Horizontal scrolling on msx 2 is a sort of holy grail.
; Here follows an overview on the approach followed in Uridium 2 beta 
; and a simplified version of its scrolling engine in screen 8.   
; We focus on side scrolling right to left, the extension to the other 
; direction is direct.
; The image to scroll is composed by tiles 16x16 pixels, the screen 
; window is 240x160 (taller windows are possible only in PAL mode). 
;
; Memory layout
;
; As we are in screen 8, we have only two VRAM pages that have to be
; used to show the screen window using  double buffering (i.e. we plot on one page 
; while we show the other page).
; This implies that the tiles have to stay in RAM or ROM. 
; Moreover, as we will see later, the access to the tiles will be demanded to the z80,
; leaving to the vdp the task of moving the screen. 
;
; The engine relies on a mega rom mapper with 8K pages (here Konami SCC) as a  
; full tile set of 256 tiles of 16x16 pixels takes 64KB of space.
; Naturally using a ram mapper or less tiles, other solutions are possible.
; Each tile is stored column-wise (i.e. successive bytes belong to the same column), 
; this in order to simplify the task of the z80 which is in charge of plotting the 
; new columns of pixels that enter the screen.
; 
; The map is stored in RAM row-wise and, to simplify line changes, as map size is 256x10 
; Each byte in the map is a tile number.
; As a tile is 16x16=256 bytes, a page in the rom mapper can host exactly 8*1024/256  = 32 tiles.
; The full tile set of 256 tiles is spread across 8 pages of 8K each.
;
; The algorithm 
;   
; The engine works on the ISR and is based on a strong parallelism between VDP and z80.
; The visible window is 240x160 pixels, plus an extra border area of 16x160 pixels.  
; The screen window is moved in 16 steps corresponding to the 16 scroll positions of R#18.
; The z80 is in charge of plotting, at each ISR a new column of pixels in the right border 
; and a column of blank pixels on the left border on the visible screen.
; In 16 steps, the z80 builds a full new column of tiles on the right and deletes 
; a corresponding area of 16x160 pixels on the left.
; In the meanwhile, at each step, the VDP is in charge of moving 15 slices 16x160 pixels  
; from the visible screen to the hidden page (displacing each slice of 16 pixels) and to  
; build the black border on the right that will appear at page swap (also in this case 16x160 pixels).
; Once the screen offset has reached its maximum, the hidden page is ready and can be 
; swapped with the visible page and the process can restart.
; 
; Devil in the details (1)
; 
; When the z80 starts filling the 16th column of the right border on the visible screen  
; (column n. 255) the vdp has to copy the slice of pixels 16x160 that includes the column
; of pixels being plotted by the z80 from the visible page to the hidden page.
; In order to maximize the parallelism, the vdp starts working before that the z80 has
; started plotting its column, so 1-2 pixels at the top of column 255 get copied before the
; z80 has updated them. 
; Without a workaround, the black pixels would propagate in the screen slice by slice.
; The engine has solved the concurrency problem by filling with valid data the two 2 wrong 
; pixels once they have been copied on column 239 on the hidden page.
; This causes a small loss in performances as the Z80 will copy those two bytes twice:
; once on column 255 in the visible page, once on columns 239 on the hidden page.
; anyway the patch occurs only when the screen offset is equal to 15 and is overall acceptable 
; as we set just 2 bytes in VRAM.
;
; Devil in the details (2)
;
; Why do we split the screen in slices instead of coping the whole 240x160 area to be moved?
; The problem is that the command engine of the VDP is affected by changes in R#18 
; At each change in R#18 occurring while the VDP is coping, there is the possibility that
; a black or white pixel appears. 
; The sole solution is to set R#18 only after the VDP command has been executed.
; This implies that the copy of the screen has to be segmented in tasks that the VDP can 
; complete within a single frame, before R#18 has to be set again.
; The height of the screen (160 pixels) has been chosen in order to have that the VDP ends
; its copy about at the end of the visible area (around line 163) also in NTSC mode
; (in PAL the VDP command is completed much earlier).
; This means that, also in NTSC, a line interrupt at line 160 could safely disable the screen 
; and the sprites, swap page to show a score bar, wait hblank, enable the screen and reset R#18
; without affecting the last VDP command.
; Moreover, the time between line 164 and 192 can be used to execute other VDP commands.
; In the test rom, the RED color bar indicates the z80 usage, the GREEN color bar the vdp usage. 
; Press space to swap between NTSC and PAL and see how things change accordingly.
; 
;

Sources and binary ROM for the above algorithm are here

https://github.com/Maneldemo/Uridium-2-for-msx/blob/master/Horizontal_scroll_article.rar?raw=true

Edited June 13, 2019 by artrag

Asmusr · June 13, 2019

This is an explanation of how the scrolling in Uridium 2 works

Thanks, I think I understand most of it. I have been looking at the programmers guide several times but missed the importance of R#18.

Edited June 13, 2019 by Asmusr

+mizapf · June 13, 2019

Screen 8 = Graphic 7, as I found out. The screen numbers seem to be MSX terminology.

artrag · June 13, 2019

Screen 8 = Graphic 7, as I found out. The screen numbers seem to be MSX terminology.

Right, sorry for not telling this (this msx terminology is derived by the msx basic, where the command to change mode is SCREEN x and x=8 corresponds to Graphic 7 mode in the v9938 manuals)

Edited June 13, 2019 by artrag

+mizapf · June 13, 2019

Hi,

I see, you are interested in matching the timing in any case. That could be some interesting code.

I'm interested in optimising (because I spent most of Saturday studying the V9938/58)

Do you have these optimisations:

1. inner loop running from PAD (I forget whether this really helps on the 9995)

2. When sending data in HMMC, do you load the byte into R#44 the usual way, or do you use Register Indirect Addressing?

i.e.

set R#44 by writing 2 bytes to VDPWA

vs

set R#17 to 44 once

set R#44 by writing 1 byte to VDPWA+4

I did not do too much optimization; I supposed that the VDP execution speed is significantly longer than possible wait states in the CPU execution.

I'm using on-chip RAM for the workspace (which plays the role of PAD on the TI-99). Here is one of the benchmark programs that draws lines to fill the screen. PRINT, GETTIM, and ITOA are used for outputting the time interval. I originally used this kind of program for multiple timing tests where the routines are referenced in the TESTS list. The null loop just sets a RAM address instead of the video port in order to identify the run time of the loop as such (without VDP execution).

vcmdspeed5.txt

+FarmerPotato · June 14, 2019

I did not do too much optimization; I supposed that the VDP execution speed is significantly longer than possible wait states in the CPU execution.

I'm using on-chip RAM for the workspace (which plays the role of PAD on the TI-99). Here is one of the benchmark programs that draws lines to fill the screen. PRINT, GETTIM, and ITOA are used for outputting the time interval. I originally used this kind of program for multiple timing tests where the routines are referenced in the TESTS list. The null loop just sets a RAM address instead of the video port in order to identify the run time of the loop as such (without VDP execution).

Got it.

I found the analysis of V9938 memory timing by Wouter Vermaelen.

Using a logic analyzer, he finds the number of potential slots (per display line) for the CPU or the command engine to access DRAM. By mode, these are:

Sprites on: 31

Sprites off: 88

Display off: 154 (or during vertical interval)

He goes on to state that with sprites on, the command engine is limited by available memory bandwidth, but with sprites off, it is not 2x faster so therefore limited by itself.

artrag · June 14, 2019

Yes Wouter is the main coder behind openmsx.

His work on modeling the v9938 and its timing is amazing.

I would refer to the sources of openmsx to improve v9938 emulation in Mame.

About the vdp copy engine, an important trick to reduce the CPU overhead is to take into account of final values of the command registers from the previous command.

If you know what values are in those registers often you can skip to load them.

It can be a huge saving but you need to be in the conditions to exploit the trick.

This demo builds the screen column by column. Within the column I use the final state of the registers to set the next copy fill command.

Edited June 14, 2019 by artrag

artrag · June 15, 2019

Another remarkable use of the v9938 capabilities (not mine this time)

Edited June 15, 2019 by artrag

+mizapf · June 23, 2019

Another point I've been investigating is the rate at which the mouse at the V9938 color bus is polled for movement. Typically, mice have something like 500-800 Hz, some of them even higher, named "high-precision mouse", i.e. typically less than 1000 polls per second.

The V9938 specification documents do not reveal any details about the mouse handling, only that the axis lines XA/XB and YA/YB are connected via the color bus, as are the two mouse buttons. The third button on the Geneve mouse is directly routed to the 9901.

Further research, particularly in the Amiga/Atari world, showed me that the XA/XB (or YA/YB) lines deliver single impulses, where these impulses are created by triggering two photoelectric sensors*. If the mouse is move in one direction, the pulse appears on XA, then on XB, and to the other direction, it is first XB, then XA (and in the other dimension it is the Y pair).

Still, I expected some rate at which these inputs are sampled. This would mean that there is a maximum rate at which impulses can be sensed; if the mouse would be moved at a higher speed, this would not result in a higher impulse count.

I wrote a test program for native mode and moved the mouse around as fast as I could. During a fixed time (about 5 seconds) the impulses are added inside the V9938 to be queried from status register 3, after which the counter is reset to 0 automatically. By this way I was able to count up to 7000 impulses, about 1400 impulses per second, which is much higher than expected.

Then I got an idea: Moving the mouse over the table is not only risky (shoving the mouse over the table edge) but also stressful to the mechanics because of the massy ball inside. When I was cleaning up my storage I found my good old "fischertechnik" box that I played a lot with in my childhood. So I built a test stand, as you can see below. Motor and power supply were still working. ?

The results were even more surprising: As you see on the attached second picture, the status register inside the V9938 advanced by 15793 steps within 49 tenths of a second (4.9 seconds). That is more than 3200 impulses per second! I could not get the motor running faster, but I guess it would even be higher than, and I do have some concern that I damage the mouse after all.

So my guess is now that the sample rate is even higher, or that the impulses are counted asynchronously. In the latter case I would imagine that the input at the color bus is an edge-triggered counter circuit. Unfortunately, nobody can really help with that, including the MSX experts, because they never used a mouse at the color bus. On the other side, it is not a big issue; it simply means that I need not emulate a regular sampling in MAME but just to take the mouse input as delivered by the input core.

* I could not find a good translation of German "Lichtschranke" that we typically use to describe this construct. "Light barrier" seems to uncommon.

V9938 benchmarks

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members