Atarisoft -- Missing in Action

+OLD CS1 · July 29, 2014

A monochrome buffer would take 6k of CPU RAM for the full screen. It does take a couple of frames to get that much data to the VDP (you could of course use a window instead of all of it!) Alternately, it would be interesting to see how well the dirty rectangle approach would work!

~~Do you mean leveraging CPU RAM (32k expansion or whatever) as screen buffers to actually draw the screen and then dumping it into the VDP?~~

Duh, that is exactly what you meant. I was typing while thinking and missed the bit about taking a couple of frames to move that data (worse, I was wondering about the delay time to dump data into VDP RAM.)

sometimes99er · July 29, 2014

Again, looking at the VIC-20 and C64 implementations of Battlezone, Omega Race, Lunar Lander and Star Wars in the early eighties, none of them use vector or wireframe graphics in the sense where lines are drawn pixel by pixel by the CPU in a high resolution bitmapped sense. Am I wrong ?

am1933 · July 29, 2014

http://www.youtube.com/watch?v=ioZv5-mLDmY

http://www.youtube.com/watch?v=gIod3Kz33p0

The first link is for the ZX Spectrum game "3D Tank Duel".

The second link is for "Rommel's Revenge" on the Dragon 32.

Sometimes makes you wonder if it is actually worthwhile having a dedicated VDP with hardware sprites etc.

Edited July 29, 2014 by am1933

Tursi · July 29, 2014

We have a version of Star Wars, now. I have a ROM that might be a prototype?

Wait.. what? We do??

There were a few homebrew single-screen shooters that called themselves Star Wars, but I never saw any attempts to actually reproduce the arcade game -- what do you have there??

~~Do you mean leveraging CPU RAM (32k expansion or whatever) as screen buffers to actually draw the screen and then dumping it into the VDP?~~

Duh, that is exactly what you meant. I was typing while thinking and missed the bit about taking a couple of frames to move that data (worse, I was wondering about the delay time to dump data into VDP RAM.)

Oh, yeah, but I was just responding to JamesD's suggestion.

My first instinct was it was "probably not", but in thinking it through as I wrote, the overhead on every pixel might well be made up even with the extra copy, especially if there are lots of lines. I went looking for dirty rectangle coalesce algorithms to see if there was anything clever out there that might run at speed on the TI, but all the threads I found on it were full of "why do you want to do that?" and "OpenGL is so fast who cares?" The future of computing is so bright.

Omega-TI · July 29, 2014

The F18A GPU should be plenty fast for line drawing, of course, particularly on the overlay screen where there's a single-instruction to plot a pixel.

It might make a really nice showcase program for the F18A's capabilities too , that is if someone wanted to go that route. Of course that might actually be a form of torture to a non-upgraded TI Retro Gamer. Imagine, having a NEW program for your box that you cannot run. Matthew's site would probably get a workout from the few remaining hold-outs.

JamesD · July 29, 2014

Oh, yeah, but I was just responding to JamesD's suggestion.

My first instinct was it was "probably not", but in thinking it through as I wrote, the overhead on every pixel might well be made up even with the extra copy, especially if there are lots of lines. I went looking for dirty rectangle coalesce algorithms to see if there was anything clever out there that might run at speed on the TI, but all the threads I found on it were full of "why do you want to do that?" and "OpenGL is so fast who cares?" The future of computing is so bright.

When you are drawing lines, you have to OR (or XOR) new pixels with existing screen contents and then write it back.

This is really slow through the VDP.

On top of that, you are drawing to characters for the full 256x192 display rather than a bitmap which requires additional processing.

If however, you can draw to a monochrome buffer, you can draw with much fewer reads and writes than characters would require.

Just reducing VDP access to writes alone cuts VDP access in half let alone what would be required to deal with characters through the VDP.

This may be a possible solution if the 9900 is fast enough and VDP access is slow enough.

You could keep track of dirty.characters as you write to the local buffer as well as using a dirty rectangle and then update the characters within the rectangle.

dirtychar(x / charheight, y / charwidth) = 1

The divisions are fast since they are just bit shifts and with proper integration into the Breezenham line routine you wouldn't have to do every operation for each pixel.

Convert the dirty rectangles to character locations in the same way and just scan that part of the dirty character array for modified bytes and update the VDP RAM.

Maybe?

At some point the amount of processing time will exceed the time of going through the VDP so you have to be careful.

*edit*

With some special cases for horizontal, vertical, 45 degree, etc... you can also speed up access to your buffer a lot.

Horizontal lines for example can be drawn an entire byte at a time for the ends and middle bytes.

Edited July 29, 2014 by JamesD

+OLD CS1 · July 29, 2014

Wait.. what? We do??

There were a few homebrew single-screen shooters that called themselves Star Wars, but I never saw any attempts to actually reproduce the arcade game -- what do you have there??

Got it from Bob...

I went looking for dirty rectangle coalesce algorithms to see if there was anything clever out there that might run at speed on the TI, but all the threads I found on it were full of "why do you want to do that?" and "OpenGL is so fast who cares?" The future of computing is so bright.

THIS RIGHT HERE. THIS bullshit attitude (my emphasis added) kills innovation. :mad:

Tursi · July 29, 2014

When you are drawing lines, you have to OR (or XOR) new pixels with existing screen contents and then write it back.

This is really slow through the VDP.

This is true - and I said that already. Each pixel takes 5 writes, and a read. To a CPU buffer, you can directly change the pixel in a single instruction.

On top of that, you are drawing to characters for the full 256x192 display rather than a bitmap which requires additional processing.

This is either not true or I don't understand what you mean. You can have a fully addressable bitmap (and the line draw code I posted assumes this.) The layout is based on characters but you can safely ignore that fact once it is set up.

If you mean calculating the address is more complex than Y*Stride+X/8, then, true. But not much more so. You can get the address in a half dozen instructions. For a full line, you only need to do so once. To keep the CPU copy loop fast, you would need to lay out the CPU buffer in the same way as VDP, so, you have to do that either way.

If however, you can draw to a monochrome buffer, you can draw with much fewer reads and writes than characters would require.

Just reducing VDP access to writes alone cuts VDP access in half let alone what would be required to deal with characters through the VDP.

The character argument notwithstanding, drawing to the buffer eliminates the redundant address writes - but you add an extra read/write loop to copy the data back to the VDP.

For a contrived example, two pixels looks like this (assuming address already calculated):

To VDP:

Address write 1, address write 2, pixel read, operation, address write 1, address write 2, pixel write

To CPU:

operation, operation, address write 1, address write 2, pixel write, pixel write

And that looks great!

But, the savings is lost if you have to write a lot of unchanged data to the VDP (ie: a full 6k or even 2k buffer for a single line). Figure a 100 pixel diagonal line from 50,50 to 150,150 would need 700 steps direct to VDP, but if you blindly write the whole rectangle, that's 2502 operations (more than 3 times the work!) On the other hand, if you had five lines drawn inside that space of comparable length, suddenly the same copy is a big win!

Another consideration is if the data you write is too discontinuous, then you also lose some overhead to changing the VDP address all the time. So a smart dirty rectangle algorithm really makes or breaks it. It has the potential to be much faster but also has the potential to be slower.

This may be a possible solution if the 9900 is fast enough and VDP access is slow enough.

There's a misconception that writing to the VDP is slow. It's not. A write to the VDP data port takes the exact same amount of time as a write to 8-bit CPU memory. The issue with VDP speed is that there are two "gotchas". The first is that it's not considered safe to write back to back to the chip (although we generally believe now that it is safe on the stock TI-99/4A, except for reading immediately after setting the address). The second is that if you're not writing consecutive bytes, you need two extra write cycles to change the address. This is what really hurts a single pixel update to the VDP, because the auto-increment means you need to do it twice per pixel, and this is where the CPU buffer has the biggest chance to shine.

Convert the dirty rectangles to character locations in the same way and just scan that part of the dirty character array for modified bytes and update the VDP RAM.

I /do/ like this idea of doing the dirty mask by character, that gives you a fixed 8 bytes to write for each 'dirty' flag, which you can unroll, and they are easy to coalesce. Neat idea! I wonder if we can test and reset in one operation -- can we use SZC or something like that to force a value to zero and also determine if it was zero before? (Just to save an operation, not necessary I guess...)

JamesD · July 30, 2014

...

I /do/ like this idea of doing the dirty mask by character, that gives you a fixed 8 bytes to write for each 'dirty' flag, which you can unroll, and they are easy to coalesce. Neat idea! I wonder if we can test and reset in one operation -- can we use SZC or something like that to force a value to zero and also determine if it was zero before? (Just to save an operation, not necessary I guess...)

I figured you could convert from a normal bitmap to characters during the copy from the buffer to the VDP RAM and it would be faster than doing it during the drawing because the copy is better suited to unrolling than the line draw itself and you only do it one time per byte where during the line draw you may have to do it multiple times per byte.

Edited July 30, 2014 by JamesD

Tursi · July 30, 2014

I figured you could convert from a normal bitmap to characters during the copy from the buffer to the VDP RAM and it would be faster than doing it during the drawing because the copy is better suited to unrolling than the line draw itself and you only do it one time per byte where during the line draw you may have to do it multiple times per byte.

Nah, it's not necessary - in fact you're adding a layer of complexity by trying to convert a linear frame buffer to the character-ish layout of the VDP. In a flat 2D frame buffer, when you draw a line you normally add a fixed value to X, then add a fixed value to Y, then get the address with Y*Stride+X/depth - for each pixel. The steps are usually fractional and then only the integer part used. (By all means, try it, but, I think the complexity of unrolling it won't make up for the simplification in the line function).

I did similar in my line draw function here (and I admit this was something I'd never tried before, but I'm pleased it worked). http://atariage.com/forums/topic/210660-fbforthti-forth-with-file-based-block-io/?p=3017529

Once calculating the slope of the line, I determined which axis would always increment (or decrement) by one for each pixel. The other axis I calculated the fractional slope. Then for each pixel I updated the fractional slope, and when it rolled over, I added one. The other axis I always added one.

I threw in a trick that the address calculation is in there as well - I calculate only the starting address. After that, incrementing on the 'X' address is just rotating a bitmask. When the bitmask wraps around, then I add 8 instead of 1, and that takes me over to the next cell. Incrementing on the 'Y' address is just adding 1 to the address - a mask test checks for every 8 rows, at which point adding 256 takes me to the next cell row. It's only slightly more work than adding 1 (and in most cases almost equivalent), but doesn't require any kind of space conversion.

I'll try this CPU RAM buffer idea sometime, cause now I'm curious how well it can actually perform. A second benefit of it is that potentially you could defer it until the vertical blank, and only do the copy up to 60 times per second (of course, a full screen will take longer than one frame, but, it'd be curious to see how well it does...) Even batching VDP accesses in CPU memory can help -- in my "680Rock" visualizer I update the character screen at 60fps by copying each VDP row into a buffer, processing the buffer, then writing it back. Even though that's three copy loops, it runs faster than doing one character at a time because I eliminate the address overhead per cell. (Doesn't hurt that the 32-byte buffer is in scratchpad ).

+OLD CS1 · July 30, 2014

I am not too certain how I feel about this thread devolving into a useful, coherent, and cogent discussion on programming.

JamesD · July 30, 2014

Nah, it's not necessary - in fact you're adding a layer of complexity by trying to convert a linear frame buffer to the character-ish layout of the VDP. In a flat 2D frame buffer, when you draw a line you normally add a fixed value to X, then add a fixed value to Y, then get the address with Y*Stride+X/depth - for each pixel. The steps are usually fractional and then only the integer part used. (By all means, try it, but, I think the complexity of unrolling it won't make up for the simplification in the line function).

I did similar in my line draw function here (and I admit this was something I'd never tried before, but I'm pleased it worked). http://atariage.com/forums/topic/210660-fbforthti-forth-with-file-based-block-io/?p=3017529

Once calculating the slope of the line, I determined which axis would always increment (or decrement) by one for each pixel. The other axis I calculated the fractional slope. Then for each pixel I updated the fractional slope, and when it rolled over, I added one. The other axis I always added one.

I threw in a trick that the address calculation is in there as well - I calculate only the starting address. After that, incrementing on the 'X' address is just rotating a bitmask. When the bitmask wraps around, then I add 8 instead of 1, and that takes me over to the next cell. Incrementing on the 'Y' address is just adding 1 to the address - a mask test checks for every 8 rows, at which point adding 256 takes me to the next cell row. It's only slightly more work than adding 1 (and in most cases almost equivalent), but doesn't require any kind of space conversion.

I'll try this CPU RAM buffer idea sometime, cause now I'm curious how well it can actually perform. A second benefit of it is that potentially you could defer it until the vertical blank, and only do the copy up to 60 times per second (of course, a full screen will take longer than one frame, but, it'd be curious to see how well it does...) Even batching VDP accesses in CPU memory can help -- in my "680Rock" visualizer I update the character screen at 60fps by copying each VDP row into a buffer, processing the buffer, then writing it back. Even though that's three copy loops, it runs faster than doing one character at a time because I eliminate the address overhead per cell. (Doesn't hurt that the 32-byte buffer is in scratchpad ).

FWIW, I did implement some line drawing code for the MC-10 so I do have some understanding of the subject.

After giving it some thought, you have to mark the dirtychar flag when you draw a pixel so you have to do the conversion math during the line draw anyway.

For Battlezone you can reduce the visible area that requires updating due to the radar, score and compass. If you drop the mountains like the CoCo game does, you can also reduce the number of updates to just on screen objects like tanks, houses, UFOs and projectiles.

Gary from OPA · July 30, 2014

My fastest line draw code so far can only manage about 50 lines per second - sacrificing color you might get up to about 75 - let's generously say 80 for math reasons. I'd love to be beaten but I was doing the measurements to see if vector games were feasible. )

What type of 'math' are you using to do your line draw code.

Someplace in my archives I have code I wrote that can generate lines very fast it used only add and sub, no mpy or div.

If it might help, I will search later and release my vector line drawing demo I wrote and you more to welcome to test out my code to see if it faster or not.

Of course line drawing on geneve is much better speed wise, also if you have 32k on 16bit bus that helps, but that does not help those with stock ti99/4a systems.

Tursi · July 30, 2014

What type of 'math' are you using to do your line draw code.

Someplace in my archives I have code I wrote that can generate lines very fast it used only add and sub, no mpy or div.

If it might help, I will search later and release my vector line drawing demo I wrote and you more to welcome to test out my code to see if it faster or not.

Go have a look at the link, or even view the video at that link if you don't want to look at the source. But it's only using shift and add in the loop.

You can test if yours is faster or not, I'm confident in my results.

sometimes99er · July 30, 2014

:party:

Edited July 30, 2014 by sometimes99er

Asmusr · July 30, 2014

This is just my first demo for the TI tweaked a little to demonstrate some of the points that have been discussed. Don't expect anything amazing this time. :_(

The disk contains two object files:

LINES1 is clearing a 6K CPU RAM buffer, drawing 68 lines into it, and copying the result to VDP RAM. The frame rate is about 2 FPS.
LINES2 is limited to a 2K buffer corresponding to the top 1/3 of the screen, and there are only 40 lines. The frame rate is about 8 FPS (a plus is rotating a full turn in 64 frames).

If you could pre-calculate all the vertices in 3D (as I have done in 2D) and only had to do the scaling on the fly, I guess this would allow you to make a tiny 3D game on the TI, but I doubt it would be very enjoyably to play.

Lines.zip

Gary from OPA · July 30, 2014

This is just my first demo for the TI tweaked a little to demonstrate some of the points that have been discussed. Don't expect anything amazing this time.

The disk contains two object files:

LINES1 is clearing a 6K CPU RAM buffer, drawing 68 lines into it, and copying the result to VDP RAM. The frame rate is about 2 FPS.

LINES2 is limited to a 2K buffer corresponding to the top 1/3 of the screen, and there are only 40 lines. The frame rate is about 8 FPS (a plus is rotating a full turn in 64 frames).

If you could pre-calculate all the vertices in 3D (as I have done in 2D) and only had to do the scaling on the fly, I guess this would allow you to make a tiny 3D game on the TI, but I doubt it would be very enjoyably to play.

You can get a bit more speed by copying the 6k or 2k buffer in one-shot, having 6,144 movb statements in a row, really silly using up 12k of opcodes to move 6k of data, but amazing enough it helps alot instead of looping around, i see you attempted that by doing 8 movb's is row already, giving you only 1/8 the amount of loops. -- If only we had enough ram, doing that leaves you with only 14k for your code.

Tursi · July 31, 2014

If you could pre-calculate all the vertices in 3D (as I have done in 2D) and only had to do the scaling on the fly, I guess this would allow you to make a tiny 3D game on the TI, but I doubt it would be very enjoyably to play.

I still love those demos - the background checkerboard makes it look rather unique to anything on the TI.

I spent some time last night on the dirty flag idea and merging it into my line demo - it's not working 100%, but early results are not good. My line demo with all color code removed took about 24 seconds to complete. Drawing to RAM alone was all the way down to about 9 seconds, but, adding the copying of the dirty characters to the bitmap display brought it up to over 40 seconds. I still need to debug a few last things with the dirty flags and double-check my copy loop for speed.

One thing I did play with, though it's incredibly wasteful and doesn't save enough time to be worth it, was I found a way to test and reset in a single instruction. ABS seems to be the only instruction that can reset a bit and at the same time set the CPU flags to tell you whether it used to be set (that being the msb). If you set your initial value to >0001, and use >8000 as your dirty flag, then it works - you can call ABS on the value, and it will reset the >8000 bit while setting the CPU flags based on the source. You can then use the A> flag to see whether it was originally set or not. (The value itself toggles between >0001 and >7FFF, but it still works).

One other downside was the loss of XOR mode -- XOR can only modify a register, it can't modify a memory location. You could still XOR with a read/XOR/write, but part of the point was to eliminate that sequence.

Anyway, maybe tonight I will finish it off and see.

You can get a bit more speed by copying the 6k or 2k buffer in one-shot, having 6,144 movb statements in a row, really silly using up 12k of opcodes to move 6k of data, but amazing enough it helps alot instead of looping around

Not as much as you might think. I benchmarked unrolled loops in scratchpad a few years ago, and after about 16 movs the benefit becomes negligable. The dec/jmp sequence only takes about 26 cyles (IIRC?), and MOV is 14 IIRC (sorry, don't have my datasheet handy!) But if you look at the cost:

1 MOVB - 14 cycles work to 26 cycles overhead - 65% overhead (or 26 cycles per byte looping)

2 MOVBs- 28 cycles to 26 cycles - 48% overhead (13 cycles per byte looping)

4 MOVBs- 56 cycles to 26 cycles - 31% overhead (6.5 cycles per byte looping (<1 instruction now))

8 MOVBs- 112 cycles to 26 cycles - 18% overhead (3.25 cycles per byte looping)

16 MOVBs- 224 cycles to 26 cycles - 10% overhead (1.625 cycles per byte looping)

If you plot a curve, it just starts to level off.. you still see benefit going further of course, but you'd reach a point where you wouldn't see the total runtime be any longer with a loop versus fully unrolled long before 6144 movb statements. (I did the actual testing on hardware moving blocks of memory around, rather than the theory above - it'll be buried in the old 99er.net forum somewhere!).

I opted for 16 as optimal but as 8 as the best tradeoff in scratchpad, for my own stuff.

Tursi · July 31, 2014

I finished my testing here, the results were interesting, but, I still think too slow.

First, I modified my old line draw demo (the one that sweeps lines from the center around the frame) to work with a CPU buffer and a per-character (yeah, I'm accepting that terminology now) dirty buffer. If I tried to copy the buffer every vblank (which was about every line), then it was actually a little slower than the direct-to-VDP approach. But, if I was willing to do every other frame, it was slightly faster. These were my results:

fastlinedraw color: 28.0 48 lps
fastlinedraw mono: 21.6 62 lps

bufline (x1-60fps): 26.7 50 lps
bufline (x2-30fps): 19.2 70 lps
bufline (x3-20fps): 16.7 80 lps
bufline (x6-10fps): 14.3 94 lps

But, even 94 lines per second at 10 frames per second isn't going to be much of a game - 4.5 lines per frame (if you also erase). Although... either approach is probably fast enough for a game like QIX which only needs a couple of lines per frame.

buflinedraw.zip

Anyway, I dove into RasmusM's demo to treat it as a 'real world' test case. I replaced his line draw function with my dirty buffer version, and ran it head-to-head. The net result was "my" version was ever-so-slightly f~~aster~~ SLOWER than the original, despite the heatmap showing far less access to VDP and the frame buffer. (See the video below) Since this is more realistic than my test code (which bunched all the lines up, resulting in essentially writing 'many' lines for the cost of 'one'), it seems fair.

My conclusion - the concept is certainly viable, especially if you have a lot of work to do on the buffer. Odds are it's equivalent (and a lot easier) to just dump the whole buffer at the VDP rather than trying to be selective, unless you have access patterns that tend to bunch the data together or update very little. Of course, it's always possible to do better!

I think I'm done with this. I also like Rasmus's line draw code a lot more than my own.

(And had to come back to edit this cause it was bugging me how hard it was to see which one was faster... a minute in overdrive showed the original code was!)

Edited July 31, 2014 by Tursi

Omega-TI · August 11, 2014

Whatever is done on the TI, I'm sure it'll be much better than this TRS-80 Model 1 black and white version.

Interestingly enough, even though it was a minimalist game, the play was enjoyable.

https://www.youtube.com/watch?v=t2wV7-u6HHE

am1933 · August 11, 2014

Whatever is done on the TI, I'm sure it'll be much better than this TRS-80 Model 1 black and white version.

Interestingly enough, even though it was a minimalist game, the play was enjoyable.

https://www.youtube.com/watch?v=t2wV7-u6HHE

It's the polyphonic sound that I love on this version ,can't be too critical-at leat the old warhorse has a version.

Tempest · August 11, 2014

I'm amazed at what people were able to do on the TRS-80. I really need to get mine working!

Omega-TI · August 11, 2014

I'm amazed at what people were able to do on the TRS-80. I really need to get mine working!

I agree, in the past couple of months there have been at least two games on that machine I've noticed that have never made it over to the TI.

1) Berzerk

2) Battlezone

Actually if one typed in, "TRS-80 GAMES" on You Tube, they could find quite a few concepts that could be ported over to the TI, and improved on in the process.

Tempest · August 11, 2014

I agree, in the past couple of months there have been at least two games on that machine I've noticed that have never made it over to the TI.

1) Berzerk

2) Battlezone

Actually if one typed in, "TRS-80 GAMES" on You Tube, they could find quite a few concepts that could be ported over to the TI, and improved on in the process.

Zaxxon had an official TRS-80 port if you can believe it (and looks really good). I'm not sure if Arcturus counts as a Zaxxon port since it's kind of different.

Omega-TI · August 11, 2014

Zaxxon had an official TRS-80 port if you can believe it (and looks really good). I'm not sure if Arcturus counts as a Zaxxon port since it's kind of different.

Well I'll be, I missed that back in the day!

http://www.youtube.com/watch?v=2QtX5RPb8Tc

Atarisoft -- Missing in Action

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members