Jump to content
IGNORED

Any 3D game with flatshading on A800 ?


VladR

Recommended Posts

I was going over my Jaguar GPU RISC ASM code for the flatshading (see http://www.youtube.com/watch?v=nTaLf9MUap4 ) and toyed with the idea of adapting it for A800, now that ABBUC allows for usage of HW extensions (most importantly more RAM).

 

Are there any A800 games that feature flatshaded 3D triangles ?

 

I'm asking as I realized these few things:

- At 160x192x4, each byte holds 4 pixels

- Thus a single write via STA fills 4 pixels

The following scanline fill loop takes 10 cycles per iteration (e.g. 4 pixels)

ScanlineLoop:
  STA (FrameBufferPtr),Y  : 6 cycles
  DEY                     : 2 cycles
BNE ScanlineLoop          : 2 cycles

Thus, even filling a large polygon with 64+ pixels between the edges, and ,say, 32 scanlines would take:

(64/4) * 10 (cycles) * 32 (scanlines) = 5,120 cycles for 64*32 = 2,048 pixels

 

Thus, during 1 NTSC frame / vbl (about 24,100 cycles after ANTIC steals its portion, if I'm not mistaken) it should be possible to do roughly 24,000 / 5,120 = 4.6 x 2,048 = 9,600 pixels (plus the outer loop overhead of about 8 cycles per scanline)

That's, like, a third of full screen.

 

The scanlines traversal with edge drawing (gotta pad to byte boundary on both sides) is slightly different from just drawing the lines, but that's a minor additional overhead that shouldn't take more than another frame / vbl.

 

Even in the example above, 64x32, we would still have 24,100 - 5,120 = ~19,000 cycles for those 32 scanlines, which yields ~600 cycles per scanline's 2 edges, which is 300 per inner loop of one edge (about ~70 ops, which should be enough), so it looks like it's possible to rasterize about 2,048 pixels of a 3D polygon within 1 frame time (assuming it was already transformed and we got screen coordinates of its points). Of course, it's just one polygon, but quite fillrate-heavy.

 

I'm thinking of the spaceship application - e.g. flatshaded 3D mesh of the enemy spaceship - any real-time examples on Atari, that could serve as a working reference ?

 

This could also be interesting in GTIA modes 80x192x16, though 1 Byte there fills only 2 pixels (although they are twice as wide as in 160x192, so it costs the same amount of time to fill same screen area), but we would get the option of 16 shades - e.g. smooth shading without DLIs.

 

In 320x192x2, this loop would fill double amount of pixels - e.g. 8 pixels in 10 cycles. Still same amount of screen real estate as in 160x192 or 80x192, but with very sharp edges.

Link to comment
Share on other sites

Don't know any games, but demos are the place to see the limits of what's possible with massively unrolled code and xor fill. EG Numen has a nice 3d engine part

Thanks for reminding me of Numen! I saw it in past, but earlier quick search didn't yield this particular demo.

Now I can examine it in more detail on YT.

This will be difficult tu make, watch on youtube Project-M (wolfenstein for atari).

Project-M is raycaster. That's very different from generic rotating 3D objects.

As for difficulty, I disagree. A800 has emulator with integrated debuggers, where you can set breakpoints, and check contents of registers and memory, from what I'm reading on forums.

 

Trying to debug the ASM code on Jaguar, without actual debugger, or ability to print numbers on screen (there's just 4 KB of GPU code area, and flatshader is pushing it to its limit, so no space for printing numbers) - now THAT was bloody difficult - coding blindly while "enjoying" compiler bugs...

 

Rainbow Walker is more 3D as some may expect

Can you explain more ? Looks to me like it's just using simple LUT to get the curve XP offsets...

Link to comment
Share on other sites

Can you explain more ? Looks to me like it's just using simple LUT to get the curve XP offsets...

Sometimes I'm curious about what people think a small 8 bit computer can do. The Game's presentation has depth, and the tiles were filled by gameplay "calculations" , not just some "Raster-Tricks" ...

Link to comment
Share on other sites

This will be difficult tu make, watch on youtube Project-M (wolfenstein for atari).

Project-M is wasting a lot CPU cycles for raster changes to get the 256 color look.

Such a "tunnel" could be done in "Gr.7" mode, where the CPU could fill the whole screen two times per frame . Not to forget the better CPU usage . The Tunnel could be done in 4 color mode , and the moving objects were still free using PMg. Actually Yoomp! is a direct indicator for that.

Edited by emkay
Link to comment
Share on other sites

You write something on 8-bit computer (maybe Atari) in assembler ?

 

Your calculations are wrong , change single pixel take more cycles:

A=color (00,01,10,11), X- position X, Y - position Y

pha
clc
txa
adc YtabL,y
sta pom
lda YtabH,y
adc #0
sta pom+1

ldy #0
pla
tax
lda colorMASK,x
and (pom),y
ora colorTAB,x
sta (pom),y

Of course you can use some tricks to speed up procedure.

 

Best way is write something simple program and test speed draw polygons.

Link to comment
Share on other sites

Sometimes I'm curious about what people think a small 8 bit computer can do. The Game's presentation has depth, and the tiles were filled by gameplay "calculations" , not just some "Raster-Tricks" ...

So, what exactly did you mean by "more 3D" then ?

 

From where I'm sitting, it'd be more 3D, if for example, camera would strafe left/right or up/down, at least upon start-up.

 

It still wouldn't have to be full 3D transformations, as another offset LUT (for fixed-point scaling of YP) could be used. But it would feel more 3D, despite not being 3D.

Link to comment
Share on other sites

You write something on 8-bit computer (maybe Atari) in assembler ?

 

Your calculations are wrong , change single pixel take more cycles:

A=color (00,01,10,11), X- position X, Y - position Y

pha
clc
txa
adc YtabL,y
sta pom
lda YtabH,y
adc #0
sta pom+1

ldy #0
pla
tax
lda colorMASK,x
and (pom),y
ora colorTAB,x
sta (pom),y

Of course you can use some tricks to speed up procedure.

 

Best way is write something simple program and test speed draw polygons.

Wait, are you actually implying doing the scanline fill via brute-force drawpixel ? I wouldn't qualify that even for a reference rasterizer version :)

 

There's a reason why I mentioned scanline traversal.

It's a process that largely resembles drawing the lines, but rather finds leftmost xp for left edge and rightmost xp for right edge. This is complicated by fact, that there are 4 combinations of steep/non-steep lines, and algorithm must handle all 4 of them, including combination with corner cases (vertical/horizontal).

 

Which means, that at that point, you have both addresses for left/right xp, and need to do 2 things:

1. Pad leftmost and rightmost pixels (so that you are only left will groups of 4-pixels - e.g. bytes)

2. Do a scanline fill (see the code in first post). on those groups of 4-pixels.

 

 

EDIT: If there's a way how to make the the scanline fill (once edges are padded) faster on Atari than those 3 ops, I'd love to hear it - but a loop needs a store (STA), a register (Y-holding count of groups of 4) and branch instruction (BNE).But, I'm no ASM expert, so hard to say, really...

It's doing backwards fill, as that's faster than forward fill (and as a bonus, saves X register for something else). As we need both addresses anyway, we might as well use it somewhere else, eh ?

 

.

Hypothetically, one could save 1 cycle of STA by using zero page, but not much of a framebuffer fits there. I mean, if A800 had a Blitter, that could work in parallel, but without stealing cycles like Antic does, it might be worth it, but this is no VBXE...

Edited by VladR
Link to comment
Share on other sites

Let's say you implement a game in graphic mode 7, with a window of 128 x 64 pixels (just throwing numbers).

Every line uses 32 bytes of memory (narrow mode playfield), and lets say your video memory starts at $8000.

Using unrolled code you could have 8 methods like this:

 

sta $8000,x

sta $8001,x

sta $8002,x

...

sta $801f,x

rts

 

(97 bytes x8 = 776 bytes for all methods)

 

This one covers the case for the first 8 lines, if you want to write in line 0 you set x = 0, for line 1 you set x = $20, for line 8 you call the next method with x = 0, and so on.

For filling one line then you basically need to set the accumulator with your "color", set x with the line offset, jump to the correct "sta", and write a "rts" at the exit point (and recover the original "sta" at the end).

Also this is for filling 4 pixels at a time, you still need to set correctly the "borders", so probably this is better used for "lines" larger than a fixed value (if the setup is to costly).

But your throughput would be something like 1.25 machine cycle per pixel.

 

.. or you could remove the use of "x", write the methods for all lines (using 776x8 = 6208 bytes), and get 1 machine cycle per pixel..

Edited by NRV
  • Like 1
Link to comment
Share on other sites

Annoying commentary, but you're talking like this?

 

 

 

I don't see why not for us, but yes you'd need to do the maths on the graphics DMA and fill routine. At least with 3D type games the expectation is there that you probably won't get 50 FPS so a double-buffered fast slideshow is acceptable in many cases.

  • Like 1
Link to comment
Share on other sites

Write some test program, use Altirra and test your calculations.

 

Registers in atari are 8 bits no 9.(16*32)

Bne takes 3 cycle if jump, 2 if no jump, 4 if jump to another page.

I would swear my A800 table shows only number 2, but from experience with other assemblers, it should have occurred to me that there's no such thing as fixed-cost branching.

This just saved my hair pulling in future (when suddenly a benchmark takes much longer for an identical code and data, due to the page boundary) ! Thanks!!!

 

Annoying commentary, but you're talking like this?

 

Thanks! I haven't seen this one !

 

That's some crazy stuff for 8-bit ! The Draw distance ! The precision of math ! Dithered texturing ! Proper clipping ! Wow...

 

I don't see why not for us, but yes you'd need to do the maths on the graphics DMA and fill routine.

I've been reading a few great books from AtariMania (namely DeRe Atari and Mapping the Atari are fricking awesome - if I had those 3 decades ago ...!), and my current understanding (it's readjusted every day) is that DMA occurs when Antic has to access RAM, and as there probably weren't separate read/write strobes on the chip, 6502 has to halt. STA WSYNC looks like absolute evil (halting 6502 for full scanline time - WTF ?!?), so no DLIs whatsoever during gameplay as I'll need every cycle I can get.

 

Perhaps you're talking about writing a separate kernel and bypassing Antic via DMACTL and writing your own handlers (e.g. to bypass Attract mode and such) ?

 

At least with 3D type games the expectation is there that you probably won't get 50 FPS so a double-buffered fast slideshow is acceptable in many cases.

Well, 50 fps is not in the picture at all :) I'm targeting 3-4 vblanks (e.g. 15-20 fps) on NTSC, but honestly till I have benchmarks for all codepaths, and written all alternative implementations (like I did on Jaguar), I won't know for sure.

From jag coding I have benchmarks on how much time is spent on various stages of the pipeline as the objects get close to the camera (e.g. spaceship in our case), and the fillrate was the absolute killer on jag.

Which is why I was very pleasantly surprised that you can do the scanline fill in just few cycles on A800 ! Because when drawing wireframe, you have paid enormous price for transformations, visibility and line drawing. So, why not try to fill the scanlines if it's so cheap ? At least that's the idea.

 

Also, this is not FPS game with terrain, buildings and stuff (like Driller). We're talking about drawing a single 3D object (the spaceship), which most of the time should be in the distance, covering just a small screen area.

So, hypothetically, for one enemy, in the distance, framerate could reach 30 fps, and as it gets closer, and the cost of scanline traversal shoots up, framerate goes down.

 

Having a generic 3D object rasterizer would however prove very beneficial for the game from code reuse standpoint, as I could create loading screens or in-game FMVs (or just plain static shots) with high-poly (for Atari) spaceships, as for those screens it wouldn't matter if the shot takes 60 or 120 vblanks to draw, as 3D mesh takes way less RAM than bitmap and it's easy to create a different screen just by adjusting camera parameters (position, angle, ...).

As a concrete example - if you watched Battlestar Galactica, they always had those amazing camera shots of the main ship (from different angles) and when Cylons launched attack, they showed their approach.

This could be used for a Kill Cam for the last enemy destroyed (so as not to get annoying and disruptive for player) too.

 

Scripting is easy, once the base code is running, but provides tremendous visual benefits, for a fraction of effort (of the rasterizer).

Link to comment
Share on other sites

I made some "impressions" demo video , how Driller would work on the Atari...

 

 

It would be still slow, but more playable..

 

 

I'm also sure there was a game called "Virus" or similar, on C64, and even Amiga/ST that used filled polys. Not sure if solid or textured though. Simple textures as such don't add too much extra overhead.

Virus on C64 ? How would that end up ? 10 seconds per frame?

Link to comment
Share on other sites

First of all - nice idea !

Imho go for it. Flat shaded polygons can be done fast enough on A8.

 

For inspiration take a look at these:

Arsantica demos:

 

Here (part of Arsantica 3) you can see something like your spaceship with rotation and moving across screen:

https://youtu.be/PM9K7jSjCu4?t=96

 

Subpix accuracy with large gtia mode pixels:

 

And yes, fastest way to fill polygons on A8 is one STA for four multicolor pixels. Combine it with 2x2 resolution (double scanline mode) and you get some nice fillrate speeds.

Feel free to ask Heaven about 3d on A8, he has done 3d polygon calc-draw routine so many times by now on 6502 that he probably knows to write it from scratch ;)

 

One more example from C64 is Space rogue (space flight part):

https://youtu.be/gEyx6evq360?t=170

 

As A8 has faster CPU and less DMA demanding gfx modes, you can safely assume to achieve faster speeds than Space Rogue does.

  • Like 1
Link to comment
Share on other sites

I'm also sure there was a game called "Virus" or similar, on C64, and even Amiga/ST that used filled polys. Not sure if solid or textured though. Simple textures as such don't add too much extra overhead.

I'm familiar with it. I even spent a day on Jag toying with the idea with my flatshader, but to get it to run on Jag at 60 fps would require major changes, so I dropped it. But that was a 26.6 MHz RISC GPU capable of executing 443,166 instructions per frame (in the worst case of interleaving where each op would take 3 cycles).

On A800 we have 5% (24k / 443k) instruction throughput (so we're down to 3.25 fps) and plenty ops require 4-6 cycles.

 

But, in a lower resolution - like 80x96 (to save a lot of bandwidth), and a halved terrain heightmap dimensions (to save a lot on transformations), and reduced distance (at which point it would look nothing like original :) ), it might be possible to get it to run on A800 around 10 vblanks - e.g. 5-6 fps.

 

But it would be a loooot of work, unrolling, etc.. A simple brute-force version would not run at more than 1 fps, for sure...

 

The ability of having simple textures (just alternate colors within the byte) for scanline fill is a nice side effect in Atari, for sure :)

Link to comment
Share on other sites

Let's say you implement a game in graphic mode 7, with a window of 128 x 64 pixels (just throwing numbers).

Every line uses 32 bytes of memory (narrow mode playfield), and lets say your video memory starts at $8000.

Using unrolled code you could have 8 methods like this:

 

sta $8000,x

sta $8001,x

sta $8002,x

...

sta $801f,x

rts

 

(97 bytes x8 = 776 bytes for all methods)

 

This one covers the case for the first 8 lines, if you want to write in line 0 you set x = 0, for line 1 you set x = $20, for line 8 you call the next method with x = 0, and so on.

For filling one line then you basically need to set the accumulator with your "color", set x with the line offset, jump to the correct "sta", and write a "rts" at the exit point (and recover the original "sta" at the end).

Also this is for filling 4 pixels at a time, you still need to set correctly the "borders", so probably this is better used for "lines" larger than a fixed value (if the setup is to costly).

But your throughput would be something like 1.25 machine cycle per pixel.

 

.. or you could remove the use of "x", write the methods for all lines (using 776x8 = 6208 bytes), and get 1 machine cycle per pixel..

1. Narrow Mode

- my current understanding from reading the books is that it reduces the resolution, in our case, to 128x192 (24k pixels: 6,144 Bytes) from 160x192 (30k pixels: 7,680 Bytes)

- thus the normal mode takes 25% more of memory and fillrate (though fillrate savings are smaller as there's 4 pixels/byte).

- if I read / understand it right, Antic steals 1 cycle per 1 byte read, so it follows that just for reading the framebuffer, it should steal 7,680 - 6,144 = 1,536 cycles less (but is it per frame or per second - I suspect per frame, as Antic has to read framebuffer each frame, in which case we're talking about 1,536*60 = 92,160 cycles per second). Any additional cycle savings ?

- having 32 bytes/line might make it faster to just do ASL 5 times, than access LUT with addresses to each line in Framebuffer (but that's just a guess at this point, before I implement it)

- it definitely will make for a better framerate (less pixels to write), so I'm all for it

- am I right this is not invoked via display list changes, but simply via bit 1 of DMACTL $D400 ?

 

2. STA $8000,X

- I'm using this system for drawing the player's bitmap (a single frame of which will be 32x16 (128 Bytes) )

- I think this addressing is useful only if the object is very close to camera and spans whole scanline, right ?

- Oh, now I think I understand how you mean it - at run-time, write an RTS during scanline traversal at each line's proper end address, and just do JSR - very clever :)

- we would also need to revert it afterwards to an STA again, so not sure right now where's the performance threshold till it still makes sense - gotta enter it all into a spreadsheet and count all cycles properly :)

- but it is a very interesting self-modifying unrolled code technique ;)

Link to comment
Share on other sites

You can set start address of each scan line as you wish. With lms instructions in display list. So you can set each line start at $NN00. So your step in vertical direction is just "Inc adr+1".

Lots of cool stuff like that...

The 3-byte LMS does not halt 6502, correct ? Just eats 3 cycles per scanline, right ?

Interesting idea! I have initially dismissed it, while reading up on Display Lists, but for scanline fills it might save quite a few ops per scanline.

 

First of all - nice idea !

Imho go for it. Flat shaded polygons can be done fast enough on A8.

 

For inspiration take a look at these:

Arsantica demos:

 

 

Here (part of Arsantica 3) you can see something like your spaceship with rotation and moving across screen:

 

 

Subpix accuracy with large gtia mode pixels:

 

 

And yes, fastest way to fill polygons on A8 is one STA for four multicolor pixels. Combine it with 2x2 resolution (double scanline mode) and you get some nice fillrate speeds.

 

Feel free to ask Heaven about 3d on A8, he has done 3d polygon calc-draw routine so many times by now on 6502 that he probably knows to write it from scratch ;)

 

One more example from C64 is Space rogue (space flight part):

 

 

As A8 has faster CPU and less DMA demanding gfx modes, you can safely assume to achieve faster speeds than Space Rogue does.

That sub-pixel rotating cube is pretty neat. Lot of inspiration to be found in Arsantica too !

 

Thanks for the Space Rogue tip. I had no idea a game of this magnitude is on 8 bits. That's Elite on steroids. Too much content to realistically recreate something like that in one's free time, though...

  • Like 1
Link to comment
Share on other sites

That's Elite on steroids. Too much content to realistically recreate something like that in one's free time, though...

That's why people should focus on projects that could be finished. Graphics 7 plus PMg with some multiplexing is the most efficient start.

 

On the other hand, using some "Graphics 1 or 2 " could open other interesting stuff.

Building a "frame buffer" , using the low spec char modes, offers additional fx ....

 

Some peeks and pokes there in Basic... as it uses basic, it's rather slow. Some animations at the beginning hide from the needed time to convert the ROM Charset to RAM, for doing manipulations on that.

It's also a miracle, why such modes never have been used for transitions in Demos. Using such low spec modes, could allow to play a small game while loading from disk.

 

Link to comment
Share on other sites

The 3-byte LMS does not halt 6502, correct ? Just eats 3 cycles per scanline, right ?

 

Yes. Only couple cycles are taken. You can see cpu timing diagrams in this text file:

Antic_Timings.txt

 

"AA" is for lms address change so I guess it only fetches lo+hi byte of new address.

ps. I don't remember were there some corrections to this file, but 99% of it is correct and you can safely use it.

 

Thanks for the Space Rogue tip. I had no idea a game of this magnitude is on 8 bits. That's Elite on steroids. Too much content to realistically recreate something like that in one's free time, though...

I didn't propose making such a complex game on A8 ;)

Maybe start with something simpler like version of Star Raiders with 3d ships ?

 

Source code is available and is full of nice info about 3d structures and math involved in making it all work on 6502:

https://github.com/lwiest/StarRaiders

 

Enjoy :)

 

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...