Any 3D game with flatshading on A800 ?

VladR · September 8, 2017

Sure, but that still won't save you from degeneracy post-transform due to snapping to the rasterization grid:

bowtie.png

With a triangle, you're guaranteed to still have a triangle. With a convex quad, you can get... this.

Yes, that one is a common case, but luckily cheap to handle. There's one or two much more uglier cases when the quad is clipped though (and is extremely expensive to process), and I still haven't figured out yet how to catch it without slowing down all other quads that aren't affected (as you gotta run those conditions for each poly, and that quickly adds up).

For a tunnel, I might be able to avoid those expensive checks, as camera is sort-of fixed, but for a generic rasterizer, probably not...

VladR · September 8, 2017

Interesting Thread.

Just to come up with another strange idea to speed-up the fill-rate of tunnel segments:

Draw to screen-memory only in edge areas, if you need to apply a wide solid run abuse PMG graphics (without DMA) and set the "pattern" to GRAFPMx. This way you can apply 32 (colour clock wide) pixels (in quad PM mode) with a single "STA GRAFPMx" and even have the same benefit when clearing them.

Not that this principle is simple to implement, but for some engines types it could be beneficial...

Thanks. It's interesting idea, indeed.

Unfortunately, it does not save too much. Let's recap my current numbers:

- 6 quads (12 triangles), half for ceiling other half for floor

- 48,927 cycles: Bresenham line computation + fill costs

- of those 48,927, only 6,450 is spent on filling the scanlines and 42,477 goes towards computing the edges, and filling the left/right edge byte boundary (via ORA) and handling all the special cases and corner conditions

Also, those 6,450 cycles could be further brought down to half (~3,300), if I was willing to spend more RAM on the unrolled filling code (which I will soon, anyway - it's just not priority at this particular moment)

We would need to somehow merge 4 PMGs without creating gaps (between PMG and scanline fill). That's going to be very costly, as it will have to happen for each and every scanline. And we have 96 scanlines right now, which really only gives us 6,450 / 96 = 67 cycles per scanline to adjust PMG's values for current scanline.

I seriously doubt it's remotely possible to do in 67 cycles. That's what - 15-20 instructions ?

Now, if one were to design the whole tunnel around the PMGs, and we could drop the scanline fill (and the whole related expensive scanline traversal), that would be a very fast renderer, indeed

VladR · September 8, 2017

Small update:

- I've implemented lowest-quality line algo (just a simple integer step)

- the total cycle count dropped from 48,927 down to 40,844

If I could now do the 3D transform in under 8,000 cycles, this would fit within 2 frames (e.g. 30 fps)...

emkay · September 8, 2017

Interesting Thread.

Just to come up with another strange idea to speed-up the fill-rate of tunnel segments:

Draw to screen-memory only in edge areas, if you need to apply a wide solid run abuse PMG graphics (without DMA) and set the "pattern" to GRAFPMx. This way you can apply 32 (colour clock wide) pixels (in quad PM mode) with a single "STA GRAFPMx" and even have the same benefit when clearing them.

That could be interesting in a 2 color mode. In "Gr.7" , it's the same cost, to have a color or the background in the calculations. And, doing that "Edge Areas" I mentioned above, as it seems to be possible to have full sync fps , It could be done with almost "one" line of gr.7 repeating for the whole screen height...

VladR · September 8, 2017

That could be interesting in a 2 color mode. In "Gr.7" , it's the same cost, to have a color or the background in the calculations. And, doing that "Edge Areas" I mentioned above, as it seems to be possible to have full sync fps , It could be done with almost "one" line of gr.7 repeating for the whole screen height...

How's one line repeated via Antic going to provide a perspective ? It's basically copy&paste. What am I missing ?

emkay · September 8, 2017

How's one line repeated via Antic going to provide a perspective ? It's basically copy&paste. What am I missing ?

Manipulating a Gr. 7 scanline allows over 410 cycles of manipulations. While the 2nd line is shown, the changes in the RAM can be made, without interfering the displaying.

Fills cost always a byte to move /change /restore. the border of a color-change is the most work, as it needs the correct bit calculation. The question is how many changes per "gr.7" line can be done to keep it in one frame.

Irgendwer · September 8, 2017

How's one line repeated via Antic going to provide a perspective ? It's basically copy&paste.

Which brings up another interesting question. While inspecting your Jag-Wipeout-Vid I asked myself if you could also enormously speed up the calculations by just preparing all possible scan-lines and their individual fillings in memory and "just" construct a frame by putting the individual parts together with LMS-Antic instructions for each mode line? You wouldn't even need left and right shifts as these can by performed by HSCROLs...

Thanks to the mainly large areas and symmetric layout this could be feasible.

VladR · September 8, 2017

Manipulating a Gr. 7 scanline allows over 410 cycles of manipulations. While the 2nd line is shown, the changes in the RAM can be made, without interfering the displaying.

Fills cost always a byte to move /change /restore. the border of a color-change is the most work, as it needs the correct bit calculation. The question is how many changes per "gr.7" line can be done to keep it in one frame.

OK, I'm getting a bit closer to understanding what you're saying (I think), but let me rephrase:

- by "keep it in one frame", you mean full 60 fps ? That's really hard. We're talking just maybe 2-3 smaller quads (and I still don't know how expensive transform stage will be on 6502, let alone clipping). Can't imagine what kind of game/scenario could be done with that - but perhaps some designers could conjure a working game environment like that...

- on second thought, if we're talking just about few smaller quads, then they're in the middle of screen, and don't have to be clipped against screen edges (so that stage of pipeline is gone=free)

- if we limit the object only to about 25-33% of screen height, this will directly drive the scanline-processing numbers down (quite significantly, as the per-scanline overhead (both line computation and traversal) is about ~200 cycles) - so yes - we could find a screen-coverage threshold where we fit under 24,000 cycles, hence it would run at 60 fps

- at this moment, the only thing I can think of (that would be useable in a game scenario) is something like rotation (but with pre-transformed vertices in advance) of a weapon or spaceship (like in a hangar or inventory screen) - I mean - what can you do with ~8-10 triangles (4-5 quads) ?

- I believe it would be better to triple the polycount (30 triangles already start to resemble some simple objects) and run at 1/3 rd of framerate - e.g. 20 fps (still more than enough for something like that)

- 410 cycles - how did you come up with this number ? In our resolution we have about 24,000 cycles (or so - don't recall exact number at this moment) and 96 scanlines, hence we can spend 24,000 / 96 = ~250 cycles per scanline (on average)

emkay · September 8, 2017

- 410 cycles - how did you come up with this number ? In our resolution we have about 24,000 cycles (or so - don't recall exact number at this moment) and 96 scanlines, hence we can spend 24,000 / 96 = ~250 cycles per scanline (on average)

228 Cycles per scanline

A Gr. 7 line is 2 scanlines.

Only the 1st is using DMA for the graphics. So the second line allows almost "full" access to the RAM by the CPU.

456 cycles minus 40 DMA Cycles... plus "what you can get"

emkay · September 8, 2017

Could be interesting to see if that works... somehow...

Doing the needed 3D calculations and preparations in the VBI, and to draw the graphics always in the "repeating line by LMS repeating"

In most cases , only some changes on the edges were needed. The gain is by not using a wireframe grid, but always solid fills. So the most time, the "filling is already there" . It may be very fast by only drawing vertikal borders, but also get very slow by drawing horizontal borders. With some limitations in the movement, those could be reduced , I guess. So, diagonal lines will be the biggest challenge

Edited September 8, 2017 by emkay

VladR · September 8, 2017

Which brings up another interesting question. While inspecting your Jag-Wipeout-Vid I asked myself if you could also enormously speed up the calculations by just preparing all possible scan-lines and their individual fillings in memory and "just" construct a frame by putting the individual parts together with LMS-Antic instructions for each mode line? You wouldn't even need left and right shifts as these can by performed by HSCROLs...

Thanks to the mainly large areas and symmetric layout this could be feasible.

Holy Shit. This is actually phenomenal idea. I can't believe I didn't come up with it :- )))))

For a tunnel the number of permutations might not be so horrendous. If we drop the rotation, and allow only strafing (e.g. instead of scanline 160 pixels long, we'd have some buffer on both ends - say 240 pixels long scanline), we'll still get some camera control.

Assuming each 3D segment is identical to the next one, we really only need to pre-store few (well, dozens of KB) versions. Still less, than store whole screen (e.g. as an animation frame) for all frames. So this could work (but would be quite hard!)

However, I just got another idea while thinking of this one. There's not that many combinations of 2 colors within same byte, as there's just 4 pixels. What if we "drew" the edge between two polygons directly - that would require us to know the lengths of each polygon's subscanline - it's same thing as merging two tris into a quad, except this time I'd merge two quads.

Given how expensive the edge is, who's to say that an another preprocessing stage (that would gives the lengths of each polygon's segment within current scanline) wouldn't bring a speed-up ? Maybe not, but we won't know till we try...

VladR · September 8, 2017

228 Cycles per scanline

A Gr. 7 line is 2 scanlines.

Only the 1st is using DMA for the graphics. So the second line allows almost "full" access to the RAM by the CPU.

456 cycles minus 40 DMA Cycles... plus "what you can get"

Wait, you're talking about chasing the ray ? I don't think that's even possible with 3d graphics, given how irregular the CPU load is per each scanline.

It's hard already just with the 24,000 cycle budget (per whole screen). Maybe if someone was to write a full kernel (but that'd be just insane )

Could be interesting to see if that works... somehow...

Doing the needed 3D calculations and preparations in the VBI, and to draw the graphics always in the "repeating line by LMS repeating"

In most cases , only some changes on the edges were needed. The gain is by not using a wireframe grid, but always solid fills. So the most time, the "filling is already there" . It may be very fast by only drawing vertikal borders, but also get very slow by drawing horizontal borders. With some limitations in the movement, those could be reduced , I guess. So, diagonal lines will be the biggest challenge

I actually did something like that in past with small cubes on jag. It was a two-pass technique:

1. First you draw whole screen (all objects)

2. Then each frame, you only draw deltas - changes from one frame to another (which are like 10% of whole scene changes, so it's extremely fast)

In our case, I was already thinking of this, instead of doing full-screen redraw of all polygons, I would just draw the edges that changed and wouldn't touch the filled area (other than OR'ing in the new pixel). The end result would, in theory, look the same as if you redraw all polygons from scratch.

Basically, in second pass, we would only redraw all 4 edges of each polygon, and wouldn't touch/process the inner area (which is unfortunately cheap as it's all unrolled code - but we'd still save 6,450 cycles). This would work as long as there are no new polygons in the scene, at which point we'd have to do a full redraw of all polygons again.

But, this technique has the greatest potential to run at 30-60 fps, with some framedrops (when we switch to full redraw).

phaeron · September 9, 2017

228 Cycles per scanline

114 cycles per scanline, not 228. Also, 9 cycles lost to memory refresh.

Heaven/TQA · September 9, 2017

Phaeron... scanline I guess emkay referes to an antic D visible scanline so he assumes that all DMA is only read once and on next line out of the antic cache.

emkay · September 9, 2017

Phaeron... scanline I guess emkay referes to an antic D visible scanline so he assumes that all DMA is only read once and on next line out of the antic cache.

Besides something got mixed up , the goal wasn't to reach full fps ...

I wonder how it may end up.

Sheddy · September 9, 2017

Holy Shit. This is actually phenomenal idea.

It is a great idea, but you usually end up realising you can only make a non interactive movie once the quickly ballooning amount of permutations are factored in. Maybe it can work in a special case for your project though. All very interesting stuff! Edited September 9, 2017 by Sheddy

emkay · September 9, 2017

It is a great idea, but you usually end up realising you can only make a non interactive movie once the quickly ballooning amount of permutations are factored in. Maybe it can work in a special case for your project though. All very interesting stuff!

On the Atari it was always shown that the lower resolution with more precise movement at full screen is the most impressive stuff, one can get.

If 2x2 wasn't fast enough, there still is 4x4 in gr. 10....

VladR · September 10, 2017

On the Atari it was always shown that the lower resolution with more precise movement at full screen is the most impressive stuff, one can get.

If 2x2 wasn't fast enough, there still is 4x4 in gr. 10....

Ugh, 4x4. That's what - 80x48 resolution ? I cannot imagine anything so lowres could possibly remotely resemble 3D scene, when viewed up-close in fullscreen. Maybe a tunnel...

Plus, once we're at that resolution, we can have Antic modes with 16 shades (or 9 colors), yet 192 scanlines.

Unfortunately, the scanline approach still wastes the same amount of cycles (regardless on resolution) on the scanline traversal and line computation (well, 50% of line computation - as we have half amount of scanlines). So, mostly the smallest and fastest part of the whole pipeline (the unrolled byte fill) would be halved. This would help somewhat, but not very drastically. It'd still be very easy to fall under 20 fps just by using more polygons (which would look really funny in that resolution).

Now, I'm trying to write the engine generic, like on Jaguar (where I can simply change the horizontal resolution from 256 up to 1536, just by changing one constant), so we may eventually see how it all looks in lower resolutions. But for now, I don't want to go under 160 (128) x 96.

It is a great idea, but you usually end up realising you can only make a non interactive movie once the quickly ballooning amount of permutations are factored in. Maybe it can work in a special case for your project though. All very interesting stuff!

Yes, the generic case would have, like, infinite amount of permutations.

But for a tunnel, the lower/higher 25% portion of screenspace would display the center polygon with 2 side polygons, which for a scanline means a small amount of variations (about 300, really).

The problem is the center portion that would have 4-6 polygons.

For now, I'll leave it for later as a very interesting and funny optimization exercise, and keep the focus on the generic rasterizer.

Heaven/TQA · September 10, 2017

Vladr

VladR · September 10, 2017

I have plenty experience with 32-bit and 16-bit integer precision from both Jaguar and PC, but never really attempted the 8-bit one, as even on jaguar, the GPU is 32-bit so there was no reason to go down to 8 bits.

I spent this morning playing with the 8-bit transformation (both integer and fixed-point). As I generally hate fixed-point with passion, I tend to spend a great deal of time converting everything to integers - basically my whole pipeline on jag is purely integer (no fixed point whatsoever) - thanks to the 32-bit architecture providing lots of unneeded bits of precision there.

Excel proved to be of great help here again, as I was able to create dozens of tables with various values of FrontPlaneDistance, XPOS range, ZPOS range and checking precision of integer-only results without having to implement lots of code.

Eventually, I worked my way to a combination of all parameters, where I could keep everything in 8 bits, yet regain enough precision for XPOS and YPOS. The problem is obviously integer values of ZPOS. When a vertex is close to the camera, just a simple change of zpos equal to 1 results in huge jumps in final screen position (up to 24 pixels, which is obviously unacceptable). So, we were back to a combination of integer and fixed-point, which I hate so much.

So, I figured I might skip few initial 6502 brute-force prototypes, and just go straight with big tables, that will bypass all multiplications and divisions.

zp = zpos - zCam;
for (int i = 0; i < 128; i++) divLut [i] = (i * FrontPlaneDist) / zp;


if (xpos > xCam) xp = HalfScrXL + divLut [xpos - xCam];
else xp = HalfScrXL - divLut [xCam - xpos];

if (ypos > yCam) xp = HalfScrYL + divLut [ypos - yCam];
else yp = HalfScrYL - divLut [yCam - ypos];

This does not handle rotations, only simple translations - so it's good only for world-space based transformations (where you can adjust camera's position, not angle).

The divLut must be computed for all values of Z that are needed, so for my case (64 layers), it takes 8 KB (a lot for 6502, but later can be adjusted if needed, as there are many ways how to use mirroring there, plus the precision falls off logarithmically anyway).

The current world-space around camera that has full precision covers 256x256x64 points. Should be good enough for some rooms and simple 3D objects later.

So, the final amount of operations needed per each 3D vertex are:

- subtract (zp)

- condition (xp)

- subtract

- look-up

- addition (or subtract)

- condition (yp)

- subtract

- look-up

- addition (or subtract)

This should be possible to write in ASM so it consumes only around 100-150 cycles per each vertex, so for our current tunnel, which has 6 polygons (14 vertices), the transformation stage should consume only around ~2,000 cycles (estimate). And we still have an option to adjust camera's position at run-time.

VladR · September 10, 2017

@Heaven: I thought Numen ran in higher resolution, but clearly my memory has blank spots

Good to know this resolution is still a useable option and I don't have to worry about it ! Thanks!

Heaven/TQA · September 11, 2017

Yeah ztable (basically for persp) is still one of my trial and error everytime I do something in 3d.... next is value overflow... in my actual prototype I even discovered first time the beauty of BVS and BVC as dealing with signed overflow...

Heaven/TQA · September 11, 2017

Dealing with 32 bit yeah nice on 68k but fixed point is no big deal either with SWAP opcode? but it will be interesting how you will handle through pipe and the end result precision.

I went 8bit to 16bit to 16bit subpixel back to 8bit in my current project.

Heaven/TQA · September 11, 2017

attachted the interactive "numen" duke nukem engine...

game_custom.xex

game_orig.xex

VladR · September 11, 2017

Yeah ztable (basically for persp) is still one of my trial and error everytime I do something in 3d.... next is value overflow... in my actual prototype I even discovered first time the beauty of BVS and BVC as dealing with signed overflow...

Well, I decided to sacrifice the sign bit, as it basically doubled the precision

Might hurt me later down the road, but am curious to see how far I'll make it without using the signed math

Besides, although initially I was using signed math on jag, later on I switched to unsigned, so the whole rasterizer is done using unsigned integers. But, that was 32-bit...

Dealing with 32 bit yeah nice on 68k but fixed point is no big deal either with SWAP opcode? but it will be interesting how you will handle through pipe and the end result precision.

"through pipe" ? If you mean the rotation around look vector (that happens later in the flythrough), that will be a separate codepath (accordingly slower, of course).

The new opcodes - I'm looking forward to them very much, but need to first get the vanilla 6502 version running

What's the cycle count of the new opcodes ? Some of those operations seem quite expensive...

I went 8bit to 16bit to 16bit subpixel back to 8bit in my current project.

I'll try to resist the subpixel as long as possible, but it's generally best to have as many codepaths as possible so you can choose at runtime whether you need the higher precision or higher framerate (for given situation).

It's not like we need to fit it all into an 8 KB cart or something

BTW, from your experience, roughly how much slower is it using the subpixel precision ?

Any 3D game with flatshading on A800 ?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members