Jump to content
IGNORED

Any 3D game with flatshading on A800 ?


VladR

Recommended Posts

Haven't seen that one raycaster before.

 

But A800 scene is already oversaturated with raycasters. There's really nothing radically new left to come up with, in that arena.

 

 

Not so much flatshading and especially non-raycasted 3d-poly FPS engines. Let alone HiRes flatshading. That looks like mostly untouched territory...

  • Like 2
Link to comment
Share on other sites

Clamping polys with dirty buffers

 

YT vid

Well, but you didn't really allow the camera to get too close to the object :)

Thus, it didn't really need more than, say, 24-32 pixels on each edge :)

 

Unfortunately, you don't have that luxury in a slightly more generic engine, which has to handle the scenario when only 1 scanline of object is visible, and up to 200 (or even 1,000 - depending on how big object is) are invisible.

 

Given how expensive edge computing and scanline traversal is, and how cheap&easy it is to do scanline Y-clipping, there's no way rendering those invisible 200 scanlines to a dirty buffer would be cheaper than culling it.

 

But in your specific demo case, for those few dozens pixels, it's not worth the coding effort, I get that - it's simply better to spend that coding time on another effect. I doubt the performance difference would be actually visually noticeable, in this case (unless we'd do cycle counting).

Link to comment
Share on other sites

Haven't seen that one raycaster before.

 

But A800 scene is already oversaturated with raycasters. There's really nothing radically new left to come up with, in that arena.

 

 

Not so much flatshading and especially non-raycasted 3d-poly FPS engines. Let alone HiRes flatshading. That looks like mostly untouched territory...

Just wanted to bring up for ppl to see the difference to CTF and Wayout.

Link to comment
Share on other sites

 

Rof seem to clip in 3d but not perfectly.

What do you specifically mean "not perfectly" ? If you refer to the polygons in distance just popping into the view in distance, that's unavoidable without incurring a heavy performance hit of always rendering them, and even then, it only pushes the cut-off distance one row into the distance (which is not much with RoF's perspective anyway).

That's a problem that's plaguing even current games on PS4 (while PS4 can surely render objects into quite a distance, there's still issues with LOD popping and fog even these days)

 

I haven't noticed errors on the screen edges - RoF seems to clip them just fine (save for occasional glitch that I attribute to deadlines or compromises).

 

It was my first 3d object moving and rotating around ;) so apologies :D and yes 2d clipping of faces is a bitch.

 

In theory you would clip at the beginning of the rendering pipe after culling test? But in 3d?

I think you are maybe confusing 2 different concepts here ?

1. Frustum culling (of 3D polygons) - this is just reducing the level's 3D dataset into smaller dataset that is processed each frame (e.g. transform -> render (and clip)) - e.g. your level is 16 square kilometers, but at any time, in your view, you can see only 4 square kilometers

2. 2D Clipping (of 2D polygons) - some of the polygons from stage 1 cross screen edges and have to be clipped against those screen edges.

 

Or maybe I interpret it wrong.

Link to comment
Share on other sites

What do you specifically mean "not perfectly" ? If you refer to the polygons in distance just popping into the view in distance, that's unavoidable without incurring a heavy performance hit of always rendering them, and even then, it only pushes the cut-off distance one row into the distance (which is not much with RoF's perspective anyway).

That's a problem that's plaguing even current games on PS4 (while PS4 can surely render objects into quite a distance, there's still issues with LOD popping and fog even these days)

 

I haven't noticed errors on the screen edges - RoF seems to clip them just fine (save for occasional glitch that I attribute to deadlines or compromises).

 

I think you are maybe confusing 2 different concepts here ?

1. Frustum culling (of 3D polygons) - this is just reducing the level's 3D dataset into smaller dataset that is processed each frame (e.g. transform -> render (and clip)) - e.g. your level is 16 square kilometers, but at any time, in your view, you can see only 4 square kilometers

2. 2D Clipping (of 2D polygons) - some of the polygons from stage 1 cross screen edges and have to be clipped against those screen edges.

 

Or maybe I interpret it wrong.

 

 

nah... look at the edges more closer... as the polys/lines should be cut (2d clipping, not 2d clamping) which RoF somehow "does not"? or in a "non mathematical" way ;=)

 

3d clipping is a poly facing to viewer and gets " cut" in 3d space...

 

I am not talking about culling...

 

 

this is an test for a mesh on lynx... and thanks to lynx hardware it clips primitives against screen edges by hardware but you still see glitches... and popping in far polys and front is natural but because I had only "z>max and z<0" "clipping" but that's not what I would refer as "clipping". if you know the book Black Art of 3d there is a section of clipping polys where it calcs the intersection of left/right/top/bottom plane and cuts polys.

post-528-0-14728900-1505828095.png

Link to comment
Share on other sites

Well, there's about a dozen common combinations of polygon clipping depending on how you structure your pipeline and culling. One can do it in world-space, screen-space, clip-space or any of various in-between stages.

 

There's no sure way of knowing how RoF does it without examining the source code in detail. Once I have triangle rasterizer, I may try coding a terrain on 6502 myself (getting there, slowly).

 

The way RoF does it may very well be just a compromise to keep things faster (e.g. they cut off certain polys sooner than they should) or - and this is more probable I think, it's the precision issue and whole dataset was designed around that.

 

I certainly wouldn't compute the screen edge intersections the way "Black Book" recommends. Not on 1.79 MHz CPU, for sure :)

 

 

Remember, that book was written for Pentium, which clocks at 60+ MHz, and has a ~free parallel division, and can use floats. None of those features are remotely available on our Atari. Hell, if I used those techniques on jaguar (and that was a 26.6 MHz RISC beast), I wouldn't get more than 5fps of my flatshader :)

 

If we want fast 3D, it's easy - we simply must be smarter than that book - which is 90% of the fun :D

  • Like 1
Link to comment
Share on other sites

RoF is nice example... fuck maths just do it someway ;)

Exactly :)

 

Also, there's a reason why RoF put the cockpit lines so close to the edge - if they weren't there, those glitches would be very prominent. But since the cockpit line breaks up the terrain, the eye naturally is more drawn to the bigger uninterrupted section of the terrain in the middle of the screen and does not "see" the glitches on edges so easily :)

 

So, it's a clever GUI design that covers the 3D pipeline glitches :D

  • Like 2
Link to comment
Share on other sites

Exactly :)

 

Also, there's a reason why RoF put the cockpit lines so close to the edge - if they weren't there, those glitches would be very prominent. But since the cockpit line breaks up the terrain, the eye naturally is more drawn to the bigger uninterrupted section of the terrain in the middle of the screen and does not "see" the glitches on edges so easily :)

 

So, it's a clever GUI design that covers the 3D pipeline glitches :D

 

yes... that's why send many kudos to Lucasfilm and team who coded RoF... fractal lines hide "low grid mesh", small windows (192x48) hide that rotation in x-axis is not correct, artwork hides the glitches... low frame rate hides inaccuracy in general maths... flight dynamics increase immersion (was tweaked on main frame from the ILM department) and run in VBL so its at 60x per second update... then perfect sound to get your attention... then flashing when shot (nature visual blending) etc...

 

RoF is really well done...

  • Like 4
Link to comment
Share on other sites

Small update:

- I've finally finished porting all Bresenham-based codepaths to ASM, thus made the quad rasterizer generic (as far as direction and steepness of line goes)

- Previously, I only had few working combinations (and only with Steep lines), I've just added the support for non-steep lines and made sure any combination of Leftwards/Rightwards edge direction works

- The run-time cost remains unchanged, as the check and jump to the non-steep paths was always there (the codepath [that was jumped to] just didn't do anything)

- This is the way I structured it:

1.1 Left Edge Leftwards Steep

1.2 Left Edge Rightwards Steep

1.3 Right Edge Leftwards Steep

1.4 Right Edge Rightwards Steep

 

1.5 Left Edge Leftwards Non-Steep

1.6 Left Edge Rightwards Non-Steep

1.7 Right Edge Leftwards Non-Steep

1.8 Right Edge Rightwards Non-Steep

 

- The primary reason for having 8 separate codepaths is that I want to have a "nice line" version (e.g. Bresenham) still reasonably fast, and a single generic version (or even two versions) would raise the per-pixel cost too much for my liking, as instead of using hardcoded values, I'd have to load the value (for direction) from RAM or indirectly index it (which is much slower), or use a condition (per each pixel processed)

 

Out of curiosity - even though for a tunnel - I'm going to use a different line algo - does anybody have some benchmark numbers for a Bresenham, showing the performance difference for a simple classic Bresenham modification, where the line computation is halved (e.g. the computed pixel is mirrored from other end of the line) ? While I could relatively easily estimate it, I suspect this is something that has been done extensively by others.

Also, since we only have 3 registers on 6502, it doesn't look like a lot of computations will really be saved, as we need another index variable, then another position variable, it all has to be loaded/stored - so I don't really think it's going to save much (but it just might add up for multiple polygons into a meaningful total savings).

 

Now, if this was Atari Jaguar, with 32 registers, that'd be a significant speedup, as all these variables would be in registers. But on jag, I don't need to use Bresenham, I just use fixed-point calcs anyway...

 

Any experience with Bresenham modifications on 6502?

  • Like 2
Link to comment
Share on other sites

 

YT vid

 

Self-modifying code helps a lot to compensate the very reduced amount of registers...

Thanks !

I suspect by self-modifying you mean you modify the instructions for direction (xp/yp) ? How much unrolling (as in quadrants/octants) do you have there ? Do you mirror the computed value ? You keep steep and non-steep codepaths separate ?

Link to comment
Share on other sites

Thanks !

I suspect by self-modifying you mean you modify the instructions for direction (xp/yp) ? How much unrolling (as in quadrants/octants) do you have there ? Do you mirror the computed value ? You keep steep and non-steep codepaths separate ?

 

The goal of this routine was a little bit different (code size and execution speed) so no mirroring, no unrolling, no separate code path for steep and non-steep case but detection of this cases when entering and modifying a central routine according to quadrant and steepness. Since there are no "big" branches while doing the line, the code is also quick.

 

The inner-loop looks like this:

                "el" is max(dx,dy)
nextPoint:
		lda err
		sec
SMC    SubEs, { sbc #SMC_Value }
		sta err
		bcs testErr
		dec err+1
testErr:					; if (err<0)	
		lda err+1
		bpl quickStep

SMC    AddEl, { lda #SMC_Value }
		clc
		adc err
		sta err
		bcc noHighInc
		inc err+1
noHighInc:
SMC ChangeSlowPosition, { inc SMC_ZpAdr }	; All (inc/dec and x0/y0) is subject of change!
	
quickStep:		
SMC ChangeQuickPosition, { inc SMC_ZpAdr }	; All (inc/dec and x0/y0) is subject of change!

PlotPixel:      ## PIXEL IS PLOTTED HERE ##

		dec el
		bne nextPoint

Edited by Irgendwer
  • Like 2
Link to comment
Share on other sites

Check here: http://atariage.com/forums/topic/231690-known-fast-line-algorithms/

 

Drawing from both ends for sure helps a lot, while having the code still quite flexible and clean. At the moment it's my favorite method.

In my vector demo I use what I call 'tree Bressenham' .. It's only for shallow variants (0-45 degrees), and it works like this:

 

Bressenham basically decides if you do right only, or right and up. If you have 4 pixels per byte, this gives 16 variants per one byte worth of width. You can look at it as binary tree, 4 levels deep, with Bressenham decision at every node, and with 16 leaves, where the code outputs the bytes directly. For example if the tests in leaves were right-right-right-right .. the code in the leaf just writes $ff and it's done. Thing is you don't need register for mask, it's all in the code itself. Very fast. But also very long, and you can't easily change colors, or drawing operation (OR, XOR), by modifying code.

For the steep variants I use the both-ends approach.

 

In the thread above there is even faster method mentioned .. it uses decimal deltas, and at every step id draw 4 pixels. It basically has 4 screen pointers, for 0,25,50,and 75% of the line, and every decision is used 4 times. But the lines are not so pretty, and they often have jaggies on the 25,50 and 75%, as the ends don't align properly .. this algorithm is also not very exact, it would not hit the target point if it was used for whole screen width.

But 160 resolution is quite low, and I wanted red-magenta 3D, so I wanted lines as correct as possible, that's why I didn't go this way.

Edited by R0ger
  • Like 2
Link to comment
Share on other sites

Btw. there is still one idea I keep thinking about .. if you take the segments of the line, you are always alternating only 2 segment lengths .. and the length varies by 1. For example you alternate between 2 pixel segments and 3 pixel segments. Bressenham can be easily modified to decide between these 2 variants, instead of simple right or right-up.

It could be very fast, especially if you could prepare specific variants of the routine for most common slopes. Or maybe have routine which would have unrolled code for let's say 20 pixels, and for shorter segments you will modify the addresses of the jumps.

 

It doesn't take byte alignment into account, and you would need register for the mask, but then you can take it as advantage .. it would work the same with any pixels per byte mode.

Edited by R0ger
  • Like 1
Link to comment
Share on other sites

Check here: http://atariage.com/forums/topic/231690-known-fast-line-algorithms/

 

Drawing from both ends for sure helps a lot, while having the code still quite flexible and clean. At the moment it's my favorite method.

In my vector demo I use what I call 'tree Bressenham' .. It's only for shallow variants (0-45 degrees), and it works like this:

 

Bressenham basically decides if you do right only, or right and up. If you have 4 pixels per byte, this gives 16 variants per one byte worth of width. You can look at it as binary tree, 4 levels deep, with Bressenham decision at every node, and with 16 leaves, where the code outputs the bytes directly. For example if the tests in leaves were right-right-right-right .. the code in the leaf just writes $ff and it's done. Thing is you don't need register for mask, it's all in the code itself. Very fast. But also very long, and you can't easily change colors, or drawing operation (OR, XOR), by modifying code.

For the steep variants I use the both-ends approach.

 

In the thread above there is even faster method mentioned .. it uses decimal deltas, and at every step id draw 4 pixels. It basically has 4 screen pointers, for 0,25,50,and 75% of the line, and every decision is used 4 times. But the lines are not so pretty, and they often have jaggies on the 25,50 and 75%, as the ends don't align properly .. this algorithm is also not very exact, it would not hit the target point if it was used for whole screen width.

But 160 resolution is quite low, and I wanted red-magenta 3D, so I wanted lines as correct as possible, that's why I didn't go this way.

Interesting.... can you share esp the both end version?

Link to comment
Share on other sites

Interesting.... can you share esp the both end version?

 

Here is my routine .. but clearly I remember it wrong .. I use the tree even for steep variants. The steep variant still does draw from both ends, but you will have to get through it all .. good luck .. I think it's the most cryptic code I produced in my life.

 

Rough orientation in the code .. line.asm is normal Bressenham, nothing fancy there. You want line2.asm. Line3.asm is just UP variant of Line2, you can ignore that. First there are some macros. DrawByte is output macro for the shallow variants - it draw byte aligned patterns. DrawPatternDown and DrawPatternUp are used in steep variants .. the Bressenham tree is basically the same for steep and shallow variants, but the final macro is what's different. DrawPatternDown and DrawPatternUp draw pixel the usual way, and it's in these macros where you can see drawing using 2 pointers and two masks, ie. from both ends.

 

The line itself has the usual coord swapping and variant decision .. and then come the trees .. labels starting with L1 are the shallow version, L2 is the steep version. Labels like L1_100x are levels of the decision tree .. this example means decision were 1-0-0 and 4th decision is unknown yet. After the decision tree there is the drawing block .. for example L1_5 is for decisions 0-1-0-1 .. it's just decimal representation .. please note that each drawing block is used twice .. as the first decision only says if you should move down before the whole block .. so block for 0-0-0-0 overlaps with 1-0-0-0, and so on.

 

There is also some code to handle special cases - ends in shallow variant and middle of the line in the steep variant .. but honestly I don't understand it at all at the moment. Still it works :-D

LineTest.zip

  • Like 2
Link to comment
Share on other sites

Btw. there is still one idea I keep thinking about .. if you take the segments of the line, you are always alternating only 2 segment lengths .. and the length varies by 1. For example you alternate between 2 pixel segments and 3 pixel segments. Bressenham can be easily modified to decide between these 2 variants, instead of simple right or right-up.

It could be very fast, especially if you could prepare specific variants of the routine for most common slopes. Or maybe have routine which would have unrolled code for let's say 20 pixels, and for shorter segments you will modify the addresses of the jumps.

 

It doesn't take byte alignment into account, and you would need register for the mask, but then you can take it as advantage .. it would work the same with any pixels per byte mode.

This looks like the same approach I came up with when experimenting with the step version. The line is divided into 3 parts- start,end and the middle.

The middle part is where 90% of computation is. I'm using a pair of 2 values- from 1,1 up to 4,4. The loop processes up to 8 pixels (the 4,4 pair) in one go, so it's much faster than Bresenham as especially for steep lines, all xpos is identical for those 4 pixels, so you just do Sta xpLeft,X Inx 4 times, then Sbc/Adc #1 and Sta/Inx 4 times. It doesn't really get much faster than this, I believe.

 

Non-steep version just does one subtract per Sta, so it's also reasonably fast, but never faster than steep lines, obviously, as steep lines are favored by the scanline filling approach.

 

Note that the two values can be of same value. Even Bresenham can come up with same values if the Dx,dy warrant it.

 

This does look much uglier than Bresenham, though. I haven't tried it in motion yet, but hope the motion will somewhat hide the ugliness...

Link to comment
Share on other sites

The goal of this routine was a little bit different (code size and execution speed) so no mirroring, no unrolling, no separate code path for steep and non-steep case but detection of this cases when entering and modifying a central routine according to quadrant and steepness. Since there are no "big" branches while doing the line, the code is also quick.

 

The inner-loop looks like this:

 

"el" is max(dx,dy)nextPoint:		lda err		secSMC    SubEs, { sbc #SMC_Value }		sta err		bcs testErr		dec err+1testErr:					; if (err<0)			lda err+1		bpl quickStepSMC    AddEl, { lda #SMC_Value }		clc		adc err		sta err		bcc noHighInc		inc err+1noHighInc:SMC ChangeSlowPosition, { inc SMC_ZpAdr }	; All (inc/dec and x0/y0) is subject of change!	quickStep:		SMC ChangeQuickPosition, { inc SMC_ZpAdr }	; All (inc/dec and x0/y0) is subject of change!PlotPixel:      ## PIXEL IS PLOTTED HERE ##		dec el		bne nextPoint
I have not noticed the first time, as I read it first while driving, but now see you are using 16-bit error value? I always thought that if I was willing to pay the price of 16-bit values, I might as well bypass Bresenham, and just go for fixed-point calculations instead, as bit shifting is cheap on 6502.

 

I guess there is just one way to find out-implement fixed-point...

 

Was there any specific reason why you used 16-bit? I understand your use case is quite different from mine, as I don't care for code size currently.

  • Like 1
Link to comment
Share on other sites

Btw. kinda relevant and very good video on Elite, by big B. himself .. http://www.gdcvault.com/play/1014628/Classic-Game-Postmortem

Together with this thread I again can't sleep and I transform and draw lines in my head all the time. Braben mentions several interesting tricks:

 

1) using logarithms for everything. I've tried, but IMHO it's only useful for 8bit camera space. I use 16bit, and that kinda throws it out of the window. I'm still thinking about some interpolation, which would convert log and exp into 1 8x8 multiplication, but so far I haven't been successful. Also Braben doesn't mention fast mul we use today, with this method the logs might not be that good.

 

2) symmetry - now this is no brainer, it will work in most engines. Simply transform every vertex also as it's mirror image. It's very cheap, and most of them will be used in the model, as models tends to be symmetric. I think it could be optimized by having the symmetric vertices first in the vertex array, and then doing the symmetry only for first N vertices. Or even hove some vertex symmetric side by side, and some symmetric side by side AND up and down. Gonna implement it into my engine for sure.

 

3) recreating transformation matrices - sin tables or normalization, both is somewhat costly. Elite doesn't do it every frame. It uses static matrix additions to do small predefined rotations, and the loss of precision is accepted. Only after several such operation the matrix is re-normalized. Might be useful in many cases. Maybe not for camera matrix, but surely for object matrices.

 

He also talks about line a bit, but not in much detail .. still it seems in lot like articles on codebase64, which I never understood .. but now I might be actually on how it works. But I will implement it first. I think it's gonna be faster and shorter then my current tree method.

 

Also I came up with new division method for my engine. I use 16/16 division, but I only need 8 bits of result. Also in the engine the result is always less then 1. At the moment I scroll the 16bit operands up, till all the bits are used - in other words I'm scratching the leading zeros .. and then I do 8/8 division, the classical way.

 

I tried to make some kind of table to avoid the classical division .. and now I've been finally successful ! Let's say we divide A/B. As I said, result is always <1 .. so A<B. So when I'm scratching the leading zeros, B always ends up with 1 in the top bit. Before last 8 bit division, B is always in range 128-256.

That's significant simplification, as it's inversion (in this case 256/x) can go from only from 2 to 1. Still not very good for table. But if I subtract 1, I get range 1..0 .. or 255 to 0. Let's call this number Q. Now to get the division, I have to do A*(1+Q) .. which is same as A+A*Q .. which is simple. It gets my division from average of 251 cycles to 130 cycles, and I think there is still lot of room of improvement by low level optimization (for example at the moment I call the mul simply by jsr). On the global it lowers time consumption of division from 5% to under 3%, at the price of 128 bytes table (well that and huge scroll tables, but I have them anyway). There are some bugs in the clipping I still have to solve, but do far it looks good !

Edited by R0ger
  • Like 5
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...