Any 3D game with flatshading on A800 ?

VladR · October 2, 2017

My guess because it was enough ?

Limits like this made it so much more productive. For 256x256 you need a good artist to make a likable game. In smaller resolution you can draw ten pixels and it looks like a small human

I don't know. Drawing anything good looking into 10x10 pixels is much harder than, say, 20x20, as you now have 4x as many pixels, so even programmer's art can look somewhat OK in 20x20. And I've tried many times, but then again, I'm a coder, not artist.

Looks a lot inspired by the Gameboy specs.

I thought the same thing exactly.

I still like it as an alternative to Flash/HTML5, though. Just don't want to get distracted by it, as my Atari coding would suffer

VladR · October 2, 2017

These are the options that I can see at this point:

Option 1: Use huge table, where for each X (by X, I mean deltaY of the line), we store X-1 values (the fixed point step value)

- This however takes up 12.6 KB (range <0, 159>), which is just too much for something like this.

Option 2: Use full-precision table for smaller lines, and progressively lower precision, the longer the line is

- 100% coverage: the range <0,47> takes up 1,128 Bytes, and that's a line that's half the screen height

- 50% coverage: the range <48,95> takes up 1,740 Bytes (instead of 3,480), at the cost of 1 bitshift, and very minor error (around 0.7%)

- 25% coverage: the range <96,159> takes up 2,056 Bytes (instead of 8,224), at the cost of 2 bitshifts, and still small error (around 2.3%)

Option 3: Store just 1 fixed point value, and multiply it by the remainder

- This however takes only 160 Bytes, as we store just the 1/X value per each X

- I played it out in excel and the error begins very small, but gets to 9% around line length 32, so I didn't really continue with the table, as ~10% feels like too much too soon

- this also means doing multiplication via adding, which is a problem for lines that have big discrepancy between dx and dy

- thus, the greatest downside of this is unpredictability, where it suddenly might take up way more cycles than regular Bresenham

- I don't know exactly where that performance threshold is, but it's clear you don't need very long lines for this to happen and I don't like unpredictable algorithms

Option 4: Split the line into 2 halves

- this halves the memory cost, but adds code complexity and cycles to handle the transition

- While I haven't implemented that yet, I've seen it somewhere,and it just looked too ugly (around the midpoint)

- I think I'd rather use my current step codepath, than do this

- this seems like too much work for too little visual benefit

I played a bit with the numbers in the excel and came up with , what I think, is a best compromise solution, as I keep the precision 100% same, but reduced the table size by 50%, at a very small performance cost.

- I will keep only first half of data for each X (e.g. for X=16, I'll have just a range <0,7> of fixed point values)

- Initially, I came up with the following formula, to get the fixed point value for the second half:

Lut [idx] = Lut [X_Half] + Lut [idx-X_Half – 1]

- However, since it's a fixed point value spread uniformly across range <0,255>, the middle value is always 128

- thus the above equation removes second look up, and replaces it with a constant addition (which is, like 4 cycles : CLC, ADC #128):

Lut [idx] = 128 + Lut [idx-X_Half – 1]

- thus the performance cost will be a simple check (only once per each edge), and about ~20 cycles if the requested value is > X_Half

- and I still keep full 100% precision

- for a 128x96, this will only take 4,170 Bytes, which I will gladly spare, considering full precision.

- it's also possible to reduce it further by 50% (into mere 2 KB), at the cost of 2 more conditions, but that's not going to be needed right now

sanny · October 2, 2017

Generally, I intend to have 2 versions of scanline traversal:

1. Nice (currently Bresenham, but would like to replace it with fixed point) - for cutscenes

Sorry, although I browsed over the whole thread as the postings popped in, I didn't read it completely again for this post and maybe I forgot/overlokked something.

Where in Bresenham do you need floating point?

VladR · October 2, 2017

Sorry, although I browsed over the whole thread as the postings popped in, I didn't read it completely again for this post and maybe I forgot/overlokked something.

Where in Bresenham do you need floating point?

No worries. I don't need floats there. Bresenham is integer-only.

However, I'll be using Bresenham as a reference rasterizer for high quality scenarios (e.g. cut-scenes or inventory or ship selection), where you don't need super high framerate (as, say, in gameplay/combat scenario).

My fixed-point experiment however showed, that I can get high-quality line at a much lower performance cost for the inner loop - which is however compensated by the higher initialization cost.

It is my hope, that the overall cycle cost will be lower for the fixed-point, and I'll be able to replace Bresenham completely by the fixed-point solution (and even be faster) - but until it's completely implemented, it's hard to say whether the higher initialization cost will overweigh much smaller amount of inner loop iterations. But, that's why I have auto benchmarking built-in.

And even if it's not faster in the end, I learnt something, so this experience is still valuable

My current gut feeling is, the following combination will be the fastest for the high-quality scenario:

1. Steep Lines - Bresenham

2. Non-Steep lines - Fixed-point (because the inner loop only needs to run (dy-2) times, which is in huge contrast with Bresenham)

I'll post the benchmark numbers, once I get there (next few days, hopefully).

dmsc · October 2, 2017

Hi!

No worries. I don't need floats there. Bresenham is integer-only.

However, I'll be using Bresenham as a reference rasterizer for high quality scenarios (e.g. cut-scenes or inventory or ship selection), where you don't need super high framerate (as, say, in gameplay/combat scenario).

My fixed-point experiment however showed, that I can get high-quality line at a much lower performance cost for the inner loop - which is however compensated by the higher initialization cost.

It is my hope, that the overall cycle cost will be lower for the fixed-point, and I'll be able to replace Bresenham completely by the fixed-point solution (and even be faster) - but until it's completely implemented, it's hard to say whether the higher initialization cost will overweigh much smaller amount of inner loop iterations. But, that's why I have auto benchmarking built-in.

And even if it's not faster in the end, I learnt something, so this experience is still valuable

My current gut feeling is, the following combination will be the fastest for the high-quality scenario:

1. Steep Lines - Bresenham

2. Non-Steep lines - Fixed-point (because the inner loop only needs to run (dy-2) times, which is in huge contrast with Bresenham)

I'll post the benchmark numbers, once I get there (next few days, hopefully).

Note that if you use DDA to rasterize, you can pre-calculate coefficients and simply project them, keeping always dy==1 (or min(dx,dy)==1). But in my (limited) experiments, I have found that in the a800 the faster algorithm seems to still be bresenham.

VladR · October 2, 2017

Note that if you use DDA to rasterize, you can pre-calculate coefficients and simply project them, keeping always dy==1 (or min(dx,dy)==1). But in my (limited) experiments, I have found that in the a800 the faster algorithm seems to still be bresenham.

Could you please elaborate that projection a bit more ? I don't think I follow...

Also, I'm in no hurry. As long as the final approach is as fast as possible on 6502, I'm willing to go through several more iterations just fine, as it's also building up my 6502 skills (which I'm still finding very lacking).

For example, an hour ago, when I was doing the indexing into the big LUT, I found out, that on 6502 it's actually stupid to store a 16-bit value array in a classic lo/hi format, and it's faster if you separate it into two separate arrays (one for Lo, second for Hi), as you gotta process the two values separately anyway, but you can use the faster LDA ptr,Y indexing that way (to be reimplemented later, as that basically touches all other stages of the scanliner, spread across dozens of pages of code).

Only god knows how many more WTFs like that are there still for me to learn. Man, I got spoiled on jaguar's RISC and its 32 registers

Heaven/TQA · October 2, 2017

Hi!

Note that if you use DDA to rasterize, you can pre-calculate coefficients and simply project them, keeping always dy==1 (or min(dx,dy)==1). But in my (limited) experiments, I have found that in the a800 the faster algorithm seems to still be bresenham.

Sounds like when using eor filler? Like me doing?

Dy==1 while x+=dy/dx

local xstep=deltax/deltay

xx1=xx1+prestep*xstep

local sy1=flr(yy1)

local sy2=flr(yy2)

if sy1<miny then

miny=sy1

end

if sy2>maxy then

maxy=sy2

end

if buffid==0 then

for y=sy1,sy2-1 do

local sx=xx1

redge[y]=sx

xx1=xx1+xstep

end

else

for y=sy1,sy2-1 do

local sx=xx1

ledge[y]=sx

xx1=xx1+xstep

end

Edited October 2, 2017 by Heaven/TQA

Sheddy · October 4, 2017

Looks a lot inspired by the Gameboy specs.

Yeah, probably. But you have to go back to Fairchild channel f for only 128x sprite resolution, and gameboy has far fewer sprites.

Maybe helps avoid copyright issues with sprite ripping from other systems.

Edited October 4, 2017 by Sheddy

emkay · October 4, 2017

Yeah, probably. But you have to go back to Fairchild channel f for only 128x sprite resolution, and gameboy has far fewer sprites.

Maybe helps avoid copyright issues with sprite ripping from other systems.

Ofcourse they wanted to have some technical advancements?

On the other hand it points to what it means on the A8. As you might know best

That square pixel mode allows to do "something going on" on the screen.

As people sometimes see colorful hires graphics, assuming the A8 could do games like Turrican there... well in a small window of 100x100 pixel it might work, but the fullscreen thing is much more impressive, IF the colors were solid...

dmsc · October 4, 2017

Hi!

Could you please elaborate that projection a bit more ? I don't think I follow...

I tough that you follow the line-pair (vertical segment) by using an inner loop like:

  for(x=x0; x<x1; x++)
  {
   y0 += y0_step;
   y1 += y1_step;
   for(y=y0; y<y1; y++)
    plot(x,y);
  }

This can be made fast as you advance in X one pixel a time, so you can simply rotate a mask (for the pixel color) each iteration, increasing the pointer when the mask rolls.

As you need to pre-compute "y0_step" and "y1_step" before calling the code, my idea is that you can pre-compute the steps in object space and then simply project them, and store the steps instead of storing the end points of the segments.

But, this won't be a gain, because in the 6502, projecting the pre-computed steps is as expensive as computing the steps afterwards (one division, y0_step = (y1-y0)/(x1-x0) ), to do the projection you need 4 multiplications instead.

Modern GPUs do this kind of optimizations, projecting many coefficients and then doing a hit-test for each pixel to determine if they are inside or outside of the triangle being resterized.

VladR · October 8, 2017

I see now, thanks. But I am not doing the actual line drawing, rather scanline drawing, which for a case of Steep line corresponds with line algorithm (but definitely not for non-steep).

So, basically, I need to compute and store the xpos for the start and end of the scanline, which corresponds with the left edge's line and right edge's line.

At the moment, I got 3 very different line algorithms implemented:

1. Bresenham

2. Fixed Point

3. Multi Store (sorry, can't really come up with better short name)

The first two have all combinations of steep/nonsteep and leftward/rightward implemented. The last one has only steep codepath currently.

Also, I am not discounting division. Hell, both second and third algo use it.I am going to worry about division later, once I have final benchmark numbers. There are tables and other approaches.

Most importantly, however, it's a problem that just might turn out to be nonexistent for certain applications, as I just found out today. By accident, sure - but if I didn't keep an open mind, I would have never found that one shortcut . I am still in the middle of experimenting with it, but according to my calculations in excel, I actually might be able to use a realtime 3D mesh for player instead of precalced bitmap at the cost of half frame...

R0ger · October 9, 2017

Division is not that big of a problem imho .. true, for your 1 pixel per line you will need 8.8 division (only 0.8 for normal line) .. but you can look at such division as 16 pixel long Bresenham line, not even that. That puts things into perspective.

VladR · October 9, 2017

Division is not that big of a problem imho

Yes, I was really surprised to see that the division loop rarely runs more than 4-5 times in the dataset I was using, and plenty times just twice. That doesn't even warrant a table lookup. The inner loop of the division is - what - 10 cycles ? It's INX, SEC, SBC Divisor, BCS DivLoop

The table lookup is not a fast thing on A800. I was actually very unpleasantly surprised, when I implemented it for the fixed-point, how much cycles it took.

Of course, a corner case might run the inner loop 80-100 times (e.g. edge tall 2 pixels, and 160 pixels wide) but there's another advantage besides debugging&benchmarking [that I've been thinking of while I was coding jaguar] of me using the Visual Studio and C++ : statistical analysis.

I will collect all information on all polygons : edges, steepness, length, pixelcount, and how often then appear during runtime.

This will provide a base for choosing best algorithm for the most common scenario. I already understand that I need to use different methods for tunnel, a different for generic triangle and a different one for another scenario) - to get max.performance.

but you can look at such division as 16 pixel long Bresenham line, not even that. That puts things into perspective.

Yeah, I noticed that if you could guarantee lengths under 32px, significant performance gains can be had. Not a generic solution for all types of 3D, not at all. But generic enough for a particular engine/game style.

Which, from a player's perspective, is all that matters anyway...

It's surprising, there's not tons of information on this already for A800 - I was expecting this landscape to have been fully mapped and benchmarked at least a decade ago. Looks like everybody moved on to commercial modern platforms...

But, it's all good anyway, as at least I'm getting my R&D fix

VladR · October 9, 2017

God, I LOVE Visual Studio's Edit&Continue

During weekend, when I was refactoring my 8-way Bresenham codepath, I have only tested the output rendering on first frame (didn't let it run full course).

Of course, second frame, now that I ran it today, is partially broken - as the right edge is somehow partially shorter (but only on second frame).

Figuring out, what the hell is going on, in pure assembler, would easily take an hour at least (and very probably half a day).

But, I put the following C++ debugging command, and it took just 2 minutes to find out where the problem was (right edge data): See the attachment

dbgArr ("_edgeL: ", _edgeL, RAM [_dy]);
dbgArr ("_edgeR: ", _edgeR, RAM [_dy]);

This brings memories of debugging my own assembler editor/compiler [that I wrote in 1990 on real Atari in hexa machine-code, initially starting from within Atmas II, though very quickly ran out of RAM -> hence hexa coding] into a whooooole new perspective

popmilo · October 9, 2017

Print statement is the only debug feature one ever needs

ps. I like your choice of music for coding

VladR · October 9, 2017

Print statement is the only debug feature one ever needs

ps. I like your choice of music for coding

I only discovered AnjunaDeep's full albums, like, 2 weeks ago. I heard plenty of those tunes on other playlists, but that was on PS4 (which was on 16 hrs a day only for the Spotify ), where you don't have the same Album choices and options in the app. Now that I fixed the issues with the libraries, and can run it on PC, I instantly searched for it, and was just absolutely floored.

Now, I've heard some other DJ remixes, where the transitions are also smooth, but the 08 and 06 have virtually nonexistent transitions between the tunes. It's a new kind of experience for me. I swear that music makes my coding -at the very least- 25-33% more productive. I usually only notice the end of album, as it's suddenly quiet (2 and half hrs)

In terms of legal, healthy, and long-term-safe brain stimulants, this is probably it

If you can think of some similar albums, I'm all ears for recommendations

Any 3D game with flatshading on A800 ?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members