Jump to content
IGNORED

Any 3D game with flatshading on A800 ?


VladR

Recommended Posts

I tried to make some kind of table to avoid the classical division .. and now I've been finally successful ! Let's say we divide A/B. As I said, result is always <1 .. so A<B. So when I'm scratching the leading zeros, B always ends up with 1 in the top bit. Before last 8 bit division, B is always in range 128-256.

That's significant simplification, as it's inversion (in this case 256/x) can go from only from 2 to 1. Still not very good for table. But if I subtract 1, I get range 1..0 .. or 255 to 0. Let's call this number Q. Now to get the division, I have to do A*(1+Q) .. which is same as A+A*Q .. which is simple. It gets my division from average of 251 cycles to 130 cycles, and I think there is still lot of room of improvement by low level optimization (for example at the moment I call the mul simply by jsr). On the global it lowers time consumption of division from 5% to under 3%, at the price of 128 bytes table (well that and huge scroll tables, but I have them anyway).

Could you share code for this division ?

 

Your explanation sounds logical but I can't wrap my head around it ;)

  • Like 1
Link to comment
Share on other sites

Was there any specific reason why you used 16-bit? I understand your use case is quite different from mine, as I don't care for code size currently.

 

I have coordinates up to 256 (this is not Antic 13), so I need at least 9 bits to cover the +- range:

	Draw_SetMode(Draw_Set);
	line.startPoint.nX = 0;
	line.startPoint.nY = 0;
	line.endPoint.nX = 255;
	line.endPoint.nY = 1;
	Draw_Line(&line);
	line.startPoint.nX = 255;
	line.startPoint.nY = 0;
	line.endPoint.nX = 254;
	line.endPoint.nY = 231;
	Draw_Line(&line);
	line.startPoint.nX = 255;
	line.startPoint.nY = 231;
	line.endPoint.nX = 0;
	line.endPoint.nY = 230;
	Draw_Line(&line);
	line.startPoint.nX = 0;
	line.startPoint.nY = 231;
	line.endPoint.nX = 1;
	line.endPoint.nY = 0;
	Draw_Line(&line);

16 bit error (correct):

post-7778-0-26441600-1506629369.png

 

8 bit error (no comment):

post-7778-0-01821100-1506629434.png

 

Ideas for improvements welcome!

  • Like 1
Link to comment
Share on other sites

Well typically you always draw from left to right, you just swap the coordinates when they come in the wrong order. And then you have 2 variants of the life for up and down .. so you never have one code handling 2 directions.

 

Yes, but I have only one variant (the central patched routine) and so no swapping. Maybe I should give the fixed point variant a try... On the other hand the code belongs to a frozen project and would be fast enough for the use case...

 

Edit: I'm unsure how swapping the coordinates would help to reduce the needed bit-depth for the error estimation?!?

Edited by Irgendwer
Link to comment
Share on other sites

I believe he means that the Bresenham error value needs more than 8 bits once the X coordinates approach value of 256.

 

At some older project, I was experimenting with different error formulas, which gives you option of keeping just 8 bits for the error value.

 

Also, he mentioned he needs to have just one central code path, so the optimizations we use are unavailable to him. On the other hand, he doesn't have to tweak and test 8+ code paths , which is something I would appreciate :)

  • Like 2
Link to comment
Share on other sites

I believe he means that the Bresenham error value needs more than 8 bits once the X coordinates approach value of 256.

 

Since the error value is signed, it's enough when 127 is crossed and that is the case in the sample. It's just this border case - in all others (shorter or more steep lines) a signed 7 bit error would be sufficient.

Link to comment
Share on other sites

Btw. kinda relevant and very good video on Elite, by big B. himself .. http://www.gdcvault.com/play/1014628/Classic-Game-Postmortem

Damn, how did I miss that!

But, I don't think I would have appreciated this talk the same before I embarked on the 8-bit flatshader journey :)

Most excellent timing of you posting this link ;)

 

 

Together with this thread I again can't sleep and I transform and draw lines in my head all the time.

Yeah, I keep thinking on those algorithms all the time myself too :)

It's funny that all those fastest graphics algorithms have converged in the PC/console arena long time ago, but Atari doesn't get that research love. While A800 sure gets much more dev exposure than, say, Atari Jaguar, it's still not fully explored!

 

I just returned from the ~5,000 km trip driving through pathetically slow U.S. interstate highways (daily avg speed across 12+hrs is unbelievably [when compared to driving in Europe] low 64 km/h), and thus had shitload of time when my brain had to do something (hell, anything, really) , as the brutally slow driving doesn't really take more than few % of your capacity anyway.

 

I did, thus, come up with few ideas, that I'll be experimenting with during weekend:

 

1. Scanline traversal is NOT line computation

- this is a no brainer, really

- why bother computing pixels, intermediate error values, and other bullshit, when we really only need 1 pixel per each scanline ?

- this gets us to idea 2 ->

 

2. Use the fact that Steep line computation matches scanline traversal (e.g. each new point computed is guaranteed to be on a new scanline)

- this is a prime example for trying out fixed-point calculations on 6502 (I'll go try 8.8)

- I did a lot of them on jaguar (where they were almost free & very simple & very precise (32-bit registers)), so I'm curious to see how they'll work out on 6502

- even if they don't turn out to be fast, the implementation still might spark some other ideas (and exchange on this forum), so I think it's worth it

- this gets us to idea 3 ->

 

3. Use a separate algorithm for Non-Steep lines

- why ? Because we need only 1 point of the edge per scanline

- why is is a problem ? Because a non-steep line that is 41 pixels long, across 4 scanlines, only needs 4 pixels computed, yet we compute remaining 37 that we don't need !

- 35/40 = 0.902 -> 90.2% of computations are thrown away

- 90% ? Hmm, if *that* does not qualify for triple-facepalm.gif, I don't know what does...

 

4. I swear I saw a short glimpse of the look-up-table that just might completely bypass line computation

- this is a variation of what I came up with for Klax demo on jaguar (which was in C)

- I'll play this one out in excel first, then C, then ASM

  • Like 2
Link to comment
Share on other sites

so I need at least 9 bits to cover the +- range

 

Actually I'm busy with other stuff, but this kept bothering me.

The "solution" is so simple: The carry flag is already the ninth bit. Lots of lines can be removed now:

                "el" is max(dx,dy)
nextPoint:
		lda err
		sec
SMC    SubEs, { sbc #SMC_Value }
		sta err
		bcs quickStep                  ; was: bcs testErr
; not needed:		  dec err+1
; not needed:   testErr:		       ; if (err<0)	
; not needed:             lda err+1
; not needed:    	  bpl quickStep

SMC    AddEl, { lda #SMC_Value }
; not needed:   clc
		adc err
		sta err
; not needed:	bcc noHighInc
; not needed:	inc err+1
noHighInc:
SMC ChangeSlowPosition, { inc SMC_ZpAdr }	; All (inc/dec and x0/y0) is subject of change!
	
quickStep:		
SMC ChangeQuickPosition, { inc SMC_ZpAdr }	; All (inc/dec and x0/y0) is subject of change!

PlotPixel:      ## PIXEL IS PLOTTED HERE ##

		dec el
		bne nextPoint

Thank you all for let this rolling again in my "think-bowl"!

  • Like 2
Link to comment
Share on other sites

I did a quick proof-of-concept with 8.8 fixed-point. The inner loop is generic, I'm currently only using manually-computed fractional step value (to be implemented later, probably via some 256 Byte table).

 

Example line:

dx = 8

dy = 5

step = 8/5 = 1.6 -> Step.Hi = INT (1.6) = 1; Step.Lo = 153 (0.6*256)

 

Point [0] = 1:153 (no need to compute, as it is equal to Step)

Point [1] = 3:50

Point [2] = 4:203

Point [3] = 6:100

Point [4] = EndPoint (no need to compute,just copy dx2)

		LDA	(	#1			)
		STA	(	_StepHi		)
		CLC
		ADC	(	_xp			)
		STA	(	_ValueHi	)
		LDA	(	#153		)
		STA	(	_StepLo		)
		STA	(	_ValueLo	)

		LDY	(	_dy			)		//	Y = dy (Loop Counter)
		DEY							//	We don't compute last point
		DEY							//	We only do inner loop (dy-2) times
		LDX	(	#0			)		//	X = scanline (0-based)
			//	42c per point
		R_LE_NonSteepScanlineLoop03:
			LDA	(	_ValueHi	)
			STA	(	_edgeL,X	)	//	edgeL [scanline] = ValueHi

				//	ValueHi += StepHi
			CLC
			ADC	(	_StepHi		)
			STA	(	_ValueHi	)

				//	ValueLo += StepLo
			CLC
			LDA	(	_ValueLo	)
			ADC	(	_StepLo		)
			BCC	(	R_LE_NSSLBypass03	)
			INC	(	_ValueHi	)
			R_LE_NSSLBypass03:
			STA	(	_ValueLo	)

				//	Loop Next
			INX					//	ToDo: Merge both indices(and reverse the order of writing)
			DEY
		BNE	(	R_LE_NonSteepScanlineLoop03	)
			//	Store the last value that only got computed
		LDA	(	_ValueHi	)
		STA	(	_edgeL,X	)	//	edgeL [scanline] = ValueHi
		INX
			//	Store Endpoint to avoid holes
		LDA	(	_xp2		)
		STA	(	_edgeL,X	)	//	edgeL [scanline] = xp2

- So, the inner loop gets executed only dy-2 times (as we don't have to compute first and last point).

- This is in huge contrast with Bresenham, where if the line was 256 pixels long (yet, just across 5 scanlines), its inner loop would get executed way too many useless times (despite us only needing 5 xpos values for each scanline's xpos).

- Yet, I retain the full precision,as 8 bits give me precision of 256, which is totally enough for my resolution.

- To my surprise, I didn't need to do any bitshifting (as I was initially expecting), it's really just a straightforward adding.

- the line is really nice - not exactly 100% identical to Bresenham, but way nicer than my older double-step approach and I actually consider this full quality and don't believe one could spot the visual difference in motion at all

- this also answers the question for sub-pix precision implementation costs I had earlier

 

I should be able to slightly optimize it further by:

- using just 1 index register for the loop, though this needs reversed order of writing and computation

- reusing the other register for hi or lo value, thus saving couple cycles compared to reading from page-0

  • Like 3
Link to comment
Share on other sites

did I not mention scanline traversal (dda)? ;)

 

my first poly fillers worked in 8bit integer so no 8.8 etc... I only went to more precise approach when dealing with gouraud and subpix rendering.

I know, but in all honesty, I don't really know the DDA algorithm from the top of my head, so it's entirely possible I just reinvented the wheel here :)

I'm just fumbling my way through this, but discovering this stuff is 90% of fun!

 

 

Actually I'm busy with other stuff, but this kept bothering me.

The "solution" is so simple: The carry flag is already the ninth bit. Lots of lines can be removed now:


Thank you all for let this rolling again in my "think-bowl"!

Yes, using the Carry flag as a 9-th bit doubled a precision for me, when I was doing the 8-bit world-space transforms, so it was way more than worth doubling the codepath size.

 

I think I should start counting the number of bytes too (not just the number of cycles), as I just keep doubling each codepath left and right, like there's no tomorrow (but, it still should all fit within 128-512 KB) :)

  • Like 3
Link to comment
Share on other sites

I think I should start counting the number of bytes too (not just the number of cycles), as I just keep doubling each codepath left and right, like there's no tomorrow (but, it still should all fit within 128-512 KB) :)

 

Just reduced some more:

                "el" is max(dx,dy)
nextPoint:
		lda err
		sec
SMC    SubEs, { sbc #SMC_Value }
		bcs quickStep

SMC    AddEl, { adc #SMC_Value }

noHighInc:
SMC ChangeSlowPosition, { inc SMC_ZpAdr }	; All (inc/dec and x0/y0) is subject of change!
	
quickStep:
                sta err
SMC ChangeQuickPosition, { inc SMC_ZpAdr }	; All (inc/dec and x0/y0) is subject of change!

PlotPixel:      ## PIXEL IS PLOTTED HERE ##

		dec el
		bne nextPoint

Line routine is now ~25% faster. (Lots of cycles wasted due to size restriction by JSRing to the central, somewhat slow Plot routine)

  • Like 1
Link to comment
Share on other sites

just to give you an idea here is the lua/pico-8 of my polydrawer:

function dopoly(facecol)
maxy=0
miny=128
 
if tflag==0 then
 -- scanedge
xx1=x1
yy1=y1
xx2=x2
yy2=y2
doline()
 
xx1=x2
yy1=y2
xx2=x3
yy2=y3
doline()
 
xx1=x3
yy1=y3
xx2=x1
yy2=y1
doline()
 
for y=miny,maxy-1 do
   line(ledge[y],y,redge[y],y,facecol)
end
end
 
 

function doline()
local buffid=0 -- right buffer
if yy2<yy1 then
 -- swap points
 temp=xx2
 xx2=xx1
 xx1=temp
 temp=yy2
 yy2=yy1
 yy1=temp
 buffid=1 --left buffer
end
 
local deltax=(xx2-xx1)
local deltay=(yy2-yy1)
if deltay<=0 then return end
local prestep=flr(yy1+0x0.ffff)-yy1
 
local xstep=deltax/deltay
 xx1=xx1+prestep*xstep
 
local sy1=flr(yy1)
local sy2=flr(yy2)
if sy1<miny  then
miny=sy1
end
if sy2>maxy then
maxy=sy2
end
 
if buffid==0 then
 
for y=sy1,sy2-1 do
local sx=xx1
redge[y]=sx 
  xx1=xx1+xstep
end
 else
for y=sy1,sy2-1 do
local sx=xx1
ledge[y]=sx 
   xx1=xx1+xstep
  end
end 
 
 
end
 
 
 

function dopyramid()
 
 zoomstep=(zoomstep+1)%180
 
 if tflag==0 then
    zoom=sin((zoomstep+180)/360)/3.0 --1.7 --1.85
end
if tflag==1 then
    zoom=sin((zoomstep+180)/360)/4.0 --1.7 --1.85
 end    
 
-- local zoom=0.50
local ssinx=sin(rotanglex)
 local ccosx=cos(rotanglex)
 local ssiny=sin(rotangley)
 local ccosy=cos(rotangley)
 local ssinz=sin(rotanglez)
 local ccosz=cos(rotanglez)
 
for v=1,max_vertices3 do
 
local px=pxtab3[v]*zoom
local py=pytab3[v]*zoom
local pz=pztab3[v]*zoom
 
   --z axis
   local stxx=ccosz*px-ssinz*py
   local styy=ssinz*px+ccosz*py
   
   --y axis
   local stxxx=ccosy*stxx+ssiny*pz
   local stzz=-ssiny*stxx+ccosy*pz
 
   --x axis
   local styyy=ccosx*styy-ssinx*stzz
   local stzzz=ssinx*styy+ccosx*stzz
    
xxtab[v]=256*stxxx/(stzzz+5)+mx
yytab[v]=256*styyy/(stzzz+5)+my
zztab[v]=stzzz
end
 
for f=1,max_faces3 do
 
   x1=xxtab[facetab3[f]]
   y1=yytab[facetab3[f]]
   x2=xxtab[facetab31[f]]
   y2=yytab[facetab31[f]]
   x3=xxtab[facetab32[f]]
   y3=yytab[facetab32[f]]
-- 2d cull test
 doculltest()
 
if culltest>0 then
  c1=zztab[facetab3[f]]*15+16
 c2=zztab[facetab31[f]]*15+16
 c3=zztab[facetab32[f]]*15+16
 dopoly(sget(f,7)) 
end
 end
   
end 
 


so should be clear. it works in subpix. and this was backported to pico-8 from my 6502 engine :D.
Edited by Heaven/TQA
Link to comment
Share on other sites

here is just a part from Lynx "Elements" demo:

 

 

 
CalcEdge_new
;  linebufid.a=0 ;left buffer
;  If y2<y1
;    Swap x1,x2
;    Swap y1,y2
;    linebufid.a=1 ;right buffer
;  EndIf
;  deltax=(x2-x1)
;  deltay=(y2-y1) ;>>5 ;subpix fraction
;sort already done so y2>y1
lda x2Edge
sec
sbc x1Edge
sta dx
lda x2Edge+1
sbc x1Edge+1
sta dx+1
bpl signcheck2
jmp CalcEdge_new_signed
signcheck2 
lda y2Edge
sec
sbc y1Edge
sta dy
lda y2Edge+1
sbc y1Edge+1
sta dy+1
 
;  If deltay<1
;    ProcedureReturn   
;  EndIf
cmp #1
bcc calcexit ;dynegative
;  xstep.w=deltax<<8/deltay ;32bit div 16 bit = 16bit
lda dx ;dx/dy
sta T1
lda dx+1
sta T1+1
lda dy
sta T2
lda dy+1
sta T2+1
jsr do_math_copo_div
lda product
sta xstep
lda product+1
sta xstep+1
;   subpix.a=((y1!$ff)&$ff)
lda y1Edge
eor #$ff
sta subpix
;  x1=x1+((subpix*xstep)>> ;prestep;
lda subpix ;subpix*xstep
sta T1
stz T1+1
lda xstep
sta T2
lda xstep+1
sta T2+1
jsr do_math_copo_mul
lda x1Edge
clc
adc product+1
sta x1Edge
lda x1Edge+1
adc product+2
sta x1Edge+1
;  screeny1.a=y1>>8 ;not necessary hi byte oly
;  screeny2.a=y2>>8 ;not necesshi byte only
 
ldy y1Edge+1
ldx x1Edge
nnewedgeloop
lda x1Edge+1
nbp sta ledge,y
txa
nbplo sta ledge_lo,y
;sta $fda0
clc
adc xstep
tax
lda x1Edge+1
adc xstep+1
sta x1Edge+1
iny
cpy y2Edge+1
bne nnewedgeloop
calcexit rts
 
Link to comment
Share on other sites

Generally, I intend to have 2 versions of scanline traversal:

1. Nice (currently Bresenham, but would like to replace it with fixed point) - for cutscenes

2. Fast (currently, unrolled Step algo) - for combat

 

I am absolutely hooked on the simplicity, elegance, speed and quality of the 8.8 fixed point line calculation in the inner loop. This does seem like a bull's eye hit, indeed, as it just checks all the boxes. There's just one remaining problem - the initialization - obtaining the fractional part (e.g. Step.Lo) of the 8.8 in a fast and memory-reasonable way.

 

 

I'll use same example from before:

 

Example line:

dx = 8

dy = 5

step = 8/5 = 1.6 -> Step.Hi = INT (1.6) = 1; Step.Lo = 153 (0.6*256)

 

Point [0] = 1:153 (no need to compute, as it is equal to Step)

Point [1] = 3:50

Point [2] = 4:203

Point [3] = 6:100

Point [4] = EndPoint (no need to compute,just copy xp2)

 

So, how do we get to the Step.Lo = 153 ?

dy = 5

1 / dy = 0.20 (floating point)

0.20 * 256 = 51 (8-bit fixed point)

dy5LUT [] = {51, 102, 153, 204}

 

The index to to dy5LUT is the remainder from the division of 8/5 minus 1 -> e.g. (3-1) = 2

dy5LUT [2] = 153

 

These are the options that I can see at this point:

 

Option 1: Use huge table, where for each X (by X, I mean deltaY of the line), we store X-1 values (the fixed point step value)

- This however takes up 12.6 KB (range <0, 159>), which is just too much for something like this.

 

Option 2: Use full-precision table for smaller lines, and progressively lower precision, the longer the line is

- 100% coverage: the range <0,47> takes up 1,128 Bytes, and that's a line that's half the screen height

- 50% coverage: the range <48,95> takes up 1,740 Bytes (instead of 3,480), at the cost of 1 bitshift, and very minor error (around 0.7%)

- 25% coverage: the range <96,159> takes up 2,056 Bytes (instead of 8,224), at the cost of 2 bitshifts, and still small error (around 2.3%)

 

Option 3: Store just 1 fixed point value, and multiply it by the remainder

- This however takes only 160 Bytes, as we store just the 1/X value per each X

- I played it out in excel and the error begins very small, but gets to 9% around line length 32, so I didn't really continue with the table, as ~10% feels like too much too soon

- this also means doing multiplication via adding, which is a problem for lines that have big discrepancy between dx and dy

- thus, the greatest downside of this is unpredictability, where it suddenly might take up way more cycles than regular Bresenham

- I don't know exactly where that performance threshold is, but it's clear you don't need very long lines for this to happen and I don't like unpredictable algorithms

 

Option 4: Split the line into 2 halves

- this halves the memory cost, but adds code complexity and cycles to handle the transition

- While I haven't implemented that yet, I've seen it somewhere,and it just looked too ugly (around the midpoint)

- I think I'd rather use my current step codepath, than do this

- this seems like too much work for too little visual benefit

 

I'm currently heavily leaning towards Option 2, where I'd have full precision for range <0,47> and double-bitshift precision for range <48,159> for less than 4 KB of RAM (which is really about max I'm willing to devote for this - but I get a feeling that 4 KB of LUT of the 4th option I mentioned few posts above will give me better results - but won't really know till I work it out in greater detail).

 

I think I should go and examine the decimal mode of 6502 - it just might provide some additional features that might render the fixed-point useless (but that's just a guess at this point).

 

 

I welcome any ideas on the matter above.

  • Like 1
Link to comment
Share on other sites

I just googled PICO-8 and found out what it is. Now I understand why your code is in LUA (which was quite a major WTF for me :) ).

 

Out of curiosity - why do you prefer doing stuff for virtual console, rather than , say, physical Atari ? What exactly hooked you on ?

 

EDIT: I just browsed through the forums and noticed the compos and such - I can now see the retro appeal - just can't get over the lua thing...

Edited by VladR
  • Like 2
Link to comment
Share on other sites

 

Released yesterday....

 

For me it was just fun... and no overhead in terms of directx etc and as I dont know c++ etc I can not code PCs ;)

 

So I stick to retro consoles and Popmilo mentioned pico some years ago and now suddenly it was time to play around and I really like it... built in basic 128x128 16 colors 32k... well ;)

 

And lua sux big but kept pico going ;) once you get around its specialities....

 

As you see the complete code was coded in several weeks while all the 3d shit was done in days. Even the fractalus part... no comparison to 6502 :D

Edited by Heaven/TQA
  • Like 2
Link to comment
Share on other sites

Sorry for the offtopic, but I don't get the pico8 choice of 128x128. Surely 256x256 would be much closer to early 8bit and console and you start to get some nice sprite definition to screen size about that point. (We all know that even 160x is not really that great for sprite games)...

Edited by Sheddy
Link to comment
Share on other sites

 

Released yesterday....

 

For me it was just fun... and no overhead in terms of directx etc and as I dont know c++ etc I can not code PCs ;)

 

So I stick to retro consoles and Popmilo mentioned pico some years ago and now suddenly it was time to play around and I really like it... built in basic 128x128 16 colors 32k... well ;)

 

And lua sux big but kept pico going ;) once you get around its specialities....

 

As you see the complete code was coded in several weeks while all the 3d shit was done in days. Even the fractalus part... no comparison to 6502 :D

I do really enjoy seeing the Fractalus in a clean, smooth motion !

 

What kind of performance / instruction throughput do you have available on PICO ? I didn't find any performance specifics.

 

Sorry for the offtopic, but I don't get the pico8 choice of 128x128. Surely 256x256 would be much closer to early 8bit and console and you start to get some nice sprite definition to screen size about that point. (We all know that even 160x is not really that great for sprite games)...

I guess they just wanted something closer to A2600, or perhaps a handheld type of thing.

128x128 is basically close to a narrow mode of 160x192 on A800...

Link to comment
Share on other sites

Sorry for the offtopic, but I don't get the pico8 choice of 128x128. Surely 256x256 would be much closer to early 8bit and console and you start to get some nice sprite definition to screen size about that point. (We all know that even 160x is not really that great for sprite games)...

My guess because it was enough ?

 

Limits like this made it so much more productive. For 256x256 you need a good artist to make a likable game. In smaller resolution you can draw ten pixels and it looks like a small human ;)

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...