Math Functions in Suzy - Speed

Heaven/TQA · December 13, 2016

can someone explain me if it is good to use the maths functions or not compared to standard "6502" fastmul table based approach?

we can not get a clue of the official docs... they talk about "ticks" how many a mul or div are spend... but the operations run parallel to cpu etc... and is Suzy running them in 16mhz?

etc... etc... my experience yet is that for standard "fastmul" in 8bit software tables are faster due to the 4mhz CPU but higher precision the maths unit is faster?

but I just have Handy as comparission which lacks in 100% emulation of the maths speed compared to real HW....

+karri · December 13, 2016

The point in using Suzy math is to pipeline things. You can prepare the next step in calculations while waiting for Suzy to do the math of the previous operation. Unfortunately I have no benchmark results. The code is from the 3D engine in Stardreamer.

.macro WAITSUZY
.local notready
notready:
    bit SPRSYS
    bmi notready
.endmacro;

  ; y / z
  ;

   ldy Temp1
   iny
   lda (_VertexSource), y
   bmi DivYNeg

   dey
   lda (_VertexSource), y
   sta MATHG
   iny
   lda (_VertexSource), y
   sta MATHF
   stz MATHE   ; NOTE: Suzy starts a division operation now

               ; The CPU can prepare for the next steps in parallell with Suzy
   iny
   iny
   iny
   sty Temp1

  ; prepare for addition

   clc
   lda #4
   ldy Temp2
   iny

  ; wait for end of divide

  WAITSUZY

  ; read result, shift once, store
   adc MATHC
   lsr
   sta (_VertexTarget), y
   dey
   lda MATHD
   ror
   sta (_VertexTarget), y
   iny
   iny
   sty Temp2

Edited December 13, 2016 by karri

Heaven/TQA · December 13, 2016

Karri then you might answer my next question regarding right side clipping.

I used the BSS poly routines so far... and tried another poly routine which posted here some time ago...

both devide a poly into 2 and "blit" them when calculated the slopes etc.

I did 2d line clipping in software (which is unneccessary if using the hardware clipping) but for filled polys esp on the right screen side when starting pixel is out of screen the complete poly gots skipped by hardware. I was under impression that superclipping should handle that as with the left/top/bottom border which works... drawing a straight line got clipped by hardware perfectly.

so what is your experience in clipping filled polys drew by the sprite hardware? might be the issue because using 1x1 sprite which get stretched and tilted?

+karri · December 14, 2016

One pixel can be stretched and tilted to produce almost any kind of polygon. Here is an example of a road segment that is stretched and moved around,

The right edge of the sprite is always a bit dirty. I don't know why.

My polygon routine for drawing the road or the markings is just:

void polygon(int x1, int y1, int w1, int x2, int y2, int w2, unsigned char color)
{
    Spixel.hpos = x1;
    Spixel.vpos = y1;
    Spixel.hsize = w1 << 8;
    Spixel.vsize = (y2 - y1 + 1) << 8;
    Spixel.tilt = (x2 - x1) * 256 / (y2 - y1);
    Spixel.stretch = (w2 - w1) * 256 / (y2 - y1);
    Spixel.penpal[0] = color;
    tgi_sprite(&Spixel);
}

x1,y1 is the middle of the top of the road. w1 is the road width at the top in pixels.

x2,y2 is the middle of the bottom of the road. w2 is the row width at the bottom in pixels.

In order to avoid to have the "hot" pixel outside the screen I draw the road from two pixels.

    // Draw road
    polygon(x1 - w1, y1, w1, x2 - w2, y2, w2, COLOR_GREY);
    polygon(x1 - 1, y1, w1, x2 - 1, y2, w2, COLOR_GREY);

The markings (yellow and white) are drawn from one pixel. The source and a compiled version is included.

cl65 -t lynx -o roadseg.lnx roadseg.c

You can move the road segment around with left/right/up/down to see how it scales when it moves towards the horizon.

If you find out what to do to have the right edge straight please share your findings.

Edit: I tried to see what the road looks like when the yellow rumble is green. Now we have only left edges and the ugly stuff is gone.

PS. I have not tried these on a real Lynx. It would be nice to know if the real Lynx looks the same.

roadseg.zip

Edited December 14, 2016 by karri

+karri · December 14, 2016

So, by choosing the order of drawing and understandng that the left edge is ok we can create this.

Both edges are now ok as the next draw hides the ugly stuff of the previous draw.

Modified source and lnx file attached.

roadfixed.zip

+karri · December 15, 2016

I also added the fire buttons for a little turning of the road segment.

roadturn.zip

Heaven/TQA · December 15, 2016

Thanks Karri that is hot....

2x thinks which never had a clue by using poly routines which render triangles....

One draws top triangle til vertice 2 and then 2nd triangle to vertice 3.

Now the bottom drawing I have tried 2 ways found.... one draws a v flipped sprite from v3 to v2....

But this seems to fail when the starting pixel is out of screen though I was under the impression that super clipping should cover that... my hoffset and voffset are 32,32.

When starting pixel is out of screen the complete poly drawing skipped by Suzy which I thought superclilling is for.

The 2nd poly routine draws as well 2 triangles per poly but renders always from top to bottom.... this works better but as well some poly plopping but here i assume number overflow in my 3D routines.

I want to avoid software 2D clipping as most is in the hardware. (Not talking about clumbing I mean real clipping hence calculating intersection point etc).

The other thing I found in all poly routines using stretching and tilting is the right edge "jitter" of the edge.

Heaven/TQA · December 15, 2016

Ah btw does the div use maths copro in library?

+karri · December 15, 2016

The cc65 does not use Suzy at all for math. We had a lengthy discussion about this and there was some concern about the effect of signed zero requiring extra code to fix up the results. Basically we could not prove any speedup from Suzy alone.

Thomas Harte wrote the engine I have used in Stardreamer. It has a few problems that I never sorted out. So the project is on hold...

You can use my polygon code for triangles also. Just set the w1 to 1 for triangle opening down or w2 to 1 for triangle opening up.

A long time ago I was making MRI math. At that time I used fixed point matrixes that were always normalized. The technique was called fraqmul or something like that. In any case there was never rounding errors and you could rotate stuff any way you wanted. Multiplying a value in range -1 .. 1 with another value in the range -1 .. 1 produces a result in the range -1 .. 1.

Creating some 16 bit math engine that works on normalized matrixes should work perfectly on the Lynx also (7FFF would be 1 and 8000 would be -1).

My faint memory claims that c = a * b was carried out as c = (a * b + 0x7FFF) / 65535.

Or in reality

d = a * b + 0x8000

c = (d + (d >> 16)) >> 16

Edited December 15, 2016 by karri

+karri · December 15, 2016

Blah. Cannot edit the previous code. The last >> 16 should have been >> 15.

Heaven/TQA · December 15, 2016

have you ever get a clue what this qube root trivia is all about?

sage · December 15, 2016

joke

ThomH · March 1, 2017

A million years late, but further to add: Suzy doesn't just multiply, she multiplies with accumulation. So that's two steps at a time for vector and matrix computation, not one. Even if the timings are 4Mhz rather than 16 (and, like the author, I have no idea), I think you're still doing better than on the CPU.

The original Elite uses [-1,1] range numbers in all its matrices, but with only a single byte of precision. On a Lynx you could use 14 bits (to include the ability to store 1.0 and -1.0), 15 bits (conveniently introducing a rounding error and storing 1.0s as just less than their true value) but probably not 16 because you can't take the signs out and put them elsewhere, and get Suzy to accumulate. Also I guess you'd read two bytes askew and then shift by only one bit, rather than actually performing 15 steps but that's probably obvious.

I used a 14-bit scheme on a z80 project; the precision is pretty good.

ThomH · March 1, 2017

Actually, it strikes me that there's not even a need to be consistent. Suppose you define that each object is rendered according to exactly three inputs — its geometry, a model matrix and the camera matrix. Then you're going to compose the two matrices, then apply them to the geometry.

If each matrix is in 2:14 then you can multiply them together using Suzy to get a 4:28 result. Keep only the top two bytes and you're at 4:12. Suppose your model geometry is also 4:12 then when you apply the composed matrix to the geometry you'll end up at 8:24. Keep just the top two bytes and you're at 8:8 just before your clipping and perspective projection (or perspective projection and clipping, if you prefer doing the clipping in pixels). Which is a comfortable place to be.

So: trig tables at 2:14. Geometry at 4:12. Perspective and clipping code works in 8:8. No shifting required.

ThomH · March 1, 2017

My last post on the topic! I promise! But another good example of overlapping work returned to my conscious mind: my Mode 7 demo.

When drawing one of those perspective floors one iterates from left to right, maintaining a current texture map location and a vector. Look up the colour at the current location, put that into the pixel, add the vector, move to the next pixel.

In my version, the current location was held in Suzy's accumulation vector. The vector was updated by triggering a multiply with accumulate. The numbers being multiplied worked out to the vector.

So the implementation over on the CPU was:

Read current location.

Write trigger byte to begin addition.

Look up colour at current location.

Push to stack.

Repeat until done.

Then, at the end of each line: the next line of output is assembled at one-byte-per-pixel on the stack. Get Suzy to move it to the right place, scaling it down so as to repack to 4bpp.

(and, in a later optimisation: don't do 160 pixels per line, do only as many as there are unique texels at that scale, and have Suzy scale based on that; it creates raggedy edges from precision loss but cut out something like half the work at my particular scale. I could have ameliorated a little without cost through better rounding but didn't. I could possibly have saved by assembling multiple lines on the stack at once, if drawing some that are short enough, but didn't.)

Summary: performing 32 bit addition on Suzy by supplying one number as two of its factors and allowing a multiplication to occur is almost certainly not faster than just doing it on the CPU. But overlapping the work means that suddenly all the CPU is spending on it is the four cycles of a store absolute. Which is faster.

Heaven/TQA · March 3, 2017

Thanks Tom... sounds interesting... right now I don't use the accumulation feature yet in my 3D engine only mul and div.

Regarding division I am not sure if CPU is not faster though or to use of course a reciprocal table and use signed mul.

Heaven/TQA · March 3, 2017

Thanks for mentioning accumulate... just implemented this feature into my matrix rotation steps for rotating vertices.

ThomH · March 3, 2017

With yet more very slight thought, I think my Mode 7 demo wittingly-or-otherwise shows that the cycle timings given in the documentation are against the 16Mhz bus. I'm confident the the per-pixel loop was something almost exactly like:

LDA abs ; load high byte of pixel address from Suzy's accumulation register

LDY abs ; load low byte of pixel address from Suzy's accumulation register

STX abs ; store to Suzy to trigger the next multiply with accumulate

STA abs ; dynamically modify the LDA below by storing the high byte of the pixel address to it

LDA abs,y ; load the next pixel of floor colour

PHA ; store the next pixel of floor colour

I'm questioning now whether I unrolled it*, but by my calculation adding the above up gives 4 + 4 + 4 + 4 + 5 (all lines are page aligned) + 2 = 23 cycles at 4Mhz. The documentation states that multiply with accumulate takes "54 ticks". Even if I didn't unroll, there's no way I spent 21 cycles on decrementing an 8-bit loop counter and jumping.

Therefore I'm going to say that the fact that the code above works** offers strong evidence that 54 ticks means 54 cycles at 16Mhz. So 13.5 is the number to beat if you want to do it entirely in software.

* if I were writing now, I'd probably put eight copies of that plus a decrement and jump back to the start onto the zero page, to save a cycle in the dynamic reprogramming bit, then just jump in at the right position ala Duff's device.

** tested on real hardware:

Edited March 3, 2017 by ThomH

Heaven/TQA · March 4, 2017

I rewrote my matrix multiplications for rotating 1 vertice yesterday to use MAC and in emulations its slightly faster...

and that's the code....

;x

;reset accumulation

lda #$e0 ;signed mul+accumulate = MAC

sta $fc92

stz MATHK

stz MATHM

;x

rot_objx1 lda _px_l,x

sta MATHD ;T1

rot_objx2 lda _px_h,x

sta MATHC ;T1+1

lda mat+0

sta MATHB ;T2

lda mat+9 ;hi

sta MATHA ;T2+1

rot_objy1 lda _py_l,x

sta MATHD ;T1

rot_objy2 lda _py_h,x

sta MATHC ;T1+1

lda mat+1

sta MATHB ;T2

WAITSUZY

lda mat+9+1 ;hi

sta MATHA ;T2+1

rot_objz1 lda _pz_l,x

sta MATHD ;T1

rot_objz2 lda _pz_h,x

sta MATHC ;T1+1

lda mat+2

sta MATHB ;T2

WAITSUZY

lda mat+9+2 ;hi

sta MATHA ;T2+1

WAITSUZY

;accumulate

lda MATHK

sta xn

lda MATHJ

sta xn+1

;y

;y'''=x*m10+y*m11+z*m12

;reset accumulation

stz MATHK

stz MATHM

rot_objx3 lda _px_l,x

sta MATHD ;T1

rot_objx4 lda _px_h,x

sta MATHC ;T1+1

lda mat+3

sta MATHB ;T2

lda mat+9+3 ;hi

sta MATHA ;T2+1

rot_objy3 lda _py_l,x

sta MATHD ;T1

rot_objy4 lda _py_h,x

sta MATHC ;T1+1

lda mat+4

sta MATHB ;T2

WAITSUZY

lda mat+9+4 ;hi

sta MATHA ;T2+1

rot_objz3 lda _pz_l,x

sta MATHD ;T1

rot_objz4 lda _pz_h,x

sta MATHC ;T1+1

lda mat+5

sta MATHB ;T2

WAITSUZY

lda mat+9+5 ;hi

sta MATHA ;T2+1

WAITSUZY

;accumulate

lda MATHK

sta yn

lda MATHJ

sta yn+1

;z'''=x*m20+y*m21+z*m22

;reset accumulation

stz MATHK

stz MATHM

rot_objx5 lda _px_l,x

sta MATHD ;T1

rot_objx6 lda _px_h,x

sta MATHC ;T1+1

lda mat+6

sta MATHB ;T2

lda mat+9+6 ;hi

sta MATHA ;T2+1

rot_objy5 lda _py_l,x

sta MATHD ;T1

rot_objy6 lda _py_h,x

sta MATHC ;T1+1

lda mat+7

sta MATHB ;T2

WAITSUZY

lda mat+9+7 ;hi

sta MATHA ;T2+1

rot_objz5 lda _pz_l,x

sta MATHD ;T1

rot_objz6 lda _pz_h,x

sta MATHC ;T1+1

lda mat+8

sta MATHB ;T2

WAITSUZY

lda mat+9+8 ;hi

sta MATHA ;T2+1

WAITSUZY

;accumulate

lda MATHK

sta zn

lda MATHJ

sta zn+1

persp_trans:

But did not checked yet on real hw.

Edited March 4, 2017 by Heaven/TQA

+selgus · March 4, 2017

I too would be really interested in hearing how this compares on real hardware speed wise. I don't totally trust the outputs I get on at the emulator I run on the Mac.

Twoface2 · March 6, 2017

I too would be really interested in hearing how this compares on real hardware speed wise. I don't totally trust the outputs I get on at the emulator I run on the Mac.

Post a ROM Image and I / we test it for you!

sage · March 6, 2017

** tested on real hardware:

Didnt remember that there was a soundtrack.

ThomH · March 6, 2017

Didnt remember that there was a soundtrack.

Haha, accidental capture. And so quiet that I didn't even realise it was there. Let this be a snapshot also of at least one thing I was listening to c.2010.

Math Functions in Suzy - Speed

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members