Known fast line algorithms ?

R0ger · November 10, 2014

Hello Atari fans !

I'm looking for some fast line algorithms. I'm using graphics mode with 4 pixels per byte, so 1 pixel is 2 bits, and all coordinates will fit into byte. After some googling I found either very basic algorithms, or something not suited to this graphics mode, or not suited for 8bit atari at all.

I mean I already made some routines. And I think they are not bad. On average my routines cost 64 cycles per pixel of the line. But are they the best possible ?

Did someone for example tried to disassemble some famous vector games, like Mercenary ?

My routine is available here:

http://roger.questions.cz/atari/line.asm

It is Bresenham, with 4 variants. 2 for up and 2 for down. 2 for beyond 45 degrees and under 45 degrees. I always draw left to right, and swap coordinates when needed. I keep byte offset from starting address in Y, and index of pixel mask in X.

The code is not commented very extensively, but it should be clear enough for anybody alredy doing Bresenham in their live.

Please be gentle with me, this is my first 6502 project and I'm still not quite efficient in all intstruction times and such. I am interested in even slightest optimalizations though, especially in the exposed inner loops ! But some principal idea would be even better.

I could quickly make some small project which would demonstrate the routine, if someone feels like debugging through it.

Rybags · November 10, 2014

There's various degrees of improvement that can be had, depends how much memory you want to throw at the problem.

First up is the obvious ones like table lookup for each possible Y position and embedding the plot point within the line loop rather than calling seperate subroutine.

There was a thread not long ago about line draw optimisations so there's plenty more info in that. One of the top optimisations can be to draw from start and endpoint and meet in the middle.

Not sure if it's worth doing seperate routines dependant on angle and such - you can potentially lose out in that you need to do sorts and comparisons and the like externally anyway.

R0ger · November 10, 2014

There's various degrees of improvement that can be had, depends how much memory you want to throw at the problem.

First up is the obvious ones like table lookup for each possible Y position and embedding the plot point within the line loop rather than calling seperate subroutine.

There was a thread not long ago about line draw optimisations so there's plenty more info in that. One of the top optimisations can be to draw from start and endpoint and meet in the middle.

Not sure if it's worth doing seperate routines dependant on angle and such - you can potentially lose out in that you need to do sorts and comparisons and the like externally anyway.

Well draving the point inside is of course part of my routine. The putpixel routine on the start is only used for special case where input coordinates are the same.

The angle is quite important actually. With angle bellow 45 degrees you always go right, and sometimes also up. With lines above 45 degrees you always go up, and sometimes also right. This distincion simplifies the inner loops considerably.

The idea with drawing from both ends at the same time sounds good though ! I would have to mask to zero page, but halving the time of decission making could bring like 20% speed boost !

As for the thread .. I failed do find it. Could someone post a link ?

Edited November 10, 2014 by Dr.Sid

FifthPlayer · November 10, 2014

As for the thread .. I failed do find it. Could someone post a link ?

Here's one. I don't know if it's the thread Rybags was referring to or not.

http://atariage.com/forums/topic/211628-fast-line-drawing/

FifthPlayer · November 10, 2014

I'm no expert at 6502 code optimization (I wrote one piece of 6502 code in my life, back in the 80s), but one thing that seems obvious to me is separate routines for the hi-res (BASIC graphics 8 display) versus the other display resolutions. At 320 pixels horizontally, you need a second byte to store the high byte of the x coordinate, and that can be optimized out for lower res display modes. I took a brief look at your code, and you already seem to do that. There might be something to be gained by have a drawing routine specialized for each graphics mode, or by using self-modifying code to adjust the X and Y screen boundaries.

Edited November 10, 2014 by FifthPlayer

Rybags · November 10, 2014

That's the thread, there's probably more but we covered some good ground there.

I'd not bother with the OS plot routine for start=end situations. You'd be throwing away a good part of the speed advantage. Even the overhead of doing the proper call sequence alone would probably burn more cycles than just doing it yourself.

Maybe you could just patch in an RTS and call the embedded routine to save on memory.

OK, I get the 45 Degree thing... for some reason I was thinking in the realms of forward/backwards rather than the attribute of X/Y incrementing every time and the other one not.

I probably mentioned it in the other thread, but one enhancement I wanted to try one day is with the mask in that rather than looking it up every time have it dynamically changed when the XPos changes. It might save a few cycles. Similar with the address of the point to plot, rather than do lookup have it setup initially and altered as you go. But there's all sorts of testing/cycle counting there.

xxl · November 10, 2014

eru's fast draw:

erudraw.rar

+Stephen · November 10, 2014

Nice!

R0ger · November 10, 2014

Wow, thanks for reactions. Didn't expect so many so fast !

Some of the suggestions are actually already part of my routine. Some are irelevant, cause I'm not interested in supporting all modes at the moment. I only need this one mode.

As for Eru's routine .. now we're talking. It is the mode my routine also works with, 4 pixels per byte. He uses quite simmilar approach to mine, except his uses something slightly different then Bresenham. He compares deltas in every step. It doesn't even seem to work in all cases, but the decision making seems to be quite simmilar in the end. My routine at the moment is actually faster then Eru's. I have 64 cycles per pixel, his has around 80. Still mine does not replace the color as his does, it uses OR to background, which is faster and it is sufficient for me at the moment, so in the end they would be quite simmilar in speed.

Eru has few great ideads there though. I especially like his use of array with masks for whole line. That is really simple and smart, and not so memory intense. I could addapt that to my routine immediately.

Also I'm thinking about that drawing from both ends at the same time .. but that would take some more time. Also 20% was overly optimistic (as always). After some thinking I guess it will be on borderline of usefullnes. But as I said, I will taky any cycle I can get.

Before that I have to read through those C64 archives. I already use some of their code, like multiplication, or signed comparison. I just haven't found line routines there till now.

xxl · November 11, 2014

wow! nice!

declare variables and tabs for your draw routine. I want to compare speed.

dmsc · November 11, 2014

Hi!

Hello Atari fans !

I'm looking for some fast line algorithms. I'm using graphics mode with 4 pixels per byte, so 1 pixel is 2 bits, and all coordinates will fit into byte. After some googling I found either very basic algorithms, or something not suited to this graphics mode, or not suited for 8bit atari at all.

I mean I already made some routines. And I think they are not bad. On average my routines cost 64 cycles per pixel of the line. But are they the best possible ?

Did someone for example tried to disassemble some famous vector games, like Mercenary ?

My routine is available here:

http://roger.questions.cz/atari/line.asm

Well, your routine is already close to optimal, and by simply ORing the pixel to the screen you are saving a few cycles.

Some ideas, based on 8086 routines from the nineties, are:

- Instead of calculating the mask at each step, simply rotate the mask. The problem is detecting the need to increase the Y register, you could check carry after each ROR but that could be slow.

- On horizontal lines, accumulate the mask in some ZP location, and copy that to the screen when you increase the Y register, this can save a lot of cycles in lines of less than 26°. This can be combined with the above, simply rotate the output mask on each cycle and.

- Add/subtract the line width (40) to the Y register instead of the screen pointer, this could save 2 cycles. Also, this allows to use self-modifying code cheaply, as you increase the address in the code only on carry.

- On 1bpp modes (this is not your case, but perhaps could be adapted), there is a fast line code that simply uses ROR to accumulate the carries from the delta accumulators. Then, after 8 cycles you have a "line code" of the next 8 pixels. You jump to a different routine for each code. For example, with code "0", you simply write $FF to set all pixels, with code "$40" you write $F0 to the first pixels, increase the pointer one line and write $0F, etc. This needs a lot of code (256 different cases), but it is normally a lot faster. With 2bpp modes, you could implement only 16 cases.

Well, if you try to implement any of the above, tell me if it is faster :-)

R0ger · November 11, 2014

I made small demo, but it only draws 2 lines, each 50 pixel long. Good for profiling cycles per pixel. If hope making random lines would be no problem for you.

It is here: http://roger.questions.cz/atari/line.zip

Main file is main.asm, rest is included. It is structure of my bigger project, I only removed unnecessary code. So it does for example turn ROM off. I can't turn it on quickly, as I'm not experienced all too well in those shadow registers, I hope you can make it faster if you need it.

So far I tried 2 modifications. First using table for masks and byte offsets for each X, as in erudraw. That actually didn't help. Reason is this. When I move down, I do not increment pointer in zero page. I first add 40 to Y. Only when that overflows, I go to zero page. This was I'm saving few cycles, but I can't use Y with the table. So the table approach is marginally faster for flat angles (less then 45 degrees), it is 10% slower on steep angles. Together it is few % slower.

The I tried drawing symetrically from both ends. For this it's clear I need 2 pointers, two masks .. so there is no way I could do that in registers. So the table approach is actually better for this ! And it actually seems to be faster ! For flat angles the speed increase is the largest - about 10%. For steep it's only marginal. But that gives me like 5% together. The codes is a bit longer, and I will have to correctly solve the last middle pixel (which I don't do at the moment). But it seems to be an improvement. It will take me some time to make it complete.

As for the C64 code I've seen .. they use some crazy techniques with precomputed blocks of bytes. It seems to be lot faster .. but they only talk about 8 bits per pixel, where the balance can be rather different. Still I don't understand it all fully yet, so I'm not saying it's not usable.

R0ger · November 11, 2014

- Instead of calculating the mask at each step, simply rotate the mask. The problem is detecting the need to increase the Y register, you could check carry after each ROR but that could be slow.

- On horizontal lines, accumulate the mask in some ZP location, and copy that to the screen when you increase the Y register, this can save a lot of cycles in lines of less than 26°. This can be combined with the above, simply rotate the output mask on each cycle and.

- Add/subtract the line width (40) to the Y register instead of the screen pointer, this could save 2 cycles. Also, this allows to use self-modifying code cheaply, as you increase the address in the code only on carry.

- On 1bpp modes (this is not your case, but perhaps could be adapted), there is a fast line code that simply uses ROR to accumulate the carries from the delta accumulators. Then, after 8 cycles you have a "line code" of the next 8 pixels. You jump to a different routine for each code. For example, with code "0", you simply write $FF to set all pixels, with code "$40" you write $F0 to the first pixels, increase the pointer one line and write $0F, etc. This needs a lot of code (256 different cases), but it is normally a lot faster. With 2bpp modes, you could implement only 16 cases.

Well, if you try to implement any of the above, tell me if it is faster :-)

I tried rotating the mask . .problem is, that on 6502 it's like txa:lsr:lsr:tax .. it's not fast. Using simple 4 element table and x only like index is just faster. Or the full mask table for whole lines. Result: 5% slower.

As for add/subtract the line width (40) to the Y register .. I already do that. It helps, especially for vertical lines.

As for acumulating result in ZP first .. I will try that, for flat angles variant, or I can do another variant. It won't help for steep angles. But it could save few cycles.

Also the carry acumulation sounds interesting. It could allow some code unwrapping and some local optimalizations.

All those variations ! :-D

dmsc · November 11, 2014

Hi!,

I tried rotating the mask . .problem is, that on 6502 it's like txa:lsr:lsr:tax .. it's not fast. Using simple 4 element table and x only like index is just faster. Or the full mask table for whole lines. Result: 5% slower.

As for add/subtract the line width (40) to the Y register .. I already do that. It helps, especially for vertical lines.

As for acumulating result in ZP first .. I will try that, for flat angles variant, or I can do another variant. It won't help for steep angles. But it could save few cycles.

Also the carry acumulation sounds interesting. It could allow some code unwrapping and some local optimalizations.

All those variations !

My idea of rotating the mask was more like:

L3a             ;move right and up

                ;draw pixel
                lda (pos),y
                ora mask
                sta (pos),y

                ;quit
                dec pixcnt
                beq L3_out

                ;move mask right
                lsr mask
                bcs L3amove
                ror mask
                bcc L3a2

                .byte $2C ; Skip next two bytes!
L3amove
                ror mask
                ror mask

                iny
                bne L3a2
                inc pos+1

L3a2

Note that you now have a free register (X), that could be used for another variable.... But you are right, at 2bpp (needing the two ROR) it is probably slower.

On closer inspection of your code, I noticed that you have a few unneeded "bne L*a2" instructions just after an "inc pos+1", perhaps a leftover from older code.

xxl · November 11, 2014

I removed the unnecessary jumps.

yes, it is fast :-)

drsid15.xex

eru15.xex

Heaven/TQA · November 11, 2014

haven't looked in code yet... but just my 2 cents regarding "can not use y"... did you use illegal opcodes? have a look... in my latest code i was using LAX etc... so check the Oxyron website or codebase64 and see if there are some usecases.

XXL... the drsid file misses RUN adress?

Edited November 11, 2014 by Heaven/TQA

R0ger · November 11, 2014

Drsid15.exe doesn't work for me .. do I need some specific machine settings ? I'm using Altira.

Anyway .. I didn't do anything so far, but I was mostly thinking about unwrapping the code (expanding loops) with carry acumulation.

For flat variant it will be 1 unwrap for every byte. So first there will be unwrapped code for getting 4 bits into byte with direction changes, then there will be 16 code variants. The first bit will actually mean only if there is up movement before first pixel, so there only be 8 routines each with 2 entry points. For flat angles the main bonus (above removing loop jumps) will be addressing the one byte at once, and doing it using constatns, no mask rotation needed.

For steep angles it won't be that effective, but there still are some nice tricks. First, unwrapping alone is still good idea. Then I could do the unwrapped routine so it only addresses using Y and only Y changes. I can address up 6 lines like that. While I will have to mantain mask, it won't change much, and in all vertical case I will reuse the same mask every time. So again, the bonus will be huge. I think 4 bits and 16 variants (or 8 with 2 entry points) sounds reasonable. I could go up to 6 pixels per iteration, after that the effectivity will start dropping a bit.

Now what about combining this with drawing from both ends ? It clearly won't work for flat angles. Here I draw byte by byte .. and the line is not symetrical in bytes. But for steep angles I'm not limited to bytes, I solve mask and everything, so I could use this aproach. In that case I would first solve let's say 4 bits of directions, then I can use that to draw 8 pixels using dedicated routines.

And of course, using all the small tricks so far mentioned here.

Sounds like long trip .. but the speed bonus will be crazy. Hope I will be able to do at least one variant tonight.

Xuel · November 11, 2014

I added a run $2000 segment to xxl's drsid15.xex:

drsid15-fix.xex

Results I see:

eru15.xex: ~285 lines per second

drsid15-fix.xex: ~300 lines per second

R0ger · November 11, 2014

Thanks for the fix. Not bad. I will try to find some improvements for this 'old' routine, as it has normal size, and it can be usefull for somebody. But now I'm focused on the unwrapped patterns.

R0ger · November 11, 2014

On closer inspection of your code, I noticed that you have a few unneeded "bne L*a2" instructions just after an "inc pos+1", perhaps a leftover from older code.

Yes, I noticed that too in my profiling .. those instruction get almost no hits though, so it unfortunatelly has no effect on speed.

R0ger · November 11, 2014

O_O

So .. I have my first version ..it's for flat angles only. So far I only have unwrapped code for 5 variants out of all 16. That is enough to draw my one test line (roughly at 30 degree angle, 52 pixels long). I also don't have code for start and end of the line solved. I do decrement pixel count though, so for the inner-most loop, it should be more or less final.

Still .. the speed is just .. I can't believe it. I did compute it twice. Three times. It's still the same.

Instead of 64 cycles per pixel, I now have 35. THIRTY FIVE. That is almost twice as fast. Exactly 54% of cycles, in other words 81% faster.

I might completely redo it once or twice, and the code is A MESS .. s I won't show it just yet. But soon !

flashjazzcat · November 11, 2014

Loving this - great stuff!

R0ger · November 11, 2014

Btw. I am not storing the carry flags in a byte. I have to do branches based on the carry, to modify there Bresenham error value based on it. And since I have to do branches, the branch itself is defined by the carry flags of previous tests.

So I have binary tree of branches, 4 deep, 16 leafs. All withing branch range.

The leaf defines pattern - and the leaf contains single jump to predefined pattern (instead of jump table).

If you picture leaf index in binary, 0 means pixel and move left, 1 means pixel and move diagonally. The pattern routine has to draw the pattern, but it uses constant masks, so it's simple and fast.

For example pattern 5 (0101) means - draw $f0, move down, draw $0f, move down.

After that, odd pattern routines jumps to one 'ending code', while even use different. I adjust error value again and I reenter the branch tree from the beginning. Here I also decrement pixel count and do other things which would be same in all pattern routines.

R0ger · November 11, 2014

I rearanged the code a bit and completed the flat angle variant (still, no start and end). Clean horizontal line makes 23 cycles per pixel. As it gets toward 45 degrees, it's getting slower (obviously, more Y steps). Even so, average over all 76 pixel long lines from 0 to 45 degrees is 37 cycles per pixel.

Now the steep angles !

Code (don't even try to make it work, it has too many limitations so far):

http://roger.questions.cz/atari/line-unwrap2.asm

Edited November 11, 2014 by Dr.Sid

dmsc · November 12, 2014

Hi!

I rearanged the code a bit and completed the flat angle variant (still, no start and end). Clean horizontal line makes 23 cycles per pixel. As it gets toward 45 degrees, it's getting slower (obviously, more Y steps). Even so, average over all 76 pixel long lines from 0 to 45 degrees is 37 cycles per pixel.

Now the steep angles !

Wow, it is good to see that kind of speedup :-)

I see your decision tree, it's a very good approach indeed, and the joining of the states "i" and "i+8" is a good idea to reduce the number of possibilities. You could squeeze a few extra cycles by coding the "i+8" states directly after the "tax", so you use one less "jmp", and to jump from the other states, you can do a "bcc" or "bcs" instead of the "jmp", because you already know the state of the carry flag. This is also true in L1_xxx1 and L1_xxx0, you can just use "bcs L1_0xxx" at the end.

Overall, I'm very impressed, thanks for writing and testing the code!

Known fast line algorithms ?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members