Jump to content

R0ger

Members
  • Content Count

    753
  • Joined

  • Last visited

  • Days Won

    3

Posts posted by R0ger


  1. It's lot easier to make demo .. demo only has to look good. You stop at any time. Basically demo can't be incomplete. Game needs to be fun. It can be incomplete. You can make demo on impulse, prove of concept. Game needs lot more planing.

    Also making demos and making games are different skills. You can do great game in game maker, with no advanced coding skills. You can do demo which will create cool wave-like fractal effect, which would have absolutely no use in any game. Most of the time, demos and games are just unrelated.

    • Like 2

  2. So XXL sent me source code for old tests. I modified it to use my new line, and here are all 3 benchmarks (and zip with sources for the last one):

     

    So while old two lines did both around 300 lines per second, this new one does about 400.

     

    Also note that my old line (drsid15) uses only ORA to draw pixels, so it's not actually any much faster then Eru's line.

    eru15.xex

    drsid15.xex

    drsid2.xex

    drsid2.zip

    • Like 2

  3. Yes, it seems clear that they basically draw a grid, only the lines they use are funky. The 'crappy line' algorithm seem to me the most interested one. I could understand how they do all else pretty well.

    It might be something like my pattern approach to lines .. except they use distorted patterns, not just straight lines.

    Clipping can be done in many ways, and it's usually nothing really slow. I don't think it's worth investigating.


  4. RoF works completly different ;) and last 2 cents for RoF/Koronis Rift... RAM would not help much (ok... for the 3d maths yes like mul-tables, perspective, runrolled code) but not for the Fractal routine itself as it might use recursion... but what we discussed for RoF is precalc more of the map in advance... but that helps only for fixed flight paths though... ;)

     

    Any links to how exactly does RoF works ? I always wondered about that.

    As for Mercenary, I did some profiling on it. Not exactly reverse engineering, but still I can say that their line is quite fast, and so is their math. What is good on Mercenary is that is has 2 versions, low res and high res. By measuring FPS on both you can estimate how much time is spend in drawing and how much in math.


  5. So finally .. I made the 'up' variant (put it all in standalone file, line3.asm), added masking of old content for benchmarking. There was still ton of errors I had to find and fix, but now it seems to be OK.

     

    I'm attaching the source code and little demo.

     

    This demo cannot be directly compared to older exes posted here. It uses different graphics mode, and other things I suspect. But I don't have source code of the old demos, so I can't make one comparable.

    If you are curious what are you looking at though, this demo makes about 680 lines per second, lines being randomly chosen from rectangle 160x80 pixels. That older demos used at least twice as much pixels though, so it should be about twice as much slower.

     

    There most probably still is room for some small local optimizations, especially jumps and subroutines reordering. I wish there was some compiler which could do that, it's not really job for humans :-D

    line2.zip

    main.xex

    • Like 4

  6. I removed redundant SEC and CLC, which gave me almost 2 cycles per pixel as expected. I also tried the page per line approach. I didn't really make it work, I just tested how fast would it be, by replacing every adw #40 with INC, and testing the speed. It goes from 41 cycles per pixel to 37. It helps only with vertical steps of course. I guess I will run my project without this variant at the moment, and use it later if needed, or generally leave this to some later stage of development.


  7. Thanks for reading the code. There is more small things like that I guess.

    And I have another idea worth 2 cycles pixels. Check my decision tree for flat variant .. after each failed BCC there is SUB (in other words SEC) .. and after each successful BCC there is ADD (in other words CLC). Basically in whole decision tree carry state already is how I need it, and I don't have to set it or clear it.

    With steep variant I use SUB in both cases, but that can be easily fixed.

     

    I plan to visit some more meetings in future, so let's hope we'll meet some day !


  8. Thanks for links. It has some good ideas, but mostly useless stuff at this stage. Btw. Dragonstomper, you are RGB author ? Czech COMRADE !? How come we didn't meet on Atariada ?

     

    Anyway: http://roger.questions.cz/atari/line2.asm (read at your own risk).

     

    Still lots of work. Mainly up variant. It's still not implemented, so do not use this variant yet ! I also have to clean all up .. improve comments. Make benchmark so it can be compared to ERU's line.

     

    But I believe all cases (for down direction) now work for flat and steep. The flat variant produces pixel by pixel the same results as my old line. Steep variant differs in lower half, as this up/down variant is more symmetrical than I could achieve in old line.

     

    I added color setting. It's quite cheap, at the price of setcolor being a bit slower. Still, in my 'project' setcolor will be called only rarely.

    Basically instead ( ora $ff ) I now have ( ora mff ). 'mff' is zero page variable, which for color 3 contains $ff.

    I don't need all 16 variants .. I only need 10, cause i don't need patterns like 11001100. You will see it clearly later when I post my zero page declarations.

    For prolog and epilog I use my old zero page table 03,0c,30,c0 .. but I made it so that the table overlays with m03,m0c,m30,mc0 variables, so it does not take any more space.

    On the other hand I need another table mask2, which has the same bytes in the reversed order, but I might yet solve that.

    So basically when doing SetColor, I have to set those 10 (mxx+mask) + 4 (mask2) bytes in zero page. Which is no big deal.

     

    So after all this, I did average benchmarking - by filling square 80x80, line starting in 0,0 to all pixels in lower and right edge (so all angles are included). All lines have 80 pixels.

    This is the test, in which my old line does 64 cycles per pixel.

    This new line does 42 cycles per pixel in this test. Which is 35% boost, pretty sweet.

     

    Code length of the new line is A1C bytes, and it does not contain Up variant, so it will be almost double that. Quite a price, but perfectly acceptable for my purpose.

     

    There is also one more idea, which can make it faster still. I now use Dlist with 80 graphic lines. I could align all lines to page boundary. That way I could only INC high byte of VRAM address to move down.

    I have 3 buffers 40 bytes wide for triple-buffering, that will leave 136 bytes free in each line, which will be challenge to use. It will effectively cripple half of my basic 64k. Still I have another half for the code, and I think I might be able to store some static data, like texts, into those holes. Certainly worth try !

    • Like 1

  9. Slow progress .. I have prolog and epilog for flat angle variant. I mean drawing the pixels before and after full byte patterns. I even managed to use the same loop for both prolog and epilog. I also introduced new variable with pattern count, which is computed in the beginning, so I can simply DEC it. So now I'm again few cycles faster (on longer lines at least, but that's what matters for me).

    Well .. not sure if this line of mine will be fastest line ever, but it's good contender for the most complex one. The prolog and epilog combinations are pretty tough.

    Normal case is you start with single pixels, then you switch to patterns, then again you switch to single pixels. But all those phases can be omitted in some cases. I have some nasty computations before I start to draw, mainly number of whole bytes I need to draw .. but that can be local optimized later, and it only happens once per line.

    I hope I will have prolog for steep angles (no epilog needed here) tomorrow, so wait a bit longer for the code.


  10. So I made the step angle variant which paints from both ends. And I hate it ! The code is really ugly. You can't use registers for anything. It's also pain in the ALU to have it all correct. But it is faster.

     

    4 pixels per pattern has 48/53/55 cycles per pixel (best case, avg case, worst case). Yes, that 25 cycles per pixel yesterday for best case was some error.

     

    8 pixels per pattern has 41/48/55 cycles per pixel.

     

    flat variant for comparison, same method of measurement: 25/40/55 cycles per pixel. Note how the worst case is about the same no matter what method you use.

     

    The difference is over 10% on average, almost 20% in best case. So I guess I will fly with it.

     

    Sry, no code so far, it's a mess.

    • Like 1

  11. Hi!,

     

    If you don't want to update the mask at all positions, you can replace the code:

    DrawByte .macro x
            lda (pos),y
            ora #
            sta (pos),y

    with:

    DrawByte .macro x
            lda mask
            and #
            ora (pos),y
            sta (pos),y

    It is a little slower, but simple.

     

    For the line start/end you can simply use the old pixel-at-a-time code until the X (or Y) coordinates are divisible by 4, and then jump to the fast code, or at start you can special case the calculation of the code and jump to the middle of the already existing sequences....

     

    Yes, I didn't mean I don't know how to do it :-D

    But thanks for reading the code !


  12. So I added steep angles variant. It does not benefit that much from the unwrap. On average it makes 53 cycles per pixel.

     

    But I want to make the steep variant to draw from both ends. So there is still some room for improvements.

     

    Then I have to finish the routines with proper line start and end. I also have to somehow solve changing of colors ! That can only slow the code down. But I hope not much. In worst case I can have routines for every color. It's of course more of a problem in flat variant, steep variant works same as my original line.

     

    Again there is huge disparity between vertical and diagonal lines. Pure vertical needs only 25 cycles per pixel, as the mask just stays static.

     

    Current version: http://roger.questions.cz/atari/line-unwrap3.asm


  13. I rearanged the code a bit and completed the flat angle variant (still, no start and end). Clean horizontal line makes 23 cycles per pixel. As it gets toward 45 degrees, it's getting slower (obviously, more Y steps). Even so, average over all 76 pixel long lines from 0 to 45 degrees is 37 cycles per pixel.

    Now the steep angles !

     

    Code (don't even try to make it work, it has too many limitations so far):

     

    http://roger.questions.cz/atari/line-unwrap2.asm

    • Like 3

  14. Btw. I am not storing the carry flags in a byte. I have to do branches based on the carry, to modify there Bresenham error value based on it. And since I have to do branches, the branch itself is defined by the carry flags of previous tests.

    So I have binary tree of branches, 4 deep, 16 leafs. All withing branch range.

    The leaf defines pattern - and the leaf contains single jump to predefined pattern (instead of jump table).

    If you picture leaf index in binary, 0 means pixel and move left, 1 means pixel and move diagonally. The pattern routine has to draw the pattern, but it uses constant masks, so it's simple and fast.

    For example pattern 5 (0101) means - draw $f0, move down, draw $0f, move down.

    After that, odd pattern routines jumps to one 'ending code', while even use different. I adjust error value again and I reenter the branch tree from the beginning. Here I also decrement pixel count and do other things which would be same in all pattern routines.


  15. O_O

     

    So .. I have my first version ..it's for flat angles only. So far I only have unwrapped code for 5 variants out of all 16. That is enough to draw my one test line (roughly at 30 degree angle, 52 pixels long). I also don't have code for start and end of the line solved. I do decrement pixel count though, so for the inner-most loop, it should be more or less final.

     

    Still .. the speed is just .. I can't believe it. I did compute it twice. Three times. It's still the same.

     

    Instead of 64 cycles per pixel, I now have 35. THIRTY FIVE. That is almost twice as fast. Exactly 54% of cycles, in other words 81% faster.

     

    I might completely redo it once or twice, and the code is A MESS .. s I won't show it just yet. But soon !

    • Like 2

  16. Drsid15.exe doesn't work for me .. do I need some specific machine settings ? I'm using Altira.

     

    Anyway .. I didn't do anything so far, but I was mostly thinking about unwrapping the code (expanding loops) with carry acumulation.

     

    For flat variant it will be 1 unwrap for every byte. So first there will be unwrapped code for getting 4 bits into byte with direction changes, then there will be 16 code variants. The first bit will actually mean only if there is up movement before first pixel, so there only be 8 routines each with 2 entry points. For flat angles the main bonus (above removing loop jumps) will be addressing the one byte at once, and doing it using constatns, no mask rotation needed.

     

    For steep angles it won't be that effective, but there still are some nice tricks. First, unwrapping alone is still good idea. Then I could do the unwrapped routine so it only addresses using Y and only Y changes. I can address up 6 lines like that. While I will have to mantain mask, it won't change much, and in all vertical case I will reuse the same mask every time. So again, the bonus will be huge. I think 4 bits and 16 variants (or 8 with 2 entry points) sounds reasonable. I could go up to 6 pixels per iteration, after that the effectivity will start dropping a bit.

     

    Now what about combining this with drawing from both ends ? It clearly won't work for flat angles. Here I draw byte by byte .. and the line is not symetrical in bytes. But for steep angles I'm not limited to bytes, I solve mask and everything, so I could use this aproach. In that case I would first solve let's say 4 bits of directions, then I can use that to draw 8 pixels using dedicated routines.

     

    And of course, using all the small tricks so far mentioned here.

     

    Sounds like long trip .. but the speed bonus will be crazy. Hope I will be able to do at least one variant tonight.


  17. - Instead of calculating the mask at each step, simply rotate the mask. The problem is detecting the need to increase the Y register, you could check carry after each ROR but that could be slow.

     

    - On horizontal lines, accumulate the mask in some ZP location, and copy that to the screen when you increase the Y register, this can save a lot of cycles in lines of less than 26°. This can be combined with the above, simply rotate the output mask on each cycle and.

     

    - Add/subtract the line width (40) to the Y register instead of the screen pointer, this could save 2 cycles. Also, this allows to use self-modifying code cheaply, as you increase the address in the code only on carry.

     

    - On 1bpp modes (this is not your case, but perhaps could be adapted), there is a fast line code that simply uses ROR to accumulate the carries from the delta accumulators. Then, after 8 cycles you have a "line code" of the next 8 pixels. You jump to a different routine for each code. For example, with code "0", you simply write $FF to set all pixels, with code "$40" you write $F0 to the first pixels, increase the pointer one line and write $0F, etc. This needs a lot of code (256 different cases), but it is normally a lot faster. With 2bpp modes, you could implement only 16 cases.

     

    Well, if you try to implement any of the above, tell me if it is faster :-)

     

     

    I tried rotating the mask . .problem is, that on 6502 it's like txa:lsr:lsr:tax .. it's not fast. Using simple 4 element table and x only like index is just faster. Or the full mask table for whole lines. Result: 5% slower.

     

    As for add/subtract the line width (40) to the Y register .. I already do that. It helps, especially for vertical lines.

     

    As for acumulating result in ZP first .. I will try that, for flat angles variant, or I can do another variant. It won't help for steep angles. But it could save few cycles.

     

    Also the carry acumulation sounds interesting. It could allow some code unwrapping and some local optimalizations.

     

    All those variations ! :-D


  18. I made small demo, but it only draws 2 lines, each 50 pixel long. Good for profiling cycles per pixel. If hope making random lines would be no problem for you.

    It is here: http://roger.questions.cz/atari/line.zip

    Main file is main.asm, rest is included. It is structure of my bigger project, I only removed unnecessary code. So it does for example turn ROM off. I can't turn it on quickly, as I'm not experienced all too well in those shadow registers, I hope you can make it faster if you need it.

     

    So far I tried 2 modifications. First using table for masks and byte offsets for each X, as in erudraw. That actually didn't help. Reason is this. When I move down, I do not increment pointer in zero page. I first add 40 to Y. Only when that overflows, I go to zero page. This was I'm saving few cycles, but I can't use Y with the table. So the table approach is marginally faster for flat angles (less then 45 degrees), it is 10% slower on steep angles. Together it is few % slower.

     

    The I tried drawing symetrically from both ends. For this it's clear I need 2 pointers, two masks .. so there is no way I could do that in registers. So the table approach is actually better for this ! And it actually seems to be faster ! For flat angles the speed increase is the largest - about 10%. For steep it's only marginal. But that gives me like 5% together. The codes is a bit longer, and I will have to correctly solve the last middle pixel (which I don't do at the moment). But it seems to be an improvement. It will take me some time to make it complete.

     

    As for the C64 code I've seen .. they use some crazy techniques with precomputed blocks of bytes. It seems to be lot faster .. but they only talk about 8 bits per pixel, where the balance can be rather different. Still I don't understand it all fully yet, so I'm not saying it's not usable.

    • Like 1
×
×
  • Create New...