bugbiter Posted November 4, 2014 Share Posted November 4, 2014 Back in the eighties I tried to do a Star Wars opening crawl in Graphics 14 with Turbo Basic/assembler. Even though I had very little knowledge about assembler programming then I managed to pull it off with Mac/65. Crawl.zip The routine worked, but I was quite disappointed with the framerate. Recently I recoded the scaling routine for the crawl - Back in the days I didn't even know about status registers so I always put a CMP before branches (CMP #0's everywhere! :-) Mostly I restructured the central loop, integrated some subroutines into linear code to get rid of unnecessary JSR/RTS calls, removed those CMP #0s and so on. In the end the routine hasn't sped up a bit - I thought I took less cycles now, but I must have miscounted. I am thinking about moving the code into zeropage, or using self-modifying code but I don't know much about that nor how much that can speed up the code. So I ask you hardcore coders - what are the big secrets of writing superfast code? I see your insane demos and can't understand how that's done! What are the big points??? Martin 5 Quote Link to comment Share on other sites More sharing options...
Rybags Posted November 5, 2014 Share Posted November 5, 2014 "Use a bigger hammer". The 2 biggest hammers are self-modifying code and (sometimes massively) unrolled loops. For a scrolling/scaling situation the "biggest hammer" is brute force using seperate instructions to directly address every point involved in the operation - ie the routine to move the screen stuff around might take more memory than the screen itself. Executing from z-page is another big hammer although you're very restricted on how much you can do from there. Re your situation - even though TBXL is way faster than normal Basic, in many critical game/demo scenarios it'll be practically as useless for the task. 1 Quote Link to comment Share on other sites More sharing options...
flashjazzcat Posted November 5, 2014 Share Posted November 5, 2014 (edited) Look-up tables. You can completely eliminate bit rotation if you throw enough memory at LUTs. Edited November 5, 2014 by flashjazzcat 1 Quote Link to comment Share on other sites More sharing options...
phaeron Posted November 5, 2014 Share Posted November 5, 2014 Fundamentally, doing any blit operation a pixel at a time is going to be sloooooowwww. There's an old trick where if you reverse the bits of an ascending integer sequence, you get a pattern to iteratively shorten scanlines by deleting one pixel at a time: $80, $40, $C0, $20, $A0, $60, $E0, etc. It basically keeps deleting a pixel from the middle of one of the longest spans. Using this trick, a lot of unrolling, and switching to a 1-bit mode, it should be possible to reduce the horizontal scaling to a bit more than a run of ROL abs,X or ROR abs,X instructions per scanline per frame to shift pixels inward and incrementally shrink down each scanline. Vertical scaling and scrolling can be handled by display list LMS instructions, but this means that more scanlines have to be scaled horizontally than are actually being displayed each frame. 1 Quote Link to comment Share on other sites More sharing options...
bugbiter Posted November 5, 2014 Author Share Posted November 5, 2014 The time loss from turbo basic is negligible considering that right now the usr function takes more than 2 seconds to build one screen. There will be no problem with switching the whole thing to complete assembler version, but I want to know what I can gain in the ml routine by that. I use a lookup table already for division by 20 and for getting the mask bits for picking and plotting pixels Massive unrolling... I guess i could unroll the loop for each line, that would make about 160 times, which could be possible but there is not much overhead for each of those loops that could be saved. Unrolling for each pixel..? That would mean 12.000 loop codes... No way! If I switch off the os completely I think I could fit the critical routine into page zero. As far as i know this does not affect the amount of cycles that each intruction will take, or does it? Self modifying code: am I right that this means mainly replacing indirect indexing addressing sta or -lda (z),y- by absolute -lda mm- and changing mm in the code instead the zeropage vector. The first takes 6 cycles, the latter 5, in zeropage code 4 cycles. Well, that could be worth a try... Are there any other tricks with self modifying code? Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted November 5, 2014 Share Posted November 5, 2014 unrolling, runrolling and SMC (Oxyron word = Self Modifiying Code). http://madteam.atari8.info/index.php?prod=fx www.codebase64.org Quote Link to comment Share on other sites More sharing options...
Rybags Posted November 5, 2014 Share Posted November 5, 2014 You shouldn't need to do operations at the pixel level for shift + downscaling. It's at the byte level, so either 4 or 8 pixels for most modes. Unless you really need the extra colour, then a 1 bpp mode will make things a lot easier. If you want stars and stuff behind then you could use PMG objects, throw a few DLIs in and suddenly you have dozens at little cost. 1 Quote Link to comment Share on other sites More sharing options...
bugbiter Posted November 5, 2014 Author Share Posted November 5, 2014 Thats exacly what I did :-) it's Gr14 so it's only 1bpp. The stars are randomly spread out PM spots. DLIs will be added later :-) Basically the routine picks out single pixels from the source picture (Horizontal pixel number given. Bitmask is looked up) then puts that pixel in a row in the output. There I can just cycle the bitmask 128,64,32,16,8,4,2,1, All the spots where to pick are taken from precalculated values. Well, I'll move the routine into page0 and see what happens. I only have to shut down the system VBI to use all of page 0, right? Quote Link to comment Share on other sites More sharing options...
Rybags Posted November 5, 2014 Share Posted November 5, 2014 You can do it without need for lookup table - I looked into doing an SW scroller, all you're doing is moving lines up, rotating the bytes and selectively throwing away 2 pixels per row. The cheap/easy way is to just shift part of each half line and move the rest which automatically throws the pixel away. Ensure that the pixel that gets thrown away moves around each line so that you're not just munching away at the same part of the text. But I was going to use brute force coding - it becomes a little different if using loops since you have to do conditional stuff. Another option could be a partial unroll - if you use loops for each scanline then you can use indexed rather than indirect addressing which speeds it up. Best bet might be to do some cycle counting first. Work out how many bytes are involved in total, then guestimate how many will be rotated, how many will just be moved only. With that info in hand, you can get a rough cycle count. Take the DMA of the mode into consideration too. Also, by starting the operation part way down the screen, it's more likely to complete before screen corruption is evident. For a SW scroller you'd probably want to start just after the first visible line of the effect. If you get tearing, just start a bit later. Quote Link to comment Share on other sites More sharing options...
bugbiter Posted November 5, 2014 Author Share Posted November 5, 2014 You can do it without need for lookup table - I looked into doing an SW scroller, all you're doing is moving lines up, rotating the bytes and selectively throwing away 2 pixels per row. Hmm, there is only one line where 2 Pixels get thrown away, somewhere at the bottom of the screen. In every line there is a different step size for picking pixels All the pixels that are not picked are thrown away. in the middle row every second pixel gets thrown away, in the top almost every pixel gets thrown away. The step size is a 24 bit value from a precalculated table. Your comment gives me an idea to avoid copying every single pixel in bottom lines where whole chunks of pixels will stay together. Those chunks of pixels could faster be bitshifted and copied bytewise .. This could be useful, the wide bottom lines take the most time to build. Screen corruption? I don't have screen corruption. I use 2 screen flipping, the new frame is built in the invisible screen, the finished frame is displayed. Building one frame can take more that 100 jiffies so there is no use trying to do the job in VB. Quote Link to comment Share on other sites More sharing options...
Rybags Posted November 5, 2014 Share Posted November 5, 2014 OK... by the way you describe the process, it sounds like you're doing pick/place scaling based on the original rendered size? The way I describe, it's an iterative process where you use existing data (except the very bottom line which is a fresh rendering). If you're rendering everything from scratch each time, then you're going to have speed issues regardless. The advantage you have of rendering everything afresh is that you could scroll both up and down but as you've found out already the speed is just too slow on our old sub-2Mhz machines. I've been putting some thought into a scaling process that uses a mix of table lookup and throwing away pixels, but it's for something different and mightn't work well in a SW scroller. My suggestion would be to just use the "fast" method in that you use existing data with a mix of shifts and moves to throw away those 2 pixels per scanline, and only need render the bottom scanline from scratch. Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted November 5, 2014 Share Posted November 5, 2014 basicly if you fill ZP then better to cut off OS completly. Quote Link to comment Share on other sites More sharing options...
bugbiter Posted November 5, 2014 Author Share Posted November 5, 2014 But I have to render every line from scratch because every line has its own source data. If I reuse the same data for every line and throw away some pixels for each next line, I would have the same visible content in every line. But I have different content in each line, so I have to fetch different data for each line. I can't reuse the data from the last line. I find it hard making myself clear what I'm actually doing, so I try to explain more graphically. The source is a bitmap in memory with the text contained as a 1bpp Picture. (160 pixels wide, 20bytes per source line, any number of source lines) (You can crawl any Image, not just text). this represents my source image. a1 is the first pixel in the top line, f9 is the last pixel in the last line a1 a2 a3 a4 a5 a6 a7 a8 a9 b1 b2 b3 b4 b5 b6 b7 b8 b9 c1 c2 c3 c4 c5 c6 c7 c8 c9 d1 d2 d3 d4 d5 d6 d7 d8 d9 e1 e2 e3 e4 e5 e6 e7 e8 e9 f1 f2 f3 f4 f5 f6 f7 f8 f9 The routine takes the last line first with a step size of 1, so it takes every pixel and puts it on the screen, one after another. No pixel gets thrown away. Then it looks which source line comes next, it is line e, no line gets omitted Now the step size is 1.25, every 4th pixel gets thrown away Then line d gets scanned with step size 1.5, every 3rd pixel gets thrown away. Then one line gets omitted, line b gets scanned with every other pixel, step width 2 ..and so on. result: b1 b3 b5 b7 b9 d1 d2 d4 d5 d7 d8 d9 e1 e2 e3 e5 e6 e7 e9 f1 f2 f3 f4 f5 f6 f7 f8 f9 for each frame the scanning process stars at a source line further down. I have to scan each line for its own with its own step size factor. And I can't prescale the source, because each text line must appear in all possible sizes on the screen while the text is crawling. Quote Link to comment Share on other sites More sharing options...
Rybags Posted November 5, 2014 Share Posted November 5, 2014 I get it. But the method I suggest does much the same. The operation does scrolling + shifting + rendering. The only new data would be the 160 pixels at the bottom. The remainder is by taking the line underneath, shifting such that 2 pixels get thrown away. In effect it's what you're doing anyway but is more efficient. When you scale an image (disregarding filtering/colour averaging which we wouldn't do anyway) such that the finished product is 1 pixel smaller than the first, all you're doing anyway is choosing a single pixel to throw away. The method I describe will allow you to easily scroll once per frame with cycles to spare for other stuff. The method you're using now (pick and place) has no hope of working in full-motion unless you restrict it to some ridiculously small area. The fact of the matter is that the machine isn't quick enough. Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted November 5, 2014 Share Posted November 5, 2014 y-scaling can be done by DLIST. x-scaling I would take unrolled code for each line-sizes. 1 Quote Link to comment Share on other sites More sharing options...
bugbiter Posted November 5, 2014 Author Share Posted November 5, 2014 I get it. But the method I suggest does much the same. The operation does scrolling + shifting + rendering. The only new data would be the 160 pixels at the bottom. The remainder is by taking the line underneath, shifting such that 2 pixels get thrown away. In effect it's what you're doing anyway but is more efficient. When you scale an image (disregarding filtering/colour averaging which we wouldn't do anyway) such that the finished product is 1 pixel smaller than the first, all you're doing anyway is choosing a single pixel to throw away. The method I describe will allow you to easily scroll once per frame with cycles to spare for other stuff. The method you're using now (pick and place) has no hope of working in full-motion unless you restrict it to some ridiculously small area. The fact of the matter is that the machine isn't quick enough. Finally I understand your method. It would be faster for sure. But I think it would look quite different. It's not true that my method throws away 2 pixels from one line to the next. Pixels that got thrown away in one line can reappear in the next, but others are left out. My method shows all the aliasing effects that are typical of pick-and-place scaling without interpolation. Quote Link to comment Share on other sites More sharing options...
flashjazzcat Posted November 5, 2014 Share Posted November 5, 2014 Ah: so your scaling isn't "progressive", so to speak: you're always working from the same full-size master copy of the line, but scaling by differing factors depending where the line ends up. I'm not sure how much you'd save by scaling down already scaled lines instead of starting with the full-size version each time: a measurable amount, but I don't know if it would be drastic enough. Quite an interesting problem. Quote Link to comment Share on other sites More sharing options...
Rybags Posted November 5, 2014 Share Posted November 5, 2014 Speed will be a problem still - a SW scroller you're dealing with 25% of the screen. Assume a 160x100 screen, that's 16,000 pixels. Doing 4,000 iterations for pick/place at the pixel level. At the byte level doing shift/move it reduces to 500 iterations. Pick/place allows for more effects though, but that speed problem crops up again. The other thing is, especially at 160h res - the detail of the text starts to become lost fairly quickly, so the more "correct" scaling algorithm becomes less important anyway. Quote Link to comment Share on other sites More sharing options...
bugbiter Posted November 5, 2014 Author Share Posted November 5, 2014 quite agree, only the bottom 2-3 text lines are legible anyway. It would like to see how your approach would look like... on my method, I like that it is mathematically 'correct' albeit relatively slow. I would be happy to achieve 1 frame per second.. :-) Quote Link to comment Share on other sites More sharing options...
bugbiter Posted November 5, 2014 Author Share Posted November 5, 2014 Ah: so your scaling isn't "progressive", so to speak: you're always working from the same full-size master copy of the line, but scaling by differing factors depending where the line ends up. I'm not sure how much you'd save by scaling down already scaled lines instead of starting with the full-size version each time: a measurable amount, but I don't know if it would be drastic enough. Quite an interesting problem. Exactly. No iterative processing on the screen data. Every frame is freshly calculated and the source bitmap never gets changed. Quote Link to comment Share on other sites More sharing options...
pseudografx Posted November 5, 2014 Share Posted November 5, 2014 You might save some cycles by switching to narrow screen at a point where the text is narrow enough (128 and fewer pixels). Quote Link to comment Share on other sites More sharing options...
dmsc Posted November 7, 2014 Share Posted November 7, 2014 Hi! Plotting all the points with pre-computed tables is not that slow, see this (very simple, and random) example. The inner loop is: drawLine: ; Setup display pointer lda posy clc adc #20 tay jsr plotY ; Setup shrink table pointer ldy posy lda shrinkTabL,y sta pnt1 lda shrinkTabH,y sta pnt1+1 ; Setup point source pointer lda posy clc adc cnt and #63 tay lda pointTabL,y sta pnt2 lda pointTabH,y sta pnt2+1 ; Clean line jsr clrLine ; Draw points lda #0 sta posz drawPoint: ldy posz lda (pnt2),y beq endPnt tay lda (pnt1),y tax jsr plotX inc posz bne drawPoint endPnt: ; next line dec posy bne drawLine Daniel. scroll.zip 1 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted November 7, 2014 Share Posted November 7, 2014 quite fast even not yet optimised to full edge... Quote Link to comment Share on other sites More sharing options...
bugbiter Posted November 7, 2014 Author Share Posted November 7, 2014 Basically, I'm doing that already. I even fetch the start byte and the bit position for the first screen plotpixel in each line from a table to avoid calculations. From there the plot mask bit cycles through 128,64,32... for each following pixel in that line. The picking positions also come from a table but their absolute loactions have to be partly calculated, as they are not fixed realive to the source bitmap data. I could post the source, its not that long. Maybe together we could squeeze out some more cycles. Quote Link to comment Share on other sites More sharing options...
bugbiter Posted November 7, 2014 Author Share Posted November 7, 2014 Yesterday I tried something I never did before: I wrote a little benchmark routine (filling a gr.8 screen with different patterns) to run in page zero. I thought zeropage code runs faster because some commands take less cycles. Result: No visible increase of speed compared to a run in higher memory. Disappointed!! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.