Secrets of Speed

bugbiter · November 4, 2014

Back in the eighties I tried to do a Star Wars opening crawl in Graphics 14 with Turbo Basic/assembler.

Even though I had very little knowledge about assembler programming then I managed to pull it off with Mac/65.

The routine worked, but I was quite disappointed with the framerate.

Recently I recoded the scaling routine for the crawl - Back in the days I didn't even know about status registers so I always put a CMP before branches (CMP #0's everywhere! :-) Mostly I restructured the central loop, integrated some subroutines into linear code to get rid of unnecessary JSR/RTS calls, removed those CMP #0s and so on.

In the end the routine hasn't sped up a bit - I thought I took less cycles now, but I must have miscounted.

I am thinking about moving the code into zeropage, or using self-modifying code but I don't know much about that nor how much that can speed up the code.

So I ask you hardcore coders - what are the big secrets of writing superfast code? I see your insane demos and can't understand how that's done! What are the big points???

Martin

Rybags · November 5, 2014

"Use a bigger hammer".

The 2 biggest hammers are self-modifying code and (sometimes massively) unrolled loops.

For a scrolling/scaling situation the "biggest hammer" is brute force using seperate instructions to directly address every point involved in the operation - ie the routine to move the screen stuff around might take more memory than the screen itself.

Executing from z-page is another big hammer although you're very restricted on how much you can do from there.

Re your situation - even though TBXL is way faster than normal Basic, in many critical game/demo scenarios it'll be practically as useless for the task.

flashjazzcat · November 5, 2014

Look-up tables. You can completely eliminate bit rotation if you throw enough memory at LUTs.

Edited November 5, 2014 by flashjazzcat

phaeron · November 5, 2014

Fundamentally, doing any blit operation a pixel at a time is going to be sloooooowwww.

There's an old trick where if you reverse the bits of an ascending integer sequence, you get a pattern to iteratively shorten scanlines by deleting one pixel at a time: $80, $40, $C0, $20, $A0, $60, $E0, etc. It basically keeps deleting a pixel from the middle of one of the longest spans. Using this trick, a lot of unrolling, and switching to a 1-bit mode, it should be possible to reduce the horizontal scaling to a bit more than a run of ROL abs,X or ROR abs,X instructions per scanline per frame to shift pixels inward and incrementally shrink down each scanline. Vertical scaling and scrolling can be handled by display list LMS instructions, but this means that more scanlines have to be scaled horizontally than are actually being displayed each frame.

bugbiter · November 5, 2014

The time loss from turbo basic is negligible considering that right now the usr function takes more than 2 seconds to build one screen.

There will be no problem with switching the whole thing to complete assembler version, but I want to know what I can gain in the ml routine by that.

I use a lookup table already for division by 20 and for getting the mask bits for picking and plotting pixels

Massive unrolling... I guess i could unroll the loop for each line, that would make about 160 times, which could be possible but there is not much overhead for each of those loops that could be saved.

Unrolling for each pixel..? That would mean 12.000 loop codes... No way!

If I switch off the os completely I think I could fit the critical routine into page zero. As far as i know this does not affect the amount of cycles that each intruction will take, or does it?

Self modifying code: am I right that this means mainly replacing indirect indexing addressing sta or -lda (z),y- by absolute -lda mm- and changing mm in the code instead the zeropage vector. The first takes 6 cycles, the latter 5, in zeropage code 4 cycles. Well, that could be worth a try...

Are there any other tricks with self modifying code?

Heaven/TQA · November 5, 2014

unrolling, runrolling and SMC (Oxyron word = Self Modifiying Code).

http://madteam.atari8.info/index.php?prod=fx

www.codebase64.org

Rybags · November 5, 2014

You shouldn't need to do operations at the pixel level for shift + downscaling. It's at the byte level, so either 4 or 8 pixels for most modes.

Unless you really need the extra colour, then a 1 bpp mode will make things a lot easier. If you want stars and stuff behind then you could use PMG objects, throw a few DLIs in and suddenly you have dozens at little cost.

bugbiter · November 5, 2014

Thats exacly what I did :-) it's Gr14 so it's only 1bpp. The stars are randomly spread out PM spots. DLIs will be added later :-)

Basically the routine picks out single pixels from the source picture (Horizontal pixel number given. Bitmask is looked up) then puts that pixel in a row in the output. There I can just cycle the bitmask 128,64,32,16,8,4,2,1,

All the spots where to pick are taken from precalculated values.

Well, I'll move the routine into page0 and see what happens. I only have to shut down the system VBI to use all of page 0, right?

Rybags · November 5, 2014

You can do it without need for lookup table - I looked into doing an SW scroller, all you're doing is moving lines up, rotating the bytes and selectively throwing away 2 pixels per row.

The cheap/easy way is to just shift part of each half line and move the rest which automatically throws the pixel away.

Ensure that the pixel that gets thrown away moves around each line so that you're not just munching away at the same part of the text.

But I was going to use brute force coding - it becomes a little different if using loops since you have to do conditional stuff.

Another option could be a partial unroll - if you use loops for each scanline then you can use indexed rather than indirect addressing which speeds it up.

Best bet might be to do some cycle counting first. Work out how many bytes are involved in total, then guestimate how many will be rotated, how many will just be moved only.

With that info in hand, you can get a rough cycle count. Take the DMA of the mode into consideration too.

Also, by starting the operation part way down the screen, it's more likely to complete before screen corruption is evident. For a SW scroller you'd probably want to start just after the first visible line of the effect. If you get tearing, just start a bit later.

bugbiter · November 5, 2014

You can do it without need for lookup table - I looked into doing an SW scroller, all you're doing is moving lines up, rotating the bytes and selectively throwing away 2 pixels per row.

Hmm, there is only one line where 2 Pixels get thrown away, somewhere at the bottom of the screen. In every line there is a different step size for picking pixels All the pixels that are not picked are thrown away. in the middle row every second pixel gets thrown away, in the top almost every pixel gets thrown away. The step size is a 24 bit value from a precalculated table.

Your comment gives me an idea to avoid copying every single pixel in bottom lines where whole chunks of pixels will stay together. Those chunks of pixels could faster be bitshifted and copied bytewise ..

This could be useful, the wide bottom lines take the most time to build.

Screen corruption?

I don't have screen corruption. I use 2 screen flipping, the new frame is built in the invisible screen, the finished frame is displayed. Building one frame can take more that 100 jiffies so there is no use trying to do the job in VB.

Rybags · November 5, 2014

OK... by the way you describe the process, it sounds like you're doing pick/place scaling based on the original rendered size?

The way I describe, it's an iterative process where you use existing data (except the very bottom line which is a fresh rendering).

If you're rendering everything from scratch each time, then you're going to have speed issues regardless. The advantage you have of rendering everything afresh is that you could scroll both up and down but as you've found out already the speed is just too slow on our old sub-2Mhz machines.

I've been putting some thought into a scaling process that uses a mix of table lookup and throwing away pixels, but it's for something different and mightn't work well in a SW scroller.

My suggestion would be to just use the "fast" method in that you use existing data with a mix of shifts and moves to throw away those 2 pixels per scanline, and only need render the bottom scanline from scratch.

Heaven/TQA · November 5, 2014

basicly if you fill ZP then better to cut off OS completly.

bugbiter · November 5, 2014

But I have to render every line from scratch because every line has its own source data.

If I reuse the same data for every line and throw away some pixels for each next line, I would have the same visible content in every line.

But I have different content in each line, so I have to fetch different data for each line. I can't reuse the data from the last line.

I find it hard making myself clear what I'm actually doing, so I try to explain more graphically.

The source is a bitmap in memory with the text contained as a 1bpp Picture. (160 pixels wide, 20bytes per source line, any number of source lines)

(You can crawl any Image, not just text).

this represents my source image. a1 is the first pixel in the top line, f9 is the last pixel in the last line

a1 a2 a3 a4 a5 a6 a7 a8 a9

b1 b2 b3 b4 b5 b6 b7 b8 b9

c1 c2 c3 c4 c5 c6 c7 c8 c9

d1 d2 d3 d4 d5 d6 d7 d8 d9

e1 e2 e3 e4 e5 e6 e7 e8 e9

f1 f2 f3 f4 f5 f6 f7 f8 f9

The routine takes the last line first with a step size of 1, so it takes every pixel and puts it on the screen, one after another. No pixel gets thrown away.

Then it looks which source line comes next, it is line e, no line gets omitted

Now the step size is 1.25, every 4th pixel gets thrown away

Then line d gets scanned with step size 1.5, every 3rd pixel gets thrown away.

Then one line gets omitted, line b gets scanned with every other pixel, step width 2

..and so on.

result:

b1 b3 b5 b7 b9

d1 d2 d4 d5 d7 d8 d9

e1 e2 e3 e5 e6 e7 e9

f1 f2 f3 f4 f5 f6 f7 f8 f9

for each frame the scanning process stars at a source line further down.

I have to scan each line for its own with its own step size factor.

And I can't prescale the source, because each text line must appear in all possible sizes on the screen while the text is crawling.

Rybags · November 5, 2014

I get it. But the method I suggest does much the same.

The operation does scrolling + shifting + rendering. The only new data would be the 160 pixels at the bottom. The remainder is by taking the line underneath, shifting such that 2 pixels get thrown away.

In effect it's what you're doing anyway but is more efficient.

When you scale an image (disregarding filtering/colour averaging which we wouldn't do anyway) such that the finished product is 1 pixel smaller than the first, all you're doing anyway is choosing a single pixel to throw away.

The method I describe will allow you to easily scroll once per frame with cycles to spare for other stuff. The method you're using now (pick and place) has no hope of working in full-motion unless you restrict it to some ridiculously small area. The fact of the matter is that the machine isn't quick enough.

Heaven/TQA · November 5, 2014

y-scaling can be done by DLIST.

x-scaling I would take unrolled code for each line-sizes.

bugbiter · November 5, 2014

I get it. But the method I suggest does much the same.

The operation does scrolling + shifting + rendering. The only new data would be the 160 pixels at the bottom. The remainder is by taking the line underneath, shifting such that 2 pixels get thrown away.

In effect it's what you're doing anyway but is more efficient.

When you scale an image (disregarding filtering/colour averaging which we wouldn't do anyway) such that the finished product is 1 pixel smaller than the first, all you're doing anyway is choosing a single pixel to throw away.

The method I describe will allow you to easily scroll once per frame with cycles to spare for other stuff. The method you're using now (pick and place) has no hope of working in full-motion unless you restrict it to some ridiculously small area. The fact of the matter is that the machine isn't quick enough.

Finally I understand your method. It would be faster for sure. But I think it would look quite different.

It's not true that my method throws away 2 pixels from one line to the next. Pixels that got thrown away in one line can reappear in the next, but others are left out.

My method shows all the aliasing effects that are typical of pick-and-place scaling without interpolation.

flashjazzcat · November 5, 2014

Ah: so your scaling isn't "progressive", so to speak: you're always working from the same full-size master copy of the line, but scaling by differing factors depending where the line ends up. I'm not sure how much you'd save by scaling down already scaled lines instead of starting with the full-size version each time: a measurable amount, but I don't know if it would be drastic enough. Quite an interesting problem.

Rybags · November 5, 2014

Speed will be a problem still - a SW scroller you're dealing with 25% of the screen.

Assume a 160x100 screen, that's 16,000 pixels. Doing 4,000 iterations for pick/place at the pixel level. At the byte level doing shift/move it reduces to 500 iterations.

Pick/place allows for more effects though, but that speed problem crops up again.

The other thing is, especially at 160h res - the detail of the text starts to become lost fairly quickly, so the more "correct" scaling algorithm becomes less important anyway.

bugbiter · November 5, 2014

quite agree, only the bottom 2-3 text lines are legible anyway. It would like to see how your approach would look like...

on my method, I like that it is mathematically 'correct' albeit relatively slow. I would be happy to achieve 1 frame per second.. :-)

bugbiter · November 5, 2014

Ah: so your scaling isn't "progressive", so to speak: you're always working from the same full-size master copy of the line, but scaling by differing factors depending where the line ends up. I'm not sure how much you'd save by scaling down already scaled lines instead of starting with the full-size version each time: a measurable amount, but I don't know if it would be drastic enough. Quite an interesting problem.

Exactly. No iterative processing on the screen data. Every frame is freshly calculated and the source bitmap never gets changed.

pseudografx · November 5, 2014

You might save some cycles by switching to narrow screen at a point where the text is narrow enough (128 and fewer pixels).

dmsc · November 7, 2014

Hi!

Plotting all the points with pre-computed tables is not that slow, see this (very simple, and random) example. The inner loop is:

drawLine:

 ; Setup display pointer
 lda posy
 clc
 adc #20
 tay
 jsr plotY

 ; Setup shrink table pointer
 ldy posy
 lda shrinkTabL,y
 sta pnt1
 lda shrinkTabH,y
 sta pnt1+1

 ; Setup point source pointer
 lda posy
 clc
 adc cnt
 and #63
 tay
 lda pointTabL,y
 sta pnt2
 lda pointTabH,y
 sta pnt2+1

 ; Clean line
 jsr clrLine

 ; Draw points
 lda #0
 sta posz

drawPoint:
 ldy posz

 lda (pnt2),y
 beq endPnt
 tay
 lda (pnt1),y
 tax
 jsr plotX

 inc posz
 bne drawPoint

endPnt:
 ; next line
 dec posy
 bne drawLine

Daniel.

scroll.zip

Heaven/TQA · November 7, 2014

quite fast even not yet optimised to full edge...

bugbiter · November 7, 2014

Basically, I'm doing that already. I even fetch the start byte and the bit position for the first screen plotpixel in each line from a table to avoid calculations. From there the plot mask bit cycles through 128,64,32... for each following pixel in that line.

The picking positions also come from a table but their absolute loactions have to be partly calculated, as they are not fixed realive to the source bitmap data.

I could post the source, its not that long. Maybe together we could squeeze out some more cycles.

bugbiter · November 7, 2014

Yesterday I tried something I never did before: I wrote a little benchmark routine (filling a gr.8 screen with different patterns)

to run in page zero. I thought zeropage code runs faster because some commands take less cycles.

Result: No visible increase of speed compared to a run in higher memory.

Disappointed!!

Secrets of Speed

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members