Fastest Possible bulk copy?

bugbear · May 8, 2018

The context I'm thinking of is a bulk copy - 4K or more, e.g. a whole screen being paged in during flyback

that sort of thing.

So - simplifying assumptions; source and dest do NOT overlap, source and dest are page aligned, size of copy is a multiple of whole pages.

If we use the "natural" coding (two zero page pointers, indexed by Y, and unroll the loop enough, we can get asymptotically close to

12 cycles per byte;

lda (SRC),y       ; 5
sta (DST),y       ; 5
iny               ; 2

This is simple to understand, and (in fact) doesn't really need all my simplifying assumptions.

To go faster, I think we need self-modifying code, using abs,x addressing (4 cycles).

Consider the following fragment:

src1:  lda $aa00,x   ; 4
dst1:  sta $bb00,x   ; 4
src2:  lda $aa01,x   ; 4
dst2:  sta $bb01,x   ; 4
src3:  lda $aa02,x   ; 4
dst3:  sta $bb02,x   ; 4
.
.

By using incrementing low bytes on the absolute addresses, we avoid the need to increment x. And we've saved a cycle on the load and store. This is 8 cycles per byte.

So, we unroll the loop, and add "some number" to x each time round the loop. If we make the unrolling a power of 2, this will neatly come out to a page. However, we also need to perform the self modifying for each page, setting the (incrementing) page location of SRC and DST into the high byte of each absolute address; we need to set SRC at src1+2, src2+2, etc and DST at dst1+2, dst2+2, etc. Each of these STA $xxxx takes 4 cycles.

If we unroll by N, where N is one of 4,8,16 etc, the copying loop will take N*8 cycles (plus a handful more for the loop back, 8 or 10).
For the self-mod, we will also have N*8, although the self mod code is run once per page; the loop is run (256/N) per page.

So the total cycles per page, in an unroll of N is around:

N*8 + (256/N)*(N*8* + 10)

So if we make N too big, we'll do too much self-mod, and if we make N too small, the copy loop won't be fast. What's best?

Setting up this equation in a dirty fragment of perl, I get:

4:2720
rate 10.625000
8:2432
rate 9.500000
16:2336
rate 9.125000
32:2384
rate 9.312500
64:2600
rate 10.156250
128:3092
rate 12.078125
(rate is cycles per byte).

So 16 appears best.

The result (at the hand-waving level) is a block copy running at 9.1 cycles per byte. I've probably left out some overhead in this, but I think the overall analysis is OK.

BugBear

Edited May 8, 2018 by bugbear

ivop · May 8, 2018

    ldx #0          ; 2

    lda $2000,x     ; 4
    sta $3000,x     ; 4 
    lda $2100,x
    sta $3100,x
    lda $2200,x
    sta $3200,x
    lda $2300,x
    sta $3300,x
    lda $2400,x
    sta $3400,x
    lda $2500,x
    sta $3500,x
    lda $2600,x
    sta $3600,x
    lda $2700,x
    sta $3700,x
    lda $2800,x
    sta $3800,x
    lda $2900,x
    sta $3900,x
    lda $2a00,x
    sta $3a00,x
    lda $2b00,x
    sta $3b00,x
    lda $2c00,x
    sta $3c00,x
    lda $2d00,x
    sta $3d00,x
    lda $2e00,x
    sta $3e00,x
    lda $2f00,x
    sta $3f00,x

    inx             ; 2
    bne copy        ; 3

; (2 + (16*8 + 2 + 3) * 256)/4096 = 8.312988 cycles per byte

This is how I would bulk copy 4kB (16 pages).

Edited May 8, 2018 by ivop

Rybags · May 8, 2018

Absolute fastest is just

LDA SRC

STA DEST

LDA SRC + 1

STA DEST + 1

8 cycles per byte but an obscene memory requirement of <screen size> * 6 so about 48K for a hires or 24K for a Graphics 7 bitmap.

But not so bad for a text screen, under 6K.

A less costly method in terms of memory and not too much worse for cycles is using indexed addressing for every page of screen RAM.

To ensure maximum efficiency, set your screen base to minimize crossing page boundaries in the move.

LDX #0

COPY LDA SRC,X

STA DEST,X

LDA SRC+$100,X

STA DEST+$100,X

etc. until

LDA SRC+$1E00,X

STA DEST+$1E00,X

INX

BNE COPY

Alternative to this you could use self-modifying code to make the internals of the loop smaller and just modify the operands and run the loop again.

Or better alternative - given the flexibility of the Atari screen architecture you can use DLists with LMS and fine scrolling to mimimize or eliminate having to move memory around.

Edited May 8, 2018 by Rybags

Irgendwer · May 8, 2018

    sta $3000,x     ; 4

Unfortunately not (5).

Heaven/TQA · May 8, 2018

if I have the mem... I would do brute force

.rept 4096

LDA src+#

STA dst+#

.endr

bugbear · May 8, 2018

>> This is how I would bulk copy 4kB (16 pages).

Unrolling the outer loop instead of the inner - I love it!

BugBear

Rybags · May 8, 2018

In the modern day with large flashcarts and Ram sizes, entirely feasible to do enormous brute-force lumps of code rather than time consuming loops.

Also worth mentioning the rendering technique exposed by Heaven in the Activision Zone Ranger game. Has a thread from several years ago somewhere around here.

ivop · May 8, 2018

Unfortunately not (5).

Aw, yes. Good catch.

(2 + (16*9 + 2 + 3) * 256)/4096 = 9.312988 cycles per byte

I just copied 4 and 4 from bugbear's post, but a abs,x store is indeed one cycle more.

bugbear · May 8, 2018

In the modern day with large flashcarts and Ram sizes, entirely feasible to do enormous brute-force lumps of code rather than time consuming loops.

In the modern age, a GPU can shift giga-bytes per second. But (for me) being full-on-retro is more fun.

BugBear

Sign In

Fastest Possible bulk copy?

Recommended Posts

bugbear

Link to comment

Share on other sites

ivop

Link to comment

Share on other sites

Rybags

Link to comment

Share on other sites

Irgendwer

Link to comment

Share on other sites

Heaven/TQA

Link to comment

Share on other sites

bugbear

Link to comment

Share on other sites

Rybags

Link to comment

Share on other sites

ivop

Link to comment

Share on other sites

bugbear

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Apps

My Activity Streams

More