The context I'm thinking of is a bulk copy - 4K or more, e.g. a whole screen being paged in during flyback

that sort of thing.

So - simplifying assumptions; source and dest do NOT overlap, source and dest are page aligned, size of copy is a multiple of whole pages.

If we use the "natural" coding (two zero page pointers, indexed by Y, and unroll the loop enough, we can get asymptotically close to

12 cycles per byte;

lda (SRC),y ; 5 sta (DST),y ; 5 iny ; 2

This is simple to understand, and (in fact) doesn't really need all my simplifying assumptions.

To go faster, I think we need self-modifying code, using abs,x addressing (4 cycles).

Consider the following fragment:

src1: lda $aa00,x ; 4 dst1: sta $bb00,x ; 4 src2: lda $aa01,x ; 4 dst2: sta $bb01,x ; 4 src3: lda $aa02,x ; 4 dst3: sta $bb02,x ; 4 . .

By using incrementing low bytes on the absolute addresses, we avoid the need to increment x. And we've saved a cycle on the load and store. This is 8 cycles per byte.

So, we unroll the loop, and add "some number" to x each time round the loop. If we make the unrolling a power of 2, this will neatly come out to a page. However, we also need to perform the self modifying for each page, setting the (incrementing) page location of SRC and DST into the high byte of each absolute address; we need to set SRC at src1+2, src2+2, etc and DST at dst1+2, dst2+2, etc. Each of these STA $xxxx takes 4 cycles.

If we unroll by N, where N is one of 4,8,16 etc, the copying loop will take N*8 cycles (plus a handful more for the loop back, 8 or 10).

For the self-mod, we will also have N*8, although the self mod code is run once per page; the loop is run (256/N) per page.

So the total cycles per page, in an unroll of N is around:

N*8 + (256/N)*(N*8* + 10)

So if we make N too big, we'll do too much self-mod, and if we make N too small, the copy loop won't be fast. What's best?

Setting up this equation in a dirty fragment of perl, I get:

4:2720

rate 10.625000

8:2432

rate 9.500000

16:2336

rate 9.125000

32:2384

rate 9.312500

64:2600

rate 10.156250

128:3092

rate 12.078125

(rate is cycles per byte).

So 16 appears best.

The result (at the hand-waving level) is a block copy running at 9.1 cycles per byte. I've probably left out some overhead in this, but I think the overall analysis is OK.

BugBear

**Edited by bugbear, Tue May 8, 2018 5:13 AM.**