Jump to content

Photo

Fastest Possible bulk copy?


8 replies to this topic

#1 bugbear OFFLINE  

bugbear

    Space Invader

  • 35 posts

Posted Tue May 8, 2018 3:24 AM

The context I'm thinking of is a bulk copy - 4K or more, e.g. a whole screen being paged in during flyback

that sort of thing.

 

So - simplifying assumptions; source and dest do NOT overlap, source and dest are page aligned, size of copy is a multiple of whole pages.

 

If we use the "natural" coding (two zero page pointers, indexed by Y, and unroll the loop enough, we can get asymptotically close to

12 cycles per byte;

lda (SRC),y       ; 5
sta (DST),y       ; 5
iny               ; 2

This is simple to understand, and (in fact) doesn't really need all my simplifying assumptions.

To go faster, I think we need self-modifying code, using abs,x addressing (4 cycles).

Consider the following fragment:

src1:  lda $aa00,x   ; 4
dst1:  sta $bb00,x   ; 4
src2:  lda $aa01,x   ; 4
dst2:  sta $bb01,x   ; 4
src3:  lda $aa02,x   ; 4
dst3:  sta $bb02,x   ; 4
.
.

By using incrementing low bytes on the absolute addresses, we avoid the need to increment x. And we've saved a cycle on the load and store. This is 8 cycles per byte.

So, we unroll the loop, and add "some number" to x each time round the loop. If we make the unrolling a power of 2, this will neatly come out to a page. However, we also need to perform the self modifying for each page, setting the (incrementing) page location of SRC and DST into the high byte of each absolute address; we need to set SRC at src1+2, src2+2, etc and DST at dst1+2, dst2+2, etc. Each of these STA $xxxx takes 4 cycles.

 

If we unroll by N, where N is one of 4,8,16 etc, the copying loop will take N*8 cycles (plus a handful more for the loop back, 8 or 10).
For the self-mod, we will also have N*8, although the self mod code is run once per page; the loop is run (256/N) per page.

So the total cycles per page, in an unroll of N is around:

N*8 + (256/N)*(N*8* + 10)

So if we make N too big, we'll do too much self-mod, and if we make N too small, the copy loop won't be fast. What's best?

Setting up this equation in a dirty fragment of perl, I get:

4:2720
rate 10.625000
8:2432
rate 9.500000
16:2336
rate 9.125000
32:2384
rate 9.312500
64:2600
rate 10.156250
128:3092
rate 12.078125
(rate is cycles per byte).
 

So 16 appears best.

The result (at the hand-waving level) is a block copy running at 9.1 cycles per byte. I've probably left out some overhead in this, but I think the overall analysis is OK.

   BugBear


Edited by bugbear, Tue May 8, 2018 5:13 AM.


#2 ivop ONLINE  

ivop

    Dragonstomper

  • 641 posts
  • Location:The Netherlands

Posted Tue May 8, 2018 6:35 AM

    ldx #0          ; 2

    lda $2000,x     ; 4
    sta $3000,x     ; 4 
    lda $2100,x
    sta $3100,x
    lda $2200,x
    sta $3200,x
    lda $2300,x
    sta $3300,x
    lda $2400,x
    sta $3400,x
    lda $2500,x
    sta $3500,x
    lda $2600,x
    sta $3600,x
    lda $2700,x
    sta $3700,x
    lda $2800,x
    sta $3800,x
    lda $2900,x
    sta $3900,x
    lda $2a00,x
    sta $3a00,x
    lda $2b00,x
    sta $3b00,x
    lda $2c00,x
    sta $3c00,x
    lda $2d00,x
    sta $3d00,x
    lda $2e00,x
    sta $3e00,x
    lda $2f00,x
    sta $3f00,x

    inx             ; 2
    bne copy        ; 3

; (2 + (16*8 + 2 + 3) * 256)/4096 = 8.312988 cycles per byte

This is how I would bulk copy 4kB (16 pages).


Edited by ivop, Tue May 8, 2018 6:35 AM.


#3 Rybags OFFLINE  

Rybags

    Quadrunner

  • 15,627 posts
  • Location:Australia

Posted Tue May 8, 2018 6:36 AM

Absolute fastest is just

LDA SRC

STA DEST

LDA SRC + 1

STA DEST + 1

 

8 cycles per byte but an obscene memory requirement of <screen size> * 6 so about 48K for a hires or 24K for a Graphics 7 bitmap.

But not so bad for a text screen, under 6K.

 

A less costly method in terms of memory and not too much worse for cycles is using indexed addressing for every page of screen RAM.

To ensure maximum efficiency, set your screen base to minimize crossing page boundaries in the move.

 

LDX #0

COPY  LDA SRC,X

STA DEST,X

LDA SRC+$100,X

STA DEST+$100,X

etc. until

LDA SRC+$1E00,X

STA DEST+$1E00,X

INX

BNE COPY

 

Alternative to this you could use self-modifying code to make the internals of the loop smaller and just modify the operands and run the loop again.

 

 

Or better alternative - given the flexibility of the Atari screen architecture you can use DLists with LMS and fine scrolling to mimimize or eliminate having to move memory around.


Edited by Rybags, Tue May 8, 2018 6:38 AM.


#4 Irgendwer OFFLINE  

Irgendwer

    Stargunner

  • 1,339 posts
  • Location:Germany

Posted Tue May 8, 2018 8:05 AM

    sta $3000,x     ; 4 

 

Unfortunately not (5).



#5 Heaven/TQA OFFLINE  

Heaven/TQA

    Quadrunner

  • 10,771 posts
  • Location:Baden-Württemberg, Germany

Posted Tue May 8, 2018 8:32 AM

if I have the mem... I would do brute force

 

.rept 4096

LDA src+#

STA dst+#

.endr



#6 bugbear OFFLINE  

bugbear

    Space Invader

  • Topic Starter
  • 35 posts

Posted Tue May 8, 2018 8:46 AM

>> This is how I would bulk copy 4kB (16 pages).


Unrolling the outer loop instead of the inner - I love it!

 

  BugBear



#7 Rybags OFFLINE  

Rybags

    Quadrunner

  • 15,627 posts
  • Location:Australia

Posted Tue May 8, 2018 9:06 AM

In the modern day with large flashcarts and Ram sizes, entirely feasible to do enormous brute-force lumps of code rather than time consuming loops.

Also worth mentioning the rendering technique exposed by Heaven in the Activision Zone Ranger game.  Has a thread from several years ago somewhere around here.



#8 ivop ONLINE  

ivop

    Dragonstomper

  • 641 posts
  • Location:The Netherlands

Posted Tue May 8, 2018 12:07 PM

 

Unfortunately not (5).

 

Aw, yes. Good catch.

 

(2 + (16*9 + 2 + 3) * 256)/4096 = 9.312988 cycles per byte
 

I just copied 4 and 4 from bugbear's post, but a abs,x store is indeed one cycle more.



#9 bugbear OFFLINE  

bugbear

    Space Invader

  • Topic Starter
  • 35 posts

Posted Tue May 8, 2018 2:11 PM

In the modern day with large flashcarts and Ram sizes, entirely feasible to do enormous brute-force lumps of code rather than time consuming loops.

In the modern age, a GPU can shift giga-bytes per second. But (for me) being full-on-retro is more fun.

 

  BugBear






0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users