Assembly challenge: copy routine

Asmusr · September 7, 2020

For use in my raycaster project, I present this challenge: write the fastest possible routine to copy any number of bytes (even or odd) from any address in CPU memory (even or odd) to any other address in CPU memory (even or odd). The ability to handle both odd and even addresses is essential, and exactly the specified number of bytes must be copied. There is no need to support overlapping memory regions. Self-modifying code is acceptable.

Here is what I'm after, only faster:

* r0: source, r1: destination, r2: count
copy: movb *r0+,*r1+
      dec r2
      jne copy
      rt

apersson850 · September 7, 2020

If both the source and destination addresses are either odd or even, then it's always possible to move most of the bytes in pairs (MOV). But if the source is odd and the destination even, or the opposite, then all data must be moved bytewise (MOVB).

The overhead used to figure this out is of course not meaningful if it's just a few bytes we're talking about. At least testing against a fixed number of bytes is easy to do.

To see if both addresses are even or odd, or of different kind, is easy to do with a source XOR destination operation.

A completely general procedure can be written, of course. But is there any more likely data size? Like usually less than 50 bytes or more than 500 or whatever?

Does it matter if the memory occupied by the routine grows significantly compared to the sample you give?

+adamantyr · September 7, 2020

47 minutes ago, Asmusr said:
For use in my raycaster project, I present this challenge: write the fastest possible routine to copy any number of bytes (even or odd) from any address in CPU memory (even or odd) to any other address in CPU memory (even or odd). The ability to handle both odd and even addresses is essential, and exactly the specified number of bytes must be copied. There is no need to support overlapping memory regions. Self-modifying code is acceptable.

Here is what I'm after, only faster:
* r0: source, r1: destination, r2: count
copy: movb *r0+,*r1+
      dec r2
      jne copy
      rt

What is the use context? Can you assume the byte count will always be a minimum size?

Unrolling the copy loop with a check to see if the count is odd or even first seems the best approach. I see two options:

A static option (with it doing a consecutive 8 or 9 byte push, for example, based on even/odd and a wrap up for remainders). You'll lose some time in prep-work but make up for it with large enough unrolls.

Self-replicating code that will in a memory block just replicate the "movb *r0+,*r1+" for as many iterations as possible. The prep work could be much longer, so it's a balance of time to create the unrolled loop verses the speed gain of not having a status register check and decrement.

+TheBF · September 7, 2020

I was testing this yesterday for copying 4K SAMS pages as fast as I could.

This ASM code is reverse notation Forth Assembler so you might need to twist your head a bit.

The results are shown on the screen capture.

CMOVE is the same as MOVE16 below but using MOVB instruction ie: byte at a time, and does not correct the byte count to an even number of course.

MOVE32 has no benefit for moving a 4K block as you can see but was 20% faster moving an 8K block so it is better for >8K block moves.

Meanings so you can translate

-----------------------------------

BEGIN, is a universal label to jump back to in this assembler

OC WHILE, compiles to: JNC REPEAT+2

REPEAT, compiles to: JMP BEGIN

LTE UNTIL, compiles to: JGT BEGIN

TOS renamed R4

NEXT, returns to the Forth interpreter


CODE MOVE16  ( src dst n -- )  \ n= no. of CELLS to move
            *SP+ R0 MOV,       \ pop DEST into R0
            *SP+ R1 MOV,       \ pop source into R1
             TOS INC,          \ make sure n is even
             TOS -2 ANDI,
             BEGIN,
                TOS DECT,         \ dect by two, moving 2 bytes at once
             OC WHILE,            \ if n<0 get out
                R1 *+ R0 *+ MOV,  \ mem to mem move, auto increment
             REPEAT,
             TOS POP,
             NEXT,
             ENDCODE

\ no improvement for 4K byte moves.  20% faster for 8K bytes
CODE MOVE32  ( src dst n -- )  \ n= no. of CELLS to move
            *SP+ R0 MOV,       \ pop DEST into R0
            *SP+ R1 MOV,       \ pop source into R1
            BEGIN,
                R1 *+ R0 *+ MOV,  \ memory to memory move, auto increment
                R1 *+ R0 *+ MOV,  \ memory to memory move, auto increment
                TOS -4 AI,        \ we are moving 4 bytes at once!
            LTE UNTIL,
            TOS POP,
            NEXT,
            ENDCODE

Tursi · September 7, 2020

This is one of the very, very few cases I'd suggest using something like Duff's Device (although it's only called that in C, where it's an abomination).

You can write a lead-in and a lead-out to handle initial and trailing bytes, then jump to the correct place in an unrolled loop (calculated with masking) that copies words. Since you're incrementing source and destination, you want to copy as many words as you can in any case.

pseudo code only at the moment, but something like:

-odd start address? deal with single byte to make it even

-calculate div 8 count (mask)

-calculate mod 8 of count (mask), use to jump into the correct start point of the loop

-repeat:

- move 1 word

- decrement count, loop if not zero

- still an odd byte left? Deal with single final byte

Not sure if I explained it well, but looking up the device will explain better. If you normally copy fewer than say 8 bytes, a mod 4 might work better than 8.

+mizapf · September 7, 2020

2 hours ago, apersson850 said:

If both the source and destination addresses are either odd or even, then it's always possible to move most of the bytes in pairs (MOV). But if the source is odd and the destination even, or the opposite, then all data must be moved bytewise (MOVB).

I think MOVB should be avoided whereever possible. It takes just the same amount of cycles as MOV, but only for a single byte.

I'd cover the cases for odd start address and odd end address by a special case, and transfer the rest as words ([MOVB] (MOV)* [MOVB]). It should be easy to calculate the minimum amount of data where this becomes better than doing a MOVB for all.

Edit: This was truly synchronous with Tursi.

Edited September 7, 2020 by mizapf

Stuart · September 7, 2020

We don't know how such a routine would be used and how big the data blocks might be of course, but might there be scope to leave the data where it is and maintain a pointer to the data? Or maybe a linked list of pointers (and data lengths) for a block of data split into sub-blocks?

Asmusr · September 8, 2020

8 hours ago, Tursi said:

This is one of the very, very few cases I'd suggest using something like Duff's Device (although it's only called that in C, where it's an abomination).

You can write a lead-in and a lead-out to handle initial and trailing bytes, then jump to the correct place in an unrolled loop (calculated with masking) that copies words. Since you're incrementing source and destination, you want to copy as many words as you can in any case.

pseudo code only at the moment, but something like:

-odd start address? deal with single byte to make it even

-calculate div 8 count (mask)

-calculate mod 8 of count (mask), use to jump into the correct start point of the loop

-repeat:

- move 1 word

- move 1 word

- move 1 word

- move 1 word

- move 1 word

- move 1 word

- move 1 word

- move 1 word

- decrement count, loop if not zero

- still an odd byte left? Deal with single final byte

Not sure if I explained it well, but looking up the device will explain better. If you normally copy fewer than say 8 bytes, a mod 4 might work better than 8.

But that won't work if the source and destination words are not aligned ?.

Tursi · September 8, 2020

31 minutes ago, Asmusr said:

But that won't work if the source and destination words are not aligned ?.

Is there nothing you can do to change that? You're leaving nearly 50% of your performance on the floor otherwise. Maybe two sets of texture data, one even aligned and one odd aligned? ROM is cheaper than cycles.

apersson850 · September 8, 2020

8 hours ago, mizapf said:

I think MOVB should be avoided whereever possible. It takes just the same amount of cycles as MOV, but only for a single byte.

Exactly. But as now has already been commented upon, if the source block starts on an odd address, but the destination starts on an even address, then all odd bytes should go to even addresses, and vice versa. MOV can't accomplish that. MOVB can. But takes twice the time.

Asmusr · September 8, 2020

9 hours ago, Tursi said:

Is there nothing you can do to change that? You're leaving nearly 50% of your performance on the floor otherwise. Maybe two sets of texture data, one even aligned and one odd aligned? ROM is cheaper than cycles.

In this case I would have to add 16 more ROM banks for a bit of performance.

Asmusr · September 8, 2020

I shouldn't have mentioned a particular use case in a general challenge. But thanks for the feedback everyone. I conclude that there isn't a clever way I have overlooked to copy words between even and odd addresses, and that the general solution will need to have branches for several special cases.

+9640News · September 8, 2020

Depending upon how large your block is, you might be able to use an index of multiple MOV *R1+,*R2+ instructions to identify where to start within the block of MOV instructions. That would get rid of a DEC instruction, especially if you were going to test after every MOV instruction.

+mizapf · September 8, 2020

9 hours ago, apersson850 said:

Exactly. But as now has already been commented upon, if the source block starts on an odd address, but the destination starts on an even address, then all odd bytes should go to even addresses, and vice versa. MOV can't accomplish that. MOVB can. But takes twice the time.

You are certainly right! I obviously tried to ignore the even-to-odd problem. Moving even to odd means to change all memory words.

+InsaneMultitasker · September 8, 2020

Following fast vdp VMBW/VMBR examples left by programmers here in the forum, I've converted some of my non-vdp memory move loops to 4- or 8- instructions per loop with a test for residual 3 or 7 bytes. This mirrors at least one of the fast VMBW/R routines that @rasmus and others have been using. I've gone so far as to read/write 16 at a time but haven't seen much benefit beyond that for my purposes. The smaller loops also have the added bonus of fitting more easily in scratchpad RAM.

When the source and destination start on different boundaries, I've used both a second routine of MOVB (slower) or when space is a premium, used self-modifying code to update the scratchpad routine by replacing the 4- or 8- MOV (and DEC) with MOVB and DECT. The odd start/end bytes must of course be dealt with at some point. My preferred method has been to ensure that all of my data is on even boundaries and where possible, forgo worrying about residual bytes by taking them into account up front. All depends on what you want to move and how, and in some cases you might end up using a slow routine for some data; fast routine for other data.

Tursi · September 9, 2020

11 hours ago, Asmusr said:

In this case I would have to add 16 more ROM banks for a bit of performance.

128k. No big deal today. If it doubles your performance, why not?

+FarmerPotato · September 9, 2020

On 9/7/2020 at 1:16 PM, Asmusr said:
For use in my raycaster project, I present this challenge: write the fastest possible routine to copy any number of bytes (even or odd) from any address in CPU memory (even or odd) to any other address in CPU memory (even or odd). The ability to handle both odd and even addresses is essential, and exactly the specified number of bytes must be copied. There is no need to support overlapping memory regions. Self-modifying code is acceptable.

Here is what I'm after, only faster:
* r0: source, r1: destination, r2: count
copy: movb *r0+,*r1+
      dec r2
      jne copy
      rt

I haven't counted the cycles, but here is my attempt at resolving the odd/even boundaries. I use MOV in slow RAM (thee of 6 wait states per word.) I only use MOVB in PAD. Here is one case where that counts:

* wp in PAD so faster than movb in ram
padws equ >83e0
r3lb  equ padws+7
r4lb  equ padws+9

src  equ 0
dst  equ 1
cnt  equ 2

* src is odd, dst is even, cnt is even
* src:  a bc de fg h
* dst: ab cd ef gh
* combine dect/jeq, if cnt is always multiple of 4

copy:
* cache first byte
 movb *src+,r3
loop:
* fill two src words:
* grab a word
 mov  *src+,r4
 movb r4,@r3lb
 mov  r3,*dst+
 dect cnt
 jeq done
 swpb r4
* grab a word
 mov  *src+,r3
 movb r3,@r4lb
 mov  r4,*dst+
 swpb r3
 dect cnt
 jne  loop
done:
 rt

This code is 16 words.

matthew180 · September 13, 2020

If the source or destination is small enough, and word aligned, you can locate your workspace on top of the src or dst, and eliminate the overhead of indirect addressing for half of the copy.

Speed gains are usually found by intimately understanding the problem and writing custom solutions that combine operations, which is only possible due to having that external knowledge. This is where humans will out-smart computers/compilers for a very long time to come.

Squeezing performance will never be a "generic code" endeavor, IMO.

Asmusr · September 14, 2020

7 hours ago, matthew180 said:

Squeezing performance will never be a "generic code" endeavor, IMO.

In this case there is a generic solution, it's just not faster for small numbers of bytes, and the speed also depends on odd/even issue. If you know at compile time that you only need to copy a small number of bytes you shouldn't call the generic routine. But if you don't know, the speed gained at the high end is likely to outweigh the speed loss at the low end.

wierd_w · September 14, 2020

This is probably very unhelpful, but--

If address location is ODD, pad first byte by 1, then treat as even.

Then assume all reads and writes are even aligned.

When working with the data, check if address is odd; if so, skip first byte, then proceed as if even.

Just be wary of the dreaded off by 1, should somehow your check routines fail.

apersson850 · September 14, 2020

But it doesn't help if the source starts at an even address, but the destination at an odd one, or vice versa. If so, you can't use MOV, not without adding more instructions, which will slow it down even more than just using MOVB in the first place.

wierd_w · September 14, 2020

Maybe for the initial exchange of the copy..

Some simple rules to follow:

Is total data length odd? If so, always select an odd numbered address.

If address is odd, pad to make it even. (the same byte that makes the logical start address even, also pads the odd number of bytes to copy, making it even also.)

This gives you two signals with one check-- If the address is odd, you immediately know: Data length is logically even, but actually odd, and padded by 1 byte at the start.

This check need only be performed once. All subsequent parts of the duty loop are for whole words, since data size is guaranteed even divisible.

apersson850 · September 14, 2020

No, it still doesn't help when a block has to move from an odd starting address to an even starting address, or the opposite. Padding will not help you there. It will just add one more byte to move.

apersson850 · September 14, 2020

Think about it like this:

Source

| x | A |
| B | C |
| D | E |
| F | G |

Destination

| A | B |
| C | D |
| E | F |
| G | y |

Only the bytes A-G are included in the payload. The bytes x and y should not be moved/changed.

If the byte A is at address SOURCE and should be moved to DEST, then only MOVB @SOURCE,@DEST can do that. If you execute MOV @SOURCE,@DEST, then what you get is actually MOV @SOURCE-1,@DEST, which in turn means that the first word at the destination will be | x | A | instead of the intended | A | B |. Padding the source by replacing x with P will render the first word at the destination as | P | A |, which still isn't the intended | A | B |.

wierd_w · September 14, 2020

I get that. I am pointing out that it is not the showstopper you think it is.

In case I am not good at explaining it, here's the basic gist:

You want to copy a big block of data. Say something that is always either 511 bytes, or 512 bytes in size. (even or odd number of bytes, but still a lot of data.) This data lives in a ROM. When you master the ROM, you pre-place the 511 byte long elements on odd numbered addresses, and assure that there is a blank 00 byte just before it in the ROM.

So, something like this:

(addr)

00 01 02 03 04 ..... (out to end of element)

--------------------------------

(Data) | 00 XX XX XX XX ..... (out to end of element)

Your data index says the data rightly starts at 0001. which it does. It does because it is 511 bytes long, and is started at the first odd byte, per the artificially imposed convention. (which is how we know it is an odd number of bytes in the record.)

However, the leading byte is (by virtue of your mastering process), preceded by an unallocated (not really part of any stream) 00 byte, which functions as the padding. Copying it has no real consequence, you initiate copy at 0000 instead of 0001, and grab the 00 byte along with the first byte of real data. This happens only one time.

Since your destination is an odd numbered address, you will know that the data contains the padding, and can treat it accordingly when you use it in RAM.

You could even reclaim the byte "wasted", if you were to use an actual even number in the address space preceding it for the next write.

EG, your memory looks like this after you copy the first odd byte length record:

E O E O E O E O E O E O E O E O

00 00 00 00 00 00 00 00 PP XX XX XX XX XX XX XX

Then you copy the next odd length record into the space just before it, and overwrite the padding.

E O E O E O E O E O E O E O E O

PP YY YY YY YY YY YY YY YY XX XX XX XX XX XX XX

Writing always starts at an odd type address. This lets you know that padding was used, and that you should ignore the first byte, and initiate copy from the preceding even byte, to use it as padding.

Both operations use only MOVB. No RAM gets wasted, except at the last odd length record. (which occupies the highest order spot in the chain.)

Properly implemented, you waste only a single byte of RAM to padding this way, and use only MOVB.

Edited September 14, 2020 by wierd_w

Assembly challenge: copy routine

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members