Assembly challenge: copy routine

apersson850 · September 14, 2020

Using MOVB, it's simple. The problem is we want to use MOV for efficiency.

+mizapf · September 14, 2020

People still think too much in terms of 8 bit. Byte manipulation is expensive on the 16-bit TMS9900. As seen from the CPU, the memory is a sequence of 16-bit values, and every move across word boundaries will necessarily involve changing the memory contents, and not only the location.

wierd_w · September 14, 2020

(I was being pulled literally 5 different directions when I was writing the prior post. Take that with some salt. I am home now, and no longer have such demands placed on me.)

The intention was that you could pad your "Odd number of bytes" data with a preceding 1 byte pad, to make it even-aligned data. Then you can copy using only whole-word-copy operations.

If you retain "even aligned with padding", and don't try to reclaim the RAM, you lose 1 byte for every odd lengthed data element.

However, I was not contemplating that there is a restriction on the word vs memory situation. (EG, that the CPU addresses at the word level, so every single byte operation has to be chunked up and manipulated so that it correlates with a whole word being read or written.) Again, I was being literally pulled in 5 directions.

This sadly, means you cannot use "odd address start" as a signal, since you have to write whole words at a time, with even alignment, to get the value-add of MOV. It might be better to have the data element itself contain a header specifying how large the element is, then back-pad to even length, artificially imposing even-alignment, and just waste the memory with the padding for the performance boost. (Unless I am mistaken again, and you totally CAN do a whole word operation in one cycle on an odd numbered location that straddles a word boundary. In which case, you can reclaim the padding by overwriting it on subsequent writes, and end up wasting only a single byte at the end of the chain.)

apersson850 · September 14, 2020

No, you can't do word operations on odd addresses. If you know the hardware details of the TMS 9900 CPU, you know that it has only 15 address bits. Thus it's always addressing words at even addresses. If you do a byte access, then the hardware reads the whole word, since it can't access memory with less granularity than that, then shifts the read data internally if necessary, to make it look the same if the byte originally came from the left or right half of the word.

That's why I keep repeating that it's not a question about if there's an odd or even number of bytes to move. It's about if the source and the destination are differently aligned that there's no cure that will allow you to use MOV, none that doesn't cost much more than just using MOVB.

But it is of course possible to write code that can do MOV, if the source and destination are equally aligned, and revert to using MOVB if they aren't. Then at least half of the time it will be faster.

Otherwise you have to figure out algorithms that don't need to move data from odd to even alignment.

wierd_w · September 14, 2020

yes, that is congruent with my (now) current understanding.

The issue is that you want to use MOV, but your data is of an odd length. (and or, exists at a location that is an odd number, and thus straddles a word. Which only happens if a data element is an odd number of bytes somewhere in memory before it. In either case, padding it, and enforcing even alignment affords the speedup.)

This can cause problems in at least two ways.

First, if your source is not even aligned, then your reads will straddle a word boundary too, not just the destination. You would have to identify this, do a MOVB to read the bottom half of the word your data is straddling, then read your file out as if you had chopped the first byte off of it, at even aligned word boundaries until the end of the record using MOV.

The second is if you are writing to a location that is not even aligned, in which you would have to do similar. These two conditions could be either-or, or BOTH at the same time. As such, you have to check for all 3 conditions, with the OTHER condition being straight up MOV only copy (because data at both source and destination are even aligned.)

If this data is being read from a ROM, I would suggest a combination of padding in the ROM (to force even alignment), with a header or some other construct to know if the data was padded, so it can be treated properly. (Or better yet, require that your data format always be an even number of bytes.) With the padding already in the data structure, you can always trust that the data is even aligned, and not worry about it. The only issue would be to ensure that your initial program can determine if it has been loaded into an odd numbered address, pad itself to fix the alignment, then continue.

I would think the best option would be to strive to assure that your data never is anything other than word boundary aligned from the beginning.

Then you wouldn't have to invest instruction cycles on testing to see if it is not (because you can presume not, because you know exactly where cartridge memory is, and control how data is mapped there-- Assuming this is a ROM), and you wouldn't have to waste instruction cycles on wastefully slow copy operations that could run foul of compounding conditions of source, destination, or both not being word aligned, which could easily pile up if you are allowing data that does not meet the boundary alignment condition.

I think the best solution is a combination of environmental sanitization at init (check only once at first execution if program was loaded outside of word alignment, then fix it if it was), followed by strict data type rules.

BillG · September 14, 2020

If the size of the block is large enough, it may be worthwhile to try to copy words at a time.

                          00001 *
                          00002 * R0 = source address
                          00003 * R1 = destination address
                          00004 * R2 = count
                          00005 *
 0000                     00006 Move
 0000 0282 0006           00007         ci      R2,6            ; Enough to get clever?
 0004 1404 (000E)         00008         jhe     TryWord
                          00009
 0006                     00010 ByteLoop
 0006 DC70                00011         movb    *R0+,*R1+       ; Copy a byte at a time
 0008 0602                00012         dec     R2              ; More?
 000A 16FD (0006)         00013         jne     ByteLoop
                          00014
 000C                     00015 MoveDone
 000C 045B                00016         b       *R11            ; Return
                          00017
 000E                     00018 TryWord
 000E 0203 0001           00019         li      R3,1            ; Load mask for LSB
                          00020
 0012 C100                00021         mov     R0,R4           ; Are they aligned?
 0014 2901                00022         xor     R1,R4
 0016 2503                00023         czc     R3,R4
 0018 16F6 (0006)         00024         jne     ByteLoop        ; No, do a byte at a time
                          00025
 001A 24C0                00026         czc     R0,R3           ; Word aligned?
 001C 1302 (0022)         00027         jeq     Aligned
                          00028
 001E DC70                00029         movb    *R0+,*R1+       ; Copy the first byte
 0020 0602                00030         dec     R2
                          00031
 0022                     00032 Aligned
 0022 C102                00033         mov     R2,R4           ; Save count
                          00034
 0024 0912                00035         srl     R2,1            ; Convert to number of words
                          00036
 0026                     00037 WordLoop
 0026 CC70                00038         mov     *R0+,*R1+       ; Copy a word
 0028 0602                00039         dec     R2
 002A 16FD (0026)         00040         jne     WordLoop
                          00041
 002C 24C4                00042         czc     R4,R3           ; One byte left over?
 002E 13EE (000C)         00043         jeq     MoveDone        ; No
                          00044
 0030 DC70                00045         movb    *R0+,*R1+       ; Copy the last byte
                          00046
 0032 045B                00047         b       *R11            ; Return
                          00048
                          00049 *
                          00050 * Just to see the overhead of combining bytes into words
                          00051 *
 0034                     00052 NotAligned
 0034 24C0                00053         czc     R0,R3           ; Is source aligned?
 0036 160F (0056)         00054         jne     DestAligned
                          00055
 0038 C102                00056         mov     R2,R4           ; Save count
                          00057
 003A 0912                00058         srl     R2,1            ; Convert to number of words
                          00059
 003C D170                00060         movb    *R0+,R5         ; Get unaligned byte
                          00061
 003E                     00062 SrcLoop
 003E C1B0                00063         mov     *R0+,R6         ; Get next word
                          00064
 0040 06C6                00065         swpb    R6              ; Assemble dest word
 0042 C1C6                00066         mov     R6,R7
 0044 D185                00067         movb    R5,R6
                          00068
 0046 CC46                00069         mov     R6,*R1+         ; Store word
                          00070
 0048 D147                00071         movb    R7,R5           ; Ready for next
                          00072
 004A 0602                00073         dec     R2
 004C 16F8 (003E)         00074         jne     SrcLoop
                          00075
 004E 24C4                00076         czc     R4,R3           ; One byte left over?
 0050 13DD (000C)         00077         jeq     MoveDone        ; No
                          00078
 0052 DC45                00079         movb    R5,*R1+         ; Store the last byte
                          00080
 0054 045B                00081         b       *R11            ; Return
                          00082
 0056                     00083 DestAligned
                          00084 *
                          00085 * Why bother?  The overhead is too bad.
                          00086 *

+Lee Stewart · September 14, 2020

24 minutes ago, BillG said:

If the size of the block is large enough, it may be worthwhile to try to copy words at a time.

Spoiler



                          00001 *
                          00002 * R0 = source address
                          00003 * R1 = destination address
                          00004 * R2 = count
                          00005 *
 0000                     00006 Move
 0000 0282 0006           00007         ci      R2,6            ; Enough to get clever?
 0004 1404 (000E)         00008         jhe     TryWord
                          00009
 0006                     00010 ByteLoop
 0006 DC70                00011         movb    *R0+,*R1+       ; Copy a byte at a time
 0008 0602                00012         dec     R2              ; More?
 000A 16FD (0006)         00013         jne     ByteLoop
                          00014
 000C                     00015 MoveDone
 000C 045B                00016         b       *R11            ; Return
                          00017
 000E                     00018 TryWord
 000E 0203 0001           00019         li      R3,1            ; Load mask for LSB
                          00020
 0012 C100                00021         mov     R0,R4           ; Are they aligned?
 0014 2901                00022         xor     R1,R4
 0016 2503                00023         czc     R3,R4
 0018 16F6 (0006)         00024         jne     ByteLoop        ; No, do a byte at a time
                          00025
 001A 24C0                00026         czc     R0,R3           ; Word aligned?
 001C 1302 (0022)         00027         jeq     Aligned
                          00028
 001E DC70                00029         movb    *R0+,*R1+       ; Copy the first byte
 0020 0602                00030         dec     R2
                          00031
 0022                     00032 Aligned
 0022 C102                00033         mov     R2,R4           ; Save count
                          00034
 0024 0912                00035         srl     R2,1            ; Convert to number of words
                          00036
 0026                     00037 WordLoop
 0026 CC70                00038         mov     *R0+,*R1+       ; Copy a word
 0028 0602                00039         dec     R2
 002A 16FD (0026)         00040         jne     WordLoop
                          00041
 002C 24C4                00042         czc     R4,R3           ; One byte left over?
 002E 13EE (000C)         00043         jeq     MoveDone        ; No
                          00044
 0030 DC70                00045         movb    *R0+,*R1+       ; Copy the last byte
                          00046
 0032 045B                00047         b       *R11            ; Return
                          00048
                          00049 *
                          00050 * Just to see the overhead of combining bytes into words
                          00051 *
 0034                     00052 NotAligned
 0034 24C0                00053         czc     R0,R3           ; Is source aligned?
 0036 160F (0056)         00054         jne     DestAligned
                          00055
 0038 C102                00056         mov     R2,R4           ; Save count
                          00057
 003A 0912                00058         srl     R2,1            ; Convert to number of words
                          00059
 003C D170                00060         movb    *R0+,R5         ; Get unaligned byte
                          00061
 003E                     00062 SrcLoop
 003E C1B0                00063         mov     *R0+,R6         ; Get next word
                          00064
 0040 06C6                00065         swpb    R6              ; Assemble dest word
 0042 C1C6                00066         mov     R6,R7
 0044 D185                00067         movb    R5,R6
                          00068
 0046 CC46                00069         mov     R6,*R1+         ; Store word
                          00070
 0048 D147                00071         movb    R7,R5           ; Ready for next
                          00072
 004A 0602                00073         dec     R2
 004C 16F8 (003E)         00074         jne     SrcLoop
                          00075
 004E 24C4                00076         czc     R4,R3           ; One byte left over?
 0050 13DD (000C)         00077         jeq     MoveDone        ; No
                          00078
 0052 DC45                00079         movb    R5,*R1+         ; Store the last byte
                          00080
 0054 045B                00081         b       *R11            ; Return
                          00082
 0056                     00083 DestAligned
                          00084 *
                          00085 * Why bother?  The overhead is too bad.
                          00086 *

Very nice, but all of your CZC instructions after the first one need the operands reversed, I think.

...lee

BillG · September 14, 2020

I thought that CZC is just fancy 9900-ese for do an AND, set the EQ flag appropriately and throw away the result.

If not, then you are right that the order matters.

+Lee Stewart · September 14, 2020

21 minutes ago, BillG said:

I thought that CZC is just fancy 9900-ese for do an AND, set the EQ flag appropriately and throw away the result.

If not, then you are right that the order matters.

Yeah, it looks for zeros in the destination operand that correspond to ones in the source operand.

...lee

apersson850 · September 14, 2020

5 hours ago, wierd_w said:

I would think the best option would be to strive to assure that your data never is anything other than word boundary aligned from the beginning.

Obviously, but that wasn't the case in the original task here.

Note that there's no problem with an odd number of bytes to transfer. You can always handle the odd one with a MOVB.

Note that there's no problem with data starting at an odd address, or ending at an even, or both for that matter, as long as the destination does too. You can always handle the single bytes at the beginning and/or end by a MOVB, and handle the brunt of the data with MOV.

No, the problem is only when the source starts at an even address and the destination at an odd, or the opposite. In such a case, whatever you do will be less efficient than simply do the whole transfer using MOVB for every byte.

+FarmerPotato · September 15, 2020

17 hours ago, apersson850 said:

No, the problem is only when the source starts at an even address and the destination at an odd, or the opposite. In such a case, whatever you do will be less efficient than simply do the whole transfer using MOVB for every byte.

I did a cycle analysis of 4 versions of the inner loop.

Assume the source is even the dest is odd. Assume registers and code are in PAD (no wait state) and data is in expansion memory or cartridge.

         MOV   MOV4    MOVB   MOVB4
ALU       48    43      13     34
PAD       40    37       9     24
MEM        8     8       3     12
Cycles   240   224      68    212
per byte  60    56      68     53

where Cycles = 2*ALU + 2*PAD + (2+W)*MEM

MOVB4 wins

I did not realize that even MOV does read-before-write. The source and destination evaluation work the same--both do a read from the effective address.

Code

MOVB

loop:
 movb  *r0+,*r1+
 dec   r2
 jne   loop

MOVB x4, loop unrolled 4 times
loop:
 movb  *r0+,*r1+
 movb  *r0+,*r1+
 movb  *r0+,*r1+
 movb  *r0+,*r1+
 ai    r2,-4
 jne   loop

MOV

copy:
* cache first byte
 movb *src+,r3
loop:
* fill two src words:
* grab a word
 mov  *src+,r4
 movb r4,@r3lb
 mov  r3,*dst+
 dect cnt
 jeq done
 swpb r4
* grab a word
 mov  *src+,r3
 movb r3,@r4lb
 mov  r4,*dst+
 swpb r3
 dect cnt
 jne  loop
done:


MOV4

copy:
* cache first byte
 movb *src+,r3
loop:
* fill two src words:
* grab a word
 mov  *src+,r4
 movb r4,@r3lb
 mov  r3,*dst+
 swpb r4
* grab a word
 mov  *src+,r3
 movb r3,@r4lb
 mov  r4,*dst+
 swpb r3
 ai   r2,-4
 jne  loop
done:

To see all the equates and stuff, open the attached file.

copy.a99

Tursi · September 15, 2020

But MOV and MOV4 will write to the wrong address... assuming DST is odd, it's not incremented before the first MOV, only (src) is, so the mov r3,*dst+ will actually write to (dst-1). Actually, since the assumption was src is even, and a movb is used to increment it, the MOV will actually read from the wrong address too, and you'll get two copies of the first byte written starting at (dst-1). The gotcha that hit many of us was remembering that MOV is incapable of accessing an odd address, it will always truncate to 15 bits.


* assume src = A000 (bytes 11,22,33,44), dst = B001

* cache first byte
 movb *src+,r3            * read A000, r3=11xx, src=A001
loop:
* fill two src words:
* grab a word
 mov  *src+,r4            * read A000, R4=1122, src=A003
 movb r4,@r3lb            * MSB 11, r3=1111
 mov  r3,*dst+            * write 1111 to B000, dst=B003

~~Even though the addresses are right, MOV is not capable of accessing them.~~

I decided to run it a little farther, cause it seemed silly to me that you'd miss that... and as long as you can overlook the first byte being early... the data does land in the right place. Sorry about that! Fire alarm testing today, that's my excuse.

Edited September 15, 2020 by Tursi

+FarmerPotato · September 15, 2020

39 minutes ago, Tursi said:
But MOV and MOV4 will write to the wrong address... assuming DST is odd, it's not incremented before the first MOV, only (src) is, so the mov r3,*dst+ will actually write to (dst-1). Actually, since the assumption was src is even, and a movb is used to increment it, the MOV will actually read from the wrong address too, and you'll get two copies of the first byte written starting at (dst-1). The gotcha that hit many of us was remembering that MOV is incapable of accessing an odd address, it will always truncate to 15 bits.
* assume src = A000 (bytes 11,22,33,44), dst = B001

* cache first byte
 movb *src+,r3            * read A000, r3=11xx, src=A001
loop:
* fill two src words:
* grab a word
 mov  *src+,r4            * read A000, R4=1122, src=A003
 movb r4,@r3lb            * MSB 11, r3=1111
 mov  r3,*dst+            * write 1111 to B000, dst=B003
~~Even though the addresses are right, MOV is not capable of accessing them.~~

I decided to run it a little farther, cause it seemed silly to me that you'd miss that... and as long as you can overlook the first byte being early... the data does land in the right place. Sorry about that! Fire alarm testing today, that's my excuse.

Thanks for your understanding.

I goofed up and posted a src odd, dst even copy. The first MOVB aligns the source pointer to even, while caching the first byte.

I see that it gives the same result whether src is odd or even. But if dst is odd, the result is even.

My test code was:


 li  src,>a001
 li  dst,>b000
 li  r2,>400
 bl  @copy

apersson850 · September 16, 2020

I haven't actually done it, but my gut feeling is that you make two separate routines. One with MOV, one with MOVB. Start with a test figuring out which to use. For the MOV routine, handle an odd byte at the start and/or end with one or two MOVB.

+TheBF · September 16, 2020

8 hours ago, apersson850 said:

I haven't actually done it, but my gut feeling is that you make two separate routines. One with MOV, one with MOVB. Start with a test figuring out which to use. For the MOV routine, handle an odd byte at the start and/or end with one or two MOVB.

I think your gut is good.

An alternative is write both and call the correct code where needed, cuz you kept your data organized.

Patient: "Doctor it hurts when I do this"

Doctor: "Then don't do that!"

Asmusr · September 16, 2020

Are you suggesting to solve the problem by solving it somewhere else? ?

apersson850 · September 16, 2020

5 hours ago, TheBF said:

An alternative is write both and call the correct code where needed, cuz you kept your data organized.

Well, that is what I meant. If the program must work with both aligned and non-aligned data movement, then write a procedure, which contains both methods, and at each invocation, determine which one to call. Then you all the time get the best performance, depending on the circumstances. And if you run out of memory, you can revert to MOVB only.

Assembly challenge: copy routine

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members