Part 1 - Introduction
Part 2 - Learn 82.7% of Assembly Language in About Three Pages
Part 3 - The World Inside a USR() Routine
Part 4 - Implement DPEEK()
Part 5 - Implement DPOKE
Part 6 - Various Bit Manipulations
Part 7 - Convert Integer to Hex String
Part 8 - Convert Integer to Bit String
Part 9 - Memory Copy
Part 10 - Binary File I/O Part 1 (XIO is Broken)
Part 11 - Binary File I/O Part 2 (XIO is Broken)
A number of features on the Atari benefit from fast memory copies. High speed data copying provides convenience to the user by reducing wait times for actions that would use slow, time-consuming BASIC loops. Copying a character set (1024 bytes) is a good example. High speed memory moves also enable use of Atari features that are not otherwise possible due to BASIC's speed. For example, vertical movement and animation of Player/Missile graphics in BASIC is not realistic, but a memory move function makes P/M graphics animation practical in an Atari BASIC program.
Though Atari BASIC lacks a general purpose memory move there is a hack around this problem that exploits the way Atari BASIC manages strings. As simply as possible: A BASIC program may dig through the variable control structures and change the starting address of a string variable, effectively overlaying the string on any memory possible: screen memory, player/missile memory, color registers, character sets, hardware addresses, etc. String assignments then cause memory to be assigned or copied at near machine language speeds. The copy is done in ascending order and there is no consideration for source and destination memory (strings) that overlap. However, this string abuse method is a subject for an entirely different discussion and is not what will be done here.
The subject of memory moves on the 6502 includes many considerations and is the inspiration for numerous discussions, arguments, and holy wars between programmers around the subject of speed, efficiency, optimization, and the odors emanating from the grandmothers of those who disagree with one's obvious and indisputable common sense. In short, although copying from A to B seems like a simple subject, there are a many choices and different methods to arrange a 6502 algorithm to copy memory. Topics that make a difference in code:
- How many bytes are being copied? More or less than 128 bytes? More or less than 256 bytes? More or less than 32K bytes?
- Does the code need to consider whether or not the source and destination memory overlap?
- Should the memory be copied in ascending or descending order?
- Is speed important? Should loops be unrolled into explicit code?
- Is the size of the code more important than speed?
All of the above can start a group of programmers on a never-ending flame war. For the sake of defusing arguments I will admit that the result created here will not be everyone's idea of best, fastest and most efficient. We will follow the rule, “moderation in all things.” Well, more of a guideline than a rule.
Let's say that we are copying four bytes from a source to a destination. The fastest way to copy is to do it directly and explicitly:
LDA SOURCE STA DEST LDA SOURCE+1 STA DEST+1 LDA SOURCE+2 STA DEST+2 LDA SOURCE+3 STA DEST+3
That's about as fast as it gets on the 6502. Depending on whether or not source or destination is in Page Zero, this code uses about 4 to 6 bytes worth of instructions (total of 16 to 24 bytes just for this example!) and 6 to 8 cycles of CPU time for each byte it moves. That's all the good about it. Now the reality check: The code is not general purpose. When source and destination change then another block of code must be created. This method makes the assembly source code for a memory move of any substantial size very wordy, not to mention the tedious typing for the programmer. An explicit copy like this is done when performance is the ultimate goal.
The next level of optimization is a loop to move a range of bytes:
LDX #$00 LOOP LDA SOURCE,X STA DEST,X INX CPX #$04 BNE LOOP
This code uses an index to copy bytes in ascending order and makes a comparison for the number of bytes. The code occupies 9 to 11 bytes (without the LDX initialization), roughly half the size of the explicit method. The size efficiency becomes greater the more bytes are copied. This code can copy 4 bytes, 10 bytes, or 100 bytes and remains the same size, while the earlier explicit method expands with every byte. However, this code uses 15 to 17 CPU cycles (not including the LDX initialization which occurs only once) for every byte it copies which is more than twice as long as the explicit copy. Additionally, this still uses absolute addresses and is not a general purpose, reusable routine.
Here is a little optimization on this method:
LDX #$03 LOOP LDA SOURCE,X STA DEST,X DEX BPL LOOP
This version uses an index to copy bytes in descending order and eliminates the compare instruction by relying on the CPU negative flag changing when the index decrements from $00 to $FF. This reduces the code size to 7 to 9 bytes without the LDX initialization. Execution time (without the LDX) is reduced to 13 to 15 cycles per byte copied – just about twice as long at the explicit method. Again, like all other examples this uses absolute addresses and so is not a general purpose, reusable routine.
A general purpose routine must use indirection instructions allowing the source and destination addresses to be different each time the program uses the routine. This means Page Zero locations must be initialized for source and destination:.
LDA # <SOURCE STA ZSOURCE LDA # >SOURCE+1 STA ZSOURCE+1 LDA # <DEST STA ZDEST LDA # >DEST+1 STA ZDEST+1 ; LDY #$03 LOOP LDA (ZSOURCE),Y STA (ZDEST),Y DEY BPL LOOP
This version is not so short. The initialization (executed once) is 18 bytes long. While the actual looping section is only 7 bytes it executes in 15 to 17 cycles due to the indirection. The advantage here is that the looping section is reusable.
Another consideration of the indexed, looping methods is that they can copy a limited number of bytes. The maximum number of bytes a loop can copy using the X or Y register as an index is 256 bytes ($00 to $FF).
The method of comparison or detection of the end of the copy loop can also limit the number of bytes. An earlier example copies in descending order uses BPL to detect the index wrapping around from $00, a positive value, to $FF, a negative value. This means the branching evaluation cannot encounter any other negative value which limits this code 129 bytes. 129? not 128? Why?
LDX #$80 ; Start at +128 LOOP LDA SOURCE,X STA DEST,X DEX ; From here X as $7F descending BPL LOOP ; to $00 is valid
It is possible to copy a specific number of bytes less than 256 in ascending order without using a comparison instruction, but, this is less obvious. This code example below alters the starting address to compensate for the starting index, so that the branch decision can rely on the CPU flag change when the index reaches zero:
; $00 - $04 = $FC LDX #$FC LOOP LDA SOURCE-$FC,X STA DEST-$FC,X INX ; Continues for $FD, $FE, $FF BNE LOOP ; and quits at index $00
So, then what do do when more there are more than 256 bytes to copy? In the simple loop examples just add another loop that copies more bytes with different source and destination addresses. But, when there is a lot of memory to copy this quickly becomes redundant.
The advantage of using the indirection method is that the source and destination addresses in Page Zero can be easily modified and used to repeat the loop again. The example below copies 1024 bytes:
LDX #$04 ; Number of 256 byte pages LDY #$00 LOOP LDA (ZSOURCE),Y STA (ZDEST),Y INY BNE LOOP ; Copy a page INC ZSOURCE+1 ; Increment addresses' INC ZDEST+1 ; high bytes DEX BNE LOOP ; Do another page
The looping part is 14 bytes long. It can copy any number of 256 byte pages just by changing the X register. This is a practical routine for the Atari which has character sets 512 or 1024 bytes long. Player/missile graphics may use memory maps of 256 bytes. More generally, aligning any data to fit within the range of an index register allows reasonably compact code like this.
The next problem is how to copy any number of bytes. Simply change perspective and think of the problem in two parts. First, if the number of bytes is 256 or greater then the previous page copy shown above can take care of all the whole, 256-byte pages. The high byte of the size provides the number of whole 256 byte pages. That leaves the size's low byte which specifies the remaining zero to 255 bytes. This second part just needs a slightly different byte copying loop that will stop early at a specific count. Assuming the source, destination, and size are all set into designated page 0 locations, then the relocatable, re-usable code could look something like this:
MEMMOVE in Mac/65 Assembler Code
0100 ; MEMMOVE.M65 0105 ; 0110 ; GENERAL MEMORY MOVE 0115 ; 0120 ; GENERIC MEMORY MOVE FROM 0125 ; SOURCE TO DESTINATION 0130 ; ASCENDING. 0135 ; 0140 ; USR 3 ARGUMENTS: 0145 ; SOURCE == FROM ADDRESS 0150 ; DEST == TO ADDRESS 0155 ; SIZE == NUMBER OF BYTES 0160 ; 0165 ; RETURN VALUE IS BYTES COPIED. 0170 ; 0175 ZRET = $D4 ; FR0 $D4/$D5 Return Value 0180 ZARGS = $D5 ; $D6-1 for arg Pulldown loop 0185 ZSIZE = $D6 ; FR0 $D6/$D7 Size 0190 ZDEST = $D8 ; FR1 $D8/$D9 Destination 0195 ZSOURCE = $DA ; FR1 $DA/$DB Source 0200 ; 0205 .OPT OBJ 0210 ; 0215 *= $9200 ; Arbitrary. this is relocatable 0220 ; 0225 INIT 0230 LDY #$00 ; Make the return 0235 STY ZRET ; value clear to 0 0240 STY ZRET+1 ; by default. 0245 PLA ; Get argument count 0250 BEQ BYE ; Shortcut for no args 0255 ASL A ; Now number of bytes 0260 TAY 0265 CMP #$06 ; Source, Dest, Size 0270 BEQ PULLDOWN 0275 ; 0280 ; Bad arg count. Clean up for exit. 0285 ; 0290 DISPOSE ; Any number of arguments 0295 PLA 0300 DEY 0305 BNE DISPOSE 0310 RTS ; Abandon ship. 0315 ; 0320 ; Pull args into Page Zero. 0325 ; This code works the same 0330 ; for 1, 4, 8... arguments. 0335 ; 0340 PULLDOWN ; arguments in Y 0345 PLA 0350 STA ZARGS,Y 0355 DEY 0360 BNE PULLDOWN 0365 ; 0370 ; Since we're good to start the 0375 ; copy, then set the return 0380 ; value to the size. 0385 ; 0390 LDA ZSIZE+1 0395 STA ZRET+1 0400 LDA ZSIZE 0405 STA ZRET 0410 ; 0415 ; Moving full 256 byte pages or not? 0420 ; 0425 LDY #0 0430 LDX ZSIZE+1 ; Number of pages is 0435 BEQ MOVEPARTIAL ; Zero so, try partial. 0440 ; 0445 ; Copy full index range of 256 bytes. 0450 ; 0455 MOVEPAGE 0460 LDA (ZSOURCE),Y 0465 STA (ZDEST),Y 0470 INY 0475 BNE MOVEPAGE ; Y rolled $FF to $00 0480 INC ZSOURCE+1 ; Next page src 0485 INC ZDEST+1 ; next page dst 0490 DEX ; this page done 0495 BNE MOVEPAGE ; Non-zero means more 0500 ; 0505 ; A partial page remains? 0510 ; 0515 MOVEPARTIAL 0520 LDX ZSIZE ; Low byte remainder. 0525 BEQ BYE ; Zero, exit 0530 ; 0535 MOVEMEM 0540 LDA (ZSOURCE),Y 0545 STA (ZDEST),Y 0550 INY ; Copy ascending. 0555 DEX ; and subtract count. 0560 BNE MOVEMEM 0565 BYE 0570 RTS 0575 ; 0580 .END
MEMMOVE copies any number of bytes in ascending order from the source to destination. Earlier the article mentioned an Atari BASIC string hack that assigns strings to specific memory addresses and uses the string assignment action to copy from a source address to a target address. String assignments work in ascending order. Therefore, this routine achieves the same result as the string method (ascending copy) with a bit less hacking of the BASIC variable table. Copying in reverse order or automatically detecting source and target overlap would add more code for options that are not as frequently needed.
This is certainly not the highest performing option possible. But, it is fairly compact and uses Page zero allowing the routine to be general-purpose, and reusable, and that is a reasonable goal for a routine to support Atari BASIC. If this discussion concerned writing an entire video game in assembly then the code would focus on execution time and use unrolled loops or other bells and whistles for copying faster at the expense of code size.
The Atari BASIC program below, TESTMOVE.BAS, reasonably verifies operation of the memory move utility works as expected:
100 REM TEST MEMORY MOVE OPERATIONS 105 REM 110 GRAPHICS 0:POKE 710,0:POKE 82,0 115 DIM S$(258),D$(259) 120 GOSUB 503:REM RESET SRC AND DST 125 GOSUB 10000:REM LOAD MEMMOVE 130 REM RUN TESTS FOR 8,255,256,257 135 RESTORE 230 140 READ MSIZE 145 IF MSIZE<1 THEN ? "Done":END 150 ? "Testing move size ";MSIZE;" . . . "; 155 X=USR(MMOV,ADR(S$),ADR(D$)+1,MSIZE) 160 DCOUNT=0 165 FOR I=1 TO 259 170 IF D$(I,I)="*" THEN DCOUNT=DCOUNT+1 175 NEXT I 180 IF DCOUNT=MSIZE THEN ? "OK":GOTO 190 185 ? "FAILED! ";DCOUNT;" <> ";MSIZE 190 GOSUB 504:REM RESET DST 195 GOTO 140 200 REM 205 REM NUMBER OF BYTES TO MOVE FOR 210 REM EACH TEST. TESTS LESS THAN 215 REM ONE PAGE AND THE BORDER 220 REM CONDITIONS AROUND ONE PAGE. 225 REM 230 DATA 8,255,256,257,-1 235 END 500 REM 501 REM RESET MEMORY 502 REM 503 S$="*":S$(258)="*":S$(2)=S$ 504 D$=".":D$(259)=".":D$(2)=D$ 505 RETURN 9997 REM 9998 REM SETUP ML MEMMOVE UTILITY 9999 REM 10000 DIM MM$(68) 10001 MMOV=ADR(MM$) 10002 RESTORE 24000:? "Loading MMOV..." 10003 FOR I=0 TO 67 10004 READ D:POKE MMOV+I,D 10005 NEXT I:? 10006 RETURN 23996 REM H1:MEMMOVE.OBJ 23997 REM SIZE = 68 23998 REM START = 37376 23999 REM END = 37443 24000 DATA 160,0,132,212,132,213,104,240 24001 DATA 58,10,168,201,6,240,5,104 24002 DATA 136,208,252,96,104,153,213,0 24003 DATA 136,208,249,165,215,133,213,165 24004 DATA 214,133,212,160,0,166,215,240 24005 DATA 14,177,218,145,216,200,208,249 24006 DATA 230,219,230,217,202,208,242,166 24007 DATA 214,240,8,177,218,145,216,200 24008 DATA 202,208,248,96
The program tests different sizes to verify the partial and full page copy loops, and also tests the border conditions around a full page copy. After each each copy it tests the contents of the destination memory counting all the locations that were changed by the memory move. Successful output will look like this:
Below are the source files and examples of how to load the machine language routine into BASIC included in the disk image and archive:
MEMMOVE File List:
MEMMOVE.M65 Saved Mac/65 source
MEMMOVE.L65 Mac/65 source listing
MEMMOVE.T65 Mac/65 source listed to H6: (linux)
MEMMOVE.ASM Mac/65 assembly listing
MEMMOVE.TSM Mac/65 assembly listing to H6: (linux)
MEMMOVE.OBJ Mac/65 assembled machine language program (with load segments)
MEMMOVE.BIN Assembled machine language program without load segments
MEMMOVE.LIS LISTed DATA statements for MEMMOVE.BIN routine.
MEMMOVE.TLS LISTed DATA statements for MEMMOVE.BIN routine to H6: (linux)
MAKEMMOV.BAS BASIC program to create the MEMMOVE.BIN file. This also contains the MEMMOVE routine in DATA statements.
MAKEMMOV.LIS LISTed version of MAKEMMOV.BAS
MAKEMMOV.TLS LISTed version of MAKEMMOV.BAS to H6: (linux)
TESTMOVE.BAS BASIC program that tests the MEMMOVE USR() routine.
TESTMOVE.LIS LISTed version of TESTMOVE.BAS.
TESTMOVE.TLS LISTed version of TESTMOVE.BAS to H6: (linux)
ZIP archive of files:
Tar archive of files (remove the .zip after download)
Great peace have those who love your law; nothing can make them stumble.