Jump to content
  • entries
    15
  • comments
    7
  • views
    14,450

Part 9 of 11 -- Simple Assembly for Atari BASIC - Memory Copy


kenjennings

1,307 views

Memory Copy

==============================================================

Part 1 - Introduction
http://atariage.com/forums/blog/576/entry-13175-part-1-of-11-simple-assembly-for-atari-basic/

Part 2 - Learn 82.7% of Assembly Language in About Three Pages
http://atariage.com/forums/blog/576/entry-13176-part-2-of-11-simple-assembly-for-atari-basic/

Part 3 - The World Inside a USR() Routine
http://atariage.com/forums/blog/576/entry-13177-part-3-of-11-simple-assembly-for-atari-basic/

Part 4 - Implement DPEEK()
http://atariage.com/forums/blog/576/entry-13178-part-4-of-11-simple-assembly-for-atari-basic/

Part 5 - Implement DPOKE
http://atariage.com/forums/blog/576/entry-13180-part-5-of-11-simple-assembly-for-atari-basic/

Part 6 - Various Bit Manipulations
http://atariage.com/forums/blog/576/entry-13181-part-6-of-11-simple-assembly-for-atari-basic/

Part 7 - Convert Integer to Hex String
http://atariage.com/forums/blog/576/entry-13182-part-7-of-11-simple-assembly-for-atari-basic/

Part 8 - Convert Integer to Bit String
http://atariage.com/forums/blog/576/entry-13183-part-8-of-11-simple-assembly-for-atari-basic/

Part 9 - Memory Copy
http://atariage.com/forums/blog/576/entry-13184-part-9-of-11-simple-assembly-for-atari-basic/

Part 10 - Binary File I/O Part 1 (XIO is Broken)
http://atariage.com/forums/blog/576/entry-13185-part-10-of-11-simple-assembly-for-atari-basic/

Part 11 - Binary File I/O Part 2 (XIO is Broken)
http://atariage.com/forums/blog/576/entry-13186-part-11-simple-assembly-for-atari-basic-the-end/

==============================================================


A number of features on the Atari benefit from fast memory copies. High speed data copying provides convenience to the user by reducing wait times for actions that would use slow, time-consuming BASIC loops. Copying a character set (1024 bytes) is a good example. High speed memory moves also enable use of Atari features that are not otherwise possible due to BASIC's speed. For example, vertical movement and animation of Player/Missile graphics in BASIC is not realistic, but a memory move function makes P/M graphics animation practical in an Atari BASIC program.

Though Atari BASIC lacks a general purpose memory move there is a hack around this problem that exploits the way Atari BASIC manages strings. As simply as possible: A BASIC program may dig through the variable control structures and change the starting address of a string variable, effectively overlaying the string on any memory possible: screen memory, player/missile memory, color registers, character sets, hardware addresses, etc. String assignments then cause memory to be assigned or copied at near machine language speeds. The copy is done in ascending order and there is no consideration for source and destination memory (strings) that overlap. However, this string abuse method is a subject for an entirely different discussion and is not what will be done here.

The subject of memory moves on the 6502 includes many considerations and is the inspiration for numerous discussions, arguments, and holy wars between programmers around the subject of speed, efficiency, optimization, and the odors emanating from the grandmothers of those who disagree with one's obvious and indisputable common sense. In short, although copying from A to B seems like a simple subject, there are a many choices and different methods to arrange a 6502 algorithm to copy memory. Topics that make a difference in code:

  • How many bytes are being copied? More or less than 128 bytes? More or less than 256 bytes? More or less than 32K bytes?

 

  • Does the code need to consider whether or not the source and destination memory overlap?

 

  • Should the memory be copied in ascending or descending order?

 

  • Is speed important? Should loops be unrolled into explicit code?

 

  • Is the size of the code more important than speed?




All of the above can start a group of programmers on a never-ending flame war. For the sake of defusing arguments I will admit that the result created here will not be everyone's idea of best, fastest and most efficient. We will follow the rule, “moderation in all things.” Well, more of a guideline than a rule.


Let's say that we are copying four bytes from a source to a destination. The fastest way to copy is to do it directly and explicitly:

    LDA SOURCE
    STA DEST
    LDA SOURCE+1
    STA DEST+1
    LDA SOURCE+2
    STA DEST+2
    LDA SOURCE+3
    STA DEST+3
 


That's about as fast as it gets on the 6502. Depending on whether or not source or destination is in Page Zero, this code uses about 4 to 6 bytes worth of instructions (total of 16 to 24 bytes just for this example!) and 6 to 8 cycles of CPU time for each byte it moves. That's all the good about it. Now the reality check: The code is not general purpose. When source and destination change then another block of code must be created. This method makes the assembly source code for a memory move of any substantial size very wordy, not to mention the tedious typing for the programmer. An explicit copy like this is done when performance is the ultimate goal.


The next level of optimization is a loop to move a range of bytes:

    LDX #$00
LOOP
    LDA SOURCE,X
    STA DEST,X
    INX
    CPX #$04
    BNE LOOP     
 


This code uses an index to copy bytes in ascending order and makes a comparison for the number of bytes. The code occupies 9 to 11 bytes (without the LDX initialization), roughly half the size of the explicit method. The size efficiency becomes greater the more bytes are copied. This code can copy 4 bytes, 10 bytes, or 100 bytes and remains the same size, while the earlier explicit method expands with every byte. However, this code uses 15 to 17 CPU cycles (not including the LDX initialization which occurs only once) for every byte it copies which is more than twice as long as the explicit copy. Additionally, this still uses absolute addresses and is not a general purpose, reusable routine.

Here is a little optimization on this method:

    LDX #$03  
LOOP
    LDA SOURCE,X
    STA DEST,X
    DEX
    BPL LOOP
 


This version uses an index to copy bytes in descending order and eliminates the compare instruction by relying on the CPU negative flag changing when the index decrements from $00 to $FF. This reduces the code size to 7 to 9 bytes without the LDX initialization. Execution time (without the LDX) is reduced to 13 to 15 cycles per byte copied – just about twice as long at the explicit method. Again, like all other examples this uses absolute addresses and so is not a general purpose, reusable routine.

A general purpose routine must use indirection instructions allowing the source and destination addresses to be different each time the program uses the routine. This means Page Zero locations must be initialized for source and destination:.

    LDA # <SOURCE
    STA ZSOURCE
    LDA # >SOURCE+1
    STA ZSOURCE+1
    LDA # <DEST
    STA ZDEST
    LDA # >DEST+1
    STA ZDEST+1
    ;
    LDY #$03
LOOP
    LDA (ZSOURCE),Y
    STA (ZDEST),Y
    DEY
    BPL LOOP
 


This version is not so short. The initialization (executed once) is 18 bytes long. While the actual looping section is only 7 bytes it executes in 15 to 17 cycles due to the indirection. The advantage here is that the looping section is reusable.

Another consideration of the indexed, looping methods is that they can copy a limited number of bytes. The maximum number of bytes a loop can copy using the X or Y register as an index is 256 bytes ($00 to $FF).

The method of comparison or detection of the end of the copy loop can also limit the number of bytes. An earlier example copies in descending order uses BPL to detect the index wrapping around from $00, a positive value, to $FF, a negative value. This means the branching evaluation cannot encounter any other negative value which limits this code 129 bytes. 129? not 128? Why?

    LDX #$80 ; Start at +128  
LOOP    
    LDA SOURCE,X
    STA DEST,X
    DEX       ; From here X as $7F descending
    BPL LOOP  ; to $00 is valid
 


It is possible to copy a specific number of bytes less than 256 in ascending order without using a comparison instruction, but, this is less obvious. This code example below alters the starting address to compensate for the starting index, so that the branch decision can rely on the CPU flag change when the index reaches zero:

    ; $00 - $04 = $FC 
    
    LDX #$FC
LOOP
    LDA SOURCE-$FC,X
    STA DEST-$FC,X
    INX      ; Continues for $FD, $FE, $FF
    BNE LOOP ; and quits at index $00
 



So, then what do do when more there are more than 256 bytes to copy? In the simple loop examples just add another loop that copies more bytes with different source and destination addresses. But, when there is a lot of memory to copy this quickly becomes redundant.

The advantage of using the indirection method is that the source and destination addresses in Page Zero can be easily modified and used to repeat the loop again. The example below copies 1024 bytes:

    LDX #$04 ; Number of 256 byte pages    
    LDY #$00  
LOOP    
    LDA (ZSOURCE),Y    
    STA (ZDEST),Y    
    INY    
    BNE LOOP      ; Copy a page    
    INC ZSOURCE+1 ; Increment addresses'    
    INC ZDEST+1   ; high bytes    
    DEX    
    BNE LOOP      ; Do another page
 



The looping part is 14 bytes long. It can copy any number of 256 byte pages just by changing the X register. This is a practical routine for the Atari which has character sets 512 or 1024 bytes long. Player/missile graphics may use memory maps of 256 bytes. More generally, aligning any data to fit within the range of an index register allows reasonably compact code like this.

The next problem is how to copy any number of bytes. Simply change perspective and think of the problem in two parts. First, if the number of bytes is 256 or greater then the previous page copy shown above can take care of all the whole, 256-byte pages. The high byte of the size provides the number of whole 256 byte pages. That leaves the size's low byte which specifies the remaining zero to 255 bytes. This second part just needs a slightly different byte copying loop that will stop early at a specific count. Assuming the source, destination, and size are all set into designated page 0 locations, then the relocatable, re-usable code could look something like this:

MEMMOVE in Mac/65 Assembler Code

0100 ; MEMMOVE.M65
0105 ;
0110 ; GENERAL MEMORY MOVE
0115 ;
0120 ; GENERIC MEMORY MOVE FROM
0125 ; SOURCE TO DESTINATION
0130 ; ASCENDING.
0135 ;
0140 ; USR 3 ARGUMENTS:
0145 ; SOURCE == FROM ADDRESS
0150 ; DEST   == TO ADDRESS
0155 ; SIZE   == NUMBER OF BYTES
0160 ;
0165 ; RETURN VALUE IS BYTES COPIED.
0170 ;
0175 ZRET =  $D4     ; FR0 $D4/$D5 Return Value
0180 ZARGS = $D5     ; $D6-1 for arg Pulldown loop
0185 ZSIZE = $D6     ; FR0 $D6/$D7 Size
0190 ZDEST = $D8     ; FR1 $D8/$D9 Destination
0195 ZSOURCE = $DA   ; FR1 $DA/$DB Source
0200 ;
0205     .OPT OBJ
0210 ;
0215     *=  $9200   ; Arbitrary. this is relocatable
0220 ;
0225 INIT
0230     LDY #$00    ; Make the return
0235     STY ZRET    ; value clear to 0
0240     STY ZRET+1  ; by default.
0245     PLA         ; Get argument count
0250     BEQ BYE     ; Shortcut for no args
0255     ASL A       ; Now number of bytes
0260     TAY
0265     CMP #$06    ; Source, Dest, Size
0270     BEQ PULLDOWN
0275 ;
0280 ; Bad arg count.  Clean up for exit.
0285 ;
0290 DISPOSE ;       Any number of arguments
0295     PLA
0300     DEY
0305     BNE DISPOSE
0310     RTS         ; Abandon ship.
0315 ;
0320 ; Pull args into Page Zero.
0325 ; This code works the same
0330 ; for 1, 4, 8... arguments.
0335 ;
0340 PULLDOWN ;      arguments in Y
0345     PLA
0350     STA ZARGS,Y
0355     DEY
0360     BNE PULLDOWN
0365 ;
0370 ; Since we're good to start the
0375 ; copy, then set the return
0380 ; value to the size.
0385 ;
0390     LDA ZSIZE+1
0395     STA ZRET+1
0400     LDA ZSIZE
0405     STA ZRET
0410 ;
0415 ; Moving full 256 byte pages or not?
0420 ;
0425     LDY #0
0430     LDX ZSIZE+1 ; Number of pages is
0435     BEQ MOVEPARTIAL ; Zero so, try partial.
0440 ;
0445 ; Copy full index range of 256 bytes.
0450 ;
0455 MOVEPAGE
0460     LDA (ZSOURCE),Y
0465     STA (ZDEST),Y
0470     INY
0475     BNE MOVEPAGE ; Y rolled $FF to $00
0480     INC ZSOURCE+1 ; Next page src
0485     INC ZDEST+1 ; next page dst
0490     DEX         ; this page done
0495     BNE MOVEPAGE ; Non-zero means more
0500 ;
0505 ; A partial page remains?
0510 ;
0515 MOVEPARTIAL
0520     LDX ZSIZE ; Low byte remainder.
0525     BEQ BYE   ; Zero, exit
0530 ;
0535 MOVEMEM
0540     LDA (ZSOURCE),Y
0545     STA (ZDEST),Y
0550     INY       ; Copy ascending.
0555     DEX       ; and subtract count.
0560     BNE MOVEMEM
0565 BYE
0570     RTS
0575 ;
0580     .END
 



MEMMOVE copies any number of bytes in ascending order from the source to destination. Earlier the article mentioned an Atari BASIC string hack that assigns strings to specific memory addresses and uses the string assignment action to copy from a source address to a target address. String assignments work in ascending order. Therefore, this routine achieves the same result as the string method (ascending copy) with a bit less hacking of the BASIC variable table. Copying in reverse order or automatically detecting source and target overlap would add more code for options that are not as frequently needed.

This is certainly not the highest performing option possible. But, it is fairly compact and uses Page zero allowing the routine to be general-purpose, and reusable, and that is a reasonable goal for a routine to support Atari BASIC. If this discussion concerned writing an entire video game in assembly then the code would focus on execution time and use unrolled loops or other bells and whistles for copying faster at the expense of code size.



Testing MEMMOVE

The Atari BASIC program below, TESTMOVE.BAS, reasonably verifies operation of the memory move utility works as expected:

100 REM TEST MEMORY MOVE OPERATIONS
105 REM
110 GRAPHICS 0:POKE 710,0:POKE 82,0
115 DIM S$(258),D$(259)
120 GOSUB 503:REM RESET SRC AND DST
125 GOSUB 10000:REM LOAD MEMMOVE
130 REM RUN TESTS FOR 8,255,256,257
135 RESTORE 230
140 READ MSIZE
145 IF MSIZE<1 THEN ? "Done":END
150 ? "Testing move size ";MSIZE;" . . . ";
155 X=USR(MMOV,ADR(S$),ADR(D$)+1,MSIZE)
160 DCOUNT=0
165 FOR I=1 TO 259
170 IF D$(I,I)="*" THEN DCOUNT=DCOUNT+1
175 NEXT I
180 IF DCOUNT=MSIZE THEN ? "OK":GOTO 190
185 ? "FAILED!  ";DCOUNT;" <> ";MSIZE
190 GOSUB 504:REM RESET DST
195 GOTO 140
200 REM
205 REM NUMBER OF BYTES TO MOVE FOR
210 REM EACH TEST.  TESTS LESS THAN
215 REM ONE PAGE AND THE BORDER
220 REM CONDITIONS AROUND ONE PAGE.
225 REM
230 DATA 8,255,256,257,-1
235 END
500 REM
501 REM RESET MEMORY
502 REM
503 S$="*":S$(258)="*":S$(2)=S$
504 D$=".":D$(259)=".":D$(2)=D$
505 RETURN
9997 REM
9998 REM SETUP ML MEMMOVE UTILITY
9999 REM
10000 DIM MM$(68)
10001 MMOV=ADR(MM$)
10002 RESTORE 24000:? "Loading MMOV..."
10003 FOR I=0 TO 67
10004 READ D:POKE MMOV+I,D
10005 NEXT I:?
10006 RETURN
23996 REM H1:MEMMOVE.OBJ
23997 REM SIZE  = 68
23998 REM START = 37376
23999 REM END   = 37443
24000 DATA 160,0,132,212,132,213,104,240
24001 DATA 58,10,168,201,6,240,5,104
24002 DATA 136,208,252,96,104,153,213,0
24003 DATA 136,208,249,165,215,133,213,165
24004 DATA 214,133,212,160,0,166,215,240
24005 DATA 14,177,218,145,216,200,208,249
24006 DATA 230,219,230,217,202,208,242,166
24007 DATA 214,240,8,177,218,145,216,200
24008 DATA 202,208,248,96
 


The program tests different sizes to verify the partial and full page copy loops, and also tests the border conditions around a full page copy. After each each copy it tests the contents of the destination memory counting all the locations that were changed by the memory move. Successful output will look like this:

blogentry-32235-0-73038200-1469592514.png

Below are the source files and examples of how to load the machine language routine into BASIC included in the disk image and archive:

MEMMOVE File List:

MEMMOVE.M65 Saved Mac/65 source
MEMMOVE.L65 Mac/65 source listing
MEMMOVE.T65 Mac/65 source listed to H6: (linux)
MEMMOVE.ASM Mac/65 assembly listing
MEMMOVE.TSM Mac/65 assembly listing to H6: (linux)
MEMMOVE.OBJ Mac/65 assembled machine language program (with load segments)
MEMMOVE.BIN Assembled machine language program without load segments
MEMMOVE.LIS LISTed DATA statements for MEMMOVE.BIN routine.
MEMMOVE.TLS LISTed DATA statements for MEMMOVE.BIN routine to H6: (linux)

MAKEMMOV.BAS BASIC program to create the MEMMOVE.BIN file. This also contains the MEMMOVE routine in DATA statements.
MAKEMMOV.LIS LISTed version of MAKEMMOV.BAS
MAKEMMOV.TLS LISTed version of MAKEMMOV.BAS to H6: (linux)

TESTMOVE.BAS BASIC program that tests the MEMMOVE USR() routine.
TESTMOVE.LIS LISTed version of TESTMOVE.BAS.
TESTMOVE.TLS LISTed version of TESTMOVE.BAS to H6: (linux)




ZIP archive of files:

Move_Disk.zip


Tar archive of files (remove the .zip after download)

Move_Disk.tgz.zip






Great peace have those who love your law; nothing can make them stumble.
Psalm 119:165



 

  • Like 1

0 Comments


Recommended Comments

There are no comments to display.

Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...