Need help with an assembly copy routine

jacobus · July 28, 2020

I assume this is eye-rollingly simple, but I'll ask anyway ...

I'm trying to modify a general purpose copy routine to use three parameters - a source location, a destination location and the number of bytes to copy. Can someone please have a look at the attached code and tell me what I am doing wrong?

;
; Ullrich von Bassewitz, 2003-08-20
; Performance increase (about 20%) by
; Christian Krueger, 2009-09-13
;
; void* __fastcall__ memcpy (void* dest, const void* src, size_t n);
;
; NOTE: This function contains entry points for memmove, which will resort
; to memcpy for an upwards copy. Don't change this module without looking
; at memmove!
;

;        .export         _memcpy, memcpy_upwards, memcpy_getparams
;        .import         popax, popptr1
;        .importzp       sp, ptr1, ptr2, ptr3

; ----------------------------------------------------------------------
;;
;; compiler directives
;;

.lsfirst

;;
;; TASM Macros and Defines
;;
#define lo(work)                (work & $00FF)
#define hi(work)                ((work & $FF00) >> 8)
#define bitprefix		.byte $2C

zplocation		.equ	$CC	;; Uses memory locations starting at this address.  Make sure they are not in use by the system or your program.
org			.equ	$4C00	;; Start of main code

sp			.equ	zplocation+$0	;; Source pointer
ptr1			.equ	zplocation+$2	;; Source
ptr2			.equ	zplocation+$4	;; Destination
ptr3			.equ	zplocation+$6	;; Size

popax			.equ	$BA	;; 
popptr1			.equ	$BB	;; 

		;;
program_start	;;
		;;

		.org	org


_memcpy
        jsr     memcpy_getparams
 	.export _memcpy

memcpy_upwards                  ; assert Y = 0
        ldx     ptr3+1          ; Get high byte of n
        beq     L2              ; Jump if zero

L1      	                ; Unrolled to make it faster...
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny

        bne     L1
        inc     ptr1+1
        inc     ptr2+1
        dex                     ; Next 256 byte block
        bne     L1              ; Repeat if any

        ; the following section could be 10% faster if we were able to copy
        ; back to front - unfortunately we are forced to copy strict from
        ; low to high since this function is also used for
        ; memmove and blocks could be overlapping!
        ; {
L2                              ; assert Y = 0
        ldx     ptr3            ; Get the low byte of n
        beq     done            ; something to copy

L3      lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        dex
        bne     L3

        ; }

done    jmp     popax           ; Pop ptr and return as result
	.export memcpy_upwards 

; ----------------------------------------------------------------------
; Get the parameters from stack as follows:
;
;       size            --> ptr3
;       src             --> ptr1
;       dest            --> ptr2
;       First argument (dest) will remain on stack and is returned in a/x!

memcpy_getparams                ; IMPORTANT! Function has to leave with Y=0!
        sta     ptr3
        stx     ptr3+1          ; save n to ptr3

        jsr     popptr1         ; save src to ptr1

                                ; save dest to ptr2
        iny                     ; Y=0 guaranteed by popptr1, we need '1' here...                        
                                ; (direct stack access is three cycles faster
                                ; (total cycle count with return))
        lda     (sp),y
        tax
        stx     ptr2+1          ; save high byte of ptr2
        dey                     ; Y = 0
        lda     (sp),y          ; Get ptr2 low
        sta     ptr2
        rts
	.export memcpy_getparams

	.end

thank you!

Copy.A65

zbyti · July 28, 2020

I don't have a time to analyze your code (moreover I am not very skillful in ML) but maybe this helps you? MoveBlock from Action! runtime.

First 3 bytes of the parameters going thru registers, rest thru zero page $Ax. In Action! CARD is 2 bytes type.

PROC MoveBlock=*(CARD d, s, l)

2000: 85 A0     STA $A0     ;TSLNUM
2002: 86 A1     STX $A1     ;TSLNUM+1
2004: 84 A2     STY $A2     ;MVLNG
2006: A0 00     LDY #$00
2008: A5 A4     LDA $A4     ;ECSIZE
200A: D0 04     BNE $2010
200C: A5 A5     LDA $A5     ;ECSIZE+1
200E: F0 18     BEQ $2028
2010: B1 A2     LDA ($A2),Y ;MVLNG
2012: 91 A0     STA ($A0),Y ;TSLNUM
2014: C8        INY
2015: D0 04     BNE $201B
2017: E6 A1     INC $A1     ;TSLNUM+1
2019: E6 A3     INC $A3     ;MVLNG+1
201B: C6 A4     DEC $A4     ;ECSIZE
201D: A5 A4     LDA $A4     ;ECSIZE
201F: C9 FF     CMP #$FF
2021: D0 E5     BNE $2008
2023: C6 A5     DEC $A5     ;ECSIZE+1
2025: 38        SEC
2026: B0 E0     BCS $2008
2028: 60        RTS

Edited July 28, 2020 by zbyti
parameters info

dmsc · July 29, 2020

Hi!

19 hours ago, jacobus said:

I assume this is eye-rollingly simple, but I'll ask anyway ...

I'm trying to modify a general purpose copy routine to use three parameters - a source location, a destination location and the number of bytes to copy. Can someone please have a look at the attached code and tell me what I am doing wrong?

Are you trying to convert this routine - that as made to be called from CC65 compiled C code - to an ASM only code?

Then, you should simply remove the usage of "SP" (the C stack) altogether, and assume that ptr1, ptr2 and ptr2 have the parameters:

;
; Ullrich von Bassewitz, 2003-08-20
; Performance increase (about 20%) by
; Christian Krueger, 2009-09-13
;

.lsfirst

;;
;; TASM Macros and Defines
;;
#define lo(work)                (work & $00FF)
#define hi(work)                ((work & $FF00) >> 8)
#define bitprefix		.byte $2C

zplocation		.equ	$CC	;; Uses memory locations starting at this address.  Make sure they are not in use by the system or your program.
org			.equ	$4C00	;; Start of main code

ptr1			.equ	zplocation+$0	;; Source
ptr2			.equ	zplocation+$2	;; Destination
ptr3			.equ	zplocation+$4	;; Size

		.org	org

memcpy
        ldy     #0              ; Needs Y = 0
        ldx     ptr3+1          ; Get high byte of n
        beq     L2              ; Jump if zero

L1      	                ; Unrolled to make it faster...
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny

        bne     L1
        inc     ptr1+1
        inc     ptr2+1
        dex                     ; Next 256 byte block
        bne     L1              ; Repeat if any

        ; the following section could be 10% faster if we were able to copy
        ; back to front - unfortunately we are forced to copy strict from
        ; low to high since this function is also used for
        ; memmove and blocks could be overlapping!
L2                              ; assert Y = 0
        ldx     ptr3            ; Get the low byte of n
        beq     done            ; something to copy

L3      lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        dex
        bne     L3

done    rts                     ; return

	.end

Have Fun!

jacobus · July 29, 2020

6 hours ago, dmsc said:

Hi!

Are you trying to convert this routine - that as made to be called from CC65 compiled C code - to an ASM only code?

Then, you should simply remove the usage of "SP" (the C stack) altogether, and assume that ptr1, ptr2 and ptr2 have the parameters:


;
; Ullrich von Bassewitz, 2003-08-20
; Performance increase (about 20%) by
; Christian Krueger, 2009-09-13
;

.lsfirst

;;
;; TASM Macros and Defines
;;
#define lo(work)                (work & $00FF)
#define hi(work)                ((work & $FF00) >> 8)
#define bitprefix		.byte $2C

zplocation		.equ	$CC	;; Uses memory locations starting at this address.  Make sure they are not in use by the system or your program.
org			.equ	$4C00	;; Start of main code

ptr1			.equ	zplocation+$0	;; Source
ptr2			.equ	zplocation+$2	;; Destination
ptr3			.equ	zplocation+$4	;; Size

		.org	org

memcpy
        ldy     #0              ; Needs Y = 0
        ldx     ptr3+1          ; Get high byte of n
        beq     L2              ; Jump if zero

L1      	                ; Unrolled to make it faster...
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny

        bne     L1
        inc     ptr1+1
        inc     ptr2+1
        dex                     ; Next 256 byte block
        bne     L1              ; Repeat if any

        ; the following section could be 10% faster if we were able to copy
        ; back to front - unfortunately we are forced to copy strict from
        ; low to high since this function is also used for
        ; memmove and blocks could be overlapping!
L2                              ; assert Y = 0
        ldx     ptr3            ; Get the low byte of n
        beq     done            ; something to copy

L3      lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        dex
        bne     L3

done    rts                     ; return

	.end

Have Fun!

Thank you - much appreciated!

Progress! It now simply hangs the computer instead of crashing Altirra! ? I think I forgot to mention that this needs to run in the VBI. Does that change anything?

zbyti · July 29, 2020

15 minutes ago, jacobus said:

I think I forgot to mention that this needs to run in the VBI. Does that change anything?

you must make it before the next interrupt, that's all

Wrathchild · July 29, 2020

I would think it depends on what you are calling within the vbi and hence if that is using/trashing any of the shared variables?

thorfdbg · July 30, 2020

22 hours ago, jacobus said:

I think I forgot to mention that this needs to run in the VBI. Does that change anything?

Note well that the zero-page variables used in your function, and in the "stack handling functions" of the C compiler library are then not available otherwise. In particular, since the stack handling functions are likely used by the main part of the program, and likely require some zero-page variables, this is likely to fail.

Interrupts typically mean "assembler only". The 6502 is badly equipped for higher languages that require stack handling.

jacobus · July 30, 2020

thanks for the responses!

Sounds like I was starting with the wrong code. Can anyone recommend an assembly routine that would do the following:

-Copy small sequences of data (<64 bytes) from one location to another

-Run in the VBI

-Compatible with the TASM cross assembler version 3.2

-Fast and light

thank you!

zbyti · July 30, 2020

@jacobus and what's wrong with the procedure from Action! runtime? It is pure ML easy to rewrite in any assembly

ilmenit · July 30, 2020

2 hours ago, jacobus said:

-Copy small sequences of data (<64 bytes) from one location to another

-Run in the VBI

-Compatible with the TASM cross assembler version 3.2

-Fast and light

64 bytes fit in one page what can greatly simplify the code. Do you select source or destination location? Is at least one of them constant?

jacobus · August 10, 2020

On 7/30/2020 at 3:32 PM, zbyti said:

@jacobus and what's wrong with the procedure from Action! runtime? It is pure ML easy to rewrite in any assembly

Nothing, but I couldn't see how to pass the parameters that I need - sorry to have ignored your response!

jacobus · August 10, 2020

On 7/30/2020 at 6:04 PM, ilmenit said:

64 bytes fit in one page what can greatly simplify the code. Do you select source or destination location? Is at least one of them constant?

Both source and destination are variable - the amount of data to copy is always 56 bytes - however I think I'd like to be able to specify that as well just in case I find another use for this code.

ivop · August 10, 2020

Do you cross page boundaries? Can page zero be a source or destination location?

Edit: do source and destination frequently change? Is self-modifying code allowed or might it need to run from ROM? Undocumented (not illegal) instructions, or should it also work on non-stanard machine with a 65C02 and up?

Edit2: tight code, or speed?

Edit3: size maximum of 128, 256, or more?

There are so many factors

Here's a sample:

src			.equ	zplocation+$0
dst			.equ	zplocation+$2

; caller sets src
; enter with A lo(dst) and X hi(dst)
; Y is size minus 1(!), maximum of 127 (i.e. 128 bytes)

memcopy
	sta dst
	stx dst+1

loop
	lda (src),y
	sta (dst),y
	dey
	bpl loop

	rts

main
	lda #$34	; lo($1234)
	sta src
	lda #$12	; hi($1234)
	sta src+1

	lda #$78	; lo($5678)
	ldx #$56	; hi($5678)
	ldy #55		; 56 bytes
	jsr memcopy

	rts

I moved storing of dst to the memcopy routine. That saves space at the caller side. No need to sta dst/stx dst+1 everytime you call memcopy.

This could be improved upon a lot, depending on your specific needs

Edited August 10, 2020 by ivop

jacobus · August 11, 2020

On 8/10/2020 at 11:44 AM, ivop said:
Do you cross page boundaries? Can page zero be a source or destination location?

Edit: do source and destination frequently change? Is self-modifying code allowed or might it need to run from ROM? Undocumented (not illegal) instructions, or should it also work on non-stanard machine with a 65C02 and up?

Edit2: tight code, or speed?

Edit3: size maximum of 128, 256, or more?

There are so many factors

Here's a sample:
src			.equ	zplocation+$0
dst			.equ	zplocation+$2

; caller sets src
; enter with A lo(dst) and X hi(dst)
; Y is size minus 1(!), maximum of 127 (i.e. 128 bytes)

memcopy
	sta dst
	stx dst+1

loop
	lda (src),y
	sta (dst),y
	dey
	bpl loop

	rts

main
	lda #$34	; lo($1234)
	sta src
	lda #$12	; hi($1234)
	sta src+1

	lda #$78	; lo($5678)
	ldx #$56	; hi($5678)
	ldy #55		; 56 bytes
	jsr memcopy

	rts
I moved storing of dst to the memcopy routine. That saves space at the caller side. No need to sta dst/stx dst+1 everytime you call memcopy.

This could be improved upon a lot, depending on your specific needs

Thank you very much for the reply!

in answer tour questions:

-I do not cross page bountries

-zero page is not used as a source or destination

-source and destination change each time the copy routine will be called

-rather not have self-modifying code, I may put this in a cart

-speed is my preference - I need to copy 56 bytes 32 times for a full screen redraw

-when you say size - do you mean size of code or bytes copied? Code size is not too important, (under 256 bytes preferred), bytes copied each call is either 48 or 56

Questions

-I understand the first two routines, but I don't understand why in the main routine at all. Why the constants ($1234 and $5678)?

thank you!

Edit: Wait a second, do memcopy and loop perform the actual copy and main simply sets up and calls it?

Edited August 11, 2020 by jacobus

StickJock · August 11, 2020

If you don't care about code size, and you want it as fast as possible, and the source & destination blocks can change every time, then I think that the fastest would be to set up your source & destination addresses in a pair of zero page words, and then use an unrolled-loop of repeated LDA (source),Y ; STA (dest),Y ; DEY

Your subroutine could look something like this:

Spoiler

COPY56BYTES:

LDY #56

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 55

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 50

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

COPY48BYTESWITHYPRELOADEDWITH48:

LDA (COPYSOURCE),Y ; 48

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 45

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 40

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 35

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 30

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 25

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 20

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ;15

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 10

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 5

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

RTS

After loading your source address in COPYSOURCE and your destination address in COPYDEST, you can call COPY56BYTES to do the copy as fast as possible.

If you want to copy 48 bytes, then also load Y with 48 and call COPY48BYTESWITHYPRELOADEDWITH48. If you need some other number of bytes copied, you can add labels in the appropriate place in the unrolled loop and call them (with Y preloaded as appropriate).

I think that this will give you the fastest memory copy, using 2 + N*(5+6+2) - 2 + 6 cycles, not counting the JSR and the loading of the COPYSOURCE & COPYDEST words.

As for size, this sub will take 2+N*5 bytes. So, for N=56, it will be 282 bytes long. If you want the speed, you have to pay for it somewhere else!

Edited August 11, 2020 by StickJock
Added size calculation

ivop · August 12, 2020

On 8/11/2020 at 9:15 PM, jacobus said:

Edit: Wait a second, do memcopy and loop perform the actual copy and main simply sets up and calls it?

Exactly main is just an example of how you call memcopy, i.e. setup src, load dst in AX and size in Y, and call the routine. ($1234 and $5678 are just example source and destination addresses)

@StickJock's code is off by one. It copies bytes 1-56 instead of 0-55. But the unroll is correct and faster. You could add the same trick of having set COPYSOURCE (src in my case) by the caller, and COPYDEST (dst) by the callee, to save space everywhere you call this routine.

Edited August 12, 2020 by ivop

StickJock · August 12, 2020

10 minutes ago, ivop said:

Exactly main is just an example of how you call memcopy, i.e. setup src, load dst in AX and size in Y, and call the routine.

@StickJock's code is off by one. It copies bytes 1-56 instead of 0-55. But the unroll is correct and faster. You could add the same trick of having set COPYSOURCE (src in my case) by the caller, and COPYDEST (dst) by the callee, to save space everywhere you call this routine.

Doh!

Classic mistake. Thanks for catching it. I didn't actually test this - I just wrote it here in the thread.

Change the LDY with #56-1, and call into the '48' label with 48-1 (and maybe change the name of the label).

E474 · August 12, 2020

Hi,

Although completely unrolling the code is the fastest way to copy memory, you can also have a block of 8 lda/sta statements, and loop 6 or 7 times depending on whether you want to copy 48 bytes (6*8), or 56 bytes (7*8). This would use less memory for code, and only be a bit slower. Alternatively, you could copy 16 bytes at a time (16 lda/sta statements inside the loop), and have a final 8 lda/sta statements after the loop that only gets executed if you want to copy 56 bytes, or you could jump back halfway into the loop for the final 8 bytes (of the 56).

You would use the X register (decrementing) as a loop counter, and have to test a memory location when the loop has been completed for determining 48/56 byte copy mode, but I think this is a good balance between speed and size.

Hope this helps!

jacobus · August 12, 2020

23 hours ago, StickJock said:

If you don't care about code size, and you want it as fast as possible, and the source & destination blocks can change every time, then I think that the fastest would be to set up your source & destination addresses in a pair of zero page words, and then use an unrolled-loop of repeated LDA (source),Y ; STA (dest),Y ; DEY

Your subroutine could look something like this:

Reveal hidden contents

COPY56BYTES:

LDY #56

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 55

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 50

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

COPY48BYTESWITHYPRELOADEDWITH48:

LDA (COPYSOURCE),Y ; 48

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 45

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 40

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 35

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 30

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 25

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 20

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ;15

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 10

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y ; 5

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

DEY

LDA (COPYSOURCE),Y

STA (COPYDEST),Y

RTS

After loading your source address in COPYSOURCE and your destination address in COPYDEST, you can call COPY56BYTES to do the copy as fast as possible.

If you want to copy 48 bytes, then also load Y with 48 and call COPY48BYTESWITHYPRELOADEDWITH48. If you need some other number of bytes copied, you can add labels in the appropriate place in the unrolled loop and call them (with Y preloaded as appropriate).

I think that this will give you the fastest memory copy, using 2 + N*(5+6+2) - 2 + 6 cycles, not counting the JSR and the loading of the COPYSOURCE & COPYDEST words.

As for size, this sub will take 2+N*5 bytes. So, for N=56, it will be 282 bytes long. If you want the speed, you have to pay for it somewhere else!

Thank you for this! Once I get the code working, I'll come back to this for now, I'll go with readability ?

jacobus · August 12, 2020

4 hours ago, ivop said:

Exactly main is just an example of how you call memcopy, i.e. setup src, load dst in AX and size in Y, and call the routine. ($1234 and $5678 are just example source and destination addresses)

@StickJock's code is off by one. It copies bytes 1-56 instead of 0-55. But the unroll is correct and faster. You could add the same trick of having set COPYSOURCE (src in my case) by the caller, and COPYDEST (dst) by the callee, to save space everywhere you call this routine.

Thanks for all the help, I am definitely closer! The code (with some very minor changes) works when called from the main part of the program but fails when run in the VBI. I should elaborate - it runs properly once when called from the VBI but then locks up the computer. Do I need to do something different when exiting?

.lsfirst

zplocation		.equ	$CC	;; Uses memory locations starting at this address.
org			.equ	$4C00	;; Start of main code

src			.equ	zplocation+$0	;; Source
dst			.equ	zplocation+$2	;; Destination
bytes			.equ	zplocation+$4	;; Size

		.org	org

; caller sets src & dst & bytes
; Y is size minus 1(!), maximum of 127 (i.e. 128 bytes)

memcopy
;	sta dst
;	stx dst+1
        ldy bytes

loop
	lda (src),y
	sta (dst),y
	dey
	bpl loop

	rts

	.end

ivop · August 12, 2020

7 minutes ago, jacobus said:

Thanks for all the help, I am definitely closer! The code (with some very minor changes) works when called from the main part of the program but fails when run in the VBI. I should elaborate - it runs properly once when called from the VBI but then locks up the computer. Do I need to do something different when exiting?

<snip>

Could you post your VBI routine, and how it's calling memcopy? And how do you setup the VBI? It might be that registers (AXYP) are not preserved during the interrupt call.

jacobus · August 12, 2020

1 hour ago, ivop said:

Could you post your VBI routine, and how it's calling memcopy? And how do you setup the VBI? It might be that registers (AXYP) are not preserved during the interrupt call.

Sure ? but it's in Quick so may not be much help. The PUSH/PULL, IPUSH/IPULL, ZPUSH/ZPULL (currently commented out) are supposed to be used to save and restore the CPU registers but the documentation is both confusing and contradictory and I have never quite figured out how to use them properly. Is this something I can add to the copy routine instead?

INTER VERTBLNK
LOCAL
 BYTE
 [
  V
 ]
 WORD
 [
  VD,VM
  OP=130,SP
 ]
BEGIN
* PUSH
* IPUSH
* ZPUSH
* SP=OP

 IF VBENABLE=1
  ZZC=0
  REPEAT
   CALL($00,$00,$00,$4C00)
   ZZC+
  UNTIL ZZC=33
  VBENABLE=0
 ENDIF

 *Horizontal Scrolling
 IF SCRLH=1 ;scroll right -->
  IF HFS=9
   IF HCS<16 ;limit of screen
    IF MAPXR<65
     HFS=12
     HCS+
    ENDIF
   ENDIF
  ELSE
   HFS-
  ENDIF
 ENDIF
 IF SCRLH=255 ;scroll left <--
  IF HFS=12
   IF HCS>0
    IF MAPXL>0
     HFS=9
     HCS-
    ENDIF
   ENDIF
  ELSE
   HFS+
  ENDIF
 ENDIF

 HSCR00=HCS
 HSCR01=HCS
 HSCR02=HCS
 HSCR03=HCS
 HSCR04=HCS
 HSCR05=HCS
 HSCR06=HCS
 HSCR07=HCS
 HSCR08=HCS
 HSCR09=HCS
 HSCR10=HCS
 HSCR11=HCS
 HSCR12=HCS
 HSCR13=HCS
 HSCR14=HCS
 HSCR15=HCS
 HSCR16=HCS

 *vertical scrolling
 IF SCRLV=1 ;top down (plyr moves up)
  IF VFS=0
   IF VCS>0
    IF MAPYT>0
     VSCR00-
     VSCR01-
     VSCR02-
     VSCR03-
     VSCR04-
     VSCR05-
     VSCR06-
     VSCR07-
     VSCR08-
     VSCR09-
     VSCR10-
     VSCR11-
     VSCR12-
     VSCR13-
     VSCR14-
     VSCR15-
     VSCR16-
     VFS=7
     VCS-
    ENDIF
   ENDIF
  ELSE
   VFS-
  ENDIF
 ENDIF

 IF SCRLV=255 ;bot up (plyr moves dn)
  IF VFS=7
   IF VCS<16
    IF MAPYB<64
     VSCR00+
     VSCR01+
     VSCR02+
     VSCR03+
     VSCR04+
     VSCR05+
     VSCR06+
     VSCR07+
     VSCR08+
     VSCR09+
     VSCR10+
     VSCR11+
     VSCR12+
     VSCR13+
     VSCR14+
     VSCR15+
     VSCR16+
     VFS=0
     VCS+
    ENDIF
   ENDIF
  ELSE
   VFS+
  ENDIF
 ENDIF

 *handle joystick
 IF STICK0<>15
  V=JOYX(STICK0) ;LUT
  ADD(PX,V,PX)
  SCRLH=0 ;no scroll flag
  IF PX<124
   SUB(PX,V,PX)
   SCRLH=255 ;scroll left flag <--
  ENDIF
  IF PX>125
   SUB(PX,V,PX)
   SCRLH=1 ;scroll right flag -->
  ENDIF

  V=JOYY(STICK0) ;LUT
  ADD(PY,V,PY)
  SCRLV=0
  IF PY<119
   SUB(PY,V,PY)
   SCRLV=1
  ENDIF
  IF PY>121
   SUB(PY,V,PY)
   SCRLV=255
  ENDIF



 ELSE
  SCRLH=0
  SCRLV=0
 ENDIF

* OP=SP
* PULL
* IPULL
* ZPULL
ENDVBI

StickJock · August 12, 2020

21 minutes ago, jacobus said:

Sure ? but it's in Quick so may not be much help. The PUSH/PULL, IPUSH/IPULL, ZPUSH/ZPULL (currently commented out) are supposed to be used to save and restore the CPU registers but the documentation is both confusing and contradictory and I have never quite figured out how to use them properly. Is this something I can add to the copy routine instead?


INTER VERTBLNK
LOCAL
 BYTE
 [
  V
 ]
 WORD
 [
  VD,VM
  OP=130,SP
 ]
BEGIN
* PUSH
* IPUSH
* ZPUSH
* SP=OP

 IF VBENABLE=1
  ZZC=0
  REPEAT
   CALL($00,$00,$00,$4C00)
   ZZC+
  UNTIL ZZC=33
  VBENABLE=0
 ENDIF

 *Horizontal Scrolling
 IF SCRLH=1 ;scroll right -->
  IF HFS=9
   IF HCS<16 ;limit of screen
    IF MAPXR<65
     HFS=12
     HCS+
    ENDIF
   ENDIF
  ELSE
   HFS-
  ENDIF
 ENDIF
 IF SCRLH=255 ;scroll left <--
  IF HFS=12
   IF HCS>0
    IF MAPXL>0
     HFS=9
     HCS-
    ENDIF
   ENDIF
  ELSE
   HFS+
  ENDIF
 ENDIF

 HSCR00=HCS
 HSCR01=HCS
 HSCR02=HCS
 HSCR03=HCS
 HSCR04=HCS
 HSCR05=HCS
 HSCR06=HCS
 HSCR07=HCS
 HSCR08=HCS
 HSCR09=HCS
 HSCR10=HCS
 HSCR11=HCS
 HSCR12=HCS
 HSCR13=HCS
 HSCR14=HCS
 HSCR15=HCS
 HSCR16=HCS

 *vertical scrolling
 IF SCRLV=1 ;top down (plyr moves up)
  IF VFS=0
   IF VCS>0
    IF MAPYT>0
     VSCR00-
     VSCR01-
     VSCR02-
     VSCR03-
     VSCR04-
     VSCR05-
     VSCR06-
     VSCR07-
     VSCR08-
     VSCR09-
     VSCR10-
     VSCR11-
     VSCR12-
     VSCR13-
     VSCR14-
     VSCR15-
     VSCR16-
     VFS=7
     VCS-
    ENDIF
   ENDIF
  ELSE
   VFS-
  ENDIF
 ENDIF

 IF SCRLV=255 ;bot up (plyr moves dn)
  IF VFS=7
   IF VCS<16
    IF MAPYB<64
     VSCR00+
     VSCR01+
     VSCR02+
     VSCR03+
     VSCR04+
     VSCR05+
     VSCR06+
     VSCR07+
     VSCR08+
     VSCR09+
     VSCR10+
     VSCR11+
     VSCR12+
     VSCR13+
     VSCR14+
     VSCR15+
     VSCR16+
     VFS=0
     VCS+
    ENDIF
   ENDIF
  ELSE
   VFS+
  ENDIF
 ENDIF

 *handle joystick
 IF STICK0<>15
  V=JOYX(STICK0) ;LUT
  ADD(PX,V,PX)
  SCRLH=0 ;no scroll flag
  IF PX<124
   SUB(PX,V,PX)
   SCRLH=255 ;scroll left flag <--
  ENDIF
  IF PX>125
   SUB(PX,V,PX)
   SCRLH=1 ;scroll right flag -->
  ENDIF

  V=JOYY(STICK0) ;LUT
  ADD(PY,V,PY)
  SCRLV=0
  IF PY<119
   SUB(PY,V,PY)
   SCRLV=1
  ENDIF
  IF PY>121
   SUB(PY,V,PY)
   SCRLV=255
  ENDIF



 ELSE
  SCRLH=0
  SCRLV=0
 ENDIF

* OP=SP
* PULL
* IPULL
* ZPULL
ENDVBI

You need to pull/pop in the reverse order as your push.

ivop · August 12, 2020

2 minutes ago, StickJock said:

You need to pull/pop in the reverse order as your push.

Sharp!

StickJock · August 12, 2020

14 minutes ago, ivop said:

Sharp!

Well, after my off-by-one bug, I had to save face! ?

Need help with an assembly copy routine

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members