Jump to content
IGNORED

Need help with an assembly copy routine


Recommended Posts

I assume this is eye-rollingly simple, but I'll ask anyway ...

 

I'm trying to modify a general purpose copy routine to use three parameters - a source location, a destination location and the number of bytes to copy.  Can someone please have a look at the attached code and tell me what I am doing wrong?

 

;
; Ullrich von Bassewitz, 2003-08-20
; Performance increase (about 20%) by
; Christian Krueger, 2009-09-13
;
; void* __fastcall__ memcpy (void* dest, const void* src, size_t n);
;
; NOTE: This function contains entry points for memmove, which will resort
; to memcpy for an upwards copy. Don't change this module without looking
; at memmove!
;

;        .export         _memcpy, memcpy_upwards, memcpy_getparams
;        .import         popax, popptr1
;        .importzp       sp, ptr1, ptr2, ptr3

; ----------------------------------------------------------------------
;;
;; compiler directives
;;

.lsfirst

;;
;; TASM Macros and Defines
;;
#define lo(work)                (work & $00FF)
#define hi(work)                ((work & $FF00) >> 8)
#define bitprefix		.byte $2C

zplocation		.equ	$CC	;; Uses memory locations starting at this address.  Make sure they are not in use by the system or your program.
org			.equ	$4C00	;; Start of main code

sp			.equ	zplocation+$0	;; Source pointer
ptr1			.equ	zplocation+$2	;; Source
ptr2			.equ	zplocation+$4	;; Destination
ptr3			.equ	zplocation+$6	;; Size

popax			.equ	$BA	;; 
popptr1			.equ	$BB	;; 

		;;
program_start	;;
		;;

		.org	org


_memcpy
        jsr     memcpy_getparams
 	.export _memcpy

memcpy_upwards                  ; assert Y = 0
        ldx     ptr3+1          ; Get high byte of n
        beq     L2              ; Jump if zero

L1      	                ; Unrolled to make it faster...
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny

        bne     L1
        inc     ptr1+1
        inc     ptr2+1
        dex                     ; Next 256 byte block
        bne     L1              ; Repeat if any

        ; the following section could be 10% faster if we were able to copy
        ; back to front - unfortunately we are forced to copy strict from
        ; low to high since this function is also used for
        ; memmove and blocks could be overlapping!
        ; {
L2                              ; assert Y = 0
        ldx     ptr3            ; Get the low byte of n
        beq     done            ; something to copy

L3      lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        dex
        bne     L3

        ; }

done    jmp     popax           ; Pop ptr and return as result
	.export memcpy_upwards 

; ----------------------------------------------------------------------
; Get the parameters from stack as follows:
;
;       size            --> ptr3
;       src             --> ptr1
;       dest            --> ptr2
;       First argument (dest) will remain on stack and is returned in a/x!

memcpy_getparams                ; IMPORTANT! Function has to leave with Y=0!
        sta     ptr3
        stx     ptr3+1          ; save n to ptr3

        jsr     popptr1         ; save src to ptr1

                                ; save dest to ptr2
        iny                     ; Y=0 guaranteed by popptr1, we need '1' here...                        
                                ; (direct stack access is three cycles faster
                                ; (total cycle count with return))
        lda     (sp),y
        tax
        stx     ptr2+1          ; save high byte of ptr2
        dey                     ; Y = 0
        lda     (sp),y          ; Get ptr2 low
        sta     ptr2
        rts
	.export memcpy_getparams

	.end

thank you!

Copy.A65

Link to comment
Share on other sites

I don't have a time to analyze your code (moreover I am not very skillful in ML) but maybe this helps you? MoveBlock from Action! runtime.

First 3 bytes of the parameters going thru registers, rest thru zero page $Ax. In Action! CARD is 2 bytes type.

PROC MoveBlock=*(CARD d, s, l)

2000: 85 A0     STA $A0     ;TSLNUM
2002: 86 A1     STX $A1     ;TSLNUM+1
2004: 84 A2     STY $A2     ;MVLNG
2006: A0 00     LDY #$00
2008: A5 A4     LDA $A4     ;ECSIZE
200A: D0 04     BNE $2010
200C: A5 A5     LDA $A5     ;ECSIZE+1
200E: F0 18     BEQ $2028
2010: B1 A2     LDA ($A2),Y ;MVLNG
2012: 91 A0     STA ($A0),Y ;TSLNUM
2014: C8        INY
2015: D0 04     BNE $201B
2017: E6 A1     INC $A1     ;TSLNUM+1
2019: E6 A3     INC $A3     ;MVLNG+1
201B: C6 A4     DEC $A4     ;ECSIZE
201D: A5 A4     LDA $A4     ;ECSIZE
201F: C9 FF     CMP #$FF
2021: D0 E5     BNE $2008
2023: C6 A5     DEC $A5     ;ECSIZE+1
2025: 38        SEC
2026: B0 E0     BCS $2008
2028: 60        RTS

 

Edited by zbyti
parameters info
Link to comment
Share on other sites

Hi!

19 hours ago, jacobus said:

I assume this is eye-rollingly simple, but I'll ask anyway ...

 

I'm trying to modify a general purpose copy routine to use three parameters - a source location, a destination location and the number of bytes to copy.  Can someone please have a look at the attached code and tell me what I am doing wrong?

Are you trying to convert this routine - that as made to be called from CC65 compiled C code - to an ASM only code?

 

Then, you should simply remove the usage of "SP" (the C stack) altogether, and assume that ptr1, ptr2 and ptr2 have the parameters:

;
; Ullrich von Bassewitz, 2003-08-20
; Performance increase (about 20%) by
; Christian Krueger, 2009-09-13
;

.lsfirst

;;
;; TASM Macros and Defines
;;
#define lo(work)                (work & $00FF)
#define hi(work)                ((work & $FF00) >> 8)
#define bitprefix		.byte $2C

zplocation		.equ	$CC	;; Uses memory locations starting at this address.  Make sure they are not in use by the system or your program.
org			.equ	$4C00	;; Start of main code

ptr1			.equ	zplocation+$0	;; Source
ptr2			.equ	zplocation+$2	;; Destination
ptr3			.equ	zplocation+$4	;; Size

		.org	org

memcpy
        ldy     #0              ; Needs Y = 0
        ldx     ptr3+1          ; Get high byte of n
        beq     L2              ; Jump if zero

L1      	                ; Unrolled to make it faster...
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny

        bne     L1
        inc     ptr1+1
        inc     ptr2+1
        dex                     ; Next 256 byte block
        bne     L1              ; Repeat if any

        ; the following section could be 10% faster if we were able to copy
        ; back to front - unfortunately we are forced to copy strict from
        ; low to high since this function is also used for
        ; memmove and blocks could be overlapping!
L2                              ; assert Y = 0
        ldx     ptr3            ; Get the low byte of n
        beq     done            ; something to copy

L3      lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        dex
        bne     L3

done    rts                     ; return

	.end

 

Have Fun!

 

Link to comment
Share on other sites

6 hours ago, dmsc said:

Hi!

Are you trying to convert this routine - that as made to be called from CC65 compiled C code - to an ASM only code?

 

Then, you should simply remove the usage of "SP" (the C stack) altogether, and assume that ptr1, ptr2 and ptr2 have the parameters:


;
; Ullrich von Bassewitz, 2003-08-20
; Performance increase (about 20%) by
; Christian Krueger, 2009-09-13
;

.lsfirst

;;
;; TASM Macros and Defines
;;
#define lo(work)                (work & $00FF)
#define hi(work)                ((work & $FF00) >> 8)
#define bitprefix		.byte $2C

zplocation		.equ	$CC	;; Uses memory locations starting at this address.  Make sure they are not in use by the system or your program.
org			.equ	$4C00	;; Start of main code

ptr1			.equ	zplocation+$0	;; Source
ptr2			.equ	zplocation+$2	;; Destination
ptr3			.equ	zplocation+$4	;; Size

		.org	org

memcpy
        ldy     #0              ; Needs Y = 0
        ldx     ptr3+1          ; Get high byte of n
        beq     L2              ; Jump if zero

L1      	                ; Unrolled to make it faster...
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny

        bne     L1
        inc     ptr1+1
        inc     ptr2+1
        dex                     ; Next 256 byte block
        bne     L1              ; Repeat if any

        ; the following section could be 10% faster if we were able to copy
        ; back to front - unfortunately we are forced to copy strict from
        ; low to high since this function is also used for
        ; memmove and blocks could be overlapping!
L2                              ; assert Y = 0
        ldx     ptr3            ; Get the low byte of n
        beq     done            ; something to copy

L3      lda     (ptr1),Y        ; copy a byte
        sta     (ptr2),Y
        iny
        dex
        bne     L3

done    rts                     ; return

	.end

 

Have Fun!

 

Thank you - much appreciated!

 

Progress!  It now simply hangs the computer instead of crashing Altirra! ?  I think I forgot to mention that this needs to run in the VBI.  Does that change anything?

 

 

Link to comment
Share on other sites

22 hours ago, jacobus said:

I think I forgot to mention that this needs to run in the VBI.  Does that change anything?

 

 

Note well that the zero-page variables used in your function, and in the "stack handling functions" of the C compiler library are then not available otherwise. In particular, since the stack handling functions are likely used by the main part of the program, and likely require some zero-page variables, this is likely to fail.

 

Interrupts typically mean "assembler only". The 6502 is badly equipped for higher languages that require stack handling.

  • Like 1
Link to comment
Share on other sites

thanks for the responses!

 

Sounds like I was starting with the wrong code.  Can anyone recommend an assembly routine that would do the following:

 

-Copy small sequences of data (<64 bytes) from one location to another

-Run in the VBI

-Compatible with the TASM cross assembler version 3.2

-Fast and light

 

 

thank you!

 

Link to comment
Share on other sites

2 hours ago, jacobus said:

-Copy small sequences of data (<64 bytes) from one location to another

-Run in the VBI

-Compatible with the TASM cross assembler version 3.2

-Fast and light

64 bytes fit in one page what can greatly simplify the code. Do you select source or destination location? Is at least one of them constant?

  • Like 1
Link to comment
Share on other sites

  • 2 weeks later...
On 7/30/2020 at 6:04 PM, ilmenit said:

64 bytes fit in one page what can greatly simplify the code. Do you select source or destination location? Is at least one of them constant?

Both source and destination are variable - the amount of data to copy is always 56 bytes - however I think I'd like to be able to specify that as well just in case I find another use for this code.

Link to comment
Share on other sites

Do you cross page boundaries? Can page zero be a source or destination location?

 

Edit: do source and destination frequently change? Is self-modifying code allowed or might it need to run from ROM? Undocumented (not illegal) instructions, or should it also work on non-stanard machine with a 65C02 and up?

 

Edit2: tight code, or speed? :)

 

Edit3: size maximum of 128, 256, or more?

 

There are so many factors :)

 

Here's a sample:

src			.equ	zplocation+$0
dst			.equ	zplocation+$2

; caller sets src
; enter with A lo(dst) and X hi(dst)
; Y is size minus 1(!), maximum of 127 (i.e. 128 bytes)

memcopy
	sta dst
	stx dst+1

loop
	lda (src),y
	sta (dst),y
	dey
	bpl loop

	rts

main
	lda #$34	; lo($1234)
	sta src
	lda #$12	; hi($1234)
	sta src+1

	lda #$78	; lo($5678)
	ldx #$56	; hi($5678)
	ldy #55		; 56 bytes
	jsr memcopy

	rts

 

I moved storing of dst to the memcopy routine. That saves space at the caller side. No need to sta dst/stx dst+1 everytime you call memcopy.

 

This could be improved upon a lot, depending on your specific needs :)

Edited by ivop
Link to comment
Share on other sites

On 8/10/2020 at 11:44 AM, ivop said:

Do you cross page boundaries? Can page zero be a source or destination location?

 

Edit: do source and destination frequently change? Is self-modifying code allowed or might it need to run from ROM? Undocumented (not illegal) instructions, or should it also work on non-stanard machine with a 65C02 and up?

 

Edit2: tight code, or speed? :)

 

Edit3: size maximum of 128, 256, or more?

 

There are so many factors :)

 

Here's a sample:


src			.equ	zplocation+$0
dst			.equ	zplocation+$2

; caller sets src
; enter with A lo(dst) and X hi(dst)
; Y is size minus 1(!), maximum of 127 (i.e. 128 bytes)

memcopy
	sta dst
	stx dst+1

loop
	lda (src),y
	sta (dst),y
	dey
	bpl loop

	rts

main
	lda #$34	; lo($1234)
	sta src
	lda #$12	; hi($1234)
	sta src+1

	lda #$78	; lo($5678)
	ldx #$56	; hi($5678)
	ldy #55		; 56 bytes
	jsr memcopy

	rts

 

I moved storing of dst to the memcopy routine. That saves space at the caller side. No need to sta dst/stx dst+1 everytime you call memcopy.

 

This could be improved upon a lot, depending on your specific needs :)

Thank you very much for the reply!

 

in answer tour questions:

   -I do not cross page bountries

   -zero page is not used as a source or destination

   -source and destination change each time the copy routine will be called

   -rather not have self-modifying code, I may put this in a cart

   -speed is my preference - I need to copy 56 bytes 32 times for a full screen redraw

   -when you say size - do you mean size of code or bytes copied?  Code size is not too important, (under 256 bytes preferred), bytes copied each call is either 48 or 56

 

Questions

   -I understand the first two routines, but I don't understand why in the main routine at all.  Why the constants ($1234 and $5678)?

 

thank you!

 

Edit: Wait a second, do memcopy and loop perform the actual copy and main simply sets up and calls it?

Edited by jacobus
Link to comment
Share on other sites

If you don't care about code size, and you want it as fast as possible, and the source & destination blocks can change every time, then I think that the fastest would be to set up your source & destination addresses in a pair of zero page words, and then use an unrolled-loop of repeated LDA (source),Y ; STA (dest),Y ; DEY

 

Your subroutine could look something like this:

Spoiler

 

COPY56BYTES:

  LDY #56

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 55

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 50

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

COPY48BYTESWITHYPRELOADEDWITH48:

  LDA (COPYSOURCE),Y ; 48

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 45

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 40

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 35

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 30

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 25

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 20

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ;15

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 10

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 5

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  RTS

 

 

 

After loading your source address in COPYSOURCE and your destination address in COPYDEST, you can call COPY56BYTES to do the copy as fast as possible.

If you want to copy 48 bytes, then also load Y with 48 and call COPY48BYTESWITHYPRELOADEDWITH48.  If you need some other number of bytes copied, you can add labels in the appropriate place in the unrolled loop and call them (with Y preloaded as appropriate).

I think that this will give you the fastest memory copy, using 2 + N*(5+6+2) - 2 + 6 cycles, not counting the JSR and the loading of the COPYSOURCE & COPYDEST words.

 

As for size, this sub will take 2+N*5 bytes.  So, for N=56, it will be 282 bytes long.  If you want the speed, you have to pay for it somewhere else!

Edited by StickJock
Added size calculation
  • Like 1
Link to comment
Share on other sites

On 8/11/2020 at 9:15 PM, jacobus said:

Edit: Wait a second, do memcopy and loop perform the actual copy and main simply sets up and calls it?

Exactly :)  main is just an example of how you call memcopy, i.e. setup src, load dst in AX and size in Y, and call the routine. ($1234 and $5678 are just example source and destination addresses)

 

@StickJock's code is off by one. It copies bytes 1-56 instead of 0-55. But the unroll is correct and faster. You could add the same trick of having set COPYSOURCE (src in my case) by the caller, and COPYDEST (dst) by the callee, to save space everywhere you call this routine.

 

 

Edited by ivop
Link to comment
Share on other sites

10 minutes ago, ivop said:

Exactly :)  main is just an example of how you call memcopy, i.e. setup src, load dst in AX and size in Y, and call the routine.

 

@StickJock's code is off by one. It copies bytes 1-56 instead of 0-55. But the unroll is correct and faster. You could add the same trick of having set COPYSOURCE (src in my case) by the caller, and COPYDEST (dst) by the callee, to save space everywhere you call this routine.

 

 

Doh!

Classic mistake.  Thanks for catching it.  I didn't actually test this - I just wrote it here in the thread.

 

Change the LDY with #56-1, and call into the '48' label with 48-1 (and maybe change the name of the label).  

  • Like 1
Link to comment
Share on other sites

Hi,

 

   Although completely unrolling the code is the fastest way to copy memory, you can also have a block of 8 lda/sta statements, and loop 6 or 7 times depending on whether you want to copy 48 bytes (6*8), or 56 bytes (7*8). This would use less memory for code, and only be a bit slower. Alternatively, you could copy 16 bytes at a time (16 lda/sta statements inside the loop), and have a final 8 lda/sta statements after the loop that only gets executed if you want to copy 56 bytes, or you could jump back halfway into the loop for the final 8 bytes (of the 56).

 

   You would use the X register (decrementing) as a loop counter, and have to test a memory location when the loop has been completed for determining 48/56 byte copy mode, but I think this is a good balance between speed and size.

 

   Hope this helps! 

  • Like 1
Link to comment
Share on other sites

23 hours ago, StickJock said:

If you don't care about code size, and you want it as fast as possible, and the source & destination blocks can change every time, then I think that the fastest would be to set up your source & destination addresses in a pair of zero page words, and then use an unrolled-loop of repeated LDA (source),Y ; STA (dest),Y ; DEY

 

Your subroutine could look something like this:

  Reveal hidden contents

 

COPY56BYTES:

  LDY #56

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 55

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 50

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

COPY48BYTESWITHYPRELOADEDWITH48:

  LDA (COPYSOURCE),Y ; 48

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 45

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 40

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 35

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 30

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 25

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 20

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ;15

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 10

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y ; 5

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  DEY

  LDA (COPYSOURCE),Y

  STA (COPYDEST),Y

  RTS

 

 

 

After loading your source address in COPYSOURCE and your destination address in COPYDEST, you can call COPY56BYTES to do the copy as fast as possible.

If you want to copy 48 bytes, then also load Y with 48 and call COPY48BYTESWITHYPRELOADEDWITH48.  If you need some other number of bytes copied, you can add labels in the appropriate place in the unrolled loop and call them (with Y preloaded as appropriate).

I think that this will give you the fastest memory copy, using 2 + N*(5+6+2) - 2 + 6 cycles, not counting the JSR and the loading of the COPYSOURCE & COPYDEST words.

 

As for size, this sub will take 2+N*5 bytes.  So, for N=56, it will be 282 bytes long.  If you want the speed, you have to pay for it somewhere else!

 Thank you for this!  Once I get the code working, I'll come back to this for now, I'll go with readability ? 

Link to comment
Share on other sites

4 hours ago, ivop said:

Exactly :)  main is just an example of how you call memcopy, i.e. setup src, load dst in AX and size in Y, and call the routine. ($1234 and $5678 are just example source and destination addresses)

 

@StickJock's code is off by one. It copies bytes 1-56 instead of 0-55. But the unroll is correct and faster. You could add the same trick of having set COPYSOURCE (src in my case) by the caller, and COPYDEST (dst) by the callee, to save space everywhere you call this routine.

 

 

Thanks for all the help, I am definitely closer!  The code (with some very minor changes) works when called from the main part of the program but fails when run in the VBI.  I should elaborate - it runs properly once when called from the VBI but then locks up the computer.  Do I need to do something different when exiting?

 

.lsfirst

zplocation		.equ	$CC	;; Uses memory locations starting at this address.
org			.equ	$4C00	;; Start of main code

src			.equ	zplocation+$0	;; Source
dst			.equ	zplocation+$2	;; Destination
bytes			.equ	zplocation+$4	;; Size

		.org	org

; caller sets src & dst & bytes
; Y is size minus 1(!), maximum of 127 (i.e. 128 bytes)

memcopy
;	sta dst
;	stx dst+1
        ldy bytes

loop
	lda (src),y
	sta (dst),y
	dey
	bpl loop

	rts

	.end

 

Link to comment
Share on other sites

7 minutes ago, jacobus said:

Thanks for all the help, I am definitely closer!  The code (with some very minor changes) works when called from the main part of the program but fails when run in the VBI.  I should elaborate - it runs properly once when called from the VBI but then locks up the computer.  Do I need to do something different when exiting?


<snip> :)

Could you post your VBI routine, and how it's calling memcopy? And how do you setup the VBI? It might be that registers (AXYP) are not preserved during the interrupt call.

Link to comment
Share on other sites

1 hour ago, ivop said:

Could you post your VBI routine, and how it's calling memcopy? And how do you setup the VBI? It might be that registers (AXYP) are not preserved during the interrupt call.

Sure ? but it's in Quick so may not be much help.  The PUSH/PULL, IPUSH/IPULL, ZPUSH/ZPULL (currently commented out) are supposed to be used to save and restore the CPU registers but the documentation is both confusing and contradictory and I have never quite figured out how to use them properly.  Is this something I can add to the copy routine instead?

 

INTER VERTBLNK
LOCAL
 BYTE
 [
  V
 ]
 WORD
 [
  VD,VM
  OP=130,SP
 ]
BEGIN
* PUSH
* IPUSH
* ZPUSH
* SP=OP

 IF VBENABLE=1
  ZZC=0
  REPEAT
   CALL($00,$00,$00,$4C00)
   ZZC+
  UNTIL ZZC=33
  VBENABLE=0
 ENDIF

 *Horizontal Scrolling
 IF SCRLH=1 ;scroll right -->
  IF HFS=9
   IF HCS<16 ;limit of screen
    IF MAPXR<65
     HFS=12
     HCS+
    ENDIF
   ENDIF
  ELSE
   HFS-
  ENDIF
 ENDIF
 IF SCRLH=255 ;scroll left <--
  IF HFS=12
   IF HCS>0
    IF MAPXL>0
     HFS=9
     HCS-
    ENDIF
   ENDIF
  ELSE
   HFS+
  ENDIF
 ENDIF

 HSCR00=HCS
 HSCR01=HCS
 HSCR02=HCS
 HSCR03=HCS
 HSCR04=HCS
 HSCR05=HCS
 HSCR06=HCS
 HSCR07=HCS
 HSCR08=HCS
 HSCR09=HCS
 HSCR10=HCS
 HSCR11=HCS
 HSCR12=HCS
 HSCR13=HCS
 HSCR14=HCS
 HSCR15=HCS
 HSCR16=HCS

 *vertical scrolling
 IF SCRLV=1 ;top down (plyr moves up)
  IF VFS=0
   IF VCS>0
    IF MAPYT>0
     VSCR00-
     VSCR01-
     VSCR02-
     VSCR03-
     VSCR04-
     VSCR05-
     VSCR06-
     VSCR07-
     VSCR08-
     VSCR09-
     VSCR10-
     VSCR11-
     VSCR12-
     VSCR13-
     VSCR14-
     VSCR15-
     VSCR16-
     VFS=7
     VCS-
    ENDIF
   ENDIF
  ELSE
   VFS-
  ENDIF
 ENDIF

 IF SCRLV=255 ;bot up (plyr moves dn)
  IF VFS=7
   IF VCS<16
    IF MAPYB<64
     VSCR00+
     VSCR01+
     VSCR02+
     VSCR03+
     VSCR04+
     VSCR05+
     VSCR06+
     VSCR07+
     VSCR08+
     VSCR09+
     VSCR10+
     VSCR11+
     VSCR12+
     VSCR13+
     VSCR14+
     VSCR15+
     VSCR16+
     VFS=0
     VCS+
    ENDIF
   ENDIF
  ELSE
   VFS+
  ENDIF
 ENDIF

 *handle joystick
 IF STICK0<>15
  V=JOYX(STICK0) ;LUT
  ADD(PX,V,PX)
  SCRLH=0 ;no scroll flag
  IF PX<124
   SUB(PX,V,PX)
   SCRLH=255 ;scroll left flag <--
  ENDIF
  IF PX>125
   SUB(PX,V,PX)
   SCRLH=1 ;scroll right flag -->
  ENDIF

  V=JOYY(STICK0) ;LUT
  ADD(PY,V,PY)
  SCRLV=0
  IF PY<119
   SUB(PY,V,PY)
   SCRLV=1
  ENDIF
  IF PY>121
   SUB(PY,V,PY)
   SCRLV=255
  ENDIF



 ELSE
  SCRLH=0
  SCRLV=0
 ENDIF

* OP=SP
* PULL
* IPULL
* ZPULL
ENDVBI

 

Link to comment
Share on other sites

21 minutes ago, jacobus said:

Sure ? but it's in Quick so may not be much help.  The PUSH/PULL, IPUSH/IPULL, ZPUSH/ZPULL (currently commented out) are supposed to be used to save and restore the CPU registers but the documentation is both confusing and contradictory and I have never quite figured out how to use them properly.  Is this something I can add to the copy routine instead?

 


INTER VERTBLNK
LOCAL
 BYTE
 [
  V
 ]
 WORD
 [
  VD,VM
  OP=130,SP
 ]
BEGIN
* PUSH
* IPUSH
* ZPUSH
* SP=OP

 IF VBENABLE=1
  ZZC=0
  REPEAT
   CALL($00,$00,$00,$4C00)
   ZZC+
  UNTIL ZZC=33
  VBENABLE=0
 ENDIF

 *Horizontal Scrolling
 IF SCRLH=1 ;scroll right -->
  IF HFS=9
   IF HCS<16 ;limit of screen
    IF MAPXR<65
     HFS=12
     HCS+
    ENDIF
   ENDIF
  ELSE
   HFS-
  ENDIF
 ENDIF
 IF SCRLH=255 ;scroll left <--
  IF HFS=12
   IF HCS>0
    IF MAPXL>0
     HFS=9
     HCS-
    ENDIF
   ENDIF
  ELSE
   HFS+
  ENDIF
 ENDIF

 HSCR00=HCS
 HSCR01=HCS
 HSCR02=HCS
 HSCR03=HCS
 HSCR04=HCS
 HSCR05=HCS
 HSCR06=HCS
 HSCR07=HCS
 HSCR08=HCS
 HSCR09=HCS
 HSCR10=HCS
 HSCR11=HCS
 HSCR12=HCS
 HSCR13=HCS
 HSCR14=HCS
 HSCR15=HCS
 HSCR16=HCS

 *vertical scrolling
 IF SCRLV=1 ;top down (plyr moves up)
  IF VFS=0
   IF VCS>0
    IF MAPYT>0
     VSCR00-
     VSCR01-
     VSCR02-
     VSCR03-
     VSCR04-
     VSCR05-
     VSCR06-
     VSCR07-
     VSCR08-
     VSCR09-
     VSCR10-
     VSCR11-
     VSCR12-
     VSCR13-
     VSCR14-
     VSCR15-
     VSCR16-
     VFS=7
     VCS-
    ENDIF
   ENDIF
  ELSE
   VFS-
  ENDIF
 ENDIF

 IF SCRLV=255 ;bot up (plyr moves dn)
  IF VFS=7
   IF VCS<16
    IF MAPYB<64
     VSCR00+
     VSCR01+
     VSCR02+
     VSCR03+
     VSCR04+
     VSCR05+
     VSCR06+
     VSCR07+
     VSCR08+
     VSCR09+
     VSCR10+
     VSCR11+
     VSCR12+
     VSCR13+
     VSCR14+
     VSCR15+
     VSCR16+
     VFS=0
     VCS+
    ENDIF
   ENDIF
  ELSE
   VFS+
  ENDIF
 ENDIF

 *handle joystick
 IF STICK0<>15
  V=JOYX(STICK0) ;LUT
  ADD(PX,V,PX)
  SCRLH=0 ;no scroll flag
  IF PX<124
   SUB(PX,V,PX)
   SCRLH=255 ;scroll left flag <--
  ENDIF
  IF PX>125
   SUB(PX,V,PX)
   SCRLH=1 ;scroll right flag -->
  ENDIF

  V=JOYY(STICK0) ;LUT
  ADD(PY,V,PY)
  SCRLV=0
  IF PY<119
   SUB(PY,V,PY)
   SCRLV=1
  ENDIF
  IF PY>121
   SUB(PY,V,PY)
   SCRLV=255
  ENDIF



 ELSE
  SCRLH=0
  SCRLV=0
 ENDIF

* OP=SP
* PULL
* IPULL
* ZPULL
ENDVBI

 

You need to pull/pop in the reverse order as your push.

 

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...