jacobus Posted July 28, 2020 Share Posted July 28, 2020 I assume this is eye-rollingly simple, but I'll ask anyway ... I'm trying to modify a general purpose copy routine to use three parameters - a source location, a destination location and the number of bytes to copy. Can someone please have a look at the attached code and tell me what I am doing wrong? ; ; Ullrich von Bassewitz, 2003-08-20 ; Performance increase (about 20%) by ; Christian Krueger, 2009-09-13 ; ; void* __fastcall__ memcpy (void* dest, const void* src, size_t n); ; ; NOTE: This function contains entry points for memmove, which will resort ; to memcpy for an upwards copy. Don't change this module without looking ; at memmove! ; ; .export _memcpy, memcpy_upwards, memcpy_getparams ; .import popax, popptr1 ; .importzp sp, ptr1, ptr2, ptr3 ; ---------------------------------------------------------------------- ;; ;; compiler directives ;; .lsfirst ;; ;; TASM Macros and Defines ;; #define lo(work) (work & $00FF) #define hi(work) ((work & $FF00) >> 8) #define bitprefix .byte $2C zplocation .equ $CC ;; Uses memory locations starting at this address. Make sure they are not in use by the system or your program. org .equ $4C00 ;; Start of main code sp .equ zplocation+$0 ;; Source pointer ptr1 .equ zplocation+$2 ;; Source ptr2 .equ zplocation+$4 ;; Destination ptr3 .equ zplocation+$6 ;; Size popax .equ $BA ;; popptr1 .equ $BB ;; ;; program_start ;; ;; .org org _memcpy jsr memcpy_getparams .export _memcpy memcpy_upwards ; assert Y = 0 ldx ptr3+1 ; Get high byte of n beq L2 ; Jump if zero L1 ; Unrolled to make it faster... lda (ptr1),Y ; copy a byte sta (ptr2),Y iny lda (ptr1),Y ; copy a byte sta (ptr2),Y iny bne L1 inc ptr1+1 inc ptr2+1 dex ; Next 256 byte block bne L1 ; Repeat if any ; the following section could be 10% faster if we were able to copy ; back to front - unfortunately we are forced to copy strict from ; low to high since this function is also used for ; memmove and blocks could be overlapping! ; { L2 ; assert Y = 0 ldx ptr3 ; Get the low byte of n beq done ; something to copy L3 lda (ptr1),Y ; copy a byte sta (ptr2),Y iny dex bne L3 ; } done jmp popax ; Pop ptr and return as result .export memcpy_upwards ; ---------------------------------------------------------------------- ; Get the parameters from stack as follows: ; ; size --> ptr3 ; src --> ptr1 ; dest --> ptr2 ; First argument (dest) will remain on stack and is returned in a/x! memcpy_getparams ; IMPORTANT! Function has to leave with Y=0! sta ptr3 stx ptr3+1 ; save n to ptr3 jsr popptr1 ; save src to ptr1 ; save dest to ptr2 iny ; Y=0 guaranteed by popptr1, we need '1' here... ; (direct stack access is three cycles faster ; (total cycle count with return)) lda (sp),y tax stx ptr2+1 ; save high byte of ptr2 dey ; Y = 0 lda (sp),y ; Get ptr2 low sta ptr2 rts .export memcpy_getparams .end thank you! Copy.A65 Quote Link to comment Share on other sites More sharing options...
zbyti Posted July 28, 2020 Share Posted July 28, 2020 (edited) I don't have a time to analyze your code (moreover I am not very skillful in ML) but maybe this helps you? MoveBlock from Action! runtime. First 3 bytes of the parameters going thru registers, rest thru zero page $Ax. In Action! CARD is 2 bytes type. PROC MoveBlock=*(CARD d, s, l) 2000: 85 A0 STA $A0 ;TSLNUM 2002: 86 A1 STX $A1 ;TSLNUM+1 2004: 84 A2 STY $A2 ;MVLNG 2006: A0 00 LDY #$00 2008: A5 A4 LDA $A4 ;ECSIZE 200A: D0 04 BNE $2010 200C: A5 A5 LDA $A5 ;ECSIZE+1 200E: F0 18 BEQ $2028 2010: B1 A2 LDA ($A2),Y ;MVLNG 2012: 91 A0 STA ($A0),Y ;TSLNUM 2014: C8 INY 2015: D0 04 BNE $201B 2017: E6 A1 INC $A1 ;TSLNUM+1 2019: E6 A3 INC $A3 ;MVLNG+1 201B: C6 A4 DEC $A4 ;ECSIZE 201D: A5 A4 LDA $A4 ;ECSIZE 201F: C9 FF CMP #$FF 2021: D0 E5 BNE $2008 2023: C6 A5 DEC $A5 ;ECSIZE+1 2025: 38 SEC 2026: B0 E0 BCS $2008 2028: 60 RTS Edited July 28, 2020 by zbyti parameters info Quote Link to comment Share on other sites More sharing options...
dmsc Posted July 29, 2020 Share Posted July 29, 2020 Hi! 19 hours ago, jacobus said: I assume this is eye-rollingly simple, but I'll ask anyway ... I'm trying to modify a general purpose copy routine to use three parameters - a source location, a destination location and the number of bytes to copy. Can someone please have a look at the attached code and tell me what I am doing wrong? Are you trying to convert this routine - that as made to be called from CC65 compiled C code - to an ASM only code? Then, you should simply remove the usage of "SP" (the C stack) altogether, and assume that ptr1, ptr2 and ptr2 have the parameters: ; ; Ullrich von Bassewitz, 2003-08-20 ; Performance increase (about 20%) by ; Christian Krueger, 2009-09-13 ; .lsfirst ;; ;; TASM Macros and Defines ;; #define lo(work) (work & $00FF) #define hi(work) ((work & $FF00) >> 8) #define bitprefix .byte $2C zplocation .equ $CC ;; Uses memory locations starting at this address. Make sure they are not in use by the system or your program. org .equ $4C00 ;; Start of main code ptr1 .equ zplocation+$0 ;; Source ptr2 .equ zplocation+$2 ;; Destination ptr3 .equ zplocation+$4 ;; Size .org org memcpy ldy #0 ; Needs Y = 0 ldx ptr3+1 ; Get high byte of n beq L2 ; Jump if zero L1 ; Unrolled to make it faster... lda (ptr1),Y ; copy a byte sta (ptr2),Y iny lda (ptr1),Y ; copy a byte sta (ptr2),Y iny bne L1 inc ptr1+1 inc ptr2+1 dex ; Next 256 byte block bne L1 ; Repeat if any ; the following section could be 10% faster if we were able to copy ; back to front - unfortunately we are forced to copy strict from ; low to high since this function is also used for ; memmove and blocks could be overlapping! L2 ; assert Y = 0 ldx ptr3 ; Get the low byte of n beq done ; something to copy L3 lda (ptr1),Y ; copy a byte sta (ptr2),Y iny dex bne L3 done rts ; return .end Have Fun! Quote Link to comment Share on other sites More sharing options...
jacobus Posted July 29, 2020 Author Share Posted July 29, 2020 6 hours ago, dmsc said: Hi! Are you trying to convert this routine - that as made to be called from CC65 compiled C code - to an ASM only code? Then, you should simply remove the usage of "SP" (the C stack) altogether, and assume that ptr1, ptr2 and ptr2 have the parameters: ; ; Ullrich von Bassewitz, 2003-08-20 ; Performance increase (about 20%) by ; Christian Krueger, 2009-09-13 ; .lsfirst ;; ;; TASM Macros and Defines ;; #define lo(work) (work & $00FF) #define hi(work) ((work & $FF00) >> 8) #define bitprefix .byte $2C zplocation .equ $CC ;; Uses memory locations starting at this address. Make sure they are not in use by the system or your program. org .equ $4C00 ;; Start of main code ptr1 .equ zplocation+$0 ;; Source ptr2 .equ zplocation+$2 ;; Destination ptr3 .equ zplocation+$4 ;; Size .org org memcpy ldy #0 ; Needs Y = 0 ldx ptr3+1 ; Get high byte of n beq L2 ; Jump if zero L1 ; Unrolled to make it faster... lda (ptr1),Y ; copy a byte sta (ptr2),Y iny lda (ptr1),Y ; copy a byte sta (ptr2),Y iny bne L1 inc ptr1+1 inc ptr2+1 dex ; Next 256 byte block bne L1 ; Repeat if any ; the following section could be 10% faster if we were able to copy ; back to front - unfortunately we are forced to copy strict from ; low to high since this function is also used for ; memmove and blocks could be overlapping! L2 ; assert Y = 0 ldx ptr3 ; Get the low byte of n beq done ; something to copy L3 lda (ptr1),Y ; copy a byte sta (ptr2),Y iny dex bne L3 done rts ; return .end Have Fun! Thank you - much appreciated! Progress! It now simply hangs the computer instead of crashing Altirra! ? I think I forgot to mention that this needs to run in the VBI. Does that change anything? Quote Link to comment Share on other sites More sharing options...
zbyti Posted July 29, 2020 Share Posted July 29, 2020 15 minutes ago, jacobus said: I think I forgot to mention that this needs to run in the VBI. Does that change anything? you must make it before the next interrupt, that's all Quote Link to comment Share on other sites More sharing options...
Wrathchild Posted July 29, 2020 Share Posted July 29, 2020 I would think it depends on what you are calling within the vbi and hence if that is using/trashing any of the shared variables? Quote Link to comment Share on other sites More sharing options...
thorfdbg Posted July 30, 2020 Share Posted July 30, 2020 22 hours ago, jacobus said: I think I forgot to mention that this needs to run in the VBI. Does that change anything? Note well that the zero-page variables used in your function, and in the "stack handling functions" of the C compiler library are then not available otherwise. In particular, since the stack handling functions are likely used by the main part of the program, and likely require some zero-page variables, this is likely to fail. Interrupts typically mean "assembler only". The 6502 is badly equipped for higher languages that require stack handling. 1 Quote Link to comment Share on other sites More sharing options...
jacobus Posted July 30, 2020 Author Share Posted July 30, 2020 thanks for the responses! Sounds like I was starting with the wrong code. Can anyone recommend an assembly routine that would do the following: -Copy small sequences of data (<64 bytes) from one location to another -Run in the VBI -Compatible with the TASM cross assembler version 3.2 -Fast and light thank you! Quote Link to comment Share on other sites More sharing options...
zbyti Posted July 30, 2020 Share Posted July 30, 2020 @jacobus and what's wrong with the procedure from Action! runtime? It is pure ML easy to rewrite in any assembly Quote Link to comment Share on other sites More sharing options...
ilmenit Posted July 30, 2020 Share Posted July 30, 2020 2 hours ago, jacobus said: -Copy small sequences of data (<64 bytes) from one location to another -Run in the VBI -Compatible with the TASM cross assembler version 3.2 -Fast and light 64 bytes fit in one page what can greatly simplify the code. Do you select source or destination location? Is at least one of them constant? 1 Quote Link to comment Share on other sites More sharing options...
jacobus Posted August 10, 2020 Author Share Posted August 10, 2020 On 7/30/2020 at 3:32 PM, zbyti said: @jacobus and what's wrong with the procedure from Action! runtime? It is pure ML easy to rewrite in any assembly Nothing, but I couldn't see how to pass the parameters that I need - sorry to have ignored your response! Quote Link to comment Share on other sites More sharing options...
jacobus Posted August 10, 2020 Author Share Posted August 10, 2020 On 7/30/2020 at 6:04 PM, ilmenit said: 64 bytes fit in one page what can greatly simplify the code. Do you select source or destination location? Is at least one of them constant? Both source and destination are variable - the amount of data to copy is always 56 bytes - however I think I'd like to be able to specify that as well just in case I find another use for this code. Quote Link to comment Share on other sites More sharing options...
ivop Posted August 10, 2020 Share Posted August 10, 2020 (edited) Do you cross page boundaries? Can page zero be a source or destination location? Edit: do source and destination frequently change? Is self-modifying code allowed or might it need to run from ROM? Undocumented (not illegal) instructions, or should it also work on non-stanard machine with a 65C02 and up? Edit2: tight code, or speed? Edit3: size maximum of 128, 256, or more? There are so many factors Here's a sample: src .equ zplocation+$0 dst .equ zplocation+$2 ; caller sets src ; enter with A lo(dst) and X hi(dst) ; Y is size minus 1(!), maximum of 127 (i.e. 128 bytes) memcopy sta dst stx dst+1 loop lda (src),y sta (dst),y dey bpl loop rts main lda #$34 ; lo($1234) sta src lda #$12 ; hi($1234) sta src+1 lda #$78 ; lo($5678) ldx #$56 ; hi($5678) ldy #55 ; 56 bytes jsr memcopy rts I moved storing of dst to the memcopy routine. That saves space at the caller side. No need to sta dst/stx dst+1 everytime you call memcopy. This could be improved upon a lot, depending on your specific needs Edited August 10, 2020 by ivop Quote Link to comment Share on other sites More sharing options...
jacobus Posted August 11, 2020 Author Share Posted August 11, 2020 (edited) On 8/10/2020 at 11:44 AM, ivop said: Do you cross page boundaries? Can page zero be a source or destination location? Edit: do source and destination frequently change? Is self-modifying code allowed or might it need to run from ROM? Undocumented (not illegal) instructions, or should it also work on non-stanard machine with a 65C02 and up? Edit2: tight code, or speed? Edit3: size maximum of 128, 256, or more? There are so many factors Here's a sample: src .equ zplocation+$0 dst .equ zplocation+$2 ; caller sets src ; enter with A lo(dst) and X hi(dst) ; Y is size minus 1(!), maximum of 127 (i.e. 128 bytes) memcopy sta dst stx dst+1 loop lda (src),y sta (dst),y dey bpl loop rts main lda #$34 ; lo($1234) sta src lda #$12 ; hi($1234) sta src+1 lda #$78 ; lo($5678) ldx #$56 ; hi($5678) ldy #55 ; 56 bytes jsr memcopy rts I moved storing of dst to the memcopy routine. That saves space at the caller side. No need to sta dst/stx dst+1 everytime you call memcopy. This could be improved upon a lot, depending on your specific needs Thank you very much for the reply! in answer tour questions: -I do not cross page bountries -zero page is not used as a source or destination -source and destination change each time the copy routine will be called -rather not have self-modifying code, I may put this in a cart -speed is my preference - I need to copy 56 bytes 32 times for a full screen redraw -when you say size - do you mean size of code or bytes copied? Code size is not too important, (under 256 bytes preferred), bytes copied each call is either 48 or 56 Questions -I understand the first two routines, but I don't understand why in the main routine at all. Why the constants ($1234 and $5678)? thank you! Edit: Wait a second, do memcopy and loop perform the actual copy and main simply sets up and calls it? Edited August 11, 2020 by jacobus Quote Link to comment Share on other sites More sharing options...
StickJock Posted August 11, 2020 Share Posted August 11, 2020 (edited) If you don't care about code size, and you want it as fast as possible, and the source & destination blocks can change every time, then I think that the fastest would be to set up your source & destination addresses in a pair of zero page words, and then use an unrolled-loop of repeated LDA (source),Y ; STA (dest),Y ; DEY Your subroutine could look something like this: Spoiler COPY56BYTES: LDY #56 LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 55 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 50 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY COPY48BYTESWITHYPRELOADEDWITH48: LDA (COPYSOURCE),Y ; 48 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 45 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 40 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 35 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 30 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 25 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 20 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ;15 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 10 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 5 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y RTS After loading your source address in COPYSOURCE and your destination address in COPYDEST, you can call COPY56BYTES to do the copy as fast as possible. If you want to copy 48 bytes, then also load Y with 48 and call COPY48BYTESWITHYPRELOADEDWITH48. If you need some other number of bytes copied, you can add labels in the appropriate place in the unrolled loop and call them (with Y preloaded as appropriate). I think that this will give you the fastest memory copy, using 2 + N*(5+6+2) - 2 + 6 cycles, not counting the JSR and the loading of the COPYSOURCE & COPYDEST words. As for size, this sub will take 2+N*5 bytes. So, for N=56, it will be 282 bytes long. If you want the speed, you have to pay for it somewhere else! Edited August 11, 2020 by StickJock Added size calculation 1 Quote Link to comment Share on other sites More sharing options...
ivop Posted August 12, 2020 Share Posted August 12, 2020 (edited) On 8/11/2020 at 9:15 PM, jacobus said: Edit: Wait a second, do memcopy and loop perform the actual copy and main simply sets up and calls it? Exactly main is just an example of how you call memcopy, i.e. setup src, load dst in AX and size in Y, and call the routine. ($1234 and $5678 are just example source and destination addresses) @StickJock's code is off by one. It copies bytes 1-56 instead of 0-55. But the unroll is correct and faster. You could add the same trick of having set COPYSOURCE (src in my case) by the caller, and COPYDEST (dst) by the callee, to save space everywhere you call this routine. Edited August 12, 2020 by ivop Quote Link to comment Share on other sites More sharing options...
StickJock Posted August 12, 2020 Share Posted August 12, 2020 10 minutes ago, ivop said: Exactly main is just an example of how you call memcopy, i.e. setup src, load dst in AX and size in Y, and call the routine. @StickJock's code is off by one. It copies bytes 1-56 instead of 0-55. But the unroll is correct and faster. You could add the same trick of having set COPYSOURCE (src in my case) by the caller, and COPYDEST (dst) by the callee, to save space everywhere you call this routine. Doh! Classic mistake. Thanks for catching it. I didn't actually test this - I just wrote it here in the thread. Change the LDY with #56-1, and call into the '48' label with 48-1 (and maybe change the name of the label). 1 Quote Link to comment Share on other sites More sharing options...
E474 Posted August 12, 2020 Share Posted August 12, 2020 Hi, Although completely unrolling the code is the fastest way to copy memory, you can also have a block of 8 lda/sta statements, and loop 6 or 7 times depending on whether you want to copy 48 bytes (6*8), or 56 bytes (7*8). This would use less memory for code, and only be a bit slower. Alternatively, you could copy 16 bytes at a time (16 lda/sta statements inside the loop), and have a final 8 lda/sta statements after the loop that only gets executed if you want to copy 56 bytes, or you could jump back halfway into the loop for the final 8 bytes (of the 56). You would use the X register (decrementing) as a loop counter, and have to test a memory location when the loop has been completed for determining 48/56 byte copy mode, but I think this is a good balance between speed and size. Hope this helps! 1 Quote Link to comment Share on other sites More sharing options...
jacobus Posted August 12, 2020 Author Share Posted August 12, 2020 23 hours ago, StickJock said: If you don't care about code size, and you want it as fast as possible, and the source & destination blocks can change every time, then I think that the fastest would be to set up your source & destination addresses in a pair of zero page words, and then use an unrolled-loop of repeated LDA (source),Y ; STA (dest),Y ; DEY Your subroutine could look something like this: Reveal hidden contents COPY56BYTES: LDY #56 LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 55 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 50 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY COPY48BYTESWITHYPRELOADEDWITH48: LDA (COPYSOURCE),Y ; 48 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 45 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 40 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 35 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 30 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 25 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 20 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ;15 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 10 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y ; 5 STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y DEY LDA (COPYSOURCE),Y STA (COPYDEST),Y RTS After loading your source address in COPYSOURCE and your destination address in COPYDEST, you can call COPY56BYTES to do the copy as fast as possible. If you want to copy 48 bytes, then also load Y with 48 and call COPY48BYTESWITHYPRELOADEDWITH48. If you need some other number of bytes copied, you can add labels in the appropriate place in the unrolled loop and call them (with Y preloaded as appropriate). I think that this will give you the fastest memory copy, using 2 + N*(5+6+2) - 2 + 6 cycles, not counting the JSR and the loading of the COPYSOURCE & COPYDEST words. As for size, this sub will take 2+N*5 bytes. So, for N=56, it will be 282 bytes long. If you want the speed, you have to pay for it somewhere else! Thank you for this! Once I get the code working, I'll come back to this for now, I'll go with readability ? Quote Link to comment Share on other sites More sharing options...
jacobus Posted August 12, 2020 Author Share Posted August 12, 2020 4 hours ago, ivop said: Exactly main is just an example of how you call memcopy, i.e. setup src, load dst in AX and size in Y, and call the routine. ($1234 and $5678 are just example source and destination addresses) @StickJock's code is off by one. It copies bytes 1-56 instead of 0-55. But the unroll is correct and faster. You could add the same trick of having set COPYSOURCE (src in my case) by the caller, and COPYDEST (dst) by the callee, to save space everywhere you call this routine. Thanks for all the help, I am definitely closer! The code (with some very minor changes) works when called from the main part of the program but fails when run in the VBI. I should elaborate - it runs properly once when called from the VBI but then locks up the computer. Do I need to do something different when exiting? .lsfirst zplocation .equ $CC ;; Uses memory locations starting at this address. org .equ $4C00 ;; Start of main code src .equ zplocation+$0 ;; Source dst .equ zplocation+$2 ;; Destination bytes .equ zplocation+$4 ;; Size .org org ; caller sets src & dst & bytes ; Y is size minus 1(!), maximum of 127 (i.e. 128 bytes) memcopy ; sta dst ; stx dst+1 ldy bytes loop lda (src),y sta (dst),y dey bpl loop rts .end Quote Link to comment Share on other sites More sharing options...
ivop Posted August 12, 2020 Share Posted August 12, 2020 7 minutes ago, jacobus said: Thanks for all the help, I am definitely closer! The code (with some very minor changes) works when called from the main part of the program but fails when run in the VBI. I should elaborate - it runs properly once when called from the VBI but then locks up the computer. Do I need to do something different when exiting? <snip> Could you post your VBI routine, and how it's calling memcopy? And how do you setup the VBI? It might be that registers (AXYP) are not preserved during the interrupt call. Quote Link to comment Share on other sites More sharing options...
jacobus Posted August 12, 2020 Author Share Posted August 12, 2020 1 hour ago, ivop said: Could you post your VBI routine, and how it's calling memcopy? And how do you setup the VBI? It might be that registers (AXYP) are not preserved during the interrupt call. Sure ? but it's in Quick so may not be much help. The PUSH/PULL, IPUSH/IPULL, ZPUSH/ZPULL (currently commented out) are supposed to be used to save and restore the CPU registers but the documentation is both confusing and contradictory and I have never quite figured out how to use them properly. Is this something I can add to the copy routine instead? INTER VERTBLNK LOCAL BYTE [ V ] WORD [ VD,VM OP=130,SP ] BEGIN * PUSH * IPUSH * ZPUSH * SP=OP IF VBENABLE=1 ZZC=0 REPEAT CALL($00,$00,$00,$4C00) ZZC+ UNTIL ZZC=33 VBENABLE=0 ENDIF *Horizontal Scrolling IF SCRLH=1 ;scroll right --> IF HFS=9 IF HCS<16 ;limit of screen IF MAPXR<65 HFS=12 HCS+ ENDIF ENDIF ELSE HFS- ENDIF ENDIF IF SCRLH=255 ;scroll left <-- IF HFS=12 IF HCS>0 IF MAPXL>0 HFS=9 HCS- ENDIF ENDIF ELSE HFS+ ENDIF ENDIF HSCR00=HCS HSCR01=HCS HSCR02=HCS HSCR03=HCS HSCR04=HCS HSCR05=HCS HSCR06=HCS HSCR07=HCS HSCR08=HCS HSCR09=HCS HSCR10=HCS HSCR11=HCS HSCR12=HCS HSCR13=HCS HSCR14=HCS HSCR15=HCS HSCR16=HCS *vertical scrolling IF SCRLV=1 ;top down (plyr moves up) IF VFS=0 IF VCS>0 IF MAPYT>0 VSCR00- VSCR01- VSCR02- VSCR03- VSCR04- VSCR05- VSCR06- VSCR07- VSCR08- VSCR09- VSCR10- VSCR11- VSCR12- VSCR13- VSCR14- VSCR15- VSCR16- VFS=7 VCS- ENDIF ENDIF ELSE VFS- ENDIF ENDIF IF SCRLV=255 ;bot up (plyr moves dn) IF VFS=7 IF VCS<16 IF MAPYB<64 VSCR00+ VSCR01+ VSCR02+ VSCR03+ VSCR04+ VSCR05+ VSCR06+ VSCR07+ VSCR08+ VSCR09+ VSCR10+ VSCR11+ VSCR12+ VSCR13+ VSCR14+ VSCR15+ VSCR16+ VFS=0 VCS+ ENDIF ENDIF ELSE VFS+ ENDIF ENDIF *handle joystick IF STICK0<>15 V=JOYX(STICK0) ;LUT ADD(PX,V,PX) SCRLH=0 ;no scroll flag IF PX<124 SUB(PX,V,PX) SCRLH=255 ;scroll left flag <-- ENDIF IF PX>125 SUB(PX,V,PX) SCRLH=1 ;scroll right flag --> ENDIF V=JOYY(STICK0) ;LUT ADD(PY,V,PY) SCRLV=0 IF PY<119 SUB(PY,V,PY) SCRLV=1 ENDIF IF PY>121 SUB(PY,V,PY) SCRLV=255 ENDIF ELSE SCRLH=0 SCRLV=0 ENDIF * OP=SP * PULL * IPULL * ZPULL ENDVBI Quote Link to comment Share on other sites More sharing options...
StickJock Posted August 12, 2020 Share Posted August 12, 2020 21 minutes ago, jacobus said: Sure ? but it's in Quick so may not be much help. The PUSH/PULL, IPUSH/IPULL, ZPUSH/ZPULL (currently commented out) are supposed to be used to save and restore the CPU registers but the documentation is both confusing and contradictory and I have never quite figured out how to use them properly. Is this something I can add to the copy routine instead? INTER VERTBLNK LOCAL BYTE [ V ] WORD [ VD,VM OP=130,SP ] BEGIN * PUSH * IPUSH * ZPUSH * SP=OP IF VBENABLE=1 ZZC=0 REPEAT CALL($00,$00,$00,$4C00) ZZC+ UNTIL ZZC=33 VBENABLE=0 ENDIF *Horizontal Scrolling IF SCRLH=1 ;scroll right --> IF HFS=9 IF HCS<16 ;limit of screen IF MAPXR<65 HFS=12 HCS+ ENDIF ENDIF ELSE HFS- ENDIF ENDIF IF SCRLH=255 ;scroll left <-- IF HFS=12 IF HCS>0 IF MAPXL>0 HFS=9 HCS- ENDIF ENDIF ELSE HFS+ ENDIF ENDIF HSCR00=HCS HSCR01=HCS HSCR02=HCS HSCR03=HCS HSCR04=HCS HSCR05=HCS HSCR06=HCS HSCR07=HCS HSCR08=HCS HSCR09=HCS HSCR10=HCS HSCR11=HCS HSCR12=HCS HSCR13=HCS HSCR14=HCS HSCR15=HCS HSCR16=HCS *vertical scrolling IF SCRLV=1 ;top down (plyr moves up) IF VFS=0 IF VCS>0 IF MAPYT>0 VSCR00- VSCR01- VSCR02- VSCR03- VSCR04- VSCR05- VSCR06- VSCR07- VSCR08- VSCR09- VSCR10- VSCR11- VSCR12- VSCR13- VSCR14- VSCR15- VSCR16- VFS=7 VCS- ENDIF ENDIF ELSE VFS- ENDIF ENDIF IF SCRLV=255 ;bot up (plyr moves dn) IF VFS=7 IF VCS<16 IF MAPYB<64 VSCR00+ VSCR01+ VSCR02+ VSCR03+ VSCR04+ VSCR05+ VSCR06+ VSCR07+ VSCR08+ VSCR09+ VSCR10+ VSCR11+ VSCR12+ VSCR13+ VSCR14+ VSCR15+ VSCR16+ VFS=0 VCS+ ENDIF ENDIF ELSE VFS+ ENDIF ENDIF *handle joystick IF STICK0<>15 V=JOYX(STICK0) ;LUT ADD(PX,V,PX) SCRLH=0 ;no scroll flag IF PX<124 SUB(PX,V,PX) SCRLH=255 ;scroll left flag <-- ENDIF IF PX>125 SUB(PX,V,PX) SCRLH=1 ;scroll right flag --> ENDIF V=JOYY(STICK0) ;LUT ADD(PY,V,PY) SCRLV=0 IF PY<119 SUB(PY,V,PY) SCRLV=1 ENDIF IF PY>121 SUB(PY,V,PY) SCRLV=255 ENDIF ELSE SCRLH=0 SCRLV=0 ENDIF * OP=SP * PULL * IPULL * ZPULL ENDVBI You need to pull/pop in the reverse order as your push. 1 Quote Link to comment Share on other sites More sharing options...
ivop Posted August 12, 2020 Share Posted August 12, 2020 2 minutes ago, StickJock said: You need to pull/pop in the reverse order as your push. Sharp! Quote Link to comment Share on other sites More sharing options...
StickJock Posted August 12, 2020 Share Posted August 12, 2020 14 minutes ago, ivop said: Sharp! Well, after my off-by-one bug, I had to save face! ? 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.