damosan Posted December 22, 2020 Share Posted December 22, 2020 See attached code below. I'm calling this from a C program - it uses graphics mode 8 but only plots 256x192 so I can use bytes. YB, XB, and ARGUMENT are zero page. yindexhi, yindexlo, byteoffset256_table and bitmask_table are lookups. It works pretty well plotting 49k pixels in about 106 jiffies. I can replace the JSR/RTS with two JMPs (saving about 6 cycles per pixel - a little less if I do a jump indirect back to the caller). I've been staring at this for a while so I might be overlooking something short of inlining this. Thanks. ;;; ;;; _plot_pixel_256 should be called the first time we plot on a ;;; new row. As long as we're plotting on the same row we can ;;; call _plot_pixel_256_fast as the only item that changes is the ;;; column. ;;; _plot_pixel_256: ldy yb ; load row lda yindexhi,y ; get row address sta argument+1 ; argument = memory to write to lda yindexlo,y sta argument _plot_pixel_256_fast: ; call this if we're writing to same row ldx xb ; load column ldy byteoffset256_table,x ; get byte offset (4 - 35) lda (argument),y ; load screen byte eor bitmask_table,x ; xor it with pixel bitmask sta (argument),y ; store it back to screen byte rts 1 Quote Link to comment Share on other sites More sharing options...
drac030 Posted December 22, 2020 Share Posted December 22, 2020 Having each display line at address $yy00 one could perhaps shorten the first 5 instructions. 1 Quote Link to comment Share on other sites More sharing options...
damosan Posted December 22, 2020 Author Share Posted December 22, 2020 30 minutes ago, drac030 said: Having each display line at address $yy00 one could perhaps shorten the first 5 instructions. Using cc65 I believe I'd have to edit the config file to create a number of special 0xff byte segments - then define the tables to reside in these segments. Then it should be possible to do a 2 byte vs. 3 byte lookup? Quote Link to comment Share on other sites More sharing options...
danwinslow Posted December 22, 2020 Share Posted December 22, 2020 I don't think you'd HAVE to do all the config stuff. You could just lay out a larger space and do your addressing inside of that, I think. Anyway even if you made the segments you'd still waste some memory because it couldn't pack around them all most likely anyway. Quote Link to comment Share on other sites More sharing options...
damosan Posted December 23, 2020 Author Share Posted December 23, 2020 As an aside I'm pushing 17k pixels per second - though I'm limiting it to 255x192 so I can push bytes. I modified the config file to put the buffers into page aligned memory. I'm going to mess around with the ZP loads. Quote Link to comment Share on other sites More sharing options...
ivop Posted December 23, 2020 Share Posted December 23, 2020 Perhaps you can also try using narrow playfield (see $D400/DMACTL). It also saves on Antic DMA cycles. Quote Link to comment Share on other sites More sharing options...
+Stephen Posted December 23, 2020 Share Posted December 23, 2020 1 hour ago, damosan said: As an aside I'm pushing 17k pixels per second - though I'm limiting it to 255x192 so I can push bytes. I modified the config file to put the buffers into page aligned memory. I'm going to mess around with the ZP loads. Sounds cool - can you share what you're working on yet? Quote Link to comment Share on other sites More sharing options...
damosan Posted December 23, 2020 Author Share Posted December 23, 2020 17 minutes ago, Stephen said: Sounds cool - can you share what you're working on yet? Nothing in particular - just seeing how fast I can paint a screen with pixels painting one per pass (vs doing 8 pixels at a time by copying 0xff). The general use routine above is slower than this code. This can probably be made faster by increment argument by 40 after the first load. ;;; _speedtest ;;; ;;; at present this will paint a 255x192 Gr.8 screen at 777 pixels ;;; per jiffy (46.6k per second). ;;; ;;; runtime = 63 jiffies ;;; _speedtest: lda #0 sta yb paint_row: ;; ;; get row offset ;; ldy yb lda yindexhi,y ; get row address sta argument+1 ; argument = memory to write to lda yindexlo,y sta argument ;; ;; init column to 0 ;; ldx #0 paint_col: ;; ;; plot pixel ;; ldy byteoffset256_table,x ; (4/5) get byte offset (4 - 35) lda (argument),y ; (5/6) load screen byte ora bitmask_table,x ; (4/5) OR it with pixel bitmask sta (argument),y ; (6) store it back to screen byte ;; ;; increment column ;; inx ; (2) increment X bne paint_col ; (2) if we increment $ff it wraps to $00 so... end_of_cols: ;; ;; increment row ;; lda yb cmp #191 beq end_of_rows inc yb jmp paint_row end_of_rows: rts 1 Quote Link to comment Share on other sites More sharing options...
rensoup Posted December 23, 2020 Share Posted December 23, 2020 57 minutes ago, damosan said: Nothing in particular - just seeing how fast I can paint a screen with pixels painting one per pass (vs doing 8 pixels at a time by copying 0xff). The general use routine above is slower than this code. This can probably be made faster by increment argument by 40 after the first load. It's difficult to optimize something that is so generic, it all depends on your use case... you could have 256 plot routines with #Imm instead of using those 2 tables, the problem would be selecting between those routines quickly enough Quote Link to comment Share on other sites More sharing options...
xxl Posted December 23, 2020 Share Posted December 23, 2020 (edited) 1 hour ago, damosan said: Nothing in particular - just seeing how fast I can paint a screen with pixels painting one per pass (vs doing 8 pixels at a time by copying 0xff). The general use routine above is slower than this code. This can probably be made faster by increment argument by 40 after the first load. ;;; _speedtest ;;; ;;; at present this will paint a 255x192 Gr.8 screen at 777 pixels ;;; per jiffy (46.6k per second). ;;; ;;; runtime = 63 jiffies ;;; _speedtest: lda #0 sta yb paint_row: ;; ;; get row offset ;; ldy yb lda yindexhi,y ; get row address sta argument+1 ; argument = memory to write to lda yindexlo,y sta argument ;; ;; init column to 0 ;; ldx #0 paint_col: ;; ;; plot pixel ;; ldy byteoffset256_table,x ; (4/5) get byte offset (4 - 35) lda (argument),y ; (5/6) load screen byte ora bitmask_table,x ; (4/5) OR it with pixel bitmask sta (argument),y ; (6) store it back to screen byte ;; ;; increment column ;; inx ; (2) increment X bne paint_col ; (2) if we increment $ff it wraps to $00 so... end_of_cols: ;; ;; increment row ;; lda yb cmp #191 beq end_of_rows inc yb jmp paint_row end_of_rows: rts 5 less cycles per loop if you use the Y register to carry values from the end of the loop -- and 3 per init -- and one 4 cycle less per loop if plot on ZP Edited December 23, 2020 by xxl 4 Quote Link to comment Share on other sites More sharing options...
Rybags Posted December 23, 2020 Share Posted December 23, 2020 A big saving could be had by just embedding it in the program rather than having it as a sub... though seeing it's used with C that mightn't be possible. Also, the position variables - if one or both could be used directly instead of copying. Quote Link to comment Share on other sites More sharing options...
damosan Posted December 24, 2020 Author Share Posted December 24, 2020 2 hours ago, xxl said: 5 less cycles per loop if you use the Y register to carry values from the end of the loop -- and 3 per init -- and one 4 cycle less per loop if plot on ZP yb is a zp byte. The lookup tables are page aligned. How would you rewrite the above to use X and Y based on Y being required the way it is? Quote Link to comment Share on other sites More sharing options...
xxl Posted December 24, 2020 Share Posted December 24, 2020 (edited) 1 hour ago, damosan said: yb is a zp byte. The lookup tables are page aligned. How would you rewrite the above to use X and Y based on Y being required the way it is? ldy yb becomes: ldy # yb equ *-1 1 cycle less sta argument sta argument+1 from ABS (4cycle) becomes ZP (3 cycle) 2 less lda (argument),y (5 cycle) becomes lda $ffff,y (4 cycle) argument equ *-2 1 less lda #0 sta yb paint_row: ;; ;; get row offset ;; ldy yb becomes ldy #0 paint_row sty yb 3 cycle less lda yb cmp #191 beq end_of_rows inc yb jmp paint_row becomes ldy # yb equ *-1 iny cpy #192 bcc paint_row 6 cycles less? Edited December 24, 2020 by xxl 3 Quote Link to comment Share on other sites More sharing options...
rensoup Posted December 24, 2020 Share Posted December 24, 2020 (edited) 3 hours ago, xxl said: lda (argument),y (5 cycle) becomes lda $ffff,y (4 cycle) argument equ *-2 Oh yeah that works! Edited December 24, 2020 by rensoup Quote Link to comment Share on other sites More sharing options...
damosan Posted December 24, 2020 Author Share Posted December 24, 2020 12 hours ago, Rybags said: A big saving could be had by just embedding it in the program rather than having it as a sub... though seeing it's used with C that mightn't be possible. Also, the position variables - if one or both could be used directly instead of copying. It's possible though it's kind of a PITA to embed assembly directly into C code - it's very easy, of course, to create separate assembly routines and let the linker figure it out. Quote Link to comment Share on other sites More sharing options...
Estece Posted December 25, 2020 Share Posted December 25, 2020 (edited) Using all above suggestions and mads i got 42 PAL or 53 NTSC frames for full fill 256x192 pixels. fastantF.xex Edited December 25, 2020 by Estece Quote Link to comment Share on other sites More sharing options...
xxl Posted December 25, 2020 Share Posted December 25, 2020 (edited) 36 minutes ago, Estece said: Using all above suggestions and mads i got 42 PAL or 53 NTSC frames for full fill 256x192 pixels. fastantF.xex 1.24 kB · 2 downloads hmmmmmm 0084: 84 8D STY $8D 0086: 84 93 STY $93 - DELETE 0088: BC 00 07 LDY $0700,X 008B: B9 00 CA LDA $xx00,Y here equ *-2 008E: 1D 00 06 ORA $0600,X 0091: 99 00 CA STA $xx00,Y - REPLACE: STA (here),Y check how much is slower (at a single point it can be faster) Edited December 25, 2020 by xxl 2 Quote Link to comment Share on other sites More sharing options...
Estece Posted December 25, 2020 Share Posted December 25, 2020 Plus 2 frames :) plus2frames.xex Quote Link to comment Share on other sites More sharing options...
damosan Posted December 28, 2020 Author Share Posted December 28, 2020 I got it down to 58 jiffies with this. paint_row: ;; ;; get row offset ;; lda yindexhi,y ; get row address sta ld+2 ; write high byte to LD/WR so we... sta wr+2 ; ...can use an absolute version. lda yindexlo,y sta ld+1 ; ditto... sta wr+1 ;; ;; init column to 0 ;; ldx #0 paint_col: ;; ;; plot pixel ;; ldy byteoffset256_table,x ; (4) get byte offset (4 - 35) ld: lda $ffff,y ; (4) ora bitmask_table,x ; (4) OR it with pixel bitmask wr: sta $ffff,y ; (5) ;; ;; increment column ;; inx ; (2) increment X bne paint_col ; (2) if we increment $ff it wraps to $00 so... 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.