danwinslow Posted August 16, 2015 Share Posted August 16, 2015 (edited) I am messing around writing a little graphics bit, and I wanted to check if my single pixel plot seems to be at least fairly fast - Right now, it's running at 4,320 pixels per second, so it takes 53 jiffies (just under a second) to plot every pixel one by one on an Antic 0D screen. Does that seem close to par or do I have more optimization to do? (Note that it's just a single pixel at a time plot, I know there's significant optimizations if I am plotting contiguous groups of pixels or lines.) and_masks: ; duplication is to make it conform to color masks .byte %00111111 .byte %11001111 .byte %11110011 .byte %11111100 .byte %00111111 .byte %11001111 .byte %11110011 .byte %11111100 .byte %00111111 .byte %11001111 .byte %11110011 .byte %11111100 .byte %00111111 .byte %11001111 .byte %11110011 .byte %11111100 c_masks: .byte %10000000 ;0 0 1 2 3 .byte %00100000 .byte %00001000 .byte %00000010 .byte %01000000 ;1 4 5 6 7 .byte %00010000 .byte %00000100 .byte %00000001 .byte %10000000 ;2 8 9 10 11 .byte %00100000 .byte %00001000 .byte %00000010 .byte %11000000 ;3 12 13 14 15 .byte %00110000 .byte %00001100 .byte %00000011 offset: .res 1 color_byte: .res 1 mask_index: .res 1 .importzp plot_x .importzp plot_y .importzp plot_c .code .proc _plot ldx plot_x ldy plot_y lda line_addr_lo,y sta ptr1 lda line_addr_hi,y sta ptr1+1 ; ptr1 now points to start of screen row txa ; x to a lsr ; div 4 lsr tay ; y now is offset to byte (ptr1),y = screen byte ;get and mask txa and #3 clc adc plot_c ; pre-shifted color byte ( 0,4,8,12 ) tax ; x=color map offset as well as and_masks offset lda (ptr1),y and and_masks,x ora c_masks,x sta (ptr1),y rts .endproc .export _plot Edited August 16, 2015 by danwinslow Quote Link to comment Share on other sites More sharing options...
Irgendwer Posted August 16, 2015 Share Posted August 16, 2015 Does that seem close to par or do I have more optimization to do? What means 'have to'? Of course you could apply further optimizations, but depending on memory footprint etc. maybe your version is fast enough...? E.g. it may not be necessary to build (an) additional table(s) which converts ;get and mask txa and #3 clc adc plot_c ; pre-shifted color byte ( 0,4,8,12 ) ... color, position and source byte to a destination byte only to squeeze some cycles out of the function. (Just to make a suggestion... ) Quote Link to comment Share on other sites More sharing options...
danwinslow Posted August 16, 2015 Author Share Posted August 16, 2015 Thanks, Irgendwer - I don't really have a criteria in terms of how fast it has to be, I am sure it's already more than fast enough for what I want to do, but I was wondering if I had missed any big optimizations.. I take it from your response that I haven't missed anything really obvious. Quote Link to comment Share on other sites More sharing options...
flashjazzcat Posted August 16, 2015 Share Posted August 16, 2015 (edited) You can save 3-4 cycles per pixel by using a LUT to divide by 4: ldx plot_x ldy plot_y lda line_addr_lo,y sta ptr1 lda line_addr_hi,y sta ptr1+1 ; ptr1 now points to start of screen row ldy div_by_4,x Edited August 16, 2015 by flashjazzcat 1 Quote Link to comment Share on other sites More sharing options...
danwinslow Posted August 16, 2015 Author Share Posted August 16, 2015 (edited) ! excellent, thanks, fjc. *edit* resulted in 2 jiffy savings over the whole screen, so now at 4500pps. Edited August 16, 2015 by danwinslow 1 Quote Link to comment Share on other sites More sharing options...
Rybags Posted August 16, 2015 Share Posted August 16, 2015 (edited) You could save more by spending memory on a LUT for the pixel masks that covers every possible plot_x. Then there's the whole subroutine thing - it costs 12 cycles for the JSR / RTS which can be saved by embedding the routine within the program. Further saving can be had by optimising the initial register loads - in an iterative situation where pixel position is adjusted each loop since you embed the plot point you could have one or two registers preloaded with coordinate data. Edited August 16, 2015 by Rybags Quote Link to comment Share on other sites More sharing options...
danwinslow Posted August 17, 2015 Author Share Posted August 17, 2015 Hmm, another LUT...not sure how that would work. Do you mean the 'get masks index' part? I can see it would save maybe the txa,and #3 part but I'm not sure how to combine the color index in there as well. The color index is 0,4,8,12 and the 'x and #3' provides the rest of the indexing. The other two things you mention I've thought about...unrolling the JSRs might be an option, and the register loads too, but mostly only for the special case where I'm just filling the whole screen sequentially in a speed check. I don't think normal usage would benefit as much from it. I just wanted a reasonably fast general purpose plot. .proc _plot ldx plot_x ldy plot_y lda line_addr_lo,y sta ptr1 lda line_addr_hi,y sta ptr1+1 ; ptr1 now points to start of screen row ldy x_div_by_4_lut,x ;get masks index txa and #3 clc adc plot_c tax ; x=color map offset AND and_masks offset lda (ptr1),y and and_masks,x ora c_masks,x sta (ptr1),y rts .endproc Quote Link to comment Share on other sites More sharing options...
Rybags Posted August 17, 2015 Share Posted August 17, 2015 Just make the tables bigger, then skipping the need to AND #3 saves some time though I notice you use a modified colour index which probably won't be compatible with that method. Quote Link to comment Share on other sites More sharing options...
RevEng Posted August 17, 2015 Share Posted August 17, 2015 You can ditch the txa if you change "ldx plot_x" to "lax plot_x" and move it above "ldy x_div_by_4_lut,x"... yeah, I know. Send the opcode police after me. It looks like the carry shouldn't get set during normal operation, so if this routine is called in a loop you can put the clc prior to the loop. (assuming nothing else in the loop sets the carry.) Quote Link to comment Share on other sites More sharing options...
xxl Posted August 17, 2015 Share Posted August 17, 2015 txa and #3 clc adc plot_c tax equivalent to lda #3 sbx #$100-plot_c ? 2 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted August 17, 2015 Share Posted August 17, 2015 buuuh XXL.... opcode police calling... Quote Link to comment Share on other sites More sharing options...
snicklin Posted August 17, 2015 Share Posted August 17, 2015 How many pixels per second are you getting now? Quote Link to comment Share on other sites More sharing options...
danwinslow Posted August 17, 2015 Author Share Posted August 17, 2015 Stop it, all of you! THEY...ARE...WATCHING. But yeah when, um let's say unusual opcodes come out, I know it means it's probably optimized enough already for me. Although I don't understand xxl's suggestion entirely - the 'and #3' is not equivalent to lda #3, as there may be a bit pattern there from the original x offset that would get destroyed? Quote Link to comment Share on other sites More sharing options...
danwinslow Posted August 17, 2015 Author Share Posted August 17, 2015 Snicklin - 4500 Quote Link to comment Share on other sites More sharing options...
flashjazzcat Posted August 17, 2015 Share Posted August 17, 2015 ...now at 4500pps. Quote Link to comment Share on other sites More sharing options...
flashjazzcat Posted August 17, 2015 Share Posted August 17, 2015 Beat me to it. Quote Link to comment Share on other sites More sharing options...
xxl Posted August 17, 2015 Share Posted August 17, 2015 I don't understand xxl's suggestion entirely - the 'and #3' is not equivalent to lda #3, as there may be a bit pattern there from the original x offset that would get destroyed? try: ldx #$ff txa and #$0f clc adc #$01 tax compare to this: ldx #$ff lda #$0f sbx #$100-$01 1 Quote Link to comment Share on other sites More sharing options...
danwinslow Posted August 17, 2015 Author Share Posted August 17, 2015 Hmm - AXS #i ($CB ii, 2 cycles)Sets X to {(A AND X) - #value without borrow}, and updates NZC. One might use TXA AXS #-element_size to iterate through an array of structures or other elements larger than a byte, where the 6502 architecture usually prefers a structure of arrays. For example, TXA AXS #$FC could step to the next OAM entry or to the next APU channel, saving one byte and four cycles over four INXs. Also called SBX. So this includes an AND A as part of the operation? I'll give this a try and see what happens, thanks XXL. Quote Link to comment Share on other sites More sharing options...
danwinslow Posted August 17, 2015 Author Share Posted August 17, 2015 Well, it didn't take SBX, so I used AXS, which supposedly is a synonym. But it didn't work. Quote Link to comment Share on other sites More sharing options...
Shawn Jefferson Posted August 18, 2015 Share Posted August 18, 2015 Are you using the cc65 suite? http://www.cc65.org/doc/ca65-4.html Quote Link to comment Share on other sites More sharing options...
danwinslow Posted August 18, 2015 Author Share Posted August 18, 2015 Hi Shawn. Yes. Quote Link to comment Share on other sites More sharing options...
Shawn Jefferson Posted August 19, 2015 Share Posted August 19, 2015 It looks like you have to explicitly enable illegal opcodes then, using: ca65 --cpu 6502X Quote Link to comment Share on other sites More sharing options...
Irgendwer Posted August 19, 2015 Share Posted August 19, 2015 ... lda (ptr1),y and and_masks,x ora c_masks,x sta (ptr1),y ... Just in case you're plotting very often on empty space (e.g. star-field), you could also do lda (ptr1),y bne needs_masking ora c_masks,x sta (ptr1),y rts needs_masking: and and_masks,x ora c_masks,x sta (ptr1),y rts Could you provide your benchmark code? I'd like to try something out without making the effort to have a comparable benchmark. Quote Link to comment Share on other sites More sharing options...
danwinslow Posted August 19, 2015 Author Share Posted August 19, 2015 It looks like you have to explicitly enable illegal opcodes then, using: ca65 --cpu 6502X Yep, I did. Still didn't take SBX, but took the AXS, which is supposedly a synonym. Quote Link to comment Share on other sites More sharing options...
danwinslow Posted August 19, 2015 Author Share Posted August 19, 2015 (edited) Just in case you're plotting very often on empty space (e.g. star-field), you could also do lda (ptr1),y bne needs_masking ora c_masks,x sta (ptr1),y rts needs_masking: and and_masks,x ora c_masks,x sta (ptr1),y rts Could you provide your benchmark code? I'd like to try something out without making the effort to have a comparable benchmark. Hehe, yeah that's a good idea. Here's the build project. I use CC65 12.13.1 The main code file is graphics.s. The 'benchmark' is by using Altirra and examining the 'start' and 'stop' jiffy counters. ships.zip Edited August 20, 2015 by danwinslow 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.