pixels per second in Antic $0D

danwinslow · August 16, 2015

I am messing around writing a little graphics bit, and I wanted to check if my single pixel plot seems to be at least fairly fast -

Right now, it's running at 4,320 pixels per second, so it takes 53 jiffies (just under a second) to plot every pixel one by one on an Antic 0D screen.

Does that seem close to par or do I have more optimization to do?

(Note that it's just a single pixel at a time plot, I know there's significant optimizations if I am plotting contiguous groups of pixels or lines.)

and_masks:           ; duplication is to make it conform to color masks
.byte %00111111
.byte %11001111
.byte %11110011
.byte %11111100
 
.byte %00111111
.byte %11001111
.byte %11110011
.byte %11111100
 
.byte %00111111
.byte %11001111
.byte %11110011
.byte %11111100
 
.byte %00111111
.byte %11001111
.byte %11110011
.byte %11111100
 
c_masks:
.byte %10000000  ;0 0 1 2 3
.byte %00100000
.byte %00001000
.byte %00000010
     
.byte %01000000  ;1 4 5 6 7
.byte %00010000
.byte %00000100
.byte %00000001
     
.byte %10000000  ;2 8 9 10 11     
.byte %00100000
.byte %00001000
.byte %00000010
     
.byte %11000000  ;3 12 13 14 15
.byte %00110000
.byte %00001100
.byte %00000011

offset:     .res 1
color_byte: .res 1
mask_index: .res 1
.importzp plot_x
.importzp plot_y
.importzp plot_c
.code
 
.proc _plot
  ldx plot_x
  ldy plot_y
  lda line_addr_lo,y
  sta ptr1
  lda line_addr_hi,y
  sta ptr1+1   ; ptr1 now points to start of screen row
  txa  ; x to a
  lsr  ; div 4
  lsr
  tay  ; y now is offset to byte (ptr1),y = screen byte
 
  ;get and mask
  txa
  and #3
  clc
  adc plot_c  ; pre-shifted color byte ( 0,4,8,12 )
  tax         ; x=color map offset as well as and_masks offset
  lda (ptr1),y
  and and_masks,x
  ora c_masks,x
  sta (ptr1),y
  rts
.endproc
.export _plot

Edited August 16, 2015 by danwinslow

Irgendwer · August 16, 2015

Does that seem close to par or do I have more optimization to do?

What means 'have to'? Of course you could apply further optimizations, but depending on memory footprint etc. maybe your version is fast enough...?

E.g. it may not be necessary to build (an) additional table(s) which converts

;get and mask
txa
and #3

clc

adc plot_c ; pre-shifted color byte ( 0,4,8,12 )

...

color, position and source byte to a destination byte only to squeeze some cycles out of the function. (Just to make a suggestion... )

danwinslow · August 16, 2015

Thanks, Irgendwer -

I don't really have a criteria in terms of how fast it has to be, I am sure it's already more than fast enough for what I want to do, but I was wondering if I had missed any big optimizations.. I take it from your response that I haven't missed anything really obvious.

flashjazzcat · August 16, 2015

You can save 3-4 cycles per pixel by using a LUT to divide by 4:

  ldx plot_x
  ldy plot_y
  lda line_addr_lo,y
  sta ptr1
  lda line_addr_hi,y
  sta ptr1+1   ; ptr1 now points to start of screen row
  ldy div_by_4,x

Edited August 16, 2015 by flashjazzcat

danwinslow · August 16, 2015

!

excellent, thanks, fjc.

*edit*

resulted in 2 jiffy savings over the whole screen, so now at 4500pps.

Edited August 16, 2015 by danwinslow

Rybags · August 16, 2015

You could save more by spending memory on a LUT for the pixel masks that covers every possible plot_x.

Then there's the whole subroutine thing - it costs 12 cycles for the JSR / RTS which can be saved by embedding the routine within the program.

Further saving can be had by optimising the initial register loads - in an iterative situation where pixel position is adjusted each loop since you embed the plot point you could have one or two registers preloaded with coordinate data.

Edited August 16, 2015 by Rybags

danwinslow · August 17, 2015

Hmm, another LUT...not sure how that would work. Do you mean the 'get masks index' part? I can see it would save maybe the txa,and #3 part but I'm not sure how to combine the color index in there as well. The color index is 0,4,8,12 and the 'x and #3' provides the rest of the indexing.

The other two things you mention I've thought about...unrolling the JSRs might be an option, and the register loads too, but mostly only for the special case where I'm just filling the whole screen sequentially in a speed check. I don't think normal usage would benefit as much from it. I just wanted a reasonably fast general purpose plot.

.proc _plot
  ldx plot_x
  ldy plot_y
  lda line_addr_lo,y
  sta ptr1
  lda line_addr_hi,y
  sta ptr1+1   ; ptr1 now points to start of screen row
  ldy x_div_by_4_lut,x
 
  ;get masks index
  txa
  and #3
  clc
  adc plot_c
  tax         ; x=color map offset AND and_masks offset
 
  lda (ptr1),y
  and and_masks,x
  ora c_masks,x
  sta (ptr1),y
  rts
.endproc

Rybags · August 17, 2015

Just make the tables bigger, then skipping the need to AND #3 saves some time though I notice you use a modified colour index which probably won't be compatible with that method.

RevEng · August 17, 2015

You can ditch the txa if you change "ldx plot_x" to "lax plot_x" and move it above "ldy x_div_by_4_lut,x"... yeah, I know. Send the opcode police after me.

It looks like the carry shouldn't get set during normal operation, so if this routine is called in a loop you can put the clc prior to the loop. (assuming nothing else in the loop sets the carry.)

xxl · August 17, 2015

txa
and #3
clc
adc plot_c
tax

equivalent to

lda #3
sbx #$100-plot_c

?

Heaven/TQA · August 17, 2015

buuuh XXL.... opcode police calling...

snicklin · August 17, 2015

How many pixels per second are you getting now?

danwinslow · August 17, 2015

Stop it, all of you! THEY...ARE...WATCHING.

But yeah when, um let's say unusual opcodes come out, I know it means it's probably optimized enough already for me.

Although I don't understand xxl's suggestion entirely - the 'and #3' is not equivalent to lda #3, as there may be a bit pattern there from the original x offset that would get destroyed?

danwinslow · August 17, 2015

Snicklin - 4500

flashjazzcat · August 17, 2015

...now at 4500pps.

flashjazzcat · August 17, 2015

Beat me to it.

xxl · August 17, 2015

I don't understand xxl's suggestion entirely - the 'and #3' is not equivalent to lda #3, as there may be a bit pattern there from the original x offset that would get destroyed?

try:

ldx #$ff

txa
and #$0f
clc
adc #$01
tax

compare to this:

ldx #$ff

lda #$0f
sbx #$100-$01

danwinslow · August 17, 2015

Hmm -

AXS #i ($CB ii, 2 cycles)
Sets X to {(A AND X) - #value without borrow}, and updates NZC. One might use TXA AXS #-element_size to iterate through an array of structures or other elements larger than a byte, where the 6502 architecture usually prefers a structure of arrays. For example, TXA AXS #$FC could step to the next OAM entry or to the next APU channel, saving one byte and four cycles over four INXs. Also called SBX.

So this includes an AND A as part of the operation? I'll give this a try and see what happens, thanks XXL.

danwinslow · August 17, 2015

Well, it didn't take SBX, so I used AXS, which supposedly is a synonym. But it didn't work.

Shawn Jefferson · August 18, 2015

Are you using the cc65 suite?

http://www.cc65.org/doc/ca65-4.html

danwinslow · August 18, 2015

Hi Shawn. Yes.

Shawn Jefferson · August 19, 2015

It looks like you have to explicitly enable illegal opcodes then, using:

ca65 --cpu 6502X

Irgendwer · August 19, 2015

...
  lda (ptr1),y
  and and_masks,x
  ora c_masks,x
  sta (ptr1),y
...

Just in case you're plotting very often on empty space (e.g. star-field), you could also do

lda (ptr1),y

bne needs_masking

ora c_masks,x

sta (ptr1),y

rts

needs_masking:

and and_masks,x

ora c_masks,x

sta (ptr1),y

rts

Could you provide your benchmark code? I'd like to try something out without making the effort to have a comparable benchmark.

danwinslow · August 19, 2015

It looks like you have to explicitly enable illegal opcodes then, using:

ca65 --cpu 6502X

Yep, I did. Still didn't take SBX, but took the AXS, which is supposedly a synonym.

danwinslow · August 19, 2015

Just in case you're plotting very often on empty space (e.g. star-field), you could also do

lda (ptr1),y

bne needs_masking

ora c_masks,x

sta (ptr1),y

rts

needs_masking:

and and_masks,x

ora c_masks,x

sta (ptr1),y

rts

Could you provide your benchmark code? I'd like to try something out without making the effort to have a comparable benchmark.

Hehe, yeah that's a good idea.

Here's the build project. I use CC65 12.13.1 The main code file is graphics.s. The 'benchmark' is by using Altirra and examining the 'start' and 'stop' jiffy counters.

ships.zip

Edited August 20, 2015 by danwinslow

pixels per second in Antic $0D

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members