Jump to content
IGNORED

pixels per second in Antic $0D


Recommended Posts

I am messing around writing a little graphics bit, and I wanted to check if my single pixel plot seems to be at least fairly fast -

 

Right now, it's running at 4,320 pixels per second, so it takes 53 jiffies (just under a second) to plot every pixel one by one on an Antic 0D screen.

Does that seem close to par or do I have more optimization to do?

 

(Note that it's just a single pixel at a time plot, I know there's significant optimizations if I am plotting contiguous groups of pixels or lines.)

and_masks:           ; duplication is to make it conform to color masks
.byte %00111111
.byte %11001111
.byte %11110011
.byte %11111100
 
.byte %00111111
.byte %11001111
.byte %11110011
.byte %11111100
 
.byte %00111111
.byte %11001111
.byte %11110011
.byte %11111100
 
.byte %00111111
.byte %11001111
.byte %11110011
.byte %11111100
 
c_masks:
.byte %10000000  ;0 0 1 2 3
.byte %00100000
.byte %00001000
.byte %00000010
     
.byte %01000000  ;1 4 5 6 7
.byte %00010000
.byte %00000100
.byte %00000001
     
.byte %10000000  ;2 8 9 10 11     
.byte %00100000
.byte %00001000
.byte %00000010
     
.byte %11000000  ;3 12 13 14 15
.byte %00110000
.byte %00001100
.byte %00000011

offset:     .res 1
color_byte: .res 1
mask_index: .res 1
.importzp plot_x
.importzp plot_y
.importzp plot_c
.code
 
.proc _plot
  ldx plot_x
  ldy plot_y
  lda line_addr_lo,y
  sta ptr1
  lda line_addr_hi,y
  sta ptr1+1   ; ptr1 now points to start of screen row
  txa  ; x to a
  lsr  ; div 4
  lsr
  tay  ; y now is offset to byte (ptr1),y = screen byte
 
  ;get and mask
  txa
  and #3
  clc
  adc plot_c  ; pre-shifted color byte ( 0,4,8,12 )
  tax         ; x=color map offset as well as and_masks offset
  lda (ptr1),y
  and and_masks,x
  ora c_masks,x
  sta (ptr1),y
  rts
.endproc
.export _plot
Edited by danwinslow
Link to comment
Share on other sites

Does that seem close to par or do I have more optimization to do?

 

What means 'have to'? Of course you could apply further optimizations, but depending on memory footprint etc. maybe your version is fast enough...?

 

E.g. it may not be necessary to build (an) additional table(s) which converts

 

;get and mask

txa

and #3

clc

adc plot_c ; pre-shifted color byte ( 0,4,8,12 )

...

 

color, position and source byte to a destination byte only to squeeze some cycles out of the function. (Just to make a suggestion... ;) )

Link to comment
Share on other sites

You could save more by spending memory on a LUT for the pixel masks that covers every possible plot_x.

 

Then there's the whole subroutine thing - it costs 12 cycles for the JSR / RTS which can be saved by embedding the routine within the program.

 

Further saving can be had by optimising the initial register loads - in an iterative situation where pixel position is adjusted each loop since you embed the plot point you could have one or two registers preloaded with coordinate data.

Edited by Rybags
Link to comment
Share on other sites

Hmm, another LUT...not sure how that would work. Do you mean the 'get masks index' part? I can see it would save maybe the txa,and #3 part but I'm not sure how to combine the color index in there as well. The color index is 0,4,8,12 and the 'x and #3' provides the rest of the indexing.

 

The other two things you mention I've thought about...unrolling the JSRs might be an option, and the register loads too, but mostly only for the special case where I'm just filling the whole screen sequentially in a speed check. I don't think normal usage would benefit as much from it. I just wanted a reasonably fast general purpose plot.

 

 

.proc _plot
  ldx plot_x
  ldy plot_y
  lda line_addr_lo,y
  sta ptr1
  lda line_addr_hi,y
  sta ptr1+1   ; ptr1 now points to start of screen row
  ldy x_div_by_4_lut,x
 
  ;get masks index
  txa
  and #3
  clc
  adc plot_c
  tax         ; x=color map offset AND and_masks offset
 
  lda (ptr1),y
  and and_masks,x
  ora c_masks,x
  sta (ptr1),y
  rts
.endproc
Link to comment
Share on other sites

You can ditch the txa if you change "ldx plot_x" to "lax plot_x" and move it above "ldy x_div_by_4_lut,x"... yeah, I know. Send the opcode police after me.

 

It looks like the carry shouldn't get set during normal operation, so if this routine is called in a loop you can put the clc prior to the loop. (assuming nothing else in the loop sets the carry.)

Link to comment
Share on other sites

Stop it, all of you! THEY...ARE...WATCHING.

 

But yeah when, um let's say unusual opcodes come out, I know it means it's probably optimized enough already for me.

 

Although I don't understand xxl's suggestion entirely - the 'and #3' is not equivalent to lda #3, as there may be a bit pattern there from the original x offset that would get destroyed?

Link to comment
Share on other sites

I don't understand xxl's suggestion entirely - the 'and #3' is not equivalent to lda #3, as there may be a bit pattern there from the original x offset that would get destroyed?

 

 

try:

ldx #$ff

txa
and #$0f
clc
adc #$01
tax

compare to this:

ldx #$ff

lda #$0f
sbx #$100-$01
  • Like 1
Link to comment
Share on other sites

Hmm -

AXS #i ($CB ii, 2 cycles)
Sets X to {(A AND X) - #value without borrow}, and updates NZC. One might use TXA AXS #-element_size to iterate through an array of structures or other elements larger than a byte, where the 6502 architecture usually prefers a structure of arrays. For example, TXA AXS #$FC could step to the next OAM entry or to the next APU channel, saving one byte and four cycles over four INXs. Also called SBX.

 

So this includes an AND A as part of the operation? I'll give this a try and see what happens, thanks XXL.

Link to comment
Share on other sites

...
  lda (ptr1),y
  and and_masks,x
  ora c_masks,x
  sta (ptr1),y
...

 

Just in case you're plotting very often on empty space (e.g. star-field), you could also do

 

lda (ptr1),y

bne needs_masking

ora c_masks,x

sta (ptr1),y

rts

 

needs_masking:

and and_masks,x

ora c_masks,x

sta (ptr1),y

rts

 

Could you provide your benchmark code? I'd like to try something out without making the effort to have a comparable benchmark.

Link to comment
Share on other sites

 

Just in case you're plotting very often on empty space (e.g. star-field), you could also do

 

lda (ptr1),y

bne needs_masking

ora c_masks,x

sta (ptr1),y

rts

 

needs_masking:

and and_masks,x

ora c_masks,x

sta (ptr1),y

rts

 

Could you provide your benchmark code? I'd like to try something out without making the effort to have a comparable benchmark.

 

Hehe, yeah that's a good idea.

 

Here's the build project. I use CC65 12.13.1 The main code file is graphics.s. The 'benchmark' is by using Altirra and examining the 'start' and 'stop' jiffy counters.

 

ships.zip

Edited by danwinslow
  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...