Jump to content
IGNORED

GR8 Pixel Plotting (assembly)


damosan

Recommended Posts

See attached code below.  I'm calling this from a C program - it uses graphics mode 8 but only plots 256x192 so I can use bytes.  YB, XB, and ARGUMENT are zero page.  yindexhi, yindexlo, byteoffset256_table and bitmask_table are lookups.  It works pretty well plotting 49k pixels in about 106 jiffies.

 

I can replace the JSR/RTS with two JMPs (saving about 6 cycles per pixel - a little less if I do a jump indirect back to the caller).

 

I've been staring at this for a while so I might be overlooking something short of inlining this.

 

Thanks.

 

;;;
;;; _plot_pixel_256 should be called the first time we plot on a
;;; new row.  As long as we're plotting on the same row we can
;;; call _plot_pixel_256_fast as the only item that changes is the
;;; column.
;;; 
_plot_pixel_256:
	ldy	yb		      ; load row
	lda	yindexhi,y	      ; get row address
	sta	argument+1	      ; argument = memory to write to
	lda	yindexlo,y
	sta	argument
_plot_pixel_256_fast:		      ; call this if we're writing to same row
	ldx	xb		      ; load column
	ldy	byteoffset256_table,x ; get byte offset (4 - 35)
	lda	(argument),y	      ; load screen byte
	eor	bitmask_table,x	      ; xor it with pixel bitmask
	sta	(argument),y	      ; store it back to screen byte
	rts

 

  • Like 1
Link to comment
Share on other sites

30 minutes ago, drac030 said:

Having each display line at address $yy00 one could perhaps shorten the first 5 instructions.

Using cc65 I believe I'd have to edit the config file to create a number of special 0xff byte segments - then define the tables to reside in these segments.  Then it should be possible to do a 2 byte vs. 3 byte lookup?

Link to comment
Share on other sites

1 hour ago, damosan said:

As an aside I'm pushing 17k pixels per second - though I'm limiting it to 255x192 so I can push bytes.  I modified the config file to put the buffers into page aligned memory.  I'm going to mess around with the ZP loads.

 

Sounds cool - can you share what you're working on yet?

Link to comment
Share on other sites

17 minutes ago, Stephen said:

Sounds cool - can you share what you're working on yet?

Nothing in particular - just seeing how fast I can paint a screen with pixels painting one per pass (vs doing 8 pixels at a time by copying 0xff).  The general use routine above is slower than this code.  This can probably be made faster by increment argument by 40 after the first load.

;;; _speedtest
;;;
;;; at present this will paint a 255x192 Gr.8 screen at 777 pixels
;;; per jiffy (46.6k per second).
;;;
;;; runtime = 63 jiffies
;;; 
_speedtest:
	lda	#0
	sta	yb
paint_row:
	;;
	;;  get row offset
	;; 
	ldy	yb
	lda	yindexhi,y	      ; get row address
	sta	argument+1	      ; argument = memory to write to
	lda	yindexlo,y
	sta	argument
	;;
	;; init column to 0
	;; 
	ldx	#0
paint_col:	
	;;
	;; plot pixel
	;;
	ldy	byteoffset256_table,x ; (4/5) get byte offset (4 - 35)
	lda	(argument),y	      ; (5/6) load screen byte
	ora	bitmask_table,x	      ; (4/5) OR it with pixel bitmask
	sta	(argument),y	      ; (6) store it back to screen byte
	;; 
	;; increment column
	;;
	inx			; (2) increment X
	bne	paint_col	; (2) if we increment $ff it wraps to $00 so...
end_of_cols:
	;; 
	;; increment row
	;;
	lda	yb
	cmp	#191
	beq	end_of_rows
	inc	yb
	jmp	paint_row
end_of_rows:	
	rts

 

  • Thanks 1
Link to comment
Share on other sites

57 minutes ago, damosan said:

Nothing in particular - just seeing how fast I can paint a screen with pixels painting one per pass (vs doing 8 pixels at a time by copying 0xff).  The general use routine above is slower than this code.  This can probably be made faster by increment argument by 40 after the first load.

It's difficult to optimize something that is so generic, it all depends on your use case...

 

you could have 256 plot routines with #Imm instead of using those 2 tables, the problem would be selecting between those routines quickly enough

Link to comment
Share on other sites

1 hour ago, damosan said:

Nothing in particular - just seeing how fast I can paint a screen with pixels painting one per pass (vs doing 8 pixels at a time by copying 0xff).  The general use routine above is slower than this code.  This can probably be made faster by increment argument by 40 after the first load.


;;; _speedtest
;;;
;;; at present this will paint a 255x192 Gr.8 screen at 777 pixels
;;; per jiffy (46.6k per second).
;;;
;;; runtime = 63 jiffies
;;; 
_speedtest:
	lda	#0
	sta	yb
paint_row:
	;;
	;;  get row offset
	;; 
	ldy	yb
	lda	yindexhi,y	      ; get row address
	sta	argument+1	      ; argument = memory to write to
	lda	yindexlo,y
	sta	argument
	;;
	;; init column to 0
	;; 
	ldx	#0
paint_col:	
	;;
	;; plot pixel
	;;
	ldy	byteoffset256_table,x ; (4/5) get byte offset (4 - 35)
	lda	(argument),y	      ; (5/6) load screen byte
	ora	bitmask_table,x	      ; (4/5) OR it with pixel bitmask
	sta	(argument),y	      ; (6) store it back to screen byte
	;; 
	;; increment column
	;;
	inx			; (2) increment X
	bne	paint_col	; (2) if we increment $ff it wraps to $00 so...
end_of_cols:
	;; 
	;; increment row
	;;
	lda	yb
	cmp	#191
	beq	end_of_rows
	inc	yb
	jmp	paint_row
end_of_rows:	
	rts

 

5 less cycles per loop if you use the Y register to carry values from the end of the loop

--

and 3 per init

--

and one 4 cycle less per loop if plot on ZP

Edited by xxl
  • Like 4
Link to comment
Share on other sites

2 hours ago, xxl said:

5 less cycles per loop if you use the Y register to carry values from the end of the loop

--

and 3 per init

--

and one 4 cycle less per loop if plot on ZP

yb is a zp byte.  The lookup tables are page aligned.

 

How would you rewrite the above to use X and Y based on Y being required the way it is?

Link to comment
Share on other sites

1 hour ago, damosan said:

yb is a zp byte.  The lookup tables are page aligned.

 

How would you rewrite the above to use X and Y based on Y being required the way it is?

ldy yb becomes:

ldy #

yb equ *-1

 

1 cycle less

 

sta argument

sta argument+1

 

from ABS (4cycle) becomes ZP (3 cycle)

 

2 less

 

lda (argument),y (5 cycle)

becomes

lda $ffff,y (4 cycle)

argument equ *-2

 

1 less

 

 

	lda	#0
	sta	yb
paint_row:
	;;
	;;  get row offset
	;; 
	ldy	yb

becomes

 

ldy #0
paint_row
sty yb

3 cycle less

 

	lda	yb
	cmp	#191
	beq	end_of_rows
	inc	yb
	jmp	paint_row

becomes

	ldy	#
yb equ *-1
        iny 
	cpy	#192
	bcc	paint_row

 6 cycles less?

 

Edited by xxl
  • Like 3
Link to comment
Share on other sites

12 hours ago, Rybags said:

A big saving could be had by just embedding it in the program rather than having it as a sub... though seeing it's used with C that mightn't be possible.

 

Also, the position variables - if one or both could be used directly instead of copying.

It's possible though it's kind of a PITA to embed assembly directly into C code - it's very easy, of course, to create separate assembly routines and let the linker figure it out. 

Link to comment
Share on other sites

36 minutes ago, Estece said:

Using all above suggestions and mads i got 42 PAL or 53 NTSC frames for full fill 256x192 pixels.

fastantF.xex 1.24 kB · 2 downloads

 

hmmmmmm

 

0084: 84 8D     STY $8D

0086: 84 93     STY $93    - DELETE
0088: BC 00 07  LDY $0700,X
008B: B9 00 CA  LDA $xx00,Y

here equ *-2
008E: 1D 00 06  ORA $0600,X
0091: 99 00 CA  STA $xx00,Y   - REPLACE: STA (here),Y

 

check how much is slower (at a single point it can be faster)

Edited by xxl
  • Like 2
Link to comment
Share on other sites

I got it down to 58 jiffies with this.

paint_row:
	;;
	;;  get row offset
	;;
	lda	yindexhi,y	      ; get row address
	sta	ld+2		      ; write high byte to LD/WR so we...
	sta	wr+2		      ; ...can use an absolute version.
	lda	yindexlo,y
	sta	ld+1		      ; ditto...
	sta	wr+1
	;;
	;; init column to 0
	;; 
	ldx	#0
paint_col:	
	;;
	;; plot pixel
	;;
	ldy	byteoffset256_table,x ; (4) get byte offset (4 - 35)
ld:	lda	$ffff,y		      ; (4)
	ora	bitmask_table,x	      ; (4) OR it with pixel bitmask
wr:	sta	$ffff,y	       	      ; (5)
	;; 
	;; increment column
	;;
	inx			; (2) increment X
	bne	paint_col	; (2) if we increment $ff it wraps to $00 so...

 

  • Thanks 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...