Fast multiply by 7 to calculate an address?

JamesD · September 4, 2015

I'm using a lookup table that has groups of data 7 bytes in length and there are 96 groups.
Okay, it's a font not that it really matters.
I could use another table of pointers but with 96 characters it would be adding 192 bytes just for pointers and I'd like to avoid that.
I could group all 1st, all 2nd, all 3rb, etc... bytes together to skip the multiply and hard code loading the base address of each group in the unrolled rendering code but that's pretty ugly to edit. Maybe I'll use that in a later version.

So, I'm looking for a faster multiply than my current code.
On entry to this code, A contains the character # to print (ASCII - first printable character value, aka ASCII - ' ')
All I need to do is multiply that by 7 and add the base address of the font to the resulting number.
I'm using shifts to multiply by 2, 4, 8 and then subtracting the character number. I'm using X for the MSB and incrementing it when there is a carry.
There has to be a faster way without going to another table. (?)

	sta		<CharDataPtr		; save character offset for subtraction later
	lsl					; multiply by 2 (leftmost bit should always be clear so no carry)

	ldx		#>font			; MSB of the font address
	lsl					; now it's * 4
	bcc		next1
	inx					; add the carry to the MSB
next1:
	lsl					; now it's * 8
	bcc		next2
	inx					; add the carry to the MSB
next2:
	sbc		<CharDataPtr		; subtract to make it * 7
	bcc		mext3
	dex							
next3:
	add		#<font			; add the LSB of the Font Address
	sta		<CharDataPtr		; save the final LSB of the character pointer
	bcc		next4
	inx					; add the carry to MSB
next4:
	stx		>CharDataPtr		; save the MSB of the character pointer

Edited September 4, 2015 by JamesD

Xuel · September 4, 2015

Looks pretty optimal to me. But I think there should be two INX instructions right before next1 instead of just one. I count 41 cycles for the all branches not taken case with that correction.

You could waste 95 bytes to align to 8 byte boundaries. Then you could use something like one of these (untested!):

getptr
    asl @       ; 2
    asl @       ; 2
    ldx #0      ; 2
    stx ptr+1   ; 3
    rol ptr+1   ; 5
    asl @       ; 2
    rol ptr+1   ; 5
    adc #<font  ; 2
    sta ptr     ; 3
    lda ptr+1   ; 3
    adc #>font  ; 2
    sta ptr+1   ; 3
    ; cycles = 34
 
getptr_optimized
    asl @       ; 2
    asl @       ; 2
    ldx #>font  ; 2
    bcc next1   ; 2
    inx         ; 2
    inx         ; 2
next1
    asl @       ; 2
    bcc next2   ; 2
    inx         ; 2
next2
    add #<font  ; 5
    sta ptr     ; 3
    bcc next3   ; 2
    inx         ; 2
next3
    stx ptr+1   ; 3
    ; max cycles = 33

EDIT: Changed RORs to ROLs.

Edited September 4, 2015 by Xuel

phaeron · September 4, 2015

The canonical method would be to transpose to struct-of-arrays storage, i.e. 7 arrays of 96 byte elements, and then do 7 indexed accesses.

For array-of-structs, x7 is not that bad as it can be done as x8 followed by a subtraction, which is a small extension to the x8 routine. A slightly different way to do it is to calculate the x8 address with an offset and then index using a partial complement of the index. For instance, you could calculate ptr = font-127+index*8 and then index off of that with (ptr),Y starting at Y = index EOR $7F.

Xuel · September 4, 2015

If you are concerned about code editability with transposed fonts, you could look into assemblers with code generation features. For example, in XASM and MADS, the code for writing one character to a hardcoded screen address + offset could be concisely written on one line:

    :7 mva font+#*96,x scr+#*40,y

MADS also has #WHILE and .REPT loops which might help.

JamesD · September 4, 2015

The canonical method would be to transpose to struct-of-arrays storage, i.e. 7 arrays of 96 byte elements, and then do 7 indexed accesses.

For array-of-structs, x7 is not that bad as it can be done as x8 followed by a subtraction, which is a small extension to the x8 routine. A slightly different way to do it is to calculate the x8 address with an offset and then index using a partial complement of the index. For instance, you could calculate ptr = font-127+index*8 and then index off of that with (ptr),Y starting at Y = index EOR $7F.

I'm already using the 2nd approach.

The complement sounds interesting but I'm not sure if it will save me clock cycles. I'll see where I'm at after the first version.

As for the first version, I mentioned that in my post. As it turns out, converting the data wasn't as bad as I expected. I copied out the columns and them recorded a macro fixing the first one and just played it back for the other 6.

Looks pretty optimal to me. But I think there should be two INX instructions right before next1 instead of just one. I count 41 cycles for the all branches not taken case with that correction.

You could waste 95 bytes to align to 8 byte boundaries. Then you could use something like one of these (untested!):

<snip>

I'm not sure I understand why there needs to be an additional INX and why I don't need to subtract the original value to get from * 8 to * 7.

It might be a moot point anyway since I've already built the tables with 1st bytes, 2nd bytes., etc...

If I can dump 33 clock cycles by eliminating the multiply it's a significant optimization.

If you are concerned about code editability with transposed fonts, you could look into assemblers with code generation features. For example, in XASM and MADS, the code for writing one character to a hardcoded screen address + offset could be concisely written on one line:
    :7 mva font+#*96,x scr+#*40,y
MADS also has #WHILE and .REPT loops which might help.

That looks interesting but I'm using a 4 bit wide font so I can have 64 characters per line on what would be a 32 character display or 80 on a 40.

Plus I'm using the assembler with CC65 to insure compatibility... for now.

My biggest problem right now is not mixing Motorola syntax with 6502 syntax.

I've been writing some code for the 6803, 68HC11, and 6809.

All I'd need is to throw a new assembler into the mix on top of that.

*edit*

I just read the comment on the 8 byte boundaries. I see what you were talking about.

Still don't understand the extra INX though.

Edited September 4, 2015 by JamesD

Xuel · September 4, 2015

The extra INX is needed because the first carry should be multiplied by two.

Original Value: 00000000 0LMNOPQR
Multiplied by eight: 000000LM NOPQR000

The way you have it, instead of L*2 + M, you'll get L+M for the high byte.

JamesD · September 4, 2015

The extra INX is needed because the first carry should be multiplied by two.

Original Value: 00000000 0LMNOPQR

Multiplied by eight: 000000LM NOPQR000

The way you have it, instead of L*2 + M, you'll get L+M for the high byte.

Maybe I'm crazy but...

00000000 0LMNPQR

ASL

00000000 LMNPQR0

ASL

INX

0000000L MNPQR00

If I INX twice I get

000000L0 MNPQR00

*edit*

I skipped the letter O and I have no idea why. I'd love to say it was so it didn't look like a zero but I hadn't thought of that.

And now I see what you are saying. I actually need multiple INX instructions to bit shift on the next ASL

Edited September 4, 2015 by JamesD

GroovyBee · September 4, 2015

My 6502 is a bit rusty but for x7 I came up with this :-

	TAX
	LDA #>font
	CPX #64
	ADC #0
	CPX #32
	ADC #0
	STA ptr+1
	TXA
	STA temp
	ASL
	ASL
	ASL
	SEC
	SBC temp
	BCS skip
	DEC ptr+1
skip:
	sta ptr

And if you pad to 8 then this should work :-

	TAX
	LDA #>font
	CPX #64
	ADC #0
	CPX #32
	ADC #0
	STA ptr+1
	TXA
	ASL
	ASL
	ASL
	STA ptr

Xuel · September 4, 2015

Nice ones GroobyBee. I think you need two CPX #64/ADC #0 sequences for the same reason as for the two INXs. After that fix, I count 42 and 30 cycles for your x7 and x8 respectively:

GroovyBee_x7
     TAX        ;2
     LDA #>font ;2
     CPX #64    ;2
     ADC #0     ;2
     CPX #64    ;2
     ADC #0     ;2
     CPX #32    ;2
     ADC #0     ;2
     STA ptr+1  ;3
     TXA        ;2
     STA temp   ;3
     ASL        ;2
     ASL        ;2
     ASL        ;2
     SEC        ;2
     SBC temp   ;3
     BCS skip   ;2
     DEC ptr+1  ;5
skip:
     sta ptrAnd ;3
     ; max cycles = 42

GroovyBee_x8
     TAX        ;2
     LDA #>font ;2
     CPX #64    ;2
     ADC #0     ;2
     CPX #64    ;2
     ADC #0     ;2
     CPX #32    ;2
     ADC #0     ;2
     STA ptr+1  ;3
     TXA        ;2
     ASL        ;2
     ASL        ;2
     ASL        ;2
     STA ptr    ;3
     ; cycles = 30

GroovyBee · September 4, 2015

Nice ones GroobyBee. I think you need two CPX #64/ADC #0 sequences for the same reason as for the two INXs. After that fix, I count 42 and 30 cycles for your x7 and x8 respectively:

I don't agree. If you do LDA #95 before entering the code you get the answer $399 instead of $299 for x7.

ClausB · September 4, 2015

ACE80 draws 4-bit characters without multiplication. The character set is 8 consecutive arrays of 128 bytes. The ASCII value indexes the first array for the top bit pattern and then 128 is added to the pointer for each successive pattern byte.

The code is posted here:

http://atariage.com/forums/topic/99666-ace80-xl/page-3?hl=%20dumb%20%20terminal

Xuel · September 4, 2015

I don't agree. If you do LDA #95 before entering the code you get the answer $399 instead of $299 for x7.

I see! My fix made no sense without some kind of bit masking. Whereas you are taking advantage of the fact that the input value will never be greater than 95 which means that the second CPX/ADC can serve to double the carry from the first CPX/ADC for values between 64 and 95. So your cycle counts are 38 and 26. Well done sir.

ClausB · September 4, 2015

The first version of that code, from 1981:

http://atariage.com/forums/topic/223411-1981-sio-modem/?p=2954647

JamesD · September 4, 2015

ACE80 draws 4-bit characters without multiplication. The character set is 8 consecutive arrays of 128 bytes. The ASCII value indexes the first array for the top bit pattern and then 128 is added to the pointer for each successive pattern byte.

The code is posted here:

http://atariage.com/forums/topic/99666-ace80-xl/page-3?hl=%20dumb%20%20terminal

That's basically what I've been working on after completing the font conversion but the font is a little smaller.

ClausB · September 5, 2015

Cool. What's your end application?

JamesD · September 5, 2015

Cool. What's your end application?

At the moment... my own personal gratification.

Actually, I have been slowly working towards a cross platform library of routines that will let me write a game once and build it for multiple targets.

I'm looking more at strategy and simulation games. Since I plan on releasing the library code, other people could use it as well.

Right now I'm targeting systems with the 6847 and possibly the Spectrum since it has the same screen resolution, but I may eventually target machines like the Plus/4 and Atari. I'm not exactly working feverishly on anything though. It's just something to do in my spare time mostly.

JamesD · September 9, 2015

FWIW, thanks to using separate tables for each byte I could simplify this to the code below.
Keep in mind this hasn't been assembled yet and I've been messing with Motorola assembly so it may not be 100% correct.

The Acorn Atom has 32 bytes per row for 64 characters per line.
Counting the blank row at the top of the font, it's 8 rows per character.
8x32 = 256. So to point to the correct row, just add the row to the MSB of the screen address.
Then set the LSB to the column shifted right by one.
Call the left or right nibble print routine based on the rightmost bit of the column.

That's simpler than the multiply let alone the rest of the code for the other approach.

	; register a contains character
	sub		#' '				; printable character set data starts at space, ASCII 32
	tax						; save as character table offset

	; point screen to $8000 + row (Acorn Atom base screen address + row)
	lda		#$80
	adc		row				; adding row to MSB = 256 * row
	sta		>fscreen
	
	; add the column
	lda		col				; 2 columns / byte
	lsr
	sta		<fscreen			; save it

;	lda		col
;	lsr
	bcs		rightnibble

Edited September 11, 2015 by JamesD

Fast multiply by 7 to calculate an address?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members