JamesD Posted September 4, 2015 Share Posted September 4, 2015 (edited) I'm using a lookup table that has groups of data 7 bytes in length and there are 96 groups. Okay, it's a font not that it really matters.I could use another table of pointers but with 96 characters it would be adding 192 bytes just for pointers and I'd like to avoid that.I could group all 1st, all 2nd, all 3rb, etc... bytes together to skip the multiply and hard code loading the base address of each group in the unrolled rendering code but that's pretty ugly to edit. Maybe I'll use that in a later version.So, I'm looking for a faster multiply than my current code. On entry to this code, A contains the character # to print (ASCII - first printable character value, aka ASCII - ' ')All I need to do is multiply that by 7 and add the base address of the font to the resulting number.I'm using shifts to multiply by 2, 4, 8 and then subtracting the character number. I'm using X for the MSB and incrementing it when there is a carry.There has to be a faster way without going to another table. (?) sta <CharDataPtr ; save character offset for subtraction later lsl ; multiply by 2 (leftmost bit should always be clear so no carry) ldx #>font ; MSB of the font address lsl ; now it's * 4 bcc next1 inx ; add the carry to the MSB next1: lsl ; now it's * 8 bcc next2 inx ; add the carry to the MSB next2: sbc <CharDataPtr ; subtract to make it * 7 bcc mext3 dex next3: add #<font ; add the LSB of the Font Address sta <CharDataPtr ; save the final LSB of the character pointer bcc next4 inx ; add the carry to MSB next4: stx >CharDataPtr ; save the MSB of the character pointer Edited September 4, 2015 by JamesD Quote Link to comment Share on other sites More sharing options...
Xuel Posted September 4, 2015 Share Posted September 4, 2015 (edited) Looks pretty optimal to me. But I think there should be two INX instructions right before next1 instead of just one. I count 41 cycles for the all branches not taken case with that correction. You could waste 95 bytes to align to 8 byte boundaries. Then you could use something like one of these (untested!): getptr asl @ ; 2 asl @ ; 2 ldx #0 ; 2 stx ptr+1 ; 3 rol ptr+1 ; 5 asl @ ; 2 rol ptr+1 ; 5 adc #<font ; 2 sta ptr ; 3 lda ptr+1 ; 3 adc #>font ; 2 sta ptr+1 ; 3 ; cycles = 34 getptr_optimized asl @ ; 2 asl @ ; 2 ldx #>font ; 2 bcc next1 ; 2 inx ; 2 inx ; 2 next1 asl @ ; 2 bcc next2 ; 2 inx ; 2 next2 add #<font ; 5 sta ptr ; 3 bcc next3 ; 2 inx ; 2 next3 stx ptr+1 ; 3 ; max cycles = 33 EDIT: Changed RORs to ROLs. Edited September 4, 2015 by Xuel Quote Link to comment Share on other sites More sharing options...
phaeron Posted September 4, 2015 Share Posted September 4, 2015 The canonical method would be to transpose to struct-of-arrays storage, i.e. 7 arrays of 96 byte elements, and then do 7 indexed accesses. For array-of-structs, x7 is not that bad as it can be done as x8 followed by a subtraction, which is a small extension to the x8 routine. A slightly different way to do it is to calculate the x8 address with an offset and then index using a partial complement of the index. For instance, you could calculate ptr = font-127+index*8 and then index off of that with (ptr),Y starting at Y = index EOR $7F. 1 Quote Link to comment Share on other sites More sharing options...
Xuel Posted September 4, 2015 Share Posted September 4, 2015 If you are concerned about code editability with transposed fonts, you could look into assemblers with code generation features. For example, in XASM and MADS, the code for writing one character to a hardcoded screen address + offset could be concisely written on one line: :7 mva font+#*96,x scr+#*40,y MADS also has #WHILE and .REPT loops which might help. 1 Quote Link to comment Share on other sites More sharing options...
JamesD Posted September 4, 2015 Author Share Posted September 4, 2015 (edited) The canonical method would be to transpose to struct-of-arrays storage, i.e. 7 arrays of 96 byte elements, and then do 7 indexed accesses. For array-of-structs, x7 is not that bad as it can be done as x8 followed by a subtraction, which is a small extension to the x8 routine. A slightly different way to do it is to calculate the x8 address with an offset and then index using a partial complement of the index. For instance, you could calculate ptr = font-127+index*8 and then index off of that with (ptr),Y starting at Y = index EOR $7F. I'm already using the 2nd approach. The complement sounds interesting but I'm not sure if it will save me clock cycles. I'll see where I'm at after the first version. As for the first version, I mentioned that in my post. As it turns out, converting the data wasn't as bad as I expected. I copied out the columns and them recorded a macro fixing the first one and just played it back for the other 6. Looks pretty optimal to me. But I think there should be two INX instructions right before next1 instead of just one. I count 41 cycles for the all branches not taken case with that correction. You could waste 95 bytes to align to 8 byte boundaries. Then you could use something like one of these (untested!): <snip> I'm not sure I understand why there needs to be an additional INX and why I don't need to subtract the original value to get from * 8 to * 7. It might be a moot point anyway since I've already built the tables with 1st bytes, 2nd bytes., etc... If I can dump 33 clock cycles by eliminating the multiply it's a significant optimization. If you are concerned about code editability with transposed fonts, you could look into assemblers with code generation features. For example, in XASM and MADS, the code for writing one character to a hardcoded screen address + offset could be concisely written on one line: :7 mva font+#*96,x scr+#*40,y MADS also has #WHILE and .REPT loops which might help. That looks interesting but I'm using a 4 bit wide font so I can have 64 characters per line on what would be a 32 character display or 80 on a 40. Plus I'm using the assembler with CC65 to insure compatibility... for now. My biggest problem right now is not mixing Motorola syntax with 6502 syntax. I've been writing some code for the 6803, 68HC11, and 6809. All I'd need is to throw a new assembler into the mix on top of that. *edit* I just read the comment on the 8 byte boundaries. I see what you were talking about. Still don't understand the extra INX though. Edited September 4, 2015 by JamesD Quote Link to comment Share on other sites More sharing options...
Xuel Posted September 4, 2015 Share Posted September 4, 2015 The extra INX is needed because the first carry should be multiplied by two. Original Value: 00000000 0LMNOPQRMultiplied by eight: 000000LM NOPQR000The way you have it, instead of L*2 + M, you'll get L+M for the high byte. Quote Link to comment Share on other sites More sharing options...
JamesD Posted September 4, 2015 Author Share Posted September 4, 2015 (edited) The extra INX is needed because the first carry should be multiplied by two. Original Value: 00000000 0LMNOPQR Multiplied by eight: 000000LM NOPQR000 The way you have it, instead of L*2 + M, you'll get L+M for the high byte. Maybe I'm crazy but... 00000000 0LMNPQR ASL 00000000 LMNPQR0 ASL INX 0000000L MNPQR00 If I INX twice I get 000000L0 MNPQR00 *edit* I skipped the letter O and I have no idea why. I'd love to say it was so it didn't look like a zero but I hadn't thought of that. And now I see what you are saying. I actually need multiple INX instructions to bit shift on the next ASL Edited September 4, 2015 by JamesD 1 Quote Link to comment Share on other sites More sharing options...
GroovyBee Posted September 4, 2015 Share Posted September 4, 2015 My 6502 is a bit rusty but for x7 I came up with this :- TAX LDA #>font CPX #64 ADC #0 CPX #32 ADC #0 STA ptr+1 TXA STA temp ASL ASL ASL SEC SBC temp BCS skip DEC ptr+1 skip: sta ptr And if you pad to 8 then this should work :- TAX LDA #>font CPX #64 ADC #0 CPX #32 ADC #0 STA ptr+1 TXA ASL ASL ASL STA ptr 4 Quote Link to comment Share on other sites More sharing options...
Xuel Posted September 4, 2015 Share Posted September 4, 2015 Nice ones GroobyBee. I think you need two CPX #64/ADC #0 sequences for the same reason as for the two INXs. After that fix, I count 42 and 30 cycles for your x7 and x8 respectively: GroovyBee_x7 TAX ;2 LDA #>font ;2 CPX #64 ;2 ADC #0 ;2 CPX #64 ;2 ADC #0 ;2 CPX #32 ;2 ADC #0 ;2 STA ptr+1 ;3 TXA ;2 STA temp ;3 ASL ;2 ASL ;2 ASL ;2 SEC ;2 SBC temp ;3 BCS skip ;2 DEC ptr+1 ;5 skip: sta ptrAnd ;3 ; max cycles = 42 GroovyBee_x8 TAX ;2 LDA #>font ;2 CPX #64 ;2 ADC #0 ;2 CPX #64 ;2 ADC #0 ;2 CPX #32 ;2 ADC #0 ;2 STA ptr+1 ;3 TXA ;2 ASL ;2 ASL ;2 ASL ;2 STA ptr ;3 ; cycles = 30 Quote Link to comment Share on other sites More sharing options...
GroovyBee Posted September 4, 2015 Share Posted September 4, 2015 Nice ones GroobyBee. I think you need two CPX #64/ADC #0 sequences for the same reason as for the two INXs. After that fix, I count 42 and 30 cycles for your x7 and x8 respectively: I don't agree. If you do LDA #95 before entering the code you get the answer $399 instead of $299 for x7. Quote Link to comment Share on other sites More sharing options...
ClausB Posted September 4, 2015 Share Posted September 4, 2015 ACE80 draws 4-bit characters without multiplication. The character set is 8 consecutive arrays of 128 bytes. The ASCII value indexes the first array for the top bit pattern and then 128 is added to the pointer for each successive pattern byte. The code is posted here: http://atariage.com/forums/topic/99666-ace80-xl/page-3?hl=%20dumb%20%20terminal 2 Quote Link to comment Share on other sites More sharing options...
Xuel Posted September 4, 2015 Share Posted September 4, 2015 I don't agree. If you do LDA #95 before entering the code you get the answer $399 instead of $299 for x7. I see! My fix made no sense without some kind of bit masking. Whereas you are taking advantage of the fact that the input value will never be greater than 95 which means that the second CPX/ADC can serve to double the carry from the first CPX/ADC for values between 64 and 95. So your cycle counts are 38 and 26. Well done sir. 1 Quote Link to comment Share on other sites More sharing options...
ClausB Posted September 4, 2015 Share Posted September 4, 2015 The first version of that code, from 1981: http://atariage.com/forums/topic/223411-1981-sio-modem/?p=2954647 Quote Link to comment Share on other sites More sharing options...
JamesD Posted September 4, 2015 Author Share Posted September 4, 2015 ACE80 draws 4-bit characters without multiplication. The character set is 8 consecutive arrays of 128 bytes. The ASCII value indexes the first array for the top bit pattern and then 128 is added to the pointer for each successive pattern byte. The code is posted here: http://atariage.com/forums/topic/99666-ace80-xl/page-3?hl=%20dumb%20%20terminal That's basically what I've been working on after completing the font conversion but the font is a little smaller. Quote Link to comment Share on other sites More sharing options...
ClausB Posted September 5, 2015 Share Posted September 5, 2015 Cool. What's your end application? Quote Link to comment Share on other sites More sharing options...
JamesD Posted September 5, 2015 Author Share Posted September 5, 2015 Cool. What's your end application?At the moment... my own personal gratification. Actually, I have been slowly working towards a cross platform library of routines that will let me write a game once and build it for multiple targets. I'm looking more at strategy and simulation games. Since I plan on releasing the library code, other people could use it as well. Right now I'm targeting systems with the 6847 and possibly the Spectrum since it has the same screen resolution, but I may eventually target machines like the Plus/4 and Atari. I'm not exactly working feverishly on anything though. It's just something to do in my spare time mostly. Quote Link to comment Share on other sites More sharing options...
JamesD Posted September 9, 2015 Author Share Posted September 9, 2015 (edited) FWIW, thanks to using separate tables for each byte I could simplify this to the code below.Keep in mind this hasn't been assembled yet and I've been messing with Motorola assembly so it may not be 100% correct. The Acorn Atom has 32 bytes per row for 64 characters per line.Counting the blank row at the top of the font, it's 8 rows per character. 8x32 = 256. So to point to the correct row, just add the row to the MSB of the screen address.Then set the LSB to the column shifted right by one.Call the left or right nibble print routine based on the rightmost bit of the column.That's simpler than the multiply let alone the rest of the code for the other approach. ; register a contains character sub #' ' ; printable character set data starts at space, ASCII 32 tax ; save as character table offset ; point screen to $8000 + row (Acorn Atom base screen address + row) lda #$80 adc row ; adding row to MSB = 256 * row sta >fscreen ; add the column lda col ; 2 columns / byte lsr sta <fscreen ; save it ; lda col ; lsr bcs rightnibble Edited September 11, 2015 by JamesD Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.