GroovyBee Posted September 12, 2010 Share Posted September 12, 2010 Since the CP1600 has no decimal mode or even any instructions to do BCD arithmetic its difficult to get any numeric information out of the machine without taking up a ton of precious cycles. The following code is taken from the score display routine in my latest game. It doesn't do anything fancy so there is no leading zero suppression. It will produce a 6 digit display with the least significant 16 bits in r0, the most significant 16 bits in r1. Register r4 is the destination pointer for the buffer and register r2 is the STIC colour information (which is XOR'd in). ;------------------------------------------------------------------------------ ; Display6DigitsInBCD ;------------------------------------------------------------------------------ ; Display 6 digits in BCD format. ;------------------------------------------------------------------------------ ; In :- ; RO - Least Significant Word. ; R1 - Most Significant Word. ; R2 - STIC colour mask (XOR'd in). ; R4 - Destination pointer. ; ; Out :- ; Nothing. ; ; Trashes :- ; r0, r1, r3, r4, r5 ; ;------------------------------------------------------------------------------ Display6DigitsInBCD: PROC pshr r5 ; Save return address. mvii #1 SHL 3, r5 ; Most used constant in a register for speed. ; Most significant digit. mvii #(-1+'0'-' ') SHL 3, r3 ; Prep GROM character. xorr r2, r3 ; Mix in the STIC colours. BCDLoop1: addr r5, r3 addi #-100000 AND $FFFF, r0 ; Get around the fact that adcr r1 ; there is no SBCR instruction. addi #(-100000 SHR 16) AND $FFFF, r1 ; on the CP1600. bc BCDLoop1 mvo@ r3, r4 ; Into the output buffer. mvii #(10+'0'-' ') SHL 3, r3 ; Prep GROM character. xorr r2, r3 ; Mix in the STIC colours. BCDLoop2: subr r5, r3 addi #10000, r0 adcr r1 bnc BCDLoop2 mvo@ r3, r4 ; Into the output buffer. mvii #(-1+'0'-' ') SHL 3, r3 ; Prep GROM character. xorr r2, r3 ; Mix in the STIC colours. BCDLoop3: addr r5, r3 subi #1000, r0 bc BCDLoop3 mvo@ r3, r4 ; Into the output buffer. mvii #(10+'0'-' ') SHL 3, r3 ; Prep GROM character. xorr r2, r3 ; Mix in the STIC colours. BCDLoop4: subr r5, r3 addi #100, r0 bnc BCDLoop4 mvo@ r3, r4 ; Into the output buffer. mvii #(-1+'0'-' ') SHL 3, r3 ; Prep GROM character. xorr r2, r3 ; Mix in the STIC colours. BCDLoop5: addr r5, r3 subi #10, r0 bc BCDLoop5 mvo@ r3, r4 ; Into the output buffer. ; Least significant digit. addi #10+'0'-' ', r0 ; Prep GROM character. sll r0, 2 addr r0, r0 ; x3 xorr r2, r0 ; Mix in the STIC colours. mvo@ r0, r4 ; Into the output buffer. ; All done! pulr pc ENDP Call the function as follows :- mvii #976541 AND $FFFF, r0 mvii #(976541 SHR 16) AND $FFFF, r1 mvii #C_RED, r2 mvii #MEMMAP.backtab, r4 call Display6DigitsInBCD You'll need the latest version of jzintv (not beta 3) to assemble it correctly. Quote Link to comment Share on other sites More sharing options...
intvnut Posted September 12, 2010 Share Posted September 12, 2010 Since the CP1600 has no decimal mode or even any instructions to do BCD arithmetic its difficult to get any numeric information out of the machine without taking up a ton of precious cycles. The following code is taken from the score display routine in my latest game. It doesn't do anything fancy so there is no leading zero suppression. It will produce a 6 digit display with the least significant 16 bits in r0, the most significant 16 bits in r1. Register r4 is the destination pointer for the buffer and register r2 is the STIC colour information (which is XOR'd in). Nice! I like the alternation between ADDI and SUBI. It saves having to "fix up" an over-subtract. Once you get to BCDloop3, you could store your power-of-10 in R1, which will cost you 8 cycles to generate, but will save you 2 cycles per loop iteration. Since you could loop up to 10 times and you always loop at least once, that'll save you cycles both in the worst case and on average. I actually went even more aggressively for some code I wrote recently for a Space Patrol update. Instead of looping, I went with a decision-tree approach, since it's guaranteed to never need more than ceil(log2(10)) steps for any digit. I didn't want to crash on all 9s or on 909090. :-) I also cheated another way: I display a 7 digit score, but I store it as a 16 bit number. I do this by making all the scores a multiple of 50 points. Ignore the very last digit (which remains fixed at 0). The rightmost changing digit is always either 0 or 5 based on the LSB of the score. The remaining 15 bits give me 0 - 32767 for the upper 5 digits of the score. The resulting decision-tree based code looks like this: MACRO digi(x) ($87 + (%x%) * ENDM ;; Displays a score from 000000 to 327675 MVI SCORE, R0 MVII #SCORE.DIG, R4 CLRC MVII #digi(5), R1 RRC R0 BC @@50 MVII #digi(0), R1 @@50 MVO R1, SCORE.DIG + 5 ;; 3xxxx, 2xxxx, 1xxxx, or 0xxxx? SUBI #20000, R0 BNC @@1xxxx_or_0xxxx @@3xxxx_or_2xxxx ;; 2xxxx or 3xxxx? Try subtracting more MVII #digi(3), R1 ; assume it's 3xxxx SUBI #10000, R0 BPL @@3xxxx ; assumed correctly ;; 2xxxx: We subtracted too much MVII #digi(2), R1 ; ADDI #10000, R0 B @@xxxx @@1xxxx_or_0xxxx: ;; 1xxxx or 0xxxx: We subtracted too much MVII #digi(1), R1 ; assume it's 1xxxx ADDI #10000, R0 BPL @@1xxxx ; assumed correctly! ;; 0xxxx: We still subtracted too much MVII #digi(0), R1 ADDI #10000, R0 ; add it back @@3xxxx: @@1xxxx: @@xxxx: MVO@ R1, R4 ; store out first digit ;; [9-5]xxx or [4-0]xxx? SUBI #5000, R0 BMI @@4xxx_or_less ;; [9-7]xxx or [6-5]xxx SUBI #2000, R0 BMI @@6xxx_or_5xxx ;; [9-8]xxx or 7xxx SUBI #1000, R0 BMI @@7xxx ;; 9xxx or 8xxx MVII #digi(9), R1 ; assume 9 SUBI #1000, R0 BPL @@9xxx ;; 8xxx MVII #digi(, R1 ; it's 8 ADDI #1000, R0 B @@xxx @@7xxx: MVII #digi(7), R1 ; it's 7 ADDI #1000, R0 B @@xxx @@6xxx_or_5xxx: ;; 6xxx or 5xxx MVII #digi(6), R1 ; assume 6 ADDI #1000, R0 BPL @@xxx ;; 5xxx MVII #digi(5), R1 ; it's 5 ADDI #1000, R0 B @@xxx @@4xxx_or_less ;; [4-2]xxx or [1-0]xxx SUBI #2000, R0 BMI @@1xxx_or_0xxx ;; [4-3]xxx or 2xxx SUBI #1000, R0 BMI @@2xxx ;; 4xxx or 3xxx MVII #digi(4), R1 ; assume 4 SUBI #1000, R0 BPL @@4xxx ;; 3xxx MVII #digi(3), R1 ; it's 3 ADDI #1000, R0 B @@xxx @@2xxx: MVII #digi(2), R1 ; it's 2 ADDI #1000, R0 B @@xxx @@1xxx_or_0xxx: ;; 1xxx or 0xxx MVII #digi(1), R1 ; assume 1 ADDI #1000, R0 BPL @@xxx ;; 0xxx MVII #digi(0), R1 ; it's 0 ADDI #1000, R0 B @@xxx @@9xxx @@4xxx @@xxx MVO@ R1, R4 ; store out second digit ;; [9-5]xx or [4-0]xx? SUBI #500, R0 BMI @@4xx_or_less ;; [9-7]xx or [6-5]xx SUBI #200, R0 BMI @@6xx_or_5xx ;; [9-8]xx or 7xx SUBI #100, R0 BMI @@7xx ;; 9xx or 8xx MVII #digi(9), R1 ; assume 9 SUBI #100, R0 BPL @@9xx ;; 8xx MVII #digi(, R1 ; it's 8 ADDI #100, R0 B @@xx @@7xx: MVII #digi(7), R1 ; it's 7 ADDI #100, R0 B @@xx @@6xx_or_5xx: ;; 6xx or 5xx MVII #digi(6), R1 ; assume 6 ADDI #100, R0 BPL @@xx ;; 5xx MVII #digi(5), R1 ; it's 5 ADDI #100, R0 B @@xx @@4xx_or_less ;; [4-2]xx or [1-0]xx SUBI #200, R0 BMI @@1xx_or_0xx ;; [4-3]xx or 2xx SUBI #100, R0 BMI @@2xx ;; 4xx or 3xx MVII #digi(4), R1 ; assume 4 SUBI #100, R0 BPL @@4xx ;; 3xx MVII #digi(3), R1 ; it's 3 ADDI #100, R0 B @@xx @@2xx: MVII #digi(2), R1 ; it's 2 ADDI #100, R0 B @@xx @@1xx_or_0xx: ;; 1xx or 0xx MVII #digi(1), R1 ; assume 1 ADDI #100, R0 BPL @@xx ;; 0xx MVII #digi(0), R1 ; it's 0 ADDI #100, R0 B @@xx @@9xx @@4xx @@xx MVO@ R1, R4 ; store out third digit ;; [9-5]x or [4-0]x? SUBI #50, R0 BMI @@4x_or_less ;; [9-7]x or [6-5]x SUBI #20, R0 BMI @@6x_or_5x ;; [9-8]x or 7x SUBI #10, R0 BMI @@7x ;; 9x or 8x MVII #digi(9), R1 ; assume 9 SUBI #10, R0 BPL @@9x ;; 8x MVII #digi(, R1 ; it's 8 ADDI #10, R0 B @@x @@7x: MVII #digi(7), R1 ; it's 7 ADDI #10, R0 B @@x @@6x_or_5x: ;; 6x or 5x MVII #digi(6), R1 ; assume 6 ADDI #10, R0 BPL @@x ;; 5x MVII #digi(5), R1 ; it's 5 ADDI #10, R0 B @@x @@4x_or_less ;; [4-2]x or [1-0]x SUBI #20, R0 BMI @@1x_or_0x ;; [4-3]x or 2x SUBI #10, R0 BMI @@2x ;; 4x or 3x MVII #digi(4), R1 ; assume 4 SUBI #10, R0 BPL @@4x ;; 3x MVII #digi(3), R1 ; it's 3 ADDI #10, R0 B @@x @@2x: MVII #digi(2), R1 ; it's 2 ADDI #10, R0 B @@x @@1x_or_0x: ;; 1x or 0x MVII #digi(1), R1 ; assume 1 ADDI #10, R0 BPL @@x ;; 0x MVII #digi(0), R1 ; it's 0 ADDI #10, R0 B @@x @@9x @@4x @@x MVO@ R1, R4 ; store out fourth digit ;; [9-5] or [4-0]? SUBI #5, R0 BMI @@4_or_less ;; [9-7] or [6-5] SUBI #2, R0 BMI @@6_or_5 ;; [9-8] or 7 DECR R0 BMI @@7 ;; 9 or 8 MVII #digi(9), R1 ; assume 9 DECR R0 BPL @@9 ;; 8 MVII #digi(, R1 ; it's 8 INCR R0 B @@last @@7: MVII #digi(7), R1 ; it's 7 INCR R0 B @@last @@6_or_5: ;; 6 or 5 MVII #digi(6), R1 ; assume 6 INCR R0 BPL @@last ;; 5 MVII #digi(5), R1 ; it's 5 INCR R0 B @@last @@4_or_less ;; [4-2] or [1-0] SUBI #2, R0 BMI @@1_or_0 ;; [4-3] or 2 DECR R0 BMI @@2 ;; 4 or 3 MVII #digi(4), R1 ; assume 4 DECR R0 BPL @@4 ;; 3 MVII #digi(3), R1 ; it's 3 INCR R0 B @@last @@2: MVII #digi(2), R1 ; it's 2 INCR R0 B @@last @@1_or_0: ;; 1 or 0 MVII #digi(1), R1 ; assume 1 INCR R0 BPL @@last ;; 0 MVII #digi(0), R1 ; it's 0 INCR R0 B @@last @@9 @@4 @@last MVO@ R1, R4 ; store out fifth digit Now that's a beast. :-) Your code is decidedly more approachable. Quote Link to comment Share on other sites More sharing options...
GroovyBee Posted September 12, 2010 Author Share Posted September 12, 2010 Once you get to BCDloop3, you could store your power-of-10 in R1, which will cost you 8 cycles to generate, but will save you 2 cycles per loop iteration. Since you could loop up to 10 times and you always loop at least once, that'll save you cycles both in the worst case and on average. Thanks for the further optimisation tip. I'd completely forgotten that I'd another free register at that point. I actually went even more aggressively for some code I wrote recently for a Space Patrol update. Instead of looping, I went with a decision-tree approach, since it's guaranteed to never need more than ceil(log2(10)) steps for any digit. I didn't want to crash on all 9s or on 909090. :-) Wow! That's a ton of code and an interesting cheat. Quote Link to comment Share on other sites More sharing options...
GroovyBee Posted September 13, 2010 Author Share Posted September 13, 2010 Here's the version with the optimisation :- ;------------------------------------------------------------------------------ ; Display6DigitsInBCD ;------------------------------------------------------------------------------ ; Display 6 digits in BCD format. ;------------------------------------------------------------------------------ ; In :- ; RO - Least Significant Word. ; R1 - Most Significant Word. ; R2 - STIC colour mask (XOR'd in). ; R4 - Destination pointer. ; ; Out :- ; Nothing. ; ; Trashes :- ; R0, R1, R3, R4, R5 ; ;------------------------------------------------------------------------------ Display6DigitsInBCD: PROC pshr r5 ; Save return address. mvii #1 SHL 3, r5 ; Most used constant in a register for speed. ; Most significant digit. mvii #(-1+'0'-' ') SHL 3, r3 ; Prep GROM character. xorr r2, r3 ; Mix in the STIC colours. BCDLoop1: addr r5, r3 addi #-100000 AND $FFFF, r0 ; Get around the fact that adcr r1 ; there is no SBCR instruction. addi #(-100000 SHR 16) AND $FFFF, r1 ; on the CP1600. bc BCDLoop1 mvo@ r3, r4 ; Into the output buffer. mvii #(10+'0'-' ') SHL 3, r3 ; Prep GROM character. xorr r2, r3 ; Mix in the STIC colours. BCDLoop2: subr r5, r3 addi #10000, r0 adcr r1 bnc BCDLoop2 mvo@ r3, r4 ; Into the output buffer. mvii #(-1+'0'-' ') SHL 3, r3 ; Prep GROM character. xorr r2, r3 ; Mix in the STIC colours. mvii #1000, r1 ; Power of 10. BCDLoop3: addr r5, r3 subr r1, r0 bc BCDLoop3 mvo@ r3, r4 ; Into the output buffer. mvii #(10+'0'-' ') SHL 3, r3 ; Prep GROM character. xorr r2, r3 ; Mix in the STIC colours. mvii #100, r1 ; Power of 10. BCDLoop4: subr r5, r3 addr r1, r0 bnc BCDLoop4 mvo@ r3, r4 ; Into the output buffer. mvii #(-1+'0'-' ') SHL 3, r3 ; Prep GROM character. xorr r2, r3 ; Mix in the STIC colours. mvii #10, r1 ; Power of 10. BCDLoop5: addr r5, r3 subr r1, r0 bc BCDLoop5 mvo@ r3, r4 ; Into the output buffer. ; Least significant digit. addi #10+'0'-' ', r0 ; Prep GROM character. sll r0, 2 addr r0, r0 ; x3 xorr r2, r0 ; Mix in the STIC colours. mvo@ r0, r4 ; Into the output buffer. ; All done! pulr pc ENDP Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted December 9, 2010 Share Posted December 9, 2010 Here's the version with the optimisation :- I'm sorry if this sounds dense, but would you mind explaining the algorithm a bit? You are obviously making some optimizations but I'm not sure I understand them. In particular, why the alternation between SUBR/ADDR for R3? Also, why use "ADDI #-100000 AND $FFFF" instead of "NEGR" once outside the loop and then just using ADDR with the 1000 in another register? Does NEGR not set the carry bit when an overflow occurs in the negation? I'm trying to implement my own bin2dec routine to display score, more as a learning exercise, but I can't seem to understand exactly the mechanisms that make yours work. -dZ. Quote Link to comment Share on other sites More sharing options...
catsfolly Posted December 10, 2010 Share Posted December 10, 2010 Here's the version with the optimisation :- I'm sorry if this sounds dense, but would you mind explaining the algorithm a bit? You are obviously making some optimizations but I'm not sure I understand them. In particular, why the alternation between SUBR/ADDR for R3? Also, why use "ADDI #-100000 AND $FFFF" instead of "NEGR" once outside the loop and then just using ADDR with the 1000 in another register? Does NEGR not set the carry bit when an overflow occurs in the negation? I'm trying to implement my own bin2dec routine to display score, more as a learning exercise, but I can't seem to understand exactly the mechanisms that make yours work. -dZ. I haven't looked into the carry bit operation much, but think I can explain the alternating subtracts and adds. Basically, we are doing a divide by doing subtracts and counting the number of subtracts we do. Imagine we are working on the hundred's digit. So we have a number "num" that is less than 1000. The code in C would look like this: digit = 0; while (num >= 100) { num = num - 100; digit = digit +1; } // use digit to set a display char digit = 0; while (num >= 10) { num = num - 10; digit = digit +1; } // use digit to set a display char In assembly, this would look like clrr r3 mvii #100, r1 ; Power of 10. BCDLoop3: cmp r1, r0 ; is num lt 100 blt done3 ; yep, done subr r1, r0 incr r3 b BCDLoop3 done3: ; output r3 clrr r3 mvii #10, r1 ; Power of 10. BCDLoop4: cmp r1, r0 ; is num lt 10 blt done4 subr r1, r0 incr r3 b BCDLoop3 done4: If we study this code, it seems inefficient. The cmp r1, r0 instruction subtracts r1 from r0, sets the flags, and then discards the result of the subtraction. Then the subtract instruction does the same subtract again. To fix this, we could do the subtract every time to set the flags. But this means we would be subtracting 100 one too many times from num, so we have to add back 100 to fix up num. digit = 0; while (1) { num = num - 100; if (num <0) break; digit = digit +1; } num = num + 100; // we did one too many subtracts, so fix num // use digit to set a display char digit = 0; while (1) { num = num - 10; if(num<0) break; digit = digit +1; } num = num + 10; // we did one too many subtracts, so fix num // use digit to set a display char In assembly, this would look like: clrr r3 mvii #100, r1 ; Power of 10. BCDLoop3: subr r1, r0 blt done3 incr r3 bc BCDLoop3 done3: add r1, r0 ; fix up r0 ; output r3 clrr r3 mvii #10, r1 ; Power of 10. BCDLoop4: subr r1, r0 blt done4 incr r3 bc BCDLoop3 done4: add r1, r0 ; fix up r0 ; output r3 As you can see, the loops are smaller. We could optimize this a bit more by initializing r3 to "-1" and then always doing the increment: mvii #-1, r3 mvii #100, r1 ; Power of 10. BCDLoop3: incr r3 subr r1, r0 bge BCDLoop3 done3: add r1, r0 ; fix up r0 ; output r3 mvii #-1, r3 mvii #10, r1 ; Power of 10. BCDLoop4: incr r3 subr r1, r0 bge BCDLoop44 done4: add r1, r0 ; fix up r0 ; output r3 But, is there some way to avoid doing the fixups? This is where the alternating subtracts and adds comes in. In C digit = -1; while (1) { digit = digit +1; num = num - 100; if (num <0) break; } // num is now negative number (or zero) // use digit to set a display char digit = 10; while (1) { digit = digit -1; num = num + 10; if (num>= 0 ) break; } // num is positive // use digit to output the character In assembly: mvii #-1, r3 mvii #100, r1 ; Power of 10. BCDLoop3: incr r3 subr r1, r0 bge BCDLoop3 done3: ; output r3 mvii #10, r3 mvii #10, r1 ; Power of 10. BCDLoop4: decr r3 addr r1, r0 blt BCDLoop44 done4: ; output r3 Does this make sense? Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted December 10, 2010 Share Posted December 10, 2010 Does this make sense? Fantastic! Yes, it makes a lot of sense. I understand it perfectly now. Thank you very much! -dZ. Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted December 10, 2010 Share Posted December 10, 2010 I think there may be one more optimization that can be done: According to the data sheet of the CP-1600 CPU, the conditional branching instructions cost 7 cycles when false and 9 when true. Thus, since we're testing on each iteration, it occurred to me that we could shave a few cycles by unrolling each division loop and taking the hit on the best case scenarios of breaking out early. For instance, instead of: mvii #-1, r3 mvii #100, r1 ; Power of 10. BCDLoop3: incr r3 subr r1, r0 bge BCDLoop3 done3: ; output r3 We could do something like: mvii #-1, r3 mvii #100, r1 ; Power of 10. BCDLoop3: ; Unroll loop and reverse break test REPEAT(9) incr r3 subr r1, r0 blt done3 ; break ENDR subr r1, r0 ; Perform last one outside repeat block ; since we don''t need the test/branch there. done3: ; output r3 That way, on the worse case scenarios, you save 20 cycles (2 cycles * 10 iterations) more per loop, and on the best case scenario you break even with the previous code. What do you think? -dZ. Quote Link to comment Share on other sites More sharing options...
GroovyBee Posted December 10, 2010 Author Share Posted December 10, 2010 That way, on the worse case scenarios, you save 20 cycles (2 cycles * 10 iterations) more per loop, and on the best case scenario you break even with the previous code. As with any optimisation its always a trade off between size and speed. In my code I only convert the score when its changed thus saving a ton of cycles. Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted December 10, 2010 Share Posted December 10, 2010 That way, on the worse case scenarios, you save 20 cycles (2 cycles * 10 iterations) more per loop, and on the best case scenario you break even with the previous code. As with any optimisation its always a trade off between size and speed. In my code I only convert the score when its changed thus saving a ton of cycles. I was just mentioning it because ROM is cheap nowadays, so the trade off between size and speed is not as significant as it once was. -dZ. Quote Link to comment Share on other sites More sharing options...
catsfolly Posted December 10, 2010 Share Posted December 10, 2010 I was just mentioning it because ROM is cheap nowadays, so the trade off between size and speed is not as significant as it once was. -dZ. i think GroovyBee's code probably does the conversion with the least amount of code. If you want to minimize the execution time, I think Joe's "divide and conquer" approach will work better, even though it takes more code. For example, we could add one check to our previous example code: subi #500, r1 ; check the halfway point bpl 5_or_more 4_or_less mvii #5, r3 mvii #100, r1 ; Power of 10. BCDLoop3: decr r3 addi r1, r0 bge BCDLoop3 b done3 5_or_more mvii #5, r3 mvii #100, r1 ; Power of 10. BCDLoop3: incr r3 subr r1, r0 bge BCDLoop3 done3: Now, the loops will only run a maximum of 4-5 times (I haven't tried any of this code, this is just to show the concept). Further "divide and conquer" checks can reduce or eliminate the loops entirely (check out Joe's code on this thread for details...). David Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.