moulinaie Posted October 24, 2019 Share Posted October 24, 2019 Hi, I'm not sure, but I think there was a way to compare a value (from a register? from memory?) to zero faster or shorter than: register: CI Ri, 0 (for compare immediate Register i to zero) -> four bytes used or memory: MOV @adr,Ri (to set flags according to what's in adr, erasing register i) (I used so often the 68000 that I expect every processor to have a TEST instruction !) If someone knows... Guillaume. Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted October 24, 2019 Share Posted October 24, 2019 There is no explicit TEST instruction for the TMS9900 other than the compare instructions (C, CB, CI). Both of your examples take four bytes. “MOV Ri,Ri”, however, takes only two and does not erase Ri but does take one memory access more than the CI instruction. ...lee Quote Link to comment Share on other sites More sharing options...
+TheBF Posted October 24, 2019 Share Posted October 24, 2019 I have used MOV Rx,Rx JNE .... Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted October 24, 2019 Share Posted October 24, 2019 8 hours ago, moulinaie said: Hi, I'm not sure, but I think there was a way to compare a value (from a register? from memory?) to zero faster or shorter than: register: CI Ri, 0 (for compare immediate Register i to zero) -> four bytes used or memory: MOV @adr,Ri (to set flags according to what's in adr, erasing register i) (I used so often the 68000 that I expect every processor to have a TEST instruction !) If someone knows... Guillaume. There is another alernative: the CZC instruction. It tests for zero bits using a mask. It's more powerful but slower. Summary: with registers in PAD >8300, MOV and CZC are equal, and can be a little faster than CI. CZC has more uses than comparing to 0. With registers in external memory, CI and CZC can be equal, and a bit faster than MOV. ones data >ffff ; source word all ones czc @ones,r6 seto r7 ; all ones, if you can keep a register constant around czc r7,r6 Per 9900 Family Systems Design, p. 8-23 (I seem to recall there is an errata to this table somewhere?) mov r6,r6 C=14 M=4 ci r6,0 C=14 M=3 czc @ones,r6 C=14 M=3 A: C+=8 M+=1 czc r7,r6 C=14 M=3 seto r7 C=10 M=3 C is the number of cycles M is the number of memory accesses A is additional cycles and memory access for operands other than registers. W is the number of wait states imposed for external memory If W=4 (2 wait states per byte on the 4A) Formula: C + W*(M+A) mov r6,r6 14 + 4*4 = 30 ci r6,0 14 + 4*3 = 26 czc @ones,r6 22 + 4*4 = 38 seto r7 10 + 4*3 = 22 czc r7,r6 14 + 4*3 = 26 =total 24 + 4*6 = 48 If you prepare R7 ahead of time, CZC can be faster than MOV but the same as CI. If you put registers in PAD (LWPI >8300), then register memory accesses (R) don't have wait states. I think this reduces to: Formula: C + 0*R + W*(M-R) mov r6,r6 14 + 0*3 + 4*1 = 18 ci r6,0 14 + 0*1 + 4*2 = 22 czc @ones,r6 22 + 0*1 + 4*3 = 34 czc r7,r6 14 + 0*2 + 4*1 = 18 seto r7 10 + 0*2 + 4*1 = 14 Summary: in the best conditions, MOV and CZC are equal, and can be a little faster than CI. CZC has more uses than comparing to 0. With registers in external memory, CI and CZC can be equal, and a bit faster than MOV. Note: With a 3 MHz clock, 18 cycles is 6 microseconds. Further I use CZC to test loop conditions where I want a condition each time the loop counter crosses a multiple. For instance, writing R2 bytes down a column of a bitmap screen, and testing the memory address for a multiple of 8 in each loop. li r7,>7 li r0,>4000+PATTBL+>03b5 ; arbitrary starting address in pattern table li r1,>1f ; a pattern byte to write all down the column li r2,42 ; weird number of bytes to write down the column loop: movb r1,@VDPWD ; write a byte. note: optimize: store VDPWD in a register inc r0 czc r7,r0 ; where r0 is the vdp address jne next ; not a multiple of 8 ai r0,>F8 ; calculate address of next row (hey: why do we do bitmap mode this way? why not 8 consecutive chars down instead of 32 across?) bl @setva ; update VDPWA next: dec r2 jne loop This might not be optimal compared to using two loop counters, but it's a pattern worth considering. Quote Link to comment Share on other sites More sharing options...
Asmusr Posted October 24, 2019 Share Posted October 24, 2019 (edited) On 10/24/2019 at 9:11 PM, FarmerPotato said: This might not be optimal compared to using two loop counters, but it's a pattern worth considering. I know this is not about optimizing drawing, but what I would do is to draw the bottom lines of the top character first (if any), then draw the middle characters in an unrolled loop with 8 movb, and finally draw the top lines of the bottom character (in any). A lot of work to code, but you dramatically reduce the average number of instructions it takes to write a byte to the screen, especially if you draw many lines. Edited October 26, 2019 by Asmusr Quote Link to comment Share on other sites More sharing options...
matthew180 Posted October 25, 2019 Share Posted October 25, 2019 13 hours ago, moulinaie said: I'm not sure, but I think there was a way to compare a value (from a register? from memory?) to zero faster or shorter than: Aside from the instructions already mentioned, the CPU will compare-to-zero after certain instructions. Organizing your code and loops in such a way to take advantage of the auto-compare is going to be the fastest. I'm pretty sure most people here are aware of this, but I did not see it mentioned. 1 Quote Link to comment Share on other sites More sharing options...
moulinaie Posted October 25, 2019 Author Share Posted October 25, 2019 12 hours ago, FarmerPotato said: There is another alernative: the CZC instruction. It tests for zero bits using a mask. It's more powerful but slower. Summary: with registers in PAD >8300, MOV and CZC are equal, and can be a little faster than CI. CZC has more uses than comparing to 0. With registers in external memory, CI and CZC can be equal, and a bit faster than MOV. Hi! This is very interesting. Those instructions, such as CSZ and friends, look very powerfull and certainly not used as they should be. I like your example with the "multiple of 8", that's clever ! When I am working with MLC, I do not touch to the 256 fast ram block as most of it is reserved for XB use. Guillaume. Quote Link to comment Share on other sites More sharing options...
moulinaie Posted October 25, 2019 Author Share Posted October 25, 2019 7 hours ago, matthew180 said: Aside from the instructions already mentioned, the CPU will compare-to-zero after certain instructions. Organizing your code and loops in such a way to take advantage of the auto-compare is going to be the fastest. I'm pretty sure most people here are aware of this, but I did not see it mentioned. Yes, that's what I try to do most of the time. But in my current program, a flag was turned to zero (eventually) inside a loop. And I had to test it after the loop was finished. Guillaume. Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted October 25, 2019 Share Posted October 25, 2019 21 hours ago, Asmusr said: I know this is not about optimizing drawing, but what I would do is to draw the bottom lines of the top character first (if any), then draw the middle characters in an unrolled loop with 8 movb, and finally draw the top lines of the bottom character (in any). I lot of work to code, but you dramatically reduce the average number instructions it takes to write a byte to the screen, especially if you draw many lines. I think you're right. I wrote a fast rectangle fill not long ago, which relied on SZCB and CZC instructions. However, they were just to manipulate the coordinates into counts for the top, middle, bottom chunks on 8-pixel boundaries, not to test the count inside the loop. Then an unrolled loop in PAD blasted them out I used the top or bottom count when jumping into the right place in the unrolled loop. 1 Quote Link to comment Share on other sites More sharing options...
+InsaneMultitasker Posted October 26, 2019 Share Posted October 26, 2019 Depending on the value range you can also use ABS. It is a handy, fast method to check for zero/non zero though it isn't always the best test for iterative loops since you probably need to use MOV, DEC, S or some other instruction as part of the loop. For flags it is very handy/quick to use CLR / SETO along with ABS to test for zero/not zero. 1 Quote Link to comment Share on other sites More sharing options...
moulinaie Posted October 27, 2019 Author Share Posted October 27, 2019 12 hours ago, InsaneMultitasker said: Depending on the value range you can also use ABS. It is a handy, fast method to check for zero/non zero though it isn't always the best test for iterative loops since you probably need to use MOV, DEC, S or some other instruction as part of the loop. For flags it is very handy/quick to use CLR / SETO along with ABS to test for zero/not zero. Thanks a lot, I like this solution for my flag. Gonna modify the source... Guillaume. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.