+retroclouds Posted May 9, 2020 Share Posted May 9, 2020 (edited) So I have a few lines of code I want to optimize for speed. Have some ideas what I can do, but I'm also very interested seeing what the community comes up with... The code is used for reorganizing my TiVi editor index when I delete a line in the editor. Note that when I reorganize the index, I make sure that the index pages form a continuous memory region, so I don't have to worry about mapping pages or anything. That's already been taken care of. The index can grow up to 5 SAMS pages mapped to (>b000 to >ffff) for 10240 lines of text. ok, Here's what I would do: Copy code to scratchpad and work from there for reducing wait states Reduce loop counter overhead by putting multiple mov instructions in a single loop, say 8 Look at the fastest mov instructions (only use registers?) What else could be done for faster access? Note that in the code fragment tmp0=R4, tmp1=R5, tmp2=R6 and so on. *************************************************************** * _idx.entry.delete.reorg * Reorganize index slot entries *************************************************************** * bl @_idx.entry.delete.reorg *-------------------------------------------------------------- * Remarks * Private, only to be called from idx_entry_delete *-------------------------------------------------------------- _idx.entry.delete.reorg: ;------------------------------------------------------ ; Reorganize index entries ;------------------------------------------------------ ! mov @idx.top+2(tmp0),@idx.top+0(tmp0) inct tmp0 ; Next index entry dec tmp2 ; tmp2-- jne -! ; Loop unless completed b *r11 ; Return to caller *////////////////////////////////////////////////////////////// * TiVi Editor - Index Management *////////////////////////////////////////////////////////////// *************************************************************** * Size of index page is 4K and allows indexing of 2048 lines * per page. * * Each index slot (word) has the format: * +-----+-----+ * | MSB | LSB | * +-----|-----+ LSB = Pointer offset 00-ff. * * MSB = SAMS Page 00-ff * Allows addressing of up to 256 4K SAMS pages (1024 KB) * * LSB = Pointer offset in range 00-ff * * To calculate pointer to line in Editor buffer: * Pointer address = edb.top + (LSB * 16) * * Note that the editor buffer itself resides in own 4K memory range * starting at edb.top * * All support routines must assure that length-prefixed string in * Editor buffer always start on a 16 byte boundary for being * accessible via index. *************************************************************** Edited May 9, 2020 by retroclouds Quote Link to comment Share on other sites More sharing options...
Asmusr Posted May 9, 2020 Share Posted May 9, 2020 (edited) You're just copying a memory range two bytes back? Instead of using indexed addressing I guess it would be faster to set up two registers tmp0 and tmp1 and use auto-increment addressing: ai tmp0,idx.top mov tmp0,tmp1 inct tmp1 ! mov *tmp1+,*tmp0+ dec tmp2 jne -! Edited May 9, 2020 by Asmusr Quote Link to comment Share on other sites More sharing options...
Stuart Posted May 9, 2020 Share Posted May 9, 2020 Is your register set already in scratchpad RAM? 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted May 9, 2020 Share Posted May 9, 2020 - Use registers as Rasmus suggests - unroll - do 4 copies in a single loop - deal with the leftover externally. Even two loops (one unrolled and one normal) will be much faster The biggest win with registers is simply the number of bytes read. With very few exceptions, nearly all contrived, the smallest code on the TI is the fastest code. That said, the big non-contrived exception is unrolling loops. And this is just a matter of work/overhead. If you have a common loop like this: lp mov *r0+,*r1+ dec r2 jne lp Even if it's all in scratchpad, you are spending 30 cycles doing work (the mov), and 20 cycles counting (so 40% of the time, it's not working). Put another way, every word costs you 50 cycles to move. If you unroll 4 times: lp mov *r0+,*r1+ mov *r0+,*r1+ mov *r0+,*r1+ mov *r0+,*r1+ dec r2 jne lp Now we have 120 cycles of work and 20 cycles counting (so the non-work portion is only 14%). On average, this loop costs 35 cycles per word. If your loops have any substantial size, the time gained can be significant and well worth the extra setup time (so assuming a shift, for instance, to set the count, you'd want to have more than about 12 words to move to see a benefit overall). 4 Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted May 10, 2020 Author Share Posted May 10, 2020 9 hours ago, Stuart said: Is your register set already in scratchpad RAM? Hi Stuart. Yes, the register set is in scratchpad RAM. Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted May 10, 2020 Author Share Posted May 10, 2020 9 hours ago, Asmusr said: You're just copying a memory range two bytes back? Instead of using indexed addressing I guess it would be faster to set up two registers tmp0 and tmp1 and use auto-increment addressing: ai tmp0,idx.top mov tmp0,tmp1 inct tmp1 ! mov *tmp1+,*tmp0+ dec tmp2 jne -! Hi Rasmus, that is correct. The index is a very simple memory structure and I’m wasting memory, compared to more fancy solutions. But with a 1MB SAMS we’ve got plenty of memory to burn and benefit is that index access is very fast. The disadvantage is that index delete/insert operations could get slow due to the amount of memory that needs shuffling. That being said, even with Tursi’s nr80 file (the largest file I’ve tested so far, some +5000 lines of text) it is still more than acceptable. So looking forwards seeing what can be pulled off with the optimized version. 2 Quote Link to comment Share on other sites More sharing options...
+retroclouds Posted May 10, 2020 Author Share Posted May 10, 2020 (edited) 8 hours ago, Tursi said: - Use registers as Rasmus suggests - unroll - do 4 copies in a single loop - deal with the leftover externally. Even two loops (one unrolled and one normal) will be much faster The biggest win with registers is simply the number of bytes read. With very few exceptions, nearly all contrived, the smallest code on the TI is the fastest code. That said, the big non-contrived exception is unrolling loops. And this is just a matter of work/overhead. If you have a common loop like this: lp mov *r0+,*r1+ dec r2 jne lp Even if it's all in scratchpad, you are spending 30 cycles doing work (the mov), and 20 cycles counting (so 40% of the time, it's not working). Put another way, every word costs you 50 cycles to move. If you unroll 4 times: lp mov *r0+,*r1+ mov *r0+,*r1+ mov *r0+,*r1+ mov *r0+,*r1+ dec r2 jne lp Now we have 120 cycles of work and 20 cycles counting (so the non-work portion is only 14%). On average, this loop costs 35 cycles per word. If your loops have any substantial size, the time gained can be significant and well worth the extra setup time (so assuming a shift, for instance, to set the count, you'd want to have more than about 12 words to move to see a benefit overall). Tursi, thank you! I knew you’d have the numbers worked out. ? So that basically matches with most of what I had in mind, and it’s good to have it confirmed before starting implementation work. I’m going to experiment with different unrolls 4 and 8. But will have to come up with even larger files. I’m sure I’ll already notice a good difference with your nr80 file, but also want to try with a file of +10.000 lines. Edited May 10, 2020 by retroclouds Added link Quote Link to comment Share on other sites More sharing options...
Tursi Posted May 10, 2020 Share Posted May 10, 2020 Yeah, unrolling does give diminishing returns as you get bigger. The 100% work case in that example is 30 cycles per word, so 35 is pretty good. 8 would take you to about 32.5. When I did tests way back in the day I decided 8 was about the sweet spot. Looking at memory copies, when source or target is VDP you get to lose an increment, so it's quicker still. My nr80 file is over 10,000 lines, but less than half of it fits on a 360k disk image. If you are aiming for the disk image crowd - why aim for a file length that can't fit? Quote Link to comment Share on other sites More sharing options...
Tursi Posted May 10, 2020 Share Posted May 10, 2020 Here's the full text as a DV80 FIAD - 10,483 records. You'll have to unzip it of course. NR80FULL.zip 1 Quote Link to comment Share on other sites More sharing options...
GDMike Posted May 26, 2020 Share Posted May 26, 2020 (edited) Using this example, because im still confused. Does r0,r1 and r2 have anything in them to start with? Sorry, I'm just trying to get this. lp mov *r0+,*r1+ mov *r0+,*r1+ mov *r0+,*r1+ mov *r0+,*r1+ dec r2 jne lp Edited May 26, 2020 by GDMike Quote Link to comment Share on other sites More sharing options...
HOME AUTOMATION Posted May 26, 2020 Share Posted May 26, 2020 In these cases... even nothing, can be something. Quote Link to comment Share on other sites More sharing options...
GDMike Posted May 26, 2020 Share Posted May 26, 2020 I'll run it in explorer and see. Quote Link to comment Share on other sites More sharing options...
Stuart Posted May 26, 2020 Share Posted May 26, 2020 1 hour ago, GDMike said: Using this example, because im still confused. Does r0,r1 and r2 have anything in them to start with? Sorry, I'm just trying to get this. lp mov *r0+,*r1+ mov *r0+,*r1+ mov *r0+,*r1+ mov *r0+,*r1+ dec r2 jne lp R0 needs to be pointing at the data to copy, R1 needs to be pointing at where you want to copy it to, and R2 needs to contain the number of words (1 word = 2 bytes) to copy. 1 1 Quote Link to comment Share on other sites More sharing options...
GDMike Posted May 26, 2020 Share Posted May 26, 2020 18 minutes ago, HOME AUTOMATION said: In these cases... even nothing, can be something. I suppose garbage and if r2 is anything then maybe the loop works or does it..hmm Quote Link to comment Share on other sites More sharing options...
GDMike Posted May 26, 2020 Share Posted May 26, 2020 Just now, Stuart said: R0 needs to be pointing at the data to copy, R1 needs to be pointing at where you want to copy it to, and R2 needs to contain the number of words (1 word = 2 bytes) to copy. Sorry, I missed this. Ok, thank you for that..I'm the kind that needs everything spelled out, and it probably was I just didn't see it. But thx again Quote Link to comment Share on other sites More sharing options...
apersson850 Posted May 27, 2020 Share Posted May 27, 2020 So if the registers really do contain garbage, when the loop starts, then that implies that you will move a random number of words from one random place in memory to another one. If you are lucky, and the number of words aren't very many, you may get away with nothing happening at all, but most likely, the program will derail. By "lucky" I mean if you happen to move 123 words from some place to monitor ROM, which can't be changed, then nothing bad happens. You'll simply try to write to memory that will not change. So nothing bad happens, since actually nothing at all happens. In the worst case, you overwrite the code you are right now executing, doing the digital equivalent of suicide. 1 Quote Link to comment Share on other sites More sharing options...
HOME AUTOMATION Posted May 27, 2020 Share Posted May 27, 2020 haha... I didn't recognise the premise... The following code, will copy cartridge RAM/ROM to HIGH MEMORY... DEF MOVBLK MOVBLK LI R0,>6000 SOURCE ADDRESS LI R1,>A000 DESTINATION ADDRESS LI R2,>2000 NUMBER OF WORDS TO MOVE NEXT MOV *R0+,*R1+ DEC R2 JNE NEXT If you don't know the absolute addresses, you can use LABELS in your source code, to define the data to be moved... LABEL DATA >1234 DATA >5678 DATA >9ABC CLEN EQU $-LABEL * THE INSTRUCTIONS BELOW WILL MOVE(COPY) THE ABOVE DATA SEGMENT * AND OR ANY OTHER LINES PLACED * INBETWEEN "LABEL" AND "CLEN EQU $-LABEL", TO THE DESTINATION ADDRESS LI R0,LABEL SOURCE LI R1,??? DESTINATION LI R2,CLEN NUMBER OF BYTES TO MOVE NEXT MOVB *R0+,*R1+ MOVES 1 BYTE AT A TIME DEC R2 BYTE COUNT LEFT TO MOVE JNE NEXT DONE??? REPEAT UNTIL LAST BYTE IS MOVED By adding a few more... MOVB *R0+,*R1+ ...instructions ...you trade a little memory efficiency, for better time efficiency. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted May 27, 2020 Share Posted May 27, 2020 LI R0,LABEL SOURCE LI R1,??? DESTINATION LI R2,CLEN NUMBER OF BYTES TO MOVE NEXT MOV *R0+,*R1+ MOVES 2 BYTES AT A TIME DECT R2 DEC BYTE COUNT BY 2 JNE NEXT DONE??? REPEAT UNTIL LAST BYTE IS MOVED With the DECT instruction just sitting there, this doubles the speed with no size increase *IF* you are always moving an even number of bytes on even address boundaries (I think) 1 Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted May 27, 2020 Share Posted May 27, 2020 You've probably figured this out already, but the code below has a minor error in it. It moves cartridge ram from >6000 to >A000. R2 is >2000 so you will move >2000 words, which is >4000 bytes. R2 should be >1000 which will copy the correct number of bytes. Alternatively you could MOVB *R0+,*R1+ but that would be a silly way to do this. DEF MOVBLK MOVBLK LI R0,>6000 SOURCE ADDRESS LI R1,>A000 DESTINATION ADDRESS LI R2,>2000 NUMBER OF WORDS TO MOVE NEXT MOV *R0+,*R1+ DEC R2 JNE NEXT 1 Quote Link to comment Share on other sites More sharing options...
HOME AUTOMATION Posted May 27, 2020 Share Posted May 27, 2020 1 hour ago, TheBF said: LI R0,LABEL SOURCE LI R1,??? DESTINATION LI R2,CLEN NUMBER OF BYTES TO MOVE NEXT MOV *R0+,*R1+ MOVES 2 BYTES AT A TIME DECT R2 DEC BYTE COUNT BY 2 JNE NEXT DONE??? REPEAT UNTIL LAST BYTE IS MOVED With the DECT instruction just sitting there, this doubles the speed with no size increase *IF* you are always moving an even number of bytes on even address boundaries (I think) Counting has always been far too simple a task to maintain my focus! ...I should have run the code before posting. 1 hour ago, senior_falcon said: You've probably figured this out already, but the code below has a minor error in it. It moves cartridge ram from >6000 to >A000. R2 is >2000 so you will move >2000 words, which is >4000 bytes. R2 should be >1000 which will copy the correct number of bytes. Alternatively you could MOVB *R0+,*R1+ but that would be a silly way to do this. DEF MOVBLK MOVBLK LI R0,>6000 SOURCE ADDRESS LI R1,>A000 DESTINATION ADDRESS LI R2,>2000 NUMBER OF WORDS TO MOVE NEXT MOV *R0+,*R1+ DEC R2 JNE NEXT I think my logic may be contagious... I meant to move 8k, not 4k or 2k. P.S. ...Oh well, back to the drawing board. 1 1 Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted May 27, 2020 Share Posted May 27, 2020 6 hours ago, HOME AUTOMATION said: I think my logic may be contagious... I meant to move 8k, not 4k or 2k. Sorry, but your code does none of the above 3 possibilities. >2000 words is 16K; you want to move >2000 bytes (8K) 1 Quote Link to comment Share on other sites More sharing options...
HOME AUTOMATION Posted May 27, 2020 Share Posted May 27, 2020 Confirmed! If I'm not looking at it in the HEX-EDITOR, I'm probably wrong. You've probably figured out already, that this is the very same formula I used to calculate the amount of fuel needed to complete the journey from my home-world where we speak in seemingly unending sentences with little or no punctuation that do not allow for the possibility of new considerations, to my destination planet in the Bhakti system. Thus, arriving HERE! P.S. If I ever get my ship re-fueled, I'll consider consulting with you first! 1 3 Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted May 28, 2020 Share Posted May 28, 2020 Only if I can go along with you. What is the weather like in Bhakti? Scenic methane thunderstorms? Ammonia clouds? (At least your porthole windows would stay clean) If you haven't heard of it, look up Gimli Glider. In short, an Air Canada Boeing 767 took off with insufficient fuel due to faulty conversion between pounds and kilograms. (They thought they had 22,300 kilograms of fuel when in reality they had 22,300 pounds of fuel. They made an unpowered landing at Gimli, a former Canadian Air Force base. No fatalities, 10 minor injuries. They flew in fuel, fixed some minor damage, and flew the plane out. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted May 28, 2020 Share Posted May 28, 2020 26 minutes ago, senior_falcon said: Only if I can go along with you. What is the weather like in Bhakti? Scenic methane thunderstorms? Ammonia clouds? (At least your porthole windows would stay clean) If you haven't heard of it, look up Gimli Glider. In short, an Air Canada Boeing 767 took off with insufficient fuel due to faulty conversion between pounds and kilograms. (They thought they had 22,300 kilograms of fuel when in reality they had 22,300 pounds of fuel. They made an unpowered landing at Gimli, a former Canadian Air Force base. No fatalities, 10 minor injuries. They flew in fuel, fixed some minor damage, and flew the plane out. I remember that event in the news. The pilot was a glider pilot as a hobby. He was up at 30,000 ft when they ran out of fuel. Auxilary power also failed! They had no instruments! The 767 was very new so the co-pilot had to read the manual to learn about a fan powered generator that could drop down from the fuselage to give them a little electricity. With that they got basic instruments. When they found Gimli on a chart the pilot side-slipped the aircraft like a glider to lose altitude and speed. Gimli was a WWII airport and was abandoned. As they made their approach at the end of the runway there were some teenagers sitting on a pickup truck. He had to hit the brakes so hard it damaged the nose wheel. I think it snapped. If the pickup wasn't there it would have been an almost normal landing. Never heard if those kids had to change their underwear. The pilot was fired because he had not properly confirmed the fuel. It was when Canada had just transitioned to metric. 1 Quote Link to comment Share on other sites More sharing options...
GDMike Posted May 28, 2020 Share Posted May 28, 2020 (edited) Smart pilot, dumb rules to go metric swap out. Hey, anyone? Can I use the area >FE00 through let's say, >FFFD or so for general use as a holding area for some data? I read in the ea manual something about not user space. I'm not using any fp math formulas. Edited May 28, 2020 by GDMike Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.