Optimize TMS9900 assembly code for speed

+retroclouds · May 9, 2020

So I have a few lines of code I want to optimize for speed. Have some ideas what I can do, but I'm also very interested seeing what the community comes up with...

The code is used for reorganizing my TiVi editor index when I delete a line in the editor.

Note that when I reorganize the index, I make sure that the index pages form a continuous memory region, so I don't have to worry about mapping pages or anything. That's already been taken care of.

The index can grow up to 5 SAMS pages mapped to (>b000 to >ffff) for 10240 lines of text.

ok, Here's what I would do:

Copy code to scratchpad and work from there for reducing wait states
Reduce loop counter overhead by putting multiple mov instructions in a single loop, say 8
Look at the fastest mov instructions (only use registers?)

What else could be done for faster access?

Note that in the code fragment tmp0=R4, tmp1=R5, tmp2=R6 and so on.

***************************************************************
* _idx.entry.delete.reorg
* Reorganize index slot entries
***************************************************************
* bl @_idx.entry.delete.reorg
*--------------------------------------------------------------
*  Remarks
*  Private, only to be called from idx_entry_delete
*--------------------------------------------------------------
_idx.entry.delete.reorg:
        ;------------------------------------------------------
        ; Reorganize index entries 
        ;------------------------------------------------------
!       mov   @idx.top+2(tmp0),@idx.top+0(tmp0)
        inct  tmp0                  ; Next index entry
        dec   tmp2                  ; tmp2--
        jne   -!                    ; Loop unless completed
        b     *r11                  ; Return to caller

*//////////////////////////////////////////////////////////////
*                  TiVi Editor - Index Management
*//////////////////////////////////////////////////////////////

***************************************************************
*  Size of index page is 4K and allows indexing of 2048 lines
*  per page.
* 
*  Each index slot (word) has the format:
*    +-----+-----+
*    | MSB | LSB |   
*    +-----|-----+   LSB = Pointer offset 00-ff.
*                      
*  MSB = SAMS Page 00-ff
*        Allows addressing of up to 256 4K SAMS pages (1024 KB)
*    
*  LSB = Pointer offset in range 00-ff
*
*        To calculate pointer to line in Editor buffer:
*        Pointer address = edb.top + (LSB * 16)
*
*        Note that the editor buffer itself resides in own 4K memory range
*        starting at edb.top
*
*        All support routines must assure that length-prefixed string in
*        Editor buffer always start on a 16 byte boundary for being
*        accessible via index.
***************************************************************

Edited May 9, 2020 by retroclouds

Asmusr · May 9, 2020

You're just copying a memory range two bytes back? Instead of using indexed addressing I guess it would be faster to set up two registers tmp0 and tmp1 and use auto-increment addressing:

  ai tmp0,idx.top
  mov tmp0,tmp1
  inct tmp1
! mov *tmp1+,*tmp0+
  dec tmp2
  jne -!

Edited May 9, 2020 by Asmusr

Stuart · May 9, 2020

Is your register set already in scratchpad RAM?

Tursi · May 9, 2020

- Use registers as Rasmus suggests

- unroll - do 4 copies in a single loop - deal with the leftover externally. Even two loops (one unrolled and one normal) will be much faster

The biggest win with registers is simply the number of bytes read. With very few exceptions, nearly all contrived, the smallest code on the TI is the fastest code.

That said, the big non-contrived exception is unrolling loops. And this is just a matter of work/overhead. If you have a common loop like this:

lp
  mov *r0+,*r1+
  dec r2
  jne lp

Even if it's all in scratchpad, you are spending 30 cycles doing work (the mov), and 20 cycles counting (so 40% of the time, it's not working). Put another way, every word costs you 50 cycles to move.

If you unroll 4 times:

lp
  mov *r0+,*r1+
  mov *r0+,*r1+
  mov *r0+,*r1+
  mov *r0+,*r1+
  dec r2
  jne lp

Now we have 120 cycles of work and 20 cycles counting (so the non-work portion is only 14%). On average, this loop costs 35 cycles per word.

If your loops have any substantial size, the time gained can be significant and well worth the extra setup time (so assuming a shift, for instance, to set the count, you'd want to have more than about 12 words to move to see a benefit overall).

+retroclouds · May 10, 2020

9 hours ago, Stuart said:

Is your register set already in scratchpad RAM?

Hi Stuart. Yes, the register set is in scratchpad RAM.

+retroclouds · May 10, 2020

9 hours ago, Asmusr said:
You're just copying a memory range two bytes back? Instead of using indexed addressing I guess it would be faster to set up two registers tmp0 and tmp1 and use auto-increment addressing:
  ai tmp0,idx.top
  mov tmp0,tmp1
  inct tmp1
! mov *tmp1+,*tmp0+
  dec tmp2
  jne -!

Hi Rasmus,

that is correct. The index is a very simple memory structure and I’m wasting memory, compared to more fancy solutions.

But with a 1MB SAMS we’ve got plenty of memory to burn and benefit is that index access is very fast. The disadvantage is that index delete/insert operations could get slow due to the amount of memory that needs shuffling. That being said, even with Tursi’s nr80 file (the largest file I’ve tested so far, some +5000 lines of text) it is still more than acceptable. So looking forwards seeing what can be pulled off with the optimized version.

+retroclouds · May 10, 2020

8 hours ago, Tursi said:
- Use registers as Rasmus suggests

- unroll - do 4 copies in a single loop - deal with the leftover externally. Even two loops (one unrolled and one normal) will be much faster

The biggest win with registers is simply the number of bytes read. With very few exceptions, nearly all contrived, the smallest code on the TI is the fastest code.

That said, the big non-contrived exception is unrolling loops. And this is just a matter of work/overhead. If you have a common loop like this:
lp
  mov *r0+,*r1+
  dec r2
  jne lp
Even if it's all in scratchpad, you are spending 30 cycles doing work (the mov), and 20 cycles counting (so 40% of the time, it's not working). Put another way, every word costs you 50 cycles to move.

If you unroll 4 times:
lp
  mov *r0+,*r1+
  mov *r0+,*r1+
  mov *r0+,*r1+
  mov *r0+,*r1+
  dec r2
  jne lp
Now we have 120 cycles of work and 20 cycles counting (so the non-work portion is only 14%). On average, this loop costs 35 cycles per word.

If your loops have any substantial size, the time gained can be significant and well worth the extra setup time (so assuming a shift, for instance, to set the count, you'd want to have more than about 12 words to move to see a benefit overall).

Tursi, thank you! I knew you’d have the numbers worked out. ?

So that basically matches with most of what I had in mind, and it’s good to have it confirmed before starting implementation work.

I’m going to experiment with different unrolls 4 and 8. But will have to come up with even larger files. I’m sure I’ll already notice a good difference with your nr80 file, but also want to try with a file of +10.000 lines.

Edited May 10, 2020 by retroclouds
Added link

Tursi · May 10, 2020

Yeah, unrolling does give diminishing returns as you get bigger. The 100% work case in that example is 30 cycles per word, so 35 is pretty good. 8 would take you to about 32.5. When I did tests way back in the day I decided 8 was about the sweet spot.

Looking at memory copies, when source or target is VDP you get to lose an increment, so it's quicker still.

My nr80 file is over 10,000 lines, but less than half of it fits on a 360k disk image. If you are aiming for the disk image crowd - why aim for a file length that can't fit?

Tursi · May 10, 2020

Here's the full text as a DV80 FIAD - 10,483 records. You'll have to unzip it of course.

NR80FULL.zip

GDMike · May 26, 2020

Using this example, because im still confused. Does r0,r1 and r2 have anything in them to start with? Sorry, I'm just trying to get this.

lp
  mov *r0+,*r1+
  mov *r0+,*r1+
  mov *r0+,*r1+
  mov *r0+,*r1+
  dec r2
  jne lp

Edited May 26, 2020 by GDMike

HOME AUTOMATION · May 26, 2020

In these cases... even nothing, can be something. :ponder:

GDMike · May 26, 2020

I'll run it in explorer and see.

Stuart · May 26, 2020

1 hour ago, GDMike said:
Using this example, because im still confused. Does r0,r1 and r2 have anything in them to start with? Sorry, I'm just trying to get this.
lp
  mov *r0+,*r1+
  mov *r0+,*r1+
  mov *r0+,*r1+
  mov *r0+,*r1+
  dec r2
  jne lp

R0 needs to be pointing at the data to copy, R1 needs to be pointing at where you want to copy it to, and R2 needs to contain the number of words (1 word = 2 bytes) to copy.

GDMike · May 26, 2020

18 minutes ago, HOME AUTOMATION said:

In these cases... even nothing, can be something.

I suppose garbage and if r2 is anything then maybe the loop works or does it..hmm

GDMike · May 26, 2020

Just now, Stuart said:

R0 needs to be pointing at the data to copy, R1 needs to be pointing at where you want to copy it to, and R2 needs to contain the number of words (1 word = 2 bytes) to copy.

Sorry, I missed this. Ok, thank you for that..I'm the kind that needs everything spelled out, and it probably was I just didn't see it. But thx again

apersson850 · May 27, 2020

So if the registers really do contain garbage, when the loop starts, then that implies that you will move a random number of words from one random place in memory to another one. If you are lucky, and the number of words aren't very many, you may get away with nothing happening at all, but most likely, the program will derail.

By "lucky" I mean if you happen to move 123 words from some place to monitor ROM, which can't be changed, then nothing bad happens. You'll simply try to write to memory that will not change. So nothing bad happens, since actually nothing at all happens.

In the worst case, you overwrite the code you are right now executing, doing the digital equivalent of suicide.

HOME AUTOMATION · May 27, 2020

haha... I didn't recognise the premise...

The following code, will copy cartridge RAM/ROM to HIGH MEMORY...

       DEF  MOVBLK
MOVBLK LI   R0,>6000 SOURCE ADDRESS
       LI   R1,>A000 DESTINATION ADDRESS
       LI   R2,>2000 NUMBER OF WORDS TO MOVE
NEXT   MOV *R0+,*R1+
       DEC  R2
       JNE  NEXT

If you don't know the absolute addresses, you can use LABELS in your source code, to define the data to be moved...

LABEL  DATA >1234
       DATA >5678
       DATA >9ABC
CLEN   EQU  $-LABEL


* THE INSTRUCTIONS BELOW WILL MOVE(COPY) THE ABOVE DATA SEGMENT
* AND OR ANY OTHER LINES PLACED
* INBETWEEN "LABEL" AND "CLEN   EQU  $-LABEL", TO THE DESTINATION ADDRESS

       LI   R0,LABEL   SOURCE
       LI   R1,???     DESTINATION
       LI   R2,CLEN    NUMBER OF BYTES TO MOVE
NEXT   MOVB *R0+,*R1+  MOVES 1 BYTE AT A TIME
       DEC  R2         BYTE COUNT LEFT TO MOVE
       JNE  NEXT       DONE??? REPEAT UNTIL LAST BYTE IS MOVED

By adding a few more...

MOVB *R0+,*R1+ ...instructions

...you trade a little memory efficiency, for better time efficiency. ;-)

+TheBF · May 27, 2020

       LI   R0,LABEL    SOURCE
       LI   R1,???     DESTINATION
       LI   R2,CLEN    NUMBER OF BYTES TO MOVE

NEXT   MOV *R0+,*R1+   MOVES 2 BYTES AT A TIME
       DECT  R2        DEC BYTE COUNT  BY 2
       JNE  NEXT       DONE??? REPEAT UNTIL LAST BYTE IS MOVED

With the DECT instruction just sitting there, this doubles the speed with no size increase *IF* you are always moving an even number of bytes on even address boundaries (I think)

senior_falcon · May 27, 2020

You've probably figured this out already, but the code below has a minor error in it. It moves cartridge ram from >6000 to >A000. R2 is >2000 so you will move >2000 words, which is >4000 bytes. R2 should be >1000 which will copy the correct number of bytes. Alternatively you could MOVB *R0+,*R1+ but that would be a silly way to do this.

       DEF  MOVBLK
MOVBLK LI   R0,>6000 SOURCE ADDRESS
       LI   R1,>A000 DESTINATION ADDRESS
       LI   R2,>2000 NUMBER OF WORDS TO MOVE
NEXT   MOV *R0+,*R1+
       DEC  R2
       JNE  NEXT

HOME AUTOMATION · May 27, 2020

1 hour ago, TheBF said:
       LI   R0,LABEL    SOURCE
       LI   R1,???     DESTINATION
       LI   R2,CLEN    NUMBER OF BYTES TO MOVE

NEXT   MOV *R0+,*R1+   MOVES 2 BYTES AT A TIME
       DECT  R2        DEC BYTE COUNT  BY 2
       JNE  NEXT       DONE??? REPEAT UNTIL LAST BYTE IS MOVED
With the DECT instruction just sitting there, this doubles the speed with no size increase *IF* you are always moving an even number of bytes on even address boundaries (I think)

Counting has always been far too simple a task to maintain my focus! :rolling:

...I should have run the code before posting. :roll:

1 hour ago, senior_falcon said:
You've probably figured this out already, but the code below has a minor error in it. It moves cartridge ram from >6000 to >A000. R2 is >2000 so you will move >2000 words, which is >4000 bytes. R2 should be >1000 which will copy the correct number of bytes. Alternatively you could MOVB *R0+,*R1+ but that would be a silly way to do this.
       DEF  MOVBLK
MOVBLK LI   R0,>6000 SOURCE ADDRESS
       LI   R1,>A000 DESTINATION ADDRESS
       LI   R2,>2000 NUMBER OF WORDS TO MOVE
NEXT   MOV *R0+,*R1+
       DEC  R2
       JNE  NEXT

I think my logic may be contagious... :twisted: I meant to move 8k, not 4k or 2k. :ponder:

P.S. ...Oh well, back to the drawing board. :waving:

senior_falcon · May 27, 2020

6 hours ago, HOME AUTOMATION said:

I think my logic may be contagious... I meant to move 8k, not 4k or 2k.

Sorry, but your code does none of the above 3 possibilities. >2000 words is 16K; you want to move >2000 bytes (8K)

HOME AUTOMATION · May 27, 2020

Confirmed! If I'm not looking at it in the HEX-EDITOR, I'm probably wrong. :dunce:

You've probably figured out already, that this is the very same formula I used to calculate the amount of fuel needed to complete the journey from my home-world where we speak in seemingly unending sentences with little or no punctuation that do not allow for the possibility of new considerations, to my destination planet in the Bhakti system. Thus, arriving HERE! :grin:

P.S. If I ever get my ship re-fueled, I'll consider consulting with you first! :ponder:

senior_falcon · May 28, 2020

Only if I can go along with you. What is the weather like in Bhakti? Scenic methane thunderstorms? Ammonia clouds? (At least your porthole windows would stay clean)

If you haven't heard of it, look up Gimli Glider. In short, an Air Canada Boeing 767 took off with insufficient fuel due to faulty conversion between pounds and kilograms. (They thought they had 22,300 kilograms of fuel when in reality they had 22,300 pounds of fuel. They made an unpowered landing at Gimli, a former Canadian Air Force base. No fatalities, 10 minor injuries. They flew in fuel, fixed some minor damage, and flew the plane out.

+TheBF · May 28, 2020

26 minutes ago, senior_falcon said:

Only if I can go along with you. What is the weather like in Bhakti? Scenic methane thunderstorms? Ammonia clouds? (At least your porthole windows would stay clean)

If you haven't heard of it, look up Gimli Glider. In short, an Air Canada Boeing 767 took off with insufficient fuel due to faulty conversion between pounds and kilograms. (They thought they had 22,300 kilograms of fuel when in reality they had 22,300 pounds of fuel. They made an unpowered landing at Gimli, a former Canadian Air Force base. No fatalities, 10 minor injuries. They flew in fuel, fixed some minor damage, and flew the plane out.

I remember that event in the news. The pilot was a glider pilot as a hobby. He was up at 30,000 ft when they ran out of fuel. Auxilary power also failed! They had no instruments!

The 767 was very new so the co-pilot had to read the manual to learn about a fan powered generator that could drop down from the fuselage to give them a little electricity. With that they got basic instruments.

When they found Gimli on a chart the pilot side-slipped the aircraft like a glider to lose altitude and speed. Gimli was a WWII airport and was abandoned.

As they made their approach at the end of the runway there were some teenagers sitting on a pickup truck.

He had to hit the brakes so hard it damaged the nose wheel. I think it snapped. If the pickup wasn't there it would have been an almost normal landing.

Never heard if those kids had to change their underwear.

The pilot was fired because he had not properly confirmed the fuel. It was when Canada had just transitioned to metric.

GDMike · May 28, 2020

Smart pilot, dumb rules to go metric swap out.

Hey, anyone? Can I use the area >FE00 through let's say, >FFFD or so for general use as a holding area for some data? I read in the ea manual something about not user space.

I'm not using any fp math formulas.

Edited May 28, 2020 by GDMike

Optimize TMS9900 assembly code for speed

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members