Jump to content
retroclouds

TMS9900 assembly language tricks

Recommended Posts

I'd like to use this thread for collecting any cool tricks you can do in TMS9900 assembly language.

With "trick" I mean optimize a statement for speed and/or size.

Or just do something with an instruction you didn't think was possible at all.

 

So how about it, any cool tricks you wanna share ?

 

I'll kick it off with a little trick I found on Thierry's page:

 

C instruction

 

Appart for comparison, this instruction can also be used to increment a register by four:

 

C *Rx+,*Rx+

 

This uses only one word of memory as opposed to the equivalent :

 

INCT Rx

INCT Rx

Note that the corresponding CB instruction would increment the register by two, but there is no advantage over a plain vanilla INCT in this case.

Share this post


Link to post
Share on other sites

Well, a similar trick to Theirry's with Compare is that you can increment /two/ registers in one instruction:

 

C *R1+,*R2+  -- increment each register by 2
CB *R1+,*R2+ -- increment each register by 1 

 

and since it supports full addressing modes, you can increment memory locations as well as registers. The idea is you get two for one of whatever it is. ;)

 

One I've used a few times was to use the parity status bit to test a bit - sort of. It's not often thought of, but if you know that the number of set bits will change in a byte, you can use JOP rather than masking and comparing, since the MOVB will set the parity bit if there are an odd number of set bits in the moved byte. (Doesn't work with words!)

 

Storing a commonly used memory address or even a commonly used number in a register can make a huge difference in the size of your program. Smaller programs on the 9900 execute faster. We have 15 general purpose registers - put them to work!

 

Registers should pretty much always be in scratchpad if you care at all about performance of the code, or you use no registers. Not really a trick, I guess, but pretty important. Likewise, if you can spare the space, copying commonly used code to scratchpad becomes a performance win after about 4 instructions.

 

This one is common to all architectures, but tail recursion is a common optimization, and it works very well on the 9900 where there's no stack. A common way to deal with a subroutine that calls a subroutine is to move R11 to another temporary register or memory location. Something like this:

 

SUB1 DO_SOME_WORK
    MOV R11,R10   * Save return address
    BL @SUB2
    B *R10

SUB2 DO_SOME_OTHER_WORK
    B *R11

 

If your second subroutine call is the last thing you do, then don't bother saving the return address. Branch to the subroutine instead and it will use YOUR R11 when it returns. Saves memory, and instructions. This does assume that SUB2 doesn't need R11 to point into SUB1 for any reason.

 

SUB1 DO_SOME_WORK
    B @SUB2     * SUB2 will return for us

SUB2 DO_SOME_OTHER_WORK
    B *R11

 

Another common-to-all-archs one reminds that shifting is much faster than multiply and divide, if you are doing a power of two. For instance, SLA R1,2 is much faster and saves a register over multiplying R1 by 4. If you are not doing a power of two to multiply, it is commonly reputed faster if you can break it into two multiplies and an addition. For example, to multiply R1 by 10, use a second register to make a copy, multiply one by 8 and one by 2, then add them, like so:

 

* 10 can be done in powers of two as 8 + 2
R1X10 MOV R1,R2
     SLA R1,3    * multiply by 8
     SLA R2,1    * multiply by 2
     A R2,R1     * put result in R1

 

Of course, the TI has a funny architecture. All other things being equal, the above doesn't work out for multiplies. The above code takes 14+18+14+14 cycles and 14 memory accesses, plus 8 bytes. In 8 bit RAM, Regs in scratchpad, that would total 76 cycles. The MPY would probably look like this:

 

R1X10 LI R0,10
     MPY R0,R1

 

This takes 12+52 cyles and 8 memory accesses, plus 6 bytes. In 8-bit RAM, Regs in scratchpad, it would total 76 cycles. Note it's the same, and the MPY takes less code space. If you don't need to load the 10 (ie: it's already loaded elsewhere), the MPY can actually be faster. Worse, if registears are in 8-bit RAM, the shift/shift/add approach takes 116 cycles, while the MPY approach only goes up to 96 cycles.

 

So the rule there is SLA if it's a multiple of 2, this is much faster and saves a register, but if you need two shifts the MPY is nearly always the better choice, unintuitively.

 

DIV, on the other hand, has a best case of 92 cycles (unless it overflows), but I don't think two shifts and an add work out.. but if you can find a way to avoid DIV your code will appreciate it. On the other hand, DIV is the slowest instruction on the chip and could be used for delays.

 

Remember that almost every instruction that touches memory does an implicit compare against zero. If you can make that meaningful, you can skip explicit compares after most operations. For instance, JNE/JEQ jump on exactly zero, JLT/JGT jumps based on the status of the highest bit (treating it as a sign bit), JOP jumps based on the number of '1' bits if it was a byte operation. Arranging your data will almost always give you the best savings, if you can.

 

I've used this VDP trick once or twice - it works on hardware but not so much on some of the emulators. Remember that the only difference between setting a read address and setting a write address is whether the prefetch occurs. So if you are desperate and don't care about compatibility, setting a read address one less than the one you want will still have the correct result because the prefetch will bump the address up before you write, and is faster than the operations needed to set (and clear) a bit. For instance:

 

DEC R1 / INC R1 - 10 cycles each (write address as a read address one less than desired)

ORI R1 / ANDI R1 - 14 cycles each plus extra program memory read (slowest method)

XOR R1 / XOR R1 - 14 cycles each plus extra memory read

SOCB / SZCB - 14 cycles each plus extra memory read

 

Unfortunately, to reiterate, this trick is not compatible with some emulation.

Share this post


Link to post
Share on other sites

Ah yes, those are some nice tricks.

Learned some of them the hard way. Especially these on register and scratch pad usage :)

 

I do am a bit lost on the VDP trick, I suppose you are referring to reading a byte from VDP and in the next step writing a byte to the VDP without setting the write address specifically ? Would you mind giving an example ?

Share this post


Link to post
Share on other sites

I often use ABS, CLR, INV, SETO for simple on/off flags. Sometimes STWP for a quick 'set' if I don't care about the value in the register. All except STWP can be used on non-registers [edited- originally, incorrectly wrote 'words'] equally well, though a little slower.

 

      SETO R6      * set flag 
      ABS  R6       * test flag
      JEQ  SETFLG   * jump if EQ it set
      CLR  R6       * clear flag ... or we could use INV R6 assuming we used SETO/CLR

* Flip the flag:
      INV R6
      ABS  R6   * is it on or off?  

* sometimes I'll "set" the flag to nonzero like this - 8 cycles IIRC
      STWP R6   * always !=0, but INV can't be used to turn flag 'off'

Edited by InsaneMultitasker

Share this post


Link to post
Share on other sites
I do am a bit lost on the VDP trick, I suppose you are referring to reading a byte from VDP and in the next step writing a byte to the VDP without setting the write address specifically ? Would you mind giving an example ?

 

No, you set an address without the prefetch inhibit bit, but you write without reading first.

 

A huge amount of confusion of the way the 9918 address counter works was caused by TI telling you to "set a read address" or "set a write address". When you get down to the lowest levels, there is no such thing, the bit that you set for a "write" address is actually a prefetch inhibit.

 

So the normal approach to set a "write" address already stored in R1 (and leaving it untouched in the end) is something like this:

 

VDPWB DATA >4000

 

SOC @VDPWB,R1    * Make it a 'write' address
SWPB R1          * LSB first
MOVB R1,@VDPWA   * Write to VDP address register
SWPB R1          * Get MSB and delay
MOVB R1,@VDPWA   * Write to VDP address register
SZC @VDPWB,R1    * Get rid of the 'write' bit

 

This works just as well, and is slightly faster:

 

DEC R1           * Make one less to account for VDP prefetch
SWPB R1          * LSB first
MOVB R1,@VDPWA   * Write to VDP address register
SWPB R1          * Get MSB and delay
MOVB R1,@VDPWA   * Write to VDP address register
INC R1           * Get back the original value

 

In emulation it only works on emulators to get the VDP prefetch correct. Emulators that get it wrong will also fail on the Diagnostic cartridge memory "checkerboard test" and the game Popeye will leave graphical glitches when a bottle is thrown.

 

Knowing about the way that VDP address register works can help your code a little, too, since you can freely change between reads and writes without changing the address register, if your data layout happens to work with that.

 

For that matter, if you need to skip one or two bytes of VDP memory, it's generally faster to just read them than to set the address explicitly again. (Since it takes two VDP writes to set the address again). You can do this even if you are writing data, you just have to be careful of the address counter, since reads and writes increment it at different times (reads increment before you read the data due to prefetch, writes after you write it). For instance, let's say I want to move two sprites in the sprite table, but not touch color or character data. The sprite table layout is: Y, X, Char, Color

 

(For simplicity, assume sprite one X and Y is in R0,R1, and sprite two is in R2,R3, and the SAL is at >0300):

 

SALTAB EQU >0043  * another trick - pre-swapping the defined address saves a SWPB in the code. Note Ive set >4000 here for write.

LI R5,SALTAB  * Get address of sprite attribute list
MOVB R5,@VDPWA * pre-swapped, dont need the first SWPB
SWPB R5        * get MSB and delay
MOVB R5,@VDPWA * write MSB - address is now set. On a stock 99/4A we dont need to delay unless
              * we use register-only addressing to access the VDP in the very next instruction
              * but you can if you are nervous or want to work on accelerated machines.
MOVB R1,@VDPWD * write sprite 0 Y. No delay needed between writes on a stock 99/4A
MOVB R0,@VDPWD * write sprite 0 X. The address pointer now points to Spr0.Char, we want to skip two.
MOVB @VDPRD,R5 * read garbage from prefetch and increment address pointer. Prefetch now has Spr0.Char and address is Spr0.Color
MOVB @VDPRD,R5 * read Spr0.Char from prefetch. Prefetch now has Spr0.Color and Address is Spr1.Y. No delay needed on stock 99/4A
              * between reads unless you are using register-only addressing (even that is on the edge).
MOVB R3,@VDPWD * write sprite 1 Y. Address counter increments as you expect.
MOVB R4,@VDPWD * write sprite 1 X. We're done.

 

The trick really, is that the VDP doesn't have a "read mode" or a "write mode". It has an address register and a prefetch register, and a read port and a write port. Accessing the read port returns the prefetch register, fetches the data at the address register into the prefetch register, then increments the address register. Accessing the write port writes the data byte to memory at the address register, then increments the address register (it may store the byte temporarily in the prefetch register, I need to test that still).

 

There is a caveat to all the above, though. Later versions of the chip had separate read and write address pointers, meaning that these tricks will NOT work on the 9938 or 9958. If you want to be compatible then you do need to think of it in terms of how TI specified a "Read address" and a "write address".

Edited by Tursi

Share this post


Link to post
Share on other sites

Also related to what Tursi is talking about is a situation where you set up a write address, but perform a read from the VDP. Since setting up the write address inhibits the prefetch of the data, what you actually read will not be the data at the address currently held in the VDP's address register. What you will get is *probably* the last byte that was read or written, but don't count on it.

 

Matthew

Share this post


Link to post
Share on other sites

Tursi gave me a piece of advice today that may not be considered a "trick" per se, but it is quite helpful. It had primarily to do with "spreading out" your workload in your game loop. For instance, if you need to check for 16 collisions, do 8 per loop cycle and alternate between them. For instance, check for collisions 1-8, then on the next cycle check for 9-16. This doesn't work for collisions very well in XB, but in assembly, it reduces your loop length and still allows for excellent accuracy in the checks. Since the ISR occurs 60(+-) times a second, that gives roughly 30 checks per second of each detection, as long as you tie the routine into the ISR. I hope I did not misunderstand this advice, but I think I have the concept now. :)

Share this post


Link to post
Share on other sites

I don't know if this is a trick or not, but I thought this was kind of neat.

 

By using the BLWP instruction, you can have overlapping workspaces between the caller and callee. This allows parameters to be passed to the callee, but preserve some register values across the function call. This would be sort of like a "caller save" calling convention, where the caller saves all important info before calling a function.

 

Example: Save four registers, then call a function which takes three arguments and returns a value

Caller regs:              0 1 2 3 4 5 6 7 8 9 A B C D E F
Callee regs:  1 2 3 4 5 6 7 8 9 A B C D E F '--.--' '-.-'
                               | | | '-.-'    |      |
Arg3        --------------------' | |   |      |      |
Arg2        ----------------------' |   |      |      |
Arg1/Return ------------------------'   |      |      |
Callee context    ----------------------'      |      |
Caller saved regs -----------------------------'      |
Caller context    ------------------------------------'

 

In this example, R9 to R12 are saved across the call, the caller places arguments in R3, R4 and R5. A return value will be placed in R5 after the return.

 

The caller registers R0 to R8 are destroyed by the call, but the callee has nearly all registers available for its use.

 

* Argument setup
 li   r3, 1
 li   r4, 2
 li   r5, 3

* Call setup:
 stwp r6                 * 8     * Get current workspace
 ai   r6, -(4+3)*2       * 14+4  * Locate caller workspace 14 bytes above this one
 li   r7, FUNC           * 12+4  * Set jump address
 blwp r6                 * 26    * Jump to called function
                         * ----
                         * 68

* Return value in R5

 

Even though BLWP is slower than BL, this is pretty quick. The call convention I came up with for my GCC port takes about 100 cycles for a call setup, return and stack maintenance. Suprisingly, this is faster.

 

I don't know of any other archetecture where you can do something like this, so no compiler would support this kind of call. Assembly only for this guy. There are some other obvious drawbacks:

 

The author would be restricted to using certain registers for certain purposes. The register restrictions could change from call to call. Making tricky assembly code REALLY confusing.

 

The small amount of scratchpad memory restricts the "stack" usage, and call tree depth. Wrapping around the top of scratchpad memory would result in hard-to-find memory errors.

 

I'm not sure if "blwp r6" is valid or not to be honest. If not, that's OK, it just means the call setup needs a few additional instructions.

 

Honestly, I don't think this is a reasonable call method for a general case, but it is awfully cool.

Share this post


Link to post
Share on other sites

I often use ABS, CLR, INV, SETO for simple on/off flags. Sometimes STWP for a quick 'set' if I don't care about the value in the register. All except STWP can be used on word values equally well, though a little slower.

 

      SETO R6      * set flag 
      ABS  R6       * test flag
      JEQ  SETFLG   * jump if EQ it set
      CLR  R6       * clear flag ... or we could use INV R6 assuming we used SETO/CLR

* Flip the flag:
      INV R6
      ABS  R6   * is it on or off?  

* sometimes I'll "set" the flag to nonzero like this - 8 cycles IIRC
      STWP R6   * always !=0, but INV can't be used to turn flag 'off'

 

He he, I never thought about using ABS. I would have used MOV R6,R6 in your example above. Sets the EQ bit if 0 :-)

Share this post


Link to post
Share on other sites

Very quick one to round a value in a register (in this case R0) up to an even byte boundary:

 

INC R0        ; add 1 to r0
ANDI R0,>FFFE ; round down to even (word) address

 

That's all I can think of right now! Brain fried!

Share this post


Link to post
Share on other sites

I often use ABS, CLR, INV, SETO for simple on/off flags. Sometimes STWP for a quick 'set' if I don't care about the value in the register. All except STWP can be used on word values equally well, though a little slower.

 

      SETO R6      * set flag 
      ABS  R6       * test flag
      JEQ  SETFLG   * jump if EQ it set
      CLR  R6       * clear flag ... or we could use INV R6 assuming we used SETO/CLR

* Flip the flag:
      INV R6
      ABS  R6   * is it on or off?  

* sometimes I'll "set" the flag to nonzero like this - 8 cycles IIRC
      STWP R6   * always !=0, but INV can't be used to turn flag 'off'

 

He he, I never thought about using ABS. I would have used MOV R6,R6 in your example above. Sets the EQ bit if 0 :-)

I totally agree. In fact I'm shamelessly stealing the ABS test for a GCC optimization step. Every cycle counts right?

Share this post


Link to post
Share on other sites

How about computed jumps?

 

I was thinking of ways to take advantage of the X instruction, and this is what I came up with

 

In basic:

on index goto 100, 110, 120, 130

 

in assembly:

 

* Assume "index" is stored in R0
* Validate input value
 ci r0, 4
 jh badval     * Index is negative or greater then 4, bad value

* Jump to correct line
 ai r0, >1000  * This is the code for "JMP 0"
 x  r0         * Jump into the table below
 jmp line100
 jmp line110
 jmp line120
 jmp line130

 

Yeah, this is pretty much what X is for, but it's an easy instruction to overlook.

  • Like 1

Share this post


Link to post
Share on other sites

Has been a while since I've seen some abracadabra :D

 

Today I went through TI Intern (page 41, address >0EDC) and found this little snippet:

 

  CLR R0
  MPY R0,R1

 

Looks kinda weird? It's code optimized for space. It clears register R0, R1, R2.

The cool thing is; the machine code for these 2 instructions is only 4 bytes.

 

Makes me wonder how many programmers worked on the TI-99/4A operating system.

Some things look real cool and others well :ponder:

 

Guess they did a good job on code size. There are still 18 bytes of free space at >1FEA.

Share this post


Link to post
Share on other sites

In GPL similar short cut exist.

 

LN1 DEC @VARIABLE

LN2 BR LN1

 

The DEC does a -1 and checks for zero at same time so the BR will work with out a CZ @VARIABLE

 

LN1 INC @VARIABLE

LN2 BR LN1

 

This works the same way, when VARIABLE hits >FF it then goes to >00 so you get the same without CZ @VARIABLE

 

LN1 DECT @VARIABLE

LN2 BR LN1

 

This one presets a problem if the number is odd it never gets out of loop, only if it is even as >01 - >02 = >FF and it starts over. (Not that anyone besides me does any GPL)

Edited by RXB

Share this post


Link to post
Share on other sites

ok, here's a little trick I use quite often now.

 

I'm using shift instructions for clearing the X leftmost bits in a register.

In the below example I clear the 3 leftmost bits in register R1.

 

 

      SLA R1,3
      SRL R1,3

 

Normally one would use a single SZC or ANDI instruction.

But I find the above easier and the machine code is also shorter (4 bytes) instead of at least 6 bytes.

 

I haven't done the math on this yet, but in general short machine code takes less time to process.

So the two instructions combined should perform as well or better as an SZC or ANDI instruction.

 

Note: by swapping the two instructions it should also work for the X rightmost bits in a register

Share this post


Link to post
Share on other sites

Actually, on the 9900 the shift instructions can be slow since their execution time depends on the amount of shifting. Most CPUs have dedicated barrel shift units that make shifting really really fast. But alas, just like the 9900's external registers, it does not have any dedicated internal shift units (actually it does, but it is used exclusively for the CRU IO.) I think the 9900 was a test bed for all of TI's *ideas* about different CPU design...

 

Anyway, the shift instructions have a base execution time of 12-clocks, with an additional 2-clocks for every shift. Both ANDI and SZC have an execution time of 14-clocks, and only SZC has possible time modification based on the addressing mode of the two operands (which is zero if you go with registers for both.) So:

 

SLA R1,3 18-clocks

SRL R1,3 18-clocks

 

 

ANDI R1,>1FFF 14-clocks

 

I'm not sure how the SZC comes in to play with just clearing some bits? Am I missing something?

1 1 1 1  1 0 0 1  0 1 1 0  1 0 1 0  >F96A  some value to clear TOP 3 bits
0 0 0 1  1 1 1 1  1 1 1 1  1 1 1 1  >1FFF  mask
----------------------------------  AND
0 0 0 1  1 0 0 1  0 1 1 0  1 0 1 0  >196A


1 1 1 1  1 0 0 1  0 1 1 0  1 0 1 0  >F96A  some value to clear BOTTOM 3 bits
1 1 1 1  1 1 1 1  1 1 1 1  1 0 0 0  >FFF8  mask
----------------------------------  AND
1 1 1 1  1 0 0 1  0 1 1 0  1 0 0 0  >F968

Share this post


Link to post
Share on other sites

Actually, on the 9900 the shift instructions can be slow since their execution time depends on the amount of shifting. Most CPUs have dedicated barrel shift units that make shifting really really fast. But alas, just like the 9900's external registers, it does not have any dedicated internal shift units (actually it does, but it is used exclusively for the CRU IO.) I think the 9900 was a test bed for all of TI's *ideas* about different CPU design...

 

Anyway, the shift instructions have a base execution time of 12-clocks, with an additional 2-clocks for every shift. Both ANDI and SZC have an execution time of 14-clocks, and only SZC has possible time modification based on the addressing mode of the two operands (which is zero if you go with registers for both.) So:

 

SLA R1,3 18-clocks

SRL R1,3 18-clocks

 

 

ANDI R1,>1FFF 14-clocks

 

I'm not sure how the SZC comes in to play with just clearing some bits? Am I missing something?

1 1 1 1  1 0 0 1  0 1 1 0  1 0 1 0  >F96A  some value to clear TOP 3 bits
0 0 0 1  1 1 1 1  1 1 1 1  1 1 1 1  >1FFF  mask
----------------------------------  AND
0 0 0 1  1 0 0 1  0 1 1 0  1 0 1 0  >196A


1 1 1 1  1 0 0 1  0 1 1 0  1 0 1 0  >F96A  some value to clear BOTTOM 3 bits
1 1 1 1  1 1 1 1  1 1 1 1  1 0 0 0  >FFF8  mask
----------------------------------  AND
1 1 1 1  1 0 0 1  0 1 1 0  1 0 0 0  >F968

 

 

As you I'm also a speed freak, so thanks for giving the facts. This is really an interting topic :)

 

ok, even as it slower as the ANDI instruction -which I really didn't expect- I'm still fine with it.

 

For most purposes I'm clearing max 3 bits and I'm not using it in loops.

So in my case the bytes saved on instruction size are more important as the actual raw performance.

 

Good to know, that you have to consider the clock-cycles when using it in loops though.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...