Assembly on the 99/4A

Asmusr · August 23, 2018

It would have been more logical to interpret "B R15" as "branch to the address in R15". But instead, it means "branch to the address of R15".

It certainly would. And if B *R15 had meant to "jump to the address contained in the word pointed to by R15" that would have been very useful for jump tables.

Tursi · August 23, 2018

It is valid, but not the same. If your workspace pointer is set to >8300, "BL R15" is the same as "BL @>831E" (branching to the assembly code in memory at the same address as R15.)

I wonder what the the C99 compiler was using it for? Maybe a breakpoint or tracepoint where R15 could easily be changed between "B *R11" or "something else" to cause compiled code to do something different dynamically at runtime? Just a guess...

It was used for a stack push subroutine call. Just trying to keep it quick as possible, I guess (register is faster than register indirect ).

RXB · August 23, 2018

One of the reasons for FORTH being so fast is it is STACK BASED.

Having several STACKS would seem to be much more efficient then swapping values in Registers to make room for new values.

BL GETSD

GETSD MOV @SAVEDDATA,R7

ADD 41,R7

MOV R7,@SAVEDDATA

RT

VS:

BLWP @NEWREGISTERSET

ADD 41,R7 * R7 is SAVEDATA

RTWP

I am not a great Assembly Language programmer but the second one seems to make much more efficient sense long term.

Now I could be wrong but if you have a huge complicated program which one saves more space and is faster?

Edited August 23, 2018 by RXB

+Lee Stewart · August 23, 2018

"B R15" ... means "branch to the address of R15".

Is the timing for the address modification piece still 0 clock cycles and 0 memory accesses, as for register direct addressing, or does it change, in this instance, to 8 clock cycles and 1 memory access, as for symbolic addressing?

...lee

Tursi · August 23, 2018

I am not a great Assembly Language programmer but the second one seems to make much more efficient sense long term.

Now I could be wrong but if you have a huge complicated program which one saves more space and is faster?

Well, just for fun, we can look at the numbers:

Assuming a normal arrangement of 8 bit code and 16-bit registers:

BL @GETSD - 4 bytes, 28 cycles
GETSD
 MOV @SAVEDDATA,R7 - 4 bytes, 30 cycles
 AI R7,41 - 4 bytes, 22 cycles
 MOV R7,@SAVEDDATA - 4 bytes, 34 cycles
 RT - 2 bytes, 16 cycles
== 18 bytes, 130 cycles, 2 bytes of storage (R11 for BL/RT)

VS:

NEWREGISTERSET DATA NEWWP,GETSD 
...
 BLWP @NEWREGISTERSET - 4 bytes, 34 cycles
GETSD
 AI R7,41 - 4 bytes, 22 cycles
 RTWP - 2 bytes, 18 cycles
== 10 bytes, 74 cycles, 8 bytes of storage (R7,R13,R14,R15) and 4 bytes for the vector.

But, the first one could be exactly the same as the second, but slightly faster:

BL @GETSD - 4 bytes, 28 cycles
GETSD
 AI R7,41 - 4 bytes, 22 cycles 
 RT - 2 bytes, 16 cycles
== 10 bytes, 66 cycles, 4 bytes of storage (R7,R11)

You can relieve the pressure on R7 slightly easier, too:

VAL41 DATA 41
...
 BL @GETSD - 4 bytes, 28 cycles
GETSD
 A @VAL41,@SAVEDATA - 6 bytes, 38 cycles
 RT - 2 bytes, 16 cycles
== 12 bytes, 82 cycles, 2 bytes of storage (R11 for BL/RT)

Assembly on the TI is always a set of tradeoffs. Are you coding for size or for speed? Are your registers constrained, or do you have lots free? BLWP is great if you need to swap in a new register set AND need all the information about the caller (if you don't care and can get back yourself, LWPI is only 18 cycles in 8-bit RAM and doesn't tie up any registers with caller information).

The one thing I've found (kind of) nice, is that the TI is so memory-bound by the multiplexer and the base instruction cost that in many cases, coding for size ALSO generates the fastest code. Not all, but certainly many.

I'd argue myself in the example given that a subroutine of any kind for a single add is silly - just put the add inline. It takes the same number of bytes as any literal jump and saves all the overhead (or, if you need it to be flexible but it fits in one word, consider the X instruction, which adds just 12 cycles plus the cost of reading the argument, which is 0 if it's a register!) But in practical terms you'd usually have more content.

Stacks are tricky to do well on the 9900, what ARE the Forth developers using for their stack functions?

Edited August 24, 2018 by Tursi

+Lee Stewart · August 24, 2018

Stacks are tricky to do well on the 9900, what ARE the Forth developers using for their stack functions?

In TI Forth and fbForth, there is a parameter (data) stack, with R9 of the Forth workspace (on 16-bit bus) as the stack pointer (SP), and a return stack, with R14 as the stack pointer. Both stacks are 1 cell (2 bytes) wide and grow downward. The parameter stack grows down from the high end of “high” RAM toward the dictionary and the return stack grows down from the high end of “low” RAM toward the Forth support routines and block buffers.

Pushing a value onto either stack involves reserving space by decrementing the stack pointer by 2 and then copying the new value into that 16-bit space. For the parameter stack:

      DECT SP
      MOV  @ADDR,*SP

Popping values from either stack is easier because the stack pointer can be dereferenced and autoincremented in one instruction. For the return stack:

      MOV  *RP+,@ADDR

One can do other stackrobatics, but the basic pushing and popping of 16-bit values is as explained above.

...lee

+TheBF · August 24, 2018

In TI Forth and fbForth, there is a parameter (data) stack, with R9 of the Forth workspace (on 16-bit bus) as the stack pointer (SP), and a return stack, with R14 as the stack pointer. Both stacks are 1 cell (2 bytes) wide and grow downward. The parameter stack grows down from the high end of “high” RAM toward the dictionary and the return stack grows down from the high end of “low” RAM toward the Forth support routines and block buffers.

Pushing a value onto either stack involves reserving space by decrementing the stack pointer by 2 and then copying the new value into that 16-bit space. For the parameter stack:
      DECT SP
      MOV  @ADDR,*SP
Popping values from either stack is easier because the stack pointer can be dereferenced and autoincremented in one instruction. For the return stack:
      MOV  *RP+,@ADDR
One can do other stackrobatics, but the basic pushing an popping of 16-bit values is as explained above.

...lee

CAMEL99 Forth uses one simple variation of the above mechanisms. It caches the top element of the data stack (TOS) in R4.

The makes some operations much faster and others slower. The literature indicates about 10% speed improvement on a threaded Forth for most CPUs. I can attest to about net 8% improvement on the 9900 vs not caching TOS.

It makes operations that are stack neutral very efficient and operations that consume 2 items from the stack and return a result to the stack are also very efficient.

CODE: 1+     ( n -- n')   
              TOS INC,
              NEXT,
              END-CODE

CODE: +       ( u1 u2 -- u )
             *SP+ TOS ADD,  \ ADD 2nd item to TOS and incr stack pointer.
              NEXT,
              END-CODE

Managing the TOS register for more complex operations, that consume all the input parameters that are on the stack, normally means just doing a refill at the end of the operation.

CODE: !      ( n addr -- )  \ store n in address
             *SP+ *TOS MOV,       
              TOS POP,            
              NEXT,            
              END-CODE

But I think you can see that managing stacks is not so bad on the 9900.

In fact using a Forth assembler CAMEL99 creates macros for PUSH, POP for the data stack and RPUSH, RPOP for the return stack.

This makes the code look like it's running on a stack machine. :-)

Edited August 24, 2018 by TheBF

RXB · August 24, 2018

I have found stacks are a much better management than just spaghetti code using BL all the time.

Also if R6 had 41 in it and you used BLWP you could just do Add R6,R7 instead of ADD 41,R7

For some reason you omitted my code MOV @SAVEDDATA,R7 that my BLWP did not need to do as R7 was the SAVEDATA

Additionally having 32 registers is faster then using only 16 registers as the more complex and larger programs get the more swapping values is needed.

+mizapf · August 24, 2018

Is the timing for the address modification piece still 0 clock cycles and 0 memory accesses, as for register direct addressing, or does it change, in this instance, to 8 clock cycles and 1 memory access, as for symbolic addressing?

The 9900 still loads the contents of R15 into the ALU, but discards them; so you have one memory access. The 9995 does not.

+TheBF · August 24, 2018

I have found stacks are a much better management than just spaghetti code using BL all the time.

I did some Asm experiments with a macro i called 'CALL'

It pushed R11 onto a return stack, did the BL to the sub-routine and when the sub-routine returned it popped R11 from the return stack.

It was aesthetically pleasing to be able to have sub-routines call a sub-routine which called a sub-routine etc...

But the size cost is quite high on the 9900. 4 instructions, 8 bytes per call. So for most Assembly language coders the need for it would be small I think especially if you have a free register to hold R11 temporarily.

But it is neat to watch on the debugger. :-)

+Lee Stewart · August 24, 2018

The 9900 still loads the contents of R15 into the ALU, but discards them; so you have one memory access. The 9995 does not.

So... This means that Table 3 in TMS9900 Microprocessor Data Manual will give the wrong answer for execution times for direct register access for B and BL instructions, right?

...lee

[Edits in this color.]

+mizapf · August 24, 2018

No, they already included the memory access in the operation time. I recommend to use the 9900 Family System Design, chapter 4, showing the micro operations.

B
-------------------------------------------------
Cycle     Type           Description
  1       Memory read    AB=PC (address bus)
                         DB=Instruction (data bus)
 
  2       ALU            AB=NC (no change)
                         DB=SD (source data register, internal)
  Ns      Data derivation
 
  3+Ns    ALU            AB=NC
                         DB=SD

Note that each machine cycle shown here requires 2 clock cycles for the 9900.

The data derivation sequence takes 1 cycle for Rx, 3 for *Rx, 4 (Byte) or 5 (Word) for *Rx+, 5 for @>Addr, 5 for @>Addr(Rx).

The reason for this 1 cycle for Rx is the memory access. This is the reason why they call it "data derivation sequence". On the 9995, only the address is derived.

Tursi · August 24, 2018

For some reason you omitted my code MOV @SAVEDDATA,R7 that my BLWP did not need to do as R7 was the SAVEDATA

Ah, because I couldn't figure out why you omitted that step. Comment your code!

Edit: it's a good call though. if your alternate workspace has registers preset with values that you need in the subroutine, tbat's another potential use case for BLWP (or at least an alternate workspace). I do that on my music player.

Edited August 24, 2018 by Tursi

RXB · August 25, 2018

It has always been my view that preloaded alternate set of 16 more registers has to be faster

then using @ANYADDRESS or MOV R#,R# for lack of Registers.

Would it not be faster completely with entire programs being nothing but sets of Registers,

and would allow modified instructions on the fly???

Damn few CPU could pull this off and the 9900 is one of the few as it is Memory Mapped.

Of course Scratch Pad is faster, but if it was nothing but Registers that would have to really pay off.

Edited August 25, 2018 by RXB

RXB · August 25, 2018

I did some Asm experiments with a macro i called 'CALL'

It pushed R11 onto a return stack, did the BL to the sub-routine and when the sub-routine returned it popped R11 from the return stack.

It was aesthetically pleasing to be able to have sub-routines call a sub-routine which called a sub-routine etc...

But the size cost is quite high on the 9900. 4 instructions, 8 bytes per call. So for most Assembly language coders the need for it would be small I think especially if you have a free register to hold R11 temporarily.

But it is neat to watch on the debugger. :-)

This is exactly how XB Stacks work in the XB ROMs.

senior_falcon · August 25, 2018

It has always been my view that preloaded alternate set of 16 more registers has to be faster

then using @ANYADDRESS or MOV R#,R# for lack of Registers.

Would it not be faster completely with entire programs being nothing but sets of Registers,

and would allow modified instructions on the fly???

Damn few CPU could pull this off and the 9900 is one of the few as it is Memory Mapped.

Of course Scratch Pad is faster, but if it was nothing but Registers that would have to really pay off.

I'm sure there could be times when this approach would lead to greater speed. But I think usually it would be slower. Here's why:

1 - To start with, BLWP is quite a bit slower than BL. (the gurus here can tell you exactly how much slower.)

2 - Usually a subroutine is acting on values in the calling workspace. For example, when you BLWP @VSBW you have to put the screen address in R0 and the byte to write in R1. When VSBW starts, it has to fetch those values with MOV *R13,R0 and MOV@2(R13),R1 before it can do anything. If you used a BL @VSBW instead, those values would already be in the registers and ready to use.

There is a time and place for everything, and BLWP is great if the subroutine is lengthy. In that case the overhead is a much smaller percentage of the program and is not so detrimental. And of course, not having to worry about messing up registers in the calling program is a big plus. And as noted above, preloaded registers would also help, but of course that means one less register available for use in the BLWPWS

Edited August 25, 2018 by senior_falcon

RXB · August 25, 2018

I'm sure there could be times when this approach would lead to greater speed. But I think usually it would be slower. Here's why:

1 - To start with, BLWP is quite a bit slower than BL. (the gurus here can tell you exactly how much slower.)

2 - Usually a subroutine is acting on values in the calling workspace. For example, when you BLWP @VSBW you have to put the screen address in R0 and the byte to write in R1. When VSBW starts, it has to fetch those values with MOV *R13,R0 and MOV@2(R13),R1 before it can do anything. If you used a BL @VSBW instead, those values would already be in the registers and ready to use.

There is a time and place for everything, and BLWP is great if the subroutine is lengthy. In that case the overhead is a much smaller percentage of the program and is not so detrimental. And of course, not having to worry about messing up registers in the calling program is a big plus. And as noted above, preloaded registers would also help, but of course that means one less register available for use in the BLWPWS

Yea, but then again you have UNLIMITED NUMBER OF REGISTERS WITH BLWP AS EACH ONE GIVES YOU 16 MORE!

Also anything to do with VDP like VSBW or GROM would be slower, yes BL is faster then BLWP, but again you are stuck with only 16 Registers.

What I was saying was any variables or data should be BLWP REGISTERS so the loss in speed between BL vs BLWP would be negligible the larger the program complexity.

Think like Forth but instead of 1 endless stack you have hundreds of small stacks, or treat it like all 1 huge stack, but with out ROT or SWAP needed to get a value used.

+mizapf · August 25, 2018

VSBW / VSBR are probably the worst examples for using BLWP, in terms of overhead. However, every VMBW with a length of at least, say, 5 or 10 characters makes the BLWP overhead small against the overall execution time. DSRLNK is arguably the best use of BLWP.

We are really a bit spoiled with direct access to devices. On a system with a 99105 (with a proper multiuser/multitasking operating system), any such access would need to be done through a XOP system call, and the LIMI command would be forbidden (for user processes).

+TheBF · August 26, 2018

Yea, but then again you have UNLIMITED NUMBER OF REGISTERS WITH BLWP AS EACH ONE GIVES YOU 16 MORE!

Also anything to do with VDP like VSBW or GROM would be slower, yes BL is faster then BLWP, but again you are stuck with only 16 Registers.

What I was saying was any variables or data should be BLWP REGISTERS so the loss in speed between BL vs BLWP would be negligible the larger the program complexity.

Think like Forth but instead of 1 endless stack you have hundreds of small stacks, or treat it like all 1 huge stack, but with out ROT or SWAP needed to get a value used.

I have been playing with BLWPing directly to the Forth stack.

It still takes some effort because you have to put 3 zeros on the bottom for R13,R14,R15 but then you have space for 13 parameters.

It's not a panacea for everything but it is cool that the 9900 can do it!

I have not found an application for it but I can envision something like a floating point package where you need to pass a lot a parameters to a function where it might be pretty cool.

B

senior_falcon · August 26, 2018

Yea, but then again you have UNLIMITED NUMBER OF REGISTERS WITH BLWP AS EACH ONE GIVES YOU 16 MORE!

Also anything to do with VDP like VSBW or GROM would be slower, yes BL is faster then BLWP, but again you are stuck with only 16 Registers.

What I was saying was any variables or data should be BLWP REGISTERS so the loss in speed between BL vs BLWP would be negligible the larger the program complexity.

Rich, I think your idea is interesting. If I understand correctly you are advocating for always using BLWP instead of BL, with the added nuance that every BLWP subroutine has its own unique workspace. The advantage to this is that no stack is necessary - each subroutine can find its way back to the code that called it.

If you were to implement a stack for BL it takes an instruction to push an address on the stack and go up a level and another instruction (or two) to pop an address from the stack and go down a level. That eliminates the speed advantage of BL.

You still have the need for passing values from the calling program to the BLWP sub, and I'm not yet convinced that putting values/data into BLWP registers is as useful as you think it is. There is also the extra memory usage (36 bytes per BLWP sub) which could be a problem with a large program. But as I noted earlier, there is a time and a place for everything, and this is one more useful tool in the assembly programmers repertoir.

Speaking of neat ideas, one of the slicker ones I have seen here is to overlap the BLWP workspace with the main workspace:

USRWS BSS 16 (but you are really using 32 bytes for USRWS)

BLWPWS BSS 32

You can use all 16 registers in the USRWS. When you BLWP to the subroutine, USRWS R8-R15 are BLWPWS R0-R7. And voila! Provided you are using R8-R15 to pass data, it is already in BLWPWS and ready for use.

RXB · August 26, 2018

Yea that is what I was talking about, so say R8 to R15 are Screen pointers say R8 is Row and R9 is Column, R10 is character buffer and color of that set,

then you have R11 Return address, R12 CRU, R13 GROM address, R14 Stack Address and R15 VDP address for multiple writes/reads.

Over lap 8 bytes and now you have a whole new set sort of like jumping up and down a stack all the time.

Everyone keeps thinking inside the 32 or 16 set box, instead of think outside that box and using less distance in overlap, this means you get more stack like features.

Edited August 26, 2018 by RXB

GDMike · January 27, 2019

I never saw how to implement the standard character set scs that Matt started discussing.. I also couldn't get his vsbr to work, and I patched between someone else's code for vmbw and Matt's to make my own vmbw work. Uncool. Note I'm using an e/a module on real hardware. Why did I have so much trouble, I just used the examples for vsbw, vmbw, vsbr, vmbr and vwtr. No refs, just equates.. And none worked for me using the workspace and info from the first post.

+Lee Stewart · January 27, 2019

I never saw how to implement the standard character set scs that Matt started discussing.. I also couldn't get his vsbr to work, and I patched between someone else's code for vmbw and Matt's to make my own vmbw work. Uncool. Note I'm using an e/a module on real hardware. Why did I have so much trouble, I just used the examples for vsbw, vmbw, vsbr, vmbr and vwtr. No refs, just equates.. And none worked for me using the workspace and info from the first post.

Where is your code? We need code to assist you in discovering the difficulties.

...lee

RXB · January 27, 2019

Here is my thinking like I do sometimes using BLWP using

overlapping area of memory for both BLWP

 SET 1 : USE           : SET 2 : USE             
--------------------------------------------------------------
R0       TEMP                    
R1       TEMP 
R2       TEMP
R3       TEMP
R4       TEMP
R5       TEMP
R6       TEMP
R7       TEMP            R0     TEMP 
R8       TEMP            R1     TEMP
R9     TEMP BL RETURN    R2     TEMP BL RETURN
R10    TEMP or ERROR 1   R3     TEMP or ERROR 1
R11    RETURN BL 1       R4     RETURN BL 1
R12    CRU USE 1         R5     CRU USE 1
R13    GROM ADDRESS 1    R6     GROM ADDRESS 1
R14    Intepreter flags  R7     Intepreter flags
R15    VDP ADDRESS 1     R8     VDP ADDRESS 1
                         R9     TEMP BL RTN 2
                         R10    TEMP
                         R11    RETURN BL 2 
                         R12    CRU USE 2
                         R13    GROM ADDRESS 2
                         R14    TEMP
                         R15    VDP ADDRESS 2

You can see more registers and multiple use.

Saves code space as SET 1 LI R7,>7F00 also loads SET 2 R0

and instead of ADD @VALUE,R7 you can use another BLWP SET 2

and instead use ADD R10,R0 this saves space and if both in

SCRATCH PAD RAM makes more sense then temp area.

Never understood why TI never did this?

Edited January 27, 2019 by RXB

GDMike · January 27, 2019

Submitting photos...I'm sure it's me somehow that's got it wrong here, and I'm using the NANOPEB with E/A cart.

Assembly on the 99/4A

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members