Jump to content

Photo

Assembly on the 99/4A


770 replies to this topic

#751 Asmusr OFFLINE  

Asmusr

    River Patroller

  • 2,870 posts
  • Location:Denmark

Posted Thu Aug 23, 2018 8:19 AM

It would have been more logical to interpret "B R15" as "branch to the address in R15". But instead, it means "branch to the address of R15".

 

It certainly would. And if B *R15 had meant to "jump to the address contained in the word pointed to by R15" that would have been very useful for jump tables.



#752 Tursi OFFLINE  

Tursi

    Quadrunner

  • 5,231 posts
  • HarmlessLion
  • Location:BUR

Posted Thu Aug 23, 2018 10:56 AM

 

It is valid, but not the same.  If your workspace pointer is set to >8300, "BL R15" is the same as "BL @>831E" (branching to the assembly code in memory at the same address as R15.)

 

I wonder what the the C99 compiler was using it for?  Maybe a breakpoint or tracepoint where R15 could easily be changed between "B *R11" or "something else" to cause compiled code to do something different dynamically at runtime?  Just a guess...

 

It was used for a stack push subroutine call. Just trying to keep it quick as possible, I guess (register is faster than register indirect ;) ).



#753 RXB OFFLINE  

RXB

    River Patroller

  • 3,303 posts
  • Location:Vancouver, Washington, USA

Posted Thu Aug 23, 2018 4:30 PM

One of the reasons for FORTH being so fast is it is STACK BASED.

 

Having several STACKS would seem to be much more efficient then swapping values in Registers to make room for new values.

 

             BL     GETSD

 

GETSD MOV @SAVEDDATA,R7

              ADD 41,R7 

              MOV R7,@SAVEDDATA

              RT

 

VS:

 

BLWP @NEWREGISTERSET

ADD   41,R7                               * R7 is SAVEDATA

RTWP

 

I am not a great Assembly Language programmer but the second one seems to make much more efficient sense long term.

Now I could be wrong but if you have a huge complicated program which one saves more space and is faster?


Edited by RXB, Thu Aug 23, 2018 4:30 PM.


#754 Lee Stewart ONLINE  

Lee Stewart

    River Patroller

  • 3,726 posts
  • Location:Silver Run, Maryland

Posted Thu Aug 23, 2018 5:26 PM

"B R15" ... means "branch to the address of R15".

 

Is the timing for the address modification piece still 0 clock cycles and 0 memory accesses, as for register direct addressing, or does it change, in this instance,  to 8 clock cycles and 1 memory access, as for symbolic addressing?

 

...lee



#755 Tursi OFFLINE  

Tursi

    Quadrunner

  • 5,231 posts
  • HarmlessLion
  • Location:BUR

Posted Thu Aug 23, 2018 5:39 PM

I am not a great Assembly Language programmer but the second one seems to make much more efficient sense long term.
Now I could be wrong but if you have a huge complicated program which one saves more space and is faster?


Well, just for fun, we can look at the numbers:
Assuming a normal arrangement of 8 bit code and 16-bit registers:
 

BL @GETSD - 4 bytes, 28 cycles
GETSD
 MOV @SAVEDDATA,R7 - 4 bytes, 30 cycles
 AI R7,41 - 4 bytes, 22 cycles
 MOV R7,@SAVEDDATA - 4 bytes, 34 cycles
 RT - 2 bytes, 16 cycles
== 18 bytes, 130 cycles, 2 bytes of storage (R11 for BL/RT)

 
VS:

NEWREGISTERSET DATA NEWWP,GETSD 
...
 BLWP @NEWREGISTERSET - 4 bytes, 34 cycles
GETSD
 AI R7,41 - 4 bytes, 22 cycles
 RTWP - 2 bytes, 18 cycles
== 10 bytes, 74 cycles, 8 bytes of storage (R7,R13,R14,R15) and 4 bytes for the vector.

 

But, the first one could be exactly the same as the second, but slightly faster:

 

BL @GETSD - 4 bytes, 28 cycles
GETSD
 AI R7,41 - 4 bytes, 22 cycles 
 RT - 2 bytes, 16 cycles
== 10 bytes, 66 cycles, 4 bytes of storage (R7,R11)

 

You can relieve the pressure on R7 slightly easier, too:

VAL41 DATA 41
...
 BL @GETSD - 4 bytes, 28 cycles
GETSD
 A @VAL41,@SAVEDATA - 6 bytes, 38 cycles
 RT - 2 bytes, 16 cycles
== 12 bytes, 82 cycles, 2 bytes of storage (R11 for BL/RT)

 

Assembly on the TI is always a set of tradeoffs. Are you coding for size or for speed? Are your registers constrained, or do you have lots free? BLWP is great if you need to swap in a new register set AND need all the information about the caller (if you don't care and can get back yourself, LWPI is only 18 cycles in 8-bit RAM and doesn't tie up any registers with caller information).

The one thing I've found (kind of) nice, is that the TI is so memory-bound by the multiplexer and the base instruction cost that in many cases, coding for size ALSO generates the fastest code. ;) Not all, but certainly many.

I'd argue myself in the example given that a subroutine of any kind for a single add is silly - just put the add inline. It takes the same number of bytes as any literal jump and saves all the overhead (or, if you need it to be flexible but it fits in one word, consider the X instruction, which adds just 12 cycles plus the cost of reading the argument, which is 0 if it's a register!) But in practical terms you'd usually have more content.

Stacks are tricky to do well on the 9900, what ARE the Forth developers using for their stack functions?


Edited by Tursi, Thu Aug 23, 2018 10:19 PM.


#756 Lee Stewart ONLINE  

Lee Stewart

    River Patroller

  • 3,726 posts
  • Location:Silver Run, Maryland

Posted Thu Aug 23, 2018 6:51 PM

Stacks are tricky to do well on the 9900, what ARE the Forth developers using for their stack functions?

 

In TI Forth and fbForth, there is a parameter (data) stack, with R9 of the Forth workspace (on 16-bit bus) as the stack pointer  (SP), and a return stack, with R14 as the stack pointer.  Both stacks are 1 cell (2 bytes) wide and grow downward.  The parameter stack grows down from the high end of “high” RAM toward the dictionary and the return stack grows down from the high end of “low” RAM toward the Forth support routines and block buffers.

 

Pushing a value onto either stack involves reserving space by decrementing the stack pointer by 2 and then copying the new value into that 16-bit space.  For the parameter stack:

      DECT SP
      MOV  @ADDR,*SP

Popping values from either stack is easier because the stack pointer can be dereferenced and autoincremented in one instruction.  For the return stack:

      MOV  *RP+,@ADDR

One can do other stackrobatics, but the basic pushing and popping of 16-bit values is as explained above.

 

...lee



#757 TheBF OFFLINE  

TheBF

    Dragonstomper

  • 703 posts
  • Location:The Great White North

Posted Thu Aug 23, 2018 7:51 PM

 

In TI Forth and fbForth, there is a parameter (data) stack, with R9 of the Forth workspace (on 16-bit bus) as the stack pointer  (SP), and a return stack, with R14 as the stack pointer.  Both stacks are 1 cell (2 bytes) wide and grow downward.  The parameter stack grows down from the high end of “high” RAM toward the dictionary and the return stack grows down from the high end of “low” RAM toward the Forth support routines and block buffers.

 

Pushing a value onto either stack involves reserving space by decrementing the stack pointer by 2 and then copying the new value into that 16-bit space.  For the parameter stack:

      DECT SP
      MOV  @ADDR,*SP

Popping values from either stack is easier because the stack pointer can be dereferenced and autoincremented in one instruction.  For the return stack:

      MOV  *RP+,@ADDR

One can do other stackrobatics, but the basic pushing an popping of 16-bit values is as explained above.

 

...lee

 

 

CAMEL99 Forth uses one simple variation of the above mechanisms.  It caches the top element of the data stack (TOS) in R4.

The makes some operations much faster and others slower.  The literature indicates about 10% speed improvement on a threaded Forth for most CPUs.  I can attest to about net 8% improvement on the 9900 vs not caching TOS. 

 

It makes operations that are stack neutral very efficient and operations that consume 2 items from the stack and return a result to the stack are also very efficient.

CODE: 1+     ( n -- n')   
              TOS INC,
              NEXT,
              END-CODE
CODE: +       ( u1 u2 -- u )
             *SP+ TOS ADD,  \ ADD 2nd item to TOS and incr stack pointer.
              NEXT,
              END-CODE 

Managing the TOS register for more complex operations, that consume all the input parameters that are on the stack, normally means just doing a refill at the end of the operation.

CODE: !      ( n addr -- )  \ store n in address
             *SP+ *TOS MOV,       
              TOS POP,            
              NEXT,            
              END-CODE

But I think you can see that managing stacks is not so bad on the 9900.

In fact using a Forth assembler CAMEL99 creates macros for PUSH, POP for the data stack and RPUSH, RPOP for the return stack.

 

This makes the code look like it's running on a stack machine. :-)


Edited by TheBF, Thu Aug 23, 2018 7:54 PM.


#758 RXB OFFLINE  

RXB

    River Patroller

  • 3,303 posts
  • Location:Vancouver, Washington, USA

Posted Fri Aug 24, 2018 12:24 AM

I have found stacks are a much better management than just spaghetti code using BL all the time.

 

Also if R6 had 41 in it and you used BLWP you could just do Add R6,R7 instead of ADD 41,R7

 

For some reason you omitted my code MOV @SAVEDDATA,R7 that my BLWP did not need to do as R7 was the SAVEDATA 

 

Additionally having 32 registers is faster then using only 16 registers as the more complex and larger programs get the more swapping values is needed.



#759 mizapf OFFLINE  

mizapf

    River Patroller

  • 3,350 posts
  • Location:Germany

Posted Fri Aug 24, 2018 3:59 AM

Is the timing for the address modification piece still 0 clock cycles and 0 memory accesses, as for register direct addressing, or does it change, in this instance,  to 8 clock cycles and 1 memory access, as for symbolic addressing?

 

The 9900 still loads the contents of R15 into the ALU, but discards them; so you have one memory access. The 9995 does not.



#760 TheBF OFFLINE  

TheBF

    Dragonstomper

  • 703 posts
  • Location:The Great White North

Posted Fri Aug 24, 2018 7:02 AM

I have found stacks are a much better management than just spaghetti code using BL all the time.

 

 

 

I did some Asm experiments with a macro i called 'CALL'

It pushed R11 onto a return stack, did the BL to the sub-routine and when the sub-routine returned it popped R11 from the return stack.

 

It was aesthetically pleasing to be able to have sub-routines call a sub-routine which called a sub-routine etc...

But the size cost is quite high on the 9900. 4 instructions, 8 bytes per call. So for most Assembly language coders the need for it would be small I think especially if you have a free register to hold R11 temporarily.

 

But it is neat to watch on the debugger. :-)



#761 Lee Stewart ONLINE  

Lee Stewart

    River Patroller

  • 3,726 posts
  • Location:Silver Run, Maryland

Posted Fri Aug 24, 2018 8:26 AM

The 9900 still loads the contents of R15 into the ALU, but discards them; so you have one memory access. The 9995 does not.

 

So... This means that Table 3 in TMS9900 Microprocessor Data Manual will give the wrong answer for execution times for direct register access for B and BL instructions, right?

 

...lee

 

[Edits in this color.]



#762 mizapf OFFLINE  

mizapf

    River Patroller

  • 3,350 posts
  • Location:Germany

Posted Fri Aug 24, 2018 11:46 AM

No, they already included the memory access in the operation time. I recommend to use the 9900 Family System Design, chapter 4, showing the micro operations.

 

B
-------------------------------------------------
Cycle     Type           Description
  1       Memory read    AB=PC (address bus)
                         DB=Instruction (data bus)
 
  2       ALU            AB=NC (no change)
                         DB=SD (source data register, internal)
  Ns      Data derivation
 
  3+Ns    ALU            AB=NC
                         DB=SD
 

 

Note that each machine cycle shown here requires 2 clock cycles for the 9900.

 

The data derivation sequence takes 1 cycle for Rx, 3 for *Rx, 4 (Byte) or 5 (Word) for *Rx+, 5 for @>Addr, 5 for @>Addr(Rx).

 

The reason for this 1 cycle for Rx is the memory access. This is the reason why they call it "data derivation sequence". On the 9995, only the address is derived.



#763 Tursi OFFLINE  

Tursi

    Quadrunner

  • 5,231 posts
  • HarmlessLion
  • Location:BUR

Posted Fri Aug 24, 2018 5:19 PM

For some reason you omitted my code MOV @SAVEDDATA,R7 that my BLWP did not need to do as R7 was the SAVEDATA


Ah, because I couldn't figure out why you omitted that step. Comment your code! ;)

Edit: it's a good call though. if your alternate workspace has registers preset with values that you need in the subroutine, tbat's another potential use case for BLWP (or at least an alternate workspace). I do that on my music player.

Edited by Tursi, Fri Aug 24, 2018 5:21 PM.

  • RXB likes this

#764 RXB OFFLINE  

RXB

    River Patroller

  • 3,303 posts
  • Location:Vancouver, Washington, USA

Posted Sat Aug 25, 2018 12:53 AM

It has always been my view that preloaded alternate set of 16 more registers has to be faster

then using @ANYADDRESS or MOV R#,R# for lack of Registers.

 

Would it not be faster completely with entire programs being nothing but sets of Registers, 

and would allow modified instructions on the fly???

 

Damn few CPU could pull this off and the 9900 is one of the few as it is Memory Mapped.

 

Of course Scratch Pad is faster, but if it was nothing but Registers that would have to really pay off.


Edited by RXB, Sat Aug 25, 2018 12:55 AM.


#765 RXB OFFLINE  

RXB

    River Patroller

  • 3,303 posts
  • Location:Vancouver, Washington, USA

Posted Sat Aug 25, 2018 12:58 AM

 

 

I did some Asm experiments with a macro i called 'CALL'

It pushed R11 onto a return stack, did the BL to the sub-routine and when the sub-routine returned it popped R11 from the return stack.

 

It was aesthetically pleasing to be able to have sub-routines call a sub-routine which called a sub-routine etc...

But the size cost is quite high on the 9900. 4 instructions, 8 bytes per call. So for most Assembly language coders the need for it would be small I think especially if you have a free register to hold R11 temporarily.

 

But it is neat to watch on the debugger. :-)

This is exactly how XB Stacks work in the XB ROMs.



#766 senior_falcon OFFLINE  

senior_falcon

    Stargunner

  • 1,229 posts
  • Location:Lansing, NY, USA

Posted Sat Aug 25, 2018 11:11 AM

It has always been my view that preloaded alternate set of 16 more registers has to be faster

then using @ANYADDRESS or MOV R#,R# for lack of Registers.

 

Would it not be faster completely with entire programs being nothing but sets of Registers, 

and would allow modified instructions on the fly???

 

Damn few CPU could pull this off and the 9900 is one of the few as it is Memory Mapped.

 

Of course Scratch Pad is faster, but if it was nothing but Registers that would have to really pay off.

I'm sure there could be times when this approach would lead to greater speed. But I think usually it would be slower. Here's why:

1 - To start with, BLWP is quite a bit slower than BL. (the gurus here can tell you exactly how much slower.)

2 - Usually a subroutine is acting on values in the calling  workspace. For example, when you BLWP @VSBW you have to put the screen address in R0 and the byte to write in R1. When VSBW starts, it has to fetch those values with MOV *R13,R0 and MOV@2(R13),R1 before it can do anything. If you used a BL @VSBW instead, those values would already be in the registers and ready to use. 

 

There is a time and place for everything, and BLWP is great if the subroutine is lengthy. In that case the overhead is a much smaller percentage of the program and is not so detrimental. And of course, not having to worry about messing up registers in the calling program is a big plus. And as noted above, preloaded registers would also help, but of course that means one less register available for use in the BLWPWS


Edited by senior_falcon, Sat Aug 25, 2018 11:12 AM.


#767 RXB OFFLINE  

RXB

    River Patroller

  • 3,303 posts
  • Location:Vancouver, Washington, USA

Posted Sat Aug 25, 2018 1:04 PM

I'm sure there could be times when this approach would lead to greater speed. But I think usually it would be slower. Here's why:

1 - To start with, BLWP is quite a bit slower than BL. (the gurus here can tell you exactly how much slower.)

2 - Usually a subroutine is acting on values in the calling  workspace. For example, when you BLWP @VSBW you have to put the screen address in R0 and the byte to write in R1. When VSBW starts, it has to fetch those values with MOV *R13,R0 and MOV@2(R13),R1 before it can do anything. If you used a BL @VSBW instead, those values would already be in the registers and ready to use. 

 

There is a time and place for everything, and BLWP is great if the subroutine is lengthy. In that case the overhead is a much smaller percentage of the program and is not so detrimental. And of course, not having to worry about messing up registers in the calling program is a big plus. And as noted above, preloaded registers would also help, but of course that means one less register available for use in the BLWPWS

Yea, but then again you have UNLIMITED NUMBER OF REGISTERS WITH BLWP AS EACH ONE GIVES YOU 16 MORE!

 

Also anything to do with VDP like VSBW or GROM would be slower, yes BL is faster then BLWP, but again you are stuck with only 16 Registers.

 

What I was saying was any variables or data should be BLWP REGISTERS so the loss in speed between BL vs BLWP would be negligible the larger the program complexity.

 

Think like Forth but instead of 1 endless stack you have hundreds of small stacks, or treat it like all 1 huge stack, but with out ROT or SWAP needed to get a value used.



#768 mizapf OFFLINE  

mizapf

    River Patroller

  • 3,350 posts
  • Location:Germany

Posted Sat Aug 25, 2018 2:55 PM

VSBW / VSBR are probably the worst examples for using BLWP, in terms of overhead. However, every VMBW with a length of at least, say, 5 or 10 characters makes the BLWP overhead small against the overall execution time. DSRLNK is arguably the best use of BLWP.

 

We are really a bit spoiled with direct access to devices. On a system with a 99105 (with a proper multiuser/multitasking operating system), any such access would need to be done through a XOP system call, and the LIMI command would be forbidden (for user processes).


  • RXB likes this

#769 TheBF OFFLINE  

TheBF

    Dragonstomper

  • 703 posts
  • Location:The Great White North

Posted Sat Aug 25, 2018 7:10 PM

Yea, but then again you have UNLIMITED NUMBER OF REGISTERS WITH BLWP AS EACH ONE GIVES YOU 16 MORE!

 

Also anything to do with VDP like VSBW or GROM would be slower, yes BL is faster then BLWP, but again you are stuck with only 16 Registers.

 

What I was saying was any variables or data should be BLWP REGISTERS so the loss in speed between BL vs BLWP would be negligible the larger the program complexity.

 

Think like Forth but instead of 1 endless stack you have hundreds of small stacks, or treat it like all 1 huge stack, but with out ROT or SWAP needed to get a value used.

 

 

I have been playing with BLWPing directly to the  Forth stack.  :)

 

It still takes some effort because you have to put 3 zeros on the bottom for R13,R14,R15 but then you have space for 13 parameters.

It's not a panacea for everything but it is cool that the 9900 can do it!

 

I have not found an application for it but I can envision something like a floating point package where you need to pass a lot a parameters to a function where it might be pretty cool.

 

B



#770 senior_falcon OFFLINE  

senior_falcon

    Stargunner

  • 1,229 posts
  • Location:Lansing, NY, USA

Posted Sun Aug 26, 2018 10:47 AM

Yea, but then again you have UNLIMITED NUMBER OF REGISTERS WITH BLWP AS EACH ONE GIVES YOU 16 MORE!

 

Also anything to do with VDP like VSBW or GROM would be slower, yes BL is faster then BLWP, but again you are stuck with only 16 Registers.

 

What I was saying was any variables or data should be BLWP REGISTERS so the loss in speed between BL vs BLWP would be negligible the larger the program complexity.

Rich, I think your idea is interesting. If I understand correctly you are advocating for always using BLWP instead of BL, with the added nuance that every BLWP subroutine has its own unique workspace. The advantage to this is that no stack is necessary - each subroutine can find its way back to the code that called it.

If you were to implement a stack for BL it takes an instruction to push an address on the stack and go up a level and another instruction (or two) to  pop an address from the stack and go down a level. That eliminates the speed advantage of BL.

You still have the need for passing values from the calling program to the BLWP sub, and I'm not yet convinced that putting values/data into BLWP registers is as useful as you think it is. There is also the extra memory usage (36 bytes per BLWP sub) which could be a problem with a large program. But as I noted earlier, there is a time and a place for everything, and this is one more useful  tool in the assembly programmers repertoir.

 

Speaking of neat ideas, one of the slicker ones I have seen here is to overlap the BLWP workspace with the main workspace:

USRWS    BSS 16    (but you are really using 32 bytes for USRWS)

BLWPWS BSS 32

You can use all 16 registers in the USRWS.  When you BLWP to the subroutine, USRWS R8-R15 are BLWPWS R0-R7. And voila! Provided you are using R8-R15 to pass data, it is already in  BLWPWS and ready for use.


  • RXB likes this

#771 RXB OFFLINE  

RXB

    River Patroller

  • 3,303 posts
  • Location:Vancouver, Washington, USA

Posted Sun Aug 26, 2018 2:07 PM

Yea that is what I was talking about, so say R8 to R15 are Screen pointers say R8 is Row and R9 is Column, R10 is character buffer and color of that set,

then you have R11 Return address, R12 CRU, R13 GROM address, R14 Stack Address and R15 VDP address for multiple writes/reads.

Over lap 8 bytes and now you have a whole new set sort of like jumping up and down a stack all the time. 

 

Everyone keeps thinking inside the 32 or 16 set box, instead of think outside that box and using less distance in overlap, this means you get more stack like features.


Edited by RXB, Sun Aug 26, 2018 2:09 PM.





0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users