TMS9900 likes direct threaded code

+TheBF · June 7, 2017

All of the Forth systems that have been created for TI-99 that I know of to date, have used something called indirect threaded code.

This is a little involved to explain but if you really want to read about it, it's best left to people who could do a much better job than I can.

Like here for example: https://en.wikipedia.org/wiki/Threaded_code

So indirect threaded code is a clean way to implement Forth, everything is a simple address of a routine. Addresses are strung together like labels in an assembler program to make more complex routines and so on. It's very clever but since every piece of code is accessed "indirectly" meaning via another address there is a little speed penalty.

Willsy had said in another post that he felt that direct threaded code might be the best compromise for Forth on the TI-99 and so this got me churning.

It's a pretty big job to re-write a major work like Turbo Forth over again to use Direct threading, but my cheesy home-brew compiler can do the substitutions for you.

So I convinced it how to change up the threading mechanism and hey... it goes faster than before.

The bottom line seems to be the routine in Forth that interprets the NEXT address in a list. It's called... wait for it ... NEXT. :-)

A Forth system can spend 1/2 it's time running this little routine. Below is the difference in the NEXT routine in ITC and DTC on a TMS9900.

And look at that. ITC is 3 instructions but DTC is about the same as returning from a nested sub-routine call. 2 instructions.

I hear you asking "So why not use sub-routine calls?" Well if you want a sub-routine to call another sub-routine and so on and so on.. like Forth requires, in front of every address that you call you need 3 instructions!

( make space on the stack, save the current address on the stack, branch and link to the new address)

That get's real big real fast. So this direct threading mechanism doesn't need that, but it does need a little 4 byte piece of code to start off every Forth word that you create. But for every Assembly language word you need 2 less bytes than using indirect threading. So it almost balances out on the 9900.

\ Direct Threading vs Indirect vs Sub-routine Threading NEXT
==============================================================================
DTC
l: _next                     \ Forth DTC NEXT routine
@@9:         *IP+  W MOV,    \ move IP to W & auto inc IP               22
             *W    B,        \ branch to the address in W               16
---------------------------------------------------------------------------
                                                                      = 38
ITC
l: _next                     \ Forth ITC NEXT routine
@@9:         *IP+ W MOV,     \ move CFA into Working register         \ 22
             *W+  R5 MOV,    \ move contents of CFA to R5 & INCR W    \ 22
             *R5  B,         \ branch to the address in R5            \ 12
---------------------------------------------------------------------------
                                                                      = 56

STC Return     R11 RPOP,      \ pop old R11 off return stack            22
              *R11 B,         \ branch to the previous address          16
---------------------------------------------------------------------------
                                                                      = 38

So how much faster?

Well for a simple DO LOOP that ran >FFFF times (65,535) it was 30% faster.

For that same do loop incrementing a variable it was 16% faster.

So pretty promising.

I have broken my text interpreter at this stage, but can run code compiled by the cross-compiler.

More fun to come.

BF

Edited June 7, 2017 by TheBF

+TheBF · June 7, 2017

So this is looking quite promising. Here are some simple benchmarks and timings

: TEST   FFFF BEGIN 1- DUP 0= UNTIL DROP ;

TurboForth    11.6  secs
Camel99 ITC   10.6  sec
Camel99 DTC    8.3  secs

There is the maximum increase when you run Forth words written in Assembler.

The operators in the benchmark below are all Assembly language code.

HEX
: OPTEST      \ mixed
          2000 0                 \ *OPTIMIZATION METHOD*
          DO                     \ CAMEL99    Turbo Forth
                                 \ ----------------------
               AAAA   ( lit)     \  HSRAM       HSRAM
               DUP               \  TOS         HSRAM
               SWAP              \  TOS         HSRAM
               OVER              \  TOS         HSRAM
               ROT               \  TOS         --
               DROP              \  TOS         HSRAM
               DUP AND           \  TOS         --
               DUP OR            \  TOS         --
               DUP XOR           \  TOS         --
               1+                \  TOS         HSRAM
               1-                \  TOS         HSRAM
               2+                \  TOS         HSRAM
               2-                \  TOS         HSRAM
               2*                \  TOS         --
               2/                \  TOS         --
               NEGATE            \  TOS         --
               ABS               \  TOS         --
               +                 \  TOS         HSRAM
               2 *               \  TOS         HSRAM
               DROP
          LOOP  ;

\ CAMEL99 ITC:   8.47 secs
\ CAMEL99 DTC:   6.73 sec  
\ TurboForth     8.88 secs

Willsy · June 8, 2017

That's very interesting. I'll have a proper read and digest over the weekend on a proper screen. This mobile phone screen is tricky to read code with!

I wonder how hard it would be to convert TF to DTC. Hmmmm....

Willsy · June 8, 2017

Is RPOP, a macro?

+TheBF · June 8, 2017

Ya it's mostly a win. And I think for TF the win is bigger because you wrote most of it as CODE. So you save 2 bytes on every code word. If you capitalize on that you could be faster and maybe smaller.

That is cool

Does your assembler support macros? If so you could rebuild in such a way that it can build either way.

Devil in the details of course.

B

Edited June 8, 2017 by TheBF

+TheBF · June 8, 2017

Is RPOP, a macro?

Yes. I made a Pop and Push and RPop and Rpush macros so the code reads more like a Forth machine.

B

+TheBF · June 8, 2017

To get you started here are some changes I had to make. The asterisk in the comment is the change.

And to give credit where credit is due I figured this out by studying Pygmy Forth by Frank Sergeant.

https://github.com/ForthHub/pygmy-forth (It is unreal fast and can re-build itself)

Brad Rodriguez referenced that Frank figure out how to do DTC using a branch rather than a sub-routine CALL.

http://www.bradrodriguez.com/papers/moving1.htm

Since 9900 doesn't have a nested sub-routine call, that's what we needed! :-)

l: _docol     RP DECT,        \ make room on the Rstack                           10
              IP *RP MOV,     \ push IP register onto the return stack            18
              W  4 ADDI,      \ *jump past the code fragment in the header         14
              W IP MOV,       \ move PFA into Forth IP register                   14
              NEXT,             

l: _docon     TOS PUSH,       \ make room in TOS
              W 4 ADDI,       \ *jump over DTC code fragment
             *W TOS MOV,      \ put contents of PFA (W+4) in TOS
              NEXT,             
              END-CODE

l: _dovar     TOS PUSH,       \ make room in TOS
              W 4 ADDI,       \ *jump over DTC code fragment
              W TOS MOV,      \ put the parameter field address in TOS
              NEXT,
              END-CODE

\ In CAMEL99 the WP register doubles as User memory pointer
\ CODE: DOUSER ( -- addr)
l: _douser    TOS PUSH,        \ Executor that executes a "USER VARIABLE" (local to each task)
              TOS STWP,        \ store workspace register WP in TOS
              W 4 ADDI,        \ *jump over DTC code fragment
             *W TOS ADD,       \ add the offset stored in the USER variable's parameter field
              NEXT,
              END-CODE

Then at the front of every Forth word instead of an pointer to the *Executors above, I had to add a code macro that Branches to these words.

I am just going to see if I can auto-increment W in _NEXT and change the ADDI to a single INCT in all the words...

B

*that's what I called them since the Forth standard doesn't have a name for these little code snippets that run the show :-(

Edited June 8, 2017 by TheBF

Sign In

TMS9900 likes direct threaded code

Recommended Posts

+TheBF

Link to comment

Share on other sites

+TheBF

Link to comment

Share on other sites

Willsy

Link to comment

Share on other sites

Willsy

Link to comment

Share on other sites

+TheBF

Link to comment

Share on other sites

+TheBF

Link to comment

Share on other sites

+TheBF

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Apps

My Activity Streams

More