Forth with top of stack cached in a register

+TheBF · January 31, 2017

After 30 years of wondering I finally got around to creating a Forth compiler for the TI-99 where the top of stack (TOS) is maintained in a register.

The literature said this would speed it up by about 10%. I used a DOS Forth system to create the cross-compiler to the build the TI-99 compiler so it was painful debugging both ends at the same time. (old brain hurts) I cross-compiled Brad Rodriguez's Camel Forth for the high level Forth words and wrote 105 Assembler primitives with hints here and there to the hard stuff from TI MSP430 Camel Forth and I had to look at Turbo Forth to help find a couple of gotchas with the 9900 instruction set. Sincere thanks to Willsy and Brad.

Anyway the answer is in.,, kind of sort of. Using Willys's excellent and highly optimized Turbo Forth as the benchmark for excellence I did a little comparison.

Turboforth uses the PAD RAM at >8300 to hold many simple code routines so they run very fast in that zero wait state memory.

To even begin to come close to Turbo Forth I found out I also had to put the Forth thread interpreter there along with branching and I stuck the literal run-time routine there as well. After that the only optimizing approach I used was this TOS thing

The TOS caching is a mixed blessing. For routines that take one input on the stack and produce one output like 1+ 2+ 2/ 2* @ C@ etc... it is about 40% faster. Very cool. For operations that take two inputs and generate one output or no output on the stack, ( ! C! + - * etc.) refilling the TOS can eat up all of the benefit on the 9900.

And for operators that need to make extra space on the stack for an output, the TMS9900 needs 2 instructions so they are actually slower because you have to push the TOS register onto the stack to make room for the new thing. (DUP OVER etc.)

FYI:

- my empty DO/LOOP structure runs the same speed as Turbo Forth so the test is truly comparing the math operations.

- Tests were run on Classi99 emulator under Windows 10 64bits (my real iron is in a box with a defective 32K memory card)

Test 1 tests all the routines Turbo Forth has in PAD Ram and the others as well, so it's mixed.

Test 2 is head to head TOS vs PAD RAM optimization.

Test 3 is TOS vs Forth operators that have no PAD RAM optimization.

We can see in test 3 the we get about 8% improvement not 10%.

The surprise for me was test 2 because the speedup was not suppose to be as fast as zero wait state ram but it seems the combination of everything netted out

to the same result. Weird.

In many other ways Turbo Forth is still faster by virtue of hand coding so much of the internals, but this demonstrates the TOS on math operations.

Now I have to stop doing this for a while. (addictions are hard to kick)

PS. I noticed I did not include NIP and TUCK but that's for another day.

PSS This means Turbo Forth 3.0 can be 8% faster. Just one more re-write Willsy :-)

theBF

HEX
: OPTEST      \ mixed
          1000 0                 \ *OPTIMIZATION METHOD*
          DO                     \ CAMEL99    Turbo Forth
                                 \ ----------------------
               AAAA   ( lit)     \  HSRAM       HSRAM
               DUP               \  TOS         HSRAM
               SWAP              \  TOS         HSRAM
               OVER              \  TOS         HSRAM
               ROT               \  TOS         --
               DROP              \  TOS         HSRAM
               DUP AND           \  TOS         --
               DUP OR            \  TOS         --
               DUP XOR           \  TOS         --
               1+                \  TOS         HSRAM
               1-                \  TOS         HSRAM
               2+                \  TOS         HSRAM
               2-                \  TOS         HSRAM
               2*                \  TOS         --
               2/                \  TOS         --
               NEGATE            \  TOS         --
               ABS               \  TOS         --
               +                 \  TOS         HSRAM
               2 *               \  TOS         HSRAM
               DROP
          LOOP  ;

\ CAMEL99:   4 5 secs
\ TurboForth 4.7 secs
\ (Empty DO/LOOP are same speed)

 : OPTEST2  \ only HSRAM VS TOS
          2000 0                 \ *OPTIMIZATION METHOD*
          DO                     \ CAMEL99    Turbo Forth
                                 \ ----------------------
               AAAA   ( lit)     \  HSRAM       HSRAM
               DUP               \  TOS         HSRAM
               SWAP              \  TOS         HSRAM
               OVER              \  TOS         HSRAM
               DUP AND           \  TOS         HSRAM
               DUP OR            \  TOS         HSRAM
               1+                \  TOS         HSRAM
               1-                \  TOS         HSRAM
               2+                \  TOS         HSRAM
               2-                \  TOS         HSRAM
               +                 \  TOS         HSRAM
               2 *               \  TOS         HSRAM
               DROP              \  TOS         HSRAM
               DROP              \  TOS         HSRAM
          LOOP  ;

\ CAMEL99:   6.4 secs
\ TurboForth 6.4 secs

HEX
: OPTEST3   \  TOS versus conventional Parameter stack

          3000 0                 \ *OPTIMIZATION METHOD*
          DO                     \ CAMEL99    Turbo Forth
                                 \ ----------------------
              AAAA               \  HSRAM       HSRAM
              BBBB               \  HSRAM       HSRAM
              CCCC               \  HSRAM       HSRAM
               ROT               \  TOS         --
               AND               \  TOS         --
               OR                \  TOS         --
               DUP XOR           \  TOS         --
               2*                \  TOS         --
               2/                \  TOS         --
               NEGATE            \  TOS         --
               ABS               \  TOS         --
               DROP              \  TOS         --  
          LOOP  ;

\ CAMEL99:   7.5 secs
\ TurboForth 8.13 secs

Edited January 31, 2017 by TheBF

+Lee Stewart · February 1, 2017

I do not know whether my fbForth 2.0 source code would be of any particular use to you, but you are more than welcome to look at it. I have not yet officially published it because I have not quite finished cleaning up the comments that might make it easier for someone other than myself to have a chance to understand it. I am in the middle of updating the fbForth 2.0 Manual, anyway. @Willsy and I have worked a lot on and borrowed from each other’s Forth sources. I am rambling—I will stop talking now...

...lee

+TheBF · February 1, 2017

Thanks Lee,

It would be very helpful. My plan for this little Forth is to be a Camel Forth kernel in 8K (I hope) and then extend it with "INCLUDE" using ANS Forth file words.

I have been reviewing the DRSLink code where I find it but it is still PFM to me. (pure friggin' magic). Setting up a PAB and buffer is very simple but calling the TI system is pretty convoluted.

So having access to something that works for file access would be great. And if I remember TI Forth abstracted it out to system calls no? That would be perfect.

But don't rush because of me.

BF

+Lee Stewart · February 1, 2017

Sometime today or tomorrow, I will put together a ZIP file with the current ALC source code and post it here. In the meantime, you might be interested in my annotated version of TI’s original source code for TI Forth. It is in “Appendix O” of my TI Forth Instruction Manual “2nd Edition 2013”, which is available in the Forth section of the pinned TI-99/4A development resources thread. It was necessary for me to go through the annotation exercise before I had any hope of rewriting TI Forth to change from sector I/O for Forth blocks to file I/O and from there to hoist fbForth into cartridge space.

The DSRLNK code is identical to (or nearly so) to the copy in the Editor/Assembler (E/A) cartridge routine, which gets copied to low RAM when the E/A cartridge starts up. Neither one of them uses the console DSRLNK routine, which allows for GROM DSRs and tape I/O. Both TurboForth and fbForth 2.0 use the DSRLNK from Millers Graphics, which does call the console version.

...lee

+TheBF · February 1, 2017

That's fantastic. Thanks Lee!

BF

+Lee Stewart · February 2, 2017

Here is the source code for fbForth 2.0:9: fbForth209.zip

...lee

+TheBF · February 6, 2017

Again a very sincere thanks Lee. I hope to post my work some time this year...

BF

Opry99er · February 9, 2017

Cooooooool!!!

More Forths!!!

Good luck, and we eagerly await your progress!

Willsy · February 12, 2017

This is fantastic! So the TF speed crown is toppled eh? Right then... There's only one thing for it! :-)

+TheBF · February 13, 2017

LOL. Well the crown is still safe IMHO. These benchmarks are a hair faster, but your screen driver being mostly code is still faster for sure.

However there are some things that might be simpler and maybe a little smaller in CAMEL99. For example here is my KEY word.

Instead of counting in code for cursor flash in my own code I just read the interrupt timer in CURS@ and return the cursor or a blank.

KEY? does KSCAN like normal and returns true if a key is pressed.

( VC! is just like C! but to VDP RAM.

VPOS returns the VDP screen address)

It has been fascinating to compare and contrast writing code words versus Forth to keep the size down.

CODE: CURS@   ( -- char)   \ return cursor or space for flashing cursor
              TOS PUSH,
             _CURSR @@ TOS MOV,         \ load the cursor char to TOS from memory
              8379 @@ W MOVB,           \ read interrupt timer to control flash speed
              W 1800 ANDI,              \ AND it with >18. This mask made a pleasant looking flash
              W W MOV,                  \ compare result to zero
              @@1 JEQ,                  \ if = 0 then we return the cursor char
              TOS BL LI,                \ otherwise return a blank (space)              \ 18 bytes
@@1:          NEXT,
              END-CODE

: KEY        ( -- char)
              BEGIN
                pause                      \ multi-tasking switch
                CURS@ VPOS VC!             \ show the flashing cursor
                KEY?
              UNTIL                       \ wait for key press
              KVAL C@                     \ fetch char from KSCAN buffer
              BL VPOS VC!  ;              \ erase cursor

And in looking over Turbo Forth there is some simple things you could do to speed it up a touch.

1. Place the code for EXIT immediately before NEXT so you don't branch to next after EXIT, but rather run it inline.

2. For the code in PAD ram: Use a JMP instruction to NEXT rather than BRANCH through a register. It's the same size but a 20% faster instruction.

So there is the beginning of version 2.x :-)

Have at it. I had to get back to work since I am self employed now and that damned cross compiler was taking all my time.

Will get back to it one day.

BF

+Lee Stewart · February 13, 2017

...

And in looking over Turbo Forth there is some simple things you could do to speed it up a touch.

1. Place the code for EXIT immediately before NEXT so you don't branch to next after EXIT, but rather run it inline.

2. For the code in PAD ram: Use a JMP instruction to NEXT rather than BRANCH through a register. It's the same size but a 20% faster instruction.

...

Presuming you are not talking about @Willsy's high-level Forth definition of NEXT (as in FOR...NEXT) and, unless I am missing something otherwise, the inner interpreter code in PAD RAM for EXIT cannot be placed before NEXT because that would interrupt the flow from the code that is already in front of NEXT. It can, however, repeat inline the three instructions at NEXT. That is precisely what TI Forth and fbForth do. Mark may not have enough PAD RAM real estate for that, though. The JMP would certainly be a good alternative.

...lee

+TheBF · February 13, 2017

"It can, however, repeat inline the three instructions at NEXT."

This is what I meant to communicate un-successfully. Thank you for clarifying.

From what I can see he puts EXIT in PAD RAM with a NEXT, macro after it. So it should be NO problem to just transplant it to just before the 3 instructions of NEXT.

"That is precisely what TI Forth and fbForth do."

Yes this is an old trick for ITC Forths from way back.

B

Edited February 13, 2017 by TheBF

+Lee Stewart · February 13, 2017

H-m-m-m...I am still not sure I understand what you mean by “with a NEXT, macro after it”. I would not call it a macro. It is copying the actual code “B *NEXT” or “B *R12”. I compared the ALC for the TurboForth inner interpreter with that of the fbForth inner interpreter in the spoiler below to be sure we are talking about the same thing:

; Reserved TurboForth Workspace Regisiters...

; PC ( R3) Interpretive Pointer

; STACK ( R4) Stack Pointer

; RSTACK ( R5) Return Stack Pointer

; W ( R6) Inner Interpreter Current word Pointer

; (referred to as R6 in code below,

; but is reserved for this purpose)

; NEXT (R12) Always points to _NEXT

; Inner Interpreter starts at >8320 (DOCOL)

DOCOL DECT RSTACK ; make space on return stack

MOV PC,*RSTACK ; save PC to return stack

MOV R6,PC ; place in PC and drop down to NEXT

; NEXT ...loads the next CFA and branches to the address in the CFA.

_NEXT MOV *PC+,R6 ; get CFA in r6

MOV *R6+,R7 ; get contents of CFA

B *R7 ; execute it

; EXIT ...exits from a FORTH high level word (i.e. a word entered with DOCOL)

EXIT DATA EXIT+2 ; called by NEXT, so needs a pointer

MOV *RSTACK+,PC ; place saved PC into PC & pop return stack

B *NEXT ; do next instruction

; or JMP _NEXT ; do next instruction

; or MOV *PC+,R6 ; get CFA in r6

; MOV *R6+,R7 ; get contents of CFA

; B *R7 ; execute it

*==============================================================

* Reserved fbForth (and TI Forth) Workspace Registers...

* U ( R8) <--points to base of User Variable area

* SP ( R9) <--Parameter Stack Pointer

* W (R10) <--Inner Interpreter current Word Pointer

* LINK (R11) <--LINKage for subroutines in CODE routines

* CRU (R12) <--Used for CRU instructions

* IP (R13) <--Interpretive Pointer

* R (R14) <--Return Stack Pointer

* NEXT (R15) <--Points to the NEXT instruction fetch routine

** Inner Interpreter starts at >832E (DODOES)

DODOES DECT SP push current word pointer (W)..

MOV W,*SP ..to parameter stack

MOV LINK,W make return address the new current word

DOCOL DECT R push next instruction pointer(IP)..

MOV IP,*R ..to return stack

MOV W,IP copy current word pointer to next instruction pointer

$NEXT MOV *IP+,W copy to W the cfa pointed to by IP and..

** ..increment IP to point to next instruction

DOEXEC MOV *W+,R1 copy to R1 the code address pointed to by W and..

** ..increment W to pfa of current word

B *R1 branch to code at address in R1

$SEMIS MOV *R+,IP pop previous IP from return stack

MOV *IP+,W copy to W the cfa pointed to by IP and..

** ..increment IP to point to next instruction

MOV *W+,R1 copy to R1 the code address pointed to by W and..

** ..increment W to pfa of current word

B *R1 branch to code at address in R1

* The TurboForth Inner Interpreter is much the same as fbForth's except that

* ++DODOES is in ROM in TurboForth (TF)

* ++TF ALC label EXIT is the pfa of TF Forth word EXIT (;S in fbForth)

* with ALC label EXIT+2 as the cfa

* ++fbForth ALC label $SEMIS is the cfa of fbForth Forth word ;S

* with fbForth ALC label SEMIS (with ALC for ;S in ROM) as the pfa

...lee

+TheBF · February 13, 2017

Sorry, I was using my own terminology without looking at the code.

I use Forth as a macro tool to do this kind of thing for NEXT which I am sure you are familiar with.

I guess it's not "technically" a macro, but it acts like one so that's what I call it.

: NEXT, *NEXT B, ;

Turbo Forth in the code you show is spelling out the Branch instruction in full after EXIT.

I think we are in violent agreement on placing _EXIT as shown below so that NEXT runs as inline code rather than branching to it through a register after EXIT.

I leave it to Willsy on how to incorporate the label _EXIT into a Forth word in his dictionary.

(I do it with a cross-compiler so it's kind of like cheating)

Am I getting it right yet?

_EXIT   MOV  *RSTACK+,PC             ; place saved PC into PC & pop return stack
; NEXT ...loads the next CFA and branches to the address in the CFA.
_NEXT  MOV  *PC+,R6                 ; get CFA in r6
       MOV  *R6+,R7                 ; get contents of CFA
       B    *R7                     ; execute it

+Lee Stewart · February 13, 2017

Agreed. I guess my difficulty is in the labeling. The actual “_NEXT” and the one you show must be different. So, if the one you show must be labeled, it should perhaps be something like “_NXT”. The reason is that the actual “_NEXT” already has fall-through code ahead of it in the form of “DOCOL”. Don't mind me—I am probably a bit OCD!

...lee

+TheBF · February 14, 2017

Ha, I completely missed that fall thru on Docol.

Don't apologize. I am clearly not OCD enough to be a great coder. (which is why I move to management) :-)

OK, So then the question is which one nets the most benefit? EXIT fall thru or DOCOL fall thru?

Any thoughts on that? I should probably create a profiler in the code I guess.

BF

+Lee Stewart · February 14, 2017

I would think it a tossup because virtually every word that starts with DOCOL ends with EXIT. Readability, perhaps, goes to DOCOL, but that is my bias to the old TI Forth code. Fastest is obviously inline code, but the JMP will likely be better for @Willsy. For your project, I would go with inline if you can spare the extra 4 bytes.

...lee

+TheBF · February 21, 2017

Quick update.

I already had EXIT fall thru to NEXT inline so I added inline NEXT to DOCOL. The resulting change in benchmark timing was not material when measured by hand,

I could say the times tended to look faster by 100ths of a second, but I would not commit to them. Makes we want to build a better timer.

So as they say:

"Theory and practice are the same... in theory... but not in practice"

BF

Forth with top of stack cached in a register

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members