Jump to content

Photo

Forth with top of stack cached in a register

Ti-99 Forth compiler

17 replies to this topic

#1 TheBF OFFLINE  

TheBF

    Moonsweeper

  • 281 posts
  • Location:The Great White North

Posted Tue Jan 31, 2017 9:06 AM

After 30 years of wondering I finally got around to creating a Forth compiler for the TI-99 where the top of stack (TOS) is maintained in a register.

The literature said this would speed it up by about 10%.  I used a DOS Forth system to create the cross-compiler to the build the TI-99 compiler so it was painful debugging both ends at the same time. (old brain hurts)  I cross-compiled Brad Rodriguez's Camel Forth for the high level Forth words and wrote 105 Assembler primitives with hints here and there to the hard stuff from TI MSP430 Camel Forth and I had to look at Turbo Forth to help find a couple of gotchas with the 9900 instruction set.  Sincere thanks to Willsy and Brad.

 

Anyway the answer is in.,, kind of sort of.  Using Willys's excellent and highly optimized Turbo Forth as the benchmark for excellence I did a little comparison.

Turboforth uses the PAD RAM at >8300 to hold many simple code routines so they run very fast in that zero wait state memory.

To even begin to come close to Turbo Forth I found out I also had to put the Forth thread interpreter there along with branching and I stuck the literal run-time routine there as well. After that the only optimizing approach I used was this TOS thing

 

The TOS caching is a mixed blessing.  For routines that take one input on the stack and produce one output like 1+ 2+  2/ 2* @ C@ etc... it is about 40% faster. Very cool. For operations that take two inputs and generate one output or no output on the stack, ( ! C! + - * etc.) refilling the TOS can eat up all of the benefit on the 9900.

And for operators that need to make extra space on the stack for an output, the TMS9900 needs 2 instructions so they are actually slower because you have to push the TOS register onto the stack to make room for the new thing. (DUP  OVER etc.)

 

FYI:

- my empty DO/LOOP structure runs the same speed as Turbo Forth so the test is truly comparing the math operations.

- Tests were run on Classi99 emulator under Windows 10 64bits (my real iron is in a box with a defective 32K memory card)

 

Test 1 tests all the routines Turbo Forth has in PAD Ram and the others as well, so it's mixed.

 

Test 2 is head to head TOS vs PAD RAM optimization.

 

Test 3 is TOS vs Forth operators that have no PAD RAM optimization.

We can see in test 3 the we get about 8% improvement not 10%.

 

The surprise for me was test 2 because the speedup was not suppose to be as fast as zero wait state ram but it seems the combination of everything netted out

to the same result. Weird.

 

In many other ways Turbo Forth is still faster by virtue of hand coding so much of the internals, but this demonstrates the TOS on math operations.

 

Now I have to stop doing this for a while.  (addictions are hard to kick)

 

PS. I noticed I did not include NIP and TUCK but that's for another day.

 

PSS This means Turbo Forth 3.0 can be 8% faster.  Just one more re-write Willsy :-)

 

theBF

HEX
: OPTEST      \ mixed
          1000 0                 \ *OPTIMIZATION METHOD*
          DO                     \ CAMEL99    Turbo Forth
                                 \ ----------------------
               AAAA   ( lit)     \  HSRAM       HSRAM
               DUP               \  TOS         HSRAM
               SWAP              \  TOS         HSRAM
               OVER              \  TOS         HSRAM
               ROT               \  TOS         --
               DROP              \  TOS         HSRAM
               DUP AND           \  TOS         --
               DUP OR            \  TOS         --
               DUP XOR           \  TOS         --
               1+                \  TOS         HSRAM
               1-                \  TOS         HSRAM
               2+                \  TOS         HSRAM
               2-                \  TOS         HSRAM
               2*                \  TOS         --
               2/                \  TOS         --
               NEGATE            \  TOS         --
               ABS               \  TOS         --
               +                 \  TOS         HSRAM
               2 *               \  TOS         HSRAM
               DROP
          LOOP  ;

\ CAMEL99:   4 5 secs
\ TurboForth 4.7 secs
\ (Empty DO/LOOP are same speed)

 : OPTEST2  \ only HSRAM VS TOS
          2000 0                 \ *OPTIMIZATION METHOD*
          DO                     \ CAMEL99    Turbo Forth
                                 \ ----------------------
               AAAA   ( lit)     \  HSRAM       HSRAM
               DUP               \  TOS         HSRAM
               SWAP              \  TOS         HSRAM
               OVER              \  TOS         HSRAM
               DUP AND           \  TOS         HSRAM
               DUP OR            \  TOS         HSRAM
               1+                \  TOS         HSRAM
               1-                \  TOS         HSRAM
               2+                \  TOS         HSRAM
               2-                \  TOS         HSRAM
               +                 \  TOS         HSRAM
               2 *               \  TOS         HSRAM
               DROP              \  TOS         HSRAM
               DROP              \  TOS         HSRAM
          LOOP  ;

\ CAMEL99:   6.4 secs
\ TurboForth 6.4 secs

HEX
: OPTEST3   \  TOS versus conventional Parameter stack

          3000 0                 \ *OPTIMIZATION METHOD*
          DO                     \ CAMEL99    Turbo Forth
                                 \ ----------------------
              AAAA               \  HSRAM       HSRAM
              BBBB               \  HSRAM       HSRAM
              CCCC               \  HSRAM       HSRAM
               ROT               \  TOS         --
               AND               \  TOS         --
               OR                \  TOS         --
               DUP XOR           \  TOS         --
               2*                \  TOS         --
               2/                \  TOS         --
               NEGATE            \  TOS         --
               ABS               \  TOS         --
               DROP              \  TOS         --  
          LOOP  ;

\ CAMEL99:   7.5 secs
\ TurboForth 8.13 secs

Edited by TheBF, Tue Jan 31, 2017 9:11 AM.


#2 Lee Stewart ONLINE  

Lee Stewart

    River Patroller

  • 3,262 posts
  • Location:Silver Run, Maryland

Posted Tue Jan 31, 2017 9:26 PM

I do not know whether my fbForth 2.0 source code would be of any particular use to you, but you are more than welcome to look at it.  I have not yet officially published it because I have not quite finished cleaning up the comments that might make it easier for someone other than myself to have a chance to understand it.  I am in the middle of updating the fbForth 2.0 Manual, anyway.  Willsy and I have worked a lot on and borrowed from each other’s Forth sources.  I am rambling—I will stop talking now...

 

...lee



#3 TheBF OFFLINE  

TheBF

    Moonsweeper

  • Topic Starter
  • 281 posts
  • Location:The Great White North

Posted Wed Feb 1, 2017 10:29 AM

Thanks Lee,

 

It would be very helpful.  My plan for this little Forth is to be a Camel Forth kernel in 8K (I hope) and then extend it with "INCLUDE"  using ANS Forth file words.

I have been reviewing the DRSLink code where I find it but it is still PFM to me. (pure friggin' magic).  Setting up a PAB and buffer is very simple but calling the TI system is pretty convoluted.

 

So having access to something that works for file access would be great. And if I remember TI Forth abstracted it out to system calls no? That would be perfect.

 

But don't rush because of me.

 

BF



#4 Lee Stewart ONLINE  

Lee Stewart

    River Patroller

  • 3,262 posts
  • Location:Silver Run, Maryland

Posted Wed Feb 1, 2017 11:39 AM

Sometime today or tomorrow, I will put together a ZIP file with the current ALC source code and post it here.  In the meantime, you might be interested in my annotated version of TI’s original source code for TI Forth.  It is in “Appendix O” of my TI Forth Instruction Manual “2nd Edition 2013”, which is available in the Forth section of the pinned TI-99/4A development resources thread.  It was necessary for me to go through the annotation exercise before I had any hope of rewriting TI Forth to change from sector I/O for Forth blocks to file I/O and from there to hoist fbForth into cartridge space.

 

The DSRLNK code is identical to (or nearly so) to the copy in the Editor/Assembler (E/A) cartridge routine, which gets copied to low RAM when the E/A cartridge starts up.  Neither one of them uses the console DSRLNK routine, which allows for GROM DSRs and tape I/O.  Both TurboForth and fbForth 2.0 use the DSRLNK from Millers Graphics, which does call the console version.

 

...lee



#5 TheBF OFFLINE  

TheBF

    Moonsweeper

  • Topic Starter
  • 281 posts
  • Location:The Great White North

Posted Wed Feb 1, 2017 3:36 PM

That's fantastic. Thanks Lee!

 

BF



#6 Lee Stewart ONLINE  

Lee Stewart

    River Patroller

  • 3,262 posts
  • Location:Silver Run, Maryland

Posted Wed Feb 1, 2017 8:38 PM

Here is the source code for fbForth 2.0:9:  Attached File  fbForth209.zip   184.57KB   13 downloads

 

...lee



#7 TheBF OFFLINE  

TheBF

    Moonsweeper

  • Topic Starter
  • 281 posts
  • Location:The Great White North

Posted Mon Feb 6, 2017 8:41 AM

Again a very sincere thanks Lee. I hope to post my work some time this year...

 

BF



#8 Opry99er OFFLINE  

Opry99er

    Quadrunner

  • 8,246 posts
  • Location:Cookeville, TN

Posted Thu Feb 9, 2017 10:26 AM

Cooooooool!!!

More Forths!!! ;)

Good luck, and we eagerly await your progress!

#9 Willsy OFFLINE  

Willsy

    River Patroller

  • 3,002 posts
  • Location:Uzbekistan (no, really!)

Posted Sun Feb 12, 2017 4:08 AM

This is fantastic! So the TF speed crown is toppled eh? Right then... There's only one thing for it! :-)

#10 TheBF OFFLINE  

TheBF

    Moonsweeper

  • Topic Starter
  • 281 posts
  • Location:The Great White North

Posted Sun Feb 12, 2017 9:09 PM

LOL.  Well the crown is still safe IMHO.  These benchmarks are a hair faster, but your screen driver being mostly code is still faster for sure.

 

However there are some things that might be simpler and maybe a little smaller in CAMEL99.  For example here is my KEY word.

Instead of counting in code for cursor flash in my own code I just read the interrupt timer in CURS@ and return the cursor or a blank.

KEY? does KSCAN like normal and returns true if a key is pressed.

 

( VC! is just like C! but to VDP RAM.

  VPOS returns the VDP screen address)

 

It has been fascinating to compare and contrast writing code words versus Forth to keep the size down.

CODE: CURS@   ( -- char)   \ return cursor or space for flashing cursor
              TOS PUSH,
             _CURSR @@ TOS MOV,         \ load the cursor char to TOS from memory
              8379 @@ W MOVB,           \ read interrupt timer to control flash speed
              W 1800 ANDI,              \ AND it with >18. This mask made a pleasant looking flash
              W W MOV,                  \ compare result to zero
              @@1 JEQ,                  \ if = 0 then we return the cursor char
              TOS BL LI,                \ otherwise return a blank (space)              \ 18 bytes
@@1:          NEXT,
              END-CODE

: KEY        ( -- char)
              BEGIN
                pause                      \ multi-tasking switch
                CURS@ VPOS VC!             \ show the flashing cursor
                KEY?
              UNTIL                       \ wait for key press
              KVAL C@                     \ fetch char from KSCAN buffer
              BL VPOS VC!  ;              \ erase cursor

And in looking over Turbo Forth there is some simple things you could do to speed it up a touch.

 

1. Place the code for EXIT immediately before NEXT so you don't branch to next after EXIT, but rather run it inline. 

2. For the code in PAD ram:  Use a JMP instruction to NEXT rather than BRANCH through a register. It's the same size but a 20% faster instruction.

 

So there is the beginning of version 2.x :-)

 

Have at it.  I had to get back to work since I am self employed now and that damned cross compiler was taking all my time. 

 

Will get back to it one day.

 

BF



#11 Lee Stewart ONLINE  

Lee Stewart

    River Patroller

  • 3,262 posts
  • Location:Silver Run, Maryland

Posted Mon Feb 13, 2017 12:11 AM

...

And in looking over Turbo Forth there is some simple things you could do to speed it up a touch.

 

1. Place the code for EXIT immediately before NEXT so you don't branch to next after EXIT, but rather run it inline. 

2. For the code in PAD ram:  Use a JMP instruction to NEXT rather than BRANCH through a register. It's the same size but a 20% faster instruction.

...

 

Presuming you are not talking about Willsy's high-level Forth definition of NEXT (as in FOR...NEXT) and, unless I am missing something otherwise, the inner interpreter code in PAD RAM for EXIT cannot be placed before NEXT because that would interrupt the flow from the code that is already in front of NEXT.  It can, however, repeat inline the three instructions at NEXT.  That is precisely what TI Forth and fbForth do.  Mark may not have enough PAD RAM real estate for that, though.  The JMP would certainly be a good alternative.

 

...lee



#12 TheBF OFFLINE  

TheBF

    Moonsweeper

  • Topic Starter
  • 281 posts
  • Location:The Great White North

Posted Mon Feb 13, 2017 10:41 AM

"It can, however, repeat inline the three instructions at NEXT."

 

This is what I meant to communicate un-successfully.  Thank you for clarifying.

 

From what I can see he puts EXIT in PAD RAM with a NEXT, macro after it. So it should be NO problem to just transplant it to just before the 3 instructions of NEXT.

 

"That is precisely what TI Forth and fbForth do."

Yes this is an old trick for ITC Forths from way back.

 

B


Edited by TheBF, Mon Feb 13, 2017 10:52 AM.


#13 Lee Stewart ONLINE  

Lee Stewart

    River Patroller

  • 3,262 posts
  • Location:Silver Run, Maryland

Posted Mon Feb 13, 2017 1:22 PM

H-m-m-m...I am still not sure I understand what you mean by “with a NEXT, macro after it”.  I would not call it a macro.  It is copying the actual code “B  *NEXT” or “B  *R12”.  I compared the ALC for the TurboForth inner interpreter with that of the fbForth inner interpreter in the spoiler below to be sure we are talking about the same thing:

 

Spoiler

 

...lee



#14 TheBF OFFLINE  

TheBF

    Moonsweeper

  • Topic Starter
  • 281 posts
  • Location:The Great White North

Posted Mon Feb 13, 2017 4:03 PM

Sorry, I was using my own terminology without looking at the code.

I use Forth as a macro tool to do this kind of thing for NEXT which I am sure you are familiar with.

I guess it's not "technically" a macro, but it acts like one so that's what I call it.

 

: NEXT,     *NEXT  B,  ; 

 

Turbo Forth in the code you show is spelling out the Branch instruction in full after EXIT.

 

I think we are in violent agreement on placing _EXIT as shown below so that NEXT runs as inline code rather than branching to it through a register after EXIT.

I leave it to Willsy on how to incorporate the label _EXIT into a Forth word in his dictionary.

(I do it with a cross-compiler so it's kind of like cheating)

 

Am I getting it right yet?

_EXIT   MOV  *RSTACK+,PC             ; place saved PC into PC & pop return stack
; NEXT ...loads the next CFA and branches to the address in the CFA.
_NEXT  MOV  *PC+,R6                 ; get CFA in r6
       MOV  *R6+,R7                 ; get contents of CFA
       B    *R7                     ; execute it


#15 Lee Stewart ONLINE  

Lee Stewart

    River Patroller

  • 3,262 posts
  • Location:Silver Run, Maryland

Posted Mon Feb 13, 2017 5:31 PM

Agreed.  I guess my difficulty is in the labeling. The actual “_NEXT” and the one you show must be different. So, if the one you show must be labeled, it should perhaps be something like “_NXT”.  The reason is that the actual “_NEXT” already has fall-through code ahead of it in the form of “DOCOL”.  Don't mind me—I am probably a bit OCD!

 

...lee



#16 TheBF OFFLINE  

TheBF

    Moonsweeper

  • Topic Starter
  • 281 posts
  • Location:The Great White North

Posted Mon Feb 13, 2017 7:39 PM

Ha, I completely missed that fall thru on Docol.

 

Don't apologize.  I am clearly not OCD enough to be a great coder.   (which is why I move to management) :-)

 

OK, So then the question is which one nets the most benefit?  EXIT fall thru or DOCOL fall thru?

 

Any thoughts on that? I should probably create a profiler in the code I guess.

 

BF



#17 Lee Stewart ONLINE  

Lee Stewart

    River Patroller

  • 3,262 posts
  • Location:Silver Run, Maryland

Posted Mon Feb 13, 2017 8:41 PM

I would think it a tossup because virtually every word that starts with DOCOL ends with EXIT.  Readability, perhaps, goes to DOCOL, but that is my bias to the old TI Forth code.  Fastest is obviously inline code, but the JMP will likely be better for Willsy.  For your project, I would go with inline if you can spare the extra 4 bytes.

 

...lee



#18 TheBF OFFLINE  

TheBF

    Moonsweeper

  • Topic Starter
  • 281 posts
  • Location:The Great White North

Posted Mon Feb 20, 2017 11:00 PM

Quick update. 

 

I already had EXIT fall thru to NEXT inline so I added inline NEXT to DOCOL. The resulting change in benchmark timing was not material when measured by hand,

I could say the times tended to look faster by 100ths of a second, but I would not commit to them.  Makes we want to build a better timer. 

 

So as they say:

 

"Theory and practice are the same... in theory... but not in practice"

 

BF







Also tagged with one or more of these keywords: Ti-99, Forth, compiler

0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users