Jump to content

Photo

Squeezing Forth for more performance with INLINE[ ]


6 replies to this topic

#1 TheBF OFFLINE  

TheBF

    Dragonstomper

  • 602 posts
  • Location:The Great White North

Posted Thu Aug 3, 2017 2:14 PM

When the Forth language first came onto the world stage it was implemented as something called an "indirect threaded code" language. This is a very clever way to make a language that fit a lot of stuff in a small space. You can see that with the big Forth systems that are running on the TI-99.  An interpreter, a compiler, a text editor and a virtual memory disk system typically can fit in 16K using this approach.
 
The secret to this code density is that every routine is identified by just one address.  The secret to making it run quickly is creating a little routine to read those addresses and do the code associated with them as fast as possible. On the TMS9900 that little routine is 3 instructions of assembly language so it's pretty fast. But as we have seen in the benchmarks with GCC and assembly language code there is a price penalty with threaded code that can range from 2.5 to as much as 10X or more in the worst cases. 
 
It turns out that ITC Forth spends about 50% of the time running those three instructions that read the address which sometimes us called the inner interpreter.  (as opposed to the "outer" text interpreter)
 
At the bottom of every Forth system are a pile of little routines written in assembler that do all the real work. They are simple things that read the contents of a memory address or add two numbers together. The code is all there and it is normally just called by the inner interpreter. Each routine ends with a call back to the inner interpreter.

Wouldn't it be handy if we could use all that code the way a conventional compiler does and copy the routines into memory all in row and eliminate the inner interpreter?
 
This little piece of Forth code does just that.  It reads a string of Forth words that are assembly language words and strings the code together as one long routine. This is not as space efficient, but it is about 2.4 times faster!
 
Here is an example that takes one number on the stack and performs a pile of operations on it. 
HEX
: OPTEST      \ mixed
          3000 0   
          DO                     
               AAAA 
               DUP SWAP OVER ROT DROP DUP AND 
               DUP OR DUP XOR 1+ 1- 2+  2- 2*
               2/ NEGATE ABS + 2* DROP
          LOOP  ;

Here it is with the operations compiled into a new routine called F(X)

The speed increased by 2.4 times vs ITC version
 
CAMEL99 ITC:   11.7 secs
CAMEL99 DTC:    9.1 secs
CAMEL99 F(X)     4.9 secs
HEX
CODE F(X)  ( n -- )
     INLINE[ DUP SWAP OVER ROT DROP DUP AND ]
     INLINE[ DUP OR DUP XOR 1+ 1- 2+  2- 2* ]
     INLINE[ 2/ NEGATE ABS + 2* DROP ]
     NEXT,
END-CODE

: ILTEST   
          3000 0
          DO
             AAAA  F(X)
          LOOP  ;

And here are the extensions to the language that let us do the magic.

With some small changes this can work just the same in Turbo-Forth or FB-Forth.

 

Spoiler

 


Edited by TheBF, Thu Aug 3, 2017 2:16 PM.


#2 Willsy OFFLINE  

Willsy

    River Patroller

  • 3,062 posts
  • Location:Uzbekistan (no, really!)

Posted Thu Aug 3, 2017 4:27 PM

Really nice. I think you may have written about this a couple of years back on CLF. Inspired by that I wrote one for TF. I'll have to see if I still have it.

Top work!

#3 TheBF OFFLINE  

TheBF

    Dragonstomper

  • Topic Starter
  • 602 posts
  • Location:The Great White North

Posted Thu Aug 3, 2017 4:58 PM

We were talking by email when you first got the disease ... ummm I mean got interested in writing a Forth system.
:-)

#4 Lee Stewart OFFLINE  

Lee Stewart

    River Patroller

  • 3,691 posts
  • Location:Silver Run, Maryland

Posted Thu Aug 3, 2017 7:55 PM

Very nice, indeed!   :)  A version of this would work quite nicely for fbForth 2.0 words in the user dictionary in RAM.  However, I am afraid it would not work very well for code (ALC) words in the fbForth 2.0 resident dictionary (RD).  Though the cfa and pfa of virtually every word in the RD are in ROM bank 0, their bodies are scattered over all four ROM banks, which would not yield to a linear copy of their machine language.

 

...lee



#5 TheBF OFFLINE  

TheBF

    Dragonstomper

  • Topic Starter
  • 602 posts
  • Location:The Great White North

Posted Thu Aug 3, 2017 8:31 PM

Ah yes. The dreaded bank switching.  I am living in my little utopian world of RAM.

 

Is the code for ASM: words in the kernel not contiguous for each word. 

ie:. Are there a lot words that BL to other words in another bank?



#6 Lee Stewart OFFLINE  

Lee Stewart

    River Patroller

  • 3,691 posts
  • Location:Silver Run, Maryland

Posted Thu Aug 3, 2017 9:26 PM

Ah yes. The dreaded bank switching.  I am living in my little utopian world of RAM.

 

Is the code for ASM: words in the kernel not contiguous for each word. 

ie:. Are there a lot words that BL to other words in another bank?

 

Yes.  In fact, there are words that jump to code in another word in the same bank and finish up in the jumped-to code because it is identical and saves space.

 

I think all of the words in your example are safe except NEGATE , which is MINUS in fbForth 2.0 (inherited from fig-Forth through TI Forth).

 

...lee



#7 TheBF OFFLINE  

TheBF

    Dragonstomper

  • Topic Starter
  • 602 posts
  • Location:The Great White North

Posted Wed Nov 29, 2017 1:57 PM

Finally go around to making INLINE[  ] work without the need to create a new CODE word.

 

This version relies on 2 new words.  ASM[   ]ASM.

By themselves these words let you insert inline assembly language inside a Forth definition, like you might see in a C compiler.

ASM[ simply compiles the addresses into the Forth list of addresses that make it look like a new CODE word.

(Notice why it's called "indirect threaded code". (ITC)  It take 2 addresses to point to a machine language word.)

 

ASM[ also turns off the compiler! 

Fun FACT: The Forth assembler interprets text.

The only compiler is the comma operator which "compiles" a number into memory.

 

 

]ASM is a macro that compiles 2 instructions after the inline machine code 

LI R9,HERE+4
B *R10

R9 is the Forth instruction pointer (IP). We need to advance IP past the machine code that we compiled into the definition

R10 holds the address of the NEXT routine, which is the 3 instruction Forth "inner"  interpreter in ITC Forth.

 

An example of what can be done with it is shown in this COMPARE routine that is written in Forth.

There are long strings of CODE WORDs in this routine that are wasting time between each word

by going through the inner interpreter.  When we inline[ ] the words inside the loop we get a 56% speed up.

There is no reason to inline the words outside the loop because the size will increase with little improvement.

HEX

: A$  S" THIS IS A LONG STRING FOR COMPARING TO ANOTHER" ;
: B$  S" THIS IS A LONG STRING FOR COMPARING TO ANOTHER TOO" ;


: COMPARE       ( addr1 u1 addr2 u2 --- diff )
\ Compare two strings. diff is negative if addr1 u1 is smaller, 0 if it
\ is equal and positive if it is greater than addr2 u2.
                ROT 2DUP - >R                    \ compute U2-U1, push to RSTACK.
                MIN DUP
                IF
                    >R                           \ loop counter push to RSTACK
                    BEGIN
                       OVER C@ OVER C@ -
                       IF  SWAP C@ SWAP C@ -
                           2R> 2DROP             \ clean Rstack
                           EXIT                  \ & get out of this word
                       THEN
                       CHAR+ SWAP CHAR+ SWAP     \ incr. addr2 & addr1
                       R> 1- DUP >R              \ get rstack loop counter, decr, copy, push back
                     0= UNTIL                    \ loop until loop counter=0
                    R>                           \ get the U2-U1 value
                THEN
                DROP 2DROP R> NEGATE ;

: COMPTEST   100 0 DO  A$ B$ COMPARE DROP LOOP ;   \ 9.4 seconds

: FCOMPARE       ( addr1 u1 addr2 u2 --- diff )
                ROT 2DUP - >R  MIN DUP
                IF
                   >R
                   BEGIN
                      INLINE[  OVER C@ OVER C@ - ]
                       IF  INLINE[ SWAP C@ SWAP C@ - 2R> 2DROP ]
                           EXIT
                       THEN
                         INLINE[ CHAR+ SWAP CHAR+ SWAP R> 1- DUP >R ]
                   0= UNTIL
                   R>
                THEN
                DROP 2DROP R> NEGATE ;

: FCOMPTEST   100 0 DO  A$ B$ FCOMPARE DROP LOOP ;   \ 6 seconds



In the new version I also removed some the word names and just put the code inside the definition of INLINE[.

This bad form in Forth which favours short definitions, but I wanted to save some space.

TI-99 is not the most generous environment.

 

New version is in the spoiler for those who are curious.

 

Spoiler

Edited by TheBF, Thu Nov 30, 2017 5:52 AM.





0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users