Jump to content
IGNORED

Machine Forth OMG


TheBF

Recommended Posts

So before anybody says "Are you nuts?" let me explain what happened in chronological order.

 

  1. I wrote a cross-compiler to create a TI-99 Forth kernel to see if I could make it faster.
  2. That succeeded somewhat giving about 8% improvement over conventional methods. No biggy. ( now it's all hard work to make it into something practical)
  3. I made the mistake of reading the multi year posts by "Insomonia" here about porting GCC to TI-99 and saw the resulting code output.
  4. I don't want to say that I'm very competitive but... it was much faster code than threaded Forth. Dam!
  5. After seeing the C compiler output I thought hey!... I could make my cross-compiler do something like that. The C compiler is making asm code and creating a return stack for sub-routines and local variables.

I started down the road of making a sub-routine threaded Forth system and realized I didn't have to go to all that trouble.

​Many years ago the inventor of Forth, Chuck Moore, abandoned threaded code which essentially is; make a program that is lists of addresses, read the list of addresses and jump to each address to run the code. So threaded code has some overhead.

 

Chuck began working in something he called "machine Forth" which is so simple it's crazy.

 

The theory is why make a fancy compiler when all it's doing is putting some numbers in memory.

​Chuck took his Forth Assembler and made Forth words that simply put the correct code into memory something like macros. Many times a Forth word translates to just 1 machine instruction so why jump to it or call it like a sub-routine? When Machine Forth reads a word, it just puts the correct code into the next available memory location and waits for the next word.

 

If it's big pile of code for that word, call it like a sub-routine. If it's small just put it in memory as is.

 

Example:

Forth '+' removes 2 numbers from the stack, adds them together and puts the answer back on the stack. ​If we keep the top of the stack in a register for convenience here is the Assembler code to do '+':

 

A *SP+,TOS

 

So when I type '+' in my program it puts one 16 bit integer (the machine code) into memory. ​That's ALL it knows how to do.

 

And many Forth words are that simple. Others are 2 or 3 instructions.

 

So I expanded my Forth Assembler to do just that and in general things run between

2 times to up to 7 times FASTER than CAMEL99 Forth in my early testing.

 

OMG

 

Machine Forth lets you decide if you want to insert the code inline for speed or call it to save some space. You are essentially using little assembler routines but it looks like Forth program text.

 

One other thing that ticked me off was that GCC was able to allocate free registers in the CPU as variables. That's is ideal for making faster code. DAM!

 

So I have created a new variable called a LOCAL, which is just a register that is automatically allocated for you. You have 6 of them and you create them like this:

 

: MYSUB

LOCAL X LOCAL Y LOCAL Z

( code goes here... )

 

;

 

 

Locals only work in the current sub-routine and are freed up after your sub-routine completes.

I might even add an "infix" evaluator so you don't have to use reverse Polish notation but that's vapour-ware at the moment.

 

It's still a big work in progress but I think I have jumped through the looking glass...

 

Where's that damned cat?

 

theBF

Edited by TheBF
  • Like 8
Link to comment
Share on other sites

This is probably not exactly what you are talking about re Forth Assembler, but a lot of fbForth 2.0 and TurboForth is ALC, e.g.,

 

A *SP+,*SP
B *NEXT
The above code does not include the word’s header, but the CFA of + points to the above code, which is pretty much what you suggested.
As Mark said, I will be watching your progress with interest.

 

...lee

  • Like 1
Link to comment
Share on other sites

Ok so this is all very preliminary meaning yesterday and late into the night.

 

Here is a test program. Excuse all the verbiage, but it's how I control the assembler creating the binary EA5 file.

HEX

CROSS-ASSEMBLING
START. NEW.
A000 ABSOLUTE ORIGIN.
TI-99.EA5
FILENAME: DOLOOPDM

include cc9900\mforth99\forthvm.lib
include cc9900\mforth99\loops.lib

TARGET-COMPILING           \ name space set for TARGET words

VARIABLE: X

PROGRAM:                   \ sets entry address for
        FORTHVM            \ sets up workspace and stacks
        FFFF LIT, 0 LIT,   \ compile literal numbers
        DO                 \ you know...
          X 1+!            \  "   "
        LOOP               \  "   "
        BYE,               \ the comma prevents a name conflict
END.                       \ does some binary error checks on target image

CROSS-COMPILING                 \ UTILITY vocabulary
CR ." Compiled " FILENAME$ $.
CR ." Saving..." FILENAME$ $SAVE-EA5.

All the primitive Forth words like 1+! are just macros:

 

: 1+! *TOS INC, TOS POP, ;

 

and are kept in another .LIB file that is pre-loaded for convenience.

Since they only generate code when they are invoked it doesn't matter.

The cross-compiler has lots of room to hold them.

 

The includes in the program compile this code into the host Forth system.

\ FORTHVM.LIB virtual machine setup for MForth99
: FORTHVM
        8300 LWPI,            \ create the Forth VM
        SP 2080 LI,
        RP 2100 LI,
        ;

\ loops.lib
CROSS-COMPILING
\ branch calculators for MACHINE code words
: <BACK,    there - 2/ 00ff and there 1- tc! ;
: RESOLVE,  there over - 2- 2/ swap 1+ tc! ;
: AHEAD,    there 2- swap ;

: IF         CJMP THERE 2- 42 ;
: ENDIF      RESOLVE,   ;
: ELSE       0  CJMP AHEAD,  RESOLVE,  ;
: BEGIN      THERE  ;
: UNTIL     CJMP <BACK, ;
: AGAIN     0 CJMP <BACK, ;
: REPEAT
            >r >r 0 CJMP <BACK,
            r> r> 2- RESOLVE, ;
: WHILE   COMPILE IF 2+ ;
: RDROP,   RP INCT, ;

\ LOOPS  in machine Forth use a register for loops indices
\ ( R8 is renamed to 'L')
\ This is per Chuck Moore

: DO ( limit index -- )
      *SP+ W MOV,       \ pop the limit into W
       W RPUSH,         \ push Working reg.
       L RPUSH,         \ push loop index register to rstack
       TOS L MOV,       \ move TOS to index register
       TOS POP,         \ refill TOS
       BEGIN
       ;

: LOOP
       L INC,           \ inc the loop index register
       L *RP CMP,       \ comp limit to top of Rstack
       EQ UNTIL         \ until they are the same
       RP 4 ADDI,       \ collapse stack frame
       ;

And here is the resulting assembled code: 60 bytes no header. 4.1 seconds.

 

Equivalent program in ITC Forth is 48 bytes with the header so it's much smaller with no header. 14.7 seconds

   A012  0649  dect R9                     (14)
   A014  C646  mov  R6,*R9                 (30)
   A016  0206  li   R6,>ffff               (20)
         FFFF
   A01A  0649  dect R9                     (14)
   A01C  C646  mov  R6,*R9                 (30)
   A01E  0206  li   R6,>0000               (20)
         0000
   A022  C1F9  mov  *R9+,R7                (30)
   A024  064A  dect R10                    (14)
   A026  C687  mov  R7,*R10                (30)
   A028  064A  dect R10                    (14)
   A02A  C688  mov  R8,*R10                (30)
   A02C  C206  mov  R6,R8                  (18)
   A02E  C1B9  mov  *R9+,R6                (30)
   A030  0649  dect R9                     (14)
   A032  C646  mov  R6,*R9                 (30)
   A034  0206  li   R6,>a004               (20)
         A004
>  A038  0596  inc  *R6                   
   A03A  C1B9  mov  *R9+,R6               
   A03C  0588  inc  R8                    
   A03E  8688  c    R8,*R10               
   A040  16F7  jne  >a030                 
   A042  022A  ai   R10,>0004             
         0004
   A046  0300  limi >0002                 
         0002
   A04A  02E0  lwpi >83e0                 
         83E0
   A04E  0420  blwp @>0000                
         0000

My approach is that all .LIB code is a macro and will compile INLINE code if invoked.

For big code words we use machine forth ':' which compiles the code to memory

with a CALL, preamble and end the definition with machine Forth ';' which adds the RT, instruction.

 

So this would be valid with my current MFPRIMS.LIB

 

: CMOVE ( src dst cnt -- ) CMOVE, ; \ CMOVE, compiles all the code into memory

 

That's the plan. This could be added easily to a BLOCK based system where the LIB

ASM macro code just exists out in virtual memory on BLOCKs.

 

BF

  • Like 2
Link to comment
Share on other sites

LOL. Didn't test that do loop, but it worked fine for FFFF iterations. DUH!

 

I went back to Laxen & Perry style and made (DO) a callable sub-routine because it's kind of big.

 

So it looks like this now.

SUB: (DO) ( limit index -- ) \ compile a callable sub-routine in TARGET memory
       R0  8000 LI,      \ load "fudge factor" to LIMIT
      *SP+  R0 SUB,      \ LIMIT, compute 8000h-limit "fudge factor"
       R0  TOS ADD,      \ loop ctr = index+fudge
       R0  RPUSH,        \ rpush limit
       L RPUSH,
       TOS L MOV,        \ rpush index
       TOS POP,          \ refill TOS
;SUB

\ this is the high level WORD that branches to the sub-routine
\ We have to use BL, and not CALL, because (DO) is putting things on the return stack!!!
: DO   ( -- )   (DO) @@ BL,  BEGIN  ;


: LOOP            
       L INC,           \ increment loop
       OO UNTIL         \ if no overflow then loop again
       L RPOP,          \ restore L register
       RDROP,           \ RDROP the limit value
;

Currently all the branching uses short jumps. That will need fixing eventually.

 

*EDIT*

 

BY the way I notice that this DO/LOOP does nothing with LIMIT value after the calculation except push it onto the stack.

Why keep it? I am going to remove it.

 

B

Edited by TheBF
Link to comment
Share on other sites

Remember this?

PROGRAM:  
          FORTHVM
          1000 lit, 0 LIT,
          DO
              AAAA lit,
               DUP
               SWAP
               OVER
               ROT
               DROP
               DUP AND
               DUP OR
               DUP XOR
               1+
               1-
               2+
               2-
               2*
               2/
               NEGATE
               ABS
               +
               2 lit, *
              DROP
         LOOP
       bye,
END.

My log showed this: *corrected*

CAMEL99: 4.5 secs
TurboForth 4.6 secs

 

Machine Forth.... 1.8 secs! A speed-up factor of 2.55x

 

I like this.

 

BF

Edited by TheBF
Link to comment
Share on other sites

H-m-m-m...fbForth 2.0 takes 6.6 seconds. Maybe I did something different. Here is my code (I had to define 2* and 2/ , which are not in fbForth.):

 

 

 

HEX
ASM: 2*
*SP R1 MOV,
R1 1 SLA,
R1 *SP MOV,
;ASM
ASM: 2/
*SP R1 MOV,
R1 1 SRL,
R1 *SP MOV,
;ASM
: XXX
1000 0
DO
AAAA
DUP
SWAP
OVER
ROT
DROP
DUP AND
DUP OR
DUP XOR
1+
1-
2+
2-
2*
2/
MINUS
ABS
+
2 *
DROP
LOOP
;
DECIMAL

 

 

 

...lee

Link to comment
Share on other sites

The only disadvantage I can see is code size.

 

Yes it will be a challenge to keep size under control. The balance is that in my current system there are no headers because it is compiled offline from the TI-99 like GCC compile 'C'.

And some of the primitives are quite small. In fact, you can make them even smaller in some cases by not keeping TOS in a register, with a small speed penalty.

 

BF

Link to comment
Share on other sites

H-m-m-m...fbForth 2.0 takes 6.6 seconds. Maybe I did something different. Here is my code (I had to define 2* and 2/ , which are not in fbForth.):

 

 

 

HEX
ASM: 2*
*SP R1 MOV,
R1 1 SLA,
R1 *SP MOV,
;ASM
ASM: 2/
*SP R1 MOV,
R1 1 SRL,
R1 *SP MOV,
;ASM
: XXX
1000 0
DO
AAAA
DUP
SWAP
OVER
ROT
DROP
DUP AND
DUP OR
DUP XOR
1+
1-
2+
2-
2*
2/
MINUS
ABS
+
2 *
DROP
LOOP
;
DECIMAL

 

 

 

...lee

 

My bad. Mea maxima culpa.

 

That time of 18 secs. was for the test above in my log, that tested UM*.

 

I just re-timed the OPTEST and it was 4.7 second in Camel99 Forth.

 

So the Machine Forth speedup is only 2.6 x CAMEL99

 

And a speed up of 3.6 x vs FBForth.

 

Still a good improvement.

 

B

Link to comment
Share on other sites

H-m-m-m...fbForth 2.0 takes 6.6 seconds. Maybe I did something different. Here is my code (I had to define 2* and 2/ , which are not in fbForth.):

 

 

 

HEX

ASM: 2*

*SP R1 MOV,

R1 1 SLA,

R1 *SP MOV,

;ASM

ASM: 2/

*SP R1 MOV,

R1 1 SRL,

R1 *SP MOV,

;ASM

: XXX

1000 0

DO

AAAA

DUP

SWAP

OVER

ROT

DROP

DUP AND

DUP OR

DUP XOR

1+

1-

2+

2-

2*

2/

MINUS

ABS

+

2 *

DROP

LOOP

;

DECIMAL

 

 

...lee

 

 

According to my experiments, TF takes just a smidge over 1 second (running in classic99 on a 12 year old Toshiba Tecra M5 laptop). Maybe 1.2 seconds, something like that.

: test 1000 0
          DO
              $AAAA
               DUP
               SWAP
               OVER
               ROT
               DROP
               DUP AND
               DUP OR
               DUP XOR
               1+
               1-
               2+
               2-
               2*
               2/
               NEGATE
               ABS
               +
               2 *
              DROP
         LOOP ;

Bruce/Lee (Ha! Bruce Lee, geddit???) please try TF on your systems if possible.

 

Thanks

 

M

Edited by Willsy
Link to comment
Share on other sites

 

According to my experiments, TF takes just a smidge over 1 second (running in classic99 on a 12 year old Toshiba Tecra M5 laptop). Maybe 1.2 seconds, something like that.

...

 

Bruce/Lee (Ha! Bruce Lee, geddit???) please try TF on your systems if possible.

Thanks

 

M

 

Hah! :P Actually, TF takes 4.096 times that for >1000, which is on the order of Camel99’s speed.

 

...lee

Link to comment
Share on other sites

Ah! it's hex 1000 :dunce:

 

About 4.5 secs...

 

Yes it is HEX 1000 . Turboforth and Camel99 do it at about the same speed.

 

How do you get those ms timing value LEE?

 

I am doing it by the "Armstrong method" (manually) with my cell phone.

 

B

Link to comment
Share on other sites

So although machine Forth is nice to play with I am still longing to create a threaded system that uses the BL instruction based calling system with the return stack keeping old R11 values.

I know it will get bigger than ITC so what about "indirect sub-routine threading" Where we keep the Forth instruction pointer but DOCOL is looks more like:

 

( I am making this up here)

MOV R11 *RP

MOV *IP+ W

BL *W

 

 

I don't have the details clear in my head but does it inspire any thoughts from you guys?

 

BF

 

*EDIT*

After looking at numbers, the benefit of using BL and return stack to CALL and BL *R11 is smaller than I thought but it is still faster

Calling would be 5% slower, but returning is 65% faster than ITC NEXT.

So the real benefit of such a system depends on careful use of inlining the small primitives where calling offers no size improvements.

Edited by TheBF
Link to comment
Share on other sites

Here's one I did earlier. In 2012, no less!

 

Just experimenting.

 

 

 
;: PRIMES ( n - )
;  2 . 3 .
;  5 DO
;    DUP DUP * I < 
;    IF 1+ THEN 
;    1 OVER 1+ 3 DO
;        J I MOD 0= 
;        IF 1- LEAVE THEN
;    2 +LOOP
;    IF I . THEN 
;  2 +LOOP 
;
 
        def START
 
        aorg >a000
 
stack   equ r14
rstack  equ r15
 
START   li r14,stackb               ; set up data stack
        li r15,rstackb              ; set up return stack
        
primes  bl @lit                     ; push 5000
        data 5000
        bl @lit2                    ; push 2
        bl @swap                    ; SWAP
        
        bl @lit2                    ; push 2
        bl @dot                     ; write to memory
        
        bl @lit                     ; push 3
        data 3          
        bl @dot                     ; write to memory
        
        bl @lit                     ; push 5
        data 5
 
        bl @do                      ; DO
do1         mov *stack+,*stack      ; dup
            mov *stack+,*stack      ; dup
            bl @mult                ; *
            bl @geti                ; I
            bl @lessth              ; <
            bl @iffalse             ; 0BRANCH
            data xxxx
            inc *stack              ; 1+
xxxx        bl @lit1                ; push 1
            bl @over                ; OVER
            inc *stack              ; 1+
            bl @lit                 ; push 3
            data 3
            bl @do                  ; DO
do2             ; bl @getj
                inct stack              ; J
                mov @-6(rstack),*stack
                ; bl @geti
                inct stack              ; I
                mov @-2(rstack),*stack
                bl @mod             ; MOD
                bl @eq0             ; 0=
                bl @iffalse         ; 0BRANCH
                data yyyy
                dec *stack          ; drop
                bl @leave           ; LEAVE
                data gggg
yyyy        bl @lit2                ; push 2
            bl @ploop               ; +LOOP
            data do2
gggg        bl @iffalse             ; 0BRANCH
            data zzzz
            bl @geti                ; I
            bl @dot                 ; write to memory
zzzz    bl @lit2                    ; push 2
        bl @ploop                   ; +LOOP
        data do1
lock    jmp lock                    ; stop
 
; --------------------------------------
do      inct rstack
        mov *stack,*rstack+     ; move start value to rstack
        dect stack
        mov *stack,*rstack      ; move max value to rstack
        dect stack
        rt
 
; --------------------------------------
ploop   a *stack,@-2(rstack)
        dect stack
        c @-2(rstack),*rstack
        jhe doexit
        mov *r11,r11
        rt
doexit  ai rstack,-4
        inct r11
        rt
        
 
; --------------------------------------
leave   ai rstack,-4
        mov *r11,r11
        rt
 
; --------------------------------------
geti    inct stack
        mov @-2(rstack),*stack
        rt
 
; --------------------------------------
getj    inct stack
        mov @-6(rstack),*stack
        rt
 
; --------------------------------------
lit     inct stack
        mov *r11+,*stack
        rt
 
; --------------------------------------
lit1    li r0,1
        inct stack
        mov r0,*stack
        rt
 
; --------------------------------------
lit2    li r0,2
        inct stack
        mov r0,*stack
        rt
 
; --------------------------------------
swap    mov *stack,r0
        mov @-2(stack),*stack
        mov r0,@-2(stack)
        rt
 
; --------------------------------------
dot     mov @here,r0
        mov *stack,*r0+
        mov r0,@here
        dect stack
        rt
 
; --------------------------------------
dup     mov *stack+,*stack
        rt
 
; --------------------------------------        
mult    mov @-2(stack),r0
        mpy *stack,r0
        dect stack
        mov r1,*stack
        rt
        
; --------------------------------------
lessth  c @-2(stack),*stack
        jlt islt
        dect stack
        clr *stack
        rt
islt    dect stack
        seto *stack
        rt
 
; --------------------------------------
iffalse mov *stack,r0
        jeq dojmp
        dect stack
        inct r11
        rt
dojmp   dect stack
        mov *r11,r11
        rt
        
; --------------------------------------
over    inct stack
        mov @-4(stack),*stack
        rt
 
; --------------------------------------
; X Y -- X mod Y
mod     clr r1
        mov *stack,r0   ; y
        dect stack      ; point to x
        mov *stack,r2   ; x
        div r0,r1       ; remainder in r2
        mov r2,*stack   ; place remainder
        rt
 
; --------------------------------------
eq0     mov *stack,r0
        jeq eq0xit      ; if equal to 0 leave a TRUE
        clr *stack      ; otherwise place a FALSE
        rt
eq0xit  seto *stack
        rt
 
; --------------------------------------
tof     data 0
stackb  bss 20
rstackb bss 20
here    DATA $+2
 
        end
       
Link to comment
Share on other sites

Here's another one. This is a sort of optimised native assembler output.

 

 

 
; : test 0 1+ dup 65535 = 0branch [ -12 , ] ;
: Easily compilable version of the above (since TF V1.2 uses absolute branches):
: test 0 [ here ] 1+ dup 65535 = 0branch [ , ] ;
 
; itc (tf v1.2):                        14 sec
; direct assembly:                      13 sec
; direct assembly with inline 1+ & DUP: 11 sec
; as above with stack to higher address:10 SEC  (dup becomes faster)
 
 
 
        aorg >a000
dstack equ 4            ; data stack register is R4
 
        DEF START
 
START   lwpi >8300
        li r4,stack     ; stack pointer - stack grows to lower addresses
 
    
        bl @lit         ; lit 0
        data 0
        
loop    inc *dstack         ; 1+        
        mov *dstack+,*dstack    ; DUP
        
        bl @lit
        data >ffff
        
        bl @equal
        bl @zbranch
        data loop
lock    jmp lock
 
;[
lit     inct dstack         ; stack
        mov *r11+,*dstack   ; push literal to stack
        rt
;]
        
;[
plus1   inc *dstack
        rt
;]        
        
;[        
equal   c *dstack,@-2(dstack)
        jeq eq
        clr @tof
        dect dstack
        dect dstack
        rt
eq      seto @tof
        dect dstack
        dect dstack
        rt
;]
 
        
;[
zbranch mov *r11+,r0
        mov @tof,r1
        jne neq
        b *r0
neq     rt
;]
 
tof     data 0
 
stack   bes 100
 
        end
       
Link to comment
Share on other sites

... How do you get those ms timing value LEE?

 

I am doing it by the "Armstrong method" (manually) with my cell phone.

 

B

 

Same here—though I am not sure what you mean by “ms timing value”. If you mean the 4.096 value, that was just >1000/1000 = 4096/1000 = 4.096, which should be the comparison factor for how Mark first ran the program in TF.

 

...lee

Link to comment
Share on other sites

So although machine Forth is nice to play with I am still longing to create a threaded system that uses the BL instruction based calling system with the return stack keeping old R11 values.

I know it will get bigger than ITC so what about "indirect sub-routine threading" Where we keep the Forth instruction pointer but DOCOL is looks more like:

 

( I am making this up here)

MOV R11 *RP

MOV *IP+ W

BL *W

 

 

I don't have the details clear in my head but does it inspire any thoughts from you guys?

 

BF

 

I think direct threaded might be better suited to the 9900:

 

Let R0 be instruction pointer. So, you'd have:

 

 

next:
    b *r0+

 

That's it.

 

High level and low-level Forth words would appear the same. No difference. You'd still have DOCOL and EXIT but they would just be the addresses of the routines, just like DUP etc.

 

There would be no CFA. The first cell of a definition is the address (direct address) of some executable code.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...