+TheBF Posted March 10, 2017 Share Posted March 10, 2017 (edited) So before anybody says "Are you nuts?" let me explain what happened in chronological order. I wrote a cross-compiler to create a TI-99 Forth kernel to see if I could make it faster. That succeeded somewhat giving about 8% improvement over conventional methods. No biggy. ( now it's all hard work to make it into something practical) I made the mistake of reading the multi year posts by "Insomonia" here about porting GCC to TI-99 and saw the resulting code output. I don't want to say that I'm very competitive but... it was much faster code than threaded Forth. Dam! After seeing the C compiler output I thought hey!... I could make my cross-compiler do something like that. The C compiler is making asm code and creating a return stack for sub-routines and local variables. I started down the road of making a sub-routine threaded Forth system and realized I didn't have to go to all that trouble.Many years ago the inventor of Forth, Chuck Moore, abandoned threaded code which essentially is; make a program that is lists of addresses, read the list of addresses and jump to each address to run the code. So threaded code has some overhead. Chuck began working in something he called "machine Forth" which is so simple it's crazy. The theory is why make a fancy compiler when all it's doing is putting some numbers in memory. Chuck took his Forth Assembler and made Forth words that simply put the correct code into memory something like macros. Many times a Forth word translates to just 1 machine instruction so why jump to it or call it like a sub-routine? When Machine Forth reads a word, it just puts the correct code into the next available memory location and waits for the next word. If it's big pile of code for that word, call it like a sub-routine. If it's small just put it in memory as is. Example: Forth '+' removes 2 numbers from the stack, adds them together and puts the answer back on the stack. If we keep the top of the stack in a register for convenience here is the Assembler code to do '+': A *SP+,TOS So when I type '+' in my program it puts one 16 bit integer (the machine code) into memory. That's ALL it knows how to do. And many Forth words are that simple. Others are 2 or 3 instructions. So I expanded my Forth Assembler to do just that and in general things run between 2 times to up to 7 times FASTER than CAMEL99 Forth in my early testing. OMG Machine Forth lets you decide if you want to insert the code inline for speed or call it to save some space. You are essentially using little assembler routines but it looks like Forth program text. One other thing that ticked me off was that GCC was able to allocate free registers in the CPU as variables. That's is ideal for making faster code. DAM! So I have created a new variable called a LOCAL, which is just a register that is automatically allocated for you. You have 6 of them and you create them like this: : MYSUB LOCAL X LOCAL Y LOCAL Z ( code goes here... ) ; Locals only work in the current sub-routine and are freed up after your sub-routine completes. I might even add an "infix" evaluator so you don't have to use reverse Polish notation but that's vapour-ware at the moment. It's still a big work in progress but I think I have jumped through the looking glass... Where's that damned cat? theBF Edited March 10, 2017 by TheBF 8 Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 10, 2017 Share Posted March 10, 2017 Ha! I've been looking at the exact same thing with some of the same ideas! Looks like you might save me the trouble! Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 10, 2017 Share Posted March 10, 2017 There may be other optimisations that you can implement with a peep-hole optimizer. For example: 12 8 + can be compiled as LIT 20 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted March 10, 2017 Share Posted March 10, 2017 This is probably not exactly what you are talking about re Forth Assembler, but a lot of fbForth 2.0 and TurboForth is ALC, e.g., A *SP+,*SP B *NEXT The above code does not include the word’s header, but the CFA of + points to the above code, which is pretty much what you suggested. As Mark said, I will be watching your progress with interest. ...lee 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 10, 2017 Author Share Posted March 10, 2017 Ok so this is all very preliminary meaning yesterday and late into the night. Here is a test program. Excuse all the verbiage, but it's how I control the assembler creating the binary EA5 file. HEX CROSS-ASSEMBLING START. NEW. A000 ABSOLUTE ORIGIN. TI-99.EA5 FILENAME: DOLOOPDM include cc9900\mforth99\forthvm.lib include cc9900\mforth99\loops.lib TARGET-COMPILING \ name space set for TARGET words VARIABLE: X PROGRAM: \ sets entry address for FORTHVM \ sets up workspace and stacks FFFF LIT, 0 LIT, \ compile literal numbers DO \ you know... X 1+! \ " " LOOP \ " " BYE, \ the comma prevents a name conflict END. \ does some binary error checks on target image CROSS-COMPILING \ UTILITY vocabulary CR ." Compiled " FILENAME$ $. CR ." Saving..." FILENAME$ $SAVE-EA5. All the primitive Forth words like 1+! are just macros: : 1+! *TOS INC, TOS POP, ; and are kept in another .LIB file that is pre-loaded for convenience. Since they only generate code when they are invoked it doesn't matter. The cross-compiler has lots of room to hold them. The includes in the program compile this code into the host Forth system. \ FORTHVM.LIB virtual machine setup for MForth99 : FORTHVM 8300 LWPI, \ create the Forth VM SP 2080 LI, RP 2100 LI, ; \ loops.lib CROSS-COMPILING \ branch calculators for MACHINE code words : <BACK, there - 2/ 00ff and there 1- tc! ; : RESOLVE, there over - 2- 2/ swap 1+ tc! ; : AHEAD, there 2- swap ; : IF CJMP THERE 2- 42 ; : ENDIF RESOLVE, ; : ELSE 0 CJMP AHEAD, RESOLVE, ; : BEGIN THERE ; : UNTIL CJMP <BACK, ; : AGAIN 0 CJMP <BACK, ; : REPEAT >r >r 0 CJMP <BACK, r> r> 2- RESOLVE, ; : WHILE COMPILE IF 2+ ; : RDROP, RP INCT, ; \ LOOPS in machine Forth use a register for loops indices \ ( R8 is renamed to 'L') \ This is per Chuck Moore : DO ( limit index -- ) *SP+ W MOV, \ pop the limit into W W RPUSH, \ push Working reg. L RPUSH, \ push loop index register to rstack TOS L MOV, \ move TOS to index register TOS POP, \ refill TOS BEGIN ; : LOOP L INC, \ inc the loop index register L *RP CMP, \ comp limit to top of Rstack EQ UNTIL \ until they are the same RP 4 ADDI, \ collapse stack frame ; And here is the resulting assembled code: 60 bytes no header. 4.1 seconds. Equivalent program in ITC Forth is 48 bytes with the header so it's much smaller with no header. 14.7 seconds A012 0649 dect R9 (14) A014 C646 mov R6,*R9 (30) A016 0206 li R6,>ffff (20) FFFF A01A 0649 dect R9 (14) A01C C646 mov R6,*R9 (30) A01E 0206 li R6,>0000 (20) 0000 A022 C1F9 mov *R9+,R7 (30) A024 064A dect R10 (14) A026 C687 mov R7,*R10 (30) A028 064A dect R10 (14) A02A C688 mov R8,*R10 (30) A02C C206 mov R6,R8 (18) A02E C1B9 mov *R9+,R6 (30) A030 0649 dect R9 (14) A032 C646 mov R6,*R9 (30) A034 0206 li R6,>a004 (20) A004 > A038 0596 inc *R6 A03A C1B9 mov *R9+,R6 A03C 0588 inc R8 A03E 8688 c R8,*R10 A040 16F7 jne >a030 A042 022A ai R10,>0004 0004 A046 0300 limi >0002 0002 A04A 02E0 lwpi >83e0 83E0 A04E 0420 blwp @>0000 0000 My approach is that all .LIB code is a macro and will compile INLINE code if invoked. For big code words we use machine forth ':' which compiles the code to memory with a CALL, preamble and end the definition with machine Forth ';' which adds the RT, instruction. So this would be valid with my current MFPRIMS.LIB : CMOVE ( src dst cnt -- ) CMOVE, ; \ CMOVE, compiles all the code into memory That's the plan. This could be added easily to a BLOCK based system where the LIB ASM macro code just exists out in virtual memory on BLOCKs. BF 2 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 10, 2017 Author Share Posted March 10, 2017 (edited) LOL. Didn't test that do loop, but it worked fine for FFFF iterations. DUH! I went back to Laxen & Perry style and made (DO) a callable sub-routine because it's kind of big. So it looks like this now. SUB: (DO) ( limit index -- ) \ compile a callable sub-routine in TARGET memory R0 8000 LI, \ load "fudge factor" to LIMIT *SP+ R0 SUB, \ LIMIT, compute 8000h-limit "fudge factor" R0 TOS ADD, \ loop ctr = index+fudge R0 RPUSH, \ rpush limit L RPUSH, TOS L MOV, \ rpush index TOS POP, \ refill TOS ;SUB \ this is the high level WORD that branches to the sub-routine \ We have to use BL, and not CALL, because (DO) is putting things on the return stack!!! : DO ( -- ) (DO) @@ BL, BEGIN ; : LOOP L INC, \ increment loop OO UNTIL \ if no overflow then loop again L RPOP, \ restore L register RDROP, \ RDROP the limit value ; Currently all the branching uses short jumps. That will need fixing eventually. *EDIT* BY the way I notice that this DO/LOOP does nothing with LIMIT value after the calculation except push it onto the stack. Why keep it? I am going to remove it. B Edited March 10, 2017 by TheBF Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 10, 2017 Author Share Posted March 10, 2017 (edited) Remember this? PROGRAM: FORTHVM 1000 lit, 0 LIT, DO AAAA lit, DUP SWAP OVER ROT DROP DUP AND DUP OR DUP XOR 1+ 1- 2+ 2- 2* 2/ NEGATE ABS + 2 lit, * DROP LOOP bye, END. My log showed this: *corrected* CAMEL99: 4.5 secs TurboForth 4.6 secs Machine Forth.... 1.8 secs! A speed-up factor of 2.55x I like this. BF Edited March 22, 2017 by TheBF Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 10, 2017 Share Posted March 10, 2017 Wow that's more like it :-) Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 10, 2017 Share Posted March 10, 2017 That's got me inspired. I've been kicking the tyres thinking about this for about a year but never did anything about it. Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 10, 2017 Share Posted March 10, 2017 The only disadvantage I can see is code size. Quote Link to comment Share on other sites More sharing options...
Opry99er Posted March 11, 2017 Share Posted March 11, 2017 128k bank switched ROM. You've got no downside, Willsy. Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted March 11, 2017 Share Posted March 11, 2017 H-m-m-m...fbForth 2.0 takes 6.6 seconds. Maybe I did something different. Here is my code (I had to define 2* and 2/ , which are not in fbForth.): HEX ASM: 2* *SP R1 MOV, R1 1 SLA, R1 *SP MOV, ;ASM ASM: 2/ *SP R1 MOV, R1 1 SRL, R1 *SP MOV, ;ASM : XXX 1000 0 DO AAAA DUP SWAP OVER ROT DROP DUP AND DUP OR DUP XOR 1+ 1- 2+ 2- 2* 2/ MINUS ABS + 2 * DROP LOOP ; DECIMAL ...lee Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 11, 2017 Author Share Posted March 11, 2017 The only disadvantage I can see is code size. Yes it will be a challenge to keep size under control. The balance is that in my current system there are no headers because it is compiled offline from the TI-99 like GCC compile 'C'. And some of the primitives are quite small. In fact, you can make them even smaller in some cases by not keeping TOS in a register, with a small speed penalty. BF Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 11, 2017 Author Share Posted March 11, 2017 H-m-m-m...fbForth 2.0 takes 6.6 seconds. Maybe I did something different. Here is my code (I had to define 2* and 2/ , which are not in fbForth.): HEX ASM: 2* *SP R1 MOV, R1 1 SLA, R1 *SP MOV, ;ASM ASM: 2/ *SP R1 MOV, R1 1 SRL, R1 *SP MOV, ;ASM : XXX 1000 0 DO AAAA DUP SWAP OVER ROT DROP DUP AND DUP OR DUP XOR 1+ 1- 2+ 2- 2* 2/ MINUS ABS + 2 * DROP LOOP ; DECIMAL ...lee My bad. Mea maxima culpa. That time of 18 secs. was for the test above in my log, that tested UM*. I just re-timed the OPTEST and it was 4.7 second in Camel99 Forth. So the Machine Forth speedup is only 2.6 x CAMEL99 And a speed up of 3.6 x vs FBForth. Still a good improvement. B Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 11, 2017 Share Posted March 11, 2017 (edited) H-m-m-m...fbForth 2.0 takes 6.6 seconds. Maybe I did something different. Here is my code (I had to define 2* and 2/ , which are not in fbForth.): HEX ASM: 2* *SP R1 MOV, R1 1 SLA, R1 *SP MOV, ;ASM ASM: 2/ *SP R1 MOV, R1 1 SRL, R1 *SP MOV, ;ASM : XXX 1000 0 DO AAAA DUP SWAP OVER ROT DROP DUP AND DUP OR DUP XOR 1+ 1- 2+ 2- 2* 2/ MINUS ABS + 2 * DROP LOOP ; DECIMAL ...lee According to my experiments, TF takes just a smidge over 1 second (running in classic99 on a 12 year old Toshiba Tecra M5 laptop). Maybe 1.2 seconds, something like that. : test 1000 0 DO $AAAA DUP SWAP OVER ROT DROP DUP AND DUP OR DUP XOR 1+ 1- 2+ 2- 2* 2/ NEGATE ABS + 2 * DROP LOOP ; Bruce/Lee (Ha! Bruce Lee, geddit???) please try TF on your systems if possible. Thanks M Edited March 11, 2017 by Willsy Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted March 11, 2017 Share Posted March 11, 2017 According to my experiments, TF takes just a smidge over 1 second (running in classic99 on a 12 year old Toshiba Tecra M5 laptop). Maybe 1.2 seconds, something like that. ... Bruce/Lee (Ha! Bruce Lee, geddit???) please try TF on your systems if possible. Thanks M Hah! Actually, TF takes 4.096 times that for >1000, which is on the order of Camel99’s speed. ...lee Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 11, 2017 Share Posted March 11, 2017 Ah! it's hex 1000 About 4.5 secs... Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 11, 2017 Author Share Posted March 11, 2017 Ah! it's hex 1000 About 4.5 secs... Yes it is HEX 1000 . Turboforth and Camel99 do it at about the same speed. How do you get those ms timing value LEE? I am doing it by the "Armstrong method" (manually) with my cell phone. B Quote Link to comment Share on other sites More sharing options...
+TheBF Posted March 11, 2017 Author Share Posted March 11, 2017 (edited) So although machine Forth is nice to play with I am still longing to create a threaded system that uses the BL instruction based calling system with the return stack keeping old R11 values. I know it will get bigger than ITC so what about "indirect sub-routine threading" Where we keep the Forth instruction pointer but DOCOL is looks more like: ( I am making this up here) MOV R11 *RP MOV *IP+ W BL *W I don't have the details clear in my head but does it inspire any thoughts from you guys? BF *EDIT* After looking at numbers, the benefit of using BL and return stack to CALL and BL *R11 is smaller than I thought but it is still faster Calling would be 5% slower, but returning is 65% faster than ITC NEXT. So the real benefit of such a system depends on careful use of inlining the small primitives where calling offers no size improvements. Edited March 11, 2017 by TheBF Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 11, 2017 Share Posted March 11, 2017 Here's one I did earlier. In 2012, no less! Just experimenting. ;: PRIMES ( n - ) ; 2 . 3 . ; 5 DO ; DUP DUP * I < ; IF 1+ THEN ; 1 OVER 1+ 3 DO ; J I MOD 0= ; IF 1- LEAVE THEN ; 2 +LOOP ; IF I . THEN ; 2 +LOOP ; def START aorg >a000 stack equ r14 rstack equ r15 START li r14,stackb ; set up data stack li r15,rstackb ; set up return stack primes bl @lit ; push 5000 data 5000 bl @lit2 ; push 2 bl @swap ; SWAP bl @lit2 ; push 2 bl @dot ; write to memory bl @lit ; push 3 data 3 bl @dot ; write to memory bl @lit ; push 5 data 5 bl @do ; DO do1 mov *stack+,*stack ; dup mov *stack+,*stack ; dup bl @mult ; * bl @geti ; I bl @lessth ; < bl @iffalse ; 0BRANCH data xxxx inc *stack ; 1+ xxxx bl @lit1 ; push 1 bl @over ; OVER inc *stack ; 1+ bl @lit ; push 3 data 3 bl @do ; DO do2 ; bl @getj inct stack ; J mov @-6(rstack),*stack ; bl @geti inct stack ; I mov @-2(rstack),*stack bl @mod ; MOD bl @eq0 ; 0= bl @iffalse ; 0BRANCH data yyyy dec *stack ; drop bl @leave ; LEAVE data gggg yyyy bl @lit2 ; push 2 bl @ploop ; +LOOP data do2 gggg bl @iffalse ; 0BRANCH data zzzz bl @geti ; I bl @dot ; write to memory zzzz bl @lit2 ; push 2 bl @ploop ; +LOOP data do1 lock jmp lock ; stop ; -------------------------------------- do inct rstack mov *stack,*rstack+ ; move start value to rstack dect stack mov *stack,*rstack ; move max value to rstack dect stack rt ; -------------------------------------- ploop a *stack,@-2(rstack) dect stack c @-2(rstack),*rstack jhe doexit mov *r11,r11 rt doexit ai rstack,-4 inct r11 rt ; -------------------------------------- leave ai rstack,-4 mov *r11,r11 rt ; -------------------------------------- geti inct stack mov @-2(rstack),*stack rt ; -------------------------------------- getj inct stack mov @-6(rstack),*stack rt ; -------------------------------------- lit inct stack mov *r11+,*stack rt ; -------------------------------------- lit1 li r0,1 inct stack mov r0,*stack rt ; -------------------------------------- lit2 li r0,2 inct stack mov r0,*stack rt ; -------------------------------------- swap mov *stack,r0 mov @-2(stack),*stack mov r0,@-2(stack) rt ; -------------------------------------- dot mov @here,r0 mov *stack,*r0+ mov r0,@here dect stack rt ; -------------------------------------- dup mov *stack+,*stack rt ; -------------------------------------- mult mov @-2(stack),r0 mpy *stack,r0 dect stack mov r1,*stack rt ; -------------------------------------- lessth c @-2(stack),*stack jlt islt dect stack clr *stack rt islt dect stack seto *stack rt ; -------------------------------------- iffalse mov *stack,r0 jeq dojmp dect stack inct r11 rt dojmp dect stack mov *r11,r11 rt ; -------------------------------------- over inct stack mov @-4(stack),*stack rt ; -------------------------------------- ; X Y -- X mod Y mod clr r1 mov *stack,r0 ; y dect stack ; point to x mov *stack,r2 ; x div r0,r1 ; remainder in r2 mov r2,*stack ; place remainder rt ; -------------------------------------- eq0 mov *stack,r0 jeq eq0xit ; if equal to 0 leave a TRUE clr *stack ; otherwise place a FALSE rt eq0xit seto *stack rt ; -------------------------------------- tof data 0 stackb bss 20 rstackb bss 20 here DATA $+2 end Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 11, 2017 Share Posted March 11, 2017 Here's another one. This is a sort of optimised native assembler output. ; : test 0 1+ dup 65535 = 0branch [ -12 , ] ; : Easily compilable version of the above (since TF V1.2 uses absolute branches): : test 0 [ here ] 1+ dup 65535 = 0branch [ , ] ; ; itc (tf v1.2): 14 sec ; direct assembly: 13 sec ; direct assembly with inline 1+ & DUP: 11 sec ; as above with stack to higher address:10 SEC (dup becomes faster) aorg >a000 dstack equ 4 ; data stack register is R4 DEF START START lwpi >8300 li r4,stack ; stack pointer - stack grows to lower addresses bl @lit ; lit 0 data 0 loop inc *dstack ; 1+ mov *dstack+,*dstack ; DUP bl @lit data >ffff bl @equal bl @zbranch data loop lock jmp lock ;[ lit inct dstack ; stack mov *r11+,*dstack ; push literal to stack rt ;] ;[ plus1 inc *dstack rt ;] ;[ equal c *dstack,@-2(dstack) jeq eq clr @tof dect dstack dect dstack rt eq seto @tof dect dstack dect dstack rt ;] ;[ zbranch mov *r11+,r0 mov @tof,r1 jne neq b *r0 neq rt ;] tof data 0 stack bes 100 end Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted March 11, 2017 Share Posted March 11, 2017 ... How do you get those ms timing value LEE? I am doing it by the "Armstrong method" (manually) with my cell phone. B Same here—though I am not sure what you mean by “ms timing value”. If you mean the 4.096 value, that was just >1000/1000 = 4096/1000 = 4.096, which should be the comparison factor for how Mark first ran the program in TF. ...lee Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 11, 2017 Share Posted March 11, 2017 So although machine Forth is nice to play with I am still longing to create a threaded system that uses the BL instruction based calling system with the return stack keeping old R11 values. I know it will get bigger than ITC so what about "indirect sub-routine threading" Where we keep the Forth instruction pointer but DOCOL is looks more like: ( I am making this up here) MOV R11 *RP MOV *IP+ W BL *W I don't have the details clear in my head but does it inspire any thoughts from you guys? BF I think direct threaded might be better suited to the 9900: Let R0 be instruction pointer. So, you'd have: next: b *r0+ That's it. High level and low-level Forth words would appear the same. No difference. You'd still have DOCOL and EXIT but they would just be the addresses of the routines, just like DUP etc. There would be no CFA. The first cell of a definition is the address (direct address) of some executable code. Quote Link to comment Share on other sites More sharing options...
Willsy Posted March 11, 2017 Share Posted March 11, 2017 I think if I did another Forth for the TI it would be direct threaded, with the ability to switch to native assembly for the fast stuff :-) Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted March 11, 2017 Share Posted March 11, 2017 Here's one I did earlier. In 2012, no less! Just experimenting. ... Here's another one. This is a sort of optimised native assembler output. ... These were back when you were growing the stacks up instead of down, right? ...lee Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.