After 30 years of wondering I finally got around to creating a Forth compiler for the TI-99 where the top of stack (TOS) is maintained in a register.
The literature said this would speed it up by about 10%. I used a DOS Forth system to create the cross-compiler to the build the TI-99 compiler so it was painful debugging both ends at the same time. (old brain hurts) I cross-compiled Brad Rodriguez's Camel Forth for the high level Forth words and wrote 105 Assembler primitives with hints here and there to the hard stuff from TI MSP430 Camel Forth and I had to look at Turbo Forth to help find a couple of gotchas with the 9900 instruction set. Sincere thanks to Willsy and Brad.
Anyway the answer is in.,, kind of sort of. Using Willys's excellent and highly optimized Turbo Forth as the benchmark for excellence I did a little comparison.
Turboforth uses the PAD RAM at >8300 to hold many simple code routines so they run very fast in that zero wait state memory.
To even begin to come close to Turbo Forth I found out I also had to put the Forth thread interpreter there along with branching and I stuck the literal run-time routine there as well. After that the only optimizing approach I used was this TOS thing
The TOS caching is a mixed blessing. For routines that take one input on the stack and produce one output like 1+ 2+ 2/ 2* @ C@ etc... it is about 40% faster. Very cool. For operations that take two inputs and generate one output or no output on the stack, ( ! C! + - * etc.) refilling the TOS can eat up all of the benefit on the 9900.
And for operators that need to make extra space on the stack for an output, the TMS9900 needs 2 instructions so they are actually slower because you have to push the TOS register onto the stack to make room for the new thing. (DUP OVER etc.)
- my empty DO/LOOP structure runs the same speed as Turbo Forth so the test is truly comparing the math operations.
- Tests were run on Classi99 emulator under Windows 10 64bits (my real iron is in a box with a defective 32K memory card)
Test 1 tests all the routines Turbo Forth has in PAD Ram and the others as well, so it's mixed.
Test 2 is head to head TOS vs PAD RAM optimization.
Test 3 is TOS vs Forth operators that have no PAD RAM optimization.
We can see in test 3 the we get about 8% improvement not 10%.
The surprise for me was test 2 because the speedup was not suppose to be as fast as zero wait state ram but it seems the combination of everything netted out
to the same result. Weird.
In many other ways Turbo Forth is still faster by virtue of hand coding so much of the internals, but this demonstrates the TOS on math operations.
Now I have to stop doing this for a while. (addictions are hard to kick)
PS. I noticed I did not include NIP and TUCK but that's for another day.
PSS This means Turbo Forth 3.0 can be 8% faster. Just one more re-write Willsy :-)
HEX : OPTEST \ mixed 1000 0 \ *OPTIMIZATION METHOD* DO \ CAMEL99 Turbo Forth \ ---------------------- AAAA ( lit) \ HSRAM HSRAM DUP \ TOS HSRAM SWAP \ TOS HSRAM OVER \ TOS HSRAM ROT \ TOS -- DROP \ TOS HSRAM DUP AND \ TOS -- DUP OR \ TOS -- DUP XOR \ TOS -- 1+ \ TOS HSRAM 1- \ TOS HSRAM 2+ \ TOS HSRAM 2- \ TOS HSRAM 2* \ TOS -- 2/ \ TOS -- NEGATE \ TOS -- ABS \ TOS -- + \ TOS HSRAM 2 * \ TOS HSRAM DROP LOOP ; \ CAMEL99: 4 5 secs \ TurboForth 4.7 secs \ (Empty DO/LOOP are same speed) : OPTEST2 \ only HSRAM VS TOS 2000 0 \ *OPTIMIZATION METHOD* DO \ CAMEL99 Turbo Forth \ ---------------------- AAAA ( lit) \ HSRAM HSRAM DUP \ TOS HSRAM SWAP \ TOS HSRAM OVER \ TOS HSRAM DUP AND \ TOS HSRAM DUP OR \ TOS HSRAM 1+ \ TOS HSRAM 1- \ TOS HSRAM 2+ \ TOS HSRAM 2- \ TOS HSRAM + \ TOS HSRAM 2 * \ TOS HSRAM DROP \ TOS HSRAM DROP \ TOS HSRAM LOOP ; \ CAMEL99: 6.4 secs \ TurboForth 6.4 secs HEX : OPTEST3 \ TOS versus conventional Parameter stack 3000 0 \ *OPTIMIZATION METHOD* DO \ CAMEL99 Turbo Forth \ ---------------------- AAAA \ HSRAM HSRAM BBBB \ HSRAM HSRAM CCCC \ HSRAM HSRAM ROT \ TOS -- AND \ TOS -- OR \ TOS -- DUP XOR \ TOS -- 2* \ TOS -- 2/ \ TOS -- NEGATE \ TOS -- ABS \ TOS -- DROP \ TOS -- LOOP ; \ CAMEL99: 7.5 secs \ TurboForth 8.13 secs
Edited by TheBF, Tue Jan 31, 2017 9:11 AM.