BASIC speed

+TheBF · February 21, 2017

I have never written a BASIC compiler or even an interpreter but I can understand how it would have lots of overhead to make it safe in the way that BASIC is designed to be. I still have grudges with TI over how they implemented the TI-BASIC and XB languages in GPL. Soooo slow. I spent so many hours in the '80s trying to make it go faster.

I found this video on youtube demonstrating the speed-up you get using a compiler for XB and it would have made me happy many years ago.

I am in the middle of writing a version of CAMEL Forth for the TI-99 and I wondered how it would compare on this simple test.

Here is the video I made using version .5 of the system.

I shows a bit more of what this ancient platform could have done if the engineers would have been free to make it properly.

Camel Forth V.5 Demo.mov

Opry99er · February 21, 2017

That is a video I made to showcase seniorfalcon's compiler several years ago.

It is a brilliant utility, and allows for super fast games to be produced in XB.

I look forward to your Forth implementation!!

+Lee Stewart · February 21, 2017

FYI, translated to fbForth 2.0, the first takes 5.5 seconds and the second ~0.4 second.

...lee

+TheBF · February 21, 2017

Cool. I am new here but I thought you might be on this site.

I used to love XB. I thought is was just great until I wrote something one day and showed my sister in law.

She said "Why is it so slow?" She was comparing it to the Commodore 64. I was P.O. ed. :-)

Nice to make your acquaintance. I was looking for stuff on youtube when I found your video.

Made me curious about how my code compared... of course.

It's got me thinking about putting a BASIC wrapper on top of Forth to make something more

palatable for people. It's been done in the past on other machines and it should go pretty fast.

So much code, so little time.

BF

+TheBF · February 21, 2017

FYI, translated to fbForth 2.0, the first takes 5.5 seconds and the second ~0.4 second.

...lee

Hi Lee,

That's interesting. Is FBForth based on TI-Forth? If so then I believe the difference is mostly in the EMIT implementation.

If I recall TI EMIT called the system for some stuff and also provided a proper control key interpreter as well.

My version of EMIT is very sparse. I tried something weird to try and avoid multiplication in calculating the cursor position.

I keep track of the ROW as the VDP address and the column as an offset.

That way I only have to add them together in the word VPOS below so it's pretty quick.

I use multiplication for manually positioning the cursor with AT-XY however

I am intending to use this implementation for cross-compiler tutorial so I am trying to keep a lot of HI-level code

with simple support words in Assembler.

: EMIT ( char -- )
    VPOS C/SCR @ = IF SCROLL THEN  \ if we are at last character in the display, scroll
    (EMIT)                          \ put the character on the screen & inc. the column
    VCOL @ C/L @ =                  \ are we at end of line?
    IF (CR) THEN  ;                 \ do carriage return math

BF

matthew180 · February 21, 2017

...
She said "Why is it so slow?" She was comparing it to the Commodore 64. I was P.O. ed. :-)
...

We all know the only thing people used C64 BASIC for was poking in assembly programs, so it was not a fair comparison. :-) Rewrite it in 9900 assembly and see how it compares then.

+Lee Stewart · February 22, 2017

... Is fbForth based on TI-Forth?

fbForth is, indeed, based on TI Forth.

I believe the difference is mostly in the EMIT implementation.

Nope. My implementation of your routines is practically identical to yours. VC! becomes VSBW and [CHAR] becomes ASCII (same as TurboForth). The time difference may be that only the inner interpreter is in scratchpad RAM on the 16-bit bus. If you are running some routines in scratchpad RAM as does TurboForth, your CAMEL99 Forth will be faster.

If I recall TI EMIT called the system for some stuff and also provided a proper control key interpreter as well. ...

Indeed, EMIT calls the system ALC (in the spoiler below), which handles BEL, BS, LF, CR. Everything else is presumed to be printable code. As noted above, nothing in my implementation of your code uses EMIT .

;[*== EMIT routine CODE = -4 =================

*

EMT EQU $LO+$-LLVSPT

MOV R2,R1 copy char to R1 for VSBW

MOV @$ALTO(U),R0 alternate output device?

JEQ EMIT0 jump to video display output if not

*

* R0 now points to PAB for alternate output device, the one-byte buffer

* for which must immediately precede its PAB. PAB must have been set up

* to write one byte.

*

CLR R7 ALTOUT active

MOVB R7,@KYSTAT zero status byte

DEC R0 point to one-byte VRAM buffer in front of PAB

SWPB R1 char to MSB

BLWP @VSBW write char to buffer

INCT R0 point to Flag/Status byte

BLWP @VSBR read it

ANDI R1,>1F00 clear error bits without disturbing flag bits

BLWP @VSBW write it back to PAB

AI R0,8 Set up pointer to namelength byte of PAB

MOV R0,@SUBPTR copy to DSR subroutine name-length pointer

BLWP @DSRLNK put 1 byte to device

DATA >8

B @BKLINK return to caller

*

* Output is going to the video display

*

EMIT0 CI R1,7 Is it a bell?

JNE NOTBEL

CLR R2

MOVB R2,@KYSTAT

BLWP @GPLLNK

DATA >0036 Emit error tone

JMP EMEXIT

*

NOTBEL CI R1,8 Is it a backspace?

JNE NOTBS

LI R1,>2000

MOV @CURPO$(U),R0

BLWP @VSBW

JGT DECCUR

JMP EMEXIT

DECCUR DEC @CURPO$(U)

JMP EMEXIT

*

NOTBS CI R1,>A Is it a line feed?

JNE NOTLF

MOV @$SEND(U),R7

S @$SWDTH(U),R7

C @CURPO$(U),R7

JHE SCRLL

A @$SWDTH(U),@CURPO$(u)

JMP EMEXIT

SCRLL MOV LINK,R7

BL @SCROLL

MOV R7,LINK

JMP EMEXIT

*

*** SCROLLING ROUTINE

*

SCROLL EQU $LO+$-LLVSPT

MOV @$SSTRT(U),R0 VRAM addr

LI R1,LINBUF Line buffer

MOV @$SWDTH(U),R2 Count

A R2,R0 Start at line 2

SCROL1 BLWP @VMBR

S R2,R0 One line back to write

BLWP @VMBW

A R2,R0 Two lines ahead for next read

A R2,R0

C R0,@$SEND(U) End of screen?

JL SCROL1

MOV R2,R1 Blank bottom row of screen

LI R0,>2000 Blank

S @$SEND(U),R2

NEG R2 Now contains address of start of last line

MOV LINK,R6

BL @FILL1 Write the blanks

B *R6

*

NOTLF CI R1,>D Is it a carriage return?

JNE NOTCR

CLR R0

MOV @CURPO$(U),R1

MOV R1,R3

S @$SSTRT(U),R1 Adjusted for screen not at 0

MOV @$SWDTH(U),R2

DIV R2,R0

S R1,R3

MOV R3,@CURPO$(U)

JMP EMEXIT

*

NOTCR SWPB R1 Assume it is a printable character

MOV @CURPO$(U),R0

BLWP @VSBW

MOV @$SEND(U),R2

DEC R2

C R0,R2

JNE NOTCR1

MOV @$SEND(U),R0

S @$SWDTH(U),R0 Was last char on screen. Scroll

MOV R0,@CURPO$(U)

JMP SCRLL

NOTCR1 INC R0 No scroll necessary

MOV R0,@CURPO$(U)

*

EMEXIT B @BKLINK

;]*

...lee

senior_falcon · February 22, 2017

We all know the only thing people used C64 BASIC for was poking in assembly programs, so it was not a fair comparison. :-) Rewrite it in 9900 assembly and see how it compares then.

I have no experience with C64 BASIC, but the VIC20 BASIC was much faster than TI BASIC. The kid who lived next door had a VIC 20 and when he saw the TI99 his first comment was "why is it so slow?"

+TheBF · February 22, 2017

fbForth is, indeed, based on TI Forth.

Nope. My implementation of your routines is practically identical to yours. VC! becomes VSBW and [CHAR] becomes ASCII (same as TurboForth). The time difference may be that only the inner interpreter is in scratchpad RAM on the 16-bit bus. If you are running some routines in scratchpad RAM as does TurboForth, your CAMEL99 Forth will be faster.

Indeed, EMIT calls the system ALC (in the spoiler below), which handles BEL, BS, LF, CR. Everything else is presumed to be printable code. As noted above, nothing in my implementation of your code uses EMIT .

;[*== EMIT routine CODE = -4 =================

*

EMT EQU $LO+$-LLVSPT

MOV R2,R1 copy char to R1 for VSBW

MOV @$ALTO(U),R0 alternate output device?

JEQ EMIT0 jump to video display output if not

*

* R0 now points to PAB for alternate output device, the one-byte buffer

* for which must immediately precede its PAB. PAB must have been set up

* to write one byte.

*

CLR R7 ALTOUT active

MOVB R7,@KYSTAT zero status byte

DEC R0 point to one-byte VRAM buffer in front of PAB

SWPB R1 char to MSB

BLWP @VSBW write char to buffer

INCT R0 point to Flag/Status byte

BLWP @VSBR read it

ANDI R1,>1F00 clear error bits without disturbing flag bits

BLWP @VSBW write it back to PAB

AI R0,8 Set up pointer to namelength byte of PAB

MOV R0,@SUBPTR copy to DSR subroutine name-length pointer

BLWP @DSRLNK put 1 byte to device

DATA >8

B @BKLINK return to caller

*

* Output is going to the video display

*

EMIT0 CI R1,7 Is it a bell?

JNE NOTBEL

CLR R2

MOVB R2,@KYSTAT

BLWP @GPLLNK

DATA >0036 Emit error tone

JMP EMEXIT

*

NOTBEL CI R1,8 Is it a backspace?

JNE NOTBS

LI R1,>2000

MOV @CURPO$(U),R0

BLWP @VSBW

JGT DECCUR

JMP EMEXIT

DECCUR DEC @CURPO$(U)

JMP EMEXIT

*

NOTBS CI R1,>A Is it a line feed?

JNE NOTLF

MOV @$SEND(U),R7

S @$SWDTH(U),R7

C @CURPO$(U),R7

JHE SCRLL

A @$SWDTH(U),@CURPO$(u)

JMP EMEXIT

SCRLL MOV LINK,R7

BL @SCROLL

MOV R7,LINK

JMP EMEXIT

*

*** SCROLLING ROUTINE

*

SCROLL EQU $LO+$-LLVSPT

MOV @$SSTRT(U),R0 VRAM addr

LI R1,LINBUF Line buffer

MOV @$SWDTH(U),R2 Count

A R2,R0 Start at line 2

SCROL1 BLWP @VMBR

S R2,R0 One line back to write

BLWP @VMBW

A R2,R0 Two lines ahead for next read

A R2,R0

C R0,@$SEND(U) End of screen?

JL SCROL1

MOV R2,R1 Blank bottom row of screen

LI R0,>2000 Blank

S @$SEND(U),R2

NEG R2 Now contains address of start of last line

MOV LINK,R6

BL @FILL1 Write the blanks

B *R6

*

NOTLF CI R1,>D Is it a carriage return?

JNE NOTCR

CLR R0

MOV @CURPO$(U),R1

MOV R1,R3

S @$SSTRT(U),R1 Adjusted for screen not at 0

MOV @$SWDTH(U),R2

DIV R2,R0

S R1,R3

MOV R3,@CURPO$(U)

JMP EMEXIT

*

NOTCR SWPB R1 Assume it is a printable character

MOV @CURPO$(U),R0

BLWP @VSBW

MOV @$SEND(U),R2

DEC R2

C R0,R2

JNE NOTCR1

MOV @$SEND(U),R0

S @$SWDTH(U),R0 Was last char on screen. Scroll

MOV R0,@CURPO$(U)

JMP SCRLL

NOTCR1 INC R0 No scroll necessary

MOV R0,@CURPO$(U)

*

EMEXIT B @BKLINK

;]*

...lee

You are correct Lee. I have NEXT in PAD RAM along with EXIT, DOCOL, ?BRANCH and BRANCH.

When I tested different benchmarks on CAMEL99 without using that speedup, things were 20% slower, so pretty much exactly right with your timing.

You are the man.

BF

+Lee Stewart · February 22, 2017

You are correct Lee. I have NEXT in PAD RAM along with EXIT, DOCOL, ?BRANCH and BRANCH.

When I tested different benchmarks on CAMEL99 without using that speedup, things were 20% slower, so pretty much exactly right with your timing.

You are the man.

BF

That might do it. We'll have to compare code sometime. The fbForth inner interpreter includes NEXT (actually, its ALC label is $NEXT) and the code fields of : [code field label = DOCOL] and EXIT ( ;S in fbForth)

, which are all in scratchpad RAM as are fbForth's workspace registers.

?BRANCH and BRANCH are not in scratchpad RAM in fbForth—and, I suppose those two words could be making the difference because they are certainly used extensively in the loops in your code—especially, the first example, which is almost twice as fast in your CAMEL99 Forth.

The second example is almost a dead heat because the loop branch only operates ten times instead of the 7680 times in the first example. I wish I could put more code in scratchpad RAM, but that would be a pretty big rewrite. I know Mark put quite a few of the oft-used words there and had to always be aware of the need to save/restore scratchpad space that conflicted with other functions.

...lee

Willsy · February 22, 2017

Yes. File routines were the main offender. Calling disk io routines clobbers pad team in some locations and so does the floating point. Pad ram layout for TF is here:

http://turboforth.net/resources/pad_ram.html

RXB · February 22, 2017

I have never written a BASIC compiler or even an interpreter but I can understand how it would have lots of overhead to make it safe in the way that BASIC is designed to be. I still have grudges with TI over how they implemented the TI-BASIC and XB languages in GPL. Soooo slow. I spent so many hours in the '80s trying to make it go faster.

I found this video on youtube demonstrating the speed-up you get using a compiler for XB and it would have made me happy many years ago.

I am in the middle of writing a version of CAMEL Forth for the TI-99 and I wondered how it would compare on this simple test.

Here is the video I made using version .5 of the system.

I shows a bit more of what this ancient platform could have done if the engineers would have been free to make it properly.

Hmm RXB doing something even more impressive then this video using XB, hard to beat the speed of this:

Or if that is not evidence enough here you go:

Or lastly try this in RXB:

100 CALL CLEAR
110 FOR L=49 TO 57
120 CALL HCHAR(1,1,L)
130 CALL MOVES("VV",767,0,1)
140 NEXT L
150 ! Test the speed is pretty fast.

Edited February 22, 2017 by RXB

+TheBF · February 22, 2017

That might do it. We'll have to compare code sometime. The fbForth inner interpreter includes NEXT (actually, its ALC label is $NEXT) and the code fields of :
 and EXIT ( ;S in fbForth) [code field label =  $SEMIS], which are all in scratchpad RAM as are fbForth's workspace registers.
?BRANCH and BRANCH are not in scratchpad RAM in fbForth—and, I suppose those two words could be making the difference because they are certainly used extensively in the loops in your code—especially, the first example, which is almost twice as fast in your CAMEL99 Forth.

<snip>

Ok this is interesting. So with DOCOL and EXIT in scratchpad RAM we are the same.

My DO/LOOP primitive code is actually not in scratchpad because it didn't work when I tried it quickly so only BEGIN UNTIL etc and IF/ELSE/THEN ARE getting help from ?BRANCH and BRANCH.

My DO/LOOP code borrows from Laxen and Perry via CAMEL Forth and is shown below.
I originally implemented it with the loop index and limit in R13 and R14 but I want to keep them chaste for my multi-tasker.

To really get the fastest Forth loops I find a FOR /NEXT implementation like E-forth is best with a simple down counter to zero.
Goes like crazy compared to DO/LOOPS. Even Chuck Moore stopped using DO LOOP but the legacy is too big for the
language to remove it completely.

Can you see fewer cycles in this code compared to yours?
(BTW the macros POP, PUSH, RPOP and RPUSH work exactly as expected. I wrote this for Intel first so tried to make the ASM a little bit Forth VM "universal".)
\ Adapted from CAMEL Forth MSP430
\ ; '83 and ANSI standard loops terminate when the boundary of
\ ; limit-1 and limit is crossed, in either direction.  This can
\ ; be conveniently implemented by making the limit 8000h, so that
\ ; arithmetic overflow logic can detect crossing.  I learned this
\ ; trick from Laxen & Perry F83.


\ CAMEL Forth tries to put loop index and limit in registers.
\ We have elected not to do this so we have free registers for
\ a TMS9900 specific, very fast cooperative TASK switcher.

\ NOT using do/loop in registers costs us about 8% slower looping
\ ====================================================================
CODE: <?DO> ( limit ndx -- )
             *SP TOS CMP,        \ compare 2 #s
              @@1 JNE,           \ if they are not the same jump to regular 'do.'  (BELOW)
              IP RPOP,           \ otherwise do a forth 'exit'
              TOS POP,           \ clean the parameter stack
              NEXT,

+CODE: <DO> ( limit indx -- )
@@1:          R0  8000 LI,      \ load "fudge factor" to LIMIT
             *SP+  R0 SUB,      \ LIMIT, compute 8000h-limit "fudge factor"
              R0  TOS ADD,      \ loop ctr = index+fudge
              R0  RPUSH,        \ rpush limit
              TOS RPUSH,        \ rpush index
              TOS POP,          \ refill TOS
              NEXT,
              END-CODE

CODE: <LOOP>
             *RP INC,           \ increment loop
@@2:          @@1 JNO,          \ if no overflow then loop again
              IP INCT,          \ move past (LOOP)'s in-line parameter
              *RP+ *RP+ CMP,    \ RP INC by 4 (1 cell, 30 clocks) Doesn't make much difference to loop speed.
              NEXT,
@@1:         *IP IP ADD,        \ jump back to top of loop  (branch)
              NEXT,
              END-CODE

+CODE: <+LOOP>
              TOS *RP ADD,      \ saving space by jumping into <loop>
              TOS POP,          \ refill TOS, (does not change overflow flag)
              @@2 JMP,
              END-CODE

CODE: I       ( -- n)
              TOS PUSH,            \ making space in TOS slows this down
              *RP TOS MOV,
              2 (RP) TOS SUB,      \ index = loopindex - fudge
              NEXT,
              END-CODE

CODE: J       ( -- n)
              TOS PUSH,
              4 (RP) TOS MOV,       \ outer loop index is on the rstack
              6 (RP) TOS SUB,       \ index = loopindex - fudge
              NEXT,
              END-CODE

CODE: LEAVE
              *RP+ *RP+ CMP,        \ collapse rstack frame in 1 CELL (TMS9900 trick)
               IP RPOP,             \ pop something else to do from the return stack
               NEXT,
               END-CODE

Edited February 22, 2017 by TheBF

+TheBF · February 22, 2017

I STAND CORRECTED.

I just wrote a FOR NEXT loop and the speed difference between the above DO LOOP and FOR NEXT is almost non-existent.

On an Intel ITC Forth I see a 30% improvement. I am really surprised. But on the 9900 there is only 1 less instruction!

Benchmarks are always right.

+Lee Stewart · February 22, 2017

My head hurts! :-o

I can probably figure out how your macros work, but on the face of it, things look pretty similar.

For another thing, I have never used JMP, in TMS9900 Forth Assembler. Perhaps, I will give it a try sometime. The TI Forth Manual never explained how to use any of the jump codes. I have always used the structured assembler constructs from the TI Forth Assembler, which is certainly encouraged in the manual. However, the similar loop code in fbForth 2.0 is all written in straight-up TMS9900 Assembler, so it should not be very difficult to compare. It is in fbForth002_ResidentDictionary.a99, if you want to check on it before I try anything.

One note (you may already know this), in fbForth 2.0 (as in TI Forth) the loop crossover is different in the positive and negative directions from what they are in Forth 83.

...lee

+TheBF · February 22, 2017

My head hurts!

I can probably figure out how your macros work, but on the face of it, things look pretty similar.

For another thing, I have never used JMP, in TMS9900 Forth Assembler. Perhaps, I will give it a try sometime. The TI Forth Manual never explained how to use any of the jump codes. I have always used the structured assembler constructs from the TI Forth Assembler, which is certainly encouraged in the manual. However, the similar loop code in fbForth 2.0 is all written in straight-up TMS9900 Assembler, so it should not be very difficult to compare. It is in fbForth002_ResidentDictionary.a99, if you want to check on it before I try anything.

One note (you may already know this), in fbForth 2.0 (as in TI Forth) the loop crossover is different in the positive and negative directions from what they are in Forth 83.

...lee

Now my head will have to hurt while I wrap it around the loop cross over implications.

My macros are in the code window.

\ PUSH & POP on both stacks
: PUSH,         ( src -- )  SP DECT,  *SP   MOV, ;    \ 10+18 = 28  cycles
: POP,          ( dst -- )  *SP+      SWAP  MOV, ;    \ 22 cycles

: RPUSH,        ( src -- ) RP DECT,  *RP   MOV,  ;
: RPOP,         ( dst -- ) *RP+      SWAP  MOV,  ;

\ this one allows nested subroutine calls. Never really needed it
: CALL,         ( dst -- )   \ total cycles 44 to call,  34 to return
                R11 RPUSH,       \ save R11 on forth return stack
               ( addr) BL,       \ branch & link saves the PC in R11
                R11 RPOP, ;      \ R11 RPOP, is laid down by CALL, in the caller
                                 \ We have to lay it in the code after BL so
                                 \ when we return from the Branch&link, R11 is
                                 \ restored to the original value from the rstack

I copied the jump mechanism for my assembler from Win32Forth. It was always confusing going the other way, from conventional Assembler to Forth assembler so that was my solution.

Sorry it's confusing going the other way. I understand.

So I just peeked into your code and I think the difference in loop speed is the classic space vs speed trade off. Yours is saving lots of space by JMPing to BRANCH every chance you get which is smart on a TI-99.

My BRANCH is far away in scratch pad RAM so I just bit the bullet and took the space.

If I can shoehorn (LOOP) into scratch pad RAM I may have something a little faster again.

BF

Edited February 23, 2017 by TheBF

BASIC speed

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members