Benchmarking Languages

JamesD · August 28, 2017

~~It makes me wonder how big a "Boolean" type is in the P-Code system. Any idea?~~

Oops re-read your earlier post completely. 16 bits for a Boolean.

How big can you go before it hits the limit?

Could you try ARRAY of CHAR ?

That's still only 20K. Seems a bit wasteful for what could be a single bit.

I think the array of char sounds like a reasonable idea, but if the P-Machine doesn't handle arrays across boundaries, it could still be a problem.

It sounds odd that it wouldn't support that though.

apersson850 · August 28, 2017

array [0..999] of char, just like array[0..999] of boolean, still occupies 1000 words, as each character/boolean is stored in one word (16 bits). The system can't allocate a 10000 word array.

Now if you declare a packed array[0..999] of char, then that's 500 words, and a packed array[0..999] of boolean is only 63 words. But then you have to add the overhead for packing and unpacking when accessing the array. So it works, but will take longer time.

I'm not sure which boundaries you are referring to, when asking if the p-system can handle arrays across boundaries? There are no boundaries to cross in this case.

Edited December 5, 2020 by apersson850

apersson850 · August 28, 2017

I realized I made a mistake in declaring the array of booleans. Corrected that and updated the list.

1000 primes
X BASIC         14.3 secs
TI BASIC        11.9
Pascal           2.786 s

10,000 primes
TF               8.0
CAMEL99 ITC     10.28
CAMEL99 DTC      7.25

When the program is running, function memavail reports 5697 words free. My unit realtime consumes some memory too, especially as it has a segment with assembly support that's not dynamically relocatable.

JamesD · August 28, 2017

array [0..999] of char, just like array[0.999] of boolean, still occupies 1000 words, as each character/boolean is stored in one word (16 bits). The system can't allocate a 10000 word array.

Now if you declare a packed array[0.999] of char, then that's 500 words, and a packed array[0..999] of boolean is only 63 words. But then you have to add the overhead for packing and unpacking when accessing the array. So it works, but will take longer time.

I'm not sure which boundaries you are referring to, when asking if the p-system can handle arrays across boundaries? There are no boundaries to cross in this case.

Isn't the memory expansion paged?

apersson850 · August 28, 2017

Well, it's split in one 8 K and one 24 K section. But they are all on the same page, at least as I define a page in this context. That is, in the same address space, which for the TMS 9900 comprises 64 Kbytes.

JamesD · August 28, 2017

Well, it's split in one 8 K and one 24 K section. But they are all on the same page, at least as I define a page in this context. That is, in the same address space, which for the TMS 9900 comprises 64 Kbytes.

Page as in turning a page, changing a page, having to change with RAM appears at an address to use it page

apersson850 · August 28, 2017

Well, that's the same definition, then. And in that case it's not paged. It's the same memory expansion as we always have had.

Edited August 29, 2017 by apersson850

+TheBF · August 30, 2017

array [0..999] of char, just like array[0.999] of boolean, still occupies 1000 words, as each character/boolean is stored in one word (16 bits). The system can't allocate a 10000 word array.

Now if you declare a packed array[0.999] of char, then that's 500 words, and a packed array[0..999] of boolean is only 63 words. But then you have to add the overhead for packing and unpacking when accessing the array. So it works, but will take longer time.

I'm not sure which boundaries you are referring to, when asking if the p-system can handle arrays across boundaries? There are no boundaries to cross in this case.

I always like the way Wirth's languages have these different data types. Forth makes you work hard for that nice stuff.

So in my Pascal Envy fever I thought I would see what it would take to make a "packed array of bits" and it is more code than I thought.

The overhead just to set 1 bit, according to the CLASSIC99 9901 timer is 1.8 milli-seconds in my Forth system.

EDIT: New times 1.5mS to set a bit.

1.1mS to read a bit (BIT@)

This could be really sped up with assembly language code words but it's still a lot of work to strip out bits.

EDIT: Here is the revised code with fixed RSHIFT and uses CELL+ and CELLS to be less platform dependant.

\ BOOLEAN array experiment

\ BOOLEAN data is one CELL (16 bits on TMS9900)

HEX
\ create & erase memory area for 'n' bits
: BITS:  ( n -- ) CREATE    8 /  HERE OVER 0 FILL  CELL+  ALLOT  ;  \ added 2 bytes extra

\ compute bit# in a cell & cell address
: BITFLD     ( bit# addr[] -- bit#' addr)
               SWAP 10 /MOD CELLS ROT +  ;

: BIT@      ( bit# addr -- ? )
              BITFLD @               \ compute bit# & fetch bits in cell
              SWAP RSHIFT            \ if bit#<>0 RSHIFT,
              0001 AND ;             \ mask 1 bit
              
: BIT#>MASK ( bit# -- n )  0001 SWAP LSHIFT ;

: BSET      ( bit# addr[] -- )
              BITFLD                   ( -- bit# addr)
              SWAP BIT#>MASK >R        \ save the mask
              DUP @                    \ -- addr bits
              R> OR SWAP ! ;           \ or mask into bit, store in addr

: BRST      ( bit# addr[] -- )
              BITFLD                   ( -- bit# addr)
              SWAP BIT#>MASK INVERT >R  \ invert and save mask
              DUP @                     \ -- addr bits
              R> AND SWAP ! ;           \ mask out bits, store back in addr

\ test code
 DECIMAL
  300 BITS: ]X      \ make array X of 1000 bits

: FILLBITS   300 0 DO  I ]X BSET   LOOP ;
: CLRBITS    300 0 DO  I ]X BRST   LOOP ;
: EVENBITS   ." Erasing..."  CLRBITS 300 0 DO  I ]X BSET   2 +LOOP ;
: SHOWBITS   300 0 DO  I ]X BIT@ . LOOP ;

Edited September 14, 2017 by TheBF

apersson850 · August 30, 2017

I've not checked how the PME routines for packing and unpacking works. But obviously they take more time than not having to run them.

+TheBF · September 14, 2017

So I ran some of the *HAYES tester on my Forth system and guess what?

I coded my bit shifting words in a way that made them FAIL. No shock because I didn't read the spec. (DUH)

In the WORD BIT@ in my earlier post, I added an IF statement to cope with a 0 bit shift. Well the ANS/ISO Forth standard already anticipated that.

So I recoded my shift words (LSHIFT & RSHIFT) to pass the test and now BIT@ is much faster and much simpler.

I have edited the previous POST to reflect the correction.

: BIT@      ( bit# addr -- ? )
              BITFLD @               \ compute bit# & fetch bits in cell
              SWAP RSHIFT            \ if bit#<>0 RSHIFT,
              0001 AND ;             \ mask 1 bit

* The Hayes tester is set of test words and test suites for every standard Forth word that let's you confirm that a Forth WORD is doing what it's supposed to do.

It looks like this for a simple word like SWAP:

T{ 1 2 3 SWAP -> 1 3 2 }T

You get OK if the output swap matches the '1 3 2' after the arrow and an error message if not.

To read more: http://forth-standard.org/standard/testsuite

Edited September 14, 2017 by TheBF

+TheBF · April 26, 2019

I had always wanted to come back to this little demo to properly do this in Forth in an idiomatic way and for the best performance.
This of course means using the Forth assembler. There is no other way to match compilers that generate native code with a Threaded code Forth.

So I took a look at Tursi's code.
I made a some slight improvements by keeping the loop limits in a register rather than re-loading the register every time through the loop.

I used the Forth compiler to make byte directive so signed bytes can be used for the counters. I liked the way that made the code look. ;-)
This happens at compile time so it costs nothing in run-time.

Then I chopped up the big program into 4 routines which would be the Forth approach. This meant I could test each movement independantly.
I used Forth for all the setup stuff, clearing the screen and creating the sprite.

After that it's just 4 words in a 100 iteration loop.

The spoiler has the new code and the video shows the timing using the VDP interrupt on the PC. It seems to agree with my iPhone stopwatch.

\ * assumes startup from CAMEL99 Forth
\ * Translated from ASM code by Tursi
\ stripped out graphics set up and used Forth.
\ Uses Forth Workspace at >8300

 \ NEEDS DUMP FROM DSK1.TOOLS
NEEDS MOV,   FROM DSK1.ASM9900
NEEDS CLEAR  FROM DSK1.GRAFIX
NEEDS SPRITE FROM DSK1.DIRSPRIT
NEEDS ELAPSE FROM DSK1.ELAPSE

HEX
8C00 CONSTANT VDPWD         \ vdp ram write data
8C02 CONSTANT VDPWA         \ vdp ram read/write address

\ MACRO to setup VDP write address from a register argument
: VDPWA, ( reg -- )
       DUP VDPWA @@  MOVB,   \ write 1st byte of address to VDP chip
       DUP           SWPB,
           VDPWA @@  MOVB,   \ write 2nd byte of address to VDP chip
                     NOP,
;

\ Macro to convert integer to signed byte at compile time
: byte  ( n -- c )  00FF AND >< ;

CODE MOVERIGHT ( -- )
          R3 CLR,            \ for x=0
DECIMAL   R8 239 byte LI,    \ to 239
          R13 01 byte LI,    \ step 1
          BEGIN,
HEX          R0 4301 >< LI,   ( >< swaps bytes at compile time)
             R0 VDPWA,
             R3 VDPWD @@ MOVB,
             R13 R3 ADD,      \ next x
             R3 R8 CMP,
          EQ UNTIL,
          NEXT,
          ENDCODE

CODE MOVEDOWN
          R4 CLR,              \ for y=0
DECIMAL   R8 175 byte LI,      \ to 175
          BEGIN,
HEX          R0 4300 >< LI,
             R0 VDPWA,
             R4 VDPWD @@ MOVB,
             R13 R4 ADD,        \ next y
             R4 R8 CMP,
          EQ UNTIL,
          NEXT,
          ENDCODE

CODE MOVELEFT
DECIMAL   R3 239 byte LI,       \ for x=239 to 0
HEX       R13 -1 byte LI,       \ step -1
          BEGIN, 
             R0 4301 >< LI,
             R0 VDPWA,
             R3 VDPWD @@ MOVB,
             R13 R3 ADD,        \ next x
          EQ UNTIL,
          NEXT,
          ENDCODE

CODE MOVEUP
DECIMAL   R4 175 byte LI,        \ * for y=175 to 0
          R13 -1 byte LI,         \ step -1
          BEGIN,
HEX          R0 4300 >< LI,
             R0 VDPWA,
             R4 VDPWD @@ MOVB,
             R13 R4 ADD,       \ * next y
          EQ UNTIL,
          NEXT,
          ENDCODE

DECIMAL
: START
    CLEAR
\    Ascii   color  y x spr#
    [CHAR] *   2    1 1  0 SPRITE
    1 MAGNIFY
    100 0
    DO
       MOVERIGHT MOVEDOWN MOVELEFT MOVEUP
    LOOP ;

Tursi's Benchmark.mp4

+TheBF · April 26, 2019

By the way, for reference, when I did a straight translation to Forth Assembler of Tursi's original code it was about 6.2 seconds on my Classic99/PC combo.

With the loop limit in a register it came down to 5.91

The setup Forth code takes about 22mS measured with the 9901 timer so it's insignificant.

I never saw anything close to the published 5 seconds. Is this a hardware issue on my end?

+TheBF · May 3, 2019

Since I found a way to take some instructions out of CAMEL99 Forth direct threaded version I re-did some times on the Sieve of Eratosthenes.
Each Forth system ran the same code in the spoiler.

Again we can see TurboForth's strategy of installing a maximum number of Forth routines in 16bit RAM as the one to beat.

But using the BL instruction to assist in making direct threaded code shaved the time down from 7.25 to 7.1 seconds.
And the newest version of Camel Forth, where more of the primitive routines are written in Assembler improves the time versus regular Camel Forth by 20%.

I believe the FBFORTH performance is being held back by the way FIG Forth implements calling Forth's "inner interpreters" per this discussion:

http://http://www.bradrodriguez.com/papers/moving3.htm

Two approaches have been used in the past:
1. The fig-Forth solution. Fig-Forth reserved the first cell of
the Parameter Field to hold the address of the high-level code.
The DODOES routine then obtained the Parameter Field address,
pushed the address of the actual data (typically PFA+2) onto
the stack, fetched the address of the high-level routine,
and EXECUTEd.

Manual timings
1000 primes
X BASIC 14.3 secs
TI BASIC 11.9

10000 Primes
----------------------
TurboForth 8.0
CAMEL99 10.28
CAMEL99G 8.3
CAMEL99 DTC 7.1 (new version)
FBFORTH 15.0

\ Sieve of Erathosenes in Forth
\ Tested with TurboForth FBFORTH and CAMEL99 Forth

\ array calculators
: []@ ( n addr -- c )    + C@ ;
: []! ( n ndx addr -- )  + C! ;

: ERASE  ( addr n -- ) 0 FILL ;

HEX
: FREEMEM  ( -- n) FF00 HERE - ;
: ?MEM     ( n -- )  FREEMEM OVER < IF ." Out of memory"  ABORT THEN ;

: SEEPRIMES ( -- )
        CR ." Primes: "
        2 DO
            I HERE []@ 0= IF I . THEN
            ?TERMINAL IF ." Primes halted" ABORT THEN
        LOOP ;

\ byte array uses un-allocated memory at HERE
DECIMAL
: PRIMES ( n -- )
        ?MEM
        CR ." Running..."
        HERE OVER ERASE
        1 0 HERE []!       \ mark as prime like 'C' version
        1 1 HERE []!
        2                  \ start at 2
        BEGIN
           OVER OVER DUP * >
        WHILE
           DUP HERE []@ 0=
           IF  OVER OVER DUP *
               DO
                  1 I HERE []!
               DUP +LOOP
           THEN
           1+
        REPEAT
        CR ." Complete."
        CR
        DROP
        CR ." Press ENTER to see primes:" KEY 13 =
        IF   SEEPRIMES   THEN
;

Edited May 3, 2019 by TheBF

+Lee Stewart · May 3, 2019

The fbForth time will likely improve if you first remove the hook to the fbForth ISR by storing 0 at >83C4:

HEX 0 83C4 !

...lee

+TheBF · May 3, 2019

I will do that and edit the result Lee when I get back to my desk.

Willsy · May 3, 2019

Nice!

RXB · May 3, 2019

Youtube deleted most of my videos off my channel and no reasons why or what the problem is.....?

Youtube management are IDIOTS at best and the software they use to find Porn or Hate Speech is pure crap.

Example is Youtube deleted a PBS channel as Porn and it was Sesame Street?

Could Youtube write a worse junk program?

+TheBF · May 4, 2019

The fbForth time will likely improve if you first remove the hook to the fbForth ISR by storing 0 at >83C4:
HEX 0 83C4 !
...lee

Lee,

I tried this twice just now.

After FBForth started I got 15 seconds.

Then again after 0 83C4 ! I got 14.86.

Your interrupts are not taking very much time away from the CPU when they are idling it seems.

When I look at your inner interpreter I cannot see where the overhead is coming from?

+TheBF · December 4, 2020

One More Time!

I have been working on this Machine Forth concept that uses Forth like language to generate native inline code and/or sub-routines.

I had to put this new compiler on the board.

VDPLIB is Forth assembler but the VDP byte write is not fully optimized and uses an internal BL as well.

I don't think I can get it down to 5 seconds but given that the compiler is only 274 lines of Forth + 170 lines for the cross-assembler vs GCC with ???? Megabytes it's not to bad.

I will try changing the VDP byte write routine to a fully INLINE version and see what happens.

Language     First Pass    Optimized
GCC            15 sec         5 sec
Assembly       17 sec         5 sec
Machine Forth  17 sec         TBD
TurboForth     48 sec        29 sec
Compiled XB    51 sec       none yet
FbForth        70 sec        26 sec
GPL            80 sec       none yet
ABASIC        490 sec       none yet
XB           2000 sec      none yet
UCSD Pascal  7300 sec      780 sec

\ Tursi's Sprite demo for Machine Forth  B Fox  Dec 3 2020
INCLUDE DSK1.ELAPSE
NEEDS MFORTH  FROM DSK2.MFORTH

COMPILER: HEX
NEW. 2000 ORIGIN.

INCLUDE DSK2.VDPLIB
INCLUDE DSK2.SCREENIO
INCLUDE DSK2.SPRITES

HOST:
300 CONSTANT $300
301 CONSTANT $301

[CC] DECIMAL

TARGET: 
PROG: SPRITEDEMO
      GRAPHICS
      DELALL
      1 # MAGNIFY
( char  colr   x   y )
  42 #   1 #   0 # 0 #  SPRITE0
      100 #
      FOR
          0 # 240 # FOR  DUP $301 # VC! 1+ NEXT DROP
          0 # 176 # FOR  DUP $300 # VC! 1+ NEXT DROP
              239 # FOR    I $301 # VC!  NEXT
              175 # FOR    I $300 # VC!  NEXT
      NEXT

      TEXT
      NEXT,
END.

GDMike · December 4, 2020

I think I've got one of them asterisks on my floor....

But good numbers ? to see

+TheBF · December 4, 2020

1 hour ago, GDMike said:

I think I've got one of them asterisks on my floor....

But good numbers ? to see

I don't quite understand. You mean sprite is moving around on your floor?

Do you a have prescription for that stuff yer takin' ?

GDMike · December 4, 2020

No, I'm like, gees I wasn't expecting all those sprite's all of a sudden! On a big screen TV it's like jumping at ya.. funny

+Lee Stewart · December 4, 2020

On 5/4/2019 at 3:03 PM, TheBF said:

Lee, I tried this twice just now. After fbForth started I got 15 seconds. Then again after 0 83C4 ! I got 14.86. Your interrupts are not taking very much time away from the CPU when they are idling it seems. When I look at your inner interpreter I cannot see where the overhead is coming from?

From the fbForth 2.0 ISR in the spoiler below,

Spoiler


*       _____   ____         __  __      ___________ 
*      / _/ /  / __/__  ____/ /_/ /     /  _/ __/ _ \
*     / _/ _ \/ _// _ \/ __/ __/ _ \   _/ /_\ \/ , _/
*    /_//_.__/_/  \___/_/  \__/_//_/  /___/___/_/|_| 
*
;[*** Interrupt Service =======================================================
* This routine is executed for every interrupt.  It processes any pending
* speech and souind.  It then looks to see whether a user ISR is installed in 
* ISR.  If so, it sets up NEXT for execution of the user ISR.  This will work 
* only if the user has installed an ISR using the following steps in the fol-
* lowing order:
*
*   (1) Write an ISR with entry point, say MYISR.
*   (2) Determine code field address of MYISR with this high-level Forth:
*           ' MYISR CFA
* <<< Maybe need a word to do #3 >>>
*   (3) Write CFA of MYISR into user variable ISR.
*
* Steps (2)-(3) in high-level Forth are shown below:
*           ' MYISR CFA
*           ISR !
* 
* <<< Perhaps last step above should be by a word that disables interrupts >>>
*
* The console ISR branches to the contents of >83C4 because it is non-zero,
* with the address, INT1, of the fbForth ISR entry point below (also, the
* contents of INTLNK).  This means that the console ISR will branch to INT1
* with BL *R12 from WP = GPLWS (>83E0), R12 containing INT1 below to first
* process any pending speech and sound.
* 
* If the user's ISR is properly installed, the code that processes the user
* ISR modifies NEXT so that the very next time B *NEXT or B *R15 is executed
* from Forth's workspace (MAINWS), the code at INT2 will process the user's
* ISR just before branching to the normal NEXT entry ($NEXT) in fbForth's
* inner interpreter.
*** ==========================================================================

* ¡¡¡ MUST REMEMBER THAT WE ARE IN GPL WORKSPACE UPON ENTRY. !!!

INT1   
       LI   R0,BRSTK          load address of top of Branch Address Stack
* 
* Set up for pending speech
*
       MOV  @SPCSVC,*R0       save Speech service address onto Branch Stack
       JEQ  SNDCH1            jump to sound-check if no speech
       INCT R0                increment Branch Stack
*
* Set up for pending sound table #1 (ST#1)
*
SNDCH1 MOV  @SND1ST,R2        sound table ST#1 to service?
       JEQ  SNDCH2            process speech and sound if needed
       LI   R1,PLAYT1         load PLAYT1 address and...
       MOV  R1,*R0+           ...push it onto Branch Stack
* 
* Set up for pending sound table #2 (ST#2)
*
SNDCH2 MOV  @SND2ST,R3        sound table ST#2 to service?
       JEQ  PRCSPS            process speech and sound if needed
       LI   R1,PLAYT2         load PLAYT2 address and...
       MOV  R1,*R0+           ...push it onto Branch Stack
*
* Process sound stack if both sound tables idle
*
PRCSPS SOC  R2,R3             OR R2 and R3..both sound tables idle?
       JNE  PRSPS2            nope..skip sound stack processing
       LWPI SND1WS            switch to ST#1 WS
       CI   R4,SNDST0         anything on sound stack?
       JEQ  PRSPS1            no..exit sound stack processing
       DECT R4                pop sound stack position
       MOV  *R4,R2            get sound table address from sound stack
       INC  R0                kick off sound processing of ST#1 (R0=1)
PRSPS1 LWPI GPLWS             switch back to GPL WS
* 
* Check for any pending speech and sound
*
PRSPS2 CI   R0,BRSTK          any speech or sound to process?
       JEQ  USRISR            if not, jump to user ISR processing
       LI   R1,BNKRST         yup..load return address
       MOV  R1,*R0            push return address onto Branch Stack
* 
* Process pending speech and sound
*
       MOV  @MYBANK,@BANKSV   save bank at interrupt
       CLR  @>6002            switch to bank 2 for speech & sound services
       LI   R7,BRSTK          load top of Branch Stack
       MOV  *R7+,R8           pop speech/sound ISR
       B    *R8               service speech/sound
*
* Restore interrupted bank
*
BNKRST                ; return point for speech and sound ISRs
       MOV  @BANKSV,R0        restore bank at interrupt
       SRL  R0,13             get the bank# to correct position
       AI   R0,>6000          make it a real bank-switch address
       CLR  *R0               switch to the bank at interrupt
*
* Process User ISR if defined
*
USRISR MOV  @$ISR+$UVAR,R0     User ISR installed?
       JEQ  INTEX             
*
* Fix NEXT so that the user's ISR is processed the next time B *NEXT (B *R15)
* is executed from Forth's WS (MAINWS = >8300), which it does at the end of
* every CODE word, keyboard scan and one or two other places.
* 
       LI   R1,INT2                 Load entry point, INT2
       MOV  R1,@2*NEXT+MAINWS       Copy it to Forth's NEXT (R15)
* 
* The following 2 instructions are copies of the remainder of the console ROM's
* ISR (except that 'CLR R8' was removed because it is only needed by TI Basic)
* because we're not going back there!
* 
INTEX  LWPI >83C0             Change to console's ISR WS  
       RTWP                   Return to caller of console ISR
* 
* Branch through above-modified NEXT (R15) gets us here. NEXT will be restored
* before executing user's ISR. INT3 (cleanup routine below) will be inserted
* in address list to get us back here for cleanup after user's ISR has finished.
* User's ISR is executed at the end of this section just before INT3.
* 
INT2   LIMI 0                 Disable interrupts
       MOVB @>83D4,R0         Get copy of VR01
       SRL  R0,8              ...to LSB
       ORI  R0,>100           Set up for VR01
       ANDI R0,>FFDF          Clear VDP-interrupt-enable bit
       BLWP @VWTR             Turn off VDP interrupt
       LI   NEXT,$NEXT        Restore NEXT
       SETO @INTACT           Set Forth "pending interrupt" flag
       DECT R                 Set up return linkage by pushing 
       MOV  IP,*R             ...IP (R13, next Forth CFA) to return stack and
       LI   IP,INT3           ...setting IP to INT3 (below) for cleanup
       MOV  @$ISR(U),W        Do the user's Forth ISR by executing
       B    @DOEXEC           ...it through Forth's inner interpreter
* 
* Clean up and re-enable interrupts.
*
INT3   DATA INT3+2            $NEXT (or $SEMIS) puts INT3+2 in W (R10)
       DATA INT3+4            DOEXEC (or $SEMIS) will branch to *W = INT3+4 (next instr)
       MOV  *R+,IP            Start cleanup: pop IP from before call to user's ISR
       CLR  @INTACT           Clear Forth "pending interrupt" flag
       MOVB @>83D4,R0         Prepare to restore VR01 by...
       SRL  R0,8              ...moving payload to LSB (enabling VDP interrupt) and
       AI   R0,>100           ...VR # (01) to MSB
       MOVB @VDPSTA,R1        Remove pending VDP interrupt by reading VDP status
       BLWP @VWTR             Write VR01
       LIMI 2                 Re-enable interrupts
       B    *NEXT             Continue normal task
;]*

you can see that, with no speech, sound or user ISR to service, only the following code executes at every interrupt—and, as you probably know, fbForth’s normal state has interrupts on:

INT1   LI   R0,BRSTK          load address of top of Branch Address Stack
       MOV  @SPCSVC,*R0       save Speech service address onto Branch Stack
       JEQ  SNDCH1            jump to sound-check if no speech
*        ...
SNDCH1 MOV  @SND1ST,R2        sound table ST#1 to service?
       JEQ  SNDCH2            process speech and sound if needed
*        ...
SNDCH2 MOV  @SND2ST,R3        sound table ST#2 to service?
       JEQ  PRCSPS            process speech and sound if needed
*        ...
PRCSPS SOC  R2,R3             OR R2 and R3..both sound tables idle?
       JNE  PRSPS2            nope..skip sound stack processing
*        ...
PRSPS2 CI   R0,BRSTK          any speech or sound to process?
       JEQ  USRISR            if not, jump to user ISR processing
*        ...
USRISR MOV  @$ISR+$UVAR,R0     User ISR installed?
       JEQ  INTEX             
*        ...
INTEX  LWPI >83C0             Change to console's ISR WS  
       RTWP                   Return to caller of console ISR

...lee

+TheBF · December 4, 2020

3 minutes ago, Lee Stewart said:

From the fbForth 2.0 ISR in the spoiler below,

you can see that, with no speech, sound or user ISR to service, only the following code executes at every interrupt—and, as you probably know, fbForth’s normal state has interrupts on:


INT1   LI   R0,BRSTK          load address of top of Branch Address Stack
       MOV  @SPCSVC,*R0       save Speech service address onto Branch Stack
       JEQ  SNDCH1            jump to sound-check if no speech
*        ...
SNDCH1 MOV  @SND1ST,R2        sound table ST#1 to service?
       JEQ  SNDCH2            process speech and sound if needed
*        ...
SNDCH2 MOV  @SND2ST,R3        sound table ST#2 to service?
       JEQ  PRCSPS            process speech and sound if needed
*        ...
PRCSPS SOC  R2,R3             OR R2 and R3..both sound tables idle?
       JNE  PRSPS2            nope..skip sound stack processing
*        ...
PRSPS2 CI   R0,BRSTK          any speech or sound to process?
       JEQ  USRISR            if not, jump to user ISR processing
*        ...
USRISR MOV  @$ISR+$UVAR,R0     User ISR installed?
       JEQ  INTEX             
*        ...
INTEX  LWPI >83C0             Change to console's ISR WS  
       RTWP                   Return to caller of console ISR

...lee

Wow. FbForth is like a duck. There is a LOT of activity beneath the water.

I had know idea.

+TheBF · December 4, 2020

Well I can't match GCC but it's close.

What was changed:

Wrote a macro to inline the VDP byte write.
Used Forth to add the write bit to the VDP address and swap the bytes at compile time for the 1st byte write
Changed the compiler's FOR/NEXT loop to use R15 for counting. Previously it was in memory on the return stack in 8 bit space.
Added a LIMI instruction to the compiler so I don't have to deal interrupts in the VSBW
Added a Register load command to the compiler. Used it to pre-load R3 with the VDPWA port before the loop.

There are some clear optimizations I can make in future where Forth is refilling the top of stack cache register (DROP) and the following it up right after with a push the TOS register onto the stack. (DUP)

( push/pop optimization) I have the beginnings of that working but some conditions break it so it's not reliable.

Language     First Pass    Optimized
GCC            15 sec         5 sec
Assembly       17 sec         5 sec
Machine Forth  17 sec         7 sec
TurboForth     48 sec        29 sec
Compiled XB    51 sec       none yet
FbForth        70 sec        26 sec
GPL            80 sec       none yet
ABASIC        490 sec       none yet
XB           2000 sec      none yet
UCSD Pascal  7300 sec      780 sec

Spoiler


\ Tursi's Sprite demo for Machine Forth Optimized   B Fox  Dec 4 2020
NEEDS MFORTH  FROM DSK2.MFORTH

COMPILER: HEX
NEW. 2000 ORIGIN.

INCLUDE DSK2.VDPLIB
INCLUDE DSK2.SCREENIO
INCLUDE DSK2.SPRITES

COMPILER:   \ extend the compiler commands
HEX
H: VSBW ( n --) \ MACRO, compiles inline code
\ Trick: Use Forth to OR the write bit
\        and swap bytes the first time at compile time :-)
             R0 SWAP 4000 OR >< LI,
             R0 R3 ** MOVB,
             R0  SWPB,
             R0 R3 ** MOVB,
             TOS SWPB,
             TOS VDPWD @@ MOVB,
             TOS POP,
;H

\ macros that use a macro
H: SP0.X! ( n --) 301 VSBW ;H
H: SP0.Y! ( n --) 300 VSBW ;H

: LIMI ( n -- )  LIMI, ;
: REG! ( n -- )  TOS SWAP MOV, TOS POP,  ;

DECIMAL
TARGET:
PROG: SPRITEDEMO
      GRAPHICS
      DELALL
      1 # MAGNIFY
      42 #  1 #   0 # 0 # SPRITE0
      0 LIMI
      VDPWA # 3 REG!
      100 #
      FOR
          0 # 240 # FOR  DUP   SP0.X! 1+  NEXT DROP
          0 # 176 # FOR  DUP   SP0.Y! 1+  NEXT DROP
              239 # FOR    I   SP0.X!  NEXT
              175 # FOR    I   SP0.Y!  NEXT
      NEXT
      TEXT
      NEXT,
END.

HOST:

Benchmarking Languages

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members