Jump to content
IGNORED

Benchmarking Languages


Tursi

Recommended Posts

I just realized, even though I've written it many times, that it's "Senior Falcon", and I have always read it to myself as "Señor Falcon"... ;)

The "Senior Falcon" by Carl Goldberg Models was my favorite radio control airplane back in the day. You might say it is my "Rosebud". (See the movie "Citizen Kane" if you don't understand the reference.)

Link to comment
Share on other sites

:|

 

Sorry about any barriers. The statement can easily be taken out of context and doesn't stand on its own. Hope my English isn't gibberish and/or my wording come on as too strong.

 

:)

I know you said before, English isn't your native language...... I always thought your native language was Extended Basic. :P

  • Like 1
Link to comment
Share on other sites

FWIW, the PCODE engine UCSD Pascal uses is basically a stack based CPU. There are no registers so on top of being interpreted, all variable accesses are through stack relative addressing on the P-Machine. If I remember right, the Apple version is somewhere between 30% to 50% faster than Applesoft depending on what you are doing. That's similar to Applesoft BASIC compilers.

Link to comment
Share on other sites

FWIW, the PCODE engine UCSD Pascal uses is basically a stack based CPU. There are no registers so on top of being interpreted, all variable accesses are through stack relative addressing on the P-Machine. If I remember right, the Apple version is somewhere between 30% to 50% faster than Applesoft depending on what you are doing. That's similar to Applesoft BASIC compilers.

So in the TI's case, pascal is running on a virtual stack based cpu coded in GPL which is a software based virtual CPU running on the 9900, a 3mhz CPU!

  • Like 1
Link to comment
Share on other sites

So in the TI's case, pascal is running on a virtual stack based cpu coded in GPL which is a software based virtual CPU running on the 9900, a 3mhz CPU!

 

And when all this is running on an emulator, which in turn runs in a virtual machine ...

 

What about writing a scripting engine in UCSD Pascal? ;)

  • Like 2
Link to comment
Share on other sites

UCSD Pascal really wasn't designed to work well on microcomputers. It was originally designed as a common environment for students using mini-computers.
You have a 16 bit instruction set, stack based data (no registers), a compiler with no significant optimization, and it's supposed to have 64K of RAM available.
The 9900 could interpret the PCode alright, but the TI-99/4A will have a lot of waitstates you wouldn't have if the VM used a register based processor.
I'm sure a native 9900 PMachine would be in the speed range of Forth interpreters measured in the benchmarks.

Just for comparison, the 6809 has a compiler for OS-9 that uses something like PCode they refer to as ICode.
BASIC-09 is sort of a cross between BASIC and Pascal in syntax, and even though it is interpreted like PCode, it runs about 90% faster than Extended Color BASIC.
That's much better than Apple Pascal vs Applesoft, and Apple Pascal is optimized more than most versions of UCSD Pascal.
Both Extended Color BASIC and Applesoft are Microsoft BASIC interpreters so the speed comparison should be somewhat accurate.
BASIC-09 also completed the Ahl benchmark in under 10 seconds if I remember right.

FWIW, the 6809 is one of the few microprocessors from the 70s that didn't get a version of UCSD Pascal even though it was more capable of running it due to 16 bit support, a separate user stack pointer, and stack relative addressing.

  • Like 2
Link to comment
Share on other sites

Actually, according to this article--there was a version of the USCD Pascal interpreter for the 6809 available in 1979. . .and it even lists the coorporate contact data for the company that wrote it.

Interesting, I knew there was a 6800 version but not a 6809 version.

There's even a review in 68 Micro Journal.

I guess I only looked for a disk image.

Link to comment
Share on other sites

  • 1 year later...

I just found this thread and I just got TI-Forth's sprite words ported to CAMEL99,

so I thought I would add 1 more language dialect.

 

CAMEL99 Forth uses an optimization in the architecture in that it caches the top of the Forth

parameter stack in Register 4. This provides a speedup on some operations and makes more overhead

for others. Net net it's about 8..10% faster.

 

I just took Willy's code and made some word changes for my dialect.

 

The optimization is similar to Lee's in that I optimized VC! (VDP character store)

to remove the sub-routine call and renamed it VSBW.

And I used 1-! , which decrements a variable by 1 on the loop counter.

 

The GCC results are really impressive but so should they be after 20 years of improvements to that compiler.

After the AI bots start writing code, I guess we will all be out of work.

VARIABLE CNT

: ASTERISK  0028 107C 1028 0000 12A CHARDEF ;

HEX 300 CONSTANT $300
$300 1+ CONSTANT $301

DECIMAL 
: TEST
      GRAPHICS
      PAGE
      1 MAGNIFY
      ASTERISK
      0 0 0 42 1 SPRITE
      100 CNT !
      BEGIN
           CNT @ 0> WHILE
           239 0 DO   I $301 VC!     LOOP
           175 0 DO   I $300 VC!     LOOP
           0 239 DO   I $301 VC! -1 +LOOP
           0 175 DO   I $300 VC! -1 +LOOP
           -1 CNT +!
      REPEAT
;  ( 30 seconds)

( Optimized with VSBW and 1-! operators)
: TEST-OPT
      GRAPHICS
      PAGE
      1 MAGNIFY
      ASTERISK
      0 0 0 42 1 SPRITE
      100 CNT !
      BEGIN
           CNT @ 0> WHILE
           239 0 DO   I $301 VSBW     LOOP
           175 0 DO   I $300 VSBW     LOOP
           0 239 DO   I $301 VSBW  -1 +LOOP
           0 175 DO   I $300 VSBW  -1 +LOOP
           CNT 1-!
      REPEAT
;  ( 28.5 seconds)


All these Indirect threaded Forths are speed limited by the address interpretor, so they will not go much faster.

Edited by TheBF
Link to comment
Share on other sites

  • 1 month later...

I would expect your optimized version there to look very much like the one I wrote. That the unoptimized version BEATS my unoptimized version is almost certainly because of the overhead of VSBW (which does multiple BLWPs).. libti99 I wrote to do things quick and inline. ;)

 

How is GCC calling the routines?

If is NOT calling them but expanding everything inline then that is hardly a fair comparison is it?

Link to comment
Share on other sites

I re-wrote my Cross-compiler to generate direct threaded code and to test it out I re-visited the now famous Tursi Benchmark.

 

Here are the results of the optimized version, which only improved the speed by about 14% versus the Indirect threaded version.

VARIABLE CNT
HEX
 300 CONSTANT $300
 301 CONSTANT $301
DECIMAL 
: BENCH
      GRAPHICS
      1 MAGNIFY
      2 0 0 42 1 SPRITE
      100 CNT !
      BEGIN
           CNT @ 0> WHILE
           239 0 DO  I $301 VC!      LOOP
           175 0 DO  I $300 VC!      LOOP
           0 239 DO  I $301 VC!  -1 +LOOP
           0 175 DO  I $300 VC!  -1 +LOOP
          -1 CNT +!
      REPEAT
;  ( DTC 26.3 seconds versus 30 seconds with ITC compiler)
   ( DTC w/fast VC!  25.2)

Then I realized that this is not how you would do this in Forth. You would add something to the compiler to do the job.

So I wrote VDP_IC! , is Forth naming for VDP loop-index character store.

Here is the new word added to Forth. It takes 2 parameters, loops and VDP address.

 

*EDIT* fixed the obvious mistakes, putting the TOS ->R0 and 0 LIMI in the loop

CODE: VDP_IC! ( loops Vaddr  -- )       \ Video CHAR store
              R3 POP,                   \ POP loop counter into R3
              TOS R0 MOV,             \ need address in R0 to call WMODE
              0 LIMI,
              BEGIN,
                R0 4000  ORI,           \ set control bits to write mode (01)
                R0 SWPB,                \ R0= VDP-adr, set the VDP WRITE address
                R0 VDPWA @@ MOVB,       \ send low byte of vdp ram write address
                R0 SWPB,
                R0 VDPWA @@ MOVB,       \ send high byte of vdp ram write address
                R3 R5 MOV,
                R5 SWPB,
                R5 VDPWD @@ MOVB,       \ write char to vdp data port

                R3 DEC,
              EQ UNTIL,
              TOS POP,                  \ refill TOS
              2 LIMI,
              NEXT,
              END-CODE

And so with this optimization which is more in keeping with how you do Forth here are the results.

: BENCH2
      GRAPHICS
      1 MAGNIFY
      2 0 0 42 1 SPRITE
      100 CNT !
      BEGIN
           CNT @ WHILE
           240 $301 VDP_IC!
           175 $300 VDP_IC!
           239 $301 VDP_IC!
           175 $300 VDP_IC!
           CNT 1-!
      REPEAT
;  ( DTC 6.5 seconds, faster after the editing VDP_IC!)

Edited by TheBF
  • Like 1
Link to comment
Share on other sites

For reference here is the slow speed version.

VARIABLE CNT
HEX
 300 CONSTANT $300
 301 CONSTANT $301

DECIMAL 

: BENCH
      GRAPHICS
      1 MAGNIFY
      2 0 0 42 1 SPRITE
      100 CNT !
      BEGIN
           CNT @ 0> WHILE
           239 0 DO    I   0 0 LOCATE     LOOP
           175 0 DO  239   I 0 LOCATE      LOOP
           0 239 DO    I 175 0 LOCATE  -1 +LOOP
           0 175 DO    0   I 0 LOCATE  -1 +LOOP
          -1 CNT +!
      REPEAT
;  ( DTC 49 seconds)
Edited by TheBF
Link to comment
Share on other sites

So here is an updated table

Language   First Pass    Optimized
GCC          15 sec         5 sec
Assembly     17 sec         5 sec
TurboForth   48 sec        29 sec
CAMEL99 DTC  49 sec         7 sec
Compiled XB  51 sec       none yet
FbForth      70 sec        26 sec
GPL          80 sec       none yet
ABASIC      490 sec       none yet
XB          2000 sec      none yet
UCSD Pascal 7300 sec      780 sec
Link to comment
Share on other sites

...

*EDIT* fixed the obvious mistakes, putting the TOS ->R0 and 0 LIMI in the loop 
[code=auto:0]CODE: VDP_IC! ( loops Vaddr  -- )       \ Video CHAR store
              R3 POP,                   \ POP loop counter into R3
              TOS R0 MOV,             \ need address in R0 to call WMODE
              0 LIMI,
              BEGIN,
                R0 4000  ORI,           \ set control bits to write mode (01)
                R0 SWPB,                \ R0= VDP-adr, set the VDP WRITE address
                R0 VDPWA @@ MOVB,       \ send low byte of vdp ram write address
                R0 SWPB,
                R0 VDPWA @@ MOVB,       \ send high byte of vdp ram write address
                R3 R5 MOV,
                R5 SWPB,
                R5 VDPWD @@ MOVB,       \ write char to vdp data port

                R3 DEC,
              EQ UNTIL,
              TOS POP,                  \ refill TOS
              2 LIMI,
              NEXT,
              END-CODE

I am really out of the loop on this (pun intended ;) ), but that assembly code for communicating with the 9918A is really inefficient. If you are going for speed and reduced instruction count, improvements can be made here.

 

Link to comment
Share on other sites

So in the TI's case, Pascal is running on a virtual stack based cpu coded in GPL which is a software based virtual CPU running on the 9900, a 3mhz CPU!

No, the p-code system on the TI implements a PME (p-machine emulator) in TMS 9900 native code, where some parts of the code is copied to scratch-pad RAM for optimal speed.

The Pascal compiler produces p-code, which is then interpreted by the PME.

There's no GPL at all involved in the p-code system. For some reason it seems a lot of people think so. It could be because of the hardware design of the p-code card.

 

When the p-system is running, the PME runs assembly code on the p-code card, in RAM at >8300 and in low memory expansion. There's 12 K ROM on the p-code card, in >4000 - >5FFF. The first 4 K are always the same, the last can switch between two banks.

Then there are 48 K GROM on the p-code card too. This is probably the source of the GPL confusion, but these GROM chips simply implement a GROM-disk. This disk, showing up as the OS: volume in the system, contains the entire operating system for the p-system. Here you find files like SYSTEM.PASCAL, SYSTEM.MISCINFO etc., which define how the p-system works. Having them on GROM is actually faster than the traditional implementation, where you read everything from a floppy disk. You can still change the system, in spite of it being in ROM, since if you make a new SYSTEM.CHARAC file and store that on a disk, you redefine the character set used by the system, for example. If such a file is in the system drive on startup, it will override the file that's fixed on the OS: volume.

There's also assembly code stored in GROM. This is code that's transferred to low memory expansion on startup, and to some other places in RAM as well. But it's only read during startup.

 

The reasons for having code spread out everywhere are different.

  • Code in RAM at >8300 is there for a speed reason. It's the inner core of the interpreter that runs here.
  • Code in RAM at >2000 - >3FFF is there among other reasons to be able to run when the p-code card is disabled. Since the card is in the expansion box, it must turn off when the computer needs access to RS232 or a disk controller, for example. The interrupt service used by the p-system also runs here.
  • Code in the p-code card saves space in RAM. The brunt of the PME runs here, and also some low-level intrinsics like MOVERIGHT and SCAN.

Apart from that, it's true that the PME implements a stack based machine. It's also flexible enough to run p-code from CPU RAM, VDP RAM or p-code card GROM. A lot of the opimizations done within the system are to make all the features work within the space of a 32+16 K RAM machine, not to run at highest possible speed.

When a procedure is called, an environment record is created on the stack. Inside this record, all local variables are allocated. There are special p-codes used to address a variable with a certain offset inside this environment record. This is of course less efficient than reading a memory location directly. There are also advantages with this approach, though, but you need to better understand all the capabilities of the system to appreciate them.

One thing is recursion. Since local variables are pushed on the stack when a procedure is invoked, and poped when you return from the procedure, only available memory limits how many times a procedure can call itself.

Another thing is memory management. Since only global variables are static, p-code programs are dynamically relocatable. As Pascal allows for program segmentation, which means that you can split a large program into segmented procedures, which only need to be resident in memory when they are actually running, code may be moved around in the system at runtime, if the system runs out of memory when attempting to call a procedure that's currently on disk only.

 

The UCSD p-system Pascal compiler also allows separate compilation, using units. That means you can have a library of functions you frequently use, like the sprite functions in this case, and make them available to you just by writing uses sprite in the code. Everything else is automatic. It's mentioned above that the p-system couldn't find the sprite unit unless the compiler disk was in drive #4:. That's because the system was started with the file SYSTEM.LIBRARY in that drive, and there's no library reference file active to tell the system where else to look for it. The p-system is flexible enough to have any number of separate libraries available, even on different drives, but if you have that, you have to write a text file that lists the libraries that are available, including the drive number or disk name where they reside, and register the name of this library map file inside the p-system.

 

Putting everything together, I've not found any language available to the 99/4A to be faster than Pascal. But then I count the whole software development process, not just the time it's executing. I'm not too frequently helped by a program that runs in seconds instead of minutes, if the first one takes weeks to develop and the second only days. The slow program will be ready long before the fast one anyway.

There are of course things that require faster speed than Pascal can support. Then it's good that linking assembly routines is fairly simple. I typically develop programs in Pascal only (provided I don't need some thing that can't be done in Pascal), and then, when I know it's working, convert time critical routines to assembly if I feel it's worth the effort. Sometimes you can simply code something using the special intrinsics that area in UCSD Pascal to facilitate the whole operating system to be written in Pascal to improve your own programs.

 

In the specific case here, moving sprites around, I don't know what takes so much time. Maybe the sprite library, which is pretty flexible, has a speed penalty for that. I've hardly ever used sprites in Pascal, so I don't know. At least 99% of the time I've stayed in the default VDP mode, which is 40 characters wide, text only.

 

The p-system loads code in the primary code pool first, if there's space available. For this short program, there definitely is. The alternate code pool runs slightly faster, though. I presume you haven't tried changing the language type from pseudo to m_9900, have you? Doing that forces the system to load the code in the secondary code pool.

 

To wrap it up, the flexibility and operating system support the p-system offers is unique within the 99/4A scope. The largest application I've written for the 99/4A is a bit above 4000 source code lines. That's quite a lot for a system with only 48 K RAM, but it runs.

  • Like 2
Link to comment
Share on other sites

I am really out of the loop on this (pun intended ;) ), but that assembly code for communicating with the 9918A is really inefficient. If you are going for speed and reduced instruction count, improvements can be made here.

 

 

You are far more experienced than I. Can you explain where the roadblocks are without having to re-write it for me?

Link to comment
Share on other sites

<snip>

 

Putting everything together, I've not found any language available to the 99/4A to be faster than Pascal. But then I count the whole software development process, not just the time it's executing. I'm not too frequently helped by a program that runs in seconds instead of minutes, if the first one takes weeks to develop and the second only days. The slow program will be ready long before the fast one anyway.

 

 

To wrap it up, the flexibility and operating system support the p-system offers is unique within the 99/4A scope. The largest application I've written for the 99/4A is a bit above 4000 source code lines. That's quite a lot for a system with only 48 K RAM, but it runs.

 

That is a really impressive piece of work. The fully re-locatable code is cool. I had no idea it was such a comprehensive package.

Does it include an editor?

 

4K lines of Pascal is a pretty big program. What did it do?

 

I would pose that using a fully integrated system like Turbo Forth or FBForth would be just as productive (once you grok the language).

Living inside the Forth REPL with editor your finger tips and the ability to test interactively every sub-routine, ASM fragment, variable, constant, memory location or data structure is extremely productive.

It has more front end loading than Pascal of course, but the typical process is to write up a set of routines that elevates the programming level, very much like you would in Pascal, but it is a low level language out of the box

 

When I was learning Forth I was surprised how some of Wirth's thinking and Charles Moore's thinking overlapped. Albeit one is a European academic and the other is a self professed "computer cowboy". :-)

(like building up the code from the bottom with no forward references, although both languages provided ways around that ideal)

Link to comment
Share on other sites

When you buy the p-code card, you get just the operating system. What you can do then is run programs compiled by somebody else. That's it.

Then you add a disk with Editor, Filer (another word for disk manager) and various Utilities. Now you can edit source code, copy files and do various extra things, like changing the code type of code files.

Then you add a disk with the Pascal compiler. Now you can compile your Pascal source files into executable code files, or compile source written as separately compiled units.

Then you add a disk with the Assembler and Linker. Now you can assemble source files into object files. They are still not executable, though. For that you use the Linker. It will resolve the procedure declarations that are external in the Pascal program, and link them to the corresponding procedure declarations in the assembly program. Then you get a single executable code file, which contains both the Pascal and assembly programs you need.

 

Another interesting feature is that assembly programs can be assemled and linked not only to allow them to be relocatable, i.e. that you can load them at any convenient place in memory, but you can actually make them dynamically reloacatable. If you do, the code loader in the OS will not only resolve addresses on loading, but save a reloacating table, which makes it possible to move the code in memory, should you encounter an issue where you run out of stack space, but can solve that by moving code closer to the heap, for example.

 

It's the fact that you can have things like this done by the system, without having to do anything more than declaring the assembly procedure as RELPROC instead of PROC, that makes the p-system so powerful. If you want to do that in Forth, you have to write the whole mechanism for it first. You can chain programs in Extended BASIC, so that you can run larger program than fit in memory, but you have to control the mechanism yourself. With Pascal, you just add segment to procedures you know you don't always need, and then the operating system will roll them in and out of memory, on a need-to-do basis.

 

The 4000 line Pascal program was used to compute the correct piping sizes for dust evacuation systems. Such systems are for example used to get sawdust out of a sawmill. They can become complex enough that a decent manual calculation of pipe sizes could consume a week from an engineer. If you then needed to change anything, you spent several days updating the calculations, since everything affected other things.

After keying in the basic layout of such a system, a task which perhaps took a few hours, you could change a value and recalculate the whole thing in two minutes with the 99/4A. And you got the same result for the same system each time. As the calculations includes boring iterative algorithms, where you calculate a possible range of pipe sizes, then test which one works best, someone who did that manually would make shortcuts and get different results for the same system each time he did the calculation.

An indication of the portability of the UCSD p-system is that I later ported this to Turbo Pascal 4.0 for use under DOS on a PC. Except for a few system related things, everything was a carbon copy of the program in the TI 99/4A. And it worked.

  • Like 2
Link to comment
Share on other sites

When you buy the p-code card, you get just the operating system. What you can do then is run programs compiled by somebody else. That's it.

Then you add a disk with Editor, Filer (another word for disk manager) and various Utilities. Now you can edit source code, copy files and do various extra things, like changing the code type of code files.

Then you add a disk with the Pascal compiler. Now you can compile your Pascal source files into executable code files, or compile source written as separately compiled units.

Then you add a disk with the Assembler and Linker. Now you can assemble source files into object files. They are still not executable, though. For that you use the Linker. It will resolve the procedure declarations that are external in the Pascal program, and link them to the corresponding procedure declarations in the assembly program. Then you get a single executable code file, which contains both the Pascal and assembly programs you need.

 

Another interesting feature is that assembly programs can be assemled and linked not only to allow them to be relocatable, i.e. that you can load them at any convenient place in memory, but you can actually make them dynamically reloacatable. If you do, the code loader in the OS will not only resolve addresses on loading, but save a reloacating table, which makes it possible to move the code in memory, should you encounter an issue where you run out of stack space, but can solve that by moving code closer to the heap, for example.

 

It's the fact that you can have things like this done by the system, without having to do anything more than declaring the assembly procedure as RELPROC instead of PROC, that makes the p-system so powerful. If you want to do that in Forth, you have to write the whole mechanism for it first. You can chain programs in Extended BASIC, so that you can run larger program than fit in memory, but you have to control the mechanism yourself. With Pascal, you just add segment to procedures you know you don't always need, and then the operating system will roll them in and out of memory, on a need-to-do basis.

 

The 4000 line Pascal program was used to compute the correct piping sizes for dust evacuation systems. Such systems are for example used to get sawdust out of a sawmill. They can become complex enough that a decent manual calculation of pipe sizes could consume a week from an engineer. If you then needed to change anything, you spent several days updating the calculations, since everything affected other things.

After keying in the basic layout of such a system, a task which perhaps took a few hours, you could change a value and recalculate the whole thing in two minutes with the 99/4A. And you got the same result for the same system each time. As the calculations includes boring iterative algorithms, where you calculate a possible range of pipe sizes, then test which one works best, someone who did that manually would make shortcuts and get different results for the same system each time he did the calculation.

An indication of the portability of the UCSD p-system is that I later ported this to Turbo Pascal 4.0 for use under DOS on a PC. Except for a few system related things, everything was a carbon copy of the program in the TI 99/4A. And it worked.

 

Cool program. Thanks for the description.

That portability is truly impressive. I spent a few years maintaining and upgrading a large factory testing program written in Turbo Pascal. I always felt comfortable with it.

I don't remember how big it was but I know that after a year of working on it for the factory users I filled the memory and had to switch to huge memory model with pointers.

It was pretty trivial to make the change. Wirth's ideas are very sound for sure.

 

The funny thing is some of the stuff you are talking about with these complex compilers are problems created by the complex environment itself.

By letting the machine control things you then have to provide ways for the machine to fix things too.

It becomes a bit of a pyramid.

 

That's one the things that shocks me about Forth is that by exposing everything to the programmer so much complexity melts away.

 

Example for re-locatable code:

Code goes where ever the variable DP (dictionary pointer) is set.

Normally this is done by the compiler automagically as you add code to the system.

But you are free to do whatever you want with it.

The Forth dictionary of routine names is the lookup table so that is also covered.

\ compile code in TI low memory >2E00
  HEX 2E00 DP !   \ set the dictionary pointer
: LOWMEM     CR ." Hello low memory!" ;

\ compile this in high memory >E000
  HEX E000 DP !
: HiMEM     CR ." Hello high memory!" ;

​\ code will continue to compile here until you move DP again

When you expose the guts to the human and the system is so simple it's remarkable what you can do.

 

However the learning curve is more akin to learning an operating system than a language so longer ramp up time for sure.

But lots of fun nevertheless.

 

Thanks again for giving us all the in depth insights on the P-code system. It's remarkable that they stuffed all that into the TI-99.

Makes one imagine what could have been... :-)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...