Benchmarking Languages

Tursi · January 22, 2016

While you're at it, how do you think GPL will compare with p-code? I have a very soft spot in my heart for Pascal, and would eventually like to develop a program for it on the TI, but I am concerned about its performance...

I couldn't say, I've never run it beyond the work I did debugging Classic99. In all that the best I did was assemble some Hello World programs...

For benchmarking languages, really... just write comparable programs. Trying to compare languages and implementations was always a battle, even back in the day, since algorithm matters, what parts of the language you touch matters, what parts of the hardware you need to use matters, etc. But off the top of my head, a good quick one for the TI might be something like manually moving a sprite around the outer edge of the screen, one pixel at a time (no auto-motion). See how fast you can get it whipping around. Make it loop 100 times and then exit, so that you can time the total runtime.

Starting with the simple in XB...

100 CALL CLEAR
110 CALL MAGNIFY(2)
120 CALL SPRITE(#1,42,2,1,1)
130 CNT=100
140 FOR X=1 TO 240 :: CALL LOCATE(#1,1,X):: NEXT X
150 FOR Y=1 TO 176 :: CALL LOCATE(#1,Y,240):: NEXT Y
160 FOR X=240 TO 1 STEP -1 :: CALL LOCATE(#1,176,X):: NEXT X
170 FOR Y=176 TO 1 STEP -1 :: CALL LOCATE(#1,Y,1):: NEXT Y
180 CNT=CNT-1 :: IF CNT>0 THEN 140
190 END

ASM and TurboForth in the spoiler tag.

ASM version:

* assumes startup from Editor/Assembler

  DEF START
  REF VDPWA,VDPWD
  
* make it work as EA5 if desired
  B @START

START
* call clear
  li r0,>0040     * write address >0000
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  
  li r1,>2000
  li r2,768
lp1
  movb r1,@VDPWD
  dec r2
  jne lp1
  
* call magnify(2)
  li r0,>c181     * write VDP register 1 with >C2 (16k,enable, no int, double-size sprites)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  
* call sprite(#1,42,2,1,1)
  li r0,>0186     * vdp register 6 to >01 (sprite descriptor table to >0800)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA

  li r0,>0043     * write address >0300
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
 
  li r0,>002A     * 1,1 (minus 1) and 42
  movb r0,@VDPWD
  nop
  movb r0,@VDPWD
  swpb r0
  movb r0,@VDPWD
  
  li r0,>01d0     * color 2 (-1) and list terminator
  movb r0,@VDPWD
  swpb r0
  movb r0,@VDPWD
  
* cnt=100
  li r5,100
  
l140

* for x=1 to 240 (minus 1 for asm)
  clr r3
xlp1

* call locate(#1,1,x)
  li r0,>0143     * write address >0301 (X pos)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  nop
  movb r3,@VDPWD  
  
* next x
  ai r3,>0100
  ci r3,>f000
  jne xlp1
  
* for y=1 to 176
  clr r4
ylp1

* call locate(#1,y,240)
  li r0,>0043     * write address >0300 (Y pos)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  nop
  movb r4,@VDPWD
  
* next y
  ai r4,>0100
  ci r4,>b000
  jne ylp1
  
* for x=240 to 1 step -1
  li r3,>ef00
xlp2

* call locate(#1,176,x)
  li r0,>0143     * write address >0301 (X pos)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  nop
  movb r3,@VDPWD  
  
* next x
  ai r3,>ff00
  ci r3,>ff00
  jne xlp2
  
* for y=176 to 1 step -1
  li r4,>af00
  
ylp2
* call locate(#1,y,240)
  li r0,>0043     * write address >0300 (Y pos)
  movb r0,@VDPWA
  swpb r0
  movb r0,@VDPWA
  nop
  movb r4,@VDPWD
  
* next y
  ai r4,>ff00
  ci r4,>ff00
  jne ylp2
  
* cnt=cnt-1
  dec r5
  jne l140
  
* end
  blwp @>0000
  
  end

TurboForth:

VARIABLE cnt

hex
: asterisk DATA 4 0028 107C 1028 0000 12a dchar ;
decimal

: test
1 gmode
page
1 magnify
asterisk
0 0 0 42 1 sprite
100 dup cnt !
begin while
  239 0 do 0 0 I sprloc loop
  175 0 do 0 I 239 sprloc loop
  0 239 do 0 175 I sprloc -1 +loop
  0 175 do 0 I 0 sprloc -1 +loop
  cnt @ 1- dup cnt !
repeat
bye
;

If porting - note how the corners overlap for one frame each! (For example, the X loop positions at 1,240, and then the Y loop ALSO positions at 1,240).

Alllllso, for XB you might want to only time one lap and multiply it by 100.

My tests for the above test come out like so:

XB (estimated): 2000 seconds (33 mins)

Assembly (8-bit code): 7 seconds

TurboForth: 48 seconds

I attempted a UCSD Pascal version, but it kept saying it couldn't find the library on 'USES SPRITE' when I tried to compile, so I gave up... and I'm out of time for the GPL version.

Willsy · January 22, 2016

Hmm.... it's academic but you might be able to make TF go faster by making it more like the assembly version. I.e use V! To poke VDP memory. I'll have a look this evening and see if it'll be any faster. I was disappointed when I saw 48 seconds, but on the other hand SPRLOC and friends actually update a copy of the sprite attribute list in cpu ram and copy portions of it to VDP so there's a lot going on under the covers.

Edited January 22, 2016 by Willsy

sometimes99er · January 22, 2016

Hmm.... it's academic but you might be able to make TF go faster by making it more like the assembly version. I.e use V! To poke VDP memory. I'll have a look this evening and see if it'll be any faster. I was disappointed when I saw 48 seconds, but ...

It would then only be fair that time is spent to make the 2 other implementations faster.

Willsy · January 22, 2016

Yes of course!

Willsy · January 22, 2016

This one is based on Tursi's code, but pokes VDP directly. Some other little optimisations:

VARIABLE cnt
 
hex
: asterisk DATA 4 0028 107C 1028 0000 12a dchar ;
decimal
 
: test
1 gmode
page
1 magnify
asterisk
0 0 0 42 1 sprite
100 cnt !
begin
  cnt @ 0> while
  239 0 do i $301 v! loop
  175 0 do i $300 v! loop
  0 239 do i $301 v! -1 +loop
  0 175 do i $300 v! -1 +loop
  -1 cnt +!
repeat
bye
;

and here's one that removes the need for a variable:

hex
: asterisk DATA 4 0028 107C 1028 0000 12a dchar ;
decimal
 
: test
    1 gmode
    page
    1 magnify
    asterisk
    0 0 0 42 1 sprite
    100 0 do
      239 0 do i $301 v! loop
      175 0 do i $300 v! loop
      0 239 do i $301 v! -1 +loop
      0 175 do i $300 v! -1 +loop
    loop
    bye 
;

Both of them take 29 seconds. So that's 3.6 times slower than assembler and 69 times faster than XB.

Rock on!

Edited January 22, 2016 by Willsy

+InsaneMultitasker · January 22, 2016

and here's one that removes the need for a variable: Both of them take 29 seconds. So that's 3.6 times slower than assembler .

Rock on!

So it's about one 'forth' as fast?

Willsy · January 22, 2016

So it's about one 'forth' as fast?

Ha ha yes!

sometimes99er · January 22, 2016

Both of them take 29 seconds. So that's 3.6 times slower than assembler and 69 times faster than XB.

So the newer language, with quite a few updates, gets optimized by its creator, and is then compared with the unoptimized versions.

Now let's compile the XB and have the ASM run on the GPU. It can be done. :-D

Willsy · January 22, 2016

Well not really. I was just seeing if I could improve tursi's time of 48 seconds.

sometimes99er · January 22, 2016

Well not really. I was just seeing if I could improve tursi's time of 48 seconds.

Wow. Sure looks like you did compare them:

Both of them take 29 seconds. So that's 3.6 times slower than assembler and 69 times faster than XB.

As Tursi said

Trying to compare languages and implementations was always a battle, ...

lucien2 · January 22, 2016

GPL: 80 seconds

When we compared TF and GPL with the bricks demo 4 1/2 years ago they were closer.

	grom	>6000
	data	>aa00,>0100,>0000
	data	menu
	data	>0000,>0000,>0000,>0000
menu	data	>0000
	data	start
	stri	'BENCHMARK'

upcase	equ	>0018
x	equ	arg
y	equ	arg+1
xy	equ	arg
cnt	equ	arg+2

start
* magnify 2
	st	>e1,@arg
	move	1,@arg,#1
* load uppercase character set
	dst	>0900,@fac
	call	upcase
* copy asterisk pattern to sprite char 0
	move	8,v@42*8+>800,v@>400
* define sprite 0 to character 0, color black	
	dst	>8001,v@>302
* locate sprite 0 to 1,1
	dst	>0000,v@>300
	
	st	100,@cnt
	
L5	clr	@x
L1	st	@x,v@>301
	inc	@x
	ch	239,@x
	br	L1
	
	clr	@y
L2	st	@y,v@>300
	inc	@y
	ch	175,@y
	br	L2
	
	st	239,@x
L3	st	@x,v@>301
	dec	@x
	ceq	255,@x
	br	L3
	
	st	175,@y
L4	st	@y,v@>300
	dec	@y
	ceq	255,@y
	br	L4

	dec	@cnt
	cz	@cnt
	br	L5

	exit

Tursi · January 23, 2016

Rigt

Hmm.... it's academic but you might be able to make TF go faster by making it more like the assembly version. I.e use V! To poke VDP memory. I'll have a look this evening and see if it'll be any faster. I was disappointed when I saw 48 seconds, but on the other hand SPRLOC and friends actually update a copy of the sprite attribute list in cpu ram and copy portions of it to VDP so there's a lot going on under the covers.

Yeah, what I was trying to do was use the language's features. The intent was to compare to the baseline Extended BASIC code, once you start bypassing the language it becomes a debate whether it's a sensible comparison. But the assembly version can be sped up with registers and scratchpad without changing the structure (also, the workspace is in 8-bit RAM, so I move that too. That's actually a bug, I never intended to not have the workspace in scratchpad ):

* assumes startup from Editor/Assembler

  DEF START
  REF VDPWA,VDPWD
  
* make it work as EA5 if desired
  B @START

START
* performance
  lwpi >8300
  li r6,VDPWA
  li r7,VDPWD
  li r8,>0043
  li r9,>0143
  li r10,>ff00
  li r11,>0100
  li r12,>f000
  li r13,>b000
  li r0,l140
  li r1,>8320
sclp
  mov *r0+,*r1+
  ci r1,>8400
  jne sclp

* call clear
  li r0,>0040     * write address >0000
  movb r0,*R6
  swpb r0
  movb r0,*R6
  
  li r1,>2000
  li r2,768
lp1
  movb r1,*R7
  dec r2
  jne lp1
  
* call magnify(2)
  li r0,>c181     * write VDP register 1 with >C2 (16k,enable, no int, double-size sprites)
  movb r0,*R6
  swpb r0
  movb r0,*R6
  
* call sprite(#1,42,2,1,1)
  li r0,>0186     * vdp register 6 to >01 (sprite descriptor table to >0800)
  movb r0,*R6
  swpb r0
  movb r0,*R6

  mov r8,r0      * write address >0300
  movb r0,*R6
  swpb r0
  movb r0,*R6
 
  li r0,>002A     * 1,1 (minus 1) and 42
  movb r0,*R7
  nop
  movb r0,*R7
  swpb r0
  movb r0,*R7
  
  li r0,>01d0     * color 2 (-1) and list terminator
  movb r0,*R7
  swpb r0
  movb r0,*R7
  
* cnt=100
  li r5,100
  
  b @>8320
  
l140

* for x=1 to 240 (minus 1 for asm)
  clr r3
xlp1

* call locate(#1,1,x)
  mov r9,r0       * write address >0301 (X pos)
  movb r0,*R6
  swpb r0
  movb r0,*R6
  nop
  movb r3,*R7  
  
* next x
  a r11,r3
  c r12,r3
  jne xlp1
  
* for y=1 to 176
  clr r4
ylp1

* call locate(#1,y,240)
  mov r8,r0       * write address >0300 (Y pos)
  movb r0,*R6
  swpb r0
  movb r0,*R6
  nop
  movb r4,*R7
  
* next y
  a r11,r4
  c r13,r4
  jne ylp1
  
* for x=240 to 1 step -1
  li r3,>ef00
xlp2

* call locate(#1,176,x)
  mov r9,r0      * write address >0301 (X pos)
  movb r0,*R6
  swpb r0
  movb r0,*R6
  nop
  movb r3,*R7  
  
* next x
  a r10,r3
  c r10,r3
  jne xlp2
  
* for y=176 to 1 step -1
  li r4,>af00
  
ylp2
* call locate(#1,y,240)
  mov r8,r0       * write address >0300 (Y pos)
  movb r0,*R6
  swpb r0
  movb r0,*R6
  nop
  movb r4,*R7
  
* next y
  a r10,r4
  c r10,r4
  jne ylp2
  
* cnt=cnt-1
  dec r5
  jne l140
  
* end
  blwp @>0000
  
  end

That gets it down to 4.5 seconds - and it's the scratchpad workspace that makes most of the difference (1.5s)... running this code in scratchpad only saved about 1s. Since it spends all its time writing to VDP this program is multiplexer bound. So we'll round up for the table and say 5s.

All that said, I totally get the desire to optimize and there's no actual cheating in the TF version directly hitting VDP RAM, since it's built in. If XB had the ability to VPOKE we could try it there -- maybe an RXB version to see if it's faster.

GPL: 80 seconds

Thanks Lucien! I was hoping someone would take that on. Looks pretty good!

I'll split up first pass and optimized times to be fair - barring extreme bugs the first pass may be how someone new to the language would write it, optimized will be any interested party's best time (without changing the output of the program).

To be fair there, I've retimed the assembly version using VSBW etc, since that's how a new assembly programmer would normally start. That actually takes 17 seconds!

* assumes startup from Editor/Assembler
* slower version

  DEF START
  REF VSBW,VWTR,VMBW
  
* make it work as EA5 if desired
  B @START
  
sprdat
  data >0000,>2A01,>d000

START

* call clear
  clr r0
  li r1,>2000
  li r2,768
lp1
  blwp @vsbw
  inc r0
  dec r2
  jne lp1
  
* call magnify(2)
  li r0,>01c1     * write VDP register 1 with >C2 (16k,enable, no int, double-size sprites)
  blwp @vwtr
  
* call sprite(#1,42,2,1,1)
  li r0,>0601     * vdp register 6 to >01 (sprite descriptor table to >0800)
  blwp @vwtr

  li r0,>0300     * write address >0300
  li r1,sprdat    * sprite table
  li r2,5
  blwp @vmbw
  
* cnt=100
  li r5,100
  
l140

* for x=1 to 240 (minus 1 for asm)
  clr r3
xlp1

* call locate(#1,1,x)
  li r0,>0301    * write address >0301 (X pos)
  movb r3,r1
  blwp @vsbw
  
* next x
  ai r3,>0100
  ci r3,>f000
  jne xlp1
  
* for y=1 to 176
  clr r4
ylp1

* call locate(#1,y,240)
  li r0,>0300     * write address >0300 (Y pos)
  movb r4,r1
  blwp @vsbw
  
* next y
  ai r4,>0100
  ci r4,>b000
  jne ylp1
  
* for x=240 to 1 step -1
  li r3,>ef00
xlp2

* call locate(#1,176,x)
  li r0,>0301     * write address >0301 (X pos)
  movb r3,r1
  blwp @vsbw
  
* next x
  ai r3,>ff00
  ci r3,>ff00
  jne xlp2
  
* for y=176 to 1 step -1
  li r4,>af00
  
ylp2
* call locate(#1,y,240)
  li r0,>0300     * write address >0300 (Y pos)
  movb r4,r1
  blwp @vsbw
  
* next y
  ai r4,>ff00
  ci r4,>ff00
  jne ylp2
  
* cnt=cnt-1
  dec r5
  jne l140
  
* end
  blwp @>0000
  
  end

So we have:

Language   First Pass    Optimized
Assembly     17 sec         5 sec
TurboForth   48 sec        29 sec
GPL          80 sec       none yet
XB         2000 sec       none yet

Frankly it's looking good for all of them so far versus XB.

Willsy · January 23, 2016

Ah. I see. Yes I think that's fair and I see what Sometimes was saying now.

+InsaneMultitasker · January 23, 2016

For giggles I typed the program into Myarc's Advanced BASIC for the Geneve. It took approximately 8.2 minutes (490 or so seconds) to complete. Considering this BASIC is written in assembly (no GPL) I would have expected it to be a bit faster. I wonder if some of the sluggishness in both XB and ABASIC isn't related to all the floating point manipulation.

senior_falcon · January 23, 2016

51 seconds for compiled XB 8 bit bus

37 seconds for compiled XB 16 bit bus

Willsy · January 23, 2016

51 seconds for compiled XB 8 bit bus

37 seconds for compiled XB 16 bit bus

Wow that's really good!

Asmusr · January 23, 2016

This seems like a straightforward benchmark, but what does it actually mean to move a sprite around at a rate faster than 1/60s, resulting in visual frames being skipped?

globeron · January 23, 2016

(I think I have the software somewhere it is somewhere in Tijdingen TI-GG NL magazine in th '80s), but there was this fun thing when changing the screen color continuously,

it generated kind of moving bars on the screen (I think it only works on CRT televisions (50 Hz/60Hz), as I tried it on an LCD but did not see it happening.

It is very simple, something like

100 Call Screen(4)

110 Call screen(5)

120 Goto 100

The difference was that the stripes increased (e.g. Basic had 2 or 3 large bars alternating, but TP99 had several small stripes, and Assembler was very fast switching colours)

Not sure if it is a good benchmark to compare languages, but it was visual. I just tried in Classic99, but here colours switch fast.

Retrospect · January 23, 2016

I didn't think BASIC on a TI would be able to do the raster crt bars! ... cuz it uses CALLS which , I recently read, are one of the reasons for slowspeed. I did this trick on a Spectrum though.

Asmusr · January 23, 2016

It is because of the emulator if it doesn't work, because the screen in some emulators is drawn too fast or is not drawn concurrently with the CPU (It does work in MESS). You should always get some type of raster bars if you change the background color at random intervals on the hardware (and is not timing it with the vertical refresh). It has nothing to do with CRT vs LCD AFAIK. The problem on the TI is keeping the bars steady because the clocks of the CPU and the VDP are not synchronized. The only way I'm aware of to get a stable raster effect is to use the 5th sprite flag to measure when the VDP is reaching a specific scan line.

Edit: sorry for polluting this thread, the benchmark is fine is long as you realize it's basically about how fast you can update one VDP RAM byte with increasing or decreasing values.

+Lee Stewart · January 24, 2016

Here are the fbForth equivalents(?) of the two TurboForth sprite runs.

First pass:

HEX
064 VARIABLE CNT
: TEST
   GRAPHICS
   PAGE
   1 MAGNIFY
   0 0 1 02A 0 SPRITE
   BEGIN
      CNT @
   WHILE
      0EF 0 DO I 0 0 SPRPUT LOOP
      0AF 0 DO 0EF I 0 SPRPUT LOOP
      0 0EF DO I 0AF 0 SPRPUT -1 +LOOP
      0 0AF DO 0 I 0 SPRPUT -1 +LOOP
      -1 CNT +!
   REPEAT
   MON
;

and port of the TF optimized pass:

HEX
: TEST
   GRAPHICS
   PAGE
   1 MAGNIFY
   0 0 1 02A 0 SPRITE
   064 0 DO
      0EF 0 DO I 301 VSBW LOOP
      0AF 0 DO I 300 VSBW LOOP
      0 0EF DO I 301 VSBW -1 +LOOP
      0 0AF DO I 300 VSBW -1 +LOOP
   LOOP
   MON
;
DECIMAL

The first took 70 seconds and the second took 58 seconds.

I might be able to optimize further; but, fbForth cannot really compete with the scratchpad-optimized words of TurboForth that run on the 16-bit bus.

...lee

Edited April 22, 2022 by Lee Stewart
Prettified the code

Tursi · January 24, 2016

Thanks for the continued updates folks! I'm finding this pretty interesting.

And yeah, the output to the screen is irrelevant, it's just about taking a normal operation to hardware (moving a sprite) and using it to benchmark the performance of the language. This is certainly not comprehensive, but I wanted something that was quick to implement and still at least somewhat real-world.

So what I see so far:

Language   First Pass    Optimized
Assembly     17 sec         5 sec
TurboForth   48 sec        29 sec
Compiled XB  51 sec        37 sec
FbForth      70 sec        58 sec
GPL          80 sec       none yet
ABASIC      490 sec       none yet
XB         2000 sec       none yet

(I included ABASIC although I don't know if it's a fair comparison since it's a different computer! )

Tursi · January 24, 2016

The original question was 'how does GPL compare?'... to be honest I'm surprised. While it is the slowest (non-BASIC) tested so far, it's not the slowest by much. Any of those languages would be just fine.

If I posted my Pascal attempt, would someone be able to help figure out why it doesn't compile?

Edited January 24, 2016 by Tursi

+Lee Stewart · January 24, 2016

I would like to revise the fbForth optimized code. The following is more in line with the TurboForth code I was attempting to port. It defines V! similar to how it is defined in TurboForth:

HEX
ASM: V!
   *SP+ R0 MOV,         ( pop addr)
   *SP+ R1 MOV,         ( pop value)
   R1 SWPB,             ( get LSB of value into MSB)
   0 LIMI,              ( disable interrupts)
   R0 4000 ORI,         ( tell VDP processor "hey, this is a *write*")
   R0 SWPB,             ( get low byte of address)
   R0 8C02 @() MOVB,    ( write it to vdp address register)
   R0 SWPB,             ( get high byte of address)
   R0 8C02 @() MOVB,    ( write it)
   R1 8C00 @() MOVB,    ( write payload)
   2 LIMI,              ( enable interrupts)
;ASM

: TEST
   GRAPHICS
   PAGE
   1 MAGNIFY
   0 0 1 02A 0 SPRITE
   064 0 DO
      0EF 0 DO I 301 V! LOOP
      0AF 0 DO I 300 V! LOOP
      0 0EF DO I 301 V! -1 +LOOP
      0 0AF DO I 300 V! -1 +LOOP
   LOOP
   MON
;
DECIMAL

This runs in 26 seconds!

...lee

Edited April 22, 2022 by Lee Stewart
Prettified the code

senior_falcon · January 24, 2016

Doggone it, now I suppose I'll have to do the program in XB using CALL LOADs. Results later today.

Oops, just remembered that I need to write to VDP, not CPU. So maybe no results today.

Edited January 24, 2016 by senior_falcon

Benchmarking Languages

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members