Benchmarking Languages

apersson850 · July 1, 2017

I guess you didn't do all the 100 loops. Did you multiply by the wrong value, maybe? I did five turns and multiplied with 20.

Anyway, just a few of the programs actually manage any kind of data structures for sprites. The rest focus on writing to the VDP RAM at certain locations, assuming the user knows which VDP RAM location to access. I'll make an optimization for the Pascal version where it also writes to VDP RAM directly, but still in Pascal, just to see what happens. I'm not sure if the routines in unit sprite will access the sprite data by themselves, when there's no countdown data in the sprite record, so I don't know if it works.

I've never used sprites for anything in Pascal, so I have no real reference to how fast or slow that handling is. But generally, Pascal runs a couple of times faster than Extended BASIC.

Oh, by the way, James D wrote earlier that the PME has no registers, but works with a stack, and that the instructions are 16-bit. The p-machine has some registers, which of course are emulated by the PME, and the instructions are 8-bit. Some of them do have one or more bytes in-line, though, as immediate data. So even if the instructions basically are 8 bits, some are extended to several bytes. Most of the data is referenced either by being on the stack, or via the stack. Variables in the current environment record are referenced through R9 in the TMS 9900 in this particular implementation of the PME. Global data though R14.

Edited July 2, 2017 by apersson850

JamesD · July 2, 2017

...

Oh, by the way, James D wrote earlier that the PME has no registers, but works with a stack, and that the instructions are 16-bit. The p-machine has some registers, which of course are emulated by the PME, and the instructions are 8-bit. Some of them do have a second byte, though, so even if the instructions basically are 8 bits, some are extended to 16 bits. Most of the data is referenced either by being on the stack, or via the stack. Variables in the current environment record are referenced through R9 in the TMS 9900 in this particular implementation of the PME.

The part about 16 bit... it's the Pascal numeric types that are normally 16 bit, not the opcodes themselves. .

Sorry about that. But it means a lot of opcodes dealing with 16 bits.

As for registers... from the P Code Machine Wiki.

Like many other p-code machines, the UCSD p-Machine is a stack machine, which means that most instructions take their operands from the stack, and place results back on the stack. Thus, the "add" instruction replaces the two topmost elements of the stack with their sum. A few instructions take an immediate argument. Like Pascal, the p-code is strongly typed, supporting boolean (b), character ©, integer (i), real ®, set (s), and pointer (a) types natively.

If by registers you mean the following (from the 'ARCHITECTURE OF THE P-MACHINE' section of the docs), then yeah, it has registers.

But that's not user registers.

HARDWARE EMULATION: REGISTERS

The P-machine uses 16-bit words, with two 8-bit bytes per word.

It has an evaluation stack, several registers, and a user memory

containing a program stack and a heap. All registers are pointers to

word-aligned structures, except IPC, which is a pointer to byte-aligned

instructions. The registers, sometimes referred to as "pseudo-variables",

are:

SP: evaluation Stack Pointer. A pointer to the current "top" of

the evaluation stack (one byte beyond the last byte in use). In the

Apple, the evaluation stack uses a portion of the 6502's hardware

stack, starting in hex memory location 1FF and growing down toward

hex location 100. It is used to pass parameters, return function

values, and as an operand source for many instructions. The

evaluation stack is extended by loads, and is cut back by stores

and arithmetic operations.

IPC: Interpreter Program Counter. Contains the address of the next

instruction to be executed, in the code segment of the currently

executing procedure.

SEG: SEGment pointer points to the procedure dictionary of the

segment to which the currently executing procedure belongs.

(See this manual's appendix OPERATION OF ThE P-MACHINE for

illustrations.)

JTAB: Jump TABle pointer. A pointer to the table of attributes and

jump table entries in the procedure code section of the currently

executing procedure. (See this manual's appendix OPERATION OF THE

P-MACHINE for illustrations.)

KP: program stacK Pointer. A pointer to the current top of the

program stack. The program stack starts in high user memory and

grows downward toward the heap. (See this manual's appendix

OPERATION OF THE P-MACHINE for illustrations.)

HP: Markstack Pointer. A pointer to the low byte of MSSTAT, in the

topmost Markstack on the program stack, in the activation record

of the currently executing procedure. Variables local to the current

procedure are accessed by indexing off MP.

NP: New Pointer. A pointer to the current top of the dynamic heap

(one byte beyond the last byte in use). The heap starts in low

user memory and grows upward toward the program stack. It contains all

dynamic variables (see Jensen and Wirth, Chapter 10). It is

extended by the standard procedure 'new', and is cut back by the

standard procedure 'release'.

BASE: BASE Procedure. A pointer to the activation record of the

most recently invoked base procedure (lex level 0). Global (lex

level 0) variables are accessed by indexing off BASE.

This is a perfect example of why the P--Machine isn't very efficient on 8 bit machines.

On a more advance 16 bit CPU, this might only need 3 opcodes.

But this is what a 6502 has to do to execute a LOR (logical or) opcode for the p-machine.

This doesn't even show the code that decodes the opcode, calls the opcode routine, and the code run after the exit.

The 6502 is the worst of the 8 bit CPUs for this, but more machines had 6502s than anything else.

If opcodes worked on registers, all the slow indexed instructions, and stack manipulation goes away and it's probably at least 30% faster

On the 9900, putting data directly into registers would surely be faster than using the stack.

LOR
	TSX
	LDA	P1BASE+3,X
	ORA	P1BASE+1,X
	STA	P1BASE+3,X
	LDA	P1BASE+4,X
	ORA	P1BASE+2,X
	STA	P1BASE+4,X
	INX
	INX
	TXS
        JMP UPDBY1

Edited July 2, 2017 by JamesD

apersson850 · July 2, 2017

The part about the registers you quote is from an earlier version of the p-system, not IV.x, so the registers aren't the same. But it's their equivalents I'm referring to, yes.

The interpreter in the 99/4A is further complicated by the fact that it can run code from CPU RAM, VDP RAM or GROM as well. If you like, I can post the inner part of the interpreter here.

apersson850 · July 2, 2017

Well, never mind waiting for anybody asking for anything...

Here's the PME (P-Machine Emulator) central parts. Note that the address of PMEFETCH is 8300H when the machine is running, so this is in 16-bit RAM. But there are two different fetch routines, one for code in CPU RAM and the other for code in VDP RAM or GROM. They are both loaded at the same place. The CPU RAM version needs some NOP instructions to occupy the same addresses, since there are six external entry points into the interpreter.

PMEFETCH gets the opcode of the instruction, and that's always 8 bits. It then looks into a table, which gives the address to the instruction interpretation.That part begins with an address where to start running the interpretation, since that may be directly after the entry, or it may branch to one of five locations in the interpreter, where one, two or three parameters, that are inline in the code (not on the stack) are fetched to R3, R4 and R5, before the interpretation of the actual instruction continues.

I've included a few instructions in full detail.

LDO loads a word from the program's global data area. R14 points to that area.

LOD loads data from a caller's local data area (environment record). There are two short forms, used to load data from the caller, or the caller's caller. Then one general that can load from any lexical level above the currently running procedure. R9 points to the current environment record. It's up to the programmer to think about this. Using variables further up than two levels take more time. There's a similar thing for local variables. There are faster instructions to pick the first variables, compared to those further down among the declarations. So declare short variables that are frequently used first. If you declare an array first, you may run out of reach of all short local variable load instructions in one fell swoop.

The ADI and LOR performs addition and logical or (so you can compare with the 6502 code above). It takes seven instructions to execute these codes, if they are in CPU RAM. Normally, p-code runs from VDP RAM, where it takes eight instructions to accomplish something the TMS 9900 could have done with one, if the p-code was converted to native code. But LOR is one byte long, SOC *SP+.*SP is two. So for this particular instruction, there's a speed gain of roughly eight times, at the loss of twice the memory use.

There are p-code instructions for more complex things too, like calling global or local procedures. They are more like small program segments, and may invoke quite a lot of code, if a segment fault is issued on a call (the called procedure isn't currently in memory, but must be retrieved from disk).

The instructions SIGNAL and WAIT also have their own p-codes, as they shouldn't be interrupted.

This is not based on any source code or such, but on my own dechipering and inspection of my p-system on the 99/4A. It was necessary to understand more than the manual tells you to be able to implement pre-emptive multitasking and bit-map mode, to allow the system to do turtlegraphics.

; Inner interpreter for the PME in TI 99/4A

PASCALWS .EQU 8380H		;Interpreter's workspace
LSBYR3	.EQU PASCALWS+7
LSBYR4  .EQU PASCALWS+9
LSBYR5	.EQU PASCALWS+11
PC	.EQU 8			;Instruction pointer
ERECP	.EQU 9			;Current environment record pointer
SP	.EQU 10			;PME stack pointer
FETCH   .EQU 12		;Address of PME instruction fetch routine (pmefetch)
CODERD  .EQU 13			;Address to read code from.
GLOBDAT .EQU 14			;Global data frame pointer within current segment
CODEFLG	.EQU 15			;Code location flag. 0: CPU RAM, <0: VDP RAM, >0: GROM


;Interpreter when code is in RAM
PMEFETCH
	MOVB *PC+,R1		;Fetch instruction
	SRL  R1,7			;Make word index
	MOV  @OPCODE(R1),R2 ;Fetch address to interpreter's header
	MOV  *R2+,R0		;Fetch address to execute code		
	B    *R0
		NOP
RD2BYT	CLR  R4				;Reads two immediate bytes
	MOVB *PC+,@LSBYR4
	NOP
RD1BYT	CLR  R3				;Reads one immediate byte
	MOVB *PC+,@LSBYR3
	B    *R2
	NOP
RD3BYT	CLR  R5				;Reads three immediate bytes, last could be big
	NOP
	MOVB *PC+,@LSBYR5
RD2BIG	CLR  R4
	MOVB *PC+,@LSBYR4	;Reads two immediate bytes, last could be big
	NOP
RDBIG	CLR  R3				;Reads immediate byte, could be big
	MOVB *PC+,R3
	JLT  @BIG
	SWPB R3				;Justify short big data
	B    *R2
BIG	ANDI R3,7F00H
	MOVB *PC+,@LSBYR3	;Reads LsByte of long big data
	B    *R2
;-------------------------------------------------------------------------------		
;Interpreter when code is in VDP RAM or in GROM
PMEFETCH
	INC  PC
	MOVB *CODERD,R1
	SRL  R1,7
	MOV  @OPCODE(R1),R2
	MOV  *R2+,R0
	B    *R0
	INC  PC
RD2BYT	CLR  R4
	MOVB *CODERD,@LSBYR4
	INC  PC
RD1BYT	CLR  R3
	MOVB *CODERD,@LSBYR3
	B	 *R2
	INC  PC
RD3BYT	CLR  R5
	MOVB *CODERD,@LSBYR5
	INC  PC
RD2BIG	CLR  R4
	MOVB *CODERD,@LSBYR4
RDBIG	CLR  R3
	MOVB *CODERD,R3
	JLT  BIG
	SWPB R3
	INC  PC
	B	 *R2
BIG	ANDI R3,7F00H
	MOVB *CODERD,@LSBYR3
	INCT PC
	B    *R2
		
; At OPCODE there's a table[0..255] of addresses (words) to each instruction

OPCODE	.WORD SLDC
	.WORD SLDC
;... only a few codes are given here		
OPCODE+133*2 .WORD LDO		;Load global word
OPCODE+136*2 .WORD LOD		;Load intermediate word from any lexical level above the current
OPCODE+160*2 .WORD LOR		;Logical or
OPCODE+162*2 .WORD ADI		;Add integer has opcode 162
OPCODE+173*2 .WORD SLOD1	;Short load intermediate word from lexical parent of current environment record
OPCODE+174*2 .WORD SLOD2	;Short load intermediate word from lexical grandparent of current environment record	
;... and so one
	.WORD RESERVE5
	.WORD RESERVE6

; Load a word from the global data area
LDO	.WORD RDBIG			;Get word index in global data frame
	SLA  R3,1
	A    GLOBDAT,R3		;Add base of global data
	DECT SP
	MOV  @8(R3),*SP		;Push word after data frame header
	B    *FETCH			;Fetch next p-code

; Load a word from the caller's environment record
SLOD1   .WORD RDBIG		;Get word index
	LI   R4,1		;Lexical level count
	JMP  LOD1
; Load a word from the caller's caller's environment record
SLOD2   .WORD RDBIG		;Get word index
	LI   R4,2		;Lexical level count
	JMP  LOD1
; Load a word from an intermediate activation record (data belonging to a caller more than two levels above)	
LOD	.WORD RD2BIG	;Get lexical level count and word index
LOD1	MOV  ERECP,R2
TRAV	MOV  *R2,R2		;Traverse activation record links
	DEC  R4			;Count levels
	JGT  TRAV
	SLA  R3,1		;Word index
	A    R2,R3		;Add environment record base
	DECT SP
	MOV  @8(R3),*SP	;Push word from environment record. Adjust for record header
	B    *FETCH
		
; Add integer adds two words at top of stack
ADI	.WORD ADI+2		;No immediate data
	A    *SP+,*SP	;Add top of stack words
	B    *FETCH
		
;Logical or of two words at top of stack
LOR	.WORD LOR1		;No immediate data
LOR1	SOC  *SP+,*SP	;Or top of stack words
	B    *FETCH

Edited July 2, 2017 by apersson850

+TheBF · July 2, 2017

This is cool to see inside this system.

Can you think of any good reason that the interpreter has three NOP instructions in it?

I find it astounding that someone wanted to slow down this critical piece of the system.

+Lee Stewart · July 2, 2017

This is cool to see inside this system.

Can you think of any good reason that the interpreter has three NOP instructions in it?

I find it astounding that someone wanted to slow down this critical piece of the system.

Re-read the third line of his response.

...lee

+TheBF · July 2, 2017

Got it?

apersson850 · July 2, 2017

Most of these NOP instructions aren't really executed, but are there just as fillers. When reading code from memory mapped devices, there's no auto-increment of the PC (R8), so it has to be advanced with extra INC instructions.

Edited July 2, 2017 by apersson850

JamesD · July 2, 2017

I'd be less worried about the NOPs and more worried about how much other code has to be executed just for a single opcode.

BTW, the first language I found that used this sort of interpreter was BCPL which came a few years before Pascal or Forth.
I'm pretty sure that's where the idea came from.

Edited July 2, 2017 by JamesD

apersson850 · July 2, 2017

Obviously, p-code runs seven times slower than pure assembly, as it takes seven instructions to execute one, which has a direct correlation to a TMS 9900 instruction.

Instructions that are used to find data in another procedure and such stuff do of course take longer time. But they couldn't execute in one single TMS 9900 instruction either, so the overhead there is less.

If you look at the SLOD2 instruction, it takes 24 TMS 9900 instructions to execute it, and six to decode it. Thus the overhead only adds 25%, not 700% as is the case with ADI.

I don't know where the idea to implement p-code for the UCSD system came from. But it's a fairly old idea, that to compile a language to some intermediate code and then either convert that all the way, or interpret it. As they wanted portability, it's very efficient, since implementing the PME on a new platform is a significantly less task than to modify the compiler each time.

Edited July 2, 2017 by apersson850

apersson850 · July 5, 2017

Just to see how much of the time is consumed by the sprite routines in the Pascal program, I commented them out. The program still runs all the loops and does all the assignments, but it never calls set_sprite. Now it does the 100 loops in 150 seconds.

Next I'll make it write to the sprite attribute table directly, and we'll see what difference that makes.

apersson850 · July 5, 2017

In the next step, I added an external procedure which simply plugs the values for x and y directly into the sprite attribute table. Execution time is now 166 seconds.

This is of course still not near Forth or pure assembly, but shows that the p-system and Pascal is at least normally significantly faster than Extended BASIC.

Edited July 5, 2017 by apersson850

apersson850 · July 5, 2017

Finally, a programs which does the same, but without any external procedures. Pure Pascal in 273 seconds.

As far as I know, there's no way to poke to VDP memory in Extended BASIC, so that's probably as optimized as it will be. At least not without external functions, like special CALLs implemented on Horizon RAMdisks and such. Thus we have 2000 seconds vs. 273 seconds here. That's in line with what I've experienced before, where Pascal is a couple of times faster than BASIC, but of course not near assembly speed.

There are a few more things you can do to optimize further, but I don't bother now. The step from 780 to 273 seconds still proves that if you know the Pascal system well, you can make it perform better. And the step down to 166 seconds shows that if you use assembly support where it's best needed, then you can get some more. But that's true for most languages.

Language   First Pass     Optimized
GCC           15 sec         5 sec
Assembly      17 sec         5 sec
TurboForth    48 sec        29 sec
Compiled XB   51 sec        37 sec
FbForth       70 sec        26 sec
GPL           80 sec       none yet
ABASIC       490 sec       none yet
XB          2000 sec       none yet
UCSD Pascal 7300 sec       273 sec

Edited July 5, 2017 by apersson850

JamesD · July 6, 2017

Ya know, there was a Pascal parser for GCC.
It's been dropped in recent versions due to lack of a maintainer, but if that were combined with the 9900 GCC changes, it would probably benchmark right up there with C.
In theory, since it uses strict typing, it should be able to optimize some code that can't be optimized under C.

apersson850 · July 6, 2017

Sure, but as far as I see it, it's not interesting. The value with the UCSD p-system lies in the word system. It's the whole system, with code and memory management, libraries and such that I like. Then I can live with that it doesn't outrun Forth, and that I occasionally have to write external procedures to get the desired performance.

+Vorticon · July 6, 2017

Finally, a programs which does the same, but without any external procedures. Pure Pascal in 273 seconds.

As far as I know, there's no way to poke to VDP memory in Extended BASIC, so that's probably as optimized as it will be. At least not without external functions, like special CALLs implemented on Horizon RAMdisks and such. Thus we have 2000 seconds vs. 273 seconds here. That's in line with what I've experienced before, where Pascal is a couple of times faster than BASIC, but of course not near assembly speed.

There are a few more things you can do to optimize further, but I don't bother now. The step from 780 to 273 seconds still proves that if you know the Pascal system well, you can make it perform better. And the step down to 166 seconds shows that if you use assembly support where it's best needed, then you can get some more. But that's true for most languages.
Language   First Pass     Optimized
GCC           15 sec         5 sec
Assembly      17 sec         5 sec
TurboForth    48 sec        29 sec
Compiled XB   51 sec        37 sec
FbForth       70 sec        26 sec
GPL           80 sec       none yet
ABASIC       490 sec       none yet
XB          2000 sec       none yet
UCSD Pascal 7300 sec       273 sec

Could you please post the source code for that program? I'm very interested in seeing how you did it. BTW, this is 10 times faster than XB, not a just a couple of times

apersson850 · July 6, 2017

Sure. Here you go.

Note that the UCSD p-system on the somewhat peculiar TI 99/4A has dual code pools. One in VDP RAM, the other in the 24 K CPU RAM. Normally, the p-system will load pure p-code in the VDP pool. Code containing assembly programs must be loaded in the secondary code pool, as they can't run from video memory. But in this program, I'm writing to the VDP from Pascal. Thus you must use the utility setltype (set language type) to change the type of the code file you run from Pseudo to M_9900.

This is not very flexible, as the procedure that writes to VDP RAM is fixed to set the position values for sprite #1, nothing else. A few more milliseconds could have been saved by not calling any procedure at all, but putting the code in line in the main program. Then it would also have been possible to only write one coordinate, as the other is fixed in each loop. But I didn't bother. I just wanted to see if it was the versatile and complicated unit sprite which does some overkill for such a simple task as this one, and thus wastes a lot of time. And it obviously is.

The time related code is just to make the benchmark timing automatic. A few declarations aren't used; they remain from the first version.

program benchmark;

uses
  support,
  sprite,
  realtime;
  
const
  rddata  = -30720; (* VDPRD *)
  rdstat  = -30718; (* VDPST *)
  wrtdata = -29696; (* VDPWD *)
  wrtaddr = -29694; (* VDPWA *)
  wrtenab =  16384; (* hex 4000, setting VDP address to write *)
      
type
  byte = 0..255;
  window = record
    case boolean of
      true: (int: integer);
      false:(ptr:^integer);
    end;
var
  x,y,cnt: integer;
  vwaaddr,
  vwdaddr: window;
  timer: timerid;
  elapsed: ttime;
  ch: char;
      
procedure spr1_pos(x,y: integer);

var
  addr: integer;
  
begin
  vwaaddr.ptr^ := 1024;
  vwaaddr.ptr^ := 19456;
  vwdaddr.ptr^ := y*256;
  vwdaddr.ptr^ := x*256;
end;

begin
  vwaaddr.int := wrtaddr;
  vwdaddr.int := wrtdata;

  tmrnew(timer);
  tmrreset(timer);
  
  write('Rounds? ');
  readln(cnt);
  
  tmrstart(timer);
  
  page(output);
  set_screen(2);
  set_spr_size(1);
  set_spr_attribute(1,42,2,0,1,1,0,0);
  
  while cnt>0 do
  begin
    for x := 1 to 240 do
      spr1_pos(x,1);
           for y := 1 to 176 do
      spr1_pos(240,y);
    for x := 240 downto 1 do
      spr1_pos(x,176);
    for y := 176 downto 1 do
      spr1_pos(1,y);
    cnt := pred(cnt);
  end;
  
  tmrstop(timer);
  tmrread(timer,elapsed);
  set_screen(1);
  page(output);
  with elapsed do
  begin
    write('Time ',hour,':');
    if minute<10 then
      write('0');
    write(minute,':');
    if second<10 then
      write('0');
    write(second,',');
    if fract<100 then
      write('0');
    if fract<10 then
      write('0');
    writeln(fract);
  end;
  read(ch);
end.

Edited April 29, 2019 by apersson850

Tursi · July 6, 2017

As far as I know, there's no way to poke to VDP memory in Extended BASIC, so that's probably as optimized as it will be.

I've taken several stabs at it, and never done any better. (Which did surprise me).

We don't have TI LOGO in there, anyone want to try that one?

JamesD · July 6, 2017

Sure, but as far as I see it, it's not interesting. The value with the UCSD p-system lies in the word system. It's the whole system, with code and memory management, libraries and such that I like. Then I can live with that it doesn't outrun Forth, and that I occasionally have to write external procedures to get the desired performance.

So writing all the code in Pascal and then compiling pieces that need more speed with a different compiler and making them external procedures isn't interesting?

apersson850 · July 6, 2017

I thought you were talking about replacing the whole system with a compiler that produced code files which couldn't be loaded under the p-system. As I understand you now it's completely different. Automating that process is like having a native code generator (which normally does accompany the p-system), but perhaps even better, if it is an optimizing such thing.

How fast do you guys sort 1000 random integers? Using whatever language you like, thus most likely assembly?

+Vorticon · July 6, 2017

I've taken several stabs at it, and never done any better. (Which did surprise me).

We don't have TI LOGO in there, anyone want to try that one?

I just might! I also think RXB will be a worthwhile contestant as well with its low level access features.

+Vorticon · July 6, 2017

Sure. Here you go.

Note that the UCSD p-system on the somewhat peculiar TI 99/4A has dual code pools. One in VDP RAM, the other in the 24 K CPU RAM. Normally, the p-system will load pure p-code in the VDP pool. Code containing assembly programs must be loaded in the secondary code pool, as they can't run from video memory. But in this program, I'm writing to the VDP from Pascal. Thus you must use the utility setltype (set language type) to change the type of the code file you run from Pseudo to M_9900.

This is not very flexible, as the procedure that writes to VDP RAM is fixed to set the position values for sprite #1, nothing else. A few more milliseconds could have been saved by not calling any procedure at all, but putting the code in line in the main program. Then it would also have been possible to only write one coordinate, as the other is fixed in each loop. But I didn't bother. I just wanted to see if it was the versatile and complicated unit sprite which does some overkill for such a simple task as this one, and thus wastes a lot of time. And it obviously is.

The time related code is just to make the benchmark timing automatic. A few declarations aren't used; they remain from the first version.
program benchmark;

uses
  support,
  sprite,
  realtime;
  
   const
      rddata  = -30720; (* VDPRD *)
      rdstat  = -30718; (* VDPST *)
      wrtdata = -29696; (* VDPWD *)
      wrtaddr = -29694; (* VDPWA *)
      wrtenab =  16384; (* hex 4000, setting VDP address to write *)
      
   type
      byte = 0..255;
      window = record
         case boolean of
           true: (int: integer);
           false:(ptr:^integer);
         end;
var
  x,y,cnt: integer;
  vwaaddr,
  vwdaddr: window;
  timer: timerid;
  elapsed: ttime;
  ch: char;
      
procedure spr1_pos(x,y: integer);

var
  addr: integer;
  
begin
  vwaaddr.ptr^ := 1024;
  vwaaddr.ptr^ := 19456;
  vwdaddr.ptr^ := y*256;
  vwdaddr.ptr^ := x*256;
end;

begin
  vwaaddr.int := wrtaddr;
  vwdaddr.int := wrtdata;

  tmrnew(timer);
  tmrreset(timer);
  
  write('Rounds? ');
  readln(cnt);
  
  tmrstart(timer);
  
  page(output);
  set_screen(2);
  set_spr_size(1);
  set_spr_attribute(1,42,2,0,1,1,0,0);
  
  while cnt>0 do
  begin
    for x := 1 to 240 do
      spr1_pos(x,1);
           for y := 1 to 176 do
      spr1_pos(240,y);
    for x := 240 downto 1 do
      spr1_pos(x,176);
    for y := 176 downto 1 do
      spr1_pos(1,y);
    cnt := pred(cnt);
  end;
  
  tmrstop(timer);
  tmrread(timer,elapsed);
  set_screen(1);
  page(output);
  with elapsed do
  begin
    write('Time ',hour,':');
    if minute<10 then
      write('0');
    write(minute,':');
    if second<10 then
      write('0');
    write(second,',');
    if fract<100 then
      write('0');
    if fract<10 then
      write('0');
    writeln(fract);
  end;
  read(ch);
end.

Thanks. I think I will need to look this over closely. I was not aware that was a unit call realtime...

apersson850 · July 6, 2017

Well, it's only my machine, and one of my friend's (but I doubt his is running any longer) that has a unit realtime. The p-system keeps track of today's date, and uses it to tag files when they are created or updated. But you have to key in the date manually, as the system always starts with the same date as last time it was set. It's stored on the disk, so it's the last date the system disk was used you get when you start next time.

If the computer had a battery-backed real time clock, keeping track of date and time, the setting of the date could be done automatically. As there was no such device available on the market at that time, I set about to design and build one. I did of course also write the software to use it. The p-system unit realtime is one part of that software.

I have two editions of the unit realtime, one that works with my own clock hardware and one that works with the Triple-Tech card, which was the first clock card available one the market that I came in touch with. A friend of mine bought one, so I wrote a driver for him to use it with the p-system.

My device uses the same clock chip as you find on the P-GRAM card. It's good for precision timing, as it counts down to 1/1000 s. The chip they used on the Triple-Tech card would do seconds only.

I've also designed my card so that I can use the interrupt generation capability of the clock. Thus the card can issue an alarm, via an interrupt, at a pre-programmed time, or use a counter rollover interrupt, and thus issue one every minute or every 1/10 s or every day. The every day interrupt can be used to change the system date not only when logging on, but also during operation of the p-system.

Since I can change the interrupt vector in my system (I can overlay the console ROM with RAM) I have the ability to for example reconfigure the TI to become a controller, which can take actions based on time.

But in this benchmark case the clock is simply used to time the activity. The unit realtime allows the dynamic creation of any number of timers (only memory limiting), that can run simultaneously. They can be halted and read individually, and disposed when not needed any longer, to recover the memory used.

Anyway, don't look for it in the stock UCSD p-system, as it's not there.

JamesD · July 7, 2017

I thought you were talking about replacing the whole system with a compiler that produced code files which couldn't be loaded under the p-system. As I understand you now it's completely different. Automating that process is like having a native code generator (which normally does accompany the p-system), but perhaps even better, if it is an optimizing such thing.

How fast do you guys sort 1000 random integers? Using whatever language you like, thus most likely assembly?

Well, on smaller projects, you could skip the P-Machine and run at speeds similar to the current GCC C compiler.

But you could also generate native code modules to work with the P-System.

If you develop something under UCSD that needs more speed in certain parts, make them modules, then compile the code on the PC and have freakish fast speeds while still having the advantages of the P-Machine. How easy it would be to convert the GCC Pascal output to a UCSD module I don't know.

The problem with the UCSD native code translator, is that the output still works like the P-Machine, just without the decoding phase.

It can't take advantage of lots of registers, it uses the stack a lot, etc...

Translation will certainly cut the execution times by quite a bit, but nothing approaching GCC's code generator.

+TheBF · July 26, 2017

I have been using this exercise to beat up the low level code for my Sprite routines.

Here is the current state of the benchmark operating under Indirect threaded Forth (ITC)

and Direct Threaded Forth (DTC), with Top of stack cached in a register.

Lee,

I can't figure out how you got FB-Forth to go faster than Turbo Forth in the optimized version.

Can you double check it when you have nothing better to do?

Note: None my code is updated in GITHUB. I need to do some housecleaning there.

Code:

HEX
VARIABLE CNT

DECIMAL
\ SP.LOC is Forth code that writes to sprite descriptor table
\ : SP.LOC    ( dx dy sprt# -- )  >R >CELL R> ]SDT V! ;
\ alternative to using mtask99 and automotion
: TURSI.BENCH
      GRAPHICS
      PAGE
      1 MAGNIFY
      2 42 0 0 0 SPRITE   \ CAMEL99 uses BASIC color #s
      SP.SHOW
      100 CNT !
      BEGIN
           CNT @ 0> WHILE
           239 0 DO   I   0  0 SP.LOC       LOOP
           175 0 DO 239   I  0 SP.LOC       LOOP
           0 239 DO   I 175  0 SP.LOC   -1 +LOOP
           0 175 DO   0   I  0 SP.LOC   -1 +LOOP
           -1 CNT +!
      REPEAT
HEX
300 CONSTANT $300
$300 1+ CONSTANT $301

DECIMAL 
: TURSI.OPT
      GRAPHICS
      PAGE
      1 MAGNIFY
      2 42 0 0 0 SPRITE   \ CAMEL99 uses BASIC color #s
      SP.SHOW
      100 CNT !
      BEGIN
           CNT @ WHILE
           239 0 DO   I $301 VC!     LOOP
           175 0 DO   I $300 VC!     LOOP
           0 239 DO   I $301 VC! -1 +LOOP
           0 175 DO   I $300 VC! -1 +LOOP
           -1 CNT +!
      REPEAT
;

Here are the standings

Language   First Pass     Optimized
GCC           15 sec         5 sec
Assembly      17 sec         5 sec
TurboForth    48 sec        29 sec
CAMEL99 DTC   49 sec        27 sec
Compiled XB   51 sec        37 sec
CAMEL99 ITC   55 sec        29 sec
FbForth       70 sec        26 sec
GPL           80 sec       none yet
ABASIC       490 sec       none yet
XB          2000 sec       none yet
UCSD Pascal 7300 sec       273 sec

Edited August 10, 2017 by TheBF

Benchmarking Languages

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members