Is Pascal slow?

apersson850 · June 14, 2020

To avoid misconceptions, I want to clarify that by Pascal I mean the UCSD Pascal, since it’s the only meaningful implementation for the TI 99/4A.

Which in turn implies that Pascal isn’t just another language for the TI, it’s also a completely different operating system. Which means a different way to work with the computer as well.

But to return to the first question, then yes, Pascal is slower than theoretically possible. The fastest possible Pascal program would run at the same speed as a carefully written assembly program, doing the same thing. Pascal clearly does not. There are two reason for this.

First, a compiler isn’t always as clever as a good programmer. Especially not a compiler with roots from the late 1970’s. The UCSD compiler isn’t famous for being any of the better at generating code, even by contemporary standards.

Second, the p-system compiler generates p-code, not pure machine code. This doesn’t necessarily imply any speed penalty, as it’s not too unusual that a compiler generates an intermediate code, which is then eventually translated to machine code. But in the p-system, the p-code is the final stage. Indeed, the p-system implements most of its functionality by executing p-code, since the p-system is mainly written in Pascal. One of the targets with the UCSD p-system is just that; it should be self-compiling, so that new versions of the system can be compiled on the system itself.

Machine code is executed by the TMS 9900 CPU in the computer. P-code, on the other hand, is executed by a virtual CPU called the PME, short for P-Machine Emulator. This is like a virtual CPU, having p-code as its native language.

The PME handles data in various locations. Processing like arithmetic functions is done using a stack. Dynamic data that’s not relevant to push and pop is allocated on the heap. The system maintains a lot of data structures to keep track of code segments, constant declarations, global data and so on.

In the TI, the p-code is interpreted by the PME. For each instruction, several machine instructions are executed just to find out how to interpret the p-code. Then the actual interpretation is done, and the PME returns to processing the next instruction. This overhead does of course imply a cost in time.

Knowing this, several questions usually comes up.

How much overhead cost is there?

Why was it implemented like this?

Can you do anything about it? Or is it actually a good idea?

PME structure

To understand the level of overhead, we have to look at how the PME works.

At startup, the inner interpreter is loaded to CPU RAM PAD at 8300-8342H. It exists in different versions, depending on whether the p-code it’s supposed to execute is in CPU RAM, VDP RAM or GROM. P-code is a byte code, so it lends itself well to being executed from VDP RAM, or GROM, but addressing of the two is different, requiring two slightly different PME versions. Then CPU RAM isn’t autoincrementing, and doesn’t require and elaborate address setup, so it’s again different.

Here is the inner interpreter for code in CPU RAM. It starts at FETCH (an instruction). Further down is code to get various parameters stored after the instruction itself.

FETCH	MOVB	*IPC+,R1	Fetch next opcode
	SRL	R1,7	Make word index
	MOV	@PCODTAB(R1),R2	Fetch interpreter code address
	MOV	*R2+,R0	
	B	*R0
	NOP
LDUBB	CLR	R4
	MOVB	*IPC+,PASCALWS+9	Lsby R4
	NOP
LDUB	CLR	R3	Read byte after code
	MOVB	*IPC+,@PASCALWS+7	Lsby R3
	B	*R2
	NOP
	CLR	R5
	MOVB	*IPC+,@PASCALWS+11	Lsby R5
	NOP
LDUBBG	CLR	R4
	MOVB	*IPC+,@PASCALWS+9	Lsby R4
	NOP
	CLR	R3
	MOVB	*IPC+,R3
	JLT	BIG
	SWPB	R3
	B	*R2
BIG	ANDI	R3,7F00H
	MOVB	*IPC+,@PASCALWS+7 	Lsby R3
	B	*R2

The p-systems main workspace, PASCALWS, is located at 8380H. Some of the registers have pre-defined functions.

R8 p-code instruction pointer. IPC

R9 Frame pointer for activation record on stack. ARECP

R10 Stack pointer. SP

R11 Return link.

R12 PME FETCH address pointer. FETCHP

R13 Read data address for currently executing segment. PGRMRD, VDPRD or 0.

R14 Global data frame pointer for current code segment. GLOBDATA

R15 Memory type flag for currently executing p-code.

PCODTAB This is a table in RAM, indexed by the p-code’s opcodes. The first entry contains the address to the code interpreting p-code with opcode 0, and so on, for all 256 p-codes.

PME execution

P-codes exist in many variants. The simplest is NOP. It does nothing. The interpreter’s code looks like this:

NOP	DATA	FETCH

As you can see, the PME will read the address to itself and re-execute itself. Five CPU instructions will be used to do this.

One of the simpler is DUPI. It duplicates an integer on top of the stack. This is the interpreter’s code for DUP:

DUPI	DATA	DUPI+2
	DECT	SP		The real work is done here
	MOV	@2(SP).*SP
	B	*FETCHP

The useful work takes two instructions. It takes the PME another six to execute them.

Some p-codes have parameters inline in the code. They are stored right after the instruction.

LDCB <UB>, Load Constant Byte, places the value of <UB> on the stack. It looks like this:

LDCB	DATA	LDUB
	DECT	SP		The real work is done here
	MOV	R3,*SP
	B	*FETCHP

Here the PME executes nine instructions, plus the two doing the work. But fetching the parameter is or course also real work, so it’s rather seven for the PME and four to do the job.

Several p-codes are specifically designed to do tasks that frequently occur when executing Pascal programs. Accessing a variable in a procedure that’s one or more lexical levels above the current one is one such task. The p-code STR <DB>, <B> stores the value on top of stack in the variable with offset <B> in the procedure <DB> levels above the current one.

STR	DATA	LDUBBG
	MOV	ARECP,R2	Activation record pointer
TRAVAREC	MOV	*R2,R2	Traverse activation records
	DEC	R4
	JGT	TRAVAREC
	SLA	R3,1
	A	R2,R3
	MOV	*S+,@8(R3)
	B	*FETCHP

Here the total number of instructions is dependent on how many lexical levels we must traverse. An average may be 19 useful instructions and five to run them.

If we go into more complex instructions, like real arithmetic, the interpreter overhead diminishes. The PME calls the console’s floating point operations for these tasks, so there’s no difference in math speed or precision, compared to BASIC.

Even more complex tasks, like calling procedures, requires so many instructions that the interpreter overhead diminishes. Especially if the procedures are in other segments.

We can see that for the simplest instructions, the execution time may increase by six times, compared to the single instruction doing the work.

But for more complex instructions, the increase gets smaller. In the last example, it’s a factor of 1.25 or less.

Code size

When the p-system, and the concept of the p-code, was first designed, it was for different reasons.

One reason was to make it possible to execute the same program on different computers. As long as the computer had a PME, then the same p-code could be executed. Since the operating system also runs p-code, not only the application, but the entire p-system, with all its utilities, could be transferred to another kind of computer.

For us, using the TI 99/4A only, this is only of value if we want to run Pascal code written for other machines, as it’s difficult to transfer object code from other systems today.

The other reason was to make Pascal possible to run on machines with rather small memories. At that time, 32 Kbyte was a large internal memory in a computer. So compact code was a priority.

P-code is a byte code. Instructions without parameters occupy one byte in memory. TMS 9900 instructions require two, four or six bytes, depending on the addressing. The p-code DUPI requires one byte, the equivalent TMS 9900 code six.

LDCB requires two bytes, TMS 9900 code eight.

STR requires three bytes, TMS 9900 code 24 bytes. It’s somewhere on the limit for where it would be better to call a subroutine, but then the difference between the p-code and the assembly code is reduced.

The inter-segment calls would most certainly be implemented as subroutine calls anyway, since they imply running hundreds of instructions.

Anyway, this difference in memory requirements means that much more complex programs can be loaded in the memory than would have been possible if they were compiled to native code. This is similar to Forth’s approach, where programs consist of a long line of addresses. But they are words, so each instruction there requires twice as much memory as for the p-code.

Improve execution speed

What can be done to improve execution speed?

There are a couple of things. Knowing how the compiler generates code is one way. The first local variables are accessed faster than those longer down in a procedure’s local variable list.

Code like x := x+1 executes slower than x := succ(x), because the compiler isn’t smart enough to generate an increment instruction on its own.

Copy array data using moveleft instead of one element at a time.

The most obvious other way is probably to support Pascal with assembly routines. The call interface is the same for a Pascal and assembly routine. Thus you can debug the algorithm first in Pascal, then add external to the procedure declaration, write it in assembly and link that to your program.

A not so obvious way is to let a program convert a time critical procedure to assembly automatically. It’s possible to add a directive to a procedure when compiling, to support a conversion to assembly language. Then you run a native code converter program, which will convert the marked procedures from p-code to assembly automatically. Unfortunately there is no such program for the TI 99/4A, but the p-system does support the converted code.

Since the support for executing such converted code is in place, I’ve played with the thought of creating a native code converter. The converters that did exist worked in such a way that they translated the instructions where the overhead is large. The simple ones, similar to those I pointed out above. Complicated p-codes, where the interpreter runs hundreds of instructions to execute one p-code, are left alone.

As an example, a p-code sequence which does add, subtract and negate with integers, then call a procedure, contains these instructions:

ADI	Add integer
SBI	Subtract integer
NGI	Negate integer
CXG	Call global external procedure

After a native code generator has processed such a segment, it would look like this:

NAT			Native code follows
	A	*SP+,*SP
	S	*SP+,*SP
	NEG	*SP
	BL	*R11
CXG			Call global external procedure

When encountering such a code segment, the p-code interpreter will run the machine code right after the NAT p-code. BL *R11 returns to the PME, and since it’s a BL, R11 will contain information to the PME about where to continue execute p-code. In this case with the complex CXG instruction, which isn’t converted.

In this example, 18 machine instructions are reduced to 13. Only 3 are used to do some good, but the other to decode and execute the NAT instruction. Fortunately, in most real applications, there are more simple p-codes in a row than just three of them. When it’s about executing a loop, the same instructions may also be executed many times, giving more value for the cost. The cost in this example is 3 or 4 bytes of extra memory used, depending on if the NAT code is at an even or odd address. Machine code can only start at even addresses, so in half of the cases, one byte must be jumped over.

Finally, the jump table for interpreting all p-codes is stored in RAM. Thus it’s possible to modify all codes. The fact that ATTACH doesn’t do anything can be changed, for example.

Summary

This was intended to give anyone interested a background to why execution speed of Pascal programs is what it is. A consequence is that as long as you compare irrelevant examples, like looping a thousand times, Pascal will always be slower than the most bare-bones languages, like assembly or Forth. The speed of Forth is accomplished by the fact that the interpreter reads a long row of code addresses, not opcodes. These addresses directly point to what to run (if it's a "final" word), instead of index into a table with adresses.

So, what's the point of Pascal then?

Apart from its generally good support for making programs that are reasonably well organized, the main advantage is the code size. Since all p-codes are bytes, the code is compact. But the operating system also provides good services for making programs that performs big tasks wihtout running out of code memory. P-code is fully relocatable, even whilst being executed. In case of a memory problem, the p-system can temporarily interrupt execution of a program, move it in memory to free up more stack space, then continue the execution.

But that's not all. Without doing anything more than just using the unit concept, where you separately compile parts of your program, then use that from the main program (or from other units), you invoke the p-system's segment concept. A segment is a piece of code which doesn't have to be in memory unless it's actually executed. When exiting the procedures in the segment, it's marked as unused. If you call procedures in another segment, the first one you used can be erased from memory and the second brought in from disk. Such overlapping of code can be done in other systems too, even in Extended BASIC, but there is no other language for the TI where this happens automatically.

If you want to, you can also make certain procedures in your main program removeable from memory, when they are not needed. Just declare a segment procedure, and you are done.

So when looking at the development of a substantial program, which has a lot of code and also processes a substantial amount of data, then Pascal is fast. What you need to do is already there. Just use it.

Then if execution speed is critical for parts of the program, you have to look at writing that in assembly. Or develop and use a native code converter...

wolhess · June 14, 2020

@apersson850

Is UCSD Pascal able to use a hard disk in the P-system.

I think about the new batch of IDE cards, there will come these days.

Or is the P-system able to use the Horizon 4000b RAM disk?

The normal system is using only the three floppy disks, I have in my TI.

RXB · June 14, 2020

"Such overlapping of code can be done in other systems too, even in Extended BASIC, but there is no other language for the TI where this happens automatically."

Hmmm incorrect TI99/4A has GPL built in and where does everything return to after is errors out or crashes.....GPL and the ROM OS that runs GPL.

Thus your statement is incorrect, there is a language that does do this automatically, it is GPL.

It fires of the TI99/4A for use and controls all cartridges and even your Pascal card.

Edited June 14, 2020 by RXB
missing text

apersson850 · June 14, 2020

No, it's not incorrect. But you misunderstood what I was referring to. I was referring to that you get this kind of support for overlapping different code in memory in your user's application.

GPL and the p-system are completely separate, though. GPL only makes sure there is a branch to the startup code on the p-code card, then GPL, or its interpreter, isn't executed any more, until you exit the p-system. The only software in the console that's used by the p-system is some ROM routines for floating point math.

The p-system IV.0 can use up to six disks. But since TI had a controller which could handle no more than three, they disabled the last three.

The simplest thing you can do is add a controller capable of using four disks, like the CorComp I have. In such a case it's the same controller for the extra disk, so the p-system's table for the fourth disk only has to be populated with the same pointer as for the first three.

It's a bit more complex when you add a disk that's handled by a different controller, as the pointer in the unit table in the p-system then can't be the same. You also have to create a PCB (Peripheral Control Block) somewhere in VDP RAM for this new controller. It's similar to a PAB in the normal operating system, just with some more information. Then it's doable.

I'm using two RAMdisks with my p-system. One is a Horizon RAMdisk. But the p-system uses a different code to access peripherals than the console does. The current DSR for the Horizon card looks into data used by the console OS, when accessing a disk. Since the p-system accesses a disk with code running in a different location, this assumption in the DSR for the Horzion card fails. This was done to save a few bytes in the DSR, since memory space is at premium on the card.

Fortunately, the p-system only needs the sector read/write capability, so it wasn't too difficult to write a DSR that does that in a way compatible with the p-system. You have to re-install the standard DSR when not using the p-system, though.

System IV.0 can't use subsidiary folders, though. Hence a hard drive is of somewhat limited use, as it can only be one drive to the p-system. Unless a DSR is written that simulates several disks on the hard drive.

Later versions of the p-system solved this by allowing the creation of virtual disks on a disk. Instead of just being able to create files on a disk, you could create a volume, which in turn could contain files. Only one level deep, though.

apersson850 · June 14, 2020

I can't edit my first post any longer.

There's a point that should be a comma in the code that interprets DUPI.

I also want to clarify that the purpose of the NOP instructions in the main interpreter is to make sure the parameter retreival code has the same addresses for all interpreter versions. When running the interpreter which handles code in VDP RAM, the code isn't fetched by

MOVB *IPC+,R1

The code used is instead

INC IPC

MOVB *R13,R1

where R13 contains the read data address for currently used memory. VDPRD in this case. VDPRD does autoincrement by itself, but to keep track of how far in the code the PME has advanced, it must increment the instruction pointer, R8, explicitly.

They have tried to use these NOP instructons as fillers only, but in some cases they are executed, something which of course slows down the system further. Had the 99/4A had CPU RAM only for program execution, and let the VDP manage the video memory for just that, video, then this complication had never existed.

When the PME branches to code in a different type of memory, it replaces the whole inner interpreter by a new version. Well, that's not always true, since the same interpreter is used to execute p-code from VDP RAM and GROM. But it changes when changing to and from CPU RAM.

Which in turn actually makes the p-code execute slower in the standard case, as p-code is by default loaded into the code pool in VDP RAM, and that interpreter has to execute more instructions.

Edited June 14, 2020 by apersson850

+TheBF · June 16, 2020

"This is similar to Forth’s approach, where programs consist of a long line of addresses. But they are words, so each instruction there requires twice as much memory as for the p-code."

For full disclosure Forth doesn't actually specify how you should do this. It is commonly addresses but their are byte-coded Forths too.

The most famous byte-code Forth is Open-firmware which booted Sun Workstations, Apple Power PC machines and I think one of the Linux distros.

apersson850 · June 16, 2020

I was referring to Forth on the TI. I've not seen any byte-coded Forth there. Does it exist in some version?

+TheBF · June 16, 2020

4 hours ago, apersson850 said:

I was referring to Forth on the TI. I've not seen any byte-coded Forth there. Does it exist in some version?

Got it.

You are correct. There is not one... yet.

It's on my todo list of variants to create with the cross-compiler.

I think I can make good use of the indexed addressing mode to look up the primitives.

apersson850 · June 16, 2020

That's what the PME does. Take a look at the label FETCH in the first code window in my first post in this thread.

But doesn't Forth need more than 256 opcodes? The whole idea being that you define new words all the time. Or there's more to it than the obvious?

There are not as many as 256 p-codes, so that's not an issue for the p-system.

+TheBF · June 16, 2020

Yes that is what I would need to do to make a Forth inner interpreter. I would only be the first 5 lines in my case.

From what I understand the way it is done in byte code Forth is to have a table of primitives written in Assembler or C.

The size of that table can be anything from a minimum of 35 or so Forth primitives all the way up to 256 if you wanted more speed.

Then another table is created for "secondary" definitions that are written in Forth calling the primitives in the 1st table.

Open-Firmware has something called the Active-Package which from what I can see is a selector for the secondary table.

So each Active-package would be limited to 256 routines. More than enough for a pretty complex program.

apersson850 · June 22, 2020

How do they mix 16-bit and 8-bit codes? Or is that not possible?

+TheBF · June 23, 2020

On 6/22/2020 at 5:28 PM, apersson850 said:

How do they mix 16-bit and 8-bit codes? Or is that not possible?

I have not looked deeper than the surface. Got sidetracked working on my editor.

From what I gather there are different ways people do it. Anything is possible and probably somebody has tried it with Forth.

I will have to look into it further. I would think there would have to be a primary token that tells the interpreter the next 2 bytes are a (table,token) pair?

How is it handled in USCD Pascal?

+TheBF · June 23, 2020

I found a paper that gives two options.

https://dl.acm.org/doi/pdf/10.1145/146559.146561

Token-Pointer Threading

The Code Field contains a token signifying if this is a primitive or a secondary.

The Parameter Field contains:

Primitives: the machine-language definition of the word, ended with a jump to next . (the Forth interpreter)
Secondaries: a list of pointers to the subwords in the definition, terminated with a pointer to return

Token-token Threading

The Code Field contains a token signifying if this is a primitive or a secondary.

The Parameter Field contains:

Primitives: the machine-language definition of the word, ended with a jump to next.
Secondaries: a list of tokens of the sub-words in the definition, terminated with the token for return.

apersson850 · June 25, 2020

On 6/24/2020 at 12:03 AM, TheBF said:

How is it handled in USCD Pascal?

All p-codes are bytes. If they are longer, it's due to in-line parameters, which are then fetched by the interpreter to pre-defined registers.

If native code is to be executed in-line with p-codes, the machine code is preceeded by a NAT instruction, which on word-addressed processors (like the TMS 9900) make the IPC even before execution begins. Native code ends with a BL to the interpreter.

Is Pascal slow?

Recommended Posts

Code size

Improve execution speed

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members