apersson850 Posted July 1, 2017 Share Posted July 1, 2017 (edited) I guess you didn't do all the 100 loops. Did you multiply by the wrong value, maybe? I did five turns and multiplied with 20. Anyway, just a few of the programs actually manage any kind of data structures for sprites. The rest focus on writing to the VDP RAM at certain locations, assuming the user knows which VDP RAM location to access. I'll make an optimization for the Pascal version where it also writes to VDP RAM directly, but still in Pascal, just to see what happens. I'm not sure if the routines in unit sprite will access the sprite data by themselves, when there's no countdown data in the sprite record, so I don't know if it works. I've never used sprites for anything in Pascal, so I have no real reference to how fast or slow that handling is. But generally, Pascal runs a couple of times faster than Extended BASIC. Oh, by the way, James D wrote earlier that the PME has no registers, but works with a stack, and that the instructions are 16-bit. The p-machine has some registers, which of course are emulated by the PME, and the instructions are 8-bit. Some of them do have one or more bytes in-line, though, as immediate data. So even if the instructions basically are 8 bits, some are extended to several bytes. Most of the data is referenced either by being on the stack, or via the stack. Variables in the current environment record are referenced through R9 in the TMS 9900 in this particular implementation of the PME. Global data though R14. Edited July 2, 2017 by apersson850 Quote Link to comment Share on other sites More sharing options...
JamesD Posted July 2, 2017 Share Posted July 2, 2017 (edited) ... Oh, by the way, James D wrote earlier that the PME has no registers, but works with a stack, and that the instructions are 16-bit. The p-machine has some registers, which of course are emulated by the PME, and the instructions are 8-bit. Some of them do have a second byte, though, so even if the instructions basically are 8 bits, some are extended to 16 bits. Most of the data is referenced either by being on the stack, or via the stack. Variables in the current environment record are referenced through R9 in the TMS 9900 in this particular implementation of the PME. The part about 16 bit... it's the Pascal numeric types that are normally 16 bit, not the opcodes themselves. . Sorry about that. But it means a lot of opcodes dealing with 16 bits. As for registers... from the P Code Machine Wiki. Like many other p-code machines, the UCSD p-Machine is a stack machine, which means that most instructions take their operands from the stack, and place results back on the stack. Thus, the "add" instruction replaces the two topmost elements of the stack with their sum. A few instructions take an immediate argument. Like Pascal, the p-code is strongly typed, supporting boolean (b), character ©, integer (i), real ®, set (s), and pointer (a) types natively. If by registers you mean the following (from the 'ARCHITECTURE OF THE P-MACHINE' section of the docs), then yeah, it has registers. But that's not user registers. HARDWARE EMULATION: REGISTERS The P-machine uses 16-bit words, with two 8-bit bytes per word. It has an evaluation stack, several registers, and a user memory containing a program stack and a heap. All registers are pointers to word-aligned structures, except IPC, which is a pointer to byte-aligned instructions. The registers, sometimes referred to as "pseudo-variables", are: SP: evaluation Stack Pointer. A pointer to the current "top" of the evaluation stack (one byte beyond the last byte in use). In the Apple, the evaluation stack uses a portion of the 6502's hardware stack, starting in hex memory location 1FF and growing down toward hex location 100. It is used to pass parameters, return function values, and as an operand source for many instructions. The evaluation stack is extended by loads, and is cut back by stores and arithmetic operations. IPC: Interpreter Program Counter. Contains the address of the next instruction to be executed, in the code segment of the currently executing procedure. SEG: SEGment pointer points to the procedure dictionary of the segment to which the currently executing procedure belongs. (See this manual's appendix OPERATION OF ThE P-MACHINE for illustrations.) JTAB: Jump TABle pointer. A pointer to the table of attributes and jump table entries in the procedure code section of the currently executing procedure. (See this manual's appendix OPERATION OF THE P-MACHINE for illustrations.) KP: program stacK Pointer. A pointer to the current top of the program stack. The program stack starts in high user memory and grows downward toward the heap. (See this manual's appendix OPERATION OF THE P-MACHINE for illustrations.) HP: Markstack Pointer. A pointer to the low byte of MSSTAT, in the topmost Markstack on the program stack, in the activation record of the currently executing procedure. Variables local to the current procedure are accessed by indexing off MP. NP: New Pointer. A pointer to the current top of the dynamic heap (one byte beyond the last byte in use). The heap starts in low user memory and grows upward toward the program stack. It contains all dynamic variables (see Jensen and Wirth, Chapter 10). It is extended by the standard procedure 'new', and is cut back by the standard procedure 'release'. BASE: BASE Procedure. A pointer to the activation record of the most recently invoked base procedure (lex level 0). Global (lex level 0) variables are accessed by indexing off BASE. This is a perfect example of why the P--Machine isn't very efficient on 8 bit machines. On a more advance 16 bit CPU, this might only need 3 opcodes. But this is what a 6502 has to do to execute a LOR (logical or) opcode for the p-machine. This doesn't even show the code that decodes the opcode, calls the opcode routine, and the code run after the exit. The 6502 is the worst of the 8 bit CPUs for this, but more machines had 6502s than anything else.If opcodes worked on registers, all the slow indexed instructions, and stack manipulation goes away and it's probably at least 30% faster On the 9900, putting data directly into registers would surely be faster than using the stack. LOR TSX LDA P1BASE+3,X ORA P1BASE+1,X STA P1BASE+3,X LDA P1BASE+4,X ORA P1BASE+2,X STA P1BASE+4,X INX INX TXS JMP UPDBY1 Edited July 2, 2017 by JamesD Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 2, 2017 Share Posted July 2, 2017 The part about the registers you quote is from an earlier version of the p-system, not IV.x, so the registers aren't the same. But it's their equivalents I'm referring to, yes. The interpreter in the 99/4A is further complicated by the fact that it can run code from CPU RAM, VDP RAM or GROM as well. If you like, I can post the inner part of the interpreter here. Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 2, 2017 Share Posted July 2, 2017 (edited) Well, never mind waiting for anybody asking for anything... Here's the PME (P-Machine Emulator) central parts. Note that the address of PMEFETCH is 8300H when the machine is running, so this is in 16-bit RAM. But there are two different fetch routines, one for code in CPU RAM and the other for code in VDP RAM or GROM. They are both loaded at the same place. The CPU RAM version needs some NOP instructions to occupy the same addresses, since there are six external entry points into the interpreter. PMEFETCH gets the opcode of the instruction, and that's always 8 bits. It then looks into a table, which gives the address to the instruction interpretation.That part begins with an address where to start running the interpretation, since that may be directly after the entry, or it may branch to one of five locations in the interpreter, where one, two or three parameters, that are inline in the code (not on the stack) are fetched to R3, R4 and R5, before the interpretation of the actual instruction continues. I've included a few instructions in full detail. LDO loads a word from the program's global data area. R14 points to that area. LOD loads data from a caller's local data area (environment record). There are two short forms, used to load data from the caller, or the caller's caller. Then one general that can load from any lexical level above the currently running procedure. R9 points to the current environment record. It's up to the programmer to think about this. Using variables further up than two levels take more time. There's a similar thing for local variables. There are faster instructions to pick the first variables, compared to those further down among the declarations. So declare short variables that are frequently used first. If you declare an array first, you may run out of reach of all short local variable load instructions in one fell swoop. The ADI and LOR performs addition and logical or (so you can compare with the 6502 code above). It takes seven instructions to execute these codes, if they are in CPU RAM. Normally, p-code runs from VDP RAM, where it takes eight instructions to accomplish something the TMS 9900 could have done with one, if the p-code was converted to native code. But LOR is one byte long, SOC *SP+.*SP is two. So for this particular instruction, there's a speed gain of roughly eight times, at the loss of twice the memory use. There are p-code instructions for more complex things too, like calling global or local procedures. They are more like small program segments, and may invoke quite a lot of code, if a segment fault is issued on a call (the called procedure isn't currently in memory, but must be retrieved from disk). The instructions SIGNAL and WAIT also have their own p-codes, as they shouldn't be interrupted. This is not based on any source code or such, but on my own dechipering and inspection of my p-system on the 99/4A. It was necessary to understand more than the manual tells you to be able to implement pre-emptive multitasking and bit-map mode, to allow the system to do turtlegraphics. ; Inner interpreter for the PME in TI 99/4A PASCALWS .EQU 8380H ;Interpreter's workspace LSBYR3 .EQU PASCALWS+7 LSBYR4 .EQU PASCALWS+9 LSBYR5 .EQU PASCALWS+11 PC .EQU 8 ;Instruction pointer ERECP .EQU 9 ;Current environment record pointer SP .EQU 10 ;PME stack pointer FETCH .EQU 12 ;Address of PME instruction fetch routine (pmefetch) CODERD .EQU 13 ;Address to read code from. GLOBDAT .EQU 14 ;Global data frame pointer within current segment CODEFLG .EQU 15 ;Code location flag. 0: CPU RAM, <0: VDP RAM, >0: GROM ;Interpreter when code is in RAM PMEFETCH MOVB *PC+,R1 ;Fetch instruction SRL R1,7 ;Make word index MOV @OPCODE(R1),R2 ;Fetch address to interpreter's header MOV *R2+,R0 ;Fetch address to execute code B *R0 NOP RD2BYT CLR R4 ;Reads two immediate bytes MOVB *PC+,@LSBYR4 NOP RD1BYT CLR R3 ;Reads one immediate byte MOVB *PC+,@LSBYR3 B *R2 NOP RD3BYT CLR R5 ;Reads three immediate bytes, last could be big NOP MOVB *PC+,@LSBYR5 RD2BIG CLR R4 MOVB *PC+,@LSBYR4 ;Reads two immediate bytes, last could be big NOP RDBIG CLR R3 ;Reads immediate byte, could be big MOVB *PC+,R3 JLT @BIG SWPB R3 ;Justify short big data B *R2 BIG ANDI R3,7F00H MOVB *PC+,@LSBYR3 ;Reads LsByte of long big data B *R2 ;------------------------------------------------------------------------------- ;Interpreter when code is in VDP RAM or in GROM PMEFETCH INC PC MOVB *CODERD,R1 SRL R1,7 MOV @OPCODE(R1),R2 MOV *R2+,R0 B *R0 INC PC RD2BYT CLR R4 MOVB *CODERD,@LSBYR4 INC PC RD1BYT CLR R3 MOVB *CODERD,@LSBYR3 B *R2 INC PC RD3BYT CLR R5 MOVB *CODERD,@LSBYR5 INC PC RD2BIG CLR R4 MOVB *CODERD,@LSBYR4 RDBIG CLR R3 MOVB *CODERD,R3 JLT BIG SWPB R3 INC PC B *R2 BIG ANDI R3,7F00H MOVB *CODERD,@LSBYR3 INCT PC B *R2 ; At OPCODE there's a table[0..255] of addresses (words) to each instruction OPCODE .WORD SLDC .WORD SLDC ;... only a few codes are given here OPCODE+133*2 .WORD LDO ;Load global word OPCODE+136*2 .WORD LOD ;Load intermediate word from any lexical level above the current OPCODE+160*2 .WORD LOR ;Logical or OPCODE+162*2 .WORD ADI ;Add integer has opcode 162 OPCODE+173*2 .WORD SLOD1 ;Short load intermediate word from lexical parent of current environment record OPCODE+174*2 .WORD SLOD2 ;Short load intermediate word from lexical grandparent of current environment record ;... and so one .WORD RESERVE5 .WORD RESERVE6 ; Load a word from the global data area LDO .WORD RDBIG ;Get word index in global data frame SLA R3,1 A GLOBDAT,R3 ;Add base of global data DECT SP MOV @8(R3),*SP ;Push word after data frame header B *FETCH ;Fetch next p-code ; Load a word from the caller's environment record SLOD1 .WORD RDBIG ;Get word index LI R4,1 ;Lexical level count JMP LOD1 ; Load a word from the caller's caller's environment record SLOD2 .WORD RDBIG ;Get word index LI R4,2 ;Lexical level count JMP LOD1 ; Load a word from an intermediate activation record (data belonging to a caller more than two levels above) LOD .WORD RD2BIG ;Get lexical level count and word index LOD1 MOV ERECP,R2 TRAV MOV *R2,R2 ;Traverse activation record links DEC R4 ;Count levels JGT TRAV SLA R3,1 ;Word index A R2,R3 ;Add environment record base DECT SP MOV @8(R3),*SP ;Push word from environment record. Adjust for record header B *FETCH ; Add integer adds two words at top of stack ADI .WORD ADI+2 ;No immediate data A *SP+,*SP ;Add top of stack words B *FETCH ;Logical or of two words at top of stack LOR .WORD LOR1 ;No immediate data LOR1 SOC *SP+,*SP ;Or top of stack words B *FETCH Edited July 2, 2017 by apersson850 2 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted July 2, 2017 Share Posted July 2, 2017 This is cool to see inside this system. Can you think of any good reason that the interpreter has three NOP instructions in it? I find it astounding that someone wanted to slow down this critical piece of the system. Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted July 2, 2017 Share Posted July 2, 2017 This is cool to see inside this system. Can you think of any good reason that the interpreter has three NOP instructions in it? I find it astounding that someone wanted to slow down this critical piece of the system. Re-read the third line of his response. ...lee 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted July 2, 2017 Share Posted July 2, 2017 Got it? Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 2, 2017 Share Posted July 2, 2017 (edited) Most of these NOP instructions aren't really executed, but are there just as fillers. When reading code from memory mapped devices, there's no auto-increment of the PC (R8), so it has to be advanced with extra INC instructions. Edited July 2, 2017 by apersson850 Quote Link to comment Share on other sites More sharing options...
JamesD Posted July 2, 2017 Share Posted July 2, 2017 (edited) I'd be less worried about the NOPs and more worried about how much other code has to be executed just for a single opcode. BTW, the first language I found that used this sort of interpreter was BCPL which came a few years before Pascal or Forth. I'm pretty sure that's where the idea came from. Edited July 2, 2017 by JamesD Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 2, 2017 Share Posted July 2, 2017 (edited) Obviously, p-code runs seven times slower than pure assembly, as it takes seven instructions to execute one, which has a direct correlation to a TMS 9900 instruction. Instructions that are used to find data in another procedure and such stuff do of course take longer time. But they couldn't execute in one single TMS 9900 instruction either, so the overhead there is less. If you look at the SLOD2 instruction, it takes 24 TMS 9900 instructions to execute it, and six to decode it. Thus the overhead only adds 25%, not 700% as is the case with ADI. I don't know where the idea to implement p-code for the UCSD system came from. But it's a fairly old idea, that to compile a language to some intermediate code and then either convert that all the way, or interpret it. As they wanted portability, it's very efficient, since implementing the PME on a new platform is a significantly less task than to modify the compiler each time. Edited July 2, 2017 by apersson850 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 5, 2017 Share Posted July 5, 2017 Just to see how much of the time is consumed by the sprite routines in the Pascal program, I commented them out. The program still runs all the loops and does all the assignments, but it never calls set_sprite. Now it does the 100 loops in 150 seconds. Next I'll make it write to the sprite attribute table directly, and we'll see what difference that makes. Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 5, 2017 Share Posted July 5, 2017 (edited) In the next step, I added an external procedure which simply plugs the values for x and y directly into the sprite attribute table. Execution time is now 166 seconds. This is of course still not near Forth or pure assembly, but shows that the p-system and Pascal is at least normally significantly faster than Extended BASIC. Edited July 5, 2017 by apersson850 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 5, 2017 Share Posted July 5, 2017 (edited) Finally, a programs which does the same, but without any external procedures. Pure Pascal in 273 seconds. As far as I know, there's no way to poke to VDP memory in Extended BASIC, so that's probably as optimized as it will be. At least not without external functions, like special CALLs implemented on Horizon RAMdisks and such. Thus we have 2000 seconds vs. 273 seconds here. That's in line with what I've experienced before, where Pascal is a couple of times faster than BASIC, but of course not near assembly speed. There are a few more things you can do to optimize further, but I don't bother now. The step from 780 to 273 seconds still proves that if you know the Pascal system well, you can make it perform better. And the step down to 166 seconds shows that if you use assembly support where it's best needed, then you can get some more. But that's true for most languages. Language First Pass Optimized GCC 15 sec 5 sec Assembly 17 sec 5 sec TurboForth 48 sec 29 sec Compiled XB 51 sec 37 sec FbForth 70 sec 26 sec GPL 80 sec none yet ABASIC 490 sec none yet XB 2000 sec none yet UCSD Pascal 7300 sec 273 sec Edited July 5, 2017 by apersson850 2 Quote Link to comment Share on other sites More sharing options...
JamesD Posted July 6, 2017 Share Posted July 6, 2017 Ya know, there was a Pascal parser for GCC. It's been dropped in recent versions due to lack of a maintainer, but if that were combined with the 9900 GCC changes, it would probably benchmark right up there with C.In theory, since it uses strict typing, it should be able to optimize some code that can't be optimized under C. Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 6, 2017 Share Posted July 6, 2017 Sure, but as far as I see it, it's not interesting. The value with the UCSD p-system lies in the word system. It's the whole system, with code and memory management, libraries and such that I like. Then I can live with that it doesn't outrun Forth, and that I occasionally have to write external procedures to get the desired performance. Quote Link to comment Share on other sites More sharing options...
+Vorticon Posted July 6, 2017 Share Posted July 6, 2017 Finally, a programs which does the same, but without any external procedures. Pure Pascal in 273 seconds. As far as I know, there's no way to poke to VDP memory in Extended BASIC, so that's probably as optimized as it will be. At least not without external functions, like special CALLs implemented on Horizon RAMdisks and such. Thus we have 2000 seconds vs. 273 seconds here. That's in line with what I've experienced before, where Pascal is a couple of times faster than BASIC, but of course not near assembly speed. There are a few more things you can do to optimize further, but I don't bother now. The step from 780 to 273 seconds still proves that if you know the Pascal system well, you can make it perform better. And the step down to 166 seconds shows that if you use assembly support where it's best needed, then you can get some more. But that's true for most languages. Language First Pass Optimized GCC 15 sec 5 sec Assembly 17 sec 5 sec TurboForth 48 sec 29 sec Compiled XB 51 sec 37 sec FbForth 70 sec 26 sec GPL 80 sec none yet ABASIC 490 sec none yet XB 2000 sec none yet UCSD Pascal 7300 sec 273 sec Could you please post the source code for that program? I'm very interested in seeing how you did it. BTW, this is 10 times faster than XB, not a just a couple of times Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 6, 2017 Share Posted July 6, 2017 (edited) Sure. Here you go. Note that the UCSD p-system on the somewhat peculiar TI 99/4A has dual code pools. One in VDP RAM, the other in the 24 K CPU RAM. Normally, the p-system will load pure p-code in the VDP pool. Code containing assembly programs must be loaded in the secondary code pool, as they can't run from video memory. But in this program, I'm writing to the VDP from Pascal. Thus you must use the utility setltype (set language type) to change the type of the code file you run from Pseudo to M_9900. This is not very flexible, as the procedure that writes to VDP RAM is fixed to set the position values for sprite #1, nothing else. A few more milliseconds could have been saved by not calling any procedure at all, but putting the code in line in the main program. Then it would also have been possible to only write one coordinate, as the other is fixed in each loop. But I didn't bother. I just wanted to see if it was the versatile and complicated unit sprite which does some overkill for such a simple task as this one, and thus wastes a lot of time. And it obviously is. The time related code is just to make the benchmark timing automatic. A few declarations aren't used; they remain from the first version. program benchmark; uses support, sprite, realtime; const rddata = -30720; (* VDPRD *) rdstat = -30718; (* VDPST *) wrtdata = -29696; (* VDPWD *) wrtaddr = -29694; (* VDPWA *) wrtenab = 16384; (* hex 4000, setting VDP address to write *) type byte = 0..255; window = record case boolean of true: (int: integer); false:(ptr:^integer); end; var x,y,cnt: integer; vwaaddr, vwdaddr: window; timer: timerid; elapsed: ttime; ch: char; procedure spr1_pos(x,y: integer); var addr: integer; begin vwaaddr.ptr^ := 1024; vwaaddr.ptr^ := 19456; vwdaddr.ptr^ := y*256; vwdaddr.ptr^ := x*256; end; begin vwaaddr.int := wrtaddr; vwdaddr.int := wrtdata; tmrnew(timer); tmrreset(timer); write('Rounds? '); readln(cnt); tmrstart(timer); page(output); set_screen(2); set_spr_size(1); set_spr_attribute(1,42,2,0,1,1,0,0); while cnt>0 do begin for x := 1 to 240 do spr1_pos(x,1); for y := 1 to 176 do spr1_pos(240,y); for x := 240 downto 1 do spr1_pos(x,176); for y := 176 downto 1 do spr1_pos(1,y); cnt := pred(cnt); end; tmrstop(timer); tmrread(timer,elapsed); set_screen(1); page(output); with elapsed do begin write('Time ',hour,':'); if minute<10 then write('0'); write(minute,':'); if second<10 then write('0'); write(second,','); if fract<100 then write('0'); if fract<10 then write('0'); writeln(fract); end; read(ch); end. Edited April 29, 2019 by apersson850 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted July 6, 2017 Author Share Posted July 6, 2017 As far as I know, there's no way to poke to VDP memory in Extended BASIC, so that's probably as optimized as it will be. I've taken several stabs at it, and never done any better. (Which did surprise me). We don't have TI LOGO in there, anyone want to try that one? Quote Link to comment Share on other sites More sharing options...
JamesD Posted July 6, 2017 Share Posted July 6, 2017 Sure, but as far as I see it, it's not interesting. The value with the UCSD p-system lies in the word system. It's the whole system, with code and memory management, libraries and such that I like. Then I can live with that it doesn't outrun Forth, and that I occasionally have to write external procedures to get the desired performance. So writing all the code in Pascal and then compiling pieces that need more speed with a different compiler and making them external procedures isn't interesting? Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 6, 2017 Share Posted July 6, 2017 I thought you were talking about replacing the whole system with a compiler that produced code files which couldn't be loaded under the p-system. As I understand you now it's completely different. Automating that process is like having a native code generator (which normally does accompany the p-system), but perhaps even better, if it is an optimizing such thing. How fast do you guys sort 1000 random integers? Using whatever language you like, thus most likely assembly? Quote Link to comment Share on other sites More sharing options...
+Vorticon Posted July 6, 2017 Share Posted July 6, 2017 I've taken several stabs at it, and never done any better. (Which did surprise me). We don't have TI LOGO in there, anyone want to try that one? I just might! I also think RXB will be a worthwhile contestant as well with its low level access features. 1 Quote Link to comment Share on other sites More sharing options...
+Vorticon Posted July 6, 2017 Share Posted July 6, 2017 Sure. Here you go. Note that the UCSD p-system on the somewhat peculiar TI 99/4A has dual code pools. One in VDP RAM, the other in the 24 K CPU RAM. Normally, the p-system will load pure p-code in the VDP pool. Code containing assembly programs must be loaded in the secondary code pool, as they can't run from video memory. But in this program, I'm writing to the VDP from Pascal. Thus you must use the utility setltype (set language type) to change the type of the code file you run from Pseudo to M_9900. This is not very flexible, as the procedure that writes to VDP RAM is fixed to set the position values for sprite #1, nothing else. A few more milliseconds could have been saved by not calling any procedure at all, but putting the code in line in the main program. Then it would also have been possible to only write one coordinate, as the other is fixed in each loop. But I didn't bother. I just wanted to see if it was the versatile and complicated unit sprite which does some overkill for such a simple task as this one, and thus wastes a lot of time. And it obviously is. The time related code is just to make the benchmark timing automatic. A few declarations aren't used; they remain from the first version. program benchmark; uses support, sprite, realtime; const rddata = -30720; (* VDPRD *) rdstat = -30718; (* VDPST *) wrtdata = -29696; (* VDPWD *) wrtaddr = -29694; (* VDPWA *) wrtenab = 16384; (* hex 4000, setting VDP address to write *) type byte = 0..255; window = record case boolean of true: (int: integer); false:(ptr:^integer); end; var x,y,cnt: integer; vwaaddr, vwdaddr: window; timer: timerid; elapsed: ttime; ch: char; procedure spr1_pos(x,y: integer); var addr: integer; begin vwaaddr.ptr^ := 1024; vwaaddr.ptr^ := 19456; vwdaddr.ptr^ := y*256; vwdaddr.ptr^ := x*256; end; begin vwaaddr.int := wrtaddr; vwdaddr.int := wrtdata; tmrnew(timer); tmrreset(timer); write('Rounds? '); readln(cnt); tmrstart(timer); page(output); set_screen(2); set_spr_size(1); set_spr_attribute(1,42,2,0,1,1,0,0); while cnt>0 do begin for x := 1 to 240 do spr1_pos(x,1); for y := 1 to 176 do spr1_pos(240,y); for x := 240 downto 1 do spr1_pos(x,176); for y := 176 downto 1 do spr1_pos(1,y); cnt := pred(cnt); end; tmrstop(timer); tmrread(timer,elapsed); set_screen(1); page(output); with elapsed do begin write('Time ',hour,':'); if minute<10 then write('0'); write(minute,':'); if second<10 then write('0'); write(second,','); if fract<100 then write('0'); if fract<10 then write('0'); writeln(fract); end; read(ch); end. Thanks. I think I will need to look this over closely. I was not aware that was a unit call realtime... Quote Link to comment Share on other sites More sharing options...
apersson850 Posted July 6, 2017 Share Posted July 6, 2017 Well, it's only my machine, and one of my friend's (but I doubt his is running any longer) that has a unit realtime. The p-system keeps track of today's date, and uses it to tag files when they are created or updated. But you have to key in the date manually, as the system always starts with the same date as last time it was set. It's stored on the disk, so it's the last date the system disk was used you get when you start next time. If the computer had a battery-backed real time clock, keeping track of date and time, the setting of the date could be done automatically. As there was no such device available on the market at that time, I set about to design and build one. I did of course also write the software to use it. The p-system unit realtime is one part of that software. I have two editions of the unit realtime, one that works with my own clock hardware and one that works with the Triple-Tech card, which was the first clock card available one the market that I came in touch with. A friend of mine bought one, so I wrote a driver for him to use it with the p-system. My device uses the same clock chip as you find on the P-GRAM card. It's good for precision timing, as it counts down to 1/1000 s. The chip they used on the Triple-Tech card would do seconds only. I've also designed my card so that I can use the interrupt generation capability of the clock. Thus the card can issue an alarm, via an interrupt, at a pre-programmed time, or use a counter rollover interrupt, and thus issue one every minute or every 1/10 s or every day. The every day interrupt can be used to change the system date not only when logging on, but also during operation of the p-system. Since I can change the interrupt vector in my system (I can overlay the console ROM with RAM) I have the ability to for example reconfigure the TI to become a controller, which can take actions based on time. But in this benchmark case the clock is simply used to time the activity. The unit realtime allows the dynamic creation of any number of timers (only memory limiting), that can run simultaneously. They can be halted and read individually, and disposed when not needed any longer, to recover the memory used. Anyway, don't look for it in the stock UCSD p-system, as it's not there. 2 Quote Link to comment Share on other sites More sharing options...
JamesD Posted July 7, 2017 Share Posted July 7, 2017 I thought you were talking about replacing the whole system with a compiler that produced code files which couldn't be loaded under the p-system. As I understand you now it's completely different. Automating that process is like having a native code generator (which normally does accompany the p-system), but perhaps even better, if it is an optimizing such thing. How fast do you guys sort 1000 random integers? Using whatever language you like, thus most likely assembly? Well, on smaller projects, you could skip the P-Machine and run at speeds similar to the current GCC C compiler. But you could also generate native code modules to work with the P-System. If you develop something under UCSD that needs more speed in certain parts, make them modules, then compile the code on the PC and have freakish fast speeds while still having the advantages of the P-Machine. How easy it would be to convert the GCC Pascal output to a UCSD module I don't know. The problem with the UCSD native code translator, is that the output still works like the P-Machine, just without the decoding phase. It can't take advantage of lots of registers, it uses the stack a lot, etc... Translation will certainly cut the execution times by quite a bit, but nothing approaching GCC's code generator. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted July 26, 2017 Share Posted July 26, 2017 (edited) I have been using this exercise to beat up the low level code for my Sprite routines. Here is the current state of the benchmark operating under Indirect threaded Forth (ITC) and Direct Threaded Forth (DTC), with Top of stack cached in a register. Lee, I can't figure out how you got FB-Forth to go faster than Turbo Forth in the optimized version. Can you double check it when you have nothing better to do? Note: None my code is updated in GITHUB. I need to do some housecleaning there. Code: HEX VARIABLE CNT DECIMAL \ SP.LOC is Forth code that writes to sprite descriptor table \ : SP.LOC ( dx dy sprt# -- ) >R >CELL R> ]SDT V! ; \ alternative to using mtask99 and automotion : TURSI.BENCH GRAPHICS PAGE 1 MAGNIFY 2 42 0 0 0 SPRITE \ CAMEL99 uses BASIC color #s SP.SHOW 100 CNT ! BEGIN CNT @ 0> WHILE 239 0 DO I 0 0 SP.LOC LOOP 175 0 DO 239 I 0 SP.LOC LOOP 0 239 DO I 175 0 SP.LOC -1 +LOOP 0 175 DO 0 I 0 SP.LOC -1 +LOOP -1 CNT +! REPEAT HEX 300 CONSTANT $300 $300 1+ CONSTANT $301 DECIMAL : TURSI.OPT GRAPHICS PAGE 1 MAGNIFY 2 42 0 0 0 SPRITE \ CAMEL99 uses BASIC color #s SP.SHOW 100 CNT ! BEGIN CNT @ WHILE 239 0 DO I $301 VC! LOOP 175 0 DO I $300 VC! LOOP 0 239 DO I $301 VC! -1 +LOOP 0 175 DO I $300 VC! -1 +LOOP -1 CNT +! REPEAT ; Here are the standings Language First Pass Optimized GCC 15 sec 5 sec Assembly 17 sec 5 sec TurboForth 48 sec 29 sec CAMEL99 DTC 49 sec 27 sec Compiled XB 51 sec 37 sec CAMEL99 ITC 55 sec 29 sec FbForth 70 sec 26 sec GPL 80 sec none yet ABASIC 490 sec none yet XB 2000 sec none yet UCSD Pascal 7300 sec 273 sec Edited August 10, 2017 by TheBF 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.