Jump to content

apersson850

Members
  • Content Count

    1,063
  • Joined

  • Last visited

Everything posted by apersson850

  1. Finally, a programs which does the same, but without any external procedures. Pure Pascal in 273 seconds. As far as I know, there's no way to poke to VDP memory in Extended BASIC, so that's probably as optimized as it will be. At least not without external functions, like special CALLs implemented on Horizon RAMdisks and such. Thus we have 2000 seconds vs. 273 seconds here. That's in line with what I've experienced before, where Pascal is a couple of times faster than BASIC, but of course not near assembly speed. There are a few more things you can do to optimize further, but I don't bother now. The step from 780 to 273 seconds still proves that if you know the Pascal system well, you can make it perform better. And the step down to 166 seconds shows that if you use assembly support where it's best needed, then you can get some more. But that's true for most languages. Language First Pass Optimized GCC 15 sec 5 sec Assembly 17 sec 5 sec TurboForth 48 sec 29 sec Compiled XB 51 sec 37 sec FbForth 70 sec 26 sec GPL 80 sec none yet ABASIC 490 sec none yet XB 2000 sec none yet UCSD Pascal 7300 sec 273 sec
  2. In the next step, I added an external procedure which simply plugs the values for x and y directly into the sprite attribute table. Execution time is now 166 seconds. This is of course still not near Forth or pure assembly, but shows that the p-system and Pascal is at least normally significantly faster than Extended BASIC.
  3. Just to see how much of the time is consumed by the sprite routines in the Pascal program, I commented them out. The program still runs all the loops and does all the assignments, but it never calls set_sprite. Now it does the 100 loops in 150 seconds. Next I'll make it write to the sprite attribute table directly, and we'll see what difference that makes.
  4. All right, I did an overhaul of the old Horizon RAMdisk I happened to get, and kind of got it going. I write "kind of", because there seems to be some bad connection somewhere on the card, as a few bytes get destroyed sometimes. Thus I can't really use it, but at least it works good enough to hold data the p-system needs to think it's there. As can be seen in this picture, my system now thinks it has seven drives (blocked devices, indicated by a # before the name. Four regular drives, named PASSYS, DPASCAL, DESIGN and TIMING. It also has two RAMdisks, named RAMDSK1 and RAMDSK2. The drive called OS is the GROMdisk on the p-code card. I'm sorry about the reflection from the window, but it's readable. Anyway, even if this particular RAMdisk has some issues, it's proof of concept. The p-system can use two different RAMdisks, set as DSK5 and DSK6. Then they show up as units #11 and #12. I've yet to find out if a disk reporting as DSK7 will come up as unit #13? I don't know yet. But it doesn't matter too much. I actually laid my hands on three different Horizon RAMdisks at the same time, so the overhaul will continue with the next one, to see if it works better. The program listed in the file I've referred to before works. But if you install two disks, you must make sure you install one PCB in the character pattern table and the other in the sprite motion table. Regarding the compiler, we'll have to be satisfied with that it is as it is, if we want to run it on a real TI. At least when both the compiler, the source and the object files are on a RAMdisk, it's at least twice as fast as it normally runs. It compiles the RAMdisk install program (470 lines) in two and a half minutes, with the code on normal floppy disk.
  5. Yes, the TI will also accept BASIC commands in any case. Pascal is the same. Variable, VARIABLE and Vari_Able are the same thing. But device names are different. They are coded in the DSR, and the DSRLNK program compares the name literally against what's in the DSR. As Lee wrote above, the CorComp controller will also understand DSK4 (and dsk4), but it takes a trick to convince the p-system to use more than three drives.
  6. Could very well be that such optimization would be useful here too. I've not inspected the code in such a way that I looked for typical peephole candidates.
  7. The code doesn't have to be that horrible just because you push and pop data from the stack. But I know for sure that there are things you need to keep track of as a programmer, things a good optimizing compiler would figure out by itself and fix for you. So I still suspect it's a "you asked for it, you got it" compiler. As far as I've read, it will not use the INC and DEC opcodes unless you tell it to in the source (using pred and succ instead of +1 or -1). I haven't checked, even if that's easy enough to do. A p-code disassembler is a part of the system. They make code generation simple, or at least simpler, by using the fact that some of the p-codes are specifically tailored to meet requirements a Pascal program has. Like the one I showed above, to find a local variable in a lexical parent, any level up. The general approach to speeding up Pascal programs, or rather p-code programs, is the conversion to native code using the Native code generator program. Unfortunately, there's no such program delivered with the TI 99/4A. The p-codes required as supported, though, so if you study the code files enough, you could write your own.
  8. Snyggt. Is there any that will take the console?
  9. I have the CorComp double sided double density controller, and it also recognizes dsk1 as well as DSK1. But not Dsk1 or any other such combination.
  10. It's a bit confusing here, what you mean by "the compiler running". Are you referring to the compiler itself, when it compiles source code, or the code that it actually compiled? For the TI 99/4A there's the special issue that it's actually a 32 K RAM machine. The UCSD Pascal needs a 48 K RAM machine to work reasonably well. But they've used the trick that the PME can run code from CPU RAM, VDP RAM and GROM (on the p-code card) to make it possible to implement the UCSD p-system IV.0 on the 99/4A. That wouldn't work if everything was in native code, as you couldn't run any programs from VDP RAM in that case. The p-code card could technically bank-switch a lot more ROM than it does (it has 12 K ROM, where 4 K remains the same, and the other 4 K are two different banks). Now it also has 48 K GROM, and they are easy enough to access, as they are seen through a byte-wide window only. But this also makes the interpreter slightly slower. Not much when running code in line (the IPC must be separately incremented, so you lose one CPU instruction per p-code), but more so when a jump has to be taken, as it takes longer time to update the VDP or GROM read address than just load a new value to the IPC (which is in R8). If you jump from code in VDP RAM or GROM to code in CPU RAM, or vice versa, you also have to load the other PME instruction/immediate data fetch routine. It's running in RAM at 8300H for speed, but only one version fits at the same time.
  11. Obviously, p-code runs seven times slower than pure assembly, as it takes seven instructions to execute one, which has a direct correlation to a TMS 9900 instruction. Instructions that are used to find data in another procedure and such stuff do of course take longer time. But they couldn't execute in one single TMS 9900 instruction either, so the overhead there is less. If you look at the SLOD2 instruction, it takes 24 TMS 9900 instructions to execute it, and six to decode it. Thus the overhead only adds 25%, not 700% as is the case with ADI. I don't know where the idea to implement p-code for the UCSD system came from. But it's a fairly old idea, that to compile a language to some intermediate code and then either convert that all the way, or interpret it. As they wanted portability, it's very efficient, since implementing the PME on a new platform is a significantly less task than to modify the compiler each time.
  12. You can find the Pascal compiler manual at the WHTech site. The pre-compiled unit sprite is described at page 144. You can also check the chapter before, which describes sound processing in the p-system. As you can see, they've made quite elaborate designs here.
  13. Most of these NOP instructions aren't really executed, but are there just as fillers. When reading code from memory mapped devices, there's no auto-increment of the PC (R8), so it has to be advanced with extra INC instructions.
  14. Well, never mind waiting for anybody asking for anything... Here's the PME (P-Machine Emulator) central parts. Note that the address of PMEFETCH is 8300H when the machine is running, so this is in 16-bit RAM. But there are two different fetch routines, one for code in CPU RAM and the other for code in VDP RAM or GROM. They are both loaded at the same place. The CPU RAM version needs some NOP instructions to occupy the same addresses, since there are six external entry points into the interpreter. PMEFETCH gets the opcode of the instruction, and that's always 8 bits. It then looks into a table, which gives the address to the instruction interpretation.That part begins with an address where to start running the interpretation, since that may be directly after the entry, or it may branch to one of five locations in the interpreter, where one, two or three parameters, that are inline in the code (not on the stack) are fetched to R3, R4 and R5, before the interpretation of the actual instruction continues. I've included a few instructions in full detail. LDO loads a word from the program's global data area. R14 points to that area. LOD loads data from a caller's local data area (environment record). There are two short forms, used to load data from the caller, or the caller's caller. Then one general that can load from any lexical level above the currently running procedure. R9 points to the current environment record. It's up to the programmer to think about this. Using variables further up than two levels take more time. There's a similar thing for local variables. There are faster instructions to pick the first variables, compared to those further down among the declarations. So declare short variables that are frequently used first. If you declare an array first, you may run out of reach of all short local variable load instructions in one fell swoop. The ADI and LOR performs addition and logical or (so you can compare with the 6502 code above). It takes seven instructions to execute these codes, if they are in CPU RAM. Normally, p-code runs from VDP RAM, where it takes eight instructions to accomplish something the TMS 9900 could have done with one, if the p-code was converted to native code. But LOR is one byte long, SOC *SP+.*SP is two. So for this particular instruction, there's a speed gain of roughly eight times, at the loss of twice the memory use. There are p-code instructions for more complex things too, like calling global or local procedures. They are more like small program segments, and may invoke quite a lot of code, if a segment fault is issued on a call (the called procedure isn't currently in memory, but must be retrieved from disk). The instructions SIGNAL and WAIT also have their own p-codes, as they shouldn't be interrupted. This is not based on any source code or such, but on my own dechipering and inspection of my p-system on the 99/4A. It was necessary to understand more than the manual tells you to be able to implement pre-emptive multitasking and bit-map mode, to allow the system to do turtlegraphics.
  15. No, until proven wrong, I claim that the low performance is due to the implementation of the unit sprite. Again, it's more advanced than any other such implementation, and is really intended to allow free-running lists of sprite actions, running by themselves and updating on the VDP interrupt only. I assume doing modifications to the data structures inside this unit is what takes time.
  16. The part about the registers you quote is from an earlier version of the p-system, not IV.x, so the registers aren't the same. But it's their equivalents I'm referring to, yes. The interpreter in the 99/4A is further complicated by the fact that it can run code from CPU RAM, VDP RAM or GROM as well. If you like, I can post the inner part of the interpreter here.
  17. I guess you didn't do all the 100 loops. Did you multiply by the wrong value, maybe? I did five turns and multiplied with 20. Anyway, just a few of the programs actually manage any kind of data structures for sprites. The rest focus on writing to the VDP RAM at certain locations, assuming the user knows which VDP RAM location to access. I'll make an optimization for the Pascal version where it also writes to VDP RAM directly, but still in Pascal, just to see what happens. I'm not sure if the routines in unit sprite will access the sprite data by themselves, when there's no countdown data in the sprite record, so I don't know if it works. I've never used sprites for anything in Pascal, so I have no real reference to how fast or slow that handling is. But generally, Pascal runs a couple of times faster than Extended BASIC. Oh, by the way, James D wrote earlier that the PME has no registers, but works with a stack, and that the instructions are 16-bit. The p-machine has some registers, which of course are emulated by the PME, and the instructions are 8-bit. Some of them do have one or more bytes in-line, though, as immediate data. So even if the instructions basically are 8 bits, some are extended to several bytes. Most of the data is referenced either by being on the stack, or via the stack. Variables in the current environment record are referenced through R9 in the TMS 9900 in this particular implementation of the PME. Global data though R14.
  18. There is not a single GPL instruction running in the PME. The P-Machine Emulator runs in native code, of which some is in the RAM at 8300H to optimize speed. The problem with the chosen benchmark is that it mainly tests the system's unit sprite, which I suspect is written to implement a more advanced automatic motion/sprite redefinition system than most, and doesn't care much about efficiency when doing manual changes.
  19. From what I've read, the p-system Pascal compiler isn't doing much in the way of optimizing for speed. If it does anything, it's for compact code. But that's more a question about generating code that "fits" the PME (P-Machine Emulator) than anything else. As far as I remember, writing cnt := pred(cnt); generates faster (and more compact) code than writing cnt := cnt-1; But that's not where the time is spent in the benchmarks. They mainly test the library sprite, a pre-compiled unit which has a lot of features, but obviously doesn't work very fast. Since you can create linked lists with sprite descriptions, lists where each element has a timeout and links to the next item when the time runs out, they offer more functionality than I've seen implemented for other languages. It could even be that they are written so that you can only update the sprites on each interrupt, but that I don't know. Pure speculation. The p-system does have its own sprite auto-motion code, which runs on each VDP interrupt. Speaking about them, the p-system also handles concurrency and a keyboard buffer during the interrupts. That's a few more instructions to pass through each time. Does Forth scan they keyboard to check for a user break key while words are running? As a general comment, the p-system is designed around the idea of fitting a complete Pascal compiler (equivalent to the capabilities found in Turbo Pasacal 4, which took a lot of inspiration from UCSD Pascal), and a run-time system which allows for dynamically reloacating of code and swapping of code segments from disk automatically during operation, inside the limits imposed by having only 48 K RAM, of which one third actually is video memory. Execution speed takes a hit there. Another big hub in the p-system design is portability. The compiler is the same program regardless of which system you run it on. If it generated native code, it would require substantial changes for each CPU. Thus portability would be lost. Most p-systems do provide the NCG program, a Native Code Generator which can accept a critical program segment as input and translate that to machine language. The p-code which runs in-line assembly is supported, and the compiler on the TI does support generating in-line p-code directly, but then you have to know about it (it's not documented) and you have to handle how to get the assembly in there. I've found it easier to develop the programs in Pascal, when doable, but design them around calling procedure/functions most of the time, instead of large chunks of in-line code. Thus it's relatively easy to re-write critical things in assembly.
  20. Now when I got my own RAMdisk working in the system again, after accidentally wiping out all drivers before, I tried the benchmark. The optimized version (which is the only one that makes sense in the Pascal implementation) runs in 683 seconds on my machine, which has memory without wait states. I presume the 780 seconds were measured on a conventional memory expansion? The faster memory will of course benefit other languages too. How much depends on how much of their access that is in standard 32 K RAM expansion, of course. It's obvious that it's the sprite access through the SPRITE library in the p-system that's slow. On the other hand, it's more comprehensive than what you get in Extended BASIC, for example. I've never used sprites for anything serious with the p-system. But if I feel I have the time I'll check how the p-system behaves if you short-circuit the library's sprite handling and write to the VDP memory directly. By the way, the first edition runs in 8254 seconds on my machine. Are you sure the time of 7300 seconds is correct? I modified the program to measure the time with my real time clock, so I know my timing is correct.
  21. OK, so Poland took away one week for me. But now I've checked it carefully, and found that the additional code was not additional. It was just a shortcut that works with Horizon RAMdisk only. My code, the one already posted, was a bit more general. The reason for me suspecting that my code wasn't all that was needed came from that it didn't work as expected. But that turned out to be due to some mistakes in re-creating it from the listing, not in the listing itself. I'll see if I can get my Horizon card to work at all, and if so it I can get a DSR in there. Without having to write one myself.
  22. I thought I already had posted something in this thread, but it turned out I was mistaken. So, here's a link to correct that assumption.
  23. I recognize that issue. My TI 99/4A has 64 K internal RAM, which is segmented in 8 K banks. Bank switching is via CRU bits. As 64 K is the entire address range of the TMS 9900, this implies that this memory can overlay everything. When I implemented a RAM disk, with a DSR on my I/O card, I wanted to use all available memory for the RAM disk. That includes console RAM at >4000->5FFF. But since enabling that memory disables the console's capability to access the DSR in the PEB, it took some extra code to fix that. Switching away the memory the current code is running in is the electronic version of suicide. Due to that there are (were) only three consoles I know about with my memory structure, I did of course also have to write my own loader software. I decided not to make my own assembler (which would have been necessary to add directives for bank switching), but instead split it up. If I wanted to assemble code that should be loaded at a "forbidden" place, i.e. where other computers didn't have RAM, I assembled with the .ABSOLUTE and .ORG directives. Then, when running the loader, I had to separately specify which CRU bit to turn on and which file to load with that bit on. To make it easy to use in daily life, when maybe several code segments should be loaded in various memory locations, I made the loader so that I can specify any number of CRU addresses and code files. The loader will then load them all in a sequence. My loader can also save the sequence data in a file, and retrieve and run that file on another occasion. I even automated it in such a way, that when I start the DSRLOADER (I gave it that name, since the first application was to load DSR files in RAM on my own PEB cards), it looks for the file *SYSTEM.LOADER. If it exists, it will run the load sequence stored in that file. Doing it like this made it possible to just turn the 99/4A on, and provided the right diskette was in the first drive, it booted not only the operating system, but also installed the clock and RAM-disk DSR files, set the system date, installed the fourth and fifth drives and copied the system files I mostly benefited from having on the RAM disk without any manual intervention. This has absolutely nothing to do with the SAMS card, but since it's about ways to use more memory than initially thought for this computer, I wrote it anyway. Just disregard if you didn't find anything useful to build on.
  24. I was referring to the keyboard of the TI 99/4, yes. But some other computers had cheap keyboards too. The Sinclairs being the worst, with ZX81 and ZX Spectrum. At TI, they dismantled the Commodore VIC 20 as well. They found the content very funny, but couldn't get away from the fact that it was cheaper to produce. As stated above, it's true that TI took the consequences and killed the home computer business. It was also claimed in the referenced article that they decided to go for processors covering specific areas, like the successful TMS 320 series of DSP chips.
×
×
  • Create New...