Jump to content
IGNORED

GPL understanding & performance optimization


speccery

Recommended Posts

As discussed a little yesterday in the Pandemic club 4A zoom call (thanks for the interesting discussion!) I started to look a bit how the GPL interpreter could be optimised. My original idea was to add some instructions to my FPGA processor core to speed up the interpretation with special purpose instructions, but as I started to look at the code it's quite clear that a lot can be achieved with normal code optimisation.

 

@RXB mentioned that there has been a discussion with @Tursi about this topic. I somehow recall seeing that thread myself, but couldn't easily find it (which is probably my fault). As an obvious optimisation, instead of the multiple levels of tables, the GPL instruction decoding could be improved at the cost of using some more memory simply by having a 256 entry lookup table (occupying 512 bytes). For that part I could create a new instruction which could combine a few TMS9900 instructions, in pseudo code:

// Address 0x78
MOVB *R13,R9
JGPL @TABLE,R9

// Here JGPL would be a new instuction, something like below. 
// The instruction would only perform 2 memory fetches: Read R9, and fetch the jump vector from TABLE.
BANK 1			// Switch bottom 8K to a new bank, which has the jump table
MOV R9,TEMP		// Temp internal register
SRL TEMP,7		// TEMP >> 7, shift to a word index
MOV @TABLE(TEMP),TEMP	// Fetch from table
BANK 0			// Switch bottom 8K to normal bank
B *TEMP

In the arrangement above since the opcode would be passed to the new instruction JGPL, that instruction could also be developed further to understand some GPL instructions directly, executing them directly by the CPU instead of TMS9900. Many GPL instructions are quite involved, so it would best to be able to incrementally improve things, for example starting with branch instructions which seem to be rather simple. 

 

I also realised that I am jumping the gun - I should try to look at some GPL code before going to optimization phase to understand how things work. To that end I started using xga99.py to assemble the GPL code for Minimemory cartridge, as a test. Also since I think this a very cool cartridge which could be integrated and expanded in interesting ways in both my icy99 and StrangeCart projects.

 

So I got the GPL source code for Minimemory from Thierry's excellent TI-99/4A tech pages. I guess that code is for his GPL assembler. But I wanted to use the xdt99 package. So I started to assemble the source with xga99.py, like so:

 

xga99.py --aorg 6000 mmg.gpl -L mmg.lst

 

I quite quickly ran into a few problems, due to differences in syntax, for example:

  • The AORG directive in xdt99 does not accept addresses higher than 8K. This causes a number of problems, because there is a hole in the code, i.e. it AORGs to >70AC skipping a bunch of bytes. I guess I have to manually fill that range with some bytes.  
  • The multiplication instruction in the source is MPY, but xga99 uses MUL. Not a biggie.
  • Many lines in the code contain comments (which is great) after the code. I have never understood why the comments don't start with a special character like semicolon or something, that would make parsing easier for the assembler and it could probably also prevent some mistakes. Anyway, xga99 could not assemble a number of lines because the comments were separated by just a space. I just removed those comments after the code (by moving them to a separate comment line).
  • The HTEX instruction (in a FMT block) escapes hex bytes differently, simple change: from HTEX '[>0A]' to HTEX >0A
  • Some other opcodes also are different: CAR -> CARRY, PARS -> PARSE, DCGTE -> DCGT
  • The source code uses the BIAS command also outside a FMT - FEND block, it appears to specify a constant to be added to strings specified with the STRI directive. The source I used has first BIAS >60 to set the TI Basic character code offset. I did not find a way to replicate this functionality in xda99. The advice goes: "use the source, Luke". And so I did, and created a new directive STRI60 for xda99, as follows. It's hack for sure, but I didn't want to enter the text as BYTE statements.
* Original source (disassembled and commented by Thierry)
       BIAS  >60
G6E1A  STRI  "ILLEGAL TAG"
G6E26  STRI  "CHECKSUM ERROR"
----------------------------------
* Modified source for xda99:
*EPEP       BIAS  >60

G6E1A  STRI60  "ILLEGAL TAG"
G6E26  STRI60  "CHECKSUM ERROR
----------------------------------
* xda99.py has been modified to support the new STRI60 as follows:
    # EP 2020-12-13 added new STRI60 operation to add the screen offset to each byte.
    # Used for Mini Memory porting
    @staticmethod
    def STRI60(asm, label, ops):
        asm.process_label(label)
        text = ''.join(asm.parser.text(op) for op in ops)
        asm.emit(len(text), *[ord(c)+0x60 for c in text])

And this is roughly where I am at the moment. I am comparing the generated GPL binary image to the original, and now the first >770 bytes match (except for the pointer to >70AC due the AORG stuff, need to come up with a solution for that - probably I'll just fill in the empty range with some bytes) to get to 70AC.

Edited by speccery
  • Like 3
  • Thanks 1
Link to comment
Share on other sites

Now got the Minimemory GPL code I assembled with xga99.py loaded on the icy99 and it seems to run fine. Icy99 does not yet support RAM at >7000, so need to enable that in theory, although I don't know if I need that since 32K RAM expansion is there.

 

Now that I have a meaningful GPL program to play with, I hopefully can get an idea on what are typical GPL instruction sequences look like to understand where the low hanging fruit is for performance optimization in the interpreter. Minimemory does not need optimization though :)

  • Like 4
Link to comment
Share on other sites

My discussion about optimizing the interpreter is theory only, I haven't done any code. But IMO the lowest hanging fruit with the greatest benefits would be instruction decode, as you suggest using a lookup table, and the MOVE command. MOVE doesn't take advantage of the GROM or VDP auto increment, rather it re-sets the address for every read and every write. This is because it doesn't keep track of what type of move it's doing, so it has to make no assumptions (for instance, VDP to VDP is legal and would require the continuous address change anyway). To accommodate such moves, a small intermediate buffer (8 bytes is enough!) would do worlds of good, but I was considering different functions - one for moves that need the buffer (again, I think only VDP to VDP or GROM to GRAM), and one for moves that don't (everything else).

 

The next greatest benefit would be improving the address decode, but I haven't looked too deeply there. However, it's used all the time, so any optimization pays off.

 

GPL was written to try and save space, so it jumps around all over the place and repeats work a lot. I think speeding it up would be easy. I don't know how easy it would be to fit the sped-up interpreter in the same 8k, though, that requires actually doing it. ;)

 

  • Like 3
Link to comment
Share on other sites

3 hours ago, Tursi said:

...MOVE doesn't take advantage of the GROM or VDP auto increment, rather it re-sets the address for every read and every write...

 

The next greatest benefit would be improving the address decode, but I haven't looked too deeply there. However, it's used all the time, so any optimization pays off.

Thanks for these comments! I have read a bit more source code, mainly looking at decode in general (as discussed in the first message here), the arithmetic operations, storing of results (>228) and GPL addressing modes (from >077A onwards). The move point you mentioned I hadn't read.

In terms of new capabilities, it would seem there would be a lot to gain from being able to have direct 16-bit read/write access to the VDP address pointer, as well as having instructions to read and write 16-bit quantities to the VDP. These would not require CPU modifications, just small changes to the VDP.

3 hours ago, Tursi said:

GPL was written to try and save space, so it jumps around all over the place and repeats work a lot. I think speeding it up would be easy. I don't know how easy it would be to fit the sped-up interpreter in the same 8k, though, that requires actually doing it. ;)

The existing 8K ROM space is an interesting limitation. I'm thinking there are a few different paths to accommodate over overcome the ROM size limitation:

1. Icy99. Since this is a FPGA which is not the original console, anything goes, as long as TI-99/4A software works. Extra instructions, higher clock speeds, multiple banks of console ROM, etc.

2. Real iron, updated ROMs. Since modifying console ROMs would probably be necessary, I suppose by the same token one could install larger ROMs and have multiple pages of ROM. This would make improvements a lot easier without breaking compatibility with existing software.

3. Real iron, existing ROMs. For certain things, such as speeding up TI Basic, it should be possible to make a cartridge which would contain a new main loop for GPL processing. Or perhaps this could be done in a DSR. Either way, it could be applied to an unmodified console. However, getting any speed benefits would probably be hard, as without any modifications the software updates would be constrained to having to execute from 8-bit wide memory. The algorithmic improvements would really have to be good to overcome break pedal imposed by 8-bit memory.

4. Real iron, existing ROMs, compile GPL to machine code. I don't know if GPL compilers exist, but this could be one potential avenue. It would result in pretty massive code size expansion, so one would probably have SAMS memory installed.

 

Out of these, for me #1 is clearly the easiest choice for me to get going. By implementing a couple of extra instructions and additional features I would save machine code space in the 8K ROM, making it possible to retain existing entry points while modifying the ROM.

Edited by speccery
Link to comment
Share on other sites

Note that there's no compatibility in its widest meaing if you modify console ROM. You never know where somebody have used some ROM data or code, expecting it to be in a certain place.

That's a very bad habit (we've had exactly that discussion regarding DSR drivers here before), but it happens. As that discussion proves.

  • Like 1
Link to comment
Share on other sites

5 hours ago, apersson850 said:

Note that there's no compatibility in its widest meaing if you modify console ROM. You never know where somebody have used some ROM data or code, expecting it to be in a certain place.

That's a very bad habit (we've had exactly that discussion regarding DSR drivers here before), but it happens. As that discussion proves.

Yes that's a good point. With multiple ROM pages (I suppose this would initially apply to FPGA system) the original ROM could always be provided as one option. I am interested in understanding how the user experience would change if the GPL was optimised, ultimately what the performance would look like if GPL was machine code of a CPU running in the FPGA. For GPL content, the test cases would be TI Basic and Extended Basic, also RXB Basic. Doing GPL optimisation would not make any difference to a lot of recent programs, which to my understanding have mostly been developed in assembly.

  • Like 1
Link to comment
Share on other sites

3 hours ago, speccery said:

Yes that's a good point. With multiple ROM pages (I suppose this would initially apply to FPGA system) the original ROM could always be provided as one option. I am interested in understanding how the user experience would change if the GPL was optimised, ultimately what the performance would look like if GPL was machine code of a CPU running in the FPGA. For GPL content, the test cases would be TI Basic and Extended Basic, also RXB Basic. Doing GPL optimisation would not make any difference to a lot of recent programs, which to my understanding have mostly been developed in assembly.

I think a page routine like a you see in a Supercart or XB where you could swap out upper 4K pages in the ROM zero 8K page.

So ROM 0 page of 4K and multiple upper 4K page both in ROM 0 8K page.

Of course a re-write of ROM 0 would need to be done but if you used the Stack to remember the address of where address used

to come from before switch of page then much like the Supercart or XB you could switch pages in ROM 0

 

Of course another can of worms is the programs that expect no changes to ROM 0, but that could be handled like SAMS does

in that you use a DSR to change pages of ROM 0 so like SAMS you need someplace to control that switch.

In RXB I store the pages in Variables in that XB program RAM of 24K RAM or String Space in a string variables in VDP.

Having multiple pages of ROM 0 would allow a larger OS for TI99/4A so would be a bit of work.

 

  • Like 3
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...