TI-99/4A with a Pipistrello FPGA board

speccery · October 1, 2016

And a link in here too https://hackaday.io/project/15430-rc201699-ti-994a-clone-using-tms99105-cpu

I started my efforts with the retrochallenge "competition". I did and published step 1 schematics and wiring. Next need to test and start working on the FPGA... And power the thing on, but not tonight. Probably.

RXB · October 2, 2016

Speaking of memory I set up RXB so the SAMS support left 32K free for future use. (Currently not used.)

I may release a RXB 2016 with new routines to access that extra 32K for the possible function of two 32K expansions so you can swap out the upper 24K with a XB program with another 24K loaded with XB programs.

There is a problem with this idea in that some variables are in the 16K VDP memory would be in conflict unless both XB programs used the same String and Numeric variable names.

Keep in mind that the XB ROMs would search for those names so location would not matter within the VDP 16K.

This could also be used for programs that work with the F18 as having 24K more XB programs space added to the current 24K XB would be pretty cool!

By the way that would also leave 8K still not used yet in the SAMS with RXB.

+Ksarul · October 2, 2016

That method might work really well in those instances where you would normally use a pair of chained programs, Rich. In those instances, the variables would have to be the same anyway, just for the sake of program continuity. Would the context switch clear the variable values or would you just be able to take up where the first program left off?

speccery · October 2, 2016

Speaking of memory I set up RXB so the SAMS support left 32K free for future use. (Currently not used.)

I may release a RXB 2016 with new routines to access that extra 32K for the possible function of two 32K expansions so you can swap out the upper 24K with a XB program with another 24K loaded with XB programs.

There is a problem with this idea in that some variables are in the 16K VDP memory would be in conflict unless both XB programs used the same String and Numeric variable names.

Keep in mind that the XB ROMs would search for those names so location would not matter within the VDP 16K.

This could also be used for programs that work with the F18 as having 24K more XB programs space added to the current 24K XB would be pretty cool!

By the way that would also leave 8K still not used yet in the SAMS with RXB.

This is interesting to know. I don't know how the basic stores variable - what gets stored in the VDP RAM and what in the main memory? Based on what you wrote it would seem the variable names are stored in VDP RAM, but the contents are in memory extension. Is there somewhere documentation about this topic?

Not that I would have any time to look at this, I am having progress with my TMS99105 retrochallenge project and that is a lot of fun. Just got the processor to execute some instructions for the first time, so the CPU seems to be alive there is hope the project might turn into something. I already thought I blew up something, but it probably was just a short of power leads, I had accidentally left current limit to 1.23A which was way too high...

I posted a bad quality video at https://hackaday.io/project/15430-rc201699-ti-994a-clone-using-tms99105-cpu

RXB · October 3, 2016

That method might work really well in those instances where you would normally use a pair of chained programs, Rich. In those instances, the variables would have to be the same anyway, just for the sake of program continuity. Would the context switch clear the variable values or would you just be able to take up where the first program left off?

It would have no effect on String variables at all so it would take up where it left off and my GPL routine in XB would handle the switch and LINE NUMBER.

EXAMPLE:

CALL SWITCH24k(1,23000) ! 0 is the first 24K bank and 1 is the second 24K bank, 23000 would be the line number.

Thus most of the Subroutines that use SUB would all be in he Second 24K bank and main program would be in First 24K bank.

Numeric variables are in Upper 24K but when you initialize the memory after load you could duplicate them in both banks and be fine.

Not to mention with RXB you would still have 960K or Lower 8K Assembly support.

Edited October 3, 2016 by RXB

Sinphaltimus · October 3, 2016

It would have no effect on String variables at all so it would take up where it left off and my GPL routine in XB would handle the switch and LINE NUMBER.

EXAMPLE:

CALL SWITCH24k(1,23000) ! 0 is the first 24K bank and 1 is the second 24K bank, 23000 would be the line number.

Thus most of the Subroutines that use SUB would all be in he Second 24K bank and main program would be in First 24K bank.

Numeric variables are in Upper 24K but when you initialize the memory after load you could duplicate them in both banks and be fine.

Not to mention with RXB you would still have 960K or Lower 8K Assembly support.

Quick question for clarification... So when you call a switch 1 on line 23000 (yeah right) everything from line 23000 ascending goes in bank one then if you had a call switch 2, 33000, at 33000 on gets stored in bank 2. so these switch calls need to be call early and I'm guessing they themselves go in to bank 1. Just checking myself to see if I'm following this correctly.

RXB · October 4, 2016

Quick question for clarification... So when you call a switch 1 on line 23000 (yeah right) everything from line 23000 ascending goes in bank one then if you had a call switch 2, 33000, at 33000 on gets stored in bank 2. so these switch calls need to be call early and I'm guessing they themselves go in to bank 1. Just checking myself to see if I'm following this correctly.

Well you got it backwards, and I could simplify it with EXAMPLE: CALL 24KBANK2(23000) ! This would switch to second bank of 24K and start at line 23000 in 2nd bank.

It is not moving anything just jumps to the other 24K bank line number.

And to get back where you came from EXAMPLE: CALL 24KBANK1(1810) ! This would switch to first bank of 24K and start at line 1810 in 1st bank.

(Of course it would be a disaster to use the same line numbers in the two banks of 24K, needless to say GOSUB or SUB or SUBEXIT would create major issues.)

A cool postulate result would be a XB program no slower then a normal XB program but twice the size.

speccery · October 4, 2016

Not to mention with RXB you would still have 960K or Lower 8K Assembly support.

Could you elaborate on this a little? I am curious to understand what happens in the lower 8K. With TI XB it seems that the Basic puts some code and/or data in to that area. Does RXB behave in a similar way? I guess I am asking what CALL INIT actually does...

RXB · October 4, 2016

Could you elaborate on this a little? I am curious to understand what happens in the lower 8K. With TI XB it seems that the Basic puts some code and/or data in to that area. Does RXB behave in a similar way? I guess I am asking what CALL INIT actually does...

In RXB the Lower 8K works exactly like it normally does in XB, but you have some added routines for XB that RXB has for access. (Lower 8K is Assembly support)

In answer to your question RXB has CALL INIT and there is no difference, but with SAMS support RXB has CALL AMSINIT that does the same thing for the SAMS that CALL INIT does for the lower 8K.

(But now you have 120 banks of lower 8K)

Now what I was talking about was the SAMS memory and RXB is you have access to 960K of lower 8K banks you can switch out.

So instead of 8K of Lower 8K assembly support in normal XB, in RXB you can swap out Lower 8K banks for 960K of Assembly support.

Now the Upper 24K bank switch is for swapping out the upper 24K where XB programs reside and now you have two of them and a way to swap them out for double the XB memory for XB programs.

Cool huh?

Edited October 4, 2016 by RXB

speccery · October 4, 2016

That's cool indeed - thanks!

So let me ask a more specific and probably simpler question then. When you say the lower 8K works like it normally does, what exactly goes on in there? I guess most (all?) of the 8K is not used by basic and thus available for machine language programs - but does XB or RXB use some of that memory and if they do use, what part and for which purpose?

RXB · October 5, 2016

That's cool indeed - thanks!

So let me ask a more specific and probably simpler question then. When you say the lower 8K works like it normally does, what exactly goes on in there? I guess most (all?) of the 8K is not used by basic and thus available for machine language programs - but does XB or RXB use some of that memory and if they do use, what part and for which purpose?

The Lower 8K in standard XB is for Assembly support and CALL INIT loads the Assembly support routines for CALL LOAD("DSK#.DV80FILE") to load into that Lower 8K.

Without CALL INIT your Assembly DF 80 files would have to include all the Support Routines but the other problem is as the CALL LINK("PROGRM") would also not work as you did not Initialize with CALL INIT.

Now RXB has a ton of extra features for Assembly Support and more advanced use of Assembly. Way to many features to list here.

You can see the RXB 2015E manual here to see these in RXB DOCUMENTS file.

(Included are PDF and Text versions)

Rich_Gilbertson.zip

matthew180 · October 6, 2016

Currently that's it. There is no speed up for anything yet. Speedup could be realized by also implementing the console GROMs in the FPGA, and then removing the original GROM chips from the console. The internal GROMs determine the speed of GROM accesses by delaying the CPU, making them horribly slow. The FPGA GROMs operate immediately, there is no need for any wait states.

Actually, this test has been done and there is no apparent speedup. The real problem is how poorly the GPL code is written, so removing the GROM wait does not give any performance benefit.

speccery · October 6, 2016

The Lower 8K in standard XB is for Assembly support and CALL INIT loads the Assembly support routines for CALL LOAD("DSK#.DV80FILE") to load into that Lower 8K.

Without CALL INIT your Assembly DF 80 files would have to include all the Support Routines but the other problem is as the CALL LINK("PROGRM") would also not work as you did not Initialize with CALL INIT.

Now RXB has a ton of extra features for Assembly Support and more advanced use of Assembly. Way to many features to list here.

You can see the RXB 2015E manual here to see these in RXB DOCUMENTS file.

(Included are PDF and Text versions)

Thank you for the docs and the explanation!

speccery · October 6, 2016

Actually, this test has been done and there is no apparent speedup. The real problem is how poorly the GPL code is written, so removing the GROM wait does not give any performance benefit.

Thanks - good to know. This does make sense, the number of GROM reads vs the amount of CPU instructions is probably very low. Still I was kinda hoping there would have been a noticeable benefit. I'll just try to get the TMS99105 version running then - there at least will be a noticeable performance difference.

Tursi · October 6, 2016

Thanks - good to know. This does make sense, the number of GROM reads vs the amount of CPU instructions is probably very low. Still I was kinda hoping there would have been a noticeable benefit. I'll just try to get the TMS99105 version running then - there at least will be a noticeable performance difference.

Yeah, I'm the one who did that test, and there's quite a high ratio of CPU work to actual GROM access. Even for the MOVE instruction, which you might expect to lean more heavily on reading data, there is a large amount of CPU effort between each byte. You can actually see it if you look at the TI Intern disassembly, just count the number of CPU instructions to interpret the simplest GPL instruction. My test GROM was roughly 4 times faster than the real things and there was no visible difference in TI BASIC (used as a test program). I can't remember the numbers, I want to say it was somewhere around 50:1 (CPU:GROM). If that is correct the best you could possibly see is only around 2%.

Tursi · October 6, 2016

One thing I /did/ consider was the concept of a GPL accelerator. I considered running a GPL interpreter on the AVR that I use, and having it only expose the GPL I/O instructions to the TI's interpreter (running all the other GPL code itself). In that way it should be able to accelerate GPL programs. But... that would only work for GPL code inside the AVR, so no help on cartridges, so I didn't take it any farther than thought experiments.

RXB · October 7, 2016

One thing I /did/ consider was the concept of a GPL accelerator. I considered running a GPL interpreter on the AVR that I use, and having it only expose the GPL I/O instructions to the TI's interpreter (running all the other GPL code itself). In that way it should be able to accelerate GPL programs. But... that would only work for GPL code inside the AVR, so no help on cartridges, so I didn't take it any farther than thought experiments.

Yea the problem with GPL speed in not a software issue but a Hardware problem.

People complain about GPL speed but there never was anything wrong with GPL other the hardware restrictions is has to deal with.

Tursi · October 7, 2016

Sorry Rich, the point of the above was that yes, it is software and not hardware. The GPL interpreter could be much faster than it is if we re-wrote it with techniques we know today. It's very true that GROMs are much slower than RAM or ROM, 30 times or so. But they are not read often enough in the average GPL program to contribute much to the performance.

As an example, sitting on the TI BASIC prompt, just blinking the cursor, the GPL interpreter runs an overall average of 832 CPU cycles per GPL instruction. (Measured in Classic99 at >0070 - the actual range is 284 cycles to 2423 cycles for the code in that loop, averaged over 26000 cycles).

I have measured actual GROM performance as averaging 4 GROM clocks, about 8uS per operation (versus 300ns for a CPU clock, so about 26 times slower.)

832 cycles is roughly 277 microseconds. If we assume the average GPL instruction in this loop takes 2 bytes from GROM (many will actually be only 1), and we assume each needs an address set, which is the longest GROM operation, then we have a full cycle taking another 48uS (2 address bytes and 1 data byte for each) just for talking to the GROMs, so that's 325uS average loop. Of that 325uS, 14% of that time is GROM hardware access... which means in theory a full speed GROM would make the TI BASIC wait loop 14% faster. That's not the huge difference that we've always assumed would be the case.

(edited: forgot to include the address cycles. Okay.. 14% is not a bad boost. But the CPU code could still do what it does in a lot fewer instructions, and we could set the address less often than we do ).

Edited October 7, 2016 by Tursi

speccery · October 7, 2016

Yeah, I'm the one who did that test, and there's quite a high ratio of CPU work to actual GROM access. Even for the MOVE instruction, which you might expect to lean more heavily on reading data, there is a large amount of CPU effort between each byte. You can actually see it if you look at the TI Intern disassembly, just count the number of CPU instructions to interpret the simplest GPL instruction. My test GROM was roughly 4 times faster than the real things and there was no visible difference in TI BASIC (used as a test program). I can't remember the numbers, I want to say it was somewhere around 50:1 (CPU:GROM). If that is correct the best you could possibly see is only around 2%.

LOL well that's not worth a lot of effort! Thank you for sharing the findings.

speccery · October 7, 2016

One thing I /did/ consider was the concept of a GPL accelerator. I considered running a GPL interpreter on the AVR that I use, and having it only expose the GPL I/O instructions to the TI's interpreter (running all the other GPL code itself). In that way it should be able to accelerate GPL programs. But... that would only work for GPL code inside the AVR, so no help on cartridges, so I didn't take it any farther than thought experiments.

Here is a kind of dream roadmap: so I have my TMS99105 retrochallenge project which I want to complete in due time, 1 month. If I am even a little bit successful with that project, I will have a system which implements (partially in time?) much of the console in an FPGA. Crucially all of the GROM accesses would pass the FPGA. So in this system it becomes possible to build a GPL accelerator in FPGA hardware. I have not looked at the GPL opcodes much, so I don't know how feasible that would be, but I think at least it would be possible to build a custom processor that uses GPL as machine code for some of the GPL opcodes. The ones that are difficult to implement in hardware could be escaped out to some sort of microcode or back to TMS9900 instructions.

Internally the FPGA logic is running at 100MHz, actually on both of my FPGA projects. Your other post was very interesting, since I did not know what is the average rate of GPL execution with some example workload. Now I know that 277 us/op is a real number. The little I know about GPL is that the operations are probably all going to be multi cycle, with memory fetches and writes required. The FPGA board I am using has external static RAM which has 10ns access time, so in the best case factoring in some delays the FPGA could access the RAM every 20 ns (at 32-bit width). Thus theoretically if the FPGA was executing GPL it could execute fairly complex instructions in one microsecond or less.

Tursi · October 7, 2016

I have not looked at the GPL opcodes much, so I don't know how feasible that would be, but I think at least it would be possible to build a custom processor that uses GPL as machine code for some of the GPL opcodes. The ones that are difficult to implement in hardware could be escaped out to some sort of microcode or back to TMS9900 instructions.

One of the long-standing legends of the TI was that at one point there was going to be a GPL processor, and that's why GPL exists. Due to that, I looked at the GPL instruction set with a thought to whether it was reasonable to expect a CPU could execute the opcodes in hardware... and I'm doubtful. (Reasonably, that is.) The instructions are very variable in length (1-6 bytes) and the encoding is sometimes a little inconsistent. Just my view, on that one.

speccery · October 7, 2016

One of the long-standing legends of the TI was that at one point there was going to be a GPL processor, and that's why GPL exists. Due to that, I looked at the GPL instruction set with a thought to whether it was reasonable to expect a CPU could execute the opcodes in hardware... and I'm doubtful. (Reasonably, that is.) The instructions are very variable in length (1-6 bytes) and the encoding is sometimes a little inconsistent. Just my view, on that one.

Yeah, I think I heard about that too. Can't remember where though.

I haven't looked at the opcodes yet, that'll have to wait for a while. With a microcoded design one could implement quite weird systems. Failing that, I've been playing around with NXP's ARM micro controller development boards quite a bit, I built one board with the LPC1343, and I am looking forward to build a board with the LPC4330 - the chips are already on my desk back home, but also this one will have to wait a while. The LPC4330 runs at 204MHz, so it would run GPL pretty fast...

Asmusr · October 7, 2016

Just wondering if GROM is 30 times slower than ROM how Parsec manages to display the scrolling graphics, which is stored in GROM AFAIK, at a reasonable speed?

speccery · October 7, 2016

Just wondering if GROM is 30 times slower than ROM how Parsec manages to display the scrolling graphics, which is stored in GROM AFAIK, at a reasonable speed?

I don't know how Parsec specifically works and I know there are here more experienced TI programmers than I am, but in short whatever you see on screen must be in the VDP RAM memory. Thus even if the graphics is originally stored on GROM, it needs to be moved over to VDP RAM for display. I don't know how Parsec specifically implements scrolling, but the fastest way to run short machine code routines on the TI is to copy them from ROM (or GROM) to the internal scratchpad RAM memory, and run it from there. That way the program is in the fastest memory on the device, accessing data in VDP RAM. Thus during scrolling there is no need to touch the GROMs.

Asmusr · October 7, 2016

From the Parsec source code it looks like it's reading from GROM into a 64(?) byte buffer in scratchpad ram where the graphics is shifted and then sent on the the VDP. I guess the answer to my question is the same as for GPL: all these instructions take so long that the read from GROM is relatively insignificant in comparison.

Parsec_Source_Code.pdf

TI-99/4A with a Pipistrello FPGA board

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members