Jump to content
IGNORED

Icy99 FPGA 99/4A


speccery

Recommended Posts

Just for info to the interested: after the first production run of the ULX3S board of almost a 1000 pieces sold out in days, there is now a second production batch available on Mouser:
https://eu.mouser.com/Search/Refine?Keyword=ulx3s
If interested in this stuff, get one whilst supplies last. There is now the Icy99 implementation of the 99/4A + extensions, and there is also a TI99/2 implementation, and an implementation of the Mini-Cortex. Maybe an implementation of the TI99/8 will emerge over time.

ULX3S development is discussed at:

https://gitter.im/ulx3s/Lobby

The complete open source Verilog tool chain can be downloaded from
https://github.com/open-tool-forge/fpga-toolchain

it is a big install (~700MB installed), but still much, much better than the multi-gigabyte installs that the vendor tool chains require.

Edited by pnr
added tool chain link
  • Like 5
Link to comment
Share on other sites

Thanks @pnr for pointing that out to everyone here, in case someone is interested.

In addition to the computers you listed, there is also the Tomy Tutor which I've been thinking to support. It's a simple computer, and with all the parts of TI-99/4A available should be a simple thing to do. Probably TMS9995 instructions would be needed. Could you provide links to the TI-99/2 and Mini-Cortext?

 

I participated yesterday (evening my time) to the pandemic TI-99ers Zoom call. That was fun, and nice to be able to have some more faces associated with names :)

 

I was showing icy99 during that call a little - completely unprepared as I forgot the call, but thanks to @kl99I got a reminder and joined in. I asked for recommendations of software using SAMS extended memory - and Dungeons of Asgaard was mentioned I tried it during the zoom call. Unfortunately it didn't work. Could be a problem with the memory extension, or could be something else. Icy99 is starting to be at a reasonable good state, but of course it has only been tested by (well and some people in the ULX3S community, but I doubt many of them have lots of TI-99/4A software to test).

 

@jedimatt42 thanks for the suggestion of using your memory test software to test further the icy99 SAMS implementation. I've been running it a while and all the tests pass.

Edited by speccery
  • Like 3
  • Thanks 1
Link to comment
Share on other sites

And I just tried again Dungeons of Asgaard - and it works. I did not do any changes. The only difference to the tests I ran yesterday was that TIPI was not initialized (i.e. TIPI DSR ROM not loaded). There could be a bug in the logic in the DSR area decode. Need to run the memory tests again after installing TIPI and using a little.

 

EDIT: After some more testing, both Dungeons of Asgaard and "In the dark" (IDT, disk based game for RXB, uses both disk and SAMS at the same time) work just fine.

I think the issues we saw in the zoom call were simply that the FPGA was not reset properly beforehand. Resetting disables SAMS and sets the cartridge ROM paging register to zero (among a ton of other things). From a clean reset both  DoA and IDT work well.

 

IMG_3823.thumb.jpg.0998ff6544b84acb0876c67a7d084d7c.jpg

Edited by speccery
  • Like 4
Link to comment
Share on other sites

2 hours ago, speccery said:

Could you provide links to the TI-99/2 and Mini-Cortext?

The TI99/2 is here:

https://gitlab.com/pnru/ti99/-/tree/master/ti99_2

 

The Mini-Cortex is here:

https://gitlab.com/pnru/cortex

I've focused on the Unix side of it. Its main claim to fame is that it hosts a 99-native C compiler and tool chain, and hence it can re-compile itself. I have a native TCP/IP stack working, but the experience is not smooth yet. It uses the ESP32 as an ISP, and connects to it using a PPP serial line.
 

2 hours ago, speccery said:

In addition to the computers you listed, there is also the Tomy Tutor which I've been thinking to support. It's a simple computer, and with all the parts of TI-99/4A available should be a simple thing to do. Probably TMS9995 instructions would be needed.

 

Yes, I've been thinking about that as well. The CPU in the Mini-Cortex is my best approximation of the 9995 yet, implementing the extra 4 instructions. It is almost cycle accurate and the bus interface is that of the 99105. It also has code to emulate the 9995's interrupt lines, for the internal timer and CRU bits, etc. Tongue-in-cheek, I'm calling it the 99095.

 

What I had in the back of my mind was to do a version of the 9918 that mimicked the data paths of the real vintage silicon. I think it should fit in some 500-700 lines of Verilog and would of course have the same limits (4 sprites on a line, no 80 column text mode, etc.). Never got around to doing that code. Together with the 99095 it would allow for a very compact implementation of the Tomy Tutor. Probably using your 9918 is a quicker route to success.

 

  • Like 4
Link to comment
Share on other sites

Thanks @pnr for the links!

 

My TMS9918 takes currently around 1100 lines of Verilog. The line count is somewhat impacted by the support for ICE40HX FPGA chips. They don't have true dual port block memories and there is not enough on-chip memory for 16K VRAM and linebuffer. The former leads to the need to use two memory blocks in parallel for rendering and the latter requires an interface through which data from external RAM can be used. The use of external RAM complicates things quite a bit, since I also built in some pipelining to be able to use the external memory bus efficiently. I also added the ability to read registers back, which was useful for debugging. And finally my TMS9918 core has some unnecessary states, which could be removed by modifying the sprite drawing logic a bit. All of these probably add quite a bit of code, but probably more than what your target line count would be. Even if you built a version which is close to real silicon, you would need to build a scan-doubler to support modern monitors.

 

Anyway, this is jus a long of saying that if you want to use my TMS9918, let me know and I will tidy up the code to make it easier to read. I am planning to clean up the code when I have a moment, also to make a version which does not have the external memory support to simplify the core.

 

The 99095 is a very cool name! And a cool core as well!

 

With regards to the TI-99/2, it would be interesting to port the Basic from it to the 99/4A. That would be quite a bit faster, as it is not using GPL to my understanding.

  • Like 2
Link to comment
Share on other sites

1 hour ago, speccery said:

but probably more than what your target line count would be

It is not about the line count so much, it is about maximum simplicity. When using internal ram (the smallest version of the ULX3S has 112KB internal ram/rom capacity), doing a 9918 that just supports basic 256x192 VGA DVI output is very simple indeed, hardly more complex than the video circuit in the 99/2. The complexity is in the sprites, which are done with comparators/counters in 9918 silicon, 4 blocks of that. I'm thinking of duplicating that design in the FPGA, hopefully it leads to very simple & readable code. However, writing that takes time, which I currently don't have.
 

1 hour ago, speccery said:

They don't have true dual port block memories

Just the other day I learned that Yosys currently cannot infer true two-port anyway. It is limited to one R/W port and one R port -- this bit of the Yosys code is currently being rewritten, so hopefully this limitation will be gone soon. For true 2-port one currently has to use a library block (Emard has that in his repo).
 

1 hour ago, speccery said:

With regards to the TI-99/2, it would be interesting to port the Basic from it to the 99/4A. That would be quite a bit faster, as it is not using GPL to my understanding.

Yes, it does not use GPL, and it does not need to as the RAM is connected to the CPU. When debugging the TI99/2 I disassembled some parts of the 32KB ROM and it has a table driven parser that compiles into a token byte code ("IF", "NEXT", etc.). This token byte code is then interpreted by calling a subroutine for each token. I did not manage to fully understand the parser, but I think it is a bottom-up parser with separate left and right priorities for each token - I did not get to the bottom of it.
 

1 hour ago, speccery said:

Anyway, this is jus a long of saying that if you want to use my TMS9918, let me know and I will tidy up the code to make it easier to read. I am planning to clean up the code when I have a moment, also to make a version which does not have the external memory support to simplify the core.

At another time, yes please. At the moment work projects are keeping me away from hobby stuff and I'd like to complete three other hobby projects first:
- A 4-way write-back cache, to make sdram access fast. I have that working for Oberon, but I'm not happy with it yet.
- True HDMI video (as opposed to DVI). This means implementing data islands and sound encoding.
- Clean up TCP/IP for the Cortex
So, we're talking mid-2021 at the earliest, maybe 2022.

Maybe it is a cool project for a Tomy Tutor enthusiast...

 

Edited by pnr
  • Like 3
Link to comment
Share on other sites

On 11/29/2020 at 11:44 PM, pnr said:

It is not about the line count so much, it is about maximum simplicity. When using internal ram (the smallest version of the ULX3S has 112KB internal ram/rom capacity), doing a 9918 that just supports basic 256x192 VGA DVI output is very simple indeed, hardly more complex than the video circuit in the 99/2. The complexity is in the sprites, which are done with comparators/counters in 9918 silicon, 4 blocks of that. I'm thinking of duplicating that design in the FPGA, hopefully it leads to very simple & readable code. However, writing that takes time, which I currently don't have.

Yes, time is the problem, isn't it. Some of the things, such as detecting sprite collisions, becomes very easy with the original approach, just a simple four input logic gate will do the trick. 112KB is quite a lot of memory already. I haven't optimised internal memory use too much yet, currently the icy99 uses around 96KB but that can be reduced easily. For instance system ROM and GROMs are stored on-chip, although that is not necessary.

 

I am planning to increase VRAM to 64K. That would be a sizeable increase and enable some further extensions (along the line of 9938/9958/F18A). Having said that I am enjoying the simple 80 column mode a lot.

 

On 11/29/2020 at 11:44 PM, pnr said:

At another time, yes please. At the moment work projects are keeping me away from hobby stuff and I'd like to complete three other hobby projects first:
- A 4-way write-back cache, to make sdram access fast. I have that working for Oberon, but I'm not happy with it yet.
- True HDMI video (as opposed to DVI). This means implementing data islands and sound encoding.
- Clean up TCP/IP for the Cortex
So, we're talking mid-2021 at the earliest, maybe 2022.

These are very cool projects you have lined up, good luck with them !

Edited by speccery
  • Like 2
Link to comment
Share on other sites

Continued to work on the icy99 some more, with the goal to release some FPGA memories I was using for low bandwidth stuff like GROMs. That also meant some changes to the boot up method - now the ESP32 controller on the ULX3S first initialises the FPGA, and then writes to SDRAM the console ROM and GROM contents. That also had the side effect of slowing down the system somewhat, although I am still running at around 3.5x the original speed. I will probably bring back the cache as a configuration option, since that would dramatically increase the CPU speed. But right now I am not aiming for max speed.

 

Once I was done with that, I wondered how to improve the VDP a little, in light of the interesting software that's available. I decided that the first thing to do was to increase the amount of VRAM, and now it is 64K. On this specific FPGA board I could make it much larger, but I was thinking that 64K or 128K would already be good. I was wondering what would be the best way to make the memory available, and settled with the obvious approach of going along the lines of V9938 VDP. That also means that I can support some of the existing features such as reading scanline position over standard interfaces. Anyway, the 9938 has quite a bit of functionality, and for now I was just looking to have a way to support more than 16K of VDP memory. I added new VDP registers to support 17bit (128K) VRAM addresses: 10, 11, 14,15,16,17,19 - not all of them work yet. I also added status register 1 (7 to go, the 9938 has 9 of them).

 

Anyway, as I was playing with this, I pretty much instantly ran into the problem that TI ROMs initialise some of the unused bits in the TMS9918 to ones, instead of zeros. The 9938 uses those unused bits, but is compatible with the TMS9918 if they're left at zero. I temporarily added a new register 63 which enables 9938 mode, and on reset the system boots in TMS9918 mode. The extra VDP memory is still available, but the address register extensions are not used for screen tables before the VDP is in 9938 mode. 

I expect there to be some incompatibilities with existing software when the extended registers are enabled. For example the megademo (Don't mess with Texas) also has a phase - when going to scanline effects - where it writes to R15 the value of one, causing the status register 1 to become active, and thus breaking a lot of stuff since the software thinks it's reading register 0 when actually it is accessing register 1. I fixed this by also making the extra status register only available in 9938 mode. I haven't done enough reading of the 9938 manuals, perhaps there is a standard way of making it look like a TMS9918.

 

One annoying problem I am lately having is that sometimes the DVI output generated by a particular synthesis run does not show stably on my monitor. It is weird, since a typical synthesis report shows that I have quite huge margins (the 125MHz DVI clock could run at 200MHz).

Link to comment
Share on other sites

If you are thinking about enhancing the VDP, these are some things that would be interesting from my point of view:

 

  • 80 columns with 30 rows mode, as supported by the F18a
  • Sprite support in 80 columns 24/30 rows support, as in the F18a
  • 80 columns with 48 rows and 60 rows mode (doubling the existing F18a 24/30 rows mode)

Reason I’m proposing that, is my WIP programming editor Stevie. It uses a sprite for the cursor in 80/30 mode.

Would make it interesting to see how it behaves in 48 rows or 60 rows mode (well to be honest I probably could try it also by patching the js99er emulator I’m using for testing my development work).

 

Had a discussion about 48/60 rows mode implementation on the F18a MK1, but if I remember correctly it’d take quite some refactoring and VRAM is probably too limited on the MK1.

  • Like 1
Link to comment
Share on other sites

A bit more progress, I have now a strange mixture of features - development still on-going:

  • normal 40 column text mode
  • 40 column text mode with 26.5 lines (as in the 9938)
  • 40 column text mode with 30 lines (as in the F18A)
  • 80 column text mode with 24 lines
  • 80 column text mode with 26.5 lines (as in the 9938)
  • 80 column text mode with 30 lines (as in the F18A)

The 26.5 lines mode is enabled as in the 9938, i.e. setting bit 7 of R9.

The 30 lines mode is enabled as in the F18A, i.e. setting bit 6 of R49.

 

As I wrote before, there is is now 64K of VRAM. Need to test that it is actually there... :)

 

A question: I know that F18A is locked on reset, and thus extended features are not there before they're enabled. But what about 9938? I guess with it things are not locked. Like I wrote in the past, the TI99/4A ROM initialises the unused bits in R0..R7 in a way which moves some of the screen tables beyond the 16K range with the "9938" in the icy99. I now work around with this by having a new register 63, which must be written to in order to have the icy99 VDP in extended mode, but this is not compatible with any software using F18A or 9938 in 80 column text modes. I suppose I could just modify the console ROM and/or GROM to be compatible with both 9938 and TMS9918... so that after normal reset the screen pattern definitions or character areas do not move above 16K range understood by TI Basic.

  • Like 3
Link to comment
Share on other sites

I like the mention of 9938 and the 80-column, but how about the 9938 or 9958 graphics modes 6 and 7 together with the interlaced modes?

Then you can use YAPP and other software made for the TI-99 with 80-column cards. I know, wishes and dreams... ?

 

I like your result as it is and this will be a system I prefer over emulators on my PC or iMac, ? if I had this I might even use this more than my original system.

Link to comment
Share on other sites

4 hours ago, Nick99 said:

I like the mention of 9938 and the 80-column, but how about the 9938 or 9958 graphics modes 6 and 7 together with the interlaced modes?

Then you can use YAPP and other software made for the TI-99 with 80-column cards. I know, wishes and dreams... ?

 

I like your result as it is and this will be a system I prefer over emulators on my PC or iMac, ? if I had this I might even use this more than my original system.

Thanks @Nick99 for these comments. To be honest with you, I haven't yet studied the 9938/9958 in much detail yet. I wanted to go above 16K of VRAM and I wanted to do that in a way which is compatible with something, otherwise it becomes just an obscurity.

But my plan is to bring in many more features. Since I am not familiar with the 80 column cards or TI-99/4A software available for them, could you provide me with some pointers please?

Link to comment
Share on other sites

3 hours ago, speccery said:

Thanks @Nick99 for these comments. To be honest with you, I haven't yet studied the 9938/9958 in much detail yet. I wanted to go above 16K of VRAM and I wanted to do that in a way which is compatible with something, otherwise it becomes just an obscurity.

But my plan is to bring in many more features. Since I am not familiar with the 80 column cards or TI-99/4A software available for them, could you provide me with some pointers please?

YAPP and XHI comes to mind, they can be found at http://ftp.whtech.com/Diskettes/

If I´m correct it should be a gif-viewer for the 80-column cards, maybe someone that has an 80-column card has more software that uses the 9938/9958?

There was a couple of disks following the TIM-card from OPA containing some demos or something, but I don´t know where to find them.

  • Like 1
Link to comment
Share on other sites

8 hours ago, speccery said:

Thanks @Nick99 for these comments. To be honest with you, I haven't yet studied the 9938/9958 in much detail yet. I wanted to go above 16K of VRAM and I wanted to do that in a way which is compatible with something, otherwise it becomes just an obscurity.

But my plan is to bring in many more features. Since I am not familiar with the 80 column cards or TI-99/4A software available for them, could you provide me with some pointers please?

I'm not against 99x8 support, but I think it would be good to at least look at the f18a first. Most of the new software aiming for better-than-the9918a graphics seems to be targeting that chip, and I've always found the existing 99x8 library a bit lacking, especially in the games department.

Link to comment
Share on other sites

8 hours ago, TheMole said:

I'm not against 99x8 support, but I think it would be good to at least look at the f18a first. Most of the new software aiming for better-than-the9918a graphics seems to be targeting that chip, and I've always found the existing 99x8 library a bit lacking, especially in the games department.

Yes, I really like the F18A. But since it has just over 16K of VRAM, while the 99x8 chips support 128K, on the VRAM extension part following the 99x8 model makes sense to me. The way Matthew built the F18A seems to have been in a way where the F18A extensions use registers which are not used by the 99x8, so at least partially it is possible to support both. I also have some ideas of my own that I want to try out. But one step at a time.

  • Like 3
Link to comment
Share on other sites

9 hours ago, speccery said:

Yes, I really like the F18A. But since it has just over 16K of VRAM, while the 99x8 chips support 128K, on the VRAM extension part following the 99x8 model makes sense to me. The way Matthew built the F18A seems to have been in a way where the F18A extensions use registers which are not used by the 99x8, so at least partially it is possible to support both. I also have some ideas of my own that I want to try out. But one step at a time.

True enough, but given that Matthew is working on the mkII which will also support more VRAM, it might be good to align with him if he's open to that. I for one look forward to whatever you guys come up with!

  • Like 1
Link to comment
Share on other sites

On 12/12/2020 at 6:32 PM, retroclouds said:

I've got myself an ULX3S. Hoping to play with it during the holidays. Main area of interest is the mini cortex and the Icy99. 

Would be great if I could get Stevie running on there. 

 

Which version of ULX3S you've got? I haven't yet written instructions anywhere on how to get the icy99 running on an ULX3S. Just let me know if you need some help with that.

Super briefly, if you want to replicate my environment you need to have a micro SD card containing couple of python files, and a directory structure containing ROM & GROM images as well as the bitstream file for the FPGA. You also need to setup the ESP32 so that you can communicate with it, i.e. provide WIFI SSID and password, after which you can use Micropython WebREPL to communicate with it from a browser, something like: http://micropython.org/webrepl/#192.168.0.123:8266/

Where you replace 192.168.0.123 with the IP address of the ESP32. Once connected to the ESP32, you issued two commands:

 

import osd
osd.run.load_roms()

After these steps the system is up and running. The "import osd" loads the python script to get everything going. It initialises the FPGA and loads the ROMs, but currently there is a problem in the system and the ROMs need to loaded again to get going.

 

You also need a keyboard which supports PS/2 protocol. The keyboard is connected to the USB port using a USB to go adapter, I am using the Steelseries 6GV2 gaming keyboard. It has a USB cable, but it also operates in PS/2 mode through the USB cable, which is a requirement since there is no USB host functionality.  

Edited by speccery
Link to comment
Share on other sites

  • 2 weeks later...

Now during the holidays I had some time to work on the icy99. I've been toying with the idea of adding GPL acceleration capabilities to the CPU core. As it was rumoured that there would have been the idea of having a CPU actually running GPL as its machine code, I though perhaps I could work on that a bit.

I've been looking at an incremental approach, of adding some new instructions which would accelerate key operations of GPL interpretation. 

So I added two new instructions, one of them I am calling GPLS and the other MOVU.

 

GPLS is a GPL operand address decoding instruction. It handles in one instruction most of the decoding job done by the routine in ROM at addresses >077A to >082C. Due to the incremental approach, I've arranged the instruction so that it will branch to ROM routines to handle the parts it does not yet do. Still it does the work of around 25 instructions in one go. It is much faster than the equivalent TMS9900 code since there are no instruction fetch cycles, and all temporary values are held on the CPU's temporary internal registers. It only accesses memory to perform the necessary memory cycles. As one example, the instruction reads the value of R13 only once in the beginning to fetch the base address of the GROM currently in use, and then it fetches from GROM the operand address bytes as necessary, without re-reading R13. These gave me an opportunity to understand how GPL instruction operands are constructed. The instruction does this:

  • In all cases the GPLS instruction writes to R1 the address of the operand and to R0 the value. The value is either a byte or word, depending on bit 8 of R5 (I use the normal bit numbering, not TI numbering, so bit 8 is the LSB of the higher byte of R5). Icy99 CPU caches all writes to bit 8 of R5 in GPL workspace, so it always knows if this is a word or byte operation.
  • Fetches the first operand byte from GROM.
  • If the high bit of the operand is zero, this is a direct short operand, and the remaining 7 bits are an offset to the scratchpad area. In this case GPLS fetches a byte of word from scratchpad. It supports unaligned memory accesses, so it properly fetches 16 bit words from odd addresses.
  • If the high bit is one, the operand is longer and more bytes from GROM need to be fetched. GPL supports 7 bit addresses in scratchpad, 12 bit addresses starting from address >8300, and 16-bit addresses, also offset from >8300 which is a bit weird.
  • GPLS instruction also understands the use of index option, and will read a 16-bit value from an unaligned address in scratchpad to get the value of the base address, and the direct address is added to this. 
  • There are currently three cases the instruction does not yet handle, and branches to ROM if a) the address 837D is read from (this is a magical address, causing a write to VDP memory using X and Y coordinates), b) if the memory read is from VDP memory or c) the indirection is used. In all of those cases a, b and c most of the preprocessing is done, though.

I also added the MOVU instruction, which has this format: MOVU *Rx,R0 where x can be from 0 to 7. MOVU performs an unaligned byte or word read operation (depending on bit 8 of R5) from the address pointed to Rx to R0. If the operation is for a byte, it will sign extend it to R0. Thus the read byte will end up in the lower byte of R0, unlike with MOVB which always deals with the high byte. 

 

Even if these are small changes to only one routine, this simple Basic program runs 10% percent faster:

10 FOR I=0 TO 1000
20 PRINT I;" ";
30 NEXT I

I needed to add 18 new states to the state machine of my TMS9900 core to support these new instructions. That is a big addition, since the entire TMS9900 instruction set occupies 96 states in my implementation. Doing these changes was very time consuming; if I want to continue on this somewhat obscure avenue of speeding up GPL I probably need to move to a microcoded architecture, since implementing instructions of this complexity makes the CPU logic very complex quickly. For this super CISC system a microcoded architecture would only add instructions to microcode ROM, instead of adding new logic. Still, it was interesting to understand a bit more of the inner workings of GPL.

 

 

  • Like 7
Link to comment
Share on other sites

This is very interesting avenue of development!


Just throwing out some thoughts:

 

1. I heard (read) the GPL processor thing as well, but I am not sure it is correct. As I understood, the original plan was for a 99xx CPU with an 8 bit data path but this project did not (timely) materialise and the 16-bit 9900 was shoehorned in at a late stage. I also think I remember reading that the designers did not mind the "double interpreter" because they expected that a dedicated CPU would be used for a next gen system. I am not sure how the two things relate, if at all.

 

2. For a microcoded design, have a look at my 99000 version. It has ~200 states for the 9995 instruction set.

 

3. Another route could be to use the co-processor design of the 99xxx series. I am not implementing that, but it could help to keep complexity down, by separating the GPL part in a co-processor.  That co-processor could have a data path optimised for GPL,with maybe a separate address ALU etc. The co-processor interface has facilities to transfer the WP, PC and ST registers between the CPU and the co-processor, so integration could be quite seamless.
 

  • Like 1
Link to comment
Share on other sites

22 hours ago, pnr said:

This is very interesting avenue of development!


Just throwing out some thoughts:

 

1. I heard (read) the GPL processor thing as well, but I am not sure it is correct. As I understood, the original plan was for a 99xx CPU with an 8 bit data path but this project did not (timely) materialise and the 16-bit 9900 was shoehorned in at a late stage. I also think I remember reading that the designers did not mind the "double interpreter" because they expected that a dedicated CPU would be used for a next gen system. I am not sure how the two things relate, if at all.

 

2. For a microcoded design, have a look at my 99000 version. It has ~200 states for the 9995 instruction set.

 

3. Another route could be to use the co-processor design of the 99xxx series. I am not implementing that, but it could help to keep complexity down, by separating the GPL part in a co-processor.  That co-processor could have a data path optimised for GPL,with maybe a separate address ALU etc. The co-processor interface has facilities to transfer the WP, PC and ST registers between the CPU and the co-processor, so integration could be quite seamless.
 

Thanks, these are good comments! I also don't know if there ever were serious plans to have a GPL based processor, but in the wonderful world of FPGA design these things are possible, so why not give it a go...

Since I wrote the last message I updated the GPLS instruction a bit more, it now also handles directly VDP address loading and indirection in the cases of VDP accesses. It's a bit ridiculous, it now occupies 30+ states. But an interesting exercise nevertheless.

 

I did take a look at your design - very interesting and good reference material! I am in the process of adopting a partially microcoded approach, so that I can hopefully do the microcode implementation incrementally, without necessitating the creation of an entirely microcoded processor from the start. As I was looking at your design, I noticed that your microcode (the array "pla") is not clocked. I wonder if it maps to block RAM during synthesis? I am going to attempt to make a clocked design for the microcode ROM, to ensure that the microcode will indeed be in ROM (i.e. initialised block RAM).

 

I am planning to make the microcode quite wide. I noticed that in your design you have a separate constant array, and the microcode only contains an index to the constants. That makes sense in keeping the microcode width smaller, but in order to accelerate GPL the microcode needs to generate a lot of constants, so I am going to just have a 16-bit field in every microcode step just for constants. That saves one level of indirection (a multiplexer) during the decode phase.

 

It appears that I'm going to need at least:

- a 3 bit field to choose ALU arg1 (my CPU core has registers as ALU inputs)

- a 5 bit field to choose ALU arg2

- a 4 bit field to choose ALU operation

- a 16-bit field for constants (which can be directed to either arg1 or arg2)

- a 8-bit field for next CPU state (planning for max 256 words at the moment, easy to extend if necessary)

- a 8-bit field for next CPU return state (to support microcode "subroutines", I already have this concept in the hardwired design)

- a 8-bit field for conditional next CPU state

- a field to choose whether the "next state" or "conditional next state" will be the next state, this will probably take 4 bits or more

- a bunch of 1 bit fields to trigger the loading of ALU output to the internal registers: PC, data write, temporary 1 or temporary 2, effective address etc. Probably around 8 overall.

 

All these together mean that the microcode is going to be something like 60 bits wide, maybe more. Multiplies of 18 bits map well to the block memories, so I might go with 72 bits first.

  • Like 1
Link to comment
Share on other sites

On 1/2/2021 at 9:19 PM, speccery said:

I did take a look at your design - very interesting and good reference material! I am in the process of adopting a partially microcoded approach, so that I can hopefully do the microcode implementation incrementally, without necessitating the creation of an entirely microcoded processor from the start. As I was looking at your design, I noticed that your microcode (the array "pla") is not clocked. I wonder if it maps to block RAM during synthesis? I am going to attempt to make a clocked design for the microcode ROM, to ensure that the microcode will indeed be in ROM (i.e. initialised block RAM).

I am not sure that is a good idea.

 

Initially my thoughts were like yours, and I was aiming for the PLA to be in block RAM. Two things changed my mind:
(i) The "ROM" has lots of duplication in it and it turns out that generating signals from the state vector does not take all that many LUTs. Probably this is the reason that CPU's from that era often used PLA's for microcode in the real silicon.

(ii) The LUT version is faster than the "ROM" version. This was the case on the ICE40 chips and perhaps even more so on the ECP5 chips.
Maybe the second reason drops away when the microcode lookup is more pipelined than in my design.

 

A now obsolete reason was that I wanted the conserve block RAM on the limited ICE40 chip.

 

On 1/2/2021 at 9:19 PM, speccery said:

I am planning to make the microcode quite wide. I noticed that in your design you have a separate constant array, and the microcode only contains an index to the constants. That makes sense in keeping the microcode width smaller, but in order to accelerate GPL the microcode needs to generate a lot of constants, so I am going to just have a 16-bit field in every microcode step just for constants. That saves one level of indirection (a multiplexer) during the decode phase.


Yes. This design choice was driven by a wish to stay close to the original silicon (see here and figure 3 in the 99105/99110 data manual). This too uses a constant table. Trying to eliminate multiplexers is a good idea, I think. In the NMOS silicon of the era, it was almost free to have a tri-state bus on the chip. On an FPGA this translates to multiplexers. The natural multiplexer seems to be a 4 bit 2:1 multiplexer in a single logic block and an 8-way multiplexer takes 3 layers of LUT. Including all the wire routing, the actual layout quickly becomes hard to predict/understand. Selecting ALU inputs and ALU function, and generating flag bits, is a critical timing path for me.

 

The 99000 microcode is 152 bits wide. Mine is much more narrow, but in part that is optical. Fields have often been constrained to 4 bits, so that 1 LUT can derive single signals. I've never counted how many bits I have after such expansion.

For another take on microcode organisation, take a look at the microcode word of the 990/12. It is described briefly in one of the assembler manuals, but I cannot find the right link at the moment. It is 64 bits wide.

 

 

  • Like 3
  • Thanks 2
Link to comment
Share on other sites

  • 2 years later...

Wow it's been almost 3 years since the last update...

I had received reports that the code no longer works with new versions of the oss-cad-suite toolchain. No wonder, as I haven't worked on the project for way too long. The project now works with the 2023-11-18 release which I used during debugging, although I had to disable SDRAM support for now as this is not working anymore. I was using code from @pnr but it instanciates a component which is no longer there. The project is still here at GitHub :)

 

I will try to find a bit more time to work on this thing, I didn't quite remember how cool this one was even if I made it... I have been working on so many other projects in the  meantime - and I had a bit of a pause with TI stuff during summer. Perhaps it's time to enjoy some more FPGA goodness. Maybe I will combine a couple of projects - I could expose a GROM port on the GPIO pins of the ULX3S and connect that to the Grommy2 board for example.

 

The nice thing is that with the ULX3S I get HDMI output. Although at the moment the video mode I use does not work with my capture device, need to see if I can increase the resolution.

  • Like 5
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...