Jump to content

speccery's Photo


Member Since 31 Jan 2016
OFFLINE Last Active May 2 2019 2:35 PM

#4267388 TMS9900 CPU core creation attempt

Posted by speccery on Tue Apr 30, 2019 1:33 AM


How many logic cells did it take to re-create the 9900? 



Sorry for the long delay in answering. I have not tested to synthesise the plain vanilla TMS9900 core without any peripherals.


Looking into one of the breadboard project targets on my GitHub account https://github.com/S...em_summary.html you can see that a minimal TMS9900 system took 1690 Xilinx Spartan 6 slice LUTs, or 29% of the XC6SLX9 chip. This system includes the TMS9900 core, 32K RAM, 32K ROM, and PNR's TMS9902 UART, all implemented using the FPGA's built-in resources. In a way this number is comparable to the 1072 logic cells for the J1 as that system also includes memory interface, some I/O and UART. However, the Spartan 6 logic core elements are much more advanced than what the Lattice ICE40HX provides, so the numbers 1690 vs 1072 are not directly comparable.

#4244049 TMS9900 CPU core creation attempt

Posted by speccery on Sun Mar 24, 2019 9:50 AM

Cool.  I love that J1 processor.
For those who don't know about J1, imagine a subroutine call that takes 1 clock cycle and a return that takes 0 clocks. 
One of my thoughts, although I don't have any knowledge of Verilog is that Forth CPUs would benefit from having a workspace register to
assist multitasking.  The Forth stacks typically are in very fast on chip ram, but if you want to change tasks it can be awkward swapping the stacks (ie registers) in/out of conventional memory. So... if there were larger memory spaces available for a number of tasks and a workspace register, the chip could have fast context switching albeit for a finite number of tasks, which is typically ok for an embedded application.
You are probably one of the few people in the world who know about FPGA 9900 and J1. :-)

The J1 implements its stacks as two huge shift registers, where each shift operation is a shift by word length, typically 16 or 32 bits. The stacks are not deep, they're for the 16-bit version by default 15 deep for data stack and 17 for the return stack. So these stacks are implemented in the FPGA logic fabric, not in block memory. This also means that there are no stack pointers, at least for the J1A version. So you don't know how deep you're in the stacks... The source code for J1A is about 130 lines of Verilog. It is tiny. It is inspired by the Novic NC4016 to my understanding.  The J1 is an awesome project, and it comes with Swapforth already implemented. The basic J1 system for the BlackIce takes 1072 logic cells, so about one eight of the total capacity.

It is not only that subroutine calls and pretty much every other instruction takes 1 clock cycle, you can also combine certain operations such as the subroutine return to it. Oh, and it runs at 48 MHz on the BlackIce-II. I did not try to optimize it. 
I think I also ported it over to the Pepino board, as 32 bit version. Along the lines James had done his version for the Xilinx Spartan 6.

Is there a repository of your code for the J1->BlackIce project?

No but I guess I could set it up. I was playing with the Icestorm tools and used the J1 as the core to play with. I did not do much, my work amounted to merging the top level block from BlackIce examples with the J1. I tested it with both place-and-route tools: arcahne-pnr and the newer nextpnr. For the latter I had to study things a little to get the PLL done properly (the input clock is 100MHz, which the PLL takes to 48MHz).

#4243406 TMS9900 CPU core creation attempt

Posted by speccery on Sat Mar 23, 2019 4:40 AM

It was great to be able to use PeteE's software, I found and fixed two bugs:

1. Despite my "testing" there still was a bug with the treatment of ST1 (A> flag) with the ABS instruction. The processing just lacked completely the special case that ABS instruction sets ST1 based on the source argument.

2. SLA0 did not set overflow flag properly if shift count was greater than one.


Fixing bug 1 got extended Basic fixed! So now I could resume what I was actually trying to implement, read access to the serial flash ROM chip. To my delight the code I had writing worked, and I was able to access the serial flash ROM from Basic with a series of call load(...) and call peek(...) statements. I wish the Basic had direct support for hexadecimal numbers, both input and output. The Oric Atmos Basic features these and also DOKE and DEEK operations, which enable peeks and pokes but with 16-bit values...


Anyway, with the bugs fixed, all the test cases pass now. It's great that this test is now also very easy to repeat whenever the CPU is updated.

#4242782 TMS9900 CPU core creation attempt

Posted by speccery on Fri Mar 22, 2019 3:05 AM

Here is a cartridge program I've been working on for validating CPU instructions.  It's based on your earlier description of a program that runs through various instructions with varying inputs and check the results.  It checks all combinations of the inputs 0,1,>7FFF,>8000,>AAAA,>FFFF with the following instructions: A AB ABS AI ANDI C CB CI COC CLR CZC DEC DECT DIV INC INCT INV LI MOV MOVB MPY NEG ORI S SB SETO SLA SLA0 SOC SOCB SRA SRA0 SRC SRC0 SRL SRL0 SWPB SZC SZCB XOR.  The status flags are set to two known states before each test, to verify that only the proper status flags are modified.  If all goes well, the display should show "OK!".  For the first 24 failures, a line is printed containing: the instruction name, the first and second input, the result, the expected result, then the status flags, and finally the expected status flags.  The first status flag byte is the result of the instruction after only EQ is set, and the second byte is the result after LGT AGT C OV OP are set.  Instructions ending in I have the inputs swapped. Shift instructions ending in zero use the R0 register for the shift amount, otherwise are shifted by 1.  I've tested Classic99 and it is ok, so it will be interesting to see how it fares on your CPU.



Thanks, this is awesome and extremely helpful to have an independent piece of verification code! I've not had time during the week to test this, but I am looking forward to doing so this evening. Hopefully something shows up immediately :)


Also your testing methodology is better than my test code, I should also test the instructions twice, to make sure the flags go both ways properly. Thus I can improve my test coverage by making a simple modification. Perhaps I should also work on the test code to make it a cartridge, could be useful to others too. 

#4240509 TMS9900 CPU core creation attempt

Posted by speccery on Mon Mar 18, 2019 2:03 PM

Well that was an interesting debugging session! At the end I understood that what I thought being the problem in computing subtraction incorrectly, the actual problem manifests itself in printing (and elsewhere too). Here is the problem under extended Basic, and below the explanation how I got there. I still don't know what is the offending CPU instruction, but I am getting forward.


Test program under extended Basic


The process how I found the problem was an interesting feature set galore of the FPGA system features, and using Stuart's cool LBLA / debugger module:


Since I thought the problem is in the subtract operation, I studied the excellent TI Intern book based on the comment from RXB SSUB routine address. I wrote a simple Basic program:


and ran this under classic99, setting breakpoints at >D74 and >FA6 to see the contents of the scratchpad memory before and after the subtraction operation when running extended Basic. (I could have determined earlier the problem cannot be in this ROM code, as it is shared with regular TI Basic, and that was working, but bear with me - these things only make sense once you know where the problem is not present).

I could see the contents of floating point accumulator at 834A (the value 1) and the argument at 835C (the value 2) and after the operation the floating point accumulator became negative. That makes sense.


Next I wanted to verify if this is what happens with my FPGA CPU. This is where I got to use Stuart's cartridge and some features of the FPGA system.

First, taking advantage that in the FPGA system ROM actually is RAM, I loaded Stuart's cartridge and modified system ROM to call a subroutine at the beginning of subtract operation (I added the BLWP @>1360 instruction)

Inserted BLWP @>1360 instruction at >7DC
Notice that as I had to have space for my intercepting subroutine call. I overwrote the NEG instruction at >D7C and moved the NEG @>834A instruction to the intercepted routine. I placed the subroutine at >1364, writing over cassette support code.
The code jumped to from intercept at beginning of subtract

I then did the same operation again at the end of the floating point routine, at >FA6, this time moving the instruction MOV @>834A,R1 to the interception routine.


The second intercept point at the end of subtract (or actually rounding)
Second intercept destination

The actual benefit of the intercept routines is that they copy the entire scratchpad memory to a safe place, before and after executing Basic ROM's floating point subtract routine respectively. The FPGA system has 1 kilobyte of scratchpad memory instead of the regular 256 bytes, so I just copied the memory from 8300 .. 83FF first to 8100..81FF and at the end to 8200..82FF.


After making those patches to the system ROM, I copied the modified ROM to PC's disk. I then initialized the FPGA system again, this time with the modified ROM but with extended Basic cartridge inserted instead of Stuart's cartridge. Next I again performed my subtraction in Basic. Once running that piece of Basic code, I just read back the two copies (before and after subtract) of scratchpad memory, and compared them. At this point I saw that the subtract had in fact executed correctly, and the problem manifests itself when printing negative numbers - the minus sign does not appear. The problem also occurs with other operations, since cos and sin functions also have issues.


I am very happy with the DMA feature of the FPGA system, as this enables me to read and write the TI clone's memory while the system is running - super handy for debugging. The same mechanism is used when the system is booted up from PC (it can also boot from flash ROM).


Now, after this debugging session, I know where the problem is not. Progress. 

#4239306 TMS9900 CPU core creation attempt

Posted by speccery on Sat Mar 16, 2019 1:22 PM

I've been today hacking away with the TI-99/4A FPGA after a while. I've been working on the collectorvision phoenix - it has been fun but is a little slow going, as the atari core I am working on is not mine. It makes quite a big difference to work on a design when you know it inside out, as opposed to porting code from someone else over.


I did some refactoring of the TI-99/4A VHDL code, separating out the external memory interface from toplevel VHDL block, so that I can more easily adapt the design to other FPGA boards. As part of this process I wanted to enable direct execution of TMS9900 machine code from the FPGA's configuration flash ROM. This is a serial ROM chip, so reading it will be relatively slow, but that should be fine as the system is anyway running too fast for legacy software without slowing it down. Having this capability would enable the TI-99/4A core to run on many barebones FPGA boards, even without any external memory as long as the FPGA has approximately the same capabilities as the XC6SLX9 I am using.


When testing the hardware, I wanted to use extended Basic, but realized I have a bug in running extended Basic: I cannot enter negative numbers. Setting A=-1 for example always ignores the minus sign, and A becomes positive. I had earlier similar problems with the regular Basic, and tracked that down to the FPGA CPU's condition codes not working properly in certain cases. I thought I still might have that problem and ran my tests again. One overflow flag bug had crept in, and I also noticed that my ABS instruction implementation was sometimes setting the carry flag while a real CPU does not do it - at least the TMS99105 never sets carry when running ABS - also looking into the source code of classic99 the carry is always cleared when running ABS. The data sheet is ambiguous here, it says ABS sets carry if there is a carry out from the ALU, but it appears in practice it is always zero. Anyway now my test machine code program has exactly the same behavior as a real TMS99105 chip when running through test cases of the following instructions:

A, S, SOC, SZC, DIV, MPY, C, NEG, SRL, ANDI, CB, SB, AB, XOR, INC, DEC, SLA, SRA, SRC, MOV, MOVB, SOCB, SZCB, ABS and X. For each of those my test software process executes the operation with 16 different input parameter value combinations, and comparing the results and top 6 bits of status registers yields identical results. This of course is not a comprehensive test of all instructions, but the coverage is pretty good - pretty much all games and other software works. Nevertheless there is a bug somewhere still, hopefully in the CPU and not in timing. But the behavior is so consistent that I believe it is a CPU bug.


So if anyone happens to know how extended Basic handles the minus sign, that would be greatly appreciated :)

#4211498 New TI Basic game: Skier 99

Posted by speccery on Sat Feb 2, 2019 9:06 AM

This is a fun game :) 

I barely resisted firing it up in classic99 (I did fire up classic99 but did not load the game) and instead ran it on my FPGA TI-99/4A for the first time.

It is pretty hysterical when the CPU is running at 39x the normal speed  ;-)

I was actually wondering why it is not running any faster than it is (which is very fast), but that is probably due to sound effects. I haven't looked and don't remember from Basic manual, but I assume the call sound (is that the name of it) commands have a timing parameter which is probably tied to vertical frame sync in its implementation, and thus can slow down the FPGA system the same way as the real iron. Any timing based on loops would just run crazy fast, but the sound effect lengths seem the same when I run at maximum speed and when I ran at "slow" speed. I also notice that my "slow" speed is not very slow at all anymore...


I also found a bug/limitiation in my setup: in my system I am using PC keyboard and capturing the keypresses on my PC. I have windows program I wrote which I use to load ROMs etc to the FPGA; this same program also captures keyboard presses and sends them to the FPGA through USB, using my own serial protocol. Now the game expects all button presses to be in upper case, but I don't support caps lock, so need to push shift while playing...

#4210286 TMS9900 CPU core creation attempt

Posted by speccery on Thu Jan 31, 2019 2:43 PM

Well hello after a long while. I have been preoccupied with other things, but during the past few weeks I've found a little time to work again on the TI-99/4A FPGA clone. I really ought to be working on the Collectorvision Atari 2600 code, but could not help but spend some time with the TI-99/4A first. Here, I wanted to follow a bit my original passion which was to have a fast TI-99/4A. This time I also wanted to put some computer architecture theory into practice: I added an on-chip cache memory to my TI-99/4A clone, while also optimising the VHDL code a bit. The result is that instruction execution speed jumped from 23 times the original to 39 times the original speed. The TMS9900 core is now a little simpler than it used to be, but still far from an elegant design, although getting a little better.


I have two plans on this project to follow up:

First add a more speed, by going from the current non-pipelined design to a slightly pipelined design in the sense that there would be a two stage pipeline, where both stages would take multiple clock cycles. The first stage would be instruction fetch and decode stage, while the second would be instruction execution stage. I could not go to this direction in the past easily since there only was (and still is) one memory bus. But now that I have a working cache, I have much more memory bandwidth to play with. The cache is currently outside the CPU core, so it is serving instruction and data fetches. It is a super simple design: direct mapped with write-through update policy. 1 kilobyte data capacity and about half a kilobyte in tag memory. The whole thing is implemented as a simple 1k x 36 bit memory block (not all of the bits are used in each 36-bit word). Having the cache outside the CPU core is not ideal, so I am probably going to add another cache for instructions only and pull that inside the core, into the fetch/decode stage, so that it can operate in parallel to the execution stage. This should increase performance quite significantly.


The second intention I have is to port the TI-99/4A core to a few more FPGA boards, in order to make this design more accessible for others. The cache is also an enabler of sorts in this sense, since now I can easily support slow buses (such as SPI connected flash memories) for cartridge ROMs, I could support DRAM fetches in burst mode enabling FPGA boards with DRAMs only to be effectively used, and I can also support quite small FPGAs since I could now modify the design in a way that doesn't anymore need a lot of on-chip memory while still running at a reasonable speed. Specifically I have the low cost blackice-ii board in mind as one target for the TI implementation, this FPGA only has 16K RAM on board.

#4128791 TMS9900 CPU core creation attempt

Posted by speccery on Sun Oct 7, 2018 5:50 AM

FPGA CPU version of the video got uploaded. This is the original version of the demo, not some custom one.



After a break I continued to tweak the VHDL, in an attempt to get the splitscreen3_demo.a99 working more smoothly. I just added registers so that the COINC and 5TH sprite flags are set pending and actually flagged at the end of a scanline, as opposed to immediately when they occur. This way there would be some CPU time between two consecutive settings of the flags. The changes maybe helped a little, but the sine curves still are jerky.

#4128784 TMS9900 CPU core creation attempt

Posted by speccery on Sun Oct 7, 2018 5:26 AM

I uploaded two videos, the latter one is still uploading as a write this, demonstrating the Megademo running on the FPGA system using TMS99105 CPU and then with my FPGA CPU core. The FPGA CPU video goes through the demo twice, once at running with a lot of wait states, bringing execution speed close to the original TI-99/4A, and then running at zero wait states, or around 23x the CPU speed.
Here is a link to the TMS99105 version. This is a special compilation of the megademo, there are no actual code changes but I edited the controller.a99 file so that the video starts with the multicolour demo, I had problems running this phase of the demo for obvious reasons - the multicolour mode was not implemented...


What is new here is that I now added to my TMS9918 code the ability to detect sprite coincidence, so the demo no longer gets stuck in the splitscreen3_demo.a99 phase. Timing behaviour is different though, as can be seen in the video.


I guess one of the next challenges for me then would be to make a new demo phase, which would take advantage of the increased processing speed.

#4126498 TMS9900 CPU core creation attempt

Posted by speccery on Wed Oct 3, 2018 1:48 PM

Had today a little time to tinker my TI-99/4A FPGA clone. Strictly speaking I was now working on the TMS99105 version, but since this design shares most of the VHDL code with the full FPGA implementation, I can work on either for as long as I am not working on the FPGA CPU core itself.


Anyway what I have decided to try to do is to improve compatibility and fix all the bugs I know about. The Megademo has been very useful in this regard, I found two bugs in the design, both on my TMS9918 implementation. I had already once decided not to complete my TMS9918 VDP since Matthew's F18A is already a feature complete version (with many additional features as I am sure people here would know), but had to revisit that decision since as long as the FPGA system is not correctly running all the software I throw at it I cannot know if something not working is due to the CPU or the VDP or something else.


One of the missing features is the multicolor mode (providing 64x48 resolution with 15 colors per pixel). The rotozoom portion of the demo uses this mode, and was displaying garbage. But no more - now it is fixed. I remain amazed how very small changes to the VHDL code create new features. Adding the multicolor mode amounted to only minor changes to pattern fetch address generation, and the pixel shifter. Overall perhaps 10 lines of code were added/changed.


And now the rotozoom runs - and it runs fast on the TMS99105! Overall the whole demo runs very nicely, that is - until it encounters the "sine wave split screen" where the system just halts. Now that I have found the Megademo source code and located finally the root cause for the halt: my VDP implementation does not yet generate the COINC status, I had completely forgotten that I did not built it. The COINC flag is set whenever two sprites have overlapping pixels and reset every time the VDP status register is read. 


On a real TMS9918 silicon the generation of this flag is easy since it has dedicated hardware to support drawing four sprites per scanline and it is easy to set the flag if any two sprite shifters are active simultaneously (or this is how I assume it works). My TMS9918 implementation is different, I have only one sprite generator which renders to a scanline buffer. The hardware is run in a loop and can render all 32 sprites on a single scanline. In fact I think I could support many more sprites, probably at least 128 per scanline. Here is the problem: due to the hardware being reused it needs special additional support to detect sprite overlap. Currently when it is writing pixels to the line buffer it is doing just that - writing. It does not care what is already in the buffer, the pixels overlaid by sprites just get overwritten. Sprites are rendered from lowest priority to highest, so that the highest priority sprites are rendered last and will be visible on top of any other sprites or characters. Alas, this "writing only" cannot work when you need to know if a pixel has been written to the linebuffer by character data or a previous sprite. So I will need to revise the state machine so that there will be an additional per pixel flag memory that is read when a sprite is rendered to detect the scenario when there are two sprite writes to the same pixel. This in turn means that in the state machine now will need additional states to perform the reads prior to writes. According to the TMS9918 data sheet the COINC flag is set even for transparent sprites, so the flags will need to be read from and written to even if the actual pixels are not visible. What a pain, and has to wait for another day and more time.


Interestingly the source code of megademo (splitscreen3_demo.a99) has bogus comments - the comments lead one to believe that scanline position detection is done with the 5th sprite flag in the VDP, but in reality the code is reading the COINC flag. I already support the 5th sprite flag, so this would not have been a problem, and I initially thought the bug on my FPGA hardware the Megademo freezing was due to something outside the VDP, but now I know that the CPU is polling the COINC flag in busy loop. As it never gets raised in my FPGA design the demo just freezes...

#4124364 TI-99/4A with a Pipistrello FPGA board

Posted by speccery on Sun Sep 30, 2018 12:04 PM

Yes, this is the exact same board. The manufacturer is QMTECH, and I did find all relevant documentation. 

SDRAM access is much more complex. This actually is still phase where I am at - after a 6 month pause in FPGA work I am still trying to remember where I left off, but  I have integrated the TMS9900, TMS9902 cores on to this board, and I integrated SDRAM controller there too but if I remember I was still trying to get that working.


I will e-mail you the manuals, so you can take a look what exactly is on board, but there are are three chips on the red base daughter board: CY7C68013 USB chip (which I have not used), ADV7123 VGA DAC (24-bits) and CP2102 USB to serial port. The last one I have used with your TMS9902 core successfully. I don't remember if I used the VGA DAC yet or not, there are a bunch of sample projects, one of them uses VGA and another one SDRAM.


To make the story short - I found the boards so useful and affordable that I have 3 of the XC6SLX16 FPGA boards and two daughter boards. I don't recall if I bought them from the same seller, probably not.

#4123039 TI-99/4A with a Pipistrello FPGA board

Posted by speccery on Fri Sep 28, 2018 2:07 PM

After a long while an update to my TMS99105 project too - I ported some features of my second version of this project (using my own TMS9900 CPU design) back to the TMS99105 CPU version. That enabled me to run the very cool megademo for the first time with the TMS99105 CPU. More explanation at hackaday:



#4120060 TMS9900 CPU core creation attempt

Posted by speccery on Mon Sep 24, 2018 1:47 PM

Yesterday and today I fixed in total four bugs in the FPGA CPU, these are documented in two blog postings, here is a link to the latter one. Three bugs with flag handling and one major bug in the hardware divider fixed.



#4118724 TMS9900 CPU core creation attempt

Posted by speccery on Sat Sep 22, 2018 2:26 PM

Wow its been a long time without updates! The TI Treff is on-going in Germany, I did not have the time to go there, but inspired by the event - and also by the fact that after my house move my work room is beginning to be in good shape - I booted both version of my FPGA TI-99/4A. I was happy to see that both FPGA boards still work. The other one had the TMS99105 daughterboard plugged in, while the other was running my VHDL TMS9900 core. I spent some time working on the latter, fixing a couple of reset related problems - and discovering a bug. Apparently even from BASIC my CPU claims the 1*-1 equals 1. Well whatever, the negative numbers are just a nuisance :)