Jump to content
IGNORED

TMS9900 CPU core creation attempt


speccery

Recommended Posts

  • 3 months later...
  • 4 months later...

Wow its been a long time without updates! The TI Treff is on-going in Germany, I did not have the time to go there, but inspired by the event - and also by the fact that after my house move my work room is beginning to be in good shape - I booted both version of my FPGA TI-99/4A. I was happy to see that both FPGA boards still work. The other one had the TMS99105 daughterboard plugged in, while the other was running my VHDL TMS9900 core. I spent some time working on the latter, fixing a couple of reset related problems - and discovering a bug. Apparently even from BASIC my CPU claims the 1*-1 equals 1. Well whatever, the negative numbers are just a nuisance :)

 

https://hackaday.io/project/20826-tms9900-compatible-cpu-core-in-vhdl

  • Like 3
Link to comment
Share on other sites

Yesterday and today I fixed in total four bugs in the FPGA CPU, these are documented in two blog postings, here is a link to the latter one. Three bugs with flag handling and one major bug in the hardware divider fixed.

 

https://hackaday.io/project/20826-tms9900-compatible-cpu-core-in-vhdl/log/153286-two-more-fpga-cpu-bug-fixes

  • Like 4
Link to comment
Share on other sites

  • 2 weeks later...

Had today a little time to tinker my TI-99/4A FPGA clone. Strictly speaking I was now working on the TMS99105 version, but since this design shares most of the VHDL code with the full FPGA implementation, I can work on either for as long as I am not working on the FPGA CPU core itself.

 

Anyway what I have decided to try to do is to improve compatibility and fix all the bugs I know about. The Megademo has been very useful in this regard, I found two bugs in the design, both on my TMS9918 implementation. I had already once decided not to complete my TMS9918 VDP since Matthew's F18A is already a feature complete version (with many additional features as I am sure people here would know), but had to revisit that decision since as long as the FPGA system is not correctly running all the software I throw at it I cannot know if something not working is due to the CPU or the VDP or something else.

 

One of the missing features is the multicolor mode (providing 64x48 resolution with 15 colors per pixel). The rotozoom portion of the demo uses this mode, and was displaying garbage. But no more - now it is fixed. I remain amazed how very small changes to the VHDL code create new features. Adding the multicolor mode amounted to only minor changes to pattern fetch address generation, and the pixel shifter. Overall perhaps 10 lines of code were added/changed.

 

And now the rotozoom runs - and it runs fast on the TMS99105! Overall the whole demo runs very nicely, that is - until it encounters the "sine wave split screen" where the system just halts. Now that I have found the Megademo source code and located finally the root cause for the halt: my VDP implementation does not yet generate the COINC status, I had completely forgotten that I did not built it. The COINC flag is set whenever two sprites have overlapping pixels and reset every time the VDP status register is read.

 

On a real TMS9918 silicon the generation of this flag is easy since it has dedicated hardware to support drawing four sprites per scanline and it is easy to set the flag if any two sprite shifters are active simultaneously (or this is how I assume it works). My TMS9918 implementation is different, I have only one sprite generator which renders to a scanline buffer. The hardware is run in a loop and can render all 32 sprites on a single scanline. In fact I think I could support many more sprites, probably at least 128 per scanline. Here is the problem: due to the hardware being reused it needs special additional support to detect sprite overlap. Currently when it is writing pixels to the line buffer it is doing just that - writing. It does not care what is already in the buffer, the pixels overlaid by sprites just get overwritten. Sprites are rendered from lowest priority to highest, so that the highest priority sprites are rendered last and will be visible on top of any other sprites or characters. Alas, this "writing only" cannot work when you need to know if a pixel has been written to the linebuffer by character data or a previous sprite. So I will need to revise the state machine so that there will be an additional per pixel flag memory that is read when a sprite is rendered to detect the scenario when there are two sprite writes to the same pixel. This in turn means that in the state machine now will need additional states to perform the reads prior to writes. According to the TMS9918 data sheet the COINC flag is set even for transparent sprites, so the flags will need to be read from and written to even if the actual pixels are not visible. What a pain, and has to wait for another day and more time.

 

Interestingly the source code of megademo (splitscreen3_demo.a99) has bogus comments - the comments lead one to believe that scanline position detection is done with the 5th sprite flag in the VDP, but in reality the code is reading the COINC flag. I already support the 5th sprite flag, so this would not have been a problem, and I initially thought the bug on my FPGA hardware the Megademo freezing was due to something outside the VDP, but now I know that the CPU is polling the COINC flag in busy loop. As it never gets raised in my FPGA design the demo just freezes...

Edited by speccery
  • Like 6
Link to comment
Share on other sites

Sorry for the bogus comments. I was trying various techniques, and I found that using coinc to detect the scan line was slightly faster than using 5th sprite, but I seem to remember that it was not as reliable.

 

Regarding the sprites, I think you need a scanline buffer for the sprites alone. You need to initialize the buffer with a value like >ff that is not one of the 16 colors, and then write the sprites in order of the highest priority sprite first. When you get a collision you have the coinc bit and when/if you draw the 5th sprite you have that bit and you stop. The 9918A draws the buffer to the screen in the following line cycle, which accounts for the y coordinate being off by one.

  • Like 1
Link to comment
Share on other sites

Thanks @asmusr. No problem with the comment being old - this happens to the best of us. And thanks for putting in the time and energy to create such an amazing demo in the first place!

 

I probably did not write very clearly - I have had the scan line buffer in there from day 1 of my TMS9918 implementation. You’re right that’s needed for sprites, but it is even more importantly required for scanline doubling for VGA output. In fact my scanline buffer is has double the horizontal resolution - when TI graphics output is written to it each pixel is processed twice to have a 512x192 resolution, which is scanline doubled to 512x384 fitted into a 640x480 VGA screen.

 

The y-coordinate off by one feature I incorporated when I built the sprite engine... already in 2016 :)

Link to comment
Share on other sites

I wasn't very clear either. :) My suggestion was to have two scanline buffers: one for the sprites alone and another combined buffer where you write the character data first and then write the data from the sprite buffer.

Edited by Asmusr
  • Like 1
Link to comment
Share on other sites

I uploaded two videos, the latter one is still uploading as a write this, demonstrating the Megademo running on the FPGA system using TMS99105 CPU and then with my FPGA CPU core. The FPGA CPU video goes through the demo twice, once at running with a lot of wait states, bringing execution speed close to the original TI-99/4A, and then running at zero wait states, or around 23x the CPU speed.

Here is a link to the TMS99105 version. This is a special compilation of the megademo, there are no actual code changes but I edited the controller.a99 file so that the video starts with the multicolour demo, I had problems running this phase of the demo for obvious reasons - the multicolour mode was not implemented...

 

What is new here is that I now added to my TMS9918 code the ability to detect sprite coincidence, so the demo no longer gets stuck in the splitscreen3_demo.a99 phase. Timing behaviour is different though, as can be seen in the video.

 

I guess one of the next challenges for me then would be to make a new demo phase, which would take advantage of the increased processing speed.

  • Like 4
Link to comment
Share on other sites

FPGA CPU version of the video got uploaded. This is the original version of the demo, not some custom one.

 

updated:

After a break I continued to tweak the VHDL, in an attempt to get the splitscreen3_demo.a99 working more smoothly. I just added registers so that the COINC and 5TH sprite flags are set pending and actually flagged at the end of a scanline, as opposed to immediately when they occur. This way there would be some CPU time between two consecutive settings of the flags. The changes maybe helped a little, but the sine curves still are jerky.

Edited by speccery
  • Like 2
Link to comment
Share on other sites

  • 3 months later...

Well hello after a long while. I have been preoccupied with other things, but during the past few weeks I've found a little time to work again on the TI-99/4A FPGA clone. I really ought to be working on the Collectorvision Atari 2600 code, but could not help but spend some time with the TI-99/4A first. Here, I wanted to follow a bit my original passion which was to have a fast TI-99/4A. This time I also wanted to put some computer architecture theory into practice: I added an on-chip cache memory to my TI-99/4A clone, while also optimising the VHDL code a bit. The result is that instruction execution speed jumped from 23 times the original to 39 times the original speed. The TMS9900 core is now a little simpler than it used to be, but still far from an elegant design, although getting a little better.

 

I have two plans on this project to follow up:

First add a more speed, by going from the current non-pipelined design to a slightly pipelined design in the sense that there would be a two stage pipeline, where both stages would take multiple clock cycles. The first stage would be instruction fetch and decode stage, while the second would be instruction execution stage. I could not go to this direction in the past easily since there only was (and still is) one memory bus. But now that I have a working cache, I have much more memory bandwidth to play with. The cache is currently outside the CPU core, so it is serving instruction and data fetches. It is a super simple design: direct mapped with write-through update policy. 1 kilobyte data capacity and about half a kilobyte in tag memory. The whole thing is implemented as a simple 1k x 36 bit memory block (not all of the bits are used in each 36-bit word). Having the cache outside the CPU core is not ideal, so I am probably going to add another cache for instructions only and pull that inside the core, into the fetch/decode stage, so that it can operate in parallel to the execution stage. This should increase performance quite significantly.

 

The second intention I have is to port the TI-99/4A core to a few more FPGA boards, in order to make this design more accessible for others. The cache is also an enabler of sorts in this sense, since now I can easily support slow buses (such as SPI connected flash memories) for cartridge ROMs, I could support DRAM fetches in burst mode enabling FPGA boards with DRAMs only to be effectively used, and I can also support quite small FPGAs since I could now modify the design in a way that doesn't anymore need a lot of on-chip memory while still running at a reasonable speed. Specifically I have the low cost blackice-ii board in mind as one target for the TI implementation, this FPGA only has 16K RAM on board.

Edited by speccery
  • Like 8
Link to comment
Share on other sites

  • 1 month later...

I've been today hacking away with the TI-99/4A FPGA after a while. I've been working on the collectorvision phoenix - it has been fun but is a little slow going, as the atari core I am working on is not mine. It makes quite a big difference to work on a design when you know it inside out, as opposed to porting code from someone else over.

 

I did some refactoring of the TI-99/4A VHDL code, separating out the external memory interface from toplevel VHDL block, so that I can more easily adapt the design to other FPGA boards. As part of this process I wanted to enable direct execution of TMS9900 machine code from the FPGA's configuration flash ROM. This is a serial ROM chip, so reading it will be relatively slow, but that should be fine as the system is anyway running too fast for legacy software without slowing it down. Having this capability would enable the TI-99/4A core to run on many barebones FPGA boards, even without any external memory as long as the FPGA has approximately the same capabilities as the XC6SLX9 I am using.

 

When testing the hardware, I wanted to use extended Basic, but realized I have a bug in running extended Basic: I cannot enter negative numbers. Setting A=-1 for example always ignores the minus sign, and A becomes positive. I had earlier similar problems with the regular Basic, and tracked that down to the FPGA CPU's condition codes not working properly in certain cases. I thought I still might have that problem and ran my tests again. One overflow flag bug had crept in, and I also noticed that my ABS instruction implementation was sometimes setting the carry flag while a real CPU does not do it - at least the TMS99105 never sets carry when running ABS - also looking into the source code of classic99 the carry is always cleared when running ABS. The data sheet is ambiguous here, it says ABS sets carry if there is a carry out from the ALU, but it appears in practice it is always zero. Anyway now my test machine code program has exactly the same behavior as a real TMS99105 chip when running through test cases of the following instructions:

A, S, SOC, SZC, DIV, MPY, C, NEG, SRL, ANDI, CB, SB, AB, XOR, INC, DEC, SLA, SRA, SRC, MOV, MOVB, SOCB, SZCB, ABS and X. For each of those my test software process executes the operation with 16 different input parameter value combinations, and comparing the results and top 6 bits of status registers yields identical results. This of course is not a comprehensive test of all instructions, but the coverage is pretty good - pretty much all games and other software works. Nevertheless there is a bug somewhere still, hopefully in the CPU and not in timing. But the behavior is so consistent that I believe it is a CPU bug.

 

So if anyone happens to know how extended Basic handles the minus sign, that would be greatly appreciated :)

Edited by speccery
  • Like 1
Link to comment
Share on other sites

Well GPL does not have anything much on MINUS Symbol other then hand it off to XB ROMs.

* ARITHMETIC FUNCTIONS
PLUS   BL   @PSHPRS           Push left arg and PARSE right
       BYTE MINUSZ,0          Stop on a minus!!!!!!!!!!!!!!!
       LI   R2,SADD           Address of add routine
LEDEX  CLR  @FAC10            Clear error code
       BL   @ARGTST           Make sure both numerics
       JEQ  ARGT05            If strings, error
       BL   @SAVREG           Save registers
       BL   *R2               Do the operation
       BL   @SETREG           Restore registers
       MOVB @FAC10,R2         Test for overflow
       JNE  LEDERR            If overflow ->error
LEDEND B    @CONT             Continue the PARSE
LEDERR B    @WARNZZ           Overflow - issue warning
MINUS  BL   @PSHPRS           Push left arg and PARSE right
       BYTE MINUSZ,0          Parse to a minus
       LI   R2,SSUB           Address of subtract routine
       JMP  LEDEX             Common code for the operation
TIMES  BL   @PSHPRS           Push left arg and PARSE right
       BYTE DIVIZ,0           Parse to a divide!!!!!!!!!!!!!
       LI   R2,SMULT          Address of multiply routine
       JMP  LEDEX             Common code for the operation
DIVIDE BL   @PSHPRS           Push left arg and PARSE right
       BYTE DIVIZ,0           Parse to a divide
       LI   R2,SDIV           Address of divide routine
       JMP  LEDEX             Common code for the operation
*************************************************************
* NUD for unary minus                                       *
*************************************************************
NMINUS BL   @PARSE            Parse the expression
       BYTE MINUSZ,0          Up to another minus
       NEG  @FAC              Make it negative
NMIN10 CB   @FAC2,@CBH63      Must have a numeric
       JH   ERRSN1            If not, error
       JMP  BCON1             Continue
*************************************************************
* NUD for unary plus                                        *
*************************************************************
NPLUS  BL   @PARSE            Parse the expression
       BYTE PLUSZ,0
       JMP  NMIN10            Use common code

RXB GRAM & ROM.zip

Edited by RXB
  • Like 1
Link to comment
Share on other sites

...

So if anyone happens to know how extended Basic handles the minus sign, that would be greatly appreciated :)

 

I am not entirely sure how useful this information may be, but all numbers in both TI Basic and TI Extended Basic are handled as floating point (FP), i.e., 8-byte radix-100 numbers. Negative FP numbers have only the first word (16 bits, 2 bytes) of the FP number negated. Any FP mathematics keeps track of the negative sign(s), but only performs FP mathematical operations with positive numbers. An FP number is considered negative if the leftmost bit of the first word is set.

 

I think that any TIB/XB operation that is expecting to use 16-bit integers will internally convert the FP numbers that are passed to it to 16-bit integers via CFI (Convert Floating point to Integer), I do not know whether any conversion back to FP is done, but, if it is, it would be done via CIF (Convert Integer to Floating point). Both CFI and CIF are ALC (Assembly Language Code).

 

...lee

  • Like 1
Link to comment
Share on other sites

 

I am not entirely sure how useful this information may be, but all numbers in both TI Basic and TI Extended Basic are handled as floating point (FP), i.e., 8-byte radix-100 numbers. Negative FP numbers have only the first word (16 bits, 2 bytes) of the FP number negated. Any FP mathematics keeps track of the negative sign(s), but only performs FP mathematical operations with positive numbers. An FP number is considered negative if the leftmost bit of the first word is set.

 

I think that any TIB/XB operation that is expecting to use 16-bit integers will internally convert the FP numbers that are passed to it to 16-bit integers via CFI (Convert Floating point to Integer), I do not know whether any conversion back to FP is done, but, if it is, it would be done via CIF (Convert Integer to Floating point). Both CFI and CIF are ALC (Assembly Language Code).

 

...lee

Most are FP but a few are not.

What I posted is the SOURCE CODE for XB (RXB) GROM and XB ROMs.

Link to comment
Share on other sites

Just read the disassembly, it would be a bug in the indirect NEG @FAC instruction.

 

But that would be if the interpretation is wrong. But what if the PRINT is wrong and "eats" the minus?

 

A way to test this would be to negate a variable in BASIC in two ways (A = 5 A = -A and A = 2 A = A - 5) and PRINT the result.

  • Like 2
Link to comment
Share on other sites

 

Well GPL does not have anything much on MINUS Symbol other then hand it off to XB ROMs.

 

 

Thanks a lot and special points for very quick reply :)

The excerpt you provided was interesting, and I did add a whole bunch more test cases, but unfortunately I did not find the problem - yet. The source code will also be very useful, I'm sure once I get a bit deeper the rabbit hole.

Link to comment
Share on other sites

Just read the disassembly, it would be a bug in the indirect NEG @FAC instruction.

 

But that would be if the interpretation is wrong. But what if the PRINT is wrong and "eats" the minus?

 

A way to test this would be to negate a variable in BASIC in two ways (A = 5 A = -A and A = 2 A = A - 5) and PRINT the result.

 

Yes - reading the disassembly is one way to go and I may have to resort to that if nothing else helps. NEG instruction does seem to work, and is included in my test cases.

I tried many different varieties of providing negative numbers, ranging from the likes you provided to trigonometric expressions (such as cos(3.141592) - but that yields also bogus results). With TI Basic these operations work, but with extended Basic I get bogus results. The extended basic has a whole lot more code in it, so it is not very surprising that it reveals problems. My test cases for machine code instructions are not comprehensive but I did extend coverage quite a bit today, including use of various addressing modes - although not for all instructions. I can use an earlier version of this same design with the TMS99105 CPU but using my FPGA code for the rest of the TI. That works, so the problem must be in the CPU. As an extreme measure, if I cannot come up with anything easier, I could record memory bus traces when using the TMS99105 and compare those with the FPGA CPU. Or I could add my CPU core to the TMS99105 design and run it with the same data that the TMS99105 is using, but that also requires a lot of work so I am trying to come up with something easier. Probably just many more test cases.

 

A lot of software works correctly, such as the Megademo and it has quite a bit of code in it, so I have reasonable amount of confidence on the CPU core, but clearly something is not working. Perhaps rather than reading the disassembly I could copy bits of code from it and compare them between the TMS99105 and my CPU core.

Edited by speccery
Link to comment
Share on other sites

The data sheet is ambiguous here, it says ABS sets carry if there is a carry out from the ALU, but it appears in practice it is always zero.

 

My guess is that ABS is implemented as

 

if (MSB==1) then NEG(operand) else NOP;

 

NEG should deliver a carry for NEG(0) because NEG(0) = INV(0)+1, and the +1 sets the carry. But the MSB of zero is 0.

 

Edit: Also see https://github.com/mamedev/mame/blob/master/src/devices/cpu/tms9900/tms9900.cpp, lines 2238

Edited by mizapf
  • Like 2
Link to comment
Share on other sites

Thanks for all the comments so far. I'll also post here a quick question on a different topic: when debugging hardware and the CPU, it would be convenient to have something akin to the minimemory and Line-by-Line-Assembler in ROM. (As an aside, I wish I purchased mini memory as a kid instead of extended basic). I have already been using Easybug and the minimemory ROM & GROMs, but my FPGA config does not yet support the 4K RAM of minimemory, although that is trivial to add. If I add the RAM to the cartridge address space I can easily enough load LBLA, but I am wondering if there already is a cartridge ROM which would have this capability to be used with the 32K memory extension? Of course I could use E/A which my system supports already, but I kind of like tweaking things with LBLA style and in most cases when debugging and testing I am only interested in running very short quick and dirty bits of code.

Link to comment
Share on other sites

Thanks for all the comments so far. I'll also post here a quick question on a different topic: when debugging hardware and the CPU, it would be convenient to have something akin to the minimemory and Line-by-Line-Assembler in ROM. (As an aside, I wish I purchased mini memory as a kid instead of extended basic). I have already been using Easybug and the minimemory ROM & GROMs, but my FPGA config does not yet support the 4K RAM of minimemory, although that is trivial to add. If I add the RAM to the cartridge address space I can easily enough load LBLA, but I am wondering if there already is a cartridge ROM which would have this capability to be used with the 32K memory extension? Of course I could use E/A which my system supports already, but I kind of like tweaking things with LBLA style and in most cases when debugging and testing I am only interested in running very short quick and dirty bits of code.

 

Is my LBLA / TIBUG / Disassembler cartridge any use?

http://www.stuartconner.me.uk/ti/ti.htm#minimem_lbla_tibug_disassembler_cartridge

  • Like 1
Link to comment
Share on other sites

 

Is my LBLA / TIBUG / Disassembler cartridge any use?

http://www.stuartconner.me.uk/ti/ti.htm#minimem_lbla_tibug_disassembler_cartridge

 

I should have known better, thanks Stuart! Once again you've already done what I was looking for, this seems perfect! :)

I am running out of time today on this project, need to continue tomorrow, first with your cartridge.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...