Jump to content
IGNORED

TMS9900 CPU core creation attempt


speccery

Recommended Posts

Like many of you probably have seen in the Facebook updates, I did get the TMS9900 core working and I've been able to run a bunch of games on it.

I need to fix the keyboard repeat rate issue before testing it much more, since typing at the moment is nearly impossible - with the CPU running 15x the speed of the normal TI, the repeat triggers way too fast.

I also want to add a register which would show me how many cycles the instructions take, as I want to optimise the CPU core further - need to go faster :)

 

Update at hack-a-day: https://hackaday.io/project/20826-tms9900-compatible-cpu-core-in-vhdl/log/67326-success-fpga-based-ti-994a-working

  • Like 1
Link to comment
Share on other sites

The adventure continues, with adjustable CPU speed. The range just ain't enough yet...

 

https://hackaday.io/project/20826-tms9900-compatible-cpu-core-in-vhdl/log/67525-speed-control-needed-and-added

 

 

Oh wow, my Zombi game might actually be playable with this board! Just gotta flip a switch when there's 4 or more zombies on the screen... LOL

 

Seriously, this is awesome. If you could get one dip switch to bring it down to real ti speed when needed... *THUD*

 

A true wizard!

  • Like 1
Link to comment
Share on other sites

 

 

Oh wow, my Zombi game might actually be playable with this board! Just gotta flip a switch when there's 4 or more zombies on the screen... LOL

 

Seriously, this is awesome. If you could get one dip switch to bring it down to real ti speed when needed... *THUD*

 

A true wizard!

 

 

I can't remember if your Zombi game was already downloadable from somewhere, that actually would be a really cool test.

Although I have tested extended basic and it works, I have not tested memory expansion (i.e. access to 32K of RAM) yet. I assume it works and that my Basic test program already used it, but haven't tested that yet. I also haven't tested loading of programs either - it will be interesting to see if my disk support built for the TMS99105 version directly works.

 

I need to expand my wait state counter further so that it can slow the processor down to real TI speeds, probably about 150 wait states per memory access would bring it close. Actually it is relatively straightforward to model the behaviour of real TI when it comes to memory access speed. It is an entirely different exercise to build a cycle exact CPU - that was never my goal, actually my goal was to build a very fast TMS9900 clone. To that end I have a few very simple ideas I want to try out, to increase speed from the measly 23x speed to something faster :-D

Edited by speccery
Link to comment
Share on other sites

It's unfinished until I get work my way back to it but level 10 I think is a good test as the entire screen is filled with Zombies. Slows the game to being unplayable which is whee my reworking comes into play.

http://atariage.com/forums/topic/255837-new-32k-xb-gamezombi-work-in-progress/page-8?hl=%2Bzombi&do=findComment&comment=3693193

  • Like 1
Link to comment
Share on other sites

  • 2 weeks later...
  • 4 weeks later...

A brief update - I continue to work on the FPGA based TI-99/4A, although time has been limited. To be more specific I followed Tursi's advice (Thank you!) he provided as a comment in my youtube video showing the megademo running on my FPGA TI. In there several of the demo phases showed erroneous graphics presentation, due to lack of character cell address masking in graphics mode 2. This feature is not documented in the TMS9918 data sheet I used as a basis for my implementation.

 

Basically in graphics mode 2 ordinarily 768 distinct characters can be shown, but you can use the low bits of registers 3 and 4 to limit the size of the character set, for example to 256 characters if you don't need full pixel addressability over the entire screen. I guess the demo code uses these features to reduce VDP memory use and to minimise memory transfer bandwidth between CPU and VDP.

 

I added support for the masking feature - and it immediately fixed many of the demo phases (not sure if that's how the demo parts are called).

 

On another unrelated note, I've also finally gotten my hands dirty with coding for the NXP LPC4330/LPC4337 micro controllers. My intention is to use one of these to build a my own "nanopeb" style peripheral. The first thing is to wire one of these to the extension bus of the TI, but before that can be done the I/O pins need to be configured properly, and with these modern micro controllers the I/O configuration is a non-trivial task. Pretty much all of the I/O pins are very multifunction: they are multiplexed between up to 8 peripherals per pin, and even for simple general purpose I/O use a whole bunch of registers around the chip have to be setup a properly. The user manual being 1442 pages does not help, this is a very complex beast. Finally I did get some 24 I/O pins setup and working, and can produce very fast pulses on the I/O pins under software control, this is the first baby step in this project.

Luckily there is existing code for many higher level functions, I have already tested the development boards as USB peripherals and connected an USB keyboard while the micro controller is the USB host. This would make one wacky peripheral once its done.

  • Like 3
Link to comment
Share on other sites

A brief update - I continue to work on the FPGA based TI-99/4A, although time has been limited. To be more specific I followed Tursi's advice (Thank you!) he provided as a comment in my youtube video showing the megademo running on my FPGA TI. In there several of the demo phases showed erroneous graphics presentation, due to lack of character cell address masking in graphics mode 2. This feature is not documented in the TMS9918 data sheet I used as a basis for my implementation.

 

Basically in graphics mode 2 ordinarily 768 distinct characters can be shown, but you can use the low bits of registers 3 and 4 to limit the size of the character set, for example to 256 characters if you don't need full pixel addressability over the entire screen. I guess the demo code uses these features to reduce VDP memory use and to minimise memory transfer bandwidth between CPU and VDP.

 

I added support for the masking feature - and it immediately fixed many of the demo phases (not sure if that's how the demo parts are called).

 

On another unrelated note, I've also finally gotten my hands dirty with coding for the NXP LPC4330/LPC4337 micro controllers. My intention is to use one of these to build a my own "nanopeb" style peripheral. The first thing is to wire one of these to the extension bus of the TI, but before that can be done the I/O pins need to be configured properly, and with these modern micro controllers the I/O configuration is a non-trivial task. Pretty much all of the I/O pins are very multifunction: they are multiplexed between up to 8 peripherals per pin, and even for simple general purpose I/O use a whole bunch of registers around the chip have to be setup a properly. The user manual being 1442 pages does not help, this is a very complex beast. Finally I did get some 24 I/O pins setup and working, and can produce very fast pulses on the I/O pins under software control, this is the first baby step in this project.

Luckily there is existing code for many higher level functions, I have already tested the development boards as USB peripherals and connected an USB keyboard while the micro controller is the USB host. This would make one wacky peripheral once its done.

 

Thanks for reminding me to subscribe to your youtube channel, I love these updates.

  • Like 1
Link to comment
Share on other sites

Mother of God! NOW *THAT'S* WHAT I'M TALKING ABOUT!

 

 

Thanks :) Since that update I have done minor tweaks, and the highest performance mode so far is 30x original speed. This is just the beginning: my TMS9900 implementation is still very naive. There is a lot of room for speed improvements. For example I want to test an architecture change where there would be two parallel address generators, allowing the core to simultaneously calculate source and destination operands. This can be facilitated with an in-core scratchpad RAM (already done), with multiple read ports. A further improvement from that point would be to substitute internal scratchpad RAM with a proper multiport cache memory.

 

Having written the above, it seems many people would like to have a version of the core which would operate at the original speed. To that end I have added a wait state generator that can support up to 256 wait states per memory access, but that is not enough to slow it down to original speed, it runs currently at about 1.8x original speed at the slowest setting. In addition to that the timing behaviour is different from original TMS9900, for example division takes about 40 cycles (or 400ns) depending on addressing modes and multiplication takes about 10 cycles, both of which are vastly faster than the original speed. I haven't measured these carefully yet.

  • Like 2
Link to comment
Share on other sites

Having written the above, it seems many people would like to have a version of the core which would operate at the original speed. To that end I have added a wait state generator that can support up to 256 wait states per memory access, but that is not enough to slow it down to original speed, it runs currently at about 1.8x original speed at the slowest setting. In addition to that the timing behaviour is different from original TMS9900, for example division takes about 40 cycles (or 400ns) depending on addressing modes and multiplication takes about 10 cycles, both of which are vastly faster than the original speed. I haven't measured these carefully yet.

 

 

Knowing next to nothing about your wizardry, I can't help think about simple things like a "fluff and stuff" load on the CPU between wait states to slow it down even further? Like a very large for next loop or something of that nature stored and ran from a different area of ram or something. A sort of load to slow down the CPU further that is invisible to the user? or am I just too ignorant to understand how these things actually work in real life?

  • Like 1
Link to comment
Share on other sites

... for example division takes about 40 cycles (or 400ns) depending on addressing modes and multiplication takes about 10 cycles, both of which are vastly faster than the original speed. I haven't measured these carefully yet.

 

Is this because you are using an FPGA library ALU that is more efficient ?

  • Like 1
Link to comment
Share on other sites

 

 

Knowing next to nothing about your wizardry, I can't help think about simple things like a "fluff and stuff" load on the CPU between wait states to slow it down even further? Like a very large for next loop or something of that nature stored and ran from a different area of ram or something. A sort of load to slow down the CPU further that is invisible to the user? or am I just too ignorant to understand how these things actually work in real life?

 

 

Actually the wait state generator works like that, it is the hardware equivalent to a dummy for next loop just consuming time. I could make it bigger, from 8 bits to 9 bits (512 steps) to be able slow down the CPU to speeds equivalent or below that of an original TMS9900. As an alternative I could do this in software by adding a hardware timer which would trigger a repetitive interrupt and then have a delay loop the interrupt routine. Adding the load is not the problem, the problem is making the load behave equivalently to a TMS9900. If I remember my test properly I already made the memory cycles longer than in a real TI, but the system still runs faster since the internal operations of my CPU core run so much faster.

 

The other problem is more philosophical: I am much more interested in making the core run fast than slow i.e. at the original speed. That's part of the reason I embarked on this FPGA project in the first place. Perhaps at a later stage I will make it run at the original speed. It would be quite simple but somewhat tedious, by first slowing down the clock speed and then making the number of states in my CPU core equivalent to that of the TMS9900. Internally my core currently follows the same patterns as a real TMS9900, so the architectural similarity is there.

  • Like 1
Link to comment
Share on other sites

 

Is this because you are using an FPGA library ALU that is more efficient ?

 

For the multiplier yes, I am using one of the DSP units in the FPGA. These contain hardwired multipliers which are very fast.

 

For the divider I created my own simple design, which uses a traditional shift-compare-substract pattern to calculate the quotient and remainder. It uses 2 clock cycles per bit, thus performing the division in 32 clock cycles. Additional clock cycles are required for instruction decode and operand access. A real TMS9900 seems to require a minimum of 92 clock cycles, up to 124 clocks (it does an early out). So a TMS9900 probably uses something like 6 or 7 clock cycles per bit for the actual division.

Link to comment
Share on other sites

 

The other problem is more philosophical: I am much more interested in making the core run fast than slow i.e. at the original speed. That's part of the reason I embarked on this FPGA project in the first place. Perhaps at a later stage I will make it run at the original speed. It would be quite simple but somewhat tedious, by first slowing down the clock speed and then making the number of states in my CPU core equivalent to that of the TMS9900. Internally my core currently follows the same patterns as a real TMS9900, so the architectural similarity is there.

I think the issue is not so much what speed you want to run at. I like faster not slower. However, when running native apps, if there could be a way to get them to run in such a way that they are not too fast. Like PARSEC you've demo'ed. Perhaps a different interrupt when cartridges are running? I love the idea of a turbo charged TI - but as you've shown, it could become unusable with legacy programs and cartridges.

  • Like 1
Link to comment
Share on other sites

I think the issue is not so much what speed you want to run at. I like faster not slower. However, when running native apps, if there could be a way to get them to run in such a way that they are not too fast. Like PARSEC you've demo'ed. Perhaps a different interrupt when cartridges are running? I love the idea of a turbo charged TI - but as you've shown, it could become unusable with legacy programs and cartridges.

 

 

Your thoughts are very much aligned with my plan. I already made the wait state configuration accessible via the DIP switches for this exact reason - so I have a turbo button (or buttons in fact since the system can be configured for many speeds).

 

I've been thinking of implementing a fairly accurate system to approach cycle accuracy by just adding a feature into the CPU core so that it would just calculate how much time a certain instruction would take on a real 3MHz TMS9900, and then wait at the end of each instruction the amount of time needed to get there. This could be partially done with a small ROM including the cycle counts, and then adding in the operand fetch times. This would be similar to how emulators work for timing.

That way I could still pursue my personal goal of creating a fast TMS9900 with all kinds of design features, but have a way to slow it down to the semicorrect speed regardless of differences in architecture.

  • Like 1
Link to comment
Share on other sites

If the system could boot up at normal speed and have a software controllable speed setting, then programs could speed up and slow down as needed.
This was the approached on the use of the CoCo's high speed modes.
You used a POKE to enter high speed mode, and another one to slow it down before tape or disk I/O.
In this case you'd also change speed before using INPUT statements.

That way anything that might be speed dependent would boot normally, and anything designed to run faster could control the speed as needed.

  • Like 1
Link to comment
Share on other sites

If the system could boot up at normal speed and have a software controllable speed setting, then programs could speed up and slow down as needed.

This was the approached on the use of the CoCo's high speed modes.

You used a POKE to enter high speed mode, and another one to slow it down before tape or disk I/O.

In this case you'd also change speed before using INPUT statements.

 

That way anything that might be speed dependent would boot normally, and anything designed to run faster could control the speed as needed.

 

 

Thanks - that is a good idea. I've been thinking of creating a PC BIOS like screen to adjust the various settings that the system supports, that would also require programmatic control of the speed and other things.

Link to comment
Share on other sites

  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...