Some questions on the 9900

cbmeeks · November 8, 2019

I've started reading (what I think) is a pretty good book on assembly language programming with the TI-99/4A. It's called Fundamentals of TI-99/4A Assembly Language by M.S. Morley.

The 9900 is such a different animal compared to the 6502 world I'm used to.

Anyway, a couple questions have popped up.

It was stated that the Workspace Pointer (WP) can be changed to relocate the 16 accumulators anywhere in memory and that "memory-to-memory architecture" is possible. So I'm trying to think of the advantages of this. For one, correct me if I'm wrong, but this allows you to perform single cycle operations on 16 memory locations (such as adding) without the need to LOAD, ADD and then SAVE. Furthermore, with the ability to move the WP around, it's like having many dozens of accumulators (memory permitting).

What are the other advantages of this and what are some of the disadvantages?

Second is the CRU (Communications Register Unit). From what I understand, this takes 12 address lines (4096 possibilities) along CUROUT, CRUIN and CRUCLK and uses them for other devices such as the keyboard and joystick. But I'm not really getting what it's used for. Is it allowing memory addressed access to these devices? Does that mean it could control up to 4096 devices? What am I missing here?

I'm sure I will have many more in the future.

Thanks for any help!

Edited November 8, 2019 by cbmeeks

+adamantyr · November 8, 2019

1 hour ago, cbmeeks said:

I've started reading (what I think) is a pretty good book on assembly language programming with the TI-99/4A. It's called Fundamentals of TI-99/4A Assembly Language by M.S. Morley.

The 9900 is such a different animal compared to the 6502 world I'm used to.

Anyway, a couple questions have popped up.

It was stated that the Workspace Pointer (WP) can be changed to relocate the 16 accumulators anywhere in memory and that "memory-to-memory architecture" is possible. So I'm trying to think of the advantages of this. For one, correct me if I'm wrong, but this allows you to perform single cycle operations on 16 memory locations (such as adding) without the need to LOAD, ADD and then SAVE. Furthermore, with the ability to move the WP around, it's like having many dozens of accumulators (memory permitting).

What are the other advantages of this and what are some of the disadvantages?

Second is the CRU (Communications Register Unit). From what I understand, this takes 12 address lines (4096 possibilities) along CUROUT, CRUIN and CRUCLK and uses them for other devices such as the keyboard and joystick. But I'm not really getting what it's used for. Is it allowing memory addressed access to these devices? Does that mean it could control up to 4096 devices? What am I missing here?

I'm sure I will have many more in the future.

Thanks for any help!

Hi and welcome!

The TMS9900 architecture is substantially different, not just from the 6502 but from the 8086 as well. It has some strong advantages but disadvantages too.

One thing to note is that there is no accumulator, you can do many operations on registers to registers, registers to memory, memory to registers, and even memory to memory! So doing a MOV @THIS,@THAT is totally fine, where both of those are just addresses in memory. The reason is that the registers themselves are just in memory, which allows you to switch quickly to different register sets. The advantage is flexibility, the disadvantage is a general latency overall as it takes extra clock cycles. There is a 256 byte RAM area in the base console (called the scratch pad) which is true 16-bit access memory, most of us use that space for registers as often as we can as it gains us some speed by cutting out a 4-cycle delay on every memory access.

Some op-codes require registers as either the source or destination. All the immediate instructions (Add Immediate, Compare Immediate) only work with registers. This isn't a limit of the architecture so much as just a limit of opcode decoding.

An important consideration though is that all the op-codes on the TI are 2 bytes. So the MOV operation I noted above is actually 6 bytes. While we can get more done in far less instructions than a 6502, they do consume more memory. On the other hand, we do have unsigned Multiply and Divide op-codes, which save a lot of memory to use. (Although you'll hear a lot of guys grumbling how divide takes too long.)

Also, because of the way the address lines on the TI work, there are byte and word equivalent instructions for most everything (Compare verses Compare Byte, Add verses Add Byte, and so forth). The byte versions only work on the first/high byte, you have to use shifts or a swap byte opcode to access the other byte.

I don't have immediate answers about your CRU questions, but I will point you to a grand resource: The TI-99/4a Tech Pages You'll find answers for a lot of the hardware and architecture design there.

+TheBF · November 8, 2019

2 hours ago, cbmeeks said:

Second is the CRU (Communications Register Unit). From what I understand, this takes 12 address lines (4096 possibilities) along CUROUT, CRUIN and CRUCLK and uses them for other devices such as the keyboard and joystick. But I'm not really getting what it's used for. Is it allowing memory addressed access to these devices? Does that mean it could control up to 4096 devices? What am I missing here?

I'm sure I will have many more in the future.

Thanks for any help!

That signature is pretty funny.

+mizapf · November 8, 2019

As @adamantyr already said: The 9900 is a pure memory/memory architecture (by hardware), although the instructions actually use register numbers. With the help of the workspace pointer, these register numbers are mapped to memory locations. That is, everything that it processes comes from memory, and all results go to memory. There are no internal working registers, only the status register and the workspace pointer. The extreme opposite would be the MIPS architecture (load/store), where every operation works on the internal (32) registers, and you have explicit load/store operations.

The advantage of the workspace pointer concept is that you can have "private" registers. You can write some subroutine with a reserved area of 32 bytes, which will serve as the "local memory" of that subroutine. Then you can glue all those routines together to a large program, and you do not have to care about memory locations, as long as every component brings its own workspace. By that you can even write recursive subroutine.

The disadvantage is, of course, the "expensive" memory access (von Neumann bottle neck). To make things worse, the memory access of the TI console is stuffed with wait states, and the data bus is multiplexed (folded from 16 to 8 bits).

As for CRU, this is the port-based access to devices. Other systems call it port addressing, where TI talks about CRU addressing. The CRU addresses are a separate address space, but they re-use the address bus. Data is transferred serially, i.e. you have a data width of 1 bit. There are commands that can access subsequent CRU addresses in one go (LDCR, STCR), or those that set individual bits (SBO, SBZ) or query them (TB).

Edited November 8, 2019 by mizapf
Forgot the disadvantage

apersson850 · November 8, 2019

1 hour ago, adamantyr said:

The byte versions only work on the first/high byte, you have to use shifts or a swap byte opcode to access the other byte.

This is not generally true.

Byte operations like Add byte can use the addresses 123 and 345, in an instruction like AB @123,@345, and actually add the two odd-addressed bytes together, and store the result at 345. So odd-byte addressing is possible.

However, when using registers (those that aren't, but pretend to be), in an instruction like AB R2,R4, you'll add the most significant byte of R2 with the most significant byte of R4, and store the result "to the left" in R4. If we assume the context that LWPI >1000 which defines the virtual R0 to be at >1000 in memory, preceeded the instruction

AB R2,R4

then this is completely equivalent to

AB @>1004,@>1008

The only difference is that the register only instruction requires one word, the other three. Since the first is shorter and has fewer memory accesses, it's also faster. This means that if you want to add the two least significant bytes of R2 and R4, you can use

AB @>1005,@>1009

The register file in memory makes it easy to use a separate register set momentarily, by LWPI NewWP, do whatever, then LWPI OldWP and you continue with your old values in the registers.

The 9900 doesn't have a predefined stack register. Return from subroutines called by BL @Sub are returned from by executing B *R11. The return address is stored in R11, so by branching indirectly to that, you return to after your Branch and Link instruction. As a consequence, you can't call another subroutine from the first, unless you take care to save the return address on the first level yourself. Either you implement a stack, or save it in some handy location.

But you can do a more elaborate BLWP @SepWSSub, to call a subroutine with a separate workspace. The Branch and Link with Workspace Pointer branches via a vector. This vector consists of two words. The address of the new workspace followed by the address of the code to run.

In the new workspace, the BLWP instruction will automatically store the old WP in R13, the return address (old PC) in R14 and the status register, as it was prior to making the call, in R15. Hence this will give you a subroutine that has its own workspace, but can acess data in the caller's registers via R13, or data stored after the call instruction via R14. To access the caller's R5, and store this in the subroutine's R5, you'd execute

MOV @10(R13),R5

since the caller's R5 is ten bytes down in his workspace.

To fetch in-line data after the call to your R6, you'd use

MOV *R14+,R6

This will move a word located right after the BLWP instruction to R6, and also increment R14 by two, so that it points to the actual code following the call instruction.

When an interrupt is signalled, an implicit BLWP, via a vector in a pre-defined address in memory, occurs. This means that without explicitly saving anything on a stack, the interrupt starts running with a fresh set of registers, as well as with the previous status word handy, so you can return to the previously executing code and restore the context you had completely. A return is done with the RTWP instruction in both cases. RTWP uses the data in R13, R14 and R15 to restore the registers WP, PC and ST in the CPU.

As explained above, the CRU is Texas Instruments version of I/O addressing. You install hardware, wired to look at the CRU command signal and decode the address given, so that you can set or read a bit from this hardware. The hardware can be anything: Latches, LSI chips with pre-defined functions, like UART, general timer/IO-chip, floppy disc controllers etc. Since it's bit-serial, you can easily define hardware ports of odd sizes. Two eight-bit latches can make three ports, five, seven and four bits wide. Many microprocessors would require three eight-bit latches to implement that efficiently, where three, one and four bits were unused.

The 9900 concept is more about flexibility than utmost speed. Many instructions allow a general address, which can be either directly to a register, indirectly via a register, indirect with auto-increment, indexed or symbolic. There are no restrictions on which of these you use in that case.

+adamantyr · November 8, 2019

10 minutes ago, apersson850 said:

This is not generally true.

Byte operations like Add byte can use the addresses 123 and 345, in an instruction like AB @123,@345, and actually add the two odd-addressed bytes together, and store the result at 345. So odd-byte addressing is possible.

However, when using registers (those that aren't, but pretend to be), in an instruction like AB R2,R4, you'll add the most significant byte of R2 with the most significant byte of R4, and store the result "to the left" in R4. If we assume the context that LWPI >1000 which defines the virtual R0 to be at >1000 in memory, preceeded the instruction

AB R2,R4

then this is completely equivalent to

AB @>1004,@>1008

The only difference is that the register only instruction requires one word, the other three. Since the first is shorter and has fewer memory accesses, it's also faster. This means that if you want to add the two least significant bytes of R2 and R4, you can use

AB @>1005,@>1009

Yes, I meant in the context of registers. Unlike the 8086 design where you can access the high or low byte of a register independently.

Memory to memory everything is wide open. You can even cheat a bit and use the actual workspace pointer address to get to your register low-byte. If you are using >8300 for your workspace registers, then the low byte of R0 is >8301, for example. If you are in a context switch via BLWP, you can use @>0001(R13) to get to the low byte of the calling routine's R0.

You get a LOT of bugs though coming up from the memory-to-byte considerations. For example:

If you do a CLR op-code for example, it only works on whole words. But if you try and use an index value, indices are always in byte counts, so the first word is at index 0, second word at index 2, and so forth.

So if you want to clear byte-only data, using CLR can over-write something you didn't intend to if you have an odd-number of bytes in an array.

You're better served just using a MOVB @ZERO,*R1+ approach to ensure your data is cleared correctly.

Opry99er · November 8, 2019

This is an excellent resource with great example code...

https://atariage.com/forums/topic/162941-assembly-on-the-994a/

matthew180 · November 12, 2019

Keep in mind, the TMS-9900 is the microchip version of TI's 990 minicomputer series, which was designed for a multi-user environment. The memory based registers meant a context-switch could happen very quickly, since there are only three real CPU registers that have to be saved (the PC, WP, and Flags). Compare this to other CPUs that have to execute a lot of PUSH and POP instructions, or SAVEs and LOADs, to do a context switch. Unfortunately this is not as beneficial for a single-user system, and also meant register access is slower compared to internal CPU registers. It also meant the CPU electronics would use less chips (when talking about the 990's CPU which was implemented on circuit boards), and less transistors when the CPU was made into a microchip, which meant it would cost less.

The 990 did a lot of strange things, and it feels like TI was experimenting with ideas. I suppose a lot of minicomputer did that back in the day, but the PDP-11 was getting most things right at the time, so maybe TI was looking for an advantage over DEC and other designs at the time. Ironically, the 990's assembly language feels very much like the PDP-11.

There are many things the 9900 could have done, like a wider address bus, byte control lines to avoid read-before-write, etc., but alas it is what it is.

Unfortunately when TI crammed the 9900 into the 99/4A, the 16-bit CPU it was severely crippled by the 8-bit bus design and slow memory access, so the 99/4A is not the computer it could have been. This is a favorite topic to kick around, hash over, discuss, and argue. ;-)

apersson850 · November 12, 2019

Yes, the mini computer TI 990/9 was the original LSI implementation of the CPU, which became integrated into a chip as the TMS 9900. The TMS 9900, in turn, was used to run the TI 990/4 and TI 990/5 mini computers. Unlike its bigger brothers (e.g. the TI 990/10A), these models were usually limited to max 64 K memory. Hence the width of the address bus.

Those wondering why TI selected model names like 99/4, 99/4A, 99/8 etc. have food for thought, when looking at the mini computer serie's names.

Edited November 12, 2019 by apersson850

cbmeeks · November 12, 2019

Thanks for all of the helpful comments!

I've read many times (and non-TI lovers like to point out) the biggest "flaw" of the TI-99/4A is the whole 16/8-bit bus debacle. However, if TI would have done everything "right", would we even have our eccentric computer today? To me, it was the goofy "mistakes" that made the TI what it is and why I find it so charming. It still sold millions of units. A failure for TI, but a success for us. Go over to the Mattel Aquarius forums and see how active they are (which is sad because I actually like that computer ? ).

Not to go too far OT, but I think the single biggest failure TI made with the 99/4A was not allowing third party (or even home users) FULL access to the machine at the beginning. I mean. In fact, I might start a new thread for this very topic. ;-)

apersson850 · November 12, 2019

Well, plug in Editor/Assembler, or even better, Mini Memory, and you have that. It's not the bare console, no, but I think many of these were sold to people plugging in Invaders and Parsec, and never brought it any longer than that.

Mini Memory gave you something to work with, but otherwise there wasn't too much that was interesting to the skilled hobbyists inside the machine, as long as it had no memory expansion.

Just making it like my modified console, with 64 K 16-bit contiguous RAM, in additon to the 16 K video RAM, makes it a whole different beast. It's a concept similar to the Commodore 64, which also had 64 K RAM, but not access to all of it when the machine was running "normally". But you could page in RAM all over, under software control, just like my console allows you to.

+mizapf · November 12, 2019

5 hours ago, cbmeeks said:

I've read many times (and non-TI lovers like to point out) the biggest "flaw" of the TI-99/4A is the whole 16/8-bit bus debacle. However, if TI would have done everything "right", would we even have our eccentric computer today?

I'd say it is not the bus folding alone. As I said elsewhere, this is not so uncommon, just consider the Intel 8088.

The problem get worse with TI's generous usage of wait states in the TI console. Every memory access (except for the ROM and the 256 bytes of RAM) gets two wait states, and with the 2 sequential 8-bit accesses, you get 4 wait states in total for every read/write operation. In addition, the 9900 is a 16-bit processor and cannot address single bytes in memory, so it must do a word transfer for every byte change. (The CPU does not even have a 2^0 address line, so it actually addresses 32K 16-bit words.)

The worst decision - in my humble opinion - was to build a virtual architecture inside (GPL) and to write the BASIC interpreter in this language. (Yes, GPL has lots of nice features, but it obviously has a dramatic impact on performance.)

Three things that I found attractive about the TI-99/4A:

- By its name-based device identification, the system has a high level of extensibility.

- The BASICs are pretty high-level, providing a high-level interface to many (but not all) system resources.

- The machine language has lots of comfort features that makes assembly programming fun.

apersson850 · November 15, 2019

All the features of BASIC most probably wouldn't have been there, without GPL.

+TheBF · November 16, 2019

13 hours ago, apersson850 said:

All the features of BASIC most probably wouldn't have been there, without GPL.

Do you mean because of the size of the code required? A byte code machine is compact yes.

I am pretty sure it could be done with another virtual machine as well that might be a little more efficient.

One of the things I am learning about 9900 machine code is that it allows some pretty hi-level concepts in a small number of instructions.

apersson850 · November 16, 2019

Yes, and that ROM has a different address space.

The TMS 9900 has a good instruction set, but being a true 16 bit design, each occupies at least two bytes.

Edited November 16, 2019 by apersson850

+mizapf · November 16, 2019

Instead of the additional (indirect) GROM address space, I would have suggested a ROM banking mechanism. For instance, for the address area from 0000-3FFF or just from 2000-3FFF, one could have used a CRU-based scheme or ROM-write scheme for switching banks.

However, TI preferred their GROMs, maybe because of cost or physical (size) constraints. Or, sometimes, you simply have a bad idea.

RXB · November 16, 2019

4 hours ago, mizapf said:

Instead of the additional (indirect) GROM address space, I would have suggested a ROM banking mechanism. For instance, for the address area from 0000-3FFF or just from 2000-3FFF, one could have used a CRU-based scheme or ROM-write scheme for switching banks.

However, TI preferred their GROMs, maybe because of cost or physical (size) constraints. Or, sometimes, you simply have a bad idea.

Well let us examine the arguments here:

1) GROM had up to 512K or 640K of memory for a 16 bit CPU with only 32K of RAM. i.e. 16 banks of 32K of 6K GROMs or 16 banks of 8K GROMS.

2) Simply moving GROM to RAM would have been a great solution but for unknown reasons was NEVER USED?

3) I do this all the time in RXB, move saved Assembly in GROM to RAM then execute it. (Of course FAST RAM is limited in size but no reason normal RAM can not be used.)

4) I think it was chip costs mostly, but when you examine the implications of size in future GROM wins by sheer volume.

Asmusr · November 16, 2019

In a machine where the majority of the RAM is VDP memory it makes sense to have a way to execute code from this type of RAM. Were GPL programs ever distributed in a way that made it possible to run them from VDP memory?

Edited November 16, 2019 by Asmusr

RXB · November 16, 2019

4 hours ago, Asmusr said:

In a machine where the majority of the RAM is VDP memory it makes sense to have a way to execute code from this type of RAM. Were GPL programs ever distributed in a way that made it possible to run them from VDP memory?

Hmm ONLY BASIC and GPL caj run from VDP, Assembly can only run from RAM and the only RAM in console was 256 bytes of FAST RAM.

The only other place for RAM was 8K in Cartridge slot.

+mizapf · November 17, 2019

7 hours ago, Asmusr said:

In a machine where the majority of the RAM is VDP memory it makes sense to have a way to execute code from this type of RAM. Were GPL programs ever distributed in a way that made it possible to run them from VDP memory?

I guess no. The most salient difference between VRAM handling and GROM handling is that you cannot read the video RAM pointer, but you can read the GROM address register. In terms of symbols: There is a GRMRA, but no VDPRA. Many techniques of the GPL architecture like pushing addresses on the stack are not possible in this way.

apersson850 · November 17, 2019

P-code can run from any memory. The PME (p-machine emulator) is written in such a way, that it will do the same thing to the code, regardless of where it is. There's a flag to keep track of which kind of memory the current code is executing from, so it can see if it needs to do any special action, when the program is calling procedures that may be in a different type of memory.

Thus the p-system has two code pools for the user, in VDP RAM and CPU RAM (24 K expansion), as well as in GROM on the p-code card for the system itself.

Edited November 17, 2019 by apersson850

Some questions on the 9900

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members