Jump to content
IGNORED

Is the TMS9900 Byte Addressable in the comp-sci sense?


jedimatt42

Recommended Posts

So, my daughter has gone off to college and entered a computer science program. ( I'm a secretly ecstatic Dad )

 

I studied comp-sci in college, and have been in industry for 20 years, but I skipped the freshman, sophomore, and junior CS classes, so I've only learned the fundamentals on demand.

 

The question was: If a 16 bit computer is byte-addressable, how much memory can be addressed. I know the answer is 2^16. 64k bytes.

 

When I start looking at retro-computing, and the 8-bit era... was it? Were the Z80 and 6502 8 bit computers? They also addressed 2^16 bytes of memory.

The Z80 had 16-bit registers ( It would put two 8-bit registers together, or so wikipedia says.. ) Yet, these computers were all referred/marketed as 8-bit computers.

 

So then we come along with our TI-99 - awesome 16-bit computer... and can only address 2^16... same as the 8-bit guys, but we have a 16 bit database, and only 2^15 address lines.. :)

And then there is this 8-bit multiplexing memory hardware inside the console, outside the CPU... that hijacks the CRU-out, and uses it for byte addressing...

 

And then I start thinking about the instruction set, and the read-before write carried out on any of the MOVB,SWPB,etc byte based instructions, and realize that I think the TMS9900 is not byte-addressable at the physical level.

 

Is it still considered byte-addressable in the comp-sci sense? or is it a 16-bit-word-addressable microprocessor with extra instructions for manipulating only half of a word?

 

So.. what characteristic leaves the Z80 as an 8-bit cpu?

 

-M@

  • Like 1
Link to comment
Share on other sites

Been working with both. And I guess this is just my opinion. ;)

Is it still considered byte-addressable in the comp-sci sense? or is it a 16-bit-word-addressable microprocessor with extra instructions for manipulating only half of a word?

 

In and around the CPU, it's mostly 16 bit. Yes, the address is 15 bit. In the TI-99/4A, it sometimes feels like an 8 bitter. And if you think about it, it's almost a sad story - stuck with it (retro-wise) and like it anyway (quite awkward and first love perhaps).

So.. what characteristic leaves the Z80 as an 8-bit cpu?

 

In and around the CPU, it's mostly 8 bit. Yes, there are some 16 bit internal operations and the address is 16 bit.

:)

 

 

 

  • Like 1
Link to comment
Share on other sites

sometime99er, I get what you are saying... 'mostly' is probably key, as there probably aren't very many CPUs ever made that are strictly one way... Many of the 16-bit home computer era, had 20 or 24 bit addresses bus... and were in many senses, 32 bit internally.

 

I feel like the internal operand size is probably the deciding factor. Or is it the opcode size? :) Or something else?

 

-M@

  • Like 1
Link to comment
Share on other sites

It's generally the word size that defined the CPU - the Z80 managed data 8 bits at a time. The 9900 managed data 16 bits at a time.

 

Then it got confusing with the 8088 (8 bit external but 16 bit words) and the 68000 (16 bit external but 32 bit words).

 

The 9900 is incapable of /accessing/ a single byte, it /always/ processes 16-bits at a time. However, the byte-oriented instructions can perform operations on just 8 of those 16-bits. But so far as the hardware goes, all reads and all writes are 16-bits wide.

 

Note that the CRU is a separate address space and uses separate instructions, it's not related to normal memory accesses.

  • Like 1
Link to comment
Share on other sites

At some point in the past memory (RAM, ROM, disk storage, tape, etc.) started to be measured in bytes. This has become the standard unit of measure for digital computer storage to this day. It does not matter how a CPU address memory, it is still measured the same way and should not be confused with the width of the CPU's address bus.

 

As for an "8 bit", "16 bit", or whatever-bit CPU, that is typically determined by the size of the data that the CPU's ALU can operate on at one time, the width of the address bus, or some combination thereof. For example, the Z80 is an 8-bit CPU because the ALU can only operate on 8-bit operands at one time. It does have a 16-bit address bus to access memory, but the data bus is only 8-bits. So when the Z80 addresses memory it can only read back a single byte in one memory operation. Internally the Z80 has a few 16-bit registers, like the Program Counter (PC) so it can address the full 64KiB of memory, but that 16-bit register has to be manipulated in two 8-bit operations.

 

The 8088 is a 8/16-bit CPU because its ALU is 16-bits, it has 16-bit registers and a 16-bit address bus. However, its data bus is only 8-bits like the Z80, 6502, and other "8-bit" CPUs, so it must make two memory operations to read a 16-bit value. The 8086 on the other hand has a 16-bit ALU, and 16-bit address and data bus, thus it is a full 16-bit CPU. So, even though internally the 8088 and 8086 are almost identical, 8086 computers are generally faster since they can access two bytes in a single memory operation.

 

The 9900 CPU has a 16-bit ALU, 16-bit registers, and a 16-bit data bus. It also has a 16-bit address bus, however only 15 address lines are physically wired to external address pins. Thus, the 9900 can only generate 32768 address, however it expects to read two bytes at each of those addresses, thus 64KiB or memory. That means when the 9900 accesses memory measured in "bytes", it can only read memory from even addresses. This is also the reason for the read-before-write nature of the 9900. If the 9900 needs to write a single byte to memory, it has to read both bytes at a specified address, manipulate only one of the two bytes, and write both bytes back to memory. Sadly this causes slow access to memory for byte operations. Some other CPUs (like the 8086) have extra pins that specify which bytes are being accessed during a memory write, so byte addressable memory can be masked externally and the CPU avoids the need for a read-before-write for byte operations.

Edited by matthew180
  • Like 3
Link to comment
Share on other sites

If the 9900 needs to write a single byte to memory, it has to read both bytes at a specified address, manipulate only one of the two bytes, and write both bytes back to memory. Sadly this causes slow access to memory for byte operations.

 

It's also doing read before write for word operations, right? That's what really sad.

Link to comment
Share on other sites

The Motorola 68000. ;)

Address bus

The 68000 was a clever compromise. When the 68000 was introduced, 16-bit buses were really the most practical size. However, the 68000 was designed with 32-bit registers and address spaces, on the assumption that hardware prices would fall.

It is important to note that even though the 68000 had a 16bit ALU, addresses were always manipulated by 32bit instructions. I.e. not only was the address space 32bit, but flat addressing was used. Contrast this to the 8086, which had 20bit address space, but could only access 16bit (64 kilobyte) chunks without resorting to slow extra instructions. The clever 68000 compromise was that in spite of databus and ALU width being 16bit, address arithmetic always is 32bit (further, even for all dataregister ops there is a 32bit version of the instruction). For the complex addressing mode, there is a fullsize address adder outside the ALU. E.g. a full 32bit address register postincrement goes without speed penalty.

So even though starting out as "16bit" cpu, the 68000 instruction set describes a 32bit architecture. The importance of architecture cannot be emphasized enough. Throughout history, addressing pains have not been hardware implementation problems, but always architecture problems (instruction set problems, i.e. software compatibility problems). The successor 68020 with 32bit ALU and 32bit databus runs unchanged 68000 software at "32bit speed", manipulating data up to 4 gigabyte, far beyond what software of other "16bit" cpus (e.g. 8086) could do.

To address the perceived markets, the actual 68000 was designed in three forms. The base-form had a 24-bit address, and a 16-bit data bus. The short form, the 68008, had an 18-bit address (possibly 19 or 20 bits, at least one firm addressed 512KBytes with 68008s), and an 8-bit data bus. A planned future form (later the 68020) had a 32-bit data and address bus.

Internal registers

The CPU had 8 general-purpose data registers (D0-D7), and 8 address registers (A0-A7). The last address register was also the standard stack pointer, and could be called either A7 or SP. This was a good number of registers in many ways. It was small enough to make the 68000 respond quickly to interrupts (because only 15 or 16 had to be saved), and yet large enough to make most calculations fast.

Having two types of registers was mildly annoying at times, but really not hard to use in practice. Reportedly, it allowed the CPU designers to achieve a higher degree of parallelism, by using an auxiliary execution unit for the address registers.

Link to comment
Share on other sites

 

It's also doing read before write for word operations, right? That's what really sad.

 

 

That is sad indeed. As an example the move instructions MOV (16 bit) and MOVB (8 bit) take the same amount of clock cycles on the TMS9900, despite the fact that there is no need to do a destination read-before-write when dealing with 16-bit operands.

 

This was fixed for the TMS9995 and successors. With regards to the 8088 / 8086 comparison that matthew180 illustrated, the TMS9995 would be similar to the 8088 and TMS9900 to the 8086 from an external bus width point of view. However, the TMS9995 is also an architectural enhancement, so in practice it is much faster despite having an 8-bit external bus. It also has an internal memory block (256 bytes) that can be accessed at 16-bit width. It does not do unnecessary memory cycles.

 

It is actually interesting to compare the amount of clock cycles it takes the TMS9900 and TMS9995 to do the same 16-bit operation. If we look at MOV R1,*R2 instruction (write contents of workspace register 1 to the address designated by workspace register 2) for these two processors and assume zero wait states for both, and also assume that the TMS9995's internal RAM is not used (worst case conditions) so that it has to go through the 8 bit external bus, it is still faster: on the TMS9995 that instruction takes 8 clock cycles, for the TMS9900 it takes 18 clock cycles.

 

As an aside, one thing often confused with the TMS9995 is its cycle time: the processor runs at 12MHz, but that gets divided internally by 4, so the machine cycles are 333 ns each and thus similar to the TMS9900. The memory cycles for the TMS9995 are only 1 machine cycles, whereas for the TMS9900 they are 4 cycles, so there is a big difference there.

  • Like 2
Link to comment
Share on other sites

This is the reason why the Geneve is so much faster than the TI-99/4A; the 12 MHz, however, are not immediately divided down to 3 MHz but they seem to serve for running the microprograms inside. Otherwise there would be no way to realize such an efficient processing of commands; it's just doing too many things in one cycle. To the outside, the CPU delivers the CLKOUT at the usual 3 MHz. This is also an interesting example where a multiplexed data bus is actually faster than the full data bus of the 9900 - because there is no need for a read-before-write.

 

I have one more question on 8-bit, though. As explained above, we refer to the internal, architectural width when talking about 8-bit, 16-bit etc. All 8-bit CPUs that I know of have more than 8 bit address width, mostly 16 bit. How does a 8-bit ALU handle addresses with that size?

 

If you take the 9900, it is pretty well understandable that we cannot easily make use of more than 16 address bits: The address can be loaded into a register, and it can be modified by arithmetic calculation, so if your ALU is 16 bits wide, you are running into trouble when you reach address FFFF and add another one. Also, if you think about 32-bit PC platforms, you cannot address more memory (flat, no segmentation) than 2^32 byte (4 GiB), which is a good incentive to switch to 64-bit platforms. (Note that the common 64-bitters "only" offer 48 address lines.)

 

One way to get past the architectural borders is to use segmentation or mappers (like in the Geneve or the 99/8) that have to be set before, which add the missing address bits.

 

So as for the Z80 or 6502/6510, they seem to have an address register of 16 bits, but how do they load it, and can it be part of a calculation?

Link to comment
Share on other sites

This is the reason why the Geneve is so much faster than the TI-99/4A; the 12 MHz, however, are not immediately divided down to 3 MHz but they seem to serve for running the microprograms inside. Otherwise there would be no way to realize such an efficient processing of commands; it's just doing too many things in one cycle. To the outside, the CPU delivers the CLKOUT at the usual 3 MHz. This is also an interesting example where a multiplexed data bus is actually faster than the full data bus of the 9900 - because there is no need for a read-before-write.

I am not sure if that is true, at least all the timing diagrams seem to indicate that the processor would be working from a single clock, clkout. Modern processors (and even old RISC processors) do all of their processing typically from a single clock. The TMS9900 has a four phase clock - you could say the individual phases run at 12MHz even with it. I don't know if the TMS9995 internally has a similar arrangement internally, with phased clocks not visible to outside.

 

I have one more question on 8-bit, though. As explained above, we refer to the internal, architectural width when talking about 8-bit, 16-bit etc. All 8-bit CPUs that I know of have more than 8 bit address width, mostly 16 bit. How does a 8-bit ALU handle addresses with that size?

They normally use multiple clock cycles through the 8-bit ALU to handle 16-bit quantities. As an example, the Z80 has 16-bit arithmetic instructions, but those take more clock cycles (or T-states as Zilog calls them). So a 16-bit operation on the Z80 like ADD HL,DE would internally be divided into something like ADD L,E followed by ADC H,D (the latter addition takes into account the carry that might be created from the addition). On the 6502 this is also evident: for example you can use 16-bit addresses and 8-bit offsets (aka LDA $4050,X) but that instruction takes an extra cycle if the address calculation yields a carry from the 8-bit offset addition. So the ALU would be used twice for address calculation in that case. If I remember correctly the 6502 has some undocumented behavior related to this, so in some circumstances the carry is not properly propagated and the addition does wrap around within the lowest 8-bits - apparently it was used by some clever game programmers to make copy protection algorithms and such.

 

If you take the 9900, it is pretty well understandable that we cannot easily make use of more than 16 address bits: The address can be loaded into a register, and it can be modified by arithmetic calculation, so if your ALU is 16 bits wide, you are running into trouble when you reach address FFFF and add another one. Also, if you think about 32-bit PC platforms, you cannot address more memory (flat, no segmentation) than 2^32 byte (4 GiB), which is a good incentive to switch to 64-bit platforms. (Note that the common 64-bitters "only" offer 48 address lines.)

Actually starting from Pentium Pro Intel added a feature called PAE (Physical Address Extension) which allowed the 32-bit architecture to address more than 4GB (it expanded physical memory address space to 36 bits). Basically the virtual memory page tables gave you additional address bits, similarly to the memory banking on the TI :) . But that gets messy, like you say 64-bit platforms are the way to go.

 

One way to get past the architectural borders is to use segmentation or mappers (like in the Geneve or the 99/8) that have to be set before, which add the missing address bits.

 

So as for the Z80 or 6502/6510, they seem to have an address register of 16 bits, but how do they load it, and can it be part of a calculation?

Yes it can, but 16-bit operations become multi cycle operations internally. That by the way also adds to the internal complexity of the processor, as you have to operate the machinery multiple times to get through an instruction. The TMS9900 is a nice and clean architecture compared to the 8-bit processors. Having said that there are nice 8-bit processors too, Atmel's AVR architecture is good modern example about that.

  • Like 1
Link to comment
Share on other sites

Other than for the TMS9900, we don't have similarly detailed information about the microprograms inside the 9995. When I implemented the microprograms in MAME I had to guess what is actually happening at each clock cycle. I tried to derive that from the instruction timings, and from the microprograms for the 9900.

 

At some points I was somewhat unsure whether it is possible to do that many microoperations at the same cycle. Either the 9995 has a much better parallelism, or it makes use of the cycles according to the CLKIN 12 MHz, that is, for each external clock cycle (at 3 MHz, visible at the pin CLKOUT) there are 4 cycles internally. This is just a guess, and I did not rely on that theory for the MAME implementation. But why should it require a clock at 12 Mhz if it immediately divides it down to 3 MHz?

 

Anyway, it is quite interesting to have a look at other architectures to be able to understand some of the design decisions for the 99xx family. In particular, I thought it was common to have something like a read-before-write for platforms that use byte operations in a 16 bit or wider environment, until I found that the x86 architecture simply uses bus control lines to be able to turn off parts of the data bus. Thanks for your information!

  • Like 1
Link to comment
Share on other sites

Those are great comments. Maybe the TMS9995 indeed does some internal processing at 12MHz. The only "solid" counter argument I would have is that if the processor was internally operating at 12MHz, why would the external memory cycles always need to be done at 3MHz granularity? In other words, if you needed a wait state for a memory device, you go in 333 ns increments and that has a huge impact, basically at zero wait state the memory access time would need to be something like 160ns, and if you can't do that, the next option is about 500ns (one 333ns wait state), so that is a huge step, even with memories of the day. If that could be done at 80ns increments (1/12MHz) it would have been more efficient.

Link to comment
Share on other sites

I always assumed that the 9995 uses a 4-phase 3MHz clock generated from the 12MHz crystal - they've in effect incorporated the TIM9904 clock generator onto the processor die. Phase 3 of the clock is brought externally as CLKOUT, in a similar way to Phi 3 being the only clock phase used by external components in a 9900 system.

  • Like 1
Link to comment
Share on other sites

Which means that if TI had decided to add another just two more lines to the TMS9900 (and maybe drop others, like two of the interrupt level lines), they would have been able to address single bytes on the data bus merely by turning on/off one half of the bus - without having to resort to read-before-write ... but we also know that there were many more considerations about the architecture.

Link to comment
Share on other sites

I always assumed that the 9995 uses a 4-phase 3MHz clock generated from the 12MHz crystal - they've in effect incorporated the TIM9904 clock generator onto the processor die. Phase 3 of the clock is brought externally as CLKOUT, in a similar way to Phi 3 being the only clock phase used by external components in a 9900 system.

 

 

That could simply be it :)

Link to comment
Share on other sites

Well, for some reason I can't edit my previous post. I realized I said the 8088/8086 were 16-bit address and data CPUs, however the address bus is actually 20-bits on the those CPUs which allows 1MiB of addressable memory. Addressing the 1MiB on a 16-bit architecture was done using the now infamous "segmented" memory scheme.

 

 

Which means that if TI had decided to add another just two more lines to the TMS9900 (and maybe drop others, like two of the interrupt level lines), they would have been able to address single bytes on the data bus merely by turning on/off one half of the bus - without having to resort to read-before-write ... but we also know that there were many more considerations about the architecture.

 

Most memory ICs today with a data bus width wider than 8-bits will typically have inputs for byte masking. On the 8086 the pin is called BHE: "Bus High Enable. Enables the most significant data bus bits (D 15 -D 8 ) during a read or write operation." That single pin allows avoiding read-before-write. I think the 9900 could have found one pin in the huge 64-pin package to do that... Ah well, nothing to do about it now.

Link to comment
Share on other sites

I first wanted to write that we just need one more line, but actually, we need two: The 8086 has two relevant lines for the selection of upper and lower byte, A0 (LSB) and BHE. (I prepared such slides for my students, in fact.) There are only 15 lines for the TMS9900; the A15 line (which is A0 on those other architectures) is missing.

Edited by mizapf
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...