Jump to content
IGNORED

How long would the following instructions take?


Willsy

Recommended Posts

You know, the 9900 Data Manual is not that scarey, you should have a look sometime. :-) pg. 28 has a summary of all the instruction timings and modifications based on the various accesses (indirect, symbolic, etc.)

 

DEC is 10 clock cycles with the destination being a register

JMP is 10 clock cycles

 

20 clock cycles = 20 * 0.333 = 6.66uS execution time for both instructions. The stock 99/4A has a 333nS (nano second) clock cycle, which is why you multiply by 0.333 to convert the timing to microseconds. The Data Manual also have an example calculating the instruction timing on pg. 29 (just following the chart of instruction timings.)

Link to comment
Share on other sites

You know, the 9900 Data Manual is not that scarey, you should have a look sometime. :-) pg. 28 has a summary of all the instruction timings and modifications based on the various accesses (indirect, symbolic, etc.)

 

DEC is 10 clock cycles with the destination being a register

JMP is 10 clock cycles

 

20 clock cycles = 20 * 0.333 = 6.66uS execution time for both instructions. The stock 99/4A has a 333nS (nano second) clock cycle, which is why you multiply by 0.333 to convert the timing to microseconds. The Data Manual also have an example calculating the instruction timing on pg. 29 (just following the chart of instruction timings.)

 

Thanks Matthew.

 

Do I double it if the code is running in the 32K?

 

Thanks

 

Mark

(Who doesn't have the 9900 Data Manual ;)

Link to comment
Share on other sites

Not necessarily double, but that would certainly give you a worst-case scenario. Some of the instruction's cycles are spent in the ALU, decoding, etc. and don't always count towards memory access. Also, if your workspace in the 16-bit scratchpad (which it always should be), then register access will not cause wait-states. However, instruction fetching will, so at least 1 of the 3 memory access in the DEC instruction will cause wait states in this particular example.

 

Doesn't everyone have the 9900 Data Manual next to their bed on the night stand?

Link to comment
Share on other sites

Not necessarily double, but that would certainly give you a worst-case scenario. Some of the instruction's cycles are spent in the ALU, decoding, etc. and don't always count towards memory access. Also, if your workspace in the 16-bit scratchpad (which it always should be), then register access will not cause wait-states. However, instruction fetching will, so at least 1 of the 3 memory access in the DEC instruction will cause wait states in this particular example.

 

Doesn't everyone have the 9900 Data Manual next to their bed on the night stand?

 

For every access to 8-bit RAM, you have to add 4 cycles. The datasheet will tell you for each instruction and each addressing mode how many memory accesses are involved. It doesn't have the "4 cycle" value since that's specific to the 99/4A console, not the 9900.

 

The main trick can be figuring out which cycles are which when you are dealing with both 8 bit and 16 bit RAM. It's not documented as far as I know... but, by keeping a few facts in mind you can make an educated guess. This is actually the most common case since registers are usually in scratchpad and code in 8-bit RAM.

 

-One memory access per word of the instruction (has to get out of RAM)

-All writes are preceded by a read of the target address (except there seem to be a couple of exceptions, like LI... you'll spot them when the access cycles don't add up)

 

It's not hard to put the cycles together, but it is a bit tedious to remember all the parts.

 

Basically, there are two tables - the opcode table and the addressing mode table. You start by looking up the opcode in the opcode table - this gives you a number of cycles, a number of memory accesses, and a reference into the second table (A or B column). The second table gives you the timing for each addressing mode (Register, Indirect, etc). It also gives you a number of cycles and a number of memory accesses.

 

If everything is in the same type of memory, then you add up the cycles and add up the memory accesses separately. You then multiply the memory accesses by the number of cycles per memory access. Add the total, and multiply by the time per cycle (which Matthew gave above), and you have the total execution time.

 

If you have both 8-bit and 16-bit RAM involved, then you need to work out how many memory accesses are in each. For instance, the instruction and any related words need to be read from the program memory, registers are read (and maybe read/written), and memory locations may be accessed. This requires a little intuition and possibly guesswork since we don't know exactly what the microcode works.

 

A simple example: JMP LABEL in 8-bit RAM.

 

JMP is listed with two valuse, one when the PC is changed, and one when it's not. JMP always changes the PC (it's unconditional), so we take that one. It tells is 10 cycles, and 1 memory access. We know JMP is a single word instruction, that 1 access is readng the instruction then, so it's in 8 bit RAM. There is no reference to the addressing table.

 

The datasheet tells us the math is : T = tc( C + W*M )

T is Total Time

tc is the time per clock cycle (0.333uS)

C is the clock cycles from the table (10)

W is the number of wait states per memory access (8-bit RAM, so 4)

M is the number of memory accesses from the table (1)

 

So we get: T = 0.333(10 + 4*1) = 0.333(10+4) = 0.333(14) = 4.662uS

 

A complicated example: MOV *R1+,@>8400, code in 8-bit RAM, registers in 16-bit RAM. R1 contains >2000 (a memory address in 8-bit RAM).

 

MOV lists as 14 cycles and 4 memory accesses, and uses Addressing table A. We have two arguments to look up in that table.

R1 is WR indirect auto-increment mode, which adds 8 clock cycles and 2 memory accesses.

@>8400 is Symbolic mode, it adds 8 clock cycles and 1 memory access.

 

Before we can do the math, we have to break down the memory accesses between 8-bit and 16-bit RAM. Start with just the opcode - there are 4 memory accesses.

 

1 is used to read the instruction - that's 8-bit RAM

1 is used to read the source value. That's R1, which is in 16-bit RAM.

2 are used to write the destination value (read-before-write!). That's @>8400, which is in 8-bit memory (technically).

 

So for the MOV so far we have 3 memory accesses to 8-bit, and 1 to 16-bit.

 

*R1+ adds 2 memory accesses. We know that reading the register is already accounted for. But the CPU still needs to read the /actual/ desired data, and it needs to write the new value of R1. Since R1 points to >2000, we know that's an 8-bit access, and writing the value back to R1 is a 16-bit access. Note there's no read-before-write on the R1 update because the CPU already just read it. It's a Read-modify-write access. If you weren't sure, you could imply it simply by seeing there aren't enough memory access cycles for the extra read.

 

@>8400 adds one memory access cycle. This is simply the cycle needed to fetch the data from the program stream, so we know it's an 8-bit access.

 

Add these in, and we get 2 more 8-bit accesses, for a total of 5, and one more 16-bit access for a total of 2.

 

We can also add up all the cycles, *R1+ adds 8 cycles, and @>8400 also adds 8 cycles. The total instruction cycles are 14+8+8 = 30.

 

Now we are ready to do the math. (Note we have to extend the wait state math for the two different wait states!)

 

T = 0.333(30 + 4*5 + 0*2) = 0.333(30+20+0) = 0.333(50) = 16.65uS

 

Now it's your turn! :)

  • Like 1
Link to comment
Share on other sites

  • 10 months later...

Bumping this to ask a question:

 

Tursi, does the classic99 debugger (in particular, the T command) take the memory types into account when calculating memory cycles?

 

I had a look in the classic99 manual but there's no mention of that.

 

Mark

Link to comment
Share on other sites

Yes, Classic99 does handle different wait states on the different memory types, including treating the 32k as 8-bit and the scratchpad and ROMs as 16-bit.

 

I do believe that it's missing one type of cycle somewhere, but I need to dig into it again. It's close, but I recommend that people treat it either as "close" or only do relative comparisons (ie: path 1 takes 'x' cycles and path 2 is 'y' cycles longer). Where it's wrong, it's definately consistent, so comparisons are fair. I also haven't proven that I've missed something yet, it just doesn't seem quite right.

  • Like 1
Link to comment
Share on other sites

-All writes are preceded by a read of the target address (except there seem to be a couple of exceptions, like LI... you'll spot them when the access cycles don't add up)
The read before write is because the TMS 9900 has a 16 bit wide data bus and a 15 bit address bus, but is still byte addressable across the 65536 byte address range. Thus when you execute an instruction like MOVB R2,@>A768, the CPU can't write just to the byte at >A768, without also writing to the byte at >A769. The only way it has to preserve the value at >A769, when modifying the byte at >A768, is to first read the 16 bit word at >A768, modify the most significant byte and write the whole 16 bit chunk back to memory again.

This is true for register accesses as well, since they are all memory in reality.

 

Now this is obvious for byte access, but one could think that word access wouldn't be affected. However, to keep the same microcode logic for instructions like A, S, MOV, SOC and so on, which have 8 bit counterparts (AB, SB, MOVB, SOCB), TI took the decision to keep the read before write for 16 bit access as well.

 

However, some instructions, like LI, don't have any LIB (Load Immediate Byte) counterpart. Thus the CPU in this case doesn't need to read before write, since it will always write the whole 16 bit value at the two memory bytes which make up the destination. The destination for a LI is always a register, but since that's memory anyway, it doesn't matter.

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

Now this is obvious for byte access, but one could think that word access wouldn't be affected. However, to keep the same microcode logic for instructions like A, S, MOV, SOC and so on, which have 8 bit counterparts (AB, SB, MOVB, SOCB), TI took the decision to keep the read before write for 16 bit access as well.

 

So do you know how it ticks when using MOVB to access memory mapped devices such as VDP, GRM, SOUND and SPCH ?

Link to comment
Share on other sites

It behaves in exactly the same way, since the CPU isn't aware of the fact that it's not memory, but a memory mapped device.

If you study the design of the 99/4A, you'll see that there are never any memory mapped 8-bit ports that are adjacent to each other. They are at least two bytes apart. That's to avoid having the 9900 mess with the odd byte when it's interested in the even one, and vice versa, since the read before write trick usually doesn't work with a memory mapped device. In most cases reading the port doesn't disclose the last value written to the same port. Thus you can't avoid writing irrelevant values to the port you aren't interested in, if it's within the same 16-bit word, when doing a byte-wide instruction like MOVB.

 

It's only addresses within the workspace that can be written to without that previous read, but for obvious reasons you usually don't have the ability to place the workspace across memory mapped devices.

  • Confused 1
Link to comment
Share on other sites

There is a thread somewhere here on this 99/4A sub-forum or A.A. where we explored the idea of mapping the workspace on top of the VDP's memory mapped area to get the fastest possible write access to it. The idea was that you would map the workspace and unroll the loop. I don't remember the details or if anyone managed to get working code. "Search" is your friend.

  • Thanks 1
Link to comment
Share on other sites

  • 3 years later...

Does anybody know how the TMS9900 CPU executes the DIV instruction with its microcode? Or does anybody know where I can find useful informations? For the first, I'm interested in knowing how to calculate the exact number of machine cycles for this instruction, primary with the focus on the data dependency. The chapter 4 of 9900-FamilySystemsDesign-1stEdition (page 94) doesn't gave me enough information.

 

Can anybody help?

Thanx!

Link to comment
Share on other sites

Usually I would point you to my implementation in MESS, but if I remember correctly, I cheated a little bit there... :)

I used the same source for the microcode implementation, so this may possibly not reveal anything new for you. Will have a look later today, nevertheless.

 

[Edit: Just had a look; as I said, I did not implement the exact ALU operation, only the microoperations around that core activity.]

Edited by mizapf
  • Like 1
Link to comment
Share on other sites

Does anybody know how the TMS9900 CPU executes the DIV instruction with its microcode? Or does anybody know where I can find useful informations? For the first, I'm interested in knowing how to calculate the exact number of machine cycles for this instruction, primary with the focus on the data dependency. The chapter 4 of 9900-FamilySystemsDesign-1stEdition (page 94) doesn't gave me enough information.

 

Can anybody help?

Thanx!

 

Try the TMS9900 Microprocessor Data Manual, §3.6 “TMS9900 Execution Times”, Table 3, page 28. The DIV instruction execution time is definitely dependent on the data. The number of clock cycles listed in the table is 92 – 124 with this footnote: “Execution time is dependent upon the partial quotient after each clock cycle during execution.” Of course, with the TI-99/4A, this also depends on wait states and memory accesses, as indicated in the tables, starting with Table 3.

 

...lee

Link to comment
Share on other sites

Thanks for your quick response!

 

Lee, I already know about the dependency of the data, but I needed to know how in details. In your mentioned manual, they write that a DIV needs 92 - 134 cycles when overflow bit is reset. I want to know how can I determine the exact number of cycles.

In the documentation I mentioned, there are all instructions of the TMS9900 described in its every single machine cycle, nevertheless I don't get it, the secret of DIV. For the ALU machine cycle TI writes "Divide sequence consisting of Ni cycles where 48 <= Ni <= 32. Ni is data dependent". But how is Ni and the dependency defined? I can't find any clue.

 

Michael, I remember you wrote me in an email about your detailed specification of every component in MESS. So my next step would be to look in your code. So I will have a look at, if no other can help me here. Hopefully it helps me going on. :-)

  • Like 1
Link to comment
Share on other sites

By the way...

 

I read, that undefined op codes uses six clock cycles. I do not understand this, because I think the processor needs two cycles for instruction fetch and two more to decode the instruction, so four in total. What happens with the other remaining two cycles?

 

[Edit: Oh, I see right now there are so many errors in 'my' book! The abstract micro code witch describes the machine cycles are incorrect and very sketchy! :-(

Michael where do you get your informations from for your implementations in MESS?]

Edited by HackMac
  • Like 1
Link to comment
Share on other sites

First, I started with the microprograms in the aforementioned book chapter. As far as I can tell they look quite reasonable. Do you have doubts? Second, I compared them to the machine cycle tables in the Pocket Guide and other books.

 

If you have a look at the implementation in MESS (src/emu/cpu/tms99xx/tms9900.c) you will find the microprograms in a quite similar form as shown in the Family Systems Design book.

 

Finally, I did some benchmarks on my real Geneve (which only helps for the 9995). I used the real-time clock of the Geneve to measure the time of an empty loop and then of a loop with a command inside (two nested loops, if I remember correctly, about 4 million iterations).

 

For DIV and MPY I did not implement the sub-sub-level of the command because I was concerned about the performance. So I only implemented the level where memory operations take place, and then I simply multiplied or divided the values. The only exception was for the overflow detection because you have to determine the overflow *before* calculation starts (otherwise it would take too long and be inconsistent with the timing specs).

  • Like 1
Link to comment
Share on other sites

As far as I can tell they look quite reasonable. Do you have doubts?

 

Yes, I have.

My first example was my DIV question (see post #17, of witch I still don't see clear), the next example I gave in post #21.

An other example is the MPY instruction. In a table of one TI documentation says it needs 52 clock cycles. [EDIT: Right is 52+Ns (where Ns is between 0 and 8 ), so it vary between 52 and 60 clock cycles.] The machine cycle explanation in an other documentation says it needs 26 + Ns machine cycles (Ns takes 1 to 5 cycles depending on addressing mode). In sum this is 27 to 31 machine cycles and corresponds to 54 up to 62 clock cycles. This is unequal to 52.

 

Most of the explanations of single machine cycles of each instructions are very sketchy, they don't tell exactly what happens while an ALU cycle happens, for example.

 

Am I wrong? Please correct me.

There are even more example, but I'm too lazy to name them all in detail.

 

I hope there is someone out there who can explain or can point me to informations...

Edited by HackMac
Link to comment
Share on other sites

Right, the 52 clock cycles seem wrong to me, believing the microinstruction cycles. Should be 54. Also, as I saw, the data do not have any impact on the performance of MPY.

 

Concerning the ALU cycles, they indeed do not reveal much about those phases. You can only guess what happens with the AB (address bus) and DB (data bus) values. In MESS I just put everything what I believed to happen between the memory accesses so that I would end up with the specified numbers of cycles.

 

I don't know a better source than the Family System Design.

 

Suggestion: Write a small benchmark program in assembly language where you put a MPY R2,R0 in a nested loop. Let it repeat for 4 million times (61 * 65536) and check whether it takes 69,2 sec (52 cycles) or 71,8 sec (54 cycles). You will have to measure the difference to an empty loop, of course.

  • Like 1
Link to comment
Share on other sites

The TI has no clock that can measure the loop time. Or can it be done by the 9901?

I think, what I need is a logic analyzer. so I can count clock cycles between memory accesses. :-(

 

But if there is a shy guy here in the forum, who is under cover and has any idea, please be courageous and contribute your part. (I believe there are more people than Lee and Michael.)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...