99110 ROM disassembly

pnr · January 21, 2018

Now that the internal ROM of the 99110 processor has been read out, it becomes interesting to see what the TI engineers had put in it.

This thread is intended for posts about recreating the source code for the ROM. It is meant to be wide in scope and include discussion about the best tools for this specific job, the ins and outs of floating point formats and their implementation on the 99xx and 99xxx, etc. All contributions towards these topics are most welcome.

This first post includes a binary dump (the .bin file) and a quick disassembly using xda99 (which was the first tool I came across).

The entry table (see section 7.3.1 of the data sheet) is:

            AORG >0800
; macrostore entry vectors (see table 7 of datasheet)
;
0800 0BE0   DATA >0BE0         ; entry point for 00xx opcodes
0802 0BE0   DATA >0BE0         ; entry point for 01xx opcodes
0804 0BE0   DATA >0BE0         ; entry point for 02xx opcodes
0806 0B1C   DATA >0B1C         ; entry point for 03xx opcodes
0808 0B6E   DATA >0B6E         ; entry point for 0Cxx opcodes
080A 0B80   DATA >0B80         ; entry point for 0Dxx opcodes
080C 0BE0   DATA >0BE0         ; entry point for 0Exx opcodes
080E 0AF6   DATA >0AF6         ; entry point for 0Fxx + 07xx opcodes
0810 0BE0   DATA >0BE0         ; entry point for two-word opcodes
0812 0BE0   DATA >0BE0         ; entry point for macro XOP's

.

Most of the entries refer to the exit code at >0BE0. This code implements the extension interface documented in section 7.3.6 of the data sheet.

; unimplemented instructions jump here to check for external macro ROM.
; Officially, 0BE0-0BFF was reserved for factory test code
;
0BE0 C1E0   MOV  @>1000, R7       ; test macro location >1000 for >AAAA magic    
0BE2 1000
0BE4 0287   CI   R7, >AAAA
0BE6 AAAA
0BE8 1602   JNE  >0BEE            ; if not present exit
0BEA 0460   B    @>1002           ; jump to external macro code
0BEC 1002
0BEE 0382   RTWP2                 ; return & trigger ILLOP interrupt

macrorom.bin

macrorom.txt

pnr · January 21, 2018

If you use an Apple Computer, you can use the TI-Disk Manager for disassembling code for nearly all processors of the 99xxx family.

It has an interactive tool (Disassembler Editor) for producing clean source code.

That is very interesting. How would I best use this tool for the job at hand? Does it support the extra 99xxx instructions? And how about the (obscure) macrostore specific instructions (EVAD, the interrupt jumps, the RTWP variants)? Can the disassembler tool be used on a stand alone basis?

pnr · January 21, 2018

And here is the code for the 8th slot in the vector table. This slot (>0Fxx, >07xx) handles only the LDS (>0780) and LDD (>07C0) opcodes.

.

; entry point for the 0Fxx and 07xx opcodes
; only the 74LS612 mapper variant of LDD and LDS are recognized
;
0AF6 2560   CZC  @>0B18, R5    ; is the opcode >0780 or >07C0?
0AF8 0B18
0AFA 1672   JNE  >0BE0         ; no: test for extension & exit
0AFC 27E0   CZC  @>0B1A, R15   ; are we in user mode?
0AFE 0B1A
0B00 1303   JEQ  >0B08
0B02 0300   LIMI >0000         ; yes: set up PRIVOP error
0B04 0000                      ;  (will cause INT #2 after the RTWP)
0B06 0380   RTWP
0B08 0283   CI   R3, >C000     ; is this a first LDS?
0B0A C000
0B0C 1303   JEQ  >0B14
0B0E 0283   CI   R3, >6000     ; is this a first LDD?
0B10 6000
0B12 1601   JNE  >0B16
0B14 C08E   MOV  R14, R2       ; save address+2 of first LDS/LDD in a sequence
0B16 0384   RTWP4              ; return & defer interrupt

0B18 F83F   DATA >F83F         ; reverse bit pattern of LDD/LDS
0B1A 0100   DATA >0100         ; PRIV bit in ST register

.

This has two interesting points.

LDD and LDS are only valid when in system mode (PRIV bit not set). If they are called from user mode (PRIV bit set) an privilege violation error must be created. This is achieved by using LIMI 0. This is a hardware instruction that is also only valid in system mode and will set the PRIVOP error bit. This normally immediately causes an INT #2 to occur. However, macrocode cannot be interrupted so it remains pending. Only after we return to normal code with the RTWP is the interrupt honored.

Saving the address+2 of the first LDD/LDS in a sequence is an obscure feature that can be used when implementing TI990/12 style interruptible instructions in an external macro ROM (also see bottom of page 107 of the data sheet). When such an instruction is used in combination with LDD/LDS and allows itself to be interrupted, it will save its progress in a checkpoint register and reset the saved PC in R14 to the address of the first LDD/LDS. After the interrupt has finished, the instruction will restart from the first LDD/LDS (setting up the hardware assists) and the interruptible instruction will restart from its checkpoint.

The 990/12 assembler manual has more information about interruptible instructions and checkpoint registers.

Edited January 21, 2018 by pnr

HackMac · January 21, 2018

That is very interesting. How would I best use this tool for the job at hand? Does it support the extra 99xxx instructions? And how about the (obscure) macrostore specific instructions (EVAD, the interrupt jumps, the RTWP variants)? Can the disassembler tool be used on a stand alone basis?

To disassemble the macrorom.bin file, you should copy it onto a TI disk image, which you can create with the TI-Disk Manager. Then you drag&drop the file from the Mac to the file list of the specific TI disk image (or move the file direct onto the disk image item in the right list), which is displayed in the main window of the TI-Disk Manager, to import this binary file in PROGRAM format (you will be asked for import options while importing). Then you can right click on the newly created file and hold the option (and also the shift key? I forgott.) pressed. In the context menu select opening in the Disassembler Editor. The file will be disassembled and the source code will be displayed in a new window. There you should play around with the mouse. There are some tool tips and links, you can use to jump to specific addresses (jump back and forrth with the arrow buttons at the bottom of the window) or you can format the source using the context menu. There are also some options at the bottom of the editor window you can select. There you can select the processor type you need.

When you are done, you can export your result as a text file on your Mac. You can close the Disassembler Editor window at any time. The disassembling session will be saved automatically, so you can continue later. Select your last session for continuing your work via the Window menu.

You can find more information on the Wiki pages at bitbucket. Please use the latest release (2.9.2) I submit this afternoon.

Enjoy

Edited January 21, 2018 by HackMac

+mizapf · January 21, 2018

LDD and LDS are only valid when in system mode (PRIV bit not set). If they are called from user mode (PRIV bit set) an privilege violation error must be created.

Is the PRIV bit negative logic? (Possibly makes sense, because the bit is set to 0 for TMS99xx, which would turn its programs to non-privileged.)

pnr · January 22, 2018

Is the PRIV bit negative logic? (Possibly makes sense, because the bit is set to 0 for TMS99xx, which would turn its programs to non-privileged.)

Yes it is negative logic and confusingly named. I think USER would have been a better name:

- When the bit is 0 the CPU is in privileged mode and can execute all instructions.

- When the bit is 1 the CPU is in unprivileged mode and I/O type instructions become restricted. Attempts to make the bit 0 again become illegal as well.

The only way to get back to privileged mode is via a reset, interrupt or XOP. As that code is normally controlled by the operating system it can retain control.

See section 6 of the data sheet for details.

+mizapf · January 22, 2018

Yes, I already had a look at the specs, in particular because this is still pending work for me to do in MAME. There was once an implementation of the TMS99xxx, but after the rewrite (already 6 years ago) I left the 99xxx unfinished, because there was no system to test it with. As for my question, I seem to have swapped the semantics of that bit in my own memory.

The privileged mode is what is typically also called the kernel mode.

For a long time I wondered what the XOP instructions with their fixed vector could be used for. Studying more about other OS and architectures, I finally understood that the XOPs are actually system calls. On the 99xx platform, the mode switch is missing, though. I found it fascinating to see that with the full support in the 99xxx, there would be no LIMI access to user programs (which is so ubiquitous in our assembly programming), and you could protect access to system resources e.g. on lower CRU addresses.

pnr · January 22, 2018

I left the 99xxx unfinished, because there was no system to test it with.

Well, now there is Stuart's excellent 99xxx project:

http://www.stuartconner.me.uk/tms99110_breadboard/tms99110_breadboard.htm

I'm having a ton of fun experimenting with the 99xxx using this board.

+mizapf · January 22, 2018

Indeed. But being the only contributor to the TI emulation, I just see my todo list getting longer all the time...

java.lang.CloneNotSupportedException at mizapf:xxxx

speccery · January 22, 2018

This has two interesting points.

LDD and LDS are only valid when in system mode (PRIV bit not set). If they are called from user mode (PRIV bit set) an privilege violation error must be created. This is achieved by using LIMI 0. This is a hardware instruction that is also only valid in system mode and will set the PRIVOP error bit. This normally immediately causes an INT #2 to occur. However, macrocode cannot be interrupted so it remains pending. Only after we return to normal code with the RTWP is the interrupt honored.

Saving the address+2 of the first LDD/LDS in a sequence is an obscure feature that can be used when implementing TI990/12 style interruptible instructions in an external macro ROM (also see bottom of page 107 of the data sheet). When such an instruction is used in combination with LDD/LDS and allows itself to be interrupted, it will save its progress in a checkpoint register and reset the saved PC in R14 to the address of the first LDD/LDS. After the interrupt has finished, the instruction will restart from the first LDD/LDS (setting up the hardware assists) and the interruptible instruction will restart from its checkpoint.

The 990/12 assembler manual has more information about interruptible instructions and checkpoint registers.

Thank you for sharing your insights, it is a very interesting read. I can only see that there is a lot of documentation to be read. It is also very interesting how the minicomputer implementation has interesting ramifications on microprocessor design for the TMS991XX. I suppose the TMS99105 still does not have what it would take to implement virtual memory, i.e. no support for page faults and generic restartable instructions?

+mizapf · January 22, 2018

As far as I saw, there is still a limit at 16 bit addresses, maybe a second bank, but not more. (I think there is a map bit.) It is a pity that the TMS architecture does not allow for more.

The process space is typically limited by the address width. You may use some tricks like segmented memory, as Intel did, but eventually you will always have the width as the limiting factor. This is the primary reason why 32 bit systems cannot make proper use of memory beyond 4 GiB.

We know this from the Geneve which is built with an address space of up to 2 MiB, but not directly addressable; it requires mapping.

[Edit:]

One might think that this just amounts to allow for a wider address range, like 24 bit, inside the CPU. Maybe there could be another instruction format that allows for using wider addresses, like MOV @>12ABCD,R1, which requires a three-word object code. But this is not sufficient. Since we are using registers for indirect addressing, the registers would need to be wider, too, to accommodate the longer addresses. At that point we are beyond 16-bit processing. The above-mentioned 32-bit systems limit the address bus width (for linear addressing) just by being unable to handle values with more than 32 bit. The modern 64 bit processors do not even use 64 address bits but typically offer 48 address bits.

Edited January 22, 2018 by mizapf

speccery · January 22, 2018

Yes it does still have the 16-bit address space limit. I think you are referring to the PSEL# bit, which allows two 64K banks to exist, but in the data sheets they are actually using the PSEL# line to enable a 74LS612 or other memory mapper device. That would mean that the user processes could have paged memory, in a 24 bit address space for example. What I am wondering is that whether or not the TMS99000 architecture supports page faults, i.e. missing pages. That is not possible with the 74LS612 alone, but a more sophisticated MMU (memory management unit) would be needed, which could interrupt the CPU if a page is not there, in a way that would allow the page to be fetched from disk storage by the operating system (in supervisor mode).

I once wrote a paged memory system for the intel 386 CPU's MMU, and with that you can mark in your page directory entries whether or not a given page is accessible. If it is, the CPU will just access the data, if not a page fault will occur. In the page fault handler you can decide how to serve the fault. If the fault occurred due to access to invalid address, the process is typically killed. If it is accessing a memory page frame that is valid but not loaded, the OS could load it. This type of functionality requires restartable instructions and a hardware based translation look aside buffer (TLB) as a minimum - it does not need all the machinery that the 386 offers. It does not make any sense in a 16-bit address space alone, but if the 16-bits can be paged to a much larger physical address space, it starts to make sense. Still I know this whole thing probably does not make sense for the TMS99000 in the greater scope of things, but just wondering if the bare ingredients in the CPU are there...

pnr · January 23, 2018

This is a very interesting and broad topic. It will be fun to discuss, but let's use a separate thread, e.g. "Designing MMU's"

Below some short comments that can be used to kickoff such a thread.

I suppose the TMS99105 still does not have what it would take to implement virtual memory, i.e. no support for page faults and generic restartable instructions?

In short:

- No, it does not have such support out of the box

- But maybe it can. Because of the 'registers in RAM' architecture, I think it might be possible with the help of external hardware (also for the 9900 & 9995)

But what is the point of demand paging when virtual memory space (64KB) is so much smaller than physical memory (say 1MB)? Maybe it only makes sense in the reverse situation. Also, does a TLB make sense when virtual memory is small?

Next to address translation, the other purpose of a MMU is memory protection. How to implement that on a 99xx/99xxx is an interesting question too.

As far as I saw, there is still a limit at 16 bit addresses, maybe a second bank, but not more. (I think there is a map bit.) It is a pity that the TMS architecture does not allow for more.

Yes, from a non-kernel program viewpoint, space is limited to 16 bits. The 99000 has two kludges to make it somewhat 17-18 bit like.

- Separation of instruction space and data space (not used on TI990 mini's). This was used with great success on PDP11 mini's and early Unix.

- The PSEL bit. Most mini's of the era had two memory spaces (kernel/user) driven by the supervisor bit in the status register. TI separated the two functions into separate bits, but when using a 74LS612 mapper or the TI990 MMU this is not fully exploited and the two bits move in tandem. With some new macro instructions PSEL could be made more useful.

Let's continue in a new thread.

pnr · January 24, 2018

I've now analysed the entry code for group >0Cxx. This group has most of the floating point instructions, see the 99110 annex of the data sheet for details.

It has revealed an undocument opcode: 'XIT'

The macro code begins thus:

; entry point for opcodes 0Cxx
;
0B6E 0285   CI   R5, >0C3F        ; zero or one operand opcode?
0B70 0C3F
0B72 1506   JGT  >0B80            ; jump if one operand
0B74 2560   CZC  @>0BD4, R5       ; valid zero operand instruction?
0B76 0BD4
0B78 161C   JNE  >0BB2            ; no: test for 'XIT'
0B7A 0245   ANDI R5, >0006        ; if valid zero operand opcode,
0B7C 0006                         
0B7E 1011   JMP  >0BA2            ; go fetch FPAC & go to opcode routine
..
0BD4 0039   DATA >0039            ; opcode bit test pattern

The above is all straightforward.

The next macro code section proceeds to handle the source operand:

0B80 C2C5   MOV  R5, R11          ; handle one operand case
0B82 0245   ANDI R5, >01FF        ; isolate <src> bits
0B84 01FF
0B86 0105   EVAD R5               ; calculate EA
0B88 1609   JNE  >0B9C            ; Ts = 3 ?
0B8A 024B   ANDI R11, >FFC0       ; mask out operand bits
0B8C FFC0
0B8E 028B   CI   R11, >0C80       ; opcode is CIR?
0B90 0C80
0B92 1602   JNE  >0B98            ; yes: autoincrement by 2, else by 4
0B94 05DA   INCT *R10
0B96 1002   JMP  >0B9C
0B98 A6A0   A    @>0B2E, *R10     ; >0B2E contains 4
0B9A 0B2E
0B9C 0855   SRA  R5, 5            ; calculate switch index from opcode
0B9E 0225   AI   R5, >0006
0BA0 0006

This macro code section uses the special macro opcode "EVAD" (evaluate address). This instruction is documented in section 7.3.3.5 of the data manual. EVAD analyzes the operand bits in the instruction and calculates the address of the source operand. This address is placed in R8. If the operand uses the *Rx+ format, a pointer to Rx is placed in R10 and the EQ status bit is set.

As floating point numbers are 32 bit, i.e. 4 bytes, registers are auto-incremented by 4. The only exception is CIR, which has a integer word operand and increments by 2.

Then the macro code proceeds with:

0BA2 C01D   MOV  *R13, R0         ; fetch FPAC into local R0,R1
0BA4 C06D   MOV  @2(R13), R1
0BA6 0002
0BA8 C08F   MOV  R15, R2          ; save status & clear ST0-ST4
0BAA 024F   ANDI R15, >07FF       ; (status is save to restore ST3-ST4 as needed)
0BAC 07FF
0BAE 0165   BIND @>0BBE(R5)       ; jump to specific opcode routine
0BB0 0BBE

; branch table for 0Cxx group (first 4 are zero operand)
;
0BBE 09CE   DATA >09CE            ; CRI
0BC0 08E4   DATA >08E4            ; NEGR
0BC2 09D2   DATA >09D2            ; CRE
0BC4 0A86   DATA >0A86            ; CER
0BC6 081E   DATA >081E            ; AR
0BC8 0A80   DATA >0A80            ; CIR
0BCA 0814   DATA >0814            ; SR
0BCC 08F4   DATA >08F4            ; MR
0BCE 0946   DATA >0946            ; DR
0BD0 08D2   DATA >08D2            ; LR
0BD2 08D8   DATA >08D8            ; STR

This code fetches the floating point accumulator ("FPAC") from the user's R0,R1 and places this in our local R0,R1. The it clears ST0-ST4 in the users status register. The various instructions will set these bits as needed. The original status register is saved in R2 because some instructions only affect ST0-ST2 and must hence restore ST3-ST4.

Then it uses the 99000 specific BIND (Branch Indirect) instruction to jump to a specific opcode handler routine via a jump table.

That leaves the mysterious undocumented 'XIT'. It is actually rather boring:

; Test and implement XIT
;
0BB2 C1C5   MOV  R5, R7           ; test for XIT (>0C0E and >0C0F)
0BB4 0917   SRL  R7, 1            ;   XIT is a no-op
0BB6 0287   CI   R7, >0607
0BB8 0607
0BBA 1612   JNE  >0BE0            ; no: test for extension & exit
0BBC 0380   RTWP                  ; macro processing complete

XIT is an instruction on the TI990/12 and is also a NOP there. It is used as part of floating point handling by the TI990 Fortran compiler, to create code that would run both on machines with (the /12) and without (the /10) native floating point (explanation courtesy of Dave Pitts)

The Fortran compiler would for example generate:

      BLWP    @F$RITP
      LR      *R9+
      AR      *R9
      STR     *R8
      XIT

On a 990/10 the "F$RITP" routine is a floating point library that reads the instructions following the BLWP and emulates the floating point hardware. When it sees a XIT instruction it stops emulating and returns. Hence "exit interpreter" or XIT. On a 990/12 the "F$RITP" routine would be empty (i.e. do a RTWP immediately) and the 990/12 hardware would execute the floating point code natively. When it saw the XIT it would treat it as a NOP.

It would seem that the 99110 implemented the "XIT" instruction for the exact same purpose.

speccery · January 25, 2018

Thanks pnr for the explanation! Youve proceeded quickly with your analysis. So did I understand correctly that the 990/12 implemented these floating point instructions natively in hardware? Again Im thinking about FPGA implementation of th 99000 here - the great thing here is that there are standard floating point instructions then.

pnr · January 25, 2018

So did I understand correctly that the 990/12 implemented these floating point instructions natively in hardware?

The answer depends on what you mean by native.

Yes: the 990/12 had microcode for all these instructions, and also for the double precision variants (AD, SD, MD, etc.).

No: the 990/12 did not have specialized data paths to support floating point, and the microcode calculated the results using the normal 16 bit data path. Simply put, the microcode did the same operations as the macro code on a 99110. Of course, not having to fetch opcodes etc. it runs faster in microcode.

One could say this is "low end native". An example of high end native would be a FPU co-processor as existed for the PDP11.:

http://www.psych.usyd.edu.au/pdp-11/11_34_fpp.html

As far as I know there never was such a specialized FPU for the TI990. The co-processor interface on the 99xxx CPU suggest that TI was thinking about it at the time. Had the series been more succesful we could have seen a 99-series FPU chip, perhaps with capabilities like the Intel 8231:

http://www.cpu-galaxy.at/cpu/ram%20rom%20eprom/other_intel_chips/other_intel-Dateien/8231A_datasheet.pdf

speccery · January 26, 2018

As far as I know there never was such a specialized FPU for the TI990. The co-processor interface on the 99xxx CPU suggest that TI was thinking about it at the time. Had the series been more succesful we could have seen a 99-series FPU chip, perhaps with capabilities like the Intel 8231:

http://www.cpu-galaxy.at/cpu/ram%20rom%20eprom/other_intel_chips/other_intel-Dateien/8231A_datasheet.pdf

Interesting, didn't know of this chip before. Looks like a predecessor of the 8087 - same stack based idea for the operands.

pnr · January 26, 2018

Let's take a look a the remaining entry point, for opcodes in the >03xx range.

It starts thus:

; Start of table entry 0806 (opcodes 03xx)
; Only >0301 (CR) and >0302 (MM) are valid on a 99110
;
0B1C 0285   CI   R5, >0302       ; CR or MM opcode?
0B1E 0302
0B20 155F   JGT  >0BE0           ; no: test extension & exit
0B22 1301   JEQ  >0B26           ; for CR clear R5 (as a flag)
0B24 04C5   CLR  R5
0B26 024F   ANDI R15, >07FF      ; clear status bits
0B28 07FF
0B2A C0BE   MOV  *R14+, R2       ; fetch second opcode word
0B2C 0206   LI   R6, >0004       ; four byte operands
0B2E 0004

It only accepts CR and MM and all other opcodes from the group are referred to the extension test. Later on we need an easy test for opcode CR versus MM, and R5 is cleared for this purpose. The status bits ST0-ST4 are cleared, as we saw with the 0Cxx opcodes. Then the second opcode word is fetched and R6 is preloaded with an auto-increment constant.

Next it prepares the source operand:

0B30 C042   MOV  R2, R1          ; extract src bits
0B32 0241   ANDI R1, >003F
0B34 003F
0B36 0101   EVAD R1              ; calculate src address
0B38 1601   JNE  >0B3C           ; if Ts = 3, autoincrement src ptr
0B3A A686   A    R6, *R10

This uses the EVAD instruction, which is discussed in data sheet section 7.3.3.5. This instruction takes a 6 bit operand field and calculates the actual address of the operand. If the modifier bits signify *Rn+ the EQ bit is set (for a source operand) and a pointer to Rn is loaded in R10. Because we are dealing with 32 bit operands the register is auto-incremented by 4 bytes.

It proceeds with preparing the destination operand:

0B3C C008   MOV  R8, R0          ; save source address during 2nd EVAD
0B3E 0242   ANDI R2, >0FC0       ; extract dst bits
0B40 0FC0
0B42 0102   EVAD R2
0B44 1B04   JH   >0B4E           ; if Td = 3, autoincrement dst ptr
0B46 C145   MOV  R5, R5          ; for MM, increment is 8
0B48 1301   JEQ  >0B4C
0B4A 0A16   SLA  R6, 1
0B4C A646   A    R6, *R9

This code is pretty much the same. For auto-increment in the destination field the A> status flag (ST0) is set and the associated pointer is in R9. Because MM has a 64 bit result, the auto-increment is upped to 8 bytes. I'm not sure why the code uses two calls on EVAD, as this instruction can do both src and dst at the same time. If anybody sees a good reason for this, please post your observations.

With the operand access prepared the code moves on to actually fetch the operands:

0B4E C085   MOV  R5, R2          ; move opcode to R2
0B50 C200   MOV  R0, R8          ; restore source address
0B52 C038   MOV  *R8+, R0        ; fetch S to R0,R1 and D to R4,R5
0B54 C058   MOV  *R8, R1
0B56 C117   MOV  *R7, R4
0B58 C167   MOV  @2(R7), R5      
0B5A 0002

The source operand is fetched into R0,R1 and the destination into R4,R5. As we will see later, these registers are chosen for good reason. It does mean that we have to move the (first word of the) opcode out of the way to R2.

The choice to move R0 back to R8 is significant. Indirect addressing via R7/R8 generates special bus status codes. When using indirect addressing with R8/R7, the CPU generates WS and DOP/SOP bus status codes. If Td/Ts was zero during EVAD a WS cycle is used and if Td/Ts was not zero a DOP/SOP cycle is used (see section 7.3.3.4). This way, external hardware cannot tell apart if an instruction is a macro instruction or implemented in microcode.

The data sheet is a bit vague, but there seems to be mechanism that this also works when two EVAD instructions are used.

Finally we get to execution:

0B5C 0706   SETO R6              ; set the MM / CR flag
0B5E C082   MOV  R2, R2          ; was opcode CR?
0B60 1604   JNE  >0B6A
0B62 0224   AI   R4, >8000       ; change sign of D
0B64 8000
0B66 0460   B    @>0830          ; perform CR = S+(-D) without store
0B68 0830
0B6A 0460   B    @>0900          ; perform MM
0B6C 0900

The execution path of CR is partly shared with AR and that of MM with MR. Hence a flag (R6) is set to keep track of which paths to follow.

CR is evaluated by calculating S+(-D), and suppressing storage of the result - just the status bits are set.

Edited January 26, 2018 by pnr

pnr · January 26, 2018

And here is the analysis for MM.

It starts with doing the 32x32 bit multiply:

; 32 x 32 => 64 bit multiply. S is R0,R1 and D is R4,R5
; result is in R0-R3
;
; used for both MM (R6!=0) and MR (R6==0)
; in case of MR it multiplies two 24 bit mantissas
;
0900 C085   MOV  R5, R2      ; long multiply in four 16x16 bit steps
0902 3881   MPY  R1, R2
0904 C205   MOV  R5, R8
0906 3A00   MPY  R0, R8
0908 C284   MOV  R4, R10
090A 3A81   MPY  R1, R10
090C 3804   MPY  R4, R0

090E 002A   AM   R10, R8     ; add the partial results
0910 420A
0912 1701   JNC  >0916
0914 0580   INC  R0
0916 002A   AM   R8, R1
0918 4048
091A 1701   JNC  >091E
091C 0580   INC  R0

.

What the above code does is easier to understand if it is written out like a manual multiplication:

                    -R0-.-R1-   = S
                    -R4-.-R5-   = D
	      ---------------x	
                    -R2-.-R3-   = RL = R1 x R5
               -R8-.-R9-.0000   = T1 = R0 x R5
               -RA-.-RB-.0000   = T2 = R4 x R1
          -R0-.-R1-.0000.0000   = RH = R4 x R0
          ===================+
          -R0-.-R1-.-R2-.-R3-   = R

In the above figure I've used RA for R10 and RB for R11 to keep alignment.

The last bit is nothing more than storing the result and setting the status flags:

091E D186   MOVB R6, R6      ; is this a MR or MM instruction?
0920 1607   JNE  >0930       ; jump if MM

[ a little code for MR skipped ]

0930 CDC0   MOV  R0, *R7+    ; MM: store 8 byte result in D
0932 CDC1   MOV  R1, *R7+
0934 CDC2   MOV  R2, *R7+
0936 C5C3   MOV  R3, *R7
0938 E001   SOC  R1, R0      ; if result is 0, set EQ flag
093A E002   SOC  R2, R0
093C E003   SOC  R3, R0
093E 1602   JNE  >0944
0940 026F   ORI  R15, >2000
0942 2000    
0944 0380   RTWP             ; macro execution complete

.

That is one math operation out of the way.

Edited January 26, 2018 by pnr

pnr · January 29, 2018

Time to work with floating point ("real") numbers. The simplest ones are CIR and CER, which convert a 16 bit or a 32 bit integer into a real number. The two instructions share nearly all of their code.

The TI990 floating point format is described in section B.4 of the data sheet (i.e. in the 99110 appendix). It is the IBM360 single precision format. In summary, a real number is expressed as:

N = S x 0.MMMMMM x 16 ^ EE

S is the sign bit, M is a 'mantissa' of 6 hex digits (note: always unsigned) and EE an exponent with 7 bits, i.e. the exponent range is -64 to +63. The number is 'normalized' so that the first hex digit of the mantissa is always non-zero (this keeps accuracy to a maximum). This is achieved by shifting the mantissa the required number of hex digits and adjusting the exponent accordingly. The exponent is in "excess 64" format; this means that it has 64 added to it. In this way the range becomes 0 to 127 and we can work with the exponent as an unsigned number.

The code in the 99110 ROM often splits a real number into its component parts to work with them independently and recombines the components at the end of the calculation. Often the mantissa is calculated in more precision than 6 hex digits (= 24 bits) to reduce rounding errors.

Let's start with CIR. After the entry code that was analyzed earlier in this thread, it starts with:

; entry point for CIR
;
0A80 C018   MOV  *R8, R0       ; fetch S and sign extend into R0,R1
0A82 C040   MOV  R0, R1
0A84 08F0   SRA  R0, 15

This fetches the 16 bit integer operand and sign-extends it to a 32 bit operand located in out local floating point accumulator, FPAC. The rest of the code can now be the same as for CER. That instruction starts with:

; entry point for CER
;
0A86 026F   ORI  R15, >1000    ; set C bit unconditionally
0A88 1000

0A8A C080   MOV  R0, R2        ; if S is zero, clear FPAC & finish
0A8C E081   SOC  R1, R2
0A8E 13A4   JEQ  >09D8

This sets the C bit unconditionally, which is how the data sheet specifies it. I'm not sure why this is useful: comments welcome. Then it special-cases a zero operand; we'll look at that further at the end.

Then we begin the conversion: the integer is separated into a sign bit and an unsigned number:

0A90 C1C0   MOV  R0, R7        ; extract sign bit
0A92 0247   ANDI R7, >8000
0A94 8000
0A96 1304   JEQ  >0AA0         ; if negative, negate the number
0A98 0540   INV  R0
0A9A 0501   NEG  R1
0A9C 1701   JNC  >0AA0
0A9E 0580   INC  R0

In effect we now have S in R7 and a (32 bit) mantissa in R0,R1. The exponent is implicitly 0. However, the number is not normalized, as the mantissa must be 0.MMMMMM, and it is now MMMMMMMM.0 Conceptually, this can easily be fixed by saying the decimal point is not to the right of the mantissa, but to its left and setting the exponent to +8. Including the excess-64 the exponent becomes 72, or >48 in hex:

0AA0 0206   LI   R6, >0048     ; start exponent at +8
0AA2 0048

.

We're still not done, because the integer number may have had leading zero's, and the mantissa must always start with a non-zero digit. As the number cannot be zero (we excluded that case above), this can always be achieved by shifting the mantissa between 0 and 7 hex digits to the left and adjusting the exponent accordingly:

0AA4 C000   MOV  R0, R0        ; if top word zero, shift 4 nibbles
0AA6 1604   JNE  >0AB0
0AA8 C001   MOV  R1, R0
0AAA 04C1   CLR  R1
0AAC 0226   AI   R6, -4        ; and adjust exponent accordingly
0AAE FFFC

0AB0 D000   MOVB R0, R0        ; if top byte zero, shift 2 nibbles
0AB2 1603   JNE  >0ABA
0AB4 001D   SLAM R0, 8
0AB6 4200  
0AB8 0646   DECT R6            ; and adjust exponent accordingly

0ABA C080   MOV  R0, R2        ; if top nibble is zero, shift one nibble
0ABC 0242   ANDI R2, >F000
0ABE F000
0AC0 1603   JNE  >0AC8
0AC2 001D   SLAM R0, 4
0AC4 4100
0AC6 0606   DEC  R6            ; and adjust exponent accordingly

.

After the above steps, we have the sign bit in R7, the mantissa in R0,R1 and the exponent in R6. The last step to make the real number is combining all component parts:

0AC8 06C6   SWPB R6            ; merge exponent (R6), mantissa (R0,R1) and
0ACA 001C   SRAM R0, 8         ; sign (R7) together
0ACC 4200
0ACE D006   MOVB R6, R0
0AD0 E007   SOC  R7, R0

Note that the mantissa is shifted 8 bits to make room for the sign and exponent. This looses 8 bits of accuracy. The lost bits are truncated, i.e. the remaining 24 bit mantissa is not rounded up if the lost bits are above >80. Such rounding could have been achieved by adding >0000 0080 to the mantissa, using the AM instruction (the top *bit* of the mantissa will always be zero, thus this cannot overflow). However, the ROM is almost full and I don't think there is space left to add such rounding to all floating point instructions.

What remains is storing the number in the user's FPAC and setting the status bits appropriately:

0AD2 10BB   JMP  >0A4A         ; store FPAC & finish
..
0A4A 10B4   JMP  >09B4
..
; compare FPAC against zero & store result
;
09B4 C000   MOV  R0, R0            ; test sign
09B6 1105   JLT  >09C2             ; if negative only set L> bit
09B8 1602   JNE  >09BE             ; if positive set L> and A> bits
09BA C041   MOV  R1, R1            ; if zero only set EQ bit
09BC 13F6   JEQ  >09AA
09BE 026F   ORI  R15, >C000        ; set L> and A> status bits
09C0 C000
09C2 026F   ORI  R15, >8000        ; set L> status bit
09C4 8000

09C6 C740   MOV  R0, *R13           ; store FPAC
09C8 CB41   MOV  R1, @2(R13)
09CA 0002
09CC 0380   RTWP                    ; macro code complete

.

This bit of code is used at the end of nearly all floating point routines. Note the the macro entry code has already reset ST0-ST4 in R15, so only setting the right bits remains. The handling of a zero result is done in a separate routine.

The "result is zero" exit routine is also heavily used, including by the CIR and CER instructions (remember the test for zero at the start of that code):

..
09D8 13E6   JEQ  >09A6
..

; clear FPAC, set EQ status bit & store
;
09A6 04C0   CLR  R0
09A8 04C1   CLR  R1
09AA 026F   ORI  R15, >2000
09AC 2000
09AE 100B   JMP  >09C6             ; store FPAC & exit

.

That concludes the first two floating point instructions.

Edited January 29, 2018 by pnr

pnr · January 31, 2018

Today a look at three short routines, implementing STR, LR and NEGR, which store, load or negate the accumulator ("FPAC") respectively.

The code is:

; entry point for LR
;
08D2 C038   MOV  *R8+, R0       ; load S into local FPAC
08D4 C058   MOV  *R8, R1
08D6 1002   JMP  >08DC

; entry point for STR
;
08D8 CE00   MOV  R0, *R8+       ; store FPAC into S
08DA C601   MOV  R1, *R8
08DC 0242   ANDI R2, >1800      ; C and AF status bits unaffected
08DE 1800
08E0 E3C2   SOC  R2, R15
08E2 1068   JMP  >09B4          ; store result, set flags & finish

; code for NEGR
;
08E4 0242   ANDI R2, >1800      ; C and AF status bits unaffected
08E6 1800
08E8 E3C2   SOC  R2, R15
08EA C000   MOV  R0, R0         ; is FPAC zero?
08EC 135C   JEQ  >09A6          ; yes, set EQ flag & finish
08EE 0220   AI   R0, >8000      ; no, invert sign bit
08F0 8000
08F2 1060   JMP  >09B4          ; store result, set flags & finish

The code is really very simple and straightforward. The only special thing is that these instructions only affect status bits ST0-2, and hence ST3 and ST4 are restored from the copy of R15 that was made in the generic entry routine.

Next up will be a look at CRI and CRE, which share a lot of code and appear to have a few corner case bugs.

pnr · February 3, 2018

Conversion from floating point back to integers is done with CRI and CRE, for a 16 bit or 32 bit integer respectively. In principle this is just the reverse of CIR and CER that were analyzed above, but it is a bit more involved as the code has to check for overflow: the real number may be larger than what fits in the integer.

In my view the code in the macro rom for CRI and CRE is a bit convoluted and borderline buggy, but maybe I don't understand the code right. Better insights are welcome.

The code for CRI and CRE starts with:

; CRI: convert real to integer
;
09CE 04C8   CLR  R8
09D0 1001   JMP  >09D4

; CRE: convert real to extended
;
09D2 0708   SETO R8

09D4 04C2   CLR  R2                ; prepare for 48 bit shift in R0,R1,R2

09D6 C1C0   MOV  R0, R7            ; if FPAC is zero, nothing to do:
09D8 13E6   JEQ  >09A6             ;   store zero result & exit

CRI and CRE share most of their code, using R8 as a flag to keep track. Also, the case where the real number is zero is special cased so that the remaining code can assume that the number is in standard format. The register R2 is cleared, the reason for which become clear further below. The test for zero has the side effect of saving the sign bit in R7.

The next bit of code is also clear:

09DA C180   MOV  R0, R6            ; separate mantissa
09DC 7000   SB   R0, R0            ;     and put exponent in R6
09DE 06C6   SWPB R6
09E0 0246   ANDI R6, >007F
09E2 007F

It separates out the mantissa (into R0,R1) from the exponent (into R6) and the sign bit (already in R7). Now the mantissa in 0.MMMMMM format, and this must be converted to MMMMMMMM.0 format, i.e. the reverse operation of that in CIR and CER. This only works if the exponent is in the range +1 to +8 (= +65 to +72 including the excess 64). If the exponent is less than 1 the real number is between (and excluding) +1 and -1 and will be truncated to 0. If the exponent is larger than 8, the number does not fit in 32 bits. This is all handled by the following code:

09E4 0226   AI   R6, -65           ; is exponent at least 1?
09E6 FFBF
09E8 112D   JLT  >0A44             ; if less than 1, result is zero
  
09EA 0506   NEG  R6                ; get 32 bit result in R1,R2
09EC 0226   AI   R6, >0009         ; by shifting mantissa between
09EE 0009                          ; 2 and 10 hex digits right.
09F0 0606   DEC  R6
09F2 1108   JLT  >0A04
09F4 001C   SRAM R1, 4
09F6 4101
09F8 0A41   SLA  R1, 4
09FA 001C   SRAM R0, 4
09FC 4100
09FE 0240   ANDI R0, >0FFF         ;   (bug: superfluous?)
0A00 0FFF
0A02 10F6   JMP  >09F0

0A04 C100   MOV  R0, R4            ; if exponent was >8, R4 will be non-zero

First it test for an exponent less than 1 and returns a zero result if so. The test for +8 is skipped as this is handled in another way that will become clear shortly. Instead it calculates the number of places that the mantissa has to be shifted. It uses a 48 bit shift in R0-R1-R2, shifting the mantissa between 2 and 10 nibbles (hex digits) right. This leaves the mantissa in MMMMMMMM.0 format in R1,R2 and leaves R0 zero. Note that for a large number the rightmost 2 hex digits will be zero as the mantissa only has 6 hex digits.

The test for an exponent larger than 8 is implicit: the mantissa will be shifted 1 or 0 nibbles and R0 will not be zero. This fact is used later when the result is tested for being in range.

In the above code AND-ing out the top digit of R0 seems superfluous: The top byte has been set to zero when the exponent and sign were separated out and hence SRAM will always shift in zeroes. Perhaps this is a leftover from earlier code. I would have thought it more logical to leave the mantissa in R0,R1 and first shift it two places to the left, followed by 0 to 7 places to the right (i.e. the exact reverse of what is done in the CIR/CER code). This would have required a separate test for the exponent being out of range, but the code would still have been shorter and faster, I think. In that code structure the AND-ing out would have been necessary.

Next we come to handling the sign bit and range tests. Here the code for CRI and CRE diverges again:

0A06 C208   MOV  R8, R8            ; opcode was CRE or CRI?
0A08 160D   JNE  >0A24

0A0A C002   MOV  R2, R0            ; CRI: fit result in 16 bits
0A0C C1C7   MOV  R7, R7            ; if real was negative, negate int
0A0E 1501   JGT  >0A12             ;   (bug: should jump to >0A18)
0A10 0500   NEG  R0
0A12 0282   CI   R2, >8000         ; value -32768 is okay
0A14 8000
0A16 1302   JEQ  >0A1C

0A18 C082   MOV  R2, R2            ; check range -32767..+32767
0A1A 11B0   JLT  >097C             ; -> report overflow (>0A20?)
0A1C E101   SOC  R1, R4            ; number was >65535?
0A1E 1314   JEQ  >0A48             ; no: store result (bug: should be >0A46)
0A20 04C1   CLR  R1
0A22 10AC   JMP  >097C             ; report overflow

First we test for a negative sign and negate the 16 bit integer as necessary. There is also a check for the value -32768, which is okay whereas +32768 is out of range. The jump instruction seems to be wrong and allows +32768 as well. This bug means that the real number +32768 is converted to the integer -32768 instead of being reported as an overflow error.

Next is the check that the (unsigned) mantissa was in the proper range of -32767 to +32767 and an overflow is reported if outside. Also if the mantissa was larger than 65536 or the exponent was larger than 8, an overflow error is reported.

A last bit of strangeness is the value of R1 upon return. The documentation is silent on what value R1 should have. In some cases it is set to zero, in other cases the absolute value of the number is left behind. Changing the destination address of one jump ensures that R1 is always set to zero.

The range check for CRE is similar (including bugs):

0A24 C001   MOV  R1, R0            ; CRE: fit result in 32 bits
0A26 C1C7   MOV  R7, R7            ; if real was negative, negate 32 bit
0A28 1504   JGT  >0A32             ;   (bug: should jumpt to >0A38)
0A2A 0540   INV  R0
0A2C 0502   NEG  R2
0A2E 1701   JNC  >0A32
0A30 0580   INC  R0
0A32 0281   CI   R1, >8000         ; value -2147483648 is okay
0A34 8000                          ;   (note: test cannot be exact)
0A36 1302   JEQ  >0A3C

0A38 C041   MOV  R1, R1            ; check range -2147483647..+2147483647
0A3A 1102   JLT  >0A40             ; -> report overflow
0A3C C104   MOV  R4, R4            ; number was >4294967296?
0A3E 1304   JEQ  >0A48             ; no: store result
0A40 C042   MOV  R2, R1
0A42 10EF   JMP  >0A22             ; report overflow

The code for handling the sign bit is a bit longer as it has to negate a 32 bit number. Again the jump for a positive number seems to be off, not skipping the test for -2147483648

as within range.

However, the test for -2147483648 is conceptually wrong: that number cannot be expressed accurately in a single precision floating point number: it requires 8 hex digits of accuracy and the IBM360 format only has 6. The result is that a number like -2.14750e9 (which is definitely out of range) is reported as okay. The mantissa for -2.14750e9 is

>800040 and this ends up in R1,R2 as >80004000. After negating this becomes >7FFFC000 which is +2147467264. Something similar happens for +2.14750e9.

It would have been better to exclude trying to handle the -2147483648 case altogether and simply suffice with the -2147483647..+2147483647 range test (which due to the six digit accuracy is actually a test for -2147483392..+2147483392).

The last bit of code deals with clearing out the FPAC when the real number truncates to zero (as tested for at the start of the code) and setting the high word (R1) of FPAC as necessary:

0A44 04C0   CLR  R0                ; clear FPAC
0A46 04C2   CLR  R2

0A48 C042   MOV  R2, R1            ; set high word of FPAC
0A4A 10B4   JMP  >09B4             ; store result & exit

That only leaves the reporting of an overflow condition:

; overflow: set C and AF status bits & store result
097C 026F   ORI  R15, >1800
097E 1800
0980 1019   JMP  >09B4        ; store FPAC & status bits

All it does is setting the C and AF (arithmetic fault) status bits (the C bit indicates it is an overflow, not an underflow) and then perform a normal return. However, if the AFIE status bit (arithmetic fault interrupt enable) was also set, this means that immediately after the exit from macrocode a level 2 interrupt is generated. If the AFIE bit is not set, the user program must separately check for the AF error bit being set.

All in all, as I understand it, the code for CRI and CRE has two corner case bugs and looks a bit suspect in two other places. Perhaps it was written the day after the Christmas party. I wonder if the corner case bugs were known back in the day (perhaps the corner cases did not matter enough to be detected).

Edited February 3, 2018 by pnr

pnr · February 5, 2018

With all the supplementary operations out of the way, time to analyze the arithmetic floating point operations: MR, DR, AR and SR. First up is MR.

To understand the code, let's first look at the math involved. Suppose we have two real numbers N1 and N2. In the IBM360 format these will be expressed as

S1 x 0.M1 x 16 ^ E1

and

S2 x 0.M2 x 16 ^ E2

The product will be:

S1 x 0.M1 x 16 ^ E1 x S2 x 0.M2 x 16 ^ E2

which is the same as:

(S1 x S2) x (0.M1 x 0.M2) x (16 ^ E1 x 16 ^ E2)

which is the same as:

(S1 x S2) x (0.M1 x 0.M2) x 16 ^ (E1 + E2)

This last formula is what the code calculates.

The code begins with:

; entry point for MR
;
08F4 C138   MOV  *R8+, R4   ; is multiplier equal to zero?
08F6 1357   JEQ  >09A6      ; yes: set FPAC to zero & finish

This handles the case where the accumulator is multiplied by zero: the result is zero.

Next comes a subroutine that handles the exponents and the sign bits:

08F8 06A0   BL   @>0A4C     ; separate & add exponents
08FA 0A4C
08FC A187   DATA >A187      ; = "A R7, R6" (for MR add exponents)
08FE FFC0   DATA -64        ; = subtract double excess 64

The subroutine is followed by two data words, which make it usable for both multiplication and division. The function of the data words will become clear when walking through the subroutine code.

THE SUBROUTINE

; subroutine for MR and DR: calculate result exponent and sign
;
0A4C C000   MOV  R0, R0        ; is FPAC zero?
0A4E 13C4   JEQ  >09D8         ; yes: set flags & finish

0A50 C158   MOV  *R8, R5       ; fetch 2nd word of operand

The subroutine starts with a check for FPAC equalling zero (i.e. the multiplicand or the numerator is zero); in that case the result is zero too. Next, it fetches the second word of the operand which had not been fetched earlier. The code can now rely on both FPAC (R0,R1) and the operand (R4,R5) being in standard normalized format. The first thing it does is separating the mantissa from the sign bits and exponents:

0A52 C180   MOV  R0, R6        ; save exponents in R6 and R7
0A54 C1C4   MOV  R4, R7
0A56 7000   SB   R0, R0        ; remove exponents from mantissas
0A58 7104   SB   R4, R4

The next thing is multiplying the sign bits:

0A5A C207   MOV  R7, R8        ; figure out sign of result in R8
0A5C 2A06   XOR  R6, R8

Multiplying two bits is the same as taking their exclusive OR. Note that the top bit in R8 will have the sign of the result, but the other 15 bits are not zero -- the other bits are meaningless to the multiplication. This is followed by placing the excess-64 exponents as proper integers in R6 and R7:

0A5E 06C6   SWPB R6            ; place FPAC exponent in R6
0A60 0246   ANDI R6, >007F
0A62 007F
0A64 06C7   SWPB R7            ; place operand exponent in R7
0A66 0247   ANDI R7, >007F
0A68 007F

Now we are ready to add the two exponents together (or subtract them for division). This is where the two data words that followed the subroutine call are used:

0A6A 04BB   X    *R11+         ; MR: "A R7,R6", DR: "S R7,R6"
0A6C A1BB   A    *R11+, R6     ; MR: -64,       DR: +64
0A6E 0286   CI   R6, >007F     ; exponent in range?
0A70 007F
0A72 15D7   JGT  >0A22         ; jump on overflow
0A74 1B9D   JH   >09B0         ; jump on underflow

First it executes the instruction in the first data word. For MR this is "A R7,R6", which adds the exponents. However, by adding the exponents the excess 64 is now included twice and must be removed once. The next data word contains "-64", which is added to the exponents. The result is that the right excess-64 exponent is now in R6. This is followed by a range check. Here the utility of the excess-64 encoding becomes clear to see. I'm not sure why the jump to overflow at >097C is done via >0A22: the real target is (just) within range. The subroutine finishes by merging the result sign bit back into the result exponent:

0A76 0A18   SLA  R8, 1         ; put sign bit back in exponent
0A78 1702   JNC  >0A7E
0A7A 0226   AI   R6, >80
0A7C 0080

0A7E 045B   RT

END OF SUBROUTINE

Now we can go back to the main MR routine at >0900. This happens to be the 32 x 32 -> 64 bit multiplication routine that we already saw as part of the MM instruction:

0900 C085   MOV  R5, R2      ; long multiply in four 16x16 bit steps
0902 3881   MPY  R1, R2
0904 C205   MOV  R5, R8
0906 3A00   MPY  R0, R8
0908 C284   MOV  R4, R10
090A 3A81   MPY  R1, R10
090C 3804   MPY  R4, R0

090E 002A   AM   R10, R8     ; add the partial results
0910 420A
0912 1701   JNC  >0916
0914 0580   INC  R0
0916 002A   AM   R8, R1
0918 4048
091A 1701   JNC  >091E
091C 0580   INC  R0

I'll not discuss it again, simply scroll up to the analysis of the MM instruction for detail on the above code. In essence it multiplies R0,R1 by R4,R5 leaving its result in R0..R3.

Next comes the bit of MR code that was skipped in the MM discussion. That code is:

091E D186   MOVB R6, R6      ; is this a MR or MM instruction?
0920 1607   JNE  >0930       ; jump if MM

0922 D001   MOVB R1, R0      ; MR: prenormalize mantissa
0924 06C0   SWPB R0
0926 D042   MOVB R2, R1
0928 06C1   SWPB R1
092A 06C2   SWPB R2
092C C0C6   MOV  R6, R3
092E 10C2   JMP  >08B4

First the flag byte in the upper half of R6 is checked. For MR this will be zero, as the exponent cannot be larger than >007F.

Next the code pre-normalizes the mantissa by moving it two hex digits (one byte) to the left. The simple way to think about this is that we are multiplying two 24 bit mantissa's into a 48 bit result. We are only interested in the top 24 bits of that result and moving two digits to the left places these 24 bits in R0,R1 properly aligned for combination with the sign and exponent. The more precise way to think about this is that we are doing fixed point arithmetic here, and that a six digit shift right is needed to keep the decimal point in the right place; shifting two hex digits to the left and taking the high two words is functionally the same (and leaves some extra digits available).

However, we are not done as it is possible that the first hex digit is still zero. This is easy to see when using two decimal examples:

0.10 x 0.10 = 0.01 and 0.99 x 0.99 = 0.98

Even though we have kept the decimal point in the right place, the first digit can still be zero in some cases. To normalize this there is a routine that is shared by the other arithmetical operations. This routine expects the result sign/exponent in R3 and so it is moved there first. It also expects the next hex digit in the top of R2.

The shared tail routine is:

; normalize FPAC mantissa (leftward)
;
08B4 0280   CI   R0, >000F      ; is the highest nibble 0?
08B6 000F
08B8 1509   JGT  >08CC          ; no: mantissa is normalized
08BA 24E0   CZC  @>0BD6, R3     ; exponent already 0?
08BC 0BD6
08BE 1378   JEQ  >09B0          ; yes: underflow
08C0 0603   DEC  R3             ; reduce exponent & shift mantissa one nibble    
08C2 001D   SLAM R0,4
08C4 4100
08C6 09C2   SRL  R2, 12         ; shift in one nibble extra precision
08C8 A042   A    R2, R1
08CA 10F4   JMP  >08B4

..
0BD6 007F   DATA >007F            ; exponent bits
..

First it checks that the first mantissa digit is zero. If not, the mantissa is already normalized. If it is it checks the exponent. If it is already zero, the mantissa cannot be shifted further: it would require the exponent to be reduced by one and puts it out of range (the excess-64 exponent would move from -64 to -65). In that case an underflow is reported.

In the other case, the exponent is reduced and the mantissa shifted left by one. To keep accuracy, a 'spare' extra digit of precision kept in R2 is shifted in. Because it is a common tail, the routine will check if further shifts are necessary, but in in the case or MR it will only ever perform one shift. After that, only merging the result exponent back in remains:

08CC 06C3   SWPB R3             ; merge exponent back in
08CE D003   MOVB R3, R0
08D0 1071   JMP  >09B4          ; store FPAC & set status bits

.

The code for underflow is simple, and very similar to the code for overflow:

; underflow: additionally set AF status bit
09B0 026F   ORI  R15, >0800
09B2 0800

<continues with normal exit code at >09B4>

Underflow only sets the arithmetic fault (AF) status bit. This allows the user program to distinguish overflow (C bit also set) from underflow.

pnr · February 11, 2018

Next up is floating point division, the "DR" instruction. It has a clever algorithm, but also a strange bit in its implementation.

First let's look at the algorithm it uses. As with multiplication, we have two real numbers N1 and N2. In the IBM360 format these will be expressed as

S1 x 0.M1 x 16 ^ E1

and

S2 x 0.M2 x 16 ^ E2

The division will be:

(S1 x 0.M1 x 16 ^ E1) / (S2 x 0.M2 x 16 ^ E2)

which is the same as:

(S1/S2) x (0.M1 / 0.M2) x ((16 ^ E1 / 16 ^ E2)

which is the same as:

(S1/S2) x (0.M1 / 0.M2) x 16 ^ (E1 - E2)

which is the same as:

(S1xS2) x (0.M1 / 0.M2) x 16 ^ (E1 - E2)

The subroutine that multiplies the signs and handles the subtraction of exponents is already there, as discussed above for multiplication. The problem is in dividing the mantissas. What is needed is 32 x 32 bit division and the 99000 only offers 32 x 16 bit division (the DIV instruction). It would be possible to write a routine to do 32 x 32 bit division from basics, but that would be a long and slow routine. Instead, it does something clever and uses the DIV instruction to the max.

In short it approximates the result by putting M1 in the 32 bit dividend and then divides by the top 16 bits of M2 (i.e. it truncates the last hex two digits of the divisor to zero). This already gives a result that is accurate to 3 or 4 hex digits. As the divisor is slightly too small, the result is slightly too large. It then subtracts a correction factor from the estimate that makes it accurate to 6 or 7 hex digits. As it turns out, the correction factor is fairly easy to calculate.

I'm not a mathematician, but I think the derivation of the correction factor is as follows. It is easiest to think about the problem in base 65536 numbers, i.e. a number system where there are 65535 different digits and "10" means 65536. As I don't have 65535 symbols available, I'll use [xxxx] as notation, where xxxx is 4 hex digits. The division of M1 by M2 can then be expressed as dividing the two digit number AB by two digit number CD giving a two digit result EF:

The dividend AB is a mantissa shifted 4 bits left, i.e. the range of
AB is [0100][0000] to [0FFF][FFF0].

The divisor CD is a mantissa shifted 8 bits left, i.e. the range of
CD is [1000][0000] to [FFFF][FF00].

The result is 0.EF, where the range of EF is [0100][0001] to [FFFF][FF00]

In the below, 100 and 10 are shorthand for [0001][0000][0000] and
[0001][0000] respectively.

[1] AB / CD = 0.EF

[2] AB = 0.EF x CD

       = EF x 0.CD

       = (E x C) + (E x D/100) + (F x C/10) + (F x D/100)
   
       = C x (E + (E x D/100)/C + F/10 + (F x D/100)/C
   
       = C x ( E.F + E x (D/10C) + (F x D)/100C )

[3] AB / C = E.F  +  E x (D/10C)  +  (F x D)/100C

    As C is at least [1000], the value of (F x D)/100C is at most [0000].[000F]
	and not significant:
	
	E x (D/10C) + (F x D)/100C ≈ E x (D/10C)
	

[4] AB / C  = E.F + E x (D/10C)

    AB / C - E x (D/10C) = E.F
	

[5] Now define a first estimate E'F':

    AB / C = E'F'

    As E x (D/10C) is at most [000F], the difference between E and E'
	is at most [000F].
	Calculating E x (D/10C) as E' x (D/10C) has an error of at
	most [000].[000F] and this error is not significant.
	
[6] Hence, AB/CD can be calculated with sufficient precision using:

    E'F' = AB / C

    T    = (D / C) x E'

    EF   = E'F' - T/10

Back to the simple terms, the initial estimate is AB / C and the correction factor is (D/C) x E' / 10.

With the mathematics and the algorithm out of the way, let's dive into the actual code. We'll see that doing the M1 / M2 division only takes 20 instructions, with no loops.

The code starts out pretty much like the code for MR:

; entry point for DR
;
0946 C138   MOV  *R8+, R4    ; if div-by-zero, report overflow
0948 1319   JEQ  >097C

094A 06A0   BL   @>0A4C      ; extract and subtract exponents
094C 0A4C     
094E 6187   DATA >6178       ; = "S R7, R6"
0950 0040   DATA 64          ; add back excess

First there is a check for a zero operand, and an overflow error is reported if it is. Then the exponent subroutine is called to extract the signs and exponents and to calculate the result sign and exponent. For division, the first data word is "S R7, R6" as the exponents must now be subtracted. The second data word is +64: the exponent subtraction will cancel out the excess-64 part and this needs to be added back.

Next there is a range check:

0952 8100   C    R0, R4      ; if dividend > divisor, result will be >1
0954 1107   JLT  >0964
0956 1502   JGT  >095C
0958 8141   C    R1, R5
095A 1A04   JL   >0964

095C 0586   INC  R6          ; increase result exponent and test for
095E 25A0   CZC  @>0BD6, R6  ; overflow (mantissa shift happens 992-99A)
0960 0BD6
0962 130C   JEQ  >097C

Depending the values of the accumulator and the operand, the result can be larger than 1 (but no larger than 15 decimal) and because the normalized mantissa must be of the form 0.MMMMMM an additional mantissa shift and exponent update may be necessary.

Now the code starts to perform the actual division of M1 by M2. First M1 and M2 are positioned to make the algorithm work:

0964 001D   SLAM R0, 4       ; align dividend & divisor for accuracy
0966 4100
0968 001D   SLAM R4, 8       ; make sure divisor larger than dividend
096A 4204

The next step is to calculate the estimate result (which will already be accurate to some 3 hex digits):

096C 3C04   DIV  R4, R0      ; calculate estimate E'F' = AB / C
096E 04C2   CLR  R2          ;  (using two steps of long division)
0970 3C44   DIV  R4, R1

To get a 32 bit result, the remainder is divided by the divisor again, just as one would do in a manual long division. Note that R4 cannot be zero and that neither division can overflow (a remainder must necessarily be smaller than the divisor) and hence there are no checks for errors.

Next comes the calculation of the correction factor:

0972 C245   MOV  R5, R9      ; now calculate error term: T = D / C x E'
0974 0949   SRL  R9, 4       ; align C with AB (i.e. make D/C < 1)
0976 04CA   CLR  R10
0978 3E44   DIV  R4, R9      ; calc D / C
097A 1903   JNO  >0982       ; always jump

...

0982 3A40   MPY  R0, R9       ; calc T = E' x (D / C)
0984 04C8   CLR  R8           ; align T/10 with E'F' and place into R8,R9
0986 001D   SLAM R8,4
0988 4108
098A 09CA   SRL  R10, 12
098C A24A   A    R10, R9

098E 0029   SM   R8, R0       ; now subtract error term from estimate
0990 4008

First we make sure that D is smaller than C to prevent overflow (and three digits of accuracy are enough). Then it calculates D/C. As C cannot be zero the dvision must succeed, just as the earlier two DIV operations; no error checking is necessary.

Here we have some strangeness: despite the above, the code checks for overflow and jumps over the overflow exit code. There is no reason the the overflow code has to be located here: it is not necessary to bring jumps into range or something like that. Other than the programmer being confused, I see no reason for this jump in the code. Maybe I'm missing something, if so please post.

The code then proceeds to multiply by E' and finish the calculation of T. The range shift of >0974 is undone and by taking the high word T is effectively divided by 10-base-65536. As a last step the correction factor is subtracted from the first estimate, giving a result accurate to 6 or 7 hex digits.

After this, only combining the exponent, sign and "EF" into a normalized real number remains:

0992 C200   MOV  R0, R8       ; normalize mantissa
0994 09C8   SRL  R8, 12       ; one ore two nibbles as needed
0996 1302   JEQ  >099C
0998 001C   SRAM R0, 4
099A 4100
099C 001C   SRAM R0, 4
099E 4100

09A0 06C6   SWPB R6          ; merge sign+exponent with mantissa
09A2 D006   MOVB R6, R0
09A4 1007   JMP  >09B4       ; compare FPAC against zero & store result

First is checks if the result of the mantissa divide was larger than 1: if this is the case, the top digit of EF will be non-zero. It then shifts by one or two hex digits to the right to create the normalized mantissa. No change to the exponent is necessary as one shift is merely compensating all the clever shifts we did at the start of the code (i.e. one shift puts the fixed point in the proper place). For the other shift, it has already made the required adjustment to the exponent at the start, see code at >095C.

The last step is to merge in the sign/exponent byte and to jump to the standard exit routine.

All in all, TI has used a very clever and fast algorithm for floating point divide.

Edited February 13, 2018 by pnr

pnr · February 12, 2018

The analysis of the DR instruction made me wonder about the speed of 99110 floating point operations. I haven't done any detailed cycle counts or run benchmark tests, but some rough scoping gives interesting results.

For floating point operations, the 99110 can always run at the full 6 Mhz, as it is not dependent on slow external memory and wait states. I think the average floating point operation in that case takes around 70-80 microseconds. This equates to some 12-15 kFLOPS.

This compares well with the FPU chips of the late seventies and early eighties. The three main choices in 1981 were the AMD9511/i8231 from 1978, the AMD9512/i8232 from 1979 and the i8087 from 1980/81. The 99110 is from 1981 as well.

http://www.cpushack.com/2010/09/23/arithmetic-processors-then-and-now/

The 9511 needs about 200 clock cycles for a floating point operation, or 100 microseconds when run at 2 MHz (which seems to have been the norm BITD). When run at its 3MHz maximum it is around 70 microseconds. That the numbers are so similar is perhaps not surprising: the 9511 also has a 16 bit data path inside and would be executing similar algorithms. Running a custom designed microcode gives it an advantage in cycles, but the 99110 compensates for this with a high clock speed.

The 9512 also needs about 100 microseconds for multiply and divide, but addition/subtraction is sped up to about 50 microseconds. It can also do double precision floating point (i.e. a 64-bit format). This is much slower than single precision: operations take between 500 and 800 microseconds. I think this would be the same for the 99110, if one would code up double precision routines in a fast external macro rom. As the 9512 still has a 16 bit data path (17 bit actually, to deal with the 'hidden 1' bit of the IEEE format used), the similarity is again not surprising. So, at double precision the speed would only be some 2-3 kFLOPS.

The real difference comes with the 8087 FPU. This chip internally always works with 80 bit floating point numbers. It is also much faster: it has separate ALU's for the exponent and mantissa, with wide data paths for both (15 and 64 bits respectively). Its speed on single precision arithmetic is around 50 kFLOPS and on double precision it is around 30 kFLOPS. However, only limited quantities of this chip were available in 1981 and this is one of the reasons why the original PC had a socket for a 8087 but it was almost never filled.

My understanding is that all these chips were expensive. The 9511 and the 9512 were selling for between $50 and $100, and the 8087 well above that. If correct, the 99110 with a volume price of around $100 was good value. On the other hand, most applications back then did not need fast floating point.

Of course, the competing 16-bit processors (8086, Z8000 and 68000) could run floating point in software emulation about as fast as a 9512 or 99110 (when run at full clock speed with fast memory, all four were about equally fast). Viewed that way, the 99110 only has a convenience advantage.

A last consideration would have been the IBM360 format. Although popular in the 60's and early 70's, it was going out of fashion in the late 70's. The 9512 and the 8087 were much closer to the emerging IEEE floating point standard.

For comparison, a high-end IBM360 mainframe in the 1960's would do about 10 MFLOPS. The supercomputer of the 70's, the Cray-1, was rated at 160 MFLOPS (both numbers for single precision arithmetic).

Edited February 13, 2018 by pnr

99110 ROM disassembly

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members