Jump to content

pnr

Members
  • Content Count

    157
  • Joined

  • Last visited

Posts posted by pnr


  1. Another possibility is to build an extended VHDL design that allows for all of the existing speed range of a standard 9902 and adds something at the top end to make it top out at speeds comparable to the PC serial chips. That way it retains compatibility with all of the older software but can still be used in high-speed modes. The rest of the system will have to be able to keep up, but if it can, this may improve things for connections to local PCs (HDX or other serial applications) or UDS10 (or other) network connections.

     

    That by itself would not be too hard I think. A real 9902 (typically) runs on an internal clock of 1MHz (3Mhz clock internally divided by 3). An FPGA version should be able to go much faster than that (say >50x faster). One could use one of the spare bits in the control register to enable such a turbo mode.

     

    Keeping up would be hard: even in a tightly coded unrolled loop I'd be surprised if a 9900 could move more than 100 kbps into or out of the 9902, say 50 kpbs for full duplex and that then consumes the CPU for 100%...

     

    For going much faster you will need an FPGA-based turbo CPU as well.

    • Like 1

  2. The other day I came across one of Grant Searle's projects that I had not noticed before: his "multicomp" project. It is a simple, low cost FPGA project, where he has more or less created a "software breadboard" for prototyping various retro computers. Googling for it shows various successful builds, including those with ready made PCB's, such as:

    https://www.retrobrewcomputers.org/forum/index.php?t=msg&th=111&start=0&

    (but there are several more such efforts).

     

    Maybe it is a nice idea to extend this idea to the 99xx world and create a few components to make it easy to prototype various small 99xx systems. The main thing missing in Grant's setup is the CRU bus and VHDL for CRU based chips. Probably the 9900 CPU core that 'speccery' did for last year's retro challenge can be reworked into a component for such a project without too much effort. Other chips to be added would be the 9902 and the 9901.

     

    As I have a need for a 9902 in VHDL for another project, I thought I might as well attempt to write one. Looking at he datasheet (figure 2 on page 3) it is not all that complex. There seem to be:

    - 6 plain registers

    - 2 shift registers

    - 3 counters

    - 3 controllers / FSM's

    The logic for the controllers is documented in flow charts on page 16, 18 and 19. I'm guessing it will be about 1,000 lines of VHDL?

     

    I'll post my public domain source as I go along. Peer review certainly welcome.

     

    PS : one guy did an interesting VHDL project around recreating the 9902.

     

     

    • Like 6

  3. I did a lot more experiments with a logic analyser hooked up.

     

    Things appear to be more complex: taking the READY signal low does not entirely stop the microcode from running. On the first clock the current bus cycle is extended, but on the second the CPU progresses to the next machine state / bus cycle anyway and only then starts the reset/interrupt 0 sequence.

     

    For nearly all instructions this does not matter much, as nothing irreversible happens during that extra machine state. There is one exception: in that extra machine state the LWP instruction overwrites the WP register. If reading the new WP from a register fails (because the register workspace is located on a fault page) it cannot be restarted without first restoring the WP in some way.

     

    So now there are two troublesome instructions to consider, BLWP and LWP. Part of the solution could be to save the previous WP just like the previous PC is saves. Luckily, every change to WP is echoed on the address bus in a special "WS update" bus cycle, so that isn't too hard to do. It does drive up complexity, though.

     

     

     

     

     

     

    • Like 1

  4. Well, I went ahead and set up another experiment on Stuart's 99110 PCB. Let's see if the 99000 can do Z8000 and 16032 style "reset-aborts"....

     

    This time I added circuitry to generate a three clock reset signal upon a (simulated) page fault, keeping ready low at the same time. This seems to work, at least for what I have tested so far: it will abort the instruction halfway through, saving the proper PC and WP. A promising result!

     

    The circuit is in a 22V10 GAL and the relevant logic formulas are:

    fault  = !/mem * a0 * !a1 * a2 * a3 * !nclk;
    
    /resout = !(fault * q2 + resin);
    
    q0    := !fault;
    q1    := q0;
    q2    := q1;

    Note: using a "=" sign means creating a combinatorial output and using a ":=" sign means creating a registered output (clocked on the rising edge of the pin 1 input, which is "nclk"). "/resin" is the button reset via a pair of schmitt-triggers (1/3rd of a 74ls14) and "nclk" is the inverted clkout. "/resout" goes to both the "/reset" and "ready" pins of the 99105. This test circuit makes address range >B000 .. >C000 illegal to simulate a page fault.

     

    With this circuit, the instruction sequence

       LWPI >A000
       LI R1,>B000
       MOV *R1+,*R1+
    

    will get aborted in the third instruction, with R1 increased to >B002. The microcode first reads R1 (storing its value internally), increases by two, stores the new R1 and then proceeds to fetch the source operand (see table 18 in the data book). Had the instruction run to completion (i.e. reset behaves like nmi), R1 would have increased to >B004.

     

    For everything I tested so far, it would suffice to have:

    • the reset/abort logic as per above
    • a register holding the last correct IAQ address. If the IAQ itself causes the fault, the register is not updated. This is so because the last state of the previous instruction (the final write in many cases) did not take place and hence the previous instruction must be run again after the faulty prefetch is brought into memory.
    • a 4 bit register counting the number of memory accesses since the last correct IAQ. This is needed so that the roll back routine knows how far the instruction got before the page fault took place.

    I think the above might fit in two 74ls374 chips and a 22V10 GAL.

    I'm lucky how RTWP works. It turns out that the new WP is the last thing fetched (see page 85 of the data book), and the old WP is saved for every fault up to & including the fetch of the new WP: it can always be restarted with the old WP.

    The problem appears to be with BLWP (see page 82). The first thing it does is fetching the new WP from the transfer vector. If this fetch faults, it would seem that the value read from the aborted fetch is stored as the old WP value in the reset workspace. Maybe this is my test setup having issues (this outcome is a bit strange after all), but if correct it means that such a fault will immediately loose the old WP value and it hence the instruction becomes impossible to restart.

    One solution could be to make a page fault during a BLWP a fatal error. My Unix C compiler does not generate BLWP instructions, so it would be an uncommon thing in that context.

    Maybe there is a way to handle it that currently escapes me: more work to do!

     

    • Like 1

  5. And here are my notes on another source of ideas (good and bad): the Z8000 series.

     

    The Z8000 appeared early in 1979, in between the 8086 and the 68000. It was not a direct successor to the Z80, but shared its philosophy. It came in two versions, the Z8001 and the Z8002. The Z8002 was very similar to the 99xxx but with conventional registers and without the macrostore concept. The Z8001 added segmentation to the Z8002: each address was extended by a 7 bit segment number. Like the 68000, the Z8000 started without an established software ecosystem and Zilog ported V7 Unix instead, called “Zeus”. Zeus included a version of RM/Cobol to make it attractive as a business machine. Microsoft also ported Xenix to the Z8000. Nonetheless the chip failed to be a market success.

    Segmentation on the Z8000 was somewhat similar to the page “0” and “1” on a 99xxx, but then with 128 segments instead of 2 pages. Each segment was up to 64KB in size. The program counter had a dedicated segment register and data accesses used two adjacent registers or memory words to hold an address and a segment number. The 7 bit segment and the 16 bit offset were distinct and could not easily be used as a 23 bit “flat” address. Next to these explicit segments, the Z8000 could also use functional segmentation (instructions/data/stack and user/supervisor). There was a companion Z8010 MMU chip that mapped up to 64 segments to real addresses; two of these could be used in parallel.

    The Z8001 and Z8002 had an input signal to report memory faults. This input effectively was a non-maskable interrupt that could abort a faulting program. However, the faulting instruction itself ran to completion and could have irrevocably changed register values. Zilog realised early on that this was a mistake and announced the Z8003 and Z8004 that could abort instructions halfway through, along with a paging MMU, the Z8015. I did not find any 1981/82 designs that used the 03/04, maybe they were released later.

    The abort mechanism on a Z8003/4 is intriguing: the abort signal seems to be a form of reset. When a memory access causes a fault the abort signal must be held active for 5 clocks simultaneously with the ‘wait’ signal active as well. Then a non-maskable interrupt must be asserted and the abort and wait signals released (note that on a Z8000 the reset signal must also be held active for 5 clocks to be recognised). What seems to happen is the following:

    - asserting ‘wait’ stops the bus transaction from completing

    - asserting ‘abort’ resets the microcode sequencer to a state where it recognises a non-maskable interrupt at the end of the current bus cycle (instead of the end of the current instruction)

    - asserting the non-maskable interrupt causes the state of the processor (PC, status) to be saved and a recovery routine entered.

    The Z8015 MMU latches the PC into a register on every instruction fetch, and counts the number of bus cycles since the last instruction fetch. After a fault, this information is frozen. Using these registers, a relatively simple routine can revert any changes the aborted instruction made so that it can be restarted later. All the details are in the 1983 data book (section 7, 9 and Appendix D):

    http://bitsavers.trailing-edge.com/components/zilog/z8000/Z8000_CPU_Technical_Manual_Jan83.pdf

    I'm not sure the Z8015 ever made it into production -- maybe it never got beyond engineering samples, like the later Z80,000 CPU.

    Note that on the NS16032 abort and reset are actually the same pin, which suggests a close link also on that processor. It also almost makes you wonder if the Z8003/04 were really different from the Z8001/02 or whether it was just marketing, a bit like the 99105 has turned out not to be unique silicon.

    Maybe the equivalent approach will also work on a 99xxx. On a 99xxx the reset signal is actually also a non-maskable interrupt, and it too will abort an instruction at the end of a bus cycle. However, it takes 3 clocks to be recognised and that could equate to 3 bus cycles. The following might work:

    - identify an abort condition before the falling edge of CLKOUT;

    - simultaneously assert ‘reset’ and de-assert ‘ready’, and wait for 3 clocks;

    - release ‘reset’ and re-assert ‘ready’.

    According to the datasheet, the 99xxx CPU will now finish the current bus cycle and proceed to save the processor state in the reset workspace R13-R15.

    It will take some experimentation to find out if further hardware support is needed to be able to revert back any changes the aborted instruction may have made.

    • Like 1

  6. Below my notes on the 8086 MMU approach and how it could relate to a 99xxx.

     

    The 8086 first appeared late in 1978 and essentially extended the 8080/8085 to 16 bits. It was not object code compatible, but 8080 assembler source code could be automatically converted into working 8086 source code using a conversion program that Intel provided. This proved a tremendous advantage as the existing CP/M code base could easily be ported to the 8086. Intel’s investment in helping Gary Kildall to develop PL/M and CP/M in the 1973-1975 era really paid off here.

    The 8086 seems to have used all the chip area that technology would allow in 1978 to add an on-board MMU to the CPU chip and did not add any mini-computer features like a supervisor mode or support to deal with page faults. The MMU is simple, but effective:

    - The MMU implements a segmentation scheme. Segmentation is along functional lines: instruction space (“code space”) is separated from data space as had been done on mini computers in the 70’s. Data space was optionally separated in a normal, a stack and an 'extra' data space.

    - Each segment had a segment register (CS, DS, SS and ES respectively) which was 16 bits long. These 16 bits were added - offset by 4 bits - to a normal 16 bit address to create a 20 bit physical address.

    - In the typical case, instructions were fetched using the CS segment register, data was stored/fetched using the DS segment register and stack operations used the SS register. ES was used with string instructions. However, using a prefix instruction a non-default segment register could be chosen.

    - There were no facilities to limit a segment to less than 64KB in length and hence also no facilities to abort an illegal memory access. In this sense, it was no different from the earlier 8-bit CPU generation.

    The 99xxx could use a 8086-like MMU with a little external hardware. Four 74LS170 chips can implement the four segment registers, and four 74LS283 fast adder chips can be used to do the addition of the segment to the base address. The segment registers could be loaded using parallel CRU I/O. The four segments could have been (i) instructions, (ii) workspace, (iii) data and (iv) extra. The first three derive directly from bus status codes, the fourth could have been selected using a prefix instruction (like LDD/LDS on a TI990). The prefix instructions and the instructions to load the segment registers could all easily be implemented in macro code.

    I guess this all could have fitted in a single 48 pin ULA chip, which would have made a nice 8086-style MMU for the 99xxx. In a way, this would have been vaguely similar to the setup in a TI99/8. The key to making it work would have been in implementing an adder with full carry look-ahead so that it could be fast. Because implementing the four segment registers does not take much space, this would have been possible I think.

    If done as full-custom silicon, such an MMU could have added a small amount of ROM with the matching supporting macro code. I wonder how successful such an add-on for the 99xxx would have been.


  7. I’ve found an archived copy of that Elektor supplement from March 1981 that I devoured BITD. It can be found here (in Dutch):

    https://archive.org/details/Elektuur20919813Gen

    It is funny to see these old CPU's referred to as the new "super chips".

    I think it appeared with the April 1981 issue of the UK edition of Elektor, but I have not been able to find an archived copy of this english version. I assume that Elektor had similar supplements in the other language editions. Does anybody else remember those supplements?

    Edit:

    Did find it:

    https://archive.org/stream/ElektorMagazine/Elektor%5Bnonlinear.ir%5D%201981-04#page/n23/mode/2up

    In the UK it was not a supplement, but simply page 23-46 of the April 1981 issue.


  8. Thanks for those insightful comments James!

     

    You are right, although the 68000 internally was a 16 bit chip, doing 32 bit operations in two steps, the architecture was 32 bit. With 24 physical address lines it could address 16MB directly, huge for 1981. I'll read up on 68010 a bit more.

     

    I agree that a simple paging design, with only functional segmentation (instruction/data, user/supervisor), is probably the way to go. My interest in virtual memory on the 99xxx is more of a "retro challenge" than anything else, although it would enable experimenting with copy-on-write in early Unix.

     

    When it comes to compilers I'm focussed on the C compiler from 2.11BSD that I ported to the 9995 a few years ago. It has support for overlays and separate instruction/data spaces built in. I used that compiler to port V6 Unix to the mini Cortex and this compiler now runs natively on 99xx hardware.

     

    • Like 2

  9. When considering MMU designs, perhaps it is good to look at the competitive field from an 1981 perspective. In 1981, arguably, the cottage industry around microcomputers separated in a “business” segment (Osborne, Kaypro, IBM PC, etc.) and a “home” segment. Early in 1981 the hobby magazine “Elektor” had a special supplement about 16 bit chips that I devoured. In the end I settled on a TI99/4A as the basis for my 16 bit endeavours.

    At that time there were four 16 bit processors on the market:

    - the 68000

    - the 8086

    - the Z8000

    - the 99xxx / 99xx

    These 4 processors were all remarkably similar: they came in DIP packaging, had a 16 bit data path, ALU and databus and roughly similar performance. Also the bus interface logic was quite similar across these chips.

    Potentially, the list should also include the National Semiconductor 16032 (later renamed 32016). However, this chip had a 32 bit data path and ALU internally. Also, the chip was initially very buggy and usable silicon did not appear until about 1983, after 14 revisions of the design (revision letter “N”!)

    I’d like to look at these chips from three perspectives: supervisor mode capability, segmentation vs. paging, and the handling of memory faults. The 99xxx looks to be interesting from all three perspectives.

    - When it comes to supervisor mode capability, three of the four offer this: only the 8086 lacks this capability. The TI990 mini’s and the 99xxx offer this capability, but the 9900 and 9995 do not. It is hard to add with external hardware, because interrupts and system calls (XOP’s) must switch back to supervisor mode and the 9900 and 9995 do not offer (easy) signals to recognise this externally.

    - All these designs initially chose segmentation to manage memory and only later reworked it into paging designs. This is interesting because the mini computer world had already decided in the late 70’s that paging was the way to go. I’m not sure why the microprocessor world initially chose segmentation. In the case of the 68000 the address space was linear, but its first MMU chip (the 68451) was designed around a segmentation scheme. The 8086 and the Z8000 series CPU’s were designed with native segmentation. The 99xxx could go either way: the TI990/10A used the chip with a segmenting MMU, but a paging scheme around a 74LS612 mapper was equally supported.

    - The 68000 and the Z8000 were designed with hardware support for recovering from memory faults, but in both cases it did not work due to design errors. The 68000 had to be redesigned into the 68010 and the Z8001/2 into the Z8003/4 to get this fixed. So, from a 1981 perspective, none of the chips had working support for demand segmentation or demand paging. The 8086 does not claim to offer support for this; it would not be supported until the 80286. The 99xxx datasheet is silent on the topic, but my hunch is that the 99xxx does support demand paging with minimal external hardware.


  10. Thanks for that link and that is indeed pricey!

     

    The TM990 series are development boards:http://http://www.stuartconner.me.uk/tm990/tm990.htm

     

    Although they can be rack mounted, they are very distinct from the TI990 series mini computers. In a way, they are more reminiscent of PEB boards.

     

    The board that was sold on eBay is trainer board with a 9981 CPU and a calculator style user interface.

     

    (there's some pictures of a TI990 here:

    http://www.computinghistory.org.uk/det/11554/Texas-Instruments-TI-990-Computer-System/)


  11. I've built a little modification on Stuart's 99110 board to test the copying of status bits ST7 to ST11 to external flip-flops.

     

    I'm decoding bus status "ST" (binary 1101) and then clocking the address bus bits 7-11 to an external flip-flop on the rising edge of CLKOUT. This appears to work fine.

     

    Copying status bit to external flip-flops makes it possible:

     

    (i) to make the non-privileged/privileged status (ST7) available to external hardware

     

    (ii) to use the status map select bit (ST8) directly in external hardware and to use the PSEL signal to signify "use another map" to the MMU (as the TI990/10A mini does)

     

    I've also found that the unassigned status bit (ST9) is present as a real register bit on 99xxx silicon: it can be set and reset. Like ST7 and ST8, the ST9 bit is reset whenever a reset, interrupt or XOP occurs. On the TI990/12 mini this bit enables error checking by the MMU. Maybe in a new design other creative uses are possible.

     

    I'm finding the 99xxx an ever more intriguing design!

     

     

    • Like 2

  12. Last up is an analysis of AR, SR, and CR. All three share most of their code.

     

    Although addition is perhaps conceptually the easiest operation, the code is surprisingly long and involved, as there are many cases to consider. As a result, floating point addition is not much faster than multiplication or division.

     

    The main issue is that the mantissas of two floating point numbers can only be added together if their exponents are equal. If they are not equal, the smaller number must be denormalized to make the exponents equal:

    0.1234 x 16^4 + 0.12 x 16^2 = 0.1234 x 16^4 + 0.0012 x 16^4 = 0.1246 x 16^4

    If the difference between the exponents is more than 6, the smaller number becomes insignificant and effectively equals zero.

     

    The entry code for SR is as follows:

    ; entry point for SR
    ;
    0814 C138   MOV  *R8+, R4       ; fetch 1st word of S
    0816 136D   JEQ  >08F2          ; if S is zero, nothing to do
    0818 0224   AI   R4, >8000      ; flip sign bit
    081A 8000
    081C 1002   JMP  >0822          ; now handle as AR
    

    It checks the operand for being zero, and if so the accumulator already has the right result. If not zero, it flips the sign bit and handles FPAC-S as FPAC+(-S).

     

    Next is the entry code for AR:

    ; entry point for AR
    ;
    081E C138   MOV  *R8+, R4       ; fetch 1st word of S
    0820 1368   JEQ  >08F2          ; if S is zero, nothing to do
    

    It only checks for the operand being zero, and FPAC already containing the result.

     

    From here on, AR and SR have an identical code path.

    0822 C000   MOV  R0, R0         ; if FPAC is zero, S is the result
    0824 1603   JNE  >082C
    0826 C004   MOV  R4, R0         ; move S to local FPAC
    0828 C058   MOV  *R8, R1
    082A 1063   JMP  >08F2          ; store FPAC & set status bits
    ;
    082C 04C6   CLR  R6             ; clear flag (= store result)
    082E C158   MOV  *R8, R5        ; fetch 2nd word of S
    

    The code first checks for another special case: if the accumulator is zero, the result is equal to the operand. If not, it enters the full calculation. It clears the CR flag (R6): at >0830 the code path for CR merges in (see entry code for CR discussed earlier), and the CR code path will separate towards the end of the algorithm.

     

    Note that the CR code path does not have checks for either the accumulator or the operand being zero. Effectively a zero here is handled as meaning "+0.0 x 16^-64" and this will not lead to issues in the CR code path.

    ; CR jumps here (with R6 all ones = set status flags only)
    ;
    0830 04C2   CLR  R2             ; clear extra mantissa bits
    0832 C0C0   MOV  R0, R3         ; save exponents
    0834 C1C4   MOV  R4, R7
    0836 7000   SB   R0, R0         ; remove exponents from mantissas
    0838 7104   SB   R4, R4
    

    As usual the code starts out separating the sign and exponent from the mantissa. R2 is prepared to hold an extra 'guard' digit of precision.

     

    Next the sign bit and exponent are separated for the accumulator:

    083A 0A13   SLA  R3, 1          ; is FPAC negative?
    083C 1702   JNC  >0842
    083E 06A0   BL   @>0AE6         ; yes: negate extended FPAC mantissa
    0840 0AE6
    0842 0993   SRL  R3, 9          ; FPAC exponent in R3
    

    If FPAC is negative, the mantissa is negated. There is a subroutine for this, as the negation has to happen again when the result is converted back to standard IBM360 format. The subroutine is:

    ; subroutine to negate extended FPAC mantissa
    ;
    0AE6 0540   INV  R0
    0AE8 0541   INV  R1
    0AEA 0502   NEG  R2
    0AEC 1703   JNC  >0AF4
    0AEE 0581   INC  R1
    0AF0 1701   JNC  >0AF4
    0AF2 0580   INC  R0
    0AF4 045B   RT
    

    Including R2 in the negation is superfluous at this point. Also note that with the sign/exponent removed, the mantissa has two extra hex digits on the left, and hence does not need to consider the >800000 overflow condition when negating.

     

    Next, the sign and exponent of the operand are separated:

    0844 0A17   SLA  R7, 1          ; is S negative?
    0846 1704   JNC  >0850
    0848 0544   INV  R4             ; yes: negate S mantissa
    084A 0505   NEG  R5
    084C 1701   JNC  >0850
    084E 0584   INC  R4
    0850 0997   SRL  R7, 9          ; S exponent in R7
    

    Here too the mantissa is negated if the operand is negative, but this time it happens in line because it does not need to be reversed later.

     

    With the mantissas prepared and including the sign bit, the code considers the exponents and the relative size of the accumulator and the operand:

    0852 C247   MOV  R7, R9         ; compare exponents
    0854 6243   S    R3, R9
    0856 1319   JEQ  >088A          ; if equal, directly add the mantissas
    
    0858 0289   CI   R9, >0006      ; S much larger than FPAC?
    085C 110F   JLT  >087C
    085E C0C7   MOV  R7, R3         ; if FPAC is insignificant, result is S
    0860 04C0   CLR  R0
    0862 04C1   CLR  R1
    0864 1012   JMP  >088A
    
    ...
    
    087C 0509   NEG  R9             ; FPAC much larger than S?
    087E 0289   CI   R9, >0006
    0880 0006
    0882 11F6   JLT  >0870
    0884 C1C3   MOV  R3, R7         ; if S is insignificant, result is FPAC
    0886 04C4   CLR  R4 
    0888 04C5   CLR  R5
    
    

    The code first handles the three easy cases: exponents equal, FPAC dominates and S dominates. If the exponents are equal, there is nothing to do. If S is more than 6 hex digits larger than FPAC, FPAC is effectively zero and the exponent of S becomes the exponent of the result. If FPAC is more than 6 hex digits larger than S, S is effectively zero and the exponent of FPAC becomes the exponent of the result.

     

    The complex case is handled by a clever loop that shifts either FPAC or S into place. The loop code is entered in the middle (>0870):

    ; denormalize & align smallest mantissa
    ;
    0866 C085   MOV  R5, R2         ; shift S one nibble right
    0868 0AC2   SLA  R2, 12
    086A 001C   SRAM R4, 4
    086C 4104
    086E 0587   INC  R7             ; and adjust exponent
    
    0870 81C3   C    R3, R7         ; exponents equal?
    0872 130B   JEQ  >088A          ; yes: add mantissas
    0874 15F8   JGT  >0866          ; exp FPAC > exp S?
    
    0876 06A0   BL   @>0AD4         ; no: shift FPAC one nibble right
    0878 0AD4                       ;     and adjust exponent
    087A 10FA   JMP  >0870
    

    The loop compares the exponents and if they have become equal (which they must within 6 shifts), the work is done and we proceed with the actual addition at >088A. If S is the smallest the loop runs from >0866 to >0874 and shifts the operand in place (keeping one guard digit in R2). If FPAC is the smallest, the loop runs from >0870 to >087A and shifts the accumulator in place (again keeping one guard digit in R2).

     

    The accumulator shift is also used again later in the algorithm and hence in a subroutine:

    ; subroutine to (de)normalize FPAC mantissa
    ; to the right one hex digit (nibble)
    ;
    0AD4 C081   MOV  R1, R2        ; shift extended mantissa one nibble
    0AD6 0AC2   SLA  R2, 12
    0AD8 001C   SRAM R0, 4
    0ADA 4100
    0ADC 0583   INC  R3            ; adjust exponent
    0ADE 24E0   CZC  @>0BD6, R3    ; exponent in range?
    0AE0 0BD6
    0AE2 139F   JEQ  >0A22         ; no: overflow
    0AE4 045B   RT
    

    At this point in time, the range check on the exponent is superfluous, as the exponent must be in range (because S is in range).

     

    With both numbers properly aligned, we can do the actual addition. At this point, the code for CR takes its own path again:

    088A C186   MOV  R6, R6         ; was opcode CR, or AR/SR?
    088C 1307   JEQ  >089C
    088E 002A   AM   R4, R0         ; CR: add mantissas & return status bits
    0890 4004
    0892 02CA   STST R10
    0894 024A   ANDI R10, >E000     ; mask out L>, A>, EQ status bits
    0896 E000
    0898 E3CA   SOC  R10, R15
    089A 0380   RTWP                ; macro processing complete
    

    For the CR instruction, we add the mantissas and only look at the status bits (L>, A> and EQ) and return those to the user routine. No result is stored back to the user accumulator.

     

    For AR and SR, there is more work to do:

    089C 002A   AM   R4, R0         ; add mantissas
    089E 4004
    08A0 1325   JEQ  >08EC          ; if zero, clear FPAC & finish
    08A2 1504   JGT  >08AC          ; if negative,
    08A4 06A0   BL   @>0AE6         ;   negate extended mantissa
    08A6 0AE6
    08A8 0263   ORI  R3, >0080      ;   and flip sign bit
    08AA 0080
    08AC D000   MOVB R0, R0         ; if mantissa too large
    08AE 1302   JEQ  >08B4
    08B0 06A0   BL   @>0AD4         ; normalize it rightward one nibble
    08B2 0AD4
                [JMP to >08CC seems missing]
    

    Again, the two mantissas are added. If the result is zero, FPAC is cleared (the normalized version of zero) and the status bits are set accordingly.

     

    If the result is negative, the result mantissa is negated back to positive (note this time negating the guard digit as well is not superfluous) and the sign bit is set accordingly.

     

    If the result has one more hex digit (i.e. something like 0.800000 + 0.A00000 = 1.200000), the mantissa is normalized one hex digit to the right (note that in this case, the range check is not superfluous). As the mantissa must now be in normalized form, the code could proceed to merging in the sign/exponent byte. However, it drops into the code for another check.

     

    It is possible in addition that several hex digits cancel out, and that there are a lot of leading zeroes in the result mantissa. An example would be:

    0.123456 - 0.123400 = 0.000056

    This must be normalized to 0.56x16^-4. In this case no precision is lost. However, with a denormalized number one guard digit is required:

    0.100001 - 0.123400x16^-5 = 0.100001 - 0.000001(2) = 0.0FFFFF(E)

    This must be normalized to 0.FFFFFEx16^-1 and in this case we need the guard digit shifted in. If I'm not mistaken only one guard digit can possibly shift in, and hence that is all we have in R2.

     

    In code, this leads to the following:

    ; normalize FPAC mantissa (leftward)
    ;
    08B4 0280   CI   R0, >000F      ; is the highest nibble 0?
    08B6 000F
    08B8 1509   JGT  >08CC          ; no: mantissa is normalized
    08BA 24E0   CZC  @>0BD6, R3     ; exponent already 0?
    08BC 0BD6
    08BE 1378   JEQ  >09B0          ; yes: underflow
    08C0 0603   DEC  R3             ; reduce exponent & shift mantissa one nibble    
    08C2 001D   SLAM R0,4
    08C4 4100
    08C6 09C2   SRL  R2, 12         ; shift in guard digit
    08C8 A042   A    R2, R1
    08CA 10F4   JMP  >08B4
    ;
    08CC 06C3   SWPB R3             ; merge exponent back in
    08CE D003   MOVB R3, R0
    08D0 1071   JMP  >09B4          ; store FPAC & set status bits
    

    This code was discussed before in the post on multiplication: this tail is shared between AR, SR and MR.

     

    I wonder if the AR/SR code is the shortest possible. It would seem that the checks for zero accumulator and operand are for performance only, as the rest of the algorithm would seem to work for AR/SR just as it does for CR. Also, maybe it is faster to operate on mantissas shifted one hex digit to the left; this still leaves one "overflow digit" to the left, but makes room to include the guard digit on the right. Finally, the range check could be taken out of the "shift right" subroutine and moved to immediately after the second subroutine call.

     

    That completes our tour of the 99110 macrorom: there is no other code left to discuss.

     

     

    • Like 2

  13. The analysis of the DR instruction made me wonder about the speed of 99110 floating point operations. I haven't done any detailed cycle counts or run benchmark tests, but some rough scoping gives interesting results.

     

    For floating point operations, the 99110 can always run at the full 6 Mhz, as it is not dependent on slow external memory and wait states. I think the average floating point operation in that case takes around 70-80 microseconds. This equates to some 12-15 kFLOPS.

     

    This compares well with the FPU chips of the late seventies and early eighties. The three main choices in 1981 were the AMD9511/i8231 from 1978, the AMD9512/i8232 from 1979 and the i8087 from 1980/81. The 99110 is from 1981 as well.

    http://www.cpushack.com/2010/09/23/arithmetic-processors-then-and-now/

     

    The 9511 needs about 200 clock cycles for a floating point operation, or 100 microseconds when run at 2 MHz (which seems to have been the norm BITD). When run at its 3MHz maximum it is around 70 microseconds. That the numbers are so similar is perhaps not surprising: the 9511 also has a 16 bit data path inside and would be executing similar algorithms. Running a custom designed microcode gives it an advantage in cycles, but the 99110 compensates for this with a high clock speed.

     

    The 9512 also needs about 100 microseconds for multiply and divide, but addition/subtraction is sped up to about 50 microseconds. It can also do double precision floating point (i.e. a 64-bit format). This is much slower than single precision: operations take between 500 and 800 microseconds. I think this would be the same for the 99110, if one would code up double precision routines in a fast external macro rom. As the 9512 still has a 16 bit data path (17 bit actually, to deal with the 'hidden 1' bit of the IEEE format used), the similarity is again not surprising. So, at double precision the speed would only be some 2-3 kFLOPS.

     

    The real difference comes with the 8087 FPU. This chip internally always works with 80 bit floating point numbers. It is also much faster: it has separate ALU's for the exponent and mantissa, with wide data paths for both (15 and 64 bits respectively). Its speed on single precision arithmetic is around 50 kFLOPS and on double precision it is around 30 kFLOPS. However, only limited quantities of this chip were available in 1981 and this is one of the reasons why the original PC had a socket for a 8087 but it was almost never filled.

     

    My understanding is that all these chips were expensive. The 9511 and the 9512 were selling for between $50 and $100, and the 8087 well above that. If correct, the 99110 with a volume price of around $100 was good value. On the other hand, most applications back then did not need fast floating point.

     

    Of course, the competing 16-bit processors (8086, Z8000 and 68000) could run floating point in software emulation about as fast as a 9512 or 99110 (when run at full clock speed with fast memory, all four were about equally fast). Viewed that way, the 99110 only has a convenience advantage.

     

    A last consideration would have been the IBM360 format. Although popular in the 60's and early 70's, it was going out of fashion in the late 70's. The 9512 and the 8087 were much closer to the emerging IEEE floating point standard.

     

    For comparison, a high-end IBM360 mainframe in the 1960's would do about 10 MFLOPS. The supercomputer of the 70's, the Cray-1, was rated at 160 MFLOPS (both numbers for single precision arithmetic).

    • Like 2
    • Thanks 2

  14. Next up is floating point division, the "DR" instruction. It has a clever algorithm, but also a strange bit in its implementation.

     

    First let's look at the algorithm it uses. As with multiplication, we have two real numbers N1 and N2. In the IBM360 format these will be expressed as

    S1 x 0.M1 x 16 ^ E1

    and

    S2 x 0.M2 x 16 ^ E2

    The division will be:

    (S1 x 0.M1 x 16 ^ E1) / (S2 x 0.M2 x 16 ^ E2)

    which is the same as:

    (S1/S2) x (0.M1 / 0.M2) x ((16 ^ E1 / 16 ^ E2)

    which is the same as:

    (S1/S2) x (0.M1 / 0.M2) x 16 ^ (E1 - E2)

    which is the same as:

    (S1xS2) x (0.M1 / 0.M2) x 16 ^ (E1 - E2)

    The subroutine that multiplies the signs and handles the subtraction of exponents is already there, as discussed above for multiplication. The problem is in dividing the mantissas. What is needed is 32 x 32 bit division and the 99000 only offers 32 x 16 bit division (the DIV instruction). It would be possible to write a routine to do 32 x 32 bit division from basics, but that would be a long and slow routine. Instead, it does something clever and uses the DIV instruction to the max.

    In short it approximates the result by putting M1 in the 32 bit dividend and then divides by the top 16 bits of M2 (i.e. it truncates the last hex two digits of the divisor to zero). This already gives a result that is accurate to 3 or 4 hex digits. As the divisor is slightly too small, the result is slightly too large. It then subtracts a correction factor from the estimate that makes it accurate to 6 or 7 hex digits. As it turns out, the correction factor is fairly easy to calculate.

    I'm not a mathematician, but I think the derivation of the correction factor is as follows. It is easiest to think about the problem in base 65536 numbers, i.e. a number system where there are 65535 different digits and "10" means 65536. As I don't have 65535 symbols available, I'll use [xxxx] as notation, where xxxx is 4 hex digits. The division of M1 by M2 can then be expressed as dividing the two digit number AB by two digit number CD giving a two digit result EF:

    The dividend AB is a mantissa shifted 4 bits left, i.e. the range of
    AB is [0100][0000] to [0FFF][FFF0].
    
    The divisor CD is a mantissa shifted 8 bits left, i.e. the range of
    CD is [1000][0000] to [FFFF][FF00].
    
    The result is 0.EF, where the range of EF is [0100][0001] to [FFFF][FF00]
    
    In the below, 100 and 10 are shorthand for [0001][0000][0000] and
    [0001][0000] respectively.
    
    [1] AB / CD = 0.EF
    
    [2] AB = 0.EF x CD
    
           = EF x 0.CD
    
           = (E x C) + (E x D/100) + (F x C/10) + (F x D/100)
       
           = C x (E + (E x D/100)/C + F/10 + (F x D/100)/C
       
           = C x ( E.F + E x (D/10C) + (F x D)/100C )
    
    [3] AB / C = E.F  +  E x (D/10C)  +  (F x D)/100C
    
        As C is at least [1000], the value of (F x D)/100C is at most [0000].[000F]
    	and not significant:
    	
    	E x (D/10C) + (F x D)/100C ≈ E x (D/10C)
    	
    
    [4] AB / C  = E.F + E x (D/10C)
    
        AB / C - E x (D/10C) = E.F
    	
    
    [5] Now define a first estimate E'F':
    
        AB / C = E'F'
    
        As E x (D/10C) is at most [000F], the difference between E and E'
    	is at most [000F].
    	Calculating E x (D/10C) as E' x (D/10C) has an error of at
    	most [000].[000F] and this error is not significant.
    	
    [6] Hence, AB/CD can be calculated with sufficient precision using:
    
        E'F' = AB / C
    
        T    = (D / C) x E'
    
        EF   = E'F' - T/10
    

    Back to the simple terms, the initial estimate is AB / C and the correction factor is (D/C) x E' / 10.

     

    With the mathematics and the algorithm out of the way, let's dive into the actual code. We'll see that doing the M1 / M2 division only takes 20 instructions, with no loops.

     

    The code starts out pretty much like the code for MR:

    ; entry point for DR
    ;
    0946 C138   MOV  *R8+, R4    ; if div-by-zero, report overflow
    0948 1319   JEQ  >097C
    
    094A 06A0   BL   @>0A4C      ; extract and subtract exponents
    094C 0A4C     
    094E 6187   DATA >6178       ; = "S R7, R6"
    0950 0040   DATA 64          ; add back excess
    

    First there is a check for a zero operand, and an overflow error is reported if it is. Then the exponent subroutine is called to extract the signs and exponents and to calculate the result sign and exponent. For division, the first data word is "S R7, R6" as the exponents must now be subtracted. The second data word is +64: the exponent subtraction will cancel out the excess-64 part and this needs to be added back.

     

    Next there is a range check:

    0952 8100   C    R0, R4      ; if dividend > divisor, result will be >1
    0954 1107   JLT  >0964
    0956 1502   JGT  >095C
    0958 8141   C    R1, R5
    095A 1A04   JL   >0964
    
    095C 0586   INC  R6          ; increase result exponent and test for
    095E 25A0   CZC  @>0BD6, R6  ; overflow (mantissa shift happens 992-99A)
    0960 0BD6
    0962 130C   JEQ  >097C
    

    Depending the values of the accumulator and the operand, the result can be larger than 1 (but no larger than 15 decimal) and because the normalized mantissa must be of the form 0.MMMMMM an additional mantissa shift and exponent update may be necessary.

     

    Now the code starts to perform the actual division of M1 by M2. First M1 and M2 are positioned to make the algorithm work:

    0964 001D   SLAM R0, 4       ; align dividend & divisor for accuracy
    0966 4100
    0968 001D   SLAM R4, 8       ; make sure divisor larger than dividend
    096A 4204
    

    The next step is to calculate the estimate result (which will already be accurate to some 3 hex digits):

    096C 3C04   DIV  R4, R0      ; calculate estimate E'F' = AB / C
    096E 04C2   CLR  R2          ;  (using two steps of long division)
    0970 3C44   DIV  R4, R1
    

    To get a 32 bit result, the remainder is divided by the divisor again, just as one would do in a manual long division. Note that R4 cannot be zero and that neither division can overflow (a remainder must necessarily be smaller than the divisor) and hence there are no checks for errors.

     

    Next comes the calculation of the correction factor:

    0972 C245   MOV  R5, R9      ; now calculate error term: T = D / C x E'
    0974 0949   SRL  R9, 4       ; align C with AB (i.e. make D/C < 1)
    0976 04CA   CLR  R10
    0978 3E44   DIV  R4, R9      ; calc D / C
    097A 1903   JNO  >0982       ; always jump
    
    ...
    
    0982 3A40   MPY  R0, R9       ; calc T = E' x (D / C)
    0984 04C8   CLR  R8           ; align T/10 with E'F' and place into R8,R9
    0986 001D   SLAM R8,4
    0988 4108
    098A 09CA   SRL  R10, 12
    098C A24A   A    R10, R9
    
    098E 0029   SM   R8, R0       ; now subtract error term from estimate
    0990 4008
    

    First we make sure that D is smaller than C to prevent overflow (and three digits of accuracy are enough). Then it calculates D/C. As C cannot be zero the dvision must succeed, just as the earlier two DIV operations; no error checking is necessary.

     

    Here we have some strangeness: despite the above, the code checks for overflow and jumps over the overflow exit code. There is no reason the the overflow code has to be located here: it is not necessary to bring jumps into range or something like that. Other than the programmer being confused, I see no reason for this jump in the code. Maybe I'm missing something, if so please post.

     

    The code then proceeds to multiply by E' and finish the calculation of T. The range shift of >0974 is undone and by taking the high word T is effectively divided by 10-base-65536. As a last step the correction factor is subtracted from the first estimate, giving a result accurate to 6 or 7 hex digits.

     

    After this, only combining the exponent, sign and "EF" into a normalized real number remains:

    0992 C200   MOV  R0, R8       ; normalize mantissa
    0994 09C8   SRL  R8, 12       ; one ore two nibbles as needed
    0996 1302   JEQ  >099C
    0998 001C   SRAM R0, 4
    099A 4100
    099C 001C   SRAM R0, 4
    099E 4100
    
    09A0 06C6   SWPB R6          ; merge sign+exponent with mantissa
    09A2 D006   MOVB R6, R0
    09A4 1007   JMP  >09B4       ; compare FPAC against zero & store result
    

    First is checks if the result of the mantissa divide was larger than 1: if this is the case, the top digit of EF will be non-zero. It then shifts by one or two hex digits to the right to create the normalized mantissa. No change to the exponent is necessary as one shift is merely compensating all the clever shifts we did at the start of the code (i.e. one shift puts the fixed point in the proper place). For the other shift, it has already made the required adjustment to the exponent at the start, see code at >095C.

     

    The last step is to merge in the sign/exponent byte and to jump to the standard exit routine.

     

    All in all, TI has used a very clever and fast algorithm for floating point divide.

    • Like 2

  15. With all the supplementary operations out of the way, time to analyze the arithmetic floating point operations: MR, DR, AR and SR. First up is MR.

     

    To understand the code, let's first look at the math involved. Suppose we have two real numbers N1 and N2. In the IBM360 format these will be expressed as

    S1 x 0.M1 x 16 ^ E1

    and

    S2 x 0.M2 x 16 ^ E2

     

    The product will be:

    S1 x 0.M1 x 16 ^ E1 x S2 x 0.M2 x 16 ^ E2

    which is the same as:

    (S1 x S2) x (0.M1 x 0.M2) x (16 ^ E1 x 16 ^ E2)

    which is the same as:

    (S1 x S2) x (0.M1 x 0.M2) x 16 ^ (E1 + E2)

    This last formula is what the code calculates.

     

    The code begins with:

    ; entry point for MR
    ;
    08F4 C138   MOV  *R8+, R4   ; is multiplier equal to zero?
    08F6 1357   JEQ  >09A6      ; yes: set FPAC to zero & finish
    

    This handles the case where the accumulator is multiplied by zero: the result is zero.

     

    Next comes a subroutine that handles the exponents and the sign bits:

    08F8 06A0   BL   @>0A4C     ; separate & add exponents
    08FA 0A4C
    08FC A187   DATA >A187      ; = "A R7, R6" (for MR add exponents)
    08FE FFC0   DATA -64        ; = subtract double excess 64
    

    The subroutine is followed by two data words, which make it usable for both multiplication and division. The function of the data words will become clear when walking through the subroutine code.

     

    THE SUBROUTINE

    ; subroutine for MR and DR: calculate result exponent and sign
    ;
    0A4C C000   MOV  R0, R0        ; is FPAC zero?
    0A4E 13C4   JEQ  >09D8         ; yes: set flags & finish
    
    0A50 C158   MOV  *R8, R5       ; fetch 2nd word of operand
    

    The subroutine starts with a check for FPAC equalling zero (i.e. the multiplicand or the numerator is zero); in that case the result is zero too. Next, it fetches the second word of the operand which had not been fetched earlier. The code can now rely on both FPAC (R0,R1) and the operand (R4,R5) being in standard normalized format. The first thing it does is separating the mantissa from the sign bits and exponents:

    0A52 C180   MOV  R0, R6        ; save exponents in R6 and R7
    0A54 C1C4   MOV  R4, R7
    0A56 7000   SB   R0, R0        ; remove exponents from mantissas
    0A58 7104   SB   R4, R4
    

    The next thing is multiplying the sign bits:

    0A5A C207   MOV  R7, R8        ; figure out sign of result in R8
    0A5C 2A06   XOR  R6, R8
    

    Multiplying two bits is the same as taking their exclusive OR. Note that the top bit in R8 will have the sign of the result, but the other 15 bits are not zero -- the other bits are meaningless to the multiplication. This is followed by placing the excess-64 exponents as proper integers in R6 and R7:

    0A5E 06C6   SWPB R6            ; place FPAC exponent in R6
    0A60 0246   ANDI R6, >007F
    0A62 007F
    0A64 06C7   SWPB R7            ; place operand exponent in R7
    0A66 0247   ANDI R7, >007F
    0A68 007F
    

    Now we are ready to add the two exponents together (or subtract them for division). This is where the two data words that followed the subroutine call are used:

    0A6A 04BB   X    *R11+         ; MR: "A R7,R6", DR: "S R7,R6"
    0A6C A1BB   A    *R11+, R6     ; MR: -64,       DR: +64
    0A6E 0286   CI   R6, >007F     ; exponent in range?
    0A70 007F
    0A72 15D7   JGT  >0A22         ; jump on overflow
    0A74 1B9D   JH   >09B0         ; jump on underflow
    

    First it executes the instruction in the first data word. For MR this is "A R7,R6", which adds the exponents. However, by adding the exponents the excess 64 is now included twice and must be removed once. The next data word contains "-64", which is added to the exponents. The result is that the right excess-64 exponent is now in R6. This is followed by a range check. Here the utility of the excess-64 encoding becomes clear to see. I'm not sure why the jump to overflow at >097C is done via >0A22: the real target is (just) within range. The subroutine finishes by merging the result sign bit back into the result exponent:

    0A76 0A18   SLA  R8, 1         ; put sign bit back in exponent
    0A78 1702   JNC  >0A7E
    0A7A 0226   AI   R6, >80
    0A7C 0080
    
    0A7E 045B   RT
    

    END OF SUBROUTINE

     

    Now we can go back to the main MR routine at >0900. This happens to be the 32 x 32 -> 64 bit multiplication routine that we already saw as part of the MM instruction:

    0900 C085   MOV  R5, R2      ; long multiply in four 16x16 bit steps
    0902 3881   MPY  R1, R2
    0904 C205   MOV  R5, R8
    0906 3A00   MPY  R0, R8
    0908 C284   MOV  R4, R10
    090A 3A81   MPY  R1, R10
    090C 3804   MPY  R4, R0
    
    090E 002A   AM   R10, R8     ; add the partial results
    0910 420A
    0912 1701   JNC  >0916
    0914 0580   INC  R0
    0916 002A   AM   R8, R1
    0918 4048
    091A 1701   JNC  >091E
    091C 0580   INC  R0
    

    I'll not discuss it again, simply scroll up to the analysis of the MM instruction for detail on the above code. In essence it multiplies R0,R1 by R4,R5 leaving its result in R0..R3.

     

    Next comes the bit of MR code that was skipped in the MM discussion. That code is:

    091E D186   MOVB R6, R6      ; is this a MR or MM instruction?
    0920 1607   JNE  >0930       ; jump if MM
    
    0922 D001   MOVB R1, R0      ; MR: prenormalize mantissa
    0924 06C0   SWPB R0
    0926 D042   MOVB R2, R1
    0928 06C1   SWPB R1
    092A 06C2   SWPB R2
    092C C0C6   MOV  R6, R3
    092E 10C2   JMP  >08B4
    

    First the flag byte in the upper half of R6 is checked. For MR this will be zero, as the exponent cannot be larger than >007F.

     

    Next the code pre-normalizes the mantissa by moving it two hex digits (one byte) to the left. The simple way to think about this is that we are multiplying two 24 bit mantissa's into a 48 bit result. We are only interested in the top 24 bits of that result and moving two digits to the left places these 24 bits in R0,R1 properly aligned for combination with the sign and exponent. The more precise way to think about this is that we are doing fixed point arithmetic here, and that a six digit shift right is needed to keep the decimal point in the right place; shifting two hex digits to the left and taking the high two words is functionally the same (and leaves some extra digits available).

     

    However, we are not done as it is possible that the first hex digit is still zero. This is easy to see when using two decimal examples:

    0.10 x 0.10 = 0.01 and 0.99 x 0.99 = 0.98

    Even though we have kept the decimal point in the right place, the first digit can still be zero in some cases. To normalize this there is a routine that is shared by the other arithmetical operations. This routine expects the result sign/exponent in R3 and so it is moved there first. It also expects the next hex digit in the top of R2.

     

    The shared tail routine is:

    ; normalize FPAC mantissa (leftward)
    ;
    08B4 0280   CI   R0, >000F      ; is the highest nibble 0?
    08B6 000F
    08B8 1509   JGT  >08CC          ; no: mantissa is normalized
    08BA 24E0   CZC  @>0BD6, R3     ; exponent already 0?
    08BC 0BD6
    08BE 1378   JEQ  >09B0          ; yes: underflow
    08C0 0603   DEC  R3             ; reduce exponent & shift mantissa one nibble    
    08C2 001D   SLAM R0,4
    08C4 4100
    08C6 09C2   SRL  R2, 12         ; shift in one nibble extra precision
    08C8 A042   A    R2, R1
    08CA 10F4   JMP  >08B4
    
    ..
    0BD6 007F   DATA >007F            ; exponent bits
    ..
    

    First it checks that the first mantissa digit is zero. If not, the mantissa is already normalized. If it is it checks the exponent. If it is already zero, the mantissa cannot be shifted further: it would require the exponent to be reduced by one and puts it out of range (the excess-64 exponent would move from -64 to -65). In that case an underflow is reported.

    In the other case, the exponent is reduced and the mantissa shifted left by one. To keep accuracy, a 'spare' extra digit of precision kept in R2 is shifted in. Because it is a common tail, the routine will check if further shifts are necessary, but in in the case or MR it will only ever perform one shift. After that, only merging the result exponent back in remains:

    08CC 06C3   SWPB R3             ; merge exponent back in
    08CE D003   MOVB R3, R0
    08D0 1071   JMP  >09B4          ; store FPAC & set status bits
    

    .

    The code for underflow is simple, and very similar to the code for overflow:

    ; underflow: additionally set AF status bit
    09B0 026F   ORI  R15, >0800
    09B2 0800
    
    <continues with normal exit code at >09B4>
    

    Underflow only sets the arithmetic fault (AF) status bit. This allows the user program to distinguish overflow (C bit also set) from underflow.

     

     

    • Like 2

  16. Conversion from floating point back to integers is done with CRI and CRE, for a 16 bit or 32 bit integer respectively. In principle this is just the reverse of CIR and CER that were analyzed above, but it is a bit more involved as the code has to check for overflow: the real number may be larger than what fits in the integer.

     

    In my view the code in the macro rom for CRI and CRE is a bit convoluted and borderline buggy, but maybe I don't understand the code right. Better insights are welcome.

     

    The code for CRI and CRE starts with:

    ; CRI: convert real to integer
    ;
    09CE 04C8   CLR  R8
    09D0 1001   JMP  >09D4
    
    ; CRE: convert real to extended
    ;
    09D2 0708   SETO R8
    
    09D4 04C2   CLR  R2                ; prepare for 48 bit shift in R0,R1,R2
    
    09D6 C1C0   MOV  R0, R7            ; if FPAC is zero, nothing to do:
    09D8 13E6   JEQ  >09A6             ;   store zero result & exit
    

    CRI and CRE share most of their code, using R8 as a flag to keep track. Also, the case where the real number is zero is special cased so that the remaining code can assume that the number is in standard format. The register R2 is cleared, the reason for which become clear further below. The test for zero has the side effect of saving the sign bit in R7.

     

    The next bit of code is also clear:

    09DA C180   MOV  R0, R6            ; separate mantissa
    09DC 7000   SB   R0, R0            ;     and put exponent in R6
    09DE 06C6   SWPB R6
    09E0 0246   ANDI R6, >007F
    09E2 007F
    

    It separates out the mantissa (into R0,R1) from the exponent (into R6) and the sign bit (already in R7). Now the mantissa in 0.MMMMMM format, and this must be converted to MMMMMMMM.0 format, i.e. the reverse operation of that in CIR and CER. This only works if the exponent is in the range +1 to +8 (= +65 to +72 including the excess 64). If the exponent is less than 1 the real number is between (and excluding) +1 and -1 and will be truncated to 0. If the exponent is larger than 8, the number does not fit in 32 bits. This is all handled by the following code:

    09E4 0226   AI   R6, -65           ; is exponent at least 1?
    09E6 FFBF
    09E8 112D   JLT  >0A44             ; if less than 1, result is zero
      
    09EA 0506   NEG  R6                ; get 32 bit result in R1,R2
    09EC 0226   AI   R6, >0009         ; by shifting mantissa between
    09EE 0009                          ; 2 and 10 hex digits right.
    09F0 0606   DEC  R6
    09F2 1108   JLT  >0A04
    09F4 001C   SRAM R1, 4
    09F6 4101
    09F8 0A41   SLA  R1, 4
    09FA 001C   SRAM R0, 4
    09FC 4100
    09FE 0240   ANDI R0, >0FFF         ;   (bug: superfluous?)
    0A00 0FFF
    0A02 10F6   JMP  >09F0
    
    0A04 C100   MOV  R0, R4            ; if exponent was >8, R4 will be non-zero
    

    First it test for an exponent less than 1 and returns a zero result if so. The test for +8 is skipped as this is handled in another way that will become clear shortly. Instead it calculates the number of places that the mantissa has to be shifted. It uses a 48 bit shift in R0-R1-R2, shifting the mantissa between 2 and 10 nibbles (hex digits) right. This leaves the mantissa in MMMMMMMM.0 format in R1,R2 and leaves R0 zero. Note that for a large number the rightmost 2 hex digits will be zero as the mantissa only has 6 hex digits.

     

    The test for an exponent larger than 8 is implicit: the mantissa will be shifted 1 or 0 nibbles and R0 will not be zero. This fact is used later when the result is tested for being in range.

     

    In the above code AND-ing out the top digit of R0 seems superfluous: The top byte has been set to zero when the exponent and sign were separated out and hence SRAM will always shift in zeroes. Perhaps this is a leftover from earlier code. I would have thought it more logical to leave the mantissa in R0,R1 and first shift it two places to the left, followed by 0 to 7 places to the right (i.e. the exact reverse of what is done in the CIR/CER code). This would have required a separate test for the exponent being out of range, but the code would still have been shorter and faster, I think. In that code structure the AND-ing out would have been necessary.

     

    Next we come to handling the sign bit and range tests. Here the code for CRI and CRE diverges again:

    0A06 C208   MOV  R8, R8            ; opcode was CRE or CRI?
    0A08 160D   JNE  >0A24
    
    0A0A C002   MOV  R2, R0            ; CRI: fit result in 16 bits
    0A0C C1C7   MOV  R7, R7            ; if real was negative, negate int
    0A0E 1501   JGT  >0A12             ;   (bug: should jump to >0A18)
    0A10 0500   NEG  R0
    0A12 0282   CI   R2, >8000         ; value -32768 is okay
    0A14 8000
    0A16 1302   JEQ  >0A1C
    
    0A18 C082   MOV  R2, R2            ; check range -32767..+32767
    0A1A 11B0   JLT  >097C             ; -> report overflow (>0A20?)
    0A1C E101   SOC  R1, R4            ; number was >65535?
    0A1E 1314   JEQ  >0A48             ; no: store result (bug: should be >0A46)
    0A20 04C1   CLR  R1
    0A22 10AC   JMP  >097C             ; report overflow
    

    First we test for a negative sign and negate the 16 bit integer as necessary. There is also a check for the value -32768, which is okay whereas +32768 is out of range. The jump instruction seems to be wrong and allows +32768 as well. This bug means that the real number +32768 is converted to the integer -32768 instead of being reported as an overflow error.

     

    Next is the check that the (unsigned) mantissa was in the proper range of -32767 to +32767 and an overflow is reported if outside. Also if the mantissa was larger than 65536 or the exponent was larger than 8, an overflow error is reported.

     

    A last bit of strangeness is the value of R1 upon return. The documentation is silent on what value R1 should have. In some cases it is set to zero, in other cases the absolute value of the number is left behind. Changing the destination address of one jump ensures that R1 is always set to zero.

     

    The range check for CRE is similar (including bugs):

    0A24 C001   MOV  R1, R0            ; CRE: fit result in 32 bits
    0A26 C1C7   MOV  R7, R7            ; if real was negative, negate 32 bit
    0A28 1504   JGT  >0A32             ;   (bug: should jumpt to >0A38)
    0A2A 0540   INV  R0
    0A2C 0502   NEG  R2
    0A2E 1701   JNC  >0A32
    0A30 0580   INC  R0
    0A32 0281   CI   R1, >8000         ; value -2147483648 is okay
    0A34 8000                          ;   (note: test cannot be exact)
    0A36 1302   JEQ  >0A3C
    
    0A38 C041   MOV  R1, R1            ; check range -2147483647..+2147483647
    0A3A 1102   JLT  >0A40             ; -> report overflow
    0A3C C104   MOV  R4, R4            ; number was >4294967296?
    0A3E 1304   JEQ  >0A48             ; no: store result
    0A40 C042   MOV  R2, R1
    0A42 10EF   JMP  >0A22             ; report overflow
    

    The code for handling the sign bit is a bit longer as it has to negate a 32 bit number. Again the jump for a positive number seems to be off, not skipping the test for -2147483648

    as within range.

     

    However, the test for -2147483648 is conceptually wrong: that number cannot be expressed accurately in a single precision floating point number: it requires 8 hex digits of accuracy and the IBM360 format only has 6. The result is that a number like -2.14750e9 (which is definitely out of range) is reported as okay. The mantissa for -2.14750e9 is

    >800040 and this ends up in R1,R2 as >80004000. After negating this becomes >7FFFC000 which is +2147467264. Something similar happens for +2.14750e9.

     

    It would have been better to exclude trying to handle the -2147483648 case altogether and simply suffice with the -2147483647..+2147483647 range test (which due to the six digit accuracy is actually a test for -2147483392..+2147483392).

     

    The last bit of code deals with clearing out the FPAC when the real number truncates to zero (as tested for at the start of the code) and setting the high word (R1) of FPAC as necessary:

    0A44 04C0   CLR  R0                ; clear FPAC
    0A46 04C2   CLR  R2
    
    0A48 C042   MOV  R2, R1            ; set high word of FPAC
    0A4A 10B4   JMP  >09B4             ; store result & exit
    

    That only leaves the reporting of an overflow condition:

    ; overflow: set C and AF status bits & store result
    097C 026F   ORI  R15, >1800
    097E 1800
    0980 1019   JMP  >09B4        ; store FPAC & status bits
    

    All it does is setting the C and AF (arithmetic fault) status bits (the C bit indicates it is an overflow, not an underflow) and then perform a normal return. However, if the AFIE status bit (arithmetic fault interrupt enable) was also set, this means that immediately after the exit from macrocode a level 2 interrupt is generated. If the AFIE bit is not set, the user program must separately check for the AF error bit being set.

     

    All in all, as I understand it, the code for CRI and CRE has two corner case bugs and looks a bit suspect in two other places. Perhaps it was written the day after the Christmas party. I wonder if the corner case bugs were known back in the day (perhaps the corner cases did not matter enough to be detected).

    • Like 3

  17. I've done a bit more testing on the 99105 chips that seem to support the TI990 variant of the LDS, LDD and LMF instructions. The theory here is that these chips have silicon inside with the macro ROM that was used for the TI990/10A mini. The further tests support that theory.

     

    The TI990 assembler manual

    http://bitsavers.informatik.uni-stuttgart.de/pdf/ti/990/assembler/2270509-9701A_AsmRef_Nov82.pdf
    documents what LDD, LDS and LMF were supposed to do.

    The TI990/10A general description
    http://bitsavers.informatik.uni-stuttgart.de/pdf/ti/990/990-10/2302633_990-10A_GenDescr_Sep82.pdf
    documents that a 10A has its mapper registers in the parallel CRU area >9F80->9FFF.

    (there is table with the CRU address map that has this info on page 1-19).

     

    So I set up a test to see if my 99105 chip supported that. The test looks to see if the LDD/LDS/LMF instructions indeed store the map information to parallel CRU addresses in the >9F80->9FFF range. They do:

     

    LMF R0,0 loads the 6 words to >9F80 (i.e. this is where map 0 is)

    LMF R0,1 loads the 6 words to >9FA0 (i.e. this is where map 1 is)
    LDD and LDS both load the 6 words to >9FC0 (i.e. this is where map 2 is)

    This means that a sequence LDD-LDS-MOV is of limited use: both the source and destination will use the "map 2" loaded by LDS, as that overwrites the upload by LDD.

     

    My conclusion is that the code for LDD/LDS/LMF in the 99000 ROM is simple and easy to factory test. This probably means that there really is no backdoor to its ROM. On the other hand, using the 99110 ROM as a guide, it is now fairly easy to newly write the code for a 99000 ROM. It will of course not be exact, but it will document what is functionally in there.

     

    This is my go at recreating that code:

    ; Macro ROM jump table
    ;
            AORG >0800
    
            DATA ILLOP
            DATA ILLOP
            DATA ILLOP
            DATA LMENTR
            DATA ILLOP
            DATA ILLOP
            DATA LDENTR
            DATA ILLOP
            DATA ILLOP
            DATA ILLOP
    
    ; LMF
    ;
    LMENTR  MOV R5,R2          ; Is opcode LMF?
            ANDI R2, >FFE0
            CI R2, >0320
            JNE MAP1
            LI R12, >9F80      ; Yes: LMF for map 0
            JMP PRVTST
    MAP1    CI R2, >0330
            JNE ILLOP
            LI R12, >9FA0      ; Yes: LMF for map 1
    
    PRVTST  CZC @USER,R15      ; CPU is in supervisor mode?
            JNE PRVERR
    
            ANDI R5, >000F     ; Fetch map pointer from register W
            ORI R5, >0020      ; by pretending W is *R
            EVAD R5
    
            LDCR *R8+,11       ; Load six words into MMU
            LDCR *R8+,11
            LDCR *R8+,11
            LDCR *R8+,11
            LDCR *R8+,11
            LDCR *R8, 10
            RTWP               ; Normal exit      
    
    ILLOP   RTWP2              ; Return with ILLOP error
    
    PRVERR  LIMI 0             ; Return with PRIVOP error
            RTWP
    
    ; LDS/LDD
    ;
    LDENTR  MOV R5, R2         ; Opcode is LDD or LDS?
            ANDI R2, >FF80
            CI R2, >0780
            JNE ILLOP
    
            CZC @USER,R15      ; CPU is in supervisor mode?
            JNE PRVERR
    
            ANDI R5, >003F     ; Mask out src bits
            EVAD R5
            JNE NOINCR         ; Handle auto-increment
            INCT *R10
    
    NOINCR  LI R12, >9FC0      ; Base address of map register 2
            MOV *R8, R2        ; Fetch pointer to new map
    
            LDCR *R2+,11       ; Load six words into MMU
            LDCR *R2+,11
            LDCR *R2+,11
            LDCR *R2+,11
            LDCR *R2+,11
            LDCR *R2, 10
            RWTP4              ; return, skip interrupt test
    
    USER    DATA >0100
    
    

    Note that for CRU addresses in parallel space (i.e. R12 has top bit set), that a bit count of 11 means 'transfer a word and post-increment R12 by 2' and a bit count of 10 means 'transfer a word, leave R12 as-is'.

     

    I think I have also finally figured out how the TI990/10A mapper can work out that a LDD/LDS is in effect (as the supervisor can use both PSEL=0 and PSEL=1, it is hard to tell when the PSEL/D15 line it is being inverted).

     

    However, every change to the PSEL status bit is echoed on the address bus during an ST machine cycle (bus code 1101, see section 2.4.2. and table 2 in the data sheet). The mapper can use this to keep a copy of the PSEL bit in the status register in a flip-flop, and doing a XOR with the bit on the PSEL/D15 line will tell it when the bit is inverted and "map 2" has to be used.

     

     

    • Like 2

  18. Today a look at three short routines, implementing STR, LR and NEGR, which store, load or negate the accumulator ("FPAC") respectively.

     

    The code is:

    ; entry point for LR
    ;
    08D2 C038   MOV  *R8+, R0       ; load S into local FPAC
    08D4 C058   MOV  *R8, R1
    08D6 1002   JMP  >08DC
    
    ; entry point for STR
    ;
    08D8 CE00   MOV  R0, *R8+       ; store FPAC into S
    08DA C601   MOV  R1, *R8
    08DC 0242   ANDI R2, >1800      ; C and AF status bits unaffected
    08DE 1800
    08E0 E3C2   SOC  R2, R15
    08E2 1068   JMP  >09B4          ; store result, set flags & finish
    
    ; code for NEGR
    ;
    08E4 0242   ANDI R2, >1800      ; C and AF status bits unaffected
    08E6 1800
    08E8 E3C2   SOC  R2, R15
    08EA C000   MOV  R0, R0         ; is FPAC zero?
    08EC 135C   JEQ  >09A6          ; yes, set EQ flag & finish
    08EE 0220   AI   R0, >8000      ; no, invert sign bit
    08F0 8000
    08F2 1060   JMP  >09B4          ; store result, set flags & finish
    

    The code is really very simple and straightforward. The only special thing is that these instructions only affect status bits ST0-2, and hence ST3 and ST4 are restored from the copy of R15 that was made in the generic entry routine.

     

    Next up will be a look at CRI and CRE, which share a lot of code and appear to have a few corner case bugs.

     


  19. Time to work with floating point ("real") numbers. The simplest ones are CIR and CER, which convert a 16 bit or a 32 bit integer into a real number. The two instructions share nearly all of their code.

     

    The TI990 floating point format is described in section B.4 of the data sheet (i.e. in the 99110 appendix). It is the IBM360 single precision format. In summary, a real number is expressed as:

     

    N = S x 0.MMMMMM x 16 ^ EE

     

    S is the sign bit, M is a 'mantissa' of 6 hex digits (note: always unsigned) and EE an exponent with 7 bits, i.e. the exponent range is -64 to +63. The number is 'normalized' so that the first hex digit of the mantissa is always non-zero (this keeps accuracy to a maximum). This is achieved by shifting the mantissa the required number of hex digits and adjusting the exponent accordingly. The exponent is in "excess 64" format; this means that it has 64 added to it. In this way the range becomes 0 to 127 and we can work with the exponent as an unsigned number.

     

    The code in the 99110 ROM often splits a real number into its component parts to work with them independently and recombines the components at the end of the calculation. Often the mantissa is calculated in more precision than 6 hex digits (= 24 bits) to reduce rounding errors.

     

    Let's start with CIR. After the entry code that was analyzed earlier in this thread, it starts with:

    ; entry point for CIR
    ;
    0A80 C018   MOV  *R8, R0       ; fetch S and sign extend into R0,R1
    0A82 C040   MOV  R0, R1
    0A84 08F0   SRA  R0, 15
    

    This fetches the 16 bit integer operand and sign-extends it to a 32 bit operand located in out local floating point accumulator, FPAC. The rest of the code can now be the same as for CER. That instruction starts with:

    ; entry point for CER
    ;
    0A86 026F   ORI  R15, >1000    ; set C bit unconditionally
    0A88 1000
    
    0A8A C080   MOV  R0, R2        ; if S is zero, clear FPAC & finish
    0A8C E081   SOC  R1, R2
    0A8E 13A4   JEQ  >09D8
    

    This sets the C bit unconditionally, which is how the data sheet specifies it. I'm not sure why this is useful: comments welcome. Then it special-cases a zero operand; we'll look at that further at the end.

     

    Then we begin the conversion: the integer is separated into a sign bit and an unsigned number:

    0A90 C1C0   MOV  R0, R7        ; extract sign bit
    0A92 0247   ANDI R7, >8000
    0A94 8000
    0A96 1304   JEQ  >0AA0         ; if negative, negate the number
    0A98 0540   INV  R0
    0A9A 0501   NEG  R1
    0A9C 1701   JNC  >0AA0
    0A9E 0580   INC  R0
    

    In effect we now have S in R7 and a (32 bit) mantissa in R0,R1. The exponent is implicitly 0. However, the number is not normalized, as the mantissa must be 0.MMMMMM, and it is now MMMMMMMM.0 Conceptually, this can easily be fixed by saying the decimal point is not to the right of the mantissa, but to its left and setting the exponent to +8. Including the excess-64 the exponent becomes 72, or >48 in hex:

    0AA0 0206   LI   R6, >0048     ; start exponent at +8
    0AA2 0048
    

    .

    We're still not done, because the integer number may have had leading zero's, and the mantissa must always start with a non-zero digit. As the number cannot be zero (we excluded that case above), this can always be achieved by shifting the mantissa between 0 and 7 hex digits to the left and adjusting the exponent accordingly:

    0AA4 C000   MOV  R0, R0        ; if top word zero, shift 4 nibbles
    0AA6 1604   JNE  >0AB0
    0AA8 C001   MOV  R1, R0
    0AAA 04C1   CLR  R1
    0AAC 0226   AI   R6, -4        ; and adjust exponent accordingly
    0AAE FFFC
    
    0AB0 D000   MOVB R0, R0        ; if top byte zero, shift 2 nibbles
    0AB2 1603   JNE  >0ABA
    0AB4 001D   SLAM R0, 8
    0AB6 4200  
    0AB8 0646   DECT R6            ; and adjust exponent accordingly
    
    0ABA C080   MOV  R0, R2        ; if top nibble is zero, shift one nibble
    0ABC 0242   ANDI R2, >F000
    0ABE F000
    0AC0 1603   JNE  >0AC8
    0AC2 001D   SLAM R0, 4
    0AC4 4100
    0AC6 0606   DEC  R6            ; and adjust exponent accordingly
    

    .

    After the above steps, we have the sign bit in R7, the mantissa in R0,R1 and the exponent in R6. The last step to make the real number is combining all component parts:

    0AC8 06C6   SWPB R6            ; merge exponent (R6), mantissa (R0,R1) and
    0ACA 001C   SRAM R0, 8         ; sign (R7) together
    0ACC 4200
    0ACE D006   MOVB R6, R0
    0AD0 E007   SOC  R7, R0
    

    Note that the mantissa is shifted 8 bits to make room for the sign and exponent. This looses 8 bits of accuracy. The lost bits are truncated, i.e. the remaining 24 bit mantissa is not rounded up if the lost bits are above >80. Such rounding could have been achieved by adding >0000 0080 to the mantissa, using the AM instruction (the top *bit* of the mantissa will always be zero, thus this cannot overflow). However, the ROM is almost full and I don't think there is space left to add such rounding to all floating point instructions.

     

    What remains is storing the number in the user's FPAC and setting the status bits appropriately:

    0AD2 10BB   JMP  >0A4A         ; store FPAC & finish
    ..
    0A4A 10B4   JMP  >09B4
    ..
    ; compare FPAC against zero & store result
    ;
    09B4 C000   MOV  R0, R0            ; test sign
    09B6 1105   JLT  >09C2             ; if negative only set L> bit
    09B8 1602   JNE  >09BE             ; if positive set L> and A> bits
    09BA C041   MOV  R1, R1            ; if zero only set EQ bit
    09BC 13F6   JEQ  >09AA
    09BE 026F   ORI  R15, >C000        ; set L> and A> status bits
    09C0 C000
    09C2 026F   ORI  R15, >8000        ; set L> status bit
    09C4 8000
    
    09C6 C740   MOV  R0, *R13           ; store FPAC
    09C8 CB41   MOV  R1, @2(R13)
    09CA 0002
    09CC 0380   RTWP                    ; macro code complete
     

    .

    This bit of code is used at the end of nearly all floating point routines. Note the the macro entry code has already reset ST0-ST4 in R15, so only setting the right bits remains. The handling of a zero result is done in a separate routine.

     

    The "result is zero" exit routine is also heavily used, including by the CIR and CER instructions (remember the test for zero at the start of that code):

    ..
    09D8 13E6   JEQ  >09A6
    ..
    
    ; clear FPAC, set EQ status bit & store
    ;
    09A6 04C0   CLR  R0
    09A8 04C1   CLR  R1
    09AA 026F   ORI  R15, >2000
    09AC 2000
    09AE 100B   JMP  >09C6             ; store FPAC & exit
    

    .

    That concludes the first two floating point instructions.

    • Like 1

  20. And here is the analysis for MM.

     

    It starts with doing the 32x32 bit multiply:

    ; 32 x 32 => 64 bit multiply. S is R0,R1 and D is R4,R5
    ; result is in R0-R3
    ;
    ; used for both MM (R6!=0) and MR (R6==0)
    ; in case of MR it multiplies two 24 bit mantissas
    ;
    0900 C085   MOV  R5, R2      ; long multiply in four 16x16 bit steps
    0902 3881   MPY  R1, R2
    0904 C205   MOV  R5, R8
    0906 3A00   MPY  R0, R8
    0908 C284   MOV  R4, R10
    090A 3A81   MPY  R1, R10
    090C 3804   MPY  R4, R0
    
    090E 002A   AM   R10, R8     ; add the partial results
    0910 420A
    0912 1701   JNC  >0916
    0914 0580   INC  R0
    0916 002A   AM   R8, R1
    0918 4048
    091A 1701   JNC  >091E
    091C 0580   INC  R0
    

    .

    What the above code does is easier to understand if it is written out like a manual multiplication:

                        -R0-.-R1-   = S
                        -R4-.-R5-   = D
    	      ---------------x	
                        -R2-.-R3-   = RL = R1 x R5
                   -R8-.-R9-.0000   = T1 = R0 x R5
                   -RA-.-RB-.0000   = T2 = R4 x R1
              -R0-.-R1-.0000.0000   = RH = R4 x R0
              ===================+
              -R0-.-R1-.-R2-.-R3-   = R
    

    In the above figure I've used RA for R10 and RB for R11 to keep alignment.

     

    The last bit is nothing more than storing the result and setting the status flags:

    091E D186   MOVB R6, R6      ; is this a MR or MM instruction?
    0920 1607   JNE  >0930       ; jump if MM
    
    [ a little code for MR skipped ]
    
    0930 CDC0   MOV  R0, *R7+    ; MM: store 8 byte result in D
    0932 CDC1   MOV  R1, *R7+
    0934 CDC2   MOV  R2, *R7+
    0936 C5C3   MOV  R3, *R7
    0938 E001   SOC  R1, R0      ; if result is 0, set EQ flag
    093A E002   SOC  R2, R0
    093C E003   SOC  R3, R0
    093E 1602   JNE  >0944
    0940 026F   ORI  R15, >2000
    0942 2000    
    0944 0380   RTWP             ; macro execution complete
    

    .

    That is one math operation out of the way.

    • Like 1

  21. Let's take a look a the remaining entry point, for opcodes in the >03xx range.

     

    It starts thus:

    ; Start of table entry 0806 (opcodes 03xx)
    ; Only >0301 (CR) and >0302 (MM) are valid on a 99110
    ;
    0B1C 0285   CI   R5, >0302       ; CR or MM opcode?
    0B1E 0302
    0B20 155F   JGT  >0BE0           ; no: test extension & exit
    0B22 1301   JEQ  >0B26           ; for CR clear R5 (as a flag)
    0B24 04C5   CLR  R5
    0B26 024F   ANDI R15, >07FF      ; clear status bits
    0B28 07FF
    0B2A C0BE   MOV  *R14+, R2       ; fetch second opcode word
    0B2C 0206   LI   R6, >0004       ; four byte operands
    0B2E 0004
    

    It only accepts CR and MM and all other opcodes from the group are referred to the extension test. Later on we need an easy test for opcode CR versus MM, and R5 is cleared for this purpose. The status bits ST0-ST4 are cleared, as we saw with the 0Cxx opcodes. Then the second opcode word is fetched and R6 is preloaded with an auto-increment constant.

     

    Next it prepares the source operand:

    0B30 C042   MOV  R2, R1          ; extract src bits
    0B32 0241   ANDI R1, >003F
    0B34 003F
    0B36 0101   EVAD R1              ; calculate src address
    0B38 1601   JNE  >0B3C           ; if Ts = 3, autoincrement src ptr
    0B3A A686   A    R6, *R10 

    This uses the EVAD instruction, which is discussed in data sheet section 7.3.3.5. This instruction takes a 6 bit operand field and calculates the actual address of the operand. If the modifier bits signify *Rn+ the EQ bit is set (for a source operand) and a pointer to Rn is loaded in R10. Because we are dealing with 32 bit operands the register is auto-incremented by 4 bytes.

     

    It proceeds with preparing the destination operand:

    0B3C C008   MOV  R8, R0          ; save source address during 2nd EVAD
    0B3E 0242   ANDI R2, >0FC0       ; extract dst bits
    0B40 0FC0
    0B42 0102   EVAD R2
    0B44 1B04   JH   >0B4E           ; if Td = 3, autoincrement dst ptr
    0B46 C145   MOV  R5, R5          ; for MM, increment is 8
    0B48 1301   JEQ  >0B4C
    0B4A 0A16   SLA  R6, 1
    0B4C A646   A    R6, *R9 

    This code is pretty much the same. For auto-increment in the destination field the A> status flag (ST0) is set and the associated pointer is in R9. Because MM has a 64 bit result, the auto-increment is upped to 8 bytes. I'm not sure why the code uses two calls on EVAD, as this instruction can do both src and dst at the same time. If anybody sees a good reason for this, please post your observations.

     

    With the operand access prepared the code moves on to actually fetch the operands:

    0B4E C085   MOV  R5, R2          ; move opcode to R2
    0B50 C200   MOV  R0, R8          ; restore source address
    0B52 C038   MOV  *R8+, R0        ; fetch S to R0,R1 and D to R4,R5
    0B54 C058   MOV  *R8, R1
    0B56 C117   MOV  *R7, R4
    0B58 C167   MOV  @2(R7), R5      
    0B5A 0002 

    The source operand is fetched into R0,R1 and the destination into R4,R5. As we will see later, these registers are chosen for good reason. It does mean that we have to move the (first word of the) opcode out of the way to R2.

     

    The choice to move R0 back to R8 is significant. Indirect addressing via R7/R8 generates special bus status codes. When using indirect addressing with R8/R7, the CPU generates WS and DOP/SOP bus status codes. If Td/Ts was zero during EVAD a WS cycle is used and if Td/Ts was not zero a DOP/SOP cycle is used (see section 7.3.3.4). This way, external hardware cannot tell apart if an instruction is a macro instruction or implemented in microcode.

     

    The data sheet is a bit vague, but there seems to be mechanism that this also works when two EVAD instructions are used.

     

    Finally we get to execution:

    0B5C 0706   SETO R6              ; set the MM / CR flag
    0B5E C082   MOV  R2, R2          ; was opcode CR?
    0B60 1604   JNE  >0B6A
    0B62 0224   AI   R4, >8000       ; change sign of D
    0B64 8000
    0B66 0460   B    @>0830          ; perform CR = S+(-D) without store
    0B68 0830
    0B6A 0460   B    @>0900          ; perform MM
    0B6C 0900 

    The execution path of CR is partly shared with AR and that of MM with MR. Hence a flag (R6) is set to keep track of which paths to follow.

     

    CR is evaluated by calculating S+(-D), and suppressing storage of the result - just the status bits are set.

    • Like 1

  22. So did I understand correctly that the 990/12 implemented these floating point instructions natively in hardware?

     

    The answer depends on what you mean by native.

     

    Yes: the 990/12 had microcode for all these instructions, and also for the double precision variants (AD, SD, MD, etc.).

     

    No: the 990/12 did not have specialized data paths to support floating point, and the microcode calculated the results using the normal 16 bit data path. Simply put, the microcode did the same operations as the macro code on a 99110. Of course, not having to fetch opcodes etc. it runs faster in microcode.

     

    One could say this is "low end native". An example of high end native would be a FPU co-processor as existed for the PDP11.:

    http://www.psych.usyd.edu.au/pdp-11/11_34_fpp.html

    As far as I know there never was such a specialized FPU for the TI990. The co-processor interface on the 99xxx CPU suggest that TI was thinking about it at the time. Had the series been more succesful we could have seen a 99-series FPU chip, perhaps with capabilities like the Intel 8231:

    http://www.cpu-galaxy.at/cpu/ram%20rom%20eprom/other_intel_chips/other_intel-Dateien/8231A_datasheet.pdf

    • Like 2
×
×
  • Create New...