Jump to content

pnr

Members
  • Content Count

    157
  • Joined

  • Last visited

Everything posted by pnr

  1. That by itself would not be too hard I think. A real 9902 (typically) runs on an internal clock of 1MHz (3Mhz clock internally divided by 3). An FPGA version should be able to go much faster than that (say >50x faster). One could use one of the spare bits in the control register to enable such a turbo mode. Keeping up would be hard: even in a tightly coded unrolled loop I'd be surprised if a 9900 could move more than 100 kbps into or out of the 9902, say 50 kpbs for full duplex and that then consumes the CPU for 100%... For going much faster you will need an FPGA-based turbo CPU as well.
  2. The other day I came across one of Grant Searle's projects that I had not noticed before: his "multicomp" project. It is a simple, low cost FPGA project, where he has more or less created a "software breadboard" for prototyping various retro computers. Googling for it shows various successful builds, including those with ready made PCB's, such as: https://www.retrobrewcomputers.org/forum/index.php?t=msg&th=111&start=0& (but there are several more such efforts). Maybe it is a nice idea to extend this idea to the 99xx world and create a few components to make it easy to prototype various small 99xx systems. The main thing missing in Grant's setup is the CRU bus and VHDL for CRU based chips. Probably the 9900 CPU core that 'speccery' did for last year's retro challenge can be reworked into a component for such a project without too much effort. Other chips to be added would be the 9902 and the 9901. As I have a need for a 9902 in VHDL for another project, I thought I might as well attempt to write one. Looking at he datasheet (figure 2 on page 3) it is not all that complex. There seem to be: - 6 plain registers - 2 shift registers - 3 counters - 3 controllers / FSM's The logic for the controllers is documented in flow charts on page 16, 18 and 19. I'm guessing it will be about 1,000 lines of VHDL? I'll post my public domain source as I go along. Peer review certainly welcome. PS : one guy did an interesting VHDL project around recreating the 9902.
  3. Maybe this is common knowledge, but just to be sure: there is an open source program to drive the TL866 from Linux/macOS. It's on github here: https://github.com/vdudouyt/minipro
  4. I did a lot more experiments with a logic analyser hooked up. Things appear to be more complex: taking the READY signal low does not entirely stop the microcode from running. On the first clock the current bus cycle is extended, but on the second the CPU progresses to the next machine state / bus cycle anyway and only then starts the reset/interrupt 0 sequence. For nearly all instructions this does not matter much, as nothing irreversible happens during that extra machine state. There is one exception: in that extra machine state the LWP instruction overwrites the WP register. If reading the new WP from a register fails (because the register workspace is located on a fault page) it cannot be restarted without first restoring the WP in some way. So now there are two troublesome instructions to consider, BLWP and LWP. Part of the solution could be to save the previous WP just like the previous PC is saves. Luckily, every change to WP is echoed on the address bus in a special "WS update" bus cycle, so that isn't too hard to do. It does drive up complexity, though.
  5. Well, I went ahead and set up another experiment on Stuart's 99110 PCB. Let's see if the 99000 can do Z8000 and 16032 style "reset-aborts".... This time I added circuitry to generate a three clock reset signal upon a (simulated) page fault, keeping ready low at the same time. This seems to work, at least for what I have tested so far: it will abort the instruction halfway through, saving the proper PC and WP. A promising result! The circuit is in a 22V10 GAL and the relevant logic formulas are: fault = !/mem * a0 * !a1 * a2 * a3 * !nclk; /resout = !(fault * q2 + resin); q0 := !fault; q1 := q0; q2 := q1; Note: using a "=" sign means creating a combinatorial output and using a ":=" sign means creating a registered output (clocked on the rising edge of the pin 1 input, which is "nclk"). "/resin" is the button reset via a pair of schmitt-triggers (1/3rd of a 74ls14) and "nclk" is the inverted clkout. "/resout" goes to both the "/reset" and "ready" pins of the 99105. This test circuit makes address range >B000 .. >C000 illegal to simulate a page fault. With this circuit, the instruction sequence LWPI >A000 LI R1,>B000 MOV *R1+,*R1+ will get aborted in the third instruction, with R1 increased to >B002. The microcode first reads R1 (storing its value internally), increases by two, stores the new R1 and then proceeds to fetch the source operand (see table 18 in the data book). Had the instruction run to completion (i.e. reset behaves like nmi), R1 would have increased to >B004. For everything I tested so far, it would suffice to have: the reset/abort logic as per above a register holding the last correct IAQ address. If the IAQ itself causes the fault, the register is not updated. This is so because the last state of the previous instruction (the final write in many cases) did not take place and hence the previous instruction must be run again after the faulty prefetch is brought into memory. a 4 bit register counting the number of memory accesses since the last correct IAQ. This is needed so that the roll back routine knows how far the instruction got before the page fault took place. I think the above might fit in two 74ls374 chips and a 22V10 GAL. I'm lucky how RTWP works. It turns out that the new WP is the last thing fetched (see page 85 of the data book), and the old WP is saved for every fault up to & including the fetch of the new WP: it can always be restarted with the old WP. The problem appears to be with BLWP (see page 82). The first thing it does is fetching the new WP from the transfer vector. If this fetch faults, it would seem that the value read from the aborted fetch is stored as the old WP value in the reset workspace. Maybe this is my test setup having issues (this outcome is a bit strange after all), but if correct it means that such a fault will immediately loose the old WP value and it hence the instruction becomes impossible to restart. One solution could be to make a page fault during a BLWP a fatal error. My Unix C compiler does not generate BLWP instructions, so it would be an uncommon thing in that context. Maybe there is a way to handle it that currently escapes me: more work to do!
  6. And here are my notes on another source of ideas (good and bad): the Z8000 series. The Z8000 appeared early in 1979, in between the 8086 and the 68000. It was not a direct successor to the Z80, but shared its philosophy. It came in two versions, the Z8001 and the Z8002. The Z8002 was very similar to the 99xxx but with conventional registers and without the macrostore concept. The Z8001 added segmentation to the Z8002: each address was extended by a 7 bit segment number. Like the 68000, the Z8000 started without an established software ecosystem and Zilog ported V7 Unix instead, called “Zeus”. Zeus included a version of RM/Cobol to make it attractive as a business machine. Microsoft also ported Xenix to the Z8000. Nonetheless the chip failed to be a market success. Segmentation on the Z8000 was somewhat similar to the page “0” and “1” on a 99xxx, but then with 128 segments instead of 2 pages. Each segment was up to 64KB in size. The program counter had a dedicated segment register and data accesses used two adjacent registers or memory words to hold an address and a segment number. The 7 bit segment and the 16 bit offset were distinct and could not easily be used as a 23 bit “flat” address. Next to these explicit segments, the Z8000 could also use functional segmentation (instructions/data/stack and user/supervisor). There was a companion Z8010 MMU chip that mapped up to 64 segments to real addresses; two of these could be used in parallel. The Z8001 and Z8002 had an input signal to report memory faults. This input effectively was a non-maskable interrupt that could abort a faulting program. However, the faulting instruction itself ran to completion and could have irrevocably changed register values. Zilog realised early on that this was a mistake and announced the Z8003 and Z8004 that could abort instructions halfway through, along with a paging MMU, the Z8015. I did not find any 1981/82 designs that used the 03/04, maybe they were released later. The abort mechanism on a Z8003/4 is intriguing: the abort signal seems to be a form of reset. When a memory access causes a fault the abort signal must be held active for 5 clocks simultaneously with the ‘wait’ signal active as well. Then a non-maskable interrupt must be asserted and the abort and wait signals released (note that on a Z8000 the reset signal must also be held active for 5 clocks to be recognised). What seems to happen is the following: - asserting ‘wait’ stops the bus transaction from completing - asserting ‘abort’ resets the microcode sequencer to a state where it recognises a non-maskable interrupt at the end of the current bus cycle (instead of the end of the current instruction) - asserting the non-maskable interrupt causes the state of the processor (PC, status) to be saved and a recovery routine entered. The Z8015 MMU latches the PC into a register on every instruction fetch, and counts the number of bus cycles since the last instruction fetch. After a fault, this information is frozen. Using these registers, a relatively simple routine can revert any changes the aborted instruction made so that it can be restarted later. All the details are in the 1983 data book (section 7, 9 and Appendix D): http://bitsavers.trailing-edge.com/components/zilog/z8000/Z8000_CPU_Technical_Manual_Jan83.pdf I'm not sure the Z8015 ever made it into production -- maybe it never got beyond engineering samples, like the later Z80,000 CPU. Note that on the NS16032 abort and reset are actually the same pin, which suggests a close link also on that processor. It also almost makes you wonder if the Z8003/04 were really different from the Z8001/02 or whether it was just marketing, a bit like the 99105 has turned out not to be unique silicon. Maybe the equivalent approach will also work on a 99xxx. On a 99xxx the reset signal is actually also a non-maskable interrupt, and it too will abort an instruction at the end of a bus cycle. However, it takes 3 clocks to be recognised and that could equate to 3 bus cycles. The following might work: - identify an abort condition before the falling edge of CLKOUT; - simultaneously assert ‘reset’ and de-assert ‘ready’, and wait for 3 clocks; - release ‘reset’ and re-assert ‘ready’. According to the datasheet, the 99xxx CPU will now finish the current bus cycle and proceed to save the processor state in the reset workspace R13-R15. It will take some experimentation to find out if further hardware support is needed to be able to revert back any changes the aborted instruction may have made.
  7. It would seem that John Walker of AutoCAD and Marinchip fame agrees with you, James! This is what he wrote in May 1982 about the 8086: https://www.fourmilab.ch/autofile/www/section2_10_8.html By the way, writing in September 1981 he also has a view on other processors: https://www.fourmilab.ch/autofile/www/chapter2_110.html
  8. Below my notes on the 8086 MMU approach and how it could relate to a 99xxx. The 8086 first appeared late in 1978 and essentially extended the 8080/8085 to 16 bits. It was not object code compatible, but 8080 assembler source code could be automatically converted into working 8086 source code using a conversion program that Intel provided. This proved a tremendous advantage as the existing CP/M code base could easily be ported to the 8086. Intel’s investment in helping Gary Kildall to develop PL/M and CP/M in the 1973-1975 era really paid off here. The 8086 seems to have used all the chip area that technology would allow in 1978 to add an on-board MMU to the CPU chip and did not add any mini-computer features like a supervisor mode or support to deal with page faults. The MMU is simple, but effective: - The MMU implements a segmentation scheme. Segmentation is along functional lines: instruction space (“code space”) is separated from data space as had been done on mini computers in the 70’s. Data space was optionally separated in a normal, a stack and an 'extra' data space. - Each segment had a segment register (CS, DS, SS and ES respectively) which was 16 bits long. These 16 bits were added - offset by 4 bits - to a normal 16 bit address to create a 20 bit physical address. - In the typical case, instructions were fetched using the CS segment register, data was stored/fetched using the DS segment register and stack operations used the SS register. ES was used with string instructions. However, using a prefix instruction a non-default segment register could be chosen. - There were no facilities to limit a segment to less than 64KB in length and hence also no facilities to abort an illegal memory access. In this sense, it was no different from the earlier 8-bit CPU generation. The 99xxx could use a 8086-like MMU with a little external hardware. Four 74LS170 chips can implement the four segment registers, and four 74LS283 fast adder chips can be used to do the addition of the segment to the base address. The segment registers could be loaded using parallel CRU I/O. The four segments could have been (i) instructions, (ii) workspace, (iii) data and (iv) extra. The first three derive directly from bus status codes, the fourth could have been selected using a prefix instruction (like LDD/LDS on a TI990). The prefix instructions and the instructions to load the segment registers could all easily be implemented in macro code. I guess this all could have fitted in a single 48 pin ULA chip, which would have made a nice 8086-style MMU for the 99xxx. In a way, this would have been vaguely similar to the setup in a TI99/8. The key to making it work would have been in implementing an adder with full carry look-ahead so that it could be fast. Because implementing the four segment registers does not take much space, this would have been possible I think. If done as full-custom silicon, such an MMU could have added a small amount of ROM with the matching supporting macro code. I wonder how successful such an add-on for the 99xxx would have been.
  9. I’ve found an archived copy of that Elektor supplement from March 1981 that I devoured BITD. It can be found here (in Dutch): https://archive.org/details/Elektuur20919813Gen It is funny to see these old CPU's referred to as the new "super chips". I think it appeared with the April 1981 issue of the UK edition of Elektor, but I have not been able to find an archived copy of this english version. I assume that Elektor had similar supplements in the other language editions. Does anybody else remember those supplements? Edit: Did find it: https://archive.org/stream/ElektorMagazine/Elektor%5Bnonlinear.ir%5D%201981-04#page/n23/mode/2up In the UK it was not a supplement, but simply page 23-46 of the April 1981 issue.
  10. Thanks for those insightful comments James! You are right, although the 68000 internally was a 16 bit chip, doing 32 bit operations in two steps, the architecture was 32 bit. With 24 physical address lines it could address 16MB directly, huge for 1981. I'll read up on 68010 a bit more. I agree that a simple paging design, with only functional segmentation (instruction/data, user/supervisor), is probably the way to go. My interest in virtual memory on the 99xxx is more of a "retro challenge" than anything else, although it would enable experimenting with copy-on-write in early Unix. When it comes to compilers I'm focussed on the C compiler from 2.11BSD that I ported to the 9995 a few years ago. It has support for overlays and separate instruction/data spaces built in. I used that compiler to port V6 Unix to the mini Cortex and this compiler now runs natively on 99xx hardware.
  11. When considering MMU designs, perhaps it is good to look at the competitive field from an 1981 perspective. In 1981, arguably, the cottage industry around microcomputers separated in a “business” segment (Osborne, Kaypro, IBM PC, etc.) and a “home” segment. Early in 1981 the hobby magazine “Elektor” had a special supplement about 16 bit chips that I devoured. In the end I settled on a TI99/4A as the basis for my 16 bit endeavours. At that time there were four 16 bit processors on the market: - the 68000 - the 8086 - the Z8000 - the 99xxx / 99xx These 4 processors were all remarkably similar: they came in DIP packaging, had a 16 bit data path, ALU and databus and roughly similar performance. Also the bus interface logic was quite similar across these chips. Potentially, the list should also include the National Semiconductor 16032 (later renamed 32016). However, this chip had a 32 bit data path and ALU internally. Also, the chip was initially very buggy and usable silicon did not appear until about 1983, after 14 revisions of the design (revision letter “N”!) I’d like to look at these chips from three perspectives: supervisor mode capability, segmentation vs. paging, and the handling of memory faults. The 99xxx looks to be interesting from all three perspectives. - When it comes to supervisor mode capability, three of the four offer this: only the 8086 lacks this capability. The TI990 mini’s and the 99xxx offer this capability, but the 9900 and 9995 do not. It is hard to add with external hardware, because interrupts and system calls (XOP’s) must switch back to supervisor mode and the 9900 and 9995 do not offer (easy) signals to recognise this externally. - All these designs initially chose segmentation to manage memory and only later reworked it into paging designs. This is interesting because the mini computer world had already decided in the late 70’s that paging was the way to go. I’m not sure why the microprocessor world initially chose segmentation. In the case of the 68000 the address space was linear, but its first MMU chip (the 68451) was designed around a segmentation scheme. The 8086 and the Z8000 series CPU’s were designed with native segmentation. The 99xxx could go either way: the TI990/10A used the chip with a segmenting MMU, but a paging scheme around a 74LS612 mapper was equally supported. - The 68000 and the Z8000 were designed with hardware support for recovering from memory faults, but in both cases it did not work due to design errors. The 68000 had to be redesigned into the 68010 and the Z8001/2 into the Z8003/4 to get this fixed. So, from a 1981 perspective, none of the chips had working support for demand segmentation or demand paging. The 8086 does not claim to offer support for this; it would not be supported until the 80286. The 99xxx datasheet is silent on the topic, but my hunch is that the 99xxx does support demand paging with minimal external hardware.
  12. Thanks for that link and that is indeed pricey! The TM990 series are development boards:http://http://www.stuartconner.me.uk/tm990/tm990.htm Although they can be rack mounted, they are very distinct from the TI990 series mini computers. In a way, they are more reminiscent of PEB boards. The board that was sold on eBay is trainer board with a 9981 CPU and a calculator style user interface. (there's some pictures of a TI990 here: http://www.computinghistory.org.uk/det/11554/Texas-Instruments-TI-990-Computer-System/)
  13. I've built a little modification on Stuart's 99110 board to test the copying of status bits ST7 to ST11 to external flip-flops. I'm decoding bus status "ST" (binary 1101) and then clocking the address bus bits 7-11 to an external flip-flop on the rising edge of CLKOUT. This appears to work fine. Copying status bit to external flip-flops makes it possible: (i) to make the non-privileged/privileged status (ST7) available to external hardware (ii) to use the status map select bit (ST8) directly in external hardware and to use the PSEL signal to signify "use another map" to the MMU (as the TI990/10A mini does) I've also found that the unassigned status bit (ST9) is present as a real register bit on 99xxx silicon: it can be set and reset. Like ST7 and ST8, the ST9 bit is reset whenever a reset, interrupt or XOP occurs. On the TI990/12 mini this bit enables error checking by the MMU. Maybe in a new design other creative uses are possible. I'm finding the 99xxx an ever more intriguing design!
  14. Last up is an analysis of AR, SR, and CR. All three share most of their code. Although addition is perhaps conceptually the easiest operation, the code is surprisingly long and involved, as there are many cases to consider. As a result, floating point addition is not much faster than multiplication or division. The main issue is that the mantissas of two floating point numbers can only be added together if their exponents are equal. If they are not equal, the smaller number must be denormalized to make the exponents equal: 0.1234 x 16^4 + 0.12 x 16^2 = 0.1234 x 16^4 + 0.0012 x 16^4 = 0.1246 x 16^4 If the difference between the exponents is more than 6, the smaller number becomes insignificant and effectively equals zero. The entry code for SR is as follows: ; entry point for SR ; 0814 C138 MOV *R8+, R4 ; fetch 1st word of S 0816 136D JEQ >08F2 ; if S is zero, nothing to do 0818 0224 AI R4, >8000 ; flip sign bit 081A 8000 081C 1002 JMP >0822 ; now handle as AR It checks the operand for being zero, and if so the accumulator already has the right result. If not zero, it flips the sign bit and handles FPAC-S as FPAC+(-S). Next is the entry code for AR: ; entry point for AR ; 081E C138 MOV *R8+, R4 ; fetch 1st word of S 0820 1368 JEQ >08F2 ; if S is zero, nothing to do It only checks for the operand being zero, and FPAC already containing the result. From here on, AR and SR have an identical code path. 0822 C000 MOV R0, R0 ; if FPAC is zero, S is the result 0824 1603 JNE >082C 0826 C004 MOV R4, R0 ; move S to local FPAC 0828 C058 MOV *R8, R1 082A 1063 JMP >08F2 ; store FPAC & set status bits ; 082C 04C6 CLR R6 ; clear flag (= store result) 082E C158 MOV *R8, R5 ; fetch 2nd word of S The code first checks for another special case: if the accumulator is zero, the result is equal to the operand. If not, it enters the full calculation. It clears the CR flag (R6): at >0830 the code path for CR merges in (see entry code for CR discussed earlier), and the CR code path will separate towards the end of the algorithm. Note that the CR code path does not have checks for either the accumulator or the operand being zero. Effectively a zero here is handled as meaning "+0.0 x 16^-64" and this will not lead to issues in the CR code path. ; CR jumps here (with R6 all ones = set status flags only) ; 0830 04C2 CLR R2 ; clear extra mantissa bits 0832 C0C0 MOV R0, R3 ; save exponents 0834 C1C4 MOV R4, R7 0836 7000 SB R0, R0 ; remove exponents from mantissas 0838 7104 SB R4, R4 As usual the code starts out separating the sign and exponent from the mantissa. R2 is prepared to hold an extra 'guard' digit of precision. Next the sign bit and exponent are separated for the accumulator: 083A 0A13 SLA R3, 1 ; is FPAC negative? 083C 1702 JNC >0842 083E 06A0 BL @>0AE6 ; yes: negate extended FPAC mantissa 0840 0AE6 0842 0993 SRL R3, 9 ; FPAC exponent in R3 If FPAC is negative, the mantissa is negated. There is a subroutine for this, as the negation has to happen again when the result is converted back to standard IBM360 format. The subroutine is: ; subroutine to negate extended FPAC mantissa ; 0AE6 0540 INV R0 0AE8 0541 INV R1 0AEA 0502 NEG R2 0AEC 1703 JNC >0AF4 0AEE 0581 INC R1 0AF0 1701 JNC >0AF4 0AF2 0580 INC R0 0AF4 045B RT Including R2 in the negation is superfluous at this point. Also note that with the sign/exponent removed, the mantissa has two extra hex digits on the left, and hence does not need to consider the >800000 overflow condition when negating. Next, the sign and exponent of the operand are separated: 0844 0A17 SLA R7, 1 ; is S negative? 0846 1704 JNC >0850 0848 0544 INV R4 ; yes: negate S mantissa 084A 0505 NEG R5 084C 1701 JNC >0850 084E 0584 INC R4 0850 0997 SRL R7, 9 ; S exponent in R7 Here too the mantissa is negated if the operand is negative, but this time it happens in line because it does not need to be reversed later. With the mantissas prepared and including the sign bit, the code considers the exponents and the relative size of the accumulator and the operand: 0852 C247 MOV R7, R9 ; compare exponents 0854 6243 S R3, R9 0856 1319 JEQ >088A ; if equal, directly add the mantissas 0858 0289 CI R9, >0006 ; S much larger than FPAC? 085C 110F JLT >087C 085E C0C7 MOV R7, R3 ; if FPAC is insignificant, result is S 0860 04C0 CLR R0 0862 04C1 CLR R1 0864 1012 JMP >088A ... 087C 0509 NEG R9 ; FPAC much larger than S? 087E 0289 CI R9, >0006 0880 0006 0882 11F6 JLT >0870 0884 C1C3 MOV R3, R7 ; if S is insignificant, result is FPAC 0886 04C4 CLR R4 0888 04C5 CLR R5 The code first handles the three easy cases: exponents equal, FPAC dominates and S dominates. If the exponents are equal, there is nothing to do. If S is more than 6 hex digits larger than FPAC, FPAC is effectively zero and the exponent of S becomes the exponent of the result. If FPAC is more than 6 hex digits larger than S, S is effectively zero and the exponent of FPAC becomes the exponent of the result. The complex case is handled by a clever loop that shifts either FPAC or S into place. The loop code is entered in the middle (>0870): ; denormalize & align smallest mantissa ; 0866 C085 MOV R5, R2 ; shift S one nibble right 0868 0AC2 SLA R2, 12 086A 001C SRAM R4, 4 086C 4104 086E 0587 INC R7 ; and adjust exponent 0870 81C3 C R3, R7 ; exponents equal? 0872 130B JEQ >088A ; yes: add mantissas 0874 15F8 JGT >0866 ; exp FPAC > exp S? 0876 06A0 BL @>0AD4 ; no: shift FPAC one nibble right 0878 0AD4 ; and adjust exponent 087A 10FA JMP >0870 The loop compares the exponents and if they have become equal (which they must within 6 shifts), the work is done and we proceed with the actual addition at >088A. If S is the smallest the loop runs from >0866 to >0874 and shifts the operand in place (keeping one guard digit in R2). If FPAC is the smallest, the loop runs from >0870 to >087A and shifts the accumulator in place (again keeping one guard digit in R2). The accumulator shift is also used again later in the algorithm and hence in a subroutine: ; subroutine to (de)normalize FPAC mantissa ; to the right one hex digit (nibble) ; 0AD4 C081 MOV R1, R2 ; shift extended mantissa one nibble 0AD6 0AC2 SLA R2, 12 0AD8 001C SRAM R0, 4 0ADA 4100 0ADC 0583 INC R3 ; adjust exponent 0ADE 24E0 CZC @>0BD6, R3 ; exponent in range? 0AE0 0BD6 0AE2 139F JEQ >0A22 ; no: overflow 0AE4 045B RT At this point in time, the range check on the exponent is superfluous, as the exponent must be in range (because S is in range). With both numbers properly aligned, we can do the actual addition. At this point, the code for CR takes its own path again: 088A C186 MOV R6, R6 ; was opcode CR, or AR/SR? 088C 1307 JEQ >089C 088E 002A AM R4, R0 ; CR: add mantissas & return status bits 0890 4004 0892 02CA STST R10 0894 024A ANDI R10, >E000 ; mask out L>, A>, EQ status bits 0896 E000 0898 E3CA SOC R10, R15 089A 0380 RTWP ; macro processing complete For the CR instruction, we add the mantissas and only look at the status bits (L>, A> and EQ) and return those to the user routine. No result is stored back to the user accumulator. For AR and SR, there is more work to do: 089C 002A AM R4, R0 ; add mantissas 089E 4004 08A0 1325 JEQ >08EC ; if zero, clear FPAC & finish 08A2 1504 JGT >08AC ; if negative, 08A4 06A0 BL @>0AE6 ; negate extended mantissa 08A6 0AE6 08A8 0263 ORI R3, >0080 ; and flip sign bit 08AA 0080 08AC D000 MOVB R0, R0 ; if mantissa too large 08AE 1302 JEQ >08B4 08B0 06A0 BL @>0AD4 ; normalize it rightward one nibble 08B2 0AD4 [JMP to >08CC seems missing] Again, the two mantissas are added. If the result is zero, FPAC is cleared (the normalized version of zero) and the status bits are set accordingly. If the result is negative, the result mantissa is negated back to positive (note this time negating the guard digit as well is not superfluous) and the sign bit is set accordingly. If the result has one more hex digit (i.e. something like 0.800000 + 0.A00000 = 1.200000), the mantissa is normalized one hex digit to the right (note that in this case, the range check is not superfluous). As the mantissa must now be in normalized form, the code could proceed to merging in the sign/exponent byte. However, it drops into the code for another check. It is possible in addition that several hex digits cancel out, and that there are a lot of leading zeroes in the result mantissa. An example would be: 0.123456 - 0.123400 = 0.000056 This must be normalized to 0.56x16^-4. In this case no precision is lost. However, with a denormalized number one guard digit is required: 0.100001 - 0.123400x16^-5 = 0.100001 - 0.000001(2) = 0.0FFFFF(E) This must be normalized to 0.FFFFFEx16^-1 and in this case we need the guard digit shifted in. If I'm not mistaken only one guard digit can possibly shift in, and hence that is all we have in R2. In code, this leads to the following: ; normalize FPAC mantissa (leftward) ; 08B4 0280 CI R0, >000F ; is the highest nibble 0? 08B6 000F 08B8 1509 JGT >08CC ; no: mantissa is normalized 08BA 24E0 CZC @>0BD6, R3 ; exponent already 0? 08BC 0BD6 08BE 1378 JEQ >09B0 ; yes: underflow 08C0 0603 DEC R3 ; reduce exponent & shift mantissa one nibble 08C2 001D SLAM R0,4 08C4 4100 08C6 09C2 SRL R2, 12 ; shift in guard digit 08C8 A042 A R2, R1 08CA 10F4 JMP >08B4 ; 08CC 06C3 SWPB R3 ; merge exponent back in 08CE D003 MOVB R3, R0 08D0 1071 JMP >09B4 ; store FPAC & set status bits This code was discussed before in the post on multiplication: this tail is shared between AR, SR and MR. I wonder if the AR/SR code is the shortest possible. It would seem that the checks for zero accumulator and operand are for performance only, as the rest of the algorithm would seem to work for AR/SR just as it does for CR. Also, maybe it is faster to operate on mantissas shifted one hex digit to the left; this still leaves one "overflow digit" to the left, but makes room to include the guard digit on the right. Finally, the range check could be taken out of the "shift right" subroutine and moved to immediately after the second subroutine call. That completes our tour of the 99110 macrorom: there is no other code left to discuss.
  15. The analysis of the DR instruction made me wonder about the speed of 99110 floating point operations. I haven't done any detailed cycle counts or run benchmark tests, but some rough scoping gives interesting results. For floating point operations, the 99110 can always run at the full 6 Mhz, as it is not dependent on slow external memory and wait states. I think the average floating point operation in that case takes around 70-80 microseconds. This equates to some 12-15 kFLOPS. This compares well with the FPU chips of the late seventies and early eighties. The three main choices in 1981 were the AMD9511/i8231 from 1978, the AMD9512/i8232 from 1979 and the i8087 from 1980/81. The 99110 is from 1981 as well. http://www.cpushack.com/2010/09/23/arithmetic-processors-then-and-now/ The 9511 needs about 200 clock cycles for a floating point operation, or 100 microseconds when run at 2 MHz (which seems to have been the norm BITD). When run at its 3MHz maximum it is around 70 microseconds. That the numbers are so similar is perhaps not surprising: the 9511 also has a 16 bit data path inside and would be executing similar algorithms. Running a custom designed microcode gives it an advantage in cycles, but the 99110 compensates for this with a high clock speed. The 9512 also needs about 100 microseconds for multiply and divide, but addition/subtraction is sped up to about 50 microseconds. It can also do double precision floating point (i.e. a 64-bit format). This is much slower than single precision: operations take between 500 and 800 microseconds. I think this would be the same for the 99110, if one would code up double precision routines in a fast external macro rom. As the 9512 still has a 16 bit data path (17 bit actually, to deal with the 'hidden 1' bit of the IEEE format used), the similarity is again not surprising. So, at double precision the speed would only be some 2-3 kFLOPS. The real difference comes with the 8087 FPU. This chip internally always works with 80 bit floating point numbers. It is also much faster: it has separate ALU's for the exponent and mantissa, with wide data paths for both (15 and 64 bits respectively). Its speed on single precision arithmetic is around 50 kFLOPS and on double precision it is around 30 kFLOPS. However, only limited quantities of this chip were available in 1981 and this is one of the reasons why the original PC had a socket for a 8087 but it was almost never filled. My understanding is that all these chips were expensive. The 9511 and the 9512 were selling for between $50 and $100, and the 8087 well above that. If correct, the 99110 with a volume price of around $100 was good value. On the other hand, most applications back then did not need fast floating point. Of course, the competing 16-bit processors (8086, Z8000 and 68000) could run floating point in software emulation about as fast as a 9512 or 99110 (when run at full clock speed with fast memory, all four were about equally fast). Viewed that way, the 99110 only has a convenience advantage. A last consideration would have been the IBM360 format. Although popular in the 60's and early 70's, it was going out of fashion in the late 70's. The 9512 and the 8087 were much closer to the emerging IEEE floating point standard. For comparison, a high-end IBM360 mainframe in the 1960's would do about 10 MFLOPS. The supercomputer of the 70's, the Cray-1, was rated at 160 MFLOPS (both numbers for single precision arithmetic).
  16. Next up is floating point division, the "DR" instruction. It has a clever algorithm, but also a strange bit in its implementation. First let's look at the algorithm it uses. As with multiplication, we have two real numbers N1 and N2. In the IBM360 format these will be expressed as S1 x 0.M1 x 16 ^ E1 and S2 x 0.M2 x 16 ^ E2 The division will be: (S1 x 0.M1 x 16 ^ E1) / (S2 x 0.M2 x 16 ^ E2) which is the same as: (S1/S2) x (0.M1 / 0.M2) x ((16 ^ E1 / 16 ^ E2) which is the same as: (S1/S2) x (0.M1 / 0.M2) x 16 ^ (E1 - E2) which is the same as: (S1xS2) x (0.M1 / 0.M2) x 16 ^ (E1 - E2) The subroutine that multiplies the signs and handles the subtraction of exponents is already there, as discussed above for multiplication. The problem is in dividing the mantissas. What is needed is 32 x 32 bit division and the 99000 only offers 32 x 16 bit division (the DIV instruction). It would be possible to write a routine to do 32 x 32 bit division from basics, but that would be a long and slow routine. Instead, it does something clever and uses the DIV instruction to the max. In short it approximates the result by putting M1 in the 32 bit dividend and then divides by the top 16 bits of M2 (i.e. it truncates the last hex two digits of the divisor to zero). This already gives a result that is accurate to 3 or 4 hex digits. As the divisor is slightly too small, the result is slightly too large. It then subtracts a correction factor from the estimate that makes it accurate to 6 or 7 hex digits. As it turns out, the correction factor is fairly easy to calculate. I'm not a mathematician, but I think the derivation of the correction factor is as follows. It is easiest to think about the problem in base 65536 numbers, i.e. a number system where there are 65535 different digits and "10" means 65536. As I don't have 65535 symbols available, I'll use [xxxx] as notation, where xxxx is 4 hex digits. The division of M1 by M2 can then be expressed as dividing the two digit number AB by two digit number CD giving a two digit result EF: The dividend AB is a mantissa shifted 4 bits left, i.e. the range of AB is [0100][0000] to [0FFF][FFF0]. The divisor CD is a mantissa shifted 8 bits left, i.e. the range of CD is [1000][0000] to [FFFF][FF00]. The result is 0.EF, where the range of EF is [0100][0001] to [FFFF][FF00] In the below, 100 and 10 are shorthand for [0001][0000][0000] and [0001][0000] respectively. [1] AB / CD = 0.EF [2] AB = 0.EF x CD = EF x 0.CD = (E x C) + (E x D/100) + (F x C/10) + (F x D/100) = C x (E + (E x D/100)/C + F/10 + (F x D/100)/C = C x ( E.F + E x (D/10C) + (F x D)/100C ) [3] AB / C = E.F + E x (D/10C) + (F x D)/100C As C is at least [1000], the value of (F x D)/100C is at most [0000].[000F] and not significant: E x (D/10C) + (F x D)/100C ≈ E x (D/10C) [4] AB / C = E.F + E x (D/10C) AB / C - E x (D/10C) = E.F [5] Now define a first estimate E'F': AB / C = E'F' As E x (D/10C) is at most [000F], the difference between E and E' is at most [000F]. Calculating E x (D/10C) as E' x (D/10C) has an error of at most [000].[000F] and this error is not significant. [6] Hence, AB/CD can be calculated with sufficient precision using: E'F' = AB / C T = (D / C) x E' EF = E'F' - T/10 Back to the simple terms, the initial estimate is AB / C and the correction factor is (D/C) x E' / 10. With the mathematics and the algorithm out of the way, let's dive into the actual code. We'll see that doing the M1 / M2 division only takes 20 instructions, with no loops. The code starts out pretty much like the code for MR: ; entry point for DR ; 0946 C138 MOV *R8+, R4 ; if div-by-zero, report overflow 0948 1319 JEQ >097C 094A 06A0 BL @>0A4C ; extract and subtract exponents 094C 0A4C 094E 6187 DATA >6178 ; = "S R7, R6" 0950 0040 DATA 64 ; add back excess First there is a check for a zero operand, and an overflow error is reported if it is. Then the exponent subroutine is called to extract the signs and exponents and to calculate the result sign and exponent. For division, the first data word is "S R7, R6" as the exponents must now be subtracted. The second data word is +64: the exponent subtraction will cancel out the excess-64 part and this needs to be added back. Next there is a range check: 0952 8100 C R0, R4 ; if dividend > divisor, result will be >1 0954 1107 JLT >0964 0956 1502 JGT >095C 0958 8141 C R1, R5 095A 1A04 JL >0964 095C 0586 INC R6 ; increase result exponent and test for 095E 25A0 CZC @>0BD6, R6 ; overflow (mantissa shift happens 992-99A) 0960 0BD6 0962 130C JEQ >097C Depending the values of the accumulator and the operand, the result can be larger than 1 (but no larger than 15 decimal) and because the normalized mantissa must be of the form 0.MMMMMM an additional mantissa shift and exponent update may be necessary. Now the code starts to perform the actual division of M1 by M2. First M1 and M2 are positioned to make the algorithm work: 0964 001D SLAM R0, 4 ; align dividend & divisor for accuracy 0966 4100 0968 001D SLAM R4, 8 ; make sure divisor larger than dividend 096A 4204 The next step is to calculate the estimate result (which will already be accurate to some 3 hex digits): 096C 3C04 DIV R4, R0 ; calculate estimate E'F' = AB / C 096E 04C2 CLR R2 ; (using two steps of long division) 0970 3C44 DIV R4, R1 To get a 32 bit result, the remainder is divided by the divisor again, just as one would do in a manual long division. Note that R4 cannot be zero and that neither division can overflow (a remainder must necessarily be smaller than the divisor) and hence there are no checks for errors. Next comes the calculation of the correction factor: 0972 C245 MOV R5, R9 ; now calculate error term: T = D / C x E' 0974 0949 SRL R9, 4 ; align C with AB (i.e. make D/C < 1) 0976 04CA CLR R10 0978 3E44 DIV R4, R9 ; calc D / C 097A 1903 JNO >0982 ; always jump ... 0982 3A40 MPY R0, R9 ; calc T = E' x (D / C) 0984 04C8 CLR R8 ; align T/10 with E'F' and place into R8,R9 0986 001D SLAM R8,4 0988 4108 098A 09CA SRL R10, 12 098C A24A A R10, R9 098E 0029 SM R8, R0 ; now subtract error term from estimate 0990 4008 First we make sure that D is smaller than C to prevent overflow (and three digits of accuracy are enough). Then it calculates D/C. As C cannot be zero the dvision must succeed, just as the earlier two DIV operations; no error checking is necessary. Here we have some strangeness: despite the above, the code checks for overflow and jumps over the overflow exit code. There is no reason the the overflow code has to be located here: it is not necessary to bring jumps into range or something like that. Other than the programmer being confused, I see no reason for this jump in the code. Maybe I'm missing something, if so please post. The code then proceeds to multiply by E' and finish the calculation of T. The range shift of >0974 is undone and by taking the high word T is effectively divided by 10-base-65536. As a last step the correction factor is subtracted from the first estimate, giving a result accurate to 6 or 7 hex digits. After this, only combining the exponent, sign and "EF" into a normalized real number remains: 0992 C200 MOV R0, R8 ; normalize mantissa 0994 09C8 SRL R8, 12 ; one ore two nibbles as needed 0996 1302 JEQ >099C 0998 001C SRAM R0, 4 099A 4100 099C 001C SRAM R0, 4 099E 4100 09A0 06C6 SWPB R6 ; merge sign+exponent with mantissa 09A2 D006 MOVB R6, R0 09A4 1007 JMP >09B4 ; compare FPAC against zero & store result First is checks if the result of the mantissa divide was larger than 1: if this is the case, the top digit of EF will be non-zero. It then shifts by one or two hex digits to the right to create the normalized mantissa. No change to the exponent is necessary as one shift is merely compensating all the clever shifts we did at the start of the code (i.e. one shift puts the fixed point in the proper place). For the other shift, it has already made the required adjustment to the exponent at the start, see code at >095C. The last step is to merge in the sign/exponent byte and to jump to the standard exit routine. All in all, TI has used a very clever and fast algorithm for floating point divide.
  17. With all the supplementary operations out of the way, time to analyze the arithmetic floating point operations: MR, DR, AR and SR. First up is MR. To understand the code, let's first look at the math involved. Suppose we have two real numbers N1 and N2. In the IBM360 format these will be expressed as S1 x 0.M1 x 16 ^ E1 and S2 x 0.M2 x 16 ^ E2 The product will be: S1 x 0.M1 x 16 ^ E1 x S2 x 0.M2 x 16 ^ E2 which is the same as: (S1 x S2) x (0.M1 x 0.M2) x (16 ^ E1 x 16 ^ E2) which is the same as: (S1 x S2) x (0.M1 x 0.M2) x 16 ^ (E1 + E2) This last formula is what the code calculates. The code begins with: ; entry point for MR ; 08F4 C138 MOV *R8+, R4 ; is multiplier equal to zero? 08F6 1357 JEQ >09A6 ; yes: set FPAC to zero & finish This handles the case where the accumulator is multiplied by zero: the result is zero. Next comes a subroutine that handles the exponents and the sign bits: 08F8 06A0 BL @>0A4C ; separate & add exponents 08FA 0A4C 08FC A187 DATA >A187 ; = "A R7, R6" (for MR add exponents) 08FE FFC0 DATA -64 ; = subtract double excess 64 The subroutine is followed by two data words, which make it usable for both multiplication and division. The function of the data words will become clear when walking through the subroutine code. THE SUBROUTINE ; subroutine for MR and DR: calculate result exponent and sign ; 0A4C C000 MOV R0, R0 ; is FPAC zero? 0A4E 13C4 JEQ >09D8 ; yes: set flags & finish 0A50 C158 MOV *R8, R5 ; fetch 2nd word of operand The subroutine starts with a check for FPAC equalling zero (i.e. the multiplicand or the numerator is zero); in that case the result is zero too. Next, it fetches the second word of the operand which had not been fetched earlier. The code can now rely on both FPAC (R0,R1) and the operand (R4,R5) being in standard normalized format. The first thing it does is separating the mantissa from the sign bits and exponents: 0A52 C180 MOV R0, R6 ; save exponents in R6 and R7 0A54 C1C4 MOV R4, R7 0A56 7000 SB R0, R0 ; remove exponents from mantissas 0A58 7104 SB R4, R4 The next thing is multiplying the sign bits: 0A5A C207 MOV R7, R8 ; figure out sign of result in R8 0A5C 2A06 XOR R6, R8 Multiplying two bits is the same as taking their exclusive OR. Note that the top bit in R8 will have the sign of the result, but the other 15 bits are not zero -- the other bits are meaningless to the multiplication. This is followed by placing the excess-64 exponents as proper integers in R6 and R7: 0A5E 06C6 SWPB R6 ; place FPAC exponent in R6 0A60 0246 ANDI R6, >007F 0A62 007F 0A64 06C7 SWPB R7 ; place operand exponent in R7 0A66 0247 ANDI R7, >007F 0A68 007F Now we are ready to add the two exponents together (or subtract them for division). This is where the two data words that followed the subroutine call are used: 0A6A 04BB X *R11+ ; MR: "A R7,R6", DR: "S R7,R6" 0A6C A1BB A *R11+, R6 ; MR: -64, DR: +64 0A6E 0286 CI R6, >007F ; exponent in range? 0A70 007F 0A72 15D7 JGT >0A22 ; jump on overflow 0A74 1B9D JH >09B0 ; jump on underflow First it executes the instruction in the first data word. For MR this is "A R7,R6", which adds the exponents. However, by adding the exponents the excess 64 is now included twice and must be removed once. The next data word contains "-64", which is added to the exponents. The result is that the right excess-64 exponent is now in R6. This is followed by a range check. Here the utility of the excess-64 encoding becomes clear to see. I'm not sure why the jump to overflow at >097C is done via >0A22: the real target is (just) within range. The subroutine finishes by merging the result sign bit back into the result exponent: 0A76 0A18 SLA R8, 1 ; put sign bit back in exponent 0A78 1702 JNC >0A7E 0A7A 0226 AI R6, >80 0A7C 0080 0A7E 045B RT END OF SUBROUTINE Now we can go back to the main MR routine at >0900. This happens to be the 32 x 32 -> 64 bit multiplication routine that we already saw as part of the MM instruction: 0900 C085 MOV R5, R2 ; long multiply in four 16x16 bit steps 0902 3881 MPY R1, R2 0904 C205 MOV R5, R8 0906 3A00 MPY R0, R8 0908 C284 MOV R4, R10 090A 3A81 MPY R1, R10 090C 3804 MPY R4, R0 090E 002A AM R10, R8 ; add the partial results 0910 420A 0912 1701 JNC >0916 0914 0580 INC R0 0916 002A AM R8, R1 0918 4048 091A 1701 JNC >091E 091C 0580 INC R0 I'll not discuss it again, simply scroll up to the analysis of the MM instruction for detail on the above code. In essence it multiplies R0,R1 by R4,R5 leaving its result in R0..R3. Next comes the bit of MR code that was skipped in the MM discussion. That code is: 091E D186 MOVB R6, R6 ; is this a MR or MM instruction? 0920 1607 JNE >0930 ; jump if MM 0922 D001 MOVB R1, R0 ; MR: prenormalize mantissa 0924 06C0 SWPB R0 0926 D042 MOVB R2, R1 0928 06C1 SWPB R1 092A 06C2 SWPB R2 092C C0C6 MOV R6, R3 092E 10C2 JMP >08B4 First the flag byte in the upper half of R6 is checked. For MR this will be zero, as the exponent cannot be larger than >007F. Next the code pre-normalizes the mantissa by moving it two hex digits (one byte) to the left. The simple way to think about this is that we are multiplying two 24 bit mantissa's into a 48 bit result. We are only interested in the top 24 bits of that result and moving two digits to the left places these 24 bits in R0,R1 properly aligned for combination with the sign and exponent. The more precise way to think about this is that we are doing fixed point arithmetic here, and that a six digit shift right is needed to keep the decimal point in the right place; shifting two hex digits to the left and taking the high two words is functionally the same (and leaves some extra digits available). However, we are not done as it is possible that the first hex digit is still zero. This is easy to see when using two decimal examples: 0.10 x 0.10 = 0.01 and 0.99 x 0.99 = 0.98 Even though we have kept the decimal point in the right place, the first digit can still be zero in some cases. To normalize this there is a routine that is shared by the other arithmetical operations. This routine expects the result sign/exponent in R3 and so it is moved there first. It also expects the next hex digit in the top of R2. The shared tail routine is: ; normalize FPAC mantissa (leftward) ; 08B4 0280 CI R0, >000F ; is the highest nibble 0? 08B6 000F 08B8 1509 JGT >08CC ; no: mantissa is normalized 08BA 24E0 CZC @>0BD6, R3 ; exponent already 0? 08BC 0BD6 08BE 1378 JEQ >09B0 ; yes: underflow 08C0 0603 DEC R3 ; reduce exponent & shift mantissa one nibble 08C2 001D SLAM R0,4 08C4 4100 08C6 09C2 SRL R2, 12 ; shift in one nibble extra precision 08C8 A042 A R2, R1 08CA 10F4 JMP >08B4 .. 0BD6 007F DATA >007F ; exponent bits .. First it checks that the first mantissa digit is zero. If not, the mantissa is already normalized. If it is it checks the exponent. If it is already zero, the mantissa cannot be shifted further: it would require the exponent to be reduced by one and puts it out of range (the excess-64 exponent would move from -64 to -65). In that case an underflow is reported. In the other case, the exponent is reduced and the mantissa shifted left by one. To keep accuracy, a 'spare' extra digit of precision kept in R2 is shifted in. Because it is a common tail, the routine will check if further shifts are necessary, but in in the case or MR it will only ever perform one shift. After that, only merging the result exponent back in remains: 08CC 06C3 SWPB R3 ; merge exponent back in 08CE D003 MOVB R3, R0 08D0 1071 JMP >09B4 ; store FPAC & set status bits . The code for underflow is simple, and very similar to the code for overflow: ; underflow: additionally set AF status bit 09B0 026F ORI R15, >0800 09B2 0800 <continues with normal exit code at >09B4> Underflow only sets the arithmetic fault (AF) status bit. This allows the user program to distinguish overflow (C bit also set) from underflow.
  18. Conversion from floating point back to integers is done with CRI and CRE, for a 16 bit or 32 bit integer respectively. In principle this is just the reverse of CIR and CER that were analyzed above, but it is a bit more involved as the code has to check for overflow: the real number may be larger than what fits in the integer. In my view the code in the macro rom for CRI and CRE is a bit convoluted and borderline buggy, but maybe I don't understand the code right. Better insights are welcome. The code for CRI and CRE starts with: ; CRI: convert real to integer ; 09CE 04C8 CLR R8 09D0 1001 JMP >09D4 ; CRE: convert real to extended ; 09D2 0708 SETO R8 09D4 04C2 CLR R2 ; prepare for 48 bit shift in R0,R1,R2 09D6 C1C0 MOV R0, R7 ; if FPAC is zero, nothing to do: 09D8 13E6 JEQ >09A6 ; store zero result & exit CRI and CRE share most of their code, using R8 as a flag to keep track. Also, the case where the real number is zero is special cased so that the remaining code can assume that the number is in standard format. The register R2 is cleared, the reason for which become clear further below. The test for zero has the side effect of saving the sign bit in R7. The next bit of code is also clear: 09DA C180 MOV R0, R6 ; separate mantissa 09DC 7000 SB R0, R0 ; and put exponent in R6 09DE 06C6 SWPB R6 09E0 0246 ANDI R6, >007F 09E2 007F It separates out the mantissa (into R0,R1) from the exponent (into R6) and the sign bit (already in R7). Now the mantissa in 0.MMMMMM format, and this must be converted to MMMMMMMM.0 format, i.e. the reverse operation of that in CIR and CER. This only works if the exponent is in the range +1 to +8 (= +65 to +72 including the excess 64). If the exponent is less than 1 the real number is between (and excluding) +1 and -1 and will be truncated to 0. If the exponent is larger than 8, the number does not fit in 32 bits. This is all handled by the following code: 09E4 0226 AI R6, -65 ; is exponent at least 1? 09E6 FFBF 09E8 112D JLT >0A44 ; if less than 1, result is zero 09EA 0506 NEG R6 ; get 32 bit result in R1,R2 09EC 0226 AI R6, >0009 ; by shifting mantissa between 09EE 0009 ; 2 and 10 hex digits right. 09F0 0606 DEC R6 09F2 1108 JLT >0A04 09F4 001C SRAM R1, 4 09F6 4101 09F8 0A41 SLA R1, 4 09FA 001C SRAM R0, 4 09FC 4100 09FE 0240 ANDI R0, >0FFF ; (bug: superfluous?) 0A00 0FFF 0A02 10F6 JMP >09F0 0A04 C100 MOV R0, R4 ; if exponent was >8, R4 will be non-zero First it test for an exponent less than 1 and returns a zero result if so. The test for +8 is skipped as this is handled in another way that will become clear shortly. Instead it calculates the number of places that the mantissa has to be shifted. It uses a 48 bit shift in R0-R1-R2, shifting the mantissa between 2 and 10 nibbles (hex digits) right. This leaves the mantissa in MMMMMMMM.0 format in R1,R2 and leaves R0 zero. Note that for a large number the rightmost 2 hex digits will be zero as the mantissa only has 6 hex digits. The test for an exponent larger than 8 is implicit: the mantissa will be shifted 1 or 0 nibbles and R0 will not be zero. This fact is used later when the result is tested for being in range. In the above code AND-ing out the top digit of R0 seems superfluous: The top byte has been set to zero when the exponent and sign were separated out and hence SRAM will always shift in zeroes. Perhaps this is a leftover from earlier code. I would have thought it more logical to leave the mantissa in R0,R1 and first shift it two places to the left, followed by 0 to 7 places to the right (i.e. the exact reverse of what is done in the CIR/CER code). This would have required a separate test for the exponent being out of range, but the code would still have been shorter and faster, I think. In that code structure the AND-ing out would have been necessary. Next we come to handling the sign bit and range tests. Here the code for CRI and CRE diverges again: 0A06 C208 MOV R8, R8 ; opcode was CRE or CRI? 0A08 160D JNE >0A24 0A0A C002 MOV R2, R0 ; CRI: fit result in 16 bits 0A0C C1C7 MOV R7, R7 ; if real was negative, negate int 0A0E 1501 JGT >0A12 ; (bug: should jump to >0A18) 0A10 0500 NEG R0 0A12 0282 CI R2, >8000 ; value -32768 is okay 0A14 8000 0A16 1302 JEQ >0A1C 0A18 C082 MOV R2, R2 ; check range -32767..+32767 0A1A 11B0 JLT >097C ; -> report overflow (>0A20?) 0A1C E101 SOC R1, R4 ; number was >65535? 0A1E 1314 JEQ >0A48 ; no: store result (bug: should be >0A46) 0A20 04C1 CLR R1 0A22 10AC JMP >097C ; report overflow First we test for a negative sign and negate the 16 bit integer as necessary. There is also a check for the value -32768, which is okay whereas +32768 is out of range. The jump instruction seems to be wrong and allows +32768 as well. This bug means that the real number +32768 is converted to the integer -32768 instead of being reported as an overflow error. Next is the check that the (unsigned) mantissa was in the proper range of -32767 to +32767 and an overflow is reported if outside. Also if the mantissa was larger than 65536 or the exponent was larger than 8, an overflow error is reported. A last bit of strangeness is the value of R1 upon return. The documentation is silent on what value R1 should have. In some cases it is set to zero, in other cases the absolute value of the number is left behind. Changing the destination address of one jump ensures that R1 is always set to zero. The range check for CRE is similar (including bugs): 0A24 C001 MOV R1, R0 ; CRE: fit result in 32 bits 0A26 C1C7 MOV R7, R7 ; if real was negative, negate 32 bit 0A28 1504 JGT >0A32 ; (bug: should jumpt to >0A38) 0A2A 0540 INV R0 0A2C 0502 NEG R2 0A2E 1701 JNC >0A32 0A30 0580 INC R0 0A32 0281 CI R1, >8000 ; value -2147483648 is okay 0A34 8000 ; (note: test cannot be exact) 0A36 1302 JEQ >0A3C 0A38 C041 MOV R1, R1 ; check range -2147483647..+2147483647 0A3A 1102 JLT >0A40 ; -> report overflow 0A3C C104 MOV R4, R4 ; number was >4294967296? 0A3E 1304 JEQ >0A48 ; no: store result 0A40 C042 MOV R2, R1 0A42 10EF JMP >0A22 ; report overflow The code for handling the sign bit is a bit longer as it has to negate a 32 bit number. Again the jump for a positive number seems to be off, not skipping the test for -2147483648 as within range. However, the test for -2147483648 is conceptually wrong: that number cannot be expressed accurately in a single precision floating point number: it requires 8 hex digits of accuracy and the IBM360 format only has 6. The result is that a number like -2.14750e9 (which is definitely out of range) is reported as okay. The mantissa for -2.14750e9 is >800040 and this ends up in R1,R2 as >80004000. After negating this becomes >7FFFC000 which is +2147467264. Something similar happens for +2.14750e9. It would have been better to exclude trying to handle the -2147483648 case altogether and simply suffice with the -2147483647..+2147483647 range test (which due to the six digit accuracy is actually a test for -2147483392..+2147483392). The last bit of code deals with clearing out the FPAC when the real number truncates to zero (as tested for at the start of the code) and setting the high word (R1) of FPAC as necessary: 0A44 04C0 CLR R0 ; clear FPAC 0A46 04C2 CLR R2 0A48 C042 MOV R2, R1 ; set high word of FPAC 0A4A 10B4 JMP >09B4 ; store result & exit That only leaves the reporting of an overflow condition: ; overflow: set C and AF status bits & store result 097C 026F ORI R15, >1800 097E 1800 0980 1019 JMP >09B4 ; store FPAC & status bits All it does is setting the C and AF (arithmetic fault) status bits (the C bit indicates it is an overflow, not an underflow) and then perform a normal return. However, if the AFIE status bit (arithmetic fault interrupt enable) was also set, this means that immediately after the exit from macrocode a level 2 interrupt is generated. If the AFIE bit is not set, the user program must separately check for the AF error bit being set. All in all, as I understand it, the code for CRI and CRE has two corner case bugs and looks a bit suspect in two other places. Perhaps it was written the day after the Christmas party. I wonder if the corner case bugs were known back in the day (perhaps the corner cases did not matter enough to be detected).
  19. I've done a bit more testing on the 99105 chips that seem to support the TI990 variant of the LDS, LDD and LMF instructions. The theory here is that these chips have silicon inside with the macro ROM that was used for the TI990/10A mini. The further tests support that theory. The TI990 assembler manual http://bitsavers.informatik.uni-stuttgart.de/pdf/ti/990/assembler/2270509-9701A_AsmRef_Nov82.pdf documents what LDD, LDS and LMF were supposed to do. The TI990/10A general description http://bitsavers.informatik.uni-stuttgart.de/pdf/ti/990/990-10/2302633_990-10A_GenDescr_Sep82.pdf documents that a 10A has its mapper registers in the parallel CRU area >9F80->9FFF. (there is table with the CRU address map that has this info on page 1-19). So I set up a test to see if my 99105 chip supported that. The test looks to see if the LDD/LDS/LMF instructions indeed store the map information to parallel CRU addresses in the >9F80->9FFF range. They do: LMF R0,0 loads the 6 words to >9F80 (i.e. this is where map 0 is) LMF R0,1 loads the 6 words to >9FA0 (i.e. this is where map 1 is) LDD and LDS both load the 6 words to >9FC0 (i.e. this is where map 2 is) This means that a sequence LDD-LDS-MOV is of limited use: both the source and destination will use the "map 2" loaded by LDS, as that overwrites the upload by LDD. My conclusion is that the code for LDD/LDS/LMF in the 99000 ROM is simple and easy to factory test. This probably means that there really is no backdoor to its ROM. On the other hand, using the 99110 ROM as a guide, it is now fairly easy to newly write the code for a 99000 ROM. It will of course not be exact, but it will document what is functionally in there. This is my go at recreating that code: ; Macro ROM jump table ; AORG >0800 DATA ILLOP DATA ILLOP DATA ILLOP DATA LMENTR DATA ILLOP DATA ILLOP DATA LDENTR DATA ILLOP DATA ILLOP DATA ILLOP ; LMF ; LMENTR MOV R5,R2 ; Is opcode LMF? ANDI R2, >FFE0 CI R2, >0320 JNE MAP1 LI R12, >9F80 ; Yes: LMF for map 0 JMP PRVTST MAP1 CI R2, >0330 JNE ILLOP LI R12, >9FA0 ; Yes: LMF for map 1 PRVTST CZC @USER,R15 ; CPU is in supervisor mode? JNE PRVERR ANDI R5, >000F ; Fetch map pointer from register W ORI R5, >0020 ; by pretending W is *R EVAD R5 LDCR *R8+,11 ; Load six words into MMU LDCR *R8+,11 LDCR *R8+,11 LDCR *R8+,11 LDCR *R8+,11 LDCR *R8, 10 RTWP ; Normal exit ILLOP RTWP2 ; Return with ILLOP error PRVERR LIMI 0 ; Return with PRIVOP error RTWP ; LDS/LDD ; LDENTR MOV R5, R2 ; Opcode is LDD or LDS? ANDI R2, >FF80 CI R2, >0780 JNE ILLOP CZC @USER,R15 ; CPU is in supervisor mode? JNE PRVERR ANDI R5, >003F ; Mask out src bits EVAD R5 JNE NOINCR ; Handle auto-increment INCT *R10 NOINCR LI R12, >9FC0 ; Base address of map register 2 MOV *R8, R2 ; Fetch pointer to new map LDCR *R2+,11 ; Load six words into MMU LDCR *R2+,11 LDCR *R2+,11 LDCR *R2+,11 LDCR *R2+,11 LDCR *R2, 10 RWTP4 ; return, skip interrupt test USER DATA >0100 Note that for CRU addresses in parallel space (i.e. R12 has top bit set), that a bit count of 11 means 'transfer a word and post-increment R12 by 2' and a bit count of 10 means 'transfer a word, leave R12 as-is'. I think I have also finally figured out how the TI990/10A mapper can work out that a LDD/LDS is in effect (as the supervisor can use both PSEL=0 and PSEL=1, it is hard to tell when the PSEL/D15 line it is being inverted). However, every change to the PSEL status bit is echoed on the address bus during an ST machine cycle (bus code 1101, see section 2.4.2. and table 2 in the data sheet). The mapper can use this to keep a copy of the PSEL bit in the status register in a flip-flop, and doing a XOR with the bit on the PSEL/D15 line will tell it when the bit is inverted and "map 2" has to be used.
  20. Today a look at three short routines, implementing STR, LR and NEGR, which store, load or negate the accumulator ("FPAC") respectively. The code is: ; entry point for LR ; 08D2 C038 MOV *R8+, R0 ; load S into local FPAC 08D4 C058 MOV *R8, R1 08D6 1002 JMP >08DC ; entry point for STR ; 08D8 CE00 MOV R0, *R8+ ; store FPAC into S 08DA C601 MOV R1, *R8 08DC 0242 ANDI R2, >1800 ; C and AF status bits unaffected 08DE 1800 08E0 E3C2 SOC R2, R15 08E2 1068 JMP >09B4 ; store result, set flags & finish ; code for NEGR ; 08E4 0242 ANDI R2, >1800 ; C and AF status bits unaffected 08E6 1800 08E8 E3C2 SOC R2, R15 08EA C000 MOV R0, R0 ; is FPAC zero? 08EC 135C JEQ >09A6 ; yes, set EQ flag & finish 08EE 0220 AI R0, >8000 ; no, invert sign bit 08F0 8000 08F2 1060 JMP >09B4 ; store result, set flags & finish The code is really very simple and straightforward. The only special thing is that these instructions only affect status bits ST0-2, and hence ST3 and ST4 are restored from the copy of R15 that was made in the generic entry routine. Next up will be a look at CRI and CRE, which share a lot of code and appear to have a few corner case bugs.
  21. Time to work with floating point ("real") numbers. The simplest ones are CIR and CER, which convert a 16 bit or a 32 bit integer into a real number. The two instructions share nearly all of their code. The TI990 floating point format is described in section B.4 of the data sheet (i.e. in the 99110 appendix). It is the IBM360 single precision format. In summary, a real number is expressed as: N = S x 0.MMMMMM x 16 ^ EE S is the sign bit, M is a 'mantissa' of 6 hex digits (note: always unsigned) and EE an exponent with 7 bits, i.e. the exponent range is -64 to +63. The number is 'normalized' so that the first hex digit of the mantissa is always non-zero (this keeps accuracy to a maximum). This is achieved by shifting the mantissa the required number of hex digits and adjusting the exponent accordingly. The exponent is in "excess 64" format; this means that it has 64 added to it. In this way the range becomes 0 to 127 and we can work with the exponent as an unsigned number. The code in the 99110 ROM often splits a real number into its component parts to work with them independently and recombines the components at the end of the calculation. Often the mantissa is calculated in more precision than 6 hex digits (= 24 bits) to reduce rounding errors. Let's start with CIR. After the entry code that was analyzed earlier in this thread, it starts with: ; entry point for CIR ; 0A80 C018 MOV *R8, R0 ; fetch S and sign extend into R0,R1 0A82 C040 MOV R0, R1 0A84 08F0 SRA R0, 15 This fetches the 16 bit integer operand and sign-extends it to a 32 bit operand located in out local floating point accumulator, FPAC. The rest of the code can now be the same as for CER. That instruction starts with: ; entry point for CER ; 0A86 026F ORI R15, >1000 ; set C bit unconditionally 0A88 1000 0A8A C080 MOV R0, R2 ; if S is zero, clear FPAC & finish 0A8C E081 SOC R1, R2 0A8E 13A4 JEQ >09D8 This sets the C bit unconditionally, which is how the data sheet specifies it. I'm not sure why this is useful: comments welcome. Then it special-cases a zero operand; we'll look at that further at the end. Then we begin the conversion: the integer is separated into a sign bit and an unsigned number: 0A90 C1C0 MOV R0, R7 ; extract sign bit 0A92 0247 ANDI R7, >8000 0A94 8000 0A96 1304 JEQ >0AA0 ; if negative, negate the number 0A98 0540 INV R0 0A9A 0501 NEG R1 0A9C 1701 JNC >0AA0 0A9E 0580 INC R0 In effect we now have S in R7 and a (32 bit) mantissa in R0,R1. The exponent is implicitly 0. However, the number is not normalized, as the mantissa must be 0.MMMMMM, and it is now MMMMMMMM.0 Conceptually, this can easily be fixed by saying the decimal point is not to the right of the mantissa, but to its left and setting the exponent to +8. Including the excess-64 the exponent becomes 72, or >48 in hex: 0AA0 0206 LI R6, >0048 ; start exponent at +8 0AA2 0048 . We're still not done, because the integer number may have had leading zero's, and the mantissa must always start with a non-zero digit. As the number cannot be zero (we excluded that case above), this can always be achieved by shifting the mantissa between 0 and 7 hex digits to the left and adjusting the exponent accordingly: 0AA4 C000 MOV R0, R0 ; if top word zero, shift 4 nibbles 0AA6 1604 JNE >0AB0 0AA8 C001 MOV R1, R0 0AAA 04C1 CLR R1 0AAC 0226 AI R6, -4 ; and adjust exponent accordingly 0AAE FFFC 0AB0 D000 MOVB R0, R0 ; if top byte zero, shift 2 nibbles 0AB2 1603 JNE >0ABA 0AB4 001D SLAM R0, 8 0AB6 4200 0AB8 0646 DECT R6 ; and adjust exponent accordingly 0ABA C080 MOV R0, R2 ; if top nibble is zero, shift one nibble 0ABC 0242 ANDI R2, >F000 0ABE F000 0AC0 1603 JNE >0AC8 0AC2 001D SLAM R0, 4 0AC4 4100 0AC6 0606 DEC R6 ; and adjust exponent accordingly . After the above steps, we have the sign bit in R7, the mantissa in R0,R1 and the exponent in R6. The last step to make the real number is combining all component parts: 0AC8 06C6 SWPB R6 ; merge exponent (R6), mantissa (R0,R1) and 0ACA 001C SRAM R0, 8 ; sign (R7) together 0ACC 4200 0ACE D006 MOVB R6, R0 0AD0 E007 SOC R7, R0 Note that the mantissa is shifted 8 bits to make room for the sign and exponent. This looses 8 bits of accuracy. The lost bits are truncated, i.e. the remaining 24 bit mantissa is not rounded up if the lost bits are above >80. Such rounding could have been achieved by adding >0000 0080 to the mantissa, using the AM instruction (the top *bit* of the mantissa will always be zero, thus this cannot overflow). However, the ROM is almost full and I don't think there is space left to add such rounding to all floating point instructions. What remains is storing the number in the user's FPAC and setting the status bits appropriately: 0AD2 10BB JMP >0A4A ; store FPAC & finish .. 0A4A 10B4 JMP >09B4 .. ; compare FPAC against zero & store result ; 09B4 C000 MOV R0, R0 ; test sign 09B6 1105 JLT >09C2 ; if negative only set L> bit 09B8 1602 JNE >09BE ; if positive set L> and A> bits 09BA C041 MOV R1, R1 ; if zero only set EQ bit 09BC 13F6 JEQ >09AA 09BE 026F ORI R15, >C000 ; set L> and A> status bits 09C0 C000 09C2 026F ORI R15, >8000 ; set L> status bit 09C4 8000 09C6 C740 MOV R0, *R13 ; store FPAC 09C8 CB41 MOV R1, @2(R13) 09CA 0002 09CC 0380 RTWP ; macro code complete . This bit of code is used at the end of nearly all floating point routines. Note the the macro entry code has already reset ST0-ST4 in R15, so only setting the right bits remains. The handling of a zero result is done in a separate routine. The "result is zero" exit routine is also heavily used, including by the CIR and CER instructions (remember the test for zero at the start of that code): .. 09D8 13E6 JEQ >09A6 .. ; clear FPAC, set EQ status bit & store ; 09A6 04C0 CLR R0 09A8 04C1 CLR R1 09AA 026F ORI R15, >2000 09AC 2000 09AE 100B JMP >09C6 ; store FPAC & exit . That concludes the first two floating point instructions.
  22. And here is the analysis for MM. It starts with doing the 32x32 bit multiply: ; 32 x 32 => 64 bit multiply. S is R0,R1 and D is R4,R5 ; result is in R0-R3 ; ; used for both MM (R6!=0) and MR (R6==0) ; in case of MR it multiplies two 24 bit mantissas ; 0900 C085 MOV R5, R2 ; long multiply in four 16x16 bit steps 0902 3881 MPY R1, R2 0904 C205 MOV R5, R8 0906 3A00 MPY R0, R8 0908 C284 MOV R4, R10 090A 3A81 MPY R1, R10 090C 3804 MPY R4, R0 090E 002A AM R10, R8 ; add the partial results 0910 420A 0912 1701 JNC >0916 0914 0580 INC R0 0916 002A AM R8, R1 0918 4048 091A 1701 JNC >091E 091C 0580 INC R0 . What the above code does is easier to understand if it is written out like a manual multiplication: -R0-.-R1- = S -R4-.-R5- = D ---------------x -R2-.-R3- = RL = R1 x R5 -R8-.-R9-.0000 = T1 = R0 x R5 -RA-.-RB-.0000 = T2 = R4 x R1 -R0-.-R1-.0000.0000 = RH = R4 x R0 ===================+ -R0-.-R1-.-R2-.-R3- = R In the above figure I've used RA for R10 and RB for R11 to keep alignment. The last bit is nothing more than storing the result and setting the status flags: 091E D186 MOVB R6, R6 ; is this a MR or MM instruction? 0920 1607 JNE >0930 ; jump if MM [ a little code for MR skipped ] 0930 CDC0 MOV R0, *R7+ ; MM: store 8 byte result in D 0932 CDC1 MOV R1, *R7+ 0934 CDC2 MOV R2, *R7+ 0936 C5C3 MOV R3, *R7 0938 E001 SOC R1, R0 ; if result is 0, set EQ flag 093A E002 SOC R2, R0 093C E003 SOC R3, R0 093E 1602 JNE >0944 0940 026F ORI R15, >2000 0942 2000 0944 0380 RTWP ; macro execution complete . That is one math operation out of the way.
  23. Let's take a look a the remaining entry point, for opcodes in the >03xx range. It starts thus: ; Start of table entry 0806 (opcodes 03xx) ; Only >0301 (CR) and >0302 (MM) are valid on a 99110 ; 0B1C 0285 CI R5, >0302 ; CR or MM opcode? 0B1E 0302 0B20 155F JGT >0BE0 ; no: test extension & exit 0B22 1301 JEQ >0B26 ; for CR clear R5 (as a flag) 0B24 04C5 CLR R5 0B26 024F ANDI R15, >07FF ; clear status bits 0B28 07FF 0B2A C0BE MOV *R14+, R2 ; fetch second opcode word 0B2C 0206 LI R6, >0004 ; four byte operands 0B2E 0004 It only accepts CR and MM and all other opcodes from the group are referred to the extension test. Later on we need an easy test for opcode CR versus MM, and R5 is cleared for this purpose. The status bits ST0-ST4 are cleared, as we saw with the 0Cxx opcodes. Then the second opcode word is fetched and R6 is preloaded with an auto-increment constant. Next it prepares the source operand: 0B30 C042 MOV R2, R1 ; extract src bits 0B32 0241 ANDI R1, >003F 0B34 003F 0B36 0101 EVAD R1 ; calculate src address 0B38 1601 JNE >0B3C ; if Ts = 3, autoincrement src ptr 0B3A A686 A R6, *R10 This uses the EVAD instruction, which is discussed in data sheet section 7.3.3.5. This instruction takes a 6 bit operand field and calculates the actual address of the operand. If the modifier bits signify *Rn+ the EQ bit is set (for a source operand) and a pointer to Rn is loaded in R10. Because we are dealing with 32 bit operands the register is auto-incremented by 4 bytes. It proceeds with preparing the destination operand: 0B3C C008 MOV R8, R0 ; save source address during 2nd EVAD 0B3E 0242 ANDI R2, >0FC0 ; extract dst bits 0B40 0FC0 0B42 0102 EVAD R2 0B44 1B04 JH >0B4E ; if Td = 3, autoincrement dst ptr 0B46 C145 MOV R5, R5 ; for MM, increment is 8 0B48 1301 JEQ >0B4C 0B4A 0A16 SLA R6, 1 0B4C A646 A R6, *R9 This code is pretty much the same. For auto-increment in the destination field the A> status flag (ST0) is set and the associated pointer is in R9. Because MM has a 64 bit result, the auto-increment is upped to 8 bytes. I'm not sure why the code uses two calls on EVAD, as this instruction can do both src and dst at the same time. If anybody sees a good reason for this, please post your observations. With the operand access prepared the code moves on to actually fetch the operands: 0B4E C085 MOV R5, R2 ; move opcode to R2 0B50 C200 MOV R0, R8 ; restore source address 0B52 C038 MOV *R8+, R0 ; fetch S to R0,R1 and D to R4,R5 0B54 C058 MOV *R8, R1 0B56 C117 MOV *R7, R4 0B58 C167 MOV @2(R7), R5 0B5A 0002 The source operand is fetched into R0,R1 and the destination into R4,R5. As we will see later, these registers are chosen for good reason. It does mean that we have to move the (first word of the) opcode out of the way to R2. The choice to move R0 back to R8 is significant. Indirect addressing via R7/R8 generates special bus status codes. When using indirect addressing with R8/R7, the CPU generates WS and DOP/SOP bus status codes. If Td/Ts was zero during EVAD a WS cycle is used and if Td/Ts was not zero a DOP/SOP cycle is used (see section 7.3.3.4). This way, external hardware cannot tell apart if an instruction is a macro instruction or implemented in microcode. The data sheet is a bit vague, but there seems to be mechanism that this also works when two EVAD instructions are used. Finally we get to execution: 0B5C 0706 SETO R6 ; set the MM / CR flag 0B5E C082 MOV R2, R2 ; was opcode CR? 0B60 1604 JNE >0B6A 0B62 0224 AI R4, >8000 ; change sign of D 0B64 8000 0B66 0460 B @>0830 ; perform CR = S+(-D) without store 0B68 0830 0B6A 0460 B @>0900 ; perform MM 0B6C 0900 The execution path of CR is partly shared with AR and that of MM with MR. Hence a flag (R6) is set to keep track of which paths to follow. CR is evaluated by calculating S+(-D), and suppressing storage of the result - just the status bits are set.
  24. The answer depends on what you mean by native. Yes: the 990/12 had microcode for all these instructions, and also for the double precision variants (AD, SD, MD, etc.). No: the 990/12 did not have specialized data paths to support floating point, and the microcode calculated the results using the normal 16 bit data path. Simply put, the microcode did the same operations as the macro code on a 99110. Of course, not having to fetch opcodes etc. it runs faster in microcode. One could say this is "low end native". An example of high end native would be a FPU co-processor as existed for the PDP11.: http://www.psych.usyd.edu.au/pdp-11/11_34_fpp.html As far as I know there never was such a specialized FPU for the TI990. The co-processor interface on the 99xxx CPU suggest that TI was thinking about it at the time. Had the series been more succesful we could have seen a 99-series FPU chip, perhaps with capabilities like the Intel 8231: http://www.cpu-galaxy.at/cpu/ram%20rom%20eprom/other_intel_chips/other_intel-Dateien/8231A_datasheet.pdf
×
×
  • Create New...