pnr Posted January 21, 2018 Share Posted January 21, 2018 Now that the internal ROM of the 99110 processor has been read out, it becomes interesting to see what the TI engineers had put in it. This thread is intended for posts about recreating the source code for the ROM. It is meant to be wide in scope and include discussion about the best tools for this specific job, the ins and outs of floating point formats and their implementation on the 99xx and 99xxx, etc. All contributions towards these topics are most welcome. This first post includes a binary dump (the .bin file) and a quick disassembly using xda99 (which was the first tool I came across). The entry table (see section 7.3.1 of the data sheet) is: AORG >0800 ; macrostore entry vectors (see table 7 of datasheet) ; 0800 0BE0 DATA >0BE0 ; entry point for 00xx opcodes 0802 0BE0 DATA >0BE0 ; entry point for 01xx opcodes 0804 0BE0 DATA >0BE0 ; entry point for 02xx opcodes 0806 0B1C DATA >0B1C ; entry point for 03xx opcodes 0808 0B6E DATA >0B6E ; entry point for 0Cxx opcodes 080A 0B80 DATA >0B80 ; entry point for 0Dxx opcodes 080C 0BE0 DATA >0BE0 ; entry point for 0Exx opcodes 080E 0AF6 DATA >0AF6 ; entry point for 0Fxx + 07xx opcodes 0810 0BE0 DATA >0BE0 ; entry point for two-word opcodes 0812 0BE0 DATA >0BE0 ; entry point for macro XOP's . . Most of the entries refer to the exit code at >0BE0. This code implements the extension interface documented in section 7.3.6 of the data sheet. ; unimplemented instructions jump here to check for external macro ROM. ; Officially, 0BE0-0BFF was reserved for factory test code ; 0BE0 C1E0 MOV @>1000, R7 ; test macro location >1000 for >AAAA magic 0BE2 1000 0BE4 0287 CI R7, >AAAA 0BE6 AAAA 0BE8 1602 JNE >0BEE ; if not present exit 0BEA 0460 B @>1002 ; jump to external macro code 0BEC 1002 0BEE 0382 RTWP2 ; return & trigger ILLOP interrupt macrorom.bin macrorom.txt 1 1 Quote Link to comment Share on other sites More sharing options...
pnr Posted January 21, 2018 Author Share Posted January 21, 2018 If you use an Apple Computer, you can use the TI-Disk Manager for disassembling code for nearly all processors of the 99xxx family. It has an interactive tool (Disassembler Editor) for producing clean source code. That is very interesting. How would I best use this tool for the job at hand? Does it support the extra 99xxx instructions? And how about the (obscure) macrostore specific instructions (EVAD, the interrupt jumps, the RTWP variants)? Can the disassembler tool be used on a stand alone basis? 1 Quote Link to comment Share on other sites More sharing options...
pnr Posted January 21, 2018 Author Share Posted January 21, 2018 (edited) And here is the code for the 8th slot in the vector table. This slot (>0Fxx, >07xx) handles only the LDS (>0780) and LDD (>07C0) opcodes. . ; entry point for the 0Fxx and 07xx opcodes ; only the 74LS612 mapper variant of LDD and LDS are recognized ; 0AF6 2560 CZC @>0B18, R5 ; is the opcode >0780 or >07C0? 0AF8 0B18 0AFA 1672 JNE >0BE0 ; no: test for extension & exit 0AFC 27E0 CZC @>0B1A, R15 ; are we in user mode? 0AFE 0B1A 0B00 1303 JEQ >0B08 0B02 0300 LIMI >0000 ; yes: set up PRIVOP error 0B04 0000 ; (will cause INT #2 after the RTWP) 0B06 0380 RTWP 0B08 0283 CI R3, >C000 ; is this a first LDS? 0B0A C000 0B0C 1303 JEQ >0B14 0B0E 0283 CI R3, >6000 ; is this a first LDD? 0B10 6000 0B12 1601 JNE >0B16 0B14 C08E MOV R14, R2 ; save address+2 of first LDS/LDD in a sequence 0B16 0384 RTWP4 ; return & defer interrupt 0B18 F83F DATA >F83F ; reverse bit pattern of LDD/LDS 0B1A 0100 DATA >0100 ; PRIV bit in ST register . This has two interesting points. LDD and LDS are only valid when in system mode (PRIV bit not set). If they are called from user mode (PRIV bit set) an privilege violation error must be created. This is achieved by using LIMI 0. This is a hardware instruction that is also only valid in system mode and will set the PRIVOP error bit. This normally immediately causes an INT #2 to occur. However, macrocode cannot be interrupted so it remains pending. Only after we return to normal code with the RTWP is the interrupt honored. Saving the address+2 of the first LDD/LDS in a sequence is an obscure feature that can be used when implementing TI990/12 style interruptible instructions in an external macro ROM (also see bottom of page 107 of the data sheet). When such an instruction is used in combination with LDD/LDS and allows itself to be interrupted, it will save its progress in a checkpoint register and reset the saved PC in R14 to the address of the first LDD/LDS. After the interrupt has finished, the instruction will restart from the first LDD/LDS (setting up the hardware assists) and the interruptible instruction will restart from its checkpoint. The 990/12 assembler manual has more information about interruptible instructions and checkpoint registers. Edited January 21, 2018 by pnr 1 Quote Link to comment Share on other sites More sharing options...
HackMac Posted January 21, 2018 Share Posted January 21, 2018 (edited) That is very interesting. How would I best use this tool for the job at hand? Does it support the extra 99xxx instructions? And how about the (obscure) macrostore specific instructions (EVAD, the interrupt jumps, the RTWP variants)? Can the disassembler tool be used on a stand alone basis? To disassemble the macrorom.bin file, you should copy it onto a TI disk image, which you can create with the TI-Disk Manager. Then you drag&drop the file from the Mac to the file list of the specific TI disk image (or move the file direct onto the disk image item in the right list), which is displayed in the main window of the TI-Disk Manager, to import this binary file in PROGRAM format (you will be asked for import options while importing). Then you can right click on the newly created file and hold the option (and also the shift key? I forgott.) pressed. In the context menu select opening in the Disassembler Editor. The file will be disassembled and the source code will be displayed in a new window. There you should play around with the mouse. There are some tool tips and links, you can use to jump to specific addresses (jump back and forrth with the arrow buttons at the bottom of the window) or you can format the source using the context menu. There are also some options at the bottom of the editor window you can select. There you can select the processor type you need. When you are done, you can export your result as a text file on your Mac. You can close the Disassembler Editor window at any time. The disassembling session will be saved automatically, so you can continue later. Select your last session for continuing your work via the Window menu. You can find more information on the Wiki pages at bitbucket. Please use the latest release (2.9.2) I submit this afternoon. Enjoy Edited January 21, 2018 by HackMac Quote Link to comment Share on other sites More sharing options...
+mizapf Posted January 21, 2018 Share Posted January 21, 2018 LDD and LDS are only valid when in system mode (PRIV bit not set). If they are called from user mode (PRIV bit set) an privilege violation error must be created. Is the PRIV bit negative logic? (Possibly makes sense, because the bit is set to 0 for TMS99xx, which would turn its programs to non-privileged.) Quote Link to comment Share on other sites More sharing options...
pnr Posted January 22, 2018 Author Share Posted January 22, 2018 Is the PRIV bit negative logic? (Possibly makes sense, because the bit is set to 0 for TMS99xx, which would turn its programs to non-privileged.) Yes it is negative logic and confusingly named. I think USER would have been a better name: - When the bit is 0 the CPU is in privileged mode and can execute all instructions. - When the bit is 1 the CPU is in unprivileged mode and I/O type instructions become restricted. Attempts to make the bit 0 again become illegal as well. The only way to get back to privileged mode is via a reset, interrupt or XOP. As that code is normally controlled by the operating system it can retain control. See section 6 of the data sheet for details. Quote Link to comment Share on other sites More sharing options...
+mizapf Posted January 22, 2018 Share Posted January 22, 2018 Yes, I already had a look at the specs, in particular because this is still pending work for me to do in MAME. There was once an implementation of the TMS99xxx, but after the rewrite (already 6 years ago) I left the 99xxx unfinished, because there was no system to test it with. As for my question, I seem to have swapped the semantics of that bit in my own memory. The privileged mode is what is typically also called the kernel mode. For a long time I wondered what the XOP instructions with their fixed vector could be used for. Studying more about other OS and architectures, I finally understood that the XOPs are actually system calls. On the 99xx platform, the mode switch is missing, though. I found it fascinating to see that with the full support in the 99xxx, there would be no LIMI access to user programs (which is so ubiquitous in our assembly programming), and you could protect access to system resources e.g. on lower CRU addresses. Quote Link to comment Share on other sites More sharing options...
pnr Posted January 22, 2018 Author Share Posted January 22, 2018 I left the 99xxx unfinished, because there was no system to test it with. Well, now there is Stuart's excellent 99xxx project: http://www.stuartconner.me.uk/tms99110_breadboard/tms99110_breadboard.htm I'm having a ton of fun experimenting with the 99xxx using this board. Quote Link to comment Share on other sites More sharing options...
+mizapf Posted January 22, 2018 Share Posted January 22, 2018 Indeed. But being the only contributor to the TI emulation, I just see my todo list getting longer all the time... java.lang.CloneNotSupportedException at mizapf:xxxx Quote Link to comment Share on other sites More sharing options...
speccery Posted January 22, 2018 Share Posted January 22, 2018 This has two interesting points. LDD and LDS are only valid when in system mode (PRIV bit not set). If they are called from user mode (PRIV bit set) an privilege violation error must be created. This is achieved by using LIMI 0. This is a hardware instruction that is also only valid in system mode and will set the PRIVOP error bit. This normally immediately causes an INT #2 to occur. However, macrocode cannot be interrupted so it remains pending. Only after we return to normal code with the RTWP is the interrupt honored. Saving the address+2 of the first LDD/LDS in a sequence is an obscure feature that can be used when implementing TI990/12 style interruptible instructions in an external macro ROM (also see bottom of page 107 of the data sheet). When such an instruction is used in combination with LDD/LDS and allows itself to be interrupted, it will save its progress in a checkpoint register and reset the saved PC in R14 to the address of the first LDD/LDS. After the interrupt has finished, the instruction will restart from the first LDD/LDS (setting up the hardware assists) and the interruptible instruction will restart from its checkpoint. The 990/12 assembler manual has more information about interruptible instructions and checkpoint registers. Thank you for sharing your insights, it is a very interesting read. I can only see that there is a lot of documentation to be read. It is also very interesting how the minicomputer implementation has interesting ramifications on microprocessor design for the TMS991XX. I suppose the TMS99105 still does not have what it would take to implement virtual memory, i.e. no support for page faults and generic restartable instructions? Quote Link to comment Share on other sites More sharing options...
+mizapf Posted January 22, 2018 Share Posted January 22, 2018 (edited) As far as I saw, there is still a limit at 16 bit addresses, maybe a second bank, but not more. (I think there is a map bit.) It is a pity that the TMS architecture does not allow for more.The process space is typically limited by the address width. You may use some tricks like segmented memory, as Intel did, but eventually you will always have the width as the limiting factor. This is the primary reason why 32 bit systems cannot make proper use of memory beyond 4 GiB. We know this from the Geneve which is built with an address space of up to 2 MiB, but not directly addressable; it requires mapping. [Edit:] One might think that this just amounts to allow for a wider address range, like 24 bit, inside the CPU. Maybe there could be another instruction format that allows for using wider addresses, like MOV @>12ABCD,R1, which requires a three-word object code. But this is not sufficient. Since we are using registers for indirect addressing, the registers would need to be wider, too, to accommodate the longer addresses. At that point we are beyond 16-bit processing. The above-mentioned 32-bit systems limit the address bus width (for linear addressing) just by being unable to handle values with more than 32 bit. The modern 64 bit processors do not even use 64 address bits but typically offer 48 address bits. Edited January 22, 2018 by mizapf 1 Quote Link to comment Share on other sites More sharing options...
speccery Posted January 22, 2018 Share Posted January 22, 2018 Yes it does still have the 16-bit address space limit. I think you are referring to the PSEL# bit, which allows two 64K banks to exist, but in the data sheets they are actually using the PSEL# line to enable a 74LS612 or other memory mapper device. That would mean that the user processes could have paged memory, in a 24 bit address space for example. What I am wondering is that whether or not the TMS99000 architecture supports page faults, i.e. missing pages. That is not possible with the 74LS612 alone, but a more sophisticated MMU (memory management unit) would be needed, which could interrupt the CPU if a page is not there, in a way that would allow the page to be fetched from disk storage by the operating system (in supervisor mode). I once wrote a paged memory system for the intel 386 CPU's MMU, and with that you can mark in your page directory entries whether or not a given page is accessible. If it is, the CPU will just access the data, if not a page fault will occur. In the page fault handler you can decide how to serve the fault. If the fault occurred due to access to invalid address, the process is typically killed. If it is accessing a memory page frame that is valid but not loaded, the OS could load it. This type of functionality requires restartable instructions and a hardware based translation look aside buffer (TLB) as a minimum - it does not need all the machinery that the 386 offers. It does not make any sense in a 16-bit address space alone, but if the 16-bits can be paged to a much larger physical address space, it starts to make sense. Still I know this whole thing probably does not make sense for the TMS99000 in the greater scope of things, but just wondering if the bare ingredients in the CPU are there... Quote Link to comment Share on other sites More sharing options...
pnr Posted January 23, 2018 Author Share Posted January 23, 2018 This is a very interesting and broad topic. It will be fun to discuss, but let's use a separate thread, e.g. "Designing MMU's" Below some short comments that can be used to kickoff such a thread. I suppose the TMS99105 still does not have what it would take to implement virtual memory, i.e. no support for page faults and generic restartable instructions? In short: - No, it does not have such support out of the box - But maybe it can. Because of the 'registers in RAM' architecture, I think it might be possible with the help of external hardware (also for the 9900 & 9995) But what is the point of demand paging when virtual memory space (64KB) is so much smaller than physical memory (say 1MB)? Maybe it only makes sense in the reverse situation. Also, does a TLB make sense when virtual memory is small? Next to address translation, the other purpose of a MMU is memory protection. How to implement that on a 99xx/99xxx is an interesting question too. As far as I saw, there is still a limit at 16 bit addresses, maybe a second bank, but not more. (I think there is a map bit.) It is a pity that the TMS architecture does not allow for more. Yes, from a non-kernel program viewpoint, space is limited to 16 bits. The 99000 has two kludges to make it somewhat 17-18 bit like. - Separation of instruction space and data space (not used on TI990 mini's). This was used with great success on PDP11 mini's and early Unix. - The PSEL bit. Most mini's of the era had two memory spaces (kernel/user) driven by the supervisor bit in the status register. TI separated the two functions into separate bits, but when using a 74LS612 mapper or the TI990 MMU this is not fully exploited and the two bits move in tandem. With some new macro instructions PSEL could be made more useful. Let's continue in a new thread. Quote Link to comment Share on other sites More sharing options...
pnr Posted January 24, 2018 Author Share Posted January 24, 2018 I've now analysed the entry code for group >0Cxx. This group has most of the floating point instructions, see the 99110 annex of the data sheet for details. It has revealed an undocument opcode: 'XIT' The macro code begins thus: ; entry point for opcodes 0Cxx ; 0B6E 0285 CI R5, >0C3F ; zero or one operand opcode? 0B70 0C3F 0B72 1506 JGT >0B80 ; jump if one operand 0B74 2560 CZC @>0BD4, R5 ; valid zero operand instruction? 0B76 0BD4 0B78 161C JNE >0BB2 ; no: test for 'XIT' 0B7A 0245 ANDI R5, >0006 ; if valid zero operand opcode, 0B7C 0006 0B7E 1011 JMP >0BA2 ; go fetch FPAC & go to opcode routine .. 0BD4 0039 DATA >0039 ; opcode bit test pattern The above is all straightforward. The next macro code section proceeds to handle the source operand: 0B80 C2C5 MOV R5, R11 ; handle one operand case 0B82 0245 ANDI R5, >01FF ; isolate <src> bits 0B84 01FF 0B86 0105 EVAD R5 ; calculate EA 0B88 1609 JNE >0B9C ; Ts = 3 ? 0B8A 024B ANDI R11, >FFC0 ; mask out operand bits 0B8C FFC0 0B8E 028B CI R11, >0C80 ; opcode is CIR? 0B90 0C80 0B92 1602 JNE >0B98 ; yes: autoincrement by 2, else by 4 0B94 05DA INCT *R10 0B96 1002 JMP >0B9C 0B98 A6A0 A @>0B2E, *R10 ; >0B2E contains 4 0B9A 0B2E 0B9C 0855 SRA R5, 5 ; calculate switch index from opcode 0B9E 0225 AI R5, >0006 0BA0 0006 This macro code section uses the special macro opcode "EVAD" (evaluate address). This instruction is documented in section 7.3.3.5 of the data manual. EVAD analyzes the operand bits in the instruction and calculates the address of the source operand. This address is placed in R8. If the operand uses the *Rx+ format, a pointer to Rx is placed in R10 and the EQ status bit is set. As floating point numbers are 32 bit, i.e. 4 bytes, registers are auto-incremented by 4. The only exception is CIR, which has a integer word operand and increments by 2. Then the macro code proceeds with: 0BA2 C01D MOV *R13, R0 ; fetch FPAC into local R0,R1 0BA4 C06D MOV @2(R13), R1 0BA6 0002 0BA8 C08F MOV R15, R2 ; save status & clear ST0-ST4 0BAA 024F ANDI R15, >07FF ; (status is save to restore ST3-ST4 as needed) 0BAC 07FF 0BAE 0165 BIND @>0BBE(R5) ; jump to specific opcode routine 0BB0 0BBE ; branch table for 0Cxx group (first 4 are zero operand) ; 0BBE 09CE DATA >09CE ; CRI 0BC0 08E4 DATA >08E4 ; NEGR 0BC2 09D2 DATA >09D2 ; CRE 0BC4 0A86 DATA >0A86 ; CER 0BC6 081E DATA >081E ; AR 0BC8 0A80 DATA >0A80 ; CIR 0BCA 0814 DATA >0814 ; SR 0BCC 08F4 DATA >08F4 ; MR 0BCE 0946 DATA >0946 ; DR 0BD0 08D2 DATA >08D2 ; LR 0BD2 08D8 DATA >08D8 ; STR This code fetches the floating point accumulator ("FPAC") from the user's R0,R1 and places this in our local R0,R1. The it clears ST0-ST4 in the users status register. The various instructions will set these bits as needed. The original status register is saved in R2 because some instructions only affect ST0-ST2 and must hence restore ST3-ST4. Then it uses the 99000 specific BIND (Branch Indirect) instruction to jump to a specific opcode handler routine via a jump table. That leaves the mysterious undocumented 'XIT'. It is actually rather boring: ; Test and implement XIT ; 0BB2 C1C5 MOV R5, R7 ; test for XIT (>0C0E and >0C0F) 0BB4 0917 SRL R7, 1 ; XIT is a no-op 0BB6 0287 CI R7, >0607 0BB8 0607 0BBA 1612 JNE >0BE0 ; no: test for extension & exit 0BBC 0380 RTWP ; macro processing complete XIT is an instruction on the TI990/12 and is also a NOP there. It is used as part of floating point handling by the TI990 Fortran compiler, to create code that would run both on machines with (the /12) and without (the /10) native floating point (explanation courtesy of Dave Pitts) The Fortran compiler would for example generate: BLWP @F$RITP LR *R9+ AR *R9 STR *R8 XIT On a 990/10 the "F$RITP" routine is a floating point library that reads the instructions following the BLWP and emulates the floating point hardware. When it sees a XIT instruction it stops emulating and returns. Hence "exit interpreter" or XIT. On a 990/12 the "F$RITP" routine would be empty (i.e. do a RTWP immediately) and the 990/12 hardware would execute the floating point code natively. When it saw the XIT it would treat it as a NOP. It would seem that the 99110 implemented the "XIT" instruction for the exact same purpose. 3 Quote Link to comment Share on other sites More sharing options...
speccery Posted January 25, 2018 Share Posted January 25, 2018 Thanks pnr for the explanation! Youve proceeded quickly with your analysis. So did I understand correctly that the 990/12 implemented these floating point instructions natively in hardware? Again Im thinking about FPGA implementation of th 99000 here - the great thing here is that there are standard floating point instructions then. Quote Link to comment Share on other sites More sharing options...
pnr Posted January 25, 2018 Author Share Posted January 25, 2018 So did I understand correctly that the 990/12 implemented these floating point instructions natively in hardware? The answer depends on what you mean by native. Yes: the 990/12 had microcode for all these instructions, and also for the double precision variants (AD, SD, MD, etc.). No: the 990/12 did not have specialized data paths to support floating point, and the microcode calculated the results using the normal 16 bit data path. Simply put, the microcode did the same operations as the macro code on a 99110. Of course, not having to fetch opcodes etc. it runs faster in microcode. One could say this is "low end native". An example of high end native would be a FPU co-processor as existed for the PDP11.: http://www.psych.usyd.edu.au/pdp-11/11_34_fpp.html As far as I know there never was such a specialized FPU for the TI990. The co-processor interface on the 99xxx CPU suggest that TI was thinking about it at the time. Had the series been more succesful we could have seen a 99-series FPU chip, perhaps with capabilities like the Intel 8231: http://www.cpu-galaxy.at/cpu/ram%20rom%20eprom/other_intel_chips/other_intel-Dateien/8231A_datasheet.pdf 2 Quote Link to comment Share on other sites More sharing options...
speccery Posted January 26, 2018 Share Posted January 26, 2018 As far as I know there never was such a specialized FPU for the TI990. The co-processor interface on the 99xxx CPU suggest that TI was thinking about it at the time. Had the series been more succesful we could have seen a 99-series FPU chip, perhaps with capabilities like the Intel 8231: http://www.cpu-galaxy.at/cpu/ram%20rom%20eprom/other_intel_chips/other_intel-Dateien/8231A_datasheet.pdf Interesting, didn't know of this chip before. Looks like a predecessor of the 8087 - same stack based idea for the operands. Quote Link to comment Share on other sites More sharing options...
pnr Posted January 26, 2018 Author Share Posted January 26, 2018 (edited) Let's take a look a the remaining entry point, for opcodes in the >03xx range. It starts thus: ; Start of table entry 0806 (opcodes 03xx) ; Only >0301 (CR) and >0302 (MM) are valid on a 99110 ; 0B1C 0285 CI R5, >0302 ; CR or MM opcode? 0B1E 0302 0B20 155F JGT >0BE0 ; no: test extension & exit 0B22 1301 JEQ >0B26 ; for CR clear R5 (as a flag) 0B24 04C5 CLR R5 0B26 024F ANDI R15, >07FF ; clear status bits 0B28 07FF 0B2A C0BE MOV *R14+, R2 ; fetch second opcode word 0B2C 0206 LI R6, >0004 ; four byte operands 0B2E 0004 It only accepts CR and MM and all other opcodes from the group are referred to the extension test. Later on we need an easy test for opcode CR versus MM, and R5 is cleared for this purpose. The status bits ST0-ST4 are cleared, as we saw with the 0Cxx opcodes. Then the second opcode word is fetched and R6 is preloaded with an auto-increment constant. Next it prepares the source operand: 0B30 C042 MOV R2, R1 ; extract src bits 0B32 0241 ANDI R1, >003F 0B34 003F 0B36 0101 EVAD R1 ; calculate src address 0B38 1601 JNE >0B3C ; if Ts = 3, autoincrement src ptr 0B3A A686 A R6, *R10 This uses the EVAD instruction, which is discussed in data sheet section 7.3.3.5. This instruction takes a 6 bit operand field and calculates the actual address of the operand. If the modifier bits signify *Rn+ the EQ bit is set (for a source operand) and a pointer to Rn is loaded in R10. Because we are dealing with 32 bit operands the register is auto-incremented by 4 bytes. It proceeds with preparing the destination operand: 0B3C C008 MOV R8, R0 ; save source address during 2nd EVAD 0B3E 0242 ANDI R2, >0FC0 ; extract dst bits 0B40 0FC0 0B42 0102 EVAD R2 0B44 1B04 JH >0B4E ; if Td = 3, autoincrement dst ptr 0B46 C145 MOV R5, R5 ; for MM, increment is 8 0B48 1301 JEQ >0B4C 0B4A 0A16 SLA R6, 1 0B4C A646 A R6, *R9 This code is pretty much the same. For auto-increment in the destination field the A> status flag (ST0) is set and the associated pointer is in R9. Because MM has a 64 bit result, the auto-increment is upped to 8 bytes. I'm not sure why the code uses two calls on EVAD, as this instruction can do both src and dst at the same time. If anybody sees a good reason for this, please post your observations. With the operand access prepared the code moves on to actually fetch the operands: 0B4E C085 MOV R5, R2 ; move opcode to R2 0B50 C200 MOV R0, R8 ; restore source address 0B52 C038 MOV *R8+, R0 ; fetch S to R0,R1 and D to R4,R5 0B54 C058 MOV *R8, R1 0B56 C117 MOV *R7, R4 0B58 C167 MOV @2(R7), R5 0B5A 0002 The source operand is fetched into R0,R1 and the destination into R4,R5. As we will see later, these registers are chosen for good reason. It does mean that we have to move the (first word of the) opcode out of the way to R2. The choice to move R0 back to R8 is significant. Indirect addressing via R7/R8 generates special bus status codes. When using indirect addressing with R8/R7, the CPU generates WS and DOP/SOP bus status codes. If Td/Ts was zero during EVAD a WS cycle is used and if Td/Ts was not zero a DOP/SOP cycle is used (see section 7.3.3.4). This way, external hardware cannot tell apart if an instruction is a macro instruction or implemented in microcode. The data sheet is a bit vague, but there seems to be mechanism that this also works when two EVAD instructions are used. Finally we get to execution: 0B5C 0706 SETO R6 ; set the MM / CR flag 0B5E C082 MOV R2, R2 ; was opcode CR? 0B60 1604 JNE >0B6A 0B62 0224 AI R4, >8000 ; change sign of D 0B64 8000 0B66 0460 B @>0830 ; perform CR = S+(-D) without store 0B68 0830 0B6A 0460 B @>0900 ; perform MM 0B6C 0900 The execution path of CR is partly shared with AR and that of MM with MR. Hence a flag (R6) is set to keep track of which paths to follow. CR is evaluated by calculating S+(-D), and suppressing storage of the result - just the status bits are set. Edited January 26, 2018 by pnr 1 Quote Link to comment Share on other sites More sharing options...
pnr Posted January 26, 2018 Author Share Posted January 26, 2018 (edited) And here is the analysis for MM. It starts with doing the 32x32 bit multiply: ; 32 x 32 => 64 bit multiply. S is R0,R1 and D is R4,R5 ; result is in R0-R3 ; ; used for both MM (R6!=0) and MR (R6==0) ; in case of MR it multiplies two 24 bit mantissas ; 0900 C085 MOV R5, R2 ; long multiply in four 16x16 bit steps 0902 3881 MPY R1, R2 0904 C205 MOV R5, R8 0906 3A00 MPY R0, R8 0908 C284 MOV R4, R10 090A 3A81 MPY R1, R10 090C 3804 MPY R4, R0 090E 002A AM R10, R8 ; add the partial results 0910 420A 0912 1701 JNC >0916 0914 0580 INC R0 0916 002A AM R8, R1 0918 4048 091A 1701 JNC >091E 091C 0580 INC R0 . What the above code does is easier to understand if it is written out like a manual multiplication: -R0-.-R1- = S -R4-.-R5- = D ---------------x -R2-.-R3- = RL = R1 x R5 -R8-.-R9-.0000 = T1 = R0 x R5 -RA-.-RB-.0000 = T2 = R4 x R1 -R0-.-R1-.0000.0000 = RH = R4 x R0 ===================+ -R0-.-R1-.-R2-.-R3- = R In the above figure I've used RA for R10 and RB for R11 to keep alignment. The last bit is nothing more than storing the result and setting the status flags: 091E D186 MOVB R6, R6 ; is this a MR or MM instruction? 0920 1607 JNE >0930 ; jump if MM [ a little code for MR skipped ] 0930 CDC0 MOV R0, *R7+ ; MM: store 8 byte result in D 0932 CDC1 MOV R1, *R7+ 0934 CDC2 MOV R2, *R7+ 0936 C5C3 MOV R3, *R7 0938 E001 SOC R1, R0 ; if result is 0, set EQ flag 093A E002 SOC R2, R0 093C E003 SOC R3, R0 093E 1602 JNE >0944 0940 026F ORI R15, >2000 0942 2000 0944 0380 RTWP ; macro execution complete . That is one math operation out of the way. Edited January 26, 2018 by pnr 1 Quote Link to comment Share on other sites More sharing options...
pnr Posted January 29, 2018 Author Share Posted January 29, 2018 (edited) Time to work with floating point ("real") numbers. The simplest ones are CIR and CER, which convert a 16 bit or a 32 bit integer into a real number. The two instructions share nearly all of their code. The TI990 floating point format is described in section B.4 of the data sheet (i.e. in the 99110 appendix). It is the IBM360 single precision format. In summary, a real number is expressed as: N = S x 0.MMMMMM x 16 ^ EE S is the sign bit, M is a 'mantissa' of 6 hex digits (note: always unsigned) and EE an exponent with 7 bits, i.e. the exponent range is -64 to +63. The number is 'normalized' so that the first hex digit of the mantissa is always non-zero (this keeps accuracy to a maximum). This is achieved by shifting the mantissa the required number of hex digits and adjusting the exponent accordingly. The exponent is in "excess 64" format; this means that it has 64 added to it. In this way the range becomes 0 to 127 and we can work with the exponent as an unsigned number. The code in the 99110 ROM often splits a real number into its component parts to work with them independently and recombines the components at the end of the calculation. Often the mantissa is calculated in more precision than 6 hex digits (= 24 bits) to reduce rounding errors. Let's start with CIR. After the entry code that was analyzed earlier in this thread, it starts with: ; entry point for CIR ; 0A80 C018 MOV *R8, R0 ; fetch S and sign extend into R0,R1 0A82 C040 MOV R0, R1 0A84 08F0 SRA R0, 15 This fetches the 16 bit integer operand and sign-extends it to a 32 bit operand located in out local floating point accumulator, FPAC. The rest of the code can now be the same as for CER. That instruction starts with: ; entry point for CER ; 0A86 026F ORI R15, >1000 ; set C bit unconditionally 0A88 1000 0A8A C080 MOV R0, R2 ; if S is zero, clear FPAC & finish 0A8C E081 SOC R1, R2 0A8E 13A4 JEQ >09D8 This sets the C bit unconditionally, which is how the data sheet specifies it. I'm not sure why this is useful: comments welcome. Then it special-cases a zero operand; we'll look at that further at the end. Then we begin the conversion: the integer is separated into a sign bit and an unsigned number: 0A90 C1C0 MOV R0, R7 ; extract sign bit 0A92 0247 ANDI R7, >8000 0A94 8000 0A96 1304 JEQ >0AA0 ; if negative, negate the number 0A98 0540 INV R0 0A9A 0501 NEG R1 0A9C 1701 JNC >0AA0 0A9E 0580 INC R0 In effect we now have S in R7 and a (32 bit) mantissa in R0,R1. The exponent is implicitly 0. However, the number is not normalized, as the mantissa must be 0.MMMMMM, and it is now MMMMMMMM.0 Conceptually, this can easily be fixed by saying the decimal point is not to the right of the mantissa, but to its left and setting the exponent to +8. Including the excess-64 the exponent becomes 72, or >48 in hex: 0AA0 0206 LI R6, >0048 ; start exponent at +8 0AA2 0048 . We're still not done, because the integer number may have had leading zero's, and the mantissa must always start with a non-zero digit. As the number cannot be zero (we excluded that case above), this can always be achieved by shifting the mantissa between 0 and 7 hex digits to the left and adjusting the exponent accordingly: 0AA4 C000 MOV R0, R0 ; if top word zero, shift 4 nibbles 0AA6 1604 JNE >0AB0 0AA8 C001 MOV R1, R0 0AAA 04C1 CLR R1 0AAC 0226 AI R6, -4 ; and adjust exponent accordingly 0AAE FFFC 0AB0 D000 MOVB R0, R0 ; if top byte zero, shift 2 nibbles 0AB2 1603 JNE >0ABA 0AB4 001D SLAM R0, 8 0AB6 4200 0AB8 0646 DECT R6 ; and adjust exponent accordingly 0ABA C080 MOV R0, R2 ; if top nibble is zero, shift one nibble 0ABC 0242 ANDI R2, >F000 0ABE F000 0AC0 1603 JNE >0AC8 0AC2 001D SLAM R0, 4 0AC4 4100 0AC6 0606 DEC R6 ; and adjust exponent accordingly . After the above steps, we have the sign bit in R7, the mantissa in R0,R1 and the exponent in R6. The last step to make the real number is combining all component parts: 0AC8 06C6 SWPB R6 ; merge exponent (R6), mantissa (R0,R1) and 0ACA 001C SRAM R0, 8 ; sign (R7) together 0ACC 4200 0ACE D006 MOVB R6, R0 0AD0 E007 SOC R7, R0 Note that the mantissa is shifted 8 bits to make room for the sign and exponent. This looses 8 bits of accuracy. The lost bits are truncated, i.e. the remaining 24 bit mantissa is not rounded up if the lost bits are above >80. Such rounding could have been achieved by adding >0000 0080 to the mantissa, using the AM instruction (the top *bit* of the mantissa will always be zero, thus this cannot overflow). However, the ROM is almost full and I don't think there is space left to add such rounding to all floating point instructions. What remains is storing the number in the user's FPAC and setting the status bits appropriately: 0AD2 10BB JMP >0A4A ; store FPAC & finish .. 0A4A 10B4 JMP >09B4 .. ; compare FPAC against zero & store result ; 09B4 C000 MOV R0, R0 ; test sign 09B6 1105 JLT >09C2 ; if negative only set L> bit 09B8 1602 JNE >09BE ; if positive set L> and A> bits 09BA C041 MOV R1, R1 ; if zero only set EQ bit 09BC 13F6 JEQ >09AA 09BE 026F ORI R15, >C000 ; set L> and A> status bits 09C0 C000 09C2 026F ORI R15, >8000 ; set L> status bit 09C4 8000 09C6 C740 MOV R0, *R13 ; store FPAC 09C8 CB41 MOV R1, @2(R13) 09CA 0002 09CC 0380 RTWP ; macro code complete . This bit of code is used at the end of nearly all floating point routines. Note the the macro entry code has already reset ST0-ST4 in R15, so only setting the right bits remains. The handling of a zero result is done in a separate routine. The "result is zero" exit routine is also heavily used, including by the CIR and CER instructions (remember the test for zero at the start of that code): .. 09D8 13E6 JEQ >09A6 .. ; clear FPAC, set EQ status bit & store ; 09A6 04C0 CLR R0 09A8 04C1 CLR R1 09AA 026F ORI R15, >2000 09AC 2000 09AE 100B JMP >09C6 ; store FPAC & exit . That concludes the first two floating point instructions. Edited January 29, 2018 by pnr 1 Quote Link to comment Share on other sites More sharing options...
pnr Posted January 31, 2018 Author Share Posted January 31, 2018 Today a look at three short routines, implementing STR, LR and NEGR, which store, load or negate the accumulator ("FPAC") respectively. The code is: ; entry point for LR ; 08D2 C038 MOV *R8+, R0 ; load S into local FPAC 08D4 C058 MOV *R8, R1 08D6 1002 JMP >08DC ; entry point for STR ; 08D8 CE00 MOV R0, *R8+ ; store FPAC into S 08DA C601 MOV R1, *R8 08DC 0242 ANDI R2, >1800 ; C and AF status bits unaffected 08DE 1800 08E0 E3C2 SOC R2, R15 08E2 1068 JMP >09B4 ; store result, set flags & finish ; code for NEGR ; 08E4 0242 ANDI R2, >1800 ; C and AF status bits unaffected 08E6 1800 08E8 E3C2 SOC R2, R15 08EA C000 MOV R0, R0 ; is FPAC zero? 08EC 135C JEQ >09A6 ; yes, set EQ flag & finish 08EE 0220 AI R0, >8000 ; no, invert sign bit 08F0 8000 08F2 1060 JMP >09B4 ; store result, set flags & finish The code is really very simple and straightforward. The only special thing is that these instructions only affect status bits ST0-2, and hence ST3 and ST4 are restored from the copy of R15 that was made in the generic entry routine. Next up will be a look at CRI and CRE, which share a lot of code and appear to have a few corner case bugs. Quote Link to comment Share on other sites More sharing options...
pnr Posted February 3, 2018 Author Share Posted February 3, 2018 (edited) Conversion from floating point back to integers is done with CRI and CRE, for a 16 bit or 32 bit integer respectively. In principle this is just the reverse of CIR and CER that were analyzed above, but it is a bit more involved as the code has to check for overflow: the real number may be larger than what fits in the integer. In my view the code in the macro rom for CRI and CRE is a bit convoluted and borderline buggy, but maybe I don't understand the code right. Better insights are welcome. The code for CRI and CRE starts with: ; CRI: convert real to integer ; 09CE 04C8 CLR R8 09D0 1001 JMP >09D4 ; CRE: convert real to extended ; 09D2 0708 SETO R8 09D4 04C2 CLR R2 ; prepare for 48 bit shift in R0,R1,R2 09D6 C1C0 MOV R0, R7 ; if FPAC is zero, nothing to do: 09D8 13E6 JEQ >09A6 ; store zero result & exit CRI and CRE share most of their code, using R8 as a flag to keep track. Also, the case where the real number is zero is special cased so that the remaining code can assume that the number is in standard format. The register R2 is cleared, the reason for which become clear further below. The test for zero has the side effect of saving the sign bit in R7. The next bit of code is also clear: 09DA C180 MOV R0, R6 ; separate mantissa 09DC 7000 SB R0, R0 ; and put exponent in R6 09DE 06C6 SWPB R6 09E0 0246 ANDI R6, >007F 09E2 007F It separates out the mantissa (into R0,R1) from the exponent (into R6) and the sign bit (already in R7). Now the mantissa in 0.MMMMMM format, and this must be converted to MMMMMMMM.0 format, i.e. the reverse operation of that in CIR and CER. This only works if the exponent is in the range +1 to +8 (= +65 to +72 including the excess 64). If the exponent is less than 1 the real number is between (and excluding) +1 and -1 and will be truncated to 0. If the exponent is larger than 8, the number does not fit in 32 bits. This is all handled by the following code: 09E4 0226 AI R6, -65 ; is exponent at least 1? 09E6 FFBF 09E8 112D JLT >0A44 ; if less than 1, result is zero 09EA 0506 NEG R6 ; get 32 bit result in R1,R2 09EC 0226 AI R6, >0009 ; by shifting mantissa between 09EE 0009 ; 2 and 10 hex digits right. 09F0 0606 DEC R6 09F2 1108 JLT >0A04 09F4 001C SRAM R1, 4 09F6 4101 09F8 0A41 SLA R1, 4 09FA 001C SRAM R0, 4 09FC 4100 09FE 0240 ANDI R0, >0FFF ; (bug: superfluous?) 0A00 0FFF 0A02 10F6 JMP >09F0 0A04 C100 MOV R0, R4 ; if exponent was >8, R4 will be non-zero First it test for an exponent less than 1 and returns a zero result if so. The test for +8 is skipped as this is handled in another way that will become clear shortly. Instead it calculates the number of places that the mantissa has to be shifted. It uses a 48 bit shift in R0-R1-R2, shifting the mantissa between 2 and 10 nibbles (hex digits) right. This leaves the mantissa in MMMMMMMM.0 format in R1,R2 and leaves R0 zero. Note that for a large number the rightmost 2 hex digits will be zero as the mantissa only has 6 hex digits. The test for an exponent larger than 8 is implicit: the mantissa will be shifted 1 or 0 nibbles and R0 will not be zero. This fact is used later when the result is tested for being in range. In the above code AND-ing out the top digit of R0 seems superfluous: The top byte has been set to zero when the exponent and sign were separated out and hence SRAM will always shift in zeroes. Perhaps this is a leftover from earlier code. I would have thought it more logical to leave the mantissa in R0,R1 and first shift it two places to the left, followed by 0 to 7 places to the right (i.e. the exact reverse of what is done in the CIR/CER code). This would have required a separate test for the exponent being out of range, but the code would still have been shorter and faster, I think. In that code structure the AND-ing out would have been necessary. Next we come to handling the sign bit and range tests. Here the code for CRI and CRE diverges again: 0A06 C208 MOV R8, R8 ; opcode was CRE or CRI? 0A08 160D JNE >0A24 0A0A C002 MOV R2, R0 ; CRI: fit result in 16 bits 0A0C C1C7 MOV R7, R7 ; if real was negative, negate int 0A0E 1501 JGT >0A12 ; (bug: should jump to >0A18) 0A10 0500 NEG R0 0A12 0282 CI R2, >8000 ; value -32768 is okay 0A14 8000 0A16 1302 JEQ >0A1C 0A18 C082 MOV R2, R2 ; check range -32767..+32767 0A1A 11B0 JLT >097C ; -> report overflow (>0A20?) 0A1C E101 SOC R1, R4 ; number was >65535? 0A1E 1314 JEQ >0A48 ; no: store result (bug: should be >0A46) 0A20 04C1 CLR R1 0A22 10AC JMP >097C ; report overflow First we test for a negative sign and negate the 16 bit integer as necessary. There is also a check for the value -32768, which is okay whereas +32768 is out of range. The jump instruction seems to be wrong and allows +32768 as well. This bug means that the real number +32768 is converted to the integer -32768 instead of being reported as an overflow error. Next is the check that the (unsigned) mantissa was in the proper range of -32767 to +32767 and an overflow is reported if outside. Also if the mantissa was larger than 65536 or the exponent was larger than 8, an overflow error is reported. A last bit of strangeness is the value of R1 upon return. The documentation is silent on what value R1 should have. In some cases it is set to zero, in other cases the absolute value of the number is left behind. Changing the destination address of one jump ensures that R1 is always set to zero. The range check for CRE is similar (including bugs): 0A24 C001 MOV R1, R0 ; CRE: fit result in 32 bits 0A26 C1C7 MOV R7, R7 ; if real was negative, negate 32 bit 0A28 1504 JGT >0A32 ; (bug: should jumpt to >0A38) 0A2A 0540 INV R0 0A2C 0502 NEG R2 0A2E 1701 JNC >0A32 0A30 0580 INC R0 0A32 0281 CI R1, >8000 ; value -2147483648 is okay 0A34 8000 ; (note: test cannot be exact) 0A36 1302 JEQ >0A3C 0A38 C041 MOV R1, R1 ; check range -2147483647..+2147483647 0A3A 1102 JLT >0A40 ; -> report overflow 0A3C C104 MOV R4, R4 ; number was >4294967296? 0A3E 1304 JEQ >0A48 ; no: store result 0A40 C042 MOV R2, R1 0A42 10EF JMP >0A22 ; report overflow The code for handling the sign bit is a bit longer as it has to negate a 32 bit number. Again the jump for a positive number seems to be off, not skipping the test for -2147483648 as within range. However, the test for -2147483648 is conceptually wrong: that number cannot be expressed accurately in a single precision floating point number: it requires 8 hex digits of accuracy and the IBM360 format only has 6. The result is that a number like -2.14750e9 (which is definitely out of range) is reported as okay. The mantissa for -2.14750e9 is >800040 and this ends up in R1,R2 as >80004000. After negating this becomes >7FFFC000 which is +2147467264. Something similar happens for +2.14750e9. It would have been better to exclude trying to handle the -2147483648 case altogether and simply suffice with the -2147483647..+2147483647 range test (which due to the six digit accuracy is actually a test for -2147483392..+2147483392). The last bit of code deals with clearing out the FPAC when the real number truncates to zero (as tested for at the start of the code) and setting the high word (R1) of FPAC as necessary: 0A44 04C0 CLR R0 ; clear FPAC 0A46 04C2 CLR R2 0A48 C042 MOV R2, R1 ; set high word of FPAC 0A4A 10B4 JMP >09B4 ; store result & exit That only leaves the reporting of an overflow condition: ; overflow: set C and AF status bits & store result 097C 026F ORI R15, >1800 097E 1800 0980 1019 JMP >09B4 ; store FPAC & status bits All it does is setting the C and AF (arithmetic fault) status bits (the C bit indicates it is an overflow, not an underflow) and then perform a normal return. However, if the AFIE status bit (arithmetic fault interrupt enable) was also set, this means that immediately after the exit from macrocode a level 2 interrupt is generated. If the AFIE bit is not set, the user program must separately check for the AF error bit being set. All in all, as I understand it, the code for CRI and CRE has two corner case bugs and looks a bit suspect in two other places. Perhaps it was written the day after the Christmas party. I wonder if the corner case bugs were known back in the day (perhaps the corner cases did not matter enough to be detected). Edited February 3, 2018 by pnr 3 Quote Link to comment Share on other sites More sharing options...
pnr Posted February 5, 2018 Author Share Posted February 5, 2018 With all the supplementary operations out of the way, time to analyze the arithmetic floating point operations: MR, DR, AR and SR. First up is MR. To understand the code, let's first look at the math involved. Suppose we have two real numbers N1 and N2. In the IBM360 format these will be expressed as S1 x 0.M1 x 16 ^ E1 and S2 x 0.M2 x 16 ^ E2 The product will be: S1 x 0.M1 x 16 ^ E1 x S2 x 0.M2 x 16 ^ E2 which is the same as: (S1 x S2) x (0.M1 x 0.M2) x (16 ^ E1 x 16 ^ E2) which is the same as: (S1 x S2) x (0.M1 x 0.M2) x 16 ^ (E1 + E2) This last formula is what the code calculates. The code begins with: ; entry point for MR ; 08F4 C138 MOV *R8+, R4 ; is multiplier equal to zero? 08F6 1357 JEQ >09A6 ; yes: set FPAC to zero & finish This handles the case where the accumulator is multiplied by zero: the result is zero. Next comes a subroutine that handles the exponents and the sign bits: 08F8 06A0 BL @>0A4C ; separate & add exponents 08FA 0A4C 08FC A187 DATA >A187 ; = "A R7, R6" (for MR add exponents) 08FE FFC0 DATA -64 ; = subtract double excess 64 The subroutine is followed by two data words, which make it usable for both multiplication and division. The function of the data words will become clear when walking through the subroutine code. THE SUBROUTINE ; subroutine for MR and DR: calculate result exponent and sign ; 0A4C C000 MOV R0, R0 ; is FPAC zero? 0A4E 13C4 JEQ >09D8 ; yes: set flags & finish 0A50 C158 MOV *R8, R5 ; fetch 2nd word of operand The subroutine starts with a check for FPAC equalling zero (i.e. the multiplicand or the numerator is zero); in that case the result is zero too. Next, it fetches the second word of the operand which had not been fetched earlier. The code can now rely on both FPAC (R0,R1) and the operand (R4,R5) being in standard normalized format. The first thing it does is separating the mantissa from the sign bits and exponents: 0A52 C180 MOV R0, R6 ; save exponents in R6 and R7 0A54 C1C4 MOV R4, R7 0A56 7000 SB R0, R0 ; remove exponents from mantissas 0A58 7104 SB R4, R4 The next thing is multiplying the sign bits: 0A5A C207 MOV R7, R8 ; figure out sign of result in R8 0A5C 2A06 XOR R6, R8 Multiplying two bits is the same as taking their exclusive OR. Note that the top bit in R8 will have the sign of the result, but the other 15 bits are not zero -- the other bits are meaningless to the multiplication. This is followed by placing the excess-64 exponents as proper integers in R6 and R7: 0A5E 06C6 SWPB R6 ; place FPAC exponent in R6 0A60 0246 ANDI R6, >007F 0A62 007F 0A64 06C7 SWPB R7 ; place operand exponent in R7 0A66 0247 ANDI R7, >007F 0A68 007F Now we are ready to add the two exponents together (or subtract them for division). This is where the two data words that followed the subroutine call are used: 0A6A 04BB X *R11+ ; MR: "A R7,R6", DR: "S R7,R6" 0A6C A1BB A *R11+, R6 ; MR: -64, DR: +64 0A6E 0286 CI R6, >007F ; exponent in range? 0A70 007F 0A72 15D7 JGT >0A22 ; jump on overflow 0A74 1B9D JH >09B0 ; jump on underflow First it executes the instruction in the first data word. For MR this is "A R7,R6", which adds the exponents. However, by adding the exponents the excess 64 is now included twice and must be removed once. The next data word contains "-64", which is added to the exponents. The result is that the right excess-64 exponent is now in R6. This is followed by a range check. Here the utility of the excess-64 encoding becomes clear to see. I'm not sure why the jump to overflow at >097C is done via >0A22: the real target is (just) within range. The subroutine finishes by merging the result sign bit back into the result exponent: 0A76 0A18 SLA R8, 1 ; put sign bit back in exponent 0A78 1702 JNC >0A7E 0A7A 0226 AI R6, >80 0A7C 0080 0A7E 045B RT END OF SUBROUTINE Now we can go back to the main MR routine at >0900. This happens to be the 32 x 32 -> 64 bit multiplication routine that we already saw as part of the MM instruction: 0900 C085 MOV R5, R2 ; long multiply in four 16x16 bit steps 0902 3881 MPY R1, R2 0904 C205 MOV R5, R8 0906 3A00 MPY R0, R8 0908 C284 MOV R4, R10 090A 3A81 MPY R1, R10 090C 3804 MPY R4, R0 090E 002A AM R10, R8 ; add the partial results 0910 420A 0912 1701 JNC >0916 0914 0580 INC R0 0916 002A AM R8, R1 0918 4048 091A 1701 JNC >091E 091C 0580 INC R0 I'll not discuss it again, simply scroll up to the analysis of the MM instruction for detail on the above code. In essence it multiplies R0,R1 by R4,R5 leaving its result in R0..R3. Next comes the bit of MR code that was skipped in the MM discussion. That code is: 091E D186 MOVB R6, R6 ; is this a MR or MM instruction? 0920 1607 JNE >0930 ; jump if MM 0922 D001 MOVB R1, R0 ; MR: prenormalize mantissa 0924 06C0 SWPB R0 0926 D042 MOVB R2, R1 0928 06C1 SWPB R1 092A 06C2 SWPB R2 092C C0C6 MOV R6, R3 092E 10C2 JMP >08B4 First the flag byte in the upper half of R6 is checked. For MR this will be zero, as the exponent cannot be larger than >007F. Next the code pre-normalizes the mantissa by moving it two hex digits (one byte) to the left. The simple way to think about this is that we are multiplying two 24 bit mantissa's into a 48 bit result. We are only interested in the top 24 bits of that result and moving two digits to the left places these 24 bits in R0,R1 properly aligned for combination with the sign and exponent. The more precise way to think about this is that we are doing fixed point arithmetic here, and that a six digit shift right is needed to keep the decimal point in the right place; shifting two hex digits to the left and taking the high two words is functionally the same (and leaves some extra digits available). However, we are not done as it is possible that the first hex digit is still zero. This is easy to see when using two decimal examples: 0.10 x 0.10 = 0.01 and 0.99 x 0.99 = 0.98 Even though we have kept the decimal point in the right place, the first digit can still be zero in some cases. To normalize this there is a routine that is shared by the other arithmetical operations. This routine expects the result sign/exponent in R3 and so it is moved there first. It also expects the next hex digit in the top of R2. The shared tail routine is: ; normalize FPAC mantissa (leftward) ; 08B4 0280 CI R0, >000F ; is the highest nibble 0? 08B6 000F 08B8 1509 JGT >08CC ; no: mantissa is normalized 08BA 24E0 CZC @>0BD6, R3 ; exponent already 0? 08BC 0BD6 08BE 1378 JEQ >09B0 ; yes: underflow 08C0 0603 DEC R3 ; reduce exponent & shift mantissa one nibble 08C2 001D SLAM R0,4 08C4 4100 08C6 09C2 SRL R2, 12 ; shift in one nibble extra precision 08C8 A042 A R2, R1 08CA 10F4 JMP >08B4 .. 0BD6 007F DATA >007F ; exponent bits .. First it checks that the first mantissa digit is zero. If not, the mantissa is already normalized. If it is it checks the exponent. If it is already zero, the mantissa cannot be shifted further: it would require the exponent to be reduced by one and puts it out of range (the excess-64 exponent would move from -64 to -65). In that case an underflow is reported. In the other case, the exponent is reduced and the mantissa shifted left by one. To keep accuracy, a 'spare' extra digit of precision kept in R2 is shifted in. Because it is a common tail, the routine will check if further shifts are necessary, but in in the case or MR it will only ever perform one shift. After that, only merging the result exponent back in remains: 08CC 06C3 SWPB R3 ; merge exponent back in 08CE D003 MOVB R3, R0 08D0 1071 JMP >09B4 ; store FPAC & set status bits . The code for underflow is simple, and very similar to the code for overflow: ; underflow: additionally set AF status bit 09B0 026F ORI R15, >0800 09B2 0800 <continues with normal exit code at >09B4> Underflow only sets the arithmetic fault (AF) status bit. This allows the user program to distinguish overflow (C bit also set) from underflow. 2 Quote Link to comment Share on other sites More sharing options...
pnr Posted February 11, 2018 Author Share Posted February 11, 2018 (edited) Next up is floating point division, the "DR" instruction. It has a clever algorithm, but also a strange bit in its implementation. First let's look at the algorithm it uses. As with multiplication, we have two real numbers N1 and N2. In the IBM360 format these will be expressed as S1 x 0.M1 x 16 ^ E1 and S2 x 0.M2 x 16 ^ E2 The division will be: (S1 x 0.M1 x 16 ^ E1) / (S2 x 0.M2 x 16 ^ E2) which is the same as: (S1/S2) x (0.M1 / 0.M2) x ((16 ^ E1 / 16 ^ E2) which is the same as: (S1/S2) x (0.M1 / 0.M2) x 16 ^ (E1 - E2) which is the same as: (S1xS2) x (0.M1 / 0.M2) x 16 ^ (E1 - E2) The subroutine that multiplies the signs and handles the subtraction of exponents is already there, as discussed above for multiplication. The problem is in dividing the mantissas. What is needed is 32 x 32 bit division and the 99000 only offers 32 x 16 bit division (the DIV instruction). It would be possible to write a routine to do 32 x 32 bit division from basics, but that would be a long and slow routine. Instead, it does something clever and uses the DIV instruction to the max. In short it approximates the result by putting M1 in the 32 bit dividend and then divides by the top 16 bits of M2 (i.e. it truncates the last hex two digits of the divisor to zero). This already gives a result that is accurate to 3 or 4 hex digits. As the divisor is slightly too small, the result is slightly too large. It then subtracts a correction factor from the estimate that makes it accurate to 6 or 7 hex digits. As it turns out, the correction factor is fairly easy to calculate. I'm not a mathematician, but I think the derivation of the correction factor is as follows. It is easiest to think about the problem in base 65536 numbers, i.e. a number system where there are 65535 different digits and "10" means 65536. As I don't have 65535 symbols available, I'll use [xxxx] as notation, where xxxx is 4 hex digits. The division of M1 by M2 can then be expressed as dividing the two digit number AB by two digit number CD giving a two digit result EF: The dividend AB is a mantissa shifted 4 bits left, i.e. the range of AB is [0100][0000] to [0FFF][FFF0]. The divisor CD is a mantissa shifted 8 bits left, i.e. the range of CD is [1000][0000] to [FFFF][FF00]. The result is 0.EF, where the range of EF is [0100][0001] to [FFFF][FF00] In the below, 100 and 10 are shorthand for [0001][0000][0000] and [0001][0000] respectively. [1] AB / CD = 0.EF [2] AB = 0.EF x CD = EF x 0.CD = (E x C) + (E x D/100) + (F x C/10) + (F x D/100) = C x (E + (E x D/100)/C + F/10 + (F x D/100)/C = C x ( E.F + E x (D/10C) + (F x D)/100C ) [3] AB / C = E.F + E x (D/10C) + (F x D)/100C As C is at least [1000], the value of (F x D)/100C is at most [0000].[000F] and not significant: E x (D/10C) + (F x D)/100C ≈ E x (D/10C) [4] AB / C = E.F + E x (D/10C) AB / C - E x (D/10C) = E.F [5] Now define a first estimate E'F': AB / C = E'F' As E x (D/10C) is at most [000F], the difference between E and E' is at most [000F]. Calculating E x (D/10C) as E' x (D/10C) has an error of at most [000].[000F] and this error is not significant. [6] Hence, AB/CD can be calculated with sufficient precision using: E'F' = AB / C T = (D / C) x E' EF = E'F' - T/10 Back to the simple terms, the initial estimate is AB / C and the correction factor is (D/C) x E' / 10. With the mathematics and the algorithm out of the way, let's dive into the actual code. We'll see that doing the M1 / M2 division only takes 20 instructions, with no loops. The code starts out pretty much like the code for MR: ; entry point for DR ; 0946 C138 MOV *R8+, R4 ; if div-by-zero, report overflow 0948 1319 JEQ >097C 094A 06A0 BL @>0A4C ; extract and subtract exponents 094C 0A4C 094E 6187 DATA >6178 ; = "S R7, R6" 0950 0040 DATA 64 ; add back excess First there is a check for a zero operand, and an overflow error is reported if it is. Then the exponent subroutine is called to extract the signs and exponents and to calculate the result sign and exponent. For division, the first data word is "S R7, R6" as the exponents must now be subtracted. The second data word is +64: the exponent subtraction will cancel out the excess-64 part and this needs to be added back. Next there is a range check: 0952 8100 C R0, R4 ; if dividend > divisor, result will be >1 0954 1107 JLT >0964 0956 1502 JGT >095C 0958 8141 C R1, R5 095A 1A04 JL >0964 095C 0586 INC R6 ; increase result exponent and test for 095E 25A0 CZC @>0BD6, R6 ; overflow (mantissa shift happens 992-99A) 0960 0BD6 0962 130C JEQ >097C Depending the values of the accumulator and the operand, the result can be larger than 1 (but no larger than 15 decimal) and because the normalized mantissa must be of the form 0.MMMMMM an additional mantissa shift and exponent update may be necessary. Now the code starts to perform the actual division of M1 by M2. First M1 and M2 are positioned to make the algorithm work: 0964 001D SLAM R0, 4 ; align dividend & divisor for accuracy 0966 4100 0968 001D SLAM R4, 8 ; make sure divisor larger than dividend 096A 4204 The next step is to calculate the estimate result (which will already be accurate to some 3 hex digits): 096C 3C04 DIV R4, R0 ; calculate estimate E'F' = AB / C 096E 04C2 CLR R2 ; (using two steps of long division) 0970 3C44 DIV R4, R1 To get a 32 bit result, the remainder is divided by the divisor again, just as one would do in a manual long division. Note that R4 cannot be zero and that neither division can overflow (a remainder must necessarily be smaller than the divisor) and hence there are no checks for errors. Next comes the calculation of the correction factor: 0972 C245 MOV R5, R9 ; now calculate error term: T = D / C x E' 0974 0949 SRL R9, 4 ; align C with AB (i.e. make D/C < 1) 0976 04CA CLR R10 0978 3E44 DIV R4, R9 ; calc D / C 097A 1903 JNO >0982 ; always jump ... 0982 3A40 MPY R0, R9 ; calc T = E' x (D / C) 0984 04C8 CLR R8 ; align T/10 with E'F' and place into R8,R9 0986 001D SLAM R8,4 0988 4108 098A 09CA SRL R10, 12 098C A24A A R10, R9 098E 0029 SM R8, R0 ; now subtract error term from estimate 0990 4008 First we make sure that D is smaller than C to prevent overflow (and three digits of accuracy are enough). Then it calculates D/C. As C cannot be zero the dvision must succeed, just as the earlier two DIV operations; no error checking is necessary. Here we have some strangeness: despite the above, the code checks for overflow and jumps over the overflow exit code. There is no reason the the overflow code has to be located here: it is not necessary to bring jumps into range or something like that. Other than the programmer being confused, I see no reason for this jump in the code. Maybe I'm missing something, if so please post. The code then proceeds to multiply by E' and finish the calculation of T. The range shift of >0974 is undone and by taking the high word T is effectively divided by 10-base-65536. As a last step the correction factor is subtracted from the first estimate, giving a result accurate to 6 or 7 hex digits. After this, only combining the exponent, sign and "EF" into a normalized real number remains: 0992 C200 MOV R0, R8 ; normalize mantissa 0994 09C8 SRL R8, 12 ; one ore two nibbles as needed 0996 1302 JEQ >099C 0998 001C SRAM R0, 4 099A 4100 099C 001C SRAM R0, 4 099E 4100 09A0 06C6 SWPB R6 ; merge sign+exponent with mantissa 09A2 D006 MOVB R6, R0 09A4 1007 JMP >09B4 ; compare FPAC against zero & store result First is checks if the result of the mantissa divide was larger than 1: if this is the case, the top digit of EF will be non-zero. It then shifts by one or two hex digits to the right to create the normalized mantissa. No change to the exponent is necessary as one shift is merely compensating all the clever shifts we did at the start of the code (i.e. one shift puts the fixed point in the proper place). For the other shift, it has already made the required adjustment to the exponent at the start, see code at >095C. The last step is to merge in the sign/exponent byte and to jump to the standard exit routine. All in all, TI has used a very clever and fast algorithm for floating point divide. Edited February 13, 2018 by pnr 2 Quote Link to comment Share on other sites More sharing options...
pnr Posted February 12, 2018 Author Share Posted February 12, 2018 (edited) The analysis of the DR instruction made me wonder about the speed of 99110 floating point operations. I haven't done any detailed cycle counts or run benchmark tests, but some rough scoping gives interesting results. For floating point operations, the 99110 can always run at the full 6 Mhz, as it is not dependent on slow external memory and wait states. I think the average floating point operation in that case takes around 70-80 microseconds. This equates to some 12-15 kFLOPS. This compares well with the FPU chips of the late seventies and early eighties. The three main choices in 1981 were the AMD9511/i8231 from 1978, the AMD9512/i8232 from 1979 and the i8087 from 1980/81. The 99110 is from 1981 as well. http://www.cpushack.com/2010/09/23/arithmetic-processors-then-and-now/ The 9511 needs about 200 clock cycles for a floating point operation, or 100 microseconds when run at 2 MHz (which seems to have been the norm BITD). When run at its 3MHz maximum it is around 70 microseconds. That the numbers are so similar is perhaps not surprising: the 9511 also has a 16 bit data path inside and would be executing similar algorithms. Running a custom designed microcode gives it an advantage in cycles, but the 99110 compensates for this with a high clock speed. The 9512 also needs about 100 microseconds for multiply and divide, but addition/subtraction is sped up to about 50 microseconds. It can also do double precision floating point (i.e. a 64-bit format). This is much slower than single precision: operations take between 500 and 800 microseconds. I think this would be the same for the 99110, if one would code up double precision routines in a fast external macro rom. As the 9512 still has a 16 bit data path (17 bit actually, to deal with the 'hidden 1' bit of the IEEE format used), the similarity is again not surprising. So, at double precision the speed would only be some 2-3 kFLOPS. The real difference comes with the 8087 FPU. This chip internally always works with 80 bit floating point numbers. It is also much faster: it has separate ALU's for the exponent and mantissa, with wide data paths for both (15 and 64 bits respectively). Its speed on single precision arithmetic is around 50 kFLOPS and on double precision it is around 30 kFLOPS. However, only limited quantities of this chip were available in 1981 and this is one of the reasons why the original PC had a socket for a 8087 but it was almost never filled. My understanding is that all these chips were expensive. The 9511 and the 9512 were selling for between $50 and $100, and the 8087 well above that. If correct, the 99110 with a volume price of around $100 was good value. On the other hand, most applications back then did not need fast floating point. Of course, the competing 16-bit processors (8086, Z8000 and 68000) could run floating point in software emulation about as fast as a 9512 or 99110 (when run at full clock speed with fast memory, all four were about equally fast). Viewed that way, the 99110 only has a convenience advantage. A last consideration would have been the IBM360 format. Although popular in the 60's and early 70's, it was going out of fashion in the late 70's. The 9512 and the 8087 were much closer to the emerging IEEE floating point standard. For comparison, a high-end IBM360 mainframe in the 1960's would do about 10 MFLOPS. The supercomputer of the 70's, the Cray-1, was rated at 160 MFLOPS (both numbers for single precision arithmetic). Edited February 13, 2018 by pnr 2 2 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.