Thanks pnr for very interesting and in-depth analysis!

Started by pnr, Jan 21 2018 4:30 PM

26 replies to this topic

Posted Mon Feb 12, 2018 4:36 AM

Thanks pnr for very interesting and in-depth analysis!

Posted Wed Feb 14, 2018 6:08 PM

Last up is an analysis of AR, SR, and CR. All three share most of their code.

Although addition is perhaps conceptually the easiest operation, the code is surprisingly long and involved, as there are many cases to consider. As a result, floating point addition is not much faster than multiplication or division.

The main issue is that the mantissas of two floating point numbers can only be added together if their exponents are equal. If they are not equal, the smaller number must be denormalized to make the exponents equal:

0.1234 x 16^4 + 0.12 x 16^2 = 0.1234 x 16^4 + 0.0012 x 16^4 = 0.1246 x 16^4

If the difference between the exponents is more than 6, the smaller number becomes insignificant and effectively equals zero.

The entry code for SR is as follows:

; entry point for SR ; 0814 C138 MOV *R8+, R4 ; fetch 1st word of S 0816 136D JEQ >08F2 ; if S is zero, nothing to do 0818 0224 AI R4, >8000 ; flip sign bit 081A 8000 081C 1002 JMP >0822 ; now handle as AR

It checks the operand for being zero, and if so the accumulator already has the right result. If not zero, it flips the sign bit and handles FPAC-S as FPAC+(-S).

Next is the entry code for AR:

; entry point for AR ; 081E C138 MOV *R8+, R4 ; fetch 1st word of S 0820 1368 JEQ >08F2 ; if S is zero, nothing to do

It only checks for the operand being zero, and FPAC already containing the result.

From here on, AR and SR have an identical code path.

0822 C000 MOV R0, R0 ; if FPAC is zero, S is the result 0824 1603 JNE >082C 0826 C004 MOV R4, R0 ; move S to local FPAC 0828 C058 MOV *R8, R1 082A 1063 JMP >08F2 ; store FPAC & set status bits ; 082C 04C6 CLR R6 ; clear flag (= store result) 082E C158 MOV *R8, R5 ; fetch 2nd word of S

The code first checks for another special case: if the accumulator is zero, the result is equal to the operand. If not, it enters the full calculation. It clears the CR flag (R6): at >0830 the code path for CR merges in (see entry code for CR discussed earlier), and the CR code path will separate towards the end of the algorithm.

Note that the CR code path does not have checks for either the accumulator or the operand being zero. Effectively a zero here is handled as meaning "+0.0 x 16^-64" and this will not lead to issues in the CR code path.

; CR jumps here (with R6 all ones = set status flags only) ; 0830 04C2 CLR R2 ; clear extra mantissa bits 0832 C0C0 MOV R0, R3 ; save exponents 0834 C1C4 MOV R4, R7 0836 7000 SB R0, R0 ; remove exponents from mantissas 0838 7104 SB R4, R4

As usual the code starts out separating the sign and exponent from the mantissa. R2 is prepared to hold an extra 'guard' digit of precision.

Next the sign bit and exponent are separated for the accumulator:

083A 0A13 SLA R3, 1 ; is FPAC negative? 083C 1702 JNC >0842 083E 06A0 BL @>0AE6 ; yes: negate extended FPAC mantissa 0840 0AE6 0842 0993 SRL R3, 9 ; FPAC exponent in R3

If FPAC is negative, the mantissa is negated. There is a subroutine for this, as the negation has to happen again when the result is converted back to standard IBM360 format. The subroutine is:

; subroutine to negate extended FPAC mantissa ; 0AE6 0540 INV R0 0AE8 0541 INV R1 0AEA 0502 NEG R2 0AEC 1703 JNC >0AF4 0AEE 0581 INC R1 0AF0 1701 JNC >0AF4 0AF2 0580 INC R0 0AF4 045B RT

Including R2 in the negation is superfluous at this point. Also note that with the sign/exponent removed, the mantissa has two extra hex digits on the left, and hence does not need to consider the >800000 overflow condition when negating.

Next, the sign and exponent of the operand are separated:

0844 0A17 SLA R7, 1 ; is S negative? 0846 1704 JNC >0850 0848 0544 INV R4 ; yes: negate S mantissa 084A 0505 NEG R5 084C 1701 JNC >0850 084E 0584 INC R4 0850 0997 SRL R7, 9 ; S exponent in R7

Here too the mantissa is negated if the operand is negative, but this time it happens in line because it does not need to be reversed later.

With the mantissas prepared and including the sign bit, the code considers the exponents and the relative size of the accumulator and the operand:

0852 C247 MOV R7, R9 ; compare exponents 0854 6243 S R3, R9 0856 1319 JEQ >088A ; if equal, directly add the mantissas 0858 0289 CI R9, >0006 ; S much larger than FPAC? 085C 110F JLT >087C 085E C0C7 MOV R7, R3 ; if FPAC is insignificant, result is S 0860 04C0 CLR R0 0862 04C1 CLR R1 0864 1012 JMP >088A ... 087C 0509 NEG R9 ; FPAC much larger than S? 087E 0289 CI R9, >0006 0880 0006 0882 11F6 JLT >0870 0884 C1C3 MOV R3, R7 ; if S is insignificant, result is FPAC 0886 04C4 CLR R4 0888 04C5 CLR R5

The code first handles the three easy cases: exponents equal, FPAC dominates and S dominates. If the exponents are equal, there is nothing to do. If S is more than 6 hex digits larger than FPAC, FPAC is effectively zero and the exponent of S becomes the exponent of the result. If FPAC is more than 6 hex digits larger than S, S is effectively zero and the exponent of FPAC becomes the exponent of the result.

The complex case is handled by a clever loop that shifts either FPAC or S into place. The loop code is entered in the middle (>0870):

; denormalize & align smallest mantissa ; 0866 C085 MOV R5, R2 ; shift S one nibble right 0868 0AC2 SLA R2, 12 086A 001C SRAM R4, 4 086C 4104 086E 0587 INC R7 ; and adjust exponent 0870 81C3 C R3, R7 ; exponents equal? 0872 130B JEQ >088A ; yes: add mantissas 0874 15F8 JGT >0866 ; exp FPAC > exp S? 0876 06A0 BL @>0AD4 ; no: shift FPAC one nibble right 0878 0AD4 ; and adjust exponent 087A 10FA JMP >0870

The loop compares the exponents and if they have become equal (which they must within 6 shifts), the work is done and we proceed with the actual addition at >088A. If S is the smallest the loop runs from >0866 to >0874 and shifts the operand in place (keeping one guard digit in R2). If FPAC is the smallest, the loop runs from >0870 to >087A and shifts the accumulator in place (again keeping one guard digit in R2).

The accumulator shift is also used again later in the algorithm and hence in a subroutine:

; subroutine to (de)normalize FPAC mantissa ; to the right one hex digit (nibble) ; 0AD4 C081 MOV R1, R2 ; shift extended mantissa one nibble 0AD6 0AC2 SLA R2, 12 0AD8 001C SRAM R0, 4 0ADA 4100 0ADC 0583 INC R3 ; adjust exponent 0ADE 24E0 CZC @>0BD6, R3 ; exponent in range? 0AE0 0BD6 0AE2 139F JEQ >0A22 ; no: overflow 0AE4 045B RT

At this point in time, the range check on the exponent is superfluous, as the exponent must be in range (because S is in range).

With both numbers properly aligned, we can do the actual addition. At this point, the code for CR takes its own path again:

088A C186 MOV R6, R6 ; was opcode CR, or AR/SR? 088C 1307 JEQ >089C 088E 002A AM R4, R0 ; CR: add mantissas & return status bits 0890 4004 0892 02CA STST R10 0894 024A ANDI R10, >E000 ; mask out L>, A>, EQ status bits 0896 E000 0898 E3CA SOC R10, R15 089A 0380 RTWP ; macro processing complete

For the CR instruction, we add the mantissas and only look at the status bits (L>, A> and EQ) and return those to the user routine. No result is stored back to the user accumulator.

For AR and SR, there is more work to do:

089C 002A AM R4, R0 ; add mantissas 089E 4004 08A0 1325 JEQ >08EC ; if zero, clear FPAC & finish 08A2 1504 JGT >08AC ; if negative, 08A4 06A0 BL @>0AE6 ; negate extended mantissa 08A6 0AE6 08A8 0263 ORI R3, >0080 ; and flip sign bit 08AA 0080 08AC D000 MOVB R0, R0 ; if mantissa too large 08AE 1302 JEQ >08B4 08B0 06A0 BL @>0AD4 ; normalize it rightward one nibble 08B2 0AD4 [JMP to >08CC seems missing]

Again, the two mantissas are added. If the result is zero, FPAC is cleared (the normalized version of zero) and the status bits are set accordingly.

If the result is negative, the result mantissa is negated back to positive (note this time negating the guard digit as well is not superfluous) and the sign bit is set accordingly.

If the result has one more hex digit (i.e. something like 0.800000 + 0.A00000 = 1.200000), the mantissa is normalized one hex digit to the right (note that in this case, the range check is not superfluous). As the mantissa must now be in normalized form, the code could proceed to merging in the sign/exponent byte. However, it drops into the code for another check.

It is possible in addition that several hex digits cancel out, and that there are a lot of leading zeroes in the result mantissa. An example would be:

0.123456 - 0.123400 = 0.000056

This must be normalized to 0.56x16^-4. In this case no precision is lost. However, with a denormalized number one guard digit is required:

0.100001 - 0.123400x16^-5 = 0.100001 - 0.000001(2) = 0.0FFFFF(E)

This must be normalized to 0.FFFFFEx16^-1 and in this case we need the guard digit shifted in. If I'm not mistaken only one guard digit can possibly shift in, and hence that is all we have in R2.

In code, this leads to the following:

; normalize FPAC mantissa (leftward) ; 08B4 0280 CI R0, >000F ; is the highest nibble 0? 08B6 000F 08B8 1509 JGT >08CC ; no: mantissa is normalized 08BA 24E0 CZC @>0BD6, R3 ; exponent already 0? 08BC 0BD6 08BE 1378 JEQ >09B0 ; yes: underflow 08C0 0603 DEC R3 ; reduce exponent & shift mantissa one nibble 08C2 001D SLAM R0,4 08C4 4100 08C6 09C2 SRL R2, 12 ; shift in guard digit 08C8 A042 A R2, R1 08CA 10F4 JMP >08B4 ; 08CC 06C3 SWPB R3 ; merge exponent back in 08CE D003 MOVB R3, R0 08D0 1071 JMP >09B4 ; store FPAC & set status bits

This code was discussed before in the post on multiplication: this tail is shared between AR, SR and MR.

I wonder if the AR/SR code is the shortest possible. It would seem that the checks for zero accumulator and operand are for performance only, as the rest of the algorithm would seem to work for AR/SR just as it does for CR. Also, maybe it is faster to operate on mantissas shifted one hex digit to the left; this still leaves one "overflow digit" to the left, but makes room to include the guard digit on the right. Finally, the range check could be taken out of the "shift right" subroutine and moved to immediately after the second subroutine call.

That completes our tour of the 99110 macrorom: there is no other code left to discuss.

- Ksarul likes this

0 members, 0 guests, 0 anonymous users