While looking for something else the other night, I stumbled on this Wikipedia article that describes Quarter Square Multiplication. I had never seen this algorithm before. For a machine like the Intellivision, it appears to be a perfect fit!
I won't repeat the Wikipedia article here, but the short version: With a few additions, subtractions and table lookups, you can perform arbitrary multiplication wicked fast. For an 8-bit x 8-bit multiplication (16-bit result), we're talking ~70 cycles. For 16x16 multiplication (also 16-bit result), we're talking ~302 cycles.
It didn't take me very long to code it up at all. The algorithm is that simple. It just needs a 511 word lookup table.
Here's the implementation of both 8x8 => 16 and 16x16 => 16 multiplies.
; Quarter Square Multiplication ; Assembly code by Joe Zbiciak, 2015 ; Released to public domain. QSQR8_TBL PROC DECLE $0000, $0000, $0001, $0002, $0004, $0006, $0009, $000C DECLE $0010, $0014, $0019, $001E, $0024, $002A, $0031, $0038 DECLE $0040, $0048, $0051, $005A, $0064, $006E, $0079, $0084 DECLE $0090, $009C, $00A9, $00B6, $00C4, $00D2, $00E1, $00F0 DECLE $0100, $0110, $0121, $0132, $0144, $0156, $0169, $017C DECLE $0190, $01A4, $01B9, $01CE, $01E4, $01FA, $0211, $0228 DECLE $0240, $0258, $0271, $028A, $02A4, $02BE, $02D9, $02F4 DECLE $0310, $032C, $0349, $0366, $0384, $03A2, $03C1, $03E0 DECLE $0400, $0420, $0441, $0462, $0484, $04A6, $04C9, $04EC DECLE $0510, $0534, $0559, $057E, $05A4, $05CA, $05F1, $0618 DECLE $0640, $0668, $0691, $06BA, $06E4, $070E, $0739, $0764 DECLE $0790, $07BC, $07E9, $0816, $0844, $0872, $08A1, $08D0 DECLE $0900, $0930, $0961, $0992, $09C4, $09F6, $0A29, $0A5C DECLE $0A90, $0AC4, $0AF9, $0B2E, $0B64, $0B9A, $0BD1, $0C08 DECLE $0C40, $0C78, $0CB1, $0CEA, $0D24, $0D5E, $0D99, $0DD4 DECLE $0E10, $0E4C, $0E89, $0EC6, $0F04, $0F42, $0F81, $0FC0 DECLE $1000, $1040, $1081, $10C2, $1104, $1146, $1189, $11CC DECLE $1210, $1254, $1299, $12DE, $1324, $136A, $13B1, $13F8 DECLE $1440, $1488, $14D1, $151A, $1564, $15AE, $15F9, $1644 DECLE $1690, $16DC, $1729, $1776, $17C4, $1812, $1861, $18B0 DECLE $1900, $1950, $19A1, $19F2, $1A44, $1A96, $1AE9, $1B3C DECLE $1B90, $1BE4, $1C39, $1C8E, $1CE4, $1D3A, $1D91, $1DE8 DECLE $1E40, $1E98, $1EF1, $1F4A, $1FA4, $1FFE, $2059, $20B4 DECLE $2110, $216C, $21C9, $2226, $2284, $22E2, $2341, $23A0 DECLE $2400, $2460, $24C1, $2522, $2584, $25E6, $2649, $26AC DECLE $2710, $2774, $27D9, $283E, $28A4, $290A, $2971, $29D8 DECLE $2A40, $2AA8, $2B11, $2B7A, $2BE4, $2C4E, $2CB9, $2D24 DECLE $2D90, $2DFC, $2E69, $2ED6, $2F44, $2FB2, $3021, $3090 DECLE $3100, $3170, $31E1, $3252, $32C4, $3336, $33A9, $341C DECLE $3490, $3504, $3579, $35EE, $3664, $36DA, $3751, $37C8 DECLE $3840, $38B8, $3931, $39AA, $3A24, $3A9E, $3B19, $3B94 DECLE $3C10, $3C8C, $3D09, $3D86, $3E04, $3E82, $3F01, $3F80 DECLE $4000, $4080, $4101, $4182, $4204, $4286, $4309, $438C DECLE $4410, $4494, $4519, $459E, $4624, $46AA, $4731, $47B8 DECLE $4840, $48C8, $4951, $49DA, $4A64, $4AEE, $4B79, $4C04 DECLE $4C90, $4D1C, $4DA9, $4E36, $4EC4, $4F52, $4FE1, $5070 DECLE $5100, $5190, $5221, $52B2, $5344, $53D6, $5469, $54FC DECLE $5590, $5624, $56B9, $574E, $57E4, $587A, $5911, $59A8 DECLE $5A40, $5AD8, $5B71, $5C0A, $5CA4, $5D3E, $5DD9, $5E74 DECLE $5F10, $5FAC, $6049, $60E6, $6184, $6222, $62C1, $6360 DECLE $6400, $64A0, $6541, $65E2, $6684, $6726, $67C9, $686C DECLE $6910, $69B4, $6A59, $6AFE, $6BA4, $6C4A, $6CF1, $6D98 DECLE $6E40, $6EE8, $6F91, $703A, $70E4, $718E, $7239, $72E4 DECLE $7390, $743C, $74E9, $7596, $7644, $76F2, $77A1, $7850 DECLE $7900, $79B0, $7A61, $7B12, $7BC4, $7C76, $7D29, $7DDC DECLE $7E90, $7F44, $7FF9, $80AE, $8164, $821A, $82D1, $8388 DECLE $8440, $84F8, $85B1, $866A, $8724, $87DE, $8899, $8954 DECLE $8A10, $8ACC, $8B89, $8C46, $8D04, $8DC2, $8E81, $8F40 DECLE $9000, $90C0, $9181, $9242, $9304, $93C6, $9489, $954C DECLE $9610, $96D4, $9799, $985E, $9924, $99EA, $9AB1, $9B78 DECLE $9C40, $9D08, $9DD1, $9E9A, $9F64, $A02E, $A0F9, $A1C4 DECLE $A290, $A35C, $A429, $A4F6, $A5C4, $A692, $A761, $A830 DECLE $A900, $A9D0, $AAA1, $AB72, $AC44, $AD16, $ADE9, $AEBC DECLE $AF90, $B064, $B139, $B20E, $B2E4, $B3BA, $B491, $B568 DECLE $B640, $B718, $B7F1, $B8CA, $B9A4, $BA7E, $BB59, $BC34 DECLE $BD10, $BDEC, $BEC9, $BFA6, $C084, $C162, $C241, $C320 DECLE $C400, $C4E0, $C5C1, $C6A2, $C784, $C866, $C949, $CA2C DECLE $CB10, $CBF4, $CCD9, $CDBE, $CEA4, $CF8A, $D071, $D158 DECLE $D240, $D328, $D411, $D4FA, $D5E4, $D6CE, $D7B9, $D8A4 DECLE $D990, $DA7C, $DB69, $DC56, $DD44, $DE32, $DF21, $E010 DECLE $E100, $E1F0, $E2E1, $E3D2, $E4C4, $E5B6, $E6A9, $E79C DECLE $E890, $E984, $EA79, $EB6E, $EC64, $ED5A, $EE51, $EF48 DECLE $F040, $F138, $F231, $F32A, $F424, $F51E, $F619, $F714 DECLE $F810, $F90C, $FA09, $FB06, $FC04, $FD02, $FE01 ENDP ; R2 = R0 * R1, where R0 and R1 are unsigned 8-bit values ; Destroys R1 qs_mpy8 PROC MOVR R0, R2 ; 6 ADDR R1, R2 ; 6 a + b SUBR R0, R1 ; 6 a - b BPL @@ok ; 7 NEGR R1 ; 6 @@ok: ADDI #QSQR8_TBL, R2 ; 8 ADDI #QSQR8_TBL, R1 ; 8 MVI@ R2, R2 ; 8 SUB@ R1, R2 ; 8 JR R5 ; 7 ;---- ; 70 ENDP ; R1 = R0 * R1, where R0 and R1 are 16-bit values ; destroys R0, R2, R3, R4, R5 qs_mpy16 PROC PSHR R5 ; 9 MVII #QSQR8_TBL, R5 ; 8 MOVR R0, R2 ; 6 R2 is orig 16-bit a MOVR R1, R3 ; 6 R3 is orig 16-bit b ; lo * lo ANDI #$FF, R0 ; 8 R0 is lo(a) MOVR R0, R4 ; 6 R4 is lo(a) ANDI #$FF, R1 ; 8 R1 is lo(b) PSHR R1 ; 9 save lo(b) ADDR R1, R4 ; 6 R4 = lo(a) + lo(b) SUBR R0, R1 ; 6 R1 = lo(a) - lo(b) BPL @@pos_ll ; 7 NEGR R1 ; 6 @@pos_ll: ADDR R5, R4 ; 6 R4 = &qstbl[lo(a)+lo(b)] ADDR R5, R1 ; 6 R1 = &qstbl[lo(a)-lo(b)] MVI@ R4, R4 ; 8 R4 = qstbl[lo(a)+lo(b)] SUB@ R1, R4 ; 8 R4 = lo(a)*lo(b) ;---- ; 113 ; lo * hi SWAP R3 ; 6 \_ R3 = hi(b) ANDI #$FF, R3 ; 8 / MOVR R0, R1 ; 6 R0 = R1 = lo(a) ADDR R3, R1 ; 6 R1 = hi(b) + lo(a) SUBR R0, R3 ; 6 R3 = hi(b) - lo(a) BPL @@pos_lh ; 7 NEGR R3 ; 6 @@pos_lh: ADDR R5, R1 ; 6 R1 = &qstbl[hi(b)+lo(a)] ADDR R5, R3 ; 6 R3 = &qstbl[hi(b)-lo(a)] MVI@ R1, R1 ; 8 R1 = qstbl[hi(b)-lo(a)] SUB@ R3, R1 ; 8 R1 = lo(a)*hi(b) ;---- ; 73 ; 113 (carried forward) ;---- ; 186 ; hi * lo SWAP R2 ; 6 ANDI #$FF, R2 ; 8 R2 = hi(a) PULR R3 ; 11 R3 = lo(b) MOVR R3, R0 ; 6 R0 = lo(b) ADDR R2, R3 ; 6 R3 = hi(a) + lo(b) SUBR R0, R2 ; 6 R2 = hi(a) - lo(b) BPL @@pos_hl ; 7 NEGR R2 ; 6 @@pos_hl: ADDR R5, R3 ; 6 R3 = &qstbl[hi(a)+lo(b)] ADDR R5, R2 ; 6 R2 = &qstbl[hi(a)-lo(b)] ADD@ R3, R1 ; 8 \_ R1 = lo(a)*hi(b) + hi(a)*lo(b) SUB@ R2, R1 ; 8 / ;---- ; 84 ; 186 (carried forward) ;---- ; 270 SWAP R1 ; 6 \_ shift upper product left 8 ANDI #$FF00, R1 ; 8 / ADDR R4, R1 ; 6 final product PULR PC ; 12 ;---- ; 32 ; 270 (carried forward) ;---- ; 302 ENDP
I spent maybe 20 minutes on the 8x8 implementation, and another 40 - 60 minutes on the 16x16 implementation. (I first did a naive implementation that called qs_mpy8, and then a smarter implementation that inlined everything.) Tonight, I ran both through literally millions of random tests, comparing against JLP's multiply accelerator.
(Side note: If you run jzIntv with rate control off (-r0), audio off (-a0), and minimize the window, it runs very, very fast. About 280x on my machine. 270x if you forget and leave audio on.)
As the comment says, I release this to the public domain. If you use this in one of your games / programs / whatever, I don't mind a tip of the hat my way if you happen to think of it. And if not, eh, we still all get more cool games.
If any of you find ways to improve this code (I'm sure it's possible), post it here and share with everyone!
Homework assignment for the motivated: Add an 8x16 multiply to the library. With all three, then programs like IntyBASIC have a nice arsenal for implementing multiplication efficiently, with no surprises for oddball multiplicands. (That's presuming IntyBASIC can figure out if the arguments to the multiply come from 8 bit variables or 16-bit variables/intermediate results.) I think adding an 8x16 multiply mainly requires removing the 'hl' portion of the 16x16 multiply code and a PSHR, saving ~93 cycles.
Edited by intvnut, Tue Mar 31, 2015 10:55 PM.