**Fast multiplication**(original article in Polish disk magazine "Syzygy" no. 6)

The problem of multiplication on the 6502 is as old as this processor.

The algorithms that can be found in various asm courses base on bit shifting

and addition, and need about 100 cycles for calculating product of two

8-bit numbers. Such algorithms are not fast at all.

There is an effective solution to this problem: create 2-dimensional array

(multiplication table) that contains all the possible results.

Unfortunately not always one can afford 16 KB memory for this.

What I suggest is an intermediate solution: implement multiplication

using an addition and a function "calculated" using a lookup table.

Consider the following formulas:

(1) a*b = exp( log(a)+log(b) )

(2) a*b = ( cos(x+y)+cos(x-y) )/2

where: x = arc cos (a), y = arc cos (b)

(3) a*b = ( (a+b)^2-a^2-b^2 )/2

(4) a*b = ( (a+b)^2-(a-b)^2 )/4

Which formula to choose? The best would be one:

- simple to calculate,

- true for any a and b,

- that not requires high precision,

- uses a lookup table easy to generate.

These conditions are fulfilled best by the formula (4).

The calculation is very simple:

a*b = f(a+b)-f(a-b)

where: f(x) = (x*x)/4

At first glance, one thinks that you have to store

the fractional part of (x*x)/4 in the lookup table

in order to get an exact result. This is not the case,

because:

1) for even x, x = 2*k :

f(2*k) = k*k

- the result is an integer

2) for odd x, x = 2*k+1

f(2*k+1) = k*k+k+1/4

- the result is an integer + 0.25.

However, if a+b is odd, then a-b is odd too.

The result of f(a+b)-f(a-b) would be therefore same,

no matter if f() includes this 0.25 or not.

Calculating successive squares is easy once you remember

that:

1*1 = 1 = 0+1

2*2 = 4 = 0+1+3

3*3 = 9 = 0+1+3+5

...

That is, you only need to sum successive odd numbers.

The algorithm is general enough to calculate products

of signed numbers.

There is no universal "fastest multiplication" routine,

because universal routines are not fast and vice-versa.

When implementing your own fast multiplication routine

you should consider the following things:

- how big are numbers which you want to multiplicate

- are the numbers signed or not

- where the factors are stored and where the result

should be stored

- what precision you want - for example, you may need

just the high byte of the product and +-1 is acceptable

But to not make this article too theoretical, I present

two routines:

- MULU8 - multiplicates unsigned numbers in A and X registers

- MULS8 - multiplicates signed numbers in A and X

A few words about the routines:

- both routines calculate sums using 6502's indexed addressing mode.

Unfortunately -A needs to calculated separately, which can be done

with EOR #$ff and adding one. Instead of adding 1, we use a separate

lookup table for (x+1)^2.

- for MULU8 one lookup table contains squares of 0..510 and the other

one -255..255

- for MULS8 both tables contain squares of -255..255

- for MULS8 $80 needs to be added to both input numbers, using EOR #$80.

This eliminates the ambiguity that arises when adding signed numbers,

for example:

$60+$60=$c0

-$40+0=$c0

Instead, we have:

$e0+$e0=$1c0 (positive $c0)

$40+$80=$c0 (negative -$40)

If the numbers are small (precisely: abs(A)+abs(X)<128), the EOR #$80

are redundant, but you need to use different lookup tables.

**mulu8.asm**:

* Unsigned multiplication
* A*X -> A:Y (A-high byte, Y-low)
* b. Fox/Tqa
opt 21
org $480
sq1l equ $a000
sq1h equ $a200
sq2l equ $a400
sq2h equ $a600
jsr init
lda #123
ldx #234
sta l1+1
sta h1+1
eor #$ff
sta l2+1
sta h2+1
sec
l1 lda sq1l,x
l2 sbc sq2l,x
tay
h1 lda sq1h,x
h2 sbc sq2h,x
brk
brk
brk
init ldx #0
stx sq1l
stx sq1h
stx sq2l+$ff
stx sq2h+$ff
ldy #$ff
msq1 txa
lsr @
adc sq1l,x
sta sq1l+1,x
sta sq2l-1,y
sta sq2l+$100,x
lda #0
adc sq1h,x
sta sq1h+1,x
sta sq2h-1,y
sta sq2h+$100,x
inx
dey
bne msq1
msq2 tya
sbc #0 -
ror @
adc sq1l+$ff,y
sta sq1l+$100,y
lda #0
adc sq1h+$ff,y
sta sq1h+$100,y
iny
bne msq2
rts
end

**muls8.asm**:

* Signed multiplication
* A*X -> A:Y (A-high byte, Y-low)
* b. Fox/Tqa
opt 21
org $480
sq1l equ $a000
sq1h equ $a200
sq2l equ $a400
sq2h equ $a600
jsr init
lda <123
ldx <0-45
eor #$80
sta l1+1
sta h1+1
eor #$ff
sta l2+1
sta h2+1
txa
eor #$80
tax
sec
l1 lda sq1l,x
l2 sbc sq2l,x
tay
h1 lda sq1h,x
h2 sbc sq2h,x
brk
brk
brk
init ldx #0
stx sq1l+$100
stx sq1h+$100
stx sq2l+$ff
stx sq2h+$ff
ldy #$ff
msqr txa
lsr @
adc sq1l+$100,x
sta sq1l,y
sta sq1l+$101,x
sta sq2l-1,y
sta sq2l+$100,x
lda #0
adc sq1h+$100,x
sta sq1h,y
sta sq1h+$101,x
sta sq2h-1,y
sta sq2h+$100,x
inx
dey
bne msqr
rts
end