# 6502 Killer hacks

144 replies to this topic

### #26 batariOFFLINE

batari

)66]U('=I;B\$*

• 6,680 posts
• begin 644 contest

Posted Thu Jul 21, 2005 3:09 PM

Hello,

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.) It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway. The loop will never repeat more than 8 times.

The beq done and bcs done are optional, one or both might save cycles but I haven't tested to see if it actually decreases or increases cycles on average.
```; x and a contain multiplicands, result in a

stx temp1
sta temp2
lda #0
rept
lsr temp2
bcc skip
clc
;bcs done might save cycles?
skip
;beq done might save cycles?
asl temp1
bne rept
done
```
I tried to work up a good division routine too, but all I could come up with is successive subtraction, which is sub-optimal for small denominator values. Anyone know of a better way?
```; x=numerator a=denominator, result in x
lsr
beq end;div by 0 = bad, div by 1=no calc needed, so bail out
rol;restore a
sta temp1
txa
ldx #\$ff
sec
loop
sbc temp1
inx
bcs loop
end; result in x
```

### #27 Thomas JentzschOFFLINE

Thomas Jentzsch

Thrust, Jammed, SWOOPS!, Boulder Dash, THREE·S, Star Castle

• 23,940 posts
• Always left from right here!
• Location:Düsseldorf, Germany, Europe, Earth

Posted Thu Jul 21, 2005 3:28 PM

Basically, that's how multiplication without tables works. The code could be optimized, but the table based approach is much faster.

Edited by Thomas Jentzsch, Thu Jul 21, 2005 3:38 PM.

### #28 batariOFFLINE

batari

)66]U('=I;B\$*

• 6,680 posts
• begin 644 contest

Posted Thu Jul 21, 2005 3:56 PM

Basically, that's how multiplication without tables works. The code could be optimized, but the table based approach is much faster.

How big is the table?

### #29 supercatOFFLINE

supercat

• 6,401 posts

Posted Thu Jul 21, 2005 4:22 PM

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.)  It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway.  The loop will never repeat more than 8 times.

Computing 8x8->16 multiply is more useful than 8x8->8, and isn't really any harder. The 6502's lack of add-without-carry is somewhat irksome, but there are a variety of workarounds that could be used.
```; Compute mul1*mul2+acc -> acc:mul1 [mul2 is unchanged]

ldx #8
dec mul2
lp:
lsr
ror mul1
bcc nope
nope:
dex
bne lp
inc mul2
```

As for division, the normal approach is to do a shift-and-subtract. A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits. I'll try to work one up for you.

Edited by supercat, Thu Jul 21, 2005 4:35 PM.

### #30 batariOFFLINE

batari

)66]U('=I;B\$*

• 6,680 posts
• begin 644 contest

Posted Thu Jul 21, 2005 4:33 PM

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.)  It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway.  The loop will never repeat more than 8 times.

Computing 8x8->16 multiply is more useful than 8x8->8, and isn't really any harder. I'll have to see if I can work one up that deals with the 6502's "fun" add with carry behavior.

As for division, the normal approach is to do a shift-and-subtract. A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits. Again, I'll try to work one up for you.

You can whip one up for 16-bit multiplication but I don't need it right now - I just need 8x8->8 for batari Basic, actually, where you can't use 16-bit numbers anyway, and probably won't ever be able to - as the utility of these is somewhat limited and the routines to implement them would use up precious ROM space. It's wise to keep bB from getting too top-heavy, since it's supposed to be a stepping stone to asm, and is only intended for making games, not doing complex math...

But don't let this stop you, as someone may find it useful, sometime, somewhere...

As for division, all I want is 8/8->8... If you want to do 16/8->8+8, go ahead, but again, I probably don't need it, for the same reasons.

### #31 supercatOFFLINE

supercat

• 6,401 posts

Posted Thu Jul 21, 2005 4:42 PM

You can whip one up for 16-bit multiplication but I don't need it right now - I just need 8x8->8 for batari Basic, actually, where you can't use 16-bit numbers anyway, and probably won't ever be able to - as the utility of these is somewhat limited and the routines to implement them would use up precious ROM space.  It's wise to keep bB from getting too top-heavy, since it's supposed to be a stepping stone to asm, and is only intended for making games, not doing complex math...

Well, I just snuck 8x8->16 in up above. I would think that for your fixed-point maths you'd want an 8x8->16 multiply. As noted, it's pretty easy.

As for division, all I want is 8/8->8... If you want to do 16/8->8+8, go ahead, but again, I probably don't need it, for the same reasons.

Here's a simple divide routine off the top of my head, but it probably isn't quite right. I'll have to simulate and see if I got it correct. It should divide A:quot by denom, leaving the quotient in quot and the remainder in A.
```  ldx #8
lp:
cmp denom
bcc toosmall
sbc denom   ; Note: Carry is, and will remain, set.
rol quot
rol
dex
bne lp
beq done
toosmall:
rol quol
rol
dex
bne lp
done:
```

### #32 batariOFFLINE

batari

)66]U('=I;B\$*

• 6,680 posts
• begin 644 contest

Posted Thu Jul 21, 2005 4:51 PM

Well, I just snuck 8x8->16 in up above.  I would think that for your fixed-point maths you'd want an 8x8->16 multiply.  As noted, it's pretty easy.

Wow, I had no idea it was so simple. I'll go ahead and use it, and just return the lower byte in the result, but leave a note in the docs on how to get the upper byte for integer multiplication if someone really needs it in bB!

EDIT: Just noticed, before the ADC, the carry is always set... do you need to clear it first?

Edited by batari, Thu Jul 21, 2005 4:53 PM.

### #33 retroconOFFLINE

retrocon

Chopper Commander

• 119 posts

Posted Thu Jul 21, 2005 4:58 PM

As for division, the normal approach is to do a shift-and-subtract.  A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits.  I'll try to work one up for you.

Supercat,

For a division shortcut, does this make any sense to you? This was a response I just got on a mailing list when asking how to optimize out a divide by 5 on a 10 bit unsigned number. I'm not sure I understand it fully, but I'd like to understand the concept. Do you know what would be the cheapest way to calculate a divide by 5? Table not an option of course.

"This is often easier to understand when expressed in binary.

1/5 equals 0.00110011001100110011 etc.

0.00110011 etc equals 11 multiplied by 0.000100010001 etc.

0.000100010001 etc equals 10001 multiplied by 0.0000000100000001 etc.

This suggests the following result (in C syntax):

Y1 = (X + X<<1)

Y2 = (Y1 + Y1<<4)

Y3 = (Y2 + Y2<<

and we have a result accurate to 16 bits with only three adders.

Y4 = (Y3 + Y3<<16)

32 bits with only four adders. "

[EDIT] Somehow I think this is related to this algorithm here, but I'm not seeing it quite yet: http://www.sxlist.co...th/divconst.htm

Edited by retrocon, Thu Jul 21, 2005 5:27 PM.

### #34 supercatOFFLINE

supercat

• 6,401 posts

Posted Thu Jul 21, 2005 7:07 PM

EDIT: Just noticed, before the ADC, the carry is always set... do you need to clear it first?

That's what the "dec mul2" and "inc mul2" instructions are for. Other ways of dealing with the issue would be to subtract mul1 from the accumulator before the multiplication (which would work nicely for the low-half, but correctly dealing with the high half would be more work), inverting mul1 beforehand and then reversing the sense of the branch (adds two bytes/two cycles at the spot where mul1 gets loaded, but makes the routine more 'opaque'), or adding a "clc" before the "adc" [which costs one byte instead of four, but sixteen cycles instead of ten].

### #35 supercatOFFLINE

supercat

• 6,401 posts

Posted Thu Jul 21, 2005 7:28 PM

For a division shortcut, does this make any sense to you? This was a response I just got on a mailing list when asking how to optimize out a divide by 5 on a 10 bit unsigned number. I'm not sure I understand it fully, but I'd like to understand the concept. Do you know what would be the cheapest way to calculate a divide by 5?

Basically, you're replacing division by a constant with multiplication by a constant. This is an optimization approach which goes back at least to the 1960's, and probably even earlier. It's often an effective approach, but one needs to be careful to ensure that the results will aways be rounded in the expected direction (if one cares).

By the way, there are some interesting table-based approaches to multiplication by non-constants which can be somewhat practical if one is doing a lot of multiplication, but require some care to ensure correct operation. One nice method, for example, takes advantage of the fact that A*A-B*B = (A+B)*(A-B).
```FastMult:; Multiplies n1 by n2.  Requires that n1+n2 be 255 or less, and n1>=n2.
lda n1
clc
tax
lda n1
sec
sbc n2
tay
lda squareL,x
sbc squareL,y
sta resultL
lda squareH,x
sbc squareH,y
sta resultH
```
Fourty-two cycles for an 8x8->16 multiply, subject to the restrictions noted above. Considerably faster than the shift-add method, but at the expense of 512 bytes worth of tables. The tables should be computed such that ([SquareH+n]:[SquareL+n]) equals int(n*n/4).

### #36 djmipsOFFLINE

djmips

Dragonstomper

• Topic Starter
• 634 posts
• scrolling
• Location:Seattle

Posted Thu Jul 21, 2005 10:26 PM

Basically, you're replacing division by a constant with multiplication by a constant.  This is an optimization approach which goes back at least to the 1960's, and probably even earlier.  It's often an effective approach, but one needs to be careful to ensure that the results will aways be rounded in the expected direction (if one cares).

By the way, there are some interesting table-based approaches to multiplication by non-constants which can be somewhat practical if one is doing a lot of multiplication, but require some care to ensure correct operation.  One nice method, for example, takes advantage of the fact that A*A-B*B = (A+B)*(A-B).

goes back to at least the 1960's

Both the multiply by 1/x and using the tables of squares for multiplication can be traced back to the Babylonians (2000-600BC)

The formula is:
ab = [(a+b)^2 - (a-b)^2]/4

Babylonian Mathematics

There are also the useful: sqrt(a^2 + b) approx = a + b/2a

There are several variations on this code and the following mores specialized version is attributed to Stephen Judd at C=Hacking (noted in old Stella post) and is 24-26 cycles at the expense of some range. Multiplying several numbers by a particular number is quite fast as you just need to reload Y each time.

Fast Signed Multiply

```;        Multiply Y * A and result in A

STA ZP1	;ZP1 -- zero page pointer to table of g(x)
EOR #\$FF
CLC
STA ZP2	;ZP2 also points to g(x)
LDA (ZP1),Y;g(Y+A)
SEC
SBC (ZP2),Y;g(Y-A)

```

### #37 supercatOFFLINE

supercat

• 6,401 posts

Posted Thu Jul 21, 2005 11:24 PM

Both the multiply by 1/x and using the tables of squares for multiplication can be traced back to the Babylonians (2000-600BC)

Yeah, well that's even further before my time than the 1960's.

BTW, I developed an interesting method for doing long division which would be really cool if someone hadn't patented the same thing shortly before (and unbeknownst to me). It would be especially useful for performing really huge divisions on machines with a fast multiply but slow divide.

There are several variations on this code and the following mores specialized version is attributed to Stephen Judd  at C=Hacking (noted in old Stella post)  and is 24-26 cycles at the expense of some range. Multiplying several numbers by a particular number is quite fast as you just need to reload Y each time.

Cute bit of code. Looks like it should be extendable to produce 16-bit results also.

### #38 Atari-JessOFFLINE

Atari-Jess

River Patroller

• 4,243 posts
• Location:Toronto

Posted Fri Jul 22, 2005 4:25 AM

Not a killer hack but I'd like to share a link to an interview with William Mensch, 6502 designer.

Real format video of interview

That was an awesome interview, but man, I greatly disliked the interviewer.

He bumbled with his words and he sounded like a smart ass and only genuinely interested when it seemed like something he knew already.

I apologize if its someone here or another "legend" in microprocessing, but man, I can't stand the interviewer. But Mensch was awesome.

### #39 mos6507ONLINE

mos6507

River Patroller

• 4,916 posts

Posted Tue Aug 30, 2005 6:22 PM

That interview must be from about 10 years ago. He's talking about Super Nintendo and the Pentium Pro like they are current products.

### #40 djmipsOFFLINE

djmips

Dragonstomper

• Topic Starter
• 634 posts
• scrolling
• Location:Seattle

Posted Tue Nov 8, 2005 9:39 PM

Some code courtesy of

Lee Davison 6502 code 'shorts'

Range test. By Lee Davison

For all of these we assume that the byte to be tested is in A and that the start and end values, n and m, are already defined. Also that 0 < n < m < \$FF.
If you don't need to preserve the byte in A then testing the byte can be done in five bytes and only six cycles. This sets the carry if A is in the range n to m.

```	CLC	; clear carry for add
ADC	#\$FF-m; make m = \$FF
ADC	#m-n+1; carry set if in range n to m
```

### #41 supercatOFFLINE

supercat

• 6,401 posts

Posted Tue Nov 8, 2005 11:08 PM

Some code courtesy of

Lee Davison 6502 code 'shorts'

I like his hex-digit conversion routine. Saves a byte over the one I normally use if the state of carry on entry is unknown. I also like his 'toggle carry' method, though I still find myself wishing the 6502 allowed a direct method to do that.

BTW, one thing that's often useful is the "cookie cutter" operation. For example, to store bits 3-5 of the accumulator into a memory location while leaving the other bits untouched:
```  eor dest
and #\$38; bits 3-5
eor dest
sta dest
```
If you want to read bits 3-5 of a memory location into the corresponding bits of the accumulator, you may similarly do:
```  eor src
and #\$C7 ; not bits 3-5
eor src
```
but there's a very important proviso here: src MUST NOT change between the two 'eor' instructions. Any bit of src which changes between those instructions will cause the corresponding bit in the accumulator to be toggled if not in the range of bit 3..5.

### #42 vdub_bobbyOFFLINE

vdub_bobby

• 5,832 posts
• Boom bam.
• Location:Seattle, WA

Posted Thu Nov 10, 2005 2:00 PM

Here's some interesting code from batari:

This:
```  ldx #56
loop
;do some stuff here
txa
sec
sbc #14
tax
bne loop```
Can be rewritten like this:
```  ldx #56
loop
;do some stuff here
txa
sbx #14
bne loop```

And his comment:

the whole routine above was in a loop that was executed up to 10 times, thus saving 2 bytes and up to 160 cycles.

Though I think that "160 cycles" is wrong; I think if the loop is executed 10 times you would save 40 cycles altogether.

### #43 batariOFFLINE

batari

)66]U('=I;B\$*

• 6,680 posts
• begin 644 contest

Posted Thu Nov 10, 2005 3:24 PM

Here's some interesting code from batari:

This:

```  ldx #56
loop
;do some stuff here
txa
sbx #14
bne loop```

And his comment:

the whole routine above was in a loop that was executed up to 10 times, thus saving 2 bytes and up to 160 cycles.

Though I think that "160 cycles" is wrong; I think if the loop is executed 10 times you would save 40 cycles altogether.

The comment was misleading. The loop shown is executed 4 times, but an outer, unseen loop was executed 10 times for a total of 40. So 160 cycles would be correct.

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.) So I recoded it

### #44 Thomas JentzschOFFLINE

Thomas Jentzsch

Thrust, Jammed, SWOOPS!, Boulder Dash, THREE·S, Star Castle

• 23,940 posts
• Always left from right here!
• Location:Düsseldorf, Germany, Europe, Earth

Posted Thu Nov 10, 2005 3:34 PM

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

### #45 vdub_bobbyOFFLINE

vdub_bobby

• 5,832 posts
• Boom bam.
• Location:Seattle, WA

Posted Thu Nov 10, 2005 3:53 PM

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

Agreed.

### #46 mojofltrOFFLINE

mojofltr

River Patroller

• 2,632 posts

Posted Thu Nov 10, 2005 3:59 PM

Not a killer hack but I'd like to share a link to an interview with William Mensch, 6502 designer.

Real format video of interview

That was an awesome interview, but man, I greatly disliked the interviewer.

He bumbled with his words and he sounded like a smart ass and only genuinely interested when it seemed like something he knew already.

I apologize if its someone here or another "legend" in microprocessing, but man, I can't stand the interviewer. But Mensch was awesome.

Yikes! I couldn't agree more! The guy sounds like he drank a couple of bottles of Robitussin before the interview! Still, when he isn't talking, the video is quite interesting.

### #47 batariOFFLINE

batari

)66]U('=I;B\$*

• 6,680 posts
• begin 644 contest

Posted Thu Nov 10, 2005 5:04 PM

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

I tend to agree too, but there are still a staggering number of PCAE users, many of which are almost dogmatic about it. Maybe a bad decision, but I didn't want people not judging my game in the minigame contest because it wouldn't run on what they still consider the "best" emulator despite the fact that it's very dated (and the author has annouced that he will no longer update it.)

The FB2 was a lesser consideration... but I have to say I really like the quality of the composite video, and yes I do play my own game occasionally.

Edited by batari, Thu Nov 10, 2005 5:05 PM.

### #48 vdub_bobbyOFFLINE

vdub_bobby

• 5,832 posts
• Boom bam.
• Location:Seattle, WA

Posted Thu Nov 10, 2005 5:21 PM

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

I tend to agree too, but there are still a staggering number of PCAE users, many of which are almost dogmatic about it. Maybe a bad decision, but I didn't want people not judging my game in the minigame contest because it wouldn't run on what they still consider the "best" emulator despite the fact that it's very dated (and the author has annouced that he will no longer update it.)

The FB2 was a lesser consideration... but I have to say I really like the quality of the composite video, and yes I do play my own game occasionally.

Hmmm...I wonder if Go Fish! 1K will run in PCAE? I never even bothered to check...

EDIT: Sure as heck doesn't! Wonder why...

Edited by vdub_bobby, Thu Nov 10, 2005 5:24 PM.

### #49 batariOFFLINE

batari

)66]U('=I;B\$*

• 6,680 posts
• begin 644 contest

Posted Thu Nov 10, 2005 6:07 PM

Hmmm...I wonder if Go Fish! 1K will run in PCAE?  I never even bothered to check...

EDIT: Sure as heck doesn't!  Wonder why...

Are you using v2.6? It seems to load up just fine for me. The score is messed up though...

### #50 vdub_bobbyOFFLINE

vdub_bobby

• 5,832 posts
• Boom bam.
• Location:Seattle, WA

Posted Thu Nov 10, 2005 6:17 PM

Hmmm...I wonder if Go Fish! 1K will run in PCAE?  I never even bothered to check...

EDIT: Sure as heck doesn't!  Wonder why...

Are you using v2.6? It seems to load up just fine for me. The score is messed up though...

Actually...I don't know what is going on. For some reason, the file menu is showing 3 different gofish1k.bin files, each of which behaves differently. There is only one file by that name in the directory...

And for kicks, none of the three behaves like it does in z26. Really weird.

#### 0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users