6502 Killer hacks

+batari · July 21, 2005

Hello,

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.) It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway. The loop will never repeat more than 8 times.

The beq done and bcs done are optional, one or both might save cycles but I haven't tested to see if it actually decreases or increases cycles on average.

; x and a contain multiplicands, result in a

stx temp1
sta temp2
lda #0
rept
lsr temp2
bcc skip
clc
adc temp1
;bcs done might save cycles?
skip
;beq done might save cycles?
asl temp1
bne rept
done

I tried to work up a good division routine too, but all I could come up with is successive subtraction, which is sub-optimal for small denominator values. Anyone know of a better way?

; x=numerator a=denominator, result in x
lsr
beq end;div by 0 = bad, div by 1=no calc needed, so bail out
rol;restore a
sta temp1
txa 
ldx #$ff
sec
loop
sbc temp1
inx
bcs loop
end; result in x

Thomas Jentzsch · July 21, 2005

Basically, that's how multiplication without tables works. The code could be optimized, but the table based approach is much faster.

Edited July 21, 2005 by Thomas Jentzsch

+batari · July 21, 2005

Basically, that's how multiplication without tables works. The code could be optimized, but the table based approach is much faster.

895999[/snapback]

How big is the table?

supercat · July 21, 2005

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.) It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway. The loop will never repeat more than 8 times.

Computing 8x8->16 multiply is more useful than 8x8->8, and isn't really any harder. The 6502's lack of add-without-carry is somewhat irksome, but there are a variety of workarounds that could be used.

; Compute mul1*mul2+acc -> acc:mul1 [mul2 is unchanged]

 ldx #8
 dec mul2
lp:
 lsr
 ror mul1
 bcc nope
 adc mul2
nope:
 dex
 bne lp
 inc mul2

As for division, the normal approach is to do a shift-and-subtract. A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits. I'll try to work one up for you.

Edited July 21, 2005 by supercat

+batari · July 21, 2005

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.) It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway. The loop will never repeat more than 8 times.

Computing 8x8->16 multiply is more useful than 8x8->8, and isn't really any harder. I'll have to see if I can work one up that deals with the 6502's "fun" add with carry behavior.

As for division, the normal approach is to do a shift-and-subtract. A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits. Again, I'll try to work one up for you.

896028[/snapback]

You can whip one up for 16-bit multiplication but I don't need it right now - I just need 8x8->8 for batari Basic, actually, where you can't use 16-bit numbers anyway, and probably won't ever be able to - as the utility of these is somewhat limited and the routines to implement them would use up precious ROM space. It's wise to keep bB from getting too top-heavy, since it's supposed to be a stepping stone to asm, and is only intended for making games, not doing complex math...

But don't let this stop you, as someone may find it useful, sometime, somewhere...

As for division, all I want is 8/8->8... If you want to do 16/8->8+8, go ahead, but again, I probably don't need it, for the same reasons.

supercat · July 21, 2005

You can whip one up for 16-bit multiplication but I don't need it right now - I just need 8x8->8 for batari Basic, actually, where you can't use 16-bit numbers anyway, and probably won't ever be able to - as the utility of these is somewhat limited and the routines to implement them would use up precious ROM space. It's wise to keep bB from getting too top-heavy, since it's supposed to be a stepping stone to asm, and is only intended for making games, not doing complex math...

Well, I just snuck 8x8->16 in up above. I would think that for your fixed-point maths you'd want an 8x8->16 multiply. As noted, it's pretty easy.

As for division, all I want is 8/8->8... If you want to do 16/8->8+8, go ahead, but again, I probably don't need it, for the same reasons.

Here's a simple divide routine off the top of my head, but it probably isn't quite right. I'll have to simulate and see if I got it correct. It should divide A:quot by denom, leaving the quotient in quot and the remainder in A.

 ldx #8
lp:
 cmp denom
 bcc toosmall
 sbc denom   ; Note: Carry is, and will remain, set.
 rol quot
 rol
 dex
 bne lp
 beq done
toosmall:
 rol quol
 rol
 dex
 bne lp
done:

+batari · July 21, 2005

Well, I just snuck 8x8->16 in up above. I would think that for your fixed-point maths you'd want an 8x8->16 multiply. As noted, it's pretty easy.

Wow, I had no idea it was so simple. I'll go ahead and use it, and just return the lower byte in the result, but leave a note in the docs on how to get the upper byte for integer multiplication if someone really needs it in bB!

EDIT: Just noticed, before the ADC, the carry is always set... do you need to clear it first?

Edited July 21, 2005 by batari

retrocon · July 21, 2005

As for division, the normal approach is to do a shift-and-subtract. A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits. I'll try to work one up for you.

896028[/snapback]

Supercat,

For a division shortcut, does this make any sense to you? This was a response I just got on a mailing list when asking how to optimize out a divide by 5 on a 10 bit unsigned number. I'm not sure I understand it fully, but I'd like to understand the concept. Do you know what would be the cheapest way to calculate a divide by 5? Table not an option of course.

"This is often easier to understand when expressed in binary.

1/5 equals 0.00110011001100110011 etc.

0.00110011 etc equals 11 multiplied by 0.000100010001 etc.

0.000100010001 etc equals 10001 multiplied by 0.0000000100000001 etc.

This suggests the following result (in C syntax):

Y1 = (X + X<<1)

Y2 = (Y1 + Y1<<4)

Y3 = (Y2 + Y2<<

and we have a result accurate to 16 bits with only three adders.

Y4 = (Y3 + Y3<<16)

32 bits with only four adders. "

[EDIT] Somehow I think this is related to this algorithm here, but I'm not seeing it quite yet: http://www.sxlist.com/techref/method/math/divconst.htm

Edited July 21, 2005 by retrocon

supercat · July 22, 2005

EDIT: Just noticed, before the ADC, the carry is always set... do you need to clear it first?

896042[/snapback]

That's what the "dec mul2" and "inc mul2" instructions are for. Other ways of dealing with the issue would be to subtract mul1 from the accumulator before the multiplication (which would work nicely for the low-half, but correctly dealing with the high half would be more work), inverting mul1 beforehand and then reversing the sense of the branch (adds two bytes/two cycles at the spot where mul1 gets loaded, but makes the routine more 'opaque'), or adding a "clc" before the "adc" [which costs one byte instead of four, but sixteen cycles instead of ten].

supercat · July 22, 2005

For a division shortcut, does this make any sense to you? This was a response I just got on a mailing list when asking how to optimize out a divide by 5 on a 10 bit unsigned number. I'm not sure I understand it fully, but I'd like to understand the concept. Do you know what would be the cheapest way to calculate a divide by 5?

Basically, you're replacing division by a constant with multiplication by a constant. This is an optimization approach which goes back at least to the 1960's, and probably even earlier. It's often an effective approach, but one needs to be careful to ensure that the results will aways be rounded in the expected direction (if one cares).

By the way, there are some interesting table-based approaches to multiplication by non-constants which can be somewhat practical if one is doing a lot of multiplication, but require some care to ensure correct operation. One nice method, for example, takes advantage of the fact that A*A-B*B = (A+B)*(A-B).

FastMult:; Multiplies n1 by n2.  Requires that n1+n2 be 255 or less, and n1>=n2.
 lda n1
 clc
 adc n2
 tax
 lda n1
 sec
 sbc n2
 tay
 lda squareL,x
 sbc squareL,y
 sta resultL
 lda squareH,x
 sbc squareH,y
 sta resultH

Fourty-two cycles for an 8x8->16 multiply, subject to the restrictions noted above. Considerably faster than the shift-add method, but at the expense of 512 bytes worth of tables. The tables should be computed such that ([squareH+n]:[squareL+n]) equals int(n*n/4).

djmips · July 22, 2005

Basically, you're replacing division by a constant with multiplication by a constant. This is an optimization approach which goes back at least to the 1960's, and probably even earlier. It's often an effective approach, but one needs to be careful to ensure that the results will aways be rounded in the expected direction (if one cares).

By the way, there are some interesting table-based approaches to multiplication by non-constants which can be somewhat practical if one is doing a lot of multiplication, but require some care to ensure correct operation. One nice method, for example, takes advantage of the fact that A*A-B*B = (A+B)*(A-B).

goes back to at least the 1960's

Both the multiply by 1/x and using the tables of squares for multiplication can be traced back to the Babylonians (2000-600BC)

The formula is:

ab = [(a+b)^2 - (a-b)^2]/4

Babylonian Mathematics

There are also the useful: sqrt(a^2 + b) approx = a + b/2a

There are several variations on this code and the following mores specialized version is attributed to Stephen Judd at C=Hacking (noted in old Stella post) and is 24-26 cycles at the expense of some range. Multiplying several numbers by a particular number is quite fast as you just need to reload Y each time.

Fast Signed Multiply

;        Multiply Y * A and result in A

STA ZP1	;ZP1 -- zero page pointer to table of g(x)
EOR #$FF
CLC
ADC #$01
STA ZP2	;ZP2 also points to g(x)
LDA (ZP1),Y;g(Y+A)
SEC
SBC (ZP2),Y;g(Y-A)

supercat · July 22, 2005

Both the multiply by 1/x and using the tables of squares for multiplication can be traced back to the Babylonians (2000-600BC)

Yeah, well that's even further before my time than the 1960's.

BTW, I developed an interesting method for doing long division which would be really cool if someone hadn't patented the same thing shortly before (and unbeknownst to me). It would be especially useful for performing really huge divisions on machines with a fast multiply but slow divide.

There are several variations on this code and the following mores specialized version is attributed to Stephen Judd at C=Hacking (noted in old Stella post) and is 24-26 cycles at the expense of some range. Multiplying several numbers by a particular number is quite fast as you just need to reload Y each time.

896209[/snapback]

Cute bit of code. Looks like it should be extendable to produce 16-bit results also.

Atari-Jess · July 22, 2005

Not a killer hack but I'd like to share a link to an interview with William Mensch, 6502 designer.

Real format video of interview

891481[/snapback]

That was an awesome interview, but man, I greatly disliked the interviewer.

He bumbled with his words and he sounded like a smart ass and only genuinely interested when it seemed like something he knew already.

I apologize if its someone here or another "legend" in microprocessing, but man, I can't stand the interviewer. But Mensch was awesome.

mos6507 · August 31, 2005

That interview must be from about 10 years ago. He's talking about Super Nintendo and the Pentium Pro like they are current products.

djmips · November 9, 2005

Some code courtesy of

Lee Davison 6502 code 'shorts'

Range test. By Lee Davison

For all of these we assume that the byte to be tested is in A and that the start and end values, n and m, are already defined. Also that 0 < n < m < $FF.
If you don't need to preserve the byte in A then testing the byte can be done in five bytes and only six cycles. This sets the carry if A is in the range n to m.

CLC	; clear carry for add
ADC	#$FF-m; make m = $FF
ADC	#m-n+1; carry set if in range n to m

supercat · November 9, 2005

Some code courtesy of

Lee Davison 6502 code 'shorts'

962533[/snapback]

I like his hex-digit conversion routine. Saves a byte over the one I normally use if the state of carry on entry is unknown. I also like his 'toggle carry' method, though I still find myself wishing the 6502 allowed a direct method to do that.

BTW, one thing that's often useful is the "cookie cutter" operation. For example, to store bits 3-5 of the accumulator into a memory location while leaving the other bits untouched:

 eor dest
 and #$38; bits 3-5
 eor dest
 sta dest

If you want to read bits 3-5 of a memory location into the corresponding bits of the accumulator, you may similarly do:

 eor src
 and #$C7 ; not bits 3-5
 eor src

but there's a very important proviso here: src MUST NOT change between the two 'eor' instructions. Any bit of src which changes between those instructions will cause the corresponding bit in the accumulator to be toggled if not in the range of bit 3..5.

vdub_bobby · November 10, 2005

Here's some interesting code from batari:

This:

   ldx #56
  loop
 ;do some stuff here
  txa
  sec
  sbc #14
  tax
  bne loop

Can be rewritten like this:

   ldx #56
  loop
 ;do some stuff here
  txa
  sbx #14
  bne loop

And his comment:

the whole routine above was in a loop that was executed up to 10 times, thus saving 2 bytes and up to 160 cycles.

Though I think that "160 cycles" is wrong; I think if the loop is executed 10 times you would save 40 cycles altogether.

+batari · November 10, 2005

Here's some interesting code from batari:

This:
   ldx #56
  loop
;do some stuff here
  txa
  sbx #14
  bne loop
And his comment:

the whole routine above was in a loop that was executed up to 10 times, thus saving 2 bytes and up to 160 cycles.

Though I think that "160 cycles" is wrong; I think if the loop is executed 10 times you would save 40 cycles altogether.

963347[/snapback]

The comment was misleading. The loop shown is executed 4 times, but an outer, unseen loop was executed 10 times for a total of 40. So 160 cycles would be correct.

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.) So I recoded it

Thomas Jentzsch · November 10, 2005

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.) So I recoded it

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

vdub_bobby · November 10, 2005

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.) So I recoded it

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

963398[/snapback]

Agreed.

uosipa llamxew · November 10, 2005

Not a killer hack but I'd like to share a link to an interview with William Mensch, 6502 designer.

Real format video of interview

891481[/snapback]

That was an awesome interview, but man, I greatly disliked the interviewer.

He bumbled with his words and he sounded like a smart ass and only genuinely interested when it seemed like something he knew already.

I apologize if its someone here or another "legend" in microprocessing, but man, I can't stand the interviewer. But Mensch was awesome.

896314[/snapback]

Yikes! I couldn't agree more! The guy sounds like he drank a couple of bottles of Robitussin before the interview! Still, when he isn't talking, the video is quite interesting.

+batari · November 10, 2005

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.) So I recoded it

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

963398[/snapback]

I tend to agree too, but there are still a staggering number of PCAE users, many of which are almost dogmatic about it. Maybe a bad decision, but I didn't want people not judging my game in the minigame contest because it wouldn't run on what they still consider the "best" emulator despite the fact that it's very dated (and the author has annouced that he will no longer update it.)

The FB2 was a lesser consideration... but I have to say I really like the quality of the composite video, and yes I do play my own game occasionally.

Edited November 10, 2005 by batari

vdub_bobby · November 10, 2005

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.) So I recoded it

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

963398[/snapback]

I tend to agree too, but there are still a staggering number of PCAE users, many of which are almost dogmatic about it. Maybe a bad decision, but I didn't want people not judging my game in the minigame contest because it wouldn't run on what they still consider the "best" emulator despite the fact that it's very dated (and the author has annouced that he will no longer update it.)

The FB2 was a lesser consideration... but I have to say I really like the quality of the composite video, and yes I do play my own game occasionally.

963421[/snapback]

Hmmm...I wonder if Go Fish! 1K will run in PCAE? I never even bothered to check...

EDIT: Sure as heck doesn't! Wonder why...

Edited November 10, 2005 by vdub_bobby

+batari · November 11, 2005

Hmmm...I wonder if Go Fish! 1K will run in PCAE? I never even bothered to check...

EDIT: Sure as heck doesn't! Wonder why...

963428[/snapback]

Are you using v2.6? It seems to load up just fine for me. The score is messed up though...

vdub_bobby · November 11, 2005

Hmmm...I wonder if Go Fish! 1K will run in PCAE? I never even bothered to check...

EDIT: Sure as heck doesn't! Wonder why...

963428[/snapback]

Are you using v2.6? It seems to load up just fine for me. The score is messed up though...

963447[/snapback]

Actually...I don't know what is going on. For some reason, the file menu is showing 3 different gofish1k.bin files, each of which behaves differently. There is only one file by that name in the directory...

And for kicks, none of the three behaves like it does in z26. Really weird.

6502 Killer hacks

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members