Jump to content
IGNORED

6502 Killer hacks


djmips

Recommended Posts

Hello,

 

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.) It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway. The loop will never repeat more than 8 times.

 

The beq done and bcs done are optional, one or both might save cycles but I haven't tested to see if it actually decreases or increases cycles on average.

; x and a contain multiplicands, result in a

stx temp1
sta temp2
lda #0
rept
lsr temp2
bcc skip
clc
adc temp1
;bcs done might save cycles?
skip
;beq done might save cycles?
asl temp1
bne rept
done

I tried to work up a good division routine too, but all I could come up with is successive subtraction, which is sub-optimal for small denominator values. Anyone know of a better way?

; x=numerator a=denominator, result in x
lsr
beq end;div by 0 = bad, div by 1=no calc needed, so bail out
rol;restore a
sta temp1
txa 
ldx #$ff
sec
loop
sbc temp1
inx
bcs loop
end; result in x

Link to comment
Share on other sites

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.)  It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway.  The loop will never repeat more than 8 times.

 

Computing 8x8->16 multiply is more useful than 8x8->8, and isn't really any harder. The 6502's lack of add-without-carry is somewhat irksome, but there are a variety of workarounds that could be used.

; Compute mul1*mul2+acc -> acc:mul1 [mul2 is unchanged]

 ldx #8
 dec mul2
lp:
 lsr
 ror mul1
 bcc nope
 adc mul2
nope:
 dex
 bne lp
 inc mul2

 

As for division, the normal approach is to do a shift-and-subtract. A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits. I'll try to work one up for you.

Edited by supercat
Link to comment
Share on other sites

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.)  It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway.  The loop will never repeat more than 8 times.

 

Computing 8x8->16 multiply is more useful than 8x8->8, and isn't really any harder. I'll have to see if I can work one up that deals with the 6502's "fun" add with carry behavior.

 

As for division, the normal approach is to do a shift-and-subtract. A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits. Again, I'll try to work one up for you.

896028[/snapback]

You can whip one up for 16-bit multiplication but I don't need it right now - I just need 8x8->8 for batari Basic, actually, where you can't use 16-bit numbers anyway, and probably won't ever be able to - as the utility of these is somewhat limited and the routines to implement them would use up precious ROM space. It's wise to keep bB from getting too top-heavy, since it's supposed to be a stepping stone to asm, and is only intended for making games, not doing complex math...

 

But don't let this stop you, as someone may find it useful, sometime, somewhere...

 

As for division, all I want is 8/8->8... If you want to do 16/8->8+8, go ahead, but again, I probably don't need it, for the same reasons.

Link to comment
Share on other sites

You can whip one up for 16-bit multiplication but I don't need it right now - I just need 8x8->8 for batari Basic, actually, where you can't use 16-bit numbers anyway, and probably won't ever be able to - as the utility of these is somewhat limited and the routines to implement them would use up precious ROM space.  It's wise to keep bB from getting too top-heavy, since it's supposed to be a stepping stone to asm, and is only intended for making games, not doing complex math...

 

Well, I just snuck 8x8->16 in up above. I would think that for your fixed-point maths you'd want an 8x8->16 multiply. As noted, it's pretty easy.

 

As for division, all I want is 8/8->8... If you want to do 16/8->8+8, go ahead, but again, I probably don't need it, for the same reasons.

 

Here's a simple divide routine off the top of my head, but it probably isn't quite right. I'll have to simulate and see if I got it correct. It should divide A:quot by denom, leaving the quotient in quot and the remainder in A.

 ldx #8
lp:
 cmp denom
 bcc toosmall
 sbc denom   ; Note: Carry is, and will remain, set.
 rol quot
 rol
 dex
 bne lp
 beq done
toosmall:
 rol quol
 rol
 dex
 bne lp
done:

Link to comment
Share on other sites

Well, I just snuck 8x8->16 in up above.  I would think that for your fixed-point maths you'd want an 8x8->16 multiply.  As noted, it's pretty easy.

Wow, I had no idea it was so simple. I'll go ahead and use it, and just return the lower byte in the result, but leave a note in the docs on how to get the upper byte for integer multiplication if someone really needs it in bB!

 

EDIT: Just noticed, before the ADC, the carry is always set... do you need to clear it first?

Edited by batari
Link to comment
Share on other sites

As for division, the normal approach is to do a shift-and-subtract.  A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits.  I'll try to work one up for you.

896028[/snapback]

 

Supercat,

 

For a division shortcut, does this make any sense to you? This was a response I just got on a mailing list when asking how to optimize out a divide by 5 on a 10 bit unsigned number. I'm not sure I understand it fully, but I'd like to understand the concept. Do you know what would be the cheapest way to calculate a divide by 5? Table not an option of course.

 

"This is often easier to understand when expressed in binary.

 

1/5 equals 0.00110011001100110011 etc.

 

0.00110011 etc equals 11 multiplied by 0.000100010001 etc.

 

0.000100010001 etc equals 10001 multiplied by 0.0000000100000001 etc.

 

This suggests the following result (in C syntax):

 

Y1 = (X + X<<1)

 

Y2 = (Y1 + Y1<<4)

 

Y3 = (Y2 + Y2<<8)

 

and we have a result accurate to 16 bits with only three adders.

 

Y4 = (Y3 + Y3<<16)

 

32 bits with only four adders. "

 

[EDIT] Somehow I think this is related to this algorithm here, but I'm not seeing it quite yet: http://www.sxlist.com/techref/method/math/divconst.htm

Edited by retrocon
Link to comment
Share on other sites

EDIT: Just noticed, before the ADC, the carry is always set... do you need to clear it first?

896042[/snapback]

That's what the "dec mul2" and "inc mul2" instructions are for. Other ways of dealing with the issue would be to subtract mul1 from the accumulator before the multiplication (which would work nicely for the low-half, but correctly dealing with the high half would be more work), inverting mul1 beforehand and then reversing the sense of the branch (adds two bytes/two cycles at the spot where mul1 gets loaded, but makes the routine more 'opaque'), or adding a "clc" before the "adc" [which costs one byte instead of four, but sixteen cycles instead of ten].

Link to comment
Share on other sites

For a division shortcut, does this make any sense to you? This was a response I just got on a mailing list when asking how to optimize out a divide by 5 on a 10 bit unsigned number. I'm not sure I understand it fully, but I'd like to understand the concept. Do you know what would be the cheapest way to calculate a divide by 5?

 

Basically, you're replacing division by a constant with multiplication by a constant. This is an optimization approach which goes back at least to the 1960's, and probably even earlier. It's often an effective approach, but one needs to be careful to ensure that the results will aways be rounded in the expected direction (if one cares).

 

By the way, there are some interesting table-based approaches to multiplication by non-constants which can be somewhat practical if one is doing a lot of multiplication, but require some care to ensure correct operation. One nice method, for example, takes advantage of the fact that A*A-B*B = (A+B)*(A-B).

FastMult:; Multiplies n1 by n2.  Requires that n1+n2 be 255 or less, and n1>=n2.
 lda n1
 clc
 adc n2
 tax
 lda n1
 sec
 sbc n2
 tay
 lda squareL,x
 sbc squareL,y
 sta resultL
 lda squareH,x
 sbc squareH,y
 sta resultH

Fourty-two cycles for an 8x8->16 multiply, subject to the restrictions noted above. Considerably faster than the shift-add method, but at the expense of 512 bytes worth of tables. The tables should be computed such that ([squareH+n]:[squareL+n]) equals int(n*n/4).

Link to comment
Share on other sites

Basically, you're replacing division by a constant with multiplication by a constant.  This is an optimization approach which goes back at least to the 1960's, and probably even earlier.  It's often an effective approach, but one needs to be careful to ensure that the results will aways be rounded in the expected direction (if one cares).

 

By the way, there are some interesting table-based approaches to multiplication by non-constants which can be somewhat practical if one is doing a lot of multiplication, but require some care to ensure correct operation.  One nice method, for example, takes advantage of the fact that A*A-B*B = (A+B)*(A-B).

 

goes back to at least the 1960's :D

 

Both the multiply by 1/x and using the tables of squares for multiplication can be traced back to the Babylonians (2000-600BC)

 

The formula is:

ab = [(a+b)^2 - (a-b)^2]/4

 

Babylonian Mathematics

 

There are also the useful: sqrt(a^2 + b) approx = a + b/2a

 

There are several variations on this code and the following mores specialized version is attributed to Stephen Judd at C=Hacking (noted in old Stella post) and is 24-26 cycles at the expense of some range. Multiplying several numbers by a particular number is quite fast as you just need to reload Y each time.

 

Fast Signed Multiply

 

;        Multiply Y * A and result in A

STA ZP1	;ZP1 -- zero page pointer to table of g(x)
EOR #$FF
CLC
ADC #$01
STA ZP2	;ZP2 also points to g(x)
LDA (ZP1),Y;g(Y+A)
SEC
SBC (ZP2),Y;g(Y-A)

Link to comment
Share on other sites

Both the multiply by 1/x and using the tables of squares for multiplication can be traced back to the Babylonians (2000-600BC)

 

Yeah, well that's even further before my time than the 1960's.

 

BTW, I developed an interesting method for doing long division which would be really cool if someone hadn't patented the same thing shortly before (and unbeknownst to me). It would be especially useful for performing really huge divisions on machines with a fast multiply but slow divide.

 

There are several variations on this code and the following mores specialized version is attributed to Stephen Judd  at C=Hacking (noted in old Stella post)  and is 24-26 cycles at the expense of some range. Multiplying several numbers by a particular number is quite fast as you just need to reload Y each time.

896209[/snapback]

 

Cute bit of code. Looks like it should be extendable to produce 16-bit results also.

Link to comment
Share on other sites

Not a killer hack but I'd like to share a link to an interview with William Mensch, 6502 designer.

 

Real format video of interview

891481[/snapback]

 

That was an awesome interview, but man, I greatly disliked the interviewer.

 

He bumbled with his words and he sounded like a smart ass and only genuinely interested when it seemed like something he knew already.

 

I apologize if its someone here or another "legend" in microprocessing, but man, I can't stand the interviewer. But Mensch was awesome.

Link to comment
Share on other sites

  • 1 month later...
  • 2 months later...

Some code courtesy of

 

Lee Davison 6502 code 'shorts'

 

Range test. By Lee Davison

 

For all of these we assume that the byte to be tested is in A and that the start and end values, n and m, are already defined. Also that 0 < n < m < $FF.

If you don't need to preserve the byte in A then testing the byte can be done in five bytes and only six cycles. This sets the carry if A is in the range n to m.

 

CLC	; clear carry for add
ADC	#$FF-m; make m = $FF
ADC	#m-n+1; carry set if in range n to m	

Link to comment
Share on other sites

Some code courtesy of

 

Lee Davison 6502 code 'shorts'

962533[/snapback]

 

I like his hex-digit conversion routine. Saves a byte over the one I normally use if the state of carry on entry is unknown. I also like his 'toggle carry' method, though I still find myself wishing the 6502 allowed a direct method to do that.

 

BTW, one thing that's often useful is the "cookie cutter" operation. For example, to store bits 3-5 of the accumulator into a memory location while leaving the other bits untouched:

 eor dest
 and #$38; bits 3-5
 eor dest
 sta dest

If you want to read bits 3-5 of a memory location into the corresponding bits of the accumulator, you may similarly do:

 eor src
 and #$C7 ; not bits 3-5
 eor src

but there's a very important proviso here: src MUST NOT change between the two 'eor' instructions. Any bit of src which changes between those instructions will cause the corresponding bit in the accumulator to be toggled if not in the range of bit 3..5.

Link to comment
Share on other sites

Here's some interesting code from batari:

 

This:

   ldx #56
  loop
 ;do some stuff here
  txa
  sec
  sbc #14
  tax
  bne loop

Can be rewritten like this:

   ldx #56
  loop
 ;do some stuff here
  txa
  sbx #14
  bne loop

 

And his comment:

the whole routine above was in a loop that was executed up to 10 times, thus saving 2 bytes and up to 160 cycles.

Though I think that "160 cycles" is wrong; I think if the loop is executed 10 times you would save 40 cycles altogether.

Link to comment
Share on other sites

Here's some interesting code from batari:

 

This:

   ldx #56
  loop
;do some stuff here
  txa
  sbx #14
  bne loop

 

And his comment:

the whole routine above was in a loop that was executed up to 10 times, thus saving 2 bytes and up to 160 cycles.

Though I think that "160 cycles" is wrong; I think if the loop is executed 10 times you would save 40 cycles altogether.

963347[/snapback]

The comment was misleading. The loop shown is executed 4 times, but an outer, unseen loop was executed 10 times for a total of 40. So 160 cycles would be correct.

 

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.) So I recoded it :(

Link to comment
Share on other sites

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it :(

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

  • Like 1
Link to comment
Share on other sites

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it :(

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

963398[/snapback]

Agreed.

Link to comment
Share on other sites

Not a killer hack but I'd like to share a link to an interview with William Mensch, 6502 designer.

 

Real format video of interview

891481[/snapback]

 

That was an awesome interview, but man, I greatly disliked the interviewer.

 

He bumbled with his words and he sounded like a smart ass and only genuinely interested when it seemed like something he knew already.

 

I apologize if its someone here or another "legend" in microprocessing, but man, I can't stand the interviewer. But Mensch was awesome.

896314[/snapback]

 

Yikes! I couldn't agree more! The guy sounds like he drank a couple of bottles of Robitussin before the interview! :P Still, when he isn't talking, the video is quite interesting.

Link to comment
Share on other sites

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it :(

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

963398[/snapback]

I tend to agree too, but there are still a staggering number of PCAE users, many of which are almost dogmatic about it. Maybe a bad decision, but I didn't want people not judging my game in the minigame contest because it wouldn't run on what they still consider the "best" emulator despite the fact that it's very dated (and the author has annouced that he will no longer update it.)

 

The FB2 was a lesser consideration... but I have to say I really like the quality of the composite video, and yes I do play my own game occasionally.

Edited by batari
Link to comment
Share on other sites

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it :(

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

963398[/snapback]

I tend to agree too, but there are still a staggering number of PCAE users, many of which are almost dogmatic about it. Maybe a bad decision, but I didn't want people not judging my game in the minigame contest because it wouldn't run on what they still consider the "best" emulator despite the fact that it's very dated (and the author has annouced that he will no longer update it.)

 

The FB2 was a lesser consideration... but I have to say I really like the quality of the composite video, and yes I do play my own game occasionally.

963421[/snapback]

Hmmm...I wonder if Go Fish! 1K will run in PCAE? I never even bothered to check...

 

EDIT: Sure as heck doesn't! Wonder why...

Edited by vdub_bobby
Link to comment
Share on other sites

Hmmm...I wonder if Go Fish! 1K will run in PCAE?  I never even bothered to check...

 

EDIT: Sure as heck doesn't!  Wonder why...

963428[/snapback]

Are you using v2.6? It seems to load up just fine for me. The score is messed up though...

963447[/snapback]

Actually...I don't know what is going on. For some reason, the file menu is showing 3 different gofish1k.bin files, each of which behaves differently. There is only one file by that name in the directory...

 

And for kicks, none of the three behaves like it does in z26. Really weird.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...