Jump to content

Photo

6502 Killer hacks


144 replies to this topic

#26 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,680 posts
  • begin 644 contest

Posted Thu Jul 21, 2005 3:09 PM

Hello,

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.) It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway. The loop will never repeat more than 8 times.

The beq done and bcs done are optional, one or both might save cycles but I haven't tested to see if it actually decreases or increases cycles on average.
; x and a contain multiplicands, result in a

 stx temp1
 sta temp2
 lda #0
rept
 lsr temp2
 bcc skip
 clc
 adc temp1
;bcs done might save cycles?
skip
;beq done might save cycles?
 asl temp1
 bne rept
done
I tried to work up a good division routine too, but all I could come up with is successive subtraction, which is sub-optimal for small denominator values. Anyone know of a better way?
; x=numerator a=denominator, result in x
 lsr
 beq end;div by 0 = bad, div by 1=no calc needed, so bail out
 rol;restore a
 sta temp1
 txa 
 ldx #$ff
 sec
loop
 sbc temp1
 inx
 bcs loop
end; result in x


#27 Thomas Jentzsch OFFLINE  

Thomas Jentzsch

    Thrust, Jammed, SWOOPS!, Boulder Dash, THREE·S, Star Castle

  • 23,940 posts
  • Always left from right here!
  • Location:Düsseldorf, Germany, Europe, Earth

Posted Thu Jul 21, 2005 3:28 PM

Basically, that's how multiplication without tables works. The code could be optimized, but the table based approach is much faster.

Edited by Thomas Jentzsch, Thu Jul 21, 2005 3:38 PM.


#28 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,680 posts
  • begin 644 contest

Posted Thu Jul 21, 2005 3:56 PM

Basically, that's how multiplication without tables works. The code could be optimized, but the table based approach is much faster.

View Post

How big is the table?

#29 supercat OFFLINE  

supercat

    Quadrunner

  • 6,401 posts

Posted Thu Jul 21, 2005 4:22 PM

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.)  It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway.  The loop will never repeat more than 8 times.


Computing 8x8->16 multiply is more useful than 8x8->8, and isn't really any harder. The 6502's lack of add-without-carry is somewhat irksome, but there are a variety of workarounds that could be used.
; Compute mul1*mul2+acc -> acc:mul1 [mul2 is unchanged]

  ldx #8
  dec mul2
lp:
  lsr
  ror mul1
  bcc nope
  adc mul2
nope:
  dex
  bne lp
  inc mul2

As for division, the normal approach is to do a shift-and-subtract. A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits. I'll try to work one up for you.

Edited by supercat, Thu Jul 21, 2005 4:35 PM.


#30 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,680 posts
  • begin 644 contest

Posted Thu Jul 21, 2005 4:33 PM

A few days ago I worked up a slick routine for 8-bit unsigned multiplication (or I think it's slick, anyway.)  It works as long as the values do not overflow 8 bits, in which case the value should not be relied upon anyway.  The loop will never repeat more than 8 times.


Computing 8x8->16 multiply is more useful than 8x8->8, and isn't really any harder. I'll have to see if I can work one up that deals with the 6502's "fun" add with carry behavior.

As for division, the normal approach is to do a shift-and-subtract. A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits. Again, I'll try to work one up for you.

View Post

You can whip one up for 16-bit multiplication but I don't need it right now - I just need 8x8->8 for batari Basic, actually, where you can't use 16-bit numbers anyway, and probably won't ever be able to - as the utility of these is somewhat limited and the routines to implement them would use up precious ROM space. It's wise to keep bB from getting too top-heavy, since it's supposed to be a stepping stone to asm, and is only intended for making games, not doing complex math...

But don't let this stop you, as someone may find it useful, sometime, somewhere...

As for division, all I want is 8/8->8... If you want to do 16/8->8+8, go ahead, but again, I probably don't need it, for the same reasons.

#31 supercat OFFLINE  

supercat

    Quadrunner

  • 6,401 posts

Posted Thu Jul 21, 2005 4:42 PM

You can whip one up for 16-bit multiplication but I don't need it right now - I just need 8x8->8 for batari Basic, actually, where you can't use 16-bit numbers anyway, and probably won't ever be able to - as the utility of these is somewhat limited and the routines to implement them would use up precious ROM space.  It's wise to keep bB from getting too top-heavy, since it's supposed to be a stepping stone to asm, and is only intended for making games, not doing complex math...


Well, I just snuck 8x8->16 in up above. I would think that for your fixed-point maths you'd want an 8x8->16 multiply. As noted, it's pretty easy.

As for division, all I want is 8/8->8... If you want to do 16/8->8+8, go ahead, but again, I probably don't need it, for the same reasons.


Here's a simple divide routine off the top of my head, but it probably isn't quite right. I'll have to simulate and see if I got it correct. It should divide A:quot by denom, leaving the quotient in quot and the remainder in A.
  ldx #8
lp:
  cmp denom
  bcc toosmall
  sbc denom   ; Note: Carry is, and will remain, set.
  rol quot
  rol
  dex
  bne lp
  beq done
toosmall:
  rol quol
  rol
  dex
  bne lp
done:


#32 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,680 posts
  • begin 644 contest

Posted Thu Jul 21, 2005 4:51 PM

Well, I just snuck 8x8->16 in up above.  I would think that for your fixed-point maths you'd want an 8x8->16 multiply.  As noted, it's pretty easy.

Wow, I had no idea it was so simple. I'll go ahead and use it, and just return the lower byte in the result, but leave a note in the docs on how to get the upper byte for integer multiplication if someone really needs it in bB!

EDIT: Just noticed, before the ADC, the carry is always set... do you need to clear it first?

Edited by batari, Thu Jul 21, 2005 4:53 PM.


#33 retrocon OFFLINE  

retrocon

    Chopper Commander

  • 119 posts

Posted Thu Jul 21, 2005 4:58 PM

As for division, the normal approach is to do a shift-and-subtract.  A 16/8->8+8 result is pretty easy, with the caveat that the results will be meaningless if the quotient doesn't fit in eight bits.  I'll try to work one up for you.

View Post


Supercat,

For a division shortcut, does this make any sense to you? This was a response I just got on a mailing list when asking how to optimize out a divide by 5 on a 10 bit unsigned number. I'm not sure I understand it fully, but I'd like to understand the concept. Do you know what would be the cheapest way to calculate a divide by 5? Table not an option of course.

"This is often easier to understand when expressed in binary.

1/5 equals 0.00110011001100110011 etc.

0.00110011 etc equals 11 multiplied by 0.000100010001 etc.

0.000100010001 etc equals 10001 multiplied by 0.0000000100000001 etc.

This suggests the following result (in C syntax):

Y1 = (X + X<<1)

Y2 = (Y1 + Y1<<4)

Y3 = (Y2 + Y2<<8)

and we have a result accurate to 16 bits with only three adders.

Y4 = (Y3 + Y3<<16)

32 bits with only four adders. "

[EDIT] Somehow I think this is related to this algorithm here, but I'm not seeing it quite yet: http://www.sxlist.co...th/divconst.htm

Edited by retrocon, Thu Jul 21, 2005 5:27 PM.


#34 supercat OFFLINE  

supercat

    Quadrunner

  • 6,401 posts

Posted Thu Jul 21, 2005 7:07 PM

EDIT: Just noticed, before the ADC, the carry is always set... do you need to clear it first?

View Post

That's what the "dec mul2" and "inc mul2" instructions are for. Other ways of dealing with the issue would be to subtract mul1 from the accumulator before the multiplication (which would work nicely for the low-half, but correctly dealing with the high half would be more work), inverting mul1 beforehand and then reversing the sense of the branch (adds two bytes/two cycles at the spot where mul1 gets loaded, but makes the routine more 'opaque'), or adding a "clc" before the "adc" [which costs one byte instead of four, but sixteen cycles instead of ten].

#35 supercat OFFLINE  

supercat

    Quadrunner

  • 6,401 posts

Posted Thu Jul 21, 2005 7:28 PM

For a division shortcut, does this make any sense to you? This was a response I just got on a mailing list when asking how to optimize out a divide by 5 on a 10 bit unsigned number. I'm not sure I understand it fully, but I'd like to understand the concept. Do you know what would be the cheapest way to calculate a divide by 5?


Basically, you're replacing division by a constant with multiplication by a constant. This is an optimization approach which goes back at least to the 1960's, and probably even earlier. It's often an effective approach, but one needs to be careful to ensure that the results will aways be rounded in the expected direction (if one cares).

By the way, there are some interesting table-based approaches to multiplication by non-constants which can be somewhat practical if one is doing a lot of multiplication, but require some care to ensure correct operation. One nice method, for example, takes advantage of the fact that A*A-B*B = (A+B)*(A-B).
FastMult:; Multiplies n1 by n2.  Requires that n1+n2 be 255 or less, and n1>=n2.
  lda n1
  clc
  adc n2
  tax
  lda n1
  sec
  sbc n2
  tay
  lda squareL,x
  sbc squareL,y
  sta resultL
  lda squareH,x
  sbc squareH,y
  sta resultH
Fourty-two cycles for an 8x8->16 multiply, subject to the restrictions noted above. Considerably faster than the shift-add method, but at the expense of 512 bytes worth of tables. The tables should be computed such that ([SquareH+n]:[SquareL+n]) equals int(n*n/4).

#36 djmips OFFLINE  

djmips

    Dragonstomper

  • Topic Starter
  • 634 posts
  • scrolling
  • Location:Seattle

Posted Thu Jul 21, 2005 10:26 PM

Basically, you're replacing division by a constant with multiplication by a constant.  This is an optimization approach which goes back at least to the 1960's, and probably even earlier.  It's often an effective approach, but one needs to be careful to ensure that the results will aways be rounded in the expected direction (if one cares).

By the way, there are some interesting table-based approaches to multiplication by non-constants which can be somewhat practical if one is doing a lot of multiplication, but require some care to ensure correct operation.  One nice method, for example, takes advantage of the fact that A*A-B*B = (A+B)*(A-B).


goes back to at least the 1960's :D

Both the multiply by 1/x and using the tables of squares for multiplication can be traced back to the Babylonians (2000-600BC)

The formula is:
ab = [(a+b)^2 - (a-b)^2]/4

Babylonian Mathematics

There are also the useful: sqrt(a^2 + b) approx = a + b/2a

There are several variations on this code and the following mores specialized version is attributed to Stephen Judd at C=Hacking (noted in old Stella post) and is 24-26 cycles at the expense of some range. Multiplying several numbers by a particular number is quite fast as you just need to reload Y each time.

Fast Signed Multiply

;        Multiply Y * A and result in A

	STA ZP1	;ZP1 -- zero page pointer to table of g(x)
	EOR #$FF
	CLC
	ADC #$01
	STA ZP2	;ZP2 also points to g(x)
	LDA (ZP1),Y;g(Y+A)
	SEC
	SBC (ZP2),Y;g(Y-A)



#37 supercat OFFLINE  

supercat

    Quadrunner

  • 6,401 posts

Posted Thu Jul 21, 2005 11:24 PM

Both the multiply by 1/x and using the tables of squares for multiplication can be traced back to the Babylonians (2000-600BC)


Yeah, well that's even further before my time than the 1960's.

BTW, I developed an interesting method for doing long division which would be really cool if someone hadn't patented the same thing shortly before (and unbeknownst to me). It would be especially useful for performing really huge divisions on machines with a fast multiply but slow divide.

There are several variations on this code and the following mores specialized version is attributed to Stephen Judd  at C=Hacking (noted in old Stella post)  and is 24-26 cycles at the expense of some range. Multiplying several numbers by a particular number is quite fast as you just need to reload Y each time.

View Post


Cute bit of code. Looks like it should be extendable to produce 16-bit results also.

#38 Atari-Jess OFFLINE  

Atari-Jess

    River Patroller

  • 4,243 posts
  • Location:Toronto

Posted Fri Jul 22, 2005 4:25 AM

Not a killer hack but I'd like to share a link to an interview with William Mensch, 6502 designer.

Real format video of interview

View Post


That was an awesome interview, but man, I greatly disliked the interviewer.

He bumbled with his words and he sounded like a smart ass and only genuinely interested when it seemed like something he knew already.

I apologize if its someone here or another "legend" in microprocessing, but man, I can't stand the interviewer. But Mensch was awesome.

#39 mos6507 ONLINE  

mos6507

    River Patroller

  • 4,916 posts

Posted Tue Aug 30, 2005 6:22 PM

That interview must be from about 10 years ago. He's talking about Super Nintendo and the Pentium Pro like they are current products.

#40 djmips OFFLINE  

djmips

    Dragonstomper

  • Topic Starter
  • 634 posts
  • scrolling
  • Location:Seattle

Posted Tue Nov 8, 2005 9:39 PM

Some code courtesy of

Lee Davison 6502 code 'shorts'

Range test. By Lee Davison

For all of these we assume that the byte to be tested is in A and that the start and end values, n and m, are already defined. Also that 0 < n < m < $FF.
If you don't need to preserve the byte in A then testing the byte can be done in five bytes and only six cycles. This sets the carry if A is in the range n to m.


	CLC	; clear carry for add
	ADC	#$FF-m; make m = $FF
	ADC	#m-n+1; carry set if in range n to m	


#41 supercat OFFLINE  

supercat

    Quadrunner

  • 6,401 posts

Posted Tue Nov 8, 2005 11:08 PM

Some code courtesy of

Lee Davison 6502 code 'shorts'

View Post


I like his hex-digit conversion routine. Saves a byte over the one I normally use if the state of carry on entry is unknown. I also like his 'toggle carry' method, though I still find myself wishing the 6502 allowed a direct method to do that.

BTW, one thing that's often useful is the "cookie cutter" operation. For example, to store bits 3-5 of the accumulator into a memory location while leaving the other bits untouched:
  eor dest
  and #$38; bits 3-5
  eor dest
  sta dest
If you want to read bits 3-5 of a memory location into the corresponding bits of the accumulator, you may similarly do:
  eor src
  and #$C7 ; not bits 3-5
  eor src
but there's a very important proviso here: src MUST NOT change between the two 'eor' instructions. Any bit of src which changes between those instructions will cause the corresponding bit in the accumulator to be toggled if not in the range of bit 3..5.

#42 vdub_bobby OFFLINE  

vdub_bobby

    Quadrunner

  • 5,832 posts
  • Boom bam.
  • Location:Seattle, WA

Posted Thu Nov 10, 2005 2:00 PM

Here's some interesting code from batari:

This:
  ldx #56
   loop
  ;do some stuff here
   txa
   sec
   sbc #14
   tax
   bne loop
Can be rewritten like this:
  ldx #56
   loop
  ;do some stuff here
   txa
   sbx #14
   bne loop

And his comment:

the whole routine above was in a loop that was executed up to 10 times, thus saving 2 bytes and up to 160 cycles.

Though I think that "160 cycles" is wrong; I think if the loop is executed 10 times you would save 40 cycles altogether.

#43 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,680 posts
  • begin 644 contest

Posted Thu Nov 10, 2005 3:24 PM

Here's some interesting code from batari:

This:

  ldx #56
   loop
 ;do some stuff here
   txa
   sbx #14
   bne loop

And his comment:

the whole routine above was in a loop that was executed up to 10 times, thus saving 2 bytes and up to 160 cycles.

Though I think that "160 cycles" is wrong; I think if the loop is executed 10 times you would save 40 cycles altogether.

View Post

The comment was misleading. The loop shown is executed 4 times, but an outer, unseen loop was executed 10 times for a total of 40. So 160 cycles would be correct.

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.) So I recoded it :(

#44 Thomas Jentzsch OFFLINE  

Thomas Jentzsch

    Thrust, Jammed, SWOOPS!, Boulder Dash, THREE·S, Star Castle

  • 23,940 posts
  • Always left from right here!
  • Location:Düsseldorf, Germany, Europe, Earth

Posted Thu Nov 10, 2005 3:34 PM

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it :(

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

#45 vdub_bobby OFFLINE  

vdub_bobby

    Quadrunner

  • 5,832 posts
  • Boom bam.
  • Location:Seattle, WA

Posted Thu Nov 10, 2005 3:53 PM

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it :(

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

View Post

Agreed.

#46 mojofltr OFFLINE  

mojofltr

    River Patroller

  • 2,632 posts

Posted Thu Nov 10, 2005 3:59 PM

Not a killer hack but I'd like to share a link to an interview with William Mensch, 6502 designer.

Real format video of interview

View Post


That was an awesome interview, but man, I greatly disliked the interviewer.

He bumbled with his words and he sounded like a smart ass and only genuinely interested when it seemed like something he knew already.

I apologize if its someone here or another "legend" in microprocessing, but man, I can't stand the interviewer. But Mensch was awesome.

View Post


Yikes! I couldn't agree more! The guy sounds like he drank a couple of bottles of Robitussin before the interview! :P Still, when he isn't talking, the video is quite interesting.

#47 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,680 posts
  • begin 644 contest

Posted Thu Nov 10, 2005 5:04 PM

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it :(

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

View Post

I tend to agree too, but there are still a staggering number of PCAE users, many of which are almost dogmatic about it. Maybe a bad decision, but I didn't want people not judging my game in the minigame contest because it wouldn't run on what they still consider the "best" emulator despite the fact that it's very dated (and the author has annouced that he will no longer update it.)

The FB2 was a lesser consideration... but I have to say I really like the quality of the composite video, and yes I do play my own game occasionally.

Edited by batari, Thu Nov 10, 2005 5:05 PM.


#48 vdub_bobby OFFLINE  

vdub_bobby

    Quadrunner

  • 5,832 posts
  • Boom bam.
  • Location:Seattle, WA

Posted Thu Nov 10, 2005 5:21 PM

This code was in my game Zirconium until I found that the FB2 doesn't do SBX (nor does PCAE.)  So I recoded it :(

IMO it's a bad idea to tailor code to partially incompatible hardware or quite outdated emulators. The target system should be the Atari 2600 and nothing else.

View Post

I tend to agree too, but there are still a staggering number of PCAE users, many of which are almost dogmatic about it. Maybe a bad decision, but I didn't want people not judging my game in the minigame contest because it wouldn't run on what they still consider the "best" emulator despite the fact that it's very dated (and the author has annouced that he will no longer update it.)

The FB2 was a lesser consideration... but I have to say I really like the quality of the composite video, and yes I do play my own game occasionally.

View Post

Hmmm...I wonder if Go Fish! 1K will run in PCAE? I never even bothered to check...

EDIT: Sure as heck doesn't! Wonder why...

Edited by vdub_bobby, Thu Nov 10, 2005 5:24 PM.


#49 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,680 posts
  • begin 644 contest

Posted Thu Nov 10, 2005 6:07 PM

Hmmm...I wonder if Go Fish! 1K will run in PCAE?  I never even bothered to check...

EDIT: Sure as heck doesn't!  Wonder why...

View Post

Are you using v2.6? It seems to load up just fine for me. The score is messed up though...

#50 vdub_bobby OFFLINE  

vdub_bobby

    Quadrunner

  • 5,832 posts
  • Boom bam.
  • Location:Seattle, WA

Posted Thu Nov 10, 2005 6:17 PM

Hmmm...I wonder if Go Fish! 1K will run in PCAE?  I never even bothered to check...

EDIT: Sure as heck doesn't!  Wonder why...

View Post

Are you using v2.6? It seems to load up just fine for me. The score is messed up though...

View Post

Actually...I don't know what is going on. For some reason, the file menu is showing 3 different gofish1k.bin files, each of which behaves differently. There is only one file by that name in the directory...

And for kicks, none of the three behaves like it does in z26. Really weird.




0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users