Jump to content
djmips

6502 Killer hacks

Recommended Posts

I also like his 'toggle carry' method, though I still find myself wishing the 6502 allowed a direct method to do that.

 

Yes, that is a nice one. At a quick glance, I thought it trashed A (but it doesn't as bit seven is safely tucked away in the carry and is restored in the ror)

 

rol	; Cb into b0  
eor #$01	; toggle bit
ror	; b0 into Cb

Share this post


Link to post
Share on other sites

Here's my all-time favorite hack for the 6502. The "skip 2 bytes" pseudo-instruction, which piggybacks off the BIT instruction. It saves a byte when performing "either/or" branching.

 

Consider the following example. You want to store a value in either Loc1 or Loc2, depending on whether it's negative. Normally, your code would look like:

 

LDA MyNumber;2
BPL B1;2
STA Loc1;2
BMI Done;2 (This is effectively a BRA to Done.)
B1
STA Loc2;2
Done

It uses 10 bytes. Now consider this alternative:

 

LDA MyNumber;2
BPL B1;2
STA Loc1;2
.BYTE $2C;1 (Skips the next 2 bytes, effectively doing a BRA to Done.)
B1
STA Loc2;2
Done

It only uses 9 bytes, and does the same thing. When the processor hits the ".BYTE $2C", it reads this as an Absolute BIT command, which is a 3 byte command. So, it uses the two bytes of the STA Loc2 command as the argument to the BIT command, performs a meaningless BIT test, and then continues to run, without ever executing the STA Loc2 command. The only thing to be careful with this is that the N, V, and Z flags will be corrupted. But the accumulator is not affected. So you've just saved a byte. Also be wary that one extra cycle is used. BIT Absolute is 4 cycles, and a BMI branch is 3 cycles (or 4 if it crosses a page boundary.)

Edited by TROGDOR

Share this post


Link to post
Share on other sites
The only thing to be careful with this is that the N, V, and Z flags will be corrupted.  But the accumulator is not affected.  So you've just saved a byte.  Also be wary that one extra cycle is used.  BIT Absolute is 4 cycles, and a BMI branch is 3 cycles (or 4 if it crosses a page boundary.)

986490[/snapback]

 

Another thing to beware of is that certain memory locations can be affected by reads. In a 4K cart, this isn't apt to be a problem (the only thing affected by a read is the timer-interrupt flag on the RIOT which is cleared by reading INTIM). In a 16K or 32K cart, skipping "INC nn,X" or "ISB nn,x" will trigger a bank switch if 'nn' is $1F, $3F, $5F, $7F, $9F, $BF, $DF, or $FF, but such an instruction isn't too likely to occur. In a 32K cart, skipping "SBC nn,X" will also trigger a bankswitch.

 

None of those bankswitch hotspots is in practice likely to be a problem. On the other hand, RAM based carts open up a much bigger danger zone. On a Superchip cart, skipping any instruction whose opcode is $00-$7F and whose operand is $10, $30, $50, $70, $90, $B0, $D0, or $F0 will trash a byte of Superchip RAM. On a Supercharger, skipping any two-byte instruction whose operand is one of the above will be, to put it mildly, "interesting".

Share this post


Link to post
Share on other sites
Yes, that is a nice one. At a quick glance, I thought it trashed A (but it doesn't as bit seven is safely tucked away in the carry and is restored in the ror)

965779[/snapback]

 

BTW, if it's desired to copy bit 7 of an address into all bits of the accumulator (so it's 00 or FF):

 lda #$7F
 cmp Address ; Carry clear if >= $80
 adc #$80  ; $00 if carry set; $FF if carry clear

Share this post


Link to post
Share on other sites
The only thing to be careful with this is that the N, V, and Z flags will be corrupted.  But the accumulator is not affected.  So you've just saved a byte.  Also be wary that one extra cycle is used.  BIT Absolute is 4 cycles, and a BMI branch is 3 cycles (or 4 if it crosses a page boundary.)

986490[/snapback]

 

Another thing to beware of is that certain memory locations can be affected by reads. In a 4K cart, this isn't apt to be a problem (the only thing affected by a read is the timer-interrupt flag on the RIOT which is cleared by reading INTIM). In a 16K or 32K cart, skipping "INC nn,X" or "ISB nn,x" will trigger a bank switch if 'nn' is $1F, $3F, $5F, $7F, $9F, $BF, $DF, or $FF, but such an instruction isn't too likely to occur. In a 32K cart, skipping "SBC nn,X" will also trigger a bankswitch.

 

None of those bankswitch hotspots is in practice likely to be a problem. On the other hand, RAM based carts open up a much bigger danger zone. On a Superchip cart, skipping any instruction whose opcode is $00-$7F and whose operand is $10, $30, $50, $70, $90, $B0, $D0, or $F0 will trash a byte of Superchip RAM. On a Supercharger, skipping any two-byte instruction whose operand is one of the above will be, to put it mildly, "interesting".

986508[/snapback]

 

Thanks for the heads-up supercat. I want to maintain Supercharger compatibility for all my programs, so it looks like I'll have to ditch my favorite hack. :sad:

Share this post


Link to post
Share on other sites
On a Supercharger, skipping any two-byte instruction whose operand is one of the above will be, to put it mildly, "interesting".

986508[/snapback]

Thanks for the heads-up supercat. I want to maintain Supercharger compatibility for all my programs, so it looks like I'll have to ditch my favorite hack.

No need to abandon the hack unless you are writing actual Supercharger games. If you create a regular 4k binary and run it on a Supercharger, you will have writes disabled anyway so this doesn't apply.

Edited by batari

Share this post


Link to post
Share on other sites
On a Supercharger, skipping any two-byte instruction whose operand is one of the above will be, to put it mildly, "interesting".

986508[/snapback]

Thanks for the heads-up supercat. I want to maintain Supercharger compatibility for all my programs, so it looks like I'll have to ditch my favorite hack.

No need to abandon the hack unless you are writing actual Supercharger games. If you create a regular 4k binary and run it on a Supercharger, you will have writes disabled anyway so this doesn't apply.

986549[/snapback]

 

Ah, I have been wondering for some time why this trick was unreliable when I used it within my PoP code :idea:

 

Thanks,

Chris

Share this post


Link to post
Share on other sites
Here's the software collision routine I worked up for Go Fish!

A little different - I use it to check if two boxes overlap. Call it once for X values, then call it again with Y values.

CheckBoundaries
	lda rect1.leftortop
	cmp rect2.leftortop	  [*]
	bmi Check2
	cmp rect2.rightorbottom
	bmi InsideBoundingBox
Check2
	lda rect2.leftortop
	cmp rect1.leftortop	 [*]
	bmi NotInsideBoundingBox
	cmp rect1.rightorbottom
	bpl NotInsideBoundingBox
InsideBoundingBox
	sec
	rts
NotInsideBoundingBox
	clc
	rts

 

I was looking back through this thread for a collision detection routine for my Juno First game, and the above routine looks like it will be ideal. However, if I am not mistaken, the comparisons marked [*] are the inverse of each other, so I think the code can be simplified as follows:

 

CheckBoundaries
	lda rect1.leftortop
	cmp rect2.leftortop  
	bmi Check2
	cmp rect2.rightorbottom
	bmi InsideBoundingBox
NotInsideBoundingBox
	clc
	rts
Check2
	lda rect2.leftortop
	cmp rect1.rightorbottom
	bpl NotInsideBoundingBox
InsideBoundingBox
	sec
	rts

 

Chris

Share this post


Link to post
Share on other sites
:thumbsup:

Don't know how I missed that; that looks nice. :)

 

Thanks - I haven't actually tried it yet, but it should work unless I am missing something :)

 

Chris

Share this post


Link to post
Share on other sites
I was looking back through this thread for a collision detection routine for my Juno First game, and the above routine looks like it will be ideal. However, if I am not mistaken, the comparisons marked [*] are the inverse of each other, so I think the code can be simplified as follows:

 

If one is trying to determine whether two objects of fixed size are overlapping, why not just use:

 ; Start with carry clear
 lda xpos1
 sbc xpos2 ; Note will subtract n-1
 sbc #SIZE2-1
 adc #SIZE1+SIZE2-1 ; Carry set if overlap

 

If SIZE1=4 and SIZE2=8, xpos1=20...

- If xpos2=24, the first SBC yields $FB (no carry) and the second, $F3 (carry set). The ADC generates no carry.

- If xpos2=12, the first SBC yields $07 (carry set) and the second, $00 (carry set). Again, the ADC generates no carry.

 

- If xpos2=23, the first SBC yields $FC (no carry) and the second, $F4 (carry set). The ADC generates carry.

- If xpos2=13, the first SBC yields $06 (carry set) and the second, $FF (carry clear). Again, the ADC generates carry.

Share this post


Link to post
Share on other sites
If one is trying to determine whether two objects of fixed size are overlapping, why not just use:

; Start with carry clear
 lda xpos1
 sbc xpos2; Note will subtract n-1
 sbc #SIZE2-1
 adc #SIZE1+SIZE2-1; Carry set if overlap

 

If SIZE1=4 and SIZE2=8, xpos1=20...

- If xpos2=24, the first SBC yields $FB (no carry) and the second, $F3 (carry set). The ADC generates no carry.

- If xpos2=12, the first SBC yields $07 (carry set) and the second, $00 (carry set). Again, the ADC generates no carry.

 

- If xpos2=23, the first SBC yields $FC (no carry) and the second, $F4 (carry set). The ADC generates carry.

- If xpos2=13, the first SBC yields $06 (carry set) and the second, $FF (carry clear). Again, the ADC generates carry.

 

This seems to be an even better approach, and is particularly useful for me as it has a constant cycle count. I wish I could write code like this, but I always seem to have trouble figuring the arithmetic :) Incidentally, I implemented the previous version which I posted above and it works fine.

 

Chris

Edited by cd-w

Share this post


Link to post
Share on other sites

Here's another one for the code wizards out there. I'm trying to write some good code for 7-bit binary->8-bit BCD. Here's two examples I've come up with:

 

Obvious method (23 bytes):

; input: binary value in A

  ldx #$FF
  sec
count
  sbc #10
  inx
  bcs count
  adc #10
  ora table,x

; output: BCD value in A

table .byte $0,$10,$20,$30,$40,$50,$60,$70,$80,$90

 

 

Obfuscated method: 19 bytes

; input: binary value in A

  ldx #8
  ldy #4
repeat
  dey
  bmi noadd
  cmp #$50
  bcc noadd
  adc #$2f
noadd
  cmp #$80
  rol
  dex
  bne repeat

; output: BCD value in A

Can anyone guess how #2 works?

 

I'm not satisfied with either though. It seems that there should be a better way, in terms of speed and cycles... Maybe something using the "D" flag?

Share this post


Link to post
Share on other sites

Quickly written... probably more cycles - table values adjusted since carry will be set for add.

 

tay 
and #3; starting figure (0-7) 
pha 
tya 
lsr a 
lsr a
lsr a 
sta ztemp 
pla 
ldx #4
loop:  lsr ztemp 
bcc noadd 
adc dectab-1,x 
noadd:  dex 
bne loop
rts 
dectab .byte $63,$31,$15,$07

 

Another tip: Instead of using branch, load/or, etc in certain conditions, you can just use the Processor Status, e.g. preserve bit 7 of a location.

 

Lda (screen),y
php
<other stuff>
plp
and #$80
ora newchar
sta (screen),y

 

I used a similar technique to preserve bit 7 of screen data in a softsprite routine - so that multicolour bitpair 11 still uses the right playfield.

 

Handily, the N and V flags and bit 7 and 6, so something like a BIT instruction followed by PHP will preserve the bit values for later processing.

Share this post


Link to post
Share on other sites
Quickly written... probably more cycles - table values adjusted since carry will be set for add.

Actually it might be quicker in terms of cycles, but it is quite a bit longer.

 

After research, I came up with something really short (17 bytes)

; Binary in A

  sed
  sta temp1
  lda #0
  ldx #8
loop
  asl temp1
  sta temp2
  adc temp2
  dex
  bne loop
  cld

; BCD in A

What's cool about this one is that it actually will do 8-bit binary -> 9-bit BCD, with the 9th bit contained in the carry! Can this be improved any more, though?

Edited by batari

Share this post


Link to post
Share on other sites

Theoretically, allowing full representation of 0-127, I suppose.

 

Probably would want some modification to allow for calling and using the carry bit in each initial calculation.

 

Anyway, stuff like that you only need once in a program and memory usage is mainly a 2600 consideration for stuff like BCD score / timers etc.

Share this post


Link to post
Share on other sites

This isn't exactly a 6502 Killer hack but I wanted to share this link.

 

http://graphics.stanford.edu/~seander/bithacks.html

 

There you will find some useful methods that apply to your 6502 coding.

Some of those are pretty cool.

 

I thought I'd translate some of them to 6502 assembly:

Counting bits set, Brian Kernighan's way

unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; c++)
{
 v &= v - 1; // clear the least significant bit set
}

Assembly:

;--count set bits in v
;  trashes v, accumulator, result is in c

  lda #0
  sta c
Loop
  lda v
  beq Done
  inc c
  dec v
  and v
  sta v
  bne Loop
Done

Seems like that should be improveable.

Share this post


Link to post
Share on other sites

Here's another:

Swapping values with XOR

#define SWAP(a, b) (((a) ^= (b)), ((b) ^= (a)), ((a) ^= (b)))

This is an old trick to exchange the values of the variables a and b without using extra space for a temporary variable.

;--swap a and b

  lda a
  eor b
  sta a
  eor b
  sta b
  eor a
  sta a

That right?

Edited by vdub_bobby

Share this post


Link to post
Share on other sites

Here's another:

Swapping values with XOR

#define SWAP(a, b) (((a) ^= (b)), ((b) ^= (a)), ((a) ^= (b)))

This is an old trick to exchange the values of the variables a and b without using extra space for a temporary variable.

;--swap a and b

  lda a
  eor b
  sta a
  eor b
  sta b
  eor a
  sta a

That right?

That does have some academic appeal, but this is how I usually do it:

lda a

ldy b

sta b

sty a

I suppose if you absolutely can't use RAM or another register, the EOR trick would work.

Share this post


Link to post
Share on other sites

Assembly:

;--count set bits in v
;  trashes v, accumulator, result is in c

  lda #0
  sta c
Loop
  lda v
  beq Done
  inc c
  dec v
  and v
  sta v
  bne Loop
Done

Seems like that should be improveable.

 

I'll take a stab at it...

 

Execution time for your version is 11 cycles if v is 0, or 5+n*23-1 cycles if v is not zero (where n is the number of bits set in v). Average time (4 bits set in input) is 96 cycles.

 

One obvious improvement would be to use the X register instead of memory location c. Use "ldx #0" to initialize (saves 3 cycles, assuming "c" was a zero page address), and "inx" instead of "inc c" (saves 3 cycles per set bit in v). Also the code gets 2 bytes smaller. Of course, if you already had written a lot of code that used this as a subroutine, all that code would have to change (to look for the result in X instead of c). Depending on how the calling code used the result, you might end up losing a lot more than the 2 bytes you've saved.

 

Another improvement would be to move the Loop label down two lines (below the "beq Done")... You only need to do "lda v : beq Done" once, before the first loop iteration (to catch the special case where the initial value of v is zero). At the end of the loop, you're making this check again (the "bne Loop")... if this branch is taken (back to Loop), it's redundant to check v for zero again (you're guaranteed it's not, since you just took the "bne Loop" branch). This will save you 4 cycles per set bit in v, and it won't break compatibility with (hypothetical) existing code that calls the routine, either. (It will add 4 constant cycles before the loop, though)

 

Applying both improvements (X register and moving the label) would save 7 cycles per loop iteration, which is a 30% speedup (of just the loop)... not bad.

 

Code looks like:

 

;--count set bits in v
;  trashes v, accumulator, result is in X register

  ldx #0
  lda v
  beq Done
Loop
  inx
  dec v
  and v
  sta v
  bne Loop
Done

 

Timing is 8 cycles if v is 0, or 7+16*n-1 if v is not 0... Average case (4 bits set in input) is 70 cycles, 28% faster than the original, assuming I've gotten all the numbers right (if not, I apologize). Total code size is 15 bytes (3 bytes smaller than the original).

 

However, there are a couple of other ways to count set bits that are faster at the expense of more code. I was messing around with this a while back, and came up with 5 other solutions (I never came up with the Kernighan method on my own, though). Here are a couple of them:

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Count the number of set bits in A
; Unrolled version
; Result in A
; Uses one byte of RAM at tmp (preferably ZP)
; On exit: X and Y untouched, C/Z/N undefined
; Code size: 36 bytes (+1 for RTS = 37)
; Execution time: Constant, 5+(7*8) = 61 cycles (+6 for RTS = 67)
countbits_3:
 sta tmp; 3
 lda #0 ; 2
 rol tmp; 5
 adc #0 ; 2
 rol tmp; 5
 adc #0 ; 2
 rol tmp; 5
 adc #0 ; 2
 rol tmp; 5
 adc #0 ; 2
 rol tmp; 5
 adc #0 ; 2
 rol tmp; 5
 adc #0 ; 2
 rol tmp; 5
 adc #0 ; 2
 rol tmp; 5
 adc #0 ; 2

 rts

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Count the number of set bits in A
; Nybble table version
; Result in A
; Uses one byte of RAM at tmp (preferably ZP)
; Execution time: Constant, 35 cycles (+6 for RTS)
; Code size: 17 bytes table + 19 bytes code = 36 bytes (+1 for RTS = 37)
; This is faster than the fully-unrolled version,
; but it trashes all the registers and flags. Could use PHA/PLA instead of
; TAY/TYA, to preserve the Y register, at the expense of 2 cycles.
nyb_bits: byte 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4
countbits_4:
 tay		   ; 2
 and #$0F	  ; 3
 tax		   ; 2
 lda nyb_bits,x; 4
 sta tmp	   ; 3
 tya		   ; 2
 lsr		   ; 2
 lsr		   ; 2
 lsr		   ; 2
 lsr		   ; 2
 tax		   ; 2
 lda nyb_bits,x; 4
 clc		   ; 2
 adc tmp	   ; 3

 rts		   ; 6

 

Neither of these is all that original. The unrolled version is obvious to any 6502'er, and the nybble-table idea came from some code I saw either here or on the [stella] list (I couldn't find the original code, so I rewrote it from scratch).

 

Here's one I came up with on my own. It's not all that fast, but I like it because it fits in 11 bytes of code and doesn't use a temp location in RAM. Also, if I added a ROL before the RTS, the accumulator and carry flag would keep their original values, which might simplify things for the calling code.

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Count the number of set bits in A
; Result in Y
; Uses no RAM
; On exit:
; Y = result, X = 0, A = undefined, C = undefined, Z = 1, N = 0
; Min execution time (input $00, 0 bits set):
; 4+(2+3+2+3)*8-1 = 83 cycles
; Avg execution time (input $AA, 4 bits set):
; 4+(2+3+2+3)*4+(2+2+2+2+3)*4-1 = 87 cycles
; Max execution time (input $FF, 8 bits set):
; 4+(2+2+2+2+3)*8-1 = 91 cycles
; (add 6 cycles for RTS to all 3 counts)
; Code size: 11 bytes (+1 for RTS = 12)

countbits_2:
ldx #8; 2
ldy #0; 2

cb2loop
 rol		 ; 2
 bcc cb2_noiny; 2/3
 iny		 ; 2
cb2_noiny
 dex		 ; 2
 bne cb2loop ; 3

; rol; adding this ROL preserves the original values of A and C (and sets Z and N according to the A value)
 rts; 6 

Share this post


Link to post
Share on other sites

I suppose if you absolutely can't use RAM or another register, the EOR trick would work.

 

The EOR trick still uses RAM (locations a and b).

Share this post


Link to post
Share on other sites

A while ago, I wrote this bit of code to calculate even/odd parity. It could probably stand some improvement... I definitely don't consider it a "killer hack": it's coded for clarity more than speed or size, but it does implement something from the bithack page, so I thought someone might be interested.

 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; calc_parity will calculate even or odd parity for the low 7 bits of A,
; sets the high bit and N flag to the result.

; Input:
; A = data to calculate parity for (bit 7 ignored)
; C = carry set for odd parity, clear for even
; X, Y, other flags: ignored

; Output:
; A = data with correct parity bit in bit 7 (bits 0-6 unchanged)
; N = parity bit (same as A bit 7)
; C = parity bit (same as A bit 7 and N)
; Z = 1 if A is 0, or 0 otherwise
; X = 0
; Y = bits 0-6 of input
; tmp = same as Y

; Memory usage: One byte at tmp. No stack used.
; If desired, the routine could easily be modified to preserve the
; Y register (and use another RAM temp instead).

; Execution time: Constant, 108 cycles, +6 for RTS. Add 10 cycles if tmp is
; not located in zero page.

; Code size: 25 bytes, +1 for RTS (+4 if tmp not zero page)

; Code:
calc_parity: subroutine;{
 and #$7F ; 2; throw away top bit (it will get replaced anyway)
 sta tmp  ; 3; copy in tmp will be repeatedly shifted (destroyed),
 tay	  ; 2; so stash another copy in Y
 lda #0   ; 2; init accumulator:
 rol	  ; 2; A=1 if C was set, otherwise A=0
 ldx #7   ; 2
	   ; =13 (+1 if tmp not ZP)

;  {
.loop	 ; loop executes 7 times...
 lsr tmp  ; 5
 adc #0   ; 2
 dex	  ; 2
 bne .loop; 3
	   ; =12*7-1 = 83, +7 if tmp not ZP
;  }

; A now has count of set bits from bits 0-6 of the original argument,
; plus 1 if C flag was initially set.
; Bit 0 of A will be the new parity bit.
 ror	  ; 2; C = bit 0 of A
 lda #0   ; 2
 ror	  ; 2; bit 7 of A = C, bits 0-6 are 0
 sty tmp  ; 3
 ora tmp  ; 3
	   ; =12 (+2 if tmp not ZP)

 rts	  ; 6
;}

 

BTW, the comments with the curly braces (lines like "; }") are a Stupid Editor Trick: My editor lets me jump from an opening { to a closing } quickly (very handy for C or Java code), so I've gotten in the habit of using them in 6502 asm code for easy navigation. I use Vim, and I know Emacs has support for this, too. I'd bet that any halfway decent code editor or IDE supports it, too.

 

I've also got a vim macro that selects the current block of code (from the { to the }), for cut/copy or applying other commands. I just press F2 to highlight... and I've got F1 bound to a macro that comments out the block (or uncomments it, if it's already commented). Who says vi is hard to use? :)

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...