Optimizing 6502 Assembly Programming

peteym5 · February 19, 2015

Lets start with what I found.

http://wiki.nesdev.com/w/index.php/6502_assembly_optimisations

This is a Nintendo Entertainment System programming page, it had a great resource about optimizing 6502 assembly. A few I had been doing for years like storing a 16bit word in two seperate tables for low and high order byte. Or using a macro for something that only gets called once, I usually just copy and past where a JSR call is and remove the RTS.

The one I found interesting is this RTS trick for a jump table. Lets say you have a bunch of different codes that need to jump to be processed by a routine. Up to I seen this I was doing CMP #Code1; BNE Check_Code2, JMP Do_Code1, Check_Code2, etc. Something like that actually could save space.

I know of a few others that had been done on the 2600 like using a BRK instruction for a frequently used subroutine. You have other things like self-modifying code, or smarter data encrypting.

I personally want to develop great games and see others develop something great also. Some programs have to fit into smaller spaces, like for an EPROM for a cartridge. Maybe the VBI processes has to be fitted into one bank.

I know we also have compression tools like exomizer and inflate, but even with those, we are still hitting a wall at times.

So if you would like to share an optimization ideal, don't be afraid to post something here.

snicklin · February 19, 2015

Talking about optimisation in general and not necessarily with the 6502, there's a few things that I do to optimise...

1) Does this function really need to exist?

2) Can this function be made inline (if speed is required)

3) Constantly re-review your code. One minor change that you made may have made some now irrelevant.

4) Print out your code, not the whole lot but just want you want to target. Sit there with a pen and strike out what isn't needed. Viewing it on paper is for some reason a lot easier for me.

5) Lots of comments with any optimised code as it isn't always so easy to read.

6) If working in a group setting, compete against someone to get the smallest or the fastest code. The competition will spur you both on to better things.

7) Is this a generic routine that someone has done before? Especially with 6502, people have optimised things to the hilt already, take a look at their code to see if it fits your program.

Always believe that your code could be better. Anyone who boasts that their code is perfect is either a liar or a fool.

9) (For size): Is there already a function that exists that does part of your work, try re-using that.

10) Is there another way of doing the same thing? i.e. Double a value could be a+a=b or you could code 'asl a' which is quicker.

11) Read other people's code. What may look like madness may be a new lesson to learn.

12) And just be careful when optimising... do you ~really~ want to optimise the routine so much that any amendments would render it useless? Groovybee gave me that great tip, he said that it is best to get your programming doing what you want and then optimise it later.

I think that point 12 led on to a thought of mine that optimisation isn't always about the code, I need to think about optimising my use of programming time. If I spend forever optimising code which then changes and needs re-optimising, my code is getting nowhere and is useless. I still need to learn this art more but I am getting better at it.

Bryan · February 19, 2015

This thread might be useful.

http://atariage.com/forums/topic/71120-6502-killer-hacks/

phaeron · February 19, 2015

When it comes to reducing code size, I'm fond of abusing fall-throughs:

IoPutSpace2x:
    jsr   IoPutSpace
IoPutSpace:
    lda   #' '
IoPutchar:
    ldx   #0
    jmp   (putvec)

Rybags · February 19, 2015

CIO uses the RTS trick for vectoring - just put bytes from a table onto the stack then RTS.

Not mentioned on the Nes page... store registers into load immediate operands to speed up interrupts, put entire interrupt (or any block of code) into z-page and use self-modifying.

There's plenty around... it all comes to what you want. Speed, code compactness... problem is those 2 things are usually mutually exclusive.

peteym5 · February 19, 2015

I looked at some of the killer hacks. Interesting some people use undocumented (illegal) opcodes.

Something I do to negate a value without tempories is to subtract value from 0: 1 compliments 255, 2 compliments 254, etc. Doing an EOR #255 changes a 1 to 254, have to add 1 after.

LDA #0

SEC

SBC Value

STA Value

(size 7 bytes, 10 cycles zero page)

The version on that page:

LDA value

EOR #255

CLC

ADC #1

STA value

(size 9 bytes, 12 cycles zero page)

MaPa · February 19, 2015

Without temporaries they mean that you have value in A as result of some operations, not stored somewhere in RAM and you need to negate it. So you do only eor,clc,adc... 5 bytes, 6 cycles and you have negated number in A.

Island2Live · February 22, 2015

Because no-one did it so far, I'll do it:

Thank you very much for the link and for starting this thread. Very valuable information! :cool:

JamesD · February 23, 2015

I was aware of the separate highbyte lowbyte table optimization for jump tables but for some reason I was still using a single table in my music player.
Making separate high/low tables let me drop an ASL in the 6502 version.
The 65C02 code is still half as many instructions though.

.IF	USE_C02 > 0
	asl						; multiply by 2 (addresses are 2 bytes in size)
	tax						; move the command table offset to x

	jmp	(commandtable,x)
.ELSE
	tax						; move the command table offset to x

	lda	commandtableH,x	; load MSB of the command address
	pha						; push it on the stack
	lda	commandtableL,x		; load LSB of the command address
	pha						; push it on the stack	
	rts						; call address on the stack
.ENDIF

Edited February 23, 2015 by JamesD

peteym5 · February 23, 2015

It looks like the magic number to use jump tables or that RTS trick is 4. If you have 3 or less codes you are testing for, the CMP #, BNE, works better. This is affected if you need to branch long(CMP #, BEQ, JMP). The RTS trick saves both in clock cycles and memory space.

I use a case logic system to control multiple types of enemies in my games. Code 0 is no enemy, 1 is a Tiger, 2 is a Bat, 3 is a Dragon, etc.. Each needs its own control logic and sprite animation settings. The RTS trick does help.

I also use procedure setup for screens. codes that tell the program to draw vertical, horizontal, rectangle blocks, patterns, etc..., and the RTS trick opened up a few bytes there as well. I do a procedure set up instead of trying to save 1K per screen in memory. A procedure set up system has data for instructions that set the screen up and does it many times under 128 bytes. Gives around a 8 to 1 compression ratio.

flashjazzcat · February 23, 2015

I just love look-up-tables, but they consume a lot of space. Turn 16-bit graphics coordinate into byte and pixel offset:

	lda cx	; LSB of 16-bit coordinate
	tay
	and #7	; mod 8
	sta PixOff
	lda DivLoTable,y ; cx / 8
	ldy cx+1	; MSB of coordinate
	ora DivHiTable,y ; or in bits 7-5
	sta ByteOff

I suppose MOD 8 could be done with a LUT as well but it seems too trivial. The two tables (DivLo and DivHi) contain (index / 8 ), with the high or low bits zeroed out respectively. Note ByteOff may be 0-255, since we're dealing with a large clipped coordinate system.

Edited February 23, 2015 by flashjazzcat

Heaven/TQA · February 23, 2015

Latest project... Popmilo knows it

- recode over and over same routine

- use illegals there are nice stable ones... SAX fex

- use extensively ZP

- use self modifing code

- unroll code

- use clever lookup tables esp for 16bit maths

- prepare data best suited for your code on PC

- think of antic sucking DMA so design clever your display list to minimize penalties

- use not complicated RMT instruments

- switch off pm Dma when unnecessary or even feed data 2600 like directly into gtia

dmsc · February 24, 2015

Hi,

I was aware of the separate highbyte lowbyte table optimization for jump tables but for some reason I was still using a single table in my music player.

Making separate high/low tables let me drop an ASL in the 6502 version.

The 65C02 code is still half as many instructions though.
.IF	USE_C02 > 0
	asl						; multiply by 2 (addresses are 2 bytes in size)
	tax						; move the command table offset to x

	jmp	(commandtable,x)
.ELSE
	tax						; move the command table offset to x

	lda	commandtableH,x	; load MSB of the command address
	pha						; push it on the stack
	lda	commandtableL,x		; load LSB of the command address
	pha						; push it on the stack	
	rts						; call address on the stack
.ENDIF
 

If the command code is small, and you allow self-modifying code, you can use a "branch" table instead of a jump table, halving the code usage (only one byte per command):

	tax				; move the command table offset to x
	lda	commandtableSkip,x	; load target "distance", from 0 to 127
	sta	commandFirst-1		; store into branch target
	bpl	commandFirst		; branch, always because A is positive
commandFirst:

If all of your command code is more than 127 bytes, you can put branches in the target for the uncommon cases, but this makes the code slower and a little bigger. Or if your code is less than 255 bytes, you could store only the low part of the jump table and assume the high part:

	tax				; move the command table offset to x
	lda	#<*			; load high part of target address
	pha				; put into stack
	lda	commandtableL,x		; load low part of target address
	pha				; put into stack
	rts				; call address on the stack

JamesD · February 24, 2015

Hi,

If the command code is small, and you allow self-modifying code, you can use a "branch" table instead of a jump table, halving the code usage (only one byte per command):
	tax				; move the command table offset to x
	lda	commandtableSkip,x	; load target "distance", from 0 to 127
	sta	commandFirst-1		; store into branch target
	bpl	commandFirst		; branch, always because A is positive
commandFirst:
If all of your command code is more than 127 bytes, you can put branches in the target for the uncommon cases, but this makes the code slower and a little bigger. Or if your code is less than 255 bytes, you could store only the low part of the jump table and assume the high part:
	tax				; move the command table offset to x
	lda	#<*			; load high part of target address
	pha				; put into stack
	lda	commandtableL,x		; load low part of target address
	pha				; put into stack
	rts				; call address on the stack

I chose a JMP table so that additional commands could be added within code that uses the player.

This would let on screen events be synced to the music or sound effects which I planned to add support for.

You could start music or a sound effect and be notified when it reached a certain point or finished.

Sounds and music can actually become timers for events without adding additional code within the interrupt itself.

At least that was the idea.

peteym5 · February 24, 2015

Just in case some missed it, the procedure set up for 1K screens would be Antic 2 or 4. Character/Tile mapped screens. A typical command would be like a $05 to draw a hoz line, 2 bytes for where to put it, character, and # of times repeated. I know there are ways to condense it like the 6 high bits for a command plus 2 low bits be part of the screen address. I like using character modes because I may require certain reactions when colliding with certain characters. Again you can use RTS trick there also.

phaeron · February 24, 2015

It looks like the magic number to use jump tables or that RTS trick is 4. If you have 3 or less codes you are testing for, the CMP #, BNE, works better. This is affected if you need to branch long(CMP #, BEQ, JMP). The RTS trick saves both in clock cycles and memory space.

RTS dispatch is not quite that good in speed since it takes 20 cycles, in which time a branch tree can split four deep. The branch tree can also be optimized size-wise by using DEX/INX for densely packed cases, or doubling up on the branches by splitting three ways each time on both C and Z:

  ldx   single_bit
  bmi   bit_7
  cpx   #$20
  beq   bit_5
  bcs   bit_6
  cpx   #$08
  beq   bit_3
  bcs   bit_4
  cpx   #$02
  beq   bit_1
  bcs   bit_2
bit_0:

When using hi/lo RTS tables, they're easier to maintain if generated off of one source table (MADS):

.macro FUNCTION_DISPATCH_TABLE
    dta   :1[funAsc-1]
    dta   :1[funVal-1]
    dta   :1[funLen-1]
    ...
.endm

.proc functionDispatchTableLo
    FUNCTION_DISPATCH_TABLE <
.endp

.proc functionDispatchTableHi
    FUNCTION_DISPATCH_TABLE >
.endp

Creature XL · February 24, 2015

  ldx   single_bit
  bmi   bit_7
  cpx   #$20
  beq   bit_5
  bcs   bit_6
  cpx   #$08
  beq   bit_3
  bcs   bit_4
  cpx   #$02
  beq   bit_1
  bcs   bit_2
bit_0:

Finally something new for an advanced 6502-programmer. Never thought about this idea. Will try it out.

For now I mostly use DEX- or LSR-BCC-style dispatchers. But I do that shortly before the release. While developing I use standard CMPs with nicely named constants. For the single purpose of being able to add stuff while I code.

peteym5 · February 25, 2015

You can always use thise "Single Byte" into A, use BMI, then do ASL @ for each bit. If we are using the bits in "Single Byte" for flag indicators. That is also if each branch is within 128 bytes of the BMI.

Looks like I am correct about a tree being more than 4 branches deep to where the RTS trick starts saving room. There are cases where we might have to branch long or over 128 bytes.

I have to see if there really are instances in where these illegal or unused 6502 opcodes can be used in places. Most of us will try to keep our programs running even for those that modify the computer with a 65816 CPU.

Edited February 25, 2015 by peteym5

JamesD · February 25, 2015

Finally something new for an advanced 6502-programmer. Never thought about this idea. Will try it out.

Agreed.

I've used several approaches to doing this but never that one.

I can use that in a piece of code I want to tackle next.

peteym5 · March 3, 2015

Another optimization I started applying to games is a table assisted joystick routine. Older games were just reading the bits of PORTA, and branching for adding or subtracting variables that represents a position of a sprite.

The new technique, has a 4 byte table:

PLUS_MINUS_STICK
DTA 000,001,255,000

and the joystick read section

LDA PORTA

AND #3

TAX

LDA PLUS_MINUS_STICK,X

CLC

ADC YPOS

STA YPOS

LDA PORTA

AND #12

LSR

TAX

CLC

ADC XPOS

STA XPOS

This is actually smaller than loading PORTA and doing a series of LSR@, BCS, for the 4 directions. It also prevents a problem if you manage to trip both left and right at the same time. I did not get into testing for screen edges, but I am sure anyone can figure that out.

Edited March 3, 2015 by peteym5

Creature XL · March 4, 2015

For simple player controls this might work well. For more complex stuff a FSM is needed anyway.

Even in simple control routines, the horizontal position should be half the "speed" of the vertical (in 2x1 pixel modes) ...

Although, when using the tables for routines and not for offsets might be interesting. However, this table has to be adjusted depending on the player's state...

In the end, finding the optimal dispatching, as interesting as it may be, is not essential. Saving a few bytes or cycles is not that essential for me in the case of the player control.

UNIXcoffee928 · March 8, 2015

Being an Atari 8-Bit user, I was never a fan of Commodore 8-Bits, but the other day I was thinking about the C128 that a friend had, back in 1985, so I was looking stuff up about it, and stumbled upon the C128 Programmer's Reference Guide. I was amazed at the level of detail that Commodore put into their manuals. Too bad Atari hadn't included a manual like this one, it would have significantly minimized the learning curve. The Assembly section is really good in this book. Additionally, I would never have even known that an 8502 had existed, if I hadn't stumbled upon this book. The C128 in 128 Mode was clearly a nice system, too bad that most people had just used it in C64 Mode. After reading this book, my feeling is that I don't think that this computer really had enough lifecycle time, on the market, to be appreciated, before it was killed off. Anyway, it's a cool & useful book, check it out.

flashjazzcat · March 8, 2015

I suppose the Atari's lack of a built-in assembler is responsible for the absence of an exhaustive 6502 assembly reference in the manuals, but that's some decent documentation there nevertheless. And yes - it looks like a nice machine too.

Edited March 8, 2015 by flashjazzcat

+slx · March 9, 2015

I suppose the Atari's lack of a built-in assembler is responsible for the absence of an exhaustive 6502 assembly reference in the manuals, but that's some decent documentation there nevertheless. And yes - it looks like a nice machine too.

I always found that "triple nature" of the C128 interesting but moved on from the 800 to an ST in the 80s without ever seeing a C128 in the wild.

A short while ago I went through all the Retrobits podcasts and after listening to an episode about the C128 and an interview with Bil Herd, one of its main designers, I finally bought one. Due to lack of time it's still awaiting its re-awakening (plus a JiffyDOS and SD drive update) in my basement :-(

While it has a better BASIC than the C64 Commodore still managed to leave out BASIC commands for some advanced graphics functions (the 80 col mode). One should have assumed that they would have learned by then but price without power seems to have been more important.

I'd guess that of the unknown percentage of Atari users not interested in gaming alone, another 90% were not interested in assembly language. While it was common knowledge that you could only max out the Atari using machine language, the learning curve was reaaaally steep and even with a couple of good books you were likely to get stuck at some point.

After

UNIXcoffee928 · March 10, 2015

The book also kinda made me think that those C128 mode specific ICs could be interfaced to the Atari for a pretty quick to design hardware device that could give the Atari some extra functionality. I liked how the book specifically showed how to utilize those new ICs with machine code. I don't know if those ICs are commonly available, but somebody with modern Commodore hobbyist experience would know.

Optimizing 6502 Assembly Programming

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members