peteym5 Posted February 19, 2015 Share Posted February 19, 2015 Lets start with what I found. http://wiki.nesdev.com/w/index.php/6502_assembly_optimisations This is a Nintendo Entertainment System programming page, it had a great resource about optimizing 6502 assembly. A few I had been doing for years like storing a 16bit word in two seperate tables for low and high order byte. Or using a macro for something that only gets called once, I usually just copy and past where a JSR call is and remove the RTS. The one I found interesting is this RTS trick for a jump table. Lets say you have a bunch of different codes that need to jump to be processed by a routine. Up to I seen this I was doing CMP #Code1; BNE Check_Code2, JMP Do_Code1, Check_Code2, etc. Something like that actually could save space. I know of a few others that had been done on the 2600 like using a BRK instruction for a frequently used subroutine. You have other things like self-modifying code, or smarter data encrypting. I personally want to develop great games and see others develop something great also. Some programs have to fit into smaller spaces, like for an EPROM for a cartridge. Maybe the VBI processes has to be fitted into one bank. I know we also have compression tools like exomizer and inflate, but even with those, we are still hitting a wall at times. So if you would like to share an optimization ideal, don't be afraid to post something here. Quote Link to comment Share on other sites More sharing options...
snicklin Posted February 19, 2015 Share Posted February 19, 2015 Talking about optimisation in general and not necessarily with the 6502, there's a few things that I do to optimise... 1) Does this function really need to exist? 2) Can this function be made inline (if speed is required) 3) Constantly re-review your code. One minor change that you made may have made some now irrelevant. 4) Print out your code, not the whole lot but just want you want to target. Sit there with a pen and strike out what isn't needed. Viewing it on paper is for some reason a lot easier for me. 5) Lots of comments with any optimised code as it isn't always so easy to read. 6) If working in a group setting, compete against someone to get the smallest or the fastest code. The competition will spur you both on to better things. 7) Is this a generic routine that someone has done before? Especially with 6502, people have optimised things to the hilt already, take a look at their code to see if it fits your program. Always believe that your code could be better. Anyone who boasts that their code is perfect is either a liar or a fool. 9) (For size): Is there already a function that exists that does part of your work, try re-using that. 10) Is there another way of doing the same thing? i.e. Double a value could be a+a=b or you could code 'asl a' which is quicker. 11) Read other people's code. What may look like madness may be a new lesson to learn. 12) And just be careful when optimising... do you ~really~ want to optimise the routine so much that any amendments would render it useless? Groovybee gave me that great tip, he said that it is best to get your programming doing what you want and then optimise it later. I think that point 12 led on to a thought of mine that optimisation isn't always about the code, I need to think about optimising my use of programming time. If I spend forever optimising code which then changes and needs re-optimising, my code is getting nowhere and is useless. I still need to learn this art more but I am getting better at it. 2 Quote Link to comment Share on other sites More sharing options...
Bryan Posted February 19, 2015 Share Posted February 19, 2015 This thread might be useful. http://atariage.com/forums/topic/71120-6502-killer-hacks/ Quote Link to comment Share on other sites More sharing options...
phaeron Posted February 19, 2015 Share Posted February 19, 2015 When it comes to reducing code size, I'm fond of abusing fall-throughs: IoPutSpace2x: jsr IoPutSpace IoPutSpace: lda #' ' IoPutchar: ldx #0 jmp (putvec) 3 Quote Link to comment Share on other sites More sharing options...
Rybags Posted February 19, 2015 Share Posted February 19, 2015 CIO uses the RTS trick for vectoring - just put bytes from a table onto the stack then RTS. Not mentioned on the Nes page... store registers into load immediate operands to speed up interrupts, put entire interrupt (or any block of code) into z-page and use self-modifying. There's plenty around... it all comes to what you want. Speed, code compactness... problem is those 2 things are usually mutually exclusive. Quote Link to comment Share on other sites More sharing options...
peteym5 Posted February 19, 2015 Author Share Posted February 19, 2015 I looked at some of the killer hacks. Interesting some people use undocumented (illegal) opcodes. Something I do to negate a value without tempories is to subtract value from 0: 1 compliments 255, 2 compliments 254, etc. Doing an EOR #255 changes a 1 to 254, have to add 1 after. LDA #0 SEC SBC Value STA Value (size 7 bytes, 10 cycles zero page) The version on that page: LDA value EOR #255 CLC ADC #1 STA value (size 9 bytes, 12 cycles zero page) Quote Link to comment Share on other sites More sharing options...
MaPa Posted February 19, 2015 Share Posted February 19, 2015 Without temporaries they mean that you have value in A as result of some operations, not stored somewhere in RAM and you need to negate it. So you do only eor,clc,adc... 5 bytes, 6 cycles and you have negated number in A. Quote Link to comment Share on other sites More sharing options...
Island2Live Posted February 22, 2015 Share Posted February 22, 2015 Because no-one did it so far, I'll do it:Thank you very much for the link and for starting this thread. Very valuable information! Quote Link to comment Share on other sites More sharing options...
JamesD Posted February 23, 2015 Share Posted February 23, 2015 (edited) I was aware of the separate highbyte lowbyte table optimization for jump tables but for some reason I was still using a single table in my music player.Making separate high/low tables let me drop an ASL in the 6502 version.The 65C02 code is still half as many instructions though. .IF USE_C02 > 0 asl ; multiply by 2 (addresses are 2 bytes in size) tax ; move the command table offset to x jmp (commandtable,x) .ELSE tax ; move the command table offset to x lda commandtableH,x ; load MSB of the command address pha ; push it on the stack lda commandtableL,x ; load LSB of the command address pha ; push it on the stack rts ; call address on the stack .ENDIF Edited February 23, 2015 by JamesD Quote Link to comment Share on other sites More sharing options...
peteym5 Posted February 23, 2015 Author Share Posted February 23, 2015 It looks like the magic number to use jump tables or that RTS trick is 4. If you have 3 or less codes you are testing for, the CMP #, BNE, works better. This is affected if you need to branch long(CMP #, BEQ, JMP). The RTS trick saves both in clock cycles and memory space. I use a case logic system to control multiple types of enemies in my games. Code 0 is no enemy, 1 is a Tiger, 2 is a Bat, 3 is a Dragon, etc.. Each needs its own control logic and sprite animation settings. The RTS trick does help. I also use procedure setup for screens. codes that tell the program to draw vertical, horizontal, rectangle blocks, patterns, etc..., and the RTS trick opened up a few bytes there as well. I do a procedure set up instead of trying to save 1K per screen in memory. A procedure set up system has data for instructions that set the screen up and does it many times under 128 bytes. Gives around a 8 to 1 compression ratio. Quote Link to comment Share on other sites More sharing options...
flashjazzcat Posted February 23, 2015 Share Posted February 23, 2015 (edited) I just love look-up-tables, but they consume a lot of space. Turn 16-bit graphics coordinate into byte and pixel offset: lda cx ; LSB of 16-bit coordinate tay and #7 ; mod 8 sta PixOff lda DivLoTable,y ; cx / 8 ldy cx+1 ; MSB of coordinate ora DivHiTable,y ; or in bits 7-5 sta ByteOff I suppose MOD 8 could be done with a LUT as well but it seems too trivial. The two tables (DivLo and DivHi) contain (index / 8 ), with the high or low bits zeroed out respectively. Note ByteOff may be 0-255, since we're dealing with a large clipped coordinate system. Edited February 23, 2015 by flashjazzcat Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted February 23, 2015 Share Posted February 23, 2015 Latest project... Popmilo knows it - recode over and over same routine - use illegals there are nice stable ones... SAX fex - use extensively ZP - use self modifing code - unroll code - use clever lookup tables esp for 16bit maths - prepare data best suited for your code on PC - think of antic sucking DMA so design clever your display list to minimize penalties - use not complicated RMT instruments - switch off pm Dma when unnecessary or even feed data 2600 like directly into gtia 1 Quote Link to comment Share on other sites More sharing options...
dmsc Posted February 24, 2015 Share Posted February 24, 2015 Hi, I was aware of the separate highbyte lowbyte table optimization for jump tables but for some reason I was still using a single table in my music player. Making separate high/low tables let me drop an ASL in the 6502 version. The 65C02 code is still half as many instructions though. .IF USE_C02 > 0 asl ; multiply by 2 (addresses are 2 bytes in size) tax ; move the command table offset to x jmp (commandtable,x) .ELSE tax ; move the command table offset to x lda commandtableH,x ; load MSB of the command address pha ; push it on the stack lda commandtableL,x ; load LSB of the command address pha ; push it on the stack rts ; call address on the stack .ENDIF If the command code is small, and you allow self-modifying code, you can use a "branch" table instead of a jump table, halving the code usage (only one byte per command): tax ; move the command table offset to x lda commandtableSkip,x ; load target "distance", from 0 to 127 sta commandFirst-1 ; store into branch target bpl commandFirst ; branch, always because A is positive commandFirst: If all of your command code is more than 127 bytes, you can put branches in the target for the uncommon cases, but this makes the code slower and a little bigger. Or if your code is less than 255 bytes, you could store only the low part of the jump table and assume the high part: tax ; move the command table offset to x lda #<* ; load high part of target address pha ; put into stack lda commandtableL,x ; load low part of target address pha ; put into stack rts ; call address on the stack Quote Link to comment Share on other sites More sharing options...
JamesD Posted February 24, 2015 Share Posted February 24, 2015 Hi, If the command code is small, and you allow self-modifying code, you can use a "branch" table instead of a jump table, halving the code usage (only one byte per command): tax ; move the command table offset to x lda commandtableSkip,x ; load target "distance", from 0 to 127 sta commandFirst-1 ; store into branch target bpl commandFirst ; branch, always because A is positive commandFirst: If all of your command code is more than 127 bytes, you can put branches in the target for the uncommon cases, but this makes the code slower and a little bigger. Or if your code is less than 255 bytes, you could store only the low part of the jump table and assume the high part: tax ; move the command table offset to x lda #<* ; load high part of target address pha ; put into stack lda commandtableL,x ; load low part of target address pha ; put into stack rts ; call address on the stack I chose a JMP table so that additional commands could be added within code that uses the player. This would let on screen events be synced to the music or sound effects which I planned to add support for. You could start music or a sound effect and be notified when it reached a certain point or finished. Sounds and music can actually become timers for events without adding additional code within the interrupt itself. At least that was the idea. Quote Link to comment Share on other sites More sharing options...
peteym5 Posted February 24, 2015 Author Share Posted February 24, 2015 Just in case some missed it, the procedure set up for 1K screens would be Antic 2 or 4. Character/Tile mapped screens. A typical command would be like a $05 to draw a hoz line, 2 bytes for where to put it, character, and # of times repeated. I know there are ways to condense it like the 6 high bits for a command plus 2 low bits be part of the screen address. I like using character modes because I may require certain reactions when colliding with certain characters. Again you can use RTS trick there also. Quote Link to comment Share on other sites More sharing options...
phaeron Posted February 24, 2015 Share Posted February 24, 2015 It looks like the magic number to use jump tables or that RTS trick is 4. If you have 3 or less codes you are testing for, the CMP #, BNE, works better. This is affected if you need to branch long(CMP #, BEQ, JMP). The RTS trick saves both in clock cycles and memory space. RTS dispatch is not quite that good in speed since it takes 20 cycles, in which time a branch tree can split four deep. The branch tree can also be optimized size-wise by using DEX/INX for densely packed cases, or doubling up on the branches by splitting three ways each time on both C and Z: ldx single_bit bmi bit_7 cpx #$20 beq bit_5 bcs bit_6 cpx #$08 beq bit_3 bcs bit_4 cpx #$02 beq bit_1 bcs bit_2 bit_0: When using hi/lo RTS tables, they're easier to maintain if generated off of one source table (MADS): .macro FUNCTION_DISPATCH_TABLE dta :1[funAsc-1] dta :1[funVal-1] dta :1[funLen-1] ... .endm .proc functionDispatchTableLo FUNCTION_DISPATCH_TABLE < .endp .proc functionDispatchTableHi FUNCTION_DISPATCH_TABLE > .endp 5 Quote Link to comment Share on other sites More sharing options...
Creature XL Posted February 24, 2015 Share Posted February 24, 2015 ldx single_bit bmi bit_7 cpx #$20 beq bit_5 bcs bit_6 cpx #$08 beq bit_3 bcs bit_4 cpx #$02 beq bit_1 bcs bit_2 bit_0: Finally something new for an advanced 6502-programmer. Never thought about this idea. Will try it out. For now I mostly use DEX- or LSR-BCC-style dispatchers. But I do that shortly before the release. While developing I use standard CMPs with nicely named constants. For the single purpose of being able to add stuff while I code. 2 Quote Link to comment Share on other sites More sharing options...
peteym5 Posted February 25, 2015 Author Share Posted February 25, 2015 (edited) You can always use thise "Single Byte" into A, use BMI, then do ASL @ for each bit. If we are using the bits in "Single Byte" for flag indicators. That is also if each branch is within 128 bytes of the BMI. Looks like I am correct about a tree being more than 4 branches deep to where the RTS trick starts saving room. There are cases where we might have to branch long or over 128 bytes. I have to see if there really are instances in where these illegal or unused 6502 opcodes can be used in places. Most of us will try to keep our programs running even for those that modify the computer with a 65816 CPU. Edited February 25, 2015 by peteym5 Quote Link to comment Share on other sites More sharing options...
JamesD Posted February 25, 2015 Share Posted February 25, 2015 Finally something new for an advanced 6502-programmer. Never thought about this idea. Will try it out. Agreed. I've used several approaches to doing this but never that one. I can use that in a piece of code I want to tackle next. Quote Link to comment Share on other sites More sharing options...
peteym5 Posted March 3, 2015 Author Share Posted March 3, 2015 (edited) Another optimization I started applying to games is a table assisted joystick routine. Older games were just reading the bits of PORTA, and branching for adding or subtracting variables that represents a position of a sprite. The new technique, has a 4 byte table: PLUS_MINUS_STICK DTA 000,001,255,000 and the joystick read section LDA PORTA AND #3 TAX LDA PLUS_MINUS_STICK,X CLC ADC YPOS STA YPOS LDA PORTA AND #12 LSR LSR TAX CLC ADC XPOS STA XPOS This is actually smaller than loading PORTA and doing a series of LSR@, BCS, for the 4 directions. It also prevents a problem if you manage to trip both left and right at the same time. I did not get into testing for screen edges, but I am sure anyone can figure that out. Edited March 3, 2015 by peteym5 1 Quote Link to comment Share on other sites More sharing options...
Creature XL Posted March 4, 2015 Share Posted March 4, 2015 For simple player controls this might work well. For more complex stuff a FSM is needed anyway. Even in simple control routines, the horizontal position should be half the "speed" of the vertical (in 2x1 pixel modes) ... Although, when using the tables for routines and not for offsets might be interesting. However, this table has to be adjusted depending on the player's state... In the end, finding the optimal dispatching, as interesting as it may be, is not essential. Saving a few bytes or cycles is not that essential for me in the case of the player control. Quote Link to comment Share on other sites More sharing options...
UNIXcoffee928 Posted March 8, 2015 Share Posted March 8, 2015 Being an Atari 8-Bit user, I was never a fan of Commodore 8-Bits, but the other day I was thinking about the C128 that a friend had, back in 1985, so I was looking stuff up about it, and stumbled upon the C128 Programmer's Reference Guide. I was amazed at the level of detail that Commodore put into their manuals. Too bad Atari hadn't included a manual like this one, it would have significantly minimized the learning curve. The Assembly section is really good in this book. Additionally, I would never have even known that an 8502 had existed, if I hadn't stumbled upon this book. The C128 in 128 Mode was clearly a nice system, too bad that most people had just used it in C64 Mode. After reading this book, my feeling is that I don't think that this computer really had enough lifecycle time, on the market, to be appreciated, before it was killed off. Anyway, it's a cool & useful book, check it out. 1 Quote Link to comment Share on other sites More sharing options...
flashjazzcat Posted March 8, 2015 Share Posted March 8, 2015 (edited) I suppose the Atari's lack of a built-in assembler is responsible for the absence of an exhaustive 6502 assembly reference in the manuals, but that's some decent documentation there nevertheless. And yes - it looks like a nice machine too. Edited March 8, 2015 by flashjazzcat Quote Link to comment Share on other sites More sharing options...
+slx Posted March 9, 2015 Share Posted March 9, 2015 I suppose the Atari's lack of a built-in assembler is responsible for the absence of an exhaustive 6502 assembly reference in the manuals, but that's some decent documentation there nevertheless. And yes - it looks like a nice machine too. I always found that "triple nature" of the C128 interesting but moved on from the 800 to an ST in the 80s without ever seeing a C128 in the wild. A short while ago I went through all the Retrobits podcasts and after listening to an episode about the C128 and an interview with Bil Herd, one of its main designers, I finally bought one. Due to lack of time it's still awaiting its re-awakening (plus a JiffyDOS and SD drive update) in my basement :-( While it has a better BASIC than the C64 Commodore still managed to leave out BASIC commands for some advanced graphics functions (the 80 col mode). One should have assumed that they would have learned by then but price without power seems to have been more important. I'd guess that of the unknown percentage of Atari users not interested in gaming alone, another 90% were not interested in assembly language. While it was common knowledge that you could only max out the Atari using machine language, the learning curve was reaaaally steep and even with a couple of good books you were likely to get stuck at some point. After Quote Link to comment Share on other sites More sharing options...
UNIXcoffee928 Posted March 10, 2015 Share Posted March 10, 2015 The book also kinda made me think that those C128 mode specific ICs could be interfaced to the Atari for a pretty quick to design hardware device that could give the Atari some extra functionality. I liked how the book specifically showed how to utilize those new ICs with machine code. I don't know if those ICs are commonly available, but somebody with modern Commodore hobbyist experience would know. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.