Jump to content
ilmenit

Advanced optimizations in CC65

Recommended Posts

Very nice stuff, Ilmenit. I learned a lot from you over the years. In particular, I wrote an (atari st) mouse handler for the A8 completely in CC65, using your techniques. It was an interrupt handler off of one of the pokey timers, and worked great.

  • Like 1

Share this post


Link to post
Share on other sites

Woaw, this is incredibly helpful, thanks a lot for taking the time to write it down!

 

I'm not programming for the Atari 5200 (yet?), but I use CC65 for making Lynx and NES games, and this will be really helpful to increase the speed of my games!

(the struct of array vs array of struct case for example, i would never have thought of that!)

  • Like 1

Share this post


Link to post
Share on other sites

Hi!

15 hours ago, ilmenit said:

Hi,

I wrote a "small" article about optimizing C code for CC65 compiler. Let me know if you have any feedback or additional questions. Enjoy reading!

https://github.com/ilmenit/CC65-Advanced-Optimizations

 

Many thanks!

 

Now I can link to this document each time someone asks related quations :) 

 

I think you should consider about the parameters to functions - sometimes allowing one parameter to a function actually produces faster and smaller code:

- at each call site, the compiler can just emit "LDA / LDX" to load the parameter, instead of the LDA/STA when the parameter is passed in static variables,

- but the code generator insists on pushing the received value to the stack at function init. Even so, for functions called a lot this is a win.

 

IMHO, one of the most important optimizations in the compiler is to avoid using the stack on leaf functions, there are two uses of the stack that could be avoided:

- using the stack to store the passed arguments in A/X;

- using the stack to save the value of ZP "registers", as those are call-saved.

 

Both could be replaced by allocating a small static area per function and storing the values there.

 

Have Fun!

 

Share this post


Link to post
Share on other sites

Excellent !

Bravo !!!

Thank you very much.

I like very much CC65 and your article boosts what can be achieved with it.

Share this post


Link to post
Share on other sites
8 hours ago, dmsc said:

I think you should consider about the parameters to functions - sometimes allowing one parameter to a function actually produces faster and smaller code:

- at each call site, the compiler can just emit "LDA / LDX" to load the parameter, instead of the LDA/STA when the parameter is passed in static variables,

- but the code generator insists on pushing the received value to the stack at function init. Even so, for functions called a lot this is a win.

CC65 allows "fastcall" calling convention that is passing data through A/X registers:

https://github.com/cc65/wiki/wiki/Parameter-passing-and-calling-conventions

However when I was testing it, it always generated "jsr pushax" at the beginning (as you wrote) which negates benefit of passing through registers and when the code was benchmarked there was no measurable benefit. Do you have some example of function where it brings boost? I could add such section to the guide.

Share this post


Link to post
Share on other sites

Hi!

On 4/24/2020 at 5:43 AM, ilmenit said:

CC65 allows "fastcall" calling convention that is passing data through A/X registers:

https://github.com/cc65/wiki/wiki/Parameter-passing-and-calling-conventions

However when I was testing it, it always generated "jsr pushax" at the beginning (as you wrote) which negates benefit of passing through registers and when the code was benchmarked there was no measurable benefit. Do you have some example of function where it brings boost? I could add such section to the guide.

It would normally be slower, but if you call the function a lot, the program will be shorter.

 

There is a trick you can use to automatically save the function argument, but needs the called function in a separated C file from the caller:

// This is in the "fun.c" file:
// Defines the function as "void" but using ASM you move the value in A into the local variable
unsigned char fun(void)
{
    static unsigned char x;
    __asm__ ("sta %v", x);
    // Your function here
    return 0xFF^x;
}

// This will also work with integer arguments, movin A and X into the local variable
unsigned fun16(void)
{
    static unsigned x;
    __asm__ ("sta %v", x);
    __asm__ ("stx %v", x+1);
    // Your function here
    return 0xFF^x;
}

 

Now, in a separate file, you declare the function with the arguments:

// This is in "main.c" file:

// Declare the functions with arguments
unsigned char fun(unsigned char x);
unsigned fun16(unsigned x);

int main()
{
    // Call the functions!
    return fun(12) + fun16(7);
}

 

In this case, this is the code produced for the "fun.c" file:

; ---------------------------------------------------------------
; unsigned char __near__ fun (void)
.segment        "CODE"
.proc   _fun: near
.segment        "BSS"
L0002:
        .res    1,$00
.segment        "CODE"
        sta     L0002
        eor     #$FF
        ldx     #$00
        rts
.endproc

; ---------------------------------------------------------------
; unsigned int __near__ fun16 (void)
.segment        "CODE"
.proc   _fun16: near
.segment        "BSS"
L0007:
        .res    2,$00
.segment        "CODE"
        sta     L0007
        stx     L0007+1
        eor     #$FF
        rts
.endproc

 

As you see, CC65 even knows that the values are already on X and A, so it does not need to reload them.

 

Have Fun!

 

  • Like 1
  • Thanks 1

Share this post


Link to post
Share on other sites
On 4/23/2020 at 12:28 PM, ilmenit said:

Hi,

I wrote a "small" article about optimizing C code for CC65 compiler. Let me know if you have any feedback or additional questions. Enjoy reading!

https://github.com/ilmenit/CC65-Advanced-Optimizations

@ilmenit, you did an excellent work. and this could become a refernce for all who wish to program well in CC65.

 

job well done!

  • Like 1

Share this post


Link to post
Share on other sites

Thanks alot, looks really handy. Definitely will save lot of time instead of reinventing wheel :D

  • Like 1

Share this post


Link to post
Share on other sites

There is quite recent cc65 feature which is I believe worthy adding to your optimization guide. For some reason ZP placed variables are accessed by two byte address instead of just one byte - https://github.com/cc65/cc65/issues/917

 

Custom fix mentioned in https://github.com/cc65/cc65/issues/917#issuecomment-647326244 was merged to master branch around end of year(is needed to get fresh sources, not release archive).

 

With this cc65 build I was able to get in my project about 700 bytes of ram by just placing global variables in ZP, there is some noticeable performance boost too.

 

Edited by clth
  • Like 1

Share this post


Link to post
Share on other sites

Hi. Thanks for pointing this out! I will need to check how well it works. I was proposing already to use ZPSYM to make sure that single-byte addressing is used.

Edited by ilmenit

Share this post


Link to post
Share on other sites

I couldn't find 'benchmarks.h'.  Is my implementation wrong?  I'm getting 533 ticks instead of 528 at Step 01 of your guide, @ilmenit

 

This is 'git clone' current version.

 

i added to your example--

 

typedef unsigned int word;

word ticks;

 

void start_benchmark(void)
{
    ticks = PEEKW(18);
}

void end_benchmark(void)
{
    printf("Ticks used: %d\n", PEEKW(18) - ticks + PEEK(20));
}

 

sorry if obvious or beginner error.

 

edit: 'git clone' of cc65, not your repo... and should have said I also do "#include <peekpoke.h>".

Edited by thank you
forgot something

Share this post


Link to post
Share on other sites
4 hours ago, thank you said:

I couldn't find 'benchmarks.h'.  Is my implementation wrong?  I'm getting 533 ticks instead of 528 at Step 01 of your guide, @ilmenit

 

This is 'git clone' current version.

 

i added to your example--

 

typedef unsigned int word;

word ticks;

 

void start_benchmark(void)
{
    ticks = PEEKW(18);
}

void end_benchmark(void)
{
    printf("Ticks used: %d\n", PEEKW(18) - ticks + PEEK(20));
}

 

sorry if obvious or beginner error.

 

edit: 'git clone' of cc65, not your repo... and should have said I also do "#include <peekpoke.h>".

The "benchmark.h" is in my repo e.g. https://github.com/ilmenit/CC65-Advanced-Optimizations/blob/master/03-smallest unsigned data types/benchmark.h

533 vs 528 is a very small difference. While it may depend on version of the compiler or the selected compilation options, remember that timer at memory location 20 has value 0-255. You are not zeroing it in your code, therefore the final result may differ by +0 to +255 depending at the moment you run the code.

Share this post


Link to post
Share on other sites
5 hours ago, thank you said:

I couldn't find 'benchmarks.h'.  Is my implementation wrong?  I'm getting 533 ticks instead of 528 at Step 01 of your guide, @ilmenit

 

This is 'git clone' current version.

 

i added to your example--

 

typedef unsigned int word;

word ticks;

 

void start_benchmark(void)
{
    ticks = PEEKW(18);
}

void end_benchmark(void)
{
    printf("Ticks used: %d\n", PEEKW(18) - ticks + PEEK(20));
}

 

sorry if obvious or beginner error.

 

edit: 'git clone' of cc65, not your repo... and should have said I also do "#include <peekpoke.h>".

 

Instead of using PEEK() functions you could use clock(). Then it wouldn't look so BASIC-like. 🙂

Share this post


Link to post
Share on other sites

thanks @ilmeniti made it through the lesson, very interesting results...  I learned a lot.  Somehow I missed the handy links at the top of the page to the various steps of the code, and I am as bad at searching on github as i am on this forum.

 

@sanny i should probably RTFM thanks :)

Share this post


Link to post
Share on other sites

Ahem, hardly an advanced one but worthy being mentioned somewhere. When trying to improve cc65 random number generator output, i've started to set seed every loop by bit more random value from DLI.

 

Original code, simple line copied from somewhere

srand((unsigned) time(NULL));

Updated variant

srand(dli_variable);

Using time() means extra 1700+ bytes consumed. I do use almost no library stuff but this one slipped through.

 

Share this post


Link to post
Share on other sites

Until you need a "deterministic RNG" like the one with srand/rand even shorter is to use the Pokey RANDOM register 🙂

  • Like 1

Share this post


Link to post
Share on other sites
Quote

If you didn't read "Advanced optimizations in CC65" by now (shame on you!), you don't know how the optimized code is a mess to read. All available tricks have been used, even the author does not recommend going so far in real life.

:D 

  • Like 1

Share this post


Link to post
Share on other sites
2 hours ago, ilmenit said:

Comparison of different C compilers (cc65, vbcc, kickc, gcc + asm):

https://www.videogamesage.com/topic/762-super-tilt-bro-for-nes/page/2/?tab=comments#comment-163145

I didn't read through it yet, just sharing for now.

Nice to see the author of the 6502 gcc backend answered, and even fixed a couple of bugs! Perhaps I should resurrect the Atari 8-bit port again, and have it merged as soon as possible, so it'll track the latest gcc sources.

  • Like 1

Share this post


Link to post
Share on other sites
10 hours ago, ilmenit said:

Comparison of different C compilers (cc65, vbcc, kickc, gcc + asm):

https://www.videogamesage.com/topic/762-super-tilt-bro-for-nes/page/2/?tab=comments#comment-163145

I didn't read through it yet, just sharing for now.

As this blog also got referenced in another forum, I will add my observations regarding this comparison here as well:

Someone pointed me to this comparison some time ago, because it seemed to mention a bug in my compiler (vbcc). Trying to verify this was made somewhat tedious, because the author of this comparison uses his own simulator and, in the case of vbcc, his own linker scripts and configuration files. After having a short look I found that the test that did not work with vbcc uses the pages 0x300 and 0x400 to write the results in a simulated frame-buffer or something like that. However his vbcc linker files contains:

MEMORY
{
  ...
  ram:     org=0x0300, len=0x0500
}

SECTIONS
{
  ...
  data:   {*(data)} >ram AT>out
  ...
  bss (NOLOAD): {*(bss)} >ram
  ...
}

I did not investigate further, but putting the data and bss section in the frame buffer does seem suspicious to me. As I did not want to waste much time with the tinkered configs, I slightly adapted the code to the C64 screen buffer and the result compiled with vbcc for C64 looked very similar to the one compiled with cc65. Strangely however, the player "sprite" only showed with cc65. Further investigation showed that the test code used an uninitialized variable for the y-coordinate of the player. After fixing this bug in the test, the result of vbcc exactly matched that of cc65.

When I tried to add a timer variable to measure run-time on the C64, the code did not compile anymore on cc65, because it exceeded the 256 byte limitation of cc65. Apparently the test was exactly tailored to cc65's limitations whereas the vbcc result was basically sabotaged. It is obviously not an unbiased comparison but rather the author started with code for cc65 and did no further investigations when the code did not work with a compiler he personally dislikes (whereas for gcc which apparently generated actually broken code he even went out of his way to fix the assembly code by hand). Using this approach your daily-use compiler will of course tend to look more stable. That does not really say much.

As I wrote above, I am the author of the compiler that the author of this comparison hates, so obviously I am not unbiased as well. When I checked his article, I did not write anything and I did not really want to get involved (and write lengthy posts like this one). However, if people point to this blog, I have to say that after what I have checked so far, it is my firm (and as my findings hopefully show mostly fact-based) opinion that this comparison is much too flawed to base a compiler decision on.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...