gcc-6502 vs cc65

ivop · March 2, 2018

Hi,

I have done some work on gcc-6502 to support the Atari executable format and some other small support stuff and bug fixes. If you want to try it out, try the following: https://github.com/ivop/gcc-6502-bits

$ git clone https://github.com/ivop/gcc-6502-bits.git
$ cd gcc-6502-bits
$ git clone https://github.com/ivop/gcc-6502.git gcc-src

$ CC65_PATH=/path/to/your/cc65/bin ./build.sh 2>&1 | tee build.log

Now you should have a 6502-gcc toolchain in prefix/bin. cc65 is used as "binutils" only (i.e. assembling and linking).

A small test:

#include <stdio.h>
#include <stdint.h>

#ifdef __CC65__
#pragma static-locals(1);
#endif

#define COUNT           16384           /* Up to what number? */
#define SQRT_COUNT      128             /* Sqrt of COUNT */

uint8_t *const RTCL20 = (uint8_t *) 20;
uint8_t *const RTCL19 = (uint8_t *) 19;
uint8_t *const RTCL18 = (uint8_t *) 18;

uint8_t sieve[COUNT];
uint16_t ticks;

int main (void) {
    register unsigned char* s;
    register unsigned       i, j;

    printf ("Sieve benchmark - calculating primes\n");
    printf ("between 2 and %u\n", COUNT);

    ticks = *RTCL20 + (*RTCL19 << ;

    i = 2;
    while (i < SQRT_COUNT) {
        if (!sieve[i]) {
            /* Prime number - mark multiples */
            j = i*2;
            s = &sieve[j];
            while (j < COUNT) {
                *s = 1;
                s += i;
                j += i;
            }
        }
        ++i;
    }

    ticks = (*RTCL20 + (*RTCL19 << ) - ticks;

    printf("Ticks used: %d\n", ticks);
    printf("%d.%03d seconds\n", ticks/50, ticks%50*20);
    for (i = 2; i < COUNT; ++i)
        if (!sieve[i]) printf ("%d, ", i);

    return 0;
}

cc65: 301 ticks (6.02s)
gcc: 97 ticks (1.94s)

$ ls -l *.xex
-rw-r--r-- 1 ivo ivo 3712 Mar  2 13:26 sieve-cc65.xex
-rw-r--r-- 1 ivo ivo 4861 Mar  2 13:40 sieve-gcc.xex

Note that gcc-6502 has an extremely limited libc for now and is in no way a drop-in replacement for cc65. Nevertheless, it's pretty functional and fast.

Edit: BTW this is based on puppeh's m65x-gcc7 branch.

sieve.zip

Edited March 2, 2018 by ivop

carlsson · March 2, 2018

I've noticed that in the Oric community, the lcc65 is commonly used instead of cc65. I haven't checked how complete it is or if it generates better code. I know there used to be a few commercial cross compilers as well, but perhaps a sane gcc would eventually outperform those anyway given that the commercial ones were made some 30 years ago.

Irgendwer · March 2, 2018

Nice! Always good to have a choice.

cc65: 301 ticks (6.02s)
gcc: 97 ticks (1.94s)

cc65 (with "-Oirs" optimization):

cc65: 75 ticks (1.500s)

Edit:

File sizes are also interesting:

gcc: 4861 Bytes

cc65: 3712 Bytes

cc65 -Oirs: 3502 Bytes

Edited March 2, 2018 by Irgendwer

ivop · March 2, 2018

It has been a long time since I did something with cc65, so thanks for pointing out the best optimization flags

File size difference is because of -O3, but mostly because of a different printf implementation (c-only). -Os is 4792 bytes. cc65 has a printf in assembly IIRC.

@carlsson, I have been looking for the lcc 6502 backend source code for years, but it seems to be closed. There's a 816 version though in the SNES-SDK.

ivop · March 2, 2018

Here's something more beefier than sieve.c.

$ cl65 --static-locals -t atari -Oirs -o dhrystone-cc65.xex dhry_1.c dhry_2.c
$ 6502-gcc -mmach=atari -o dhrystone-gcc.xex dhry_1.c dhry_2.c -O3

cc65, 1000 runs, 474 ticks (9.480s)

gcc-6502, 1000 runs, 378 ticks (7.560s)

dhrystone.zip

ivop · March 3, 2018

This calculates an MD5 hash for a 512 bytes buffer, five times for benchmarking:

execution times:

md5-cc65.xex:       811 ticks (16.220s)
md5-gcc.xex:        457 ticks ( 9.140s)
md5-gcc-O0.xex:     457 ticks ( 9.140s)
md5-gcc-Os.xex:     445 ticks ( 8.900s)
md5-gcc-O1.xex:     427 ticks ( 8.540s)
md5-gcc-O2.xex:     404 ticks ( 8.080s)
md5-gcc-O3.xex:     404 ticks ( 8.080s)
md5-gcc-Ofast.xex:  404 ticks ( 8.080s)

file sizes:

  8493 md5-cc65.xex
 10689 md5-gcc-Os.xex
 11269 md5-gcc-O2.xex
 11285 md5-gcc-O3.xex
 11285 md5-gcc-Ofast.xex
 11545 md5-gcc-O1.xex
 12322 md5-gcc-O0.xex
 12322 md5-gcc.xex

For comparison:

$ md5sum 512bytes.dat
f73654db8a587962996923d44a99d420 512bytes.dat

BTW all this tests are done on a PAL system and one tick is hardcoded to be 20ms.

Edit: fixed numbers

md5.zip

Edited March 3, 2018 by ivop

VladR · March 3, 2018

This is amazing!

What would it take or would be needed (except me implementing it ) to have at least some primitive C++ support ? And I don't mean something insane like C99.

Literally just few most basic things from C++:

- classes with polymorphism and inheritance

- proper parameter overloading

It doesn't have to be an optimized 6502 code or anything, just something that compiles and could be used for rapid prototyping (or for turn-based games, where performance does not matter).

ivop · March 3, 2018

I have not looked into C++ yet. I suppose we need at least a rudimentary libstdc++.

Edit: I'll try compiling with --enable-languages=c,c++ and see what's missing

Edited March 3, 2018 by ivop

Wrathchild · March 3, 2018

what are the headline differences/approach of the assembler produced?

ivop · March 3, 2018

The backend is located in gcc-src/gcc/config/6502

Instead of the AX register combination and a lot of soft stack manipulations, it uses 48+ zero page locations as registers, including combinations for 16 bits and beyond, and let the gcc register allocator do its magic A, X, and Y are shadowed in page 0, too.

ivop · March 3, 2018

As for C++ support, it has been too long since I dabbled with the gcc internals (think 2.9.x era). A LOT has changed since. They even completely rewrote the c-parser in C++. I think it's best to ask puppeh what needs to be done to at least compile a couple of classes and cout << some string. It's beyond me at this point in time as I do not sufficiently know my way around the source code anymore.

VladR · March 3, 2018

I have not looked into C++ yet. I suppose we need at least a rudimentary libstdc++.

Edit: I'll try compiling with --enable-languages=c,c++ and see what's missing

I would truly appreciate that. If for nothing else, at least to see what kind of effort would be needed to get the lowest possible level of support for C++ features. Highly likely, the dependencies won't make it easy to extract just few features (e.g. initially disable templates, streams, STL, exceptions, etc.) but if it'd make things easier, that'd be the best route).

Because I am really perplexed that nobody seems to miss C++ on 6502. In past I heard absurdities like - "oh, it'd be slow", yet the utter insanity of using floats in standard basic even for loops has been just fine for half a century.

Of course, it doesn't have to be native - I reckon it'd have a hard time fitting within 64 KB

A linux/PC-based linker producing a 6502 binary is absolutely fine.

VladR · March 3, 2018

As for C++ support, it has been too long since I dabbled with the gcc internals (think 2.9.x era). A LOT has changed since. They even completely rewrote the c-parser in C++. I think it's best to ask puppeh what needs to be done to at least compile a couple of classes and cout << some string. It's beyond me at this point in time as I do not sufficiently know my way around the source code anymore.

Well, C++0X, C++11, C++14 took its toll, obviously.

They still host the source code from around 1998 ? Because even that kind of feature set would be not needed.

VladR · March 3, 2018

As for C++ support, it has been too long since I dabbled with the gcc internals (think 2.9.x era). A LOT has changed since. They even completely rewrote the c-parser in C++. I think it's best to ask puppeh what needs to be done to at least compile a couple of classes and cout << some string.

Streams and strings - while undeniably nice are far from the impact of classes : I would consider those a syntactic sugar on 6502

It's really the classes and polymorphism that enable to write highly modular, reusable, instantly readable and debuggable code, especially for things like RPG games.

ivop · March 3, 2018

I think back-porting this 6502 backend to the 2.95.x era gcc is going to be a huge task in itself and then there's still no C++ support out of the box.

There's also https://github.com/JuliaComputing/llvm-cbe, the resurrected c backend for llvm. In theory, you can compile C++ code with clang to llvm-ir, use llvm-cbe to generate c code and compile that with 6502-gcc

ivop · March 3, 2018

Re: llvm-cbe. The code produced by this backend is horrible and relies heavily on uint128_t type. This can be emulated with uint64_t pairs, but I guess it suffices to say the code runs slooooow. Another path to simple C++ support may be this: https://github.com/seyko2/cfront-3

Irgendwer · March 3, 2018

Because I am really perplexed that nobody seems to miss C++ on 6502. In past I heard absurdities like - "oh, it'd be slow", yet the utter insanity of using floats in standard basic even for loops has been just fine for half a century.

Of course, it doesn't have to be native - I reckon it'd have a hard time fitting within 64 KB

A linux/PC-based linker producing a 6502 binary is absolutely fine.

A problem is the memory footprint of the resulting executable. Not that I didn't find C++ appealing, but the overhead is just too big for our little machines. So chances are high that you're not even getting a somewhat slower, but also a "too large for your target" application. Knowing that, CC65 favours by design at many places a smaller size than a higher speed.

But aren't just these restrictions what us attracts to these old computers? ;-)

VladR · March 4, 2018

A problem is the memory footprint of the resulting executable. Not that I didn't find C++ appealing, but the overhead is just too big for our little machines. So chances are high that you're not even getting a somewhat slower, but also a "too large for your target" application.

I don't accept this as an argument, as this has been constantly used for last 2-3 decades, without any hard data. Whatsoever.

We have 4 MB expansions now. Even without that, we got 128 KB in stock Atari. And 256-320 KB expansions were pretty common over a decade ago. That's a lot for 6502 code.

Let me do a bit of a research work here, in this thread, so we are on the same page. Here's the quick draft of the base class common to all players and enemies in an hypothetical RPG game:

class CParameters
{
    public:
  char Strength;
  char Speed;
  char Agility;

  char Armor;
  char Health;

  char Level;
  char Experience;

  void Damage (char value)
  {
    Health -= value;
  }
  void Heal   (char value);
  {
    Health += value;
  }

  void Initialize ()
  {
    Experience = 0;
    Level = 1;
    Health = 100;
    Armor = 0;
    Strength = 1;
    Speed = 1;
    Agility = 1;
  }
  
  CParameters ()
  {
    Initialize ();
  }
};

Here's the declarations and instantiations. Note that they're global, hence at compile-time we (compiler) already know all addresses.

  // The following instances are global, thus removing any need for run-time instantiation and heap memory management
CParameters Player;
  // Each enemy's parameters will be set upon encountering him in the dungeon
CParameters EnemyGhost;
CParameters EnemyZombie;
CParameters EnemySpider;

There's billion ways how to implement it, so the following one is just one of the many and I don't claim it's the fastest, shortest, but I do believe it's fast enough:

- each class has 7 Bytes of data
- compiler thus can determine the address of each global instance:
  $1000:Player
  $1007:EnemyGhost
  $100E:EnemyZombie
  $1015:EnemySpider

- the code for the methods will address the class variables via 0-page ptr : InstancePtr

Here's what a code for Initialize could look like:

CParameter_Initialize:

  ;Strength
LDA #1
LDY #0
STA (InstancePtr),Y
  ;Speed
LDA #1
LDY #1
STA (InstancePtr),Y
  ;Agility
LDA #1
LDY #2
STA (InstancePtr),Y
  ;Armor
LDA #0
LDY #3
STA (InstancePtr),Y
  ;Health
LDA #100
LDY #4
STA (InstancePtr),Y
  ;Level
LDA #1
LDY #5
STA (InstancePtr),Y
  ;Experience
LDA #0
LDY #6
STA (InstancePtr),Y

RTS

; When a method call is encountered, compiler will simply put the address of the instance into InstancePtr (10 cycles) and call the method:
LDX #$10           ; 2c
LDY #$00           ; 2c
STX InstancePtrHi  ; 3c
STY InstancePtrLo  ; 3c
JSR CParameter_Initialize ;

Now, here's the code for Heal/Damage (for the example's simplicity, does not deal with overflows)

CParameters_Heal:

LDY #4
LDA FunctionParameter1
CLC
ADC (InstancePtr),Y
STA (InstancePtr),Y

RTS


CParameters_Damage:

LDY #4
LDA (InstancePtr),Y
SEC
SBC FunctionParameter1
STA (InstancePtr),Y

RTS


; Upon encountering the calls to these methods, compiler - seeing that there's just 1 parameter - could avoid the stack, and just store the desired value to FunctionParameter1 address and then just JSR to the function:

LDA ValueOfParameter    ; 4c
STA FunctionParameter1  ; 4c
JSR CParameters_Heal    ; 6c

So, we finally get to the actual run-time usage:


This call
......
Player.Heal (MediumMedkitValue);
......

can be translated as:
LDX #$10               ; 2c
LDY #$00               ; 2c
STX InstancePtrHi      ; 3c
STY InstancePtrLo      ; 3c
LDA MediumMedkitValue  ; 4c
STA FunctionParameter1 ; 4c
JSR CParameter_Heal    ; 6c


Of course, compiler should also store/restore the previous value of InstancePtr, as those calls may be nested, but that's outside of scope of this example.

24 cycles. Damn! Of 29,895c only 29,871 cycles are remaining, at 50 fps (160x96x4) ! Too slow !!! Back to BASIC :lol:

Sarcasm aside, the default usage scenario, without virtual methods can be fast (and even those are really only a double look-up, so far from the horrors that C++ books make it out to be). Of course, the above example is simplified, as compiler has to take into account few other things, but there's no reason, that it couldn't have a codepath like the above, when it encounters this particular case.

Irgendwer · March 4, 2018

I don't accept this as an argument, as this has been constantly used for last 2-3 decades, without any hard data. Whatsoever.

We have 4 MB expansions now. Even without that, we got 128 KB in stock Atari. And 256-320 KB expansions were pretty common over a decade ago. That's a lot for 6502 code.

...

Let me do a bit of a research work here, in this thread, so we are on the same page. Here's the quick draft of the base class common to all players and enemies in an hypothetical RPG game:
24 cycles. Damn! Of 29,895c only 29,871 cycles are remaining, at 50 fps (160x96x4) ! Too slow !!! Back to BASIC

I see your point and know that a lot of optimizations are available today ( http://atariage.com/forums/topic/259931-atari-8-bit-games-in-c-cc65/?do=findComment&comment=3657588) and like said contrary to the memory footprint the speed maybe not of much an issue. But a 320k machine with 8k chunks of segmented memory is not a typical C target and also for me not a typical retro machine any more. Like said, the limited resources are the challenge I like, and carrying around f.e. the instance pointer is just counterproductive in such a scenario (there is a reason for static-locals in CC65).

The 6502 is even an ambitious C target: 256 bytes hard located hardware stack make things difficult and your memory expansion doesn't help you with that, nor the limited space in zero page.

That said, I like to see things evolving and maybe some day I jump on the C++ train for 6502 development too...

Edited March 4, 2018 by Irgendwer

Wrathchild · March 4, 2018

As a slight diversion, seeing as c++ would be fun - we've seen recently the C++17 thread for the C64 and comments such as here.

In terms of OO support I would feel that having something that takes the byte-code of C# or Java and translates that into 6502 asm would be pretty good.

flashjazzcat · March 4, 2018

But a 320k machine with 8k chunks of segmented memory is not a typical C target and also for me not a typical retro machine any more. Like said, the limited resources are the challenge I like, and carrying around f.e. the instance pointer is just counterproductive in such a scenario (there is a reason for static-locals in CC65).

16KB banking windows, but aside from whether or not this represents a typical retro machine, lots of banked extended RAM indeed does not present a free pass to produce executables 100s of KBs in size. Inter-bank calls aren't such a problem, but the classic issue is how to directly access data in one extended bank from code in another without some kind of costly indirection. One can strategically position code and data to get around this (see SpartaDOS X for examples of well designed memory management strategies), but once code and data break out of the 64KB limit, the (welcome) challenges imposed by limited resources become even more interesting. I wrote a page-based memory allocator for my GOS which can access up to 1MB of extended memory, but since (relocatable) parallel processes live in the exact same address space as the allocated memory, it's far from plain sailing.

VladR · March 4, 2018

But a 320k machine with 8k chunks of segmented memory is not a typical C target and also for me not a typical retro machine any more.

I agree, but there's no reason why the compiler couldn't support something like .ORG - e.g. placing the requested code chunk at a specific address (e.g. $4000, $8000, ...), to allow for the bank switching (16 KB, btw).

It doesn't have to do any checking or anything and that's alright. Hell, the Jaguar compiler doesn't. It's your job to figure out how to know that your RISC code didn't fit into the 4 KB cache. I had to do macros that were notifying my of that during the build, as the tools there are grossly - uhm- suboptimal.

Like said, the limited resources are the challenge I like, and carrying around f.e. the instance pointer is just counterproductive in such a scenario (there is a reason for static-locals in CC65).

There's a reason why I didn't use stack You run the risk of running out of stack. Hell, that was easy even on 80286 and using Pascal The whole point of using C++ is to be able to use expressive / behavioral language, so you can go 10 levels deep (if you wish to suffer the size&performance penalty associated with it - which for early prototyping should not be a concern, ever), but at the very least, 5. If the compiler was placing parameters and everything on stack, we'd run out of HW stack long before that.

Of course, if a compiler developer was generous, he could make this optional (as a separate codepath/flag in the compiler), so it could be up to coder to risk the stack overflow. I wouldn't be against that

The 6502 is even an ambitious C target: 256 bytes hard located hardware stack make things difficult and your memory expansion doesn't help you with that, nor the limited space in zero page.

Correct. If you're using stack for parameters and nested calls. But that should be, like, one of first two things, to come up in the compiler design for 6502, no ?

As for zero page usage, I'd rather see some hard data - internal statistics from the compiler, to make sure zeropage is really used only for the most common stuff (whatever may that be).

I can't say, how many times, I was surprised, once I introduced benchmark counters for various pieces of engine components. Once the thing is scattered, the truth is, you have no idea exactly how much is something called. Global counters help me with that and give my instincts a run for their money

VladR · March 4, 2018

Inter-bank calls aren't such a problem, but the classic issue is how to directly access data in one extended bank from code in another without some kind of costly indirection.

Oh, that's very simple actually. You just don't do that

This shouldn't be a compiler's job, but programmer's, of course. You have to split the whole engine into components that can be run serially, with minimum amount of shared data from the main 64 KB.

This is the exact same problem as on jaguar - you run of 4 KB GPU cache veeery quickly, so one has to split the whole engine into serial components, each massaging the output data of previous stage, just enough for next 4 KB code chunk to proceed. Some games have over a dozen of those chunks

Except, on jaguar, the performance penalty of using another 4 KB chunk is pretty brutal:

- you have to first wait for Blitter to finish it's current job

- this locks GPU, which is bleeding tens of thousands of cycles in that timeframe

- when Blitter finally becomes available, you initiate the 4 kB blit

- again, Blitter is not doing anything meaningful, just working around the incredible HW design limitation by copying the next chunk

- this locks the GPU, who obviously, can't do anything else, as it's working RAM is being flushed down the toilet, so it just burns thousands of cycles again.

- as a byproduct, since the code is in main RAM, the blit locks out 68000, who up to this point was running some low-maintenance code in parallel, but as both can't access RAM, 68000 is sitting on bench too

From what I hear you guys say, on 6502, the bank switch takes about a cycle or two, correct ? That's incredible for such a machine, really !

Wrathchild · March 4, 2018

This shouldn't be a compiler's job, but programmer's, of course. You have to split the whole engine into components that can be run serially, with minimum amount of shared data from the main 64 KB.

I'd agree as I'd tried an approach a long time back with C and jump tables. There are some subtleties though as the context of the bank needs to included on the stack.

Because there is a pattern to it though, rather than the job of the compiler it could be the job of the linker, given that it should be able to spot where the cross bank calls are occurring.

Having evolved myself from assembler on very limit 8-bit PIC micros in the 90's, e.g. PIC16C71, onto the C compilers that target them, it was nice to not have to be concerned about subtle little things like some page-boundary things as these were avoided for you.

The quality of some of the code produced by is high though its a shame (but understandable) that the majority of compilers for PICs/Adruino etc are propriety and not open-source as there'd be a lot to learn from them.

flashjazzcat · March 4, 2018

I certainly wasn't implicating it should be the compiler's job to worry about bank management. But C programmers especially are used to calling the memory allocator and obtaining a handle to a buffer they can address directly. My solution with regard to GOS applications is - as was suggested - to simply not call the allocator at all if it's possible to avoid it, and instead allocate buffers at fixed addresses at the end of the executable. The relocating loader counts the uninitialised blocks as part of the overall program size and requests the total from the kernel's memory allocator. Thus the executable is guaranteed direct access to its buffers. But for big data this simply won't do. It would be possible for large GOS executables to span multiple banks (by being non-relocatable: the loader will simply load consecutive absolute segments into discreet 16KB banks and eventually return a list of bank numbers for the purpose of the application's inter-bank calling mechanism), and something similar would be necessary on a non-multitasking system unless one knew in advance the actual PORTB bits of each memory bank on the target system.

Unfortunately - in order to actually accomplish anything useful - one will eventually require direct access to a large (and by large, I mean 16KB) amounts of linear RAM. Of course I may be approaching things from the point of view of someone who writes applications which commonly manipulate large amounts of linear data and which have to do so as efficiently as possible, so I suppose YMMV.

Edited March 4, 2018 by flashjazzcat

gcc-6502 vs cc65

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members