Jump to content
IGNORED

gcc-6502 vs cc65


Recommended Posts

(16 KB, btw).

 

Yes, sorry for that, my mind seemed to be stuck to a cartridge banking scheme.

 

 

There's a reason why I didn't use stack :) You run the risk of running out of stack. Hell, that was easy even on 80286 and using Pascal :) The whole point of using C++ is to be able to use expressive / behavioral language, so you can go 10 levels deep (if you wish to suffer the size&performance penalty associated with it - which for early prototyping should not be a concern, ever), but at the very least, 5. If the compiler was placing parameters and everything on stack, we'd run out of HW stack long before that.

 

No explanations needed. CC65 also uses the HW stack only for return addresses.

 

But even a soft-stack is quite slow on a 6502:

https://github.com/cc65/cc65/blob/master/libsrc/runtime/popa.s

https://github.com/cc65/cc65/blob/master/libsrc/runtime/pushax.s

 

So how do you would realize parameter passing? Having fixed, reserved addresses for each function? Return values can be retrieved on another fixed address for all functions?

Would you completely abstain from supporting recursions?

 

I find much truth in http://atariage.com/forums/topic/274341-taking-advantage-of-c17-to-build-6502-code/?do=findComment&comment=3935969

 

 

From what I hear you guys say, on 6502, the bank switch takes about a cycle or two, correct ? That's incredible for such a machine, really !

 

The shortes sequence I can imagine would be

 

LDA #SOMEBANK

STA PORTB

 

that would be 6 cycles in the best absolute case.

  • Like 2
Link to comment
Share on other sites

I certainly wasn't implicating it should be the compiler's job to worry about bank management. But C programmers especially are used to calling the memory allocator and obtaining a handle to a buffer they can address directly.

Yeah, when banking the code, we must unfortunately totally forget the classic heap functionality via malloc, as every single bank block starts at the same address (well, that's assuming the compiler is actually able to create unlimited amount of 16 KB blocks that all start at same address).

It's a totally different approach compared to linear unbanked RAM - kinda like a paradigm-switch, where instead of looking at your program as a nice, coherent, single chunk of linear code, it's broken down to dozen dumb pieces, with potentially substantial code duplication(can't access that small function in Bank 03, as we're in bank 05 now, so just gotta duplicate it into bank 05, as otherwise we'd run out of main RAM very quickly if we put all the reused code there) that have very limited options of talking to each other other then up/down the chain one level.

 

 

But for big data this simply won't do. It would be possible for large GOS executables to span multiple banks (by being non-relocatable: the loader will simply load consecutive absolute segments into discreet 16KB banks and eventually return a list of bank numbers for the purpose of the application's inter-bank calling mechanism), and something similar would be necessary on a non-multitasking system unless one knew in advance the actual PORTB bits of each memory bank on the target system.

If I'm understanding you correctly, I don't think there's a way for compiler to automate / foolproof this. All we can really request/ask from the compiler is the bank number that we want for particular block of functions, and it's up to coder to not to screw up.

If you, as a coder, use an address from a different bank, then - well - it's like dereferencing an uninitialized pointer :)

However, given the fact, that unlike on jaguar, we have 0 penalty for switching banks, we can simply pair code and data within same 16 KB. So, that should substantially reduce the need for different banks to talk to each other during execution.

 

For example, in a 3D engine, you can't do clipping before you did the frustum culling, and once it's done, you can forget about it - all these pipeline stages are nicely linear and dependent on each other, so you could do something like:

Bank 0: Code: Frustum Culling, Data: Level's 3D vertices (8 KB)

Bank 1: Code:3D Transform, Data: table for Z (12 KB)

Bank 2: Code:Line/Edge computation, Data: FixedPoint table (12 KB)

Bank 3: Code:Heavily unrolled Polygon Rasterizing + Clipping (8 KB), Data: various small acceleration tables

Bank 4: Code: Unrolled framebuffer 1 Clear (12 KB)

Bank 5: Code: Unrolled framebuffer 2 Clear (12 KB)

Bank 6: Code: Menus, Data: Menu gfx (8 KB)

 

While in a linear RAM, all of the above might take up only 72 KB, broken down to banks will consume 112 KB. But all of those stages are totally independent and so can happily be isolated in their specific bank.

 

Each stage would only write little data to main RAM, so that the next stage could proceed. The downside is, that there would be some waste of RAM, since you can't merge two stages if their data is > 16 KB. But, with 1 MB - 4 MB extensions, that's not a huge problem, that each bank would hold some empty data.

 

Unfortunately - in order to actually accomplish anything useful - one will eventually require direct access to a large (and by large, I mean 16KB) amounts of linear RAM. Of course I may be approaching things from the point of view of someone who writes applications which commonly manipulate large amounts of linear data and which have to do so as efficiently as possible, so I suppose YMMV.

Well, this is a game site, so if you take a look at the breakdown of code/data above, game engine application is actually pretty useful (if wasteful, but that's a small price for doing something like this on a natively 64 KB system).

Link to comment
Share on other sites

Yeah, when banking the code, we must unfortunately totally forget the classic heap functionality via malloc, as every single bank block starts at the same address (well, that's assuming the compiler is actually able to create unlimited amount of 16 KB blocks that all start at same address).

Don't I know it. It's so disappointing that only the kernel and UI library get to call the nice bitfield-managed memory allocator with complete freedom. :)

 

It's a totally different approach compared to linear unbanked RAM - kinda like a paradigm-switch, where instead of looking at your program as a nice, coherent, single chunk of linear code, it's broken down to dozen dumb pieces, with potentially substantial code duplication(can't access that small function in Bank 03, as we're in bank 05 now, so just gotta duplicate it into bank 05, as otherwise we'd run out of main RAM very quickly if we put all the reused code there) that have very limited options of talking to each other other then up/down the chain one level.

Yep. One saving grace in a multitasking environment is that I can spawn child processes and communicate with them via message-passing. That should help avoid huge executables. :)

 

If I'm understanding you correctly, I don't think there's a way for compiler to automate / foolproof this. All we can really request/ask from the compiler is the bank number that we want for particular block of functions, and it's up to coder to not to screw up.

There's a simple solution to the problem I was referring to. If DOS doesn't already present a list of PORTB banking values which your inter-bank calling mechanism can index via an abstracted "bank number", then it's easy enough for the application to build such a list at load time.

 

Definitely coding for non-linear layouts (especially 1.5 - 2KB PBI ROMs) requires some ingenuity. :)

 

Well, this is a game site...

It is? I wonder what on earth I have been doing here publishing applications and drivers for the past ten years, then. :)

  • Like 3
Link to comment
Share on other sites

Well, this is a game site...

quote

It is? I wonder what on earth I have been doing here publishing applications and drivers for the past ten years, then.

 

I thought it was an engineering site, no a coding site, um a history site, maybe a nostalgia site, perhaps a conversation site... and archive....

 

oh heck it's all of these things and more.... go figure!

 

some great applications for the Atari, X86, and Motorola machines over the years. Many of which are used to create other things.

Edited by _The Doctor__
Link to comment
Share on other sites

No explanations needed. CC65 also uses the HW stack only for return addresses.

 

But even a soft-stack is quite slow on a 6502:

https://github.com/cc65/cc65/blob/master/libsrc/runtime/popa.s

https://github.com/cc65/cc65/blob/master/libsrc/runtime/pushax.s

That's good to know that CC65 uses it only for return addresses, thanks!

I'm pretty sure I wouldn't mind the slower soft-stack, for the enhanced productivity of not having to constantly browse the 10-page long list of global variables, and directly seeing the 5 variables that the given function requires.

Once the algorithm works, a coder can always refactor for speed - e.g. replace local variables with global ones, thus removing the expensive parameter (re)store from the equation, altogether.

 

 

So how do you would realize parameter passing? Having fixed, reserved addresses for each function? Return values can be retrieved on another fixed address for all functions?

Would you completely abstain from supporting recursions?

Well, there's no manager with a whip cracking down our necks that this release needs to ship next Monday :)

So, ideally, a coder would be able to choose the implementation method of parameter passing, for each function, wrapping the call in relevant macro (so the compiler can select the appropriate codepath).

 

A Recursion would be a special case, which would only have 1 way of parameter passing. I wouldn't mind at all, if recursion wasn't supported anyway. It's too much of an overhead for 8-bit. I'm pretty sure that in my PC engine, I refactored every single one in past into a series of nested loops, except the A* algorithm for the pathfinding, which I later replaced with a totally different one, but linear.

 

 

 

The shortes sequence I can imagine would be

 

LDA #SOMEBANK

STA PORTB

 

that would be 6 cycles in the best absolute case.

Oh, yeah - sorry - I didn't mean selection of the bank - that's highly likely going to involve some indirect addressing and tables, and restoring registers. So, dozen(s) cycles. Still beats copying the 16 KB data with LDA/STA :)

I meant, once the value is stored to PORTB, that the HW can make the new bank available within few cycles. I believe flashjazzcat replied to me recently in one of threads and mentioned that there is actually behavior that relies on the banked data to be available in either first or second (don't recall exactly now) cycle after PORTB is set.

 

I don't quite understand how it's remotely technically possible to do it in 1-2 cycles. Especially at 1.79 MHZ ?!? I guess it may make sense if you know and see the memory controller design. Must have been a HW design requirement in Top-3, for sure !

Still, compared to jaguar's might 64-bit HW, this is mind-blowing, that a machine from '70s can swap between banks (that double the jaguar's RAM size to - currently - 4 MB extensions) in a bloody cycle !

Link to comment
Share on other sites

I believe flashjazzcat replied to me recently in one of threads and mentioned that there is actually behavior that relies on the banked data to be available in either first or second (don't recall exactly now) cycle after PORTB is set.

It's often useful to have your inter-bank calling mechanism at the same address in every bank (especially in ROM), so you can say something like:

 

lda #banknum

sta bankreg

jmp target

 

The bank switch happens right at the end of the second instruction, and execution of instruction three happens in a different bank. So if bank switching wasn't essentially instantaneous, this would never work.

  • Like 1
Link to comment
Share on other sites

It's often useful to have your inter-bank calling mechanism at the same address in every bank (especially in ROM), so you can say something like:

 

lda #banknum

sta bankreg

jmp target

 

The bank switch happens right at the end of the second instruction, and execution of instruction three happens in a different bank. So if bank switching wasn't essentially instantaneous, this would never work.

Thanks ! I always presumed a couple of NOPs had to be inserted (or vbl wait was in order or some other HW register would be updated by the bank switching mechanism). Haven't -yet- played with the actual bank switching (but will have to, soon).

 

Truly Amazing HW design for '70s!

Link to comment
Share on other sites

Hi VladR, if you do not care about speed, why not go the JMV in 6502 path, https://mzattera.github.io/b2fJ/, as someone suggested beforehand. This could simplify the bank switching as well, as the bytecode could be spread just about any way one wishes as long as the VM runs in the main memory. It seems not only much more viable approach, but also imagine really fast prototyping with Jython :)

Link to comment
Share on other sites

 

Not really, it's more or less just a switch, which deflects lines of the addressing bus to other memory locations. Nothing to copy or swap o.t.

Well, I'm sure you would understand my excitement on this feature, if you spent a similar amount of effort and especially sanity - like I did- on attempting to tame the jaguar's RISC :lol:

 

Hi VladR, if you do not care about speed, why not go the JMV in 6502 path, https://mzattera.github.io/b2fJ/, as someone suggested beforehand. This could simplify the bank switching as well, as the bytecode could be spread just about any way one wishes as long as the VM runs in the main memory. It seems not only much more viable approach, but also imagine really fast prototyping with Jython :)

I'll need to have a look at it, as I'm not familiar with JMV. Thanks

Edited by VladR
Link to comment
Share on other sites

That was a typo - should have been Java Virtual Machine (JVM) - the Lego Mindstorm targeting leJOS was an interesting thing (I've found some code when I looked at it back in 2005!).

 

I was also able last year to have the Jump stuff build and able to make binaries to run on the POSE (Palm Pilot) emulator. Code on github

Edited by Wrathchild
Link to comment
Share on other sites

  • 10 months later...

I've noticed that in the Oric community, the lcc65 is commonly used instead of cc65. I haven't checked how complete it is or if it generates better code. I know there used to be a few commercial cross compilers as well, but perhaps a sane gcc would eventually outperform those anyway given that the commercial ones were made some 30 years ago.

 

Hi everyone. Funny that you mention lcc65. I just stumbled on this thread today while looking for an alternative to it on the Oric (for which I currently code a small demo effect).

Lcc65's output is pretty much what one can call a complete disaster in its current state and the amount of redundant, wasted operations it does is quite impressive.
A working peephole optimizer could probably remove some of the waste but as of now lcc65's is not compiling. icon_wink.gif
Others in the Oric community have been using cc65 and it seems to generate much better code (even if imperfect, obviously).

 

The backend is located in gcc-src/gcc/config/6502

 

Instead of the AX register combination and a lot of soft stack manipulations, it uses 48+ zero page locations as registers, including combinations for 16 bits and beyond, and let the gcc register allocator do its magic icon_smile.gif A, X, and Y are shadowed in page 0, too.

 

cc65 and lcc65 seem to rely on a similar technique but do reserve one of the registers for indexing purposes (if I remember correctly) which leads to some glaring inefficiencies when register pressure is high and one would benefit from all available registers.

So I would expect that gcc would generate much better code than these.

 

 

No explanations needed. CC65 also uses the HW stack only for return addresses.

 

But even a soft-stack is quite slow on a 6502:

https://github.com/cc65/cc65/blob/master/libsrc/runtime/popa.s

https://github.com/cc65/cc65/blob/master/libsrc/runtime/pushax.s

 

So how do you would realize parameter passing? Having fixed, reserved addresses for each function? Return values can be retrieved on another fixed address for all functions?

Would you completely abstain from supporting recursions?

 

I find much truth in http://atariage.com/forums/topic/274341-taking-advantage-of-c17-to-build-6502-code/?do=findComment&comment=3935969

 

 

The shortes sequence I can imagine would be

 

LDA #SOMEBANK

STA PORTB

 

that would be 6 cycles in the best absolute case.

 

Ideally, it would be nice if gcc was able to select an appropriate calling convention depending on when/how the function is used (in a tight loop, only once, etc.) to provide optimal performance but I doubt it is capable of such thing.

A compromise would be to use gcc's attribute system to override the default calling convention and replace it with dedicated ones (register based, memory based) when more appropriate.

I am not sure if it does support changing the calling convention on the fly though, so this is quite speculative as well.

A more realistic and practical aim though would be to use inline assembly to pass parameters as users desire, C++ templates can probably be leveraged to make this boilerplate-free.

 

Edit: removed some leftover from an uncontrolled copy/paste and added last line.

Edited by Nekoniaow
Link to comment
Share on other sites

IIRC lcc has always been a tutoring tool accompanying a book about compiler design. When it was first released, it was not competing with then-current compilers. That was not the point.

 

There's another "teaching" C compiler that has a 65xx backend, and that's ACK (the Amsterdam Compiler Kit). Basically, it targets a VM (Virtual Machine) which is then implemented for each target CPU, including the MOS 65xx series. The VM is not interpreted BTW, but each VM instruction has its target equivalent (sometime multiple mnemonics), and that's the output. After that there's a peep-hole stage, assembling and linking. Its 6502 code was not that bad. I think comparable with the native version of cc65. Current cc65 is better.

 

As for gcc-6502, its generated code is generally faster than cc65, but takes more space. -Os doesn't do much with the current back-end.

Edited by ivop
Link to comment
Share on other sites

Ok, good (?) to know that lcc65 is something of a dead end so one compiler less to dream about.

 

 

I would not say dead end, it is useful as is as long as no performance is expected and has been used for the backbone of many a Oric demo for the past two decades.

Only if one wants to try to get as much performance as possible out of a C program would I recommend to not use it.

The fact that this happens to be my use case is just a coincidence. ;)

 

IIRC lcc has always been a tutoring tool accompanying a book about compiler design. When it was first released, it was not competing with then-current compilers. That was not the point.

[...]

As for gcc-6502, its generated code is generally faster than cc65, but takes more space. -Os doesn't do much with the current back-end.

 

I see that the sources have not changed for the last 11 months, is there is still ongoing work on it and/or plans to improve it? I am willing to invest a bit of time into it (a few hours per week) if that can be useful.

 

Edit: 11 months, not years. ;)

Edited by Nekoniaow
Link to comment
Share on other sites

  • 1 month later...

For info: Could compile it on Ubuntu 16.4.6, but

apt-get build-dep gcc-4.8

fails. So I did

 

sudo apt-get install libgmp-dev libmpfr-dev libmpc-dev

and build worked like a charm (ok, replace -j 4 by -j 8 ).

Edited by 42bs
Link to comment
Share on other sites

For info: Could compile it on Ubuntu 16.4.6, but

 

apt-get build-dep gcc-4.8

fails. So I did

 

sudo apt-get install libgmp-dev libmpfr-dev libmpc-dev

 

and build worked like a charm (ok, replace -j 4 by -j 8 ).

You probably don't even need to do that - just go to gcc top directory and type

 

./contrib/download_prerequisites

Then configuring and building gcc should also build those libraries.

Link to comment
Share on other sites

You probably don't even need to do that - just go to gcc top directory and type

./contrib/download_prerequisites

 

I used his "build.sh" script. But this is elegant now. Back in time, compiling gcc was real rocket science :-)

Link to comment
Share on other sites

  • 1 year later...
On 3/2/2018 at 6:09 PM, ivop said:

Here's something more beefier than sieve.c.


$ cl65 --static-locals -t atari -Oirs -o dhrystone-cc65.xex dhry_1.c dhry_2.c
$ 6502-gcc -mmach=atari -o dhrystone-gcc.xex dhry_1.c dhry_2.c -O3

cc65, 1000 runs, 474 ticks (9.480s)

gcc-6502, 1000 runs, 378 ticks (7.560s)

 

dhrystone.zip 19.58 kB · 89 downloads

I just found this thread while uploading my first Atari supporting version of vbcc6502. Of course I downloaded the examples and compiled them with vbcc.

 

Not sure if everything is correct as I am not familiar with the Atari, but with an emulator I get the same results for the posted binaries of gcc and cc65.

So here are my results:

cc65: 474 ticks

gcc: 378 ticks

vbcc: 165 ticks (vc +atari -O3 -speed dhry*.c -o dhrystone-vbcc.xex)

 

For md5:

cc65: 811 ticks

gcc-Ofast: 404 ticks

vbcc: 361 ticks (vc +atari -O3 -speed md5.c -o md5-vbcc.xex -c99)

vbcc_bench.zip

  • Like 3
Link to comment
Share on other sites

On 7/29/2020 at 5:45 PM, vbc said:

I just found this thread while uploading my first Atari supporting version of vbcc6502.

Hey this looks quite promising, thank you!

 

Took me a few minutes to figure out what I downloaded had a C64 config as default and to rename the atari one, then set the environmental variable - plus editing a line from the dhrystone demo (apparently // isn't parsed as a comment?) but after that I got close to the result you did with PAL: 164 ticks, and in NTSC 212.

 

Just to see the pure computational speed I turned off ANTIC in dhry_1.c just before/after the tick measurement and got a value of 147, under 3 seconds!

 

I have found a C environment to be a great boon when working on 8bit code, due to the quick prototyping ability. Having it be fast compiled code is quite a bonus, I could see this being a good system to write a C framework that has inline assembly for the heavy lifting. After all this time I still find things like a display list assignment easier on the eyes in C syntax.

Link to comment
Share on other sites

  • 6 months later...

Just FYI, I ported B2FJ to Atari, allows real multitasking and multithreading and write object oriented code the run on Atari.

 

B2FJ is compiled with CC65 but seems that the 6502 code generated is not very optimized, anybody knows a optimizing standard 6502 compiler that can be used in place of cc65?

 

 

 

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...