Hi everyone. Funny that you mention lcc65. I just stumbled on this thread today while looking for an alternative to it on the Oric (for which I currently code a small demo effect).
Lcc65's output is pretty much what one can call a complete disaster in its current state and the amount of redundant, wasted operations it does is quite impressive.
A working peephole optimizer could probably remove some of the waste but as of now lcc65's is not compiling.
Others in the Oric community have been using cc65 and it seems to generate much better code (even if imperfect, obviously).
cc65 and lcc65 seem to rely on a similar technique but do reserve one of the registers for indexing purposes (if I remember correctly) which leads to some glaring inefficiencies when register pressure is high and one would benefit from all available registers.
So I would expect that gcc would generate much better code than these.
Ideally, it would be nice if gcc was able to select an appropriate calling convention depending on when/how the function is used (in a tight loop, only once, etc.) to provide optimal performance but I doubt it is capable of such thing.
A compromise would be to use gcc's attribute system to override the default calling convention and replace it with dedicated ones (register based, memory based) when more appropriate.
I am not sure if it does support changing the calling convention on the fly though, so this is quite speculative as well.
A more realistic and practical aim though would be to use inline assembly to pass parameters as users desire, C++ templates can probably be leveraged to make this boilerplate-free.
Edit: removed some leftover from an uncontrolled copy/paste and added last line.