I've been working on and off for the past several weeks and I've made some progress; which can, at times, seem like a lot. Other times, this all still seems to very much be a daunting task.
When I started writing this entry, I realized I should probably give a general overview/background of how this compiler works under the covers. Here is a basic outline of the process that the compiler follows:
(1) Input files are run thru a handwritten Recursive Descent parser to generate an Abstract Syntax Tree (AST) composed of linked lists with each node in the linked list can be a token (enumerations), a single char, an identifier (string), a number, or another list. Using an AST allows multiple passes with less processing (walking the tree) and easier manipulation (potential optimizer step before code generation) The AST is essentially following "Greenspun's Tenth rule"... a crude, implementation-specific version of a LISP list. (see footnote)
There is a semi-implemented, semi-separate tokenizer/lexer utilized by the parser that's implemented in an on-demand way... As parser attempts to fetch a token, the tokenizer checks the next input character and uses that to determine what to fetch. It skips comments and the newline characters (and keeps track of the current line number). It usually returns a string to the parser with a token type... that token type indicator also serves as an indicator of a double symbol (++, --, and the like). Beyond that, the parser is mainly responsible for doing token selection when building the tree.
(2) Next the symbol table (list of variables and constants) is generated from the AST. The code walks the tree, finds the variable declaration statements and function definitions, and adds them to the table. It keeps track of 3 different symbol tables: Global, current Function Parameters, current Function Locals. The symbol tables for each function's params and locals are stored under that function's entry in the symbol table. After table is generated, variable allocations are done.
(3) Finally, the code generation is done. The code generator is made up of two parts: the AST walker, and the Instruction Generator. The code generator walks the tree in similar fashion to how the parser builds it, walking each node list in order and processing all sub-list nodes first before handling the operation of the parent list. When handling the operations, the Instruction generator is called to output the code, in DASM-compatible assembly language.
In a way, it feels like it's been way too long with still nothing to show. I've actually done a ton of work that's only partially visible on the surface of the compiler. I've done a lot refactoring of the underlying code:
- I've switched the parser->AST to rely more on using enumerations for tokens instead of string comparisons for operation designation. Originally there were a LOT of string comparisons all over the place; more or less due to procrastinating this rework. I'd like to try and push this more into the "tokenizer" but that's not really necessary... more of a nice to have.
- I've filled out most of the code generation code for handling all the operations... although most operations are still 8-bit only. That's mainly because features have only been added as I've run into the need for them. And there have been a few cases where I've put off adding a feature because I don't want to spend the time implementing it... I just want my example code to run. I know, lazy.
- I've created a constant expression evaluator that I call at a few different points in the code generator. The most obvious case is when initializing a constant variable.
- I've got a full blown inline assembly language processor working. Inline assembly code has access to the symbol tables; it can use variables and preform simple array and structure accesses.
I'm currently deep in decoupling the assembly instruction generation portion of the code. I'm working on building an Instruction List module to encapsulate a block of instructions so that I can more easily determine code size for dealing with conditional branching and making sure to only use jumps when necessary. The Instruction List will also help with function inlineing and allow me to finally implement a module system. Currently, only variables are processed in any included files.
Things I still need to do:
- implement better variable allocation for local variables. Currently local vars are located immediately after the global vars... Nested function calls don't work well because, for every function, the local vars start at the exact same spot. I need to add a function call tree to finagle that around a bit... that and automatic function inlining...
- work on the expression processing code. It's still not quite working for more complex expressions... I have some simple stack-based code generation stuff written that isn't quite working right. Also, there are some complications with conditional expressions that I need to handle.
- improve the handling of function parameter passing. I've started with the approach of using just the registers with a hinting system in place. I first implemented this to handle the inline assembly version of the position sprite function. But that's about the only case where it works nicely. That and very simple functions... since the code generation for the functions themselves doesn't quite know about register preservation
- pass line and position information into the AST to allow for use when generating errors beyond the parsing stage. Currently I'm only spitting out the AST chunk that contains the erroneous code, which is generally enough for me, but not very user-friendly.
- clean up memory management. I'm very loose with memory management for the compiler code itself at the moment. So far I've been relying on the OS to clean up afterwards. I should trying compiling this using an old 16-bit compiler and see what kind of fallout there is on an older computer.
I'm thinking and hoping that with another month of work on this project, that it should be at a good point for a beta release. I still need to figure out a new name for the project. I've come up with a few so far:
- Nimble - It's short, simple, and seems to capture the spirit of this project (easy mixing of C and Assembly)
- Elemental -> ElementalC - denotes the origins of the language while also subtly signifying that it's a subset.
- Jumble - A bit of humor, since mixing of two languages can end up jumbled together.
I'm leaning towards Elemental... but I haven't quite decided yet.
(footnote - from wikipedia)
* Greenspun's tenth rule of programming is an aphorism in computer programming and especially programming language circles that states:
Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.