Warning, brain dump follows!
After the benchmarking thread of last summer, I began getting interested in seeing which parts of BASIC deserve some attention, as opposed to guessing. For instance, one might guess that you would want to optimize GOTO because they're so slow. But GOTOs aren't all that common, and often run only once. Is that the right place to spend bytes, or some other place?
So I started thinking about a tool that would collect statistics from BASIC programs on which to make informed decisions. These would be static, not runtime traces or anything like that. At first I thought I could do it all in something like Python, but as the complexity became clearer I decided the "simplest" solution was to just write an entire BASIC in lex/yacc. I found a useful starting point and I've been hacking it an hour here an hour there ever since. And as it turns out, not so simple...
Well as of yesterday it correctly parsed and partially ran Super Star Trek - partially because I don't eval DEF FNs yet (although it should be possible). And the statistics I'm seeing definitely seem useful! For instance, in Super Star Trek (SST) one finds:
A total of 825 statements on 425 lines, which implies about 2 statements per line. This hides the fact that SST generally either has a single statement, PRINT, or 3 or 4 statements, which may or may not be typical of other programs.
There are 709 numeric constants, of which only 41 are not integers. Of the 668 integer constants, 139 of them are zero, and 179 of them are ones! 169 of the remaining are a single digit (2 through 9). Looking at it another way, 490 of the numbers could be represented in a single byte, and 178 in two bytes - which is every integer in the app.
There are 272 string constants, of which 140 are larger than 16 chars. The largest is 62.
There are 184 branches; 7 ONs, 36 GOSUBs and 151 GOTOs. Of the GOTOs, a slight majority jump forward, whereas in IF statements the vast majority are backward. I didn't bother checking GOSUBs, I assumed they were generally forward, and ONs can go both so I didn't collect these.
So the immediate lessons one learns:
The VAST majority of numeric constants in BASIC are single-digit integers. Zero and one represent over a third of all numbers. This STRONGLY suggests that in an Atari-a-like BASIC where numbers are converted at parse time, tokens $D0 and $D1 should be used to represent these constants. In SST, this would save 1390 bytes of memory, which is not a small amount! Using $D2 through $D9 for the rest of the single-digit variables would save an additional 845 bytes. This too seems worthwhile, although it would slightly complicate the code compared to zero and one only. With this change we've already saved 1875 bytes.
At runtime, when one of these tokens is encountered, the evaluation stack is loaded with a constant, rather than copying over the next five bytes. Everything else is unchanged. This appears to be a trivial modification that would have BIG paybacks in terms of program storage (and load time too!) while requiring only a few bytes of code in the parser and runtime.
One could also attack this problem by using a binary format for small numbers, say a $D0 followed by a one or two-byte value. This improves the number of values that can be represented in this format, but requires a runtime conversion (assuming no INT math package). If you go the one-byte route you get 490 replacements at 2 bytes, a savings of 4 per number, for a total of 1960. If you go two-byte, you get 668 replacements with a savings of 3 bytes per, so 2004.
This sort of comparison is precisely why I wrote the parser. It seems the single-digit-token approach saves almost the same amount of memory, but would be much easier to implement in code, and be faster as well. This is assuming you don't have a separate integer math package, at which point you need a two-byte format anyway.
Turning to the GOTO/GOSUB/ETC issue, which was notoriously slow in Atari BASIC. Since the value is ultimately a 16-bit int, that does provide a slight additional argument that there should be 16-bit constants, as it would save that one conversion at runtime. However, that would also require a separate "GOTO-16-bit" token and a little code to look for this case and do the conversion. This is because if the line is "GOTO I*10" then you don't want to convert the 10 to 16-bit, leaving it as FP makes more sense.
What is interesting here is that the majority of branches are backward. I guess this shouldn't be entirely surprising. But what that does suggest is that the "big problem" in Atari, that the line search always starts at the top, likely isn't as much of a problem as you'd guess. The counterexample is GOSUB - of the 49 forward branches, 36 of them are GOSUBs!
So this is also extremely informative, we can make better decisions on how to optimize these lookups. The options include:
1) having a separate 32-bit token for line numbers, where the first 16-bits is the line and the second is the memory address. Now one can branch immediately to the code without any searching - this is what BASIC XL did IIRC. The problem with this approach is that every time the program is edited you have to go through the entire source and do the lookups. So let's exclude this one, given the statistics above, this really doesn't seem like a worthwhile approach.
2) Turbo's solution, store a small table of hashed numbers for more rapid lookup. After calculating the number (whether its a constant or not), you hash the result, find the closest match in the hash, and start from there. So for a few burned cycles you skip over a lot of code. Better yet, when the program runs you only have this little cache to recreate, fast enough that you could do it with every edit.
3) my febrile idea, a "variable like" table - Atari stores its variables in a table that is created when the program is typed in. Perhaps we could use this solution for GOTO? In this case the table would simply have 127 entries of 16-bits. At edit time, any branch-to-constant is replaced by an index in this table, and the original number is saved as a string in the same fashion as variable names. At runtime, if the value is zero you do the search (potentially also doing (2) if you wish) and drop that number into the slot in the table. If the code changes, this table is set to zero.
(3) only works if the table is small. With 194 branches in SST (although I didn't see what numbers they branched too, the number of unique branches might be smaller) it seems unlikely that (3) works. It looks like some variation on Turbo is the way to go.
Still to do: I need to fix the problems with variable counting and add code for FOR loop stats. I strongly suspect that one should have two separate FOR handlers, one for FOR I = (someint) TO (someint) (STEP int), but I'll know once it's implemented. I also need to get command line parsing working so you can script it.
The code is GPL and once I get the last bits working I'll post to GitHub. It's currently running in Xcode because I like IDEs over CLIs, but the code itself should run on anything remotely Unix-y, including GW on win.