supercat
-
Content Count
7,259 -
Joined
-
Last visited
Blog Comments posted by supercat
-
-
Oh, for six more bits, use bits 8, 10, 11, 13, 14, and 15 of PC. Not sure the most effective way to do that since I don't think you can afford a computed jump, and if you don't use a computed jump the code size will double with each bit you want to save, but here's how to save three bits:
; Assume we want to save the three MSBs of WOWZO lda #32 bit WOWZO bmi b1xx b0xx bvs b01x b00x bne b001 b000 jmp ZPCODE b001 jmp ZPCODE+$2000 b01x bne b011 b010 jmp ZPCODE+$4000 b011 jmp ZPCODE+$6000 b1xx bvs b11x b10x bne b101 b100 jmp ZPCODE+$8000 b101 jmp ZPCODE+$A000 b11x bne b111 b110 jmp ZPCODE+$C000 b111 jmp ZPCODE+$E000
The MSBs of the program counter can easily be retrieved after the BRK instruction.
Another spot you may be able to save a bit is in bit 0 of Y. You'd have to adjust the operands of indexed instructions, but it might be doable.
Another spot to save 4.5 bits (a one-of-24 selection) would be the horizontal positions of P0, P1, M0, and M1. You can permute them any which way without affecting the on-screen appearance. Not sure you could avoid HMOVE lines if you did that, though, since you'd have to HMOVE the squares back and forth.
-
Yeah, freeing more bytes without compromising the height of the pieces would be nice. Right now there are 17 bits free (2 bytes of RAM plus overflow flag), and I think that is barely enough for a playable chess game.Well, presumably you're not going to be displaying the board while you're 'thinking'. During 'thought', the RAM used by the kernel could be used for other things.
Looking through your self-modifying code, I think I see space to store about 16 bits "within" it. Each indexed STA can be done using either X or Y (adjust the operand as needed), and every TIA store can be done with the normal address or with address+64. Further, you might be able to use the ball horizontal position and size to store 9 bits, eleven collision registers to store 11 bits(*), the RIOT timer to store 8 bits, and two bits of SWCHB and SWBCTL to store 4 bits. See--I just saved you six bytes.
(*) The playfield is going to collide with both players and both missiles, but no other collisions would have to occur, leaving eleven collision registers available for use.
The obstacle to a 4th PLA was the number of cycles between color changes. I tried a lot of things, but couldn't squeeze four cycles in. Your right that it would have been more elegant.It might be helpful if you offer a version of the demo in which a row of pieces alternates between white and black, and in which the PF data is all $FF, so as to show exactly when the color changes are taking place.
-
ALL systems that generate composite video and use a stationary color phase (basically, the "comb" mentioned above) will have this problem. Note that broadcast video has the color phase alternate every scan line (even in NTSC) so as to mimimize these effects.Finally remembered to look this up. I think you are confusing two things. NTSC line period is 227.5 cycles of colorburst, so the phase of the colorburst alternates each line. For PAL the phase of the Red-Cyan color axis (wrt colorburst) alternates each line. The extra half cycle in NTSC moves the modulated color signal from high frequency horizontal luma into high-frequency diagonal luma (or the fookie-nookie hole as one coworker once described it). PAL (phase alternating line) does this so colorburst phase errors (which is why old NTSC TVs needed a tint control) can be averaged out.
In broadcast NTSC, the phase on alternate scan lines will be shifted by 180 degrees as a result of the 'half' cycle. In PAL, the phase direction is reversed on alternate scan lines (so that half the scan lines are 'red green blue' and half are 'blue green red').
One side-effect of this is that in broadcast NTSC, any solid saturated color will appear as a checkerboard; in PAL, some colors will appear as a checkerboard, some as stripes, and some as a herringbone pattern. Unlike NTSC, someone examining a PAL picture on a monochrome display without a comb filter would be able to tell something about the colors displayed therein.
-
Looks nice. My earlier pessimism may have been unwarranted.
Using 32 bytes for the board may be excessive. Since there are only 32 pieces on the entire board, it may be possible to do something like using 8 bytes to keep track of which squares are occupied, and then use 16 bytes (32 nybbles) to say what's in the occupied squares from top to bottom. Not sure you'd have enough time for the unpacking, though. My guess would be that average case time would be improved, but worst-case time would be made worse. But I don't know what you're doing at present, so I can't say how the change would affect things.
Otherwise, I must say I'm pretty impressed with your ability to eke out some cleverly-compact code. It took awhile to figure out how the self-modifying parts work since it's not very clear. It would be much more elegant if there were some way to use a fourth PLA.
-
I had a couple of kernels I wanted to test anyway, so I fired up the Kroko Cart. Juno First looks good on the TV (hacked to 262 lines). It's too bad the vertical movement is jumpy, but it's better than I could have done.
Yeah, I messed up the scan line count. Sorry 'bout that.
-
Maybe I'll try a quick mockups to see how it would look.I did try a mockup; it's posted in my blog.
-
Thanks for the hint. Hm... it does seem like the grid just provides some sense of motion, so it's not really essential, or?The grid in Juno First does "just" provide some sense of motion, but I'd consider it a defining part of what makes the game what it is.
If one didn't mind having grid points limited to 3-pixel horizontal resolution, it would be possible to use the Ball for the grid and trigger it everywhere the grid should appear (using a special batch of code for every variation). This might be somewhat workable if sprites were limitted to two-line resolution, but would seem somewhat gnarly.
Maybe I'll try a quick mockups to see how it would look.
-
Actually, I found an article which uses this effect as evidence that the Apple ][ effectively has 560 pixel horizontal resolution on a monochrome monitor.Not sure what you mean by "evidence". I've described what the bit does. In some rare cases, that's kinda sorta like 560-dot resolution, but it's pretty limitted.
The 2600 has much less artifacting than the Apple ][ becuase it's dot clock is equal to colorburst so each pixel is a full colorburst cycle. The 7800, on the other hand, uses 2*colorburst in it's 320 modes so suffers from lots of artifacting because luma changes will produce false color. Dot crawl probably has more to do with modern comb filters as opposed to the heavy notch filters used back in those days.Many black and white TV sets had, and have, no filter to take out colorburst. The striping could be objectionable, which is part of the reason some games force black and white, but I'm surprised no real efforts have been made to 'use' the striping.
-
On a monochrome display, there is. You do indeed see individual pixels instead of color (this is what I got on my monochrome //c display.) Well, to be exact, some monochromes back then were green or amber...Well, sure, if you want to call everything that isn't a black pixel a white pixel, then yeah I guess there were white pixels.
And if a bunch of pixels in a row are turned on, they'll blend together to produce white even if individually they're not "really" white.
Fundamentally, artifact color acts as a sort of 'hybrid' resolution mode. Provided that areas of a given color are at least one chroma clock wide (2 pixels on 280/320-dot machines; 4 on 560/640-dot machines), boundaries between colors may be placed with a precision that depends upon the colors involved but may be as high as the dot rate (for white on black or black on white).
A game like Kings Quest, using artifact color, can show 16 colors on screen at once but still also display legible 40-column text without having to split graphics modes mid-screen. This is why Kings Quest looks better on a CGA composite monitor than on an RGB one.
-
0. high CPU or memory cost of address and pixel calculations
I think a table lookup is basically mandatory regardless of the byte ordering. The only nasty thing here is that the Apple ][ uses seven pixels/byte instead of eight.
1. odd/even pixel bit position in the byte changes based on whether the byte is on an odd or even address
2. also the number of odd/even pixels per byte depends on whether the byte is on an odd or even address
The #2 follows from #1, but I don't really see why it's a separate annoyance. Why does a program care "how many" even or odd pixels are in a particular byte? I'll certainly agree that #1 is annoying.
3. white pixels can span multiple bytes and may occur unintentionally
4. overlapping colored sprites required masking to prevent interference (and I'm not certain some interferece can be completely avoided)
5. difficult/impossible to create truely multi-colored sprites (hmm, I think this also applies to #4)
There are two issues here. The first is that because the Apple ][ uses "artifact" color, there's not really such a thing as a white pixel. What's helpful is to think of the screen as having a striped color 'comb' superimposed on a black and white picture. The image is displayed through the comb and the resulting colors are then smeared. What happens effectively is that if you display an even pixel adjacent to an odd pixel, you'll effectively get (IIRC) a fat blue-cyan pixel and a fat orange pixel which largely overlap. The area where they overlap will be white, but the left edge will be blue-cyan and the right edge will be orange. If you had an odd pixel and then an even one, you'd have an orange pixel and a blue-cyan one, so the left fringe would be orange and the right fringe blue-cyan.
To be sure, it's not possible to display orange next to blue-cyan without having the area between them appear black or white, but it should be noted the Apple is hardly unique in that regard. ALL systems that generate composite video and use a stationary color phase (basically, the "comb" mentioned above) will have this problem. Note that broadcast video has the color phase alternate every scan line (even in NTSC) so as to mimimize these effects.
The second issue with the Apple II hires color is that the MSB of each byte shifts the pixels in that byte 1/2 dot to the right. If a shifted byte follows an unshifted byte, the last pixel of the former will be 1.5x normal width; if an unshifted byte follows a shifted one, the last pixel on the shifted one will be half normal width. The primary purpose of shifting the pixels by half a pixel is to shift their position on the "color comb", but the shift can also be used as a means of making certain graphics appear less jagged. Unfortunately, the fact that clusters of 7 pixels have to be shifted somewhat limits the utility of this feature.
Incidentally, the Apple's "lo res" mode works by treating the four bits of color data as a repeating pattern of four "double high res" pixels. If you view lo-res graphics programs on a monochrome monitor, this is clearly visible. The "super Hi Res" mode of the Apple IIe offers a full bitmap at that higher resolution.
Many people swear at artifact color. It can be annoying to work with, to be sure, but it's not without its advantages. Chief among these is that provided certain graphical design requirements are met, the effective resolution is higher than could be achieved with the number of colors by other means.
BTW, I'm somewhat surprised that nobody's used artifacting to any advantage on black and white televisions with the 2600. Although a solid area of color $18 is indistinguishable from a solid area of $88, an area that changes $18,$28,$38,$48,... will appear to have a "dot crawl" in the opposite direction from one that changes $48,$38,$28,$18. Perhaps the RF on many 2600's was so poor as to make the dot crawl not useful?
-
Wait, you forgot the biggest graoner of them all:The bit->pixel order is reversed! Kind of like the 2600 playfield. But it's only 7 bits that are reversed, so it makes it that much more fun to deal with.
Why is that a groaner? Seems like far less of an annoyance than the use of 7 pixels/byte instead of 8.
Actually, when I designed some custom display hardware for a single-board computer, I deliberately put the LSB of each byte at the left even though many PCs and such put it on the right. For little-endian machines with 16-bit registers (such as the 8x88) it's much better to have display data be little-endian as well, since it allows data to be shifted within a 16-bit register and written to the screen without having to byte-swap anything. The 6502 doesn't have any instructions that span bytes in any manner that would make byte-order significant, but since it's generally a little-endian machine I see nothing wrong with Woz's decision.
-
Glad you're back. Interesting that rather than use a direction bit which would be set if the ball goes off the top or clear if it goes off the bottom, they have a "wrong direction" bit that inverts the effective direction from the last latched collision. I wonder if this has anything to do with supposed bugs at the corners? If a player hits the ball with the top of the paddle when the ball is already at the top edge of the screen, could this cause the ball to get stuck in odd fashion?
-
4a. 2D, 1D-Scrolling Shooters (Defender, River Raid)4b. 2D, 2D-Scrolling Shooters (Thrust)
4a. "Rails" V-shooters (River Raid)
4b. "Rails" H-shooters (Phaser Patrol)
4c. "Free" H-shooters (Defender; Chopper Command)
4d. Are there any "free" V-shooters?
4e. "Free" HV-shooters (Thrust+)
Note that the different waves in something like Vanguard and Gorf are effectively different games.
-
A feeling I have, is when seeing all your required stunts like minimalizing stack usage and tradeoffs like the one above, all of this before you can even post a demo of your work, is that possibly your "set of compromises" may be flawed and that you may need to reconsider your plans, having better one or another feature redesigned for less RAM usage. I'd think you'll probably always need another 10 or more bytes for taking a conceptual prototype to an actual game.If I use 8 bytes for the kernel and do everything as planned, I could use a 16-high playfield and a 64-long snake and have seven bytes free (assuming one byte for return address, but before doing some of the other tricks I mentioned). If things become to difficult, I could shorten the snake (4 units/byte) or cut the playfield height (5 bytes/row) but I'd prefer not to do either unless I have to. If the snake limit ends up being 48 units instead of 64, I won't mind too much, but if it shrinks below 32 that wouldn't be as much fun.
-
The intersection of the two sets "people who want this mod" and "people with the technical skill to do it themselves" is probably very small. I'm sure a few would do the mod, but I think it has potential to sell quite a few units if you make it into a pass-through adapter (if this is even possible.) That is, a DB9 male and female with the microcontroller in between in a tiny case. I don't know how much this would cost, but it would be more likely to catch on and it would work with two paddles at once.Solving the jitter problem would require modifying the paddle assemblies; a pass-through adapter alone would not suffice. Using a pass-through adapter would allow the mod to be made simpler--either a simple wire between the unconnected end of the pot and the ground terminal on the button (which would prevent the paddle from working without the pass-through adapter) or a shotky diode between those points (anode on the switch ground) which would allow the paddles to be used with or without the adapter.
Easing the software interface without solving the jitter problem would certainly be possible with an adapter, though I would see that as being more of a chicken-and-egg problem. People aren't going to spend money on an adapter with no software support, and people aren't going to write games for an adapter nobody has. The anti-jitter feature would IMHO be the key to making the microcontrollers useful even without special software support, and I see no way around requiring a paddle mod for that.
Though there is another possibility... are there any other cheaply- and commonly-available analog controllers? Perhaps one of those might be adaptable?
-
12 isn't so bad.But am I reading this right; you can access $F4-$Fx with the stack only?
How does that work? Which opcodes can you use? JSR, BRK? What about PHA, PLA? And, just curious, why not LDA/STA/etc.?
On the 2600, address pin 8 is not connected to the RAM or the Stella, nor are they involved in the chip-select logic. Thus, all addresses of the form "xxx0 xx0x 1nnn nnnn" will access location "nnn nnnn" of RAM. So "lda $6985" will behave the same as "lda $0105" or "lda.w $0085", and both will read the same data as "lda $85" (though the latter will be a cycle shorter).
Even though the TIA and RIOT might not be able to distinguish among these various addresses, the cartridge port can (except for the top 3 bits anyway, and even those it can do sometimes). To see how this is relevant with hotspots, consider the following example.
Assume A contains 5, X contains 7, Y contains 12, and S contains $FF. Note that address $FE is a hotspot for selecting a flash page to appear at $1E00-$1EFF and is $FF a hotspot for selecting a page of RAM to appear there. The following instructions are run in sequence.
STA $FE ; Stores 5 in $FE *and* selects page 5 of flash STX $01FE ; Stores a 7 in $FE, but leaves page 5 of flash selected STY $FF ; Stores a 12 in $FF *and* selects page 12 of RAM LDA $FE ; Loads a 7 *and* selects page 7 of flash INC $FF ; Stores a 13 in $FF (was 12) *and* selects page 13 of RAM NOP $FE ; Selects page 7 of flash (no registers or flags affected) LDA #$1A PHA ; Stores a $1A in $FF, but leaves page 7 of flash selected NOP $FF ; Selects page $1A of flash
If all one wants to do is select a particular known bank of RAM or flash into the $1E00 block and the page number isn't in a register, it may be done in 4 cycles via the $6C00-$6DFF hotspots. "NOP $6C53" will select page $53 of ROM; "NOP $6DEA" will select page $EA of RAM; neither instruction affects flags or registers. But if it's necessary to switch frequently among a few pages, the RAM hotspots allow that to be done in 3 cycles. Further, use of instructions like "inc" and "dec" on the RAM hotspots make indexing highly convenient.
-
What information do you need to have associated with each room in POP? If you use a few multi-level table lookups, I'd think you could compress things pretty well.
BTW, by multi-level table lookups, I mean something like:
ldy roomnumber lda table1,y and #$0F tax lda table1L1,x sta property1 lda table1L2,x sta property2 lda table1,y lsr lsr lsr and #$07 ; Sharing MSB from first batch (in LSB) tax lda table1M1,x sta property3 lda table1,y asl rol rol and #$07; Sharing LSB from first batch (in MSB) tax lda table1H1,x sta property4 lda table2,y etc.
The results of one lookup table are munged and then used to drive another, smaller, lookup table. Depending upon what combinations of properties exist, you may be able to store an awful lot of information in 1-2 bytes per room plus a few dozen bytes' worth of smaller tables. Working out how to arrange the smaller tables can be an interesting challenge, but tasks like that can sometimes be automated considerably.
-
For the worm game, I personally think 1K ROM for 2 bytes of RAM is an unacceptable compromise, if that helps your descission any
Before making a tradeoff like that, it's often necessary to ascertain (1) whether the memory is really saved, or whether another 'critical depth' still needs it at a different time in program execution; (2) whether there aren't already another couple spare bytes available.
Unrolling loops often not only saves loop overhead, including the memory required for a loop counter, but it also often allows other useful things to be done. For example, in Strat-O-Gems, the scan lines within a row of gems are not looped, because I compute the colors differently on different rows. The top three rows of gems are in a 3-high loop, and I might have been able to figure a way to loop the scan lines within each row, EXCEPT that I use some of my spare cycles during each row to precompute the colors for the fourth row of gems (note that this computation ends up getting done three times when it only needs to happen once, but the cycles exist so that's harmless). When I added the remaining-gems counter to Strat-O-Gems deluxe, I added another unrolled version of that loop. I could perhaps have used the same loop as I'd been using for all three of the top rows, but I needed some of the 'spare' cycles that precomputed the row 3 colors to instead generate and show the gem counter. Had I been trying to squish things into 8K, I could probably have figured out how to get everything in the one loop, but I was in a rush.
Incidentally, in Strat-O-Gems, the 'next gem' indicator sprite is positioned to the right of the central area. In Strat-O-Gems deluxe, it's to the left but I use 'two copies wide' mode. On the row where the level indicator is shown, the sprite appears in front of the playfield. On the other rows, it shows an extra copy of the 'next' gem but is covered up by the playfield.
To be sure, sometimes even when the first few iterations of a loop need to be unrolled, it can be useful to loop the rest but sometimes it may be cleaner to just unroll the whole thing.
-
Yeah, I realized after writing this up that the vertical scrolling would have to be handled by adjusting the starting/ending Y index. Not really a problem, though.That's not what I meant. Only adjusting start and end will limit your scrolling range severely. I meant a table, which loads the next Y value.
.loop: tay ... lda Table,y bne .loop Table:; in RAM! .byte 0, 2, 3, 4, 5... 158, 159, 0; start with Y = 1
Which later changes e.g. into:
.byte 0, 2, 0, 4, 5... 158, 159, 1; start with Y = 3
I like it. You only have to change two bytes in your table to perform a scroll, and your table lookup only costs four cycles as compared with a dey (and in some cases you can even 'absorb' those).
BTW, using a double-line kernel can often be useful even if you're drawing some shapes at single-line resolution. If your displayed area is 170 pixels or fewer, you can put three 85-byte tables in a page. Color and shape data for both players on every scan line would take 2 2/3 pages. If you use a 'straight' single-line kernel, the data would have to occupy 170 bytes on each of four separate pages--not nearly as nice.
If some shapes are at single-line resolution and others are at double, the decision becomes even easier.
-
Also, am I reading that last section correctly? So the standard clean start macro should not be used since it will hit all the bank switch spots?Actually, it won't trip any of them because it writes to the RAM using addresses $0180-$01FF rather than $0080-$00FF.
And...24 zeropage bytes that are unusable is a bummer. There isn't a way around this?Mea culpa. It's $F4-$FF, not $E8. So twelve bytes. And unused presets won't really cost memory because you can set the stack pointer just below the lowest used one and use that space for stack. In the event that your stack doesn't use up all your presets (e.g. you only need $FA-$FF for presets and four bytes of stack) you could still address the other locations as $01F4 and $01F5. Not zero-page, but always available nonetheless.
Further, I really don't think zero-page RAM is going to be as scarce and precious a commodity in a 4A50 game as in a normal one (since other RAM will be available). When I did SDI (a 1K minigame for the SuperCharger), I had boatloads of zero-page RAM left. So much that spending 8 bytes of RAM to save two bytes of code was a no-brainer.
-
Regarding Strat-O-Gems, I personally would've kept it at 4K. It seemed complete, polished and charming that way and I'm not sure if $5 extra are really justified for the difference of the two versions. Especially having PAL/NTSC in one binary seems to be an unnecessary overkill gimmick. I wouldn't know anyone *needing* both versions of a game
Dual-system capability alone would not have justified going from a 4K cart to something bigger. Even the speech would not have. The thing that pushed Strat-O-Gems over 4K was the Atarivox/Memcard "instant replay" and the saving of great plays. I would agree that for people without an Atarivox or Memcard there isn't much new to justify the extra $5 cost, but I wanted to do something that would help make the Atarivox/Memcard a "must-have" and there was no way that was going to fit in a 4K cart. If I'd offered a 4K cart for people who don't have an AtariVox/Memcard today, they wouldn't be able to use those things if they got them later.
Since a 32K cart was the same price as 8K, I figured I may as well use the space to maximum advantage; since Albert indicated it would be easier for him to deal with one .bin than two, I included both PAL and NTSC versions in the same cart.
FYI, a rough breakdown of SOG memory usage in bank order
2x4K -- Main game kernel (PAL/NTSC) and AtariVox code
2x4K -- Main game (PAL/NTSC)
2x4K -- Title screen/instant replay/Great play kernel (PAL/NTSC)
4K -- Original Strat-O-Gems
4K -- Atarivox speech data (about 1K, common to PAL/NTSC) and Boot Menu
PS --Although the SOG info page doesn't mention it, the 32K cart includes the original 4K version. Hold fire on startup to bring up the game menu, then push and hold fire while pulling down.
PPS -- The instant replay and record-play-saves function really is neat. I highly recommend that anyone with Strat-O-Gems at least spend the $10 for a Memcard. Having great plays recorded really adds a new dimension to the game.
-
...as an embedded systems engineer...Do you ever work with ARMs? I like the ARM, as it's got a simple, straightforward instruction set, and nearly all instructions can be conditional, not just branches. This leads to some really funky code, sort of like we do here with the 6502.
Never used the ARM. Most of my stuff is on smaller micros. Other than PC support software, the "largest" project I did was for an 8x86 with 512K of RAM.
-
Yes, sprite animation is where I fall down also. Any tips on how to create a half-decent animation sequence would be very welcome
I'm hardly an expert, but I think one of the keys is to figure out where all of the parts of the sprite are going in every frame, even though there may be some frames where they aren't visible.
For an example of how not to do things, look at the Odyssey2 "left standing man" and "left walking man" graphics (reproduced roughly below). How do the different parts of the man's leg move in going from one frame to the other? Where does the man's "foot", present in the "standing" graphic only, "go" in the walking man?
...##... ...##... ...##... ...##... ....#... ....#... ..###... ..###... ....#... ....#... ....#... ...#.#.. ...##... ..#...#.
-
4. Corpse BrideI liked the movie, but I still can't see how anyone gave it the green light. I can't imagine anyone who wasn't a die-hard fan of Tim Burton's stop-motion animation wanting to go see it based upon the previews.

Juno First kernel demo
in 2600 in 2006
A blog by supercat
Posted
Not really, actually.
When I read about your thinking about JF, I figured that any good JF game ought to have the scrolling dot thingies since--regardless of whether they really affect gameplay--they are a defining aspect of the game's visuals.
I'm not really interested in writing the game myself, but wanted to if possible prevent the 'crime' of someone releasing a JF game without the dotty background.
So I'll offer you my full source code for the demo, including the code generator, gratis to anyone who wants it. I'll even do a bit of a write-up for it and offer any help you may need.
Some preliminary points:
-1- The code generator is written in QBASIC. I think QBASIC is available free on Microsoft's web site someplace; it's certainly not hard to find.
-2- Each row of dots is stored as a starting position and a string of digits; each digit is the number of cycles to the next dot, minus six. So if there should be six dots, starting at position 10, and there should be seven cycles between them except between the third and fourth where there are eight, that row would be generated as a 10 in the start array and a "11211". Note that the string is five characters because six dots will have five "gaps".
-3- Seven cycles was chosen as the minimum dot spacing because that's the time for "STA RESBL / DEY / BEQ exit". Larger spacings use STA.W or add NOPs. I use overlapping code for different dot spacings and using the above means the code generator can be pretty simple. I could have accommodated closer dots and still gotten some code overlap if I'd generated "STA RESBL / DEY / STA RESBL / BEQ exit" but since such code sequences could only exit at the BEQ the code generator would have to be more complicated.
-4- The code generator right now works by generating the strings top to bottom. When generating each string, it checks to see if the master template string W$ already contains it. If not, it appends to W$, borrowing from the end of W$ if possible (e.g. if W$ ends with "112112" and I want to add "11212", I would just add the "12" to W$ so it would end with "11211212". A better packing algorithm could probably avoid some redundancy here.
-5- Because of branching restrictions, the generated code has to be split into pages. The existing algorithm is very crude and makes no effort to find the best arrangement of code into pages.
-6- To make things quick-and-easy, I used four-byte entries for each row in the master table. This is rather expensive. If the screen is 32 dots tall (as at present), each horizontal scroll position will take up 128 bytes. Since I would expect there to be less than 256 entry points into the dot-drawing code, and since both the starting delay and number of dots are always less than 15, it should be possible to compress this into two bytes/row.
-7- At present, the generated code for each line starts with "STA WSYNC/LDY #2/STY ENABL". It may be more useful to do something like "LDA #2/STA RESBL/STA ENABL" sometime in the line before the dot line, and then starting the dot line with "STA WSYNC/STA GP0". If VDELP1 is set, this would allow for clean sprite shape updates.
-8- Because there are no delay registers for missile enables, it may be difficult to get those done without 'tearing' in the line following a dot line. I would suggest that it may be useful to compute two double-lines worth of missle data earlier in the kernel, so that immediately following a dot line the kernel can simply "LDA missiles / STA ENAM0 / ASL / STA ENAM1". If missiles are only one double-scan-line high, that might not be necessary: set up SP before invoking the dot line, then after the dot line do "cpx missiley1 / php / cpx missiley2 / php".
Let me know if you need any other assistance.