The first thing (something that may or may not be something anyone cares about but me) is that this is all actual hardware, with no fpga or emulation solutions on the board. There is going to be a cpld handling some of the logic functions, but that's just because otherwise I'd have to fit around 15 random chips worth of logic gates on a board that's already a little cramped. This tickles my fancy (I've personally just never felt that excited about a big black box of an fpga doing everything, it's technically very impressive but at that point I feel like I might as well just run an emulator.) It also means that I can take out the single most expensive component from the apple squeezer and use cheaper parts that will also be more readily available. This is just speculation on my part but I'd imagine component shortages are part of the reason other new accelerators are sold out and unavailable to me, so I've been making sure to avoid any parts that'll be difficult to get my hands on.
The second thing is the bottleneck the IIGS has. Writing to video ram is the biggest obstacle for games, with writes always slowing down to the 1Mhz system bus regardless of how fast the CPU is. The specific idea I'm working with is using a queue for any writes to vram. It can be filled at the full accelerated 14mhz speed, with it then saturating 100% of the bandwidth of the slow bus. Basically it'd work more like a DMA, transferring data on every single clock cycle.
As far as I know, the fastest way to move data into vram is using the PEA instruction, which takes 5 cycles to put 2 bytes into vram (assuming the stack has already been set to point there). Only 2 of those cycles actually need to touch slow memory, so the hypothetical best time for something like this would be 2000 nanoseconds for the actual slow writes, and about another 214s at the 14 mhz speed. But with the queue approach, as long as it isn't full, every cycle could be accelerated, meaning the whole process would take only 357 ns.
2214/357 ~= a 6x speed increase at the worst bottleneck in the system. Even if this is only true part of the time this can be a significant performance increase. Of course, once the cpu fills the queue it'll be stuck waiting to do writes in the old fashioned slow way. But since the queue is a decent length as long as on average we're writing only 1 byte to vram every 14 cycles then it empties as quick as it fills and we remain at max speed. Factoring in other code (most programs or games are not just writing a test pattern endlessly after all) I'm willing to bet this is true more often than not. Of course though, the proof is in the pudding, and I won't know if any of this actually works out until I have the thing in my hands and test it