Alright, after playing around with this for a while, I now understand how the engine can be so fast.
Instead of just storing the sprite bytes and transferring them by indirect addressing, it relies on lists of immediate loads (like lda #0) followed by absolute,X stores (like sta $C600+0,x). This saves quite a bunch of cycles when setting the sprites, but..... it takes up a huge load of memory. Especially because the instructions are repeated twice (for PMG 0/1 and 2/3). Each sprite takes up about 6~8 times more memory space.
I won't be able to use the engine for 8bit-Strike, as I have 112 x 8 x 16 multicolor sprites (that would take about 28K of memory instead of 3.5K currently).