Another exercise is clearing RAM. Can you do it faster than this? (Hope my code is OK, haven't tried it).
Code looks fine! It was tough, but I think I beat it. Requires few more cycles for setup and loop around, but you can use STST to write the zeros even faster than CLR (8 cycles versus 10 cycles to 16-bit RAM, and 16 cycles versus 22 cycles to 8-bit RAM due to one fewer memory access - STST doesn't read before write.) I tried two ways of getting a zero into the status register.
For one, just doing math. I loaded a zero into R3, then
AB R3,R3 * 0+0=0, clears L>, A>, C, OV, OP, sets EQ
COC R1,R3 * assuming R1<>0, clears EQ
This clears it out (you also need to be running with interrupts cleared (LIMI 0) and not running an XOP. This takes a spare register and 28 cycles (0-wait state), and has to be done every loop before the LWPI.
If you can spare the high registers in the workspace, then you can use RTWP to do it instead. This also lets you do it without self-modifying code.
LI R13,START * will become the WP
LI R14,LOOP2 * is the address branched to
CLR R15 * will become the ST register
LOOP2 STST R0
If I'm doing my math right, assuming 16-bit code and data both, the CLR version should take 34 cycles and 2 registers to set up, and take 236 cycles per loop (an average of 7.375 cycles per byte). If the target is in 8-bit RAM, it would take another 128 cycles per loop (due to the read-before-write, 8 wait states are inserted), which makes it 11.375 cycles per byte.
The STST version needs 56 cycles to set up and 4 registers (and fixed registers for three of those), but manages 186 cycles per loop (averaging 5.8125 cycles per byte). STST does not have the read-before-write, so it jumps ahead in 8-bit target memory, adding only 64 cycles (4 wait states per write). This makes an average of 7.8125 cycles per byte.
Fun puzzle, and I almost gave up!