but there's only a single cycle free...
Without looking further into details, I think there might(!) be some potential for optimization:
1. There is WSYNC, which could be removed. (3 cycles gained)
2. There might be a chance to merge the two repositioning branches and check outside the main loop which is relevant. (2 cycles)
3. Alternatively to 2. one repositioning branch could be merged with the kernel loop branch (2 cycles), or maybe even both so that there is only one branch left. (4 cycles)
4. The kernel could be unrolled once to save the loop branch (1.5 cycles). Or more often for more cycles.
5. Maybe the lower bits of AUDV0 could be shared with one PF0 write. Though you need one bit for ENAMx there. Not sure if one could cheat here.
I know not all ideas can be combined. But enough cycles to allow at least one extra load/write seem doable.