Lynx loader from scratch


Trying to squeeze STNICC scene into a card I found ethusi's loader wastes some space. So here is another version, which fetches the data right after the encrypted loader. Three bytes are left for personal use ;-)

; micro loader
; programm must start at $1ff, first byte must contain number of
; pages to load (see demo.s), so actual code at $200
; Note: Does not clear AUDIN, therefor not for use for bank-switching carts!
;      (lda #$1a; sta $FD8A, and FE00 sets AUDIN (B4) == 0)

RCART_0		EQU $fcb2 ; cart data register

BLOCKNR		EQU 0		; zeroed by ROM

	RUN    $0200

	; SP = 3 after ROM, so push 3 bytes plus
	ldx	#(b9+1)-b0+3
	stz	$fda0,x		; clear colors
	lda	b0,x		; copy loader
	bpl	cpy

	ldy	#51+1		; already 51 bytes loaded from 1st block!
	bra	$200-(b9+1-b1)

	; to be copied into stack
	bne	b2
	inc	BLOCKNR		; next block
	jsr	$fe00		; select block
	ldx	#4		; 4 pages per block
	lda	RCART_0
	sta	$200-(51+2),y	; first byte goes to $1ff (PAGECNT)
	bne	b2

	inc	$200-(b9+1-DST)+2	; next dst page

	bne	b0
	dc.b 	$80		; opcode "BRA"
	; PAGECNT will be here, if zero => BRA $200

	; program is here ...

size	set endofbl-$200
free	set 49-size
	echo "Free %Dfree"
	IF free < 0
	echo "Size must be <= 50!"

	; fill remaining space
	IF free > 0
	REPT	free
	dc.b	$42		; unused space shall not be 0!
	dc.b 	$00		; end mark!

The program must start at $1ff with the number of pages, so actual code begins at $200.

This one just loads one file. No directory. This can be implemented in the application though.

Here my Makefile:

all: ml.lnx

ml.lyx: ml_enc.bin demo.bin
	cat ml_enc.bin demo.bin >$@

ml.lnx: ml.lyx
	make_lnx $< -b0 256K -o $@

ml_enc.bin: ml.bin
	lynxenc $< $@

ml.bin: micro_loader.s
	lyxass -d -o $@ $<

demo.bin: demo.s
	lyxass -d -o $@ $<

.PHONY: clean
	rm -f *.bin
	rm -f ml.lnx ml.lyx

just a hint:

the second stage in the microloader is not encrypted. thus there is no real speed difference to your ansatz.

if you want to do a REAL optimization, you have to choose the filler bytes such, that the multiplication is faster.



The challenge was to fit as much as possible in the first 50bytes since this is the minimum.


But, oh, I get your point, if the first stage is only a few bytes that loads the rest and we fill the remainder with optimal values, decryption plus additional loading might be quicker.


But to find this kind of optimized code one needs to have exact cycle counts. There is a guy in the 6502-FB group who made a simulator with lots to debugging features. But for Apple][.


I do not trust handybug's cycle count, but the decription could be run in another simulator ...


Next challenge :-)


I did some profiling of that modular multiplication algorithm while it decoded Karri's micro-loader and here are the results:




Last four columns say about number of branch taken vs. not taken. It took overall 1486320 cycles. I think it might be faster if the encoded stream had more zero bits. But how much it takes on Lynx to execute such number of cycles? It's much more than 0.4s? So is there really any sense optimizing it?

"So is there really any sense optimizing it?"


IHMO, optimizing up to a certain point is a "must". Beyond this point it is just: "because I can" ;-)


Yeah, yeah, I know. Premature optimization is my hobby too ;) But there was a question what is the stake here? I don't own the console and really don't know how much it takes to decrypt one 51 byte long block. I presume that decrypting one instead of two is perceivable, but will it be visible to speed up the process by few percent? I even don't know how much faster it can be. I could do some tests - generate some blocks with different number of zeros and see what the number of cycles it will take. But overall the means of optimization here are... cumbersome at least. It involves filling the extra space with different values and checking the number of zeros on result after encoding. Pure brute-force. If someone has idle cycles on his/her machine it can be done in spare time.

On a real (original) Lynx, the short loader is quicker than the tube takes to stabilize.

Also, the ROM zeroes all RAM, which takes far more time than decoding one block.


So, finding the optimal block is academical and of no real use. :-)

I see :)

Nevertheless if someone would feel an urge to pursue the Monty Python's level of academicity I could prepare an idle priority brute-force encrypter that would walk through whole space of different fillings of unused bytes in search of encoded block with as many zeros as possible :-)

Also, the ROM zeroes all RAM, which takes far more time than decoding one block.


I don't see this zeroing the RAM as there seems to be garbage values in uninitialized variables. Or it could be left-overs from the decryption process.

This code in the ROM:


STZ z01
LDA #$00
STA (z00),Y
INC z01


It is called at the very beginning:


BEQ end
LDY #$02
STZ z00
JMP clearMem


I don't see this zeroing the RAM as there seems to be garbage values in uninitialized variables. Or it could be left-overs from the decryption process.


Here's the animation of content of page zero before and after decryption. So yes, page zero is littered with leftovers from the decryption process.




how did you profile Lynx code?


I ran decryption in Altirra. Processor is roughly the same :)

But it might be a good idea to add a profiler to Handy. It should not be very difficult. Given it has cycle exact emulation.

