Jump to content
IGNORED

Any compressor between rle and lz4 ?


Recommended Posts

Hi!

 

Hi,

 

I'm just wondering if there's anything between these 2 as RLE just doesn't compress too well and LZ4 compresses well enough but is a little to slow when decrunching ?

 

Perhaps some modified form of RLE ?

 

Thanks

LZ4 should be as fast (or even faster) than RLE. Sure you are using an optimized implementation?

Link to comment
Share on other sites

Hi!

 

 

LZ4 should be as fast (or even faster) than RLE. Sure you are using an optimized implementation?

 

 

Really ? well I tried Fox and XXL's version, it's optimized for size so probably not for speed. It looks like it can decrunch about 200-300 bytes max per frame at 50fps.

 

The other problem with LZ4 for what I have in mind, is that it can't decrunch chunks as it uses previously decrunched data (while RLE can)

 

I would still be interested to get a faster version if you have one.

 

Thanks!

Link to comment
Share on other sites

 

 

Really ? well I tried Fox and XXL's version, it's optimized for size so probably not for speed. It looks like it can decrunch about 200-300 bytes max per frame at 50fps.

 

 

Do you want to depack movies with 50fps in realtime on A8 ?!? Could be a difficult task on the little Atari...

Link to comment
Share on other sites

I'm just wondering if there's anything between these 2 as RLE just doesn't compress too well and LZ4 compresses well enough but is a little to slow when decrunching ?

 

Three years ago I've created "autogamy" and used it to compress the level data for "dye". (Also used in "E-Type".)

 

Features:

* fast and compact depacker (using self modifying code)

* nearly as good as LZ4 - for small files often even better

* in-place decompression if the source data ends at least 4 bytes behind the end of the target buffer

* compression uses also self-referencing - so no streaming (windowing currently unsupported)

 

Archive contains packer for Windows (exe) and source-code for de-packer cc65 header / ca65 source.

 

You may like give it a try.

Autogamy.zip

  • Like 6
Link to comment
Share on other sites

download the MADS assembler and have a look at the various crunchers.

 

me used exomizer, deflate, lz4 in the past but sticked now to c64 depacker called "B2" & pucrunch.

 

I've started using MADS and indeed there's plenty of crunchers. I will take a closer look but perhaps some folks here have experience with them and can tell me if one of them fits the requirements?

 

Can't find any ref to "B2" but pucrunch/exomizer aren't quite what I'm after... way too slow to decrunch.

Link to comment
Share on other sites

 

Do you want to depack movies with 50fps in realtime on A8 ?!? Could be a difficult task on the little Atari...

 

Well there's various things I want to do... and decoding gfx data is one of them. A single 32x32 1 color sprite would take about half a frame to decode... That hurts.

Link to comment
Share on other sites

 

Three years ago I've created "autogamy" and used it to compress the level data for "dye". (Also used in "E-Type".)

 

Features:

* fast and compact depacker (using self modifying code)

* nearly as good as LZ4 - for small files often even better

* in-place decompression if the source data ends at least 4 bytes behind the end of the target buffer

* compression uses also self-referencing - so no streaming (windowing currently unsupported)

 

Archive contains packer for Windows (exe) and source-code for de-packer cc65 header / ca65 source.

 

You may like give it a try.

 

Thanks! :thumbsup:

 

I will give it a try for sure. Sounds like it would be a good fit for one of my use cases.

 

I've got another use case which requires -no- self-ref so I will look into what's in MADS and possibly have to try my own RLE variant (not even sure it will compress much better than RLE though)

Link to comment
Share on other sites

Hi!

 

Really ? well I tried Fox and XXL's version, it's optimized for size so probably not for speed. It looks like it can decrunch about 200-300 bytes max per frame at 50fps.

 

The other problem with LZ4 for what I have in mind, is that it can't decrunch chunks as it uses previously decrunched data (while RLE can)

I don't understand this. LZ4 can be decompressed "almost" in place (with about 8 bytes of gap between compressed/decompressed data).

 

I would still be interested to get a faster version if you have one.

 

Thanks!

With a "small" (124 bytes) implementation, I get a little less than 49 cycles per byte, this is 500 bytes/PAL frame on GR.0 (the slowest possible), about 650 bytes/PAL frame without DMA. This is my main "copy" loop, in ZP:

 

        org     $80
docopy
	jsr	getsrc
dst = * + 1
	sta 	output_addr
	inc	dst
        bne     skip4
        inc	dst+1
skip4:
	dex
	bne	docopy
        dey
	bne	docopy
	rts

getsrc
src = * + 1
	lda     packed_data
	inc	src
        beq     skip5
        rts
skip5:	inc	src+1
	rts
Note that the "docopy" loop uses JSR to the read routine, and the pointer is incremented on each byte read/write, this is to minimize size.

 

You can implement it faster if you use indexed addressing, the main copy loop should be like:

 

docopy
	lda     source_addr, X
	sta 	output_addr, X
	inx
        bne     docopy
        dey
	bne	docopy
	rts
This would make 14 cycles/byte in the best case, so this is max about 1700 bytes/PAL frame. Certainly three times faster than my original code should be possible.

 

Attached is a sample source code, and a program that compressed a XEX file using LZ4 (windows/linux versions).

lzxex.zip

lz4dec.asm

  • Like 4
Link to comment
Share on other sites

LZ4 compresses well enough but is a little to slow when decrunching ?

 

 

LZ4 is very fast

 

gpl3.txt - 35147 bytes

exomizer - 12382 bytes + depacker 1 page =~ 12.3 KB, decompress 128 frames (2.6 sec)

deflate - 11559 bytes + depacker 2 pages =~ 11.8 KB, decompress 179 frames (3.6 sec)

LZ4 - 15622 bytes + depacker <150 bytes =~ 15.3 KB, decompress 55 frames (1,1 sec)

there is also a bootloader with lz4 streaming decompression

test-lz4.atr

test-normal.atr

  • Like 1
Link to comment
Share on other sites

I don't understand this. LZ4 can be decompressed "almost" in place (with about 8 bytes of gap between compressed/decompressed data).

 

Yes, but if the source data is bigger than the available target buffer memory (->streaming) you've lost. So you need something like windowing the data for the compression or no self-referencing at all.

Link to comment
Share on other sites

are we talking about packing 32x4 bytes= 128 total? is there any chance to win here? as sprite data might be very pixelized. not sure if using compression helps here a lot? how many of those frames do you want to pack as stream? what is total size? I mean that's something we must know to find the "best" packer in your use case scenario.

Link to comment
Share on other sites

Hi!

 

Yes, but if the source data is bigger than the available target buffer memory (->streaming) you've lost. So you need something like windowing the data for the compression or no self-referencing at all.

Well, the LZ4 format codes two types of runs: literal runs and copy-over runs, so you can simply limit the window size to allow for streaming with any buffer size. Of course, the compression will be a lot worse with smaller windows.

 

In my code above I made an optimization: I replace the "copy distance" field with a "copy address" field (= position - distance), so the copy can be done directly. This also allows (by modifying the compressor) to use *any* place in the target memory as a source, so you could stream frames using the read buffer and the target screen, so you can send only changed bytes.

  • Like 2
Link to comment
Share on other sites

Hi!

 

 

I don't understand this. LZ4 can be decompressed "almost" in place (with about 8 bytes of gap between compressed/decompressed data).

 

 

With a "small" (124 bytes) implementation, I get a little less than 49 cycles per byte, this is 500 bytes/PAL frame on GR.0 (the slowest possible), about 650 bytes/PAL frame without DMA. This is my main "copy" loop, in ZP:

 

        org     $80
docopy
	jsr	getsrc
dst = * + 1
	sta 	output_addr
	inc	dst
        bne     skip4
        inc	dst+1
skip4:
	dex
	bne	docopy
        dey
	bne	docopy
	rts

getsrc
src = * + 1
	lda     packed_data
	inc	src
        beq     skip5
        rts
skip5:	inc	src+1
	rts
Note that the "docopy" loop uses JSR to the read routine, and the pointer is incremented on each byte read/write, this is to minimize size.

 

You can implement it faster if you use indexed addressing, the main copy loop should be like:

 

docopy
	lda     source_addr, X
	sta 	output_addr, X
	inx
        bne     docopy
        dey
	bne	docopy
	rts
This would make 14 cycles/byte in the best case, so this is max about 1700 bytes/PAL frame. Certainly three times faster than my original code should be possible.

 

Attached is a sample source code, and a program that compressed a XEX file using LZ4 (windows/linux versions).

 

 

Sounds interesting as well, 1700 bytes best case is a lot better for lz4. :thumbsup:

Link to comment
Share on other sites

 

LZ4 is very fast

 

gpl3.txt - 35147 bytes

exomizer - 12382 bytes + depacker 1 page =~ 12.3 KB, decompress 128 frames (2.6 sec)

deflate - 11559 bytes + depacker 2 pages =~ 11.8 KB, decompress 179 frames (3.6 sec)

LZ4 - 15622 bytes + depacker <150 bytes =~ 15.3 KB, decompress 55 frames (1,1 sec)

there is also a bootloader with lz4 streaming decompression

 

well 15.3KB/ 55 =280 bytes per frame which is what I was seeing as well...

 

Sure it's faster than exomizer/deflate but they're at the far end of the spectrum when it comes to speed.

 

 

I was looking for something between lz4 & rle in terms of speed and compression ratio. Looks like I've got 2 possibilities with dmsc & irgendwer at least when it's ok for self referencing.

 

 

btw I tried compressing data with your rle implementation that I found in MADS, it seems to perform a litte worse than this one sometimes (files can be 1-10% bigger):

 

https://csdb.dk/release/?id=34685

Link to comment
Share on other sites

are we talking about packing 32x4 bytes= 128 total? is there any chance to win here? as sprite data might be very pixelized. not sure if using compression helps here a lot? how many of those frames do you want to pack as stream? what is total size? I mean that's something we must know to find the "best" packer in your use case scenario.

 

Ok, so I guess that was a little confusing. I mixed 2 use cases into one question.

 

My first use case was simple:

 

Like pirx said a decent streaming compression scheme would be perfect for SAP-R which exactly what I was trying to do!

 

lz4 deinterleaved Pokey data compresses pretty well... so I modified XXL's lz4 decompression code to only output 9 bytes per frame only to realize that it didn't work because it was trying to copy data that I'd already discarded (yeah I know very little about compression :) )

 

I also realized that it was taking about 10 scanlines to output 9 bytes!

 

I had another use case in mind (compressing sprites) but at 300 bytes/frame, it just wouldn't cut it either. With the code posted before, I have hope for this one.

Link to comment
Share on other sites

 

In my code above I made an optimization: I replace the "copy distance" field with a "copy address" field (= position - distance), so the copy can be done directly. This also allows (by modifying the compressor) to use *any* place in the target memory as a source, so you could stream frames using the read buffer and the target screen, so you can send only changed bytes.

 

 

Indeed, not what I want to do at all though... sorry for the confusion :arrow:

 

 

 

 

Well, the LZ4 format codes two types of runs: literal runs and copy-over runs, so you can simply limit the window size to allow for streaming with any buffer size. Of course, the compression will be a lot worse with smaller windows.

 

 

Yes unfortunately for SAP-R, the window has to be 128-256 bytes for any decent compression, 128 *9 *2 =~2.3KB just for buffers, you're starting to lose the benefit of the compression. I get the feeling that the code would be clunky too. A cheaper compressor might do just as well.

Link to comment
Share on other sites

lz4 deinterleaved Pokey data compresses pretty well... so I modified XXL's lz4 decompression code to only output 9 bytes per frame only to realize that it didn't work because it was trying to copy data that I'd already discarded (yeah I know very little about compression :) )

You might want to take a look at the wikipedia article for LZSS to understand the basic idea behind all of the compressors that you've been looking at (beyond RLE).

 

It's a really simple concept, but, as you've found, you need to keep the decompressed data around in order to be referenced.

 

This usually isn't a problem, but you do need to know that you've got to do that.

 

Pure LZSS might be what you're looking for. The compression ratio is worse than LZ4, but the coding for the decompressor is absolutely trivial.

 

It is what a lot of games in the late 1980s and early 1990s used before programmers tweaked the details of the algorithm to get even better compression, but at some cost in speed.

 

 

LZ4 was optimized for speed of compression/decompression for use in realtime on 32-bit CPUs ... I've yet to find a case where it actually makes sense to use it on 8-bit or 16-bit computers/consoles.

 

Something like Jorgen Ibsen's aPlib offers better compression, with only a little extra cost in speed ... and if you really need the speed, then pure LZSS is hard to beat.

 

 

As you've pointed out, to get good decompression speed with any of the LZSS variants, you've got to make good use of the X and Y index registers and self-modifying code. You'd also be advised to avoid subroutine calls and inline, if possible.

 

 

You can find the C source for an LZSS compressor (from Mark Nelson's book) here...

 

https://github.com/videogamepreservation/homeworld/blob/master/src/Game/LZSS.C

  • Like 1
Link to comment
Share on other sites

 

well 15.3KB/ 55 =280 bytes per frame which is what I was seeing as well...

 

Sure it's faster than exomizer/deflate but they're at the far end of the spectrum when it comes to speed.

 

 

I was looking for something between lz4 & rle in terms of speed and compression ratio. Looks like I've got 2 possibilities with dmsc & irgendwer at least when it's ok for self referencing.

 

 

btw I tried compressing data with your rle implementation that I found in MADS, it seems to perform a litte worse than this one sometimes (files can be 1-10% bigger):

 

https://csdb.dk/release/?id=34685

Did you benchmark LZO? It's supposed to be designed around decompression speed.

 

Also, if the idea is just to blast this to the screen as you decompress, you could even add a special flag to the decompressor which says "Move to this address before continuing to decompress". Then you can just let the decompression loose on an entire run of data to do a frame update without having to write "same" data to parts of the screen.

Link to comment
Share on other sites

Hi!

 

Ok, so I guess that was a little confusing. I mixed 2 use cases into one question.

 

My first use case was simple:

 

Like pirx said a decent streaming compression scheme would be perfect for SAP-R which exactly what I was trying to do!

 

lz4 deinterleaved Pokey data compresses pretty well... so I modified XXL's lz4 decompression code to only output 9 bytes per frame only to realize that it didn't work because it was trying to copy data that I'd already discarded (yeah I know very little about compression :) )

 

I also realized that it was taking about 10 scanlines to output 9 bytes!

 

I had another use case in mind (compressing sprites) but at 300 bytes/frame, it just wouldn't cut it either. With the code posted before, I have hope for this one.

Look at the attached program, it is a (really simple) animation with 128 frames of 7680 bytes each. The compressor uses the current frame and the last one as reference, so streaming output is possible. Attached is the C program that compresses the video and the and decompressor ASM source.

lz4vid.xex

genvideo.zip

lz4vid.asm

  • Like 2
Link to comment
Share on other sites

Hi!

 

Sounds interesting as well, 1700 bytes best case is a lot better for lz4. :thumbsup:

Here is a somewhat optimized version, it decompresses at 126 228 cycles / frame, this is 16.4 cycles/byte (so, about 9.5 fps in PAL, 9.0 fps in NTSC)

 

It could be optimized a little more, but I doubt more than about 15%.

lz4vid8.xex

lz4vid8.asm

  • Like 3
Link to comment
Share on other sites

 

well 15.3KB/ 55 =280 bytes per frame which is what I was seeing as well...

 

 

 

no.

compressed data: 15kb

decompresed: 35147 bytes

 

so: 35147 / 55 = 639 bytes

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...