Jump to content
IGNORED

Inflate/Deflate Improving the speed of decompression


Recommended Posts

I have recently seen that "Inflate.asm" has been downsized so it can inflate into RAM faster and the routine to use less space? I did a few things with it over the years to see if it can inflate any faster. But this was program is outside my normal knowledge dealing with compression. Many times I experimented with something to optimized it, can get garbage loaded into RAM. I did see it was originally inefficient with CPU usage, had many extra JSR calls that could be removed. Anybody else do anything with this program?

 

Hmmmm, whenever the inflate routine was used in an A8 program, it was stated later: "inflate routine by FoX", so you most likely have to ask FoX about it... ;-) His real world name is Piotr Fusik and you can find some code and info at Github:

 

https://github.com/pfusik/zlib6502

https://github.com/pfusik/zlib6502/commit/3faede80c34b7a060715873e4834d1fe3acebbfa

 

And no, I am not a programmer, so I never do anything with ASM sources (or any other programming sources)...

 

I managed to improve the speed further inside the getbits_loop by branching out only when the getBit_buffer equals 0 since branching takes an extra cycle if taken.

 

Link to comment
Share on other sites

I see many things that could be done. It mainly depends ob how much space you want to trade for speed.

The getBits_loop could be unrolled completely (max 8 times) getting rid of 12 JSR/RET cylces per bit.

Also store_byte shoud be moved to ZP with absolute addressing to get rid of () and LDY.

  • Like 2
Link to comment
Share on other sites

The speed of decompression has been in my mind the last few weeks on and off.

 

Here's a situation that I wonder about and maybe some of you have some rough ideas here:

 

- If I have 5kb of compressed data on a standard Atari disk

- Lets say there is around 8kb overall data once decompressed

- Standard 1050 being used....

 

If you look at the total time taken to load and decompress, is it quicker to have it all on disk and just do a straight load. Or is it quicker overall to load the compressed data and then decompress it? (theory being that as it is then in memory, this should be quicker)

 

And how would this fare if we were using cartridge instead?

 

I know, I could test all this out, but its not something I'm going to spend ages setting up just to get an answer :)

  • Like 1
Link to comment
Share on other sites

With what I am doing, it is approaching 8kb per second. But as said, it is something that needs to be loaded into one part of ram and expanded to another part. After expanding, you can clear that original area where the compressed data was and load something else there. Compression is useful from cartridge ROMs where using the smaller and less expensive ones be used instead of something bigger and costing more for everyone.

Link to comment
Share on other sites

Here's a situation that I wonder about and maybe some of you have some rough ideas here:

 

- If I have 5kb of compressed data on a standard Atari disk

- Lets say there is around 8kb overall data once decompressed

- Standard 1050 being used....

 

If you look at the total time taken to load and decompress, is it quicker to have it all on disk and just do a straight load. Or is it quicker overall to load the compressed data and then decompress it? (theory being that as it is then in memory, this should be quicker)

This is a classic linear optimization problem :)

 

You need to load the decompressor plus compressed data from your storage media and then decompress it. So loading speed is one factor and decompression speed and size of the decompressed image another.

 

Let's do some ballpark calculations. Assuming compressed size is 5k, decompressed 8k and decompressing takes about 1 second. Also let's assume the decompressor code is about 0.5k.

 

On a stock 1050 loading speed will be about 1.5k/sec.

 

Loading an 8k uncompressed block will take 5.3 seconds.

 

Loading 5k compressed data plus 0.5k decompressor will take 5.5/1.5 = 3.7 seconds.

 

So total time for the compressed case will be 4.7 sec (3.7 sec loading plus 1 sec uncompressing) which is about half a second faster than uncompressed.

 

On a highspeed SIO drive like a Happy/Speedy loading speed is about 5k/sec.

 

Uncompressed loading will take 1.6 sec, compressed loading 1.1 sec, plus the 1sec decompressor time that'a total of 2.1 sec - meh, that's half a second slower than uncompressed loading.

 

With PBI hard drives loading speed will be (about) 50k/sec.

 

Uncompressed loading: 0.16 sec, compressed 0.11 + 1sec = 1.11 sec - wow, uncompressed loading now is roughly 7 times as fast as compressed version.

 

So, to sum it up: while you'll have a slightly (~10%) reduced loading time on slow, stock storage media compressing data increases loading time quite a bit (up to even drastically) on faster storage media.

 

And this is why I'm personally not a fan of games, programs, ... being distributed as compressed EXE files. If I feel the need to compress these I usually can do it quite easily myself with the tools out there. But getting a decompressed EXE from a compressed one is messy.

 

And how would this fare if we were using cartridge instead?

If you are progamming your software to be run from cartridge (like, for example, Space Harrier) you'll most certainly use the cart memory directly. Swapping in an 8k bank takes a few CPU cycles and you'll also have plenty of space available. The smallest chips you'd currently be buying is 512k which cost about $1.50 (39SF040). 4MB chips (S29GL032) cost about the same, but then you also need some level-shifter on your cart as these are 3.3V devices. If you need even more memory you can get a 128MB chip for $10 (S29GL01, that's the chip we used on The!Cart).

 

The only reason I can see for using compression with cart memory is when you've got an older cart sitting around somewhere and you want to squeeze in a little bit more (at rather drastic speed costs). If you design a new game from scratch, better don't do that, you'll be limiting yourself too much - look at the available other options and spend your time to do creative stuff like adding more graphics, additional levels and such.

 

so long,

 

Hias

  • Like 6
Link to comment
Share on other sites

I looked inflateCodes_copyByte loop section this morning. Trying to figure out why its length is something called "inflateCodes_lengthMinus2" and wonder if we can just do a single lda (ndx0),y sta (ndx1),y dey set up, and remove the whole JSR there,

Link to comment
Share on other sites

The only reason I can see for using compression with cart memory is when you've got an older cart sitting around somewhere and you want to squeeze in a little bit more (at rather drastic speed costs). If you design a new game from scratch, better don't do that, you'll be limiting yourself too much - look at the available other options and spend your time to do creative stuff like adding more graphics, additional levels and such.

Exactly so. I found myself using compression in cart-based software the other week solely because I was constrained to 16KB of a 512KB flash chip. Were it not for these pre-ordained architectural constraints, I would simply have grabbed as many banks as I needed.

Edited by flashjazzcat
  • Like 3
Link to comment
Share on other sites

I looked inflateCodes_copyByte loop section this morning. Trying to figure out why its length is something called "inflateCodes_lengthMinus2" and wonder if we can just do a single lda (ndx0),y sta (ndx1),y dey set up, and remove the whole JSR there,

 

This is due to two reasons: because of the way that Deflate splits copy lengths into a base Huffman-encoded length plus extra length bits, and because it is necessary to fit the maximum length of 258 bytes into an 8-bit counter.

Edited by phaeron
Link to comment
Share on other sites

Loading 5k compressed data plus 0.5k decompressor will take 5.5/1.5 = 3.7 seconds.

 

 

I loved your post above. Thanks for that.

 

I would probably ignore the time taken to load the decompressor as the use case for me (with what I have in mind) will have all that loaded in advance and then as different situations occur in a game, I would load in other parts. The idea being that it doesn't matter too much how long the initial load is, but when loading other parts of the game in, it needs to be quick. The situation is that I released a game a long time ago in Turbo Basic which loads a program and then loads another in a chain. I want to slowly start converting it into assembly language, with it loading in the engine for the game, then loading the engine for the other results, then loading in the menu system. There's too much to have it all in memory at once.

Link to comment
Share on other sites

 

This is due to two reasons: because of the way that Deflate splits copy lengths into a base Huffman-encoded length plus extra length bits, and because it is necessary to fit the maximum length of 258 bytes into an 8-bit counter.

This can be optimized in two different ways. One is to create a two byte, lo and hi for inflateCodes_length, and just use the y register. dec hi register when y hits 0. Second is to keep the inflatecode lengths in the original location where the deflate table was loaded into RAM and set some table pointers there. I have to look at the code tomorrow when I am off from work.

Link to comment
Share on other sites

This version of Inflate modifies the LDA (inflateCodes_sourcePointer),Y .. STA (outputPointer),Y loop, to just INY and comparing Y to inflateCodes_length, and looping the number of bytes it needs to copy. It adds inflateCodes_length to Outputpointer size after. By doing this, it got a 27K binary to decompress in about 3.0 seconds, vs 3.5 seconds, before this modification. The most recent modification from FoX, takes about 4 seconds. am not posting any versions that use self-modifying code because the routine executes in RAM.

Edited by peteym5
  • Like 1
Link to comment
Share on other sites

There was a slight glitch I discovered while decompressing larger blocks. When calculating the offset before copying duplicated memory. I was pushing the processor status before getting high byte and there is a compare X register to 10. Appears to be an issue with what the carry flag needs to be. I need to research and see if I can fix the problem.

Edited by peteym5
Link to comment
Share on other sites

I do not intend to keep bumping up this thread, but found another problem with the prior version of inflate to where it totally was not working because I accidentally deleted a CLC instruction. I also did another fix on it not inflating correctly. I found the problem when it tried to copy bytes that are over 255 in length and advancing the pages forward at the wrong time. This update got everything inflated 100% where it was suppose to be.

inflate_ver3.asm

Edited by peteym5
  • Like 1
Link to comment
Share on other sites

A slight change to save a little bit of room. Right after "inflateCodes_setOffsetHighByte" when it needs to add 2 to the length, we just compare to 3. If less, it will indicate the copy length is 256 to 258, and need to loop over 255 times.

  LDX #0 
  LDA	inflateCodes_length
  CLC
  ADC #2 
  CMP #3
  BCS Set_Adjusted_Copy_Length
Set_Copy_Page_2
  INX
  STA inflate_t0
  LDA #0

I am not posting any posting any versions that use self modifying code. What you see can work from either ROM or RAM.

inflate_ver3.asm

Link to comment
Share on other sites

The speed of decompression has been in my mind the last few weeks on and off.

 

Here's a situation that I wonder about and maybe some of you have some rough ideas here:

 

- If I have 5kb of compressed data on a standard Atari disk

- Lets say there is around 8kb overall data once decompressed

- Standard 1050 being used....

 

If you look at the total time taken to load and decompress, is it quicker to have it all on disk and just do a straight load. Or is it quicker overall to load the compressed data and then decompress it? (theory being that as it is then in memory, this should be quicker)

 

And how would this fare if we were using cartridge instead?

 

I know, I could test all this out, but its not something I'm going to spend ages setting up just to get an answer :)

 

I thought about an ideal of decompressing while it is loading. Load each sector into a buffer and decompress it to its final memory location. Decompress from disk.

 

 

This is due to two reasons: because of the way that Deflate splits copy lengths into a base Huffman-encoded length plus extra length bits, and because it is necessary to fit the maximum length of 258 bytes into an 8-bit counter

I figured out this 258 counter, it is actually 3 to 258 bytes, duplicating a repeated pattern. What I did was add 2 to this counter, and have the y register count up to its size. If more than 255, pushes the amount onto the stack, and copies one page plus the remaining 1 to 2 bytes.

 

I applied the speed up changes to the version made by Fox, which I am calling "Inflate 2017" because this person posted the changes this year.

 

I am bench marking it on a 27k binary file used for a game. I started with it taking over 4 seconds. The latest modification gets it just under 3 seconds.

 

I may start looking to see what I can do with speeding up the preparation processes leading into the routine loops, which seem to be taking up a fair amount of time.

inflate_2017_ver4.asm

Edited by peteym5
Link to comment
Share on other sites

  • 1 month later...

Interesting topic, I need to visit this forum more often. :)

 

HiassofT's calculations are interesting, but for me 1.11 sec program loading time feels just as good as 0.16 sec, not "seven times slower" and most of Atari users don't have that fast storage. The data we compress now is usually bigger than 8k, so a compressed program takes several seconds less to load from a standard drive, which does make a difference for me.

 

What's more important, disk and memory capacities are limited. For example, Numen wouldn't fit on 2*130k floppy and in 320k RAM without compression. Did you notice there's background decompression running in the demo and it's the old "inefficient" code?

 

Obviously, I optimized "inflate" for size and not speed. Nevertheless, it's always been much faster than the old-school "crunchers" which decompressed with dots and bars on screen for several seconds.

 

peteym5, I'm curious about more accurate time comparison. I've reviewed your code. I like the way you handled 256-byte sequences. :) Some comments:

 

bcc inflateCodes_loop
inflateCodes_loop
This bcc is pointless.

 

 
sta inflateCodes_sourcePointer
  php    
sta inflateCodes_sourcePointer

One sta is enough.

 

clc
inflateCodes_setOffsetHighByte

You can remove this clc now.

 

  ;TYA
  lda inflateCodes_length

Why not use tya instead of the lda?

Edited by fox
  • Like 1
Link to comment
Share on other sites

Well,

 

I would normally not compress an 8k program, if the compressed program is still 5k - except for one case: the program must fit on a certain disk or disk-magazine (and the disk space is limited there, so every kbyte counts)...

 

Allthough I do have most loading devices (SIO2PC, SIO2SD, SIO2USB, IDE+/IDE-2, Ultimate Cart, etc.) available, I still use my 1050 with Speedy upgrade very often. And there are dozens if not hundreds of A8 users out there that still use their floppy drive. So when downloading a 190kbyte file like e.g. Laura, Unmec, etc. in uncompressed form, I could load it with some XEX-loader of SIO2SD or something similar. But instead I compress the file with Exomizer to something like 85kbytes that a) can be loaded from a real diskette easily and b) loads and decompresses surely faster than a 190k file would load on a 1050 drive (would, because a 190k file would not even fit on a 180k disk, the max. a 1050 can handle).

 

With TIP Animator I often had animations that had 300k (or more) in length and could be compressed to less than 50% , sometimes even less than 33% of the original size. Some years ago a playable demo of GTIA Blast was released, uncompressed length more than 400kbytes (the demo requires 1088k RAM), compressed with Exomizer it had only 105kbytes (still required 1088k RAM). So we would need some new speed calculations -especially for floppy drives- where compressed files have only 50% (or only 33%, or only 25%) of the size of an uncompressed file...

Link to comment
Share on other sites

Interesting topic, I need to visit this forum more often. :)

 

HiassofT's calculations are interesting, but for me 1.11 sec program loading time feels just as good as 0.16 sec, not "seven times slower" and most of Atari users don't have that fast storage. The data we compress now is usually bigger than 8k, so a compressed program takes several seconds less to load from a standard drive, which does make a difference for me.

 

What's more important, disk and memory capacities are limited. For example, Numen wouldn't fit on 2*130k floppy and in 320k RAM without compression. Did you notice there's background decompression running in the demo and it's the old "inefficient" code?

 

Obviously, I optimized "inflate" for size and not speed. Nevertheless, it's always been much faster than the old-school "crunchers" which decompressed with dots and bars on screen for several seconds.

 

peteym5, I'm curious about more accurate time comparison. I've reviewed your code. I like the way you handled 256-byte sequences. :) Some comments:

 

bcc inflateCodes_loop
inflateCodes_loop
This bcc is pointless.

 

 
sta inflateCodes_sourcePointer
  php    
sta inflateCodes_sourcePointer

One sta is enough.

 

clc
inflateCodes_setOffsetHighByte

You can remove this clc now.

 

  ;TYA
  lda inflateCodes_length

Why not use tya instead of the lda?

 

Thankyou I will try these things and see if it still works.

 

My main use for compression is for ROM cartridges that need to load blocks into RAM. Fonts, Data, Texts, and Programming that use self modifying code.

An issue with using some larger bank switching cartridges sometimes is the extra logic chips does add to the costs. Plus availability of some EPROM cartridges can be harder to obtain. I started using flash cartridges for larger games now like Tempest Elite.

However imagine what you can do with a 1024K AtariMax cartridge and compressing information. That would be a very serious game. I estimated I could make a game like Secretum Labyrinth with 25,000 different screens (rooms) 100s of different monsters, 100s of different puzzles that challenge you. Can use full graphics 15 images to demonstrate stuff. Lots of possibilities.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...