Now Reading
FC8 – Quicker 68K Decompression

FC8 – Quicker 68K Decompression

2023-09-11 13:19:34


Knowledge compression is enjoyable! I’ve written a brand new compression scheme that’s designed to be as quick as potential to decompress on a 68K CPU, whereas nonetheless sustaining a good compression density. I’m calling it FC8, and you will get the generic C implementation and optimized 68K decompressor from the project’s Github repository. I’ve most likely reinvented the wheel with this, and in a non-optimal method too, however as soon as I began I discovered that I couldn’t cease. My meant platform for FC8 is 68020 and 68030-based classic Macintosh computer systems, however it needs to be simply moveable to different basic computer systems, microcontrollers, and comparable minimal techniques.

The primary loop of the 68K decompressor is precisely 256 bytes, so it matches completely inside the instruction cache of the 68020/030. Decompression velocity on a 68030 is about 25% as quick as an optimized memcpy of uncompressed knowledge, which is actually an unrolled loop of 4-byte transfer.l directions with no computation concerned. In comparison with that, I believe 25% is fairly good, however I can all the time hope for extra. ????

Within the previous post, I described how I used to be utilizing compression to squeeze a bigger rom-disk picture right into a customized substitute Macintosh ROM that I’m designing. I started with a compression algorithm referred to as LZG, written by Marcus Geelnard. It labored properly, however the 68K decompression appeared disappointingly gradual. I attempted to contact the writer to debate it, however couldn’t discover any e mail tackle or different contact data, so I finally drifted in the direction of creating my very own compression methodology loosely primarily based on LZG. This grew to become FC8. On a 68030 CPU, FC8 compresses knowledge equally as tightly as LZG and decompresses 1.5x to 2x quicker. FC8 retains a lot of the compression acceleration code from LZG, in addition to the thought of quantizing lengths, however the encoding and decompressor are new.

The algorithm is predicated on the basic LZ77 compression scheme, with a 128K sliding historical past window and with duplicated knowledge changed by (distance,size) backref markers pointing to earlier cases of the identical knowledge. No further RAM is required throughout decompression, apart from the enter and output buffers. The compressed knowledge is a collection of tokens on this format:

  • LIT = 00aaaaaa = subsequent aaaaaa+1 bytes are literals
  • BR0 = 01baaaaa = backref to offset aaaaa, size b+3
  • EOF = 01×00000 = finish of file
  • BR1 = 10bbbaaa’aaaaaaaa = backref to offset aaa’aaaaaaaa, size bbb+3
  • BR2 = 11bbbbba’aaaaaaaa’aaaaaaaa = backref to offset a’aaaaaaaa’aaaaaaaa, size lookup_table[bbbbb]

The encoding might look barely unusual, reminiscent of solely a single bit for the backref size in BR0, however this produced the most effective ends in my testing with pattern knowledge. The size lookup desk allows encoding of backrefs as much as 256 bytes in size utilizing solely 5 bits, although some longer lengths can’t be encoded straight. These are encoded as two successive backrefs, every with a smaller size.

The most important conceptual adjustments vs LZG had been the introductions of the LIT and EOF tokens. EOF eliminates the necessity to test the enter pointer after decoding every token to find out if decompression is full, and speeds issues up barely. LIT allows an entire block of literals to be rapidly copied to the output buffer, as a substitute of checking each to see if it’s a backref token. This speeds issues up considerably, but additionally swells the dimensions of the information. Within the worst case, a single literal would encode as 1 byte in LZG however 2 bytes in FC8, making it twice as costly! All the opposite adjustments had been wanted to cancel out the compression bloat launched by the LIT token, with the tip consequence that FC8 compresses equally as compactly as LZG. Each compressed my pattern knowledge to about 63% of unique dimension.

The 68K decompressor code could be seen here.

Decompression on the Fly

A number of folks talked about the opportunity of on-the-fly decompression, because the meant use is a compressed disk picture. That’s one thing I plan to discover, however it’s not so simple as it may appear at first. Disk sectors are 512 bytes, however there’s no option to decompress a particular 512 byte vary from the compressed knowledge, because the entire compression scheme depends upon having 128K of prior knowledge to attract on for backref matches. You possibly can compress your entire disk picture as a collection of separate 512 byte blocks, however then the compression density would go to hell. A greater resolution would compress your entire disk picture as a collection of bigger blocks, possibly 128K or slightly smaller, after which design a caching scheme to maintain observe of whether or not the block containing a selected sector had been already decompressed and obtainable. This might nonetheless have a detrimental affect on the compression density, and it will make disk I/O slower, however would most likely nonetheless be OK.

Finally I believe the 2 decompression approaches every have strengths and weaknesses, so your best option depends upon the necessities.

Boot-Time Decompression:
Professionals: Greatest compression density, quickest I/O speeds as soon as the disk picture is decompressed
Cons: 5-10 second await decompression at boot time, requires sufficient RAM to carry your entire disk picture

On-the-Fly Decompression:
Professionals: No wait at boot time, required quantity of RAM is configurable (dimension of the decompressed block cache)
Cons: Worse compression density, slower I/O speeds, extra advanced implementation
{Hardware} Exams

I found {that a} Macintosh IIci in 8-bit shade mode decompresses about 20% slower than in 1-bit shade mode. However a IIsi decompresses on the similar velocity whatever the shade settings. Each machines are utilizing the built-in graphics {hardware}, which steals some reminiscence cycles from the CPU with a purpose to refresh the show. I’m unsure why solely the IIci confirmed a dependence on the colour depth. Each machines needs to be quicker when utilizing a discrete graphics card, although I didn’t check this.

The unique LZG compression confirmed a a lot greater velocity distinction between the IIci and IIsi, nearer to a 50% distinction, which I assumed was as a result of 32K cache card within the IIci in addition to its quicker CPU. It’s not clear why the discrepancy is smaller with FC8, or whether or not it means the IIci has gotten worse or the IIsi has gotten higher, comparatively talking. In comparison with the identical machine with the LZG compression, FC8 is 1.57x quicker on the IIci and 1.99x quicker on the IIsi. Primarily based on checks below emulation with MESS, I used to be anticipating a 1.78x speedup.

See Also


Whereas engaged on this, I found many locations the place compression compactness could possibly be traded for decompression velocity. My first try at FC8 had a minimal match dimension of two bytes as a substitute of three, which compressed about 0.7% smaller however was 13% slower to decompress as a result of bigger variety of backrefs. On the different excessive, the introduction of a LIT token with out some other adjustments resulted within the quickest decompression velocity of all, about 7% quicker than FC8, however the compressed information had been about 6% bigger, and I made a decision the tradeoff wasn’t value it.

I explored many different concepts to enhance the compression density, however every thing I considered proved to have solely a tiny profit at finest, not sufficient to justify the affect on decompression velocity. An algorithm primarily based on one thing aside from LZ77 would seemingly have compressed considerably extra densely, or say a mixture of LZ77 and Huffman coding. However decompression of LZ77-based strategies are far simpler and quicker to implement.

Compression Heuristics

It will definitely grew to become apparent to me that defining the token format doesn’t inform you a lot about the best way to finest encode the information in that format. A grasping algorithm appeared to work pretty properly, in order that’s what I used. At every level within the uncompressed knowledge, the compressor substitutes the most effective match it may well make (if any) between that knowledge and former knowledge within the historical past window.

Nonetheless, there are some examples the place selecting a non-optimal match would permit for a good higher match later, leading to higher total compression. This may occur because of quirks within the quantizing of match lengths, or with lengthy runs of repeated bytes that are solely partially matched within the earlier knowledge. It’s a bit like sacrificing your queen in chess, and generally you want to settle for a short-term penalty with a purpose to notice a long-term profit. Higher compression heuristics that took this into consideration might most likely squeeze one other 1% out of the information, with out altering the compression format or the decompressor in any respect.


Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top