GZIP algorithm in FPGA

Hi,

i tried to find some references for lossless compression algorithms implemented in fpga. I found a lot about bitstream compression but (nearly) nothing about fpga implementations of lossless compression algorithms.

Which lossless compression algorithms are suited for fpga implementation.

Thank you.

Reply to
lenz
Loading thread data ...

Some years ago I had a need for a losslessly compressed database, and found some source code for the Lempel-Ziv-Welch compression program (LZW). I extracted the core code from it and inserted it into the database read-write program, and got it working. I didn't dwell on the details, but it looked pretty simple to me. It built a 'dictionary' of multi-byte patters it had seen, and assigned single-byte and then later

9, 10, 11, 12-etc. bit codes for these patterns. These functions seem like they could be done with an FPGA and an external RAM to hold the dictionary. (The dictionary is written in the output stream one entry at a time during compression, and then rebuilt when decompressing.

You should be able to find some source code for LZW somewhere.

Jon

Reply to
Jon Elson

Since you mentioned GZIP, I will start with Lempel-Ziv (LZ) based compression schemes:

Huang, Saxena and McCluskey "A Reliable Compressor on Reconfigurable Coprocessors"

formatting link

Jung and Burleson. "Efficient VLSI for Lempel-Ziv Compression in Wireless Data Communication Networks"

formatting link

It should be noted that the Systolic Array implementatio described in the papers would violate the Stac patents. Basically Stac has patented the notion of using a shift register to store the history buffer. Though I personally believe the concept of storing the dictionary in a shift register to be rather obvious... we have opted not to use a shift register implementation so as to not risk litigation.

We are in the process of completing a LZ based compression core that is very similar to LZS, but does not violate the patents. The performance in a Xilinx V2Pro is expected to be on the order of 100M bytes/second, with a 512 byte dictionary. We expect to release this core for Xilinx v2/v2pro/s3 and Altera Stratix by the end of March.

These papers should atleast get you started. Large CAMs are not a particularly FPGA friendly structure, but similar structures can be effectively implemented.

For decompression, sequential implementations will be just as fast as parallel ones, as on decompression only one location from the dictionary needs to be read per output. For compression, a parallel implementation will allow for processing up to 100M tokens per second. A sequential approach will be much slower.

LZ compression is a very good approach for "generic" data, in areas such as storage or networking. Since you did not mention what sort of data you were going to compress, I will make brief mention of non-LZ approaches.

IBM's ALDC is something worth studying to see an alternative approach to compression. As far as patents go... the adaptive nature of the algorithm is really quite novel, and dare I say mathematically quite cool.

If it is images, one is much better off going with an algorithm that is tailored to this sort of data-set. For example LOCO/JPEG-LS for continuous tone images. We are wrapping up an implementation of this algorithm, and our first hardware implementation is running at 35M

12bit pixels/sec using ~65% of an XC2v1000-4. The challenge with this algorithm is in the pipelining, as each pixel prediction, error calculation, and statistics update, must be completed prior to processing the next pixel. We expect the final implementation to be quite a bit faster and slightly smaller.

One of the challenges faced implementing compression algorithms in hardware is that the algorithms are most often specified in the standards documents in C-pseudocode. So implementation often requires translating a substantially serial implementation into a pipelineable design.

The more significant challenge with implementing these algorithms in hardware is that each pixel/token, is completely dependent upon the error and updated statistics from the previous pixel/token. If you can identify what this minimal loop is, and put all other logic into pipelines before and after this loop, your performance will be limited solely by the minimal loop. Most of these algorithms look at minimizing the total compute required, rather than minimizing the complexity of this minimal loop. The former will dictate software performance, the later hardware, in optimal implementations. As an example JPEG-LS in near lossless mode is only marginal slower in software, but about 50% as fast when implented in hardware, each as compared to JPEG-LS in lossless mode.

I hope these comments atleast offer a brief introduction. I also hope that my comments were not too commercial in nature, having mentioned two cores that we are getting ready to release.

Regards, Erik Widding.

--
Birger Engineering, Inc. -------------------------------- 617.695.9233
100 Boylston St #1070; Boston, MA 02116 -------- http://www.birger.com
Reply to
Erik Widding

If you know in advance information on the type of data (e.g. english language) then a Huffman Coding could be a quick and small solution. It could be e.g. easily done with Lookuptable.

Florian

--
int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
 Click to see the full signature
Reply to
Florian-Wolfgang Stock

the part with the email address is obvious, the thing around is called mandelbrot-set (in german also called "Apfelmaennchen") - And no, I havnt implemented it yet on a fpga (but its just some complex calculation, so it should not be difficult :)

Florian

--
int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
 Click to see the full signature
Reply to
Florian-Wolfgang Stock

Good info Erik. How does the above performance compare to a 2 or 3 GHz 64 bit processor?

Steve

Reply to
Steve Casselman

Here is an implementation from 2000.

formatting link
It was a student project for my lab course.

Kolja Sulimma

Reply to
Kolja Sulimma

Exactly something like that I had in mind as I talked from it. The Restriction with with the advance information could be circumvent by dynamic generating the Lookuptable (here is the drawback, that you need 2 Passes over the stuff you want to code).

Florian

--
int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
 Click to see the full signature
Reply to
Florian-Wolfgang Stock

Isn't there a variation that generates the table on the fly? I don't know any details, but wouldn't they have this same problem in modem compression, you can only see the data once? Or do they buffer up a block before compressing? Back in the early days, an engineer talked to me about the possibility of patenting a method that worked like this.

--
Rick "rickman" Collins

rick.collins@XYarius.com
 Click to see the full signature
Reply to
rickman

That isnt possible. At least not with Normal Huffman Coding. With the ZIP Alogrithm it is possible to build a dictonary on the fly, so you need only one pass. The Huffman coding bases on the fact that it replaces often appearing characters with short bitstrings, seldom ones with long bitstrings. To know which character is often/seldom you need to count it. There are some ways you can reduce this problem - most often used is just a precomputed dictonary. The is the distribution of an average english text used. This works good if you want to encode normal text (or if you know what to expect, make some distribution analysis on some data and make your dictonary on it).

Another approach could be, to use an internal buffer from a given size, and build just on that distribution your dictonary.

Florian

--
int m,u,e=0;float l,_,I;main(){for(;1840-e;putchar((++e>907&&942>e?61-m:u)
["\t#*fg-pa.vwCh`lwp-e+#h`lwP##mbjqloE"]^3))for(u=_=l=0;79-(m=e%80)&&
 Click to see the full signature
Reply to
Florian-Wolfgang Stock

Reply to
Ray Andraka

I find this to be a very stupid - and irrelevant - question. Not only is this majorly affected by the bitness of the CPU, a 2Ghz genral purpose CPU is going to have (comapred to FPGA) very large amounts of very fast "blockram". CPUs do most things faster than FPGA-s, and that includes most kinds of compression. This doesn't mean they aren't interesting in FPGA-s or that FPGA-s couldn't do even those things alongside CPU-s

- but it does reduce the number of things for which FPGA-s are practical if they are alongside CPUs.

Now of course, if you are power limited then said CPU-s go missing from the picture (though not entirely - look at the performance and power needs of say VIA C3).

--
	Sander

+++ Out of cheese error +++
Reply to
Sander Vesik

Your comments seem odd. You have no understanding of why Steve was asking the question. So how can you presume to judge it to be "stupid"? Your analysis is just one way of looking at the comparison. If nothing else, perhaps he wanted the comparison just as a way of judging the speed in less technical terms, for presentation to management, for example.

--
Rick "rickman" Collins

rick.collins@XYarius.com
 Click to see the full signature
Reply to
rickman

Yes there is ... it is called Golomb Rice coding ...

Its a Huffman-variation and it assumes a geometric distribution At the moment im still simulating in C to estimate the coding performance - FPGA implementation should be easy ... it seems to work quite well on dpcm-images ...

bye, Michael

Reply to
Michael Schöberl

I've never played with this problem before but I fail to see why you couldn't do it adaptively - at the cost of more memory of course. Just maintain a heap of occurence count of codes up until this point. To emit a word search for it the heap (hashing would do) and it's position in the heap directly translates into it's Huffman code. Then update it's count which could cause the code to move up in the heap and thus update a bunch of word -> code mappings.

To decode you'd do almost exactly the same, except you're looking up Huffman codes in the heap and translating them to words. (Of course I assume that previously unseen words are emitted escaped in some way, eg. prefixed with a reserved code.)

Improvements include starting with a good default dictionary and heuristics for resetting the dictionary when attained compression is too low.

Is anything wrong with this obvious approach?

/Tommy

Reply to
user

Adaptive huffman codes have been around since at least the early 1980s.

You may want to read the following:

igm.univmlv.fr/~mac/DOC/C10.ps

formatting link

If you are interested in (adaptive) huffman codes

--
	Sander

+++ Out of cheese error +++
Reply to
Sander Vesik

Maybe. If I understand what you are saying, initially a sequence will be sent unencoded until enough information is saved up to assign codes to various strings that occur. But the xmitting end sees the raw stream while the rxing end sees the encoded end. How does the rxing end know when the first code is sent and what it stands for? Each time a new code is sent, the rxing end needs to know what is happening. Or does the decision to send a code instead of the data depend only on the history and not on the present data?

--
Rick "rickman" Collins

rick.collins@XYarius.com
 Click to see the full signature
Reply to
rickman

Some compression schemes have a way to indicate that the next N bytes are raw data, don't look in the dictionary.

You can also use the data stream (the part you have sent already) as a dictionary, encoding things like copy X bytes starting -Y bytes back from here.

In general, most good compression techniques are a tradeoff on memory CPU/cache usage, and compression efficiency.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
 Click to see the full signature
Reply to
Hal Murray

Steve,

We have not run optimized the reference encoder or run it on such a fast machine yet. But I can make a guess as to the perfomance based on the instruction flow. Best case one might get 2-3M bytes/second throughput. I am being very generous with the estimate, unless I have overlooked an optimization.

I am generally wary to give performance numbers for our reference encoders, as we generally write them to prove an algorithm "works", i.e. for test bench data generation and verification of compression ratios. It would be rather misleading for us to say "our hardware goes x-times faster than our software implementation" as we invest a great deal of effort in the hardware optimizations and the algorithms, and almost none in the software beyond demonstrating the correctness of the implementation and algorithm. Our reference encoder probably won't get anywhere near 2MB/sec.

Regards, Erik Widding.

P.S. I grouped this response with one below, as they are closely related.

*******************************************

Sander,

Actually I found Steve's question to be right on target, and quite typical of those asked by our customers. I had not yet had a chance to reply. As for the rest of your comments... just about everything in your post is incorrect.

Most of these algorithms deal with memory access in a very predetermined fashion globally, with random access on a much more localized basis. So the lack of large amounts of blockram does not benefit the processor implementation. If memory is needed, one generally adds SDRAM next to the FPGA.

If this were true the compression engines for tape drives would not be located in the drive, but rather run on the host processors themselves. As an example, for minimally compressible data in our LZ-derivative (or LZS for that matter) with a 512 byte history-buffer/dictionary there will be 512 comparisons and 512 address increments, requiring 1024 instructions minimum. In an FPGA these 512 comparisons can be executed in a single clock period. The FPGA clock rate may only be 100MHz to the processors 3GHz, or a ratio of 1:30, whereas the FPGA only requires one clock cycle to the processors 1000+.

I do not know of a single compression algorithm that will run faster on a general purpose processor than it will run on an FPGA with external SDRAM. Many won't even need the external memory to beat the processor. Assume both implementations are done by experts in the respective areas.

Feel free to propose an algorithm where I am wrong. It has been my experience that the hardware implementations are generally one to three orders of magnitude faster than the software, with the performance normalized for the cost of the respective hardware.

Regards, Erik Widding.

--
Birger Engineering, Inc. -------------------------------- 617.695.9233
100 Boylston St #1070; Boston, MA 02116 -------- http://www.birger.com
Reply to
Erik Widding

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.