20080710 Thursday July 10, 2008

Compressing Text Data Using BWT

Text data can be more efficiently compressed by following certain semantics of block sorting and run length encoding. It all started with a research paper 'A Block­sorting Lossless Data Compression Algorithm' by M.Burrows and D.J. Wheeler, sometime in 1994. The algorithm they proposed was so elegant and easy to adopt that it was implemented in every second compression program.

But there was no pure Java implementation primarily because of the performance expectations of compression algorithms. So here is my Java implementation of the BWT algorithm, called BAR (BWT Archiver).

Here is the doc explaining what I'm trying to do, which is a slight modification of the BWT recommendation:

I do RLE -> Forward BWT -> MTF -> RLE -> ENT(Huffman)
Reverse ENT(Huffman) -> RLE -> MTF -> Reverse BWT -> RLE

Surprisingly, here is the comparison results:

For Three Musketeers (http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=1257)

Tools           Orig. Size (Bytes)  Comp. Size(Bytes)         Bps (Bits/Symbol) 
WinZip          1,349,189           503,693(Best Compression) 2.986 
WinRAR          1,349,189           416,342(Best Compression) 2.468 
BAR             1,349,189           371,442                   2.202                     

For Large Corpus (http://corpus.canterbury.ac.nz/descriptions/index.html#large)

Tools           Orig. Size (Bytes)  Comp. Size(Bytes)           Bps (Bits/Symbol) 
WinZip          11,159,482          3,331,499(Best Compression) 2.3882 
WinRAR          11,159,482          2,837,133(Best Compression) 2.0338 
BAR             11,159,482          2,602,030                   1.8653                                               
                                    


Download the jar file and try it out:
java -jar Bar.jar -c input.txt output.bar
I have made some tweaking to make Bar work off nio. But still performance can be improved.

( Jul 10 2008, 01:28:01 AM PDT ) Permalink
Comments:

Post a Comment:

Comments are closed for this entry.