Compressing Text Data Using BWT
Text data can be more efficiently compressed by following certain semantics of block sorting and run length encoding. It all started with a research paper 'A Blocksorting Lossless Data Compression Algorithm' by M.Burrows and D.J. Wheeler, sometime in 1994. The algorithm they proposed was so elegant and easy to adopt that it was implemented in every second compression program.But there was no pure Java implementation primarily because of the performance expectations of compression algorithms. So here is my Java implementation of the BWT algorithm, called BAR (BWT Archiver).
Here is the doc explaining what I'm trying to do, which is a slight modification of the BWT recommendation:
I do RLE -> Forward BWT -> MTF -> RLE -> ENT(Huffman)
Reverse ENT(Huffman) -> RLE -> MTF -> Reverse BWT -> RLE
Surprisingly, here is the comparison results:
For Three Musketeers (http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=1257)
Tools Orig. Size (Bytes) Comp. Size(Bytes) Bps (Bits/Symbol)
WinZip 1,349,189 503,693(Best Compression) 2.986
WinRAR 1,349,189 416,342(Best Compression) 2.468
BAR 1,349,189 371,442 2.202
For Large Corpus (http://corpus.canterbury.ac.nz/descriptions/index.html#large)
Tools Orig. Size (Bytes) Comp. Size(Bytes) Bps (Bits/Symbol)
WinZip 11,159,482 3,331,499(Best Compression) 2.3882
WinRAR 11,159,482 2,837,133(Best Compression) 2.0338
BAR 11,159,482 2,602,030 1.8653
Download the jar file and try it out:
java -jar Bar.jar -c input.txt output.barI have made some tweaking to make Bar work off nio. But still performance can be improved.
( Jul 10 2008, 01:28:01 AM PDT ) Permalink
Comments:
Post a Comment:
Comments are closed for this entry.




