Thursday, March 11, 2010

wrung out YAML

Because of a project at work I have recently become interested in compression algorithms. The only one I was exposed to at school was the famous Huffman Algorithm which is often used as an archetypal greedy algorithm. In particular we need to compress some very small YAML documents, at most a thousand bytes or so for transmission over modem channels. Yes the dinosaur dial up kind.

So we went looking so far we have tried LZW, the Burrows-Wheeler transform followed by run length encoding and a really simple one called Byte Pair encoding. LZW does not have time to really get going before it runs out of data and ends up increasing the size of the message. BWT-RLE gives us some savings but the very simple byte-pair-encoding seems to be winning the day right now.

The algorithm is easy. Basically you count how many times each pair of bytes is in the message, you replace the most often occurring one with a byte that does not appear elsewhere in the message. The pair must appear at least 4 times to get any savings. The way my particular implementation the expansion dictionary is pre-pended to the message. The algorithm runs multiple passes over the data and the dictionary is pre-pended at the end of each round. Which means the dictionary is subject to compression of it's own.

That leads to some interesting recursive compressions many times. A length 3 string repeated 4 times ends up compressing from 12 to 10 bytes:

 round 1: abcabcabcabc
 round 2: ZabZcZcZcZc
 round 3: YZcZabYYYY
which does not seem very good. But the empirical evidence so far says that this simple compression scheme works very well. At least on YAML.