Suppose you wish to losslessly compress data which can be expressed as a sequence of symbols (a regular text file is a classic example where the symbols are letters and punctuation marks). There are many good compression algorithms out there that solve this problem. This post will present a famous one called Huffman's algorithm, which compresses data by creating an encoding scheme which is optimal in the sense that no other encoding will generate a better compression of the data. This form of compression is surely not the only one available, but it is very easy to understand and also very important since it is used by codecs such as JPEG and MP3.

Huffman's algorithm works as follows:

1. | count the number of occurrences $n_i$ of each distinct symbol $s_i$ on the data |

2. | for each distinct symbol $s_i$, generate a tree containing a single node with key $s_i$ and weight $n_i$ |

3. | select the two trees with smallest root node weights and merge them under a new root node; the key of this new root node is the concatenation of the keys of the root nodes of the merged trees and its weight is the sum of their weights (so if the two chosen trees have root nodes with keys $\alpha_j$ and $\alpha_k$ and weights $w_j$ and $w_k$ respectively, we put them under a new root node with key $\alpha_j\alpha_k$ and weight $w_j + w_k$) |

4. | if only a single tree is left, stop; otherwise go back to step 3 |

5. | the representation of a given symbol is then given by the bit sequence which one obtains by starting from the root node of the final tree and moving along the tree until the node containing the given symbol as key is reached: each time we need to go the left child, we append a 0 to the bit representation, otherwise we append a 1 to it. |

Although the description above is somewhat abstract, the algorithm is really simple. Figure 1 describes visually how it works.

Fig. 1: | A visual description of Huffman's algorithm for some text which contains only the letters in the set {A,B,C,D}. In the first step, we merge the trees with root nodes C and D, then we merge the trees with root nodes B and CD, then we finally we merge the trees with root nodes A and BCD. The produced representation of a given symbol is a bit string which is obtained by merely traversing the final tree (starting from its root node) until the node whose key is this symbol is reached. Above, we have the following symbol-to-bit-string mappings: A $\rightarrow$ 0, B $\rightarrow$ 10, C $\rightarrow$ 110, D $\rightarrow$ 111. |

As mentioned above, Huffman's algorithm is optimal in the sense that no other encoding will generate a better compression of the data. Consider the following sequence with $20$ characters where $n_\textrm{A} = 12$, $n_\textrm{B} = 5$, $n_\textrm{C} = 2$ and $n_\textrm{D} = 1$ (as in figure 1):

AABACAABBAABAAACABAD

Below is the representation of the sequence above using ASCII codes (which is a very common encoding scheme used for storing text files on computers):

0100000101000001010000100100000101000011010000010100000101000010010000100100000101000001010000100100000101000001010000010100001101000001010000100100000101000100

And here is the representation of the same sequence using the encoding scheme from figure 1:

0010011000101000100001100100111

Huffman's representation of the original character sequence is clearly better than the ASCII one, but how much better is it? If $L^{\textrm{A}}_i$ is the number of bits used to represent the $i$-th symbol in the alphabet {A,B,C,D} in the ASCII encoding, then the total number of bits necessary to represent the character sequence using ASCII is: $$ N^{\textrm{A}} = \sum_i L^{\textrm{A}}_i n_i = 8n_{\textrm{A}} + 8n_{\textrm{B}} + 8n_{\textrm{C}} + 8n_{\textrm{D}} = 160 $$ The equivalent quantity for the Huffman encoding is: $$ N^{\textrm{H}} = \sum_i L^{\textrm{H}}_i n_i = 1n_{\textrm{A}} + 2n_{\textrm{B}} + 3n_{\textrm{C}} + 3n_{\textrm{D}} = 31 $$ which is less than $20\%$ of the amount of space needed with the ASCII encoding scheme. The Huffman encoding uses a clever trick: it uses short bit strings to represent the characters which appear more often in the data and larger bit strings to represent the ones which occur less frequently.

One interesting aspect of Huffman's algorithm is the fact that it always generates encodings which are not ambiguous, in the sense that the original data can be directly obtained from the compressed version expressed as a concatenation of bit strings. This is due to the fact that Huffman's algorithm always produces prefix codes. To clarify this point, consider this mapping: A $\rightarrow$ 0, B $\rightarrow$ 10 and C $\rightarrow$ 101. If you have the following bit sequence: 01010, what does it represent: ABB or ACA? The ambiguity here comes from the fact that the representation of B is a prefix of the representation of C. Huffman's algorithm will never yield an encoding scheme like this.

An implementation of Huffman's algorithm written in Python can be found here.