Huffman Codes

Learn about the Huffman codes problem and its solution using greedy algorithms.

Binary codes and prefix-free codes

A binary code assigns a string of 00s and 11s to each character in the alphabet. A binary code is prefix-free if no code is a prefix of any other. (Confusingly, prefix-free codes are also commonly called prefix codes.) 7-bit ASCII and Unicode’s UTF-8 are both prefix-free binary codes. Morse code is a binary code with symbols • and —, but it is not prefix-free because the code for E ( • ) is a prefix of the codes for I ( • • ), S ( • • • ), and H ( • • • • ).

Binary trees and code words

Any prefix-free binary code can be visualized as a binary tree with the encoded characters stored at the leaves. The code word for any symbol is given by the path from the root to the corresponding leaf; 00 for left, 11 for right. Thus, the length of any symbol’s code word is the depth of the corresponding leaf in the code tree. Although they are superficially similar, binary code trees are not binary search trees; we don’t care at all about the order of symbols on the leaves.

Optimizing prefix-free binary codes

Suppose we want to encode a message written in an nn-character alphabet so that the encoded message is as short as possible. Specifically, given an array of frequency counts f[1..n]f [1 .. n], we want to compute a prefix-free binary code that minimizes the total encoded length of the message:

i=1nf(i).depth(i).\overset{n}{\underset{i=1}{\sum}}f(i) .depth(i).

This is exactly the same cost function we considered for optimizing binary search trees, but the optimization problem is different because code trees are not required to keep the keys in any particular order.

In 1951, as a PhD student at MIT, David Huffman developed the following greedy algorithm to produce such an optimal code:

Press + to interact

Huffman’s algorithm

Huffman’s algorithm is best illustrated through an example. Suppose we want to encode the following helpfully self-descriptive sentence, discovered by Lee Sallows:

This sentence contains three a’s, three c’s, two d’s, twenty-six e’s, five f’s, three g’s, eight h’s, thirteen i’s, two l’s, sixteen n’s, nine o’s, six r’s, twenty-seven s’s, twenty-two t’s, two u’s, five v’s, eight w’s, four x’s, five y’s, and only one z.

To keep things simple, let’s ignore the forty-four spaces, nineteen apostrophes, nineteen commas, three hyphens, and only one period, and encode only the letters as though the message were written in scriptio continua:

THISSENTENCECONTAINSTHREEASTHREECSTWODSTWENTYSIXESFIVEFSTHREEGSEIGHTHSTHIRTEENISTWOLSSIXTEENNSNINEOSSIXRSTWENTYSEVENSSTWENTYTWOTSTWOUSFIVEVSEIGHTWSFOURXSFIVEYSANDONLYONEZ THISSENTENCECONTAINSTHREEASTHREECSTWODSTWENTYSIXESFIVEFST \\ HREEGSEIGHTHSTHIRTEENISTWOLSSIXTEENNSNINEOSSIXRSTWENTYSEV \\ ENSSTWENTYTWOTSTWOUSFIVEVSEIGHTWSFOURXSFIVEYSANDONLYONEZ

Here is the frequency table for Sallows’ sentence:

Press + to interact

Huffman’s algorithm picks out the two least frequent letters, breaking ties arbitrarily—in this case, say, ZZ and DD—and merges them together into a single new character DZDZ with frequency 33. This new character becomes an internal node in the code tree we are constructing, with ZZ and DD as its children; it doesn’t matter which child is which. The algorithm then recursively constructs a Huffman code for the new frequency table.

Press + to interact

After 19 merges, all 20 letters have been merged together. The record of merges gives us our code tree. The algorithm makes a number of arbitrary choices; as a result, there are actually several different Huffman codes. One such Huffman code is shown below; numbers in non-leaf nodes are frequencies for merged characters. For example, the code for AA is 101000101000, and the code for SS is 111111.

Press + to interact
One possible Huffman code for Sallows' sentence
One possible Huffman code for Sallows' sentence

Encoding Sallows’ sentence with this particular Huffman code would yield a bit string that starts like so:

Press + to interact

Here is the list of costs for encoding each character in Sallows’ sentence, along with that character’s contribution to the total length of the encoded sentence:

Press + to interact

Optimality of Huffman codes: theorem and proof

Altogether, the encoded message is 649 bits long. Different Huffman codes encode the same characters differently, possibly with code words of different lengths, but the overall length of the encoded message is the same for every Huffman code: 649 bits.

Given the simple structure of Huffman’s algorithm, it’s rather surprising that it produces an optimal prefix-free binary code. Encoding Sallows’ sentence using any prefix-free code requires at least 649 bits! Fortunately, the recursive structure makes this claim easy to prove using an exchange argument, similar to our earlier optimality proofs. We start by proving that the algorithm’s very first choice is correct.

Lemma: Let xx and yy be the two least frequent characters (breaking ties between equally frequent characters arbitrarily). There is an optimal code tree in which xx and yy are siblings.

Proof: Let’s actually prove a stronger statement than this lemma: There is an optimal code in which xx and yy are siblings and have the largest depth of any leaf.

Let TT be an optimal code tree, and suppose this tree has depth dd. Because TT is a full binary tree, it has at least two leaves at depth dd that are siblings. (Verify this claim by induction!) Suppose those two leaves are not xx and yy, but some other characters aa and bb.

Let TT^{′} be the code tree obtained by swapping x ...