 In Progress
Lesson 30 of 43
In Progress

# Optimal Tree Problem

Suppose we have to encode a text that comprises symbols from some n-symbol alphabet by assigning to each of the text’s symbols some sequence of bits called the codeword. For example, we can use a fixed-length encoding that assigns to each symbol a bit string of the same length m (m ≥ log2 n). This is exactly what the standard ASCII code does. One way of getting a coding scheme that yields a shorter bit string on the average is based on the old idea of assigning shorter codewords to more frequent symbols and longer codewords to less frequent symbols. This idea was used, in particular, in the telegraph code invented in the mid-19th century by Samuel Morse. In that code, frequent letters such as e (.) and a (.−) are assigned short sequences of dots and dashes while infrequent letters such as q (− − .−) and z (− − ..) have longer ones.

Variable-length encoding, which assigns codewords of different lengths to different symbols, introduces a problem that fixed-length encoding does not have. Namely, how can we tell how many bits of an encoded text represent the first (or, more generally, the ith) symbol? To avoid this complication, we can limit ourselves to the so-called prefix-free (or simply prefix) codes. In a prefix code, no codeword is a prefix of a codeword of another symbol. Hence, with such an encoding, we can simply scan a bit string until we get the first group of bits that is a codeword for some symbol, replace these bits by this symbol, and repeat this operation until the bit string’s end is reached.

If we want to create a binary prefix code for some alphabet, it is natural to associate the alphabet’s symbols with leaves of a binary tree in which all the left edges are labeled by 0 and all the right edges are labeled by 1. The codeword of a symbol can then be obtained by recording the labels on the simple path from the root to the symbol’s leaf. Since there is no simple path to a leaf that continues to another leaf, no codeword can be a prefix of another codeword; hence, any such tree yields a prefix code.

Among the many trees that can be constructed in this manner for a given alphabet with known frequencies of the symbol occurrences, how can we construct a tree that would assign shorter bit strings to high-frequency symbols and longer ones to low-frequency symbols? It can be done by the following greedy algorithm, invented by David Huffman while he was a graduate student at MIT.

Lesson Content
0% Complete 0/1 Steps