Design and Analysis of Algorithms: Huffman codes

Huffman codes

Huffman codes compress data very effectively: savings of 20% to 90% are typical, depending on the

characteristics of the data being compressed.

There are many options for how to represent such a file of information.

Consider the problem of designing a binary character code in which each character is represented by

a unique binary string, called a code word. If the code word of fixed length code words are used,

more bits are necessary to store compared to variable length code words.

Huffman codes uses Variable length code words for data compression. A variable-length code can

do considerably better than a fixed-length code, by giving frequent characters short code words and

infrequent characters long code words. These types of variable-length codes are called prefix codes.

A prefix code can always achieve the optimal data compression among any character code.

Prefix codes are desirable because they simplify decoding. Since no code word is a prefix of any other, the code

word that begins an encoded file is unambiguous.

Above figure is the tree corresponding to the fixed-length code of three bits. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the frequencies of the leaves in its sub tree.

Above figure is the tree corresponding to the optimal prefix code. An optimal code for a file is always represented by a full binary tree.

Binary code word for a character is the path from the root to that character where 0 means "go to left child"and 1 means "go to right child".

Ex: file contains "aabe" then the encoding is 0.0.101.1101, where '. ' denotes concatenation. Decoding is also simple, since prefix code is used no other code word is prefix of any other code word.

Given a tree T corresponding to a prefix code, the number of bits required to encode a file can be easily computed. For each character ‘c’ in the alphabet C, let the attribute c.freq denote the frequency of ‘c’ in the file and let d_T(c) denote the depth of c’s leaf in the tree.
The number of bits required to encode a file (cost of the tree T) is thus

Constructing a Huffman code

The algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. It begins with a set of |C| leaves and performs a sequence of |C| - 1 “merging” operations to create the final tree.

The algorithm uses a min-priority queue Q, keyed on the freq attribute, to identify the two least-frequent objects to merge together.

Line 2 initializes the min-priority queue Q with the characters in C. The for loop in lines 3–8 repeatedly extracts the two nodes x and y of lowest frequency from the queue, replacing them in the queue with a new node ‘z ‘representing their merger. The frequency of z is computed as the sum of the frequencies of x and y in line 7. The node z has x as its left child and y as its right child. After n - 1 mergers, line 9 returns the one node left in the queue, which is the root of the code tree.

The total running time of HUFFMAN on a set of n characters is O(n lg n).

Example:

Consider the characters in the file occur with the frequencies given by