Huffman codes
Huffman codes compress
data very effectively: savings of 20% to 90% are typical, depending on the
characteristics of the data being compressed.
There are many options for how to represent such
a file of information.
Consider the problem of designing a binary
character code in which each character is represented by
a unique binary
string, called a code word. If the code word of fixed length code words are used,
more bits are necessary to store compared to variable length code words.
Huffman codes uses Variable length code words for data compression. A variable-length code can
do considerably better than a fixed-length code, by giving frequent characters
short code words and
infrequent characters long code words. These
types of variable-length codes are called prefix codes.
A prefix code can always achieve the optimal data compression among any character code.
Prefix
codes are desirable because they simplify decoding. Since no code word is a
prefix of any other, the code
Above figure is the tree corresponding to the fixed-length code of three bits. Each leaf is labeled with a character and its
frequency of occurrence. Each internal node is labeled with the sum of the
frequencies of the leaves in its sub tree.
Above figure is the tree corresponding to the optimal prefix
code. An optimal code for a file is always represented by a full binary tree.
Binary code word for a character is the path from the root to that character where 0 means "go to left child"and 1 means "go to right child".
Ex: file contains "aabe" then the encoding is 0.0.101.1101, where '. ' denotes concatenation. Decoding is also simple, since prefix code is used no other code word is prefix of any other code word.
Given
a tree T corresponding to a prefix code, the number of
bits required to encode a file can be easily computed. For each character ‘c’ in the alphabet C, let
the attribute c.freq denote the frequency of ‘c’ in the file and let dT(c)
denote the depth of c’s leaf in the tree.
The number of bits required to encode a file (cost of the tree T) is thus
The number of bits required to encode a file (cost of the tree T) is thus
Constructing
a Huffman code
The
algorithm builds the tree T corresponding to the optimal code in a bottom-up manner.
It begins with a set of |C| leaves and performs a sequence of |C| - 1 “merging”
operations to create the final tree.
The algorithm uses a min-priority queue Q,
keyed on the freq attribute, to identify the two least-frequent objects
to merge together.
Line 2 initializes the min-priority queue Q with the characters in C. The for loop in lines 3–8 repeatedly extracts the two nodes x and y of lowest frequency from the queue, replacing them in the queue with a new node ‘z ‘representing their merger. The frequency of z is computed as the sum of the frequencies of x and y in line 7. The node z has x as its left child and y as its right child. After n - 1 mergers, line 9 returns the one node left in the queue, which is the root of the code tree.
The total running time of HUFFMAN on a set of n characters is O(n lg n).
Example:
Consider the characters in the file occur with the
frequencies given by
Arrange the frequencies in monotonically increasing order in a queue
Choose the two minimum frequencies and merge them
Now queue becomes
By repeating the same process, we get
choose two min frequencies and merge
By repeating the same process, we get
choose two min frequencies and merge