Thursday, August 22, 2013

Data Compression Techniques - 1

Hi Everyone, My post today is about Data Compression Techniques, I’ve decided to post about Data Compression because It’s really widely used everywhere around us in many applications that we are using everyday… I really like this CS topic trying everyday to know more about it…
Today I’ll be talking about:
  1. Types/ Techniques of Data Compression
  2. Lossless Data Compression
    1. Huffman Coding
    2. LZC Coding
  3. Lossy Data Compression
    1. DCT Coding
Let’s Start now;
First, Simply There is 2 main techniques in data compression:
  1. Lossless Data Compression
  2. Lossy Data Compression
1. Lossless Data Compression: Where no Loss of info –> The compressed data can be decoded into exactly the original data.
2. Lossy Compression: Because sometimes we need to save size and time we are using this technique; Lossy Compression is not accurate 100%, This Compression technique is used in analog ranged data such as videos, JPEG encoding and sounds.
and now let’s talk about the most known fast Lossless Data Compression Algorithms that are used nowadays: Huffman and Lempel-ZIV coding … let’s take a look on them

Examples of Lossless Data Compression Algorithms

1-Huffman Algorithm:
A Huffman code is designed by merging together the two least probable characters, and repeating this process until there is only one character remaining. A code tree is thus generated and the Huffman code is obtained from the labeling of the code tree. An example of how this is done is shown below.
Huffman Design Animation
The final static code tree is given below:
Huffman Design Animation

  • It does not matter how the characters are arranged. The Tree have been arranged above so that the final code tree looks nice and neat.
  • It does not matter how the final code tree are labeled (with 0s and 1s). we have chosen to label the upper branches with 0s and the lower branches with 1s.
  • There may be cases where there is a tie for the two least probable characters. In such cases, any tie-breaking procedure is acceptable.
  • Huffman codes are not unique.
  • Huffman codes are optimal in the sense that no other lossless fixed-to-variable length code has a lower average rate.
  • The rate of the above code is 2.94 bits/character. [1]
  • The entropy lower bound is 2.88 bits/character. [2]

  • 2-Lempel-Ziv Coding (LZ Coding):
    The basic idea is to parse the input sequence into non-overlapping blocks of different lengths while constructing a dictionary of blocks seen thus far.

    Encoding Algorithm:
    1. Initialize the dictionary to contain all blocks of length one (D={a,b}).
    2. Search for the longest block W which has appeared in the dictionary.
    3. Encode W by its index in the dictionary.
    4. Add W followed by the first symbol of the next block to the dictionary.
    5. Go to Step 2.

      The final code results are given below: Lempel-Ziv Encoding Animation
    after talking about LZ Coding … now let’s talk about one of the most known Lossy Data Compression Algorithms : The Discrete Cosine Transform (DCT) used in JPEG and MP3 Compressions

    Example of Lossy Data Compression Algorithms

    1-Discrete Cosine Transform (DCT) Algorithm:
    expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. DCTs are important to numerous applications in science and engineering, from lossy compression of audio (e.g. MP3) and images (e.g. JPEG) (where small high-frequency components can be discarded) (Wikipedia)
    you can take a look on this Demo Java Applet that applies the DCT Algorithm:
    DCT Simulation tool:
    Consider this 8x8 grayscale image of capital letter A.

    Original size, scaled 10x (nearest neighbor), scaled 10x (bilinear).
    DCT of the image.
6.1917 & -0.3411 & 1.2418  &  0.1492  &  0.1583  &  0.2742 &  -0.0724  &  0.0561 \\
0.2205 & 0.0214 & 0.4503  &  0.3947  & -0.7846 &  -0.4391  &  0.1001  & -0.2554 \\
1.0423 & 0.2214 & -1.0017 &  -0.2720  &  0.0789 &  -0.1952  &  0.2801  &  0.4713 \\
-0.2340 & -0.0392 & -0.2617 &  -0.2866 &   0.6351 &   0.3501 &  -0.1433  &  0.3550 \\
0.2750 & 0.0226 & 0.1229  &  0.2183  & -0.2583  & -0.0742  & -0.2042  & -0.5906 \\
0.0653 & 0.0428 & -0.4721 &  -0.2905  &  0.4745  &  0.2875  & -0.0284  & -0.1311 \\
0.3169 & 0.0541 & -0.1033 &  -0.0225  & -0.0056  &  0.1017  & -0.1650 &  -0.1500 \\
-0.2970 & -0.0627 & 0.1960 &   0.0644  & -0.1136 &  -0.1031 &   0.1887  &  0.1444 \\

    Basis functions of the discrete cosine transformation with corresponding coefficients (specific for our image).
    Each basis function is multiplied by its coefficient and then this product is added to the final image.

    On the left is final image. In the middle is weighted function (multiplied by coefficient) which is added to the final image. On the right is the current function and corresponding coefficient. Images are scaled (using bilinear interpolation) by factor 10x.
    you can know more about DCT here :

    [1] Compression Ratio: It’s the ration between the bits of data before compression and the Bits of data after compression ( Original Bits:Compressed Bits)
    if we have a file with an original data size of 65,535 bytes .. This file became 16,384 bytes after applying some data compression algorithms on it. we can say that the compression ratio is 65535:16384 which is approximately 4:1 or we can just say now that the file is 75% compressed (The total size is 100% and the compressed size is 25% of the total);
    Now if we have 8 bits per each byte we can say that we are representing each byte by just 2 bits (25 %) or 2 bits per byte or 2 bits/character

    The Difference between the Original data and the Reconstruction data is called “Distortion”  also “Fidelity” and “Quality”.

    [2] The entropy rate of a source: is a number which depends only on the statistical nature of the source. If the source has a simple model, then this number can be easily calculated. Here, we consider an arbitrary source:
    while paying special attention to the case where is English text.
    Zero-Order Model: The characters are statistically independent of each other and every letter of the alphabet,, are equally likely to occur. Let be the size of the alphabet. In this case, the entropy rate is given by

    For English text, the alphabet size is m=27. Thus, if this had been an accurate model for English text, then the entropy rate would have been H=log2 27=4.75 bits/character.

    1-Khalid Sayoud Introduction to Data Compression 2nd Edition, 2006.
    2-A. Gersho and R. M. Gray, Vector Quantization and Signal Compression.
    3-D. A. Huffman, ``A Method for the Construction of Minimum Redundancy Codes,'' Proceedings of the IRE, Vol. 40, pp. 1098--1101, 1952.
    4-J. Ziv and A. Lempel, ``A Universal Algorithm for Sequential Data Compression,'' IEEE Transactions on Information Theory, Vol. 23, pp. 337--342, 1977.
    5-J. Ziv and A. Lempel, ``Compression of Individual Sequences Via Variable-Rate Coding,'' IEEE Transactions on Information Theory, Vol. 24, pp. 530--536, 1978.
    6-T. A. Welch, ``A Technique for High-Performance Data Compression,'' Computer, pp. 8--18, 1984.
    7.C. E. Shannon, ``Prediction and Entropy of Printed English," available in Shannon: Collected Papers.

    Hope you enjoyed this Data Compression Introduction and Thanks for Reading;  Smile
    For More Info Please Visit Data Compression - Debra A. Lelewer and Daniel S. Hirschberg

    Search This Blog