Compression Performance Evaluation Report
0.
Purposes:
· To evaluate bzip2 compression on real NASA data by comparing the performance with gzip compression
· To check the possibility of integrating bzip2 to HDF5
1.
What’s bzip2?
Burrow-wheeler block-sorting text compression [1] + Huffman coding [2]
1) Utility:
· Compression: bzip2 original_name
· Decompression: bunzip2 or bzip2 –d original_name.bz2
· error detection: 32-bit CRC, only tell you something is wrong
· Depression level 1-9 ( like gzip: from fast to best)
2) Library:
· Interfaces similar to zlib library, including low-level, high-level and utility functions
· Low-level interface: thread-safe
3) Features according to the author:
· Not good for Highly repetition data
· May perform best on machines with very large caches
· Confidence about error handling
4) Misc:
· Not use autoconf
5) Future work according to the author:
· In the library interface, one parameter called “working factor” should be adjusted by the library automatically instead by the user application. The author may get rid of this parameter by making changes in the library in the future.
For more information, check the bzip2 web page at [3].
2.
Users’ point of view of bzip2
According to one user’s email:
· Compression time is slower than gzip
· 5% better compression ratio than gzip
3.
Some definitions in this report
1) Compression ratio: The percentage of the compressed file size or array size to the original file size or array size.
2) Encoding time of the utility: Elapsed time counted from starting the process of compressing the file until the end of this process
3) Decoding time of the utility: Elapsed time counted from starting the process of decompressing the file until the end of this process
4) Encoding time of the library: Difference of the elapsed time between writing an HDF5 dataset with compression and without compression
5) Decoding time of the library: Difference of the elapsed time between reading an HDF5 dataset with compression and without compression
Note:
· “time” utility is used to calculate encoding time and decoding time of the utility
· gettimeofday is used to calculate encoding and decoding time of the library
4.
Data
1) NASA data with HDF4 to HDF5 converter utility
We are using semi-real NASA data to do performance analyses. Since the current NASA EOS data are all stored in HDF4 format, to do the performance analyses in HDF5, we use NCSA H4toH5 converter utility [4] to convert all NASA data from HDF4 format to HDF5 format. With the rough comparison of file size between the converted HDF5 files and the original HDF4 files, we find the converted HDF5 files are reliable.
2) Detailed data information
Based on about 30 real data samples, we choose 10 samples. These data are SSMI, CERES, TOMS, TRIM, MODIS, MISR, ASTER and LANDSAT products. The file size and data type are listed in the table 1.
Table1: File information of the experiment
|
File name |
File Size (Unit: MB) |
Data type |
|
SSMI |
2.02737 |
Unsigned
8-bit big-endian integer |
|
TOMS |
6.093707 |
Unsigned
16-bit big-endian integer |
|
TRIM |
13.69254 |
Unsigned
16-bit big-endian integer |
|
CERES1 |
22.77592 |
IEEE
32-bit float |
|
MISR |
70.01059 |
Unsigned
16-bit big-endian integer (most) IEEE
32-bit float (least) |
|
CERES2 |
72.66951 |
IEEE
32-bit float |
|
ASTER2 |
74.94336 |
Unsigned
16-bit big-endian integer (most) Unsigned
8-bit big-endian integer (least) |
|
ASTER1 |
118.6585 |
Unsigned
8-bit big-endian integer (most) Unsigned
16-bit big-endian integer (least) |
|
MODIS1 |
262.343 |
Unsigned
16-bit big-endian integer (most) Unsigned
8-bit big-endian integer (middle) IEEE
32-bit float (least) |
|
LANDSAT |
561.8911 |
Unsigned
8-bit big-endian integer |
5.
Utility performance analysis result
· Platform independent (SGI O2K, windows 2000, Linux 2.2.18, solaris 2.7). We find stronger similarities among all four platforms. For elapsed encoding and decoding times: Windows is the best and SGI O2K is the worst. Compression ratio is exactly the same (should be! Even one byte should not be wrong). In the following, only use charts from linux running to show typical results.
· Bzip2 can always give a better compression ratio from 0.1% to almost 20%.
· Bzip2 is almost always taking longer for decoding and encoding the data, especially the decoding time is much longer for all data samples.
· Compression ratio is better with the increasing of compression level for both compression packages. However, Compression ratio is not sensitive to different compression levels for both bzip2 and gzip
· Decoding time is not sensitive to different compression levels for both gzip and bzip2, which should behave like this according to the theory
· Encoding time is worse with the increasing of compression level for both compression packages. Encoding time is more sensitive for gzip than for bzip2. In fact, gzip level 9 encoding time for MODIS file is even longer than bzip2 level 9 compression for MODIS file
· Outlier: Floating point data (CERES data) are in bad compression ratio and bad encoding and decoding time for both bzip2 and gzip. Bzip2 gains little for floating point data compression ratio, however, it takes much longer decoding time.
The following six figures will show comparisons of compression ratio, encoding time and decoding time in detail.






6. Performance comparison with compression library calls
1) Working procedure
· A user-provided bzip2 filter is integrated with HDF5 library to make the performance comparison between bzip2 and gzip.
· Based on utility performance comparison, we selected three arrays with different datatype. They represent arrays with float, 16-bit integer and 8-bit integer individually.
· We calculate compression ratio, encoding time of the library and decoding time of the library.
2) Tables and charts
The following tables show performance results of the three arrays. To make the comparison and consistent checking between compression libraries and utilities, The compression ratio, encoding and decoding time of the three corresponding files are also included afterwards.
i) Unsigned 8-bit integer
Data source: ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer) data on Terra
File name: ast1.h5
Array size: 22908000 byte
Data type: unsigned char
Array dimensions: 4600 * 4980
|
|
Compression ratio |
Encoding time (second) |
Decoding time (second) |
|
Bzip2 level 1 |
0.441 |
21.6 |
15.87 |
|
Bzip2 level 6 |
0.428 |
22.2 |
21.3 |
|
Bzip2 level 9 (default) |
0.426 |
22.7 |
22.24 |
|
|
|
|
|
|
Gzip level 1 |
0.593 |
9.43 |
1.95 |
|
Gzip level 6 (default) |
0.586 |
14.86 |
1.86 |
|
Gzip level 9 |
0.585 |
18.08 |
1.86 |
ii) Unsigned 16-bit integer
Data source: ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer) data on Terra
File name: ast2.h5
Array size: 10458000byte
Data type: unsigned short
Array dimensions: 2100 * 2490
|
|
Compression ratio |
Encoding time (second) |
Decoding time (second) |
|
Bzip2 level 1 |
0.1171 |
9.27 |
5.08 |
|
Bzip2 level 6 |
0.1141 |
9.76 |
7.44 |
|
Bzip2 level 9 (default) |
0.1136 |
10.43 |
8.01 |
|
|
|
|
|
|
Gzip level 1 |
0.2014 |
2.03 |
0.53 |
|
Gzip level 6 (default) |
0.1656 |
7.9 |
0.44 |
|
Gzip level 9 |
0.1647 |
48.09 |
0.44 |
iii) 32-bit float
Data source: CERES(Clouds and the Earth’s Radiant Energy System)
Array size: 5359200byte
File name: ceres2.h5
Data type: float
Array dimensions: 2030*660
|
|
Compression ratio |
Encoding time (second) |
Decoding time (second) |
|
Bzip2 level 1 |
0.4570 |
19.61 |
4.14 |
|
Bzip2 level 6 |
0.4545 |
25.23 |
5.08 |
|
Bzip2 level 9 (default) |
0.4519 |
26.47 |
5.42 |
|
|
|
|
|
|
Gzip level 1 |
0.4801 |
1.95 |
0.36 |
|
Gzip level 6 (default) |
0.4703 |
2.6 |
0.34 |
|
Gzip level 9 |
0.4700 |
3.38 |
0.34 |



3) Library analysis results:
i. Library analysis results are consistent with utility analysis results qualitatively.
ii. Overhead to call the compression library inside HDF5 library is endurable.
Tables that shows the comparison of size/second between library and utility are as follows:
Relative efficiency =
(Library compression size/second)/(Utility compression size/second)
According to the table, the relative efficiencies of both libraries are above 70%.
Table 2: relative encoding time comparison between library and utility for ceres data
Data source: ceres2.h5
|
|
Size (byte) |
Encoding time (second) |
Encoding Size/second (byte/s) |
Relative efficiency |
|
|
Bzip2 Library (L9) |
5359200 |
26.47 |
202463.2 |
0.733575 |
|
Bzip2 Utility (L9)
|
|
276.09 |
275995.2 |
1 |
|
|
Gzip library (L6) |
5359200 |
2.6 |
2061231 |
0.851007 |
|
|
Gzip Utility (L6) |
|
31.46 |
2422108 |
1 |
Table 3: relative encoding time comparison between library and utility for ASTER data I
Data source: ast2.h5
|
|
Size (byte) |
Encoding time |
Size/second |
Relative efficiency |
|
|
Bzip2 Library (L9) |
10458000 |
10.43 |
1002685 |
0.804992 |
|
Bzip2 Utility (L9)
|
|
63.09 |
1245583 |
1 |
|
|
Gzip library (L6) |
10458000 |
7.9 |
1323797 |
0.791747 |
|
|
Gzip Utility(L6) |
|
47 |
1671996 |
1 |
Table 4: relative encoding time comparison between library and utility for ASTER data II
Data source: ast1.h5
|
|
Size (byte) |
Encoding time |
Size/second |
Relative efficiency |
||
|
Bzip2 Library (L9) |
22908000 |
22.7 |
1009163 |
0.933713 |
||
Bzip2 Utility (L9)
|
|
115.12 |
1080807 |
1 |
||
|
Gzip library (L6) |
22908000 |
14.86 |
1541588 |
0.886625 |
||
|
GzipUtility (L6) |
|
71.56 |
1738715 |
1 |
7.
Concluding remarks and suggestions
According to the analyses with very limited samples, we find
· Bzip2 is always better than gzip in compression ratio.
· Bzip2 is always taking longer processing time than gzip, especially for decoding time.
· Neither compression packages is good for floating point data.
· Compression ratio is not sensitive to different compression levels for both compression packages.
· Encoding time is worse with the increasing of compression level for both compression packages. Encoding time is more sensitive for gzip than for bzip2.
· Overall library analysis is consistent with utility analysis.
· Integrating bzip2 to HDF5 is not hard but maintenance effort cannot be ignored.
Suggestions:
· Don’t use bzip2 for floating point data if you don’t have to.
· If you care about compression ratio more than anything else, you may consider using bzip2.
· If you care about decoding time more than anything else, you may choose to use gzip.
8.
What’s left?
· No configuration integration with HDF5 tests and tools
· No proper comments added to bzip2 filter that a user provided
· No tests on other platforms
· No implementation of bzip2 filter at user’s applications
9.
Reference:
1. Michael Burrows and D.J. Wheeler, 1994. “A block-sorting lossless data compression algorithm,” Digital SRC Research Report 124.
ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps
2. Huffman, D.A., 1952. “A method for the construction of minimum redundancy codes,” Proceedings of the IRE, Volume 40, Number 9, pages 1098-1101.
3. bzip2 URL: http://sources.redhat.com/bzip2/
4. h4toh5 utility URL: http://hdf.ncsa.uiuc.edu/h4toh5/