Compression Performance Evaluation Report

 

0. Purposes:

·        To evaluate bzip2 compression on real NASA data by comparing the performance with gzip compression

·        To check the possibility of integrating bzip2 to HDF5

 

 

1. What’s bzip2?

 

Burrow-wheeler block-sorting text compression [1] + Huffman coding [2]

 

1) Utility:

·        Compression: bzip2 original_name

·        Decompression: bunzip2 or bzip2 –d original_name.bz2

·        error detection: 32-bit CRC, only tell you something is wrong

·        Depression level 1-9 ( like gzip: from fast to best)

 

2) Library:

·        Interfaces similar to zlib library, including low-level, high-level and utility functions

·        Low-level interface: thread-safe

 

3) Features according to the author:

·        Not good for Highly repetition data

·        May perform best on machines with very large caches

·        Confidence about error handling

 

4) Misc:

·        Not use autoconf

 

5) Future work according to the author:

·        In the library interface, one parameter called “working factor” should be adjusted by the library automatically instead by the user application. The author may get rid of this parameter by making changes in the library in the future.

 

For more information, check the bzip2 web page at [3].

 

 

 


 

 

 

2. Users’ point of view of bzip2

 

 

According to one user’s email:

·        Compression time is slower than gzip

·        5% better compression ratio than gzip

 

 

 

3. Some definitions in this report

 

 

1) Compression ratio: The percentage of the compressed file size or array size to the original file size or array size.

 

 2) Encoding time of the utility: Elapsed time counted from starting the process of compressing the file until the end of this process

 

3) Decoding time of the utility: Elapsed time counted from starting the process of decompressing the file until the end of this process

 

4) Encoding time of the library: Difference of the elapsed time between writing an HDF5 dataset with compression and without compression

 

5) Decoding time of the library: Difference of the elapsed time between reading an HDF5 dataset with compression and without compression

 

     Note:

·        “time” utility is used to calculate encoding time and decoding time of the utility

·        gettimeofday is used to calculate encoding and decoding time of the library


 

4. Data

 

1) NASA data with HDF4 to HDF5 converter utility

 

We are using semi-real NASA data to do performance analyses. Since the current NASA EOS data are all stored in HDF4 format, to do the performance analyses in HDF5, we use NCSA H4toH5 converter utility [4] to convert all NASA data from HDF4 format to HDF5 format. With the rough comparison of file size between the converted HDF5 files and the original HDF4 files, we find the converted HDF5 files are reliable.

 

2) Detailed data information

 

Based on about 30 real data samples, we choose 10 samples. These data are SSMI, CERES, TOMS, TRIM, MODIS, MISR, ASTER and LANDSAT products. The file size and data type are listed in the table 1.

 

Table1: File information of the experiment

File name

File Size

(Unit: MB)

Data type

SSMI

2.02737

Unsigned 8-bit big-endian integer

TOMS

6.093707

Unsigned 16-bit big-endian integer

TRIM

13.69254

Unsigned 16-bit big-endian integer

CERES1

22.77592

IEEE 32-bit float

MISR

70.01059

Unsigned 16-bit big-endian integer (most)

IEEE 32-bit float (least)

CERES2

72.66951

IEEE 32-bit float

ASTER2

74.94336

Unsigned 16-bit big-endian integer (most)

Unsigned 8-bit big-endian integer (least)

 

ASTER1

118.6585

Unsigned 8-bit big-endian integer (most)

Unsigned 16-bit big-endian integer (least)

MODIS1

262.343

Unsigned 16-bit big-endian integer (most)

Unsigned 8-bit big-endian integer (middle)

IEEE 32-bit float (least)

LANDSAT

561.8911

Unsigned 8-bit big-endian integer

 

 


 

 

5. Utility performance analysis result

 

 

·        Platform independent (SGI O2K, windows 2000, Linux 2.2.18, solaris 2.7). We find stronger similarities among all four platforms. For elapsed encoding and decoding times: Windows is the best and SGI O2K is the worst. Compression ratio is exactly the same (should be! Even one byte should not be wrong). In the following, only use charts from linux running to show typical results.

 

·        Bzip2 can always give a better compression ratio from 0.1% to almost 20%.

 

·        Bzip2 is almost always taking longer for decoding and encoding the data, especially the decoding time is much longer for all data samples.

 

·        Compression ratio is better with the increasing of compression level for both compression packages. However, Compression ratio is not sensitive to different compression levels for both bzip2 and gzip

 

·        Decoding time is not sensitive to different compression levels for both gzip and bzip2, which should behave like this according to the theory

 

·         Encoding time is worse with the increasing of compression level for both compression packages.  Encoding time is more sensitive for gzip than for bzip2. In fact, gzip level 9 encoding time for MODIS file is even longer than bzip2 level 9 compression for MODIS file

 

·        Outlier: Floating point data (CERES data) are in bad compression ratio and bad encoding and decoding time for both bzip2 and gzip. Bzip2 gains little for floating point data compression ratio, however, it takes much longer decoding time.

 

 

The following six figures will show comparisons of compression ratio, encoding time and decoding time in detail.

 


 

 

 

 


 

 

 

6. Performance comparison with compression library calls

 

 

1) Working procedure

·        A user-provided bzip2 filter is integrated with HDF5 library to make the performance comparison between bzip2 and gzip.

·        Based on utility performance comparison, we selected three arrays with different datatype. They represent arrays with float, 16-bit integer and 8-bit integer individually.

·        We calculate compression ratio, encoding time of the library and decoding time of the library.

 

 

2) Tables and charts

The following tables show performance results of the three arrays. To make the comparison and consistent checking between compression libraries and utilities, The compression ratio, encoding and decoding time of the three corresponding files are also included afterwards.

 

 

 

i) Unsigned 8-bit integer

 

Data source: ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer) data on Terra

File name: ast1.h5

Array size: 22908000 byte

 Data type: unsigned char

Array dimensions: 4600 * 4980

 

 

Compression ratio

Encoding time

(second)

Decoding time

(second)

Bzip2 level 1

0.441

21.6

15.87

Bzip2 level 6

0.428

22.2

21.3

Bzip2 level 9 (default)

0.426

22.7

22.24

 

 

 

 

Gzip level 1

0.593

9.43

1.95

Gzip level 6 (default)

0.586

14.86

1.86

Gzip level 9

0.585

18.08

1.86

 


 

ii) Unsigned 16-bit integer

 

Data source: ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer) data on Terra

File name: ast2.h5

Array size: 10458000byte

Data type: unsigned short

Array dimensions: 2100 * 2490

 

 

 

Compression ratio

Encoding time

(second)

Decoding time

(second)

Bzip2 level 1

0.1171

9.27

5.08

Bzip2 level 6

0.1141

9.76

7.44

Bzip2 level 9 (default)

0.1136

10.43

8.01

 

 

 

 

Gzip level 1

0.2014

2.03

0.53

Gzip level 6 (default)

0.1656

7.9

0.44

Gzip level 9

0.1647

48.09

0.44

 

 

iii) 32-bit float

 

Data source: CERES(Clouds and the Earth’s Radiant Energy System)

 

Array size: 5359200byte

File name: ceres2.h5

Data type: float

Array dimensions: 2030*660

 

 

 

Compression ratio

Encoding time

(second)

Decoding time

(second)

Bzip2 level 1

0.4570

19.61

4.14

Bzip2 level 6

0.4545

25.23

5.08

Bzip2 level 9 (default)

 

0.4519

26.47

5.42

 

 

 

 

Gzip level 1

0.4801

1.95

0.36

Gzip level 6 (default)

 

0.4703

2.6

0.34

Gzip level 9

0.4700

3.38

0.34

 


 

 

 

 

 

 

 

3) Library analysis results:

 

i. Library analysis results are consistent with utility analysis results qualitatively.

ii. Overhead to call the compression library inside HDF5 library is endurable.

 

Tables that shows the comparison of size/second between library and utility are  as follows:

 

Relative efficiency =

(Library compression size/second)/(Utility compression size/second)

 

According to the table, the relative efficiencies of both libraries are above 70%.

 

 

 

Table 2: relative encoding time comparison between library and utility for ceres data

Data source: ceres2.h5

 

Size (byte)

Encoding time (second)

Encoding Size/second

(byte/s)

Relative efficiency

Bzip2 Library (L9)

5359200

26.47

202463.2

0.733575

Bzip2 Utility (L9)

76199508

276.09

275995.2

1

Gzip library (L6)

5359200

2.6

2061231

0.851007

Gzip Utility (L6)

76199508

31.46

2422108

1

 


 

 

 

Table 3: relative encoding time comparison between library and utility for ASTER data I

Data source: ast2.h5

 

Size (byte)

Encoding time

Size/second

Relative efficiency

Bzip2 Library (L9)

10458000

10.43

1002685

0.804992

Bzip2 Utility (L9)

78583805

63.09

1245583

1

Gzip library (L6)

10458000

7.9

1323797

0.791747

Gzip Utility(L6)

78583805

47

1671996

1

 

 

 

 

Table 4: relative encoding time comparison between library and utility for ASTER data II

Data source: ast1.h5

 

Size (byte)

Encoding time

Size/second

Relative efficiency

Bzip2 Library (L9)

22908000

22.7

1009163

0.933713

Bzip2 Utility (L9)

124422472

115.12

1080807

1

Gzip library (L6)

22908000

14.86

1541588

0.886625

GzipUtility

(L6)

124422472

71.56

1738715

1

 


 

 

7. Concluding remarks and suggestions

 

According to the analyses with very limited samples, we find

·        Bzip2 is always better than gzip in compression ratio.

·        Bzip2 is always taking longer processing time than gzip, especially for decoding time.

·        Neither compression packages is good for floating point data.

·        Compression ratio is not sensitive to different compression levels for both compression packages.

·        Encoding time is worse with the increasing of compression level for both compression packages.  Encoding time is more sensitive for gzip than for bzip2.

·        Overall library analysis is consistent with utility analysis.

·        Integrating bzip2 to HDF5 is not hard but maintenance effort cannot be ignored.

 

Suggestions:

·        Don’t use bzip2 for floating point data if you don’t have to.

·        If you care about compression ratio more than anything else, you may consider using bzip2.

·        If you care about decoding time more than anything else, you may choose to use gzip.

 

8. What’s left?

 

·        No configuration integration with HDF5 tests and tools

·        No proper comments added to bzip2 filter that a user provided

·        No tests on other platforms

·        No implementation of bzip2 filter at user’s applications

 

 

 

 


 

9. Reference:

 

1. Michael Burrows and D.J. Wheeler, 1994. “A block-sorting lossless data compression algorithm,” Digital SRC Research Report 124.

ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps

 

2. Huffman, D.A., 1952. “A method for the construction of minimum redundancy codes,” Proceedings of the IRE, Volume 40, Number 9, pages 1098-1101.

 

3. bzip2 URL: http://sources.redhat.com/bzip2/

 

4. h4toh5 utility URL: http://hdf.ncsa.uiuc.edu/h4toh5/