Revising Chunked Dataset Storage in HDF5

Quincey Koziol
The HDF Group
koziol@hdfgroup.org
Version 0.2
June 12rd, 2008
  1. This Document's Intended Audience

    This document is written for the HDF5 development team, and outside developers who are familiar with HDF5's current data model. Certain commonly used terms in HDF5 are not explained in detail, please see the HDF5 documentation at http://www.hdfgroup.org/HDF5 for more information about HDF5.

  2. Current Implementation of Indexing for Chunked HDF5 Datasets

    HDF5 datasets that are stored in "chunked" form. To retrieve elements in the dataset, the chunk containing those elements must be located and accessed. The locations of chunks for a dataset are stored in a B-tree data structure, which maps the index of the chunk to the file offset where the chunk's elements are stored. This diagram shows an example of a B-tree that maps chunk indices to file offsets for a dataset with one dimension and a chunk size of 30 elements:

    Figure 1

    The algorithms for operating on these B-trees follow the normal rules for searching, inserting and deleting records in B-trees. For example, to locate the chunk containing element 108, the following nodes/chunks are accessed: I(0) -> I(1) -> L(1) -> C(90). (This assumes that C(0) is the address of the chunk with elements 0-29, C(30) is the address of the chunk with elements 30-59, etc.)

    The record stored in the B-tree for locating the chunk (the "chunk record") stores the following information:

  3. Requirements of Index Method for Chunked HDF5 Datasets

    Data structures chosen to index chunked datasets in HDF5 must achieve certain minimum requirements:

  4. Drawbacks of Current B-tree Implementation for Indexing Chunked HDF5 Datasets

    The current (as of version 1.8.x) method of using B-trees to index chunks in chunked HDF5 datasets fulfills all the minimum requirements for chunked dataset indices. However, there are some drawbacks to the current B-tree index implementation:

  5. Improvements to Storing Chunked HDF5 Datasets

    In order to avoid the drawbacks of the current chunking algorithms, two types of changes to how chunks are stored and indexed should be made: "intrinsic" changes to chunked dataset storage (which don't involve explicitly changing the indexing method for chunks), and changes to the indexing method for looking up the address of a chunk in the file.

    The following "intrinsic" changes are suggested:

    The following changes to how chunks are indexed are suggested:

    Note that the default storage for chunked datasets must continue to be the current B-tree indexing, etc. in order to retain as much "forward" compatibility as possible. All the improvements above will only be used when the H5F_LIBVER_LATEST value is used for the "high bound" to H5Pset_libver_bounds.

  6. Other Improvements to Chunked HDF5 Datasets

    Some other improvements could be made to the internal HDF5 library algorithms for handling chunks, and are listed here:


QAK:06/12/2008