Dirfile Encodings

Introduction

Starting with GetData 0.4.0, the binary files associated with RAW fields may be encoded into alternate forms when convenient. A common use-case for encoding a dirfile is to compress the binary files to save disk space. Only data is modified by an encoding scheme. Database metadata is unaffected.

Because the Dirfile Standards do not limit the types of encodings which may be present in a dirfile, GetData can gracefully handle unknown encoding schemes. It does this by simply refusing to perform any operation which would require I/O on a binary file having an unknown encoding. In this case, I/O functions will fail with the GD_E_UNKNOWN_ENCODING error code.

The encoding scheme of a dirfile may be specified using the /ENCODING directive. An encoding scheme is local to the particular format file fragment in which the /ENCODING directive is found. (That is, only RAW fields defined in the same fragment, or in a sub-fragment not containing its own /ENCODING directive, will be assumed to have the indicated encoding.) This allows a single dirfile to have binary files which are stored using multiple encodings, by having them defined in multiple fragments.

The Encoding Framework

The GetData library provides an encoding framework which abstracts data file I/O, allowing for generic support for a wide variety of encoding schemes. Functions which (potentially) make use of the encoding framework are:

gd_nframes(), gd_eof(),
gd_getdata(),
gd_putdata(),
gd_framenum(), gd_framenum_subset(),
functions which add RAW fields: gd_add(), gd_add_raw(), gd_add_spec(),
functions which modify RAW field metadata: gd_alter_entry(), gd_alter_raw(), gd_alter_spec(), gd_move(), gd_rename(),
the fragment metadata altering functions: gd_alter_encoding(), gd_alter_endianness(), and gd_alter_frameoffset(), since they may have to translate binary files affected by the metadata changes.

Most of the encodings supported by GetData are implemented through external libraries which handle the actual file I/O and data translation. All such libraries are optional: a build of GetData which omits an external library will lack support for the associated encoding scheme. In this case, GetData will still properly identify the encoding scheme, but attempts to use GetData for file I/O via one of the above functions will fail with the GD_E_UNSUPPORTED error code.

GetData functionality for the external encodings can be put into modules which are loaded by GetData on-demand at run time. As a result, whether a given encoding is supported by the library may not be known a priori at run time. Run-time support for encodings can be determined using the gd_encoding_support() function.

Automatic Encoding Detection

If the encoding is not explicitly specified, GetData will attempt to figure out how a particular RAW field has been encoded.

GetData discovers the encoding scheme of a particular RAW field by noting the filename extension of files associated with the field. Binary files which form an unencoded dirfile have no file extension. The file extension used by the other encodings are noted below. Encoding discovery proceeds by searching for files with the known list of file extensions (in an unspecified order) and stopping when the first successful match is made. Because of this, when multiple data files with different, supported file extensions exist in the same directory and could legitimately be associated with the raw field being considered, the encoding scheme discovered by GetData is not well defined.

Once the encoding had been determined for a RAW field, GetData assumes that encoding for all other RAW fields defined in the same fragment (i.e., automatic encoding detection still enforces a single encoding per fragment).

Out-of-Place Writes

Some of the encodings listed below only support writing via out-of-place writes; that is, raw files are written in a temporary location and only moved into place when closed. As a result, writing to these encodings requires making a copy of the whole binary data file. A further side effect of this is that a third-party trying to concurrently read a Dirfile which is being written to using one of these encodings usually doesn't work.

Reading from a field so encoded after writing to it will cause writing to the temporary file to be finished and then the file moved into place before the read occurs, which may take some time to do. Encodings which perform out-of-place writes are: bzip2, flac, gzip, and lzma.

Supported Encoding Schemes

In addition to raw (unencoded) data, GetData 0.9 and newer support nine other encoding schemes:

bzip2 encoding,
gzip encoding,
flac encoding,
lzma encoding,
sample-index encoding,
slim encoding,
text encoding,
zzip encoding, and
zzslim encoding.

Bzip2 Encoding

The bzip2 encoding compresses raw binary files using the bzip2 compression scheme (Burrows-Wheeler block sorting algorithm with Huffman coding), using the bz2 library developed by Julian Seward. All operations are supported by the bzip2 encoding, but writing occurs out-of-place. The file extension of the bzip2 encoding is .bz2.

FLAC Encoding

The flac encoding compresses raw binary files using the Free Lossless Audio Codec. GetData's FLAC Encoding scheme is implemented through the libFLAC reference implementation developed by Josh Coalson and the Xiph.Org Foundation. All operations are supported by the flac encoding, but writing occurs out-of-place.

The FLAC format only permits samples up to 32-bits, but the libFLAC reference codec can only handle samples up to 24-bits. GetData gets around this by slicing data that is wider than 16-bits into multiple channels (2, 4, or 8, depending on width). For big-ended data, the most-significant 16-bits are in channel 0, the second 16-bits in channel 1, &c. For little-ended data, this is reversed, with the least significant word in channel 0.

The sample rate specified in the FLAC header is ignored and may be any valid value. FLAC files written by GetData use a sample rate of 1 Hz. The file extension of the flac encoding is .flac. The Ogg container format is not supported.

Gzip Encoding

The gzip encoding compresses raw binary files using the gzip compression scheme (Lempel-Ziv coding, LZ77), using the zlib library written by Jean-loup Gailly and Mark Adler. All operations are supported by the gzip encoding, but writing occurs out-of-place. The file extension of the gzip encoding is .gz.

To speed the operation of gd_nframes() and gd_eof(), the gzip compression scheme takes the uncompressed size of the file from the gzip footer, which contains the file's uncompressed size in bytes, modulo 2³². As a result, using a field with an (uncompressed) binary file size larger than 2³² bytes (4 GiB) as the reference field will result in a wrong number of frames being reported.

LZMA Encoding

The lzma encoding compresses raw binary files using the Lempel-Ziv Markov Chain Algorithm (LZMA), implemented in the xz container format, using the liblzma library, part of the XZ Utils suite written by Lasse Collin, Ville Koskinen, and Igor Pavlov. All operations are supported by the lzma encoding, but writing occurs out-of-place. The file extension of the lzma encoding is .xz or .lzma.

Sample-Index Encoding

The sample-index encoding (SIE), like the text encoding, requires no external library. As a result, all builds of the library contain full support for this encoding. The sample-index encoding is a lossless compression similar to run-length encoding. It replaces contiguous stretches of data having the same value with a record consisting of a 64-bit sample number and the value of the data in that run. The sample number indicates the last sample in the uncompressed data which took the associated value. All operations are supported by the sample-index encoding.

All runs, even those of length one, are replaced by such a record. As a result, sample-index encoding only produces a reduction in the size of a file when the data contain long stretches of identical values, in which case it provides significant access speed improvements over the more complex compression encodings. Specifically, given a n-sample long data stream of s-byte-wide values which change values m times, sample-index encoding will reduce the size of the data on disk compared to raw data only when

$m < n / (1 + 8/s) - 1$ .

The file extension of the sample-index encoding is .sie.

Slim Encoding

The slim encoding compresses raw binary files using the slim compression scheme, which was was developed at Princeton University to compress dirfile or similar binary data for ACT. GetData's slim encoding framework currently lacks write capabilities; as a result, only gd_getdata(), gd_nframes(), and gd_eof() are supported by the encoding. The file extension of the slim encoding is .slm.

Slim was written by Joseph Fowler. It has been released under the GNU Public License, and is distributed on SourceForge.

Text Encoding

The text encoding, like the sample-index encoding, requires no external library. As a result, all builds of the library contain full support for this encoding. It is meant to serve as a reference encoding and example of the encoding framework for work on other encoding schemes.

The text encoding replaces the binary data files with 7-bit ASCII files containing a decimal text encoding of the data, one sample per line. All operations are supported by the Text Encoding. Note: because this is a decimal encoding, storing floating point data may lose precision. The file extension of the text encoding is .txt.

ZZip Encoding

The zzip encoding reads compressed raw binary files using the DEFLATE algorithm as implemented in the PKWARE ZIP archive container format. GetData's zzip encoding scheme is implemented through the zzip library written by Tomi Ollila and Guido Draheim. The zzip encoding framework currently lacks write capabilities; as a result, only gd_getdata(), gd_nframes(), and gd_eof() are supported by the encoding.

Unlike most encoding schemes, the zzip encoding merges all binary data files defined in a given fragment into a single ZIP archive. The name of this archive is raw.zip by default, but a differt name may be specified using the second parameter to the /ENCODING directive. For example,

/ENCODING zzip archive

indicates that the ZIP archive is called archive.zip. The file extension of the zzip encoding is .zip.

ZZSlim Encoding

The zzslim encoding is a convolution of the slim encoding and the zzip encoding. To create zzslim encoded files, first the raw data are compressed using the slim library, and then these slim-compressed files are archived (and compressed again) into a ZIP archive. As with the zzip encoding, the ZIP archive is raw.zip by default, but a different name may be specified with the /ENCODING directive as with the zzip encoding.

Note: since the archives have the same name as zzip encoded data, automatic encoding detection on zzslim encoded data always fails: they are incorrectly identified as simply zzip encoded. As a result, an /ENCODING directive in the format file or else a GD_ZZSLIM_ENCODED flag passed to gd_open() is required to read zzslim encoded data. The file extension of the zzslim encoding is .zip.

Dirfile Encodings