The Dirfile Standards

home | download | API documentation | dirfile standards | mailing list | sourceforge | github

Introduction

The Dirfile Standards describe the dirfile database format, a filesystem based database for time-ordered binary data. Dirfiles are designed to be a fast, simple format for storing and reading binary time-ordered data. This document provides an unofficial overview of the Dirfile Standards. The official Dirfile Standards are distributed with GetData as three Unix man pages: dirfile(5), dirfile-format(5), and dirfile-encoding(5). Additionally, this document discusses some implementation-dependent behaviour specific to GetData not found in those documents. The latest release of the Dirfile Standards is Standards Version 10 (January 2017).

The dirfile database is centred around one or more time-ordered data streams (a time stream). Each time stream is written to disk in a separate file, in its native binary format. The name of these time stream files correspond to the time stream's field name, a descriptive textual tag.

Two time streams may have different constant sampling frequencies and mechanisms exist within the dirfile format to ensure these time streams remain properly sequenced in time. To do this, the time streams in the dirfile are subdivided into frames. Each frame contains an integer number of samples of each time stream. When synchronous retrieval of data from more than one time stream is required, position in the dirfile can be specified in frames, which will ensure synchronicity.

The time stream files are all located in a central directory, known as the dirfile directory. The dirfile as a whole is typically referred to by its dirfile directory. Included in the dirfile along with the time streams is the dirfile format specification, which is an ASCII text file called format located in the dirfile directory.

Version 3 of the Dirfile Standards introduced the large dirfile extension. This extension added the ability to distribute the dirfile format specification among multiple files (called fragments) in addition to the format file, as well as the ability to house portions of the database in subdirfiles. These subdirfiles may be fully-fledged dirfiles in their own right, but may also be contained within a larger, parent dirfile.

In addition to the raw fields on disk, the dirfile format specification may also specify derived fields which are calculated from one or more raw or derived time streams. Derived fields behave identically to raw fields when read via GetData. See below for a complete list of derived field types.

Dirfiles are designed to be written to and read simultaneously. The dirfile specification dictates that one particular raw field (specified either explicitly with the /REFERENCE directive or implicitly by the format file) is to be used as the reference field: all other vector fields are assumed to have at least as many frames as the reference field has, and the size (in frames) of the reference field is used as the size of the dirfile as a whole.

Version 6 of the Dirfile Standards added the ability to encode the binary files on disk. Each fragment may have its own encoding scheme. Notably this can be used to compress these files. See Dirfile Encodings for information on encoding schemes.

Dirfile Example

An example dirfile is presented as Figure 1 below. The dirfile as a whole is referenced by the name of the directory which forms its base, in this case "dirfile". The base directory contains a format file which contains the metadata for the dirfile database. This format file includes other files (called format file fragments) which provide additional database metadata, including a format file in a subdirectory (a subdirfile). A full description of the format file syntax is given below.

Figure 1: Graphical representation of a dirfile
Figure 1: Graphical representation of a dirfile

This dirfile contains four time streams (indicated by RAW in the format file, and often called raw fields), one of which is located in the subdirfile. Five derived fields are also defined (indicated by LINCOM, MULTIPLY, and BIT in the format file—other derived field types also exist, see below).

Each time stream has a corresponding file of binary data containing the time stream itself. Each time stream may have a different sample rate, also indicated in the format file. In this example, for every sample of field1, there are four samples of field2, twelve samples of field3, and eight samples of bits. Derived fields inherit the sample rate of their first input field, so, for example, the derived field diff has the same sample rate as field2.

The binary file associated with the time stream bits is located in the subdirfile because it is defined in the format file fragment in that directory. This is a general rule: the binary file associated with a time stream must reside in the directory which contains the fragment that defines the field. (This is true even if the fragment isn't called format, i.e. binary files associated with raw fields defined in extra_format would have to reside in dirfile.) If this rule is not followed, GetData will be unable to locate the time stream on disk.

The subdirfile is a fully formed dirfile in its own right, since its metadata is fully specified by its format file, which does not refer to any fields defined by its parent. Note, however, that dirfile, is not a complete dirfile without the inclusion of the subdirfile, since the format fragment extra_format refers to a field defined in the subdirfile.

The Format File

The format file is a case-sensitive text file which contains the dirfile database metadata. The explicit text encoding is not specified by the Standards, but it must be 7-bit ASCII compatible. Examples of acceptable character encodings include all the ISO 8859 character sets (i.e. Latin-1 through Latin-10, among others), as well as the UTF-8 encoding of Unicode and UCS.

The format file is composed of directive lines and field specification lines, optionally separated by blank lines, or lines containing only whitespace. Lines are separated by the line-feed character (0x0A). Unless escaped (see below), the hash mark (#) is the comment delimiter; the comment delimiter, and any text following it to the end of the line, is ignored.

Tokens

Both directive lines and field specification line consist of several tokens separated by whitespace. Whitespace consists of one or more whitespace characters. These are: space (0x20), horizontal tab (0x09), vertical tab (0x0B), form-feed (0x0C), and carriage return (0x0D). The first token of a directive line is always a reserved word, while a field specification line begins with a field name. As a result, no field may have the same name as a reserved word (although, as of Standards Version 8, all reserved words contain a forward slash character, /, which are prohibited in field names in any case).

Since tokens are separated by whitespace, to include a whitespace character in a token, it must either be escaped by preceding it by a backslash character (\), or replaced by a character escape sequence, see Table 1, below), or else the token must be enclosed in quotation marks ("). The quotation marks themselves are stripped from the token. The null-token (that is, the token consisting of zero characters) may be specified by a pair of quotation marks with nothing between them (""). To include a literal quotation mark or backslash character in a token, it must be escaped (\" or \\). Similarly, a hash mark may be included in a token by including it in a quoted token or else by escaping it (\#), otherwise the hash mark will be understood as the comment delimiter.

It is a syntax error to have a line which contains unmatched quotation marks, or in which the last character is an un-escaped backslash (i.e., line continuation is not allowed).

Several characters when escaped by a preceding backslash character are interpreted as special characters in tokens. Some of these have already been mentioned. The full list of character escape sequences is presented in Table 1. Any other character which is escaped is interpreted as the character itself. (i.e. \c is interpreted as c).

Table 1: Character Escape Sequences
SequenceInterpretationByte
\" a quotation mark character0x22
\# a hash mark character0x23
\a an alert (bell) character0x07
\b a backspace character0x08
\e an escape character0x1B
\f a form-feed character0x0C
\n a line-feed character0x0A
\r a carriage return character0x0D
\t a horizontal tab character0x09
\v a vertical tab character0x0B
\\ a backslash character0x5C
\ooo the single byte given by the octal number ooo. (1 to 3 octal digits) 0ooo
\xhh the single byte given by the hexadecimal number hh. (1 or 2 hexadecimal digits) 0xhh
\uhhhhhhh the UTF-8 byte sequence encoding the Unicode code point given by the hexadecimal number hhhhhhh. (1 to 7 hexadecimal digits)

No token may contain the NUL character (0x00). Furthermore, although support is present to create UTF-8 byte sequences, tokens are not required to be valid UTF-8 sequences. Any byte sequence not containing the NULL character forms a valid token. However, there may be further restrictions on allowed characters for a token in a particular situation, (for example, when used as a field name).

Standards Versions 5 and earlier do not recognise the character escape sequences, nor allow quoting of tokens. As a result, they prohibit both whitespace and the comment delimiter from being used in tokens.

Directives

There are eleven directives, each specified by a different reserved word which cannot be used as field names in the dirfile; all directives are optional. As of Standards Version 8, all reserved words start with an initial forward slash (/), to distinguish them from field names. Standards Versions 5, 6 and 7 permit any reserved word to optionally omit its initial forward slash, without change in meaning. Reserved words in Standards Version 4 and earlier may not have an initial forward slash. Like the rest of the format specification, directives are case sensitive.

A number of the directives have fragment scope. A directive with fragment scope only applies to the fragment in which it is present, plus any sub-fragments indicated by the /INCLUDE directive, but only if those sub-fragments don't have their own corresponding directive. Directives which have fragment scope are: /ENCODING, /ENDIAN, /FRAMEOFFSET, and /PROTECT. Because of these scoping rules, different portions of the dirfile may have different encodings, endiannesses, frame offsets, or protection levels.

If a directive with fragment scope appears more than once in a fragment, only the last such directive is be honoured, with the exception that the effect of a directive is not propagated to sub-fragments if the directive line appears after the sub-fragment is included. The scoping rules of the remaining directives are discussed below.

Field Specifications

Any line which does not start with a reserved word is assumed to be a field specification line. A field specification line consists of at least two tokens. The first token is the field name. The second token is the field type. Subsequent tokens are field parameters. The meaning and number these parameters depends on the field type specified.

Field Names

A field name consists of one or more characters, excluding both ASCII control characters (bytes 0x00 through 0x1F) and the reserved characters listed in Table 2 according to Standards Version. Furthermore, the field name of a RAW field may only contain characters allowed in filenames. Although never allowed in a field name, a forward slash (/) can be used to define metafields; see above under the /META directive. Like the rest of the format file, field names are case sensitive.

Table 2: Reserved Characters in Field Names
VersionReserved Characters
0–4# / whitespace
5# / & ; < > | whitespace
6—/ & ; < > | .
‡: By virtue of there being no way to include such characters in tokens.

The field name may not be INDEX, which is a special, implicit field which contains the integer frame index. Standards Version 5 and earlier also prohibit FILEFRAM as a field name; it was an alias for INDEX, (which arose in the prehistoric times of ReadData, GetData's spiritual predecessor).

Standards Version 3 and 4 restrict field names to 50 characters. Standards Version 2 and earlier restrict field names to 16 characters. Additionally, the filesystem will put restrictions on the length of a RAW field name, regardless of Standards Version*.

Starting in Standards Version 7, if the field name beginning a field specification line contains exactly one forward slash character (/), the line is assumed to specify a metafield. See the /META directive above for further details. A field name may not contain more than one forward slash. Starting in Standards Version 10, any field name may be preceded by a namespace tag. The namespace tag and the field name are separated by a dot (.). See the Namespaces section, following, for details.


: Consult the documentation of the filesystem backing the database for details, although most modern filesystems permit any byte except NUL (0x00) or, failing that, any Unicode character except NUL.

*: Again, consult your filesystem documentation, but most modern filesystems permit filenames of at least 255 bytes.

Namespaces

Beginning with Standards Version 10, every field in a Dirfile is contained in a namespace. Every namespace is identified by a namespace tag which consist of the same restricted set of characters used for field names (see Table 2, above). Namespaces nest arbitrarily deep. Subnamespaces are identified by concatenating all namespace tags, separating tags by dots (.), with the outermost namespace leftmost:
topspace.subspace.subsusbspace

Each fragment has an immutable root namespace The root namespace of the primary format file is the null namespace, identified by the null-token (""). The root namespace of other fragments is specified when they are introduced (see the /INCLUDE directive). Each fragment also has a current namespace which may be changed as often as needed using the /NAMESPACE directive, and defaults to the root namespace. The current namespace is always either the root namespace or else a subspace under the root namespace.

If a field name or field code starts with a leading dot, then that name or code is taken to be relative to the fragment's root space. If it does not start with a dot, it is taken to be relative to the current namespace.

For example, if the both the root namespace and current namespace of a fragment start off as rootspace, then:

aaaa       RAW UINT8 1
.bbbb      RAW UINT8 1
cccc.dddd  RAW UINT8 1
.eeee.ffff RAW UINT8 1

/NAMESPACE newspace

gggg       RAW UINT8 1
.hhhh      RAW UINT8 1
iiii.jjjj  RAW UINT8 1
.kkkk.llll RAW UINT8 1
specifies, respectively, the fields:

Note that a field code may specify deeper subspaces under either the root namespace or the current namespace (meaning it is never necessary to use the /NAMESPACE directive). Note also that there is no way for metadata in a given fragment to refer to fields outside the fragment's root space.

There is one exception to this namespace scoping rule: the implicit INDEX vector is always in the null (top-level) namespace, and namespace tags specified with it, either explicitly or implicitly, even a fragment root namespace, are ignored. So, in a fragment with root namespace rootspace, and current namespace rootspace.subspace,

all refer to the same INDEX field.

Field Types

There are eighteen field types. Of these, fourteen are of vector type (BIT, DIVIDE, INDIR, LINCOM, LINTERP, MPLEX, MULTIPLY, PHASE, POLYNOM, RAW, RECIP, and SBIT, SINDIR, WINDOW) and four are of scalar type (CARRAY, CONST, SARRAY, and STRING). The eleven vector field types other than RAW fields are also called derived fields, since they derive their value from one or more input fields.

Five of these derived fields (DIVIDE, LINCOM, MPLEX, MULTIPLY, and WINDOW) may have more than one vector input field. In situations where these input fields have differing sample rates, the sample rate of the derived field is the same as the sample rate of the first (left-most) input field specified. Furthermore, the input fields are synchronised by aligning them on frame boundaries, assuming equally-spaced sampling throughout a frame, and using the last sample of each input field which did not occur after the sample of the derived field being computed. That is, if the first and second input fields have sample rates s1 and s2, the derived field also has sample rate s1 and, for every sample of the derived field, n, the n'th sample of the first field is used (since they have the same sample rate by definition), and the sample number used of the second field, m, is computed as:

m = floor((n * s2) / s1).

Starting in Standards Version 6, certain scalar field parameters in the field specifications may be specified using CONST or CARRAY fields, instead of literal values. A list of parameters for which this is allowed is given below in the Field Parameters section.

The possible fields types are:

Field Parameters

All input vector field parameters should be field codes. Additionally, in Standards Version 6 and later, some of the numerical field parameters may be either literal numbers or else the field code of a CONST or CARRAY scalar field containing the value. In the case of a CARRAY, the field code may be immediately followed by an integer enclosed in angle brackets (< >) specifying which element (counting from zero) of the CARRAY to use (so: field_code<n>). If this is omitted, the first element is assumed. Parameters for which this is possible are:

Since it is possible to create a field code which is identical to a literal number, a parameter is assumed to be the field code of a scalar field only if it doesn't look like a number.

Starting in Standards Version 9, in additional to decimal notation, literal integer parameters may be specified as hexadecimal numbers, by prefixing the number with 0x or 0X, or as octal numbers, by prefixing the number with 0. Both uppercase and lowercase hexadecimal digits may be used.

In Standards Version 7 and later, a literal complex number is specified as two real (floating point) numbers separated by a semicolon (;) with no intervening whitespace. So, for example, the tokens:

1;0 0;1 4;0 0;5 9.313e2;74.1
represent, respectively, the real unit, the imaginary unit, the real number four, the imaginary number 5i, and the complex number 931.3+74.1i. Because the semicolon character cannot be used in field names, a complex valued literal can never be mistaken for a field code.

Complex literals allow, among other things, the composition of complex valued fields from purely real input fields. For example, a complex valued field, z, may be created from a real valued field re, representing the real part of the complex number, and the real valued field im, representing the imaginary part of the complex number, by specifying:

z LINCOM re 1 0 im 0;1 0

Field Codes

Both when specifying the inputs to a field (as a non-literal scalar parameter, or as an input vector field to a field), and when specifying a field to a GetData call, field codes are used. A field code consists of, in order:

A representation suffix may be used used to extract a real number from a complex value. The available suffixes (listed here with their preceding dot) and their meanings are:

If the specified field is purely real, the representations are calculated as if the imaginary part was equal to +0. For example, given a complex valued vector, z, a vector containing the real part of z, called re_z, could be produced with:

re_z PHASE z.r 0
and similarly for the complex field's imaginary part, argument, and absolute value. (Although it should be pointed out this simplistic an example isn't strictly necessary, since z.r could be used wherever re_z would be.)

History

The latest version of the Dirfile Standards is Version 10.

Table 4: Dirfile Standards Version history
VersionRelease DateNotes
10January 2017Added the INDIR, SARRAY, and SINDIR field types, the /NAMESPACE directive, the optional namespace tag to the /INCLUDE directive and the .z representation suffix.
9April 2012Added the MPLEX, and WINDOW field types, the /ALIAS and /HIDDEN directives, the affixes to /INCLUDE, and the optional enc-datum token to /ENCODING. It permitted specification of integer literals in octal and hexadecimal. Finally, it deprecated the type aliases FLOAT and DOUBLE.
8November 2010Added the DIVIDE, RECIP and CARRAY field types, made the forward slash on reserved words mandatory, and prohibited using the single-character type aliases in the specification of RAW fields. It also introduced the optional second (arm) token to the /ENDIAN directive.
7October 2009Added the POLYNOM and SBIT field types, and complex data types COMPLEX64 and COMPLEX128. It also introduced representation suffixes to field codes, made the n_fields parameter to LINCOM optional, and introduced the directive-free method of specifying metafields.
6October 2008 Added the /ENCODING, /META, /PROTECT, and /REFERENCE directives and the CONST and STRING field types. It permitted whitespace in tokens and introduced the character escape sequences. It allowed CONST fields to be used as parameters in field specification lines. It also removed FILEFRAM as an alias for INDEX, and allowed # and \ in field codes.
5August 2008 Added VERSION and ENDIAN, and removed the restriction on field name length. It introduced the data types INT8, INT64, and UINT64, the new-style type specifiers, and increased the range of the BIT field type from 32 to 64 bits. It also prohibited the characters #&/;<>\.| in field names.
4October 2006 Added the PHASE field type.
3January 2006 (The "Large Dirfile Extension") Added INCLUDE, support for sub-dirfiles, and increased the allowed length of a field name from 16 to 50 characters.
2September 2005 Added the MULTIPLY field type, and added support for LINCOM fields with inputs of differing sample rates.
1November 2004 Added FRAMEOFFSET and the optional fourth argument to the BIT field type.
0before
March 2003
This Refers to the dirfile standards supported by the GetData library originally introduced into the kst sources, which contained support for all other features covered by this document.
© 2008, 2009, 2010, 2011, 2012, 2013, 2016, 2017 D. V. Wiebe
Valid HTML 4.01 StrictValid CSS