Data format specification

I envision a data format specification language similar to Sun's XDR (RFC1832).

While XDR uses 32 bit for many types, I'll try to hardcode as few size assumptions as possible.

I'm documenting on this page a few (binary) formats that I'm currently using. Some of them aim at minimum processing overhead, some at minimum space consumption. I hope the design of the specification system will allow all of the formats and their aims to be unified.

Some thoughts about supported types.

Examples

Formats of these are documented here (I'm using all of them):

mail spool file
log file for my Go playing CGI
request protocol used by that CGI for communicating with the game engine
tunnel protocol for passing the end-of-stream condition through connections made by rsh and ssh

Formats of these will also be documented:

compressing protocol for passing client-server IRC (RFC1459) traffic
remote file update protocol (like rdist)

Endianness

A few thoughts about endianness:

If data is only handled in fixed sizes, endianness doesn't matter and the choice can be made in a random way.
For packed strings (one byte per character) big-endian makes lexical string comparison easier, because in this case a 32 bit value is smaller iff the associated 4-character string comes first lexically.
For arbitrarily sized integers little-endian allows a simple polynomial formula for calculating the value: v = sum a[i]*256^i

Thus for XDR big-endian can be considered a good choice, because it has packed strings and only fixed size integers. (Of course it seems to be more than a coincidence that both the 680x0 and Sparc processors used by XDR's inventor, Sun, are big-endian.)

My system will probably not have a special string type, but it will have arbitrary integer sizes, so it will use a little-endian format.

Olivier Galibert noted that big-endian is better suited for evaluating the polynomial with Horner's scheme, but that evaluating might be too simple to be of any use, as it would only consist of shifting bytes.

Canonical representation

My system might allow the definition of type names like XDR's "typedef" and maybe even have a list of predefined types.

Format definitions will be considered equal if their expanded form matches. This ensures that an implementation need not know all predefined types.

I will try to eliminate redundant structure like "struct" types with only one component, too. Ideally two definitions should be considered equal if they parse the same encoded data stream in the same way, like expecting the same range for each code item. (This must of course be defined in a formal way to be useful at all.)

Assigned numbers

Definitions will be registered and assigned a serial number in order to make it easy to talk about them or refer to them at the beginning of a file or communication stream.

There might be demand for local extensions and I'm not yet sure how to handle them. Two choices come to mind: Either reserve some numbers, like a range of 256 or all odd numbers, or define one number for raw data and expect that some kind of "magic numbers" will be used, like it's done with existing file types on many platforms.

To the projects overview.

Page created: Jun 30, 1997 - last update: Nov 26, 2002 - version 2.2
Jörg Czeranski (Impressum)