Tencoding

Version 1, Tom Anderson <twic@urchin.earth.li>, 2005-01-23

Tencoding is a simple scheme for encoding structured data in a terse binary format. It is similar to, but much simpler and more practical than, ASN.1 BER. The name is a homage to bencoding.

The data model supported by tencoding is very simple. The universe consists of four kinds of objects:

  1. Blobs. These are just chunks of opaque binary data; they're useful for passing things like pictures, external files, etc. They have no structure or semantics within the framework of tencoding.
  2. Integers. These are common or garden whole numbers, positive or negative.
  3. Strings. These are sequences of unicode characters.
  4. Lists. These are sequences of references (aka pointers) to objects, of any kind.

Tencoding does not directly support booleans (use C-style integers, 0 for false and 1 for true), characters (use a length-1 string) or floating-point numbers (since portable encoding is a huge pain in the arse; use fixed-point or a pair (mantissa, exponent)).

However, there is slightly more to it than this: as well as a fundamental kind (blob, integer, string, composite), every object has a type. A type is simply a name, and indicates to applications what the semantics of an object, beyond those implied by its kind, are. For example, in describing customers, you might have types like Name and Address; you could then have a type Customer which was a list containing Name and Address objects. Note that types are derived from kinds; all objects of a given type are of a specific kind.

So, how is this universe encoded? Every object is encoded as a type - length - value (TLV) triple, consisting of the concatenation of the encodings of the object's type, the length of its encoded value, and its value. Types and lengths are handled as integers, and encoded using 'stretchy ints'; values are encoded according to specific rules for each kind.

Lengths are clearly already integers, but types are not; accordingly, for encoding, types must be represented as integers. To this end, all types are simply assigned a number. The least significant two bits of the number reflect the kind of the type: 00 for blobs, 01 for integers, 10 for strings and 11 for lists. The rest of the type number is up to the application. Type numbers, and indeed types generally, are not globally unique: every application is free to assign its own numbers. However, the type number 0 (which would otherwise be a blob) is always reserved (it is used as a marker for references in lists).

A stretchy int is a simple way of efficiently encoding unsigned integers of arbitary size. The integer is considered as a bitstring (with most significant bit first), and broken into groups of seven bits (if its length is not a multiple of seven, the string is left-padded with zeroes so that it is). Each group of bits is then written, preceded by a continuation bit, set to 0 if this is the last group in the sequence, and 1 if it is not. For example, 0 is encoded as 00000000, 1 as 00000001, 127 as 01111111, 128 as 10000001 00000000 and 316 as 10000010 00111100.

The values of objects are then encoded as follows:

  1. Blobs are encoded by writing the bytes of the data one after the other. Obvious, but it has to be said.
  2. Integers are encoded by splitting them into bytes and writing them out, most significant first (big-endian or network byte order). Note that leading zeroes can, and should, be discarded.
  3. Strings are encoded by encoding as UTF-8, then writing the bytes out one after another. Note that US ASCII is a subset of UTF-8, so applications not able to support full unicode may use that (and not ISO 8859-1 or relatives!). A byte order mark is not required (a byte order mark is never required for UTF-8). If the string must be normalised, use normalisation form C.
  4. Lists are encoded as the concatenation of the encodings of each item. Note that a list is a list of references, not objects, so this encoding is not quite as straightforward as you might think! How a reference is encoded depends on whether the referent has already been encoded in the stream: if it has, a pointer to the encoding is written; if not, it is encoded inline. Pointers are encoded as a zero byte followed by the offset of the encoding of the referent, relative to the pointer; offsets are measured from start to start (zero byte to first byte of encoded type), with positive being towards the start of the file (so offsets are always positive). Inline objects are encoded as described. Note that no encoded type can start with a zero byte (an initial zero byte would mean a type number 0, which is forbidden), so it is possible to distinguish pointers from inline objects simply by examining the first byte.

I'm working on a tencoding library in Java; bug me if you want a copy.