We Built A New Data Format for Serializing Named Binary Buffers. Introducing BFAST

Written by cdiggins | Published 2019/08/26
Tech Story Tags: programming | serialization | fileformat | dataformat | binarydata | binaryformat | crossplatform | latest-tech-stories

TLDR Given a collection of byte arrays, how do you pack them efficiently into a single block to use in memory, write to disk, and transmit over a network? We decided to put together a formal specification called BFAST for encoding binary arrays. BFAST is an acronym for “Binary Format for Array Serialization and Transmission’. The BFAST format consists of three sections: Fixed size descriptor (32 bytes) A pointer table that identifies where each binary buffer starts and ends within the data block. A list of Utf-8 encoded strings separated by null characters.via the TL;DR App

While developing the G3D geometry file format at VIM AEC we encountered the following challenge:
Given a collection of byte arrays, how do you pack them efficiently into a single block to use in memory, write to disk, and transmit over a network, so that it can be easily unpacked on different platforms including Web, Mobile, and AR devices like the Magic Leap.
This simple use case of storing and processing collections of large arrays of binary data pops up in a large number of problem domains, such as audio, image, video, 3D, geographical information systems (GIS), continuous fluid dynamics (CFD), photogrammetry, etc. In fact a lot of tabular data can be represented of as collections of named arrays.
“Isn’t this a solved problem?” you might ask, “Can’t we just use one of several cross-platform binary solutions like the Concise Binary Object Representation (CBOR) or any of other various Binary formats that exist?”. Certainly you can use them, but all of these binary formats are for nested structured data, so as a result they are necessarily more complex and less efficient than a specialized solution for the common use case of serializing byte arrays.

A Naïve Solution

Obviously, one could implement a simple solution with relative ease. A common first approach to the problem (which several serialization libraries use) is to write the number of arrays, then for each byte array, write the number of bytes, followed by the byte data. Here is some sample code in C# to illustrate this approach:
public static byte[] NaivePack(IList<byte[]> buffers)
{
    using (var stream = new MemoryStream())
    using (var bw = new BinaryWriter(stream))
    {
        bw.Write(buffers.Count);
        foreach (var b in buffers)
        {
            bw.Write(b.Length);
            bw.Write(b);
        }
        return stream.ToArray();
    }
}

public static IList<byte[]> NaiveUnpack(byte[] data)
{
    var r = new List<byte[]>();
    using (var stream = new MemoryStream())
    using (var br = new BinaryReader(stream))
    {
        var n = br.ReadInt32();
        for (var i = 0; i < n; ++i)
        {
            var localN = br.ReadInt32();
            var bytes = br.ReadBytes(localN);
            r.Add(bytes);
        }
    }
    return r;
}
This works well enough in simple cases and has acceptable performance for some use cases. However, it suffers from several shortcomings:
  • The read code must be executed on an architecture that has the same endianness than the writer
  • You can’t encode arrays that are larger than 2GB.
  • A reader has no easy way to know that a data block was encoded using this scheme.
  • Accessing data in an array can require multiple disk accesses (one to each array, to figure out where the next array is)
  • Individual data buffers aren’t strictly aligned, so casting data to pointers won’t work reliably
Overall, this is not a very robust or efficient solution.

A Better Approach

The following simple steps improve upon the previous naïve solution:
  • Add a magic number to Identify the endianness of the reader. This will also help decoders quickly identify the encoding scheme of the data.
  • Add a pointer table that identifies where each binary buffer starts and ends within the data block.
  • Align the data-buffers to 64-byte addresses
Given these observations, we decided to put together a formal specification called BFAST for encoding binary arrays, so that we could easily share binary arrays across different applications written in different languages.

Introducing the BFAST Specification

BFAST is an acronym for “Binary Format for Array Serialization and Transmission”. The BFAST format consists of three sections:
  • Header — Fixed size descriptor (32 bytes) describing the file contents
  • Ranges — An array of offset pairs indicating the begin and end of each buffer (relative to file begin)
  • Data — 64-byte aligned data buffers
The header is a 32-byte struct with the following layout:
[StructLayout(LayoutKind.Explicit, Pack = 8, Size = 32)]
public struct Header
{
    [FieldOffset(0)] public long Magic; // 0xBFA5
    [FieldOffset(8)] public long DataStart; // <= File size and >= 32 + Sizeof(Range) * NumArrays
    [FieldOffset(16)] public long DataEnd; // >= DataStart and <= file size
    [FieldOffset(24)] public long NumArrays; // Number of all buffers, including name buffer
}
The ranges start at byte 32. There are
NumArrays
of them. This is the total count of all buffers, including the first buffer that contains the names. Each
Begin
and
End
values are byte offsets relative to the beginning of the file. The ranges have the following layout:
[StructLayout(LayoutKind.Explicit, Pack = 8, Size = 16)]
public struct Range
{
    [FieldOffset(0)] public long Begin;
    [FieldOffset(8)] public long End;
}
The data section starts at the first 64 byte aligned address immediately following the last Range struct. This value is stored for validation purposes in the header as
DataStart
.
The first data buffer contain the names of the subsequent buffers as a concatenated list of Utf-8 encoded strings separated by null characters. Names may be zero-length and are not guaranteed to be unique. A name may contain any Utf-8 encoded character except the null character.
There must be N-1 names where N is the number of ranges (i.e. the
NumArrays
value in header).

Final Words

We have released our implementation as an open-source project at https://github.com/vimaec/bfast with a commercially friendly open-source license (the MIT license) in the hope that others find it useful and can contribute back improvement. Let us know if you use the format or the implementation. It is always rewarding for developers to hear when their code is useful to others.

Written by cdiggins | As head of research at http://vimaec.com, I help transform complex problems into simple ones
Published by HackerNoon on 2019/08/26