While developing the G3D geometry file format at VIM AEC we encountered the following challenge:
Given a collection of byte arrays, how do you pack them efficiently into a single block to use in memory, write to disk, and transmit over a network, so that it can be easily unpacked on different platforms including Web, Mobile, and AR devices like the Magic Leap.
This simple use case of storing and processing collections of large arrays of binary data pops up in a large number of problem domains, such as audio, image, video, 3D, geographical information systems (GIS), continuous fluid dynamics (CFD), photogrammetry, etc. In fact a lot of tabular data can be represented of as collections of named arrays.
“Isn’t this a solved problem?” you might ask, “Can’t we just use one of several cross-platform binary solutions like the Concise Binary Object Representation (CBOR) or any of other various Binary formats that exist?”. Certainly you can use them, but all of these binary formats are for nested structured data, so as a result they are necessarily more complex and less efficient than a specialized solution for the common use case of serializing byte arrays.
Obviously, one could implement a simple solution with relative ease. A common first approach to the problem (which several serialization libraries use) is to write the number of arrays, then for each byte array, write the number of bytes, followed by the byte data. Here is some sample code in C# to illustrate this approach:
public static byte[] NaivePack(IList<byte[]> buffers)
{
using (var stream = new MemoryStream())
using (var bw = new BinaryWriter(stream))
{
bw.Write(buffers.Count);
foreach (var b in buffers)
{
bw.Write(b.Length);
bw.Write(b);
}
return stream.ToArray();
}
}
public static IList<byte[]> NaiveUnpack(byte[] data)
{
var r = new List<byte[]>();
using (var stream = new MemoryStream())
using (var br = new BinaryReader(stream))
{
var n = br.ReadInt32();
for (var i = 0; i < n; ++i)
{
var localN = br.ReadInt32();
var bytes = br.ReadBytes(localN);
r.Add(bytes);
}
}
return r;
}
This works well enough in simple cases and has acceptable performance for some use cases. However, it suffers from several shortcomings:
Overall, this is not a very robust or efficient solution.
The following simple steps improve upon the previous naïve solution:
Given these observations, we decided to put together a formal specification called BFAST for encoding binary arrays, so that we could easily share binary arrays across different applications written in different languages.
BFAST is an acronym for “Binary Format for Array Serialization and Transmission”. The BFAST format consists of three sections:
The header is a 32-byte struct with the following layout:
[StructLayout(LayoutKind.Explicit, Pack = 8, Size = 32)]
public struct Header
{
[FieldOffset(0)] public long Magic; // 0xBFA5
[FieldOffset(8)] public long DataStart; // <= File size and >= 32 + Sizeof(Range) * NumArrays
[FieldOffset(16)] public long DataEnd; // >= DataStart and <= file size
[FieldOffset(24)] public long NumArrays; // Number of all buffers, including name buffer
}
The ranges start at byte 32. There are
NumArrays
of them. This is the total count of all buffers, including the first buffer that contains the names. Each Begin
and End
values are byte offsets relative to the beginning of the file. The ranges have the following layout:[StructLayout(LayoutKind.Explicit, Pack = 8, Size = 16)]
public struct Range
{
[FieldOffset(0)] public long Begin;
[FieldOffset(8)] public long End;
}
The data section starts at the first 64 byte aligned address immediately following the last Range struct. This value is stored for validation purposes in the header as
DataStart
.The first data buffer contain the names of the subsequent buffers as a concatenated list of Utf-8 encoded strings separated by null characters. Names may be zero-length and are not guaranteed to be unique. A name may contain any Utf-8 encoded character except the null character.
There must be N-1 names where N is the number of ranges (i.e. the
NumArrays
value in header).We have released our implementation as an open-source project at https://github.com/vimaec/bfast with a commercially friendly open-source license (the MIT license) in the hope that others find it useful and can contribute back improvement. Let us know if you use the format or the implementation. It is always rewarding for developers to hear when their code is useful to others.