Change is an inevitable consequence of development process. During development: new insights surface, requirements change and feature requests arise. All of it has one consequence — we need to change data representations. Some of those changes are breaking changes, some of them are not. The consequences of a breaking changes are manageable, when the data usage is contained to single place. It becomes much more complicated, when data is sent from one independent system to another.
A change to a structure can be considered additive, when we add more information, without removing the information. Say we have an object which represents a person:
{"name": "Maxim", "email": "maxim@#####.com"}
And we add a phone number to it:
{"name":"Maxim", "email":"maxim@#####.com", "phone":"+49#####"}
This is an additive change and can be considered non breaking, except you define phone number as required property in your own application model.
JSON as is, does not have a concept of a required property. A JSON object is a list of key/value pairs, even the order of the list is not important. However, when we define a model class in our application, we might insist on this property being required. We do it, because we want to have a “clean” application logic. However this implies that additive changes become breaking changes as well. Meaning that if a new client gets data from old client it will break, because old clients do not know the phone number property. BTW: new client being able to communicate with old clients is called backwards compatibility.
Another case of additive breaking change is, if we have something like an enum construct in our data:
enum Gender { case male, female }
And we translate the case names into strings:
{"name": "Maxim", "email": "maxim@#####.com", "gender":"male"}
If we decide to add more cases to gender enum, it is an additive non destructive change, but it might break old clients. Say we change the gender enum like following:
enum Gender { case male, female, other }
If following object will be received by an old client:
{"name": "Maxim", "email": "maxim@#####.com", "gender":"other"}
Old client will try to convert the gender property to it’s own model representation of gender and in best case scenario, it will produce a nil
and lose/ignore the information about gender. Worst case scenario it will crash trying to transform an unknown key. BTW. old clients able to talk to new clients is called forwards compatibility.
Protocol Buffers is a binary data serialisation format, based on a schema definition. So lets define the schema for our first example:
syntax = "proto3";message Person {string name = 1;string email = 2;}
From this schema we can generate data classes for our language of choice. We will have a Person
class with name
and email
properties. If we add a phone number to person:
syntax = "proto3";message Person {string name = 1;string email = 2;string phone = 3;}
and generate again, our Person
class will get another property and new version of Person
will be able to transform messages created by old version of Person
and vice versa. Backwards and forwards compatibility is granted by design.
Btw. in proto3
the required
key word was removed, for reasons I mentioned in Additive / non destructive changes in JSON section.
Enum can be defined as following:
enum Gender {MALE = 0;FEMALE = 1;}
Additive changes on enums will be handled as following in regards to forward compatibility:
During deserialization, unrecognized enum values will be preserved in the message, though how this is represented when the message is deserialized is language-dependent. (https://developers.google.com/protocol-buffers/docs/proto3#enum)
This is not perfect, but a big improvement compared to undefined behaviour of enums in JSON.
FlatBuffers is also a binary data serialisation format, based on a schema definition. However its goal is slightly different. The main benefit of FlatBuffers is random value access and that we can store data as DAG.
In FlatBuffers the Person definition would look like following:
table Person {name: string;email: string;}
And if we want to add a phone numbers we can define it as following:
table Person {name: string;email: string;phone: string;}
FlatBuffers provide the same backward and forward compatibility guaranties as Protocol Buffers, however there is one small difference. In both formats the properties in the payload are stored not by property key but by it’s index/id. In Protocol Buffers we have to set the id explicitly string phone = 3;
. In FlatBuffers it is not needed. Meaning the id is based on the property index. So if we would define Person
as following:
table Person {name: string;phone: string;email: string;}
We would introduce a breaking change. You might argue that having such fragile implicit convention is a bad idea. And you are absolutely right. This is why in FlatBuffers, we can add an explicit attribute to a property, defining its id:
table Person {name: string (id:0);phone: string (id:2);email: string (id:1);}
Now our new person is compatible with old person again.
In FlatBuffers we can also define structs. Structs are fix size value types. We would not be able to represent a person as a struct, person has string properties and strings are not fix size. However I think it is important to mention that structs are fix and do not support evolution. There for I would recommend to use structs only for simple things which are defined by a standard and will not change in next versions, like for example IPv4 address.
Definition of enum in FlatBuffers looks like following:
enum Gender : byte { male, female }
They are a bit more rigid than in Protocol Buffers. They both are stored as numbers, but in case of Protocol Buffers the number is a VLQ, so we don’t need to care about the size. In FlatBuffers we have to define the size — in our case it is 1byte (byte), which also can become an unlikely but still possible breaking change, if the number of cases grows beyond 127.
In case of FlatBuffers, the unknown case (forward compatibility) is also at mercy of particular language implementation.
Imagine we got a new requirement to remove the email
property.
In case of JSON, it is not a big deal, except if we defined email
as required. As mentioned before required
is a bad idea when it comes to data evolution. Otherwise if a new client will receive a JSON which contains an email
property, it will just ignore it.
In case of Protocol Buffers and FlatBuffers, it is possible to set a property to deprecated
. This means that you will not be able to set this property in new version of Person
class any more.
In Protocol Buffers deprecation comes with zero cost, as the message contains only values that are stored. In FlatBuffers, because of random value access, stored object has so called vTables, which points to the the values. If a value is deprecated it will still be present in the vTable. This implies that a deprecation will still cost us 2bytes per vTable (vTable can be reused between objects, but I will not go into details about it in this article).
Another nice addition in proto3 is that we don’t even have to keep the property definition of the deprecated property in the schema, we can use reserved
keyword with the id to make sure that this particular id will not be used in new versions.
As a matter of fact property name is unimportant for FlatBuffers and Protocol Buffers. As properties are stored and looked up by id and not by name. So we can also rename a deprecated property:
table Person {name: string;__phone: string (deprecated);email: string;}
There are only two hard things in Computer Science: cache invalidation and naming things.
— Phil Karlton
Let’s say we realised that name
is not an accurate description of the property and we should change it to fullName
or maybe full_name
. This is a breaking change in JSON. An old client will ask the person object for “name”. So if a new client starts sending “full_name” instead, both will not be able to communicate any more.
As mentioned before, FlatBuffers and Protocol Buffers do not store properties by name, they store them by id/index. Only the generated code reflects the name we put in the schema. Here is a simplified example what generated code does:
var name: String {return self[0] as! String}
So changing the property name in schema, has an effect on code, but not on underlying data representation.
var fullName: String {return self[0] as! String}
This means that renaming properties and enum cases is not a breaking change in FlatBuffers and Protocol Buffers. To clarify — renaming enum cases is nto a breaking change in FlatBuffers and Protocol Buffers, because the cases are also stored internally as ints and not as strings.
Some times we realise that the type we chose for the property is not sufficient. Let’s say we want to keep the property name
, but it should now store an object which contains first and last names:
{"name": {"firstName" : "Maxim", "lastName": "Zaks"}}
This would be a breaking change in JSON, FlatBuffers and Protocol Buffers.
However in FlatBuffers and ProtocolBuffers we could perform following trick. We could deprecate old name
property, rename it to oldName
and introduce a new property name
of type FullName
.
table Person {oldName: string (deprecated);email: string;name: FullName;}
This is a series of additive / non destructive changes which provides a clean evolution of the data representation.
Change is inevitable, but with right tools and a bit creativity, it is possible to make it non breaking.
Thank you for reading, I appreciate a clap or two.