FlutBuffers in 2024: Can We Recreate Old Success? Performance Optimization Takes Center Stage!

Hi there!

What is this article about?

In 2021, I worked on a project where there were no extra funds for almost everything – there were no resources from AWS, and the mobile phones for which we were developing software were very basic, performing only essential functions and providing internet access. At times, these were tablets from our supplier, which were also not flagship devices and consistently lagged.

In this article, I will share how we significantly optimized the speed of our services, and later, we will write new versions in 2024 and put them to the test!

Happy reading!

Let’s start.

A client approached us with a three-month timeline for launching an MVP to be tested by real users. Our task was to develop a relatively straightforward backend for a mobile application. The client provided detailed requirements, specifications, and integration modules from the outset. The primary goal was to collect data from the mobile application, review it, and send it to the specified integrations. Essentially, our role was to function as a validating proxy service that also recorded events.

What’s the usual challenge we’re facing? It’s either cranking out a quick microservice or a combo of services that’ll be catching requests from the app. Most times, our clients are rocking top-notch gear and flagship devices.

But what if our case is:

A feeble AWS cluster that needs to squeeze in over 10 logic services plus monitoring.
Our phones are like special Android gadgets with no more than 4GB RAM, often tablets.
We’re frequently shooting snapshots from the app to the backend.
We need to validate a chunk of data before pushing it further down the business flow.

AS IS

So, here’s the deal — weak backend, feeble devices, 1 MVP, and 3 devs. The mission is to expand as minimally as possible, not blowing extra cash on AWS while we’re in the MVP zone.

If the MVP nails it, resources flow.

If not, the project might hit pause.

Sounds like a challenge?

We rolled up our sleeves and started experimenting. What’s on the market and what we’re doing with services:

REST (JSON)
gRPC (Proto, binary)
“Special guest” (binary too)

One more about specification: as a good sample here, our document should be like this:

{
  "docs": {
    "name": "name_for_documents",
    "department": {
      "code": "uuid_code",
      "time": 123123123,
      "employee": {
        "name": "Ivan",
        "surname": "Polich",
        "code": "uuidv4"
      }
    },
    "price": {
      "categoryA": "1.0",
      "categoryB": "2.0",
      "categoryC": "3.0"
    },
    "owner": {
      "uuid": "uuid",
      "secret": "dsfdwr32fd0fdspsod"
    },
    "data": {
      "transaction": {
        "type": "CODE",
        "uuid": "df23erd0sfods0fw",
        "pointCode": "01"
      }
    },
    "delivery": {
      "company": "TTC",
      "address": {
        "code": "01",
        "country": "uk",
        "street": "Main avenue",
        "apartment": "1A"
      }
    },
    "goods": [
      {
        "name": "toaster v12",
        "amount": 15,
        "code": "12312reds12313e1"
      }
    ]
  }
}

For instance, we have a compact service with just two methods:

Save docs and validate department code, delivery company, and address.
Find all with limit/offset pagination.

Stage 1. REST service

Nothing special, we are going to create a small service with Gin gonic and http lib. As a good example, “Golang RESTful API”: click here

Let’s code smh like this. Full code here: Github json

const (
 post = "/report"
 get  = "/reports"
 TTL  = 5
)

func main() {
 router := gin.Default()
 p := ginprometheus.NewPrometheus("gin")
 p.Use(router)

 sv := service.NewReportService()
 gw := middle.NewHttpGateway(*sv)

 router.POST(post, gw.Save)
 router.GET(get, gw.Find)

 srv := &http.Server{
  Addr:    "localhost:8080",
  Handler: router,
 }
}

This code represents a benchmark for the `BenchmarkCreateAndMarshal` function, measuring the performance of create and marshal operations.

// BenchmarkCreateAndMarshal-10       168706       7045 ns/op
func BenchmarkCreateAndMarshal(b *testing.B) {
 for i := 0; i < b.N; i++ {
  doc := createDoc()
  _ = doc.Docs.Name // for tests

  bt, err := json.Marshal(doc)
  if err != nil {
   log.Fatal("parse error")
  }

  parsedDoc := new(m.Document)
  if json.Unmarshal(bt, parsedDoc) != nil {
   log.Fatal("parse error")
  }
  _ = parsedDoc.Docs.Name
 }
}

BenchmarkCreateAndMarshal-10: This is the output line provided by the Go testing tool.
168706: This is the number of iterations that were executed during the test.
7045 ns/op: This is the average time taken for one iteration in nanoseconds. Here, ns/op stands for nanoseconds per operation.

Thus, the result indicates that the BenchmarkCreateAndMarshal function executes at approximately 7045 nanoseconds per operation over 168706 iterations.

This is where we began our journey, and now we are considering the first key point on our path. Did it suffice to launch? The answer is yes! But for how long? The answer is no.

We successfully completed the first part and launched the MVP for testing, including with synthetic loads. What was the result? We ran out of memory; when transmitting a batch of reports, our environment turned red from the load, and we were operating, but not as fast as we could.

From here, a new branch of our exploration opens up. Why add memory when we can use some processes more efficiently? Yes, we're talking about serialization, and the second chapter begins, significantly speeding up our processing.

Stage 2. Move to gRPC

Now and again, no worries if you’re new to gRPC! Taking it step by step is a great approach. I remember being in the same boat — copying and pasting from the documentation is a common practice when diving into new technologies. It’s a fantastic way to grasp the concepts and understand how things work. Keep exploring the guide, and don’t hesitate to reach out if you have any questions along the way. Happy coding! 🚀 Read more here: https://protobuf.dev/overview/

So, gRPC:

gRPC provides more efficient and compact binary communication compared to the textual nature of HTTP.

Type: Oriented towards transferring binary data and structured messages.
Protocol: Supports state and duplex communication.
Data Format: Protocol Buffers (protobuf) — a binary data serialization format.
Transport: Uses HTTP/2 as the transport protocol.

In rapid development, this is crucial because you’ve got to think about backward compatibility and all that jazz. If you’re just changing everything on the fly, not storing info anywhere — it’s a recipe for disaster. You’ll end up chasing bugs related to backward compatibility! Plus, you need to nail down versioning.

In the simple REST world, we use Swagger or OAS3 tools like Apicurio and such — it saves time and makes the process more transparent, but it still takes time. And guess what, let’s take another look at Protobuf — it already comes with a schema, and there’s a version of it (if we store it in Git) — a massive plus. Share it with the team, and now everyone has the spec.

Ok, how does it work?

Well, easy example, let’s write a file example.proto:

syntax = "proto3";

message Person {
 required string name = 1;
 required int32 id = 2;
 optional string email = 3;
}

And generate it:

protoc - python_out=. example.proto

This will create the file `example_pb2.py`, which contains the generated code for working with the data defined in `example.proto`. Usage in Python:

import example_pb2

# Create a Person object
person = example_pb2.Person()
person.name = "John"
person.id = 123
person.email = "[email protected]"

# Serialize to binary format
serialized_data = person.SerializeToString()

# Deserialize from binary format
new_person = example_pb2.Person()
new_person.ParseFromString(serialized_data)

Here, we create a `Person` object, set its fields, serialize it to a binary format, and then deserialize it back. Note: `example_pb2` is the generated module created by the protobuf compiler. Protocol Buffers provide a binary data format that is compact and efficient for transmission. It also supports various programming languages, making it convenient for use in different parts of your technology stack.

Still no idea how it works? Let’s use the previous example of `Person.` Imagine we have a `Person` object with filled fields:

Person person = {
 name: "John Doe",
 id: 123,
 email: "[email protected]"
};

When this object is serialized into binary format, each field will be represented as a tagged element. In this case, the tags are the numbers 1, 2, and 3. After serialization, the binary data stream might look something like this (in a simplified form):

08 4A 6F 68 6E 20 44 6F 65 10 7B 1A 14 6A 6F 68 6E 40 65 78 61 6D 70 6C 65 2E 63 6F 6D

Let’s break it down:

08 represents tag 1 (the name field), followed by the field’s length.
4A 6F 68 6E 20 44 6F 65 represents the ASCII codes for the string “John Doe.”
10 represents tag 2 (the id field), followed by the value 123 in variable-length encoding (Varint).
1A represents tag 3 (the email field), followed by the string length 20 and the ASCII codes for the string “[email protected].”

Thus, tags and their order allow parsing the serialized data stream and determining which field contains what information. Benefits of using a binary format and tags:

Efficiency: Binary format provides a more compact representation of data, reducing the volume of transmitted information over the network.
Speed: Serialization and deserialization operations are faster since binary data can be processed efficiently.

It’s time to create our proto service with a specification:

syntax = "proto3";

package docs;
option go_package = "proto-docs-service/docs";

service DocumentService {
    rpc GetAllByLimitAndOffset(GetAllByLimitAndOffsetRequest) returns (GetAllByLimitAndOffsetResponse) {}
    rpc Save(SaveRequest) returns (SaveResponse) {}
}

message GetAllByLimitAndOffsetRequest {
    int32 limit = 1;
    int32 offset = 2;
}

message GetAllByLimitAndOffsetResponse {
    repeated Document documents = 1;
}

message SaveRequest {
    Document document = 1;
}

message SaveResponse {
    string message = 1;
}

message Document {
  string name = 1;
  Department department = 2;
  Price price = 3;
  Owner owner = 4;
  Data data = 5;
  Delivery delivery = 6;
  repeated Goods goods = 7;
}

message Department {
  string code = 1;
  int64 time = 2;
  Employee employee = 3;
}

message Employee {
  string name = 1;
  string surname = 2;
  string code = 3;
}

message Price {
  string categoryA = 1;
  string categoryB = 2;
  string categoryC = 3;
}

message Owner {
  string uuid = 1;
  string secret = 2;
}

message Data {
  Transaction transaction = 1;
}

message Transaction {
  string type = 1;
  string uuid = 2;
  string pointCode = 3;
}

message Delivery {
  string company = 1;
  Address address = 2;
}

message Address {
  string code = 1;
  string country = 2;
  string street = 3;
  string apartment = 4;
}

message Goods {
  string name = 1;
  int32 amount = 2;
  string code = 3;
}

After, we should build it with a simple script:

# if it is your first downloading:
brew install protobuf
go install google.golang.org/protobuf/cmd/[email protected]
go install google.golang.org/grpc/cmd/[email protected]
export PATH="$PATH:$(go env GOPATH)/bin"

# only generator
cd .. && cd grpc
mkdir "docs"
protoc --go_out=./docs --go_opt=paths=source_relative \
    --go-grpc_out=./docs --go-grpc_opt=paths=source_relative docs.proto

And as a good result we should have a folder with files.

How do you test locally? I prefer to use BloomRpc (unfortunately, it has been deprecated :D; also, Postman can do the same). This time, I will skip the implementation details of the server and the logic related to document processing. However, once again, we will write a test. Naturally, we expect an unprecedented increase!

// BenchmarkCreateAndMarshal-10       651063       1827 ns/op
func BenchmarkCreateAndMarshal(b *testing.B) {
 for i := 0; i < b.N; i++ {
  doc := CreateDoc()
  _ = doc.GetName()
  r, e := proto.Marshal(&doc)
  if e != nil {
   log.Fatal("problem with marshal")
  }

  nd := new(docs.Document)
  if proto.Unmarshal(r, nd) != nil {
   log.Fatal("problem with unmarshal")
  }
  _ = nd.GetName()
 }
}

This code represents a benchmark named BenchmarkCreateAndMarshal, which measures the performance of creating and marshaling operations. The results show that, on average, the benchmark performs these operations in 1827 nanoseconds per iteration over 651063 iterations. So, the full code here: click pls

It may seem like a success, and we could have stopped, but why? It appears that we can still squeeze more performance out of the service and achieve better results, but how? And here comes the final chapter of the story...

Stage 3. Move to Flatbuffers.

And now, let’s introduce our guest among the protocols — most of you probably haven’t even heard of it — it’s FlatBuffers.

FlatBuffers steps onto the scene with a different swagger compared to the others. Picture this: no parsing is needed on the client side. Why? Because the data is accessed directly, no unpacking is required. This makes it super efficient for mobile devices and resource-constrained environments. It’s like handing over a ready-to-eat meal instead of making your client cook it up. The schema is minimalistic, and you get a flat binary right away. Plus, no versioning headache because you can add new fields without breaking anything — a win for the backward compatibility game. Example:

Person person;
person.id = 123;
person.name = "John Doe";
person.age = 30;

Certainly, let’s represent the serialized bytes in hexadecimal format for the given Person structure:

// Serialized bytes (hexadecimal representation)
// (assuming little-endian byte order)
1B 00 00 00    // Data size (including this byte)
7B 00 00 00    // ID (123 in little-endian byte order)
09 00 00 00    // Name string length (including null-terminator)
4A 6F 68 6E    // Name ("John" in ASCII, including null-terminator)
20 00 00 00    // Age (30 in little-endian byte order)

In this example:

The first 4 bytes represent the data size, including this byte. In this case, the size is 27 bytes (0x1B).
The next 4 bytes represent the id (123 in little-endian byte order).
Following that, 4 bytes represent the length of the name string (9 bytes).
The subsequent 9 bytes represent the name string “John Doe,” including the null-terminator.
The last 4 bytes represent the age (30 in little-endian byte order).

Please note that this is just an illustration of the data structure in binary form, and the specific values may vary depending on the platform, byte order, and other factors.

This time around, we can’t just breeze through because we’ll have to write all the serialization code ourselves. However, we’re expecting some silver linings from this.

Sure, we’ll need to get our hands dirty with manual serialization coding, but the potential payoffs are worth it. It’s a bit of a trade-off — more effort upfront, but the control and potential performance boost might just make it a sweet deal in the long run. After all, sometimes you gotta get your hands deep in the code to make the magic happen, right?

Code sample here: GitHub

// BenchmarkCreateAndMarshalBuilderPool-10      1681384        711.2 ns/op
func BenchmarkCreateAndMarshalBuilderPool(b *testing.B) {
 builderPool := builder.NewBuilderPool(100)

 for i := 0; i < b.N; i++ {
  currentBuilder := builderPool.Get()

  buf := BuildDocs(currentBuilder)
  doc := sample.GetRootAsDocument(buf, 0)
  _ = doc.Name()

  sb := doc.Table().Bytes
  cd := sample.GetRootAsDocument(sb, 0)
  _ = cd.Name()

  builderPool.Put(currentBuilder)
 }
}

Since we’re in the “do-it-yourself optimization” mode, I decided to whip up a small pool of builders that I clear after use. This way, we can recycle them without allocating memory again and again.

It’s a bit like having a toolkit that we tidy up after each use — keeps things tidy and efficient. Why waste resources on creating new builders when we can repurpose the ones we’ve got, right? It’s all about that DIY efficiency.

const builderInitSize = 1024

// Pool - pool with builders.
type Pool struct {
 mu     sync.Mutex
 pool   chan *flatbuffers.Builder
 maxCap int
}

// NewBuilderPool - create new pool with max capacity (maxCap)
func NewBuilderPool(maxCap int) *Pool {
 return &Pool{
  pool:   make(chan *flatbuffers.Builder, maxCap),
  maxCap: maxCap,
 }
}

// Get - return builder or create new if it is empty
func (p *Pool) Get() *flatbuffers.Builder {
 p.mu.Lock()
 defer p.mu.Unlock()

 select {
 case builder := <-p.pool:
  return builder
 default:
  return flatbuffers.NewBuilder(builderInitSize)
 }
}

// Put return builder to the pool
func (p *Pool) Put(builder *flatbuffers.Builder) {
 p.mu.Lock()
 defer p.mu.Unlock()

 builder.Reset()

 select {
 case p.pool <- builder:
  // return to the pool
 default:
  // ignore
 }
}

Time to check results

Now, let’s dive into the results of our tests, and here’s what we see:

protocol	iterations	speed
json	168706	7045 ns/op
proto	651063	1827 ns/op
flat	1681384	711.2 ns/op

Well, well, well — looks like Flat is the speed demon here, leaving the others in the dust by a factor of T. The numbers don’t lie, and it seems like our DIY optimization is paying off big time!

Well, then, testing is all good, but let's try writing services using these protocols and see what results we get!

Technical requirements:

Language: Golang
http framework: Gin gonic
database: mognodb

Now it’s time to put our protocols to the real test — we’ll spin up the services, hook them up with Prometheus metrics, add MongoDB connections, and generally make them full-fledged services. We might skip tests for now, but that’s not the priority.

In the classic setup, as mentioned earlier, we’ll have two methods — save and find by limit and offset. We’ll implement these for all three implementations and stress test the whole shebang using Yandex Tank + Pandora.

To keep it simple on the graph side, I’m using a Yandex service called Overload, and I’ll leave links to our tests. Let’s get down to business!

Save method, 1000 rps, 60 sec, profile:

rps: { duration: 60s, type: const,  ops: 1000 }

Results:

JSON	first	second
99%	1.630	1.260
98%	1.160	1.070
95%	1	0.920

Links: first test and second.

PROTO	first	second
99%	1.800	2.040
98%	1.380	1.540
95%	1.160	1.220

Links: first test and second.

FLAT	first	second
99%	3.220	3.010
98%	2.420	2.490
95%	1.850	1.840

Links: first test and second

And now, let’s throw in another method that covers that very case I mentioned at the beginning — we need to quickly extract a field from the request and validate it. If there are any issues, we reject the request; if it’s all good, we proceed.

Validate method, 1000 rps, 60 sec, same profile:

    rps: { duration: 60s, type: const,  ops: 1000 }

JSON	first	second
99%	1.810	1.980
98%	1.230	1.290
95%	0.970	1.070

Links: first test and second.

PROTO	first	second
99%	1.060	1.010
98%	0.700	0.660
95%	0.550	0.530

Links: first test and second.

FLAT	first	second
99%	2.920	3.010
98%	2.170	2.490
95%	1.540	1.510

Links: first test and second

Conclusion

Experiments were conducted in an attempt to enhance our application by several steps - this is what inspired this article. We transitioned from JSON to Proto2, then to Proto3, but the real performance boost occurred only with FlatBuffer. Fast forward over two years, and developers and the community have significantly improved Protobuf. Now, in our stack, we use the high-performance Go language instead of Kotlin with coroutines and Spring Boot. In my project launched in 2021, we experienced an unprecedented performance boost, and this entire story seamlessly integrated with our logic and processes.

So, if you ever find yourself in a situation where rapid serialization is crucial, consider FlatBuffer.

The results are quite ambiguous. Serialization with FlatBuffer, as expected, is faster than simple JSON. The case with validation was also surprising - we expected FlatBuffer to win, but gRPC came out on top. Let's delve into why we got these results. But why did another protocol win in load tests?

As examples, I wrote a couple of services that are simple enough, only processing incoming messages or "checking" them. As we can replace, gRPC and the Protocol Buffer protocol are now the winners. By the way, consider this experiment as a baseline, and if you encounter the question of accelerating serialization and message transmission down the chain, it's worth testing it on your process. Do not forget that it is also important to take into account the programming language and stack you are using. And once again, I want to underline the importance of conducting an MVP on the project if you still want to migrate to other serialization protocols.

If we talk about serialization:

JSON: In stress tests of the save method with a load of 1000 requests per second, JSON demonstrates stable results, with an execution time of approximately 7045 nanoseconds per operation.

Protobuf: Protobuf demonstrates high efficiency, surpassing JSON, with an execution time of about 1827 nanoseconds per operation in the same test.

FlatBuffers: FlatBuffers stands out among others, demonstrating significantly lower execution time - around 711.2 nanoseconds per operation in the same stress test.

These results emphasize that FlatBuffers provides a significant performance advantage compared to JSON and Protobuf. Despite requiring more complex training and usage, its real efficiency underscores that investments in performance optimization can pay off in the long run.

So, let's summarize:

Judging by the tests, JSON still works faster than the others - the numbers don't lie, right? Indeed, we created a service that saves everything to a database. But what if we need some network connection besides the database? gRPC has more advantages, as it operates on the HTTP/2 protocol.

If you need to serialize data quickly, use FlatBuffers.
If you have many services and need to transmit requests between them, use gRPC - nothing seems to match its speed.
If you just need a JSON translator service from a phone to a database, choose REST + JSON.
If you need to save memory on the device and can wait a bit for processing on the server, use FlatBuffers.

Looking at our metrics, we're talking about 1-3 ms - just think how fast that is!

I hope this was helpful to you. Thank you! 🙏

As a good chance to share my articles about load testing:

https://hackernoon.com/turbocharge-load-testing-yandextank-ghz-combo-for-lightning-fast-code-checks?embedable=true

https://hackernoon.com/leveraging-yandex-pandora-stress-testing-grpc-and-flatbuffer-services?embedable=true

Also published here.