Protobuf Under the Hood: How Serialization and Deserialization Work in Go

by TatyanaNovember 26th, 2024

Too Long; Didn't Read

Protocol Buffers (Protobuf) is a fast, efficient, and language-agnostic data serialization mechanism. It converts structured data into a compact binary format, which can then be deserialized back into its original form. In this article, we will dive deep into the internals of how Protobuf serialization and deserialization work in Go.

featured image - Protobuf Under the Hood: How Serialization and Deserialization Work in Go

‘protobuf software engineer (young woman) cosy workplace’ Image created by HackerNoon AI Image Generator

Efficient Serialization and Deserialization in Protobuf with Go: A Deep Dive

When building distributed systems, microservices, or any performance-critical application, handling data efficiently is paramount. Protocol Buffers (Protobuf) by Google is a fast, efficient, and language-agnostic data serialization mechanism allowing compact and optimized binary data formats. In this article, we will dive deep into the internals of how Protobuf serialization and deserialization work in Go, explore complex data types and provide optimization tips to ensure these operations happen with minimal delay.

Introduction to Protobuf and Its Importance

Protocol Buffers (Protobuf) are designed to be an efficient method for serializing structured data. By converting data into a compact binary format, Protobuf helps minimize memory consumption and bandwidth usage, making it a perfect solution for performance-critical applications such as real-time systems, distributed microservices, and mobile applications where resources are limited.

How Protobuf Works

At its core, Protobuf operates based on a predefined schema, which describes the structure of the data to be serialized. This schema is compiled into specific language bindings (such as Go, Python, or Java), allowing for cross-platform communication. Protobuf’s serialization mechanism converts structured data into a highly efficient binary format, which can then be deserialized back into its original form.

Schema Definition in Protobuf

Before we can serialize any data, we must define the structure of the data in a .proto file. The .proto file defines the schema, which describes how Protobuf should serialize and deserialize the data.

Here’s an example schema for a Person and Address:

syntax = "proto3";

message Address {
  string street = 1;
  string city = 2;
  string state = 3;
  int32 zip_code = 4;
}

message Person {
  string name = 1;
  int32 id = 2;
  string email = 3;
  Address address = 4;
  repeated string phone_numbers = 5;
}

In this example:

Person contains basic fields like name, id, and email.
Address is a nested message within Person.
The repeated keyword indicates a list of phone_numbers.

Each field is assigned a unique field number, which plays a crucial role during serialization, allowing Protobuf to encode the field efficiently.

Serialization in Protobuf

Serialization is the process of converting an in-memory Go struct into a binary format. This binary format is highly optimized for both size and speed. Let’s go over how serialization works internally and how you can optimize it for complex types in Go.

Step 1: Compiling the Schema

To use the schema defined in the .proto file, it needs to be compiled into Go code using the protoc compiler:

protoc --go_out=. --go_opt=paths=source_relative person.proto

This generates a .pb.go file, containing Go structs and methods for serialization and deserialization.

Step 2: Serialization in Go

Here's an example of serializing a Person struct in Go:

package main

import (
	"log"
	"github.com/golang/protobuf/proto"
	"path/to/your/proto/package" // Adjust the import path
)

func main() {
	person := &proto_package.Person{
		Name:  "John Doe",
		Id:    150,
		Email: "[email protected]",
		Address: &proto_package.Address{
			Street:  "123 Main St",
			City:    "Springfield",
			State:   "IL",
			ZipCode: 62704,
		},
		PhoneNumbers: []string{"123-456-7890", "098-765-4321"},
	}

	data, err := proto.Marshal(person)
	if err != nil {
		log.Fatalf("Failed to serialize person: %v", err)
	}
	log.Printf("Serialized data: %x", data)
}

In this example:

A Person message is created.
proto.Marshal() is used to serialize the message into a compact binary format.

This binary format is highly efficient, but when dealing with complex or large data, there are several ways to optimize performance.

Internal Steps of Protobuf Serialization

1. Field Number and Wire Type Determination

The first step in serialization is identifying each field in the Person message, extracting its value, and determining its field number and wire type.

Field Number: Each field in a Protobuf message has a unique field number (specified in the .proto file). For example, in the Person message, name has a field number of 1, id has a field number of 2, and so on.
Wire Type: The wire type specifies how the data for each field is encoded. Protobuf uses different wire types for different kinds of data (e.g., varint for integers, length-delimited for strings and nested messages, fixed-width for certain data types).

Each field is represented as a tag, which is a combination of the field number and the wire type.

Tag Encoding

A tag is encoded by combining the field number and the wire type. The formula is:

tag=(field number<<3)∣wire type\text{tag} = (\text{field number} << 3) | \text{wire type}tag=(field number<<3)∣wire type

For example:

The tag for the name field (field number 1, wire type 2 for length-delimited) would be:tag=(1<<3)∣2=0x0A\text{tag} = (1 << 3) | 2 = 0x0Atag=(1<<3)∣2=0x0A

This tag indicates the start of the serialized name field in the binary stream.

2. Encoding Each Field Based on Wire Type

After determining the tag, Protobuf serializes the field’s value based on its wire type. Different wire types are encoded in different ways:

Varint Encoding (Wire Type 0)

Varint encoding is used for fields with integer types (int32, int64, uint32, uint64, bool). Varints use a variable number of bytes depending on the size of the integer.

For the id field, which has a value of 150, the varint encoding works as follows:
- 150 is represented as 0x96 0x01 in varint format. The first byte (0x96) indicates that more bytes are part of the varint (because the MSB is set), and the second byte (0x01) completes the value.
- The id field is serialized as:
  - Tag: 0x10 (field number 2, wire type 0 for varint)
  - Value: 0x96 0x01 (encoded value of 150).

Length-Delimited Encoding (Wire Type 2)

Length-delimited encoding is used for fields that contain variable-length data, such as strings, byte arrays, and nested messages.

For the name field, which has a value of "John Doe", the serialization process is:
- First, the length of the string is calculated: "John Doe" has 8 characters.
- The length (8) is encoded as a varint (0x08).
- Then the string "John Doe" is encoded in UTF-8 bytes: 0x4A 0x6F 0x68 0x6E 0x20 0x44 0x6F 0x65.
- The name field is serialized as:
  - Tag: 0x0A (field number 1, wire type 2 for length-delimited)
  - Length: 0x08 (length of the string)
  - Value: 0x4A 0x6F 0x68 0x6E 0x20 0x44 0x6F 0x65 (UTF-8 encoded string "John Doe").

Fixed-Length Encoding (Wire Types 1 and 5)

Fixed-length encoding is used for fixed-width types such as fixed32, fixed64, sfixed32, and sfixed64. These fields are serialized using a fixed number of bytes (4 or 8 bytes depending on the type).

If the Person message had a fixed32 or fixed64 field, the corresponding value would be serialized in exactly 4 or 8 bytes, respectively, without any extra length or varint encoding.

3. Handling Nested Messages

For fields that are themselves Protobuf messages (like the Address field inside the Person message), Protobuf treats them as length-delimited fields. The nested message is serialized first, and then its length and value are encoded in the parent message.

For the Address field:

The nested Address message (street, city, state, zip_code) is serialized independently.
Protobuf calculates the total length of the serialized Address message.
The Address field is serialized in the Person message with:
- Tag: 0x22 (field number 4, wire type 2 for length-delimited).
- Length: Length of the serialized Address message.
- Value: Serialized binary representation of the Address message.

4. Handling Repeated Fields

For repeated fields like phone_numbers, Protobuf serializes each element in the list individually. Each item is serialized with the same tag but with different values.

For example:

The phone_numbers field contains two strings: "123-456-7890" and "098-765-4321".
Each string is serialized as a length-delimited field:
- First string ("123-456-7890") is serialized as:
  - Tag: 0x2A (field number 5, wire type 2 for length-delimited).
  - Length: 0x0B (length of the string).
  - Value: 0x31 0x32 0x33 0x2D 0x34 0x35 0x36 0x2D 0x37 0x38 0x39 0x30.
- Second string ("098-765-4321") is serialized similarly with the same tag (0x2A), length, and UTF-8 encoded string value.

Protobuf automatically handles repeated fields by serializing each element separately with the same tag.

5. Completing the Serialization

After all fields are serialized into binary format, Protobuf concatenates the binary representations of all fields into a single binary message. This compact binary representation is the final serialized message.

For example, the final serialized message might look something like this (in hexadecimal form):

0A 08 4A 6F 68 6E 20 44 6F 65 10 96 01 1A 13 6A 6F 68 6E 2E 64 6F 65 40 65 78 61 6D 70 6C 65 2E 636F6D 22 0A 0A 31 32 33 20 4D 61 69 6E 20 53 74 12 0B 53 70 72 69 6E 67 66 69 65 6C 64 12 04 49 4C 1A 09 31 32 33 2D 34 35 36 2D 37 38 39 30 2A 09 30 39 38 2D 37 36 35 2D 34 33 32 31

Optimization Techniques for Efficient Serialization

1. Use Fixed-Width Types for Known Data Ranges

Protobuf provides both variable-length and fixed-length types. Variable-length encoding (int32, int64) is more space-efficient for smaller numbers but slower for large values. If you expect your values to remain large, use fixed32 or fixed64.

message Product {
  string name = 1;
  fixed32 quantity = 2;   // Use fixed-width types for performance
  fixed64 price = 3;
}

By avoiding variable-length encoding, you can speed up the serialization and deserialization process.

2. Use `packed` for Repeated Primitive Fields

When working with repeated fields, packing them can improve performance by eliminating redundant field tags during serialization. Packing groups multiple values into a single length-delimited block.

message Inventory {
  repeated int32 item_ids = 1 [packed=true];
}

Packing reduces the size of the serialized message, making the serialization and deserialization processes faster.

3. Limit Nesting and Flatten Structures

Deeply nested structures slow down both serialization and deserialization, as Protobuf needs to recursively process each level of nesting. A flatter structure leads to faster processing.

Before (Deep Nesting):

message Department {
  message Team {
    message Employee {
      string name = 1;
    }
  }
}

After (Flatter Structure):

message Employee {
  string name = 1;
}

message Team {
  repeated Employee employees = 1;
}

message Department {
  repeated Team teams = 1;
}

Flattening the structure eliminates unnecessary nesting, which reduces recursive processing time.

4. Stream Large Data Sets

For large datasets, it’s often inefficient to serialize everything at once. Instead, break large datasets into chunks and handle serialization and deserialization incrementally using streams.

message DataChunk {
  bytes chunk = 1;
  int32 sequence_number = 2;
}

service FileService {
  rpc UploadFile(stream DataChunk) returns (UploadStatus);
}

Streaming allows for efficient handling of large datasets, avoiding memory overhead and delays caused by processing entire messages at once.

5. Use Caching for Frequently Serialized Data

If you frequently serialize the same data (e.g., common configurations or settings), consider caching the serialized form. This way, you can avoid repeating the serialization process.

var cache map[string][]byte

func serializeWithCache(key string, message proto.Message) ([]byte, error) {
	if cachedData, ok := cache[key]; ok {
		return cachedData, nil
	}

	data, err := proto.Marshal(message)
	if err != nil {
		return nil, err
	}

	cache[key] = data
	return data, nil
}

Caching serialized data helps reduce redundant work and speeds up both serialization and deserialization.

Deserialization in Protobuf

Deserialization is the reverse process where the binary data is converted back into a Go struct. Protobuf’s deserialization process is highly optimized, but understanding how to handle complex types and large datasets efficiently can improve overall performance.

Deserialization Example in Go

package main

import (
	"log"
	"github.com/golang/protobuf/proto"
	"path/to/your/proto/package"
)

func main() {
	data := []byte{ /* serialized data */ }

	person := &proto_package.Person{}
	err := proto.Unmarshal(data, person)
	if err != nil {
		log.Fatalf("Failed to deserialize: %v", err)
	}

	log.Printf("Deserialized Name: %s", person.Name)
}

In this example, proto.Unmarshal() converts the binary data back into a Go struct. The performance of deserialization can also be optimized by applying the same techniques as serialization, such as reducing nesting and streaming large data.

Internal Steps of Protobuf Deserialization

When the proto.Unmarshal() function is called, several steps occur internally to convert the binary data into the corresponding Go struct.

1. Parsing the Binary Data Stream

The first thing that happens is that the binary data is read sequentially. Protobuf messages are encoded in a tag-value format, where each field is stored along with its tag (containing the field number and wire type). The deserialization process needs to parse this tag and determine how to interpret the subsequent bytes.

Tags: Each tag is encoded as a combination of the field number and wire type. The wire type indicates how the data is encoded (e.g., varint, fixed-width, or length-delimited).The tag is decoded by extracting the field number and wire type. The tag is read as:tag=(field number<<3)∣wire type\text{tag} = (\text{field number} << 3) | \text{wire type}tag=(field number<<3)∣wire typeFor example, a tag like 0x08 means:
- Field Number: The field number is extracted by shifting the tag to the right (tag >> 3), which gives 1.
- Wire Type: The wire type is determined by masking the tag with 0x07 (tag & 0x07), which gives the wire type (for example, 0 means varint).

This step involves reading the tag and interpreting what type of data it represents.

2. Decoding Wire Types

Once the field number and wire type are extracted, the deserializer proceeds to read the actual field data. Each wire type dictates how the data should be interpreted.

Varint (Wire Type 0): This is the wire type used for most integer fields (int32, int64, bool). Varint encoding stores integers in a variable number of bytes, with smaller numbers using fewer bytes. The deserialization process reads one byte at a time, checking the most significant bit (MSB) to determine if more bytes are part of the integer.

Example:
- For an id field with a value of 150, the binary representation would be 0x96 0x01. The first byte (0x96) tells Protobuf that the integer continues (since the MSB is set), and the second byte (0x01) completes the value. The deserializer combines these bytes to get 150.
Length-Delimited (Wire Type 2): This wire type is used for strings, byte arrays, and nested messages. The deserializer first reads the length of the data (encoded as a varint), and then reads that many bytes.

Example:
- For the field name = "John Doe", the binary data might look like 0x0A 0x08 4A 6F 68 6E 20 44 6F 65. The deserializer first reads the tag 0x0A (field 1, length-delimited). Then it reads the length 0x08, indicating that the next 8 bytes are the string "John Doe".
Fixed-Length Types (Wire Type 1 for fixed64, Wire Type 5 for fixed32): These are used for fixed-width integers and floats, and the deserializer reads 4 bytes for fixed32 and 8 bytes for fixed64 without additional interpretation.

3. Field Mapping and Assignment

Once the deserializer has interpreted the field number and read the associated data, it maps the field to the corresponding struct field in Go. The deserializer performs a lookup using the field number defined in the schema to determine which Go struct field corresponds to the data it has just decoded.

For instance, when the deserializer reads the field with field number 1 and wire type 2 (indicating that it is a length-delimited string), it knows that this corresponds to the name field in the Person struct. It then assigns the decoded value "John Doe" to the Name field in the Go object.

person.Name = "John Doe"

4. Handling Repeated Fields

If a field is marked as repeated, the deserializer keeps track of multiple instances of that field. For example, the phone_numbers field in the Person message is a repeated string field. The deserializer collects each occurrence of the field and appends it to the list of phone numbers in the Go struct.

person.PhoneNumbers = append(person.PhoneNumbers, "123-456-7890")
person.PhoneNumbers = append(person.PhoneNumbers, "098-765-4321")

5. Handling Nested Messages

When deserializing nested messages (like the Address message inside the Person message), the deserializer treats them as length-delimited fields. After reading the length, it recursively parses the nested message's binary data into the corresponding Go struct.

For example, in the Person message:

message Address {
  string street = 1;
  string city = 2;
  string state = 3;
  int32 zip_code = 4;
}

message Person {
  string name = 1;
  Address address = 4;
}

When deserializing the Address field (field number 4), Protobuf reads the length of the Address message, and then recursively deserializes the binary data for the Address into the Address struct inside the Person.

6. Handling Unknown Fields

One of the key features of Protobuf is forward and backward compatibility. During deserialization, if the binary data contains a field that is not recognized (perhaps because it was added in a newer version of the schema), the deserializer can either store the unknown field data for later use or simply ignore it.

This ensures that older versions of the code can still read newer messages without crashing.

7. Completing the Deserialization Process

Once all fields are processed, and the binary stream is fully read, the deserialization is complete. The resulting Go struct is fully populated with the deserialized data.

At this point, the application can access the Person object as if it had been constructed manually in Go.

Conclusion

Serialization and deserialization in Protobuf are highly efficient, but working with complex types and large datasets requires careful consideration. By following the optimization techniques outlined in this article—such as using fixed-width types, packing repeated fields, flattening structures, streaming large datasets, and caching—you can minimize delays and ensure high performance in your Go applications.

These strategies are particularly useful in systems where efficiency and speed are critical, such as in real-time applications, distributed microservices, or high-volume data processing pipelines. Understanding and leveraging Protobuf's internal mechanics allows developers to unlock the full potential of this powerful serialization framework.