In the world of high-performance distributed systems, data serialization frameworks like Google Protocol Buffers (Protobuf) are indispensable. They offer compact binary formats and efficient parsing, making them ideal for everything from inter-service communication to persistent data storage. But when it comes to updating just a small part of an already serialized data blob, a common question arises: can we "patch" it directly, avoiding the overhead of reading, modifying, and re-writing the entire thing? Google Protocol Buffers (Protobuf) The short answer, for most practical purposes, is no. While Protobuf provides clever mechanisms that seem to offer direct patching, the reality is more nuanced. Let's dive into why the full "read-modify-write" cycle remains largely unavoidable and where the true efficiencies lie. no The Core Challenge: Binary Data's Unfixed Nature Imagine a book where every word's length can change, and there are no fixed page numbers for individual words. If you change a single word, all subsequent words on that page (and potentially the entire book) would shift, requiring a complete re-layout. This is akin to the challenge of patching a binary serialized blob. Protobuf, like Apache Thrift, uses a compact, variable-length binary encoding. Fields are identified by unique numeric tags, and their values are encoded efficiently, often with variable-length integers or length-prefixed strings. This design is fantastic for minimizing data size and maximizing parsing speed. However, it means that the exact byte offset and length of any given field are not fixed. Changing the value of a field, especially a string, can alter its byte length, which would then shift the positions of all subsequent fields in the binary stream. Attempting an "in-place" modification without recalculating and shifting all subsequent bytes would lead to data corruption. compact, variable-length binary encoding "in-place" modification Misconception 1: The "Last Field Wins" Magic Trick One intriguing feature of Protocol Buffers is its "last field wins" merge behavior for non-repeated fields. This means if you have two serialized Protobuf messages for the same type, and you concatenate their binary forms, when the combined stream is deserialized, the value of a non-repeated field from the last occurrence in the stream will be used. For repeated fields, new values are appended, not overwritten. "last field wins" merge behavior How it seems to work (and why it's misleading for patching): Let's say you have an original Person object serialized into a blob: Person Original Blob: [name="Alice", age=30, phone_number=["111", "222"]] Original Blob: [name="Alice", age=30, phone_number=["111", "222"]] You want to update only the name to "Alicia". You could create a new, small Protobuf message containing just the updated name: Patch Blob: [name="Alicia"] Patch Blob: [name="Alicia"] Then, you could concatenate this Patch Blob to the Original Blob: Patch Blob Original Blob Combined Blob: [name="Alice", age=30, phone_number=["111", "222"]] + [name="Alicia"] Combined Blob: [name="Alice", age=30, phone_number=["111", "222"]] + [name="Alicia"] When a Protobuf parser reads this Combined Blob, due to "last field wins," the name will indeed resolve to "Alicia," while age and phone_number will retain their original values. Combined Blob age phone_number The Catch: While this appears to be a patch, it's a deserialization rule, not a binary patching mechanism. The parser still has to read and process the entire concatenated stream to determine the final state of the message. You haven't avoided the deserialization cost; you've just changed how the parser resolves conflicts during deserialization. deserialization rule Furthermore, this approach has severe limitations: Only for Root Objects and Non-Repeated Fields: It "only works well for the root object" and "doesn't work for repeated" fields. If you tried to update a specific phone number, or a field within a nested message, this concatenation trick would fail or lead to unintended appends.
Increased Storage/Transmission Size: You're now storing or transmitting more data (original + patch) than if you had simply re-serialized the whole object. Only for Root Objects and Non-Repeated Fields: It "only works well for the root object" and "doesn't work for repeated" fields. If you tried to update a specific phone number, or a field within a nested message, this concatenation trick would fail or lead to unintended appends. Only for Root Objects and Non-Repeated Fields: Increased Storage/Transmission Size: You're now storing or transmitting more data (original + patch) than if you had simply re-serialized the whole object. Increased Storage/Transmission Size: Misconception 2: FieldMask Saves Re-serialization Cost Google's official Protobuf best practices recommend using FieldMask for supporting partial updates in APIs. This is an excellent pattern, but it's crucial to understand where its efficiency gains truly lie. FieldMask How FieldMask works: A FieldMask is a separate Protobuf message that explicitly lists the paths of the fields a client intends to modify (e.g., name, address.street). When a client wants to update a resource, it sends a small request containing: FieldMask name address.street The FieldMask itself.
Only the partial data for the fields specified in the mask. The FieldMask itself. FieldMask Only the partial data for the fields specified in the mask. Example of a network payload using FieldMask: FieldMask Instead of sending: { "name": "Alicia", "age": 30, "phone_number": ["111", "222"] } // (full object) { "name": "Alicia", "age": 30, "phone_number": ["111", "222"] } // (full object) A client might send: { "update_mask": { "paths": ["name"] }, "person": { "name": "Alicia" } } // (much smaller payload) { "update_mask": { "paths": ["name"] }, "person": { "name": "Alicia" } } // (much smaller payload) Where FieldMask truly shines (and why re-serialization is still needed): FieldMask FieldMask significantly improves efficiency, but not by avoiding the deserialization/re-serialization cycle on the server's persistent data. Its benefits are primarily at the network communication and application logic layers: FieldMask network communication application logic layers Bandwidth Optimization: By sending only the FieldMask and the partial data, the request payload size is drastically reduced. This saves network bandwidth, especially critical for mobile clients or high-volume APIs.
Reduced Server-Side Processing: The server receives explicit instructions on which fields to update. This streamlines the application logic, preventing the server from having to infer changes or process a large object where most fields are unchanged. Bandwidth Optimization: By sending only the FieldMask and the partial data, the request payload size is drastically reduced. This saves network bandwidth, especially critical for mobile clients or high-volume APIs. Bandwidth Optimization: FieldMask Reduced Server-Side Processing: The server receives explicit instructions on which fields to update. This streamlines the application logic, preventing the server from having to infer changes or process a large object where most fields are unchanged. Reduced Server-Side Processing: However, once the server receives this partial update request, to apply it to the stored, serialized data, it still performs the following steps: Retrieve Existing Data: The server fetches the full, existing serialized blob from its storage.
Deserialize: The entire blob is deserialized into a complete in-memory Protobuf object.
Apply Patch: The application logic uses the FieldMask to update only the specified fields on this in-memory object.
Re-serialize: The entire modified in-memory object is then re-serialized into a new binary blob.
Persist: This new blob replaces the old one in storage. Retrieve Existing Data: The server fetches the full, existing serialized blob from its storage. Retrieve Existing Data: Deserialize: The entire blob is deserialized into a complete in-memory Protobuf object. Deserialize: Apply Patch: The application logic uses the FieldMask to update only the specified fields on this in-memory object. Apply Patch: FieldMask Re-serialize: The entire modified in-memory object is then re-serialized into a new binary blob. Re-serialize: Persist: This new blob replaces the old one in storage. Persist: The Unavoidable Truth: Read-Modify-Write For any robust and reliable modification of a Protocol Buffer serialized data blob, the read-modify-write cycle is the standard and necessary approach. This is because: read-modify-write cycle Data Integrity: It ensures that the entire object remains consistent and correctly encoded after the modification.
Schema Evolution: It gracefully handles schema changes (adding/removing fields) by allowing the parser to correctly interpret the full data structure.
Binary Format Constraints: The variable-length nature of Protobuf's encoding makes direct byte-level manipulation impractical and prone to corruption. Data Integrity: It ensures that the entire object remains consistent and correctly encoded after the modification. Data Integrity: Schema Evolution: It gracefully handles schema changes (adding/removing fields) by allowing the parser to correctly interpret the full data structure. Schema Evolution: Binary Format Constraints: The variable-length nature of Protobuf's encoding makes direct byte-level manipulation impractical and prone to corruption. Binary Format Constraints: Conclusion Protocol Buffers are incredibly powerful for efficient data serialization and schema evolution. Features like "last field wins" and FieldMask are valuable tools, but their utility for "patching" existing serialized blobs is often misunderstood. "last field wins" FieldMask The "last field wins" behavior is a deserialization rule that can be leveraged for simple, non-repeated field updates via concatenation, but it still requires full deserialization and is not a general-purpose binary patching solution. The FieldMask is an excellent API design pattern that optimizes network bandwidth and simplifies application logic for partial updates, but the server still performs a full read-modify-write cycle on the underlying data. FieldMask Ultimately, if you need to modify a Protobuf serialized blob, prepare for the full read-modify-write dance. The true efficiencies come from optimizing the communication of the patch (e.g., with FieldMask) and the in-memory processing, rather than magically altering bytes on disk. read-modify-write FieldMask Further Reading Protocol Buffers Documentation
Protobuf Language Guide (proto3)
Google Protobuf FieldMask
Google Cloud API Design Guide - Partial Updates Protocol Buffers Documentation Protocol Buffers Documentation Protobuf Language Guide (proto3) Protobuf Language Guide (proto3) Google Protobuf FieldMask Google Protobuf FieldMask Google Cloud API Design Guide - Partial Updates Google Cloud API Design Guide - Partial Updates

Apache

Can You Patch a Protobuf File? Not Really—and Here’s Why

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Tiny Deltas, Big Wins: Schema-less Thrift Patching at Planet Scale

0–100 in Django: Starting an app the right way

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: OpenAI is Sam Altman; Sam Altman is OpenAI (11/30/2023)

The Noonification: Panda Power (11/28/2023)

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Tiny Deltas, Big Wins: Schema-less Thrift Patching at Planet Scale

0–100 in Django: Starting an app the right way

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: OpenAI is Sam Altman; Sam Altman is OpenAI (11/30/2023)

The Noonification: Panda Power (11/28/2023)

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps