How to Manage Schema Change with Apache Pulsar

Schemas are an essential part of any data platform. They’re metadata that define the shape of the data and the properties’ names and data types. Schemas are helpful in two ways. First, they provide a fixed blueprint for the data format, which can prevent badly formed data from being used within the context of the schema. Second, schemas let users understand how to parse the data and what to expect from its properties.

Apache Pulsar is an open source, distributed publish-subscribe (pub-sub) messaging system that enables server-to-server data transportation using schema. Messages are sent via topics, with the producer putting messages onto these topics to be read by the consumer. A user can define a schema for a topic stored in Pulsar’s schema registry. When messages are added to the topic, the Pulsar broker checks that the message conforms to the schema and ensures only valid messages are sent. The schema acts as a contract between the producer and consumers, enabling both parties to know the exact format of the data.

Over time, as applications evolve, it may be that the data produced by the application changes too. However, schema changes can affect the downstream consumers of the data that expect data in a specific format. Without a way to manage schemas between producers and consumers, it’s difficult to make changes to the data that is written in the messages or events without risking breaking downstream applications. To avoid this kind of issue, the schema of the messages must also evolve as new properties are added and allow consumers to understand data in both the old and new formats. This concept is known as schema evolution and Pulsar supports it.

This article discusses why schemas evolve before diving into how Pulsar implements and supports schema evolution.

Why Schemas Evolve

Schemas provide context about raw data. They often describe a particular entity in a system, encompassing all the properties of that entity. For example, you may have an application that signs up users. It stores user details such as names, email addresses and ages. There would be a user schema that describes the underlying data and provides context, such as the name of the field and the type of data in it. The schema may look as follows:

Now, say you wanted to expand the data captured to include address data and support direct mailing to the users. You would then need to expand the schema to include the new fields for capturing addresses, such as the first line of the address, city and ZIP code. After including these new fields, the schema will look like this:

This is a simple form of schema evolution, as the original fields haven’t changed, and only new fields have been added. In most cases, this shouldn’t be a breaking change for downstream consumers, as consumers can continue as if the new fields were absent. The consumers would just need to be updated to consume and use the new properties.

However, sometimes existing fields need to be amended to support new functionality. For example, say that users were uncomfortable giving an exact age, and instead, you change the application to capture age ranges like 18-24, 25-39,40-49 and 60+. The age column would need its data type amended from integer to string.

This is a more complex schema evolution, as this could break downstream consumers who are processing the age property and expecting it to be a number or parsing the number using a strictly typed language like Java. They could also perform numeric calculations on the property, which would no longer work in its new format.

To overcome this challenge, data platforms can support schema evolution to handle scenarios like this. Pulsar recognizes the importance of schema to data processing; in fact, it treats it like a first-class citizen by including built-in schema evolution support. Let’s look at how Pulsar does this.

How Pulsar Supports Schema Evolution

Pulsar defines schemas in a data structure called `SchemaInfo`. This is a Pulsar-specific data structure, which is itself a schema, that describes the schema of messages transported via Pulsar. Each `SchemaInfo` belongs to a topic (message channel) and describes the messages to be produced and consumed using that topic.

Each `SchemaInfo` has a type that details the type of schema being used. This can be anything from an integer, a string, or a complex schema such as Avro or Protobuf.

To support schema evolution, Pulsar uses schema compatibility checks to compare an incoming schema to a topic’s known schema(s). Schema compatibility checks occur when a producer or consumer attempts to connect to a particular topic. The strategy is chosen while configuring the broker, with the value `schemaCompatibilityStrategy`. The goal is to check the `SchemaInfo` provided by the connecting client with existing `SchemaInfo` to see if they are compatible.

Pulsar supports eight different types ofschema compatibility strategies, which you can set based on the requirements of the change. These strategies also come with rules outlining what changes can be made and pointers around which clients to update first. (Check out the documentation linked above for a deeper explanation of each compatibility strategy).

Returning to the earlier examples, let’s implement the schema changes using Pulsar’s compatibility strategy. First, start with the initial user schema (without the address).

This will be V1 of your schema. So, when you implement a Pulsar producer or consumer for the first time, the `SchemaInfo` for this version will be stored, and the producer and consumer will work as expected.

Next, you want to add the new address fields to your user schema. The first step is to consult the schema compatibility strategies and determine which one is best for this change. Using the “changes allowed” column in the documentation, you’re looking for any strategy that allows the addition of new fields. This gives you BACKWARD, BACKWARD_TRANSITIVE, FORWARD, and FORWARD_TRANSITIVE.

BACKWARD should be used when there’s no guarantee that consumers using an older version can understand the new schema. FORWARD is used when the consumers on the latest schema version might not be able to read data within older versions. If you want to update all of the consumers first to use the new schema, use a BACKWARD strategy. Otherwise, FORWARD is best.

Looking at the bigger picture, Pulsar refers to the entire act of evolving a topic’s schema as schema verification. It’s the combination of a producer or consumer providing `SchemaInfo`, the chosen compatibility strategy for the topic, and the broker deciding what to do.

Conclusion

It’s rare for schemas to stay the same forever. As new features are introduced into applications, schemas often have to evolve to support these features. However, keeping the data's producers and consumers in sync can often be a challenge when schemas are modified.

Pulsar’s built-in schema evolution concepts help deal with these changes. Using schema compatibility strategies, it can define the rules of how different compatible versions of a schema are. Pulsar uses this in conjunction with a schema verification process that then uses these rules to determine which schemas can be used by a consumer when connecting to a particular topic.

Also publishedhere.