Trying to Scale Apache Kafka? Consider Using Apache Pulsar

Written by datastax | Published 2023/04/10
Tech Story Tags: apache-pulsar | apache-kafka | event-streaming | data-science | software-development | data-engineering | pubsub | good-company

TLDRWe compare the differences between Kafka and Pulsar, demonstrating how a logical next step for scalability when using Kafka is switching to Pulsar.via the TL;DR App

Today, even the most basic web and mobile applications consume a lot of data. The key to exchanging and acting on this data is a messaging system backed by an event-driven architecture.

An event-driven system enables messaging solutions and processing to be scalable and asynchronous. Asynchronous systems can handle more requests, as each request is handled in the background.

When a request is made to the server, it’s added to a queue, where a processor will read it. This enables organizations to build systems accepting hundreds of thousands — or even millions — of requests per second at scale by processing the requests in a separate cluster.

The industry has produced several message-broker systems and topic-driven publish-subscribe (pub-sub) platforms that follow this event- and message-driven format. Apache Kafka and Apache Pulsar are two widely used examples of distributed message delivery and streaming systems.

Kafka and Pulsar are both built on a pub-sub pattern that you can use to scale message delivery to thousands of connected clients. Both offer a persistent storage model to ensure the messages aren’t lost, and both use partitions to store and process the messages.

While Kafka and Pulsar are similar in many ways, they have some notable differences in capabilities — particularly when managing large amounts of data, creating real-time applications and developing at scale.

Kafka provides many benefits, but Pulsar’s support for scalability and growth is unmatched. And at a certain point in growth, the optimal choice is to no longer attempt to optimize Kafka, but instead part ways with it. Here, we’ll compare the differences between Kafka and Pulsar, demonstrating how a logical next step for scalability when using Kafka is switching to Pulsar.

Challenges with Apache Kafka Apps

Kafka is the de facto for distributed pub-sub patterns in software architecture. An organization using Kafka is capable of handling thousands of messages and broadcasting the messages to several consumers at the same time.

Kafka has several benefits, but it also has certain limitations when trying to scale. Let’s explore some challenges you’ll face when trying to scale applications built with Apache Kafka.

Storage Limitations

Kafka’s architecture creates the first challenge you’ll face when scaling your applications in Kafka: storage.

Stateful brokers are the first reason an organization finds it challenging to scale. The data in Kafka is stored in the leader node, while partitions of data are stored on the local disk. The data is tied to the nodes, and the brokers in Kafka are stateful. This means that once the leader node has reached the maximum storage capacity, the cluster can’t accept more messages unless infrastructure storage is increased. This is challenging because, in an ever-growing environment, a cluster will require multiple upgrades.

One way to surmount this challenge is to purchase a large storage cluster, which is very expensive.

Additionally, based on this architecture, once the platform has hit the maximum storage or memory limit, it can’t accept new incoming messages. This can lead to a huge loss for business-critical applications. The architecture of Kafka is designed to accept and broadcast a lot of messages. Long-term data storage isn’t a priority. As a result, scaling a Kafka application is very challenging because it can’t provide the storage you need — at least not without a hefty price tag.

Troubles with Message Processing

Managing Kafka is challenging because it doesn’t include features necessary for activity monitoring, processing messages, and data persistence.

Kafka shines for headless message broadcasting systems, where you don’t need to mutate a message before delivery. However, suppose you need to process a message before forwarding it to the consumers; this requires reliance on additional platforms, which makes it more challenging and complex to process messages with Kafka.

Moreover, the involvement of other platforms like those listed above significantly increases the complexity of your data delivery system, as each component of the streaming platform requires maintenance and has limitations that apply to the entire cluster. Additionally, Kafka clusters have limited data, and message persistence as your data requirements grow with time.

Complicated Client Libraries

Enterprises mainly use Kafka for its provided streaming services. The streaming API is written on top of the pub-sub message delivery to support a unique business case. The Kafka Streams API is a standalone product that offers advanced features aimed at enterprise customers. Kafka Streams’s most notable feature, transactions, helps enterprises ensure the consistency of the output generated by the flow of messages. For this reason, Kafka has two separate APIs for each use case.

For example, the Kafka Streaming library enables enterprises to offer an “exactly once” delivery for messages. The delivery guarantees that both Kafka and Pulsar offer are:

  • At least once
  • At most once
  • Exactly once

The “exactly once” delivery guarantees that for each message, there will be one associated output, which guarantees the message is processed in case a consumer crashes. However, this is impossible with the Consumers API,which allows applications to read streams of data from topics in the Kafka cluster, requiring you to write most of the features in the platform. This makes it difficult to use a single client library for all the features you need for your business, which isn’t sustainable when you’re working at scale.

Enter Pulsar

For each Kafka limitation highlighted above, Pulsar has a solution. The following sections outline some of Pulsar’s benefits.

Persistent Data Storage

Pulsar provides the message streaming and publishing features that Kafka does, but adds the ability to persist the data for longer periods.

Pulsar offers data storage persistence using Apache Bookkeeper. Bookkeeper maintains the data and helps offload the data persistence outside the cluster. You can use other data storage mediums such as AWS S3 to store data and scale beyond the limits of a local disk, meaning you can easily expand your applications without storage concerns.

Additionally, Pulsar includes a tiered storage feature that helps move the data between hot and cold storage options; data can then be stored in cold storage for as long as the business needs. The cluster doesn’t require a continuous change in the infrastructure size for the storage options.

Pulsar also automatically moves the older messages from Bookkeeper to a cheaper, cold storage option by making a segment of the data immutable. The immutable segment can be moved to cheaper storage, effectively enabling Pulsar to accept infinite amounts of data.

Developer Experience

From the developer’s perspective, Pulsar offers an integrated, simple client library for all major languages (Java, Python, Go, and C#). The libraries help developers get started with the platform quickly, which is key when developing and releasing applications at scale. Pulsar’s binary protocol extends the features of the client library as needed, making the library suitable for growth. (Here’s the list of available and officially supported Pulsar client libraries.)

Pulsar Functions

Pulsar Functions is an out-of-the-box feature that enables developers to write custom code that can process messages in the message stream without needing to deploy a system like Apache Heron, Apache Flink, or Apache Storm.

Pulsar Functions are used in a serverless connector framework Pulsar IO, making it easier to move the data from and to Pulsar. This out-of-box system enables Pulsar to be connected to external SQL and NoSQL databases, such as Apache Cassandra.

Additionally, this message processing is stream-native, meaning the messages are processed and transformed inside the cluster before they’re delivered to the consumers. Because Pulsar Functions are the computing infrastructure of the Pulsar messaging system, they support business-level objectives, including developer productivity, easy troubleshooting, and operational simplicity — qualities crucial to application and team performance when working at scale.

Scalability

In addition to the features and services mentioned above and their influence on scalability, Pulsar offers various features that make it a scalable option for your enterprises’ message streaming and publishing needs.

Pulsar’s geo-replication feature enables Pulsar to be highly scalable. The cluster replicates the data to multiple locations across the globe for use in case a disaster brings down the application. The replication is supported to be synchronous as well as asynchronous. Asynchronous replication is faster but provides fewer data consistency guarantees than synchronous replication.

Pulsar uses a broker-per-topic concept, ensuring that the same broker handles all the requests for a topic. The Pulsar architecture demonstrates how the broker-based approach improves the system’s performance compared to a Kafka cluster.

Wrapping up

Kafka and Pulsar have some similarities, but there are some fundamental differences worth considering when selecting which platform to use — especially when you need scalability.

Kafka’s architecture, storage capabilities, and usability present numerous challenges that can inhibit an organization’s ability to grow. Trying to scale your Kafka clusters beyond a point becomes expensive and is often more trouble than it’s worth. From the way it stores data to the way it supports message transformation, Pulsar is the next-generation, unified challenger to Kafka that’s built for scalability.

Learn about DataStax Astra Streaming, built on Apache Pulsar and delivered as a fully managed service.

By Mary Grygleski. Mary is a streaming developer advocate at DataStax. She focuses on developing community advocacy and outreach for Java, open source and cloud technology including cloud native, serverless, event-driven, microservices, and reactive architectures.


Written by datastax | DataStax is the real-time data company for building production GenAI applications.
Published by HackerNoon on 2023/04/10