Welcome to this issue of Activation Function. Every other week, I introduce you to a new and exciting open-source backend technology (that you’ve probably only kind of heard about… ) and explain it to you in 5 minutes or less so you can make better technical decisions moving forward.
In this issue, we’ll explore Redpanda, a Kafka-API-compatible distributed data streaming platform. Redpanda dubs itself as a simple, high-performance, cost-effective drop-in replacement for Kafka.
I know… I know… high-performance, drop-in replacement, cost-effective, all those are the usual suspects every new vendor throws around. Is Redpanda the real deal? Let’s find out.
Note: Data stream processing, event stream processing, event data stream processing; I’m sure some of you can argue the nuances of these, but I don’t want to be part of that conversion, so I’ll consider them all interchangeable.
Stream processing isn’t anything new. It’s been an active research field for over 20 years; however, it has significantly evolved over the years and transitioned from a niche to a mainstream technology over the past ten years, with the main driver being technological advancements (e.g., MapReduce), and requirements (e.g., real-time data feeds at scale).
Without going down the rabbit hole, here’s a quick and oversimplified rundown of how we got to where we are today:
Over the years, Kafka evolved from a simple message broker into a full-fledged streaming platform used by over 80 percent of Fortune 100 companies. It’s performant, reliable, and scalable.
Cool… So why do we need alternatives?
The simple answer is that things have evolved, and it can be hard to adapt older technologies to new needs (e.g., ML workloads, cloud-native apps, etc.) or leverage new technology advancement (e.g., new-generation hardware). Eventually, people start running into bottlenecks and start thinking about new ways of getting things done.
To some, Kafka’s “outdated” architecture makes it difficult and labor-intensive to deploy, manage, and optimize.
That’s what Redpanda is looking to capitalize on.
Redpanda (formerly Vectorized) was launched in 2019 to build a modern data streaming platform that is a more performant, less complex, and cost-effective alternative to Kafka. How? By building from the ground up and going low-level to fully leverage modern hardware.
Before we delve into the differences between Redpanda and Kafka, let’s briefly go over the similarities:
[Note] Redpanda has NOT achieved 100% API compatibility with Kafka, so you shouldn’t expect 100% of the same behavior as with Kafka.
Redpanda draws its performance advantage from a few factors:
[Thought] I wonder how much of the C++ performance advantage is negated if you use Java-based components like Kafka Connect, which Redpanda integrates with.
[Note] If you want to go down the rabbit hole of Thread Per Code Architecture, start here.
All of this means that, in theory, Redpanda should outperform Kafka, but as we know, it’s a bit more complicated than that. Long story short, you’ll need to find out by running benchmarks for your specific use case.
But just in case you’re interested in existing benchmarks, the Redpanda team released these benchmarks in 2022, to which Jack Vanlightly (Staff Technologist at Confluent) responded with his own benchmarks.
Tldr:
This is an area that is easier to determine, and many agree that Redpanda has the upper hand when it comes to self-managed environments. This is mainly because Redpanda boasts a more streamlined architecture than Kafka without relying on external components like the JVM or specific ZooKeeper servers.
Redpanda can be set up in a cluster with one or multiple nodes, similar to Kafka. These clusters can span various availability zones or regions. However, unlike Kafka, every Redpanda node operates using the same binary, taking on different roles, such as acting as a data broker or providing supplementary services like an HTTP proxy or schema registry.
Each node inherently uses the Raft consensus algorithm, eliminating the need for external services like the Quorum-based Controller in Kafka's KRaft.
[Technical Note] Kafka moved away from ZooKeeper to KRaft to reduce operational overhead and improve scalability.
Just like performance, TCO is hard to generalize, so you’ll have to do your own benchmarking. Redpanda claims that its self-hosted version is much cheaper than Kafka as it requires fewer nodes to achieve similar performance (aka smaller infrastructure footprint) and requires less administrative overhead.
That being said, remember the TCO includes other factors like labor cost, which can tip the scales in favor of Kafka, as it’s safe to assume that finding engineers who know Kafka is cheaper and easier given the much larger expert community.
Also, remember that even if you decide to go with the hosted version, you’ll have many more vendors and pricing options to choose from when going with Kafka. On the other hand, Redpanda is the only vendor offering a fully managed version of its platform.
As you can imagine, there are many more technical nuances to unpack for such a technical product. So here are a few quick facts you should be aware of:
Redpanda is still in its early days but has already amassed a solid customer base with names like Cisco, Akamai, Laceworks, Vodaphone, Moody’s, Midjourney, Activision, and many others.
Redpanda is used in a variety of use cases where latency is essential, including real-time cybersecurity monitoring, , real-time market pricing and analytics, and many more.
Based on the case studies listed above, it looks like Redpanda is a good fit in the following situations:
Many of these will be apparent, and I’ve already mentioned a few, but reasons you wouldn't want to use Redpanda include:
This one is a tricky question to answer, and my opinion is solely based on my experience working on Memgraph (a C++ wire-compatible graph DB alternative to Java-based Neo4j) and my observation of ScyllaDB (a C++, API-compatible replacement to Java-based Cassandra DB).
Many organizations have sunk a lot of resources into Kafka, and it will make more sense for them to keep throwing money at Confluent to deal with operational complexity, performance optimization, etc. Although today’s focus on cost-cutting might push a few to consider replacing some of their Kafka-cluster to test the waters.
I think new startups or even new projects with larger organizations will adopt Redpanda to speed up development, reduce complexity, and achieve better performance for more demanding use cases. But this will be a slow adoption driven by a few discerning engineers.
I might sound bearish, but I’m excited about watching Redpanda grow and reach new heights. I think the team has built a great piece of technology, and as they grow their customer portfolio, community, resources, and tooling ecosystem, I think they’ll give Confluent a run for its money.
Alright, so now that I hopefully got you intrigued by Redpanda, here are a few resources you should check out to dig a little deeper:
That’s it, folks! I hope this gave you an overview of Redpanda and how it compares to Kafka on a high level. As always, there is a lot to learn and many nuances to consider for your specific needs. The only way to determine if this is for you is to take it for a quick spin, which should be simple, thanks to the single binary deployment.
Until next time!
Also published here.