Let's say you're building an application, for example, an e-commerce store. This application will contain necessary components like user authentication, notifications, order processing, inventory management, etc.
One challenge you might encounter is how communication between these components can happen. You might want them to interact via direct integration or other messaging patterns, such as point-to-point messaging or the request-response pattern.
While these communication patterns can work, they often lead to tightly coupled systems and can become complex to manage as the application grows. Well, there's a better way to manage communication between these components, and that's with Pub/Sub systems.
Pub/Sub systems help decouple these components, making the system more flexible, scalable, and easier to maintain. With pub/sub, components don't need to know about each other's existence and can communicate through a centralized message broker.
This article aims to introduce you to Pub/Sub systems with Apache Kafka. You'll understand how Kafka works, the key components of its architecture, and what you can do with pub/sub systems. By the end of this article, you'll have a conceptual understanding of Kafka and pub/sub systems.
Pub/sub, short for publish/subscribe, are systems that allow communication between different components or services in a distributed architecture. As the name suggests, you have publishers and subscribers in a pub/sub system.
In a pub/sub system, publishers send messages to a central message broker without knowing who the subscribers are. Subscribers express interest in specific messages or topics and receive only the messages that match their criteria.
Traditional messaging patterns follow the point-to-point messaging model. In this model, components communicate by sending messages directly to each other. This means the sender must know who the receiver is.
Meanwhile, Pub/Sub follows the publish/subscribe model, where components don't need to know about each other's existence. Instead, a publisher sends messages to a message broker, and a subscriber receives them.
From the above description, you can notice the three core components of a pub/sub system. These are:
Several examples of pub/sub systems include RabbitMQ, Redis, Amazon SQS, Amazon SNS, Azure HDInsight, Google Pub/Sub, and the focus of this article, Kafka.
Apache Kafka is an open-source event streaming platform that allows you to decouple communication between distributed systems. As a streaming platform, Kafka always has a continuous flow of data that can be processed in real-time.
As a distributed system, Kafka works across multiple machines or servers, known as brokers, in a cluster where each broker manages a portion of the data, handles data processing and serves client requests.
Overall, Kafka allows organizations to build real-time data pipelines, stream processing applications, and event-driven architectures by providing an infrastructure for processing continuous data streams across distributed systems.
You can only fully grasp the concept of Kafka if you understand the key components of its architecture. Below are the components of the Kafka architecture:
Kafka Producer
These are applications or components that publish messages to Kafka topics. They write messages to one or more topic partitions, typically using a specified partitioning strategy.
Kafka Consumer
Consumers are components that subscribe to Kafka topics and consume messages from partitions. Each consumer belongs to one or more consumer groups. Each consumer group can have one or more instances of the consumer.
Kafka Streams
This is a client library for building real-time stream-processing applications using Kafka. It allows developers to perform stateful and stateless processing of data streams directly within Kafka.
With Kafka Streams, you can read data from Kafka topics, process it using transformations or aggregations, and write the results back to Kafka topics or external systems.
Kafka Connect
Kafka Connect is a framework for building and running connectors that stream data between Kafka and other data systems.
Connectors are plugins that handle the integration with external systems such as databases, message queues, file systems, cloud services, and more. Connectors can be source connectors (ingest data into Kafka from external systems) or sink connectors (which export data from Kafka to external systems).
Admin Client
The Admin Client is a Java client library that provides administrative operations for managing Kafka clusters, topics, and configurations.
It allows developers and administrators to create, delete, list, describe, and alter topics, as well as query metadata about brokers, topics, partitions, and consumer groups.
The Admin Client manages Kafka clusters programmatically, automates administrative tasks, and builds management tools and utilities.
REST Proxy
Kafka REST Proxy is a RESTful interface for interacting with Kafka clusters over HTTP. It provides a simple HTTP-based API for producing and consuming messages from Kafka topics, managing consumer offsets, and querying metadata about topics and partitions.
The REST Proxy allows clients that do not have native Kafka libraries or support to interact with Kafka using standard HTTP methods, making Kafka accessible from a wide range of programming languages and platforms.
ZooKeeper (Optional)
ZooKeeper is used for coordination and management tasks in Kafka, such as leader election, partition reassignment, and maintaining cluster metadata.
While Kafka can now operate without ZooKeeper, earlier versions relied on ZooKeeper for essential coordination tasks.
Each component in the Kafka architecture has an underlying process of operation. However, for a high-level overview of how all these components work together, let's consider how Kafka will work with a popular streaming platform - Twitch.
In the context of Twitch, the producer component would be responsible for generating and sending streaming data to the Kafka cluster. When a streamer goes live on Twitch, the Twitch servers act as producers. These servers continuously generate various events, such as chat messages, viewer interactions, video frames, and metadata about the live stream.
As streamers interact with their audience, these events are packaged into messages and sent to Kafka topics. Each streamer's channel could have its topic, ensuring that events related to a particular stream are organized and accessible. For instance, chat messages might be sent to a chat_messages
topic, while viewer interactions like likes, follows, or donations could be sent to a viewer_engagement
topic.
The producer utilizes the Kafka client libraries to publish these messages to the Kafka cluster. These messages are then distributed across the Kafka brokers, ensuring fault tolerance and scalability. Kafka provides acknowledgements to the producer once the messages are successfully received and persisted in the cluster, ensuring reliability.
On the consumer side, Twitch would deploy consumer applications that subscribe to the relevant Kafka topics to process and analyze the streaming data. These consumer applications could be responsible for a variety of tasks, such as real-time moderation of chat messages, generating viewer engagement metrics, detecting copyright violations in video streams, and more.
These consumer applications leverage Kafka consumer APIs to subscribe to one or more topics and receive messages as they are published. The Kafka cluster ensures that each message is delivered to all the consumers subscribed to the corresponding topic, enabling real-time processing and analysis of the streaming data.
Optionally, Twitch could use the stream processing framework - Kafka Streams - to perform more complex analytics and transformations on the streaming data. For example, Twitch could use Kafka Streams to compute real-time statistics such as trending topics, viewer demographics, or popular streamers.
This overview doesn't go into detail about the whole Kafka architecture. To get more information, check out the Kafka documentation.
Aside from being a distributed event streaming platform, Kafka is an excellent choice for many reasons. The following are some of the advantages of using Kafka:
Unified Platform for Data Integration: Kafka combines data ingestion, storage, processing, and delivery into a single platform, making it easier to build and manage data pipelines. It can integrate with databases, file systems, and other data systems, allowing you to build end-to-end data pipelines and streamline data integration workflows.
Decoupling: Kafka's pub/sub model decouples producers from consumers, allowing them to operate independently and asynchronously. Producers can publish messages to Kafka topics without knowing who will consume them, and consumers can consume messages at their own pace.
Single source of truth: Kafka allows you to configure retention policies for topics, specifying how long messages should be retained in Kafka. By default, Kafka retains messages for a configurable period, but you can also configure retention based on message size or custom criteria. Therefore, it acts as a single source of truth for every message that comes through it.
Real-time Stream Processing: Kafka Streams allows you to build real-time stream processing applications directly within Kafka. With Kafka Streams, you can perform complex data transformations, aggregations, and analytics on data streams in real-time, enabling use cases such as real-time analytics, fraud detection, and monitoring.
Scalability: Kafka can scale horizontally, allowing you to add more brokers to your cluster to handle increased throughput and storage requirements. This scalability makes Kafka suitable for handling large volumes of data and supporting growing workloads without sacrificing performance.
In this article, you understood what Pub/Sub systems are and discussed a popular example - Kafka. You learnt what Kafka is, how it works, and a few of its benefits.
Kafka is very versatile as it has so many use cases ranging from streaming data processing to log aggregation. Therefore, gaining a deeper understanding of Kafka has so many benefits.
This article is just an introduction to Pub/Sub systems and Kafka. In the following article, you'll learn about the inner workings of the Kafka architecture. See you soon!