With ever-increasing demands from other business units, IT departments have to be constantly looking for service improvements and cost-saving opportunities. This article showcases several concrete use-cases for companies that are investigating or already using Kafka, in particular, Kafka Connect.
Kafka Connect is an enterprise-grade solution for integrating a plethora of applications, ranging from traditional databases to business applications like Salesforce and SAP. Possible integration scenarios range from continuously streaming events and data between applications to large-scale, configurable batch jobs that can be used to replace manual data transfers.
Apache Kafka started as an in-house project at LinkedIn intended to replace the various queuing and message-passing systems that had evolved over the years. The goal was to create a horizontally scalable system, able to reliably store and handle many millions of messages per second. Today, the Kafka community provides a whole ecosystem of tools and features ranging from simple log-message passing to performing basic on-the-fly ETL jobs to complex data-pipeline processing. Its versatility and robustness in addition to its scalability make Kafka a great choice for the enterprise.
The power of Kafka comes at a price: While it's easy to use Kafka from a client perspective, the setup and operation of Kafka is a difficult task. Setting up a reliable Kafka cluster is a challenge that requires experience. If your company does not have the experience, it can take weeks, even months to set up and fine-tune a production-ready cluster. That's where Kafka on Heroku shines: Setting up a private Kafka cluster literally takes minutes, which makes it feasible for IT departments to focus on the services they want to provide instead of operating a Kafka cluster.
Adding Kafka Connect is the logical next step, once you are using Kafka. But Kafka Connect on Heroku is a huge time-saver, even if you are not planning to use Kafka directly from your application. With little operational overhead, Kafka Connect provides an integration point for any number of applications. Instead of running and maintaining different solutions for recurring data transfer, on-demand data transfer, and continuous updates between systems, Kafka Connect provides a unified solution to those problems.
In the next section you will learn about some typical use cases, and also find out which applications Kafka Connect is able to integrate.
IT departments constantly struggle with requirements from business units that must connect and integrate all kinds of custom and off-the-shelf applications. These integrations often use less-reliable mechanisms, like recurring batch jobs, instead of continuously "streaming" updates across systems. Kafka Connect provides a scalable, reliable solution that can often replace slow batch jobs by instead "streaming" updates. Common scenarios include:
IT departments could even go as far as creating new self-service opportunities, where manual or labor-intensive requests for data transfer or extraction are replaced by self-service processes that reduce the IT department's workload while improving response time for business units. Consider a system in which your company's marketing department creates a ticket in your service-desk software, which automatically creates and triggers a data-transfer that otherwise the IT service desk would have had to handle manually.
Kafka Connect can read from and write to an ever-increasing number of applications and services, including enterprise applications, like SAP and Salesforce:
Let's take a concrete example. A common integration scenario is this: You have two SQL databases and you need to update one database with information from the other database. For example, when the status of order changes in your fulfillment system, you might need to update the internal CRM database to reflect this change. Here is the concrete scenario:
For simplicity, we'll assume both are MySQL databases. The steps for other connectors would be the same, only with different configurations.
It takes two steps to set up this integration, assuming you have a working Kafka Cluster with Kafka Connect:
Here are the corresponding cURL calls, and an explanation of what each of them does. They use the Kafka Connect REST API to create the source and sink. Note that these calls are not specific to Heroku. They will work with any Kafka Connect installation:
1. Creating the source-connection.
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
"name": "jdbc_source_mysql_01",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:mysql://fulfillmentdbhost:3306/fulfillmentdb",
"connection.user": "fullfilment_user",
"connection.password": "<password>",
"topic.prefix": "order-status-update-",
"mode":"timestamp",
"table.whitelist" : "fulfullmentdb.status",
"timestamp.column.name": "LAST_UPDATED",
"validate.non.null": false
}
}'
2. Creating the sink-connection.
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
"name": "jdbc_sink_mysql_01",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "jdbc:mysql://crmdbhost:3306/crmdb",
"connection.user": "crm_user",
"connection.password": "<password>",
"topics": "order-status-update-status",
"table.name.format" : "crmdb.order_status"
}
}'
And you're done! From this point on, any new entries in the STATUS table of the fulfillment database will automatically be written to the ORDER_STATUS table of the CRM database. Let's dive a little into what each call is doing.
The first cURL command tells Kafka Connect to use a specific type of source connector, namely JdbcSourceConnector, to connect to the MySQL database at fulfillmentdbhost:3306/fulfillmentdb using the supplied credentials. It also configures this connector to monitor the table called "status" in that database for new entries based on the timestamp in the column "LAST_UPDATED." As a result, whenever a new row is written to the table, Kafka Connect will automatically write a new message into the Kafka topic "order-status-update-status" for each new row.
Next, we need a sink connector that monitors this Kafka topic for changes. That's what the second cURL call does.
The second cURL command creates a sink connector (JdbcSinkConnector) that connects to a different database, namely crmdbhost:3306/crmdb. This connector will monitor the topic "order-status-update-status" for new messages and will write any new messages to the table "order_status" in the crmdb-database.
This example uses the JdbcSourceConnector and JdbcSinkConnector to connect two relational databases. Other types of connectors can be installed and then used in the same way. Each connector comes with its own documentation and set of options. An unofficial list of available connectors with links to their documentation can be found on the Confluent website.
You have seen how to set up a continuously streaming integration between two SQL databases with just two commands. In a similar fashion, you could integrate any of the applications that Kafka Connect provides connectors for -- or even write your own. Just set up a source and a sink connector and you are ready to stream your updates.
But this is not all. Once you have set up a source connector, you can use the same source to update multiple destinations ("sinks"). In our example, you could run a third cURL command to write all order-status updates to another database as well.
If your team faces any of the problems described in this article, you should give Kafka Connect a try. Kafka Connect adds a whole new set of capabilities to an existing Kafka cluster, that will make your team's life easier in the long run. And if you are currently not running a Kafka cluster, this article has shown you how to set it up within minutes. Once Kafka Connect is up and running, your team can focus on solving actual business problems.
Once your team has mastered the basic workflows described in this article, you might want also want to check out Kafka Connect's ability to transform data on the fly, which can be used to solve a whole new set of problems.