By Adam Bellemare, Principal Technologist at Confluent
The emergence of generative AI has resurfaced a long-debated question: how do you get your systems and services the data they need to do their jobs? While most commonly asked for microservices and populating a data lake, generative AI has pushed its way to the front of this list. This article explores how the data demands of generative AI are an extension to the age-old problem of data access, and how data streams can provide you with the missing answer.
The key problem to accessing data is that the services that create the original record of data are not necessarily the best suited for hosting ad-hoc access to it. Your service may be perfectly capable of performing its actual business responsibilities, but it isn’t able to serve that data to prospective clients. While you can expose the data using an interface, the service might not be able to handle the query volume, or the types of queries that are expected.
Data analysts ran into this problem decades ago, where the original system of record (an OLTP database) couldn’t provide the necessary power and performance for analytical use cases. A data engineer would extract the data from the original system of record and load it into an OLAP database so the data analysts could do their job. While the tools and technologies have changed over the decades, the gist remains the same: copying data from the operational space to the analytical space.
Fig 1: A simple Extract-Transform-Load (ETL) job copying data from the operational domain to the analytical domain.
Microservices have the same problem. How do they get the data they need? One common option is a direct query to the original system of record, via HTTP, SOAP, or RPC, for example. Similar to the data analyst case, the same limitations apply since the service is unable to handle the access patterns, latency requirements, and load put on it by other dependent services. Updating the systems to handle the new requirements may not be reasonable either, considering complexity, limited resources, and competing requirements.
Fig 2: Other services will require the data to solve their own business use cases, resulting in a web of point-to-point connections.
The crux of the matter is that the services that create the data must also provide access to it for external systems. This open-ended requirement complicates things because the service must do a good job fulfilling its direct business responsibilities, and it must also support data access patterns beyond its direct business use cases.
Fig 3: The application that created the data is also responsible for fulfilling the on-demand data queries of all other services.
The solution to providing data access to services, systems, and AIs is a dedicated data communications layer, responsible only for the circulation and distribution of data across an organization. This is where data streaming comes in (also sometimes known as event streaming).
In short, your services publish important business data to durable, scalable, and replayable data streams. Other services that need that data can subscribe to the relevant data streams, consume the data, and react to it according to their business needs.
Fig 4: A dedicated data communication layer, provided by data streams, simplifies the exchange of data across your organization.
Data streaming allows you to power services of any size (either micro or macro), populate your data lakes and other analytical endpoints, and power AI applications and services across your business.
Services don’t have to write all of their data to the data stream, only that which is useful to others. A good place to start is to investigate the requests that a service handles, like GET requests, as they illustrate the types of data commonly requested from others. Also, talk to your colleagues, as they’ll have a good idea of the types of data their services need to accomplish their tasks.
Other services read the data from the data streams and react to it by updating their own state stores, applying their own business logic, and generating results which they may also publish to its own stream. There are three big changes for the consumer:
- They no longer request data ad-hoc from the producer service - instead, they get all their data through the data stream, including new data, deleted data, and changes made to data.
- Since they no longer request data on demand, they must maintain a replica of the state that they care about within their own data stores. (Note: They do not need to store ALL data, just the fields that they care about)
- The consumer becomes solely responsible for its own performance metrics, as long as the data is available in the data stream. It is no longer reliant on the producer to handle its load or meet its SLAs.
Data streaming offers significant benefits to microservices, AI, and analytics.
- Makes data available to whatever systems, processes or services need it. Data written to streams can be made widely available across your organization. The producer service writes the data once, and consumers can read the data as often as they need. Note that cheap disk and cloud storage enables you to keep the data in the stream as long as you require (including infinite retention!)
- Simplifies dependencies between producers and the consumers. The producer is no longer responsible for serving the query patterns of those dependent on its data. The consumer is no longer reliant on the producer’s compute and storage performance for serving its business needs. You significantly reduce the quantity of point-to-point connections across your business, instead relying on producing reusable, self-updating data sets.
- Decoupling: Consumer services can tolerate producer outages without significant service degradation, though the data stream will no longer be updated and will eventually become stale. Additionally, you can modify and swap out producers without affecting existing consumers, as they remain coupled only on the event streams
- Power operational (OLTP-based) systems: Data streams enable you to build event-driven (micro)services that both consume data and write their own data to streams. You can use easily available data
- Power both realtime and batch analytics: Analytics can use the very same data streams for either real-time analytics, or as a source for building Iceberg or Delta tables for batch-based analytics.
- Fuel Gen AI and AI agents: The same streams can also power generative AI. Data streams enable low-latency retrieval augmented generation (RAG) and context building, so your AI queries always have the most relevant and up-to-date information. Additionally, the emerging field of AI agents benefits from the very same event-based communication patterns that serve event-driven microservices.
- Fix bad data once, propagate everywhere: You can fix bad data at the source, and propagate it through the data stream to all downstream consumers. While there are some nuances on handling bad data in events, there are many ways to both prevent that bad data from getting in, and to fix it if it does occur.
- You can still use point-to-point request/response connections. It’s not an all or nothing proposition. You can gradually migrate some services and workloads to data streaming, while leaving others to rely on their existing request-response architectures.
Data streams enable you to power operations, analytics, and AIs, all from the same data source. As a data communication layer, it makes it easy for your colleagues and their services to find and use the data they need for their business use cases.
One of the last major benefits is a strategic benefit. This one is a bit more difficult to quantify, but it is undoubtedly one of the most important. By investing in a data streaming layer, you open up a wide range of possibilities for putting your data to work. Apache Kafka, a popular choice for data streaming, offers a wide range of connectors for integrating with all kinds of systems and services. You’re no longer restricted to only using AIs that are integrated with your data lake offering, or those attached to the cloud service provider that is storing all your analytical data. Instead, you can easily trial models from all sorts of providers as they become available, giving you a first-mover advantage in leveraging the latest tools.
Thinking about data, how to access it, and how to get it to where it needs to be has always been a challenge, particularly for the operational/analytical divide. But the advent of GenAI has made it even more important, adding even more weight and importance towards solving this age-old problem. At its heart is a simple principle - let your business services focus on their business use cases, and let the data communication layer provide data to all who need it through low latency data streaming. And from that single set of data streams, you’ll be able to power your operational, analytical, and AI use cases.
