provides a notable advantage over — data becomes available to consumers faster. In the traditional ETL, you would not be able to analyze events from today until tomorrow’s nightly jobs would finish. These days, within minutes, seconds, or even . With streaming technologies, we no longer need to wait for scheduled batch jobs to see new data events. Real-time processing batch processing many businesses rely on data being available milliseconds Live dashboards are . updated automatically as new data comes in Despite all the benefits, to the overall data processes, tooling, and even data format. Therefore, it’s crucial to carefully weigh out the pros and cons of switching to real-time data pipelines. In this article, we’ll look at several options to with the and maintenance effort. real-time streaming adds a lot of additional complexity reap the benefits of a real-time paradigm least amount of architectural changes Traditional approach When you hear about real-time data pipelines, you may immediately start thinking about Apache Kafka, Flink, Spark Streaming, and similar frameworks which to operate a distributed event streaming platform. Those open-source platforms are best suited to scenarios: require a lot of knowledge - when you need to continuously ingest and process reasonably of real-time data, large amounts - when you anticipate and and you want to decouple their communication, multiple producers consumers - or when you want to own the underlying infrastructure, possibly ( ). on-prem e.g. compliance While many companies and services attempt to facilitate the management of underlying distributed clusters, the . Therefore, you need to consider: architecture still remains fairly complex - whether you have the , resources to operate those clusters - do you plan to process by using this platform, how much data - whether the added complexity is . worth the effort In the next sections, we’ll look at if your real-time needs don’t justify the and of a . alternative options added complexity costs self-managed distributed streaming platform Amazon Kinesis AWS realized the customer’s difficulties in managing . As a result, they came up with — a family of services that attempt to . By leveraging serverless , you can create a data stream with a few clicks in the AWS management console. Once you configured your estimated throughput and the number of shards, you can start implementing data producers and consumers. Even though Kinesis is serverless, you still need to monitor the message size and the number of shards to ensure that you don’t encounter any unexpected write throttles. message-bus architectures a long time ago (2013) Kinesis make real-time analytics easier Kinesis Data Streams In , you can find an example of a Kinesis producer ( ) sending data to a Kinesis data stream using a Python client, and how to continuously send micro-batches of data records to S3 ( ) by leveraging a delivery stream. my previous article source consumer/destination Kinesis Data Firehose Alternatively, to consume data from Kinesis Data Stream, we could: - aggregate and analyze data with , Kinesis Data Analytics - use Apache Flink to . send this data into Amazon Timestream The main benefits of using Kinesis Data Streams as compared to sending data directly to your desired application are and . Kinesis allows you to within the stream for up to and have that would receive data at the same time. This means that if a new application would need to collect the same data, you could add a new consumer to the process. This new consumer would not affect other data consumers or producers thanks to decoupling on the Kinesis architecture level. latency decoupling store data seven days multiple consumers Amazon Timestream As mentioned in the previous section, the . If you don’t need multiple applications that would regularly consume data from the stream, you could considerably streamline the process by using Amazon Timestream — a allowing you to analyze data in ( ) real-time. The underlying architecture is smart enough to for fast retrieval of real-time data, and then it automatically moves “old” data to cheaper long-term storage according to the specified retention period. major advantage of Kinesis is decoupling serverless time-series data store near ingest data first into an in-memory store Why would you use a time-series database for real-time data? Any new data record comes into the at a particular time. You may be tracking price changes over time, sensor measurements, logs, CPU utilization — practically . Therefore, it makes sense to consider using a time series database such as Timestream. The simplicity of the service makes it very appealing, especially if you would like to use SQL to retrieve data for analytics. stream any real-time streaming data is some sort of a time series When comparing the SQL interface of Timestream against the one available in Kinesis Data Analytics, Timestream is a clear winner. Kinesis SQL is quite obscure and introduces a lot of specific vocabulary. In contrast, Timestream provides an intuitive SQL interface with many useful built-in time-series functions, making time-based aggregation ( ) much easier. ex. minutely or hourly time buckets Side note: don’t use semicolons at the end of your queries in Timestream. If you do, you’ll get an error. Demo: Real-time ingestion into Timestream using Python To demonstrate how Timestream works, we’ll be sending cryptocurrency price changes into a Timestream table. Let’s start by creating a Timestream database and table. We can do all that either from the AWS management console or from AWS CLI: The above code should create a in your AWS region. Make sure that you use one of the in which Timestream is available. database regions Side note: The easiest way to find available regions for any AWS service is to check the pricing page: https://aws.amazon.com/timestream/pricing/ . Now we can create a . You need to specify your in-memory and magnetic store retention period. table Our database and table are created. Now we can get the latest price data from the . This AP provides many useful endpoints to get the latest information about a cryptocurrency market. We will focus on getting real-time price data for selected cryptocurrencies. Cryptocompare API We’ll get data in the following format: {‘BTC’: {‘USD’: 34406.27}, ‘DASH’: {‘USD’: 178.1}, ‘ETH’: {‘USD’: 2263.64}, ‘REP’: {‘USD’: 26.6}} Additionally, we need to convert this data to the proper Timestream format with a column, , and . Here is the full script that we can use to ingest new data every 10 seconds: . time measures dimensions https://gist.github.com/d00b8173d7dbaba08ba785d1cdb880c8 That’s it! The most time-consuming part is the definition of your dimensions and measures (lines 21–44). You should be about the : with Timestream you can only query data from a single table. No JOINS between tables are allowed. Therefore, it’s important to think ahead about your access patterns before you start ingesting data into Timestream. careful design of your measures and dimensions Here is how the data looks like in the end. Note that the ingestion time is presented in UTC: AWS Timestream: exploring the results in the query console — image by author We could now easily connect Timestream to Grafana for near real-time visualization. But that’s a story for another article. Never-ending script In the Timestream example above, running in a single process, we used a defined using while True. This is a common approach for a simple service ingesting data all the time, typically running as a background process or a service in a container orchestration platform. never-ending loop Minutely scheduled jobs An alternative to a continuously running script is a . The benefit of this approach is that it allows you to , which simplifies your architecture. You can think of it as a reversed Kappa architecture: while Kappa processes batch in the same way as real-time data ( ), this approach “batchifies” real-time data streams ( ) into micro-batches. service that is scheduled to run every minute treat this near real-time process as a batch job streaming-first approach batch-first approach Instead of while True, we now still ingest data roughly every 10 seconds but the actual process is executed once per minute, allowing us to track which runs were successful, and does not depend on the health of a single job run: There is no “right” or “wrong” approach. The main purpose of this method is to treat near real-time ingestion as a batch job. Here is a full Gist: . https://gist.github.com/d953cdbc6edbf8b224815cc5d8b53f73 Which option should you choose? The following questions may help you to make the right decision for your use case: - Which do you want to solve using real-time streaming: is it anomaly detection, alerting, product recommendation, dynamic pricing algorithm, tracking current market prices, understanding user behavior? Having a specific use case in mind can help you determine the right tool for the job, especially because there are a lot of specialized tools on the market. problem(s) - Which is acceptable in your use case? Is it OK if your data is available for analytics 1 minute after the event or stream has been received? Or on the contrary, you need a millisecond latency because otherwise, this data will no longer be actionable? latency - How many ( ) do you have to keep your platform operational? Does it make sense to spin up your own Kafka cluster, use some managed service, or maybe a serverless option such as Amazon Kinesis or Amazon Timestream can address your needs? resources employees and budget - How do you plan to and the health of your data streams? monitor observe - How much would be needed to teach your team how to use this specific platform? training - Which would need to be ingested in real-time, i.e. ? data sources data producers - What is the ( ) from which you would want to retrieve this data, i.e. ? And how do you want to retrieve this data — via SQL, Python, or perhaps only via analytical dashboards? target datastore data lake, data warehouse, specific database data consumers - In which way ( ) would you want to process this data? Is Kappa, , or other architecture worth considering to ? architecture-wise Lambda distinguish between real-time and batch Conclusion Ultimately, it depends on the problem that you try to solve using real-time processing technologies, the scale of your problem, and available resources. In many scenarios, a simple minutely batch job may be sufficient. It allows having a single architecture for all data processing needs, and data available within few minutes or even seconds after its generation. For other scenarios, Kinesis Data Streams or Amazon Timestream may provide simple yet effective means to add (near) real-time capabilities with very little maintenance effort. Lastly, if you do have employees who know how to operate Kafka, Flink, or Spark Streams, those can be helpful if you want to own your infrastructure and not being reliant on cloud providers. As always, thinking about the problem at hand will help assess the trade-offs and make the best decision for your use case. Also published on: https://dashbird.io/blog/real-time-processing-analytical/