Unbounded data refers to continuous, never-ending data streams with no beginning or end. They are made available over time. Anyone who wishes to act upon them can do without downloading them first. As stated in his famous book, unbounded data will never “complete” in any meaningful way. Martin Kleppmann "In reality, a lot of data is unbounded because it arrives gradually over time: your users produced data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of business, this process never ends, and so the dataset is never “complete” in any meaningful way." — Martin Kleppmann, Designing Data-Intensive Applications Processing unbounded data requires an entirely different approach than its counterpart, batch processing. This article summarises the value of unbounded data and how you can build systems to harness the power of real-time data. Bounded and unbounded data Unlike its “bounded” counterpart — batch data, unbounded data has no boundaries defined in terms of time. The data may have been arriving from the past, continuing today, and expected to arrive in the future. Events, messages, and streams Streaming data, real-time data, and event streams are some similar terms for unbounded data. A stream of events consists of a sequence of immutable records that carry information about state changes that happened in a system. These records are called " ", rather than "messages" as they merely transmit what just happened. events Conversely, " " are a way of transferring the control or telling someone what to do next. I’ve written about this difference . messages sometime back A continuous flow of related events makes an . Each event in a stream has a timestamp. continuously generate streams of events and transmit them over a network. Hardware sensors, servers, mobile devices, applications, web browsers, and microservices are examples of event sources. event stream Event sources Benefits of real-time data and practical examples Today’s organizations are abandoning batch processing systems and moving towards . Their main goal is to gain business insights on time and act upon them without delays. real-time processing systems Batch processing is better in terms of providing the most accurate analysis of data. But it takes a long time to deliver and requires careful planning and complex architecture. In a world full of competition, businesses tend to trade accuracy for quick but reliable information. Real-time processing systems starting to shine there. Following are some practical examples that demonstrate the power of real-time event processing. Real-time customer 360 view implementations Recommendation engines Fraud/anomaly detection Predictive maintenance using IoT Streaming ETL systems Making sense of streaming data Moving data alone doesn’t make sense to any organization. It has to be ingested and processed to gain more value out of it. The best analogy I can provide is a small stream ( a real stream where water flows). If you consider the stream in isolation, it is not that useful except for supporting the life forms that depend on it. But when multiple streams are combined, they carry enough power. When stored in a reservoir, they can even generate electricity to power an entire village. Photo by on Cédric Dhaenens Unsplash Likewise, event streams flowing into an organization have to undergo several stages before they become useful. Usually, an architecture that processes streams employ two layers to facilitate that; the ingestion layer and the processing layer. Event Ingestion And Processing Is Challenging The ingestion layer needs to support fast, inexpensive reads and writes of large streams of data. The ordering of incoming data and strong consistency is a concern as well. The processing layer is responsible for consuming data from the ingestion layer, running computations on that data, and then notifying the storage layer to delete data no longer needed. Also, the processing should be done on time, allowing organizations to act on data quickly. An organization planning to build a real-time data processing system should plan for scalability, data durability, and fault tolerance in both the ingestion and processing layers. A Reference Architecture For Stream Processing A real-time processing architecture should have the following logical components to address the challenges mentioned above. 1. The Event Sources Event sources can be almost any source of data: system or weblog data, social network data, financial trading information, geospatial data, mobile app data, or telemetry from connected IoT devices. They can be structured or unstructured in format. 2. The Ingestion System The ingestion system captures and stores real-time messages until they are processed. A distributed file system can be a good fit here. But a messaging system would do a better job as it acts as a buffer for the messages and supports reliable delivery of messages to consumers. The stored messages can be processed by many consumers multiple times. Hence, the ingestion system must support the message replay and non-destructive message consumption. That goes beyond the capabilities of traditional messaging systems and opens the door for a distributed log. Distributed and append-only logs provide the foundation for ingestion systems architecture. Such a system appends received messages into a distributed log while preserving their order of arrival. This log is partitioned across multiple machines to support scalable message consumption and fault-tolerance. The following figure illustrates the structure of a Kafka topic. Image credit The architecture and behavior of an ingestion system go beyond the scope of this post. Hence, I’ll save it for a separate article. : Apache Kafka, AWS Kinesis Streams, Azure Event Hubs, and IoT Hubs. Technology Choices 3. The Stream Processing System After ingestion, the messages go through one or more that can route the data, or perform analytics and other processing. stream processors The stream processor can run perpetual queries against an unbounded stream of data. These queries consume streams of data from the ingestion system, analyze them in real-time to detect anomalies, recognize patterns over rolling time windows, or trigger alerts when a specific condition occurs in the stream. After processing, the stream processor writes the result into event sinks such as storage, databases, or directly to real-time dashboards. : Apache Flink, Apache Storm, Apache Spark Streaming, Apache Samza, Kafka Streams, Siddhi, AWS Kinesis Analytics, and Azure Stream Analytics. Technology choices 4. The Cold Storage In some cases, the stream processor writes the ingested message into cold storage for archiving or batch analysis. This data serves as the source of truth for the organization and doesn’t expect frequent reads. Later, batch analytic systems can process them to produce new reports and views required by the organization. Cold storage systems can be in the form of object storage or more sophisticated . Data Lakes : Amazon S3, Azure Blob Containers, Azure Data Lake Store, and HDFS Technology choices 5. The Analytical Datastore Processed real-time data is transformed into a structured format and stored in a relational database to enable queried by analytical tools. This data store is often called the serving layer and feeds multiple applications that require analytics. The serving layer can be a relational database, a NoSQL database, or distributed over a file system. Data warehouses are a common choice among many organizations to be used as such a store. However, the serving layer requires strong support for random reads with low latency. Sometimes, the batch processing systems use this as their final destination. Hence, the serving layer should also support the random writes as the loading of processed data can introduce delays. : Amazon Redshift, Azure Synapse Analytics, Apache Hive, Apache Spark, Apache HBase Technology choices 6. Monitoring and Notification The stream processor can trigger alerts on different systems when it detects anomalies and some conditions. Publishing an event into a pub/sub topic, triggering an event-driven workflow, or invoking a serverless function are few examples. The ultimate goal is to notify the relevant party that has the authority to act upon the alert. : Pub/sub messaging systems and AWS Lambda Technology choices 7. Reporting and Visualization Reporting applications and dashboards use processed data in the analytical store for historical reporting and visualizations. Additionally, the stream processor can directly update real-time reports and dashboards if a low-latency answer is needed. : Microsoft Power BI, Amazon Quicksight, Kibana, Reactive web, and mobile applications Technology choices 8. Machine Learning Real-time data can be passed through machine learning systems to train and score ML models in real-time. Conclusion Real-time processing can be beneficial to organizations in terms of getting actionable insights on time. A real-time processing architecture should cater to high volumes of streaming data, scale-out processing, and fault-tolerance. But real-time processing can fall short when it comes to accuracy. Often, real-time systems trade accuracy for low latency. For a better level of accuracy, both real-time and batch processing need to be combined. Also published at https://medium.com/event-driven-utopia/making-sense-of-unbounded-data-4abdfa0edad2 Cover Photo by Mathew Schwartz on Unsplash

Flow

Amazon

Apache

Microsoft

Event-Driven Change Data Capture: Introduction, Use Cases, and Tools

Choosing Between Enterprise Messaging and Event Streaming For Your Architecture

Follow me on Twitter for updates!

Read My Stories

Too Long; Didn't Read

Making Sense of Unbounded Data & Real-Time Processing Systems

Making Sense of Unbounded Data & Real-Time Processing Systems

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

API Development in the Time of COVID-19

3 Best Hadoop Alternatives to Consider for Migration

5 Big Data Problems and How to Solve Them

6 Database Migration Tools For Complete Data Integrity & More

7 Data Analytics Tools to Boost Your Business

8 Tools You Can Use to Analyze Big Data

API Development in the Time of COVID-19

3 Best Hadoop Alternatives to Consider for Migration

5 Big Data Problems and How to Solve Them

6 Database Migration Tools For Complete Data Integrity & More

7 Data Analytics Tools to Boost Your Business

8 Tools You Can Use to Analyze Big Data

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps