Photo by Atanas Malamov on Unsplash The amount of data produced in this world has exploded. Whether it’s from the Internet of Things (IoT), social media, app metrics, or device analytics, the amount of data that can be processed and analyzed at any given moment is staggering. Big data is on the rise, and data systems are tasked with handling it. But this begs the question: Are these systems up for the task? The Challenges of Big Data Data systems are most useful when they correctly answer the right questions about all the data they contain. With big data, however, new data comes into the system every minute — and there’s a lot of it. The data system faces two fundamental challenges. Challenge #1: . The Latency Problem I can get super-precise answers to your questions based on all the data streamed from the beginning of my existence until this very moment, but that will take me a while. I hope you don’t mind waiting. Challenge #2: . The Accuracy Problem I’ve finally got my super-precise answers for you. Sadly, they took me so long to acquire, they’re no longer up-to-date and accurate. Introducing Lambda Architecture Lambda Architecture is a big data , a way of structuring a data system to overcome the latency problem and the accuracy problem. It was first introduced by Nathan Marz and James Warren in 2015. paradigm Lambda Architecture structures the system in three layers: a batch layer, a speed layer, and a serving layer. We will talk about these in detail shortly. Lambda Architecture is horizontally scalable. This means that if your data set becomes too large or the data views you need are too numerous, all you need to do is add more machines. It also confines the most complex part of the system in the speed layer, where the outputs are temporary and can be discarded (every few hours) if ever there is a need for refinement or correction. How Does Lambda Architecture Work? As mentioned above, Lambda Architecture is made up of three layers. The first layer — the — stores the entire data set and computes batch views. The stored data set is immutable and append-only. New data is continually streamed in and appended to the data set, but . The batch layer also computes batch views, which are queries or functions on the . These views can subsequently be queried for low-latency answers to questions of the . The drawback, however, is that it takes a lot of time to compute these batch views. batch layer old data will always remain unchanged entire data set entire data set The second layer in Lambda Architecture is the . The serving layer loads in the batch views and, much like a traditional database, allows for , providing low-latency responses. As soon as the batch layer has a new set of batch views ready, . serving layer read-only querying on those batch views the serving layer swaps out the now-obsolete set of batch views for the current set The third layer is the . The data that streams into the batch layer . The difference is that while the batch layer keeps all of the data since the beginning of its time, the . The speed layer makes up for the high-latency in computing batch views by processing queries on the most recent data that the batch views have yet to take into account. speed layer also streams into the speed layer speed layer only cares about the data that has arrived since the last set of batch views completed The Three Layers — An Analogy Consider the elderly man who lives alone in a giant mansion. Every room of his mansion has a clock, but except for the clock in the kitchen, all of the clocks  are wrong. The man decides one day that he will set all of the clocks, using the kitchen clock as the correct time. But, his memory is poor, so he writes down the current time on the kitchen clock (9:04 AM) on a piece of paper. He begins his slow walk around the mansion setting all of his clocks to 9:04 AM. What happens? By the time he gets to the very last clock in the guest bedroom of the east wing, it is 9:51 AM. He sets this clock too, as he has done with all of the other clocks, to the time written on his paper — 9:04 AM. No wonder all of his clocks are wrong! This is the problem we would experience if a data system only had a batch layer. The answers to the questions we’re asking would no longer be up-to-date because it took so long to get those answers. Fortunately, the man remembers that he has an old runner’s stopwatch. The next day, he starts again in the kitchen at 9:04 AM. He writes down the time on a piece of paper. , , and begins walking around his mansion. Now, when he gets to the very last clock in the east wing, he sees the paper with 9:04 AM written, and he sees his stopwatch says “47 minutes and 16 seconds”. With a little bit of math, he knows to set this last clock to 9:51 AM. He starts his stopwatch his speed layer In this analogy , which, admittedly, is not perfect , . He takes the batch view (the paper with “9:04 AM” written on it) around his mansion to answer “What time is it?” But, he also does the additional work of reconciling the batch view with the speed layer to get the most accurate answer possible. the man is the serving layer Why Use Lambda Architecture? In Marz and Warren’s seminal on Lambda Architecture, , they list eight desirable properties in a big data system, describing how Lambda Architecture satisfies each one: book Big Data Robustness and fault tolerance. Because the batch layer is designed to be append-only, containing the entire data set since the beginning of time, the system is human-fault tolerant. If there is any data corruption, then all of the data from the point of corruption forward can be deleted and replaced with correct data. Batch views can be swapped out for completely recomputed ones. The speed layer can be discarded. In the time it takes to generate a new set of batch views, the entire system can be reset and running again. Scalability. Lambda Architecture is designed with layers built as distributed systems. By simply adding more machines, end users can easily horizontally scale those systems. Generalization. Since Lambda Architecture is a general paradigm, adopters aren’t locked into a specific way of computing this or that batch view. Batch views and speed layer computations can be designed to meet the specific needs of the data system. Extensibility. As new types of data enter the data system, new views will become necessary. Data systems are not locked into certain kinds or a certain number of batch views. New views can be coded and added to the system, with the only constraint being resources, which are easily scalable. Ad hoc queries. If necessary, the batch layer support ad hoc queries that were not available in the batch views. Assuming the high-latency for these ad hoc queries is permissible, then the batch layer’s usefulness is not restricted only to the batch views it generates. can Minimal maintenance. Lambda Architecture, in its typical incarnation, uses Apache Hadoop for the batch layer and ElephantDB for the serving layer. Both are fairly simple to maintain. Debuggability. The inputs to the batch layer’s computation of batch views are always the same: the entire data set. In contrast to debugging views computed on a snapshot of a stream of data, the inputs and outputs for each layer in Lambda Architecture are not moving targets, vastly simplifying the debugging of computations and queries. Low latency reads and updates. In Lambda Architecture, the last property of a big data system is fulfilled by the speed layer, which offers real-time queries of the latest data set. The Disadvantages of Lambda Architecture While the advantages of Lambda Architecture seem numerous and straightforward, there are some disadvantages to keep in mind. First and foremost, cost will become a consideration. While to scale is not very complex — just add more machines — we can see that the batch layer will necessarily need to expand and grow over time. Since all data is append-only and no data in the batch layer is discarded, the cost of scaling will necessarily grow with time. how Others have noted the challenge of maintaining two separate sets of code to compute views for the batch layer and the speed layer. Both layers operate on the same set—or, in the case of the speed layer, subset—of data, and the questions asked of both layers are similar. However, because the two layers are built on completely different systems (for example, Hadoop or Snowflake for the batch layer, but Storm or Spark for the speed layer), code maintenance for two separate systems can be complicated. Lambda Architecture in Machine Learning In the field of machine learning, there’s no doubt that data is better. For machine learning to apply algorithms or detect patterns, however, it needs to receive its data in a way that makes sense. more Rather than receiving data from different directions without any semblance of structure, machine learning can benefit by processing data through a Lambda Architecture data system first. From there, machine learning algorithms can ask questions and begin to make sense of the data that enters the system. Lambda Architecture for IoT While machine learning might be on the side of a Lambda Architecture, IoT might very well be on the side of the data system. Imagine a city of millions of automobiles, each one equipped with sensors to send data on weather, air quality, traffic, location information, driving habits, and so on. output input is the massive stream of data that would be fed into the batch layer and speed layer of a Lambda Architecture. IoT devices are a perfect example of providing the data in big data. This Stream Processing and Lambda Architecture Challenges We noted above that While this is true, it’s important to make a clarification: that small subset of data is stored; it is processed immediately as it streams in, and then it is discarded. The speed layer is also referred to as the “ .” Remember that the goal of the speed layer is to provide low-latency, real-time views of the most recent data, the data that the batch views have yet to take into account. “the speed layer only cares about the data that has arrived since the completion of the last set of batch views.” not stream-processing layer On this point, the original authors of the Lambda Architecture refer to “eventual accuracy,” noting that the batch layer strives for exact computation while the speed layer strives for approximate computation. The approximate computation of the speed layer will eventually be replaced by the next set of batch views, moving the system towards “eventual accuracy.” Processing the stream in real-time in order to produce views that are constantly updated as new data streams in (on the order of milliseconds) is an incredibly complex task. Partnering a document-based database with an indexing and querying system is often recommended in these cases. Differences Between Lambda Architecture and Kappa Architecture We noted above that a considerable disadvantage of Lambda Architecture is maintaining two separate code bases to handle similar processing since the batch layer and the speed layer are different distributed systems. Kappa Architecture seeks to address this concern by removing the batch layer altogether. Instead, the real-time views computed from recent data the batch views computed from all data are performed within a single stream processing layer. The entire data set — the append-only log of immutable data — streams through the system quickly in order to produce the views with the exact computations. Meanwhile, the original “speed layer” tasks from Lambda Architecture are retained in Kappa Architecture, still providing the low-latency views with approximate computations. both and This difference in Kappa Architecture allows for the maintaining of a system for generating views, which simplifies the system’s code base considerably. single Lambda Architecture via Containers on Heroku Coordinating and deploying the various tools needed to support a Lambda Architecture — especially when you’re in the starting up and experimenting stage — is accomplished easily with Docker. Heroku serves well as a container-based cloud platform-as-a-service (PaaS), allowing you to deploy and scale your applications with ease. For the batch layer, you would likely deploy a docker container for . As the speed layer, you might consider deploying or . Lastly, for the serving layer, you could deploy docker containers for or , coupled with indexing and querying by . Apache Hadoop Apache Storm Apache Spark Apache Cassandra MongoDB Elasticsearch Conclusion Taking on the task of big data is not for the faint of heart. Scaling and system robustness are huge challenges, so paradigms like Lambda Architecture bring excellent guidance. As massive amounts of data stream into the data system, the batch layer provides high-latency accuracy, while the speed layer provides a low-latency approximation. Meanwhile, the speed layer responds to queries by reconciling these two views to provide the best possible response. Implementing a Lambda Architecture data system is not trivial. While perhaps complex and initially intimidating, the tools are available and ready to be deployed. Previously published here .

Apache

Heroku

Super

Deep Dive Into DevSecOps: Heroku Flow Edition

An Introduction to Microservice Messaging in Kubernetes

I write about technology michael.bogan@gmail.com

Nominated for 2022 - HackerNoon Contributor of the Year - Software Architecture

Nominated for 2022 - HackerNoon Contributor of the Year - Microservices

Nominated for 2022 - HackerNoon Contributor of the Year - Devops

Nominated for 2022 - HackerNoon Contributor of the Year - Api

Nominated for 2022 - Software Developer of the Year

Nominated for 2022 - HackerNoon Contributor of the Year - Software

Too Long; Didn't Read

Lambda Architecture: A Comprehensive Introduction and Breakdown

Lambda Architecture: A Comprehensive Introduction and Breakdown

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 Threats to an Open API Ecosystem

10 Lessons from 10 Years of AWS (part 2)

10 Lessons from 10 Years of AWS (part 1)

111 Stories To Learn About Architecture

13 Expert Tips to Improve Your Web Application Performance Today

4 Skills You Need to Become a Distinguished Developer

10 Threats to an Open API Ecosystem

10 Lessons from 10 Years of AWS (part 2)

10 Lessons from 10 Years of AWS (part 1)

111 Stories To Learn About Architecture

13 Expert Tips to Improve Your Web Application Performance Today

4 Skills You Need to Become a Distinguished Developer

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps