In this part I will be talking about the batch layer of the Lambda Architecture.
Batch layer is computed by applying a function to the whole historical dataset, to answer some high level questions which cannot be answered by either speed layer or serving layer.
The computations typically take hours or days to run, and the results are stored usually in a distributed file system (although this is not a requirement). For example, the queries that might need to be answered would range from the beginning of the dataset to now, or in our case, till date how many cabs have served how many passengers, or what is the total distance driven by all the cabs.
In this article I will try to answer questions like these based on the dataset that I have. The code for the article can be found here.
Batch layer contains master copy of the dataset and precomputes batch view on that dataset. It needs to be able to do two things, to be able to store a growing, immutable copy of the dataset, and compute arbitrary functions on that dataset, using a batch processing system like Hadoop or Spark. It can be represented as:
batch view= function (all data)
Batch layer, like serving layer, satisfies requirements of big data systems, like:
Batch layer precomputes master dataset into batch views so that queries can be resolved with low latency. These batch views are loaded into serving layer. These precomputations over master dataset take time, so it should be taken as an opportunity to delve deeper into what kind of views are being computed.
Since master data is immutable and ever growing, two types of algorithms come into the picture when taking into account new data, whether to use a recomputation algorithm to generate views for the whole new dataset, or use an incremental algorithm to update views on the fly.
For example, to compute total number of records in the master dataset, you can either use a recomputation alogrithm to determine updated count, or use an incremental algorithm to determine new rows, and add it to the old count.
The thing to keep in mind is that incremental algorithm uses less resources, and is faster, whereas recomputation algorithm is more resource extensive, and slower.
In terms of fault tolerance, recomputation algorithm might be a better option. If you make a mistake while running a recomputation algorithm, all you have to do is fix the mistake and run the algorithm again. It consumes more resources, but the fix is simple.
But with incremental algorithm, if you make that mistake, you’d have to find the records that have been affected by that mistake, and go back and fix those errors. It might consume less resources, but is extremely time consuming.
Incremental algorithms are usually tailor made for specific use cases, and mostly, recomputation algorithms are preferred. With these things in mind, I will try to answer some questions that will require computation on the batch dataset that i have, in the form of visualizations. I will not be storing it in a database, just flat files for the purpose of visualization.
The dataset has been obtained from academictorrents, and consists of trips made by taxis in New York, including the medallion number, pickup and dropoff locations, distance travelled, and time of the trip taken in seconds. It contains several more columns, but only these are relevant to us at the moment. An example is shown below:
medallion pickup_datetime dropoff_datetime trip_time_in_secs trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
740BD5BE61840BE4FE3905CC3EBE3E7E, 2013–10–01 12:44:29, 2013–10–01 12:53:26, 536, 1.2, -73.974319, 40.741859, -73.99115, 40.742424
Based on this data, the goal of the task at hand is to determine:
Tech stack that I have used is Spark for aggregation, Python for data manipulation and plotting. Since we are working with batch data there was no need for a streaming engine, and since we are plotting data from flat files there was no need for a NoSQL database, although we could have used it.
Based on the data that I had, the results are as following:
a) Comparison of months based on distance driven
As we can see, August saw the most miles driven among other months. This can be attributed to the fact that a lot of students come to join universities in the month of September, and that is why the distance driven is so high. But in my opinion this might be an anomaly because the distance driven is unusually high (20 times higher).
b) Comparison of months based on trip time
As we can see, August again tops the list. As distance driven was the highest in August, total time spent driving is also highest in August. But it is not unusually high as compared to distance driven, confirming the suspicion that the distance plot might have been an anomaly.
c) Comparison of months based on trips taken
Number of trips each month are relatively equal for all months, meaning the number of trips made by passengers are almost the same, but the distance and time was greater in the month of August.
d) Comparison of Months based on Distance: HeatMap
This is a heatmap of total distance driven for all the months, and we can see that August has the highest number of distance driven,
e) Comparison of Months based on time spent driving: HeatMap
When it comes to time spent driving for each month, the values are almost the same for all months.
e) For any month, top 5 rows based on distance driven for each driver
Medallion, Distance Driven, Rank
06EAD4C8D98202F1E2D7057F2899CFE5, 9.90, 13, 1
06EAD4C8D98202F1E2D7057F2899CFE5, 9.80, 11, 2
06EAD4C8D98202F1E2D7057F2899CFE5, 9.70, 16, 3
06EAD4C8D98202F1E2D7057F2899CFE5, 9.60, 11, 4
06EAD4C8D98202F1E2D7057F2899CFE5, 9.50, 17, 5
0F621E366CFE63044BFED29EA126CDB9, 9.99, 1, 1
0F621E366CFE63044BFED29EA126CDB9, 9.95, 1, 2
0F621E366CFE63044BFED29EA126CDB9, 9.94, 1, 3
0F621E366CFE63044BFED29EA126CDB9, 9.91, 2, 4
0F621E366CFE63044BFED29EA126CDB9, 9.90, 1, 5
I used a rank function to calculate the top 5 trips made by each driver. It is shown below:
Rank function calculates the rank of a particular column within a particular window. In our case, the window is medallion number and total distance driven, and rank function assigns rank based on each value of distance driven.
f) Total distance driven is 210 million miles by all drivers
g) Total time spent driving is 39000 hours
h) Total trips made by all drivers is 98 million
In this article we went through what a Batch Layer is, and what is it’s purpose in the Lambda Architecture. Batch Layer keeps an ever growing master dataset, that is immutable. It allows for views to be computed on either the whole dataset or the subset of the dataset.
There are two main type of algorithms that batch layer implements, recomputation and incremental algorithms. The output of batch layer can either be flat files, or it can be saved in NoSQL database.
Then we went through a real world study of how batch layer functions. We took a taxi dataset, and determined the total distance driven, time spent driving and total trips made. We also made a comparison of driving statistics between each month.
Read about Speed layer here
https://hackernoon.com/lambda-architecture-speed-layer-real-time-visualization-for-taxi-part-1-a31931tr
And about Serving layer here
https://hackernoon.com/lambda-architecture-speed-layer-real-time-visualization-for-taxi-part-2-8i1p31q7
References: Big Data