Lambda Architecture Speed Layer: Real-Time Visualization For Taxi [Part 1]

Written by srivassid | Published 2020/12/18
Tech Story Tags: lambda-architecture | data-pipeline | speed-layer | apache-spark | python | kafka | cap-theorem | hackernoon-top-story | web-monetization

TLDRvia the TL;DR App

Lambda architecture has 3 components, a) Speed layer, which is the streaming data layer or real time data layer, b) serving layer, which is the database layer, which is derived by aggregating data from speed layer, and c) batch layer, which is the set of computations which are perfomed on large sets of data, typically stored in a distributed file system. In this post i will be talking about how to implement the speed layer, by visualizing real time taxi data. Post that, the visualization will allow us to make some real time business decisions. Code for this article can be found here.

Background on Speed Layer

The three views, speed layer batch layer and serving layer answer different questions and at a different scale. The data is processed 3 times, during which it changes it’s shape and volume, but these 3 views answer very different queries which would not be possible from a single point of view. 
The serving layer updates when batch computation has been finished. That means the only data that is not available in the system belongs to the time frame when the computation was running. To compensate for the few minutes or hours of data that is not available in the system, speed layer is required. The difference is that speed layer looks at only a subset of the data, while batch layer looks at the whole dataset. Also, batch layer taken into account the whole dataset, while speed layer looks at recent data, which means it works with incremental data. It can be summarized as
realtime view = function (realtime view, new data)
Speed layer solves mainly one problem, low latency updates. Speed layer has data that can be discarded once processed by the pipeline, and since real time dataset within a window is relatively small (compared to batch data), the processing power requires is low as well. The size of the window differs from application to application. Two main points to consider while dealing with speed layer are storing the real time views, and processing the incoming data stream to update those views. The speed layer should support
  • a) Fast random reads to answer queries quickly. 
  • b) Fast random writes to update or append new data
  • c) Scalability, because there is a very real chance incoming data size will grow
  • d) Fault tolerance, data and processing logic should be replicated across machines in case one fails.

CAP Theorem

An additional challenge arises when working with speed layer, and that is called as CAP theorem. It is as follows.
“When a distributed data system is partitioned, it can be consistent or available, but not both”.
Let us see what it means. You have a dataset that is stored in HDFS for example. It would be stored on one of the partitions. It would be available, but not consistent as not every partition knows where the data is. Only the namenode would know where it is. 
If you take a partitioned NoSQL database for example, then data might be stored across partitions, and eventually it might become consistent and available, but not during (at which time it might be required to give answers to business questions, as we will see in this post). 
To sum up, speed layer involves data that is real time, acquired typically within seconds of the event happening, has to be processed in real time (that means faster processing time requirement), and has to be stored relatively fast, and can be discarded after it has answered the questions it was meant to answer, because it will be available in serving or batch layer. 

Dataset and Goals

The dataset has been obtained from academictorrents, and consists of trips made by taxis in New York, including the medallion number, pickup and dropoff locations, distance travelled, and time of the trip taken in seconds. It contains several more columns, but only these are relevant to us at the moment. An example is shown below
medallion pickup_datetime dropoff_datetime trip_time_in_secs trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude 
740BD5BE61840BE4FE3905CC3EBE3E7E, 2013–10–01 12:44:29, 2013–10–01 12:53:26, 536, 1.2, -73.974319, 40.741859, -73.99115, 40.742424
Based on this data, the goal of the task at hand is
Visualize the pickup / dropoff locations to determine “hot zones” 
This would allow us to make certain decisions, such as
  • a ) Places where empty taxis can be routed to
  • b) Based on the locations of “hot zones” determine the empty routes in the city where traffic can be routed to, in order to balance itself

Tech Stack used

The tech stack used is Kafka, Python, Elasticsearch and Kibana. Although we have a static file, i simulated a streaming dataset using kafka. That data is then indexed in Elasticsearch, and finally visualized in Kibana. Also, for the purpose of demonstration, streaming the whole dataset is not required, just couple of thousand rows.

Implementation

Produce the data using Kafka

To start with, i read data from the file and stream it to a topic to which our consumer can subscribe to. I also wait for 1 second after producing 15 rows. I do that so that the data is easy to visualize. The dataset initially contains timestamps from it’s original date, 2013. But to view it in real time, i replace that timestamp with the current timestamp.
So dropoff_time becomes time.now(), and pickup_time becomes dropoff_time-trip_time_in_sec. Also, i pause the program every 15 rows produced, because i am appending the current timestamp to each row, and without the time break there would be too much data within a small timeframe. 
The code is as follows.

Consume data using Kafka

Now that i have a topic in which data is being streamed, i can consume it. The code to consume data is as follows.
I consume the data from the topic and append it to a dataframe to be used later. 

Index data to ElasticSearch

Now that we have started consuming the data, we can index that into elasticsearch. But before we insert documents, we have to define mapping for the data, to determine which element represents what data type, for example dropoff_time is of type date, and dropoff_location is of type geo_point. We do that by using the following mapping. 
We put this mapping to http://localhost:9092/my_index, and then we are good to go.
Now that we have our mapping in place, we can start inserting data. The code to do that is as follows.
This code is a continuation of the previous code. Once i get the data i append it to a dataframe, and once there are 15 rows, i insert the data into ElasticSearch. To do that i use helpers function, which allows us to insert batch data into the database. 
One thing to note is that once the data has been visualized, it can be removed from database, as it would not be needed now, and it would be available in serving or batch layer. 

Visualizing data using Kibana

Now that the data has been inserted into ElasticSearch, we can start to visualize the data. To do that, open up Kibana, create a new dashboard and select maps. Make sure to add index pattern for the index name. Once you select the data, select time range as last 5 min, and auto refresh interval as 2 sec. You’ll see the taxis in real time. 
As you can see, we are looking at taxis in real time, specifically their drop_off location. By looking at the data, we can make decisions as to where should the empty taxis be routed to. By the looks of it, Manhattan seems to be a busy area, whereas Airports and outer areas are not that busy.

Conclusion

In this post we went through how to visualize streaming data. Speed layer is inherently different from Batch or Serving layer, as it operates on real time data, uses incremental algorithms, and computes answers based on a small dataset. The writes can be asynchronous or synchronous.
Real time dashboards like this enables a user to make business decisions in real time. The data volume might be very high, so it bodes well to delete data that is a day old, or less. What we did not cover in this post was how to handle out of order or duplicate data, but that is a different topic altogether (although you can sort the bucket, for instance). 
Read about Serving layer here
https://hackernoon.com/lambda-architecture-speed-layer-real-time-visualization-for-taxi-part-2-8i1p31q7
And about batch layer here
https://hackernoon.com/lambda-architecture-batch-layer-visualizing-all-time-taxi-data-part-3-n74l31lq
I hope you enjoyed my post.

References

  • https://www.google.co.in/books/edition/Big_Data/HW-kMQEACAAJ?hl=en

Written by srivassid | Data Engineer
Published by HackerNoon on 2020/12/18