Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. © Wikipedia
Let's explore how exactly distributed storage works in Hadoop.
Imagine that we need to store a ton of documents, do some computation on them like counting all the unique users, and store the measurement of the users close to the source data. To regularly complete this operation with Hadoop, we first have to store the data to our hardware.
For the beginning, let’s imagine storing the data with one machine without Hadoop:
Now, imagine a scenario in which we need to handle an enormous number of files. Imagine that a file can be even bigger than a single machine’s HDD. For this situation, we can upgrade our machine by adding more HDD. Doing this will result in the following impediments:
Fitting enormous measures of assets into a single machine is incredibly costly. If our machine goes down unexpectedly, we can’t handle any data any longer.
To take care of these issues, we need to move towards splitting calculations into a few machines. To achieve this, we add 3 extra machines connected. We want them to be a computing cluster. The objective is to store data, utilizing as many resources of each machine as possible at the same time. We can call these machines workers or nodes.
Now, we can simply divide all input records by 4 (number of our nodes) so each machine would accept ¼ of input files. We expect that all documents should be loaded similarly on 4 machines. But then we face considerably more disadvantages:
Before we begin handling any data, we have to analyze each worker to know whether it has enough capacity to store the data. It can require some time (particularly if the connection isn’t strong enough) if we have multiple workers. We need to monitor the processing advancement of every worker to know when the activity is finished.
If a node breaks down, we have to distinguish it and figure out how to manage the documents relegated to this machine. For this situation, the processed data ought to be reassigned to alive nodes.aWe need a solitary point to access all the stored data. Else, we can never be certain that we asked all of the workers who may contain the information we need.
Now, let’s talk about real Hadoop architecture. It turns out, to beat the troubles mentioned above, we have to characterize an essential node (known as NameNode in Hadoop) from one of the workers (or DataNodes in Hadoop).
If we consider one node as NameNode, it can be liable for allocating input files to workers, monitor the executions, and, if necessary, reassign input documents to various workers. Moreover, if we need our information to be reliable, the NameNode should manage all the replications of the stored data.
For these purposes, all the HDD space of each node becomes divided by blocks. The information about each block and data inside is stored in Metadata on the NameNode.
At the point when we have the NameNode, writing the data to Hadoop become to resemble this:
When there is a sign to write the data to the Hadoop cluster, it goes to the NameNode.
The NameNode checks its metadata about blocks with enough free space. Then, it chooses in what block exactly each part of the data will be stored (and replicas as well).
When the choice about target blocks is done, NameNode tells the client which DataNodes should be utilized for each part of the data and in what block it should put each part of the data.
While the client writes its data, NameNode waits for it to acknowledge the writing or to reset the Metadata.
The client connects to the objective DataNode, writes the data to the target blocks, and waits for the acknowledgment.
After the data is written, before sending the acknowledge, DataNode checks if we need replications for just loaded data. If we do, the DataNode replicates new data to another DataNodes, targeted by NameNode at the beginning and so on.
When the last DataNode completed the replication, it sends acknowledge back to the previous DataNode,
The acknowledgment should reach the original client.
When the client gets the acknowledgment, it goes to the NameNode and commits the successful writing. The NameNode saves the metadata about each block that was acknowledged.
At this moment, the writing is done, and data is saved and replicated among DataNodes. Now NameNode has the information in what blocks exactly each part of data is stored.
Described Hadoop architecture allows us to get rid of the challenges mentioned above:
NameNode always knows about all of the other nodes. No need to check all of the nodes every time we want to write the data. NameNode monitors every active writing and resets its metadata when timeout is reached. If DataNode goes down, NameNode will redirect the writing to alive DataNode.NameNode becomes a solitary point to access all the stored data because it keeps metadata about each block on other nodes. Finally, If we need more computing resources, we can connect more DataNodes to enlarge the cluster’s capacity.
Also published on Medium's Subdomain.