Overview of Fundamental Technology is a software-hardware complex that allows tasks related to storage and processing of large data arrays to be performed. Hadoop is designed for batch data processing using the computational power of more than one physical machine, as well as for storing datasets that do not fit on a single node. Apache Hadoop You may need to use Hadoop if your needs meet the following criteria: It is necessary to accept unstructured/semi-structured data (JSON) and then convert it into a normal/structured form. Long-term storage and processing of large data arrays are required, with volumes that cannot be accommodated in a relational database management system (more than 100GB). Data processing is needed that cannot be performed within the scope of a service/set of services because it does not fit into the RAM or requires the storage of large intermediate results. It is necessary to transfer large datasets (hundreds of millions of rows, hundreds of gigabytes) between services/clusters - using Hadoop as an integration bus. Multi-stage data processing operations (ETL) are required that cannot be implemented within a single service, or that require the use of logic in languages prohibited for integration in services (Python). Ad-hoc analytics on data stored in Hadoop. Examples of tasks currently implemented on Hadoop: Intermediate storage of data extracted from master systems for further use in DWH (ETL, integration bus). Calculation of mathematical models and forecasts. Storage of user behavior data in a tabular form. Prerequisites for Hadoop Across Multiple Data Centers: As a rule, Hadoop is located in one data center, but what if we require high availability of a service like Hadoop that would allow it to withstand the shutdown of one of the data centers? Or, as is not excluded, perhaps you simply do not have enough racks in one data center, and you need to increase both data and resource capacity. We faced a choice - to place consumers on different Hadoops in different data centers (DC) or to do something special. We chose to do something special, and thus a concept that we developed, tested, and implemented was born - Hadoop Multi Data Center ( ). HMDC Hadoop Multi Data Center As the name suggests, it lives in multiple data centers. This solution provides cluster resilience to the loss of capacity in any individual data center and efficiently distributes resources among consumers. provides two main services to consumers: the distributed file system HDFS and the resource manager YARN, which also acts as a task scheduler. HMDC This technology allows for a much more fault-tolerant Hadoop cluster, although it has certain features, a brief overview of which is given below. In each data center, there is one Master node and a set of worker Slave nodes. The Master nodes have the following components installed: HDFS NameNode - responsible for the metadata of our HDFS, where each data block is located, and to which file it belongs. YARN ResourceManager - our entry point for running tasks on YARN, responsible for distributing tasks among Slave nodes. Both services operate in High-Availability mode based on the Active-Standby principle. That is, at any given moment, one Master node serves client requests. If it becomes unavailable, another Master node switches to “Active” mode and starts serving client requests. In this way, we can survive a data center outage (DC-1) without losing functionality. Other advantages of this “stretched” architecture include: Data providers support uploads to only one cluster. Replication with a factor of 3, data is stored in each data center (DC). Easy switching between DCs during a DC-1 outage. Adding new resources is practically “on-demand”. Migrating clients from one DC to another causes problems and takes only minutes. YARN tasks of a specific queue (client) run only in one DC, and data is read only in that DC. As a result, we generate cross-DC traffic only during writing. Distinctive features of HMDC HDFS in uses based on the principle “ ” The definition of the nearest cluster is specified in the file. HDFS is deployed in a High Availability configuration with and, as a service, consists of the following components: HMDC HDFS Rack-awareness one data center is one rack core-site.xml QJM HDFS Namenode HDFS Journalnode HDFS ZKFC HTTPFS HDFS Datanode YARN Capacity Scheduler in relies on Node Labels for resource allocation among node managers in different data centers. YARN is represented by the following components: HMDC YARN Resource Manager YARN Node Manager HDFS Datanodes (DN) and YARN Node Managers (NM) are configured on machines functionally grouped as slave nodes. These are the main working machines in the cluster: the sum of the available volumes of allocated hard drives (taking replication into account), RAM, and virtual cores on these machines make up the total resource pool of cluster. HMDC Control services - HDFS Namenode, ZKFC, YARN ResourceManager, HiveServer2, and HiveMetastore - are placed on machines grouped as master nodes. In each data center, there is one master node. The configuration of each service is implemented in such a way that the three master nodes provide High Availability for the service as a whole. Main Components of the System Let's take a closer look at the main components of the system. HDFS Datanode (DN) HDFS Datanode is a service responsible for storing data on cluster machines and a daemon that manages the data. On the machine, it is represented as a service called and runs as root, as it requires access to privileged ports for secure operation. systemd hadoop-hdfs-datanode.service YARN Node Manager (NM) YARN NodeManager is a daemon that manages the operation of YARN application processes on a specific machine. When a YARN application is accepted by the resource manager (RM), it instructs the selected NM to allocate resources (memory and virtual cores) for a container in which a specific application process will run (for example, Spark Executor, Spark Driver, MapReduce Mapper, or Reducer). HDFS NameNode (NN) HDFS NameNode is the main controlling service of HDFS. In cluster, three NNs operate simultaneously, but only one can be active at any given time. The other two NNs are in Standby State during this time (sometimes they are called Standby NN or SbNN). HMDC HDFS JournalNode (JN) HDFS JournalNode is a daemon responsible for synchronizing HDFS Namespace changes between NameNodes. On each master node, a service called is running, and together they support the distributed QJM journal. The active NN reports changes in the HDFS Namespace through edits, which carry information about completed transactions in HDFS. systemd hadoop-hdfs-journalnode HDFS ZKFC To ensure automatic fault detection of HDFS High Availability (HA) components and automatic failover, a Zookeeper Quorum and ZKFailoverController (ZKFC) are running on HMDC master nodes. ZKFC is a process represented by the systemd service , acting as a Zookeeper client and providing: hadoop-hdfs-zkfc.service Monitoring the state of NN through health check probes. Managing sessions in Zookeeper: ZKFC keeps an active session in Zookeeper if the NN is healthy. Also, if the NN is active, it holds a znode. If the NN is healthy, and no other NN holds a znode, ZKFC will attempt to win the election for a new active NN. Zookeeper In HMDC Zookeeper servers are also deployed on the master nodes, providing a means for distributed coordination of HA services. Zookeeper has two main functions: Detecting faults in NN operation. Ensuring re-election of the active NN. YARN ResourceManager (RM) YARN ResourceManager is a service responsible for managing resources in the YARN cluster. It operates in an HA configuration and is represented as a service called on the master nodes. It provides a web interface on port 8088 (Standby RM always redirects to the active RM's UI). To fully utilize resources and prevent downtime, RM implements two mechanisms: systemd hadoop-yarn-resourcemanager.service Queue Elasticity: Each queue is assigned parameters that determine guaranteed and maximum possible resources. If resources in the queue are underutilized, the application can get more resources than guaranteed, but not exceeding the values set by the second parameter. Container Preemption: To ensure applications always receive guaranteed resources, a mechanism is in place to reclaim resources from already running applications that are using more than their guaranteed capacities. Containers are terminated with the corresponding exit code and status in the UI. When a client authenticates to the cluster and sends a request to launch an application, RM calculates whether the required resources are available, accepts the application into the queue, and allocates memory and virtual cores to it. Hive High Availability HA for Hive is achieved by installing multiple Hive Metastore servers and Hiveserver2 in each of the data centers, respectively. For HA Hiveserver2, Zookeeper is used. You can connect to Hiveserver2 using JDBC. When , a connection is established to a random available Hiveserver2 server. serviceDiscoveryMode=zooKeeper Benefits of HMDC As I said at the very beginning - we were faced with a choice: to create independent data centers, where each consumer is located in their own data center (in our case, there are 3) or to implement this concept - HMDC. It is worth noting that despite the separation of consumers by data centers - many consumers need the same data (it is not possible to divide the data). Let's compare these 2 paradigms: Independent DCs HMDC Data management ×3, mapping As a single cluster Error with data Possible loss Possible loss Data volume ×3 - ×9 ×3 DC loss migration Several weeks Hours, config editing Inter-DC data replication Custom service HDFS functionality Support 3 different domains Unified cluster Inequality of clusters Flexibility in uploading Block Placement Policies What Does HMDC Give Us as a whole? In case of a DC outage, you can migrate users with just a configuration change. This concept allows you to manage data under the paradigm of “ ”. it's a single cluster In case of an error during data processing, there may be data loss, as replication cannot be carried out. The volume of stored data is spread across three data centers, with one copy in each data center. Yes, this creates some drawbacks. By lowering replication to two, you risk increasing network load due to data retrieval from other data centers. Data replication between data centers is carried out through HDFS functionality. If necessary, and with proper control, you can manage stored data by isolating it in one of the DCs thanks to the HDFS block placement policy. You can add servers and data centers to your installation virtually on-demand when scalability is needed.

This story contains new, firsthand information uncovered by the writer.

The code in this story is for educational purposes. The readers are solely responsible for whatever they build with it.

The writer is smart, but don't just like, take their word for it. #DoYourOwnResearch

Hadoop Across Multiple Data Centers

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How to Optimize Your Image Storage

10 Lessons from 10 Years of AWS (part 2)

10 Lessons from 10 Years of AWS (part 1)

111 Stories To Learn About Architecture

13 Expert Tips to Improve Your Web Application Performance Today

4 Skills You Need to Become a Distinguished Developer

How to Optimize Your Image Storage

10 Lessons from 10 Years of AWS (part 2)

10 Lessons from 10 Years of AWS (part 1)

111 Stories To Learn About Architecture

13 Expert Tips to Improve Your Web Application Performance Today

4 Skills You Need to Become a Distinguished Developer

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps