The entire perception of big data and its management went on a revolutionary change with the introduction of Apache Hadoop, an open source Java framework, released in 2010.
As it supports the processing of huge volumes of data, companies have realized its importance in a distributed computing environment.
And what’s more, Hadoop is scalable as well, meaning it can expand from one server to many servers (sometimes, running up to thousands of machines) each with its own computation and storage facilities. Each time you scale it, you can add on to the storage and computational capabilities of the framework.
Hadoop, HDFS and MapReduce
The major advantage of HDFS is that it can be scaled up to 4500 server clusters and 200 PB of data. Storage and computation is easily handled across all servers and when demand grows, storage capabilities can be scaled, while still remaining economical.
Hadoop ensures reliability under any physical or systemic circumstance. All the problems that you had had to face with previous data platforms are resolved with HDFS.
While HDFS is the storage part of Apache Hadoop, MapReduce is its processing tool. With MapReduce, it is possible for enterprises to process and generate huge unstructured data sets (remember, each node in the cluster is incorporated with its own storage) across commodity clusters.
MapReduce is, thus, a major aspect of Hadoop and conducts two main functions:
(1) It is responsible for delegating work to the different nodes in the cluster/map and
(2) It can collect all the results from the query into one cohesive answer.
The main components of MapReduce are JobTracker (the master node), TaskTrackers (these are agents within each cluster, with functions of their own) and JobHistoryServer (deployed as separate function, but a component that tracks jobs.
Hadoop can streamline huge amounts of data using single programming models, and across various clusters of computers. Hadoop has evolved to handle both structured and unstructured data, and all you need to do to handle huge volumes of data is to add extra servers to a Hadoop cluster.
Have a look at the main reasons why enterprises insist on using Hadoop for business:
Scalability — Nothing can beat the scalability function of Hadoop because you can easily scale to thousands of inexpensive servers with just a single Hadoop cluster. You can keep on adding as many clusters as you want.
No more changing of formats — In all traditional warehouses, formats will need to be changed, and during this translation process, there could be data loss. This problem is totally avoided in Hadoop.
Highly effective and quick data processing — The storage method of Hadoop is, indeed, unique. The mapping and processing of data is all done within the data storage itself. It is also interesting to note how large volumes of unstructured data are processed. Hadoop can process petabytes of data in a matter of hours and terabytes of data in minutes.
Cost-effective — The most attractive feature of Hadoop is that it is free. If at all any expense is incurred, then it probably would be commodity hardware for storing huge amounts of data. But that still makes Hadoop inexpensive.
Highly robust- The fault tolerance feature of Hadoop makes it really popular. If ever a cluster fail happens, the data will automatically be passed on to another location. This will ensure that data processing is continued without any hitches.
Now let’s look at some of the main use cases of Hadoop.
1. Converting Site Visitors — How Hotels.com did it
When you have a huge company like the Expedia owned hotel booking site, Hotels.com, you can only imagine the huge amounts data that gets churned by the millisecond. With the number of people that keeps coming and going, how can you convert them into site visitors? Hotels.com, intelligently solved that problem by using Hadoop.
Of course, they were already using cloud to power some of the small functions like the auto search capabilities that popped up as soon as a visitor types in the search boxes. However, during the peak season, things started getting tougher as more and more people poured in. They needed to use data to get closer to their customers.
This has to be done in such a way that the site’s performance doesn’t falter, because customers have little patience for slow-performing websites. The site had to respond quickly to oncoming demands and carry on with the number crunching while allowing people to book holidays without hitches.
Hotels.com started using NoSQL databases and Apache Cassandra. And Cassandra was a major boon in several ways. For example, if you are looking at a particular property, you can see this message “XX people are also looking at this property at this moment” “This property was booked previously on — — ”
When they started using Hadoop, it amplified the usage capabilities of Cassandra, helping them to convert more people and giving better service to them all. The conversion rate of the website also changed for the better.
2. Easy Data Analysis from Multiple Sources — How Marks and Spencer does it
The retail giant, Marks and Spencer was really successful in incorporating a cross functional team consisting of teams from marketing, ecommerce, finance and IT to analyze data from so many different sources.
For this purpose, M&S had deployed Cloudera Enterprise Data Hub Edition to enhance its digital platform and to have a better understanding of consumer behavior. Cloudera provides Apache Hadoop based support and services to store and process data seamlessly.
M&S was looking for a strategic technology partner that would help their in-house team handle and manage big data in a customizable and flexible manner. With Cloudera, they have been able to leverage data in a robust and scalable manner, with security given the primary importance.
Through their decision to use Hadoop, the company has been able to successfully use Predictive analytics to keep their shelves stocked during festival time. While competitors try hard to predict customer behavior and struggle to meet demand, M&S manages to stay a step ahead, with zero disappointed customers.
3. Improve Customer Satisfaction — How Royal Bank of Scotland does it
Royal Bank of Scotland is at the forefront driving customer insights with Hadoop and Big Data Analytics to improve customer interactions. Though the bank has invested in 100 billion dollars after the banking crash to win back customers, the move was well-worth it.
The aim of the bank was to know each customer personally, and act like how managers knew their customers in the 1970s. Before the banking crash, RBS was one of the biggest banks in Scotland, and the bank sought to bring back its lost glory by learning to improve customer satisfaction by providing excellent services and not by focusing on beating out competitors.
Like M&S, RBS also invested on Cloudera to help their customers get the best out of banking service. The bank was also quick to step in and guide the customer whenever they faced problems. They have also incorporated several technologies to provide tailored recommendations for their customers.
This is more to help them out when they are confused on what plans to follow, and what products to buy. Combine this technology with Predictive analytics and you’ve got a fool-proof plan to get into the minds of the customer, and know what they are thinking about.
4. Get to the Know the Customer — How British Airways did it
Enterprises need something solid to gain edge over competition. Collecting customer data and analyzing their behavior with this information is one way to bring results.
British Airways introduced a new program known as the Know Me, a unique plan intended to understand the customer better than their competitors. Through the Know Me program, they marked the customers that remained loyal to them and rewarded them with benefits and offers.
If a particular traveler got stuck on the freeway, the airline sends them messages, asking them if they want to reschedule their flights. These are just a couple of the many ways in which they strive to deliver the best service to their customers.
The above user case is an excellent example of how knowing your customer is important. BA does this with the help of Hadoop, and since they have large amounts of data, storing, processing indexing the data is easier with Hadoop.
They are able to manage the data effectively despite the heavy influx. With the extensive capabilities, data can be easily governed and controlled, thereby aiding in intelligent data archiving and processing.
5. Saving Millions in Hardware Costs — How Yahoo accomplished it
When it is a huge multinational IT company like Yahoo, the challenge for saving millions of dollars in hardware costs is both a necessity and a challenge. It is said that more than 150 terabytes of machine data goes through their data warehouses every day.
Yahoo uses Hunk, a Hadoop tool by Splunk, the data analytics company to manage their data in real time. In fact, the company entered into Hadoop game much earlier than most companies. The earlier intention was to fasten the pace of indexing the web pages through web crawlers.
Yahoo swears by the benefits it has earned through Hadoop and the company admits that it leaves the largest Hadoop footprint in the tech world. There are about 4500 nodes in the largest Hadoop cluster owned by Yahoo. And it runs the framework on over 100,000 CPUs in more than 40,000 servers. Now that’s a huge footprint right?
Hadoop plays an important role in detecting spam messages and blocking them. It works the personalization feature like magic, and gives the best to customers depending on their interests and hobbies.
Yahoo sends value-added packages to its customers by combining both automated analysis and real editors to define customer interests. Yahoo uses Hadoop in partnership with other technologies to deliver the best results to advertisers and marketers as well.
The power of Big Data, coupled with the ubiquitous use of mobile apps has drastically transformed the business sector. With the application of Hadoop, businesses can not only manage their data, but can actually free up their internal team to analyze the insights derived from the proprietary data warehouses.
And that’s not all.
Hadoop gives you the ultimate knowledge on what’s happening in the world of data, so having this framework makes the difference between having knowledge and not having it. It is the most cost-effective and efficient way to handle the escalating data volume.
If you like this post, please share!!