A year ago, we started looking at ideas of how to solve some data related problems we had. At that point, our current environment was struggling at managing our Point of Sales (POS) feed. We get millions of messages that the POS generates in order to do analytic on them.
The second problem we had was associating the POS data with other data sources which ultimately could help to predict sales in stores while getting the right amount of products from the distributors.
We wanted to be able to correlate all these sources easily and fast, but we were far away of doing it effectively, yet it was being done with a lot of efforts by our organization.
We got the idea of trying Hadoop out from another team within the company that already had invested time investigating about it since they had a similar problem.
Our first Proof of Concept (POC) took few weeks to be completed and my opinion, it was the first step to get into the Hadoop world.
This POC was done using Amazon Elastic Mapreduce (AWS EMR). We had a bunch of data stored at Amazon Simple Storage (S3) and we could dynamically spin up clusters and process these data using Apache Spark. It was just a POC and because of that, we only demonstrated some key features of the solution. We need to know how to process data stored somewhere, how to secure it, and how to do analytic using Business Intelligence (BI) tools that non technical people could use such as Microstrategy
After completing our first approach, we basically got shot down because we had the in-house infrastructure to do the same, and a cloud based solution should not be our focus, but instead implement a similar solution on premises.
Even though we, in fact, have the infrastructure power to host a solution like the one proposed, we knew it will take time to implement what we need, yet we all committed ourselves to solve our organization problems.
The first vendor we tried was Hortonworks (Horton), but at that moment we were just starting and the process of learning was slow and painful.
A team from Horton did join to us every week for some weeks to check the progress we were making on their platform. Very knowledgeable guys, might I say, but something was still missing. Horton is based on the open source community and some of the components we needed were not updated and we just could not use them.
The business model Horton uses is also based on the open source ideas, they give you the entire platform for free, yet there will not be any support engagement until some money is on the table. For us that was a big deal, we were exploring in a complete new environment and we did not have anyone to ask for help.
This is madness
a colleague of mine says from time to time while trying to figure out why Hive was not working on our Hadoop Cluster.
Yes, we had a lot of configuration problems, our cluster was not set up correctly and some of the latest software components were not present. Also, we could not find a nice way to automate the creation and addition of virtual machines to the cluster using the recommended Horton tools.
We had problems, and that was fine, but we were expecting some help from them and we never got it.
From our initial test using the Hortonworks platform, we could define a design that will become the design on our future implementations. In order to everyone to follow, let’s review our data workflow.
We have a constant flow of messages coming through a web service and those messages are written in Apache Kafka, a general purpose, distributed, and high performance log queue designed at LinkedIn. From here, an Apache Spark job streams those messages into two different flows. One that aggregates them and writes them down to Hadoop Distributed File System (HDFS) and another flow which writes them in a relation fashion to our Teradata environment.
Now, we have our messages stored at HDFS yet most of the time we don’t know how to correlate those messages, or how to look at them. We need a way to explore this data, we need a self BI environment.
A common problem in the BI word is the one related to the time required to develop ETL jobs. Yes, there are very skilled people whose work is exactly this, and they do it right and fast. However, what we have seen recently has been that at the moment developers end creating a solution (ETL jobs) the business questions have already changed so the previous jobs are now useless.
On the other hand, developers could expend a lot of time creating jobs that will be used only a few times and then will be deprecated since they do not respond to the business inquiries any more.
You might wonder why is this happening. Why business people don’t get an agreement about what they really want? There might be many answers, but in our world, it happens because we don’t know all the data we have. We need to explore it, make sense of it and then create the relations that will allow us to get reports out of it.
Today, we want certain reports and right away ETL developers start working on them. Tomorrow, we discover that a new data could be related to the one that ETL developers are working with and by the time they finish, the original requirements have changed so much that the results of their work is entire useless. Now, we need to start everything again.
Wouldn’t be nice if business people have the ability to explore their own data by themselves?
Now that we explained why we need self BI environments, let’s go back to our design session.
At this point, we have hundred of Gigabytes written in HDFS, but we need to access them efficiently.
There are several tools we could use. Hortonworks provides Apache Hive out of the box. Hive works as an SQL engine that runs on top of HDFS, but wait, our business guys don’t need to know SQL! Based on this, we put Microstrategy at their disposition so they drag and click while Microstrategy generates SQL statements that are executed over our data through Hive on HDFS.
Because the business guys will use microstrategy as a front end tool, we could change Hive (our back end query engine) to any other processing engine such as Spark SQL, and they will still perform the same tasks on their end without problems, hopefully.
With the Hortonworks platform we experimented a lot of performance issues when using Hive. That was one of the main arguments while moving to the next vendor we tried out.
Apache Spark worked well, but some team members were not happy to code some Scala code every time a new data feed was added (even when all this process can be automated which implies the argument is not valid any longer).
Apache Spark has become one the most popular data processing frameworks out there, we definitely have to come back to it.
One day, the guys from Cloudera appeared at office and we got introduced to their platform.
In order to be short talking about our experience we Cloudera, I have to say that the platform it truly at the enterprise level, something we were looking for. All components were updated, we get their customer support since day one, and they worked with us in order to set up the cluster right from the beginning. For us that was just impressive after our experience with our previous vendor.
The management tools were easy to use and understand. The Cloudera tools allowed us to work fast, to discover the advantages of their platform with easy.
Cloudera has the same components that Hortonworks and also adds a new one that we were quite interested in, Apache Impala.
Impala is the faster SQL compliance tools we have seen so far. It was quite faster than Hive. Impala integrates with Microstrategy just fine, so our front end does not have to change at all.
In both platforms, Hortonworks and Cloudera, we face one common challenge. How do we expose new data that we haven’t defined the schema for it?
The process is the following:
The question is who is going to do this? No one wants to do it, it should be automated, but sometimes the business wants to look at some files that are completely new for everyone (extractions from different sources they manually generate and use). There was no way to automate that. The business depends on developers to follow the previous process every time it wants to do these operations. Our Self BI idea was not possible, it was incompleted and these two platforms were not enough for us.
The truth is that we were skeptical when the guys from MapR showed up. However, by the end of their talk, I think we were kind of convince that they have our solution.
Even though this vendor is not the most popular one in the Hadoop world, what they showed us matched our requirements perfectly, so we decided to give it a try.
Setting the cluster up was as easy as in Cloudera, I don’t think there were any problems. However, the console management tool was not as good as the one present in the Cloudera cluster. It was a little more difficult to navigate and understand.
MapR is POSIX compliance! That means that our data path get significant simplified. Having a POSIX platform implies we can create a volume inside the cluster and mount it in our Linux or Window environment. The volume will appear as any other drive we have on our computer. We can read / write from / to it as we do with any other attached storage. We in fact, are reading/writing to a distributed storage (our MapR cluster).
Now we can remove Kafka, we can remove the Spark job that aggregates messages and writes them to HDFS and with that we also remove the complexity of all those moving parts in the middle of our data path. Now we have less components to worry about and less hardware requirements.
Part of our initial design was streaming messages to Teradata. We still can do that by changing our Spark streaming source to be the file system instead of Kafka and everything else just works fine.
In an initial data movement test, we executed a rsync command between a network share and a mounted drive that pointed to the MapR file system. We got a 100 Mb / s transfer rate between our environment and the MapR cluster. For us, it was more than enough.
MapR brings a new tool to the table as an addition to the ones found in other vendors. Apache Drill, a tool that auto discover schemas from different sources and exposes them without any previous registration. This is the tool we were looking for.
Yes, we had a lot of problems with it, mostly related to performance. Some team members dedicated hours to work it out, and they did. We can get results back as fast as in Impala with the addition of auto discovering schemas.
Drill exposes data to Microstrategy in the same way that Hive and Impala do it so it can be used in the same fashion.
Because we can mount different volumes, now business people can have their our space in the MapR cluster, so they can copy and paste the data they want to explore (streaming and automatic feeds are processed by automatic processes such as Spark jobs). If a guy creates an extracted file and wants to correlate it to another that comes within the automatic feed, he only needs to:
Behind the scene,
The small files problem is a big deal in Hadoop. In Hortonworks and Cloudera it was something we really need to take care of. It was the main reason behind the aggregation process we had. We aggregated a lot of small file into a big file so we can avoid the small files problem by reducing the number of files written to HDFS.
MapR does not use HDFS, instead, they have implemented MapR-FS, a file system that exposes a HDFS interface so it can be used by any tools designed to work with HDFS. It can be also accessed through POSIX commands, adding the possibility of work as any other file system compatible with most operating systems.
In MapR-FS, we can have millions of small files without problem. The internal specifications of this file system and how to work with it can be found here. MapR has done an incredible job in this area and we all can benefit from it.
MapR, in the same way Cloudera does, has relied, heavily, in the use of Spark as one of the main component of their solution. Spark is used extensively as a data processing framework and sometimes as the engine that powers Impala and Drill. Spark is by far, one the most used components in any big data solution because its versatility while working on any task.
As we can see, MapR not only simplified our solution, but also offer a different way to attack the problems we have. MapR can be seen as an extension of our environment instead of a separated environment that we need to architect in order to move and process data within it. Even though MapR management tools are a little behind of the ones offered by Cloudera, they work as they should, they keep our environment stable, always working. MapR helps us to eliminate unnecessary processes so we can focus on the data and how to work with it, and that is what we really need.
If you enjoyed this peace you might also like Apache Spark as a Distributed SQL Engine
This blog doesn’t represent the thoughts, plans or strategies of my employer.
Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising &sponsorship opportunities.
To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.
If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!