A few months ago, I decided that I wanted to pursue a career in data engineering. I was fascinated with data engineering because of all the big data tools being used, like Spark, but besides that, I knew very little. This is what I learned in the past 4 months.
The first thing you need to grok is what is the point of all the data? The data ultimately helps the people that are making decisions make better decisions. For instance, if you sell T-shirts and you find that most of your customer’s are between 18–25, then you can put Justin Bieber’s face on the T-shirts and all of sudden your sales will go through the roof. Once you have the data, you can do some statistics on it, make fancy visualizations, run some SQL, and as a whole the organization can make better decisions.
Big data, in the most general case refers to a quantity of data, that either can’t be stored on a single server, or computing something over this data using a single server, would take too long (what is too long depends on the application). Historically, the scale of data was small enough to fit on one server. As a result, the tools that developed, e.g. Relational Database Management Systems (RDMS) were largely meant for single node deployments (this isn’t really true since distributed computing models existed since the 1960s and distributed databases like Teradata existed since the 1970s- but these were very niche). Even though our compute and storage capacities were increasing exponentially, the data generated was growing even faster.
Although, some solutions existed, the modern era of distributed computing and storage started when in 2004 Google published MapReduce: Simplified Data Processing on Large Clusters. The key proposition here along with their 2003 paper, The Google File System, was how to perform distributed and fault-tolerant computing and storage on top of commodity servers.
Around the same time Doug Cutting was working on Nutch, an open source web crawler. Doug was facing problems of distributed storage and processing and when Google’s papers were released. He decided to implement them as part of Nutch. Eventually, these would become Hadoop and HDFS. (If you are more interested in the history, see https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704 for an excellent writeup on the topic).
I covered two aspects of the big data ecosystem, namely distributed computing and distributed storage. However, the ecosystem is much more diverse and has many more moving parts. Another important component is the ingestion layer. Why do we need an ingestion layer? Let’s say we have logs being generated on many nodes, and we want to store these in HDFS and run a Hadoop job on them. The ingestion layer basically acts as a fault-tolerant, distributed middle layer between the producers of data on one side, and consumers on the other.
Now, that we have some sense of what the big data ecosystem looks like, let’s consider the two key players in the game: data scientists and data engineers.
The job of the data scientist is to turn the big data into actionable insights. This requires them to apply machine learning, statistics, and analytical thinking, to the data. Some data scientists will have strong programming skills and some will have a deep understanding of the ecosystem of tools. However, their primary job is to run experiments, develop algorithms and get insights.
Data engineering is a new enough role that each organization defines it a little differently. However, broadly speaking their job is to manage the data and make sure it can be channeled as required. In some companies, this means data engineers build the underlying system that allows data scientists to efficiently do their job, e.g. at Netflix data engineers may build and maintain the infrastructure that allows data scientists to experiment with recommendation algorithms, and in other companies, the data engineering is the whole shebang, e.g. at Twitter, the biggest challenge is how to make data flow as quickly and efficiently as possible.
Typically, a data engineer will have strong programming skills and a deep understanding of the big data ecosystem, and distributed systems in general. A data engineer will perform one of the following:
I mentioned before that sometimes the job of the data engineer is to make data accessible, so for instance the data scientist can run queries on top of this data. However, this leads to the question of how this data should be stored. The answer to this question is very nuanced and this is where the concept of data warehouse comes into play.
A data warehouse is a consolidation of all the data in an organization in order to make it easier to analyze. The raw data is often heterogenous in its format and location. For instance, you may have some data coming from log files and other from transactional databases, e.g. a MySQL database that stores the order quantity and order amount. To give you a sense of some of the subtleties in working with data warehouses, consider that for transactional databases, data is stored in normalized form, so that the data does not become anomalous. To perform complex queries on normalized data, requires joins. Joins, in general ,are expensive, and at the big data scale, when the data is located on multiple servers, across the network, joins become prohibitively expensive. A solution to this is to denormalize the data. However, denormalizing the data, typically leads to increased storage space.
ETL is related to the concept of Data Warehouses. Essentially, ETL refers to moving the data from some initial source, to a different location. For instance, if we have CSV log files and we want to put them in our data warehouse, we must first read them from the source (extract), then strip the commas, and remove malformed lines (transform), and put them in the final data warehouse (load).
In some cases, data engineers will have to work with the raw compute and storage nodes, e.g. installing the software, etc. This is more typical in smaller organizations where the roles are not clearly delineated. Larger organization will have specialized DevOps roles.
In some cases, data engineers will work on infrastructure tools, e.g. adding an extension to Hadoop. Again, this is more typical of smaller organizations, with larger organizations having separate tooling teams.
This is closer to the role of data scientist, but depending on your background and interests you may be more or less close to the data science side of things. Again, this varies from organization to organization.
Data engineering is a very exciting field, and as data driven decision making becomes more central to our businesses so will the challenges and opportunities of data engineering.
Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising &sponsorship opportunities.
To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.
If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!