With immense amounts of data, we need a tool to rapidly digest it Data is all around us. The IDC estimated the size of the “digital universe” at 4.4 Zettabytes (1 Trillion Gigabytes) in 2013. Currently, the digital universe grows by 40% every year, and by 2020 the IDC expects it to be as large as 44 Zettabytes, amounting to a single bit of data for every star in the physical universe. We have a lot of data, and we aren’t getting rid of any of it. We need a way to store increasing amounts of data at scale, with protections against data-loss stemming from hardware failure. Finally we need a way to digest all this data with a quick feedback loop. Thank the cosmos we have Hadoop and Spark. To demonstrate the usefulness of Spark, let’s begin with an example dataset. 500 GB of sample weather data comprised of: C_ountry|City|Date|Temperature_ Say we’re asked to calculate the maximum temperature by city for these data, and we start with a native Java program since Java is your second favorite programming language: Java Solution However at 500 GB in size, even such a simple task will take nearly to complete utilizing this native Java method. 5 hours “Java sucks, I’ll just write this in Ruby and kick if off real quick, Ruby is my favorite” Ruby Solution However Ruby being your favorite doesn’t make it any more equipped for this task. I/O simply isn’t Ruby’s strong suit, so Ruby will take even longer to find the max temperature than Java will. Finding maximum temperature by city is a problem best solved using This is where MapReduce shines, mapping cities as keys and temperatures as values, we’ll find our results in much less time, around 15 minutes compared to the previous 5+ hours in Java. Apache MapReduce (I bet you thought I’d say Spark). MaxTemperatureMapper MaxTemperatureReducer MapReduce is a perfectly viable solution to this problem. This approach will run much faster compared to the native Java solution because the MapReduce framework excels in delegating map tasks across our cluster of workers. Lines are fed from our file to each of the cluster nodes in parallel, whereas they were fed into our native Java’s main method one at a time. The Problem Calculating max temperatures for each country is a novel task in itself, but it’s hardly groundbreaking analysis. Real world data carries with it more cumbersome schemas and complex analyses, pushing us toward tools that fill our specific niches. What if instead of max temperature, we were asked to find max temperature by country and city, then we were asked to break this down by day? What if we mixed it up and were asked to find the country with the highest average temperatures? Or if you wanted to find your habitat niche where the temperature is never below 58 or above 68 ( doesn’t seem so bad). Antananarivo Madagascar MapReduce excels at batch data processing, however it lags behind when it comes to repeat analysis and small feedback loops. The only way to reuse data between computations is to write it to an external storage system (a la HDFS). MapReduce writes out the contents of its Maps with each job — before the reduce step. This means each MapReduce job will complete a single task, defined at its onset. If we wanted to do the above analysis, it would require three separate MapReduce Jobs: MaxTemperatureMapper, MaxTemperatureReducer, MaxTemperatureRunner MaxTemperatureByCityMapper, MaxTemperatureByCityReducer, MaxTemperatureByCityRunner MaxTemperatureByCityByDayMapper, MaxTemperatureByCityByDayReducer, MaxTemperatureByCityByDayRunner It’s apparent how easily this can run amok. Data sharing is slow in MapReduce due to the benefits of distributed file systems: replication, serialization, and most importantly disk IO. Many MapReduce applications can spend up to 90% of their time reading and writing from disk. Having recognized the above problem, researchers set out to develop a specialized framework that could accomplish what MapReduce could not: in memory computation across a cluster of connected machines. Spark: The Solution Spark solves this problem for us. Spark provides us with tight feedback loops, and allows us to process multiple queries quickly, and with little overhead. All 3 of the above Mappers can be embedded into the same spark job, outputting multiple results if desired. The lines above that are commented out could easily be used to set the correct key depending on our specific job requirements. Spark Implementation of MaxTemperatureMapper using RDDs Spark will also iterate up to 10x faster than MapReduce for comparable tasks as Spark operates — so it never has to write/read from disk, a generally slow and expensive operation. entirely in memory Apache Spark is a wonderfully powerful tool for data analysis and transformation. If this post an interest stay tuned: over the next several posts we’ll take an incremental deep-dive into the ins and outs of the Apache Spark Framework. sparked

Cosmos

Why We Need Apache Spark

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

High Level Overview of Apache Spark

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

High Level Overview of Apache Spark

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps