In my last post we introduced a problem: copious, never ending streams of data, and it’s solution: Apache Spark. Here in Part II we’ll focus on Spark’s internal architecture and data structures, and in Part III we’ll focus more on Spark’s available APIs and Functions in Java.
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers — Grace Hopper
With the scale of data growing at a rapid and ominous pace, we needed a way to process potential petabytes of data quickly, and we simply couldn’t make a single computer process that amount of data at a reasonable pace. This problem is solved by creating a cluster of machines to perform the work for you, but how do those machines work together to solve the common problem?
Photo by Jez Timms on Unsplash
Spark is the cluster computing framework for large-scale data processing. Spark offers a set of libraries in 3 languages (Java, Scala, Python) for its unified computing engine. What does this definition actually mean?
Unified: With Spark, there is no need to piece together an application out of multiple APIs or systems. Spark provides you with enough built-in APIs to get the job done
Computing Engine: Spark handles loading data from various file systems and runs computations on it, but does not store any data itself permanently. Spark operates entirely in memory — allowing unparalleled performance and speed
Libraries: Spark is comprised of a series of libraries built for data science tasks. Spark includes libraries for SQL (SparkSQL), Machine Learning (MLlib), Stream Processing (Spark Streaming and Structured Streaming), and Graph Analytics (GraphX)
Every Spark Application consists of a Driver and a set of distributed worker processes (Executors).
The Driver runs the main() method of our application and is where the SparkContext is created. The Spark Driver has the following duties:
An executor is a distributed process responsible for the execution of tasks. Each Spark Application has its own set of executors, which stay alive for the life cycle of a single Spark application.
When you submit a job to Spark for processing, there is a lot that goes on behind the scenes.
Lets take a deeper look at the Spark Job we wrote in Part I to find max temperature by country. This abstraction hid a lot of set-up code, including the initialization of our SparkContext, lets fill in the gaps:
MaxTemperature Spark Setup
Remember that Spark is a framework, in this case implemented in Java. It isn’t until line 16 that Spark needs to do any work at all. Sure, we initialized our SparkContext, however loading data into an RDD is the first bit of code that requires work be sent to our executors.
By now you may have seen the term “RDD” appear multiple times, it’s about time we define it.
Spark has a well-defined layered architecture with loosely coupled components based on two primary abstractions:
RDDs are essentially the building blocks of Spark: everything is comprised of them. Even Sparks higher-level APIs (DataFrames, Datasets) are composed of RDDs under the hood. What does it mean to be a Resilient Distributed Dataset?
All data we work with in Spark will be stored inside some form of RDD — it is therefore imperative to fully understand them.
Spark offers a slew of “Higher Level” APIs built on top of RDDs designed to abstract away complexity, namely the DataFrame and Dataset. With a strong focus on Read-Evaluate-Print-Loops (REPLs), Spark-Submit and the Spark-Shell in Scala and Python are targeted toward Data Scientists, who often desire repeat analysis on a dataset. The RDD is still imperative to understand, as it’s the underlying structure of all data in Spark.
An RDD is colloquially equivalent to: “Distributed Data Structure”. A JavaRDD<String> is essentially just a List<String> dispersed amongst each node in our cluster, with each node getting several different chunks of our List. With Spark, we need to think in a distributed context, always.
RDDs work by splitting up their data into a series of partitions to be stored on each executor node. Each node will then perform its work only on its own partitions. This is what makes Spark so powerful: If an executor dies or a task fails Spark can rebuild just the partitions it needs from the original source and re-submit the task for completion.
Spark RDD partitioned amongst executors
RDDs are Immutable, meaning that once they are created, they cannot be altered in any way, they can only be transformed. The notion of transforming RDDs is at the core of Spark, and Spark Jobs can be thought of as nothing more than any combination of these steps:
In fact, every Spark job I’ve written is comprised of exclusively those types of tasks, with vanilla Java for flavour.
Spark defines a set of APIs for working with RDDs that can be broken down into two large groups: Transformations and Actions.
Transformations create a new RDD from an existing one.
Actions return a value, or values, to the Driver program after running a computation on its RDD.
For example, the map function weatherData.map() is a transformation that passes each element of an RDD through a function.
Reduce is an RDD action that aggregates all the elements of an RDD using some function and returns the final result to the driver program.
“I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. — Bill Gates”
All transformations in Spark are lazy. This means that when we tell Spark to create an RDD via transformations of an existing RDD, it won’t generate that dataset until a specific action is performed on it or one of it's children. Spark will then perform the transformation and the action that triggered it. This allows Spark to run much more efficiently.
Let’s re-examine the function declarations from our earlier Spark example to identify which functions are actions and which are transformations:
16: JavaRDD<String> weatherData = sc.textFile(inputPath);
Line 16 is neither an action or a transformation; it’s a function of sc, our JavaSparkContext.
17: JavaPairRDD<String, Integer> tempsByCountry = weatherData.mapToPair(new Func.....
Line 17 is a transformation of the weatherData RDD, in it we map each line of weatherData to a pair comprised of (City, Temperature)
26: JavaPairRDD<String, Integer> maxTempByCountry = tempsByCountry.reduce(new Func....
Line 26 is also a transformation because we are iterating over key-value pairs. its a transformation of tempsByCountry in which we reduce each city to its highest recorded temperature.
31: maxTempByCountry.saveAsHadoopFile(destPath, String.class, Integer.class, TextOutputFormat.class);
Finally on line 31 we trigger a Spark action: saving our RDD to our file system. Since Spark subscribes to the lazy execution model, it isn’t until this line that Spark generates weatherData, tempsByCountry, and maxTempsByCountry before finally saving our result.
Whenever an action is performed on an RDD, Spark creates a DAG, a finite direct graph with no directed cycles (otherwise our job would run forever). Remember that a graph is nothing more than a series of connected vertices and edges, and this graph is no different. Each vertex in the DAG is a Spark function, some operation performed on an RDD (map, mapToPair, reduceByKey, etc).
In MapReduce, the DAG consists of two vertices: Map → Reduce.
In our above example of MaxTemperatureByCountry, the DAG is a little more involved:
parallelize → map → mapToPair → reduce → saveAsHadoopFile
The DAG allows Spark to optimize its execution plan and minimize shuffling. We’ll discuss the DAG in greater depth in later posts, as it’s outside the scope of this Spark overview.
With our new vocabulary, let us re-examine the problem with MapReduce as I defined in Part I, quoted below:
MapReduce excels at batch data processing, however it lags behind when it comes to repeat analysis and small feedback loops. The only way to reuse data between computations is to write it to an external storage system (a la HDFS)”
‘Re-use data between computations’? Sounds like an RDD that can have multiple actions performed on it! Lets suppose we have a file “data.txt” and want to accomplish two computations:
In MapReduce, each task would require a separate job or a fancy MulitpleOutputFormat implementation. Spark makes this a breeze in just four simple steps:
Load contents of data.txt into an RDD
JavaRDD<String> lines = sc.textFile("data.txt");
2. Map each line of ‘lines’ to its length (Lambda functions used for brevity)
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
3. To solve for total length: reduce lineLengths to find the total line length sum, in this case the sum of every element in the RDD
int totalLength = lineLengths.reduce((a, b) -> a + b);
4. To solve for longest length: reduce lineLengths to find the maximum line length
int maxLength = lineLengths.reduce((a, b) -> Math.max(a,b));
Note that steps 3 and 4 are RDD actions, so they return a result to our Driver program, in this case a Java int. Also recall that Spark is lazy and refuses to do any work until it sees an action, in this case it will not begin any real work until step 3.
So far we’ve introduced our data problem and its solution: Apache Spark. We reviewed Spark’s architecture and workflow, it’s flagship internal abstraction (RDD), and its execution model. Next we’ll look into Functions and Syntax in Java, getting progressively more technical as we dive deeper into the framework.