What is Spark? Let’s take a look under the hood In we introduced a problem: copious, never ending streams of data, and it’s solution: Apache Spark. Here in Part II we’ll focus on Spark’s internal architecture and data structures, and in Part III we’ll focus more on Spark’s available APIs and Functions in Java. my last post In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers — Grace Hopper With the scale of data growing at a rapid and ominous pace, we needed a way to process potential petabytes of data quickly, and we simply couldn’t make a single computer process that amount of data at a reasonable pace. This problem is solved by creating a cluster of machines to perform the work for you, but how do those machines work together to solve the common problem? Meet Spark Photo by on Jez Timms Unsplash Spark is the cluster computing framework for large-scale data processing. Spark offers a set of libraries in 3 languages (Java, Scala, Python) for its unified computing engine. What does this definition actually mean? Unified: With Spark, there is no need to piece together an application out of multiple APIs or systems. Spark provides you with enough built-in APIs to get the job done Computing Engine: Spark handles loading data from various file systems and runs computations on it, but does not store any data itself permanently. Spark operates entirely in memory — allowing unparalleled performance and speed Libraries: Spark is comprised of a series of libraries built for data science tasks. Spark includes libraries for SQL (SparkSQL), Machine Learning (MLlib), Stream Processing (Spark Streaming and Structured Streaming), and Graph Analytics (GraphX) The Spark Application Every Spark Application consists of a and a set of distributed worker processes ( ). Driver Executors Spark Driver The Driver runs the main() method of our application and is where the SparkContext is created. The Spark Driver has the following duties: Runs on a node in our cluster, or on a client, and schedules the job execution with a cluster manager Responds to user’s program or input Analyzes, schedules, and distributes work across the exectors Stores metadata about the running application and conveniently exposes it in a webUI Spark Executors An executor is a distributed process responsible for the execution of tasks. Each Spark Application has its own set of executors, which stay alive for the life cycle of a single Spark application. Executors perform all data processing of a Spark job Stores results in memory, only persisting to disk when specifically instructed by the driver program Returns results to the driver once they have been completed Each node can have anywhere from 1 executor per node to 1 executor per core Spark’s Application Workflow When you submit a job to Spark for processing, there is a lot that goes on behind the scenes. Our Standalone Application is kicked off, and initializes its SparkContext. Only after having a SparkContext can an app be referred to as a Driver Our Driver program asks the Cluster Manager for resources to launch its executors The Cluster Manager launches the executors Our Driver runs our actual Spark code Executors run tasks and send their results back to the driver SparkContext is stopped and all executors are shut down, returning resources back to the cluster MaxTemperature, Revisited Lets take a deeper look at the Spark Job we wrote in to find max temperature by country. This abstraction hid a lot of set-up code, including the initialization of our SparkContext, lets fill in the gaps: Part I MaxTemperature Spark Setup Remember that Spark is a framework, in this case implemented in Java. It isn’t until line 16 that Spark needs to do any work at all. Sure, we initialized our SparkContext, however loading data into an RDD is the first bit of code that requires work be sent to our executors. By now you may have seen the term “RDD” appear multiple times, it’s about time we define it. Spark Architecture Overview Spark has a well-defined layered architecture with loosely coupled components based on two primary abstractions: Resilient Distributed Datasets (RDDs) Directed Acyclic Graph (DAG) Resilient Distributed Datasets RDDs are essentially the building blocks of Spark: everything is comprised of them. Even Sparks higher-level APIs (DataFrames, Datasets) are composed of RDDs under the hood. What does it mean to be a Resilient Distributed Dataset? Resilient: Since Spark runs on a cluster of machines, data-loss from hardware failure is a very real concern, so RDDs are fault tolerant and can rebuild themselves in the event of failure Distributed: A single RDD is stored on a series of different nodes in the cluster, belonging to no single source (and no single point of failure). This way our cluster can operate on our RDD in parallel Dataset: A collection of values — this one you should probably have known already All data we work with in Spark will be stored inside some form of RDD — it is therefore imperative to fully understand them. Spark offers a slew of “Higher Level” APIs built on top of RDDs designed to abstract away complexity, namely the DataFrame and Dataset. With a strong focus on Read-Evaluate-Print-Loops (REPLs), Spark-Submit and the Spark-Shell in Scala and Python are targeted toward Data Scientists, who often desire repeat analysis on a dataset. The RDD is still imperative to understand, as it’s the underlying structure of all data in Spark. An RDD is colloquially equivalent to: “Distributed Data Structure”. A JavaRDD<String> is essentially just a List<String> dispersed amongst each node in our cluster, with each node getting several different chunks of our List. With Spark, we need to think in a distributed context, always. RDDs work by splitting up their data into a series of partitions to be stored on each executor node. Each node will then perform its work only on its own partitions. This is what makes Spark so powerful: If an executor dies or a task fails Spark can rebuild just the partitions it needs from the original source and re-submit the task for completion. Spark RDD partitioned amongst executors RDD Operations RDDs are Immutable, meaning that once they are created, they cannot be altered in any way, they can only be transformed. The notion of transforming RDDs is at the core of Spark, and Spark Jobs can be thought of as nothing more than any combination of these steps: Loading data into an RDD Transforming an RDD Performing an Action on an RDD In fact, every Spark job I’ve written is comprised of exclusively those types of tasks, with vanilla Java for flavour. Spark defines a set of APIs for working with RDDs that can be broken down into two large groups: and . Transformations Actions Transformations create a new RDD from an existing one. Actions return a value, or values, to the Driver program after running a computation on its RDD. For example, the function weatherData.map() is a transformation that passes each element of an RDD through a function. map is an RDD action that aggregates all the elements of an RDD using some function and returns the final result to the driver program. Reduce Lazy Evaluation “I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. — Bill Gates” All transformations in Spark are lazy. This means that when we tell Spark to create an RDD via transformations of an existing RDD, it won’t generate that dataset until a specific action is performed on it or one of it's children. Spark will then perform the transformation and the action that triggered it. This allows Spark to run much more efficiently. Let’s re-examine the function declarations from our earlier Spark example to identify which functions are actions and which are transformations: 16: JavaRDD<String> weatherData = sc.textFile(inputPath); Line 16 is neither an action or a transformation; it’s a function of , our . sc JavaSparkContext 17: JavaPairRDD<String, Integer> tempsByCountry = weatherData.mapToPair(new Func..... Line 17 is a of the RDD, in it we map each line of to a pair comprised of (City, Temperature) transformation weatherData weatherData 26: JavaPairRDD<String, Integer> maxTempByCountry = tempsByCountry.reduce(new Func.... Line 26 is also a because we are iterating over key-value pairs. its a transformation of in which we reduce each city to its highest recorded temperature. transformation tempsByCountry 31: maxTempByCountry.saveAsHadoopFile(destPath, String.class, Integer.class, TextOutputFormat.class); Finally on line 31 we trigger a Spark : saving our RDD to our file system. Since Spark subscribes to the lazy execution model, it isn’t until this line that Spark generates , , and before finally saving our result. action weatherData tempsByCountry maxTempsByCountry Directed Acyclic Graph Whenever an action is performed on an RDD, Spark creates a DAG, a finite direct graph with no directed cycles (otherwise our job would run forever). Remember that a graph is nothing more than a series of connected vertices and edges, and this graph is no different. Each vertex in the DAG is a Spark function, some operation performed on an RDD (map, mapToPair, reduceByKey, etc). In MapReduce, the DAG consists of two vertices: Map → Reduce. In our above example of MaxTemperatureByCountry, the DAG is a little more involved: parallelize → map → mapToPair → reduce → saveAsHadoopFile The DAG allows Spark to optimize its execution plan and minimize shuffling. We’ll discuss the DAG in greater depth in later posts, as it’s outside the scope of this Spark overview. Evaluation Loops With our new vocabulary, let us re-examine the problem with MapReduce as I defined in , quoted below: Part I MapReduce excels at batch data processing, however it lags behind when it comes to repeat analysis and small feedback loops. The only way to reuse data between computations is to write it to an external storage system (a la HDFS)” ‘Re-use data between computations’? Sounds like an RDD that can have multiple actions performed on it! Lets suppose we have a file “data.txt” and want to accomplish two computations: Total length of all lines in the file Length of the longest line in the file In MapReduce, each task would require a separate job or a fancy MulitpleOutputFormat implementation. Spark makes this a breeze in just four simple steps: Load contents of into an RDD data.txt JavaRDD<String> lines = sc.textFile("data.txt"); 2. Map each line of ‘ ’ to its length lines (Lambda functions used for brevity) JavaRDD<Integer> lineLengths = lines.map(s -> s.length()); 3. To solve for total length: reduce to find the total line length sum, in this case the sum of every element in the RDD lineLengths int totalLength = lineLengths.reduce((a, b) -> a + b); 4. To solve for longest length: reduce to find the maximum line length lineLengths int maxLength = lineLengths.reduce((a, b) -> Math.max(a,b)); Note that steps 3 and 4 are RDD actions, so they return a result to our Driver program, in this case a Java int. Also recall that Spark is lazy and refuses to do any work until it sees an action, in this case it will not begin any real work until step 3. Next Steps So far we’ve introduced our data problem and its solution: Apache Spark. We reviewed Spark’s architecture and workflow, it’s flagship internal abstraction (RDD), and its execution model. Next we’ll look into Functions and Syntax in Java, getting progressively more technical as we dive deeper into the framework.

High Level Overview of Apache Spark

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

How Your Job May Be Crippling Your Tech Skills

3 Best Hadoop Alternatives to Consider for Migration

The Noonification: Introduction to Python Debugging with Pdb (8/25/2022)

8 Lessons For Building Data Companies On Solid Ground

Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution

How Your Job May Be Crippling Your Tech Skills

3 Best Hadoop Alternatives to Consider for Migration

The Noonification: Introduction to Python Debugging with Pdb (8/25/2022)

8 Lessons For Building Data Companies On Solid Ground

Accelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage Solution

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps