Meet the powerhouse data processing engines dueling to define the next era in big data
This article is part of Alibaba’s Flink series.
When it comes to big data, there’s no avoiding the importance of stream computing and the powerful analytics it enables in real time. It also goes that when it comes to stream computing, there’s no avoiding the field’s two most powerful data processing engines: Spark and Flink.
Fast-rising in popularity since 2014, Apache Spark exceeds predecessor Hadoop MapReduce’s performance by triple-digit factors in some cases, offering a unified engine with support for all common data processing scenarios like batch processing, stream processing, interactive query, and machine learning. With its high performance and comprehensive scenario support, it has continued to hold favor with its early adopters in big data development.
Shortly after Spark’s appearance, Apache Flink started to enter the public view as an outside challenger before becoming widely known around 2016. Where early Spark users faced usability issues in scenarios like real-time stream processing, Flink offered a superior stream processing engine with support for a wide range of scenarios, as well as other advantages.
Throughout their brief rivalry, Spark has continued to optimize its real-time streaming capabilities, with version 2.3 (released in February) introducing a continuous processing model that reduces stream processing latency to milliseconds. Flink likewise presents a formidable innovator and it remains to be seen which of the two frameworks will ultimately prevail to define the next generation of big data computing.
With a comprehensive analysis of their respective technologies and uses, this article should help to shed light on the question.
Origins of Big Data Computing Engines
Hadoop and other MapReduce-based data processing systems first emerged to address data processing demands beyond the capacities of traditional databases. Following the wave of development since Google’s release of the MapReduce white paper in 2004, processing big data with Hadoop’s open source ecosystem or similar systems has become a basic requirement of the industry.
Despite recent work to ease entry barriers, organizations inevitably encounter a series of problems when developing their own data processing systems, often discovering that the investment needed to derive value from their data vastly exceeds expectation.
The following sections detail the most prevalent of these issues, which should help to explain the basis on which Spark and Flink continue to compete for industry preference.
A very steep learning curve
Newcomers to big data are often stunned by the amount of technology they need to grasp. Where traditional databases developed over the past several decades generally come built for comprehensive data processing, big data ecosystems like Hadoop require several different subsystems, each with its own specialization and advantage before the various demands scenarios present.
The above picture describes a typical lambda architecture. Showing just two scenarios (batch processing and stream processing), it already involves at least four or five technologies, not counting alternatives that often need to be considered. Adding real-time query, interactive analysis, machine learning, and other scenarios, every situation involves choices among several technologies covering overlapping areas in different ways. As such, a business often needs to use numerous technologies to support complete data processing. Coupled with research and selection, the amount of information that investors need to digest is tremendous.
For a sense of the technologies available, consider the following overview of the big data industry.
Inefficient development and operation
Because of the variety of systems involved, where each system has its own development tools and language, development efficiency in big data is by default quite limited. Because data needs to be transferred between multiple systems, further development and operational costs inevitably appear. Data consistency, meanwhile, remains difficult to guarantee.
In many organizations, more than half of development efforts are spent on the transfer of data between systems.
Complex operation, data quality, and other problems
Multiple systems, each requiring its own operation and maintenance, introduce higher operating costs and increase the possibility of system errors. Further, it is difficult to guarantee the quality of data, and when issues do appear it is difficult to track and resolve them.
Last but not least, there are people problems. In many cases, the complexity of the system means that support and use for each subsystem must be implemented in different departments, which are not always aligned in goals and priorities.
Toward a solution
In light of these issues, it is easy to understand Spark’s popularity. At the time of its rise in 2014, Spark introduced not only enhancements that raised Hadoop MapReduce’s performance but also a common engine to support a full range of data processing scenarios. Seeing a Spark demo with all of the aforementioned scenarios working together in a single notebook, it’s a relatively easy decision for many developers to make the shift to Spark. It should come as no surprise, then, that Spark has since emerged as a complete replacement for the MapReduce engine in Hadoop.
At the same time, Flink was brought forth to offer greater ease of use in a range of scenarios, especially in the real-time processing of data streams.
With the field of competition thus established, the following sections will compare the two rival frameworks at the technical level.
Processing Engines in Spark and Flink
This section focuses on architectural features of the Spark and Flink engines, with focus on the potential and the limitations of their architectures. As well as their data and processing models, the two are distinct in their emphasis among data processing scenarios, approaches to stateful processing, and programming models.
Data model and processing model
To understand the engine characteristics in Spark and Flink, it is essential to first examine their respective data models.
Spark uses a Resilient Distributed Datasets (RDD) data model. RDD is more abstract than MapReduce’s file model and relies on lineage to ensure recoverability. Often RDD can be implemented as distributed shared memory or fully virtualized. This is to say that some intermediate result RDD can be optimized and omitted when downstream processing is completely local. This saves a lot of unnecessary input and output and is the main basis of Spark’s early performance advantages.
Spark also uses transformations (operators) on RDD to describe data processing. Each operator (such as map, filter, join) generates a new RDD. Together, all operators form a directed acyclic graph (DAG). Spark simply divides edges into wide dependencies and narrow dependencies. When upstream and downstream data does not require shuffle, the edge is a narrow dependency. In this case, the upstream and downstream operators can be processed locally in the same stage, and materialization of the upstream result RDD can be omitted. The figure below shows the basic concepts involved.
By contrast, Flink’s basic data model is comprised of data streams — i.e., sequences of events. Data streams, as a basic model of data, may not be as intuitive and familiar as tables or data blocks, but can nevertheless provide a completely equivalent set of features. A stream can be an infinite stream that is boundless, which is the common perception. It can also be a finite stream with boundaries, and processing these streams is equivalent to batch processing.
To describe data processing, Flink uses operators on data streams, with each operator generating a new data stream. In terms of operators, DAGs, and chaining of upstream and downstream operators, the overall model is roughly equivalent to Spark’s. Flink’s vertices are roughly equivalent to stages in Spark, and dividing operators into vertices will be basically the same as the dividing stages in Spark DAG in the above figure.
Spark and Flink have one significant difference in DAG execution. In Flink’s stream execution mode, the output of an event after processing on one node can be sent to the next node for immediate processing. This way the execution engine does not introduce any extra delay. Correspondingly, all nodes need to be running at the same time. On the contrary Spark’s micro batch execution is no different from its normal batch execution in that only after the upstream stage completes processing of a micro batch, the downstream stage starts processing its output.
In Flink’s stream execution mode, multiple events can be transferred or calculated together for improved efficiency. However, this is purely an optimization at the execution engine’s discretion. It can be determined independently for each operator, and it is not bound to any boundary of the dataset such as RDD as it is in the batch processing model. It can leave flexibility for optimization and at the same time meet low latency requirements.
Flink uses an asynchronous checkpoint mechanism to achieve the recoverability of the task state to ensure processing consistency. Thus, I/O delay can be eliminated in the entire main processing path between the data source and output to achieve higher performance and lower latency.
Data processing scenarios
In addition to batch processing, Spark also supports real-time data stream processing, interactive query, machine learning, and graph computing, among other scenarios.
The main difference between real-time data stream processing and batch processing is the low latency requirement. Because Spark RDD is memory-based, it can be easily cut into smaller blocks for processing. Handling these small blocks quickly enough allows for the achievement of low latency.
Spark can also support interactive queries if all the data is in memory and processed fast enough.
Spark’s machine learning and graph calculations can be seen as different categories of RDD operators. Spark provides libraries to support common operations, and users or third-party libraries can also extend and provide more operations. It is worth mentioning that Spark’s RDD model is very compatible with machine learning model training’s iterative calculation. From its outset, it has brought remarkable performance improvements in some scenarios.
In light of these features, Spark is essentially a faster memory-based batch processor than Hadoop MapReduce and uses fast enough batch processing to implement various scenarios.
In Flink, if the input data stream is bounded, the effect of batch processing results naturally. The difference between stream processing and batch processing lies only in input types and is independent of the underlying implementation and optimization, thus the logic that the user needs to implement is exactly the same, yielding a cleaner kind of abstraction.
Flink also provides libraries to support scenarios such as machine learning and graph computing. In this respect, it is not so different from Spark.
Notably, Flink’s low-level API can be used to implement some data-driven distributed services using Flink clusters alone. Some companies use Flink clusters to implement social networking, web crawling, and other services. These uses reflect the versatility of Flink as a generic computing engine and benefit from Flink’s built-in state support.
In general, both Spark and Flink aim to support most data processing scenarios in a single execution engine, and both should be able to achieve it. The main difference is that the respective architecture of each can prove limiting in certain scenarios. One notable place where this is the case is the micro-batch execution mode of Spark Streaming. The Spark community should already be aware of this and has recently begun to work on continuous processing. We’ll come back to this later.
Another very unique aspect of Flink is the introduction of managed state in the engine. To understand the managed state, we have to first start with stateful processing. If the result of processing an event (or a piece of data) is only related to the content of the event itself, it is called stateless processing; otherwise, the result is related to the previously processed event, and it is called stateful processing. Any non-trivial data processing, such as basic aggregation, is often stateful processing. Flink has long believed that without good state support there will be no efficient streaming, for which reason managed state and state API were introduced early on.
Generally, stateful processing is considered in the context of streaming, but looking closely it also affects batch processing. Taking, for example, the common case of windowed aggregation, if the batch data period is larger than the window, the intermediate state can be ignored, and user logic will tend to ignore this problem. When the batch period becomes smaller than the window, however, the results of a batch actually depend on the batch that has been processed previously. Because batch engines generally do not see this demand, they usually don’t provide built-in state support, requiring the user to manually maintain the states. In the case of window aggregation, for example, the user will need an intermediate result table to store the results of an incomplete window. Thus, when the user shortens the batch period, the processing logic becomes more complicated. This was a common problem for early Spark Streaming users before Structured Streaming was released.
On the other hand, Flink as a streaming engine had to face this problem from the beginning and introduced managed state as a general solution. In addition to making users’ jobs easier, a built-in solution can also achieve better performance compared with user-implemented solutions. Most importantly, it can provide a better guarantee of consistency.
Simply put, there are some problems inherent to the data processing logic that can be ignored or simplified in batch processing without hindering results, while they will be exposed and solved in stream processing. Therefore, implementing batch processing in the streaming engine as finite streams can naturally produce correct results, and the main effort is in specialized implementation in some areas for the sake of optimization. On the contrary, simulating streaming with smaller batches means new kinds of problems will be exposed. When the computing engine does not have a general solution for a problem, it requires users to solve it themselves. In addition to state, problems include dimensional table changes (such as updating user information), batch data boundaries, late arriving data, and so on.
One of Spark’s original intentions was to provide a unified programming model capable of solving the various needs of different users — a focus to which it has devoted a great deal of effort. Spark’s original RDD-based API was already capable of all kinds of data processing. Later, in order to simplify users’ development, higher level DataFrame (adding columns to structured data in RDD) and Dataset (adding types to DataFrame columns) were introduced and consolidated in Spark 2.0 (DataFrame = DataSet[Row]). Spark SQL support was also introduced relatively early. With the continuous improvement of scenario-specific APIs, such as Structured Streaming and integration with machine learning and deep learning, Spark’s APIs have become very easy to use, and today form one of the framework’s strongest aspects.
Flink’s API has followed a similar set of goals and development path. The core APIs of Flink and Spark can be considered rough counterparts. Today, Spark’s API is more complete in general, as per the integration of machine learning and deep learning over the past two years. Flink still leads in streaming-related aspects, such as its support for watermark, window, and trigger.
Both Spark and Flink are general-purpose computing engines that support very large-scale data processing and various types of processing. Each presents a great deal to explore which has not been touched on here, such as SQL optimization and machine learning integration. The main goal of this comparison is to review the two systems in terms of their basic architecture and design features. The rationale is that it’s more practical to learn collaboratively to catch up on upper-level functionalities, whereas changes in fundamental design will tend to be more costly and prohibitive.
The biggest difference between Spark and Flink’s different execution models lies in their support for stream processing. The initial Spark Streaming’s approach to stream processing was too simplistic and turned out to be problematic in more complex processing. Structured Streaming, introduced in Spark 2.0, cleaned up streaming semantics and added support for event-time processing and end-to-end consistency. Although there are still many limitations in terms of functionality, it makes considerable progress over past iterations. Problems owing to the micro-batch execution method still exist, especially performance on a very large scale. Recently, Spark has been spurred by application demands to develop a Continuous Processing mode. The experimental release in 2.3 only supports simple map-like operations.
Following updates at the recent Spark+AI Summit conference, Continuous Processing appears positioned to evolve into an execution engine with close similarities to Flink’s stream processing model. However, as shown in the above image, the main functions continue to undergo development. It remains to be seen how well these will perform and how they will be integrated with Spark’s original batch execution engine in the future.
(Original article by Wang Haitao王海涛)
This article is part of Alibaba’s Flink series.