“Long lines of tiny speckles on light in a high-ceiling interior” by Joshua Sortino on Unsplash
Apache Flink is an open source streaming platform which provides you tremendous capabilities to run real-time data processing pipelines in a fault-tolerant way at a scale of millions of events per second.
The key point is that it does all this using the minimum possible resources at single millisecond latencies.
So how does it manage that and what makes it better than other solutions in the same domain?
Flink is based on the DataFlow model i.e. processing the elements as and when they come rather than processing them in micro-batches (which is done by Spark streaming).
Micro-batches can contain huge number of elements and the resources needed to process those elements at once can be substantial. In the case of a sparse data stream (in which you get only a burst of data at irregular intervals), this becomes a major pain point.
You also don’t need to go through the trial and error of configuring the micro-batch size so that the processing time of the batch doesn’t exceed it’s accumulation time. If it happens, then the batches start to queue up and eventually all the processing will come to a halt.
Dataflow allows flink to process millions of records per minutes at milliseconds of latencies on a single machine (it’s also because of Flink’s managed memory and custom serialisation but more on that in next article). Here are some benchmarks.
Flink provides seamless connectivity to a variety of data sources and sinks.
Some of these include:
Flink provides robust fault-tolerance using checkpointing (periodically saving internal state to external sources such as HDFS).
However, Flink’s checkpointing mechanism can be made incremental (save only the changes and not the whole state) which really reduces the amount of data in HDFS and the I/O duration. The checkpointing overhead is almost negligible which enables users to have large states inside Flink applications.
Flink also provides a high availability setup through zookeeper. This is for re-spawning the job in the cases when the driver (which is known as JobManager in Flink) crashes due to some error.
Unlike Apache Storm (which also follows a data flow model), Flink provides a extremely simple high level api in the form of Map/Reduce, Filters, Window, GroupBy, Sort and Joins.
This provides a developer lot of flexibility and speeds up the development while writing new jobs.
Sometimes an operation requires some config or data from some other source to perform an operations. A simple example will be to count the number of records of type Y in a stream X. This counter will be known as the state of the operation.
Flink provides a simple API to interact with state like you would interact with a java object. States can be backed by Memory, Filesystem or RocksDB which are check pointed and are thus fault tolerant. e.g. With respect to the above example, in case your application restarts, your counter value will still be preserved.
Apache Flink provides exactly once processing like Kafka 0.11 and above with minimal overhead and zero dev effort. This is not trivial to do in other streaming solutions such as Spark Streaming and Storm and is not supported in Apache Samza.
Exactly-once Semantics is Possible: Here's How Apache Kafka Does it_Print I'm thrilled that we have hit an exciting milestone the Kafka community has long been waiting for: we have…_www.confluent.io
Like Spark streaming Flink also provides a SQL API interface which makes writing a job easier for people with non programming background. Flink SQL is maturing day by day and is already being used by companies such as UBER and Alibaba to do analytics on real time data.
Flink SQL & TableAPI in Large Scala Production at Alibaba
A Flink job can be run in a distributed system or in local machine. The program can run on mesos, yarn, kubernetes as well as standalone mode (e.g. in docker containers). Since Flink 1.4, Hadoop is not a pre-requisite which opens up a number of possibilities for places to run a flink job.
Flink has a great dev community which allows for frequent new features and bug fixes as well as great tools to ease the developer effort further. Some of these tools are:
Flink SQL and Complex Event Processing (CEP) were also initially developed by Alibaba and contributed back to flink.
Note : Spark Streaming 2.3 has started offering support for continuos processing rather than micro-batching. Check it out here. I’ll run some benchmarks using yahoo-streaming-benchmarks and post the results in next article.
Connect with me on LinkedIn or Facebook or drop a mail to [email protected] to share the feedback.