Building data pipelines are common tasks for almost every company nowadays. However, the ways they are being built are quite different from one to another.
We, as a company, are not different from others so we are building these data paths too, but the tooling always varies from one place to another one.
What I am about to describe can be achieved using any of the available big data tools such as Apache Spark which I am also a huge fan. However, the team has invested heavily in Akka, and we have built several internal systems with it, which in some way, means we are going to explore how to finish the task at hand.
Simplifying the problem, we will say that our task consists in calculating a value that represent the unique line of a specific file. For example, given a file A with the following content:
This is just a
For this example
The result of processing this file using the function f(s: String) => int is:
Please, don’t worry too much about the function f, let’s focus the overall view of the process.
Using Akka Streams we could create a simple pipeline to process these files. Let’s see how.
First, we define the building blocks of our stream. Then, have to put them together.
And now we are basically done. We have built a stream that will process each file, line by line, calculating the weight of each of them and finally writing the value down.
The Real Problem
When the described stream does its job, it takes long hours to process because the file can be huge.
First Attempt. Increasing Stream Parallelism
Our first try is to increase the number of concurrent mappers on the stream. We can redefine parts of pipe to do more work at a given moment in time.
Because we have big machines, we can increase the parallelism. Also, we don’t care too much about the order since the calculated value is unique per line so we can easily identify the exact (file, line) given the line’s weight.
As you can expect, the performance is better now, but not good enough given the amount of data waiting to be processed. Even though we are doing more processing at any given time, we cannot go beyond the power of a single computer, Akka Streams do not support it for now.
Second Attempt. Going Back to the Basics
Working with streams is easy, nice, and clean, but sometimes we need to go back to basic abstractions, Actors.
We are going to use Akka Cluster and Distributed Pub / Sub to move our data through the pipeline.
Here, we have created two actors which do the same we had before with streams. Still, we are missing the one that read our dataset.
This actor will take a stream and will publish its elements (file, lines) to the cluster so
WeightCalculator can do its corresponding job.
Let’s create an app that runs this pipeline to see how everything fits together.
At this point we have a runnable application that does the same we had using streams. However, this one, can scale much easier by deploying more actors of a certain type (
WeightCalculator or Writer) in the same or in a different JVM. We can even deploy them remotely on different machines on the network. All these deployment strategies are backed by Akka Cluster.
We have no limits regarding the number of processors for each step we could have, but this approach has an interesting problem.
In the stream’s world, we have built-in back pressure so we don’t have to worry about flooding actors with too many messages. However, with our new approach, we are pushing a lot of messages into the cluster, which, in our experience, results in a lot of JVM crashes.
Third Attempt. Pulling vs Pushing
It would be nice if we could use the Akka Cluster deployment and at the same time have control on how fast messages are propagated from the source. Doing some changes in our actors we could achieve this.
We need to change our source first, so we pull from it.
Streamer doesn’t push anything by default. It waits for SendNext(to) messages published on the cluster by some consumer where the value of to is the address of the consumer. Then
Streamer will send the next element of the stream directly to it. This mechanism guarantees that elements are sent only to actors that requested them.
On the other hand, any actor can request elements by publishing a request on the cluster. We need to change
WeightCalculator so it requests elements when it needs them.
As you could see,
WeightCalculator requests new elements only when it has finished processing a given element. In this way we avoid over flooding it with too many messages that it cannot process at once.
WeightCalculator has finished processing a message, it will publish the results in the cluster using the topic
out so it can be written down by the
We could use the same
app to run our pipeline, but remember that because of the Cluster capabilities, we could deploy our actors anywhere, we are not limited by a single computer.
Using Akka Cluster we can scale up without limits, but even with a great number of computers, we ended up over flooding actors to the point of crash and failure. The way messages propagate is important in our particular case and changing to a pulling approach seems to be the solution. By pulling from the source allows us to have different components working together, each of them processing at its own pace; the slow ones process less than the faster ones, but no crashes.
We believe that the problem we had is a very common one, and this post is meant to give you some ideas of what could go wrong and how this solution works for us.
We hope that our simplistic example illustrate with clarity the problem while helping you find a solution to yours.
Hopefully we will see Akka Streams running on Clusters since their abstractions are nice to use and easy to put together; for now, we need to work on alternatives.