Building data pipelines are common tasks for almost every company nowadays. However, the ways they are being built are quite different from one to another. We, as a company, are not different from others so we are building these data paths too, but the tooling always varies from one place to another one. What I am about to describe can be achieved using any of the available big data tools such as Apache Spark which I am also a huge fan. However, the team has invested heavily in Akka, and we have built several internal systems with it, which in some way, means we are going to explore how to finish the task at hand. Simplifying the problem, we will say that our task consists in calculating a value that represent the unique line of a specific file. For example, given a file with the following content: A This is just a Test fileFor this example The result of processing this file using the function is: f(s: String) => int 234234212341234151245234 Please, don’t worry too much about the function f, let’s focus the overall view of the process. Using Akka Streams we could create a simple pipeline to process these files. Let’s see how. First, we define the building blocks of our stream. Then, have to put them together. And now we are basically done. We have built a stream that will process each file, line by line, calculating the weight of each of them and finally writing the value down. The Real Problem When the described stream does its job, it takes long hours to process because the file can be huge. First Attempt. Increasing Stream Parallelism Our first try is to increase the number of concurrent mappers on the stream. We can redefine parts of pipe to do more work at a given moment in time. Because we have big machines, we can increase the parallelism. Also, we don’t care too much about the order since the calculated value is unique per line so we can easily identify the exact (file, line) given the line’s weight. As you can expect, the performance is better now, but not good enough given the amount of data waiting to be processed. Even though we are doing more processing at any given time, we cannot go beyond the power of a single computer, Akka Streams do not support it for now. Second Attempt. Going Back to the Basics Working with streams is easy, nice, and clean, but sometimes we need to go back to basic abstractions, . Actors We are going to use and to move our data through the pipeline. Akka Cluster Distributed Pub / Sub Here, we have created two actors which do the same we had before with streams. Still, we are missing the one that read our dataset. This actor will take a stream and will publish its elements (file, lines) to the cluster so can do its corresponding job. WeightCalculator Let’s create an app that runs this pipeline to see how everything fits together. At this point we have a runnable application that does the same we had using streams. However, this one, can scale much easier by deploying more actors of a certain type ( ) in the same or in a different JVM. We can even deploy them remotely on different machines on the network. . WeightCalculator or Writer All these deployment strategies are backed by Akka Cluster We have no limits regarding the number of processors for each step we could have, but this approach has an interesting problem. In the stream’s world, we have built-in back pressure so we don’t have to worry about flooding actors with too many messages. However, with our new approach, we are pushing a lot of messages into the cluster, which, in our experience, results in a lot of JVM crashes. Third Attempt. Pulling vs Pushing It would be nice if we could use the Akka Cluster deployment and at the same time have control on how fast messages are propagated from the . Doing some changes in our actors we could achieve this. source We need to change our source first, so we pull from it. In here, doesn’t push anything by default. It waits for messages published on the cluster by some consumer where the value of is the address of the consumer. Then will send the next element of the stream directly to it. This mechanism guarantees that elements are sent only to actors that requested them. Streamer SendNext(to) to Streamer On the other hand, any actor can request elements by publishing a request on the cluster. We need to change so it requests elements when it needs them. WeightCalculator As you could see, requests new elements only when it has finished processing a given element. In this way we avoid over flooding it with too many messages that it cannot process at once. WeightCalculator After has finished processing a message, it will publish the results in the cluster using the topic so it can be written down by the . WeightCalculator out Writer We could use the same to run our pipeline, but remember that because of the Cluster capabilities, we could deploy our actors anywhere, we are not limited by a single computer. app Using Akka Cluster we can scale up without limits, but even with a great number of computers, we ended up over flooding actors to the point of crash and failure. The way messages propagate is important in our particular case and changing to a approach seems to be the solution. By pulling from the source allows us to have different components working together, each of them processing at its own pace; pulling the slow ones process less than the faster ones, but no crashes. We believe that the problem we had is a very common one, and this post is meant to give you some ideas of what could go wrong and how this solution works for us. We hope that our simplistic example illustrate with clarity the problem while helping you find a solution to yours. Hopefully we will see Akka Streams running on Clusters since their abstractions are nice to use and easy to put together; for now, we need to work on alternatives.

Akka Streams, a Story of Scalability

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Miami Scala 2017 Presentations and Conferences Journal, with Pictures.

Akka Cluster in Docker. A Straight Forward Configuration.

Akka.io, sbt-assembly and Docker

How to Store Data with a Few Million Akka Actors

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Miami Scala 2017 Presentations and Conferences Journal, with Pictures.

Akka Cluster in Docker. A Straight Forward Configuration.

Akka.io, sbt-assembly and Docker

How to Store Data with a Few Million Akka Actors

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps