3,997 reads

Collect or not Collect your Java Stream?

by Nicolas A PerezJuly 17th, 2018

Too Long; Didn't Read

We have been exploring some of the Java APIs, how they have changed how we use this popular language and how to write better and more performance code using these new tools. Please, you can read <a href="https://hackernoon.com/finally-functional-programming-in-java-ad4d388fb92e" target="_blank">Finally Functional Programming in Java</a> and <a href="https://hackernoon.com/a-sad-story-about-concurrency-346990a9a3fe" target="_blank">Snippets About Concurrency</a> for more information.

Coin Mentioned

featured image - Collect or not Collect your Java Stream?

Yet, there is one particular API that we, as an organization, use extensively and some time in a combination with other parts of the language that we have discussed before. We are going to use this post to extend our previous posts while discussing the Stream API usage and some of the problems we should be aware of when using it.

Let’s start we some code examples that will bring light to the problem in question.

Hopefully, we all see the problem in here. Every time we collect the stream and materialize it into a list, and then, in order to execute the next operation, we convert the list to a stream again just to materialize the result back to a list.

The main idea about streams is they lazy nature, which makes them perfectly good when handling continues data or dataset that are sufficiently large to overflow your memory.

There is also another problem. This one is also implicit in the nature of laziness. When working with streams, there is no actual way to know the size of the stream without materializing it, which at the same time implies that we could actually never find the size of it. This might sound confusing, but let’s look at one example.

if we try to get the .size of it, your program will hang forever since this is an unbounded stream.

The same will happen if we try to materialize it using one of the Collectors such as:

This program will never end because the stream numbers does not end.

Now, suppose that we can convert the stream numbers to a bounded stream.

As we can see, even when we know that the materialization of the stream is possible, collecting it just to converting it back to stream might be very costly.

These operations will need to go over the entire streams over and over on every .stream and .collect.

At this point we have discussed two problems, a performance hit that implies materializing and converting back to streams. There is the performance hit the application can be taking and the possibility to be working with unbounded streams which put the risk of getting into a never-ending processing point.

As a rule of thumb, we should not materialize streams until the very end of the processing chain AND the materialization should only happen if we are certain that our stream is bounded at this point.

During some code review, we have found the snippets of code similar to the followings.

Notice how many times we have to go over the entire collections. Basically, at least twice on each computational stage. Depending on the size of the user population, this simple examples can have a performance impact in the application.

Let’s see how we can make this better.

Notice that now, all our functions receive and return Streams. Basically, each computation stage will pass a new stream to the next step in the computation. If we think about it, nothing gets materialized until the very end. There is just one final iteration over the stream since the final one is a composition of previous streams. There is one exception though. In order to group, the stream must be materialized, there is no way around it. Other than that, every other operation is being executed lazily, and it not until the very final step that we trigger the materialization process on this dataset.

There is something else about performance we should take into consideration. Because we are using streams, we can process a large number of users without worrying too much about the memory consumption since we might never have to deal with the entire dataset at once. Of course, our last function, veryInterstingUser is not making use of this, but in real-world applications, we might return a stream also here, and then consume the stream instead of a List<>.

Conclusions

The Java concurrent API and the streaming API are very interesting and powerful tools that we all using Java should learn how to use in a responsible way so we can deliver better and more performant applications. It might take some time to get used to these new concepts, especially when starting working with them, but it is just a matter of practice and good engineering techniques until we are experts on these areas.

Enjoy your Java :)