831 reads

Node.js streams in action

by Max KharandziukApril 17th, 2017

Too Long; Didn't Read

This article shows how to apply <a href="https://hackernoon.com/tagged/nodejs" target="_blank">Node.js</a> <a href="https://nodejs.org/api/stream.html" target="_blank">Stream</a> and a bit of Reactive <a href="https://hackernoon.com/tagged/programming" target="_blank">programming</a> to a real(tm) problem. The article is intended to be highly practical and oriented for an intermediate reader. I intentionally omit some basic explanations. If you miss something try to check the <a href="https://nodejs.org/api/stream.html" target="_blank">API documentation</a> or its retelling(e.g.: <a href="https://github.com/substack/stream-handbook" target="_blank">this one</a>)

Company Mentioned

Coin Mentioned

featured image - Node.js streams in action

This article shows how to apply Node.js Stream and a bit of Reactive programming to a real(tm) problem. The article is intended to be highly practical and oriented for an intermediate reader. I intentionally omit some basic explanations. If you miss something try to check the API documentation or its retelling(e.g.: this one)

So, lets start from the problem description. We need to implement a simple web scraper which grabs all the data from some REST API, process the data somehow and inserts into our Database. For simplicity, I omit the details about the actual database and REST API(in real life it was the API of some travel fare aggregator website and a Pg database)

Consider we have two functions(code of the IO simulator functions and the other article code is here):

For this article we will ignore the errors which can occur. Maybe I will describe it in next article.

Also there are some requirements about the data processing. Let’s say that we need to remove all the items which id contains number 3. And we need to extend with current timestamp all the items which id contains number 9 (e.g.: {id: 9} -> {id: 9, timestamp: 1490571732068}). These requirements are silly but similar to real ones which appear in real web scrapers

Time to get hands dirty! Some naive implementation can look like this:

What are the problems with such code?

It’s hard to understand what does the code at the first glance. We can improve it some comments, but it’s better to show in the code that we are reading the in one place and writing it in the other.
it’s to specific. it’s hard to write some logic for processing the values. How can we separate the processing logic from the I/O-logic?
it isn’t effective. Our producer waits until the consumer use a portion of data instead of buffering some portions and writing all of them at once.

So, you may already guess that there is a better way to solve this problem with Node Streams. Let’s split out problem for three: input, output and processing.

So, our Readable Stream can look like this:

Looks a little bit heavy, but it’s a standard interface. As a chunk of data we will use a list of thousand items. The important thing here is objectMode: true – we want to operate with objects instead of binary data.

Now the output part. We need to implement the Writable Stream. Something like this:

Some important bits here:

objectMode — we want to operate with objects
highWaterMark— the size of our buffer… in objects. You should be careful with it because there is no direct conection between quantity and real size in bytes. In our case one object is a list of some size
_writev — ‘explains’ how should it write more than one chunk from a buffer at one time.

And now we can connect them:

https://gist.github.com/9e3174a05adb55a5fbb750b5e972bc0e

From my perspective the code doesn’t require any comments. We show the data flow and hide the details of implementation until you ask. Also, it’s definitely effective:

it doesn’t block event-loop
it’s extensively use standard library primitives. They should be effective
it uses buffer and back-presure

And for the dessert we will implement data processing. We can manually implement a Transform stream, but there is no fun with it. We will use a library called Highland.js which gives us an ability to use well known functional programming primitives(.filter, .map, etc.) with our streams. Actually, Highland.js is much bigger than just .map and .filter but I don’t want to mute the scope of the article. So, transformation can look like this:

Much the same as with simple JS lists. We need .flatten() and .batchWithTimeOrCount(100, 1000) because our streams operate on list of items instead of individual items.

So, that’s all. I hope this article gives you some motivation to learn Node Stream and Highland.js

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!