Node.js War Stories: Debugging Issues in Production

In this article, you can read stories from Netflix, RisingStack & nearForm about Node.js issues in production — so you can learn from our mistakes and avoid repeating them. You’ll also learn what methods we used to debug these Node.js issues.

Special shoutout to Yunong Xiao of Netflix, Matteo Collina of nearForm & Shubhra Kar from Strongloop for helping us with their insights for this post!

At RisingStack, we have accumulated a tremendous experience of running Node apps in production in the past 4 years — thanks to our Node.js consulting, training and development business.

As well as the Node teams at Netflix & nearForm we picked up the habit of always writing thorough postmortems, so the whole team (and now the whole world) could learn from the mistakes we made.

Netflix & Debugging Node: Know your Dependencies

Let’s start with a slowdown story from Yunong Xiao, which happened with our friends at Netflix.

The trouble started with the Netflix team noticing that their applications response time increased progressively — some of their endpoints’ latency increased with 10ms every hour.

This was also reflected in the growing CPU usage.

Request latencies for each region over time — photo credit: Netflix

At first, they started to investigate whether the request handler is responsible for slowing things down.

After testing it in isolation, it turned out that the request handler had a constant response time around 1ms.

So the problem was not that, and they started to suspect that probably it’s deeper in the stack.

The next thing Yunong & the Netflix team tried are CPU flame graphs and Linux Perf Events.

Flame graph or the Netflix slowdown — photo credit: Netflix

What you can see in the flame graph above is that

it has high stacks (which means lot of function calls)
and the boxes are wide (meaning we are spending quite some time in those functions).

After further inspection, the team found that Express’s router.handle and router.handle.next has lots of references.

The Express.js source code reveals a couple of interesting tidbits:

Route handlers for all endpoints are stored in one global array.
Express.js recursively iterates through and invokes all handlers until it finds the right route handler.

Before revealing the solution of this mystery, we have to get one more detail:

Netflix’s codebase contained a periodical code that ran every 6 minutes and grabbed new route configs from an external resource and updated the application’s route handlers to reflect the changes.

This was done by deleting old handlers and adding new ones. Accidentally, it also added the same static handler all over again — even before the API route handlers. As it turned out, this caused the extra 10ms response time hourly.

Takeaways from Netflix’s Issue

Always know your dependencies — first, you have to fully understand them before going into production with them.
Observability is key — flame graphs helped the Netflix engineering team to get to the bottom of the issue.

Read the full story here: Node.js in Flames.

RisingStack CTO: “Crypto takes time”

You may have already heard to story of how we broke down the monolithic infrastructure of Trace (our Node.js monitoring solution) into microservices from our CTO, Peter Marton.

The issue we’ll talk about now is a slowdown which affected Trace in production:

As the very first versions of Trace ran on a PaaS, it used the public cloud to communicate with other services of ours.

To ensure the integrity of our requests, we decided to sign all of them. To do so, we went with Joyent’s HTTP signing library. What’s really great about it, is that the request module supports HTTP signature out of the box.

This solution was not only expensive, but it also had a bad impact on our response times.

The network delay built up our response times — photo: Trace

As you can see on the graph above, the given endpoint had a response time of 180ms, however from that amount, 100ms was just the network delay between the two services alone.

As the first step, we migrated from the PaaS provider to use Kubernetes. We expected that our response times would be a lot better, as we can leverage internal networking.

We were right — our latency improved.

However, we expected better results — and a lot bigger drop in our CPU usage. The next step was to do CPU profiling, just like the guys at Netflix:

As you can see on the screenshot, the crypto.sign function takes up most of the CPU time, by consuming 10ms on each request. To solve this, you have two options:

if you are running in a trusted environment, you can drop request signing,
if you are in an untrusted environment, you can scale up your machines to have stronger CPUs.

Takeaways from Peter Marton

Latency in-between your services has a huge impact on user experience — whenever you can, leverage internal networking.
Crypto can take a LOT of time.

nearForm: Don’t block the Node.js Event Loop

React is more popular than ever. Developers use it for both the frontend and the backend, or they even take a step further and use it to build isomorphic JavaScript applications.

However, rendering React pages can put some heavy load on the CPU, as rendering complex React components is CPU bound.

When your Node.js process is rendering, it blocks the event loop because of its synchronous nature.

As a result, the server can become entirely unresponsive — requests accumulate, which all puts load on the CPU.

What can be even worse is that even those requests will be served which no longer have a client — still putting load on the Node.js application, as Matteo Collina of nearForm explains.

It is not just React, but string operations in general. If you are building JSON REST APIs, you should always pay attention to JSON.parse and JSON.stringify.

As Shubhra Kar from Strongloop (now Joyent) explained, parsing and stringifying huge payloads can take a lot of time as well (and blocking the event loop in the meantime).

function requestHandler(req, res) { const body = req.rawBody let parsedBody try {  parsedBody = JSON.parse(body) }  catch(e) {   res.end(new Error('Error parsing the body')) } res.end('Record successfully received') }

Simple request handler

The example above shows a simple request handler, which just parses the body. For small payloads, it works like a charm — however, if the JSON’s size can be measured in megabytes, the execution time can be seconds instead of milliseconds. The same applies for JSON.stringify.

To mitigate these issues, first, you have to know about them. For that, you can use Matteo’s loopbench module, or Trace’s event loop metrics feature.

With loopbench, you can return a status code of 503 to the load balancer, if the request cannot be fulfilled. To enable this feature, you have to use the instance.overLimit option. This way ELB or NGINX can retry it on a different backend, and the request may be served.

Once you know about the issue and understand it, you can start working on fixing it — you can do it either by leveraging Node.js streams or by tweaking the architecture you are using.

Takeaways from nearForm

Always pay attention to CPU bound operations — the more you have, to more pressure you put on your event loop.
String operations are CPU-heavy operations

Debugging Node.js Issues in Production

I hope these examples from Netflix, RisingStack & nearForm will help you to debug your Node.js apps in Production.

If you’d like to learn more, I recommend checking out these recent posts which will help you to deepen your Node knowledge:

If you have any questions, please let us know in the comments!

Originally published at blog.risingstack.com on April 4, 2017.