tl;dr This is the story about how we analyzed and fixed an unstable service in our infrastructure. Node.js Background A few months ago, we decided to have a scheduled downtime and migrate all our existing services from to for security, stability and control. We’d been systematically planning this migration two months before D-day — with over a 100 services that were making a switch from Classic to VPC, we didn’t want to leave any room for glitches. We migrated each service to VPC, tested each of them for stability and selectively routed some of our production traffic to these new instances — monitoring the system stability and metrics throughout. EC2 Classic EC2 VPC But after few hours, we observed that a specific feature (which was dependent on this service that handled long-lived WebSocket connections and was hosted behind and nginx) was intermittently slow or unavailable. ELB We grappled with this service that was acting up — restarting the service seemed to fix it but only to have more issues later. Digging deeper The was our starting point. ELB CloudWatch Metrics Clearly, there were spikes in the number of TCP connections each time the service failed. These plateaued out each time we restarted the service, only to work up a frenzy of new TCP connections after a short while. It had to be the network right? And, so we scoured through all our network settings and group policies but couldn’t build a case against it. They seemed to be standard and nothing seemed out of place. We ran the tool from the suite in one of our Node.js instance to see at what rate the connections were being accepted. tcpaccept BCC/eBPF [root@nodejs-app5 ~]# /usr/share/bcc/tools/tcpaccept -t ……. TIME(s) PID COMM IP RADDR LADDR LPORT 4194.519 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4194.528 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4194.529 24848 node 6 ::ffff:127.0.0.1 ::ffff:127.0.0.1 81 4194.535 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4194.539 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4194.541 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4194.548 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4194.553 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4194.625 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4194.625 24848 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4196.074 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4196.075 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4196.499 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4196.511 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4196.514 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4196.515 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4196.516 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4196.785 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 4197.837 1763 sshd 4 172.16.3.17 172.16.150.139 22 4198.073 9706 node 6 ::ffff:172.16.151.146 ::ffff:172.16.150.139 81 ……. The connection acceptance rates looked normal (remember, we were only redirecting small subset of our traffic) and didn’t show anything unusual. Sigh. Still unsure about where to start looking, we started with the obvious — doing packet capture because the connection count kept going up. We waited for the issue to resurface (after a restart), connected it to one of the machines that hosted the nginx service and ran to capture the packets. tshark [root@nginx1 ~]# tshark -f ‘host 172.16.150.139 and (tcp[tcpflags] & (tcp-rst|tcp-fin) != 0)’ ………. 36.348790 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10954 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899165 TSecr=906255810 36.369605 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10955 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899170 TSecr=906255815 36.372122 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10956 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899171 TSecr=906255816 36.399220 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10957 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899178 TSecr=906255822 36.408761 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10958 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899180 TSecr=906255825 36.416402 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10959 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899182 TSecr=906255827 36.419439 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10960 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899183 TSecr=906255827 36.457502 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10961 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899192 TSecr=906255837 36.458672 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10962 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899192 TSecr=906255837 36.459733 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10963 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899193 TSecr=906255837 36.463407 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10964 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899194 TSecr=906255838 36.485216 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10965 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899199 TSecr=906255844 36.487000 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10966 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899199 TSecr=906255844 36.491632 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10967 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899201 TSecr=906255845 36.499982 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10941 [FIN, ACK] Seq=1 Ack=1 Win=16 Len=0 TSval=3899203 TSecr=906255847 36.500013 172.16.151.146 -> 172.16.150.139 TCP 66 10941 > 81 [FIN, ACK] Seq=1 Ack=2 Win=16 Len=0 TSval=906255848 TSecr=3899203 36.526724 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10969 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899209 TSecr=906255854 36.555598 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10970 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899217 TSecr=906255860 36.559707 172.16.150.139 -> 172.16.151.146 TCP 66 81 > 10971 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899218 TSecr=906255862 ………. Wading through the captured packets, we figured the TCP connections were being reset by the remote app machine and new connections were being established. Repeatedly. So it looked like there was something wrong with our machine setup and/or our app (as it always turns out to be). To see why these supposedly long-lived TCP (Websocket) connections were being closed, we ran on the Node process. strace [root@nodejs-app5 ~]# strace -ff -p `pidof node` ……. clock_gettime(CLOCK_MONOTONIC, {8666, 663543136}) = 0 clock_gettime(CLOCK_MONOTONIC, {8666, 663607713}) = 0 epoll_wait(5, 7ffd00abd8d0, 1024, 11) = -1 EINTR (Interrupted system call) — — SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=11784, si_uid=0} — - tgkill(2100, 2100, SIGTERM) = 0 rt_sigreturn() = -1 EINTR (Interrupted system call) — — SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=2100, si_uid=0} — - +++ killed by SIGTERM +++ ……. Surprisingly, the Node.js app was actually being killed and getting restarted automatically since we had watching it over. Now we knew why the TCP connections were being reset. Monit A closer look at the tcpaccept output and you’ll notice that the PID of the process accepting connections kept changing — definitely something that we’d overlooked earlier. The hunch: An ominous SIGTERM At this point, we had a lot of questions! What was killing the Node.js process? We looked at dmesg and our memory usage seemed normal (besides we have with , to notify us should OOM killer be on the rampage), so it couldn’t be . Nagios check_oomkiller OOM killer Was the app doing something unexpected? To find out, we attached a GDB to the delinquent process, set breakpoints on all the exit syscalls and then waited. [root@nodejs-app5 ~]# gdb -ex ‘catch syscall exit exit_group’ -ex ‘break exit’ -ex ‘continue’ -p `pidof node` ……. Loaded symbols for /lib64/libresolv.so.2 0x00007fbc8c8f6419 in syscall () from /lib64/libc.so.6 Catchpoint 1 (syscalls ‘exit’ [60] ‘exit_group’ [231]) Breakpoint 2 at 0x7fbc8c83dea0 Continuing. Program received signal SIGTERM, Terminated. 0x00007fbc8c8f6419 in syscall () from /lib64/libc.so.6 ……. Our hunch was confirmed when the app itself didn’t crash or call exit but was killed by a SIGTERM. Looking around to see where the signal was coming from, we started stracing the Monit process to see what it does. Since we know Monit was watching over the Node.js process and could kill/restart it, should there be any problems with the process. [root@nodejs-app5 ~]# strace -ff -p `pidof monit` ……. connect(4, {sa_family=AF_INET, sin_port=htons(81), sin_addr=inet_addr(“127.0.0.1”)}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=4, events=POLLIN|POLLOUT}], 1, 10000) = 1 ([{fd=4, revents=POLLIN|POLLOUT}]) getsockopt(4, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 poll([{fd=4, events=POLLIN}], 1, 0) = 1 ([{fd=4, revents=POLLIN}]) read(4, “”, 1024) = 0 write(4, “GET / HTTP/1.1\r\nHost: 127.0.0.1:”…, 95) = 95 read(4, “”, 1024) = 0 write(2, “HTTP: error receiving data — Su”…, 38) = 38 sendto(3, “<11>Nov 15 10:08:01 monit[8186]:”…, 71, MSG_NOSIGNAL, NULL, 0) = 71 shutdown(4, SHUT_RDWR) = -1 ENOTCONN (Transport endpoint is not connected) close(4) = 0 ……. From the trace, we understood that Monit tries to connect to the Node.js process but fails. Following this, it sends a SIGTERM and restarts the app. check host node_web_app_0 with address 127.0.0.1 GROUP node_web_app start program = “/bin/sh -c ‘cmd_to_start_the_service’” stop program = “cmd_to_stop_the_service” if failed port 81 protocol HTTP request / with timeout 10 seconds then restart We then discovered that Monit upon trying to check the Node.js app’s health status (by running a “http GET of /”), fails and kills/restarts the app. Here’s a closer look at the at the strace on the Node process to see network activity, [root@nodejs-app5 ~]# strace -ff -e trace=network -p `pidof node` ……. [pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EMFILE (Too many open files) [pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 9 [pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) [pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EMFILE (Too many open files) [pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 9 [pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable ……. With an EMFILE error popping up often, we wondered if the app repeatedly reached its max file descriptor limit configured in and was then prompted by Monit to restart because it couldn’t reach the Node.js process. limits.conf [root@nodejs-app5 ~]# cat /etc/security/limits.conf | grep -i nofile # — nofile — max number of open files root hard nofile 500000 root soft nofile 500000 * hard nofile 500000 * soft nofile 500000 So, we decided to get to the bottom of things and [root@nodejs-app5 ~]# ls -l /proc/`pidof node`/fd | wc -l 1025 it looked like the limit kept falling back to 1024 despite us setting it to a much higher value. [root@nodejs-app5 ~]# grep ‘open files’ /proc/`pidof node`/limits Max open files 1024 4096 files Connecting to another machine which has just been started, we verified that the Node.js process has its limits set to what we have in limits.conf. Why did the limits fall back to these low values? We know for sure that these values are inherited from the parent process. Each time we connected to a machine that had just started, our Chef OpsWorks-based infrastructure set the right limits for the Node.js process. It’s when Monit takes over and restarts the process that something goes wrong. The Monit process’s limits matched the low values in the Node.js app. We wondered how. Started from the bottom, now we’re here Once the kernel was done initializing, it starts the first user space program, init. All the hard-coded values for various limits were still intact — file -nr with the at and . On cue, all our programs configured to start during system startup, would now go live. open file descriptor limit soft=1024 hard=4096 So all these system services/daemons inherited similar rlimit values but this was only applied to the log-in shell — in effect, just the programs started by the interactive shell would reflect these higher values. In case you lost us somewhere, here’s a quick rundown, Our app started with high rlimit values (as set by OpsWorks/the Chef script). A few hours in, our app crashed due to an unrelated bug (this, we narrowed down to an internal call that was introduced only in this new setup). Monit on detecting the app crash would rush in, only to restart the app at lower rlimit values (similar to its own). After the app reached 1024 connections, it quashed incoming connections. Monit on its patrol( ), fails to receive the app’s health status and restarts it, resetting all connections. http GET of / From the browser angle, since the connections (which uses ) are closed, it would start a re-connection logic. With each connection running this logic, only a few would succeed while the rest succumb (reaching the 1024 limit) to a restart. socket.io Out of the multiple ways to fix this, we ended up updating our Monit script to set higher file limits. :) check host node_web_app_0 with address 127.0.0.1 GROUP node_web_app start program = “/bin/sh -c ‘ ; cmd_to_start_the_service’” ulimit -n 100000 stop program = “cmd_to_stop_the_service” if failed port 81 protocol HTTP request / with timeout 10 seconds then restart Found this post useful? Kindly tap the ❤ button below! :) About The Authors is Technical Lead — , and is a Technical Architect with . Suman Kumar Dey DevOps Freshdesk Suresh Kumar Freshdesk is a Corporate Marketing Associate with . Say hello to her on and . Alina Benny Freshdesk Twitter LinkedIn

Solana

The Case of the Delinquent Node.js Service

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: Delving Into OpenTelemetry Collector (11/18/2023)

The Noonification: How to Implement a Merkle Tree in Solidity (11/12/2023)

105 Stories To Learn About K8s

104 Stories To Learn About Continuous Integration

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: Delving Into OpenTelemetry Collector (11/18/2023)

The Noonification: How to Implement a Merkle Tree in Solidity (11/12/2023)

105 Stories To Learn About K8s

104 Stories To Learn About Continuous Integration

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps