Alibaba Tech

@alitech_2017

TCP Three-Way Handshake

How well do you really know it? — Best practice from the Alibaba tech team

Ren Xijun is a member of Alibaba’s middleware technology team. Recently, he encountered a problem with a client-side communication server that was constantly throwing an exception. But to his dismay, despite scouring the Internet for information and making repeated attempts to locate the cause, he could not find anything to help explain the two queues or how to observe their metrics.

Undeterred, he took it upon himself to get to the bottom of the issue. He wrote this article to record how he identified and resolved the issue.

An Annoying Problem

In Java, the client and server were using a socket to communicate. In this case, a NIO server was being used. The following status occurred:

· A three-way handshake was performed intermittently to create a connection between the client and the server, but the listen socket did not respond.

· The problem then occurred in many other connections at the same time.

· The NIO selector was not destroyed and recreated. The one used was always the first one.

· The problems occurred when the program was started, and appeared intermittently thereafter.

Recap: How Does a TCP Three-Way Handshake Work?

The first thing I did was to remind myself of the standard process for a three-way handshake when establishing a TCP connection. The standard process takes place as follows:

1. The client sends a SYN packet to the server to initiate a handshake.

2. Upon receipt of this, the server sends a SYN-ACK packet to the client.

3. Finally, the client sends an ACK packet to the server to indicate that it has received the server’s SYN-ACK packet. (By this point, the connection to the server has already been established through port 56911 of the client.)

Process of a TCP three-way handshake

A Quick Fix

Judging by the description of the problem, it sounded similar to when the TCP complete connection queue (or accept queue, which will be discussed later) is full during the establishment of a TCP connection. To confirm this, I checked the queue’s overflow statistics via netstat -s | egrep “listen”.

667399 times the listen queue of a socket overflowed

After checking three times, I found that the value was increasing continuously. It was clear that the accept queue on the server had overflowed.

It was possible to then see how the OS deals with the overflow.

# cat /proc/sys/net/ipv4/tcp_abort_on_overflow
0

With tcp_abort_on_overflow being 0, if the accept queue is full in the third step of the three-way handshake, the server throws away the ACK packet sent by the client, as it presumes that the connection has not been established on the server side.

To prove that the exception was related to the complete connection queue, I first changed tcp_abort_on_overflow to 1. If the complete connection queue was full in the third step, the server would send a reset packet to the client, indicating that it should end both the handshake process and the connection. (The connection was in fact not established on the server side.)

I then proceeded with the test, finding that there were a number of “connection reset by peer” exceptions in the client. We came to the conclusion that the complete connection queue’s overflow was in turn causing the client error, which helped us quickly identify key parts of the problem.

The development team looked at the Java source code and found that the default value of the backlog of the socket was 50 (this value controls the size of the complete connection queue and will be detailed later). I increased the value and ran it again, and after over 12 hours of stress testing, I noticed that the error wasn’t showing up anymore and that the overflow also wasn’t increasing.

So, it’s as simple as that. There is a complete connection queue overflow after a TCP three-way handshake takes place, and only after entering this queue can the server change from Listen to accept. The default value of the backlog is 50, which is easy to overflow. If it overflows, at the third step of the handshake, the server ignores the ACK packet sent by the client. The server will repeat the second step (sending the SYN-ACK packet to the client) at regular intervals. If the connection is not queued, it results in an exception.

But although we had solved the problem, I still wasn’t satisfied. I wanted to use this whole encounter as a learning experience, so I looked into the problem further.

Delving Deeper: TCP Handshake Process and Queues

As shown above, there are two queues: a SYN queue (or incomplete connection queue) and an accept queue (or complete connection queue).

In the three-way handshake, after receiving a SYN packet from the client, the server places the connection information in the SYN queue and sends a SYN-ACK packet back to the client.

The server then receives an ACK packet from the client. If the accept queue isn’t full, you should either remove the information from the SYN queue and put it into the accept queue, or execute as tcp_abort_on_overflow instructed.

At this point, if the accept queue is full and tcp_abort_on_overflow is 0, the server sends a SYN-ACK packet to the client again after a certain period of time (in other words, it repeats the second step of the handshake). If the client experiences even a short timeout, it is easy to encounter a client exception.

In our OS, the second step is retried twice by default (five times for CentOS).

net.ipv4.tcp_synack_retries = 2

A New Approach

The solution detailed above is a tad confusing, and you may be wondering whether there’s a faster or simpler way to solve these problems. Let’s start by taking a look at some useful commands.

Commands

netstat –s

[root@server ~]#  netstat -s | egrep "listen|LISTEN" 
667399 times the listen queue of a socket overflowed
667399 SYNs to LISTEN sockets ignored

Here, for example, 667399 indicates the number of times that the accept queue overflowed. Execute this command every few seconds, and if the number increases, the accept queue must be full.

ss command

[root@server ~]# ss -lnt
Recv-Q Send-Q Local Address:Port Peer Address:Port
0 50 *:3306 *:*

Here, the Send-Q value in the second column is 50, indicating that the accept queue on the listen port (the third column) is 50 at most. The first column, Recv-Q, indicates the amount of the accept queue that is currently being used.

The size of the accept queue depends on min(backlog, somaxconn). The backlog is passed in when the socket is created, and somaxconn is an OS-level system parameter.

At this point, we can establish contact with our code. For example, when Java creates ServerSocket, it will let you pass in the value of the backlog.

(Source: https://docs.oracle.com/javase/7/docs/api/java/net/ServerSocket.html)

The size of the SYN queue depends on max(64, /proc/sys/net/ipv4/tcp_max_syn_backlog) and the OSs of different versions may be different.

netstat command

Send-Q and Recv-Q can also be shown via the netstat command just as with the ss command. However, if the connection is not in Listen state, Recv-Q means that the received data is still in a cache and has not been read by the process. This value represents the bytes that have not been read by the process. Send is the number of bytes in the send queue that have not been acknowledged by the remote host.

$netstat -tn  
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp0 0 100.81.180.187:8182 10.183.199.10:15260 SYN_RECV
tcp0 0 100.81.180.187:43511 10.137.67.18:19796 TIME_WAIT
tcp0 0 100.81.180.187:2376 100.81.183.84:42459 ESTABLISHED

It is important to note that the Recv-Q data shown by netstat -tn has nothing to do with the accept queue or SYN queue. This must be emphasized here so as not to confuse it with the Recv-Q data shown by ss -lnt.

For example, the following netstat -t sees that Recv-Q has accumulated a lot of data, which is generally caused by CPU processing failures.

Verification Process

To verify the information detailed above, change the backlog value in Java to 10 (the smaller the value, the easier it is to overflow), and continue to run the stress testing. The client then starts to report an exception, after which the following can be observed via the ss command on the server.

Fri May  5 13:50:23 CST 2017
Recv-Q Send-QLocal Address:Port Peer Address:Port
11 10 *:3306 *:*

Here we can see that the service accept queue on port 3306 is 10 at most, but that there are now 11 connections in the queue. There must be a queue that cannot be queued and will overflow. At the same time, it is true that the value of the overflow is constantly increasing.

Accept Queue Size in Tomcat and Nginx

Tomcat defaults to transient connection. In Ali-Tomcat, the default value of the backlog (which is “accept count” in Tomcat) is 200. In Apache Tomcat, it’s 100.

#ss -lnt
Recv-Q Send-Q Local Address:Port Peer Address:Port
0 100 *:8080 *:*

In Nginx, the default value of the backlog is 511.

$sudo ss -lnt
State Recv-Q Send-Q Local Address:PortPeer Address:Port
LISTEN 0 511 *:8085 *:*
LISTEN 0 511 *:8085 *:*

Nginx runs in multi-process mode, so there are multiple numbers of 8085, meaning that multiple processes are listening to the same port both to avoid context switching and to improve performance.

Summary

Once an overflow occurs, the CPU and thread states look normal, but the stress doesn’t go up. From the client’s perspective, response time (network + queue + service time) is high, but considering the true service time in the server log, it is actually very short. The default value of the backlog in some frameworks, such as JDK and Netty, is small, which may lead to performance issues in some cases.

I hope this article serves to help you understand the concepts, principles, and functions of the SYN queue and the accept queue in the establishment of a TCP connection. The overflow problem with accept queues and SYN queues is easily neglected, but it’s critical, especially in scenarios where transient connections are used (such as Nginx and PHP, although they also support persistent connections).

(Original article by Ren Xijun任喜军)

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.

More by Alibaba Tech

Topics of interest

More Related Stories