This article is translated from tmq.qq.com (Tencent Mobile Quality Center)
One dark night, an O2O taxi service provider, ‘X’, put out their new discount deals online. The campaign was a bit too successful for the back-end servers and the requests deluge crashed the query suggestion service (QSS). After the incident, the dev and op teams had a joint meeting and decided to thoroughly investigate the capacity of the service ASAP!
The challenge we have to undertake the task is to come up with the answers to the following questions, 1) for this particular type of service (QSS), what are the vital performance metrics? 2) what are their implications? and 3) what is the performance test pass criteria?
In this case study, we firstly discuss the performance metrics we care about in general basis. Then we benchmark QSS for real, and complete the circle by providing the in practice solution based on the analysis.
The Hamlets appears differently in the views of two different groups of people, the API users (i.e., ‘X’, the O2O taxi service platform) and the resource owner (i.e., us). ‘X’ can only see external metrics such as throughput, latency and error rate. While as resource owner, we also have to take account of internal metrics and factors including but is not limited to CPU and memory usage, I/O and network activities, system load, server hardware/software configuration, server architecture and service deployments strategy etc.
In practice, the latency requirement should correspond to the specific service type. In our case, it is <100ms since QSS users expect real-time feedback after every single character input. On the other hand, users of a map & navigation service would feel 2~5 seconds for a route suggestion acceptable.
We consider that mean value is not sufficient to reflect the performance of an application, thus, we normally collect the mean value, .90, .99 as well as data distribution (a sample is given in the following figures).
For example:
Throughput indicates the volume of traffic a service handles. There is a phenomenal correlation between throughput and latency in a performance test, that is, latency increases along with the growth of throughput. This is simply because, under high-pressure, a server is generally slower than that in a normal load. To be more specific, when throughput is low, the latency largely stays consistent regardless of the throughput fluctuation; when the throughput goes higher than a threshold, the latency start increasing linearly (sometimes close to exponentially); and when the throughput reaches the crisis point, the latency skyrockets. The crisis point is determined by hardware configuration, software architecture and network conditions.
Normally network issues (e.g., congestion) caused errors should not exceed 5% of the total requests, and application caused errors should not exceed 1%.
Basically, a slow service can be identified by high latency, low throughput or both. A problematic service can be identified by high error rate.
As the workhorse, CPU largely determines a server’s performance. So let’s start with CPU usage, briefly,
when your code does not bother the kernel, it increases us, e.g.,
i++;
.
when your code bothers the kernel, it increases both us and sy, e.g.,
gettimeofday(&t, NULL);
.
We use a relay server in production as an example and give the output of the top
command in the following figure.
Next, we look at each metrics in more detail,
rule of thumb, a high sy indicates the server switch between user mode and kernel mode too often, which impacts the overall performance negatively. We need to profile the process for more information. Moreover, in a multi-core architecture, CPU0 is responsible for scheduling among cores. Thus, CPU0 deserves more notice because a high usage of CPU0 will affect other CPU cores as well.
rule of thumb, a service normally should not have favorable niceness to hog the CPU that could be otherwise used by other services. If so, we need to check the system configuration and the service startup parameters.
rule of thumb, for any online service, we normally preserve a certain amount of id to deal with throughput spikes. However, low id plus low throughput (i.e., a slow service) mean that some processes are using CPU excessively. If this only happens after a recent release, it is highly possible that some new code introduces hefty big o so better check the newly merged pull requests. If not, maybe it is time for horizontal scaling.
rule of thumb, if high wa is observed in a service that is not I/O intensive, we need to a) run iostat
to check the detailed I/O activities from the system perspective, and b) check the log strategy and data load frequency etc., from the application perspective. A low wa in a slow service can only rule out disk I/O, which highlights network activities as the next checkpoint.
rule of thumb, hi should stay in a reasonable range for most services. si is normally higher in I/O intensive services than ordinary ones.
Technically, CPU load averages (a.k.a., run-queue length) count the number of processes that are running or waiting for running.
Linux provides several commands for checking load averages, such as top
and uptime
. The output of the two commands are load averages sampled from last 1 minute, last 5 minutes and last 15 minutes.
In systems with has one CPU core, 1
is the ideal load. In such load CPU is fully saturated and is running in threshold state. Hence any value above 1
indicate some levels of issues. Likewise, in multi-core systems, the number of cores is the actual threshold, which can be checked with:
cat /proc/cpuinfo|grep ‘model name’|wc -l
rule of thumb, In production, 70%~80% of the threshold are our experience values for production servers running ideal state. In such state, most of the server’s capacity are utilized while there are margins for potential throughput increase. In performance tests, the system load could be close to but should not exceed the threshold value during pressure test; it should not exceed 80% during concurrent test; and it should be maintained around 50% during stability test.
Difference between VIRT and RES is like difference between budget and real money. The VIRT is how much memory a process claims to use, and RES is how much physical memory it is actually using. To be specific,
int *i = malloc(1024 * sizeof(int));
claims, whilst*i = 1;
starts using.
bandwidth
nethogs
, like top
, runs in a real time manner.
Because we are investigating a query service, most likely bandwidth is not a concern.
connection state
In performance test we monitor the connection state changes and exceptions. For service based on TCP and HTTP, we need to monitor the established TCP connection state, the in-test process’ network buffer, TIME_WAIT state and connection count. We use netstat
and ss
for this purpose.
Too frequent read/write the disk will cause the service in I/O waiting state (as counted with wa), which causes long latency and low throughput.
We use iostat
to monitor disk I/O:
We can gather a basic information from the command above. For more information, we need to add -x
N.b., the unit of the numbers listed above is Kilobytes.
Throughput reached the upper limit while CPU load is low: Normally this is caused by a lack of resource allocated for the service. If such phenomenon is noticed, we need to check ulimit
, thread number and memory allowance for the process in test.
us and sy are low, and wa is high: high wa is normal if the service is I/O intensive. Otherwise, high wa means problems. There are roughly two potential reasons, a) application-logic flaws: might it be too many read/write requests, too large I/O data, irrational data loading strategy or excessive logging; and b) memory deficiency caused excessive swap.
Latency jitter for the same API: if the throughput is normal, it is normally caused by unnecessary contention, caused by a) flaws in inter-threads/processes locking logic; or b) limited resource, e.g., threads, memory allocated for the service so it has to wait for available resource frequently.
Memory usage increases over time: in constant throughput, it is a typical indicator of memory leakage. We need to use valgrind
to profile the service process.
Let’s get back to the investigation of the QSS. Firstly, we look at the general architecture of the system in test.
As given from the figure, the underlying data layer determined the performance upper limit, i.e., 3500qps. Hence, the task is to find the threshold for the rest of the layers under this prerequisite.
This section briefly explains the test process. A full circle performance test is given in the following figure.
preparation
in test
We use Jmeter
to simulate user requests. The performance test config files include 1) data set config (thread sharing mode, new line behavior, etc), 2) throughput control, 3) HTTP sampler (domain, port, HTTP METHOD, request body etc.), 4) response assertion. We give a sample config in the following figures:
data set config
throughput control
HTTP sampler
response assertion
In the test, we also measure the performance metrics using the utilities discussed in the previous sections. And we collect data generated in the test process to see if there is anything abnormal.
test result
The test result should include:
We concluded that within the QSS, throughput threshold of a single node is 300qps, and that of the whole cluster is 1800qps. The peak throughput on that dark night was around 5000qps+, which crashed the whole system as the load exceeded the capacity too much.
At last, we scaled up QSS cluster and set a throughput quota for ‘X’. Through this practice, we secure the taxi booking service which in turn offers a smoother UX for our users who are relying on the service to go bar and dinner and dates. With the enhanced success rate of QSS and all sorts of derivative activities, we have made the world a better place.
In next post, I will continue discussing more techniques for backend server design and implementation.
If you like this read please clap for it or follow by clicking the buttons. Thanks, and hope to see you the next time.