This article is translated from tmq.qq.com (Tencent Mobile Quality Center)
One dark night, an O2O taxi service provider, ‘X’, put out their new discount deals online. The campaign was a bit too successful for the back-end servers and the requests deluge crashed the query suggestion service (QSS). After the incident, the dev and op teams had a joint meeting and decided to thoroughly investigate the capacity of the service ASAP!
The challenge we have to undertake the task is to come up with the answers to the following questions, 1) for this particular type of service (QSS), what are the vital performance metrics? 2) what are their implications? and 3) what is the performance test pass criteria?
In this case study, we firstly discuss the performance metrics we care about in general basis. Then we benchmark QSS for real, and complete the circle by providing the in practice solution based on the analysis.
The Hamlets appears differently in the views of two different groups of people, the API users (i.e., ‘X’, the O2O taxi service platform) and the resource owner (i.e., us). ‘X’ can only see external metrics such as throughput, latency and error rate. While as resource owner, we also have to take account of internal metrics and factors including but is not limited to CPU and memory usage, I/O and network activities, system load, server hardware/software configuration, server architecture and service deployments strategy etc.
In practice, the latency requirement should correspond to the specific service type. In our case, it is <100ms since QSS users expect real-time feedback after every single character input. On the other hand, users of a map & navigation service would feel 2~5 seconds for a route suggestion acceptable.
We consider that mean value is not sufficient to reflect the performance of an application, thus, we normally collect the mean value, .90, .99 as well as data distribution (a sample is given in the following figures).
Throughput indicates the volume of traffic a service handles. There is a phenomenal correlation between throughput and latency in a performance test, that is, latency increases along with the growth of throughput. This is simply because, under high-pressure, a server is generally slower than that in a normal load. To be more specific, when throughput is low, the latency largely stays consistent regardless of the throughput fluctuation; when the throughput goes higher than a threshold, the latency start increasing linearly (sometimes close to exponentially); and when the throughput reaches the crisis point, the latency skyrockets. The crisis point is determined by hardware configuration, software architecture and network conditions.
Normally network issues (e.g., congestion) caused errors should not exceed 5% of the total requests, and application caused errors should not exceed 1%.
Basically, a slow service can be identified by high latency, low throughput or both. A problematic service can be identified by high error rate.
As the workhorse, CPU largely determines a server’s performance. So let’s start with CPU usage, briefly,
- us: the of CPU time spent in user mode;
when your code does not bother the kernel, it increases us, e.g.,
- sy: the of CPU time spent in kernel mode;
when your code bothers the kernel, it increases both us and sy, e.g.,
- ni: the CPU time spent on executing processes that are niced;
- id: the time when a CPU idles;
- wa: the time when a CPU waits for I/O;
- hi: the time when CPU responds hardware interrupts;
- si: the time percentage spent on software interrupts.
We use a relay server in production as an example and give the output of the
top command in the following figure.
Next, we look at each metrics in more detail,
- us & sy: in a majority of the back-end servers, the major CPU time is spent in us and sy. us indicates the CPU time spent on user mode and sy indicates that on kernel mode. These two arguments “argue” with each other for determined CPU capacity, the higher us goes, the lower sy goes, and vice versa;
rule of thumb, a high sy indicates the server switch between user mode and kernel mode too often, which impacts the overall performance negatively. We need to profile the process for more information. Moreover, in a multi-core architecture, CPU0 is responsible for scheduling among cores. Thus, CPU0 deserves more notice because a high usage of CPU0 will affect other CPU cores as well.
- ni: every Linux process has a priority, CPU will execute processes with higher priority values (pri). Besides, Linux defines niceness for priority fine-tune;
rule of thumb, a service normally should not have favorable niceness to hog the CPU that could be otherwise used by other services. If so, we need to check the system configuration and the service startup parameters.
- id: indicates the amount of CPU time that is not used by any processes;
rule of thumb, for any online service, we normally preserve a certain amount of id to deal with throughput spikes. However, low id plus low throughput (i.e., a slow service) mean that some processes are using CPU excessively. If this only happens after a recent release, it is highly possible that some new code introduces hefty big o so better check the newly merged pull requests. If not, maybe it is time for horizontal scaling.
- wa: it indicates time waiting for the disk I/O activities (local disk or NFS but NOT network wait);
rule of thumb, if high wa is observed in a service that is not I/O intensive, we need to a) run
iostat to check the detailed I/O activities from the system perspective, and b) check the log strategy and data load frequency etc., from the application perspective. A low wa in a slow service can only rule out disk I/O, which highlights network activities as the next checkpoint.
- hi & si: basically, hi and si together show the CPU time spent on processing interruptions.
rule of thumb, hi should stay in a reasonable range for most services. si is normally higher in I/O intensive services than ordinary ones.
Technically, CPU load averages (a.k.a., run-queue length) count the number of processes that are running or waiting for running.
Linux provides several commands for checking load averages, such as
uptime. The output of the two commands are load averages sampled from last 1 minute, last 5 minutes and last 15 minutes.
In systems with has one CPU core,
1 is the ideal load. In such load CPU is fully saturated and is running in threshold state. Hence any value above
1 indicate some levels of issues. Likewise, in multi-core systems, the number of cores is the actual threshold, which can be checked with:
cat /proc/cpuinfo|grep ‘model name’|wc -l
rule of thumb, In production, 70%~80% of the threshold are our experience values for production servers running ideal state. In such state, most of the server’s capacity are utilized while there are margins for potential throughput increase. In performance tests, the system load could be close to but should not exceed the threshold value during pressure test; it should not exceed 80% during concurrent test; and it should be maintained around 50% during stability test.
- VIRT: the virtual memory used by a process;
- RES: the physical memory used by a process;
Difference between VIRT and RES is like difference between budget and real money. The VIRT is how much memory a process claims to use, and RES is how much physical memory it is actually using. To be specific,
int *i = malloc(1024 * sizeof(int));claims, whilst
*i = 1;starts using.
- SHR: the amount of memory usage shared with other process;
- SWAP: the size of swap area for a process. OS swap some of data to disk for the memory shortfall which is one of the reasons of high wa;
- DATA: the sum of stack, heap used by a process.
top , runs in a real time manner.
Because we are investigating a query service, most likely bandwidth is not a concern.
In performance test we monitor the connection state changes and exceptions. For service based on TCP and HTTP, we need to monitor the established TCP connection state, the in-test process’ network buffer, TIME_WAIT state and connection count. We use
ss for this purpose.
Too frequent read/write the disk will cause the service in I/O waiting state (as counted with wa), which causes long latency and low throughput.
iostat to monitor disk I/O:
- tips: the number of I/O requests per second. Because multiple logic requests could be merged into one I/O request, the exact request number, and the size per request can not be fully reflected by this number;
- kB_read/s: the size of data read from the device per second;
- kB_wrtn/s: the size of data written to the device per second;
- kB_read: total amount of data read from the device;
- kB_wrtn: total amount of data written to the device.
We can gather a basic information from the command above. For more information, we need to add
- rrqm/s: how many read requests are merged (If the FS noticed that the requests sent from the OS are reading from the same Block, it merges the requests);
- wrqm/s: how many write requests are merged;
- await: the average waiting time for all I/O requests (milliseconds);
- %util: the percentage of time spent for I/O. This parameter implies how busy is the device.
N.b., the unit of the numbers listed above is Kilobytes.
Some common performance issues
Throughput reached the upper limit while CPU load is low: Normally this is caused by a lack of resource allocated for the service. If such phenomenon is noticed, we need to check
ulimit, thread number and memory allowance for the process in test.
us and sy are low, and wa is high: high wa is normal if the service is I/O intensive. Otherwise, high wa means problems. There are roughly two potential reasons, a) application-logic flaws: might it be too many read/write requests, too large I/O data, irrational data loading strategy or excessive logging; and b) memory deficiency caused excessive swap.
Latency jitter for the same API: if the throughput is normal, it is normally caused by unnecessary contention, caused by a) flaws in inter-threads/processes locking logic; or b) limited resource, e.g., threads, memory allocated for the service so it has to wait for available resource frequently.
Memory usage increases over time: in constant throughput, it is a typical indicator of memory leakage. We need to use
valgrind to profile the service process.
Get back to work
Let’s get back to the investigation of the QSS. Firstly, we look at the general architecture of the system in test.
- Web API layer: translate HTTP requests to socket that is used internally;
- Proxy layer: relay the socket requests to the different business logic module;
- Business layer: a) compute service, render the output based on the result calculated from data fetched from data layer; b) error correcting service, correct the query string;
- Data layer: the underlying data service.
As given from the figure, the underlying data layer determined the performance upper limit, i.e., 3500qps. Hence, the task is to find the threshold for the rest of the layers under this prerequisite.
This section briefly explains the test process. A full circle performance test is given in the following figure.
- Test data: we use the test data based on service log fetched from the day when the service crashed;
- Estimation of the threshold qps: to get this value is the primary purpose of this test;
- Service config: we use the exact same hardware/software config as the production environment.
Jmeter to simulate user requests. The performance test config files include 1) data set config (thread sharing mode, new line behavior, etc), 2) throughput control, 3) HTTP sampler (domain, port, HTTP METHOD, request body etc.), 4) response assertion. We give a sample config in the following figures:
In the test, we also measure the performance metrics using the utilities discussed in the previous sections. And we collect data generated in the test process to see if there is anything abnormal.
The test result should include:
- Test environment description: performance requirement, server config, test data source, test method etc;
- Performance statistic: latency, qps, CPU, memory, bandwidth etc;
- Test conclusion: the comparison between maximum qps, latency and the expected values, deployment suggestion etc.
We concluded that within the QSS, throughput threshold of a single node is 300qps, and that of the whole cluster is 1800qps. The peak throughput on that dark night was around 5000qps+, which crashed the whole system as the load exceeded the capacity too much.
End of the day
At last, we scaled up QSS cluster and set a throughput quota for ‘X’. Through this practice, we secure the taxi booking service which in turn offers a smoother UX for our users who are relying on the service to go bar and dinner and dates. With the enhanced success rate of QSS and all sorts of derivative activities, we have made the world a better place.
In next post, I will continue discussing more techniques for backend server design and implementation.
If you like this read please clap for it or follow by clicking the buttons. Thanks, and hope to see you the next time.