This article is translated from (Tencent Mobile Quality Center) tmq.qq.com One dark night, an O2O taxi service provider, ‘X’, put out their new discount deals online. The campaign was a bit too successful for the back-end servers and the requests deluge crashed the query suggestion service (QSS). After the incident, the dev and op teams had a joint meeting and decided to thoroughly investigate the capacity of the service ASAP! The challenge we have to undertake the task is to come up with the answers to the following questions, 1) for this particular type of service (QSS), what are the vital performance metrics? 2) what are their implications? and 3) what is the performance test pass criteria? In this case study, we firstly discuss the performance metrics we care about in general basis. Then we benchmark QSS for real, and complete the circle by providing the in practice solution based on the analysis. Abstract The appears differently in the views of two different groups of people, the API users (i.e., ‘X’, the O2O taxi service platform) and the resource owner (i.e., us). ‘X’ can only see such as , and . While as resource owner, we also have to take account of and factors including but is not limited to and , and , , server hardware/software configuration, server architecture and service deployments strategy etc. Hamlets external metrics throughput latency error rate internal metrics CPU memory usage I/O network activities system load External metrics latency In practice, the latency requirement should correspond to the specific service type. In our case, it is <100ms since QSS users expect real-time feedback after every single character input. On the other hand, users of a map & navigation service would feel 2~5 seconds for a route suggestion acceptable. We consider that mean value is not sufficient to reflect the performance of an application, thus, we normally collect the mean value, .90, .99 as well as data distribution (a sample is given in the following figures). For example: throughput Throughput indicates the volume of traffic a service handles. There is a phenomenal correlation between throughput and latency in a performance test, that is, latency increases along with the growth of throughput. This is simply because, under high-pressure, a server is generally slower than that in a normal load. To be more specific, when throughput is low, the latency largely stays consistent regardless of the throughput fluctuation; when the throughput goes higher than a threshold, the latency start increasing linearly (sometimes close to exponentially); and when the throughput reaches the crisis point, the latency skyrockets. The crisis point is determined by hardware configuration, software architecture and conditions. network error rate Normally network issues (e.g., congestion) caused errors should not exceed 5% of the total requests, and application caused errors should not exceed 1%. Basically, a can be identified by high latency, low throughput or both. A problematic service can be identified by high error rate. slow service Internal metrics CPU As the workhorse, CPU largely determines a server’s performance. So let’s start with CPU usage, briefly, : the of CPU time spent in user mode; us when your code does not bother the kernel, it increases , e.g., . us i++; : the of CPU time spent in kernel mode; sy when your code bothers the kernel, it increases both and , e.g., . us sy gettimeofday(&t, NULL); : the CPU time spent on executing processes that are niced; ni : the time when a CPU idles; id : the time when a CPU waits for I/O; wa : the time when CPU responds hardware interrupts; hi : the time percentage spent on software interrupts. si We use a relay server in production as an example and give the output of the command in the following figure. top Next, we look at each metrics in more detail, in a majority of the back-end servers, the major CPU time is spent in and . indicates the CPU time spent on user mode and indicates that on kernel mode. These two arguments “argue” with each other for determined CPU capacity, the higher us goes, the lower sy goes, and vice versa; us & sy: us sy us sy a high indicates the server switch between user mode and kernel mode too often, which impacts the overall performance negatively. We need to profile the process for more information. Moreover, in a multi-core architecture, CPU0 is responsible for scheduling among cores. Thus, CPU0 deserves more notice because a high usage of CPU0 will affect other CPU cores as well. rule of thumb, sy every Linux process has a priority, CPU will execute processes with higher priority values ( ). Besides, Linux defines for priority fine-tune; ni: pri niceness a service normally should not have favorable to hog the CPU that could be otherwise used by other services. If so, we need to check the system configuration and the service parameters. rule of thumb, niceness startup indicates the amount of CPU time that is not used by any processes; id: for any online service, we normally preserve a certain amount of to deal with throughput spikes. However, low plus low (i.e., a ) mean that some processes are using CPU excessively. If this only happens after a recent release, it is highly possible that some new code introduces hefty big o so better check the newly merged pull requests. If not, maybe it is time for horizontal scaling. rule of thumb, id id throughput slow service it indicates time waiting for the disk I/O activities (local disk or NFS but network wait); wa: NOT if high is observed in a service that is not I/O intensive, we need to a) run to check the detailed I/O activities from the system perspective, and b) check the log strategy and data load frequency etc., from the application perspective. A low in a slow service can only rule out disk I/O, which highlights network activities as the next checkpoint. rule of thumb, wa iostat wa basically, and together show the CPU time spent on processing interruptions. hi & si: hi si , should stay in a reasonable range for most services. is normally higher in I/O intensive services than ordinary ones. rule of thumb hi si Load averages Technically, CPU (a.k.a., run-queue length) count the number of processes that are running or waiting for running. load averages Linux provides several commands for checking , such as and . The output of the two commands are sampled from last 1 minute, last 5 minutes and last 15 minutes. load averages top uptime load averages In systems with has one CPU core, is the ideal load. In such load CPU is fully saturated and is running in threshold state. Hence any value above indicate some levels of issues. Likewise, in multi-core systems, the number of cores is the actual threshold, which can be checked with: 1 1 cat /proc/cpuinfo|grep ‘model name’|wc -l , 70%~80% of the threshold are our experience values for production servers running ideal state. In such state, most of the server’s capacity are utilized while there are margins for potential throughput increase. , the system load could be close to but should not exceed the threshold value during pressure test; it should not exceed 80% during concurrent test; and it should be maintained around 50% during stability test. rule of thumb, In production In performance tests Memory : the virtual memory used by a process; VIRT : the physical memory used by a process; RES Difference between and is like difference between budget and real money. The is how much memory a process claims to use, and RES is how much physical memory it is actually using. To be specific, VIRT RES VIRT claims, whilst starts using. int *i = malloc(1024 * sizeof(int)); *i = 1; : the amount of memory usage shared with other process; SHR : the size of swap area for a process. OS some of data to disk for the memory shortfall which is one of the reasons of high ; SWAP swap wa : the sum of stack, heap used by a process. DATA Network bandwidth , like , runs in a real time manner. nethogs top Because we are investigating a query service, most likely bandwidth is not a concern. connection state In performance test we monitor the connection state changes and exceptions. For service based on TCP and HTTP, we need to monitor the established TCP connection state, the in-test process’ network buffer, TIME_WAIT state and connection count. We use and for this purpose. netstat ss Disk I/O Too frequent read/write the disk will cause the service in I/O waiting state (as counted with ), which causes long latency and low throughput. wa We use to monitor disk I/O: iostat : the number of I/O requests per second. Because multiple logic requests could be merged into one I/O request, the exact request number, and the size per request can not be fully reflected by this number; tips : the size of data read from the device per second; kB_read/s : the size of data written to the device per second; kB_wrtn/s : total amount of data read from the device; kB_read : total amount of data written to the device. kB_wrtn We can gather a basic information from the command above. For more information, we need to add -x : how many read requests are merged (If the FS noticed that the requests sent from the OS are reading from the same Block, it merges the requests); rrqm/s : how many write requests are merged; wrqm/s : the average waiting time for all I/O requests (milliseconds); await : the percentage of time spent for I/O. This parameter implies how busy is the device. %util N.b., the unit of the numbers listed above is Kilobytes. Some common performance issues Normally this is caused by a lack of resource allocated for the service. If such phenomenon is noticed, we need to check , thread number and memory allowance for the process in test. Throughput reached the upper limit while CPU load is low: ulimit : high is normal if the service is I/O intensive. Otherwise, high means problems. There are roughly two potential reasons, a) application-logic flaws: might it be too many read/write requests, too large I/O data, irrational data loading strategy or excessive logging; and b) memory deficiency caused excessive swap. us and sy are low, and wa is high wa wa : if the throughput is normal, it is normally caused by unnecessary contention, caused by a) flaws in inter-threads/processes locking logic; or b) limited resource, e.g., threads, memory allocated for the service so it has to wait for available resource frequently. Latency jitter for the same API : in constant throughput, it is a typical indicator of memory leakage. We need to use to profile the service process. Memory usage increases over time valgrind Get back to work System architecture Let’s get back to the investigation of the QSS. Firstly, we look at the general architecture of the system in test. Web API layer: translate HTTP requests to socket that is used internally; Proxy layer: relay the socket requests to the different business logic module; Business layer: a) compute service, render the output based on the result calculated from data fetched from data layer; b) error correcting service, correct the query string; Data layer: the underlying data service. As given from the figure, the underlying data layer determined the performance upper limit, i.e., 3500qps. Hence, the task is to find the threshold for the rest of the layers under this prerequisite. Test process This section briefly explains the test process. A full circle performance test is given in the following figure. preparation Test data: we use the test data based on service log fetched from the day when the service crashed; Estimation of the threshold : to get this value is the primary purpose of this test; qps Service config: we use the exact same hardware/software config as the production environment. in test We use to simulate user requests. The performance test config files include 1) data set config (thread sharing mode, new line behavior, etc), 2) throughput control, 3) HTTP sampler (domain, port, HTTP METHOD, request body etc.), 4) response assertion. We give a sample config in the following figures: Jmeter data set config throughput control HTTP sampler response assertion In the test, we also measure the performance metrics using the utilities discussed in the previous sections. And we collect data generated in the test process to see if there is anything abnormal. test result The test result should include: Test environment description: performance requirement, server config, test data source, test method etc; Performance statistic: , , , , etc; latency qps CPU memory bandwidth Test conclusion: the comparison between maximum , latency and the expected values, deployment suggestion etc. qps We concluded that within the QSS, throughput threshold of a single node is 300qps, and that of the whole cluster is 1800qps. The peak throughput on that dark night was around 5000qps+, which crashed the whole system as the load exceeded the capacity too much. End of the day At last, we scaled up QSS cluster and set a throughput quota for ‘X’. Through this practice, we secure the taxi booking service which in turn offers a smoother UX for our users who are relying on the service to go bar and dinner and dates. With the enhanced success rate of QSS and all sorts of derivative activities, we have made the world a better place. In , I will continue discussing more techniques for backend server design and implementation. next post If you like this read please clap for it or follow by clicking the buttons. Thanks, and hope to see you the next time.