Suman Karumuri | Pinterest tech lead, Visibility
To ensure we’re providing a fast experience for Pinners, we’re constantly improving the request latency of our backend services. Earlier this year, we open sourced Pintrace collector, and contributed the code to the OpenZipkin community. In a previous post, we explained the architecture of Pintrace and how we trace backend applications.
In addition to using Pintrace to understand and improve the latency of our backend, we’ve developed applications of trace data to solve a few day-to-day issues. While there’s a lot of literature on how distributed tracing can be used for improving request latency, there’s surprisingly little documentation on how this trace data can be used in other ways. Here, we’ll share how we use the trace data to understand, debug and tune our backend services.
Pintrace has been running in production for over six months now. Currently, we’re sampling a small portion of production traffic (<0.5%) for tracing with negligible overhead. You can also use the Pintrace traces to answer several architectural questions easily.
Some fun facts about Pinterest backend we noticed from our traces.
We’ll share examples of how we use traces to improve the latency of our backend, as well as how we’ve made trace data a foundational substrate for all performance data at Pinterest, starting with mobile clients.
Even an individual trace of a request can provide a lot of insight into the system. It can tell us what actions took place during a request and how long each took. The image below shows a trace for a single home feed request. The trace visualization page consists of trace summary and a Gantt chart visualization of a trace.
The Zipkin UI shows a summary of the trace at the top of the Gantt chart. This kind of summary is useful to get an overview of the request execution. This summary view provides the following insights:
The spans in a trace show all the actions that occurred during a request. However, when the same spans are repeated in a trace, it’s a sign of duplicate computation or computation that can be optimized using caching or other techniques. Given the complex interactions between components in system and the software abstractions, these kinds of issues are unavoidable and not easily detected in complex systems. So, If a trace contains identical spans, it’s worth taking a closer look.
For example, when looking for a trace for a home feed request we noticed two identical Thrift calls separated by a few milliseconds to our follower service backend. The image below shows an annotated screenshot of the trace. When we looked at the code making these calls, we identified that the second call was unnecessary. The developer was not aware of the duplicate call since different methods in the code were using the same underlying API. By deleting the duplicate call, we improved the overall call latency of home feed request by 20+ms or 5% and halved the load on the backend service by removing the duplicate call.
Each span in a trace captures the amount of time it took to execute an operation. A Gantt chart (like the one shown in Figure 1) shows the spans in a waterfall view. In this view, a blue bar shows the relative time during which the span was active during the request. By identifying the spans with the longest duration (widest blue bar) we can know where the request spent it’s time.
For example, in Figure 1, you can see the entire v3_home_feed call in ngapi took 581ms. By looking at the spans we can see that getfeed call to smartfeedservice took 177ms. So, the request spent 30% of the time in the getfeed API call to smartfeedservice. This is acceptable in this case, since smartfeedservice returns a list of pins to be shown in the home feed.
The Gantt chart visualization of a trace can clearly show serial execution. The spans in a serial execution show up as a step pattern from the same service. In Figure 1, we can see that the memcache_get calls are executed in parallel because they start at the same time. However, in a later part of that trace we see get_many calls to memcache from ngapi executed in a step pattern as shown in the image below. The next image shows the step function pattern characteristic of a serial execution.
Inspecting the code revealed a bug showing we failed to wrap memcache get_many call with a Python decorator that would run it asynchronously in a greenlet. Once we made the call asynchronous by adding a Python decorator to the method, all our get_many calls to Memcache now run in parallel. We confirmed the fix by looking at a trace generated from the new code, as shown in the image below.
By looking at a trace, one may identify opportunities for optimization. For example, an architect can identify calls which can take advantage of a cache or other architectural optimizations. However, improvements like these need deep domain expertise as they usually involve architectural changes.
For example, in the home feed trace in Figure 1, we saw that 30% of the time is spent identifying the Pins to be shown to a user. By examining the rest of the trace (not shown in that image), we noticed that we spend roughly 40% of the time materializing a Pin i.e, building actual Pins from pin ids returned by smartfeedservice. Currently, this logic needs calls to several services. Based on this trace data, we’ve started looking into ways to optimize pin materialization.
A trace has also helped us identify the performance bottlenecks in the code. For example, we realized we can make a call to the ads API much sooner, thus reducing the overall latency. We also identified several opportunities where the data can be cached by dependent services using a trace. We’re working on these optimizations in the first half of 2017.
The generic Python Zipkin tracer generates a span for incoming and outgoing network requests like HTTP and Thrift calls. This is fine choice in many cases since the network latency accounts for most of the request processing time. However, in cases where complex business logic is involved, capturing network calls alone is not very useful to understand what’s going on. For example, a trace for the heavily used v3_get_experience API, we’d see a span that shows the API took 300ms with a few short network calls. But this trace wouldn’t tell why those network calls were made or how the remaining 300ms of time was spent.
So, we provided developers with the ability to add custom non-networked spans to their traces. Developers can add custom spans using a simple python context manager as shown in the code below.
with custom_span(‘my_operation’, tags={‘key1’: ‘value1’, ‘key2’: ‘value2’}):
…
By annotating v3_get_experiences API with custom spans, our experience framework team was able to get a deeper understanding of how time was spent in their API. After adding custom spans the team found several optimizations that would improve the latency by 20%.
Given our complex monolithic python front-end, our developers don’t have good tools to understand the performance impact of a new change they make. So, we trace all their requests automatically on development machines so they can easily trace their code and understand how the code works.
In a micro services architecture, understanding why a request is slow is very hard because any component in the request path can add to the latency. With p99 latency, tracking down the problem is even harder because the latency can be added by any component in the request processing path for a wide variety of reasons. A trace captures a request throughout its execution, so it can identify the component that is causing the slowness. Since we sample a small percentage of the requests for tracing, the captured traces can identify p50, p75 and sometimes p90 latency issues easily. Since p99 issues are rare, we have to search for traces over longer durations to find a trace with p99 latency.
Historically, we’ve used traces to determine the root cause of a p99 latency issue in the followerservice. The team member searched for a span of follower service whose duration was more than a second, the p99 latency for the follower service. Comparing the p99 latency trace with a normal trace, we were able to quickly identify that the p99 latency issue was caused because the client was looking up too many keys in the database compared to a normal request. The issue was fixed by limiting the number of keys a client can lookup.
For fault tolerance, it’s very common for people use multiple clusters or replicas of databases and caches. During troubleshooting or debugging, the first questions an engineer would have to ask and answer are things like: Which memcache cluster a request is hitting? Is this request served by the master or slave database? Logs are typically used to answer these questions. Even with a distributed log search system, searching through logs would be cumbersome since we have to carefully comb through the application logs to identify the exact time of the request and then comb through the database or cache logs to identify the request around that time. The process requires inferring causality of events and extra context manually.
Traces are a better alternative to logs for this use case since they capture the causality and the needed context like host ip and port of the remote server during a request. We can also annotate the span with additional information about the name of the database and the sql query executed etc. to simplify debugging. For example, while investigating an incident the zen team would like to know whether a request was served from a mysql cluster or hbase cluster. Since Pintrace captures the network address and port of every request, they use this information to do a DNS lookup to determine the cluster a request was served with.
For performance we cache a lot of data across several memcache clusters. While investigating a latency issue, we’d like to know if the cache was hit during a request and which one it was. To simplify answering these questions we’ve instrumented our memcache trace code to add the name of the memcache cluster serving the request to the span, which has helped in debugging several issues.
Pinterest is a data driven company, and so, several services measure the time it took to perform an action, like the time it takes to load a video or open the app. Some of these actions may contain sub-actions. For example, opening the app will authenticate the user, send a request for home feed, download images and render the home feed. Currently, this latency information is captured in custom data formats for each action, with its own backend data processing job and storage formats.
However, we noticed several parallels between the data format for actions data and the span data. The Zipkin span can encapsulate the action data perfectly using span name as action name, a start time, an end time and a map of key value pairs to annotate the span. Similarly, each sub-action can also be represented as a span and the parent child relationship among spans can be used for relating the sub-actions to an action. By standardizing the data format for all these actions as a span, we can also standardize the data processing pipeline for processing this information.
In response, we’ve started a process to standardize all the latency data into span format. We’ve also built a new pipeline called the latency pipeline to collect, process, store and query this information at scale using big data tools like Hadoop, Spark and Hive. Currently, we process some mobile client latency metrics using this new pipeline. Watch out for a new blog post on this soon.
In addition to strengthening existing use cases, we see several ways to use trace data in future.
Pintrace started to trace requests across our backend services to improve request latency. But, we soon realized that the span model, trace pipeline and the trace data provide novel ways for understanding, debugging and tuning our systems from mobile devices, backend services to databases. With these new use cases, we see tracing infrastructure as the third pillar of monitoring our services in addition to metrics and log search systems. In future, we plan to further analyze the trace data to gather insights for capacity planning, calculating charge back and estimating cost to serve a request. We are just getting started.
Acknowledgements: The contributors to this project are Joe Gordon, Sam Meder, Naoman Abbas, Xiaoqiao Meng and Suman Karumuri.