Years ago, I used aver­age laten­cy on every dash­board and every alarm. That is, until I woke up to the prob­lems of aver­age laten­cies along with : every­body else in the indus­try When the dataset is small it can be eas­i­ly skewed by a small num­ber of outliers. When the dataset is large it can hide impor­tant details such as 10% of your users are expe­ri­enc­ing slow respons­es! It is just a sta­tis­ti­cal val­ue, . Until we plot the laten­cy dis­tri­b­u­tion we won’t actu­al­ly under­stand how our users are expe­ri­enc­ing our sys­tem. on its own it’s almost mean­ing­less Nowa­days, the lead­ing prac­tice is to use 95th or 99th per­centile laten­cies (often referred to as ) instead. These per­centile laten­cies tell us the worst response time 95% or 99% of users are get­ting. They gen­er­al­ly align with our SLOs or SLAs, and gives us a mean­ing­ful tar­get to work towards — e.g. “99% of requests should com­plete in 1s or less”. tail laten­cy Using per­centile laten­cies is a big improve­ment on aver­age laten­cies. But over the years I have expe­ri­enced a num­ber of pain points with them, and I think we can do bet­ter. The problems with percentile latencies The biggest prob­lem with using per­centile laten­cies is not actu­al­ly with percentile laten­cies them­selves, but with the way it’s imple­ment­ed by almost ven­dor out there. every sin­gle Percentile latencies are “averaged” Because it takes a lot of stor­age and data pro­cess­ing pow­er to ingest all the raw data, most ven­dors would gen­er­ate the per­centile laten­cy val­ues at the agent lev­el. This means by the time laten­cy data is ingest­ed, they have lost all gran­u­lar­i­ty and comes in as sum­maries — mean, min, max, and some predefined per­centiles. To show you the final 99th per­centile laten­cy, the vendor would (by default) aver­age the 99th per­centile laten­cies that has been ingest­ed. , it doesn’t make any sense! This whole prac­tice gives you a mean­ing­less sta­tis­ti­cal val­ue, and it’s in no way the true 99th percentile laten­cy of your ser­vice. Aver­ag­ing the per­centiles inher­it all the same prob­lems with aver­ages that per­centile laten­cies were sup­posed to address! You can’t aver­age per­centiles I have seen 99th per­centile laten­cies dif­fer by order of mag­ni­tude depend­ing on how I choose to aggre­gate them. Seri­ous­ly, how am I sup­posed to trust this num­ber when choos­ing the max over aver­age can pro­duce a 10x dif­fer­ence! You might as well stick a ran­dom­ly gen­er­ate num­ber on the dash­board, it’s almost as mean­ing­ful as the “the aver­age of 99th per­centiles”. This prac­tice is so wide­spread, almost every mon­i­tor­ing tool I have tried does this. is one of the few excep­tions because they actu­al­ly ingest and process all the raw events. Hon­ey­comb Can’t tell how bad the bad times are It’s great that we can use per­centiles to mon­i­tor our com­pli­ance with SLOs/SLAs. When things are going well, it gives us that warm and fuzzy feeling that all is well with the world. But when they go wrong, and some­times they go very wrong, we are left won­der­ing just how bad things are. Are 10% of my users get­ting response time of 1s and above? Is it 20%? Could it be 50% of my users are get­ting a bad experience? I just don’t know! I can use var­i­ous per­centiles as gates but that approach only goes so far before it over­whelms my dash­boards. Most data points are not actionable As much as I love to stare at those green tiles and line graphs and know that: We have done a good job, go team! Everything’s fine, there’s no need to do any­thing Indeed, most of the infor­ma­tion I con­sume when I look at the dash­board, are not imme­di­ate­ly action­able. To be clear, I’m not say­ing that per­centile laten­cies are not use­ful and that you shouldn’t show them on dash­boards. But as the on-call engi­neer, my atten­tion is heav­i­ly biased towards “what is wrong” than “what is right”. I want dashboards that match my focus and not force me to scan through tons of infor­ma­tion and pay the cog­ni­tive price to iden­ti­fy the sig­nals from the noise. As an appli­ca­tion devel­op­er, my def­i­n­i­tion for “what is wrong” is quite different. As an appli­ca­tion devel­op­er, I’m look­ing for unex­pect­ed changes in appli­ca­tion per­for­mance. If the laten­cy pro­file of my ser­vice changes after a deploy­ment, or oth­er relat­ed event (e.g. a mar­ket­ing cam­paign, or a new feature being tog­gled on) then I need to inves­ti­gate those. This dichoto­my in what’s impor­tant for ops engi­neers and appli­ca­tion developers means we should have sep­a­rate dash­boards for each. More on this lat­er. What can we do instead? What could we use instead of per­centiles as the met­ric to mon­i­tor our application’s per­for­mance with and alert us when it starts to dete­ri­o­rate? pri­ma­ry If you go back to your SLOs or SLAs, you prob­a­bly have some­thing along the lines of “99% of requests should com­plete in 1s or less”. In oth­er words, is allowed to . less than 1% of requests take more than 1s to com­plete So what if we mon­i­tor the instead? To alert us when our SLAs are vio­lat­ed, we can trig­ger alarms when that per­cent­age is over some pre­de­fined time win­dow. per­cent­age of requests that are over the thresh­old greater than 1% Unlike per­centiles, this per­cent­age can be eas­i­ly aggre­gat­ed across mul­ti­ple agents: Each agent sub­mits total request count and num­ber of requests over thresh­old Sum the two num­bers across all agents Divide total num­ber of requests over thresh­old by total request count and you have an accu­rate per­cent­age Dur­ing an out­age, when our SLAs are impact­ed, this met­ric tells us the number of requests that have been affect­ed. Once we under­stood the blast radius of the out­age, the per­centile and max laten­cies then become use­ful met­rics to gauge user expe­ri­ence has been impact­ed. how much Move aside, error count We can apply the same approach to how we mon­i­tor errors. For any giv­en system, you have a small and finite num­ber of suc­cess cas­es. You also have a finite num­ber of known fail­ure cas­es, which you can active­ly mon­i­tor. But then there are the — the fail­ure cas­es that you hadn’t even realised you have and wouldn’t know to mon­i­tor! unknown unknowns So instead of putting all your efforts into mon­i­tor­ing every sin­gle way your sys­tem can pos­si­bly fail, you should instead mon­i­tor for the . For APIs, this can be the per­cent­age of requests that do not have a 2xx or 4xx response. For event pro­cess­ing sys­tems, it might be the percentage of incom­ing events that do not have a cor­re­spond­ing out­go­ing event or observ­able side-effect. absence of a success indi­ca­tor This tells you at a high lev­el that “some­thing is wrong”, but not “what is wrong”. To fig­ure the “what”, you need to build observ­abil­i­ty into your sys­tem so you can ask arbi­trary ques­tions about its state and debug prob­lems that you hadn’t thought of ahead of time. Different dashboard for different disciplines As we dis­cussed ear­li­er, dif­fer­ent dis­ci­plines require dif­fer­ent views of the system. One of the most impor­tant design prin­ci­ples of a dash­board is that it must present infor­ma­tion that is . And since the action you will likely take depends on your role in the orga­ni­za­tion, you real­ly need dashboards that show you infor­ma­tion that are action­able ! action­able for you Don’t try to cre­ate the dash­board to rule them all by cramp­ing every met­ric onto it. You will just end up with some­thing nobody actu­al­ly wants! Instead, con­sid­er cre­at­ing a few spe­cialised dash­boards, one for each dis­ci­pline, for instance: Ops/SRE engi­neers care about out­ages and inci­dents first and fore­most. Action­able infor­ma­tion for them would help them detect inci­dents quick­ly and assess their sever­i­ties eas­i­ly. For exam­ple, per­cent­age of requests that are over the thresh­old, or the per­cent­age of requests that did not yield a suc­cess­ful response. Devel­op­ers care about appli­ca­tion per­for­mance. Per­centile laten­cies are very rel­e­vant here, as are oth­er resource met­rics such as CPU and mem­o­ry usage met­rics. Prod­uct own­ers and busi­ness ana­lysts might also need their own dashboards too. They care about busi­ness met­rics such as reten­tion, conversion rate, or sales. Summary When you go to see a doc­tor, the doc­tor would try to ascer­tain (as part of the diag­no­sis): What, and where your symp­toms are. The sever­i­ty of your symp­toms. How long have you expe­ri­enced these symp­toms. Any cor­re­lat­ed events that could have trig­gered the symp­toms. The doc­tor would use these infor­ma­tion to derive a treat­ment plan, or not, as might be the case. As the on-call engi­neer deal­ing with an inci­dent, I go through the same process to fig­ure out what went wrong and how I should respond. In this post we dis­cussed the short­com­ings of per­centile laten­cies, which makes it a poor choice of met­ric in these sce­nar­ios: They are usu­al­ly cal­cu­lat­ed at the agent lev­el, and aver­aged, which produces a non­sen­si­cal val­ue that doesn’t reflect the true per­centile latency of my sys­tem. They don’t tell you the impact of an inci­dent. We pro­posed an alter­na­tive approach — to mon­i­tor ser­vice health by track­ing the per­cent­age of requests whose response time is over the thresh­old. Unlike per­centiles, this met­ric aggre­gates well when sum­maris­ing results from multiple agents, and gives us a clear pic­ture of the impact of an out­age. We can apply the same approach to how we mon­i­tor errors. Instead of monitoring each and every error we know about, and miss all the errors we don’t know about, we should mon­i­tor for the . absence of suc­cess indi­ca­tors Unfor­tu­nate­ly, this is not how exist­ing mon­i­tor­ing tools work… For this vision to come to pass we need sup­port from our ven­dors and change the way we han­dle mon­i­tor­ing data. The next time you meet with your ven­dor, let them know that you need some­thing bet­ter than per­centile laten­cies ;-) And if you know of any tools that let you imple­ment the approach I out­lined here, please let me know via the com­ments below!

Dash

You are thinking about serverless costs all wrong

We can do better than percentile latencies

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

6 Tips To Scale an AppSync Project To 200+ Resolvers That Will Blow Your Mind

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

6 Tips To Scale an AppSync Project To 200+ Resolvers That Will Blow Your Mind

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps