That time a dormant bug came alive and everything went wrong at the same time

Written by SkyscannerEng | Published 2017/11/02
Tech Story Tags: outage | bugs | lean-startup | tech | startup-lessons

TLDRvia the TL;DR App

Skyscanner Engineering Post Mortem, shared

That time a dormant bug came alive and everything went wrong at the same time. Illustration by Skyscanner’s Gavin Spence.

By Dave Archer and Matt Hailey

At Skyscanner we know that our technology or our processes will let us down at some point and that’s OK. Having introduced our Post Mortem series, this is the second post in a series we hope to continue with in future.

Background

We often roll out new features using configuration flags that can be set remotely. The web nodes, (known internally as the scaffold nodes) poll for these flags and their respective constraints (user in a particular market, served to a percentage of traffic, etc) and evaluate them on a per-request basis.

Summary of the issue

A long-time dormant bug in our scaffold codebase surfaced when we tried to roll out a change that would replace the provider we use for web crawler detection. The remote configuration flag was enabled, the evaluation of which triggered an infinite loop in request execution. This overloaded all our nodes causing the entire fleet to fall over.

During production incidents, our process usually starts with reverting recent changes (including remote configuration flags), but this proved to be problematic for a few reasons:

  • We didn’t roll back the offending configuration correctly in the first instance, meaning we ended up chasing our tails for a few hours
  • Our scaffold servers run under IIS. With this kind of catastrophic failure, IIS shuts down the application pool after seeing too many consecutive errors, meaning the nodes did not automatically pick up changes we were trying to make to remote configuration flags

Eventually the offending configuration was correctly reverted, nodes restarted and order restored.

During the incident

The root cause of this issue is outlined in its own section below. What is more interesting with this issue in particular is the sequence of events during the incident that lead to this being one of our longest-lived outages. For brevity, the sequence is listed here in point form.

  • ‘Crawler detection provider’ flag enabled for 1%* of requests
  • Web nodes crash as they pick up the new configuration and evaluate the flag
  • ‘Crawler detection provider’ incorrectly rolled back. The flag value was set from TRUE to FALSE but the constraints were left in place**. This was the point after which we were investigating based on false assumptions
  • Incorrect assumption was made that the nodes were struggling to start up and fetch the updated remote configuration (see the section below ‘Why did it take so long…’)
  • Attempts were made to protect the servers from traffic to allow them to start up with the updated remote configuration
  • Manually copied updated configuration to a portion of nodes. Issue still persisted, indicating the problem was within the configuration itself
  • Fully rolled back the remote configuration setup
  • Nodes begin to recover and serve traffic successfully

* The 1% flag is evaluated on all requests on all nodes, but the feature is only activated for 1% of requests

** The issue was not with the new code path, but rather the evaluation of constraints

Additional factors contributed to the time to fully diagnose and resolve the issue, including:

  • The configuration UI was unavailable during the outage. It had not yet been removed from the monolithic scaffold codebase (though the underlying service had).
  • All dashboards were unavailable due to an unrelated, but unfortunately timed, issue with our data platform.
  • Attempts to drain individual data centres at the Akamai (our CDN) level didn’t have the desired effect; drained data centres continued to receive traffic. The root cause of which is out of scope for this report, though a separate investigation was launched.

Root Causes

As mentioned above, we were in the process of changing providers for crawler detection. Wanting to do this safely, we chose to put the switch behind a configuration flag and use an ‘experiment’(term is used very lightly here) constraint to slowly ramp up traffic to the new provider.

  • To identify a request as a web crawler, the Scaffolding must evaluate the configuration flag to choose a between the two crawler detection providers
  • Evaluation requires checking all constraints defined in Config Service; in this case the experiment constraint was of particular interest
  • To avoid a network call, allocation to experiment variants is performed by a fat client of our Dr Jekyll (our experiment platform) based on selected request attributes
  • One attribute indicates whether request originated from a crawler
  • However, because of a similar flag related to this roll out, determining whether a request is from a crawler again depends on experiment allocation data, which leads to a circular dependency: crawler information requires experiment data, which is evaluated based on crawler information
  • Allocation parameters are calculated once per website request inside a critical section defined by a mutually-exclusive, request-scoped lock. Recurrent evaluation of parameters triggered by crawler information request lead to IIS application pools being stopped.

Time to Detect

Target is 15 mins, actual was four mins while Time to Resolution — Target is 60 mins, actual was 215 mins. Our detection centres raised the bug through automated reports, our internal slack channel for employees to report issues flagged this, while we also received user reports through our user satisfaction team and our Twitter channels.

The impact?

On our web products, travellers visiting the web site were served the Houston error page. For our mobile apps, users redirecting to partners are sent via browser, meaning travellers using our apps could not complete any bookings.

We also saw an impact to SEO within one hour of the bug surfacing.

Our Houston error page

Questions

Why wasn’t this issue detected before impacting production?

The new feature was fully enabled in pre-prod via configuration flag. However, it wasn’t enabled with an experiment as a constraint; it was switch from 0% to 100% via config only. Up until this incident, this was relatively standard practise for us.

Automated test coverage was also missing the production case of both configuration and experiment constraint enabled.

Why did it take so long to fix when the cause was detected early on?

The configuration update was identified as the trigger early on. To attempt to fix it, the entry was changed from TRUE to FALSE.

However, (we now know) the root issue was caused by the constraint, which still existed.

When the team attempted to bring Scaffold servers back online (by restarting the IIS app pool) they immediately fell over again. The (incorrect) assumption at the time was that the Scaffold servers weren’t picking up the latest config before falling over.

The team then copied the latest config file to Scaffold boxes before restarting (to bypass the polling mechanism in Scaffold). However, the Scaffold servers still fell over when restarted.

With this new knowledge, the config entry was fully deleted (Scaffold has safe defaults to fall back on when no config available).

The app pools were restarted and this time the Scaffold servers picked up the new config and stayed online.

Key Learnings

  • Test the combination of config + experimentation safely prior to production (config had been tested, but not with a backing experiment)
  • Revert erroneous config/experiment value if identified as cause of issue rather than fix forward
  • Separate systems as much as possible to limit impact of an outage (Scaffold + Config UI, in this case)
  • Fully understand and consider blast radius of changes — limit where ever possible
  • Site stability likely causes such an impacton user performance metrics (time between searches because of quick bounces) and impact SEO

Feedback from the team

Regardless of the specifics of this incident, we needed to ensure we limited the blast radius and fault domain of any release in future. Our mechanism for limiting traffic to the 1% mentioned above was a software-based decision where it should have been physically limited to code running on a subset of nodes.

  • We needed to freeze other experiments during outages to prevent a snowball effect and confusion around root causes.
  • We needed to understand the blast radius and limiting it and how to experiment in future without such a big risk.

All of the above has now been addressed.

What do you think? Comment below or tweet us with your theories, comments and thoughts.

Like what you hear? Work with us

We do things differently at Skyscanner and we’re on the lookout for more Engineering Tribe Members across our global offices. Take a look at our Skyscanner Jobs for more vacancies.

Skyscanner Jobs

About the authors

Hi, I’m Matt and I’m an Engineering Manager at Skyscanner, working alongside Dave and the rest of our core web team. Working on the front-line, there’s never a dull day. We juggle our time between maintaining the website and developing new systems that are fundamental to our success as we continue to scale across the globe. Working for a travel company, it comes as no surprise that I like to travel — China is next on the list!

Matt Hailey, Skyscanner


Published by HackerNoon on 2017/11/02