By Dave Archer and Matt Hailey
At Skyscanner we know that our technology or our processes will let us down at some point and that’s OK. Having introduced our Post Mortem series, this is the second post in a series we hope to continue with in future.
We often roll out new features using configuration flags that can be set remotely. The web nodes, (known internally as the scaffold nodes) poll for these flags and their respective constraints (user in a particular market, served to a percentage of traffic, etc) and evaluate them on a per-request basis.
A long-time dormant bug in our scaffold codebase surfaced when we tried to roll out a change that would replace the provider we use for web crawler detection. The remote configuration flag was enabled, the evaluation of which triggered an infinite loop in request execution. This overloaded all our nodes causing the entire fleet to fall over.
During production incidents, our process usually starts with reverting recent changes (including remote configuration flags), but this proved to be problematic for a few reasons:
Eventually the offending configuration was correctly reverted, nodes restarted and order restored.
The root cause of this issue is outlined in its own section below. What is more interesting with this issue in particular is the sequence of events during the incident that lead to this being one of our longest-lived outages. For brevity, the sequence is listed here in point form.
* The 1% flag is evaluated on all requests on all nodes, but the feature is only activated for 1% of requests
** The issue was not with the new code path, but rather the evaluation of constraints
Additional factors contributed to the time to fully diagnose and resolve the issue, including:
As mentioned above, we were in the process of changing providers for crawler detection. Wanting to do this safely, we chose to put the switch behind a configuration flag and use an ‘experiment’(term is used very lightly here) constraint to slowly ramp up traffic to the new provider.
Target is 15 mins, actual was four mins while Time to Resolution — Target is 60 mins, actual was 215 mins. Our detection centres raised the bug through automated reports, our internal slack channel for employees to report issues flagged this, while we also received user reports through our user satisfaction team and our Twitter channels.
On our web products, travellers visiting the web site were served the Houston error page. For our mobile apps, users redirecting to partners are sent via browser, meaning travellers using our apps could not complete any bookings.
We also saw an impact to SEO within one hour of the bug surfacing.
Why wasn’t this issue detected before impacting production?
The new feature was fully enabled in pre-prod via configuration flag. However, it wasn’t enabled with an experiment as a constraint; it was switch from 0% to 100% via config only. Up until this incident, this was relatively standard practise for us.
Automated test coverage was also missing the production case of both configuration and experiment constraint enabled.
Why did it take so long to fix when the cause was detected early on?
The configuration update was identified as the trigger early on. To attempt to fix it, the entry was changed from TRUE to FALSE.
However, (we now know) the root issue was caused by the constraint, which still existed.
When the team attempted to bring Scaffold servers back online (by restarting the IIS app pool) they immediately fell over again. The (incorrect) assumption at the time was that the Scaffold servers weren’t picking up the latest config before falling over.
The team then copied the latest config file to Scaffold boxes before restarting (to bypass the polling mechanism in Scaffold). However, the Scaffold servers still fell over when restarted.
With this new knowledge, the config entry was fully deleted (Scaffold has safe defaults to fall back on when no config available).
The app pools were restarted and this time the Scaffold servers picked up the new config and stayed online.
Regardless of the specifics of this incident, we needed to ensure we limited the blast radius and fault domain of any release in future. Our mechanism for limiting traffic to the 1% mentioned above was a software-based decision where it should have been physically limited to code running on a subset of nodes.
All of the above has now been addressed.
What do you think? Comment below or tweet us with your theories, comments and thoughts.
We do things differently at Skyscanner and we’re on the lookout for more Engineering Tribe Members across our global offices. Take a look at our Skyscanner Jobs for more vacancies.
Hi, I’m Matt and I’m an Engineering Manager at Skyscanner, working alongside Dave and the rest of our core web team. Working on the front-line, there’s never a dull day. We juggle our time between maintaining the website and developing new systems that are fundamental to our success as we continue to scale across the globe. Working for a travel company, it comes as no surprise that I like to travel — China is next on the list!
Create your free account to unlock your custom reading experience.