John Doran

@johnwildoran

Resolving our platform stability issues

A couple of years back we started to face some serious scaling issues. As the platform had grown and we had onboarded more and more customers, system response times began to suffer due to server exhaustion and database contention. Our hosting costs were rising much faster than our growth rate which was clearly unsustainable for the business. Things became very serious when we suffered outages as various components of the platform started falling over on a regular basis.

We found that we were continuously fire fighting to keep the system running smoothly. The support team ended up with a giant red button on their floor that was pressed numerous times per week. Our reputation for delivering amazing customer service was being hurt. The teams who had to deal with frustrated customers weren’t able to do their jobs, that is help salons utilise our platform to grow their business.

We needed to take some serious action and fix the stability issues, rethinking our values and decision making process when an incident happened. We could no longer just “redeploy it” or “make x bigger” to solve any production issues.

Every incident that occurred needed to have an outage report. This allowed us to have clear actions and solutions for every type of incident.

Each outage report had the same format, be published by the software engineers working on the problem and put onto our wiki. It was shared with everyone in the company, giving transparency and reassurance we were putting preventative measures in place.

Outage report format

Description

Digestible one-liner of what happened.

Outage time

hh:mm

Number of support tickets raised

Detailed numbers of impact on the support team

Affected functionality

Description of the functions of the system affected by the outage

Explanation of the problem

A clear technical description of what happened

The report ensures we have a clear understanding of what actually happened.

Investigations

Some details of where the engineer looked and how they came to fix the issue and how long it took. Along with some screenshots of metrics or logs from the issue.

Preventative measures and actions

What are we going to do from stopping this from happening again?

The minimum expectation here would be an alert to help us pre-empt the issue. Each action needed to be tracked in Jira.

What we found

  • Weaknesses in automation and deployment procedures.
  • Our build process and speed to deploy was too slow
  • Where we lacked monitors and alerts (customers knew about issues before us)
  • Server components which struggled to deal with traffic volumes
  • Outdated versions of libraries and code which had memory leaks

When analysing the data we were clearly able to see how much it was hindering our product development. Engineers were being constantly pulled from different angles to firefight. That instability in velocity and delivery meant we couldn’t accurately predict when new features or improvements could be delivered. Two of our core values are growth and thinking long term, so we knew it was time to fix these issues and evolve our platform.

Fixing the stability issues

The effort to fix everything was too large with a small engineering team while also continuing product development work. We had to make a big decision to halt all product development work and undergo a large price of engineering effort to fix the problems. This had large knock-on effects as we had business commitments made and expectations to meet.

The goal was clear, to improve the stability of our system while helping it scale as we grow our customer base. We called this engineering effort project Darwin as it was about the evolution of our system. From an engineering side it was extremely difficult to know when we would be done, but we broke it down into small measurable increments.

Some of the major pieces of work we took on were:

  • We started with test coverage at an API and integration level — so we could know if we broke anything
  • We wrote gatling performance tests to ensure we could simulate production environments
  • Dividing a monolithic backend up into separate services (bounded contexts per responsibility)
  • Migrated from classic EC2 baked AMI deployments to Docker
  • We made our containers self-healing and load balanced them behind ALBs
  • Moved our infrastructure to code
  • Migrating our databases Amazon’s Aurora
  • Making our services stateless and removing caches
  • Adding autoscaling capabilities
  • Fully automating our build process and release process

Looking back a year later

While it was painful to stop feature development and fix the issues, we can safely say our stability problems are gone. There is no more firefighting and the red button on the support floor is thankfully gathering dust. By using our long term values as guidance, we took on project Darwin to attain platform stability, fault tolerance and elasticity.

So that we never have to fall back into this big bang approach of needing to fix things we have adapted a continuous improvement mindset, it is now something that is a core part of our engineering values. We take periodical breaks in our development sprints to work on our technical backlog — fixing niggling issues, upgrading areas of the system, answering the unknowns and always making the system better.

As mentioned our hosting costs were unsustainable and as we look back we see a lower and nonfluctuating AWS bill.

On a more personal note, this was one of the hardest engineering challenges I have ever faced, it wouldn’t have been possible without the talented engineers and support of the team at Phorest.

More by John Doran

Topics of interest

More Related Stories