I spent quite some time in 2015 trying to improve performance of web-scale digital services. Of course it wasn’t just me, I was with the teams at Kainos working to deliver new digital services for UK citizens.
So what did I learn?
Optimisation is Under-Valued
Very often performance optimisation is under-valued. For some software projects performance is a box that needs to be ticked at the 11th hour before the first production release. For some folks it is all about performance testing using tool X from Acme Corporation. Others successfully test and optimise but still the production service fails when real users get at it. And of course there are those that don’t bother at all.
Lets be really clear then: performance optimisation is more important than all the features of a service combined (hint: users won’t use your features if they are slow or unresponsive). This is a different perspective to begin with that will challenge how you address performance.
Optimisation not Testing
I say “performance optimisation” because many reduce it to performance testing. And that means hiring a performance tester who knows who to use tool X from Acme Corp. But testing (and tooling) is only one small part of performance optimisation. Don’t be fooled into a false sense of safety just because you do performance testing or you selected an industry-leading tool. Make sure you are analysing the results and fixing the service — by service I mean application and infrastructure together — and repeating. And repeating again (hint: you can’t stop performance optimisation if you keep adding features).
Performance as a Feature
It’s helpful to consider performance to be a feature not some non-functional bucket of technical debt to risk manage. Once performance is a feature then it’s harder to de-prioritise. You can do this in multiple ways but one way is to make performance part of the acceptance criteria for your user stories. This way it becomes avoidable only by intentional negligence of your teams.
To allow your product owners to do this you cannot leave performance optimisation to the end. You also cannot leave the provisioning of a technical environment for representative performance testing to the end either. You will need to be ready to performance optimise early. Having a feature should mean having a performance-optimised feature. So When is too early? I would suggest when you get the user need and are choosing to build a service for production – this is aligned with the start of the Beta phase for UK Digital by Default.
Beware the NFR Trap
There are some who get performance optimisation. They test, they analyse and they fix. But when real users start to use their service they have performance problems. Unfortunately these folks often get snared by the NFR Trap. The NFR — Non-Functional Requirements — did not describe the level of real-world usage that real users impose on the service. Instead very complicated, hard to understand and detailed requirements were constructed by a smart software architect that didn’t get the user need or user context.
So challenge your performance NFR. If you’re not satisfied tear them up and write new performance targets. Ask yourself some questions,
- Are they intelligible?
- Do they cover the basic user needs for all user personas?
- Do they cover spikes in historic usage?
- Do they allow for the worst-case number of concurrent users?
- Do they allow for the worst-case number of concurrent transactions?
- Do they cover unknown future spikes (hint: past performance is no guarantee of future results). Define stretch targets as part of your performance indicators.
- Do they cover no bandwidth or low bandwidth users?
- Do they include the types of devices users will adopt?
- Do they cover everyone logging in at once (hint: sometimes things go wrong and your service goes down. When this happens everyone attempts to authenticate at the same time)?
If you just cannot agree on performance targets then an alternative is to provide product owners with quantitative data on the capacity available when using the feature. This can be reached by stress testing on a production-like environment with a full dataset. This allows the performance risk to be assessed by product owners or senior managers.
Publish Early and Often
Experienced folks worry that there will be performance issues with a newly live service; it can really kill a launch. To build this confidence be ready to publish your performance results to your whole project team. To do this you will need to make it intelligible (so typical NFR won’t cut it). Think of summary performance targets that can demonstrate the progress of performance optimisation and the risks if the feature is accepted in its current state.
So you’ve heard of TDD — Test-Driven Development. Test-Driven Infrastructure brings the same result-oriented approach to infrastructure builds that TDD does for applications.
Very often infrastructure is built using a waterfall process even when the development teams are agile. This has the downside of making the results appear near the end and is based on the smarts of the architect. And given the Production environment tends to be provisioned late in the project lifecycle this exacerbates the performance risks.
Instead bring the benefits of agile to your infrastructure. Be test-driven, particularly performance testing leading to performance optimisation and working infrastructure. Starting with infrastructure tests – performance, failure, security – means your infrastructure design is proven as your develop it. Scaling can be done later as iteration and further tested to find the sweet spot for your workloads.
So how will you analyse your Production infrastructure as you test it early? Your expensive performance test tooling won’t help much for this. Instead embrace the DevOps Doctrine to bring forward all of the traditional ops monitoring features (hint: typically they are only needed after go-live so don’t get built until the end). This way you can test and optimise your monitoring as you test during development. Development and Ops together!
Unfixable Performance Defects
Be ready for performance defects you cannot fix. You will need to make bold decisions based on the risk to your service.
This event is more likely than you may realise even if you have the skills to fix your own code. If for example you have selected commercial products to include in your service you already are exposed to this risk. You will not have control to fix issues and will instead be dependent on the vendor.
So instead when selecting products you need to get assurances before you buy that there are real-world case studies or lab test results that demonstrate it can meet your performance targets. If your project buys anyway then you are dependent on the vendor unless you design you way out of it. It is clearly a last resort to do this but if the risk is too great then you may need to protect the users of your service.
All The Data
Data remains one of the most difficult elements to reach agreement on, especially for government services given information security and privacy. Information Security custodians may inform you that you may not use real data for non-production uses for reasons. This can often result in no full size real data used for functional and performance testing.
However it is clear that you cannot performance test without full-sized dataset. There are also risks using synthetic full-sized datasets for functional and performance testing given the actual real data can be important (hint: data is more important than the code, invest more your time getting data right).
So make the clear case for full-sized real data for performance testing. Take it to senior management if necessary. It may necessary to pseudo-anonymise personally identifiable information as part of the agreement to use it but this is worth it.
Hey, so what about those top 10 optimisations you mentioned at the beginning…
In summary then, 10 things I’d recommend you follow to help mitigate performance risks.
- Build Production early. As early as start of Beta. This will allow you to test on Production sized environment as you develop.
- Don’t rely on Non-Functional Requirements. Define easy to follow targets based on capacity and concurrency. Then define stretch target. Then define an excessive stretch target.
- If your targets are uncertain focus on stress testing not load testing so that your Production environment can be described to your product owners in terms of the capacity it will sustain.
- Script your performance tests before you have Production. It is helpful to design these to be flexible about data expected from the start so they can be run with live data and test data.
- Test your Production as you build it. This ensures it is both testing and optimised for your targets.
- Be open. Publish your performance results to senior management and the rest of the project team. Be careful to tune your results to represent your targets and stretch targets.
- Build deep monitoring of infrastructure and application at the beginning of building Production and tune these during your performance testing.
- Always performance test with full-sized dataset.
- Be ready for unfixable performance defects. Look ahead when selecting commercial or open source products and demand test results up-front.
- The holy grail is when performance testing is continuous, part of your deployment pipeline. At the very least each sprint should be performance tested.