One thing stands out most from what I have learned in the past 10 years.
Non-functional failure is the most dangerous technical risk in software.
Agile is designed to allow for change, encouraging experimentation. And if you are experimenting — with your design, user experience or technology — you should expect to fail. By failing you learn more, allow innovation and will have a better product afterwards. We expect this from controlled experimentation, but are caught out when a whole service fails. It is this macro-failure that we must beware of.
There are five macro reasons I can see why digital services fail.
- Building a product that doesn’t work. This is most obvious type of failure when software is not functional. It is the right product but it’s defective.
- Starting but not finishing the product. Fundamentally the software was not delivered to its users.
- Spending more money than expected. Running over budget is not automatically failure but major budget overrun added to one of the others is.
- Building the wrong product. You might have the best developers with the best technologies but if the product you give to your users is the wrong one then you’ve failed.
- Building a product that may work but can’t be used. This is the silent but deadly reason for failure. It is often badly understood by product owners. This is the failure I want to explore.
What is Non-Functional?
My definition is,
non-functionals are concerned with how your software product works not what it does (functional).
So for example, if your service allows users to obtain a fishing license but in doing this, your data is exposed, then it is insecure. This is a non-functional issue.
Some have abandoned the word “non-functional” altogether and have adopted words like “constraint” but I’m not sure this adequately covers what is needed.
It is common to see software products that have non-functional failures because these concerns are often badly understood for software. If you think about a new car, its non-functional concerns are well understood: it should be drivable by one adult (accessibility), do 100 mph without falling apart (performance), prevent others from stealing it (security) and be able to do 20,000 miles until its first service (reliability).
It is dangerous to think non-functional issues are more relevant to software architects than users. There is a close relationship between what your service does and how it does it. Issues with how your service works are often barriers to any use of your service. If your service is not accessible on mobiles or tablets users will avoid using your service. If your service has performance issues users will not be able to use your service. Look what happened to Pokémon Go in its launch week.
So we all need to take more care to ensure non-functionals are taken seriously.
Silent But Deadly: The NFR Trap
I have written before about the NFR Trap in relation to system performance. The trap is to believe your team won’t have non-functional issues because you’ve got “The NFRs”.
There are some who get performance optimisation. They test, they analyse and they fix. But when real users start to use their service they have performance problems. Unfortunately these folks often get snared by the NFR Trap.
The NFR — Non-Functional Requirements — did not describe the level of real-world usage that real users impose on the service. Instead very complicated, hard to understand and detailed requirements were constructed by a smart software architect that didn’t get the user need or user context.
With service design and agile development we now have a focus on users and their needs represented as user stories instead of business requirements. Yet non-functional requirements are often represented as a list of abstract statements about things like system performance, security, usability, accessibility, availability, maintainability and business continuity. They are abstract because they doesn’t relate to users (bad), don’t mean much to most people (bad) and are difficult to test (very bad).
Often The NFR are kept separate and referenced, but are very difficult to corroborate or approve. The NFR are often derived from templates that carry an undue reverence. Take this NFR template for example. It is typical of what is perpetuated by many teams but unfortunately its inadequate.
Lets look at two common examples to understand why traditional NFRs are inadequate. Ask yourself for each one what it means for the users.
Availability must be no less than 99.9%
What this means is that the system should be available (working) 99.9% of the time — except when it is down for scheduled maintenance.
There are a few problems with this. It doesn’t relate to availability when users need it. If peak usage is in the morning time but the service is used in the daytime only then availability is business critical in the morning, important in the afternoon and not required in the evening.
How will scheduled downtime affect users? It is rare to see scheduled downtime targeted in NFRs but users don’t care about this — downtime of any kind means they can’t use the service. So for a 24x7 service understanding what users can tolerate and the resulting design for minimising or zero downtime will be important.
How can you be sure 99.9% is even necessary? It isn’t untypical for these numbers to be guessed, written in The NFR by The Architect and never questioned ever again. A better question to ask is, what is the impact to users when the service is not available and what alternatives will they have? Designing a contingency or having high impact areas of your service less complex (to allow easier redundancy) may be time better spent for your users instead of a focus on an uptime threshold.
90% of all page requests must be completed within 1.5 secs
This is an attempt to describe how responsive your service should be based on experiences with popular websites. It is written to provide confidence that your service will be “fast” for its users. Lots of time may have been spent in meetings discussing whether the target should be 1.5 seconds, 2 seconds, 3 seconds or something. But isn’t this missing the point?
Surely the point is to understand what performance is expected by your users? What performance level will allow them to use your service without frustration? To understand this, it’s necessary to speak to users and do effective research with them and analyse performance data. Some parts of your service might be time-critical, others less so. Use this to prioritise critical features within your service avoiding generalised page response time targets like the one above.
While you’re doing this make sure you validate actual performance by testing it.
Test It, Don’t Just Require It
It’s vital to ensure features are measured, especially for performance. Performance testing is more important than performance requirements. This is because you can easily iterate and optimise performance based on test results and user research if you are doing it. Active performance testing of features by teams should become the new normal. And the results can even be used to guide acceptable performance for users.
There is a risk though that your testing could be providing false confidence. To help avoid bogus performance testing, you need to ensure a number of realism-factors are present.
- Test on production-sized infrastructure. Even better test on Production if it’s a new service.
- Test with production-sized datasets, ideally production data or anonymised production data.
- Test response time at peak load. You want your service to be as responsive on its worst day as its best day. This means testing at peak load.
- Ensure the testing is end-to-end. Often response time testing is done from edge of the data-centre only (largely because it’s easier to measure). However this doesn’t help your users.
- Ensure the testing throttles connectivity and client performance based on user devices. If you have a significant group of users with older devices on sub-broadband speeds their performance will be substantially slower.
So we’ve seen that The NFR are often too abstract and are rarely considered in context of the users. Let’s rip up The NFR and start again with non-functional needs that are user-focused, testable and a regular aspect of team development.
Non-functionals can be normalised within agile development by considering them as features. Many non-functionals as we’ve already seen heavily impact on user experience and so can be written as user stories. Alternatively orthogonal features such as performance expectations can be integrated into your stories as acceptance criteria.
When is the right time to do this? Beta. The Beta phase is where you build out an end-to-end service and starting using it with production data. Just make sure your non-functional features are developed in your backlog at the beginning of Beta. Waiting until near the end of the Beta phase is an invitation to fail.
At Kainos we have written some guidance for teams moving into Beta. These were written by a bunch of Kainos technical architects who have seen the lows of non-functional failure. These will help guide some of the more important non-functional features you should be thinking about.
- You have a Backlog that contains non-functional features.
- You have a Backlog that ensures operational (live running) aspects of functional stories are accounted for.
- You have a production environment for the start of private Beta.
- You have a deployment pipeline to build, package and release features into production rapidly.
- You have a deployment pipeline to build, package and release patches into production rapidly.
- You have invested in automation for builds, tests and deployment (application and infrastructure).
- You have instrumentation to understand what your users are doing with the service.
- You have got aggressive scale and performance targets. Don’t be satisfied with historic peaks. You have load, stress, bandwidth and soak tested your service beyond these performance targets.
- You have tested all integration points (internal and external to your service) for performance, scale and stability.
- You have performance, application and infrastructure monitoring to understand what your service is doing.
- You have alerting to proactively identify performance and stability issues.
- You are aggregating log information to a central point and making it available for all developers.
- You have accessibility testing planned with real users.
- You understand the security classification of the service and its data — and what this means for development, testing and production.
- You have in-sprint security testing or regular checkpoints with the security specialists.
- Security controls for both application and infrastructure required for live operations are present across all environments in the pipeline.
- You are able to deliver the service to a wide range of browsers and devices.
- You know how you will open source your code repositories without including sensitive configuration.
- You are able to test failure of your service and know how it will respond if parts aren’t available.
- You have discussed and agreed the contingency options for the digital service with the customer and where appropriate prioritised work on building contingency up front (particularly important when working with hard deadlines).
Let’s not be complacent when building digital services for citizens and customers. Macro-failure is bad for everyone, let’s work hard to avoid it.
Thanks to @johnstrudwick who has very kindly edited this post into something much more readable.
Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising &sponsorship opportunities.
To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.
If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!