Prepare your customers to survive cloud service disruptions
Monday, 27 February, 2017: Another new customer went live with our Approov authentication service. It’s fun to launch a new customer and watch the traffic growing between his mobile apps and our AWS-hosted service. Day 1 for our new customer was a smooth success. I love my job.
Tuesday, 28 February, 2017: It’s early morning, and I’m on call for Approov operations support. In passing, I notice that some websites I’m visiting are a bit sluggish; images aren’t rendering. Odd. What’s up Amazon Web Services? S3 cloud storage is experiencing increased error rates? For hours? No alarms on our Approov service. Hmm, this could be a nerve wracking day. I love my job?
Recognizing Only Real Apps
About nine months ago, my company, CriticalBlue, launched a new product, Approov, a cloud service which dynamically attests that the app you are using on your phone is genuine and has not been tampered with. You might log in to Retailer X with valid credentials, but if the Retailer X app with that tempting 50% off sale is a fake, you’re not going to get what you paid for.
Mobile-first is a popular mindset for product managers, but unsurprisingly it’s also a popular target for cybercriminals as well. Fake apps draw users and revenue away from your product sites, and fake apps stuffed with ads or stealing customer credentials will significantly damage your brand. We are experts in low-level system performance optimization, so it was natural for us to consider binary-level attestation to weed out phony apps.
Beyond fake apps, other attacks focus on reverse-engineering product APIs. Using bots, competitors can scrape valuable information off product back-ends and siphon customers and profits into their own sites using your information against you. Alternatively, competitors may try flooding your product servers with expensive application-layer API calls to degrade your service and frustrate your customers. You need a way to positively filter out the good traffic from the bad.
Traditionally, you protect your APIs with secret keys, but any secrets hidden in your app are juicy targets for hackers, so we went looking for a way to get these secrets out of the apps all together. Our solution was to partition our authentication between the app itself and an external service. Your secrets are now managed completely off device which is a big win. Customers agreed with us, and because of that, Amazon Web Services has become a key component in our security solution.
Engineering for Availability
To add the Approov service requires very few changes to your normal app development and deployment flow. However, it does introduce a cloud services dependency in our own services flow.
The scary thing about dependencies is that it’s very easy to bet your company and potentially your customers’ companies on them by assuming they’ll always be there. As an engineer wrote in our company blog, Amazon’s S3 service has a 99.9% uptime guarantee, which translates into less than 44 minutes downtime per month. That’s 44 minutes you had better be prepared for.
And don’t blame AWS; they advise developers and provide extensive guidance on architecting for performance and resiliency. Failures in parts of a complex system will occur frequently, but failures of the whole system must be made vanishingly rare.
So how did we do throughout the four hour long February 28th S3 downtime? To our engineering team’s credit, all customer authentication services remained operational. High availability was a product design goal, so the initial product roll out featured modestly over provisioned, redundant, and elastic services with strong isolation between different customer services.
We do use S3 to store attestation data, but we cache crucial data on the attestor itself, so we are able to withstand temporary S3 failures. While S3 went down for hours, we continued attesting for customers. If every one of a customer’s redundant attestors had failed, we would have been hard pressed to restart their service, but we were resilient to up to N — 1 failures. If necessary, we were prepared to redirect attestation requests to a passthrough service to keep application traffic flowing.
What if even more AWS services failed? We maintain redundant services with another cloud provider and can failover if necessary.
We continue to test and improve our design for availability architecture and speed of adaptation to various failure scenarios.
Courting Experienced Customers
Working with experienced customers can make a big difference in managing a successful launch. They place a premium on avoiding service disruptions and tend to gradually stress new systems as they bring them up.
In the case of my new customer who launched through the S3 failures, the customer had prepared a conservative launch plan with extensive pre-production testing, user-transparent attestation switching, and the ability to monitor and analyze attestation failures and performance variations.
This customer offers both iOS and Android apps. They chose to initially do a single platform release to less than 25K users. Over time and in a controlled manner, they increased the % of users on both platforms eligible for live upgrade to the more secure Approov-enabled apps. It’s been a smooth ramp.
Cultivating these customers early is a double win, because they are often willing to share their success launch strategies with less experienced potential customers.
Cloud Services Success
Bringing a product to market requires a well thought out plan which inevitably changes during roll out. As a product manager, you want to minimize as many external dependencies as possible. Each dependency must justify its profit contribution against the risk of failure and loss of control.
Using cloud services enables us to ultimately offer you a lower price because we only need to charge you for the services you use, and we can elastically scale to meet demand without the worst case capitalization costs of maintaining our own data center. But that does introduce a dependency on AWS, and you should expect us to walk you through every reasonable failure scenario you care about to demonstrate consistently high availability service throughout.
The return on good engineering is loyal and savvy customers. They’ve helped us improve our product, and they’ve helped bring on even more loyal and savvy customers.
Preparing for failure is preparing for success.
Wednesday 22 March, 2017: 3 weeks after the AWS outage, another quiet day on call for ops support. Engineering did a bit of testing on the live system, but no operational alarms triggered. Another good day. The new customer has surpassed ¼ million monthly active users with us and is still ramping. I do love my job.
Thanks for reading! I’d really appreciate it if you recommend this post (by clicking the ❤ button) so other people can find it.
To learn more about API security and related topics, visit approov.io or follow @criitblue on twitter.