Like any good startup hacked together a prototype of their vision to test out their assumptions. A few developers later they launched the product and validated the demand for their service. To get working capital to accelerate growth Michael went on , the only catch with being on Shark Tank is that it comes with a spike of traffic, all of which hits in the space of about a minute. We were told to prepare for a spike of up to 10,000 users in the first minute of being live. www.betswaps.com Shark Tank Australia That’s when I got involved. http://devopsreactions.tumblr.com/ With a huge tech debt and limited time our only focus was, what will move the needle so we can survive the spike? A quick inspection of the code revealed there was no caching no CDN no tests missing database indexes highly coupled code a 7 second server response time To get visibility into why everything was so slow we turned on the frameworks profiler and setup New Relic. Digging deeper revealed the data structures required lots of queries the MVC pattern of the framework wasn’t being followed correctly unused code and libraries had not been removed several different people having worked on the undocumented code http://devopsreactions.tumblr.com/ To move the needle for the sites performance we decided not to change the process to import events the data structures UI and UX the Javascript and CSS With the scope set, so it begins http://devopsreactions.tumblr.com/ Step 1 — minimize database calls The first task was to make the homepage more responsive, you can see in the graph below the server response time went from 7,000ms to 133ms. This was achieved by batching the database calls, you can see in the breakdown table that 3,330 queries were being made to build up the categories tree available on the site. These queries built up the drop down events that allow you to filter 100,000+ markets by category (e.g. soccer) events (e.g. super bowl) market category (things in an event that you can bet on ) markets (things you can bet on in a market category) http://devopsreactions.tumblr.com/ The first task mostly consisted of moving the database calls into models batching the database calls writing transformers to output the data formatted for the views Once the process of refactoring the code was established it was a case of rinse and repeat for each page. Step 2 — Add Caching The next easy win was to add caching into the database calls as most of the data changes infrequently, having appropriate caching means that we can greatly reduce the number of reads that need to be done on the DB server. Below you can see the database queries during the Shark Tank spike. Now the server response times looked more respectable it was time to move onto load testing. Step 3 — Network I/O bottleneck We used to do the load testing, with a simple of process of stress it, break it, fix it and repeat. loadimpact.com http://devopsreactions.tumblr.com/ The first bottleneck was at 750 virtual users, below you can see that CPU, memory and disk IO have excess capacity. In the second image below you can see the transaction time spike significantly even though the server profile is very similar to the previous two tests. The bottleneck was caused by the amount of data sent back from the server. The AWS m3.xlarge instance has a bandwidth limit of 62.5 Mbps. The JS, CSS and images were being served from the server, moving the assets to S3 and connecting it to Cloudfront solved the issue. Step 4 — Hidden DB calls in the views Redbean was initially used as the ORM, which doesn’t show its database calls in the frameworks profiler, however these extra calls became evident under load in New Relic. These calls created two problems, hammering the database and persistent database connections being left open. Why there were DB calls in the views is unclear. The solution was to edit the database connection settings so that the connection wasn’t persistent, moving the calls into the appropriate models and caching the remaining uncached calls. Step 5 — Template engine The next bottleneck was the template engine, the graph below shows the same load test running before and after turning on the template engines cache. With template caching the response is almost instant. This was diagnosed by looking at the details of the transaction traces. Showtime ! The final steps before airing involved warming the cache up and tweaking the autoscaling settings. To avoid having a thundering herd from all of the cache values expiring at the same time, the caches TTL was set as a random value between four and five minutes just for the Shark Tank airing, the length of the TTL was dialed back after the spike had passed. The end result was that the site weathered the Shark Tank spike and the response times from the server stayed fast. The main peak in traffic corresponds to the airtime in Melbourne, Sydney and Brisbane, followed half and hour later by Adelaide and an hour and a half after by Perth. Final thoughts When contemplating just getting it done, don’t forget about velociraptors. https://xkcd.com/292/