1,640 reads

Awful to Awesome: A Software Development Story

by Joshua KerievskyNovember 14th, 2017

Too Long; Didn't Read

When Stas Zvinyatskovsky, Ed Kraay and team began to pull Yahoo!’s <a href="https://hackernoon.com/tagged/software-development" target="_blank">software development</a> out of the stone ages, teams averaged 20 weeks to build, test, stabilize and deploy large software releases. Every release was accompanied by painful, stressful production incidents and programmers worked overtime, fixing defects, handling emergencies and just trying to survive. One developer said, “We work really hard but we don’t know if we’re making a difference.”

Company Mentioned

featured image - Awful to Awesome: A Software Development Story

When Stas Zvinyatskovsky, Ed Kraay and team began to pull Yahoo!’s software development out of the stone ages, teams averaged 20 weeks to build, test, stabilize and deploy large software releases. Every release was accompanied by painful, stressful production incidents and programmers worked overtime, fixing defects, handling emergencies and just trying to survive. One developer said, “We work really hard but we don’t know if we’re making a difference.”

Yahoo!’s competitors were running circles around them and something had to be done. Two years later, Yahoo! teams released to production daily or weekly and production incidents were almost non-existent. It was a joyous end to a painful journey that Stas and Ed described as, pulling an elephant out of a tarpit.

The tarpit of software development at Yahoo! was a software development process that was the opposite of agile. Programmers didn’t write unit tests, which led to numerous defects leaking into production. Programmers worked in branches for long periods of time, isolated from other’s code changes, which made it slow and hard to integrate later. A manual build process was scheduled by a manager and would often break when code wouldn’t compile. After a successful build was finally produced, quality assurance people would find serious defects in it. Programmers fixed defects, new builds were attempted and the cycle repeated. The work was so slow and painful that most teams barely managed 3 releases per year.

Continuous deployment is safe, automatic deployments of frequent, small code changes, enabling the release of fine-grained changes to production, rather than far riskier, big batch releases. Doing continuous deployment is table stakes in Silicon Valley today because releasing small changes frequently allows you to incrementally improve your software, rapidly fix defects, learn quickly from experiments in production and deliver happiness to customers faster.

Back when Yahoo! was stuck in the tar pit, they were no where close to continuous deployment. They had to make some serious changes to get there and it wasn’t obvious what to do. “It took us many attempts and many failures to find our way out of the tar pit, but we found our way out”, Stas said.

An early experiment involved creating an “end-to-end” environment that would simulate production and allow people to test releases there. The trouble with this approach was that it led programmers to perform only superficial testing on their code, because they figured the real testing would happen in the end-to-end environment. As a result, poor quality code was pushed to the end-to-end environment and that code broke the environment when it failed to work correctly.

They tried shifting from working together on one big release to working separately on features but this led to a complex mess of feature branches, requiring lots of painful merges to handle dependencies on shared code.

With every change attempted, they were still only releasing 3 times a year with poor quality.

They attempted to go faster, got a release ready, rolled it out successfully to 1% of production machines, then 5% of production machines and then 100% of production machines, whereupon it crashed. Yahoo!’s best engineers worked to find the defect, which only happened in production (those are the hardest defects to fix), fix the defect and then try another release. The same cycle happened, leading to a release that had to be rolled back. It took 7 attempts and 3 months (burning valuable time from the company’s best engineers) to finally get a good release into production. This was so utterly painful that management was open to anything that could genuinely help.

Changing Everything

Most of the defects that were happening in production could have been caught by unit testing. Stas and Ed had a solution and they were finally invited to try it. In January of 2011, a team agreed to do continuous integration for real (always keeping the build working) and unit testing for any and all code changes. When they released to production 6 months later, there were no rollbacks!

It was a small ray of light. Continuous integration had made a huge difference! Management decided that everyone needed to be doing it and mandated it across all of engineering. But they continued to struggle. Quality problems persisted and they failed to get more than 3 releases per year.

This is when they decided that they needed to fundamentally change the way they worked at Yahoo! Instead of having a goal to be agile or implement continuous integration, they realized their real goal was to “move fast to delight customers.”

To do that, they reasoned that they could no longer work in silos, one for product management, one for engineers, one for quality assurance, one for release engineering, where each silo threw work over the wall to the next silo. Instead, they formed into genuine cross-functional communities of 40–50 people, the kind that could move from product idea to production without ever asking any other team or department for help. Inside of these communities they organized into small groups, but the community was responsible for delivering delight to customers. They threw out their expensive end-to-end environment and committed to implementing continuous deployment: if you check code in, and it passes the automated build, it automatically goes to production without any human intervention. To perform continuous deployment well, they re-committed to continuous integration. They also hired my company, Industrial Logic, to help train engineers in test-driven development and refactoring and train quality assurance people to become quality engineers by learning how to collaborate on features, craft acceptance criteria and automate acceptance tests.

And this was the giant breakthrough they needed all along to delight customers and make working at Yahoo! fun again. The number of releases to production trended sharply up while the number of production incidents trended steeply down. Managers and employees across the entire 1,000-person, geographically split organization loved the changes and customers were delighted. It began to become normal to have daily releases and close to zero production incidents.

Yahoo!’s pulled themselves out of the software development tar pit through patience, trial and error and sheer determination to improve and compete. Before they began the journey, some developers had said “continuous deployment was crazy” and vowed they would never do it. Now they said they would never go back to the awful, older way of working.

Note: This story was originally told by my friend Stas. You can watch his video here: https://www.youtube.com/watch?v=C-Itdv1p11A