Troubleshooting on an engineering team is an art, not a science. More importantly, troubleshooting is a team effort. As companies begin scaling from a five-person engineering team to 50, or from twelve customers to 1,200, troubleshooting becomes exponentially more complex. Larger engineering teams, dispersed operations, service dependencies and more customers make maintaining reliability of service more difficult. The troubleshooting process starts out simple: fixing problems by getting your team into a room together — but, it gets much more complex as you grow, ballooning to require solving multiple, concurrent problems involving broad product surface areas, and several services and external dependencies across a large, dispersed team.
Throughout growth stages, successful companies have figured out how to be profitable while ensuring reliability, uptime and service performance is maintained, but it’s certainly not easy. Many challenges companies face are not unique, so in addition to having the right testing, synthetic transactions, fine-tuned monitoring & alerting, on-call rotations and change control processes in place, there are a few lessons Okta’s engineering team has learned along the way, especially when we evolved our troubleshooting strategy. Hopefully you’ll be able to incorporate some of these practices and ideologies into your own team.
There Are Two Types of Engineers
After years of hands-on, technical experience, I’ve found that the art of troubleshooting begins with identifying the perfect combination of “builders” and “doctors” on an engineering team that approach problems differently. This combination helps organizations operate at peak performance during a crisis. Variety means team members will look at the system from different perspectives and attack a problem from multiple angles. Having a varied approach is important for many reasons, but namely, it ensures fast resolution and sets the stage to prevent the problem from happening again.
While any robust team needs the typical engineer mindset — individuals who are able to uncover the root cause of problems — that’s not all it takes to be successful. A team of passionate engineers will still fail to prevent future issues from happening if they aren’t made up of the right assembly of builders and doctors. So, what do I mean by builders and doctors, and what are the differences between the two?
The Differences Between “Builders” and “Doctors”
Builders and doctors, in the engineering world, vary most distinctively in the way that they approach their work and the situations they are faced with. Here are five scenarios to demonstrate this point:
How to Master the Art of Troubleshooting
Now that we’ve established the types of engineers that are required be on a team, it’s important to lay out a standardized approach and define a problem-solving culture that will help handle troubleshooting efforts across those teams. Here are some things that will help you maximize your team’s troubleshooting efforts:
Identification: You should look at your team (including yourself) to identify behaviors and combine skills that will provide an effective mix of builders and doctors. Doctors tend to lose interest in non-urgent tasks and are more engaged during emergencies when the heart monitor is beeping furiously. Builders, on the other hand, enjoy the holistic approach of looking at the whole picture and clearly identifying what happened during an event.
Balance: To say the least, you must have at least one builder and one doctor per team, but depending on the control that you have in your environment, you may need to tweak the ratios. The more control you have (less external dependencies), the more builders you should have. The less control you have (big dependencies on external systems), the more doctors you’ll want. Assembling the right team for the right situation is important, even down to having the right breakdown of builders and doctors. You don’t only want brain surgeons with laser-focused precision, but also general practitioners who can do everything from fixing minor bugs to full-on crisis control.
Speak up and challenge: Train your junior engineers to open their mouths when they don’t believe hypotheses from a senior engineer — or even more established theories. Mitigation of ideas should be strongly discouraged by an engineer of any level. One of the simplest ways to encourage this across the team is simply by asking all members of your team: “What do you think?” Or, have team members focused on quickly proving and/or disproving any theory. This is a good mechanism to remove random theories that can cause your team spin out of control.
Narrow the focus: Aim to create an interruption-shielding program, so that your team can focus on troubleshooting when required and not get pulled in multiple directions by other members of your organization. One way to do this: dedicate an incident response manager to keep the team focused. Without this individual, if a database crashes and people start looking at the logs and focusing on forensics, that is not helpful to people who are trying to recover the system. Those engineers already in forensics mode need to be told to wait. A fire needs to be put out by the firefighters before the investigators can jump in to figure out the cause.
In the end, finding the right combination of personalities will give your team an invaluable advantage. Your doctors will acquire builder techniques at times and vice versa, so after building a team with both skill sets, it is still important to constantly help your team develop new and complementary skills. Troubleshooting is hard for an engineering team and it only gets more difficult as time goes on and your organization grows and scales, but it can also be an enjoyable engineering challenge. Once you meld the right blend of builders and doctors together, watching your team in action will be an amazing experience!