If you see some recurring issue in your production env, what do you do? Well Surprisingly restart works most of the times but if you are managing 10’s of services with 100’s of machines having the code in multiple data centers even restarting can be a full-time maintenance job. I have seen teams managing microservices has to keep 2–3 software engineers just to manage the service and restarting servers from time to time. And issues are by no means rare and maintaining services that have a lot of unknown production issues can be a really — really painful which can lead to all sorts of problems. production People start to search for new jobs and it can create a feedback loop and makes the problem even worst. Developers are distracted constantly to resolve issues Working on the actual new task becomes hard Over time people and the team can become afraid to release new features Managers don’t trust their team members anymore and start to come up with a lot of processes to stop people from releasing things Okay, So now we have established that it’s a serious problem. what’s the solution? Basically, I think there are 2 problems with fixing production issues Prioritization Technical challenges /Debugging I have seen that sometimes picking up the task to fix the can be a bigger problem then actually solving it and I think there are multiple reasons for it. issue Production issues come out of nowhere and it screws up with planned tasks so we tend to ignore it. Very hard to predict how much time it would take to fix the issue 3. We somehow hope that it goes away by itself. “It just happened a few times”, 4. Restarting, again and again, seems an easier option to many people. Solving the above problems can be tricky. You have to convince your manager and team members to stop working on some task and pick up the investigation. But let’s say somehow you got it prioritized then how do you solve it? Over time I have learned that debugging or any production investigation is like a scientific process of problem-solving. There are a few things you have to keep in mind. Before making any hypothesis about the issue, try to learn about the system around it. Do a lot of experimentation around it. Plat graphs if required, make notes etc. This is critical because if you make a hypothesis or assumptions about the issue without knowing the system. It’s very hard to come out of these assumptions and see the full picture. Try to reproduce the issue in a minimum setting, this makes it faster to experiment and try out things. Eventually, this will help you validate the fix as well. . A lot of time we come up with some complicated hypothesis that is hard to validate also but the problem ends up being pretty simple. Validate simple hypothesis first A lot of times the issue is a known issue inside some library, search on StackOverflow or post a new question on StackOverflow with all the details. . Off-course StackOverflow community is also great so they can help you with the answers. You will be surprised by the insights you can get from just posting the question because this forces you to construct the question in a minimum reproduction scenario If you still can’t find the root cause give the issue to your coworker for a fresh perspective. Sometimes it’s just very hard to come out of your own assumptions. If you still can’t find the issue then document your main finding, find a workaround (it should be better then restarting every-time issue happens :P) and move on. Learn from your mistakes and be better next time. I see lots and lots of hours getting wasted in solving issues so if this can help even few people fixing the issues faster then writing this is worth it. If you liked the content please clap(👏) and Let me know what you think?

How to investigate and fix production issues?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

CromaApp: My First React Native App After 7 Years of Backend Development

10 Myths About Microservices

The Noonification: Unpacking Cinnamon—A New Resiliency Approach at Uber (1/22/2024)

111 Stories To Learn About Architecture

12 Methods of Improving Your Monolith Before Making the Jump to Microservices

13 Questions and Answers for Google Cloud Reference Architectures

CromaApp: My First React Native App After 7 Years of Backend Development

10 Myths About Microservices

The Noonification: Unpacking Cinnamon—A New Resiliency Approach at Uber (1/22/2024)

111 Stories To Learn About Architecture

12 Methods of Improving Your Monolith Before Making the Jump to Microservices

13 Questions and Answers for Google Cloud Reference Architectures

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps