Jupyter Notebook is a fantastic tool for data exploration. It combines markdown text, executable code, and output all inside a single document served in a browser. While Jupyter is great for data science, I’m going to demonstrate the use of Notebook for a completely different use case: DevOps Runbook or simply put, a way to respond quickly to system outages.
Imagine you are spending an evening with your loved one and suddenly see a flurry of slack/pager alerts about your API latency climbing up. It’s all downhill from there. You get online and check all usual suspects: recent deployments, dependent services, load balancer, incoming traffic, database and so on. You jump from terminal to AWS console to NewRelic to conference call and what not. Let’s just say the whole experience is stressful until you find and fix the issue.
More matured organisations maintain runbooks for incident response. Runbook outlines the steps to be followed and takes the guess work out of debugging. First, let’s see some challenges with current form of runbooks:
To tackle some of these problems I am proposing use of Jupyter Notebooks for writing runbooks. Here’s how your API latency debugging session might look like inside a Notebook environment (please watch this video in full screen mode, original youtube link if needed).
Incident response with Jupyter Notebook
You can pull in graphs, check deployment times, rollback changes, run SQL queries, shell scripts, SSH all from within Notebook.
Here are some benefits of maintaining your runbooks in executable Notebook format.
Executable Notebook format is promising but here are some challenges with current Jupyter implementation.
I’m building Nurtch, a platform that tackles these challenges & provides an easy way to write and share executable runbooks within team. Docs provide a complete overview of Nurtch capabilities and how-to’s. Let me know what do you think of this approach to managing runbooks.
Originally published at blog.amirathi.com on March 27, 2018.