I’m a software engineer at Plaid on the Data Science & Infrastructure team. We support a myriad of services for both external clients and internal employees: we enable the product, growth, and engineering teams with tools like Redshift for large analytics queries and the ELK stack for real-time investigation across the many services powering Plaid.
Supporting these analytics & monitoring services is mission-critical: if they go down dozens of workflows across the company break; it’s important that we catch any degradation of service before it becomes it an issue. Over the course of 2018 we’ve added dozens of new features to the services we support: new datasets, faster analytics capabilities, and more self-service data portals. We need to find any issues before they are realized by our users:
We use PagerDuty to send real-time alerts when there is a degradation in service performance. Our team has an on-call rotation: between four engineers we rotate being the directly responsible individual for handling any alerts of service performance generated through PagerDuty. The week you’re on call, you use the PagerDuty app to alert of any realtime DSI-Urgent
alerts, and the PagerDuty slack integration to send other messages to the DSI-Notify
slack channel.
PagerDuty slackbot — friend to all.
Towards the end of the year we realized that while we had done well to add new PagerDuty alerts to support SLAs for our services, we weren’t as disciplined with triaging and fixing the underlying long-term issues. In order to step back from treating the symptoms and work towards overall service quality, we decided to add an “on-call summary & handoff” digest to our weekly team meetings. This helped facilitate discussion about why we were being paged, iterate on a service to improve reliability, and change alert structure to reduce false alarms.
Trialing a formalized handoff document, with categories of alerts to facilitate trend analysis and some comments on specific alerts for specific details.
This worked well as a pilot — we were able to better surface common issues, more easily identify patterns, and decouple the on-call cadence of addressing real-time issues with the need for making longer term substantive changes.
It was not, however, our long-term vision to add manual work to the on-call rotation. PagerDuty offers a few views into analytics, but we wanted finer grain control of our data and to be able to join it to other analytics data sources. We set out to use the PagerDuty V2 API to pull the data that would help inform patterns of incidents and supplant the manual tracking of our handoff document.
We use Apache Airflow for a wide-range of recurring data ETL, and wrote a one-off Airflow DAG that downloads a history of the data from PagerDuty, writes it into S3, and then copies from S3 to Redshift. You can see a sanitized version of the python script in this gist.
Airflow DAG for ingesting PagerDuty data
We focused on two tables: incidents
and log_entries
, which gives us a history of which team got notified about what alert on what date. PagerDuty offers a one month history of incidents through the API. Because a user can modify an incident well after the original date of the alert, we re-harvest the entire 30-day incident history and update any modified data. By contrast, log_entries is an append-only log, so we more simply take new data at a regular cadence and copy it into redshift.
Finally, with the data in Redshift we can create an automated version of our handoff document! Here’s an example of how we query the final schema to pull insights out of the incident and notification history:
-- categorize alerts, and count by category + date-- takes a rolling five day historyselectdate_trunc('day', (created_at)::timestamp)::date, casewhen incidents.summary like '%airflow heartbeat%'then 'Airflow Heartbeat'when incidents.summary like '%mysql_dump%'then 'MysqlDump'when incidents.summary like '%TaskInstance: dump_%'then 'MongoDump'when incidents.summary like '%ES%not green%'then 'ES not green'when incidents.summary like '%queue is building%'then 'Logstash Queue'when incidents.summary like '%kinesis%behind%'then 'Kinesis Behind'when incidents.summary like '%TaskInstance%'then 'Airflow - Other'else-- remove numbers to more easily group alertsregexp_replace(incidents.summary, '[0-9]', '')end as summary, count(1)frompagerduty.incidentswhereabs(datediff('day', getdate(), created_at)) < 5and (escalation_policy_summary='DW-DSI')group by1, 2order by3 desc
Which leads us to the grand conclusion: a Periscope dashboard where we can look at the number of alerts, the breakdown by category and service, and track trends over time.
The final dashboard in Periscope, summarizing alerts over the past 7 days.
Our handoff doc is a little cleaner now: rather than having to manually type out reports about individual alerts, we can summarize the week from our dashboard and jump straight to the collective and individual actions we want to take to build resilient and reliable data infrastructure.
New handoff document, which focuses on concrete Action Items
This new process has helped us prioritize & fix noisy alerts and persistent incidents. As a bonus, it helps our users surface and escalate lingering issues that have been masked by the cadence on on-call alerts, focusing on other important issues:
The good case: our internal users attacking the real issues.