About 9 months ago I set out to leave my teaching career of six years to pursue a career as a Software Engineer. I attended a 3 month Programming Bootcamp called Hackbright Academy during which I not only learned the fundamentals of programming, but more importantly, the fundamentals of what type of work excites me. I realized that I loved design. I loved data-model design, user experience design, architectural design, system design… The list goes on, I love design. Because of this, I thought the best place for me would be as a Front End Engineer, boy was I wrong.
Dropbox Office Library in San Francisco, 2017
At Hackbright, we spent four weeks working on one singular project. One might imagine, with such a small scale project, I didn’t think a ton about reliability and scalability.
It was not until I began speaking to industry professionals, specifically several SREs, that I realized I had an interest in thinking about large scale systems and making them scalable and reliable.
I now work as a Site Reliability Engineer (SRE) on the Monitoring Team at Dropbox and I absolutely love the work that I am doing. I get to work with my teammates to design large scale systems as well as the internal tools that monitor them!
Since I started at Dropbox, I have had the opportunity to speak with many career changers like me and share what I have learned over the past few months. The question that I am constantly asked is:
What is an SRE, and how do I become one?
This post is designed for those who are interested in learning more about SRE. I give a general explanation about the position, offer key questions to ask yourself before committing to applying to SRE roles and point to several resources to jumpstart your journey.
To understand the answer to this question, it’s important that you learn a bit of history. Lets talk about the traditional approach to system management. Prior to Google’s creation of the SRE position, System Administrators ran company operations.
System Administrators worked on the “operations” side of things, whereas engineers worked on the “development” side of things.
According to the SRE Book, this approach caused division and conflict between developers and sysadmins. Because the two had different backgrounds, skills, and incentives, it meant that they had different vocabulary and thought about reliability very differently. Developers wanted new features to get out to users as quickly as possible whereas the operations team members (sysadmins) wanted to avoid breaking anything. Google saw the concerns with this approach and created the idea of “Site Reliability Engineering.”
According to the creator of the position at Google, Ben Treynor defines SRE in this interview as:
“Fundamentally, it’s what happens when you ask a software engineer to design an operations function…So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.”
Example of a data center
Let me offer a concrete example on how SRE makes the “new way” even better.
A few months ago, I had the opportunity to visit a data center just like the one you see pictured here. I toured several large warehouse sized rooms filled with thousands of machines. The magnitude of this space is remarkable.
Now let’s say that one of the servers in the data center went down and needed to be replaced. With the “old way,” a new server would be configured manually by a system administrator. What this means is that the sysadmin would manually make sure the new machine has the proper operating system, software, tags, etc. Now imagine that 1,000 servers need to be replaced. See where I am going with this? It would take forever, or the company would need a lot of sysadmins to do the labor.
Now consider the “new way” as described in this bullet point that I took from Dropbox’s Site Reliability Engineer Job posting:
“You will automate the server provisioning process to reduce the labor of our networking engineering and datacenter operations teams. Once we plug a new server in, it walks itself through all aspects of provisioning to join the fleet without any human involvement.”
Without any human involvement.
In this example, an SRE would be responsible for writing the software that automates the server configuration process. Cool right? This example really helped me to understand what an SRE truly is:
Site Reliability Engineer = Software Engineer + Systems Enthusiast
According to Tammy Butow, SRE Manager at Dropbox,
“SREs are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.”
By eliminating human interaction through automation, SREs make systems more reliable. So essentially, an SRE’s job is to automate themselves out of a job.
The reason I found this to be really cool is the same reason I decided to study math. Math allows you to utilize functions and rules to compute large scale problems. One of my favorite lessons from when I was teaching is based on this problem:
The Staircase Problem Visual Aid
“You own a landscaping business and one of your specialties is outdoor brick staircase. How many bricks would you need to bring if a customer ordered a 10-high stairwell? How many bricks would a customer need for a 20-high stairwell? How many bricks would a customer need for a 38-high stairwell?”
Gauss’ Trick for summing the numbers from 1 to n
My students quickly realized that counting the bricks was an okay strategy for the smaller staircases. But as I increased the height all the way to a 100-high stairwell, they were forced to find another way. They realized that math can be used as a tool to calculate large scale problems, avoiding a brute force approach (In my Algebra 1 courses, I would get students to discover they could use the equation n(n+1)/2 for the staircase problem.)
Just as math is a tool for solving large scale problems, in the world of computers, code is a tool for managing large scale systems. It is a tool that allows for automating tasks through software and eliminating the need for manual human labor. Site Reliability Engineers are behind this work, they manage and automate these systems using their systems knowledge and their code, making the system more reliable with every bit.
This is a big question that comes up when I speak to job seekers considering pursuing SRE roles. I put together some important questions to ask yourself before you commit completely.
If you answered yes to at least 8 of these questions, SRE could be a good position for you. Read on to find more resources on SRE and a list of companies that offer SRE roles.
There are many resources out there that are useful to start learning more about SRE, as well as gain the skills needed to obtain a role. Here are a few that I recommend starting with.
Still trying to wrap your mind around what SRE means? Check out these resources:
🌐 Google’s SRE Resources — A website that contains Google’s definition of SRE, the transcript of an interview with the creator of the position, as well as other resources (including the online version of the SRE Book).
🌐 SRE Book Notes — Realizing you may not be ready to go out and spend $40-$50 on the SRE book, this is an awesome set of notes on each chapter of the book by Dan Luu.
🎥 — A talk given by the creator of the SRE role Ben Treynor of Google.
🎥 — A Webinar with Google SREs.
🎥 Site Reliability Engineering at Dropbox — A talk given by Tammy Butow, SRE Manager at Dropbox.
🎥 — A talk by Jonah Horowitz, SRE at Netflix.
🎥 — A panel of SREs at SRECon16.
📰 Andrew Fong on Tackling the Full Stack — An interview with Andrew Fong, an SRE manager at Dropbox.
📰 Site Reliability Engineers: “We Solve Cooler Problems” — An article about SRE at Google.
📰 Love DevOps? Wait until you meet SRE — An article about SRE at Atlassian.
Curious which companies out there hire SREs? Here are just a few:
…and many more! I have linked to sample job posts so that you can get a feel for what SRE means at the different companies.
Once you feel you have a handle on the definition of SRE, check out these resources to start expanding your skill set.
General Resources
📰 Graduating from Bootcamp and interested in becoming a Site Reliability Engineer? — A comprehensive resource list that Tammy Butow and I put together for Bootcamp Grads interested in SRE roles.
📰 The Must Know Checklist For DevOps & Site Reliability Engineers — A list of skills and mindsets of an SRE.
🌐 Introduction to Distributed System Design — A Google Code University Course.
🌐 Awesome Site Reliability Engineering — An amazing curated list of SRE and Production Engineering resources by Pavlos Ratis. Definitely bookmark worthy if you become an SRE.
Linux Resources
Since I did not have a lot of Linux experience when I started my Junior-level SRE role, I have included a separate section for the tutorials and tools that I have found useful for learning Linux fundamentals.
🎥 Eli the Computer Guy’s Linux Video Series — A series of awesome videos that breaks down key Linux concepts.
🌐 Unix/Linux Tutorial for Beginners — A step by step tutorial
🌐 LinuxCommand.org — A step by step tutorial.
📋 Cheatography Linux Cheatsheet
Dropbox’s Magic Pocket
In addition to the Dropbox specific resources covered in Tammy’s blog post, here are some others that made me super interested in working as an SRE at Dropbox:
🎥
📰 Scaling to exabytes and beyond
Krishelle is a former High School Math and Spanish Teacher turned Site Reliability Engineer at Dropbox. Read about her journey from education to tech and connect with her on Twitter and LinkedIn.
Looking to make a Career Change but not sure where to start? Job Searching and not sure how to get organized? Check out these articles by Krishelle: