About 9 months ago I set out to leave my teaching career of six years to pursue a career as a Software Engineer. I attended a 3 month Programming Bootcamp called Hackbright Academy during which I not only learned the fundamentals of programming, but more importantly, the fundamentals of what type of work excites me. I realized that I loved design. I loved data-model design, user experience design, architectural design, system design… The list goes on, I love design. Because of this, I thought the best place for me would be as a Front End Engineer, boy was I wrong.
At Hackbright, we spent four weeks working on one singular project. One might imagine, with such a small scale project, I didn’t think a ton about reliability and scalability.
It was not until I began speaking to industry professionals, specifically several SREs, that I realized I had an interest in thinking about large scale systems and making them scalable and reliable.
I now work as a Site Reliability Engineer (SRE) on the Monitoring Team at Dropbox and I absolutely love the work that I am doing. I get to work with my teammates to design large scale systems as well as the internal tools that monitor them!
Since I started at Dropbox, I have had the opportunity to speak with many career changers like me and share what I have learned over the past few months. The question that I am constantly asked is:
What is an SRE, and how do I become one?
This post is designed for those who are interested in learning more about SRE. I give a general explanation about the position, offer key questions to ask yourself before committing to applying to SRE roles and point to several resources to jumpstart your journey.
What is an SRE anyway?
To understand the answer to this question, it’s important that you learn a bit of history. Lets talk about the traditional approach to system management. Prior to Google’s creation of the SRE position, System Administrators ran company operations.
What is a system administrator?
- A system administrator or ‘sysadmin’ is someone who is responsible for the configuration, upkeep and reliability of complex computing systems.
- They assemble software components (that are written by developers) and deploy them to produce a service.
- They monitor these services and respond if there are any events that occur with the service.
System Administrators worked on the “operations” side of things, whereas engineers worked on the “development” side of things.
What’s so bad about this approach?
According to the SRE Book, this approach caused division and conflict between developers and sysadmins. Because the two had different backgrounds, skills, and incentives, it meant that they had different vocabulary and thought about reliability very differently. Developers wanted new features to get out to users as quickly as possible whereas the operations team members (sysadmins) wanted to avoid breaking anything. Google saw the concerns with this approach and created the idea of “Site Reliability Engineering.”
So again, WHAT IS AN SRE?
According to the creator of the position at Google, Ben Treynor defines SRE in this interview as:
“Fundamentally, it’s what happens when you ask a software engineer to design an operations function…So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.”
Let me offer a concrete example on how SRE makes the “new way” even better.
A few months ago, I had the opportunity to visit a data center just like the one you see pictured here. I toured several large warehouse sized rooms filled with thousands of machines. The magnitude of this space is remarkable.
Now let’s say that one of the servers in the data center went down and needed to be replaced. With the “old way,” a new server would be configured manually by a system administrator. What this means is that the sysadmin would manually make sure the new machine has the proper operating system, software, tags, etc. Now imagine that 1,000 servers need to be replaced. See where I am going with this? It would take forever, or the company would need a lot of sysadmins to do the labor.
Now consider the “new way” as described in this bullet point that I took from Dropbox’s Site Reliability Engineer Job posting:
“You will automate the server provisioning process to reduce the labor of our networking engineering and datacenter operations teams. Once we plug a new server in, it walks itself through all aspects of provisioning to join the fleet without any human involvement.”
Without any human involvement.
In this example, an SRE would be responsible for writing the software that automates the server configuration process. Cool right? This example really helped me to understand what an SRE truly is:
Site Reliability Engineer = Software Engineer + Systems Enthusiast
According to Tammy Butow, SRE Manager at Dropbox,
“SREs are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.”
By eliminating human interaction through automation, SREs make systems more reliable. So essentially, an SRE’s job is to automate themselves out of a job.
But Krishelle, why do you think this is cool?
The reason I found this to be really cool is the same reason I decided to study math. Math allows you to utilize functions and rules to compute large scale problems. One of my favorite lessons from when I was teaching is based on this problem:
“You own a landscaping business and one of your specialties is outdoor brick staircase. How many bricks would you need to bring if a customer ordered a 10-high stairwell? How many bricks would a customer need for a 20-high stairwell? How many bricks would a customer need for a 38-high stairwell?”
My students quickly realized that counting the bricks was an okay strategy for the smaller staircases. But as I increased the height all the way to a 100-high stairwell, they were forced to find another way. They realized that math can be used as a tool to calculate large scale problems, avoiding a brute force approach (In my Algebra 1 courses, I would get students to discover they could use the equation n(n+1)/2 for the staircase problem.)
Just as math is a tool for solving large scale problems, in the world of computers, code is a tool for managing large scale systems. It is a tool that allows for automating tasks through software and eliminating the need for manual human labor. Site Reliability Engineers are behind this work, they manage and automate these systems using their systems knowledge and their code, making the system more reliable with every bit.
How do I know if SRE is right for me?
This is a big question that comes up when I speak to job seekers considering pursuing SRE roles. I put together some important questions to ask yourself before you commit completely.
SRE Compatibility Quiz
- Do you like thinking about large scale problems that have a lot of moving parts?
- Do you like thinking about how to make large systems more reliable?
- Are you okay with working on software that will likely never be overtly seen by an external user?
- Do you enjoy looking at a terminal for large amounts of time?
- Do you enjoy the process of diagnosing and fixing a problem? If yes, what if the diagnosis involves system level problems that you cannot always see?
- Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level functionality (e.g. ssh, proc, cron, swaps, etc.)?
- Are you comfortable with the idea of being “on-call” in which you are likely to be in high-stakes scenario where something needs to be fixed?
- Are you able to stay calm under pressure?
- Do you approach problems in a logical, process-oriented way?
- Are you comfortable attempting a problem that has never been solved before?
- Are you someone who thinks about how you can make things better?
If you answered yes to at least 8 of these questions, SRE could be a good position for you. Read on to find more resources on SRE and a list of companies that offer SRE roles.
So I really want to be an SRE, now what?
There are many resources out there that are useful to start learning more about SRE, as well as gain the skills needed to obtain a role. Here are a few that I recommend starting with.
Understanding SRE Role and Responsibilities
Still trying to wrap your mind around what SRE means? Check out these resources:
🌐 Google’s SRE Resources — A website that contains Google’s definition of SRE, the transcript of an interview with the creator of the position, as well as other resources (including the online version of the SRE Book).
🎥 Keys to SRE — A talk given by the creator of the SRE role Ben Treynor of Google.
🎥 Site Reliability Engineers — Keeping Google up and running 24/7 — A Webinar with Google SREs.
🎥 Site Reliability Engineering at Dropbox — A talk given by Tammy Butow, SRE Manager at Dropbox.
🎥 Site Reliability Engineering at Netflix — A talk by Jonah Horowitz, SRE at Netflix.
🎥 Who/What is SRE? — A panel of SREs at SRECon16.
📰 Andrew Fong on Tackling the Full Stack — An interview with Andrew Fong, an SRE manager at Dropbox.
📰 Site Reliability Engineers: “We Solve Cooler Problems” — An article about SRE at Google.
📰 Love DevOps? Wait until you meet SRE — An article about SRE at Atlassian.
Companies that Hire SREs
Curious which companies out there hire SREs? Here are just a few:
- Facebook (called Production Engineer)
- Reddit (called DevOps Software Engineer)
…and many more! I have linked to sample job posts so that you can get a feel for what SRE means at the different companies.
Resources for an Aspiring SRE
Once you feel you have a handle on the definition of SRE, check out these resources to start expanding your skill set.
📰 Graduating from Bootcamp and interested in becoming a Site Reliability Engineer? — A comprehensive resource list that Tammy Butow and I put together for Bootcamp Grads interested in SRE roles.
📰 The Must Know Checklist For DevOps & Site Reliability Engineers — A list of skills and mindsets of an SRE.
🌐 Introduction to Distributed System Design — A Google Code University Course.
🌐 Awesome Site Reliability Engineering — An amazing curated list of SRE and Production Engineering resources by Pavlos Ratis. Definitely bookmark worthy if you become an SRE.
Since I did not have a lot of Linux experience when I started my Junior-level SRE role, I have included a separate section for the tutorials and tools that I have found useful for learning Linux fundamentals.
🎥 Eli the Computer Guy’s Linux Video Series — A series of awesome videos that breaks down key Linux concepts.
🌐 Unix/Linux Tutorial for Beginners — A step by step tutorial
🌐 LinuxCommand.org — A step by step tutorial.
Twitter Accounts and Blogs to Follow
- Julia Evans on Twitter and Blog
- Brandan Gregg
- Eli Bendersky
- Several Nines
- Philip Fisher-Ogden
- Tammy Butow
- Nora Jones
- Pavlos Ratis — Pavlos offers follow recommendations too!
- SRE Weekly
Dropbox Infrastructure Resources
In addition to the Dropbox specific resources covered in Tammy’s blog post, here are some others that made me super interested in working as an SRE at Dropbox:
Found this post useful? Kindly tap the ❤ button below and share with friends considering SRE roles!
About the author
Looking to make a Career Change but not sure where to start? Job Searching and not sure how to get organized? Check out these articles by Krishelle: