About 9 months ago I set out to leave my teaching career of six years to pursue a career as a Software Engineer. I attended a 3 month Programming Bootcamp called during which I not only learned the fundamentals of programming, but more importantly, the fundamentals of what type of work excites me. I realized that I loved design. I loved data-model design, user experience design, architectural design, system design… The list goes on, I love design. Because of this, I thought the best place for me would be as a Front End Engineer, boy was I wrong. Hackbright Academy Dropbox Office Library in San Francisco, 2017 At Hackbright, we spent four weeks working on one . One might imagine, with such a small scale project, I didn’t think a ton about reliability and scalability. singular project It was not until I began speaking to industry professionals, specifically several SREs, that I realized I had an interest in thinking about large scale systems and making them scalable and reliable. I now work as a (SRE) on the Monitoring Team at Dropbox and I absolutely love the work that I am doing. I get to work with my teammates to design large scale systems as well as the internal tools that monitor them! Site Reliability Engineer Since I started at Dropbox, I have had the opportunity to speak with many career changers like me and share what I have learned over the past few months. The question that I am constantly asked is: What is an SRE, and how do I become one? This post is designed for those who are interested in learning more about SRE. I give a general explanation about the position, offer key questions to ask yourself before committing to applying to SRE roles and point to several resources to jumpstart your journey. What is an SRE anyway? To understand the answer to this question, it’s important that you learn a bit of history. Lets talk about the traditional approach to system management. Prior to Google’s creation of the SRE position, ran company operations. System Administrators What is a system administrator? A system administrator or ‘sysadmin’ is someone who is responsible for the configuration, upkeep and reliability of complex computing systems. They assemble software components (that are written by developers) and deploy them to produce a service. They monitor these services and respond if there are any events that occur with the service. System Administrators worked on the “operations” side of things, whereas engineers worked on the “development” side of things. What’s so bad about this approach? According to the , this approach caused division and conflict between developers and sysadmins. Because the two had different backgrounds, skills, and incentives, it meant that they had different vocabulary and thought about reliability very differently. Developers wanted new features to get out to users as quickly as possible whereas the operations team members (sysadmins) wanted to avoid breaking anything. Google saw the concerns with this approach and created the idea of “Site Reliability Engineering.” SRE Book So again, WHAT IS AN SRE? According to the creator of the position at Google, Ben Treynor defines SRE in as: this interview “Fundamentally, it’s what happens when you ask a software engineer to design an operations function…So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.” Example of a data center Let me offer a concrete example on how SRE makes the “new way” even better. A few months ago, I had the opportunity to visit a data center just like the one you see pictured here. I toured several large warehouse sized rooms filled with thousands of machines. The magnitude of this space is remarkable. Now let’s say that one of the servers in the data center went down and needed to be replaced. With the “old way,” a new server would be configured manually by a system administrator. What this means is that the sysadmin would make sure the new machine has the proper operating system, software, tags, etc. Now imagine that 1,000 servers need to be replaced. See where I am going with this? It would take forever, or the company would need a lot of sysadmins to do the labor. manually Now consider the “new way” as described in this bullet point that I took from Dropbox’s Site Reliability Engineer Job posting: “You will automate the server provisioning process to reduce the labor of our networking engineering and datacenter operations teams. Once we plug a new server in, it walks itself through all aspects of provisioning to join the fleet without any human involvement.” Without any human involvement. In this example, an SRE would be responsible that automates the server configuration process. Cool right? This example really helped me to understand what an SRE truly is: for writing the software Site Reliability Engineer = Software Engineer + Systems Enthusiast According to , SRE Manager at Dropbox, Tammy Butow “SREs are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.” By eliminating human interaction through automation, SREs make systems more reliable. So essentially, an SRE’s job is to automate themselves out of a job. But Krishelle, why do you think this is cool? The reason I found this to be really cool is the same reason I decided to study math. Math allows you to utilize functions and rules to compute large scale problems. One of my favorite lessons from when I was teaching is based on this problem: The Staircase Problem Visual Aid “You own a landscaping business and one of your specialties is outdoor brick staircase. How many bricks would you need to bring if a customer ordered a 10-high stairwell? How many bricks would a customer need for a 20-high stairwell? How many bricks would a customer need for a 38-high stairwell?” Gauss’ Trick for summing the numbers from 1 to n My students quickly realized that counting the bricks was an okay strategy for the smaller staircases. But as I increased the height all the way to a 100-high stairwell, they were forced to find another way. They realized that math can be used as a tool to calculate large scale problems, avoiding a brute force approach (In my Algebra 1 courses, I would get students to discover they could use the equation n(n+1)/2 for the staircase problem.) Just as math is a tool for solving large scale problems, in the world of computers, code is a tool for managing large scale systems. It is a tool that allows for automating tasks through software and eliminating the need for manual human labor. Site Reliability Engineers are behind this work, they manage and automate these systems using their systems knowledge and their code, making the system more reliable with every bit. How do I know if SRE is right for me? This is a big question that comes up when I speak to job seekers considering pursuing SRE roles. I put together some important questions to ask yourself before you commit completely. SRE Compatibility Quiz Do you like thinking about large scale problems that have a lot of moving parts? Do you like thinking about how to make large systems more reliable? Are you okay with working on software that will likely never be overtly seen by an external user? Do you enjoy looking at a terminal for large amounts of time? Do you enjoy the process of diagnosing and fixing a problem? If yes, what if the diagnosis involves system level problems that you cannot always see? Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level functionality (e.g. ssh, proc, cron, swaps, etc.)? Are you comfortable with the idea of being “on-call” in which you are likely to be in high-stakes scenario where something needs to be fixed? Are you able to stay calm under pressure? Do you approach problems in a logical, process-oriented way? Are you comfortable attempting a problem that has never been solved before? Are you someone who thinks about how you can make things better? If you answered yes to at least 8 of these questions, SRE could be a good position for you. Read on to find more resources on SRE and a list of companies that offer SRE roles. So I really want to be an SRE, now what? There are many resources out there that are useful to start learning more about SRE, as well as gain the skills needed to obtain a role. Here are a few that I recommend starting with. Understanding SRE Role and Responsibilities Still trying to wrap your mind around what SRE means? Check out these resources: 🌐 — A website that contains Google’s definition of SRE, the transcript of an interview with the creator of the position, as well as other resources (including the ). Google’s SRE Resources online version of the SRE Book 🌐 — Realizing you may not be ready to go out and spend $40-$50 on the SRE book, this is an awesome set of notes on each chapter of the book by . SRE Book Notes Dan Luu 🎥 — A talk given by the creator of the SRE role Ben Treynor of Google. Keys to SRE 🎥 — A Webinar with Google SREs. Site Reliability Engineers — Keeping Google up and running 24/7 🎥 — A talk given by Tammy Butow, SRE Manager at Dropbox. Site Reliability Engineering at Dropbox 🎥 — A talk by Jonah Horowitz, SRE at Netflix. Site Reliability Engineering at Netflix 🎥 — A panel of SREs at SRECon16. Who/What is SRE? 📰 — An interview with Andrew Fong, an SRE manager at Dropbox. Andrew Fong on Tackling the Full Stack 📰 — An article about SRE at Google. Site Reliability Engineers: “We Solve Cooler Problems” 📰 — An article about SRE at Atlassian. Love DevOps? Wait until you meet SRE Companies that Hire SREs Curious which companies out there hire SREs? Here are just a few: Dropbox Google Netflix GitHub Atlassian Stripe Pinterest (called Production Engineer) Facebook (called DevOps Software Engineer) Reddit …and many ! I have linked to sample job posts so that you can get a feel for what SRE means at the different companies. more Resources for an Aspiring SRE Once you feel you have a handle on the definition of SRE, check out these resources to start expanding your skill set. General Resources 📰 — A comprehensive resource list that and I put together for Bootcamp Grads interested in SRE roles. Graduating from Bootcamp and interested in becoming a Site Reliability Engineer? Tammy Butow 📰 — A list of skills and mindsets of an SRE. The Must Know Checklist For DevOps & Site Reliability Engineers 🌐 — A Google Code University Course. Introduction to Distributed System Design 🌐 — An amazing curated list of SRE and Production Engineering resources by Pavlos Ratis. Definitely bookmark worthy if you become an SRE. Awesome Site Reliability Engineering Linux Resources Since I did not have a lot of Linux experience when I started my Junior-level SRE role, I have included a separate section for the tutorials and tools that I have found useful for learning Linux fundamentals. 🎥 — A series of awesome videos that breaks down key Linux concepts. Eli the Computer Guy’s Linux Video Series 🌐 — A step by step tutorial Unix/Linux Tutorial for Beginners 🌐 — A step by step tutorial. LinuxCommand.org 📋 Cheatography Linux Cheatsheet 📋 FossWire Linux Cheatsheet Twitter Accounts and Blogs to Follow Julia Evans on and Twitter Blog Brandan Gregg Eli Bendersky Several Nines Philip Fisher-Ogden Tammy Butow Nora Jones — Pavlos offers too! Pavlos Ratis follow recommendations SRE Weekly SREcon Dropbox Infrastructure Resources Dropbox’s Magic Pocket In addition to the Dropbox specific resources covered in , here are some others that made me super interested in working as an SRE at Dropbox: Tammy’s blog post 🎥 Magic Pocket 📰 Scaling to exabytes and beyond 📰 Inside the Magic Pocket 🌐 Dropbox Infrastructure Blog Found this post useful? Kindly tap the ❤ button below and share with friends considering SRE roles! About the author Krishelle is a former High School Math and Spanish Teacher turned Site Reliability Engineer at . about her journey from education to tech and connect with her on and . Dropbox Read Twitter LinkedIn Looking to make a Career Change but not sure where to start? Job Searching and not sure how to get organized? Check out these articles by Krishelle: Designing your Career Change ) How I Transitioned From Teaching to Engineering (Webinar Tools for Organizing Your Job Search

ATLASSIAN

Dropbox

Facebook

Google

Netflix

Stripe

Super

Twitter

YouTube

SREies Part 1: Configuration Management

8 Tools for Organizing your Post-Bootcamp Job Search

Too Long; Didn't Read

So you want to be an SRE?

So you want to be an SRE?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

8 Tools for Organizing your Post-Bootcamp Job Search

12 Powerful Books And Courses To Help You Shift Into PM Or UI/UX Designer Role

14 Free Certifications by Google and Meta to Start Working Remotely

22 Top Tech Companies Hiring Remote Workers in 2021

3 Reasons Why You Should Get AWS Certified This Year

3 Things I Wish I Knew When I Was Still An Engineer

8 Tools for Organizing your Post-Bootcamp Job Search

12 Powerful Books And Courses To Help You Shift Into PM Or UI/UX Designer Role

14 Free Certifications by Google and Meta to Start Working Remotely

22 Top Tech Companies Hiring Remote Workers in 2021

3 Reasons Why You Should Get AWS Certified This Year

3 Things I Wish I Knew When I Was Still An Engineer

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps