I don’t know about you, but lately, I’ve been hearing quite a lot about SREs (or non-acronymized ). Now, there are probably a dozen different meanings for this role and it varies from company to company. I’m going to talk about what we had in the Agoda Homes team and the on and the on the actual of our platform. Basically, for my definition, an is an engineer within the team task with the of the product and the and the of . Site Reliability Engineers impact morale impact reliability SRE monitoring reliability investigating cause determining priority bugs The Job No Engineer Wanted Initially, we created the role within our product because we were almost at 100% features and we had traffic. So, we needed someone (or a team) to how our was performing and determine which bugs are to the of the project and what the actual of the bugs are. I can tell you now, that if you create this role out of thin air — your engineers will probably hate you. I’m being (of course), but in the end, engineer wanted to take on the role. It was rotated every sprint (we figured a week was too short and a month was probably too long). monitor production environment critical success impact dramatic no First a Team As I kind-of alluded to above, we first started out assigning this SRE role to a . We’d the number of stories the team would need to produce and let them have free reign on what bugs to tackle/determine impact. Now, as I said — the point of the SRE is not to solve the bugs — but . Can you already guess where I’m heading? Rather than investigating and — the team would usually investigate and . That sounds nice — until the team is spending a significant amount of time on bugs that probably aren’t a high priority when we have features that need to be completed. team reduce investigate and determine priority determining priority solve In the end, though, the SRE role led to morale (within the team chasing bugs), and very high . We really change the of our product and we ended up our . With bugs being reported all the time, the team were constantly dropping product work and context switching within a sprint. The cost of this constant ramp-up (think — where did I get with the story) was too great. assigned to a team decreased unproductivity didn’t reliability affecting velocity Then — a Single Engineer Right, so the team as an SRE role didn’t work. We also tried having a from the product every sprint as SRE. This was but still . Basically, the one poor engineer ended up being named the bug buster. Or bug boy. Or any play on the word bug you could imagine. Now, what happened is that this would need from the previous bug boy. That’s spent just getting to know what the bugs are in the system. Remember, this software engineer was meant to solve the bugs, but to figure out where they were happening and how big of a priority it should be. That’s hard. single engineer better not good software single engineer one to three days handover a lot of time not We had a . We didn’t ask for volunteers, it was . Also great for . But it worked. People got on with their jobs. But the bug boy was left isolated and alone. They were (even though they came to stand-ups and meetings). They had from the rest of the engineers. What we found was that this role became very inefficient. There was so much time spent ramping up each sprint and knowledge transfer — that bugs were left on our radar for weeks at a time because they were not reproducible (which should mean low priority, right?). rotating roster mandatory not culture no longer part of the team different priorities We also found that engineers who were the SRE didn’t necessarily come back with knowledge of the different parts of the system (as you might expect). What ended up happening is that a high priority bug would come through from the PO (Product Owner) and the QAs (Quality Assurance/Testers) and from customer feedback; the SRE would have to drop the current bug she/he is working on and figure out the new bug. So — their knowledge was reduced to the high profile bug. For the rest of the engineers, there were no more distractions. This was what we wanted, right? No POs nagging us and product work pushing ahead full steam. But having a member away from your sprint meant that the teams became disconnected. Knowledge of bugs was passed from SRE to SRE rather than shared among the team. It was like a “right of passage” to be an SRE. No one looked forward to the role. What We Do Now We no longer have SREs within the Agoda Homes team. The toll the role took on the people and the effectiveness of the teams was too great. We still get high priority bugs. We still investigate bugs. But it’s more like a Product task now. The PO chats with the QAs. QAs help determines how much of an impact the bug has on the product. The PO weighs up product and bug work and determines what will bring the most business value. It’s not perfect, but as engineers, we work together as a team again. Originally published at www.alexaitken.nz on July 23, 2018.