diff options
Diffstat (limited to 'gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi')
| -rw-r--r-- | gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi | 33 |
1 files changed, 20 insertions, 13 deletions
diff --git a/gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi b/gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi index 96ddea9b..609c6e58 100644 --- a/gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi +++ b/gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi @@ -1,12 +1,13 @@ -# Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect +# Site Reliability Engineering - Part 3: On-Call Culture > Published at 2024-01-09T18:35:48+02:00 -This is the third part of my Site Reliability Engineering (SRE) series. I am currently employed as a Site Reliability Engineer and will try to share what SRE is about in this blog series. +Welcome to Part 3 of my Site Reliability Engineering (SRE) series. I'm currently working as a Site Reliability Engineer, and I’m here to share what SRE is all about in this blog series. => ./2023-08-18-site-reliability-engineering-part-1.gmi 2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture -=> ./2023-11-19-site-reliability-engineering-part-2.gmi 2023-11-19 Site Reliability Engineering - Part 2: Operational Balance in SRE -=> ./2024-01-09-site-reliability-engineering-part-3.gmi 2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect (You are currently reading this) +=> ./2023-11-19-site-reliability-engineering-part-2.gmi 2023-11-19 Site Reliability Engineering - Part 2: Operational Balance +=> ./2024-01-09-site-reliability-engineering-part-3.gmi 2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture (You are currently reading this) +=> ./2024-09-07-site-reliability-engineering-part-4.gmi 2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers ``` ..--""""----.. @@ -34,23 +35,29 @@ This is the third part of my Site Reliability Engineering (SRE) series. I am cur ``` -## On-Call Culture and the Human Aspect: Prioritising Well-being in the Realm of Reliability +## Putting Well-being First -Site Reliability Engineering is synonymous with ensuring system reliability, but the human factor is an often-underestimated part of this discipline. Ensuring an healthy on-call culture is as critical as any technical solution. The well-being of the engineers is an important factor. +Site Reliability Engineering is all about keeping systems reliable, but we often forget how important the human side is. A healthy on-call culture is just as crucial as any technical fix. The well-being of the engineers really matters. -Firstly, a healthy on-call rotation is about more than just managing and responding to incidents. It's about the entire ecosystem that supports this practice. This involves reducing pain points, offering mentorship, rapid iteration, and ensuring that engineers have the right tools and processes. One ceavat is, that engineers should be willing to learn. Especially in on-call rotation embedding SREs with other engineers (for example Software Engineers or QA Engineers), it's difficult to motivate everyone to engage. QA Engineers want to test the software, Software Engineers want to implement new features; they don't want to troubleshoot and debug production incidents. It can be depressing for the mentoring SRE. +First off, a healthy on-call rotation is about more than just handling incidents. It's about creating a supportive ecosystem. This means cutting down on pain points, offering mentorship, quickly iterating on processes, and making sure engineers have the right tools. But there's a catch—engineers need to be willing to learn. Especially in on-call rotations where SREs work with Software Engineers or QA Engineers, it can be tough to get everyone motivated. QA Engineers want to test, Software Engineers want to build new features; they don’t want to deal with production issues. This can be really frustrating for the SREs trying to mentor them. -Furthermore, the metrics that measure the success of an on-call experience are only sometimes straightforward. While one might assume that fewer pages translate to better on-call expertise (which is true to a degree, as who wants to receive a page out of office hours?), it's not always the volume of pages that matters most. Trust, ownership, accountability, and effective communication play the important roles. +Plus, measuring a good on-call experience isn't always clear-cut. You might think fewer pages mean a better on-call setup—and yeah, no one wants to get paged after hours—but it's not just about the number of pages. Trust, ownership, accountability, and solid communication are what really matter. -An important part is giving feedback about the on-call experience to ensure continuous learning. If alerts are mostly noise, they should be tuned or even eliminated. If alerts are actionable, can recurring tasks be automated? If there are knowledge gaps, is the documentation not good enough? Continuous retrospection ensures that not only do systems evolve, but the experience for the on-call engineers becomes progressively better. +A key part is giving feedback about the on-call experience to keep learning and improving. If alerts are mostly noise, they need to be tweaked or even ditched. If alerts are helpful, can we automate the repetitive tasks? If there are knowledge gaps, is the documentation lacking? Regular retrospectives ensure that the systems get better over time and the on-call experience improves for the engineers. -Onboarding for on-call duties is a crucial aspect of ensuring the reliability and efficiency of systems. This process involves equipping new team members with the knowledge, tools, and support to handle incidents confidently. It begins with an overview of the system architecture and common challenges, followed by training on monitoring tools, alerting mechanisms, and incident response protocols. Shadowing experienced on-call engineers can offer practical exposure. Too often, new engineers are thrown into the cold water without proper onboarding and training because the more experienced engineers are too busy fire-fighting production issues in the first place. +Getting new team members ready for on-call duties is super important for keeping systems reliable and efficient. This means giving them the knowledge, tools, and support they need to handle incidents with confidence. It starts with a rundown of the system architecture and common issues, then training on monitoring tools, alerting systems, and incident response protocols. Watching experienced on-call engineers in action can provide some hands-on learning. Too often, though, new engineers get thrown into the deep end without proper onboarding because the more experienced engineers are too busy dealing with ongoing production issues. -An always-on, always-alert culture can lead to burnout. Engineers should be encouraged to recognise their limits, take breaks, and seek support when needed. This isn't just about individual health; a burnt-out engineer can have cascading effects on the entire team and the systems they manage. A successful on-call culture ensures that while systems are kept running, the engineers are kept happy, healthy, and supported. The more experienced engineers should take time to mentor the junior engineers, but the junior engineers should also be fully engaged, try to investigate and learn new things by themselves. +A culture where everyone's always on and alert can cause burnout. Engineers need to know their limits, take breaks, and ask for help when they need it. This isn't just about personal health; a burnt-out engineer can drag down the whole team and the systems they manage. A good on-call culture keeps systems running while making sure engineers are happy, healthy, and supported. Experienced engineers should take the time to mentor juniors, but junior engineers should also stay engaged, investigate issues, and learn new things on their own. -For the junior engineer, it's too easy to fall back and ask the experts in the team every time an issue arises. This seems reasonable, but serving recipes for solving production issues on a silver tablet won't scale forever, as there are infinite scenarios of how production systems can break. So every engineer should learn to debug, troubleshoot and resolve production incidents independently. The experts will still be there for guidance and step in when the junior gets stuck after trying, but the experts should also learn to step down so that lesser experienced engineers can step up and learn. But mistakes can always happen here; that's why having a blameless on-call culture is essential. +For junior engineers, it's tempting to always ask the experts for help whenever something goes wrong. While that might seem reasonable, constantly handing out solutions doesn't scale—there are endless ways for production systems to break. So, every engineer needs to learn how to debug, troubleshoot, and resolve incidents on their own. The experts should be there for guidance and can step in when a junior gets really stuck, but they also need to give space for less experienced engineers to grow and learn. -A blameless on-call culture is a must for a safe and collaborative environment where engineers can effectively respond to incidents without fear of retribution. This approach acknowledges that mistakes are a natural part of the learning and innovation process. When individuals are assured they won't be punished for errors, they're more likely to openly discuss mistakes, allowing the entire team to learn and grow from each incident. Furthermore, a blameless culture promotes psychological safety, enhances job satisfaction, reduces burnout, and ensures that talent remains committed and engaged. +A blameless on-call culture is essential for creating a safe and collaborative environment where engineers can handle incidents without worrying about getting blamed. It recognizes that mistakes are just part of learning and innovating. When people know they won’t be punished for errors, they’re more likely to talk openly about what went wrong, which helps the whole team learn and improve. Plus, a blameless culture boosts psychological safety, job satisfaction, and reduces burnout, keeping everyone committed and engaged. + +Mistakes are gonna happen, which is why having a blameless on-call culture is so important. + +Continue with the fourth part of this series: + +=> ./2024-09-07-site-reliability-engineering-part-4.gmi 2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers E-Mail your comments to `paul@nospam.buetow.org` :-) |
