diff options
18 files changed, 220 insertions, 94 deletions
diff --git a/gemfeed/2023-03-16-the-pragmatic-programmer-book-notes.gmi b/gemfeed/2023-03-16-the-pragmatic-programmer-book-notes.gmi index 78d8c13c..d01e1bf0 100644 --- a/gemfeed/2023-03-16-the-pragmatic-programmer-book-notes.gmi +++ b/gemfeed/2023-03-16-the-pragmatic-programmer-book-notes.gmi @@ -81,7 +81,7 @@ Other book notes of mine are: => ./2023-03-16-the-pragmatic-programmer-book-notes.gmi 2023-03-16 "The Pragmatic Programmer" book notes (You are currently reading this) => ./2023-04-01-never-split-the-difference-book-notes.gmi 2023-04-01 "Never split the difference" book notes => ./2023-05-06-the-obstacle-is-the-way-book-notes.gmi 2023-05-06 "The Obstacle is the Way" book notes -=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes +=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes E-Mail your comments to paul at buetow.org :-) diff --git a/gemfeed/2023-04-01-never-split-the-difference-book-notes.gmi b/gemfeed/2023-04-01-never-split-the-difference-book-notes.gmi index 2b430e6b..2fea61d8 100644 --- a/gemfeed/2023-04-01-never-split-the-difference-book-notes.gmi +++ b/gemfeed/2023-04-01-never-split-the-difference-book-notes.gmi @@ -125,7 +125,7 @@ Other book notes of mine are: => ./2023-03-16-the-pragmatic-programmer-book-notes.gmi 2023-03-16 "The Pragmatic Programmer" book notes => ./2023-04-01-never-split-the-difference-book-notes.gmi 2023-04-01 "Never split the difference" book notes (You are currently reading this) => ./2023-05-06-the-obstacle-is-the-way-book-notes.gmi 2023-05-06 "The Obstacle is the Way" book notes -=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes +=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes E-Mail your comments to paul at buetow.org :-) diff --git a/gemfeed/2023-05-06-the-obstacle-is-the-way-book-notes.gmi b/gemfeed/2023-05-06-the-obstacle-is-the-way-book-notes.gmi index 781291a5..4eda16b9 100644 --- a/gemfeed/2023-05-06-the-obstacle-is-the-way-book-notes.gmi +++ b/gemfeed/2023-05-06-the-obstacle-is-the-way-book-notes.gmi @@ -87,7 +87,7 @@ Other book notes of mine are: => ./2023-03-16-the-pragmatic-programmer-book-notes.gmi 2023-03-16 "The Pragmatic Programmer" book notes => ./2023-04-01-never-split-the-difference-book-notes.gmi 2023-04-01 "Never split the difference" book notes => ./2023-05-06-the-obstacle-is-the-way-book-notes.gmi 2023-05-06 "The Obstacle is the Way" book notes (You are currently reading this) -=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes +=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes E-Mail your comments to paul at buetow.org :-) diff --git a/gemfeed/2023-07-17-career-guide-and-soft-skills-book-notes.gmi b/gemfeed/2023-07-17-career-guide-and-soft-skills-book-notes.gmi index 141ca604..6ee31f3c 100644 --- a/gemfeed/2023-07-17-career-guide-and-soft-skills-book-notes.gmi +++ b/gemfeed/2023-07-17-career-guide-and-soft-skills-book-notes.gmi @@ -277,7 +277,7 @@ Other book notes of mine are: => ./2023-03-16-the-pragmatic-programmer-book-notes.gmi 2023-03-16 "The Pragmatic Programmer" book notes => ./2023-04-01-never-split-the-difference-book-notes.gmi 2023-04-01 "Never split the difference" book notes => ./2023-05-06-the-obstacle-is-the-way-book-notes.gmi 2023-05-06 "The Obstacle is the Way" book notes -=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes (You are currently reading this) +=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes (You are currently reading this) E-Mail your comments to paul at buetow.org :-) diff --git a/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi b/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi index 57ba1d62..97530523 100644 --- a/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi +++ b/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi @@ -6,6 +6,7 @@ The universe of Site Reliability Engineering (SRE) is like an intricate tapestry => ./2023-08-18-site-reliability-engineering-part-1.gmi 2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture (You are currently reading this) => ./2023-08-19-site-reliability-engineering-part-2.gmi 2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE +=> ./2023-08-20-site-reliability-engineering-part-3.gmi 2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect ``` ▓▓▓▓░░ diff --git a/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi b/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi index c37dd9e7..d03b15e4 100644 --- a/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi +++ b/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi @@ -6,6 +6,7 @@ This is the second part of my Site Reliability Engineering (SRE) series. I am cu => ./2023-08-18-site-reliability-engineering-part-1.gmi 2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture => ./2023-08-19-site-reliability-engineering-part-2.gmi 2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE (You are currently reading this) +=> ./2023-08-20-site-reliability-engineering-part-3.gmi 2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect ``` ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ @@ -40,7 +41,9 @@ In conclusion, operational balance in SRE is not a static thing but an ongoing j That all sounds very romantic. The truth is, it is brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it! -The third part of this blog series will be published soon :-) +Continue with the third part of this series: + +=> ./2023-08-20-site-reliability-engineering-part-3.gmi 2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect E-Mail your comments to paul at buetow.org :-) diff --git a/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi.tpl b/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi.tpl index 6cfac681..84722fd2 100644 --- a/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi.tpl +++ b/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi.tpl @@ -39,7 +39,9 @@ In conclusion, operational balance in SRE is not a static thing but an ongoing j That all sounds very romantic. The truth is, it is brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it! -The third part of this blog series will be published soon :-) +Continue with the third part of this series: + +<< template::inline::index site-reliability-engineering-part-3 E-Mail your comments to paul at buetow.org :-) diff --git a/gemfeed/2023-08-20-site-reliability-engineering-part-3.gmi b/gemfeed/2023-08-20-site-reliability-engineering-part-3.gmi new file mode 100644 index 00000000..352954f0 --- /dev/null +++ b/gemfeed/2023-08-20-site-reliability-engineering-part-3.gmi @@ -0,0 +1,59 @@ +# Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect + +> Published at 2023-08-20T12:17:56+03:00 + +This is the third part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will try to share what SRE is about in this blog series. + +=> ./2023-08-18-site-reliability-engineering-part-1.gmi 2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture +=> ./2023-08-19-site-reliability-engineering-part-2.gmi 2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE +=> ./2023-08-20-site-reliability-engineering-part-3.gmi 2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect (You are currently reading this) + +``` + ..--""""----.. + .-" ..--""""--.j-. + .-" .-" .--.""--.. + .-" .-" ..--"-. \/ ; + .-" .-"_.--..--"" ..--' "-. : + .' .' / `. \..--"" __ _ \ ; + :.__.-" \ / .' ( )"-. Y + ; ;: ( ) ( ). \ + .': /:: : \ \ + .'.-"\._ _.-" ; ; ( ) .-. ( ) \ + " `.""" .j" : : \ ; ; \ + bug /"""""/ ; ( ) "" :.( ) \ + /\ / : \ \`.: _ \ + : `. / ; `( ) (\/ :" \ \ + \ `. : "-.(_)_.' t-' ; + \ `. ; ..--": + `. `. : ..--"" : + `. "-. ; ..--"" ; + `. "-.:_..--"" ..--" + `. : ..--"" + "-. : ..--"" + "-.;_..--"" + +``` + +## On-Call Culture and the Human Aspect: Prioritising Well-being in the Realm of Reliability + +Site Reliability Engineering is synonymous with ensuring system reliability, but the human factor is an often-underestimated part of this discipline. Ensuring a healthy on-call culture is as critical as any technical solution. The well-being of the engineers is an important factor. + +Firstly, a healthy on-call rotation is about more than just managing and responding to incidents. It's about the entire ecosystem that supports this practice. This involves reducing pain points, offering mentorship, rapid iteration, and ensuring that engineers have the right tools and processes. One ceavat is, that engineers should be willing to learn. Especially in on-call rotation mixing SREs with other engineers (for example Software Engineers or QA engineers), it's difficult to motivate everyone to engage. QA engineers want to test the software, Software Engineers want to implement new features; they don't want to troubleshoot and debug production incidents. It can be depressing for the mentoring SRE. + +Furthermore, the metrics that measure the success of an on-call experience are only sometimes straightforward. While one might assume that fewer pages translate to better on-call expertise (which is true to a certain degree, as who wants to receive a page out of office hours?), it's not always the volume of pages that matters most. Trust, ownership, accountability, and effective communication play the important roles. + +An important part is giving feedback about the on-call experience to ensure continuous learning. If alerts are mostly noise, they should be tuned or even eliminated. If alerts are actionable, can recurring tasks be automated? If there are knowledge gaps, is the documentation not good enough? Continuous retrospection ensures that not only do systems evolve, but the experience for the on-call engineers becomes progressively better. + +Onboarding for on-call duties is a crucial aspect of ensuring the reliability and efficiency of systems. This process involves equipping new team members with the knowledge, tools, and support to handle incidents confidently. It begins with an overview of the system architecture and common challenges, followed by training on monitoring tools, alerting mechanisms, and incident response protocols. Shadowing experienced on-call engineers can offer practical exposure. Too often, new engineers are thrown into the cold water without proper onboarding and training because the more experienced engineers are too busy fire-fighting production issues in the first place. + +An always-on, always-alert culture can lead to burnout. Engineers should be encouraged to recognise their limits, take breaks, and seek support when needed. This isn't just about individual health; a burnt-out engineer can have cascading effects on the entire team and the systems they manage. A successful on-call culture ensures that while systems are kept running, the engineers are kept happy, healthy, and supported. The more experienced engineers should take time to mentor the junior engineers, but the junior engineers should also be fully engaged, try to investigate and learn new things by themselves. + +For the junior engineer, it's too easy to fall back and ask the experts in the team every time an issue arises. This seems reasonable, but serving recipes for solving production issues on a silver tablet will only scale in the short run, as there are infinite scenarios of how production systems can break. So every engineer should learn to debug, troubleshoot and resolve production incidents independently. The experts will still be there for guidance and step in when the junior is stuck after trying, but the experts should also learn to step down so that lesser experienced engineers can step up and learn. But mistakes can always happen here; that's why having a blameless on-call culture is essential. + +A blameless on-call culture is a must for fostering a safe and collaborative environment where engineers can effectively respond to incidents without fear of retribution. This approach acknowledges that mistakes are a natural part of the learning and innovation process. When individuals are assured they won't be punished for errors, they're more likely to openly discuss mistakes, allowing the entire team to learn and grow from each incident. Furthermore, a blameless culture promotes psychological safety, enhances job satisfaction, reduces burnout, and ensures that talent remains committed and engaged. + +The fourth part of this blog series will be published soon :-) + +E-Mail your comments to paul at buetow.org :-) + +=> ../ Back to the main site diff --git a/gemfeed/2023-08-20-site-reliability-engineering-part-3.gmi.tpl b/gemfeed/2023-08-20-site-reliability-engineering-part-3.gmi.tpl new file mode 100644 index 00000000..a186f7bf --- /dev/null +++ b/gemfeed/2023-08-20-site-reliability-engineering-part-3.gmi.tpl @@ -0,0 +1,57 @@ +# Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect + +> Published at 2023-08-20T12:17:56+03:00 + +This is the third part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will try to share what SRE is about in this blog series. + +<< template::inline::index site-reliability-engineering-part + +``` + ..--""""----.. + .-" ..--""""--.j-. + .-" .-" .--.""--.. + .-" .-" ..--"-. \/ ; + .-" .-"_.--..--"" ..--' "-. : + .' .' / `. \..--"" __ _ \ ; + :.__.-" \ / .' ( )"-. Y + ; ;: ( ) ( ). \ + .': /:: : \ \ + .'.-"\._ _.-" ; ; ( ) .-. ( ) \ + " `.""" .j" : : \ ; ; \ + bug /"""""/ ; ( ) "" :.( ) \ + /\ / : \ \`.: _ \ + : `. / ; `( ) (\/ :" \ \ + \ `. : "-.(_)_.' t-' ; + \ `. ; ..--": + `. `. : ..--"" : + `. "-. ; ..--"" ; + `. "-.:_..--"" ..--" + `. : ..--"" + "-. : ..--"" + "-.;_..--"" + +``` + +## On-Call Culture and the Human Aspect: Prioritising Well-being in the Realm of Reliability + +Site Reliability Engineering is synonymous with ensuring system reliability, but the human factor is an often-underestimated part of this discipline. Ensuring a healthy on-call culture is as critical as any technical solution. The well-being of the engineers is an important factor. + +Firstly, a healthy on-call rotation is about more than just managing and responding to incidents. It's about the entire ecosystem that supports this practice. This involves reducing pain points, offering mentorship, rapid iteration, and ensuring that engineers have the right tools and processes. One ceavat is, that engineers should be willing to learn. Especially in on-call rotation mixing SREs with other engineers (for example Software Engineers or QA engineers), it's difficult to motivate everyone to engage. QA engineers want to test the software, Software Engineers want to implement new features; they don't want to troubleshoot and debug production incidents. It can be depressing for the mentoring SRE. + +Furthermore, the metrics that measure the success of an on-call experience are only sometimes straightforward. While one might assume that fewer pages translate to better on-call expertise (which is true to a certain degree, as who wants to receive a page out of office hours?), it's not always the volume of pages that matters most. Trust, ownership, accountability, and effective communication play the important roles. + +An important part is giving feedback about the on-call experience to ensure continuous learning. If alerts are mostly noise, they should be tuned or even eliminated. If alerts are actionable, can recurring tasks be automated? If there are knowledge gaps, is the documentation not good enough? Continuous retrospection ensures that not only do systems evolve, but the experience for the on-call engineers becomes progressively better. + +Onboarding for on-call duties is a crucial aspect of ensuring the reliability and efficiency of systems. This process involves equipping new team members with the knowledge, tools, and support to handle incidents confidently. It begins with an overview of the system architecture and common challenges, followed by training on monitoring tools, alerting mechanisms, and incident response protocols. Shadowing experienced on-call engineers can offer practical exposure. Too often, new engineers are thrown into the cold water without proper onboarding and training because the more experienced engineers are too busy fire-fighting production issues in the first place. + +An always-on, always-alert culture can lead to burnout. Engineers should be encouraged to recognise their limits, take breaks, and seek support when needed. This isn't just about individual health; a burnt-out engineer can have cascading effects on the entire team and the systems they manage. A successful on-call culture ensures that while systems are kept running, the engineers are kept happy, healthy, and supported. The more experienced engineers should take time to mentor the junior engineers, but the junior engineers should also be fully engaged, try to investigate and learn new things by themselves. + +For the junior engineer, it's too easy to fall back and ask the experts in the team every time an issue arises. This seems reasonable, but serving recipes for solving production issues on a silver tablet will only scale in the short run, as there are infinite scenarios of how production systems can break. So every engineer should learn to debug, troubleshoot and resolve production incidents independently. The experts will still be there for guidance and step in when the junior is stuck after trying, but the experts should also learn to step down so that lesser experienced engineers can step up and learn. But mistakes can always happen here; that's why having a blameless on-call culture is essential. + +A blameless on-call culture is a must for fostering a safe and collaborative environment where engineers can effectively respond to incidents without fear of retribution. This approach acknowledges that mistakes are a natural part of the learning and innovation process. When individuals are assured they won't be punished for errors, they're more likely to openly discuss mistakes, allowing the entire team to learn and grow from each incident. Furthermore, a blameless culture promotes psychological safety, enhances job satisfaction, reduces burnout, and ensures that talent remains committed and engaged. + +The fourth part of this blog series will be published soon :-) + +E-Mail your comments to paul at buetow.org :-) + +=> ../ Back to the main site diff --git a/gemfeed/DRAFT-site-reliability-engineering.gmi b/gemfeed/DRAFT-site-reliability-engineering.gmi index cb0f398b..776cd71b 100644 --- a/gemfeed/DRAFT-site-reliability-engineering.gmi +++ b/gemfeed/DRAFT-site-reliability-engineering.gmi @@ -1,19 +1,3 @@ -## On-Call Culture and the Human Aspect: Prioritising Well-being in the Realm of Reliability - -Site Reliability Engineering is synonymous with ensuring system reliability, but the human factor is an often-underestimated component of this discipline. Fostering a healthy on-call culture is as critical as any technical solution. In the world of constant alerts, pages, and incident management, the well-being of the engineers becomes paramount. - -Firstly, a healthy on-call rotation is about more than just managing and responding to incidents. It's about the entire ecosystem that supports this practice. Establishing happy and healthy on-call rotations is akin to possessing a superpower. This involves reducing pain points, offering mentorship, rapid iteration, and ensuring that engineers have the right tools and processes. It acknowledges that while systems are crucial, the engineers who maintain them are invaluable. - -However, the metrics that measure the success of an on-call experience are only sometimes straightforward. While one might assume that fewer pages translate to better on-call expertise, it's not the volume of pages that matters most. Instead, the underlying culture plays a pivotal role. Trust, ownership, accountability, and effective communication are the pillars upon which successful on-call experiences are built. The essence lies in the approach to incident management, not just the incidents themselves. - -A significant part of this approach is the feedback mechanism. On-call postmortems are vital to ensure continuous learning. If alerts are mostly noise, they should be tuned or even eliminated. If alerts are actionable, can recurring tasks be automated? Continuous retrospection ensures that not only do systems evolve, but the experience for the on-call engineers becomes progressively better. - -But beyond processes and postmortems, there's a profound human element involved. No engineer should ever feel that being jolted awake in the middle of the night for an incident is a rite of passage. "Trial by fire" should never be a prerequisite for being good on-call. Instead, mentorship is invaluable. Having every on-caller shadow a more experienced engineer provides a safety net, ensuring that new members are brought into the fold with care and guidance. - -Moreover, the psychological well-being of the engineers is vital. An always-on, always-alert culture can lead to burnout. Mental health is paramount. Engineers should be encouraged to recognise their limits, take breaks, and seek support when needed. This isn't just about individual health; a burnt-out engineer can have cascading effects on the entire team and the systems they manage. - -In conclusion, while SRE has its roots in technical solutions and ensuring system reliability, it's fundamentally a discipline that thrives on its human component. A successful on-call culture recognises this and ensures that while systems are kept running, the engineers are kept happy, healthy, and supported. The human aspect, thus, becomes the heart of SRE, driving it forward with passion, dedication, and care. - ## The Heroic Facade and Team Dynamics: Rethinking Success in SRE The realm of Site Reliability Engineering is punctuated by the constant ebb and flow of system challenges. While individual excellence is commendable, the overarching belief in the SRE culture should be that true success lies in cohesive teamwork and not in individual heroics. diff --git a/gemfeed/atom.xml b/gemfeed/atom.xml index f19577fb..36946f41 100644 --- a/gemfeed/atom.xml +++ b/gemfeed/atom.xml @@ -1,12 +1,86 @@ <?xml version="1.0" encoding="utf-8"?> <feed xmlns="http://www.w3.org/2005/Atom"> - <updated>2023-08-19T11:27:51+03:00</updated> + <updated>2023-08-20T12:35:50+03:00</updated> <title>foo.zone feed</title> <subtitle>To be in the .zone!</subtitle> <link href="gemini://foo.zone/gemfeed/atom.xml" rel="self" /> <link href="gemini://foo.zone/" /> <id>gemini://foo.zone/</id> <entry> + <title>Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</title> + <link href="gemini://foo.zone/gemfeed/2023-08-20-site-reliability-engineering-part-3.gmi" /> + <id>gemini://foo.zone/gemfeed/2023-08-20-site-reliability-engineering-part-3.gmi</id> + <updated>2023-08-20T12:17:56+03:00</updated> + <author> + <name>Paul Buetow aka snonux</name> + <email>paul@dev.buetow.org</email> + </author> + <summary>This is the third part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will try to share what SRE is about in this blog series.</summary> + <content type="xhtml"> + <div xmlns="http://www.w3.org/1999/xhtml"> + <h1 style='display: inline'>Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</h1><br /> +<br /> +<span class='quote'>Published at 2023-08-20T12:17:56+03:00</span><br /> +<br /> +<span>This is the third part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will try to share what SRE is about in this blog series.</span><br /> +<br /> +<a class='textlink' href='./2023-08-18-site-reliability-engineering-part-1.html'>2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture</a><br /> +<a class='textlink' href='./2023-08-19-site-reliability-engineering-part-2.html'>2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE</a><br /> +<a class='textlink' href='./2023-08-20-site-reliability-engineering-part-3.html'>2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect (You are currently reading this)</a><br /> +<br /> +<pre> + ..--""""----.. + .-" ..--""""--.j-. + .-" .-" .--.""--.. + .-" .-" ..--"-. \/ ; + .-" .-"_.--..--"" ..--' "-. : + .' .' / `. \..--"" __ _ \ ; + :.__.-" \ / .' ( )"-. Y + ; ;: ( ) ( ). \ + .': /:: : \ \ + .'.-"\._ _.-" ; ; ( ) .-. ( ) \ + " `.""" .j" : : \ ; ; \ + bug /"""""/ ; ( ) "" :.( ) \ + /\ / : \ \`.: _ \ + : `. / ; `( ) (\/ :" \ \ + \ `. : "-.(_)_.' t-' ; + \ `. ; ..--": + `. `. : ..--"" : + `. "-. ; ..--"" ; + `. "-.:_..--"" ..--" + `. : ..--"" + "-. : ..--"" + "-.;_..--"" + +</pre> +<br /> +<h2 style='display: inline'>On-Call Culture and the Human Aspect: Prioritising Well-being in the Realm of Reliability</h2><br /> +<br /> +<span>Site Reliability Engineering is synonymous with ensuring system reliability, but the human factor is an often-underestimated part of this discipline. Ensuring a healthy on-call culture is as critical as any technical solution. The well-being of the engineers is an important factor.</span><br /> +<br /> +<span>Firstly, a healthy on-call rotation is about more than just managing and responding to incidents. It's about the entire ecosystem that supports this practice. This involves reducing pain points, offering mentorship, rapid iteration, and ensuring that engineers have the right tools and processes. One ceavat is, that engineers should be willing to learn. Especially in on-call rotation mixing SREs with other engineers (for example Software Engineers or QA engineers), it's difficult to motivate everyone to engage. QA engineers want to test the software, Software Engineers want to implement new features; they don't want to troubleshoot and debug production incidents. It can be depressing for the mentoring SRE.</span><br /> +<br /> +<span>Furthermore, the metrics that measure the success of an on-call experience are only sometimes straightforward. While one might assume that fewer pages translate to better on-call expertise (which is true to a certain degree, as who wants to receive a page out of office hours?), it's not always the volume of pages that matters most. Trust, ownership, accountability, and effective communication play the important roles.</span><br /> +<br /> +<span>An important part is giving feedback about the on-call experience to ensure continuous learning. If alerts are mostly noise, they should be tuned or even eliminated. If alerts are actionable, can recurring tasks be automated? If there are knowledge gaps, is the documentation not good enough? Continuous retrospection ensures that not only do systems evolve, but the experience for the on-call engineers becomes progressively better.</span><br /> +<br /> +<span>Onboarding for on-call duties is a crucial aspect of ensuring the reliability and efficiency of systems. This process involves equipping new team members with the knowledge, tools, and support to handle incidents confidently. It begins with an overview of the system architecture and common challenges, followed by training on monitoring tools, alerting mechanisms, and incident response protocols. Shadowing experienced on-call engineers can offer practical exposure. Too often, new engineers are thrown into the cold water without proper onboarding and training because the more experienced engineers are too busy fire-fighting production issues in the first place.</span><br /> +<br /> +<span>An always-on, always-alert culture can lead to burnout. Engineers should be encouraged to recognise their limits, take breaks, and seek support when needed. This isn't just about individual health; a burnt-out engineer can have cascading effects on the entire team and the systems they manage. A successful on-call culture ensures that while systems are kept running, the engineers are kept happy, healthy, and supported. The more experienced engineers should take time to mentor the junior engineers, but the junior engineers should also be fully engaged, try to investigate and learn new things by themselves.</span><br /> +<br /> +<span>For the junior engineer, it's too easy to fall back and ask the experts in the team every time an issue arises. This seems reasonable, but serving recipes for solving production issues on a silver tablet will only scale in the short run, as there are infinite scenarios of how production systems can break. So every engineer should learn to debug, troubleshoot and resolve production incidents independently. The experts will still be there for guidance and step in when the junior is stuck after trying, but the experts should also learn to step down so that lesser experienced engineers can step up and learn. But mistakes can always happen here; that's why having a blameless on-call culture is essential.</span><br /> +<br /> +<span>A blameless on-call culture is a must for fostering a safe and collaborative environment where engineers can effectively respond to incidents without fear of retribution. This approach acknowledges that mistakes are a natural part of the learning and innovation process. When individuals are assured they won't be punished for errors, they're more likely to openly discuss mistakes, allowing the entire team to learn and grow from each incident. Furthermore, a blameless culture promotes psychological safety, enhances job satisfaction, reduces burnout, and ensures that talent remains committed and engaged.</span><br /> +<br /> +<span>The fourth part of this blog series will be published soon :-)</span><br /> +<br /> +<span>E-Mail your comments to paul at buetow.org :-)</span><br /> +<br /> +<a class='textlink' href='../'>Back to the main site</a><br /> + </div> + </content> + </entry> + <entry> <title>Site Reliability Engineering - Part 2: Operational Balance in SRE</title> <link href="gemini://foo.zone/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi" /> <id>gemini://foo.zone/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi</id> @@ -26,6 +100,7 @@ <br /> <a class='textlink' href='./2023-08-18-site-reliability-engineering-part-1.html'>2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture</a><br /> <a class='textlink' href='./2023-08-19-site-reliability-engineering-part-2.html'>2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE (You are currently reading this)</a><br /> +<a class='textlink' href='./2023-08-20-site-reliability-engineering-part-3.html'>2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</a><br /> <br /> <pre> ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ @@ -60,7 +135,9 @@ <br /> <span>That all sounds very romantic. The truth is, it is brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it!</span><br /> <br /> -<span>The third part of this blog series will be published soon :-)</span><br /> +<span>Continue with the third part of this series:</span><br /> +<br /> +<a class='textlink' href='./2023-08-20-site-reliability-engineering-part-3.html'>2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</a><br /> <br /> <span>E-Mail your comments to paul at buetow.org :-)</span><br /> <br /> @@ -88,6 +165,7 @@ <br /> <a class='textlink' href='./2023-08-18-site-reliability-engineering-part-1.html'>2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture (You are currently reading this)</a><br /> <a class='textlink' href='./2023-08-19-site-reliability-engineering-part-2.html'>2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE</a><br /> +<a class='textlink' href='./2023-08-20-site-reliability-engineering-part-3.html'>2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</a><br /> <br /> <pre> ▓▓▓▓░░ @@ -584,7 +662,7 @@ http://www.gnu.org/software/src-highlite --> <a class='textlink' href='./2023-03-16-the-pragmatic-programmer-book-notes.html'>2023-03-16 "The Pragmatic Programmer" book notes</a><br /> <a class='textlink' href='./2023-04-01-never-split-the-difference-book-notes.html'>2023-04-01 "Never split the difference" book notes</a><br /> <a class='textlink' href='./2023-05-06-the-obstacle-is-the-way-book-notes.html'>2023-05-06 "The Obstacle is the Way" book notes</a><br /> -<a class='textlink' href='./2023-07-17-career-guide-and-soft-skills-book-notes.html'>2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes (You are currently reading this)</a><br /> +<a class='textlink' href='./2023-07-17-career-guide-and-soft-skills-book-notes.html'>2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes (You are currently reading this)</a><br /> <br /> <span>E-Mail your comments to paul at buetow.org :-)</span><br /> <br /> @@ -971,7 +1049,7 @@ http://www.gnu.org/software/src-highlite --> <a class='textlink' href='./2023-03-16-the-pragmatic-programmer-book-notes.html'>2023-03-16 "The Pragmatic Programmer" book notes</a><br /> <a class='textlink' href='./2023-04-01-never-split-the-difference-book-notes.html'>2023-04-01 "Never split the difference" book notes</a><br /> <a class='textlink' href='./2023-05-06-the-obstacle-is-the-way-book-notes.html'>2023-05-06 "The Obstacle is the Way" book notes (You are currently reading this)</a><br /> -<a class='textlink' href='./2023-07-17-career-guide-and-soft-skills-book-notes.html'>2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes</a><br /> +<a class='textlink' href='./2023-07-17-career-guide-and-soft-skills-book-notes.html'>2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes</a><br /> <br /> <span>E-Mail your comments to paul at buetow.org :-)</span><br /> <br /> @@ -1582,7 +1660,7 @@ ok codeberg<font color="#990000">.</font>org/snonux/algorithms/sort <fo <a class='textlink' href='./2023-03-16-the-pragmatic-programmer-book-notes.html'>2023-03-16 "The Pragmatic Programmer" book notes</a><br /> <a class='textlink' href='./2023-04-01-never-split-the-difference-book-notes.html'>2023-04-01 "Never split the difference" book notes (You are currently reading this)</a><br /> <a class='textlink' href='./2023-05-06-the-obstacle-is-the-way-book-notes.html'>2023-05-06 "The Obstacle is the Way" book notes</a><br /> -<a class='textlink' href='./2023-07-17-career-guide-and-soft-skills-book-notes.html'>2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes</a><br /> +<a class='textlink' href='./2023-07-17-career-guide-and-soft-skills-book-notes.html'>2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes</a><br /> <br /> <span>E-Mail your comments to paul at buetow.org :-)</span><br /> <br /> @@ -1863,7 +1941,7 @@ The remaining content of the Gemtext file<font color="#990000">...</font> <a class='textlink' href='./2023-03-16-the-pragmatic-programmer-book-notes.html'>2023-03-16 "The Pragmatic Programmer" book notes (You are currently reading this)</a><br /> <a class='textlink' href='./2023-04-01-never-split-the-difference-book-notes.html'>2023-04-01 "Never split the difference" book notes</a><br /> <a class='textlink' href='./2023-05-06-the-obstacle-is-the-way-book-notes.html'>2023-05-06 "The Obstacle is the Way" book notes</a><br /> -<a class='textlink' href='./2023-07-17-career-guide-and-soft-skills-book-notes.html'>2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes</a><br /> +<a class='textlink' href='./2023-07-17-career-guide-and-soft-skills-book-notes.html'>2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes</a><br /> <br /> <span>E-Mail your comments to paul at buetow.org :-)</span><br /> <br /> @@ -8491,64 +8569,4 @@ Notice: Finished catalog run in 206.09 seconds </div> </content> </entry> - <entry> - <title>Offsite backup with ZFS</title> - <link href="gemini://foo.zone/gemfeed/2016-04-03-offsite-backup-with-zfs.gmi" /> - <id>gemini://foo.zone/gemfeed/2016-04-03-offsite-backup-with-zfs.gmi</id> - <updated>2016-04-03T22:43:42+01:00</updated> - <author> - <name>Paul Buetow aka snonux</name> - <email>paul@dev.buetow.org</email> - </author> - <summary>When it comes to data storage and potential data loss, I am a paranoid person. It is due to my job and a personal experience I encountered over ten years ago: A single drive failure and loss of all my data (pictures, music, etc.).</summary> - <content type="xhtml"> - <div xmlns="http://www.w3.org/1999/xhtml"> - <h1 style='display: inline'>Offsite backup with ZFS</h1><br /> -<br /> -<span class='quote'>Published at 2016-04-03T22:43:42+01:00</span><br /> -<br /> -<pre> - ________________ -|# : : #| -| : ZFS/GELI : | -| : Offsite : | -| : Backup : | -| :___________: | -| _________ | -| | __ | | -| || | | | -\____||__|_____|__| -</pre> -<br /> -<a class='textlink' href='./2016-04-03-offsite-backup-with-zfs.html'>Offsite backup with ZFS Part 1 (you are reading this atm.)</a><br /> -<a class='textlink' href='./2016-04-16-offsite-backup-with-zfs-part2.html'>Offsite backup with ZFS Part 2</a><br /> -<br /> -<h2 style='display: inline'>Please don't lose all my pictures again!</h2><br /> -<br /> -<span>When it comes to data storage and potential data loss, I am a paranoid person. It is due to my job and a personal experience I encountered over ten years ago: A single drive failure and loss of all my data (pictures, music, etc.).</span><br /> -<br /> -<span>A little about my personal infrastructure: I am running my own (mostly FreeBSD based) root servers (across several countries: Two in Germany, one in Canada, one in Bulgaria) which store all my online data (E-Mail and my Git repositories). I am syncing incremental (and encrypted) ZFS snapshots between these servers forth and back so either data can be recovered from the other server.</span><br /> -<br /> -<h2 style='display: inline'>Local storage box for offline data</h2><br /> -<br /> -<span>Also, I am operating a local server (an HP MicroServer) at home in my apartment. Full snapshots of all ZFS volumes are pulled from the "online" servers to the local server every other week and the incremental ZFS snapshots every day. That local server has a ZFS ZMIRROR with three disks configured (local triple redundancy). I keep up to half a year worth of ZFS snapshots of all volumes. That local server also contains all my offline data such as pictures, private documents, videos, books, various other backups, etc.</span><br /> -<br /> -<span>Once weekly, all the local server data is copied to two external USB drives as a backup (without the historic snapshots). For simplicity, these USB drives are not formatted with ZFS but with good old UFS. This gives me a chance to recover from a (potential) ZFS disaster. ZFS is a complex thing. Sometimes it is good not to trust complicated things!</span><br /> -<br /> -<h2 style='display: inline'>Storing it at my apartment is not enough</h2><br /> -<br /> -<span>Now I am thinking about an offsite backup of all this local data. The problem is that all the data remains on a single physical location: My local MicroServer. What happens when the house burns or my server, including the internal disks and the attached USB drives, gets stolen? My first thought was to back up everything to the "cloud". However, the significant issue here is the limited amount of available upload bandwidth (only 1MBit/s).</span><br /> -<br /> -<span>The solution is adding another USB drive (2TB) with an encryption container (GELI) and a ZFS pool. The GELI encryption requires a secret key and a secret passphrase. I am updating the data to that drive once every three months (my calendar is reminding me about it), and afterwards, I keep that drive at a secret location outside of my apartment. All the information needed to decrypt (mounting the GELI container) is stored at another (secure) place. Key and passphrase are kept at different sites, though. Even if someone knew of it, he would not be able to decrypt it as some additional insider knowledge would be required as well.</span><br /> -<br /> -<h2 style='display: inline'>Walking one round less</h2><br /> -<br /> -<span>I am thinking of buying a second 2TB USB drive and setting it up the same way as the first one. So I could alternate the backups. One drive would be at the secret location, and the other drive would be at home. And these drives would swap place after each cycle. This would give some security about the failure of that drive, and I would have to go to the secret location only once (swapping the drives) instead of twice (picking that drive up to update the data + bringing it back to the remote location).</span><br /> -<br /> -<span>E-Mail your comments to paul at buetow.org :-)</span><br /> -<br /> -<a class='textlink' href='../'>Back to the main site</a><br /> - </div> - </content> - </entry> </feed> diff --git a/gemfeed/index.gmi b/gemfeed/index.gmi index 9e3a5348..3831178a 100644 --- a/gemfeed/index.gmi +++ b/gemfeed/index.gmi @@ -2,6 +2,7 @@ ## To be in the .zone! +=> ./2023-08-20-site-reliability-engineering-part-3.gmi 2023-08-20 - Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect => ./2023-08-19-site-reliability-engineering-part-2.gmi 2023-08-19 - Site Reliability Engineering - Part 2: Operational Balance in SRE => ./2023-08-18-site-reliability-engineering-part-1.gmi 2023-08-18 - Site Reliability Engineering - Part 1: SRE and Organizational Culture => ./2023-07-21-gemtexter-2.1.0-lets-gemtext-again-3.gmi 2023-07-21 - Gemtexter 2.1.0 - Let's Gemtext again³ @@ -1,6 +1,6 @@ # foo.zone -> This site was generated at 2023-08-19T11:27:51+03:00 by `Gemtexter` +> This site was generated at 2023-08-20T12:35:50+03:00 by `Gemtexter` ``` |\---/| @@ -33,6 +33,7 @@ If you reach this site via the modern web, please read this: ### Posts +=> ./gemfeed/2023-08-20-site-reliability-engineering-part-3.gmi 2023-08-20 - Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect => ./gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi 2023-08-19 - Site Reliability Engineering - Part 2: Operational Balance in SRE => ./gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi 2023-08-18 - Site Reliability Engineering - Part 1: SRE and Organizational Culture => ./gemfeed/2023-07-21-gemtexter-2.1.0-lets-gemtext-again-3.gmi 2023-07-21 - Gemtexter 2.1.0 - Let's Gemtext again³ diff --git a/notes/career-guide-and-soft-skills.gmi b/notes/career-guide-and-soft-skills.gmi index 141ca604..6ee31f3c 100644 --- a/notes/career-guide-and-soft-skills.gmi +++ b/notes/career-guide-and-soft-skills.gmi @@ -277,7 +277,7 @@ Other book notes of mine are: => ./2023-03-16-the-pragmatic-programmer-book-notes.gmi 2023-03-16 "The Pragmatic Programmer" book notes => ./2023-04-01-never-split-the-difference-book-notes.gmi 2023-04-01 "Never split the difference" book notes => ./2023-05-06-the-obstacle-is-the-way-book-notes.gmi 2023-05-06 "The Obstacle is the Way" book notes -=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes (You are currently reading this) +=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes (You are currently reading this) E-Mail your comments to paul at buetow.org :-) diff --git a/notes/never-split-the-difference.gmi b/notes/never-split-the-difference.gmi index 2b430e6b..2fea61d8 100644 --- a/notes/never-split-the-difference.gmi +++ b/notes/never-split-the-difference.gmi @@ -125,7 +125,7 @@ Other book notes of mine are: => ./2023-03-16-the-pragmatic-programmer-book-notes.gmi 2023-03-16 "The Pragmatic Programmer" book notes => ./2023-04-01-never-split-the-difference-book-notes.gmi 2023-04-01 "Never split the difference" book notes (You are currently reading this) => ./2023-05-06-the-obstacle-is-the-way-book-notes.gmi 2023-05-06 "The Obstacle is the Way" book notes -=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes +=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes E-Mail your comments to paul at buetow.org :-) diff --git a/notes/the-obstacle-is-the-way.gmi b/notes/the-obstacle-is-the-way.gmi index 781291a5..4eda16b9 100644 --- a/notes/the-obstacle-is-the-way.gmi +++ b/notes/the-obstacle-is-the-way.gmi @@ -87,7 +87,7 @@ Other book notes of mine are: => ./2023-03-16-the-pragmatic-programmer-book-notes.gmi 2023-03-16 "The Pragmatic Programmer" book notes => ./2023-04-01-never-split-the-difference-book-notes.gmi 2023-04-01 "Never split the difference" book notes => ./2023-05-06-the-obstacle-is-the-way-book-notes.gmi 2023-05-06 "The Obstacle is the Way" book notes (You are currently reading this) -=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes +=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes E-Mail your comments to paul at buetow.org :-) diff --git a/notes/the-pragmatic-programmer.gmi b/notes/the-pragmatic-programmer.gmi index 78d8c13c..d01e1bf0 100644 --- a/notes/the-pragmatic-programmer.gmi +++ b/notes/the-pragmatic-programmer.gmi @@ -81,7 +81,7 @@ Other book notes of mine are: => ./2023-03-16-the-pragmatic-programmer-book-notes.gmi 2023-03-16 "The Pragmatic Programmer" book notes (You are currently reading this) => ./2023-04-01-never-split-the-difference-book-notes.gmi 2023-04-01 "Never split the difference" book notes => ./2023-05-06-the-obstacle-is-the-way-book-notes.gmi 2023-05-06 "The Obstacle is the Way" book notes -=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide & Soft Skills" book notes +=> ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 "Software Developmers Career Guide and Soft Skills" book notes E-Mail your comments to paul at buetow.org :-) diff --git a/uptime-stats.gmi b/uptime-stats.gmi index 1a9cd5aa..3706419e 100644 --- a/uptime-stats.gmi +++ b/uptime-stats.gmi @@ -1,6 +1,6 @@ # My machine uptime stats -> This site was last updated at 2023-08-19T11:27:51+03:00 +> This site was last updated at 2023-08-20T12:35:50+03:00 The following stats were collected via `uptimed` on all of my personal computers over many years and the output was generated by `guprecords`, the global uptime records stats analyser of mine. |
