diff options
| author | Paul Buetow <paul@buetow.org> | 2024-01-09 18:45:34 +0200 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2024-01-09 18:45:34 +0200 |
| commit | c19bc77ae895abbf12ef5c3161fea7aed34bd374 (patch) | |
| tree | 96f14ba0216a15247d3e2276286911cdf80d9c46 /gemfeed/atom.xml | |
| parent | c3b96b220c7abb624de848399a72593fc83621b2 (diff) | |
Update content for html
Diffstat (limited to 'gemfeed/atom.xml')
| -rw-r--r-- | gemfeed/atom.xml | 290 |
1 files changed, 144 insertions, 146 deletions
diff --git a/gemfeed/atom.xml b/gemfeed/atom.xml index f38c9ff6..051ee50c 100644 --- a/gemfeed/atom.xml +++ b/gemfeed/atom.xml @@ -1,12 +1,84 @@ <?xml version="1.0" encoding="utf-8"?> <feed xmlns="http://www.w3.org/2005/Atom"> - <updated>2023-12-10T17:11:56+02:00</updated> + <updated>2024-01-09T18:45:17+02:00</updated> <title>foo.zone feed</title> <subtitle>To be in the .zone!</subtitle> <link href="https://foo.zone/gemfeed/atom.xml" rel="self" /> <link href="https://foo.zone/" /> <id>https://foo.zone/</id> <entry> + <title>Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</title> + <link href="https://foo.zone/gemfeed/2024-01-09-site-reliability-engineering-part-3.html" /> + <id>https://foo.zone/gemfeed/2024-01-09-site-reliability-engineering-part-3.html</id> + <updated>2024-01-09T18:35:48+02:00</updated> + <author> + <name>Paul Buetow aka snonux</name> + <email>paul@dev.buetow.org</email> + </author> + <summary>This is the third part of my Site Reliability Engineering (SRE) series. I am currently employed as a Site Reliability Engineer and will try to share what SRE is about in this blog series.</summary> + <content type="xhtml"> + <div xmlns="http://www.w3.org/1999/xhtml"> + <h1 style='display: inline'>Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</h1><br /> +<br /> +<span class='quote'>Published at 2024-01-09T18:35:48+02:00</span><br /> +<br /> +<span>This is the third part of my Site Reliability Engineering (SRE) series. I am currently employed as a Site Reliability Engineer and will try to share what SRE is about in this blog series.</span><br /> +<br /> +<a class='textlink' href='./2023-08-18-site-reliability-engineering-part-1.html'>2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture</a><br /> +<a class='textlink' href='./2023-11-19-site-reliability-engineering-part-2.html'>2023-11-19 Site Reliability Engineering - Part 2: Operational Balance in SRE</a><br /> +<a class='textlink' href='./2024-01-09-site-reliability-engineering-part-3.html'>2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect (You are currently reading this)</a><br /> +<br /> +<pre> + ..--""""----.. + .-" ..--""""--.j-. + .-" .-" .--.""--.. + .-" .-" ..--"-. \/ ; + .-" .-"_.--..--"" ..--' "-. : + .' .' / `. \..--"" __ _ \ ; + :.__.-" \ / .' ( )"-. Y + ; ;: ( ) ( ). \ + .': /:: : \ \ + .'.-"\._ _.-" ; ; ( ) .-. ( ) \ + " `.""" .j" : : \ ; ; \ + bug /"""""/ ; ( ) "" :.( ) \ + /\ / : \ \`.: _ \ + : `. / ; `( ) (\/ :" \ \ + \ `. : "-.(_)_.' t-' ; + \ `. ; ..--": + `. `. : ..--"" : + `. "-. ; ..--"" ; + `. "-.:_..--"" ..--" + `. : ..--"" + "-. : ..--"" + "-.;_..--"" + +</pre> +<br /> +<h2 style='display: inline'>On-Call Culture and the Human Aspect: Prioritising Well-being in the Realm of Reliability</h2><br /> +<br /> +<span>Site Reliability Engineering is synonymous with ensuring system reliability, but the human factor is an often-underestimated part of this discipline. Ensuring an healthy on-call culture is as critical as any technical solution. The well-being of the engineers is an important factor.</span><br /> +<br /> +<span>Firstly, a healthy on-call rotation is about more than just managing and responding to incidents. It's about the entire ecosystem that supports this practice. This involves reducing pain points, offering mentorship, rapid iteration, and ensuring that engineers have the right tools and processes. One ceavat is, that engineers should be willing to learn. Especially in on-call rotation embedding SREs with other engineers (for example Software Engineers or QA Engineers), it's difficult to motivate everyone to engage. QA Engineers want to test the software, Software Engineers want to implement new features; they don't want to troubleshoot and debug production incidents. It can be depressing for the mentoring SRE.</span><br /> +<br /> +<span>Furthermore, the metrics that measure the success of an on-call experience are only sometimes straightforward. While one might assume that fewer pages translate to better on-call expertise (which is true to a degree, as who wants to receive a page out of office hours?), it's not always the volume of pages that matters most. Trust, ownership, accountability, and effective communication play the important roles.</span><br /> +<br /> +<span>An important part is giving feedback about the on-call experience to ensure continuous learning. If alerts are mostly noise, they should be tuned or even eliminated. If alerts are actionable, can recurring tasks be automated? If there are knowledge gaps, is the documentation not good enough? Continuous retrospection ensures that not only do systems evolve, but the experience for the on-call engineers becomes progressively better.</span><br /> +<br /> +<span>Onboarding for on-call duties is a crucial aspect of ensuring the reliability and efficiency of systems. This process involves equipping new team members with the knowledge, tools, and support to handle incidents confidently. It begins with an overview of the system architecture and common challenges, followed by training on monitoring tools, alerting mechanisms, and incident response protocols. Shadowing experienced on-call engineers can offer practical exposure. Too often, new engineers are thrown into the cold water without proper onboarding and training because the more experienced engineers are too busy fire-fighting production issues in the first place.</span><br /> +<br /> +<span>An always-on, always-alert culture can lead to burnout. Engineers should be encouraged to recognise their limits, take breaks, and seek support when needed. This isn't just about individual health; a burnt-out engineer can have cascading effects on the entire team and the systems they manage. A successful on-call culture ensures that while systems are kept running, the engineers are kept happy, healthy, and supported. The more experienced engineers should take time to mentor the junior engineers, but the junior engineers should also be fully engaged, try to investigate and learn new things by themselves.</span><br /> +<br /> +<span>For the junior engineer, it's too easy to fall back and ask the experts in the team every time an issue arises. This seems reasonable, but serving recipes for solving production issues on a silver tablet won't scale forever, as there are infinite scenarios of how production systems can break. So every engineer should learn to debug, troubleshoot and resolve production incidents independently. The experts will still be there for guidance and step in when the junior gets stuck after trying, but the experts should also learn to step down so that lesser experienced engineers can step up and learn. But mistakes can always happen here; that's why having a blameless on-call culture is essential.</span><br /> +<br /> +<span>A blameless on-call culture is a must for a safe and collaborative environment where engineers can effectively respond to incidents without fear of retribution. This approach acknowledges that mistakes are a natural part of the learning and innovation process. When individuals are assured they won't be punished for errors, they're more likely to openly discuss mistakes, allowing the entire team to learn and grow from each incident. Furthermore, a blameless culture promotes psychological safety, enhances job satisfaction, reduces burnout, and ensures that talent remains committed and engaged.</span><br /> +<br /> +<span>E-Mail your comments to <span class='inlinecode'>paul@nospam.buetow.org</span> :-)</span><br /> +<br /> +<a class='textlink' href='../'>Back to the main site</a><br /> + </div> + </content> + </entry> + <entry> <title>Bash Golf Part 3</title> <link href="https://foo.zone/gemfeed/2023-12-10-bash-golf-part-3.html" /> <id>https://foo.zone/gemfeed/2023-12-10-bash-golf-part-3.html</id> @@ -425,6 +497,71 @@ echo baz </content> </entry> <entry> + <title>Site Reliability Engineering - Part 2: Operational Balance in SRE</title> + <link href="https://foo.zone/gemfeed/2023-11-19-site-reliability-engineering-part-2.html" /> + <id>https://foo.zone/gemfeed/2023-11-19-site-reliability-engineering-part-2.html</id> + <updated>2023-11-19T00:18:18+03:00</updated> + <author> + <name>Paul Buetow aka snonux</name> + <email>paul@dev.buetow.org</email> + </author> + <summary>This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Site Reliability Engineer and will try to share what SRE is about in this blog series.</summary> + <content type="xhtml"> + <div xmlns="http://www.w3.org/1999/xhtml"> + <h1 style='display: inline'>Site Reliability Engineering - Part 2: Operational Balance in SRE</h1><br /> +<br /> +<span class='quote'>Published at 2023-11-19T00:18:18+03:00</span><br /> +<br /> +<span>This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Site Reliability Engineer and will try to share what SRE is about in this blog series.</span><br /> +<br /> +<a class='textlink' href='./2023-08-18-site-reliability-engineering-part-1.html'>2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture</a><br /> +<a class='textlink' href='./2023-11-19-site-reliability-engineering-part-2.html'>2023-11-19 Site Reliability Engineering - Part 2: Operational Balance in SRE (You are currently reading this)</a><br /> +<a class='textlink' href='./2024-01-09-site-reliability-engineering-part-3.html'>2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</a><br /> +<br /> +<pre> +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⣾⠿⠿⠿⠶⠾⠿⠿⣿⣿⣿⣿⣿⣿⠿⠿⠶⠶⠿⠿⠿⣷⠀⠀⠀⠀ +⠀⠀⠀⣸⢿⣆⠀⠀⠀⠀⠀⠀⠀⠙⢿⡿⠉⠀⠀⠀⠀⠀⠀⠀⣸⣿⡆⠀⠀⠀ +⠀⠀⢠⡟⠀⢻⣆⠀⠀⠀⠀⠀⠀⠀⣾⣧⠀⠀⠀⠀⠀⠀⠀⣰⡟⠀⢻⡄⠀⠀ +⠀⢀⣾⠃⠀⠀⢿⡄⠀⠀⠀⠀⠀⢠⣿⣿⡀⠀⠀⠀⠀⠀⢠⡿⠀⠀⠘⣷⡀⠀ +⠀⣼⣏⣀⣀⣀⣈⣿⡀⠀⠀⠀⠀⣸⣿⣿⡇⠀⠀⠀⠀⢀⣿⣃⣀⣀⣀⣸⣧⠀ +⠀⢻⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⣿⣿⣿⣿⠀⠀⠀⠀⠈⢿⣿⣿⣿⣿⣿⡿⠀ +⠀⠀⠉⠛⠛⠛⠋⠁⠀⠀⠀⠀⢸⣿⣿⣿⣿⡆⠀⠀⠀⠀⠈⠙⠛⠛⠛⠉⠀⠀ +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣾⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⠀⠀⠴⠶⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠶⠦⠀⠀ +</pre> +<br /> +<h2 style='display: inline'>Operational Balance in SRE: Finding the Equilibrium in Reliability and Velocity</h2><br /> +<br /> +<span>Site Reliability Engineering has established itself as more than just a set of best practices or methodologies. Instead, it stands as a beacon of operational excellence, which guides engineering teams through the turbulent waters of modern software development and system management.</span><br /> +<br /> +<span>In the universe of software production, two fundamental forces are often at odds: The drive for rapid feature release (velocity) and the need for system reliability. Traditionally, the faster teams moved, the more risk was introduced into systems. SRE offers a approach to mitigate these conflicting drives through concepts like error budgets and SLIs/SLOs. These mechanisms offer a tangible metric, allowing teams to quantify how much they can push changes while ensuring they don't compromise system health. Thus, the error budget becomes a balancing act, where teams weigh the trade-offs between innovation and reliability.</span><br /> +<br /> +<span>An important part of this balance is the dichotomy between operations and coding. According to SRE principles, an engineer should ideally spend an equal amount of time on operations work and coding - 50% on each. This isn't just a random metric; it's a reflection of the value SRE places on both maintaining operational excellence and progressing forward with innovations. This balance ensures that while SREs are solving today's problems, they are also preparing for tomorrow's challenges. </span><br /> +<br /> +<span>However, not all operational tasks are equal. SRE differentiates between "ops work" and "toil". While ops work is integral to system maintenance and can provide value, toil represents repetitive, mundane tasks which offer little value in the long run. Recognising and minimising toil is crucial. A culture that allows engineers to drown in toil stifles innovation and growth. Hence, an organisation's approach to toil indicates its operational health and commitment to balance.</span><br /> +<br /> +<span>A cornerstone of achieving operational balance lies in the tools and processes SREs use. Effective monitoring, observability tools, and ensuring that tools can handle high cardinality data are foundational. These aren't just technical requisites but reflective of an organisational culture prioritising proactive problem-solving. By having systems that effectively flag potential issues before they escalate, SREs can maintain the balance between system stability and forward momentum.</span><br /> +<br /> +<span>Moreover, operational balance isn't just a technological or process challenge; it's a human one. The health of on-call engineers is as crucial as the health of the services they manage. On-call postmortems, continuous feedback loops, and recognising gaps (be it tooling, operational expertise, or resources) ensure that the human elements of operations are noticed. </span><br /> +<br /> +<span>In conclusion, operational balance in SRE isn't static thing but an ongoing journey. It requires organisations to constantly evaluate their practices, tools, and, most importantly, their culture. By achieving this balance, organisations can ensure that they have time for innovation while maintaining the robustness and reliability of their systems, resulting in sustainable long-term success.</span><br /> +<br /> +<span>That all sounds very romantic. The truth is, it's brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it!</span><br /> +<br /> +<span>Continue with the third part of this series:</span><br /> +<br /> +<a class='textlink' href='./2024-01-09-site-reliability-engineering-part-3.html'>2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</a><br /> +<br /> +<span>E-Mail your comments to <span class='inlinecode'>paul@nospam.buetow.org</span> :-)</span><br /> +<br /> +<a class='textlink' href='../'>Back to the main site</a><br /> + </div> + </content> + </entry> + <entry> <title>'Mind Management' book notes</title> <link href="https://foo.zone/gemfeed/2023-11-11-mind-management-book-notes.html" /> <id>https://foo.zone/gemfeed/2023-11-11-mind-management-book-notes.html</id> @@ -1169,145 +1306,6 @@ http://www.gnu.org/software/src-highlite --> </content> </entry> <entry> - <title>Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</title> - <link href="https://foo.zone/gemfeed/2023-08-20-site-reliability-engineering-part-3.html" /> - <id>https://foo.zone/gemfeed/2023-08-20-site-reliability-engineering-part-3.html</id> - <updated>2023-08-20T12:17:56+03:00</updated> - <author> - <name>Paul Buetow aka snonux</name> - <email>paul@dev.buetow.org</email> - </author> - <summary>This is the third part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will try to share what SRE is about in this blog series.</summary> - <content type="xhtml"> - <div xmlns="http://www.w3.org/1999/xhtml"> - <h1 style='display: inline'>Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</h1><br /> -<br /> -<span class='quote'>Published at 2023-08-20T12:17:56+03:00</span><br /> -<br /> -<span>This is the third part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will try to share what SRE is about in this blog series.</span><br /> -<br /> -<a class='textlink' href='./2023-08-18-site-reliability-engineering-part-1.html'>2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture</a><br /> -<a class='textlink' href='./2023-08-19-site-reliability-engineering-part-2.html'>2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE</a><br /> -<a class='textlink' href='./2023-08-20-site-reliability-engineering-part-3.html'>2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect (You are currently reading this)</a><br /> -<br /> -<pre> - ..--""""----.. - .-" ..--""""--.j-. - .-" .-" .--.""--.. - .-" .-" ..--"-. \/ ; - .-" .-"_.--..--"" ..--' "-. : - .' .' / `. \..--"" __ _ \ ; - :.__.-" \ / .' ( )"-. Y - ; ;: ( ) ( ). \ - .': /:: : \ \ - .'.-"\._ _.-" ; ; ( ) .-. ( ) \ - " `.""" .j" : : \ ; ; \ - bug /"""""/ ; ( ) "" :.( ) \ - /\ / : \ \`.: _ \ - : `. / ; `( ) (\/ :" \ \ - \ `. : "-.(_)_.' t-' ; - \ `. ; ..--": - `. `. : ..--"" : - `. "-. ; ..--"" ; - `. "-.:_..--"" ..--" - `. : ..--"" - "-. : ..--"" - "-.;_..--"" - -</pre> -<br /> -<h2 style='display: inline'>On-Call Culture and the Human Aspect: Prioritising Well-being in the Realm of Reliability</h2><br /> -<br /> -<span>Site Reliability Engineering is synonymous with ensuring system reliability, but the human factor is an often-underestimated part of this discipline. Ensuring an healthy on-call culture is as critical as any technical solution. The well-being of the engineers is an important factor.</span><br /> -<br /> -<span>Firstly, a healthy on-call rotation is about more than just managing and responding to incidents. It's about the entire ecosystem that supports this practice. This involves reducing pain points, offering mentorship, rapid iteration, and ensuring that engineers have the right tools and processes. One ceavat is, that engineers should be willing to learn. Especially in on-call rotation embedding SREs with other engineers (for example Software Engineers or QA Engineers), it's difficult to motivate everyone to engage. QA Engineers want to test the software, Software Engineers want to implement new features; they don't want to troubleshoot and debug production incidents. It can be depressing for the mentoring SRE.</span><br /> -<br /> -<span>Furthermore, the metrics that measure the success of an on-call experience are only sometimes straightforward. While one might assume that fewer pages translate to better on-call expertise (which is true to a degree, as who wants to receive a page out of office hours?), it's not always the volume of pages that matters most. Trust, ownership, accountability, and effective communication play the important roles.</span><br /> -<br /> -<span>An important part is giving feedback about the on-call experience to ensure continuous learning. If alerts are mostly noise, they should be tuned or even eliminated. If alerts are actionable, can recurring tasks be automated? If there are knowledge gaps, is the documentation not good enough? Continuous retrospection ensures that not only do systems evolve, but the experience for the on-call engineers becomes progressively better.</span><br /> -<br /> -<span>Onboarding for on-call duties is a crucial aspect of ensuring the reliability and efficiency of systems. This process involves equipping new team members with the knowledge, tools, and support to handle incidents confidently. It begins with an overview of the system architecture and common challenges, followed by training on monitoring tools, alerting mechanisms, and incident response protocols. Shadowing experienced on-call engineers can offer practical exposure. Too often, new engineers are thrown into the cold water without proper onboarding and training because the more experienced engineers are too busy fire-fighting production issues in the first place.</span><br /> -<br /> -<span>An always-on, always-alert culture can lead to burnout. Engineers should be encouraged to recognise their limits, take breaks, and seek support when needed. This isn't just about individual health; a burnt-out engineer can have cascading effects on the entire team and the systems they manage. A successful on-call culture ensures that while systems are kept running, the engineers are kept happy, healthy, and supported. The more experienced engineers should take time to mentor the junior engineers, but the junior engineers should also be fully engaged, try to investigate and learn new things by themselves.</span><br /> -<br /> -<span>For the junior engineer, it's too easy to fall back and ask the experts in the team every time an issue arises. This seems reasonable, but serving recipes for solving production issues on a silver tablet won't scale forever, as there are infinite scenarios of how production systems can break. So every engineer should learn to debug, troubleshoot and resolve production incidents independently. The experts will still be there for guidance and step in when the junior gets stuck after trying, but the experts should also learn to step down so that lesser experienced engineers can step up and learn. But mistakes can always happen here; that's why having a blameless on-call culture is essential.</span><br /> -<br /> -<span>A blameless on-call culture is a must for a safe and collaborative environment where engineers can effectively respond to incidents without fear of retribution. This approach acknowledges that mistakes are a natural part of the learning and innovation process. When individuals are assured they won't be punished for errors, they're more likely to openly discuss mistakes, allowing the entire team to learn and grow from each incident. Furthermore, a blameless culture promotes psychological safety, enhances job satisfaction, reduces burnout, and ensures that talent remains committed and engaged.</span><br /> -<br /> -<span>The fourth part of this blog series will be published soon :-)</span><br /> -<br /> -<span>E-Mail your comments to <span class='inlinecode'>paul@nospam.buetow.org</span> :-)</span><br /> -<br /> -<a class='textlink' href='../'>Back to the main site</a><br /> - </div> - </content> - </entry> - <entry> - <title>Site Reliability Engineering - Part 2: Operational Balance in SRE</title> - <link href="https://foo.zone/gemfeed/2023-08-19-site-reliability-engineering-part-2.html" /> - <id>https://foo.zone/gemfeed/2023-08-19-site-reliability-engineering-part-2.html</id> - <updated>2023-08-19T00:18:18+03:00</updated> - <author> - <name>Paul Buetow aka snonux</name> - <email>paul@dev.buetow.org</email> - </author> - <summary>This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will try to share what SRE is about in this blog series.</summary> - <content type="xhtml"> - <div xmlns="http://www.w3.org/1999/xhtml"> - <h1 style='display: inline'>Site Reliability Engineering - Part 2: Operational Balance in SRE</h1><br /> -<br /> -<span class='quote'>Published at 2023-08-19T00:18:18+03:00</span><br /> -<br /> -<span>This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will try to share what SRE is about in this blog series.</span><br /> -<br /> -<a class='textlink' href='./2023-08-18-site-reliability-engineering-part-1.html'>2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture</a><br /> -<a class='textlink' href='./2023-08-19-site-reliability-engineering-part-2.html'>2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE (You are currently reading this)</a><br /> -<a class='textlink' href='./2023-08-20-site-reliability-engineering-part-3.html'>2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</a><br /> -<br /> -<pre> -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⣾⠿⠿⠿⠶⠾⠿⠿⣿⣿⣿⣿⣿⣿⠿⠿⠶⠶⠿⠿⠿⣷⠀⠀⠀⠀ -⠀⠀⠀⣸⢿⣆⠀⠀⠀⠀⠀⠀⠀⠙⢿⡿⠉⠀⠀⠀⠀⠀⠀⠀⣸⣿⡆⠀⠀⠀ -⠀⠀⢠⡟⠀⢻⣆⠀⠀⠀⠀⠀⠀⠀⣾⣧⠀⠀⠀⠀⠀⠀⠀⣰⡟⠀⢻⡄⠀⠀ -⠀⢀⣾⠃⠀⠀⢿⡄⠀⠀⠀⠀⠀⢠⣿⣿⡀⠀⠀⠀⠀⠀⢠⡿⠀⠀⠘⣷⡀⠀ -⠀⣼⣏⣀⣀⣀⣈⣿⡀⠀⠀⠀⠀⣸⣿⣿⡇⠀⠀⠀⠀⢀⣿⣃⣀⣀⣀⣸⣧⠀ -⠀⢻⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⣿⣿⣿⣿⠀⠀⠀⠀⠈⢿⣿⣿⣿⣿⣿⡿⠀ -⠀⠀⠉⠛⠛⠛⠋⠁⠀⠀⠀⠀⢸⣿⣿⣿⣿⡆⠀⠀⠀⠀⠈⠙⠛⠛⠛⠉⠀⠀ -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣾⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⠀⠀⠴⠶⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠶⠦⠀⠀ -</pre> -<br /> -<h2 style='display: inline'>Operational Balance in SRE: Finding the Equilibrium in Reliability and Velocity</h2><br /> -<br /> -<span>Site Reliability Engineering has established itself as more than just a set of best practices or methodologies. Instead, it stands as a beacon of operational excellence, which guides engineering teams through the turbulent waters of modern software development and system management.</span><br /> -<br /> -<span>In the universe of software production, two fundamental forces are often at odds: The drive for rapid feature release (velocity) and the need for system reliability. Traditionally, the faster teams moved, the more risk was introduced into systems. SRE offers a approach to mitigate these conflicting drives through concepts like error budgets and SLIs/SLOs. These mechanisms offer a tangible metric, allowing teams to quantify how much they can push changes while ensuring they don't compromise system health. Thus, the error budget becomes a balancing act, where teams weigh the trade-offs between innovation and reliability.</span><br /> -<br /> -<span>An important part of this balance is the dichotomy between operations and coding. According to SRE principles, an engineer should ideally spend an equal amount of time on operations work and coding - 50% on each. This isn't just a random metric; it's a reflection of the value SRE places on both maintaining operational excellence and progressing forward with innovations. This balance ensures that while SREs are solving today's problems, they are also preparing for tomorrow's challenges. </span><br /> -<br /> -<span>However, not all operational tasks are equal. SRE differentiates between "ops work" and "toil". While ops work is integral to system maintenance and can provide value, toil represents repetitive, mundane tasks which offer little value in the long run. Recognising and minimising toil is crucial. A culture that allows engineers to drown in toil stifles innovation and growth. Hence, an organisation's approach to toil indicates its operational health and commitment to balance.</span><br /> -<br /> -<span>A cornerstone of achieving operational balance lies in the tools and processes SREs use. Effective monitoring, observability tools, and ensuring that tools can handle high cardinality data are foundational. These aren't just technical requisites but reflective of an organisational culture prioritising proactive problem-solving. By having systems that effectively flag potential issues before they escalate, SREs can maintain the balance between system stability and forward momentum.</span><br /> -<br /> -<span>Moreover, operational balance isn't just a technological or process challenge; it's a human one. The health of on-call engineers is as crucial as the health of the services they manage. On-call postmortems, continuous feedback loops, and recognising gaps (be it tooling, operational expertise, or resources) ensure that the human elements of operations are noticed. </span><br /> -<br /> -<span>In conclusion, operational balance in SRE isn't static thing but an ongoing journey. It requires organisations to constantly evaluate their practices, tools, and, most importantly, their culture. By achieving this balance, organisations can ensure that they have time for innovation while maintaining the robustness and reliability of their systems, resulting in sustainable long-term success.</span><br /> -<br /> -<span>That all sounds very romantic. The truth is, it's brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it!</span><br /> -<br /> -<span>Continue with the third part of this series:</span><br /> -<br /> -<a class='textlink' href='./2023-08-20-site-reliability-engineering-part-3.html'>2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</a><br /> -<br /> -<span>E-Mail your comments to <span class='inlinecode'>paul@nospam.buetow.org</span> :-)</span><br /> -<br /> -<a class='textlink' href='../'>Back to the main site</a><br /> - </div> - </content> - </entry> - <entry> <title>Site Reliability Engineering - Part 1: SRE and Organizational Culture</title> <link href="https://foo.zone/gemfeed/2023-08-18-site-reliability-engineering-part-1.html" /> <id>https://foo.zone/gemfeed/2023-08-18-site-reliability-engineering-part-1.html</id> @@ -1316,18 +1314,18 @@ http://www.gnu.org/software/src-highlite --> <name>Paul Buetow aka snonux</name> <email>paul@dev.buetow.org</email> </author> - <summary>The universe of Site Reliability Engineering (SRE) is like an intricate tapestry woven with diverse technology, culture, and personal grit threads. Site Reliability Engineering is one of the most demanding jobs. With all the facets, it's impossible to get bored. There is always a new challenge to master, and there is always a new technology to tinker with. It's not just technical; it's also about communication, collaboration and teamwork. I am currently employed as a Principal Site Reliability Engineer and will try to share what SRE is about in this blog series.</summary> + <summary>The universe of Site Reliability Engineering (SRE) is like an intricate tapestry woven with diverse technology, culture, and personal grit threads. Site Reliability Engineering is one of the most demanding jobs. With all the facets, it's impossible to get bored. There is always a new challenge to master, and there is always a new technology to tinker with. It's not just technical; it's also about communication, collaboration and teamwork. I am currently employed as a Site Reliability Engineer and will try to share what SRE is about in this blog series.</summary> <content type="xhtml"> <div xmlns="http://www.w3.org/1999/xhtml"> <h1 style='display: inline'>Site Reliability Engineering - Part 1: SRE and Organizational Culture</h1><br /> <br /> <span class='quote'>Published at 2023-08-18T22:43:47+03:00</span><br /> <br /> -<span>The universe of Site Reliability Engineering (SRE) is like an intricate tapestry woven with diverse technology, culture, and personal grit threads. Site Reliability Engineering is one of the most demanding jobs. With all the facets, it's impossible to get bored. There is always a new challenge to master, and there is always a new technology to tinker with. It's not just technical; it's also about communication, collaboration and teamwork. I am currently employed as a Principal Site Reliability Engineer and will try to share what SRE is about in this blog series.</span><br /> +<span>The universe of Site Reliability Engineering (SRE) is like an intricate tapestry woven with diverse technology, culture, and personal grit threads. Site Reliability Engineering is one of the most demanding jobs. With all the facets, it's impossible to get bored. There is always a new challenge to master, and there is always a new technology to tinker with. It's not just technical; it's also about communication, collaboration and teamwork. I am currently employed as a Site Reliability Engineer and will try to share what SRE is about in this blog series.</span><br /> <br /> <a class='textlink' href='./2023-08-18-site-reliability-engineering-part-1.html'>2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture (You are currently reading this)</a><br /> -<a class='textlink' href='./2023-08-19-site-reliability-engineering-part-2.html'>2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE</a><br /> -<a class='textlink' href='./2023-08-20-site-reliability-engineering-part-3.html'>2023-08-20 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</a><br /> +<a class='textlink' href='./2023-11-19-site-reliability-engineering-part-2.html'>2023-11-19 Site Reliability Engineering - Part 2: Operational Balance in SRE</a><br /> +<a class='textlink' href='./2024-01-09-site-reliability-engineering-part-3.html'>2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture and the Human Aspect</a><br /> <br /> <pre> ▓▓▓▓░░ @@ -1357,7 +1355,7 @@ DC on fire: <br /> <span>Another defining SRE idea concept the "error budget." This ingenious framework accepts that no system is flawless. Failures are inevitable. However, instead of being punitive, the culture here is to accept, learn, and iterate. By providing teams with a "budget" for errors, organisations create an environment where innovation is encouraged, and failures are viewed as learning opportunities.</span><br /> <br /> -<span>But SRE isn't just about technology and metrics; it's deeply human. It challenges the "hero culture" that plagues many IT teams. While individual heroics might occasionally save the day, a sustainable model requires collective expertise. An SRE culture recognises that heroes achieve their best within teams, negating the need for a hero-centric environment. This philosophy promotes a balanced on-call experience, emphasising the importance of trust, ownership, effective communication, and collaboration as cornerstones of team success. I personally have fallen into the hero trap, and know it's unsustainable to be the only go-to person for every problem.</span><br /> +<span>But SRE isn't just about technology and metrics; it's also human. It challenges the "hero culture" that plagues many IT teams. While individual heroics might occasionally save the day, a sustainable model requires collective expertise. An SRE culture recognises that heroes achieve their best within teams, negating the need for a hero-centric environment. This philosophy promotes a balanced on-call experience, emphasising the importance of trust, ownership, effective communication, and collaboration as cornerstones of team success. I personally have fallen into the hero trap, and know it's unsustainable to be the only go-to person for every arising problem.</span><br /> <br /> <span>Additionally, the SRE model requires good documentation. However, it's essential ensuring that this documentation undergoes the same quality checks as code, reinforcing effective onboarding, training and communication.</span><br /> <br /> @@ -1373,7 +1371,7 @@ DC on fire: <br /> <span>Continue with the second part of this series:</span><br /> <br /> -<a class='textlink' href='./2023-08-19-site-reliability-engineering-part-2.html'>2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE</a><br /> +<a class='textlink' href='./2023-11-19-site-reliability-engineering-part-2.html'>2023-11-19 Site Reliability Engineering - Part 2: Operational Balance in SRE</a><br /> <br /> <span>E-Mail your comments to <span class='inlinecode'>paul@nospam.buetow.org</span> :-)</span><br /> <br /> |
