summaryrefslogtreecommitdiff
path: root/gemfeed
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2023-08-18 23:12:28 +0300
committerPaul Buetow <paul@buetow.org>2023-08-18 23:12:28 +0300
commit826f0ac2b011f2b3eee91475fba63a0fc70168bc (patch)
tree91534cd055959a2d0bfc9bfab9ff2ffbc6bcd383 /gemfeed
parent67d7123b1b4add41613831ca633fd04635d8f88e (diff)
Update content for md
Diffstat (limited to 'gemfeed')
-rw-r--r--gemfeed/DRAFT-site-reliability-engineering-part-2.md44
-rw-r--r--gemfeed/DRAFT-site-reliability-engineering.md16
2 files changed, 44 insertions, 16 deletions
diff --git a/gemfeed/DRAFT-site-reliability-engineering-part-2.md b/gemfeed/DRAFT-site-reliability-engineering-part-2.md
new file mode 100644
index 00000000..147ed440
--- /dev/null
+++ b/gemfeed/DRAFT-site-reliability-engineering-part-2.md
@@ -0,0 +1,44 @@
+# Site Reliability Engineering - Part 2: Operational Balance in SRE
+
+This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will attempt to share what SRE is about in this blog series.
+
+[2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture](./2023-08-18-site-reliability-engineering-part-1.md)
+
+```
+⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
+⠀⠀⠀⠀⣾⠿⠿⠿⠶⠾⠿⠿⣿⣿⣿⣿⣿⣿⠿⠿⠶⠶⠿⠿⠿⣷⠀⠀⠀⠀
+⠀⠀⠀⣸⢿⣆⠀⠀⠀⠀⠀⠀⠀⠙⢿⡿⠉⠀⠀⠀⠀⠀⠀⠀⣸⣿⡆⠀⠀⠀
+⠀⠀⢠⡟⠀⢻⣆⠀⠀⠀⠀⠀⠀⠀⣾⣧⠀⠀⠀⠀⠀⠀⠀⣰⡟⠀⢻⡄⠀⠀
+⠀⢀⣾⠃⠀⠀⢿⡄⠀⠀⠀⠀⠀⢠⣿⣿⡀⠀⠀⠀⠀⠀⢠⡿⠀⠀⠘⣷⡀⠀
+⠀⣼⣏⣀⣀⣀⣈⣿⡀⠀⠀⠀⠀⣸⣿⣿⡇⠀⠀⠀⠀⢀⣿⣃⣀⣀⣀⣸⣧⠀
+⠀⢻⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⣿⣿⣿⣿⠀⠀⠀⠀⠈⢿⣿⣿⣿⣿⣿⡿⠀
+⠀⠀⠉⠛⠛⠛⠋⠁⠀⠀⠀⠀⢸⣿⣿⣿⣿⡆⠀⠀⠀⠀⠈⠙⠛⠛⠛⠉⠀⠀
+⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
+⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣾⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
+⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
+⠀⠀⠀⠀⠀⠀⠴⠶⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠶⠦⠀⠀
+```
+
+## Operational Balance in SRE: Finding the Equilibrium in Reliability and Velocity
+
+Site Reliability Engineering has established itself as more than just a set of best practices or methodologies. Instead, it stands as a beacon of operational excellence, which guides engineering teams through the turbulent waters of modern software development and system management.
+
+In the universe of software production, two fundamental forces are often at odds: the drive for rapid feature release (velocity) and the need for system reliability. Traditionally, the faster teams moved, the more risk was introduced into systems. SRE offers a profound approach to reconciling these conflicting drives through concepts like error budgets and SLIs/SLOs. These mechanisms provide a tangible metric, allowing teams to quantify how much they can push changes while ensuring they don't compromise system health. Thus, the error budget becomes a balancing act, where teams weigh the trade-offs between innovation and reliability.
+
+A quintessential component of this balance is the dichotomy between operations and coding. According to SRE principles, an engineer should ideally spend an equal amount of time on operations work and coding—50% on each. This isn't just a random metric; it's a reflection of the value SRE places on both maintaining operational excellence and progressing forward with innovations. This balance ensures that while SREs are solving today's problems, they are also preparing for tomorrow's challenges.
+
+However, not all operational tasks are equal. SRE differentiates between 'ops work' and 'toil'. While ops work is integral to system maintenance and can provide value, toil represents repetitive, mundane tasks which offer little value in the long run. Recognising and minimising toil is crucial. A culture that allows engineers to drown in toil stifles innovation and growth. Hence, an organisation's approach to toil indicates its operational health and commitment to balance.
+
+A cornerstone of achieving operational balance lies in the tools and processes SREs use. Effective monitoring, observability tools, and ensuring that tools can handle high cardinality data are foundational. These aren't just technical requisites but reflective of an organisational culture prioritising proactive problem-solving. By having systems that effectively flag potential issues before they escalate, SREs can maintain the delicate balance between system stability and forward momentum.
+
+Moreover, operational balance isn't just a technological or process challenge; it's a human one. The health of on-call engineers is as crucial as the health of the services they manage. On-call postmortems, continuous feedback loops, and recognising gaps (be it tooling, operational expertise, or resources) ensure that the human elements of operations are noticed.
+
+In conclusion, operational balance in SRE is not a static goalpost but an ongoing journey. It requires organisations to constantly evaluate their practices, tools, and, most importantly, their culture. By achieving this balance, organisations can ensure that they are poised for innovation while maintaining the robustness and reliability of their systems, resulting in sustainable long-term success.
+
+That all sounds very romantic. The truth is, it is brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it!
+
+The next entry of this blog series will be published soon :-)
+
+E-Mail your comments to paul at buetow.org :-)
+
+[Back to the main site](../)
diff --git a/gemfeed/DRAFT-site-reliability-engineering.md b/gemfeed/DRAFT-site-reliability-engineering.md
index 5caa1c60..c6ec6076 100644
--- a/gemfeed/DRAFT-site-reliability-engineering.md
+++ b/gemfeed/DRAFT-site-reliability-engineering.md
@@ -1,19 +1,3 @@
-## Operational Balance in SRE: Finding the Equilibrium in Reliability and Velocity
-
-Site Reliability Engineering has established itself as more than just a set of best practices or methodologies. Instead, it stands as a beacon of operational excellence, which guides engineering teams through the turbulent waters of modern software development and system management.
-
-In the universe of software production, two fundamental forces are often at odds: the drive for rapid feature release (velocity) and the need for system reliability. Traditionally, the faster teams moved, the more risk was introduced into systems. SRE offers a profound approach to reconciling these conflicting drives through concepts like error budgets and SLIs/SLOs. These mechanisms provide a tangible metric, allowing teams to quantify how much they can push changes while ensuring they don't compromise system health. Thus, the error budget becomes a balancing act, where teams weigh the trade-offs between innovation and reliability.
-
-A quintessential component of this balance is the dichotomy between operations and coding. According to SRE principles, an engineer should ideally spend an equal amount of time on operations work and coding—50% on each. This isn't just a random metric; it's a reflection of the value SRE places on both maintaining operational excellence and progressing forward with innovations. This balance ensures that while SREs are solving today's problems, they are also preparing for tomorrow's challenges.
-
-However, not all operational tasks are equal. SRE differentiates between 'ops work' and 'toil'. While ops work is integral to system maintenance and can provide value, toil represents repetitive, mundane tasks which offer little value in the long run. Recognising and minimising toil is crucial. A culture that allows engineers to drown in toil stifles innovation and growth. Hence, an organisation's approach to toil indicates its operational health and commitment to balance.
-
-A cornerstone of achieving operational balance lies in the tools and processes SREs use. Effective monitoring, observability tools, and ensuring that tools can handle high cardinality data are foundational. These aren't just technical requisites but reflective of an organisational culture prioritising proactive problem-solving. By having systems that effectively flag potential issues before they escalate, SREs can maintain the delicate balance between system stability and forward momentum.
-
-Moreover, operational balance isn't just a technological or process challenge; it's a human one. The health of on-call engineers is as crucial as the health of the services they manage. On-call postmortems, continuous feedback loops, and recognising gaps (be it tooling, operational expertise, or resources) ensure that the human elements of operations are noticed.
-
-In conclusion, operational balance in SRE is not a static goalpost but an ongoing journey. It requires organisations to constantly evaluate their practices, tools, and, most importantly, their culture. By achieving this balance, organisations can ensure that they are poised for innovation while maintaining the robustness and reliability of their systems, resulting in sustainable long-term success.
-
## On-Call Culture and the Human Aspect: Prioritising Well-being in the Realm of Reliability
Site Reliability Engineering is synonymous with ensuring system reliability, but the human factor is an often-underestimated component of this discipline. It is evident that fostering a healthy on-call culture is as critical as any technical solution. In the world of constant alerts, pages, and incident management, the well-being of the engineers becomes paramount.