From 5d5c00a184690b095cb04308c0402128b8b0d205 Mon Sep 17 00:00:00 2001 From: Paul Buetow Date: Sat, 19 Aug 2023 00:21:24 +0300 Subject: Update content for gemtext --- ...3-08-18-site-reliability-engineering-part-1.gmi | 5 +- ...-18-site-reliability-engineering-part-1.gmi.tpl | 4 +- ...3-08-19-site-reliability-engineering-part-2.gmi | 47 ++++ ...-19-site-reliability-engineering-part-2.gmi.tpl | 46 ++++ .../DRAFT-site-reliability-engineering-part-2.gmi | 44 ---- ...AFT-site-reliability-engineering-part-2.gmi.tpl | 44 ---- gemfeed/atom.xml.tmp | 282 +++++---------------- gemfeed/index.gmi | 1 + 8 files changed, 168 insertions(+), 305 deletions(-) create mode 100644 gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi create mode 100644 gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi.tpl delete mode 100644 gemfeed/DRAFT-site-reliability-engineering-part-2.gmi delete mode 100644 gemfeed/DRAFT-site-reliability-engineering-part-2.gmi.tpl (limited to 'gemfeed') diff --git a/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi b/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi index ce60a5e2..28007a5b 100644 --- a/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi +++ b/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi @@ -4,6 +4,7 @@ The universe of Site Reliability Engineering (SRE) is like an intricate tapestry woven with diverse technology, culture, and personal grit threads. Site Reliability Engineering is one of the most demanding jobs. With all the facets, it is impossible to get bored. There is always a new challenge to master, and there is always a new technology to tinker with. It's not just technical; it's also about communication, collaboration and teamwork. I am currently employed as a Principal Site Reliability Engineer and will attempt to share what SRE is about in this blog series. +=> ./2023-08-19-site-reliability-engineering-part-2.gmi 2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE => ./2023-08-18-site-reliability-engineering-part-1.gmi 2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture (You are currently reading this) ``` @@ -48,7 +49,9 @@ In essence, the integration of SRE principles transcends technical practices. It Organisations with the implementation of SLIs, SLOs and error budgets are already advanced in their SRE journey. It takes a lot of communication, convincing, and patience until that point is reached. -The next entry of this blog series will be published soon :-) +Continue with the second part of this series: + +=> ./2023-08-19-site-reliability-engineering-part-2.gmi 2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE E-Mail your comments to paul at buetow.org :-) diff --git a/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi.tpl b/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi.tpl index 29d9efad..b5bfd6a0 100644 --- a/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi.tpl +++ b/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi.tpl @@ -48,7 +48,9 @@ In essence, the integration of SRE principles transcends technical practices. It Organisations with the implementation of SLIs, SLOs and error budgets are already advanced in their SRE journey. It takes a lot of communication, convincing, and patience until that point is reached. -The next entry of this blog series will be published soon :-) +Continue with the second part of this series: + +<< template::inline::index site-reliability-engineering-part-2 E-Mail your comments to paul at buetow.org :-) diff --git a/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi b/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi new file mode 100644 index 00000000..fca15271 --- /dev/null +++ b/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi @@ -0,0 +1,47 @@ +# Site Reliability Engineering - Part 2: Operational Balance in SRE + +> Published at 2023-08-19T00:18:18+03:00 + +This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will attempt to share what SRE is about in this blog series. + +=> ./2023-08-19-site-reliability-engineering-part-2.gmi 2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE (You are currently reading this) +=> ./2023-08-18-site-reliability-engineering-part-1.gmi 2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture + +``` +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⣾⠿⠿⠿⠶⠾⠿⠿⣿⣿⣿⣿⣿⣿⠿⠿⠶⠶⠿⠿⠿⣷⠀⠀⠀⠀ +⠀⠀⠀⣸⢿⣆⠀⠀⠀⠀⠀⠀⠀⠙⢿⡿⠉⠀⠀⠀⠀⠀⠀⠀⣸⣿⡆⠀⠀⠀ +⠀⠀⢠⡟⠀⢻⣆⠀⠀⠀⠀⠀⠀⠀⣾⣧⠀⠀⠀⠀⠀⠀⠀⣰⡟⠀⢻⡄⠀⠀ +⠀⢀⣾⠃⠀⠀⢿⡄⠀⠀⠀⠀⠀⢠⣿⣿⡀⠀⠀⠀⠀⠀⢠⡿⠀⠀⠘⣷⡀⠀ +⠀⣼⣏⣀⣀⣀⣈⣿⡀⠀⠀⠀⠀⣸⣿⣿⡇⠀⠀⠀⠀⢀⣿⣃⣀⣀⣀⣸⣧⠀ +⠀⢻⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⣿⣿⣿⣿⠀⠀⠀⠀⠈⢿⣿⣿⣿⣿⣿⡿⠀ +⠀⠀⠉⠛⠛⠛⠋⠁⠀⠀⠀⠀⢸⣿⣿⣿⣿⡆⠀⠀⠀⠀⠈⠙⠛⠛⠛⠉⠀⠀ +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣾⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⠀⠀⠴⠶⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠶⠦⠀⠀ +``` + +## Operational Balance in SRE: Finding the Equilibrium in Reliability and Velocity + +Site Reliability Engineering has established itself as more than just a set of best practices or methodologies. Instead, it stands as a beacon of operational excellence, which guides engineering teams through the turbulent waters of modern software development and system management. + +In the universe of software production, two fundamental forces are often at odds: the drive for rapid feature release (velocity) and the need for system reliability. Traditionally, the faster teams moved, the more risk was introduced into systems. SRE offers a profound approach to reconciling these conflicting drives through concepts like error budgets and SLIs/SLOs. These mechanisms provide a tangible metric, allowing teams to quantify how much they can push changes while ensuring they don't compromise system health. Thus, the error budget becomes a balancing act, where teams weigh the trade-offs between innovation and reliability. + +A quintessential component of this balance is the dichotomy between operations and coding. According to SRE principles, an engineer should ideally spend an equal amount of time on operations work and coding—50% on each. This isn't just a random metric; it's a reflection of the value SRE places on both maintaining operational excellence and progressing forward with innovations. This balance ensures that while SREs are solving today's problems, they are also preparing for tomorrow's challenges. + +However, not all operational tasks are equal. SRE differentiates between 'ops work' and 'toil'. While ops work is integral to system maintenance and can provide value, toil represents repetitive, mundane tasks which offer little value in the long run. Recognising and minimising toil is crucial. A culture that allows engineers to drown in toil stifles innovation and growth. Hence, an organisation's approach to toil indicates its operational health and commitment to balance. + +A cornerstone of achieving operational balance lies in the tools and processes SREs use. Effective monitoring, observability tools, and ensuring that tools can handle high cardinality data are foundational. These aren't just technical requisites but reflective of an organisational culture prioritising proactive problem-solving. By having systems that effectively flag potential issues before they escalate, SREs can maintain the delicate balance between system stability and forward momentum. + +Moreover, operational balance isn't just a technological or process challenge; it's a human one. The health of on-call engineers is as crucial as the health of the services they manage. On-call postmortems, continuous feedback loops, and recognising gaps (be it tooling, operational expertise, or resources) ensure that the human elements of operations are noticed. + +In conclusion, operational balance in SRE is not a static goalpost but an ongoing journey. It requires organisations to constantly evaluate their practices, tools, and, most importantly, their culture. By achieving this balance, organisations can ensure that they are poised for innovation while maintaining the robustness and reliability of their systems, resulting in sustainable long-term success. + +That all sounds very romantic. The truth is, it is brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it! + +The third part of this blog series will be published soon :-) + +E-Mail your comments to paul at buetow.org :-) + +=> ../ Back to the main site diff --git a/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi.tpl b/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi.tpl new file mode 100644 index 00000000..7bac0257 --- /dev/null +++ b/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi.tpl @@ -0,0 +1,46 @@ +# Site Reliability Engineering - Part 2: Operational Balance in SRE + +> Published at 2023-08-19T00:18:18+03:00 + +This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will attempt to share what SRE is about in this blog series. + +<< template::inline::index site-reliability-engineering-part + +``` +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⣾⠿⠿⠿⠶⠾⠿⠿⣿⣿⣿⣿⣿⣿⠿⠿⠶⠶⠿⠿⠿⣷⠀⠀⠀⠀ +⠀⠀⠀⣸⢿⣆⠀⠀⠀⠀⠀⠀⠀⠙⢿⡿⠉⠀⠀⠀⠀⠀⠀⠀⣸⣿⡆⠀⠀⠀ +⠀⠀⢠⡟⠀⢻⣆⠀⠀⠀⠀⠀⠀⠀⣾⣧⠀⠀⠀⠀⠀⠀⠀⣰⡟⠀⢻⡄⠀⠀ +⠀⢀⣾⠃⠀⠀⢿⡄⠀⠀⠀⠀⠀⢠⣿⣿⡀⠀⠀⠀⠀⠀⢠⡿⠀⠀⠘⣷⡀⠀ +⠀⣼⣏⣀⣀⣀⣈⣿⡀⠀⠀⠀⠀⣸⣿⣿⡇⠀⠀⠀⠀⢀⣿⣃⣀⣀⣀⣸⣧⠀ +⠀⢻⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⣿⣿⣿⣿⠀⠀⠀⠀⠈⢿⣿⣿⣿⣿⣿⡿⠀ +⠀⠀⠉⠛⠛⠛⠋⠁⠀⠀⠀⠀⢸⣿⣿⣿⣿⡆⠀⠀⠀⠀⠈⠙⠛⠛⠛⠉⠀⠀ +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣾⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ +⠀⠀⠀⠀⠀⠀⠴⠶⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠶⠦⠀⠀ +``` + +## Operational Balance in SRE: Finding the Equilibrium in Reliability and Velocity + +Site Reliability Engineering has established itself as more than just a set of best practices or methodologies. Instead, it stands as a beacon of operational excellence, which guides engineering teams through the turbulent waters of modern software development and system management. + +In the universe of software production, two fundamental forces are often at odds: the drive for rapid feature release (velocity) and the need for system reliability. Traditionally, the faster teams moved, the more risk was introduced into systems. SRE offers a profound approach to reconciling these conflicting drives through concepts like error budgets and SLIs/SLOs. These mechanisms provide a tangible metric, allowing teams to quantify how much they can push changes while ensuring they don't compromise system health. Thus, the error budget becomes a balancing act, where teams weigh the trade-offs between innovation and reliability. + +A quintessential component of this balance is the dichotomy between operations and coding. According to SRE principles, an engineer should ideally spend an equal amount of time on operations work and coding—50% on each. This isn't just a random metric; it's a reflection of the value SRE places on both maintaining operational excellence and progressing forward with innovations. This balance ensures that while SREs are solving today's problems, they are also preparing for tomorrow's challenges. + +However, not all operational tasks are equal. SRE differentiates between 'ops work' and 'toil'. While ops work is integral to system maintenance and can provide value, toil represents repetitive, mundane tasks which offer little value in the long run. Recognising and minimising toil is crucial. A culture that allows engineers to drown in toil stifles innovation and growth. Hence, an organisation's approach to toil indicates its operational health and commitment to balance. + +A cornerstone of achieving operational balance lies in the tools and processes SREs use. Effective monitoring, observability tools, and ensuring that tools can handle high cardinality data are foundational. These aren't just technical requisites but reflective of an organisational culture prioritising proactive problem-solving. By having systems that effectively flag potential issues before they escalate, SREs can maintain the delicate balance between system stability and forward momentum. + +Moreover, operational balance isn't just a technological or process challenge; it's a human one. The health of on-call engineers is as crucial as the health of the services they manage. On-call postmortems, continuous feedback loops, and recognising gaps (be it tooling, operational expertise, or resources) ensure that the human elements of operations are noticed. + +In conclusion, operational balance in SRE is not a static goalpost but an ongoing journey. It requires organisations to constantly evaluate their practices, tools, and, most importantly, their culture. By achieving this balance, organisations can ensure that they are poised for innovation while maintaining the robustness and reliability of their systems, resulting in sustainable long-term success. + +That all sounds very romantic. The truth is, it is brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it! + +The third part of this blog series will be published soon :-) + +E-Mail your comments to paul at buetow.org :-) + +=> ../ Back to the main site diff --git a/gemfeed/DRAFT-site-reliability-engineering-part-2.gmi b/gemfeed/DRAFT-site-reliability-engineering-part-2.gmi deleted file mode 100644 index 7aa607ad..00000000 --- a/gemfeed/DRAFT-site-reliability-engineering-part-2.gmi +++ /dev/null @@ -1,44 +0,0 @@ -# Site Reliability Engineering - Part 2: Operational Balance in SRE - -This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will attempt to share what SRE is about in this blog series. - -=> ./2023-08-18-site-reliability-engineering-part-1.gmi 2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture - -``` -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⣾⠿⠿⠿⠶⠾⠿⠿⣿⣿⣿⣿⣿⣿⠿⠿⠶⠶⠿⠿⠿⣷⠀⠀⠀⠀ -⠀⠀⠀⣸⢿⣆⠀⠀⠀⠀⠀⠀⠀⠙⢿⡿⠉⠀⠀⠀⠀⠀⠀⠀⣸⣿⡆⠀⠀⠀ -⠀⠀⢠⡟⠀⢻⣆⠀⠀⠀⠀⠀⠀⠀⣾⣧⠀⠀⠀⠀⠀⠀⠀⣰⡟⠀⢻⡄⠀⠀ -⠀⢀⣾⠃⠀⠀⢿⡄⠀⠀⠀⠀⠀⢠⣿⣿⡀⠀⠀⠀⠀⠀⢠⡿⠀⠀⠘⣷⡀⠀ -⠀⣼⣏⣀⣀⣀⣈⣿⡀⠀⠀⠀⠀⣸⣿⣿⡇⠀⠀⠀⠀⢀⣿⣃⣀⣀⣀⣸⣧⠀ -⠀⢻⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⣿⣿⣿⣿⠀⠀⠀⠀⠈⢿⣿⣿⣿⣿⣿⡿⠀ -⠀⠀⠉⠛⠛⠛⠋⠁⠀⠀⠀⠀⢸⣿⣿⣿⣿⡆⠀⠀⠀⠀⠈⠙⠛⠛⠛⠉⠀⠀ -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣾⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⠀⠀⠴⠶⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠶⠦⠀⠀ -``` - -## Operational Balance in SRE: Finding the Equilibrium in Reliability and Velocity - -Site Reliability Engineering has established itself as more than just a set of best practices or methodologies. Instead, it stands as a beacon of operational excellence, which guides engineering teams through the turbulent waters of modern software development and system management. - -In the universe of software production, two fundamental forces are often at odds: the drive for rapid feature release (velocity) and the need for system reliability. Traditionally, the faster teams moved, the more risk was introduced into systems. SRE offers a profound approach to reconciling these conflicting drives through concepts like error budgets and SLIs/SLOs. These mechanisms provide a tangible metric, allowing teams to quantify how much they can push changes while ensuring they don't compromise system health. Thus, the error budget becomes a balancing act, where teams weigh the trade-offs between innovation and reliability. - -A quintessential component of this balance is the dichotomy between operations and coding. According to SRE principles, an engineer should ideally spend an equal amount of time on operations work and coding—50% on each. This isn't just a random metric; it's a reflection of the value SRE places on both maintaining operational excellence and progressing forward with innovations. This balance ensures that while SREs are solving today's problems, they are also preparing for tomorrow's challenges. - -However, not all operational tasks are equal. SRE differentiates between 'ops work' and 'toil'. While ops work is integral to system maintenance and can provide value, toil represents repetitive, mundane tasks which offer little value in the long run. Recognising and minimising toil is crucial. A culture that allows engineers to drown in toil stifles innovation and growth. Hence, an organisation's approach to toil indicates its operational health and commitment to balance. - -A cornerstone of achieving operational balance lies in the tools and processes SREs use. Effective monitoring, observability tools, and ensuring that tools can handle high cardinality data are foundational. These aren't just technical requisites but reflective of an organisational culture prioritising proactive problem-solving. By having systems that effectively flag potential issues before they escalate, SREs can maintain the delicate balance between system stability and forward momentum. - -Moreover, operational balance isn't just a technological or process challenge; it's a human one. The health of on-call engineers is as crucial as the health of the services they manage. On-call postmortems, continuous feedback loops, and recognising gaps (be it tooling, operational expertise, or resources) ensure that the human elements of operations are noticed. - -In conclusion, operational balance in SRE is not a static goalpost but an ongoing journey. It requires organisations to constantly evaluate their practices, tools, and, most importantly, their culture. By achieving this balance, organisations can ensure that they are poised for innovation while maintaining the robustness and reliability of their systems, resulting in sustainable long-term success. - -That all sounds very romantic. The truth is, it is brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it! - -The next entry of this blog series will be published soon :-) - -E-Mail your comments to paul at buetow.org :-) - -=> ../ Back to the main site diff --git a/gemfeed/DRAFT-site-reliability-engineering-part-2.gmi.tpl b/gemfeed/DRAFT-site-reliability-engineering-part-2.gmi.tpl deleted file mode 100644 index 9bc85270..00000000 --- a/gemfeed/DRAFT-site-reliability-engineering-part-2.gmi.tpl +++ /dev/null @@ -1,44 +0,0 @@ -# Site Reliability Engineering - Part 2: Operational Balance in SRE - -This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will attempt to share what SRE is about in this blog series. - -<< template::inline::index site-reliability-engineering-part - -``` -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⣾⠿⠿⠿⠶⠾⠿⠿⣿⣿⣿⣿⣿⣿⠿⠿⠶⠶⠿⠿⠿⣷⠀⠀⠀⠀ -⠀⠀⠀⣸⢿⣆⠀⠀⠀⠀⠀⠀⠀⠙⢿⡿⠉⠀⠀⠀⠀⠀⠀⠀⣸⣿⡆⠀⠀⠀ -⠀⠀⢠⡟⠀⢻⣆⠀⠀⠀⠀⠀⠀⠀⣾⣧⠀⠀⠀⠀⠀⠀⠀⣰⡟⠀⢻⡄⠀⠀ -⠀⢀⣾⠃⠀⠀⢿⡄⠀⠀⠀⠀⠀⢠⣿⣿⡀⠀⠀⠀⠀⠀⢠⡿⠀⠀⠘⣷⡀⠀ -⠀⣼⣏⣀⣀⣀⣈⣿⡀⠀⠀⠀⠀⣸⣿⣿⡇⠀⠀⠀⠀⢀⣿⣃⣀⣀⣀⣸⣧⠀ -⠀⢻⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⣿⣿⣿⣿⠀⠀⠀⠀⠈⢿⣿⣿⣿⣿⣿⡿⠀ -⠀⠀⠉⠛⠛⠛⠋⠁⠀⠀⠀⠀⢸⣿⣿⣿⣿⡆⠀⠀⠀⠀⠈⠙⠛⠛⠛⠉⠀⠀ -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣾⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ -⠀⠀⠀⠀⠀⠀⠴⠶⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠶⠦⠀⠀ -``` - -## Operational Balance in SRE: Finding the Equilibrium in Reliability and Velocity - -Site Reliability Engineering has established itself as more than just a set of best practices or methodologies. Instead, it stands as a beacon of operational excellence, which guides engineering teams through the turbulent waters of modern software development and system management. - -In the universe of software production, two fundamental forces are often at odds: the drive for rapid feature release (velocity) and the need for system reliability. Traditionally, the faster teams moved, the more risk was introduced into systems. SRE offers a profound approach to reconciling these conflicting drives through concepts like error budgets and SLIs/SLOs. These mechanisms provide a tangible metric, allowing teams to quantify how much they can push changes while ensuring they don't compromise system health. Thus, the error budget becomes a balancing act, where teams weigh the trade-offs between innovation and reliability. - -A quintessential component of this balance is the dichotomy between operations and coding. According to SRE principles, an engineer should ideally spend an equal amount of time on operations work and coding—50% on each. This isn't just a random metric; it's a reflection of the value SRE places on both maintaining operational excellence and progressing forward with innovations. This balance ensures that while SREs are solving today's problems, they are also preparing for tomorrow's challenges. - -However, not all operational tasks are equal. SRE differentiates between 'ops work' and 'toil'. While ops work is integral to system maintenance and can provide value, toil represents repetitive, mundane tasks which offer little value in the long run. Recognising and minimising toil is crucial. A culture that allows engineers to drown in toil stifles innovation and growth. Hence, an organisation's approach to toil indicates its operational health and commitment to balance. - -A cornerstone of achieving operational balance lies in the tools and processes SREs use. Effective monitoring, observability tools, and ensuring that tools can handle high cardinality data are foundational. These aren't just technical requisites but reflective of an organisational culture prioritising proactive problem-solving. By having systems that effectively flag potential issues before they escalate, SREs can maintain the delicate balance between system stability and forward momentum. - -Moreover, operational balance isn't just a technological or process challenge; it's a human one. The health of on-call engineers is as crucial as the health of the services they manage. On-call postmortems, continuous feedback loops, and recognising gaps (be it tooling, operational expertise, or resources) ensure that the human elements of operations are noticed. - -In conclusion, operational balance in SRE is not a static goalpost but an ongoing journey. It requires organisations to constantly evaluate their practices, tools, and, most importantly, their culture. By achieving this balance, organisations can ensure that they are poised for innovation while maintaining the robustness and reliability of their systems, resulting in sustainable long-term success. - -That all sounds very romantic. The truth is, it is brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it! - -The next entry of this blog series will be published soon :-) - -E-Mail your comments to paul at buetow.org :-) - -=> ../ Back to the main site diff --git a/gemfeed/atom.xml.tmp b/gemfeed/atom.xml.tmp index d42ca52e..1bd29d00 100644 --- a/gemfeed/atom.xml.tmp +++ b/gemfeed/atom.xml.tmp @@ -1,11 +1,73 @@ - 2023-08-18T23:12:15+03:00 + 2023-08-19T00:21:09+03:00 foo.zone feed To be in the .zone! gemini://foo.zone/ + + Site Reliability Engineering - Part 2: Operational Balance in SRE + + gemini://foo.zone/gemfeed/2023-08-19-site-reliability-engineering-part-2.gmi + 2023-08-19T00:18:18+03:00 + + Paul Buetow aka snonux + paul@dev.buetow.org + + This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will attempt to share what SRE is about in this blog series. + +
+

Site Reliability Engineering - Part 2: Operational Balance in SRE


+
+Published at 2023-08-19T00:18:18+03:00
+
+This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Principal Site Reliability Engineer and will attempt to share what SRE is about in this blog series.
+
+2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE (You are currently reading this)
+2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture
+
+
+⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
+⠀⠀⠀⠀⣾⠿⠿⠿⠶⠾⠿⠿⣿⣿⣿⣿⣿⣿⠿⠿⠶⠶⠿⠿⠿⣷⠀⠀⠀⠀
+⠀⠀⠀⣸⢿⣆⠀⠀⠀⠀⠀⠀⠀⠙⢿⡿⠉⠀⠀⠀⠀⠀⠀⠀⣸⣿⡆⠀⠀⠀
+⠀⠀⢠⡟⠀⢻⣆⠀⠀⠀⠀⠀⠀⠀⣾⣧⠀⠀⠀⠀⠀⠀⠀⣰⡟⠀⢻⡄⠀⠀
+⠀⢀⣾⠃⠀⠀⢿⡄⠀⠀⠀⠀⠀⢠⣿⣿⡀⠀⠀⠀⠀⠀⢠⡿⠀⠀⠘⣷⡀⠀
+⠀⣼⣏⣀⣀⣀⣈⣿⡀⠀⠀⠀⠀⣸⣿⣿⡇⠀⠀⠀⠀⢀⣿⣃⣀⣀⣀⣸⣧⠀
+⠀⢻⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⣿⣿⣿⣿⠀⠀⠀⠀⠈⢿⣿⣿⣿⣿⣿⡿⠀
+⠀⠀⠉⠛⠛⠛⠋⠁⠀⠀⠀⠀⢸⣿⣿⣿⣿⡆⠀⠀⠀⠀⠈⠙⠛⠛⠛⠉⠀⠀
+⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
+⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣾⣿⣿⣷⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
+⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⣿⣿⣿⣿⣿⣿⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
+⠀⠀⠀⠀⠀⠀⠴⠶⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠶⠦⠀⠀
+
+
+

Operational Balance in SRE: Finding the Equilibrium in Reliability and Velocity


+
+Site Reliability Engineering has established itself as more than just a set of best practices or methodologies. Instead, it stands as a beacon of operational excellence, which guides engineering teams through the turbulent waters of modern software development and system management.
+
+In the universe of software production, two fundamental forces are often at odds: the drive for rapid feature release (velocity) and the need for system reliability. Traditionally, the faster teams moved, the more risk was introduced into systems. SRE offers a profound approach to reconciling these conflicting drives through concepts like error budgets and SLIs/SLOs. These mechanisms provide a tangible metric, allowing teams to quantify how much they can push changes while ensuring they don't compromise system health. Thus, the error budget becomes a balancing act, where teams weigh the trade-offs between innovation and reliability.
+
+A quintessential component of this balance is the dichotomy between operations and coding. According to SRE principles, an engineer should ideally spend an equal amount of time on operations work and coding—50% on each. This isn't just a random metric; it's a reflection of the value SRE places on both maintaining operational excellence and progressing forward with innovations. This balance ensures that while SREs are solving today's problems, they are also preparing for tomorrow's challenges.
+
+However, not all operational tasks are equal. SRE differentiates between 'ops work' and 'toil'. While ops work is integral to system maintenance and can provide value, toil represents repetitive, mundane tasks which offer little value in the long run. Recognising and minimising toil is crucial. A culture that allows engineers to drown in toil stifles innovation and growth. Hence, an organisation's approach to toil indicates its operational health and commitment to balance.
+
+A cornerstone of achieving operational balance lies in the tools and processes SREs use. Effective monitoring, observability tools, and ensuring that tools can handle high cardinality data are foundational. These aren't just technical requisites but reflective of an organisational culture prioritising proactive problem-solving. By having systems that effectively flag potential issues before they escalate, SREs can maintain the delicate balance between system stability and forward momentum.
+
+Moreover, operational balance isn't just a technological or process challenge; it's a human one. The health of on-call engineers is as crucial as the health of the services they manage. On-call postmortems, continuous feedback loops, and recognising gaps (be it tooling, operational expertise, or resources) ensure that the human elements of operations are noticed.
+
+In conclusion, operational balance in SRE is not a static goalpost but an ongoing journey. It requires organisations to constantly evaluate their practices, tools, and, most importantly, their culture. By achieving this balance, organisations can ensure that they are poised for innovation while maintaining the robustness and reliability of their systems, resulting in sustainable long-term success.
+
+That all sounds very romantic. The truth is, it is brutal to archive the perfect balance. No system will ever be perfect. But at least we should aim for it!
+
+The third part of this blog series will be published soon :-)
+
+E-Mail your comments to paul at buetow.org :-)
+
+Back to the main site
+
+
+
Site Reliability Engineering - Part 1: SRE and Organizational Culture @@ -24,6 +86,7 @@
The universe of Site Reliability Engineering (SRE) is like an intricate tapestry woven with diverse technology, culture, and personal grit threads. Site Reliability Engineering is one of the most demanding jobs. With all the facets, it is impossible to get bored. There is always a new challenge to master, and there is always a new technology to tinker with. It's not just technical; it's also about communication, collaboration and teamwork. I am currently employed as a Principal Site Reliability Engineer and will attempt to share what SRE is about in this blog series.

+2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE
2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture (You are currently reading this)

@@ -68,7 +131,9 @@ DC on fire:
 
Organisations with the implementation of SLIs, SLOs and error budgets are already advanced in their SRE journey. It takes a lot of communication, convincing, and patience until that point is reached.

-The next entry of this blog series will be published soon :-)
+Continue with the second part of this series:
+
+2023-08-19 Site Reliability Engineering - Part 2: Operational Balance in SRE

E-Mail your comments to paul at buetow.org :-)

@@ -8482,219 +8547,6 @@ Notice: Finished catalog run in 206.09 seconds
E-Mail your comments to paul at buetow.org :-)

-Back to the main site
- - - - - Run Debian on your phone with Debroid - - gemini://foo.zone/gemfeed/2015-12-05-run-debian-on-your-phone-with-debroid.gmi - 2015-12-05T16:12:57+00:00 - - Paul Buetow aka snonux - paul@dev.buetow.org - - You can use the following tutorial to install a full-blown Debian GNU/Linux Chroot on an LG G3 D855 CyanogenMod 13 (Android 6). First of all, you need to have root permissions on your phone, and you also need to have the developer mode activated. The following steps have been tested on Linux (Fedora 23). - -
-

Run Debian on your phone with Debroid


-
-Published at 2015-12-05T16:12:57+00:00; Updated at 2021-05-16
-
-
- ____       _               _     _ 
-|  _ \  ___| |__  _ __ ___ (_) __| |
-| | | |/ _ \ '_ \| '__/ _ \| |/ _` |
-| |_| |  __/ |_) | | | (_) | | (_| |
-|____/ \___|_.__/|_|  \___/|_|\__,_|
-                                    
-
-
-You can use the following tutorial to install a full-blown Debian GNU/Linux Chroot on an LG G3 D855 CyanogenMod 13 (Android 6). First of all, you need to have root permissions on your phone, and you also need to have the developer mode activated. The following steps have been tested on Linux (Fedora 23).
-
-
-
-

Foreword


-
-A couple of years have passed since I last worked on Debroid. Currently, I am using the Termux app on Android, which is less sophisticated than a fully blown Debian installation but sufficient for my current requirements. The content of this site may be still relevant, and it would also work with more recent versions of Debian and Android. I would expect that some minor modifications need to be made, though.
-
-

Step by step guide


-
-All scripts mentioned here can be found on GitHub at:
-
-https://codeberg.org/snonux/debroid
-
-

First debootstrap stage


-
-This is to be performed on a Fedora Linux machine (could work on a Debian too, but Fedora is just what I use on my Laptop). The following steps prepare an initial Debian base image, which can then be transferred to the phone.
-
- -
sudo dnf install debootstrap
-# 5g
-dd if=/dev/zero of=jessie.img bs=$[ 1024 * 1024 ] \
-  count=$[ 1024 * 5 ]
-
-# Show used loop devices
-sudo losetup -f
-# Store the next free one to $loop
-loop=loopN
-sudo losetup /dev/$loop jessie.img
-
-mkdir jessie
-sudo mkfs.ext4 /dev/$loop
-sudo mount /dev/$loop jessie
-sudo debootstrap --foreign --variant=minbase \
-  --arch armel jessie jessie/ \
-  http://http.debian.net/debian
-sudo umount jessie
-
-
-

Copy Debian image to the phone


-
-Now setup the Debian image on an external SD card on the Phone via Android Debugger as follows:
-
- -
adb root && adb wait-for-device && adb shell
-mkdir -p /storage/sdcard1/Linux/jessie
-exit
-
-# Sparse image problem, may be too big for copying otherwise
-gzip jessie.img
-# Copy over
-adb push jessie.img.gz /storage/sdcard1/Linux/jessie.img.gz
-adb shell
-cd /storage/sdcard1/Linux
-gunzip jessie.img.gz
-
-# Show used loop devices
-losetup -f
-# Store the next free one to $loop
-loop=loopN
-
-# Use the next free one (replace the loop number)
-losetup /dev/block/$loop $(pwd)/jessie.img
-mount -t ext4 /dev/block/$loop $(pwd)/jessie
-
-# Bind-Mound proc, dev, sys`
-busybox mount --bind /proc $(pwd)/jessie/proc
-busybox mount --bind /dev $(pwd)/jessie/dev
-busybox mount --bind /dev/pts $(pwd)/jessie/dev/pts
-busybox mount --bind /sys $(pwd)/jessie/sys
-
-# Bind-Mound the rest of Android
-mkdir -p $(pwd)/jessie/storage/sdcard{0,1}
-busybox mount --bind /storage/emulated \
-  $(pwd)/jessie/storage/sdcard0
-busybox mount --bind /storage/sdcard1 \
-  $(pwd)/jessie/storage/sdcard1
-
-# Check mounts
-mount | grep jessie
-
-
-

Second debootstrap stage


-
-This is to be performed on the Android phone itself (inside a Debian chroot):
-
- -
chroot $(pwd)/jessie /bin/bash -l
-export PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin
-/debootstrap/debootstrap --second-stage
-exit # Leave chroot
-exit # Leave adb shell
-
-
-

Setup of various scripts


-
-jessie.sh deals with all the loopback mount magic and so on. It will be run later every time you start Debroid on your phone.
-
- -
# Install script jessie.sh
-adb push storage/sdcard1/Linux/jessie.sh /storage/sdcard/Linux/jessie.sh
-adb shell
-cd /storage/sdcard1/Linux
-sh jessie.sh enter
-
-# Bashrc
-cat <<END >~/.bashrc
-export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$PATH
-export EDITOR=vim
-hostname $(cat /etc/hostname)
-END
-
-# Fixing an error message while loading the profile
-sed -i s#id#/usr/bin/id# /etc/profile
-
-# Setting the hostname
-echo phobos > /etc/hostname
-echo 127.0.0.1 phobos > /etc/hosts
-hostname phobos
-
-# Apt-sources
-cat <<END > sources.list
-deb http://ftp.uk.debian.org/debian/ jessie main contrib non-free
-deb-src http://ftp.uk.debian.org/debian/ jessie main contrib non-free
-END
-apt-get update
-apt-get upgrade
-apt-get dist-upgrade
-exit # Exit chroot
-
-
-

Entering Debroid and enable a service


-
-This enters Debroid on your phone and starts the example service uptimed:
-
- -
sh jessie.sh enter
-
-# Setup example serice uptimed
-apt-get install uptimed
-cat <<END > /etc/rc.debroid
-export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$PATH
-service uptimed status &>/dev/null || service uptimed start
-exit 0
-END
-
-chmod 0755 /etc/rc.debroid
-exit # Exit chroot
-exit # Exit adb shell
-
-
-

Include to Android startup:


-
-If you want to start Debroid automatically whenever your phone starts, then do the following:
-
- -
adb push data/local/userinit.sh /data/local/userinit.sh
-adb shell
-chmod +x /data/local/userinit.sh
-exit
-
-
-Reboot & test! Enjoy!
-
-E-Mail your comments to paul at buetow.org :-)
-
Back to the main site
diff --git a/gemfeed/index.gmi b/gemfeed/index.gmi index b41f3397..df3f9e13 100644 --- a/gemfeed/index.gmi +++ b/gemfeed/index.gmi @@ -2,6 +2,7 @@ ## To be in the .zone! +=> ./2023-08-19-site-reliability-engineering-part-2.gmi 2023-08-19 - Site Reliability Engineering - Part 2: Operational Balance in SRE => ./2023-08-18-site-reliability-engineering-part-1.gmi 2023-08-18 - Site Reliability Engineering - Part 1: SRE and Organizational Culture => ./2023-07-21-gemtexter-2.1.0-lets-gemtext-again-3.gmi 2023-07-21 - Gemtexter 2.1.0 - Let's Gemtext again³ => ./2023-07-17-career-guide-and-soft-skills-book-notes.gmi 2023-07-17 - 'Software Developmers Career Guide & Soft Skills' book notes -- cgit v1.2.3