diff options
Diffstat (limited to 'gemfeed')
| -rw-r--r-- | gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi | 1 | ||||
| -rw-r--r-- | gemfeed/2023-11-19-site-reliability-engineering-part-2.gmi | 1 | ||||
| -rw-r--r-- | gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi | 1 | ||||
| -rw-r--r-- | gemfeed/2024-09-07-site-reliability-engineering-part-4.gmi | 7 | ||||
| -rw-r--r-- | gemfeed/2024-09-07-site-reliability-engineering-part-4.gmi.tpl | 6 | ||||
| -rw-r--r-- | gemfeed/DRAFT-site-reliability-engineering.gmi | 152 | ||||
| -rw-r--r-- | gemfeed/DRAFT-site-reliability-engineering.gmi.tpl | 152 | ||||
| -rw-r--r-- | gemfeed/atom.xml | 419 | ||||
| -rw-r--r-- | gemfeed/index.gmi | 2 |
9 files changed, 106 insertions, 635 deletions
diff --git a/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi b/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi index 91b1b0d9..c5097463 100644 --- a/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi +++ b/gemfeed/2023-08-18-site-reliability-engineering-part-1.gmi @@ -8,6 +8,7 @@ Being a Site Reliability Engineer (SRE) is like stepping into a lively, ever-evo => ./2023-11-19-site-reliability-engineering-part-2.gmi 2023-11-19 Site Reliability Engineering - Part 2: Operational Balance => ./2024-01-09-site-reliability-engineering-part-3.gmi 2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture => ./2024-09-07-site-reliability-engineering-part-4.gmi 2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers +=> ./2026-03-01-site-reliability-engineering-part-5.gmi 2026-03-01 Site Reliability Engineering - Part 5: System Design, Incidents, and Learning ``` ▓▓▓▓░░ diff --git a/gemfeed/2023-11-19-site-reliability-engineering-part-2.gmi b/gemfeed/2023-11-19-site-reliability-engineering-part-2.gmi index abb1255d..5864ffe3 100644 --- a/gemfeed/2023-11-19-site-reliability-engineering-part-2.gmi +++ b/gemfeed/2023-11-19-site-reliability-engineering-part-2.gmi @@ -8,6 +8,7 @@ This is the second part of my Site Reliability Engineering (SRE) series. I am cu => ./2023-11-19-site-reliability-engineering-part-2.gmi 2023-11-19 Site Reliability Engineering - Part 2: Operational Balance (You are currently reading this) => ./2024-01-09-site-reliability-engineering-part-3.gmi 2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture => ./2024-09-07-site-reliability-engineering-part-4.gmi 2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers +=> ./2026-03-01-site-reliability-engineering-part-5.gmi 2026-03-01 Site Reliability Engineering - Part 5: System Design, Incidents, and Learning ``` ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ diff --git a/gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi b/gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi index 609c6e58..4bc5ded0 100644 --- a/gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi +++ b/gemfeed/2024-01-09-site-reliability-engineering-part-3.gmi @@ -8,6 +8,7 @@ Welcome to Part 3 of my Site Reliability Engineering (SRE) series. I'm currently => ./2023-11-19-site-reliability-engineering-part-2.gmi 2023-11-19 Site Reliability Engineering - Part 2: Operational Balance => ./2024-01-09-site-reliability-engineering-part-3.gmi 2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture (You are currently reading this) => ./2024-09-07-site-reliability-engineering-part-4.gmi 2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers +=> ./2026-03-01-site-reliability-engineering-part-5.gmi 2026-03-01 Site Reliability Engineering - Part 5: System Design, Incidents, and Learning ``` ..--""""----.. diff --git a/gemfeed/2024-09-07-site-reliability-engineering-part-4.gmi b/gemfeed/2024-09-07-site-reliability-engineering-part-4.gmi index 4500e7bb..71b22cd5 100644 --- a/gemfeed/2024-09-07-site-reliability-engineering-part-4.gmi +++ b/gemfeed/2024-09-07-site-reliability-engineering-part-4.gmi @@ -8,6 +8,7 @@ Welcome to Part 4 of my Site Reliability Engineering (SRE) series. I'm currently => ./2023-11-19-site-reliability-engineering-part-2.gmi 2023-11-19 Site Reliability Engineering - Part 2: Operational Balance => ./2024-01-09-site-reliability-engineering-part-3.gmi 2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture => ./2024-09-07-site-reliability-engineering-part-4.gmi 2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers (You are currently reading this) +=> ./2026-03-01-site-reliability-engineering-part-5.gmi 2026-03-01 Site Reliability Engineering - Part 5: System Design, Incidents, and Learning ``` __..._ _...__ @@ -68,4 +69,10 @@ By structuring the onboarding process with KT sessions, shadowing, comprehensive If you're looking to optimize your on-call onboarding process, these strategies could be your ticket to a more efficient and effective transition. Happy on-calling! +Continue with the fifth part of this series: + +=> ./2026-03-01-site-reliability-engineering-part-5.gmi 2026-03-01 Site Reliability Engineering - Part 5: System Design, Incidents, and Learning + +E-Mail your comments to `paul@nospam.buetow.org` :-) + => ../ Back to the main site diff --git a/gemfeed/2024-09-07-site-reliability-engineering-part-4.gmi.tpl b/gemfeed/2024-09-07-site-reliability-engineering-part-4.gmi.tpl index b07e94fa..b0a5a28b 100644 --- a/gemfeed/2024-09-07-site-reliability-engineering-part-4.gmi.tpl +++ b/gemfeed/2024-09-07-site-reliability-engineering-part-4.gmi.tpl @@ -65,4 +65,10 @@ By structuring the onboarding process with KT sessions, shadowing, comprehensive If you're looking to optimize your on-call onboarding process, these strategies could be your ticket to a more efficient and effective transition. Happy on-calling! +Continue with the fifth part of this series: + +<< template::inline::index site-reliability-engineering-part-5 + +E-Mail your comments to `paul@nospam.buetow.org` :-) + => ../ Back to the main site diff --git a/gemfeed/DRAFT-site-reliability-engineering.gmi b/gemfeed/DRAFT-site-reliability-engineering.gmi deleted file mode 100644 index be26119e..00000000 --- a/gemfeed/DRAFT-site-reliability-engineering.gmi +++ /dev/null @@ -1,152 +0,0 @@ -## System Design and Incident Analysis: Building Resilience in the SRE Landscape - -A significant portion of the work revolves around system design and incident analysis. - -The first axiom is the acceptance of a bitter truth: things will always break. No matter the precision of which a system is crafted, the inevitability of failures looms large. However, what distinguishes a well-designed system from a mediocre one is its ability to minimise and contain cascading failures. These failures, if left unchecked, can spiral into global outages with come with consequences. - -There's a growing emphasis on building resilient systems to avoid such cascading failures to circumvent this. Such resilience requires foresight in system design, wherein potential weakpoints are identified and addressed before deployed to production. Prevention is better than cure. The primary objective is ensuring that services remain uninterrupted and dependable. - -Yet, despite these preventative measures, when incidents do arise, their analysis becomes a goldmine of learning. Every incident exposes gaps within the system. Instead of attributing these incidents to nebulous concepts like "human error," the onus is on dissecting them to uncover underlying systemic issues. Whether it's a tooling gap where operational tools prove insufficient or an operational expertise gap where engineers lack critical skills, incident analysis shines a light on these deficiencies. - -In doing so, incident analysis is about rectifying the immediate issue and learning and evolving the system design. Every incident offers an opportunity, a feedback loop, to refine the system further. Through rigorous postmortems focusing on customer impact, organisations can distil valuable lessons. These lessons, when incorporated, make the system more robust and less susceptible to similar failures in the future. - -Moreover, as systems grow more complex, the importance of observability tools cannot be overstated. These tools, designed to query against high cardinality data, provide granular insights into system operations. They enable engineers to diagnose problems rapidly, especially in the chaotic aftermath of an incident, giving clarity amidst the turmoil. - -In conclusion, the symbiotic relationship between system design and incident analysis underscores the evolving ethos of SRE. While impeccable system design lays the foundation for reliable operations, incident analysis ensures that this foundation remains robust and dynamic, adapting to challenges. Together, they form the pillars of a resilient, customer-centric service environment that stands the test of time. - -Add paragraph about product wants features, but observability is often an afterthought. So often, during an incident, people start agreeing, and then it was already too late. - -=> add 6 minutes to wt. - -## The Heroic Facade and Team Dynamics: Rethinking Success in SRE - -The realm of Site Reliability Engineering is punctuated by the constant ebb and flow of system challenges. While individual excellence is commendable, the overarching belief in the SRE culture should be that true success lies in cohesive teamwork and not in individual heroics. - -he SRE Hero is an anti-pattern that can occur when a few individuals consistently step in to save the day during incidents or emergencies, earning themselves the status of heroes. While this might seem positive at first, it can lead to several negative outcomes and should be addressed to ensure the reliability and sustainability of the SRE team's operations. These individuals might possess specialized knowledge, quick problem-solving skills, or simply a willingness to work long hours. As a result, they become the go-to people whenever something goes wrong. - -This culture can emerge for various reasons: - -- Immediate Problem Solving: Heroes are praised for their ability to solve issues quickly. However, this may lead to bypassing proper post-incident analysis and learning, as the focus is on getting systems up and running as fast as possible. - -- Burnout and Fatigue: Heroes are often overworked and stressed, leading to burnout and high turnover rates. - -- Skill Asymmetry: If only a few team members possess specific knowledge or skills, others may not have the chance to learn, grow, and take on more responsibilities. - -- Dependency: Teams become dependent on heroes, leading to a lack of collaboration and shared ownership of systems. - -How can you fix it? - -- Incident Reviews and Post-Mortems: Conduct thorough post-incident reviews to understand the root causes of issues. Focus on learning and prevention rather than just quick fixes. - -- Distribute Knowledge: Encourage knowledge sharing by documenting incidents, solutions, and best practices. Consider implementing a knowledge-sharing platform or wiki. - -- Rotating Responsibilities: Rotate on-call and incident response responsibilities among team members. This prevents burnout and ensures that everyone gains experience. - -- Automation and Tooling: Develop automation and tools that enable the entire team to handle incidents more effectively, reducing the reliance on individual heroics. - -- Training and Skill Development: Provide training and resources to help all team members enhance their skills. This levels the playing field and reduces skill asymmetry. - -- Recognize Collaborative Efforts: Shift the focus from individual heroics to collaborative efforts. Recognize and reward team members who contribute to preventive measures, incident response improvements, and system stability. - -- Leadership Support: Management should actively support efforts to address the hero culture. This might involve setting expectations for collaboration, learning, and shared responsibility. - -- Celebrate Learning: Emphasize that learning from failures is a positive outcome. This encourages a culture of continuous improvement rather than blame. - -By addressing the hero culture and fostering a collaborative, learning-oriented environment, SRE teams can enhance their overall effectiveness, prevent burnout, and ensure the long-term stability of the systems they manage. - - - - -The allure of the "hero" is undeniable. There's a certain appeal in being the one who swoops in, fixes critical incidents, and saves the day. However, this hero culture, while often romanticised, has its pitfalls. Heroes are necessary, no doubt, but a hero culture can often obscure the collaborative essence of SRE. Recognising that heroes do their best work as part of a team is a profound acknowledgement that true heroes don't need a hero culture to excel. - -The danger of a hero-driven approach is that it can lead to an over-reliance on specific individuals. The assumption that certain team members will always be there to address and mitigate issues can be a dangerous precedent. It fosters a reactive culture rather than a proactive one. Instead of developing inherently more resilient and reliable systems, the organisation starts relying on these heroes as a Band-Aid® solution, masking deeper systemic problems. - -A further dimension to this issue is the impact on team morale. Continually being in the spotlight, heroes might be inadvertently sidelining other team members, leading to feelings of underappreciation or undervaluation. Such a dynamic can hinder sharing knowledge, collaboration, and preparation – the pillars that successful SRE teams are built on. - -However, this isn't to say that individual excellence should be curbed. Instead, it's about shifting the narrative. Building a team culture based on collaboration ensures that knowledge sharing becomes second nature. Such an environment propels teams towards a dynamic where preparation and proactive measures are valued over-reactive heroics. When success stories are shared as a collective win, it boosts team morale and fosters a sense of shared responsibility. - -In the broader spectrum of SRE, it's also crucial to recognise the silent work – the preventive measures, the well-thought-out systems, the meticulous planning – that ensures incidents don't occur. This proactive approach often goes unnoticed because, in a well-functioning system, the absence of issues is the norm. But this 'silence' is a testament to a team working harmoniously, with every member contributing towards system reliability. - -To conclude, while the heroics in SRE can often be the stuff of legends, it's vital to see beyond this facade. The countless hours of teamwork, collaboration, and shared responsibility lie in the shadows of these heroic acts. The future of SRE lies not in individual heroics but in teams that operate like well-oiled machines, with every cog, big or small, playing its part to perfection. - -## Monitoring, Observability, and the SRE Arsenal: Navigating the Nuances of System Reliability - -Site Reliability Engineering is characterised by a relentless quest for reliability, uptime, and seamless user experiences. Within this universe, the notions of monitoring and observability emerge not as mere tools but as critical lifelines that guide decision-making, error diagnosis, and preventive strategies. - -At its core, monitoring is vigilantly monitoring system health, alerting engineers to potential anomalies that might adversely impact system performance or availability. Every alert is treated as an exceptional circumstance warranting immediate attention. However, it's worth noting that only some alerts translate into genuine threats. As such, if an alert merely adds noise without substance, the onus is on refining the monitoring system to filter out such distractions. This process of continuous refinement underscores the dynamism inherent in effective monitoring. - -In tandem with monitoring is the concept of observability. Beyond just knowing that something went wrong, observability equips engineers with the 'why.' It offers a deep dive into the system's intricate operations, allowing for a granular understanding of its behaviours. Observability tools designed to query against high cardinality data become the SRE's best allies in this endeavour. They help comprehensively diagnose problems, especially when conventional monitoring alerts might not capture the nuanced layers of an issue. - -However, monitoring and observability aren't standalone entities; they feed into the broader ambit of error budgets, service level objectives (SLOs), and service level indicators (SLIs). These metrics and frameworks collectively serve as a mirror, reflecting the true health of services. While SLIs define quantitative measures about the reliability of services, SLOs set targets for these measures. On the other hand, error budgets provide a tangible metric of 'how wrong things can go' before the service quality deteriorates below acceptable levels. - -Yet, the human element remains paramount amidst this arsenal of tools and methodologies. No matter how sophisticated, observational tools are only as valuable as the engineers wielding them. It demands a spirit of curiosity, a relentless quest for knowledge, and a willingness to delve deep into data-driven narratives. SREs, therefore, need to be both technically adept and intrinsically motivated to leverage these tools to their fullest potential. - -To sum it up, monitoring and observability play pivotal roles in the intricate dance of system reliability. They are the compass and map, guiding SREs through the labyrinthine challenges of modern systems. By leveraging them effectively and in conjunction with other SRE methodologies, organisations can achieve the zenith of reliability, ensuring that their services remain robust, resilient, and remarkably user-centric. - -## The Ever-evolving Landscape of SRE - -To begin, the very fabric of SRE is interwoven with organisational culture. Successful SRE adoption transcends the mere automation of software operations—it is deeply cultural. It demands a seismic shift in how organisations perceive failures, value preventative work, and prioritise communication. In such an environment, writing is not just a skill but a critical tool for reliability. Precise communication enhances clarity, mitigates risks, and facilitates collaboration. - -Central to SRE's operational philosophy is the balance between innovation and stability. Every system has its error budget, representing the acceptable threshold of issues before service quality falls below expectations. These error budgets are more than mere metrics—they guide decisions, helping organisations balance pushing new features and ensuring system reliability. Such operational nuances remind us that while things will inevitably break in the engineering world, the informed response, driven by data and proactive work, sets SRE apart. - -However, the brilliance of SRE is not merely in the systems but the people powering it. The human element in SRE is both its strength and vulnerability. On the one hand, SREs must be ceaselessly curious, ready to adopt new learnings, and willing to iterate rapidly. On the other, the high-stakes environment and demanding on-call rotations place them at risk of burnout. It's a stark reminder that while systems need monitoring, the well-being of those who maintain them is equally crucial. Organisations must ensure that on-call schedules are sustainable, mentorship is available, and continuous learning is encouraged. - -The SRE world is also marked by its vast arsenal of monitoring systems, observability tools, postmortems, and more. These tools, designed for high cardinality data querying, are pivotal in diagnosing problems, especially when traditional monitoring might miss the subtleties. Yet, tools alone aren't the panacea. The SRE's mindset, the ability to discern tooling gaps, operational expertise voids, and resource inadequacies truly elevates the discipline. - -In conclusion, as a discipline, SRE is a beacon of continuous evolution. As systems grow more complex and user expectations rise, the SRE landscape will inevitably shift, demanding adaptability, resilience, and foresight from its practitioners. But in this ever-changing terrain, the core tenets remain—balancing innovation with reliability, valuing human well-being, and leveraging tools and data for informed decision-making. In the grand tapestry of engineering, SRE stands out as a dynamic, challenging, yet immensely rewarding realm, ever-responsive to the rhythms of technology and human ingenuity. - -## Effective Communication and Collaboration in SRE - -Site Reliability Engineering is not merely a technical discipline. At its core, SRE underscores the importance of effective communication and collaboration as critical tenets of a resilient and efficient system. - -The dynamics of modern organisations, especially those heavily reliant on technology, present systems of such complexity that no single individual possesses a complete understanding. As highlighted from the insights, "Each person inside an organisation has only a partial understanding of how the overall system works." Such compartmentalisation necessitates a culture of open communication and collaboration to ensure that different components, managed by other teams, work in harmony. - -The importance of communication is not just limited to the intra-team dynamics but extends to how teams convey the value of their work, especially the preventive work that pre-empts potential incidents. As many SREs might attest, we live in a data-driven world, but capturing the metrics on incidents that didn't occur due to preventive measures is a challenge. This highlights the need for SREs to be adept at articulating the significance of their roles and their actions to ensure system reliability. It's about making the invisible work visible, ensuring stakeholders understand the value delivered. - -Further emphasising the role of communication, the insights suggest, "Writing is good for reliability; the more precise, the better." Precise communication, whether in documentation, runbooks, or postmortems, is essential for ensuring that every team member, whether an SRE or from an allied discipline, is on the same page. It mitigates the risk of misunderstandings that could compromise system reliability. - -On the other side of the coin is collaboration. An SRE's role frequently involves liaising with various teams, be it developers, back-end teams, or dedicated incident response teams. Effective collaboration with these teams is crucial in a crisis. When cascading failures occur and overload symptoms present simultaneously, this culture of collaboration can make the difference between swift mitigation and a full-blown global outage. - -Furthermore, the insights provide a perspective against fostering a hero culture. "Recognise that heroes do their best work as part of a team, and true heroes don't need a hero culture to do good." Such a sentiment emphasises the collective over the individual. It's a call to ensure team dynamics are built on mutual respect, trust, and a shared understanding of goals rather than relying on individual brilliance. - -In conclusion, while SRE is deeply technical, its efficacy is intertwined with the soft skills of communication and collaboration. As systems grow more intricate and the stakes rise, the ability to communicate clearly and collaborate effectively will distinguish successful SRE teams from the rest. It's a reminder that there are people at the heart of every machine, every line of code, and nurturing human connections is paramount to ensuring machine efficiency. - -## Inherent Curiosity and Continual Learning in SRE - -The realm of Site Reliability Engineering is expansive, dynamic, and deeply integrated with the ever-evolving technological landscape. It's evident that an essential trait underpinning successful SRE practice combines inherent curiosity and an unwavering commitment to continual learning. - -Within modern organisations, technology infrastructures have burgeoned into complex ecosystems. It's been highlighted that "each person inside an organisation has only a partial understanding of how the overall system works." In such an environment, an SRE cannot afford to be siloed or static in their knowledge. The intricacies of systems and the myriad of potential issues necessitate that SREs possess an innate curiosity. It's this curiosity that drives them to explore beyond their immediate purview, to question why systems behave the way they do, and to unravel the intricacies that lie beneath surface-level observations. - -Yet, curiosity alone isn't enough. The pace at which technology evolves is staggering. New tools emerge architectural paradigms shift, and what was once a best practice might become obsolete in a short span. To keep up with this dynamism, SREs need to be invested in continual learning. Whether mastering a new observability tool designed for high cardinality data or understanding the nuances of error budgets and their implications, SREs must be lifelong learners. - -This commitment to learning is about more than just keeping up-to-date with tools and practices. It's about broadening one's horizon and developing a holistic understanding of systems. As cascading failures emerge and system outages threaten, an SRE with a comprehensive knowledge base built on continual learning is better equipped to identify root causes, devise mitigation strategies, and ensure system resilience. - -Furthermore, as we glean from the insights, there's a marked shift in the perception of SRE as a discipline. We're transitioning into an era where "an SRE mindset will be an important hiring requirement for every engineering role." Such a shift implies that the principles of SRE are becoming fundamental to the broader engineering domain. And at the heart of this mindset is the thirst for knowledge and the spirit of exploration. - -In conclusion, the world of Site Reliability Engineering is not for the complacent. It's a domain that rewards the curious, the seekers, and those with an insatiable appetite for knowledge. As systems grow in complexity and the stakes become higher, this inherent curiosity and dedication to continual learning will define the success and resilience of SRE endeavours. The journey of an SRE, thus, is one of perpetual exploration, driven by the quest to know more and do better. - -## The Iterative Spirit of SRE - -Site Reliability Engineering is more than just a technical discipline; it embodies a mindset that embraces iteration, proactive problem-solving, and continuous enhancement. - -At the core of the SRE ethos lies the principle that prevention trumps cure. To build systems resilient to cascading failures and ensure that user impact is minimised, SREs work diligently to improve system designs. However, a crucial component of this prevention strategy is recognising that system designs will never be perfect. Instead, they are continually refined based on real-world performance, learnings from incidents, and shifting user needs. By leveraging tools like error budgets and performance metrics, SREs can gauge the effectiveness of their systems, identify areas of concern, and make informed decisions about where to allocate resources for improvements. - -Moreover, the SRE approach to incident analysis further underscores this iterative spirit. No matter how minor, every incident is viewed as an opportunity to learn. Incidents expose gaps, areas where the system's design or execution fell short. Through postmortems focusing on customer impact and detailed investigations, these gaps become learning avenues, leading to system refinements. The emphasis isn't on apportioning blame but on extracting insights that can fuel the next iteration of the system. - -In conjunction with system design, the tools and practices employed by SREs are also subject to this iterative refinement. Observability tools designed for high cardinality data, rollback automation, and failover tooling are all components of the SRE arsenal, but their effectiveness isn't taken for granted. SREs are consistently evaluating the efficacy of their tools, ensuring that they align with the current system demands and making enhancements as required. The idea is not to find the 'perfect' tool but to recognise that as systems evolve, the tools to manage them must evolve in tandem. - -Finally, the SRE's iterative spirit extends to collaboration and communication. The continual drive to enhance and refine is not a solitary endeavour. SREs actively collaborate with developers, back-end teams, and dedicated incident response units. This collaborative approach ensures that diverse perspectives contribute to the iterative process and collective wisdom is harnessed. - -In summary, the essence of Site Reliability Engineering is characterised by an iterative spirit, a recognition that perfection is a journey, not a destination. Whether refining system designs, enhancing tooling or fostering collaborative dialogues, SREs are always looking for the next improvement, refinement, and iteration. It's this spirit that ensures systems are reliable and continually evolving to meet the ever-changing demands of the digital age. - -## The role of simplicity Simplicity - -## Book tips - -* 97 Things Every SRE Should Know: Collective Wisdom from the Experts by Emily Stolarsky and Jaime Woo -* Site Reliability Engineering: How Google runs Production Systems by by Jennifer Petoff, Niall Murphy, Betsy Beyer and Chris Jones -* Implementing Service Level Objectives by Alex Hidalgo - -E-Mail your comments to `paul@nospam.buetow.org` :-) - -=> ../ Back to the main site diff --git a/gemfeed/DRAFT-site-reliability-engineering.gmi.tpl b/gemfeed/DRAFT-site-reliability-engineering.gmi.tpl deleted file mode 100644 index be26119e..00000000 --- a/gemfeed/DRAFT-site-reliability-engineering.gmi.tpl +++ /dev/null @@ -1,152 +0,0 @@ -## System Design and Incident Analysis: Building Resilience in the SRE Landscape - -A significant portion of the work revolves around system design and incident analysis. - -The first axiom is the acceptance of a bitter truth: things will always break. No matter the precision of which a system is crafted, the inevitability of failures looms large. However, what distinguishes a well-designed system from a mediocre one is its ability to minimise and contain cascading failures. These failures, if left unchecked, can spiral into global outages with come with consequences. - -There's a growing emphasis on building resilient systems to avoid such cascading failures to circumvent this. Such resilience requires foresight in system design, wherein potential weakpoints are identified and addressed before deployed to production. Prevention is better than cure. The primary objective is ensuring that services remain uninterrupted and dependable. - -Yet, despite these preventative measures, when incidents do arise, their analysis becomes a goldmine of learning. Every incident exposes gaps within the system. Instead of attributing these incidents to nebulous concepts like "human error," the onus is on dissecting them to uncover underlying systemic issues. Whether it's a tooling gap where operational tools prove insufficient or an operational expertise gap where engineers lack critical skills, incident analysis shines a light on these deficiencies. - -In doing so, incident analysis is about rectifying the immediate issue and learning and evolving the system design. Every incident offers an opportunity, a feedback loop, to refine the system further. Through rigorous postmortems focusing on customer impact, organisations can distil valuable lessons. These lessons, when incorporated, make the system more robust and less susceptible to similar failures in the future. - -Moreover, as systems grow more complex, the importance of observability tools cannot be overstated. These tools, designed to query against high cardinality data, provide granular insights into system operations. They enable engineers to diagnose problems rapidly, especially in the chaotic aftermath of an incident, giving clarity amidst the turmoil. - -In conclusion, the symbiotic relationship between system design and incident analysis underscores the evolving ethos of SRE. While impeccable system design lays the foundation for reliable operations, incident analysis ensures that this foundation remains robust and dynamic, adapting to challenges. Together, they form the pillars of a resilient, customer-centric service environment that stands the test of time. - -Add paragraph about product wants features, but observability is often an afterthought. So often, during an incident, people start agreeing, and then it was already too late. - -=> add 6 minutes to wt. - -## The Heroic Facade and Team Dynamics: Rethinking Success in SRE - -The realm of Site Reliability Engineering is punctuated by the constant ebb and flow of system challenges. While individual excellence is commendable, the overarching belief in the SRE culture should be that true success lies in cohesive teamwork and not in individual heroics. - -he SRE Hero is an anti-pattern that can occur when a few individuals consistently step in to save the day during incidents or emergencies, earning themselves the status of heroes. While this might seem positive at first, it can lead to several negative outcomes and should be addressed to ensure the reliability and sustainability of the SRE team's operations. These individuals might possess specialized knowledge, quick problem-solving skills, or simply a willingness to work long hours. As a result, they become the go-to people whenever something goes wrong. - -This culture can emerge for various reasons: - -- Immediate Problem Solving: Heroes are praised for their ability to solve issues quickly. However, this may lead to bypassing proper post-incident analysis and learning, as the focus is on getting systems up and running as fast as possible. - -- Burnout and Fatigue: Heroes are often overworked and stressed, leading to burnout and high turnover rates. - -- Skill Asymmetry: If only a few team members possess specific knowledge or skills, others may not have the chance to learn, grow, and take on more responsibilities. - -- Dependency: Teams become dependent on heroes, leading to a lack of collaboration and shared ownership of systems. - -How can you fix it? - -- Incident Reviews and Post-Mortems: Conduct thorough post-incident reviews to understand the root causes of issues. Focus on learning and prevention rather than just quick fixes. - -- Distribute Knowledge: Encourage knowledge sharing by documenting incidents, solutions, and best practices. Consider implementing a knowledge-sharing platform or wiki. - -- Rotating Responsibilities: Rotate on-call and incident response responsibilities among team members. This prevents burnout and ensures that everyone gains experience. - -- Automation and Tooling: Develop automation and tools that enable the entire team to handle incidents more effectively, reducing the reliance on individual heroics. - -- Training and Skill Development: Provide training and resources to help all team members enhance their skills. This levels the playing field and reduces skill asymmetry. - -- Recognize Collaborative Efforts: Shift the focus from individual heroics to collaborative efforts. Recognize and reward team members who contribute to preventive measures, incident response improvements, and system stability. - -- Leadership Support: Management should actively support efforts to address the hero culture. This might involve setting expectations for collaboration, learning, and shared responsibility. - -- Celebrate Learning: Emphasize that learning from failures is a positive outcome. This encourages a culture of continuous improvement rather than blame. - -By addressing the hero culture and fostering a collaborative, learning-oriented environment, SRE teams can enhance their overall effectiveness, prevent burnout, and ensure the long-term stability of the systems they manage. - - - - -The allure of the "hero" is undeniable. There's a certain appeal in being the one who swoops in, fixes critical incidents, and saves the day. However, this hero culture, while often romanticised, has its pitfalls. Heroes are necessary, no doubt, but a hero culture can often obscure the collaborative essence of SRE. Recognising that heroes do their best work as part of a team is a profound acknowledgement that true heroes don't need a hero culture to excel. - -The danger of a hero-driven approach is that it can lead to an over-reliance on specific individuals. The assumption that certain team members will always be there to address and mitigate issues can be a dangerous precedent. It fosters a reactive culture rather than a proactive one. Instead of developing inherently more resilient and reliable systems, the organisation starts relying on these heroes as a Band-Aid® solution, masking deeper systemic problems. - -A further dimension to this issue is the impact on team morale. Continually being in the spotlight, heroes might be inadvertently sidelining other team members, leading to feelings of underappreciation or undervaluation. Such a dynamic can hinder sharing knowledge, collaboration, and preparation – the pillars that successful SRE teams are built on. - -However, this isn't to say that individual excellence should be curbed. Instead, it's about shifting the narrative. Building a team culture based on collaboration ensures that knowledge sharing becomes second nature. Such an environment propels teams towards a dynamic where preparation and proactive measures are valued over-reactive heroics. When success stories are shared as a collective win, it boosts team morale and fosters a sense of shared responsibility. - -In the broader spectrum of SRE, it's also crucial to recognise the silent work – the preventive measures, the well-thought-out systems, the meticulous planning – that ensures incidents don't occur. This proactive approach often goes unnoticed because, in a well-functioning system, the absence of issues is the norm. But this 'silence' is a testament to a team working harmoniously, with every member contributing towards system reliability. - -To conclude, while the heroics in SRE can often be the stuff of legends, it's vital to see beyond this facade. The countless hours of teamwork, collaboration, and shared responsibility lie in the shadows of these heroic acts. The future of SRE lies not in individual heroics but in teams that operate like well-oiled machines, with every cog, big or small, playing its part to perfection. - -## Monitoring, Observability, and the SRE Arsenal: Navigating the Nuances of System Reliability - -Site Reliability Engineering is characterised by a relentless quest for reliability, uptime, and seamless user experiences. Within this universe, the notions of monitoring and observability emerge not as mere tools but as critical lifelines that guide decision-making, error diagnosis, and preventive strategies. - -At its core, monitoring is vigilantly monitoring system health, alerting engineers to potential anomalies that might adversely impact system performance or availability. Every alert is treated as an exceptional circumstance warranting immediate attention. However, it's worth noting that only some alerts translate into genuine threats. As such, if an alert merely adds noise without substance, the onus is on refining the monitoring system to filter out such distractions. This process of continuous refinement underscores the dynamism inherent in effective monitoring. - -In tandem with monitoring is the concept of observability. Beyond just knowing that something went wrong, observability equips engineers with the 'why.' It offers a deep dive into the system's intricate operations, allowing for a granular understanding of its behaviours. Observability tools designed to query against high cardinality data become the SRE's best allies in this endeavour. They help comprehensively diagnose problems, especially when conventional monitoring alerts might not capture the nuanced layers of an issue. - -However, monitoring and observability aren't standalone entities; they feed into the broader ambit of error budgets, service level objectives (SLOs), and service level indicators (SLIs). These metrics and frameworks collectively serve as a mirror, reflecting the true health of services. While SLIs define quantitative measures about the reliability of services, SLOs set targets for these measures. On the other hand, error budgets provide a tangible metric of 'how wrong things can go' before the service quality deteriorates below acceptable levels. - -Yet, the human element remains paramount amidst this arsenal of tools and methodologies. No matter how sophisticated, observational tools are only as valuable as the engineers wielding them. It demands a spirit of curiosity, a relentless quest for knowledge, and a willingness to delve deep into data-driven narratives. SREs, therefore, need to be both technically adept and intrinsically motivated to leverage these tools to their fullest potential. - -To sum it up, monitoring and observability play pivotal roles in the intricate dance of system reliability. They are the compass and map, guiding SREs through the labyrinthine challenges of modern systems. By leveraging them effectively and in conjunction with other SRE methodologies, organisations can achieve the zenith of reliability, ensuring that their services remain robust, resilient, and remarkably user-centric. - -## The Ever-evolving Landscape of SRE - -To begin, the very fabric of SRE is interwoven with organisational culture. Successful SRE adoption transcends the mere automation of software operations—it is deeply cultural. It demands a seismic shift in how organisations perceive failures, value preventative work, and prioritise communication. In such an environment, writing is not just a skill but a critical tool for reliability. Precise communication enhances clarity, mitigates risks, and facilitates collaboration. - -Central to SRE's operational philosophy is the balance between innovation and stability. Every system has its error budget, representing the acceptable threshold of issues before service quality falls below expectations. These error budgets are more than mere metrics—they guide decisions, helping organisations balance pushing new features and ensuring system reliability. Such operational nuances remind us that while things will inevitably break in the engineering world, the informed response, driven by data and proactive work, sets SRE apart. - -However, the brilliance of SRE is not merely in the systems but the people powering it. The human element in SRE is both its strength and vulnerability. On the one hand, SREs must be ceaselessly curious, ready to adopt new learnings, and willing to iterate rapidly. On the other, the high-stakes environment and demanding on-call rotations place them at risk of burnout. It's a stark reminder that while systems need monitoring, the well-being of those who maintain them is equally crucial. Organisations must ensure that on-call schedules are sustainable, mentorship is available, and continuous learning is encouraged. - -The SRE world is also marked by its vast arsenal of monitoring systems, observability tools, postmortems, and more. These tools, designed for high cardinality data querying, are pivotal in diagnosing problems, especially when traditional monitoring might miss the subtleties. Yet, tools alone aren't the panacea. The SRE's mindset, the ability to discern tooling gaps, operational expertise voids, and resource inadequacies truly elevates the discipline. - -In conclusion, as a discipline, SRE is a beacon of continuous evolution. As systems grow more complex and user expectations rise, the SRE landscape will inevitably shift, demanding adaptability, resilience, and foresight from its practitioners. But in this ever-changing terrain, the core tenets remain—balancing innovation with reliability, valuing human well-being, and leveraging tools and data for informed decision-making. In the grand tapestry of engineering, SRE stands out as a dynamic, challenging, yet immensely rewarding realm, ever-responsive to the rhythms of technology and human ingenuity. - -## Effective Communication and Collaboration in SRE - -Site Reliability Engineering is not merely a technical discipline. At its core, SRE underscores the importance of effective communication and collaboration as critical tenets of a resilient and efficient system. - -The dynamics of modern organisations, especially those heavily reliant on technology, present systems of such complexity that no single individual possesses a complete understanding. As highlighted from the insights, "Each person inside an organisation has only a partial understanding of how the overall system works." Such compartmentalisation necessitates a culture of open communication and collaboration to ensure that different components, managed by other teams, work in harmony. - -The importance of communication is not just limited to the intra-team dynamics but extends to how teams convey the value of their work, especially the preventive work that pre-empts potential incidents. As many SREs might attest, we live in a data-driven world, but capturing the metrics on incidents that didn't occur due to preventive measures is a challenge. This highlights the need for SREs to be adept at articulating the significance of their roles and their actions to ensure system reliability. It's about making the invisible work visible, ensuring stakeholders understand the value delivered. - -Further emphasising the role of communication, the insights suggest, "Writing is good for reliability; the more precise, the better." Precise communication, whether in documentation, runbooks, or postmortems, is essential for ensuring that every team member, whether an SRE or from an allied discipline, is on the same page. It mitigates the risk of misunderstandings that could compromise system reliability. - -On the other side of the coin is collaboration. An SRE's role frequently involves liaising with various teams, be it developers, back-end teams, or dedicated incident response teams. Effective collaboration with these teams is crucial in a crisis. When cascading failures occur and overload symptoms present simultaneously, this culture of collaboration can make the difference between swift mitigation and a full-blown global outage. - -Furthermore, the insights provide a perspective against fostering a hero culture. "Recognise that heroes do their best work as part of a team, and true heroes don't need a hero culture to do good." Such a sentiment emphasises the collective over the individual. It's a call to ensure team dynamics are built on mutual respect, trust, and a shared understanding of goals rather than relying on individual brilliance. - -In conclusion, while SRE is deeply technical, its efficacy is intertwined with the soft skills of communication and collaboration. As systems grow more intricate and the stakes rise, the ability to communicate clearly and collaborate effectively will distinguish successful SRE teams from the rest. It's a reminder that there are people at the heart of every machine, every line of code, and nurturing human connections is paramount to ensuring machine efficiency. - -## Inherent Curiosity and Continual Learning in SRE - -The realm of Site Reliability Engineering is expansive, dynamic, and deeply integrated with the ever-evolving technological landscape. It's evident that an essential trait underpinning successful SRE practice combines inherent curiosity and an unwavering commitment to continual learning. - -Within modern organisations, technology infrastructures have burgeoned into complex ecosystems. It's been highlighted that "each person inside an organisation has only a partial understanding of how the overall system works." In such an environment, an SRE cannot afford to be siloed or static in their knowledge. The intricacies of systems and the myriad of potential issues necessitate that SREs possess an innate curiosity. It's this curiosity that drives them to explore beyond their immediate purview, to question why systems behave the way they do, and to unravel the intricacies that lie beneath surface-level observations. - -Yet, curiosity alone isn't enough. The pace at which technology evolves is staggering. New tools emerge architectural paradigms shift, and what was once a best practice might become obsolete in a short span. To keep up with this dynamism, SREs need to be invested in continual learning. Whether mastering a new observability tool designed for high cardinality data or understanding the nuances of error budgets and their implications, SREs must be lifelong learners. - -This commitment to learning is about more than just keeping up-to-date with tools and practices. It's about broadening one's horizon and developing a holistic understanding of systems. As cascading failures emerge and system outages threaten, an SRE with a comprehensive knowledge base built on continual learning is better equipped to identify root causes, devise mitigation strategies, and ensure system resilience. - -Furthermore, as we glean from the insights, there's a marked shift in the perception of SRE as a discipline. We're transitioning into an era where "an SRE mindset will be an important hiring requirement for every engineering role." Such a shift implies that the principles of SRE are becoming fundamental to the broader engineering domain. And at the heart of this mindset is the thirst for knowledge and the spirit of exploration. - -In conclusion, the world of Site Reliability Engineering is not for the complacent. It's a domain that rewards the curious, the seekers, and those with an insatiable appetite for knowledge. As systems grow in complexity and the stakes become higher, this inherent curiosity and dedication to continual learning will define the success and resilience of SRE endeavours. The journey of an SRE, thus, is one of perpetual exploration, driven by the quest to know more and do better. - -## The Iterative Spirit of SRE - -Site Reliability Engineering is more than just a technical discipline; it embodies a mindset that embraces iteration, proactive problem-solving, and continuous enhancement. - -At the core of the SRE ethos lies the principle that prevention trumps cure. To build systems resilient to cascading failures and ensure that user impact is minimised, SREs work diligently to improve system designs. However, a crucial component of this prevention strategy is recognising that system designs will never be perfect. Instead, they are continually refined based on real-world performance, learnings from incidents, and shifting user needs. By leveraging tools like error budgets and performance metrics, SREs can gauge the effectiveness of their systems, identify areas of concern, and make informed decisions about where to allocate resources for improvements. - -Moreover, the SRE approach to incident analysis further underscores this iterative spirit. No matter how minor, every incident is viewed as an opportunity to learn. Incidents expose gaps, areas where the system's design or execution fell short. Through postmortems focusing on customer impact and detailed investigations, these gaps become learning avenues, leading to system refinements. The emphasis isn't on apportioning blame but on extracting insights that can fuel the next iteration of the system. - -In conjunction with system design, the tools and practices employed by SREs are also subject to this iterative refinement. Observability tools designed for high cardinality data, rollback automation, and failover tooling are all components of the SRE arsenal, but their effectiveness isn't taken for granted. SREs are consistently evaluating the efficacy of their tools, ensuring that they align with the current system demands and making enhancements as required. The idea is not to find the 'perfect' tool but to recognise that as systems evolve, the tools to manage them must evolve in tandem. - -Finally, the SRE's iterative spirit extends to collaboration and communication. The continual drive to enhance and refine is not a solitary endeavour. SREs actively collaborate with developers, back-end teams, and dedicated incident response units. This collaborative approach ensures that diverse perspectives contribute to the iterative process and collective wisdom is harnessed. - -In summary, the essence of Site Reliability Engineering is characterised by an iterative spirit, a recognition that perfection is a journey, not a destination. Whether refining system designs, enhancing tooling or fostering collaborative dialogues, SREs are always looking for the next improvement, refinement, and iteration. It's this spirit that ensures systems are reliable and continually evolving to meet the ever-changing demands of the digital age. - -## The role of simplicity Simplicity - -## Book tips - -* 97 Things Every SRE Should Know: Collective Wisdom from the Experts by Emily Stolarsky and Jaime Woo -* Site Reliability Engineering: How Google runs Production Systems by by Jennifer Petoff, Niall Murphy, Betsy Beyer and Chris Jones -* Implementing Service Level Objectives by Alex Hidalgo - -E-Mail your comments to `paul@nospam.buetow.org` :-) - -=> ../ Back to the main site diff --git a/gemfeed/atom.xml b/gemfeed/atom.xml index a4a22681..d63ddc19 100644 --- a/gemfeed/atom.xml +++ b/gemfeed/atom.xml @@ -1,12 +1,93 @@ <?xml version="1.0" encoding="utf-8"?> <feed xmlns="http://www.w3.org/2005/Atom"> - <updated>2026-02-22T08:00:29+02:00</updated> + <updated>2026-02-28T17:02:03+02:00</updated> <title>foo.zone feed</title> <subtitle>To be in the .zone!</subtitle> <link href="gemini://foo.zone/gemfeed/atom.xml" rel="self" /> <link href="gemini://foo.zone/" /> <id>gemini://foo.zone/</id> <entry> + <title>Site Reliability Engineering - Part 5: System Design, Incidents, and Learning</title> + <link href="gemini://foo.zone/gemfeed/2026-03-01-site-reliability-engineering-part-5.gmi" /> + <id>gemini://foo.zone/gemfeed/2026-03-01-site-reliability-engineering-part-5.gmi</id> + <updated>2026-03-01T12:00:00+02:00</updated> + <author> + <name>Paul Buetow aka snonux</name> + <email>paul@dev.buetow.org</email> + </author> + <summary>Welcome to Part 5 of my Site Reliability Engineering (SRE) series. I'm currently working as a Site Reliability Engineer, and I'm here to share what SRE is all about in this blog series.</summary> + <content type="xhtml"> + <div xmlns="http://www.w3.org/1999/xhtml"> + <h1 style='display: inline' id='site-reliability-engineering---part-5-system-design-incidents-and-learning'>Site Reliability Engineering - Part 5: System Design, Incidents, and Learning</h1><br /> +<br /> +<span class='quote'>Published at 2026-03-01T12:00:00+02:00</span><br /> +<br /> +<span>Welcome to Part 5 of my Site Reliability Engineering (SRE) series. I'm currently working as a Site Reliability Engineer, and I'm here to share what SRE is all about in this blog series.</span><br /> +<br /> +<a class='textlink' href='./2023-08-18-site-reliability-engineering-part-1.html'>2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture</a><br /> +<a class='textlink' href='./2023-11-19-site-reliability-engineering-part-2.html'>2023-11-19 Site Reliability Engineering - Part 2: Operational Balance</a><br /> +<a class='textlink' href='./2024-01-09-site-reliability-engineering-part-3.html'>2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture</a><br /> +<a class='textlink' href='./2024-09-07-site-reliability-engineering-part-4.html'>2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers</a><br /> +<a class='textlink' href='./2026-03-01-site-reliability-engineering-part-5.html'>2026-03-01 Site Reliability Engineering - Part 5: System Design, Incidents, and Learning (You are currently reading this)</a><br /> +<br /> +<pre> + ___ + / \ resilience + | o | <---------- learning + \___/ +</pre> +<br /> +<span>This time I want to share some themes that build on what we've already covered: how system design and incident analysis fit together, why observability should not be an afterthought, and how a design‑improvement loop keeps systems getting better. Let's dive in!</span><br /> +<br /> +<h2 style='display: inline' id='table-of-contents'>Table of Contents</h2><br /> +<br /> +<ul> +<li><a href='#site-reliability-engineering---part-5-system-design-incidents-and-learning'>Site Reliability Engineering - Part 5: System Design, Incidents, and Learning</a></li> +<li>⇢ <a href='#system-design-and-incident-analysis'>System Design and Incident Analysis</a></li> +<li>⇢ ⇢ <a href='#resilience-and-cascading-failures'>Resilience and cascading failures</a></li> +<li>⇢ ⇢ <a href='#learning-from-incidents'>Learning from incidents</a></li> +<li>⇢ <a href='#observability-don-t-leave-it-for-when-it-s-too-late'>Observability: Don't leave it for when it's too late</a></li> +<li>⇢ <a href='#the-iterative-spirit'>The iterative spirit</a></li> +<li>⇢ <a href='#book-tips'>Book tips</a></li> +</ul><br /> +<h2 style='display: inline' id='system-design-and-incident-analysis'>System Design and Incident Analysis</h2><br /> +<br /> +<span>A big chunk of SRE work revolves around system design and incident analysis. What separates a well-designed system from a mediocre one is its ability to minimise and contain cascading failures. Unchecked, those can spiral into global outages.</span><br /> +<br /> +<h3 style='display: inline' id='resilience-and-cascading-failures'>Resilience and cascading failures</h3><br /> +<br /> +<span>There's a growing emphasis on building resilient systems so that when something fails, the blast radius stays small. That resilience needs to be baked in at design time: we identify weak points and address them before production. The goal is to keep services dependable and uninterrupted.</span><br /> +<br /> +<h3 style='display: inline' id='learning-from-incidents'>Learning from incidents</h3><br /> +<br /> +<span>When incidents do happen, their analysis is a goldmine. Every incident exposes gaps—whether in tooling (ops tools that aren't up to the job) or in skills (engineers missing critical know-how). Blaming "human error" doesn't help. The job is to dig into root causes and fix the system. Postmortems that focus on customer impact help us distil lessons and make the system more robust so we're less likely to repeat the same failure.</span><br /> +<br /> +<span>System design and incident analysis form a feedback loop: we improve the design based on what we learn from incidents, and a better design reduces the impact of the next one.</span><br /> +<br /> +<h2 style='display: inline' id='observability-don-t-leave-it-for-when-it-s-too-late'>Observability: Don't leave it for when it's too late</h2><br /> +<br /> +<span>Product and features often get the spotlight; observability is often an afterthought. Teams agree that "we need better observability" when they're already in the middle of an incident—and by then it's too late. Good observability needs to be in place before things go wrong. Tools that can query high-cardinality data and give granular insight into system behaviour are what let us diagnose problems quickly when chaos hits. So invest in observability early. When the next incident happens, you'll be glad you did.</span><br /> +<br /> +<h2 style='display: inline' id='the-iterative-spirit'>The iterative spirit</h2><br /> +<br /> +<span>We also accept that system design is never "done." We refine it based on real-world performance, incident learnings, and changing needs. Every incident is a chance to learn and improve; the emphasis is on learning, not blame. SREs work with developers, backend teams, and incident response so that the whole system keeps getting better. Perfection is a journey, not a destination.</span><br /> +<br /> +<h2 style='display: inline' id='book-tips'>Book tips</h2><br /> +<br /> +<span>If you want to go deeper, here are a few books I can recommend:</span><br /> +<br /> +<ul> +<li>97 Things Every SRE Should Know: Collective Wisdom from the Experts by Emily Stolarsky and Jaime Woo</li> +<li>Site Reliability Engineering: How Google Runs Production Systems by Jennifer Petoff, Niall Murphy, Betsy Beyer, and Chris Jones</li> +<li>Implementing Service Level Objectives by Alex Hidalgo</li> +</ul><br /> +<span>E-Mail your comments to <span class='inlinecode'>paul@nospam.buetow.org</span> :-)</span><br /> +<br /> +<a class='textlink' href='../'>Back to the main site</a><br /> + </div> + </content> + </entry> + <entry> <title>My desk rack: DeskPi RackMate T0</title> <link href="gemini://foo.zone/gemfeed/2026-02-22-my-desk-rack.gmi" /> <id>gemini://foo.zone/gemfeed/2026-02-22-my-desk-rack.gmi</id> @@ -376,335 +457,6 @@ mage test </content> </entry> <entry> - <title>Meta slash-commands to manage prompts, skills, and context for coding agents</title> - <link href="gemini://foo.zone/gemfeed/2026-02-14-meta-slash-commands-for-prompts-and-context.gmi" /> - <id>gemini://foo.zone/gemfeed/2026-02-14-meta-slash-commands-for-prompts-and-context.gmi</id> - <updated>2026-02-14T13:44:45+02:00, last updated Tue 17 Feb 14:00:00 EET 2026</updated> - <author> - <name>Paul Buetow aka snonux</name> - <email>paul@dev.buetow.org</email> - </author> - <summary>I work on many small, repeatable tasks. Instead of retyping the same instructions every time, I want to turn successful prompts into reusable slash-commands and keep background knowledge in loadable context files. This post describes a set of *meta* slash-commands: commands that create, update, and delete other commands, context files, and skills. They live as markdown in a dotfiles repo and work with any coding agent that supports slash-commands—Claude Code CLI, Cursor Agent, OpenCode, Ampcode, and others.</summary> - <content type="xhtml"> - <div xmlns="http://www.w3.org/1999/xhtml"> - <h1 style='display: inline' id='meta-slash-commands-to-manage-prompts-skills-and-context-for-coding-agents'>Meta slash-commands to manage prompts, skills, and context for coding agents</h1><br /> -<br /> -<span class='quote'>Published at 2026-02-14T13:44:45+02:00, last updated Tue 17 Feb 14:00:00 EET 2026</span><br /> -<br /> -<span>I work on many small, repeatable tasks. Instead of retyping the same instructions every time, I want to turn successful prompts into reusable slash-commands and keep background knowledge in loadable context files. This post describes a set of *meta* slash-commands: commands that create, update, and delete other commands, context files, and skills. They live as markdown in a dotfiles repo and work with any coding agent that supports slash-commands—Claude Code CLI, Cursor Agent, OpenCode, Ampcode, and others.</span><br /> -<br /> -<span class='quote'>Updated Tue 17 Feb: Added section about skill management commands and the differences between commands and skills</span><br /> -<br /> -<pre> - ┌─────────────────────────────────────────────────────────────┐ - │ Cursor Agent [~][□][X] │ - ├─────────────────────────────────────────────────────────────┤ - │ │ - │ → /load-context api-guidelines │ - │ │ - │ Context loaded: api-guidelines.md │ - │ Ready. Ask me to implement something. │ - │ │ - │ → /create-skill docker-compose │ - │ │ - │ Analyzing "docker-compose"... │ - │ Generated: SKILL.md with frontmatter + instructions. │ - │ Save to skills/docker-compose/ ? [Y] │ - │ │ - │ ✓ Saved. Use /docker-compose anytime. │ - │ │ - └─────────────────────────────────────────────────────────────┘ - │ - │ slash-commands & skills - ▼ - ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ - │ /load- │ │ /create- │ │ /create- │ │ /docker- │ - │ context │ │ command │ │ skill │ │ compose │ - └──────────┘ └──────────┘ └──────────┘ └──────────┘ - │ │ │ │ - └──────────────┴──────────────┴──────────────┘ - │ - coding agent executes - your prompt library -</pre> -<br /> -<h2 style='display: inline' id='table-of-contents'>Table of Contents</h2><br /> -<br /> -<ul> -<li><a href='#meta-slash-commands-to-manage-prompts-skills-and-context-for-coding-agents'>Meta slash-commands to manage prompts, skills, and context for coding agents</a></li> -<li>⇢ <a href='#motivation-collecting-prompts-for-later-re-use'>Motivation: collecting prompts for later re-use</a></li> -<li>⇢ <a href='#loading-whole-context-before-asking-the-agent-to-do-something'>Loading whole context before asking the agent to do something</a></li> -<li>⇢ <a href='#works-with-any-coding-agent-that-supports-slash-commands'>Works with any coding agent that supports slash-commands</a></li> -<li>⇢ <a href='#commands-that-manage-slash-commands'>Commands that manage slash-commands</a></li> -<li>⇢ ⇢ <a href='#create-command'><span class='inlinecode'>/create-command</span></a></li> -<li>⇢ ⇢ <a href='#update-command'><span class='inlinecode'>/update-command</span></a></li> -<li>⇢ ⇢ <a href='#delete-command'><span class='inlinecode'>/delete-command</span></a></li> -<li>⇢ <a href='#commands-vs-skills-when-to-use-which'>Commands vs skills: when to use which</a></li> -<li>⇢ <a href='#commands-that-manage-skills'>Commands that manage skills</a></li> -<li>⇢ ⇢ <a href='#create-skill'><span class='inlinecode'>/create-skill</span></a></li> -<li>⇢ ⇢ <a href='#update-skill'><span class='inlinecode'>/update-skill</span></a></li> -<li>⇢ ⇢ <a href='#delete-skill'><span class='inlinecode'>/delete-skill</span></a></li> -<li>⇢ <a href='#commands-that-manage-context-files'>Commands that manage context files</a></li> -<li>⇢ ⇢ <a href='#create-context'><span class='inlinecode'>/create-context</span></a></li> -<li>⇢ ⇢ <a href='#update-context'><span class='inlinecode'>/update-context</span></a></li> -<li>⇢ ⇢ <a href='#delete-context'><span class='inlinecode'>/delete-context</span></a></li> -<li>⇢ ⇢ <a href='#load-context'><span class='inlinecode'>/load-context</span></a></li> -<li>⇢ <a href='#summary'>Summary</a></li> -</ul><br /> -<h2 style='display: inline' id='motivation-collecting-prompts-for-later-re-use'>Motivation: collecting prompts for later re-use</h2><br /> -<br /> -<span>When I use a coding agent, I often find myself repeating the same kind of request: "review this function," "explain this error," "add tests for this module," "format this as a blog post" and may other cases. Typing long prompts from scratch is tedious, and ad-hoc prompts are easy to forget. I'd rather capture what works and reuse it.</span><br /> -<br /> -<span>The solution is to treat prompts as first-class artefacts: store them as markdown files (one file per slash-command or per context), and use a small set of *meta* commands to manage them. The agent then creates, updates, or deletes these files through conversation—no hand-editing of markdowns. I can say <span class='inlinecode'>/create-command review-code we just did a code review</span> and the agent generates the command file based on the current agent's context, shows a preview, and saves it. Later I run <span class='inlinecode'>/review-code</span> and get a consistent workflow every time.</span><br /> -<br /> -<span>Because everything is just markdown in directories (<span class='inlinecode'>commands/</span> for commands, <span class='inlinecode'>skills/</span> for skills, and <span class='inlinecode'>context/</span> for context), I can version it in git, sync it across machines, and gradually build a library of prompts. When a command grows too complex for a single file, I promote it to a skill—a structured directory with YAML frontmatter, a "When to Use" section, and detailed instructions.</span><br /> -<br /> -<h2 style='display: inline' id='loading-whole-context-before-asking-the-agent-to-do-something'>Loading whole context before asking the agent to do something</h2><br /> -<br /> -<span>A separate but related need is *context*: background information the agent should have before I ask it to do anything. For example, I might have a document describing our Kubernetes setup, API conventions, or the architecture of a specific service. If I ask "add a new endpoint for X" without that context, the agent guesses and without having a reference to an existing project with an <span class='inlinecode'>AGENTS.md</span>. If I first load the relevant context file, the agent knows the naming conventions, the existing patterns, and the infrastructure—and its edits are more accurate.</span><br /> -<br /> -<span>So I keep three kinds of artefacts:</span><br /> -<br /> -<ul> -<li>Commands — Reusable workflows (e.g. "review code", "explain error"). They live as single <span class='inlinecode'>.md</span> files in a <span class='inlinecode'>commands/</span> directory. Meta-commands create, update, and delete them. Commands are simple: one file, one prompt. They work with any coding agent.</li> -<li>Skills — Richer, more structured artefacts than commands. Each skill lives in its own directory (e.g. <span class='inlinecode'>skills/go-best-practices/SKILL.md</span>) and includes YAML frontmatter with metadata (name, description), a "When to Use" section, and detailed multi-step instructions. Skills can include additional files alongside the <span class='inlinecode'>SKILL.md</span>. They are the right choice when a workflow needs more structure, domain knowledge, or multiple steps.</li> -<li>Context — Reusable background (project rules, API notes, infrastructure docs, personas). They live as <span class='inlinecode'>.md</span> files in a <span class='inlinecode'>context/</span> directory. I can create, update, delete, and—importantly—*load* them. Loading a context file injects that content into the conversation so the agent has it in mind for subsequent requests.</li> -</ul><br /> -<span>The use case is: start a session, run <span class='inlinecode'>/load-context api-guidelines</span> (or whatever context name), then ask the agent to implement a feature or fix a bug. The agent already knows the guidelines. No need to paste a wall of text every time; the context is on demand.</span><br /> -<br /> -<h2 style='display: inline' id='works-with-any-coding-agent-that-supports-slash-commands'>Works with any coding agent that supports slash-commands</h2><br /> -<br /> -<span>I use different agents depending on the task: Claude Code CLI, Cursor Agent (CLI), OpenCode, Ampcode and others. What they have in common is support for custom slash-commands (or the ability to read prompt files). My meta-commands, skills, and context files are just markdown; there is no lock-in. Point your agent at the same directories and you get the same prompts, skills, and context. I don't need an MCP server returning prompts right now—the files on disk are enough.</span><br /> -<br /> -<h2 style='display: inline' id='commands-that-manage-slash-commands'>Commands that manage slash-commands</h2><br /> -<br /> -<span>These meta-commands create, update, and delete other slash-commands. The target files live in <span class='inlinecode'>~/Notes/Prompts/commands/</span> (or your chosen path). Each command is one <span class='inlinecode'>.md</span> file. You can see the commands (and the context files) here:</span><br /> -<br /> -<a class='textlink' href='https://codeberg.org/snonux/dotfiles/src/branch/master/prompts/'>https://codeberg.org/snonux/dotfiles/src/branch/master/prompts/</a><br /> -<br /> -<h3 style='display: inline' id='create-command'><span class='inlinecode'>/create-command</span></h3><br /> -<br /> -<span>Creates a new slash-command by inferring its purpose from the name you give.</span><br /> -<br /> -<ul> -<li>Parameter: <span class='inlinecode'>command_name</span> (e.g. <span class='inlinecode'>review-code</span>, <span class='inlinecode'>explain-error</span>, <span class='inlinecode'>optimize-function</span>)</li> -<li>What it does: The agent analyses the name, infers intent and parameters, writes a description and prompt, shows a preview, and saves <span class='inlinecode'>{{command_name}}.md</span> to the commands directory.</li> -<li>Good for: Turning the current task or a recurring need into a reusable command without editing files by hand.</li> -</ul><br /> -<span>Example usage:</span><br /> -<br /> -<pre> -/create-command review-code -/create-command explain-error -</pre> -<br /> -<h3 style='display: inline' id='update-command'><span class='inlinecode'>/update-command</span></h3><br /> -<br /> -<span>Updates an existing slash-command step by step.</span><br /> -<br /> -<ul> -<li>Parameter: <span class='inlinecode'>command_name</span> (e.g. <span class='inlinecode'>create-command</span>, <span class='inlinecode'>review-code</span>)</li> -<li>What it does: Reads the existing <span class='inlinecode'>.md</span> file, shows the current content, asks what to change (description, parameters, prompt text), applies edits, shows a preview, and saves.</li> -<li>Good for: Refining a command after you've used it a few times or when requirements change.</li> -</ul><br /> -<span>Example usage:</span><br /> -<br /> -<pre> -/update-command create-command -/update-command review-code -</pre> -<br /> -<h3 style='display: inline' id='delete-command'><span class='inlinecode'>/delete-command</span></h3><br /> -<br /> -<span>Removes a slash-command by deleting its definition file.</span><br /> -<br /> -<ul> -<li>Parameter: <span class='inlinecode'>command_name</span> (e.g. <span class='inlinecode'>testing</span>, <span class='inlinecode'>review-code</span>)</li> -<li>What it does: Verifies the file exists, shows what will be deleted, asks for confirmation, then deletes the file.</li> -<li>Good for: Cleaning up experiments or commands you no longer use.</li> -</ul><br /> -<span>Example usage:</span><br /> -<br /> -<pre> -/delete-command testing -/delete-command review-code -</pre> -<br /> -<h2 style='display: inline' id='commands-vs-skills-when-to-use-which'>Commands vs skills: when to use which</h2><br /> -<br /> -<span>Commands and skills both produce reusable slash-commands, but they differ in structure and intent:</span><br /> -<br /> -<pre> -| Aspect | Command | Skill | -|-----------------|--------------------------------|----------------------------------------| -| File layout | Single .md file in commands/ | Directory with SKILL.md in skills/ | -| Metadata | Markdown heading + description | YAML frontmatter (name, description) | -| Structure | Free-form prompt text | "When to Use" + structured instructions| -| Complexity | Simple, single-purpose prompts | Multi-step workflows, domain knowledge | -| Extra files | No | Yes (can include supporting files) | -| Best for | Quick one-shot tasks | Rich, repeatable processes | -</pre> -<br /> -<span>Use a **command** when you need a quick, single-purpose prompt—something like "review this PR" or "explain this error." Use a **skill** when the workflow is more involved: it needs structured instructions, domain-specific knowledge, or multiple steps that the agent should follow in order. For example, my <span class='inlinecode'>go-best-practices</span> skill contains detailed conventions for project structure, naming, error handling, and testing—far more than would fit comfortably in a flat command file.</span><br /> -<br /> -<span>The YAML frontmatter in skills (<span class='inlinecode'>name</span> and <span class='inlinecode'>description</span> between <span class='inlinecode'>---</span> fences at the top of the file) is what makes skills discoverable by the coding agent. When the agent starts a session, it scans the skills directory and reads the frontmatter to build a list of available skills—without having to parse the entire file. The <span class='inlinecode'>name</span> field gives the skill its slash-command name, and the <span class='inlinecode'>description</span> tells the agent (and the user) what the skill does, so the agent can suggest the right skill for a given task. Commands don't need this metadata because they are simpler: the filename *is* the command name, and the first heading serves as the description.</span><br /> -<br /> -<span>In practice, I start with a command and promote it to a skill once it grows beyond a simple prompt.</span><br /> -<br /> -<h2 style='display: inline' id='commands-that-manage-skills'>Commands that manage skills</h2><br /> -<br /> -<span>These meta-commands create, update, and delete skills. Skills live in <span class='inlinecode'>~/Notes/Prompts/skills/</span>, each in its own directory containing a <span class='inlinecode'>SKILL.md</span> file with YAML frontmatter.</span><br /> -<br /> -<h3 style='display: inline' id='create-skill'><span class='inlinecode'>/create-skill</span></h3><br /> -<br /> -<span>Creates a new skill by inferring its purpose from the name you give.</span><br /> -<br /> -<ul> -<li>Parameter: <span class='inlinecode'>skill_name</span> (e.g. <span class='inlinecode'>docker-compose</span>, <span class='inlinecode'>rust-conventions</span>)</li> -<li>What it does: The agent analyses the name, infers intent, creates a directory <span class='inlinecode'>skills/{{skill_name}}/</span>, generates a <span class='inlinecode'>SKILL.md</span> with YAML frontmatter (<span class='inlinecode'>name</span>, <span class='inlinecode'>description</span>), a "When to Use" section, and detailed instructions. Shows a preview before saving.</li> -<li>Good for: Creating structured, multi-step workflows that need more organisation than a simple command.</li> -</ul><br /> -<span>Example usage:</span><br /> -<br /> -<pre> -/create-skill docker-compose -/create-skill rust-conventions -</pre> -<br /> -<h3 style='display: inline' id='update-skill'><span class='inlinecode'>/update-skill</span></h3><br /> -<br /> -<span>Updates an existing skill step by step.</span><br /> -<br /> -<ul> -<li>Parameter: <span class='inlinecode'>skill_name</span> (e.g. <span class='inlinecode'>go-best-practices</span>, <span class='inlinecode'>compose-blog-post</span>)</li> -<li>What it does: Reads the existing <span class='inlinecode'>SKILL.md</span>, shows its current content, asks what to change (description, "When to Use" section, instructions), applies edits, shows a preview, and saves.</li> -<li>Good for: Refining a skill after real-world usage or when conventions evolve.</li> -</ul><br /> -<span>Example usage:</span><br /> -<br /> -<pre> -/update-skill go-best-practices -/update-skill compose-blog-post -</pre> -<br /> -<h3 style='display: inline' id='delete-skill'><span class='inlinecode'>/delete-skill</span></h3><br /> -<br /> -<span>Removes a skill by deleting its entire directory.</span><br /> -<br /> -<ul> -<li>Parameter: <span class='inlinecode'>skill_name</span> (e.g. <span class='inlinecode'>docker-compose</span>, <span class='inlinecode'>rust-conventions</span>)</li> -<li>What it does: Verifies the skill exists, shows what will be deleted, asks for confirmation, then removes the <span class='inlinecode'>skills/{{skill_name}}/</span> directory.</li> -<li>Good for: Cleaning up experimental or unused skills.</li> -</ul><br /> -<span>Example usage:</span><br /> -<br /> -<pre> -/delete-skill docker-compose -/delete-skill rust-conventions -</pre> -<br /> -<h2 style='display: inline' id='commands-that-manage-context-files'>Commands that manage context files</h2><br /> -<br /> -<span>These meta-commands create, update, delete, and *load* context files. Context files live in <span class='inlinecode'>~/Notes/Prompts/context/</span>. Loading a context injects its content into the conversation so the agent can use it for subsequent requests.</span><br /> -<br /> -<h3 style='display: inline' id='create-context'><span class='inlinecode'>/create-context</span></h3><br /> -<br /> -<span>Creates a new context file.</span><br /> -<br /> -<ul> -<li>Parameter: <span class='inlinecode'>context_name</span> (without <span class='inlinecode'>.md</span>), e.g. <span class='inlinecode'>epimetheus</span>, <span class='inlinecode'>api-guidelines</span></li> -<li>What it does: Checks if the context already exists, asks what the context should contain (background, structure, sections), then writes <span class='inlinecode'>{{context_name}}.md</span> to the context directory.</li> -<li>Good for: Capturing project rules, API conventions, or infrastructure notes once and reusing them via <span class='inlinecode'>/load-context</span>.</li> -</ul><br /> -<span>Example usage:</span><br /> -<br /> -<pre> -/create-context epimetheus -/create-context api-guidelines -</pre> -<br /> -<h3 style='display: inline' id='update-context'><span class='inlinecode'>/update-context</span></h3><br /> -<br /> -<span>Updates an existing context file by adding, modifying, or removing content.</span><br /> -<br /> -<ul> -<li>Parameter: <span class='inlinecode'>context_name</span> (e.g. <span class='inlinecode'>epimetheus</span>, <span class='inlinecode'>api-guidelines</span>). If omitted, lists available context files.</li> -<li>What it does: Reads the existing file, asks what to change (add section, modify section, remove section, rewrite, or full overhaul), applies changes, and saves.</li> -<li>Good for: Keeping context up to date as the project or infrastructure evolves.</li> -</ul><br /> -<span>Example usage:</span><br /> -<br /> -<pre> -/update-context epimetheus -/update-context api-guidelines -/update-context -</pre> -<br /> -<h3 style='display: inline' id='delete-context'><span class='inlinecode'>/delete-context</span></h3><br /> -<br /> -<span>Deletes a context file after confirmation.</span><br /> -<br /> -<ul> -<li>Parameter: <span class='inlinecode'>context_name</span> (e.g. <span class='inlinecode'>epimetheus</span>, <span class='inlinecode'>old-api-guidelines</span>). If omitted, lists available context files.</li> -<li>What it does: Verifies the file exists, shows a preview or summary, asks for confirmation, then deletes the file.</li> -<li>Good for: Removing outdated or unused context.</li> -</ul><br /> -<span>Example usage:</span><br /> -<br /> -<pre> -/delete-context epimetheus -/delete-context old-api-guidelines -/delete-context -</pre> -<br /> -<h3 style='display: inline' id='load-context'><span class='inlinecode'>/load-context</span></h3><br /> -<br /> -<span>Loads a context file into the conversation so the agent has that background for subsequent requests.</span><br /> -<br /> -<ul> -<li>Parameter: <span class='inlinecode'>context_name</span> (e.g. <span class='inlinecode'>epimetheus</span>, <span class='inlinecode'>api-guidelines</span>). If omitted, lists available context files.</li> -<li>What it does: Reads the context file, displays its content, and confirms it is loaded. From then on, the agent can use that information when you ask it to implement features, fix bugs, or answer questions.</li> -<li>Good for: Starting a session with "load our API guidelines" or "load our Kubernetes runbook" so the agent knows the infrastructure and conventions before you ask it to do something.</li> -</ul><br /> -<span>Example usage:</span><br /> -<br /> -<pre> -/load-context epimetheus -/load-context api-guidelines -/load-context -</pre> -<br /> -<h2 style='display: inline' id='summary'>Summary</h2><br /> -<br /> -<pre> -| Meta-command | Purpose | Good for | -|--------------------|----------------------------------------------|---------------------------------------------------| -| /create-command | Create new slash-command from name | Turning current or recurring tasks into commands | -| /update-command | Edit existing slash-command | Refining commands over time | -| /delete-command | Remove slash-command file | Cleaning up unused commands | -| /create-skill | Create new skill with structured instructions| Building rich, multi-step workflows | -| /update-skill | Edit existing skill | Refining skills as conventions evolve | -| /delete-skill | Remove skill directory | Cleaning up experimental or unused skills | -| /create-context | Create new context file | Capturing project/infra knowledge once | -| /update-context | Edit existing context file | Keeping context up to date | -| /delete-context | Remove context file | Removing outdated context | -| /load-context | Load context into conversation | Giving the agent background before tasks | -</pre> -<br /> -<span>Context is what the agent *knows*; commands and skills are what the agent *does*—commands for simple prompts, skills for structured multi-step workflows. All three are markdown files you can create, update, and delete on the fly through the same coding agent—Claude Code CLI, Cursor Agent, OpenCode, Ampcode, or any other that supports slash-commands or prompt files. Start with commands for quick tasks, promote to skills when complexity grows, and load context when the agent needs background knowledge.</span><br /> -<br /> -<span>Other related posts:</span><br /> -<br /> -<a class='textlink' href='./2026-02-14-meta-slash-commands-for-prompts-and-context.html'>2026-02-14 Meta slash-commands to manage prompts, skills, and context for coding agents (You are currently reading this)</a><br /> -<a class='textlink' href='./2026-02-02-tmux-popup-editor-for-cursor-agent-prompts.html'>2026-02-02 A tmux popup editor for Cursor Agent CLI prompts</a><br /> -<br /> -<span>E-Mail your comments to <span class='inlinecode'>paul@nospam.buetow.org</span> :-)</span><br /> -<br /> -<a class='textlink' href='../'>Back to the main site</a><br /> - </div> - </content> - </entry> - <entry> <title>A tmux popup editor for Cursor Agent CLI prompts</title> <link href="gemini://foo.zone/gemfeed/2026-02-02-tmux-popup-editor-for-cursor-agent-prompts.gmi" /> <id>gemini://foo.zone/gemfeed/2026-02-02-tmux-popup-editor-for-cursor-agent-prompts.gmi</id> @@ -16640,6 +16392,7 @@ http://www.gnu.org/software/src-highlite --> <a class='textlink' href='./2023-11-19-site-reliability-engineering-part-2.html'>2023-11-19 Site Reliability Engineering - Part 2: Operational Balance</a><br /> <a class='textlink' href='./2024-01-09-site-reliability-engineering-part-3.html'>2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture</a><br /> <a class='textlink' href='./2024-09-07-site-reliability-engineering-part-4.html'>2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers (You are currently reading this)</a><br /> +<a class='textlink' href='./2026-03-01-site-reliability-engineering-part-5.html'>2026-03-01 Site Reliability Engineering - Part 5: System Design, Incidents, and Learning</a><br /> <br /> <pre> __..._ _...__ @@ -16702,6 +16455,12 @@ jgs \\`_..---.Y.---.._`// <br /> <span>If you're looking to optimize your on-call onboarding process, these strategies could be your ticket to a more efficient and effective transition. Happy on-calling!</span><br /> <br /> +<span>Continue with the fifth part of this series:</span><br /> +<br /> +<a class='textlink' href='./2026-03-01-site-reliability-engineering-part-5.html'>2026-03-01 Site Reliability Engineering - Part 5: System Design, Incidents, and Learning</a><br /> +<br /> +<span>E-Mail your comments to <span class='inlinecode'>paul@nospam.buetow.org</span> :-)</span><br /> +<br /> <a class='textlink' href='../'>Back to the main site</a><br /> </div> </content> diff --git a/gemfeed/index.gmi b/gemfeed/index.gmi index 47a398b8..4c80cc0e 100644 --- a/gemfeed/index.gmi +++ b/gemfeed/index.gmi @@ -2,9 +2,9 @@ ## To be in the .zone! +=> ./2026-03-01-site-reliability-engineering-part-5.gmi 2026-03-01 - Site Reliability Engineering - Part 5: System Design, Incidents, and Learning => ./2026-02-22-my-desk-rack.gmi 2026-02-22 - My desk rack: DeskPi RackMate T0 => ./2026-02-15-loadbars-resurrected-from-perl-to-go.gmi 2026-02-15 - Loadbars resurrected: From Perl to Go after 15 years -=> ./2026-02-14-meta-slash-commands-for-prompts-and-context.gmi 2026-02-14 - Meta slash-commands to manage prompts, skills, and context for coding agents => ./2026-02-02-tmux-popup-editor-for-cursor-agent-prompts.gmi 2026-02-02 - A tmux popup editor for Cursor Agent CLI prompts => ./2026-01-01-using-supernote-nomad-offline.gmi 2026-01-01 - Using Supernote Nomad offline => ./2026-01-01-posts-from-july-to-december-2025.gmi 2026-01-01 - Posts from July to December 2025 |
