From ce3f82e7b966c356885dcc14be7b3d6e38180ca0 Mon Sep 17 00:00:00 2001 From: Paul Buetow Date: Fri, 21 Feb 2025 17:08:01 +0200 Subject: Update content for html --- notes/97-things-every-sre-should-know.html | 362 +++++++++++++++++++++++ notes/implementing-service-level-objectives.html | 109 +++++++ notes/index.html | 3 + notes/site-reliability-engineering.html | 112 +++++++ 4 files changed, 586 insertions(+) create mode 100644 notes/97-things-every-sre-should-know.html create mode 100644 notes/implementing-service-level-objectives.html create mode 100644 notes/site-reliability-engineering.html (limited to 'notes') diff --git a/notes/97-things-every-sre-should-know.html b/notes/97-things-every-sre-should-know.html new file mode 100644 index 00000000..4435b470 --- /dev/null +++ b/notes/97-things-every-sre-should-know.html @@ -0,0 +1,362 @@ + + + + +'97 Things Every SRE Should Know' book notes + + + + + +

+Home | Markdown | Gemini +

+

"97 Things Every SRE Should Know" book notes


+
+These are my personal book notes of Emil Stolarsky's and Jaime Woo's "97 Things Every SRE Should Know". They are for myself, but I hope they might be useful to you too.
+
+

Table of Contents


+
+
+

Introduction


+
+That willingness to learn makes sense for SREs, given the need to work with complex systems. The systems change constantly, and the role requires someone wanting to ask questions about how they work. Curiosity is a trait found in many SREs.
+
+It's normal (and fine) for some of our work to deal with immediate needs, but teams that operate only on the urgent side of the Eisenhower matrix are limited in what they can achieve. Nothing is ever perfect, so don’t aim for it. Ensure instead that you’re aiming to be reliable just enough of the time. Because that’s where the power is.
+
+
+

Observability


+
+Observability would swamp most observability systems with an obscene amount of storage and scale. It would simply be impractical to pay for a system capable of doing that. Observability helps your investigation of problems pinpoint likely sources. Observability is not for debugging your code logic. It is for figuring out where in your systems to find the code you need to debug.
+
+

The ancient art of writing things down


+
+When it comes to reliability, we’re used to discussing new advances in the field, but one of the most powerful forces for reliability is also one of the oldest: the ancient art of writing things down.
+
+
+An SLO of 99.9% only tells you anything if you know what the service’s owners consider “available” to mean. If there’s an accompanying SLO definition document that explains that a one-second response is considered a success, and you were hoping for 10-millisecond latencies, you’ll reevaluate whether this backend is the one for you.
+
+
+

The teams health


+
+To decide, you must know what you value most in a job and what you can expect from companies. As we fine-tune SLOs and iterate on rotation design, it’s equally important to keep in touch with the pulse of the team’s health, and constantly ask: As a group, are we working in a way that is sustainable over the long haul?
+
+
+The second notion was, “No one pays for generalists; you need to specialize.” Now I’m an SRE. Burnout is a challenge. It will happen a few times, and each time you think, “I’ll never fall for that again." Again, you will work too hard, too long, without reward or appreciation. It can permanently damage your health.
+
+
+Dilemmas get easier when you ask, “In ten years, what will I wish I’d done?” Feeling financially trapped makes situations far worse. We work for managers, not companies. Ensure you are only 80% sure you can do the jobs you apply for, so you stretch yourself. Managers aren’t your friends; they are your agents. Fire them if you don’t like the community, work, or money they bring you.
+
+The efforts and personal sacrifices of engineers are meaningless if they do not resonate at a strategic level. The Space Shuttle Challenger was approved for launch by NASA managers seeking to avoid delays in an already beleaguered schedule, despite known engineer concerns about the safety of the orbiter vehicle in subzero launch temperatures.
+
+
+

Sharing responsibilities


+
+Embracing the idea “you build it, you run it” empowers everyone in your organization with shared responsibility for reliability and broad use of your team’s skills.
+
+
+

The roles and the solo SRE


+
+Three roles: incident manager, expert/operator, and communications. Typically, incident management roles include an incident commander, technical lead, and communications lead. Incident management is a natural progression after observability.
+
+The most important point to remember in being a solo SRE is that although you can effect change within your organization, you cannot do it alone, so don’t try to carry the weight of your organization’s problems on your shoulders.
+
+
+

Being customer-focused


+
+SRE is about being customer-focused. Regardless of the stage of development, it is critical to understand the bottlenecks in your system and communicate them to stakeholders. There is likely to be a strong push to ignore SRE capability work and focus on new features. However, for most enterprises, introducing SLOs and error budgets to business-critical services remains a key differentiator for establishing SRE.
+
+
+

Don't have all the answers


+
+There is unfortunate pressure on people to feel like they have all the answers. In meetings, we often see someone tap dancing nervously around an answer they don’t have, especially when asked by someone higher up the management chain. It’s not our role as engineers and leaders always to have the answers.
+
+
+

Runbooks


+
+Once a mental model can be recorded, reproduced, and shared, it becomes a general-purpose abstraction. It speeds communication and gives people standard tools to refer to when reasoning about behavior, outages, and proposed changes to the system.
+
+
+Runbooks cannot and will not solve every incident. But that’s fine. As incidents become more novel, there is a point at which an investment in runbooks starts to show diminishing returns.
+
+
+

Alerts per shift


+
+
+In a transitory phase, people who are more often on call will get two mandatory consecutive days of recovery to prevent burnout.
+
+
+

Balancing velocity


+
+As SREs, we see our job as balancing velocity with reliability.
+
+You Don’t Know for Sure Until It Runs in Production.
+
+We often view production as a house of cards–like a fragile ecosystem that needs to be approached with care, silk gloves, or bunker gear. Incident reviews are a perfect opportunity to target and remove detrimental complexity. Incidents give us the space to zoom out and notice detrimental complexity.
+
+Simpler systems that aren’t perfect are usually better than complex ones. We often think of incidents in terms of TTx (time to x), like time to detect or time to mitigate, but these metrics provide little insight into what makes an incident interesting.
+
+*If an engineer is a hero, there’s a gap in the process, the infrastructure, or the tooling.
+
+"Measure everything" is a trap. Metrics are raw numbers: how many items in a queue, how many days since the last failure. SLIs are combinations of metrics that tell a story: like if the queue keeps filling at the current rate.
+
+
+

The power in knowing how to be self-sufficient


+
+There is power in knowing how to be self-sufficient, in having the tools and the fearlessness to track answers down through layers of abstractions. SLOs are about quantifying delivered service, setting appropriate expectations, and changing tactics when things aren’t going well.
+
+
+

Prioritize towards the overall reliability goal


+
+Part of the solution is to prioritize working on something small towards the overall reliability goals every day, rather than working on it for a week and then moving on, never to return.
+
+
+Mentoring others becomes part of this. It requires time, energy, dedication, and goodwill, so it is considered additional work.
+
+
+

The quiet time vs the burnout


+
+Quiet time in the morning can be used to work on tasks with fewer interruptions. Remote ICs (individual contributors) have opportunities to be productive differently than before, like time-shifting work or breaking up their day.
+
+
+It's important for mental and physical health to create and maintain personal margin to avoid burnout. Renewing activities counter environmental uncertainty: breaks, changes of scenery, and exercise. Incidents are unplanned investments in understanding systems. The learning budget is where you explore new, creative approaches.
+
+

Error budget as a learning budget


+
+Also known as the error budget, this leftover part is where or when the service does not meet the objective. It's more helpful to think about this as the learning budget. Shouldn't we just be open that we’re all committed to reliability and have leadership prioritize it? Sure, in a perfect world, but driving culture change means being passionate about the vision and patient enough to know folks need training wheels.
+
+Focus not just on a single night; rather, lay the groundwork for creating an operationally mature organization. We are creatures of habit—sudden changes of routine and operating outside our comfort zone attract doubt. Changing too much too quickly leads to confusion and skepticism.
+
+

Introducing SRE


+
+Bringing SRE means overcoming inertia and requires substantial investment in time to educate and continuously reinforce practices and behaviors.
+
+Change is hard, especially in large organizations. Focus initially on the most critical behaviors to adapt and help spread awareness.
+
+
+discuss several sources of complexity. The biggest and hardest to deal with is state. State influences control flow, but the number of potential software states increases exponentially with variables. Separating the SRE team from development teams—sometimes by creating a Center of Excellence—causes problems rather than solving them.
+Elitism and knowledge constraints are issues.
+
+
+SLOs, SLIs, and error budgets will require convincing within the organization. Some may prioritize feature velocity over reliability work. Once engineering, operations, and product teams buy in, it's essential to engage senior leadership. The benefits of SRE practices, such as greater release velocity and early insights into the user experience, should be emphasized to them.
+
+The key argument to leadership is that SRE practices will provide better feature velocity over time.
+
+

Heroes and On-Call Practices


+
+Heroes are necessary, but hero culture is not. A hero culture can easily form, but an SRE mindset helps combat this. If no action is required, tweak thresholds or delete alerts. Treat every page as an exceptional circumstance. Include on-call behaviors in developmental and career progression frameworks. On-callers should shadow experienced engineers to practice incident response. Trial by fire is not a prerequisite for being good on call. Best improvement ideas often come from the on-callers themselves.
+
+Regular retrospectives and reflection improve on-call experiences. Good communication and collaboration multiply team efficiency. Successful teams frequently meet to improve processes and keep documentation up to date.
+
+
+

Prevent failures through improved system design


+
+When a cascading failure occurs, many issues arise simultaneously, overwhelming systems. Even prepared teams can struggle to mitigate without serious user impact. A more effective strategy involves preventing failures through improved system design.
+
+
+

On-call health and postmortems


+
+On-call health is crucial. Postmortems should analyze alerts for noise and automate recurring tasks. Action items from retrospectives should be timely completed.
+
+
+

Time Management and Cultural Considerations


+
+
+Toil reduction should be a goal across all engineering disciplines. Reliability and operability demand proactive planning, not just reactive fixes. An SRE team should ensure systems need less human intervention to function. It's crucial to make SRE contributions visible to prevent organizational decay. While we cannot track prevented incidents, preventive efforts are invaluable.
+
+
+Remember, engineering will always face breakages, which can lead to burnout. Mental health is a priority. Error budgets provide data for better decision-making. When faced with incidents outside SREs' control, cultural shifts ensure long-term success.
+
+Building a successful team in large enterprises is challenging. A culture emphasizing knowledge sharing, collaboration, and preparation is more beneficial than runbooks alone.
+
+
+SREs protect critical systems, facing excitement and risk of burnout. Reliable systems require quick improvements and avoidance of delay-inducing processes. Modernize systems incrementally, focusing on small, frequent deployments to manage risk.
+
+Establishing a solid SRE culture is vital for sustainable success. Comprehensive documentation should not undergo the same review as code. Heroes do their best work as part of a team; a hero culture isn’t essential.
+
+
+

Alert volume vs effectiveness


+
+The volume of alerts isn’t as critical as handling them effectively. Trust, ownership, communication, and collaboration underpin successful teams, improving processes and reliability. Like maintaining fire safety, regularly test systems to prevent outages.
+
+
+Introducing SRE courses in academia would signify a new era in engineering.
+
+Other book notes of mine are:
+
+
+E-Mail your comments to paul@nospam.buetow.org :-)
+
+Back to the main site
+ + + diff --git a/notes/implementing-service-level-objectives.html b/notes/implementing-service-level-objectives.html new file mode 100644 index 00000000..865bd86a --- /dev/null +++ b/notes/implementing-service-level-objectives.html @@ -0,0 +1,109 @@ + + + + +'Implementing Service Level Objectives' book notes + + + + + +

+Home | Markdown | Gemini +

+

"Implementing Service Level Objectives" book notes


+
+These are my personal book notes of Alex Hidalgo's "Implementing Service Level Objectives: A Pratical Guide to SLIs, SLOs, and Error Budgets" They are for myself, but I hope they might be useful to you too.
+
+

Table of Contents


+
+
+

Introduction


+
+Service Level Objectives (SLOs) are a fundamental component in ensuring service reliability, enhancing engineering effectiveness, and aligning organizational goals. Below is a comprehensive guide to understanding and implementing SLOs, focusing on the critical documentation required and the three phases of SLO implementation.
+
+

Importance of Documentation


+
+Documentation Support: Strong documentation is essential in supporting both you and your organization throughout the SLO implementation process. It provides clarity and guidance, making the transition smoother and more efficient.
+
+

Implementation Phases


+
+

The Three Phases of SLO Implementation


+
+
+

Phase 1: Defining SLOs


+
+Strategy Document:
+
+Create a one-page strategy document. This document is vital in the initial 'crawl' phase, outlining what you are trying to achieve, why, and how. It should be concise, allowing anyone to read it in less than ten minutes. It's crucial to get this document right, as it answers:
+
+
+SLO Definition Document:
+
+Draft a two-page document providing a high-level definition of SLOs, including examples of effective ones. This should guide engineers by making SLO implementation accessible and generate interest without overwhelming them with volumes of information.
+
+FAQ Document:
+
+Compile a FAQ document to address anticipated questions as teams begin their SLO journey. Example questions include:
+
+
+

Phase 2: Collecting SLIs


+
+Instrumentation Guide:
+
+Once the high-level SLO definition is clear, provide a detailed guide on how to instrument services to collect SLIs. Be specific and include examples from your organization's monitoring platforms. Address scenarios like collecting latency data, using percentiles, and instrumenting different types of services. Offer code snippets to facilitate the instrumentation process.
+
+

Phase 3: Utilizing SLOs


+
+Use Case Documentation:
+
+
+

Best Practices


+
+Quality Documentation:
+
+
+This systematic approach to SLO implementation, supported by robust documentation, will help your organization effectively adopt SLOs and improve overall service reliability.
+Other book notes of mine are:
+
+
+E-Mail your comments to paul@nospam.buetow.org :-)
+
+Back to the main site
+ + + diff --git a/notes/index.html b/notes/index.html index 5044c940..fef4aa83 100644 --- a/notes/index.html +++ b/notes/index.html @@ -23,6 +23,7 @@ 'The Obstacle is the Way' book notes
'Staff Engineer' book notes
'Slow Productivity' book notes
+'Site Reliability Engineering' book notes
'Search Inside Yourself' book notes
'Never split the difference' book notes
'Mind Management' book notes
@@ -30,10 +31,12 @@ 'Love People, Use Things' book notes
'Joy On Domand' book notes
'Influence without Authority' book notes
+'Implementing Service Level Objectives' book notes
'Fluent Forever' book notes
'Eat That Frog' book notes
'Software Developmers Career Guide and Soft Skills' book notes
'A Monk's Guide to Happiness' book notes
+'97 Things Every SRE Should Know' book notes

That were all notes. Hope they were useful!

diff --git a/notes/site-reliability-engineering.html b/notes/site-reliability-engineering.html new file mode 100644 index 00000000..e84761ed --- /dev/null +++ b/notes/site-reliability-engineering.html @@ -0,0 +1,112 @@ + + + + +'Site Reliability Engineering' book notes + + + + + +

+Home | Markdown | Gemini +

+

"Site Reliability Engineering" book notes


+
+These are my personal book notes of Niall Richard Murphy's "Site Reliability Engineering: How Google Runs Production systems". They are for myself, but I hope they might be useful to you too.
+
+

Table of Contents


+
+
+

Key Concepts in SRE


+
+

Role of an SRE:


+
+Ideally, SREs should spend no more than 50% of their time on operational work. The focus should primarily be on development activities. Systems should self-heal automatically.
+
+

Error Budget


+
+No development work should occur when the error budget is exceeded for a whole quarter, requiring strong management support. Error budgets help resolve conflicts between development and operational work by creating a common incentive, allowing both product development and SRE teams to balance innovation with reliability. Removes need for negotiations on the number of feature changes allowed.
+
+

On-call Management


+
+An on-call engineer should encounter a maximum of two events per eight hours to ensure sufficient time for cleanup and post-mortems. This allows thorough investigation and learning without overwhelming engineers. Monitoring should alert only when human interaction is required. Logs should be used for later forensics and not require immediate attention. Uptime is calculated with successful requests included, potentially by source, considering volume, partial writes, and HTTP response codes.
+
+

Reliability Metrics


+
+
+

Service Indicators


+
+Choose an appropriate number of SLIs/KPIs to maintain focus without missing vital aspects of system performance. KPIs and SLIs are crucial for real-time metrics like uptime, latency, and throughput, often aggregated for analysis. Risk tolerance should be set in collaboration with product teams for user-facing services.
+
+

Metrics and Error Rates


+
+Accurate measurement involves considering all system components, including infrastructure error rates from networking, hardware, etc. High availability solutions (HA) and ISP background error rates can influence the impact of network outages on the error budget.
+
+

Testing and Monitoring


+
+Regular DR/Chaos testing is essential to gauge the impact of outages (like DC outages) on availability. Comprehensive testing ensures systems can handle variable loads without catastrophic failure. Monitoring and alert systems should swiftly address concerns, measuring latency on errors to distinguish 'slow' from 'fast' failures.
+
+

Automation and Human Involvement


+
+While automation can replace manual error resolution, maintaining human expertise is vital to operate systems when automation fails or becomes opaque over time.
+
+

SRE Work Distribution


+
+Google SREs, for example, allocate their work as 25% on-call, 25% non-urgent operations, and 50% engineering tasks.
+
+

Post-mortem Practices


+
+Creating post-mortems is a learning opportunity, not a punishment. They must be deliberate and not merely procedural. Post-mortems should be comprehensive to ensure lessons are applied effectively.
+
+

Load Testing


+
+ Proper load testing identifies when a system begins rejecting traffic and observes how it handles excess load. Systems should be tested at the subsystem level to identify different thresholds.
+
+

Criticality and Throttling


+
+Client-side rate limiting can implement adaptive throttling based on error counts. Systems should be designed to prioritize requests of higher criticality.
+
+

Toil Management


+
+Toil should account for less than 50% of an SRE's work currently and must be minimized. Toil is repetitive, manual work that could be automated. A balance must be struck, as occasional toil can prove insightful, but excessive toil detrimentally affects morale and productivity. Different engineers have varied thresholds for tolerating toil, influencing job satisfaction and retention.
+
+

Efficient Operations


+
+Toil, overhead, and non-operational tasks should be distinguished from core operational activities, which do not relate to direct HR or interview processes. Monitoring alerts should inform the necessary actions with clear context ("the what and the why") to minimize unnecessary manual efforts.
+
+Other book notes of mine are:
+
+
+E-Mail your comments to paul@nospam.buetow.org :-)
+
+Back to the main site
+ + + -- cgit v1.2.3