summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2024-09-07 16:33:42 +0300
committerPaul Buetow <paul@buetow.org>2024-09-07 16:33:42 +0300
commitae825dce02aea0bfbe7aeeceba470e5b16d4eafb (patch)
tree7542e9c6bdb205be6fa2a08929dc6a6cdf114520
parentbbf4aa0ed380f1b7892bd9ec6eb38ad2a236c5a5 (diff)
Update content for md
-rw-r--r--gemfeed/2023-08-18-site-reliability-engineering-part-1.md7
-rw-r--r--gemfeed/2023-11-19-site-reliability-engineering-part-2.md11
-rw-r--r--gemfeed/2024-01-09-site-reliability-engineering-part-3.md9
-rw-r--r--gemfeed/2024-09-07-site-reliability-engineering-part-4.md69
-rw-r--r--gemfeed/index.md5
-rw-r--r--index.md7
-rw-r--r--uptime-stats.md2
7 files changed, 92 insertions, 18 deletions
diff --git a/gemfeed/2023-08-18-site-reliability-engineering-part-1.md b/gemfeed/2023-08-18-site-reliability-engineering-part-1.md
index 37976a18..5c61f9ac 100644
--- a/gemfeed/2023-08-18-site-reliability-engineering-part-1.md
+++ b/gemfeed/2023-08-18-site-reliability-engineering-part-1.md
@@ -5,8 +5,9 @@
Being a Site Reliability Engineer (SRE) is like stepping into a lively, ever-evolving universe. The world of SRE mixes together different tech, a unique culture, and a whole lot of determination. It’s one of the toughest but most exciting jobs out there. There's zero chance of getting bored because there's always a fresh challenge to tackle and new technology to play around with. It's not just about the tech side of things either; it's heavily rooted in communication, collaboration, and teamwork. As someone currently working as an SRE, I’m here to break it all down for you in this blog series. Let's dive into what SRE is really all about!
[2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture (You are currently reading this)](./2023-08-18-site-reliability-engineering-part-1.md)
-[2023-11-19 Site Reliability Engineering - Part 2: Operational Balance in SRE](./2023-11-19-site-reliability-engineering-part-2.md)
-[2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture and the Human Side](./2024-01-09-site-reliability-engineering-part-3.md)
+[2023-11-19 Site Reliability Engineering - Part 2: Operational Balance](./2023-11-19-site-reliability-engineering-part-2.md)
+[2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture](./2024-01-09-site-reliability-engineering-part-3.md)
+[2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers](./2024-09-07-site-reliability-engineering-part-4.md)
```
▓▓▓▓░░
@@ -52,7 +53,7 @@ Organizations that have SLIs, SLOs, and error budgets in place are already prett
Continue with the second part of this series:
-[2023-11-19 Site Reliability Engineering - Part 2: Operational Balance in SRE](./2023-11-19-site-reliability-engineering-part-2.md)
+[2023-11-19 Site Reliability Engineering - Part 2: Operational Balance](./2023-11-19-site-reliability-engineering-part-2.md)
E-Mail your comments to `paul@nospam.buetow.org` :-)
diff --git a/gemfeed/2023-11-19-site-reliability-engineering-part-2.md b/gemfeed/2023-11-19-site-reliability-engineering-part-2.md
index 9bddfbcc..29749257 100644
--- a/gemfeed/2023-11-19-site-reliability-engineering-part-2.md
+++ b/gemfeed/2023-11-19-site-reliability-engineering-part-2.md
@@ -1,12 +1,13 @@
-# Site Reliability Engineering - Part 2: Operational Balance in SRE
+# Site Reliability Engineering - Part 2: Operational Balance
> Published at 2023-11-19T00:18:18+03:00
This is the second part of my Site Reliability Engineering (SRE) series. I am currently employed as a Site Reliability Engineer and will try to share what SRE is about in this blog series.
[2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture](./2023-08-18-site-reliability-engineering-part-1.md)
-[2023-11-19 Site Reliability Engineering - Part 2: Operational Balance in SRE (You are currently reading this)](./2023-11-19-site-reliability-engineering-part-2.md)
-[2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture and the Human Side](./2024-01-09-site-reliability-engineering-part-3.md)
+[2023-11-19 Site Reliability Engineering - Part 2: Operational Balance (You are currently reading this)](./2023-11-19-site-reliability-engineering-part-2.md)
+[2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture](./2024-01-09-site-reliability-engineering-part-3.md)
+[2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers](./2024-09-07-site-reliability-engineering-part-4.md)
```
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣾⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
@@ -23,7 +24,7 @@ This is the second part of my Site Reliability Engineering (SRE) series. I am cu
⠀⠀⠀⠀⠀⠀⠴⠶⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠿⠶⠦⠀⠀
```
-## Operational Balance in SRE: Striking the Right Balance Between Reliability and Speed
+## Striking the Right Balance Between Reliability and Speed
Site Reliability Engineering is more than just a bunch of best practices or methods. It's a guiding light for engineering teams, helping them navigate the tricky waters of modern software development and system management.
In the world of software production, there are two big forces that often clash: the push for fast feature releases (velocity) and the need for reliable systems. Traditionally, moving faster meant more risk. SRE helps balance these opposing goals with things like error budgets and SLIs/SLOs. These tools give teams a clear way to measure how much they can push changes without hurting system health. So, the error budget becomes a balancing act, helping teams trade off between innovation and reliability.
@@ -42,7 +43,7 @@ That all sounds pretty idealistic. The reality is that getting the perfect balan
Continue with the third part of this series:
-[2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture and the Human Side](./2024-01-09-site-reliability-engineering-part-3.md)
+[2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture](./2024-01-09-site-reliability-engineering-part-3.md)
E-Mail your comments to `paul@nospam.buetow.org` :-)
diff --git a/gemfeed/2024-01-09-site-reliability-engineering-part-3.md b/gemfeed/2024-01-09-site-reliability-engineering-part-3.md
index 314f50e0..af82c899 100644
--- a/gemfeed/2024-01-09-site-reliability-engineering-part-3.md
+++ b/gemfeed/2024-01-09-site-reliability-engineering-part-3.md
@@ -1,12 +1,13 @@
-# Site Reliability Engineering - Part 3: On-Call Culture and the Human Side
+# Site Reliability Engineering - Part 3: On-Call Culture
> Published at 2024-01-09T18:35:48+02:00
Welcome to Part 3 of my Site Reliability Engineering (SRE) series. I'm currently working as a Site Reliability Engineer, and I’m here to share what SRE is all about in this blog series.
[2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture](./2023-08-18-site-reliability-engineering-part-1.md)
-[2023-11-19 Site Reliability Engineering - Part 2: Operational Balance in SRE](./2023-11-19-site-reliability-engineering-part-2.md)
-[2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture and the Human Side (You are currently reading this)](./2024-01-09-site-reliability-engineering-part-3.md)
+[2023-11-19 Site Reliability Engineering - Part 2: Operational Balance](./2023-11-19-site-reliability-engineering-part-2.md)
+[2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture (You are currently reading this)](./2024-01-09-site-reliability-engineering-part-3.md)
+[2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers](./2024-09-07-site-reliability-engineering-part-4.md)
```
..--""""----..
@@ -34,7 +35,7 @@ Welcome to Part 3 of my Site Reliability Engineering (SRE) series. I'm currently
```
-## On-Call Culture and the Human Side: Putting Well-being First in the World of Reliability
+## Putting Well-being First
Site Reliability Engineering is all about keeping systems reliable, but we often forget how important the human side is. A healthy on-call culture is just as crucial as any technical fix. The well-being of the engineers really matters.
diff --git a/gemfeed/2024-09-07-site-reliability-engineering-part-4.md b/gemfeed/2024-09-07-site-reliability-engineering-part-4.md
new file mode 100644
index 00000000..3425956a
--- /dev/null
+++ b/gemfeed/2024-09-07-site-reliability-engineering-part-4.md
@@ -0,0 +1,69 @@
+# Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers
+
+> Published at 2024-09-07T16:27:58+03:00
+
+Welcome to Part 4 of my Site Reliability Engineering (SRE) series. I'm currently working as a Site Reliability Engineer, and I’m here to share what SRE is all about in this blog series.
+
+[2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture](./2023-08-18-site-reliability-engineering-part-1.md)
+[2023-11-19 Site Reliability Engineering - Part 2: Operational Balance](./2023-11-19-site-reliability-engineering-part-2.md)
+[2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture](./2024-01-09-site-reliability-engineering-part-3.md)
+[2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers (You are currently reading this)](./2024-09-07-site-reliability-engineering-part-4.md)
+
+```
+ __..._ _...__
+ _..-" `Y` "-._
+ \ Once upon | /
+ \\ a time..| //
+ \\\ | ///
+ \\\ _..---.|.---.._ ///
+jgs \\`_..---.Y.---.._`//
+```
+
+This time, I want to share some tips on how to onboard software engineers, QA engineers, and Site Reliability Engineers (SREs) to the primary on-call rotation. Traditionally, onboarding might take half a year (depending on the complexity of the infrastructure), but with a bit of strategy and structured sessions, we've managed to reduce it to just six weeks per person. Let's dive in!
+
+## Setting the Scene: Tier-1 On-Call Rotation
+
+First things first, let's talk about Tier-1. This is where the magic begins. Tier-1 covers over 80% of the common on-call cases and is the perfect breeding ground for new on-call engineers to get their feet wet. It's designed to be manageable training ground.
+
+### Why Tier-1?
+
+* Easy to Understand: Every on-call engineer should be familiar with Tier-1 tasks.
+* Training Ground: This is where engineers start their on-call career. It's purposefully kept simple so that it's not overwhelming right off the bat.
+* Runbook/recipe driven: Every alert is attached to a comprehensive runbook, making it easy for every engineer to follow.
+
+## Onboarding Process: From 6 Months to 6 Weeks
+
+So how did we cut down the onboarding time so drastically? Here’s the breakdown of our process:
+
+Knowledge Transfer (KT) Sessions: We kicked things off with more than 10 KT sessions, complete with video recordings. These sessions are comprehensive and cover everything from the basics to some more advanced topics. The recorded sessions mean that new engineers can revisit them anytime they need a refresher.
+
+Shadowing Sessions: Each new engineer undergoes two on-call week shadowing sessions. This hands-on experience is invaluable. They get to see real-time incident handling and resolution, gaining practical knowledge that's hard to get from just reading docs.
+
+Comprehensive Runbooks: We created 64 runbooks (by the time writing this probably more than 100) that are composable like Lego bricks. Each runbook covers a specific scenario and guides the engineer step-by-step to resolution. Pairing these with monitoring alerts linked directly to Confluence docs, and from there to the respective runbooks, ensures every alert can be navigated with ease (well, there are always exceptions to the rule...).
+
+Self-Sufficiency & Confidence Building: With all these resources at their fingertips, our on-call engineers become self-sufficient for most of the common issues they'll face (new starters can now handle around 80% of the most common issue after 6 weeks they had joined the company). This boosts their confidence and ensures they can handle Tier-1 incidents independently.
+
+Documentation and Feedback Loop: Continuous improvement is key. We regularly update our documentation based on feedback from the engineers. This makes our process even more robust and user-friendly.
+
+## It's All About the Tiers
+
+Let’s briefly touch on the Tier levels:
+
+* Tier 1: Easy and foundational tasks. Perfect for getting new engineers started. This covers around 80% of all on-call cases we face. This is what we trained on.
+* Tier 2: Slightly more complex, requiring more background knowledge. We trained on some of the topics but not all.
+* Tier 3: Requires a good understanding of the platform/architecture. Likely needs KT sessions with domain experts.
+* Tier DE (Domain Expert): The heavy hitters. Domain experts are required for these tasks.
+
+### Growing into Higher Tiers
+
+From Tier-1, engineers naturally grow into Tier-2 and beyond. The structured training and gradual increase in complexity help ensure a smooth transition as they gain experience and confidence. The key here is that engineers stay curous and engaged in the on-call, so that they always keep learning.
+
+## Keeping Runbooks Up to Date
+
+It is important that runbooks are not a "project to be finished"; runbooks have to be maintained and updated over time. Sections may change, new runbooks need to be added, and old ones can be deleted. So the acceptance criteria of an on-call shift would not just be reacting to alerts and incidents, but also reviewing and updating the current runbooks.
+
+## Conclusion
+
+By structuring the onboarding process with KT sessions, shadowing, comprehensive runbooks, and a feedback loop, we've been able to fast-track the process from six months to just six weeks. This not only prepares our engineers for the on-call rotation quicker but also ensures they're confident and capable when handling incidents.
+
+If you're looking to optimize your on-call onboarding process, these strategies could be your ticket to a more efficient and effective transition. Happy on-calling!
diff --git a/gemfeed/index.md b/gemfeed/index.md
index 0b888b34..1ec71ea7 100644
--- a/gemfeed/index.md
+++ b/gemfeed/index.md
@@ -2,6 +2,7 @@
## To be in the .zone!
+[2024-09-07 - Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers](./2024-09-07-site-reliability-engineering-part-4.md)
[2024-09-07 - Projects I support](./2024-09-07-projects-i-support.md)
[2024-08-05 - Typing `127.1` words per minute (`>100wpm average`)](./2024-08-05-typing-127.1-words-per-minute.md)
[2024-07-07 - 'The Stoic Challenge' book notes](./2024-07-07-the-stoic-challenge-book-notes.md)
@@ -13,9 +14,9 @@
[2024-03-03 - A fine Fyne Android app for quickly logging ideas programmed in Go](./2024-03-03-a-fine-fyne-android-app-for-quickly-logging-ideas-programmed-in-golang.md)
[2024-02-04 - From `babylon5.buetow.org` to `*.buetow.cloud`](./2024-02-04-from-babylon5.buetow.org-to-.cloud.md)
[2024-01-13 - One reason why I love OpenBSD](./2024-01-13-one-reason-why-i-love-openbsd.md)
-[2024-01-09 - Site Reliability Engineering - Part 3: On-Call Culture and the Human Side](./2024-01-09-site-reliability-engineering-part-3.md)
+[2024-01-09 - Site Reliability Engineering - Part 3: On-Call Culture](./2024-01-09-site-reliability-engineering-part-3.md)
[2023-12-10 - Bash Golf Part 3](./2023-12-10-bash-golf-part-3.md)
-[2023-11-19 - Site Reliability Engineering - Part 2: Operational Balance in SRE](./2023-11-19-site-reliability-engineering-part-2.md)
+[2023-11-19 - Site Reliability Engineering - Part 2: Operational Balance](./2023-11-19-site-reliability-engineering-part-2.md)
[2023-11-11 - 'Mind Management' book notes](./2023-11-11-mind-management-book-notes.md)
[2023-10-29 - KISS static web photo albums with `photoalbum.sh`](./2023-10-29-kiss-static-web-photo-albums-with-photoalbum.sh.md)
[2023-09-25 - DTail usage examples](./2023-09-25-dtail-usage-examples.md)
diff --git a/index.md b/index.md
index 44f571a0..225e2077 100644
--- a/index.md
+++ b/index.md
@@ -1,6 +1,6 @@
# foo.zone
-> This site was generated at 2024-09-07T16:11:00+03:00 by `Gemtexter`
+> This site was generated at 2024-09-07T16:32:50+03:00 by `Gemtexter`
Welcome to the foo.zone. Everything you read on this site is my personal opinion and experience. You can call me a Linux/*BSD enthusiast and hobbyist. I mainly write about tech, IT, programming and sometimes also about self-improvement here. Note that this blog usually does not overlap with what I do at my day job as a Site Reliability Engineer.
@@ -30,6 +30,7 @@ If you reach this site via the modern web, please read this:
### Posts
+[2024-09-07 - Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers](./gemfeed/2024-09-07-site-reliability-engineering-part-4.md)
[2024-09-07 - Projects I support](./gemfeed/2024-09-07-projects-i-support.md)
[2024-08-05 - Typing `127.1` words per minute (`>100wpm average`)](./gemfeed/2024-08-05-typing-127.1-words-per-minute.md)
[2024-07-07 - 'The Stoic Challenge' book notes](./gemfeed/2024-07-07-the-stoic-challenge-book-notes.md)
@@ -41,9 +42,9 @@ If you reach this site via the modern web, please read this:
[2024-03-03 - A fine Fyne Android app for quickly logging ideas programmed in Go](./gemfeed/2024-03-03-a-fine-fyne-android-app-for-quickly-logging-ideas-programmed-in-golang.md)
[2024-02-04 - From `babylon5.buetow.org` to `*.buetow.cloud`](./gemfeed/2024-02-04-from-babylon5.buetow.org-to-.cloud.md)
[2024-01-13 - One reason why I love OpenBSD](./gemfeed/2024-01-13-one-reason-why-i-love-openbsd.md)
-[2024-01-09 - Site Reliability Engineering - Part 3: On-Call Culture and the Human Side](./gemfeed/2024-01-09-site-reliability-engineering-part-3.md)
+[2024-01-09 - Site Reliability Engineering - Part 3: On-Call Culture](./gemfeed/2024-01-09-site-reliability-engineering-part-3.md)
[2023-12-10 - Bash Golf Part 3](./gemfeed/2023-12-10-bash-golf-part-3.md)
-[2023-11-19 - Site Reliability Engineering - Part 2: Operational Balance in SRE](./gemfeed/2023-11-19-site-reliability-engineering-part-2.md)
+[2023-11-19 - Site Reliability Engineering - Part 2: Operational Balance](./gemfeed/2023-11-19-site-reliability-engineering-part-2.md)
[2023-11-11 - 'Mind Management' book notes](./gemfeed/2023-11-11-mind-management-book-notes.md)
[2023-10-29 - KISS static web photo albums with `photoalbum.sh`](./gemfeed/2023-10-29-kiss-static-web-photo-albums-with-photoalbum.sh.md)
[2023-09-25 - DTail usage examples](./gemfeed/2023-09-25-dtail-usage-examples.md)
diff --git a/uptime-stats.md b/uptime-stats.md
index 1658eced..bbeeef1b 100644
--- a/uptime-stats.md
+++ b/uptime-stats.md
@@ -1,6 +1,6 @@
# My machine uptime stats
-> This site was last updated at 2024-09-07T16:11:00+03:00
+> This site was last updated at 2024-09-07T16:32:50+03:00
The following stats were collected via `uptimed` on all of my personal computers over many years and the output was generated by `guprecords`, the global uptime records stats analyser of mine.