diff options
| author | Paul Buetow <paul@buetow.org> | 2025-01-15 00:17:16 +0200 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2025-01-15 00:17:16 +0200 |
| commit | 26dd99a97c483ffdacba22fcda2faa8a10148c0f (patch) | |
| tree | f1eecb0811035b1a8ad774044c854aaff54ea55f /gemfeed | |
| parent | 5c0cbc2301e3faf793f7d0b989c43f44cc56f182 (diff) | |
Update content for md
Diffstat (limited to 'gemfeed')
| -rw-r--r-- | gemfeed/2024-12-03-f3s-kubernetes-with-freebsd-part-2.md | 3 | ||||
| -rw-r--r-- | gemfeed/2025-01-15-working-with-an-sre-interview.md | 177 | ||||
| -rw-r--r-- | gemfeed/DRAFT-f3s-kubernetes-with-freebsd-bhyve.md | 91 | ||||
| -rw-r--r-- | gemfeed/index.md | 1 |
4 files changed, 256 insertions, 16 deletions
diff --git a/gemfeed/2024-12-03-f3s-kubernetes-with-freebsd-part-2.md b/gemfeed/2024-12-03-f3s-kubernetes-with-freebsd-part-2.md index 23dd19fa..06c11371 100644 --- a/gemfeed/2024-12-03-f3s-kubernetes-with-freebsd-part-2.md +++ b/gemfeed/2024-12-03-f3s-kubernetes-with-freebsd-part-2.md @@ -131,6 +131,7 @@ root@f0:~ # freebsd-update reboot ``` I also added the following entries for the three FreeBSD boxes to the `/etc/hosts` file: + ```sh root@f0:~ # cat <<END >>/etc/hosts 192.168.1.130 f0 f0.lan f0.lan.buetow.org @@ -139,6 +140,8 @@ root@f0:~ # cat <<END >>/etc/hosts END ``` +You might wonder why bother using the hosts file? Why not use DNS properly? The reason is simplicity. I don't manage 100 hosts, only a few here and there. Having an OpenWRT router in my home, I could also configure everything there, but maybe I'll do that later. For now, keep it simple and straightforward. + ## After install After that, I installed the following additional packages: diff --git a/gemfeed/2025-01-15-working-with-an-sre-interview.md b/gemfeed/2025-01-15-working-with-an-sre-interview.md new file mode 100644 index 00000000..e9556cfb --- /dev/null +++ b/gemfeed/2025-01-15-working-with-an-sre-interview.md @@ -0,0 +1,177 @@ +# Working with an SRE Interview + +> Published at 2025-01-15T00:16:04+02:00 + +I have been interviewed by Florian Buetow about what it's like working with a Site Reliability Engineer from the point of view of a Software Engineer, Data Scientist, and AI Engineer. + +[See original interview here](https://www.cracking-ai-engineering.com/writing/2025/01/12/working-with-an-sre-interview/) + +Below, I am posting the interview here on my blog as well. + +## Table of Contents + +* [⇢ Working with an SRE Interview](#working-with-an-sre-interview) +* [⇢ ⇢ Preamble ](#preamble-) +* [⇢ ⇢ Introducing Paul](#introducing-paul) +* [⇢ ⇢ How did you get started?](#how-did-you-get-started) +* [⇢ ⇢ Roles and Career Progression](#roles-and-career-progression) +* [⇢ ⇢ Anecdotes and Best Practices](#anecdotes-and-best-practices) +* [⇢ ⇢ Working with Different Teams](#working-with-different-teams) +* [⇢ ⇢ Using AI Tools](#using-ai-tools) +* [⇢ ⇢ SRE Learning Resources](#sre-learning-resources) +* [⇢ ⇢ Blogging](#blogging) +* [⇢ ⇢ Wrap-up](#wrap-up) +* [⇢ ⇢ Closing comments](#closing-comments) + +## Preamble + +In this insightful interview, Paul Bütow, a Principal Site Reliability Engineer at Mimecast, shares over a decade of experience in the field. Paul highlights the role of an Embedded SRE, emphasizing the importance of automation, observability, and effective incident management. We also focused on the key question of how you can work effectively with an SRE weather you are an individual contributor or a manager, a software engineer or data scientist. And how you can learn more about site reliability engineering. + +## Introducing Paul + +Hi Paul, please introduce yourself briefly to the audience. Who are you, what do you do for a living, and where do you work? + +> My name is Paul Bütow, I work at Mimecast, and I’m a Principal Site Reliability Engineer there. I’ve been with Mimecast for almost ten years now. The company specializes in email security, including things like archiving, phishing detection, malware protection, and spam filtering. + +You mentioned that you’re an ‘Embedded SRE.’ What does that mean exactly? + +> It means that I’m directly part of the software engineering team, not in a separate Ops department. I ensure that nothing is deployed manually, and everything runs through automation. I also set up monitoring and observability. These are two distinct aspects: monitoring alerts us when something breaks, while observability helps us identify trends. I also create runbooks so we know what to do when specific incidents occur frequently. + +> Infrastructure SREs on the other hand handle the foundational setup, like providing the Kubernetes cluster itself or ensuring the operating systems are installed. They don't work on the application directly but ensure the base infrastructure is there for others to use. This works well when a company has multiple teams that need shared infrastructure. + +## How did you get started? + +How did your interest in Linux or FreeBSD start? + +> It began during my school days. We had a PC with DOS at home, and I eventually bought Suse Linux 5.3. Shortly after, I discovered FreeBSD because I liked its handbook so much. I wanted to understand exactly how everything worked, so I also tried Linux from Scratch. That involves installing every package manually to gain a better understanding of operating systems. + +[https://www.FreeBSD.org](https://www.FreeBSD.org) +[https://linuxfromscratch.org/](https://linuxfromscratch.org/) + +And after school, you pursued computer science, correct? + +> Exactly. I wasn’t sure at first whether I wanted to be a software developer or a system administrator. I applied for both and eventually accepted an offer as a Linux system administrator. This was before 'SRE' became a buzzword, but much of what I did back then-automation, infrastructure as code, monitoring-is now considered part of the typical SRE role. + +## Roles and Career Progression + +Tell us about how you joined Mimecast. When did you fully embrace the SRE role? + +> I started as a Linux sysadmin at 1&1. I managed an ad server farm with hundreds of systems and later handled load balancers. Together with an architect, we managed F5 load balancers distributing around 2,000 services, including for portals like web.de and GMX. I also led the operations team technically for a while before moving to London to join Mimecast. + +> At Mimecast, the job title was explicitly 'Site Reliability Engineer.' The biggest difference was that I was no longer in a separate Ops department but embedded directly within the storage and search backend team. I loved that because we could plan features together-from automation to measurability and observability. Mimecast also operates thousands of physical servers for email archiving, which was fascinating since I already had experience with large distributed systems at 1&1. It was the right step for me because it allowed me to work close to the code while remaining hands-on with infrastructure. + +What are the differences between SRE, DevOps, SysAdmin, and Architects? + +> SREs are like the next step after SysAdmins. A SysAdmin might manually install servers, replace disks, or use simple scripts for automation, while SREs use infrastructure as code and focus on reliability through SLIs, SLOs, and automation. DevOps isn’t really a job-it’s more of a way of working, where developers are involved in operations tasks like setting up CI/CD pipelines or on-call shifts. Architects focus on designing systems and infrastructures, such as load balancers or distributed systems, working alongside SREs to ensure the systems meet the reliability and scalability requirements. The specific responsibilities of each role depend on the company, and there is often overlap. + +What are the most important reliability lessons you’ve learned so far? + +* Don’t leave SRE aspects as an afterthought. It’s much better to discuss automation, monitoring, SLIs, and SLOs early on. Traditional sysadmins often installed systems manually, but today, we do everything via infrastructure as code-using tools like Terraform or Puppet. +* I also distinguish between monitoring and observability. Monitoring tells us, 'The server is down, alarm!' Observability dives deeper, showing trends like increasing latency so we can act proactively. +* SLI, SLO, and SLA are core elements. We focus on what users actually experience-for example, how quickly an email is sent-and set our goals accordingly. +* Runbooks are also crucial. When something goes wrong at night, you don’t want to start from scratch. A runbook outlines how to debug and resolve specific problems, saving time and reducing downtime. + +## Anecdotes and Best Practices + +Runbooks sound very practical. Can you explain how they’re used day-to-day? + +> Runbooks are essentially guides for handling specific incidents. For instance, if a service won’t start, the runbook will specify where the logs are and which commands to use. Observability takes it a step further, helping us spot changes early-like rising error rates or latency-so we can address issues before they escalate. + +When should you decide to put something into a runbook, and when is it unnecessary? + +> If an issue happens frequently, it should be documented in a runbook so that anyone, even someone new, can follow the steps to fix it. The idea is that 90% of the common incidents should be covered. For example, if a service is down, the runbook would specify where to find logs, which commands to check, and what actions to take. On the other hand, rare or complex issues, where the resolution depends heavily on context or varies each time, don’t make sense to include in detail. For those, it’s better to focus on general troubleshooting steps. + +How do you search for and find the correct runbooks? + +> Runbooks should be linked directly in the alert you receive. For example, if you get an alert about a service not running, the alert will have a link to the runbook that tells you what to check, like logs or commands to run. Runbooks are best stored in an internal wiki, so if you don’t find the link in the alert, you know where to search. The important thing is that runbooks are easy to find and up to date because that’s what makes them useful during incidents. + +Do you have an interesting war story you can share with us? + +> Sure. At 1&1, we had a proprietary ad server software that ran a SQL query during startup. The query got slower over time, eventually timing out and preventing the server from starting. Since we couldn’t access the source code, we searched the binary for the SQL and patched it. By pinpointing the issue, a developer was able to adjust the SQL. This collaboration between sysadmin and developer perspectives highlights the value of SRE work. + +## Working with Different Teams + +You’re embedded in a team-how does collaboration with developers work practically? + +> We plan everything together from the start. If there’s a new feature, we discuss infrastructure, automated deployments, and monitoring right away. Developers are experts in the code, and I bring the infrastructure expertise. This avoids unpleasant surprises before going live. + +How about working with data scientists or ML engineers? Are there differences? + +> The principles are the same. ML models also need to be deployed and monitored. You deal with monitoring, resource allocation, and identifying performance drops. Whether it’s a microservice or an ML job, at the end of the day, it’s all running on servers or clusters that must remain stable. + +What about working with managers or the FinOps team? + +> We often discuss costs, especially in the cloud, where scaling up resources is easy. It’s crucial to know our metrics: do we have enough capacity? Do we need all instances? Or is the CPU only at 5% utilization? This data helps managers decide whether the budget is sufficient or if optimizations are needed. + +Do you have practical tips for working with SREs? + +> Yes, I have a few: + +* Early involvement: Include SREs from the beginning in your project. +* Runbooks & documentation: Document recurring errors. +* Try first: Try to understand the issue yourself before immediately asking the SRE. +* Basic infra knowledge: Kubernetes and Terraform aren’t magic. Some basic understanding helps every developer. + +## Using AI Tools + +Let’s talk about AI. How do you use it in your daily work? + +> For boilerplate code, like Terraform snippets, I often use ChatGPT. It saves time, although I always review and adjust the output. Log analysis is another exciting application. Instead of manually going through millions of lines, AI can summarize key outliers or errors. + +Do you think AI could largely replace SREs or significantly change the role? + +> I see AI as an additional tool. SRE requires a deep understanding of how distributed systems work internally. While AI can assist with routine tasks or quickly detect anomalies, human expertise is indispensable for complex issues. + +## SRE Learning Resources + +What resources would you recommend for learning about SRE? + +> The Google SRE book is a classic, though a bit dry. I really like 'Seeking SRE,' as it offers various perspectives on SRE, with many practical stories from different companies. + +[https://sre.google/books/](https://sre.google/books/) +[Seeking SRE](https://www.oreilly.com/library/view/seeking-sre/9781491978856) + +Do you have a podcast recommendation? + +> The Google SRE prodcast is quite interesting. It offers insights into how Google approaches SRE, along with perspectives from external guests. + +[https://sre.google/prodcast/](https://sre.google/prodcast/) + +## Blogging + +You also have a blog. What motivates you to write regularly? + +> Writing helps me learn the most. It also serves as a personal reference. Sometimes I look up how I solved a problem a year ago. And of course, others tackling similar projects might find inspiration in my posts. + +What do you blog about? + +> Mostly technical topics I find exciting, like homelab projects, Kubernetes, or book summaries on IT and productivity. It’s a personal blog, so I write about what I enjoy. + +## Wrap-up + +To wrap up, what are three things every team should keep in mind for stability? + +> First, maintain runbooks and documentation to avoid chaos at night. Second, automate everything-manual installs in production are risky. Third, define SLIs, SLOs, and SLAs early so everyone knows what we’re monitoring and guaranteeing. + +Is there a motto or mindset that particularly inspires you as an SRE? + +> "Keep it simple and stupid"-KISS. Not everything has to be overly complex. And always stay curious. I’m still fascinated by how systems work under the hood. + +Where can people find you online? + +> You can find links to my socials on my website paul.buetow.org +> I regularly post articles and link to everything else I’m working on outside of work. + +[https://paul.buetow.org](https://paul.buetow.org) + +Thank you very much for your time and this insightful interview into the world of site reliability engineering + +> My pleasure, this was fun. + +## Closing comments + +Dear reader, I hope this conversation with Paul Bütow provided an exciting peak into the world of Site Reliability Engineering. Whether you’re a software developer, data scientist, ML engineer, or manager, reliable systems are always a team effort. Hopefully, you’ve taken some insights or tips from Paul’s experiences for your own team or next project. Thanks for joining us, and best of luck refining your own SRE practices! + +E-Mail your comments to `paul@nospam.buetow.org` :-) + +[Back to the main site](../) diff --git a/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-bhyve.md b/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-bhyve.md index 4d769fe2..d9932801 100644 --- a/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-bhyve.md +++ b/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-bhyve.md @@ -16,12 +16,15 @@ This is the third blog post about my f3s series for my self-hosting demands in m * [⇢ ⇢ ⇢ ISO download](#iso-download) * [⇢ ⇢ ⇢ VM configuration](#vm-configuration) * [⇢ ⇢ ⇢ VM installation](#vm-installation) +* [⇢ ⇢ ⇢ Increase of the disk image](#increase-of-the-disk-image) +* [⇢ ⇢ ⇢ Connect to VPN](#connect-to-vpn) +* [⇢ ⇢ After install](#after-install) ## Introduction In this blog post, we are going to install the Bhyve hypervisor. -The FreeBSD Bhyve hypervisor is a lightweight, modern hypervisor that enables virtualization on FreeBSD systems. Bhyve's strengths include its minimal overhead, which allows it to achieve near-native performance for virtual machines. It is designed to be efficient and lightweight, leveraging the capabilities of the FreeBSD operating system for performance and network management. +The FreeBSD Bhyve hypervisor is a lightweight, modern hypervisor that enables virtualization on FreeBSD systems. Bhyve's strengths include its minimal overhead, which allows it to achieve near-native performance for virtual machines. It is designed to be efficient and lightweight, leveraging the capabilities of the FreeBSD operating system for performance and network management. Bhyve supports running a variety of guest operating systems, including FreeBSD, Linux, and Windows, on hardware platforms that support hardware virtualization extensions (such as Intel VT-x or AMD-V). In our case, we are going to virtualize Rocky Linux, which later on in this series will be used to run k3s. @@ -34,35 +37,35 @@ For the management of the Bhyve VMs, we are using `vm-bhyve`, a tool not part of The following commands are executed on all three hosts `f0`, `f1`, and `f2`, where `re0` is the name of the Ethernet interface (which may need to be adjusted if your hardware is different): ```sh -paul@f2:~ % doas pkg install vm-bhyve bhyve-firmware -paul@f2:~ % doas sysrc vm_enable=YES +paul@f0:~ % doas pkg install vm-bhyve bhyve-firmware +paul@f0:~ % doas sysrc vm_enable=YES vm_enable: -> YES -paul@f2:~ % doas sysrc vm_dir=zfs:zroot/bhyve +paul@f0:~ % doas sysrc vm_dir=zfs:zroot/bhyve vm_dir: -> zfs:zroot/bhyve -paul@f2:~ % doas zfs create zroot/bhyve -paul@f2:~ % doas vm init -paul@f2:~ % doas vm create public -paul@f2:~ % doas vm switch add public re0 +paul@f0:~ % doas zfs create zroot/bhyve +paul@f0:~ % doas vm init +paul@f0:~ % doas vm switch create public +paul@f0:~ % doas vm switch add public re0 ``` Bhyve stores all it's data in the `/bhyve` of the `zroot` ZFS pool: ```sh -paul@f2:~ % zfs list | grep bhyve +paul@f0:~ % zfs list | grep bhyve zroot/bhyve 1.74M 453G 1.74M /zroot/bhyve ``` For convenience, we also create this symlink: ```sh -paul@f2:~ % doas ln -s /zroot/bhyve/ /bhyve +paul@f0:~ % doas ln -s /zroot/bhyve/ /bhyve ``` Now, Bhyve is ready to rumble, but no VMs are there yet: ```sh -paul@f2:~ % doas vm list +paul@f0:~ % doas vm list NAME DATASTORE LOADER CPU MEMORY VNC AUTO STATE ``` @@ -73,17 +76,17 @@ NAME DATASTORE LOADER CPU MEMORY VNC AUTO STATE We're going to install the Rocky Linux from the latest minimal iso: ```sh -paul@f2:~ % doas vm iso \ +paul@f0:~ % doas vm iso \ https://download.rockylinux.org/pub/rocky/9/isos/x86_64/Rocky-9.5-x86_64-minimal.iso /zroot/bhyve/.iso/Rocky-9.5-x86_64-minimal.iso 1808 MB 4780 kBps 06m28s -paul@f2:/bhyve % doas vm create rocky +paul@f0:/bhyve % doas vm create rocky ``` ### VM configuration The default configuration looks like this now: ```sh -paul@f2:/bhyve/rocky % cat rocky.conf +paul@f0:/bhyve/rocky % cat rocky.conf loader="bhyveload" cpu=1 memory=256M @@ -95,12 +98,30 @@ uuid="1c4655ac-c828-11ef-a920-e8ff1ed71ca0" network0_mac="58:9c:fc:0d:13:3f" ``` -but in order to make Rocky Linux boot, it... +Whereas the `uuid` and the `network0_mac` differ on each of the 3 hosts. + +but in order to make Rocky Linux boot it (plus some other adjustments, e.g. as I am intending to run the majority of the workload in the k3s cluster running on those linux VMs, I give them beefy specs like 4 CPU cores and 14GB RAM), I modified it to: + +```sh +guest="linux" +loader="uefi" +uefi_vars="yes" +cpu=4 +memory=14G +network0_type="virtio-net" +network0_switch="public" +disk0_type="virtio-blk" +disk0_name="disk0.img" +graphics="yes" +graphics_vga=io +uuid="1c45400b-c828-11ef-8871-e8ff1ed71cac" +network0_mac="58:9c:fc:0d:13:3f" +``` ### VM installation ```sh -paul@f2:~ % doas vm install rocky Rocky-9.5-x86_64-minimal.iso +paul@f0:~ % doas vm install rocky Rocky-9.5-x86_64-minimal.iso Starting rocky * found guest in /zroot/bhyve/rocky * booting... @@ -115,6 +136,44 @@ root bhyve 6079 8 tcp4 *:5900 *:* Port 5900 is now also open for VNC connections, so we connect to it with a VNC client and run through the installation dialogs. I'm sure this could be done unattended or more automated, but we have only 3 VMs to install, and the automation doesn't seem worth it as we are doing it only once. +### Increase of the disk image + +By default the VMs disk image is only 20G, which is a bit small for my purposes, so I stopped the VMs again and run `truncate` on the image file to enlarge them to 100G, and re-started the installation: + +```sh +paul@f0:/bhyve/rocky % doas vm stop rocky +paul@f0:/bhyve/rocky % doas truncate -s 100G disk0.img +paul@f0:/bhyve/rocky % doas vm install rocky Rocky-9.5-x86_64-minimal.iso +``` + +### Connect to VPN + +For the installation, I opened the VPN client on my Fedora laptop (GNOME comes with a simple VPN client) and ran through the base installation for each of the VMs manually. I am sure this could have been automated a bit more, but there were just 3 VMs, and it wasn't worth the effort. The three VNC addresses of the VMs were: `vnc://f0:5900`, `vnc://f1:5900`, and `vnc://f0:5900`. + +I mostly selected the default settings (auto partitioning on the 100GB drive and a root user password). After the installation, the VMs were rebooted. + +## After install + +After that, I changed the network configuration to be static here as well. + +As per previous post of this series, the 3 FreeBSD hosts were already in my `/etc/hosts` file: + +``` +192.168.1.130 f0 f0.lan f0.lan.buetow.org +192.168.1.131 f1 f1.lan f1.lan.buetow.org +192.168.1.132 f2 f2.lan f2.lan.buetow.org +``` + +For the Rocky VMs I added those: + +```sh +cat <<END >>/etc/hosts +192.168.1.120 r0 r0.lan r0.lan.buetow.org +192.168.1.121 r1 r1.lan r1.lan.buetow.org +192.168.1.122 r2 r2.lan r2.lan.buetow.org +END +``` +and configured the IPs accordingly on the VMs themselves. Other *BSD-related posts: diff --git a/gemfeed/index.md b/gemfeed/index.md index eeaa7d8c..1650a18b 100644 --- a/gemfeed/index.md +++ b/gemfeed/index.md @@ -2,6 +2,7 @@ ## To be in the .zone! +[2025-01-15 - Working with an SRE Interview](./2025-01-15-working-with-an-sre-interview.md) [2025-01-01 - Posts from October to December 2024](./2025-01-01-posts-from-october-to-december-2024.md) [2024-12-15 - Random Helix Themes](./2024-12-15-random-helix-themes.md) [2024-12-03 - f3s: Kubernetes with FreeBSD - Part 2: Hardware and base installation](./2024-12-03-f3s-kubernetes-with-freebsd-part-2.md) |
