gemfeed/2025-01-15-working-with-an-sre-interview.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Working with an SRE Interview</title>
<link rel="shortcut icon" type="image/gif" href="/favicon.ico" />
<link rel="stylesheet" href="../style.css" />
<link rel="stylesheet" href="style-override.css" />
</head>
<body>
<p class="header">
<a href="https://foo.zone">Home</a> | <a href="https://codeberg.org/snonux/foo.zone/src/branch/content-md/gemfeed/2025-01-15-working-with-an-sre-interview.md">Markdown</a> | <a href="gemini://foo.zone/gemfeed/2025-01-15-working-with-an-sre-interview.gmi">Gemini</a>
</p>
<h1 style='display: inline' id='working-with-an-sre-interview'>Working with an SRE Interview</h1><br />
<br />
<span class='quote'>Published at 2025-01-15T00:16:04+02:00</span><br />
<br />
<span>I have been interviewed by Florian Buetow on <span class='inlinecode'>cracking-ai-engineering.com</span> about what it&#39;s like working with a Site Reliability Engineer from the point of view of a Software Engineer, Data Scientist, and AI Engineer.</span><br />
<br />
<a class='textlink' href='https://www.cracking-ai-engineering.com/writing/2025/01/12/working-with-an-sre-interview/'>See original interview here</a><br />
<a class='textlink' href='https://www.cracking-ai-engineering.com'>Cracking AI Engineering</a><br />
<br />
<span>Below, I am posting the interview here on my blog as well.</span><br />
<br />
<h2 style='display: inline' id='table-of-contents'>Table of Contents</h2><br />
<br />
<ul>
<li><a href='#working-with-an-sre-interview'>Working with an SRE Interview</a></li>
<li>⇢ <a href='#preamble-'>Preamble </a></li>
<li>⇢ <a href='#introducing-paul'>Introducing Paul</a></li>
<li>⇢ <a href='#how-did-you-get-started'>How did you get started?</a></li>
<li>⇢ <a href='#roles-and-career-progression'>Roles and Career Progression</a></li>
<li>⇢ <a href='#anecdotes-and-best-practices'>Anecdotes and Best Practices</a></li>
<li>⇢ <a href='#working-with-different-teams'>Working with Different Teams</a></li>
<li>⇢ <a href='#using-ai-tools'>Using AI Tools</a></li>
<li>⇢ <a href='#sre-learning-resources'>SRE Learning Resources</a></li>
<li>⇢ <a href='#blogging'>Blogging</a></li>
<li>⇢ <a href='#wrap-up'>Wrap-up</a></li>
<li>⇢ <a href='#closing-comments'>Closing comments</a></li>
</ul><br />
<h2 style='display: inline' id='preamble-'>Preamble </h2><br />
<br />
<span>Florian from Cracking AI Engineering interviewed me about my work as a Principal SRE at Mimecast. We talked about what an Embedded SRE actually does, automation, observability, incident management, and how to work well with an SRE — whether you&#39;re a developer, data scientist, or manager.</span><br />
<br />
<h2 style='display: inline' id='introducing-paul'>Introducing Paul</h2><br />
<br />
<span>Hi Paul, please introduce yourself briefly to the audience. Who are you, what do you do for a living, and where do you work?</span><br />
<br />
<span class='quote'>My name is Paul Bütow, I work at Mimecast, and I’m a Principal Site Reliability Engineer there. I’ve been with Mimecast for almost ten years now. The company specializes in email security, including things like archiving, phishing detection, malware protection, and spam filtering.</span><br />
<br />
<span>You mentioned that you’re an ‘Embedded SRE.’ What does that mean exactly?</span><br />
<br />
<span class='quote'>It means that I’m directly part of the software engineering team, not in a separate Ops department. I ensure that nothing is deployed manually, and everything runs through automation. I also set up monitoring and observability. These are two distinct aspects: monitoring alerts us when something breaks, while observability helps us identify trends. I also create runbooks so we know what to do when specific incidents occur frequently.</span><br />
<br />
<span class='quote'>Infrastructure SREs on the other hand handle the foundational setup, like providing the Kubernetes cluster itself or ensuring the operating systems are installed. They don&#39;t work on the application directly but ensure the base infrastructure is there for others to use. This works well when a company has multiple teams that need shared infrastructure.</span><br />
<br />
<h2 style='display: inline' id='how-did-you-get-started'>How did you get started?</h2><br />
<br />
<span>How did your interest in Linux or FreeBSD start?</span><br />
<br />
<span class='quote'>It began during my school days. We had a PC with DOS at home, and I eventually bought Suse Linux 5.3. Shortly after, I discovered FreeBSD because I liked its handbook so much. I wanted to understand exactly how everything worked, so I also tried Linux from Scratch. That involves installing every package manually to gain a better understanding of operating systems.</span><br />
<br />
<a class='textlink' href='https://www.FreeBSD.org'>https://www.FreeBSD.org</a><br />
<a class='textlink' href='https://linuxfromscratch.org/'>https://linuxfromscratch.org/</a><br />
<br />
<span>And after school, you pursued computer science, correct?</span><br />
<br />
<span class='quote'>Exactly. I wasn’t sure at first whether I wanted to be a software developer or a system administrator. I applied for both and eventually accepted an offer as a Linux system administrator. This was before &#39;SRE&#39; became a buzzword, but much of what I did back then-automation, infrastructure as code, monitoring-is now considered part of the typical SRE role.</span><br />
<br />
<h2 style='display: inline' id='roles-and-career-progression'>Roles and Career Progression</h2><br />
<br />
<span>Tell us about how you joined Mimecast. When did you fully embrace the SRE role?</span><br />
<br />
<span class='quote'>I started as a Linux sysadmin at 1&amp;1. I managed an ad server farm with hundreds of systems and later handled load balancers. Together with an architect, we managed F5 load balancers distributing around 2,000 services, including for portals like web.de and GMX. I also led the operations team technically for a while before moving to London to join Mimecast.</span><br />
<br />
<span class='quote'>At Mimecast, the job title was explicitly &#39;Site Reliability Engineer.&#39; The biggest difference was that I was no longer in a separate Ops department but embedded directly within the storage and search backend team. I loved that because we could plan features together-from automation to measurability and observability. Mimecast also operates thousands of physical servers for email archiving, which was fascinating since I already had experience with large distributed systems at 1&amp;1. It was the right step for me because it allowed me to work close to the code while remaining hands-on with infrastructure.</span><br />
<br />
<span>What are the differences between SRE, DevOps, SysAdmin, and Architects?</span><br />
<br />
<span class='quote'>SREs are like the next step after SysAdmins. A SysAdmin might manually install servers, replace disks, or use simple scripts for automation, while SREs use infrastructure as code and focus on reliability through SLIs, SLOs, and automation. DevOps isn’t really a job-it’s more of a way of working, where developers are involved in operations tasks like setting up CI/CD pipelines or on-call shifts. Architects focus on designing systems and infrastructures, such as load balancers or distributed systems, working alongside SREs to ensure the systems meet the reliability and scalability requirements. The specific responsibilities of each role depend on the company, and there is often overlap. </span><br />
<br />
<span>What are the most important reliability lessons you’ve learned so far?</span><br />
<br />
<ul>
<li>Don’t leave SRE aspects as an afterthought. It’s much better to discuss automation, monitoring, SLIs, and SLOs early on. Traditional sysadmins often installed systems manually, but today, we do everything via infrastructure as code-using tools like Terraform or Puppet.</li>
<li>I also distinguish between monitoring and observability. Monitoring tells us, &#39;The server is down, alarm!&#39; Observability dives deeper, showing trends like increasing latency so we can act proactively.</li>
<li>SLI, SLO, and SLA are core elements. We focus on what users actually experience-for example, how quickly an email is sent-and set our goals accordingly.</li>
<li>Runbooks are also crucial. When something goes wrong at night, you don’t want to start from scratch. A runbook outlines how to debug and resolve specific problems, saving time and reducing downtime.</li>
</ul><br />
<h2 style='display: inline' id='anecdotes-and-best-practices'>Anecdotes and Best Practices</h2><br />
<br />
<span>Runbooks sound very practical. Can you explain how they’re used day-to-day?</span><br />
<br />
<span class='quote'>Runbooks are essentially guides for handling specific incidents. For instance, if a service won’t start, the runbook will specify where the logs are and which commands to use. Observability takes it a step further, helping us spot changes early-like rising error rates or latency-so we can address issues before they escalate.</span><br />
<br />
<span>When should you decide to put something into a runbook, and when is it unnecessary?</span><br />
<br />
<span class='quote'>If an issue happens frequently, it should be documented in a runbook so that anyone, even someone new, can follow the steps to fix it. The idea is that 90% of the common incidents should be covered. For example, if a service is down, the runbook would specify where to find logs, which commands to check, and what actions to take. On the other hand, rare or complex issues, where the resolution depends heavily on context or varies each time, don’t make sense to include in detail. For those, it’s better to focus on general troubleshooting steps. </span><br />
<br />
<span>How do you search for and find the correct runbooks?</span><br />
<br />
<span class='quote'>Runbooks should be linked directly in the alert you receive. For example, if you get an alert about a service not running, the alert will have a link to the runbook that tells you what to check, like logs or commands to run. Runbooks are best stored in an internal wiki, so if you don’t find the link in the alert, you know where to search. The important thing is that runbooks are easy to find and up to date because that’s what makes them useful during incidents. </span><br />
<br />
<span>Do you have an interesting war story you can share with us?</span><br />
<br />
<span class='quote'>Sure. At 1&amp;1, we had a proprietary ad server software that ran a SQL query during startup. The query got slower over time, eventually timing out and preventing the server from starting. Since we couldn’t access the source code, we searched the binary for the SQL and patched it. By pinpointing the issue, a developer was able to adjust the SQL. This collaboration between sysadmin and developer perspectives highlights the value of SRE work.</span><br />
<br />
<h2 style='display: inline' id='working-with-different-teams'>Working with Different Teams</h2><br />
<br />
<span>You’re embedded in a team-how does collaboration with developers work practically?</span><br />
<br />
<span class='quote'>We plan everything together from the start. If there’s a new feature, we discuss infrastructure, automated deployments, and monitoring right away. Developers are experts in the code, and I bring the infrastructure expertise. This avoids unpleasant surprises before going live.</span><br />
<br />
<span>How about working with data scientists or ML engineers? Are there differences?</span><br />
<br />
<span class='quote'>The principles are the same. ML models also need to be deployed and monitored. You deal with monitoring, resource allocation, and identifying performance drops. Whether it’s a microservice or an ML job, at the end of the day, it’s all running on servers or clusters that must remain stable.</span><br />
<br />
<span>What about working with managers or the FinOps team?</span><br />
<br />
<span class='quote'>We often discuss costs, especially in the cloud, where scaling up resources is easy. It’s crucial to know our metrics: do we have enough capacity? Do we need all instances? Or is the CPU only at 5% utilization? This data helps managers decide whether the budget is sufficient or if optimizations are needed.</span><br />
<br />
<span>Do you have practical tips for working with SREs?</span><br />
<br />
<span class='quote'>Yes, I have a few:</span><br />
<br />
<ul>
<li>Early involvement: Include SREs from the beginning in your project.</li>
<li>Runbooks &amp; documentation: Document recurring errors.</li>
<li>Try first: Try to understand the issue yourself before immediately asking the SRE.</li>
<li>Basic infra knowledge: Kubernetes and Terraform aren’t magic. Some basic understanding helps every developer.</li>
</ul><br />
<h2 style='display: inline' id='using-ai-tools'>Using AI Tools</h2><br />
<br />
<span>Let’s talk about AI. How do you use it in your daily work?</span><br />
<br />
<span class='quote'>For boilerplate code, like Terraform snippets, I often use ChatGPT. It saves time, although I always review and adjust the output. Log analysis is another exciting application. Instead of manually going through millions of lines, AI can summarize key outliers or errors.</span><br />
<br />
<span>Do you think AI could largely replace SREs or significantly change the role?</span><br />
<br />
<span class='quote'>I see AI as an additional tool. SRE requires a deep understanding of how distributed systems work internally. While AI can assist with routine tasks or quickly detect anomalies, human expertise is indispensable for complex issues.</span><br />
<br />
<h2 style='display: inline' id='sre-learning-resources'>SRE Learning Resources</h2><br />
<br />
<span>What resources would you recommend for learning about SRE?</span><br />
<br />
<span class='quote'>The Google SRE book is a classic, though a bit dry. I really like &#39;Seeking SRE,&#39; as it offers various perspectives on SRE, with many practical stories from different companies.</span><br />
<br />
<a class='textlink' href='https://sre.google/books/'>https://sre.google/books/</a><br />
<a class='textlink' href='https://www.oreilly.com/library/view/seeking-sre/9781491978856'>Seeking SRE</a><br />
<br />
<span>Do you have a podcast recommendation?</span><br />
<br />
<span class='quote'>The Google SRE prodcast is quite interesting. It offers insights into how Google approaches SRE, along with perspectives from external guests.</span><br />
<br />
<a class='textlink' href='https://sre.google/prodcast/'>https://sre.google/prodcast/</a><br />
<br />
<h2 style='display: inline' id='blogging'>Blogging</h2><br />
<br />
<span>You also have a blog. What motivates you to write regularly?</span><br />
<br />
<span class='quote'>Writing helps me learn the most. It also serves as a personal reference. Sometimes I look up how I solved a problem a year ago. And of course, others tackling similar projects might find inspiration in my posts.</span><br />
<br />
<span>What do you blog about?</span><br />
<br />
<span class='quote'>Mostly technical topics I find exciting, like homelab projects, Kubernetes, or book summaries on IT and productivity. It’s a personal blog, so I write about what I enjoy.</span><br />
<br />
<h2 style='display: inline' id='wrap-up'>Wrap-up</h2><br />
<br />
<span>To wrap up, what are three things every team should keep in mind for stability?</span><br />
<br />
<span class='quote'>First, maintain runbooks and documentation to avoid chaos at night. Second, automate everything-manual installs in production are risky. Third, define SLIs, SLOs, and SLAs early so everyone knows what we’re monitoring and guaranteeing.</span><br />
<br />
<span>Is there a motto or mindset that particularly inspires you as an SRE?</span><br />
<br />
<span class='quote'>"Keep it simple and stupid"-KISS. Not everything has to be overly complex. And always stay curious. I’m still fascinated by how systems work under the hood.</span><br />
<br />
<span>Where can people find you online?</span><br />
<br />
<span class='quote'>You can find links to my socials on my website paul.buetow.org</span><br />
<span class='quote'>I regularly post articles and link to everything else I’m working on outside of work.</span><br />
<br />
<a class='textlink' href='https://paul.buetow.org'>https://paul.buetow.org</a><br />
<br />
<span>Thank you very much for your time and this insightful interview into the world of site reliability engineering</span><br />
<br />
<span class='quote'>My pleasure, this was fun.</span><br />
<br />
<h2 style='display: inline' id='closing-comments'>Closing comments</h2><br />
<br />
<span>Thanks for reading! Hopefully there’s something useful in here for your own work. Reliable systems are a team effort, after all.</span><br />
<br />
<span>E-Mail your comments to <span class='inlinecode'>paul@nospam.buetow.org</span> or contact Florian via the Cracking AI Engineering :-)</span><br />
<br />
<a class='textlink' href='../'>Back to the main site</a><br />
<p class="footer">
	Generated with <a href="https://codeberg.org/snonux/gemtexter">Gemtexter 3.0.1-develop</a> |
	served by <a href="https://www.OpenBSD.org">OpenBSD</a>/<a href="https://man.openbsd.org/relayd.8">relayd(8)</a>+<a href="https://man.openbsd.org/httpd.8">httpd(8)</a> |
	<a href="https://foo.zone/site-mirrors.html">Site Mirrors</a>
	<br />
	Webring: <a href="https://shring.sh/foo.zone/previous">previous</a> | <a href="https://shring.sh">shring</a> | <a href="https://shring.sh/foo.zone/next">next</a>
</p>
</body>
</html>