summaryrefslogtreecommitdiff
path: root/gemfeed/2021-10-22-defensive-devops.html
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2024-08-26 23:07:43 +0300
committerPaul Buetow <paul@buetow.org>2024-08-26 23:07:43 +0300
commita060115f62061039758dd3cadae97889c4eadb1b (patch)
treea5b1699fa0078ea6b37fc871a40f0aef53453054 /gemfeed/2021-10-22-defensive-devops.html
parentc7026fd7eb994ad457419e2251403d741e98e3df (diff)
Update content for html
Diffstat (limited to 'gemfeed/2021-10-22-defensive-devops.html')
-rw-r--r--gemfeed/2021-10-22-defensive-devops.html31
1 files changed, 22 insertions, 9 deletions
diff --git a/gemfeed/2021-10-22-defensive-devops.html b/gemfeed/2021-10-22-defensive-devops.html
index f81976d8..fa66c14d 100644
--- a/gemfeed/2021-10-22-defensive-devops.html
+++ b/gemfeed/2021-10-22-defensive-devops.html
@@ -12,6 +12,16 @@
<br />
<span class='quote'>Published at 2021-10-22T10:02:46+03:00</span><br />
<br />
+<span>I have seen many different setups and infrastructures during my carreer. My roles always included front-line ad-hoc fire fighting production issues. This often involves identifying and fixing these under time pressure, without the comfort of 2-week-long SCRUM sprints and without an exhaustive QA process. I also wrote a lot of code (Bash, Ruby, Perl, Go, and a little Java), and I followed the typical software development process, but that did not always apply to critical production issues.</span><br />
+<br />
+<span>Unfortunately, no system is 100% reliable, and you can never be prepared for a subset of the possible problem-space. IT infrastructures can be complex. Not even mentioning Kubernetes yet, a Microservice-based infrastructure can complicate things even further. You can take care of 99% of all potential problems by following all DevOps best practices. Those best practices are not the subject of this blog post; this post is about the sub 1% of the issues arising from nowhere you can&#39;t be prepared for. </span><br />
+<br />
+<span>Is there a software bug in a production, even though the software passed QA (after all, it is challenging to reproduce production behaviour in an artificial testing environment) and the software didn&#39;t show any issues running in production until a special case came up just now after it got deployed to production a week ago? Are there multiple hardware failure happening which causes loss of service redundancy or data inaccessibility? Is the automation of external customers connected to our infrastructure putting unexpectedly extra pressure on your grid, driving higher latencies and putting the SLAs at risk? You bet the solution is: Sysadmins, SREs and DevOps Engineers to the rescue. </span><br />
+<br />
+<span>You agree that fixing production issues this way is not proactive but rather reactive. I prefer to call it defensive, though, as you "defend" your system against a production issue. But, at the same time, you have to take a cautious (defensive) approach to fix it, as you don&#39;t want to make things worse. </span><br />
+<br />
+<span>Over time, I have compiled a list of fire-fighting automation strategies, which I would like to share here. </span><br />
+<br />
<pre>
c=====e
H
@@ -22,16 +32,19 @@
ASCII Art by Clyde Watson
</pre>
<br />
-<span>I have seen many different setups and infrastructures during my carreer. My roles always included front-line ad-hoc fire fighting production issues. This often involves identifying and fixing these under time pressure, without the comfort of 2-week-long SCRUM sprints and without an exhaustive QA process. I also wrote a lot of code (Bash, Ruby, Perl, Go, and a little Java), and I followed the typical software development process, but that did not always apply to critical production issues.</span><br />
-<br />
-<span>Unfortunately, no system is 100% reliable, and you can never be prepared for a subset of the possible problem-space. IT infrastructures can be complex. Not even mentioning Kubernetes yet, a Microservice-based infrastructure can complicate things even further. You can take care of 99% of all potential problems by following all DevOps best practices. Those best practices are not the subject of this blog post; this post is about the sub 1% of the issues arising from nowhere you can&#39;t be prepared for. </span><br />
-<br />
-<span>Is there a software bug in a production, even though the software passed QA (after all, it is challenging to reproduce production behaviour in an artificial testing environment) and the software didn&#39;t show any issues running in production until a special case came up just now after it got deployed to production a week ago? Are there multiple hardware failure happening which causes loss of service redundancy or data inaccessibility? Is the automation of external customers connected to our infrastructure putting unexpectedly extra pressure on your grid, driving higher latencies and putting the SLAs at risk? You bet the solution is: Sysadmins, SREs and DevOps Engineers to the rescue. </span><br />
-<br />
-<span>You agree that fixing production issues this way is not proactive but rather reactive. I prefer to call it defensive, though, as you "defend" your system against a production issue. But, at the same time, you have to take a cautious (defensive) approach to fix it, as you don&#39;t want to make things worse. </span><br />
-<br />
-<span>Over time, I have compiled a list of fire-fighting automation strategies, which I would like to share here. </span><br />
+<h2 style='display: inline' id='table-of-contents'>Table of Contents</h2><br />
<br />
+<ul>
+<li><a href='#defensive-devops'>Defensive DevOps</a></li>
+<li>⇢ <a href='#meet-defensive-devops'>Meet Defensive DevOps</a></li>
+<li>⇢ <a href='#don-t-fully-automate-from-the-beginning'>Don&#39;t fully automate from the beginning</a></li>
+<li>⇢ <a href='#develop-code-directly-on-production-systems'>Develop code directly on production systems</a></li>
+<li>⇢ ⇢ <a href='#don-t-make-it-worse'>Don&#39;t make it worse</a></li>
+<li>⇢ <a href='#test-your-code'>Test your code</a></li>
+<li>⇢ <a href='#automation'>Automation</a></li>
+<li>⇢ <a href='#out-of-office-hours'>Out of office hours</a></li>
+<li>⇢ <a href='#retrospective'>Retrospective</a></li>
+</ul><br />
<h2 style='display: inline' id='meet-defensive-devops'>Meet Defensive DevOps</h2><br />
<br />
<span>Defensive DevOps is a term I invented by myself. I define it this way:</span><br />