diff options
Diffstat (limited to 'gemfeed/2021-10-22-defensive-devops.html')
| -rw-r--r-- | gemfeed/2021-10-22-defensive-devops.html | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/gemfeed/2021-10-22-defensive-devops.html b/gemfeed/2021-10-22-defensive-devops.html index 519342b3..81e26478 100644 --- a/gemfeed/2021-10-22-defensive-devops.html +++ b/gemfeed/2021-10-22-defensive-devops.html @@ -80,7 +80,7 @@ p.quote:after { (__((__((___()()()------------------------------------' |_____| ASCII Art by Clyde Watson </pre> -<p class="quote"><i>Written by Paul Buetow 2021-10-22</i></p> +<p class="quote"><i>Published by Paul Buetow 2021-10-22</i></p> <p>I have seen many different setups and infrastructures during my carreer. My roles always included front-line ad-hoc fire fighting production issues. This often involves identifying and fixing these under time pressure, without the comfort of 2-week-long SCRUM sprints and without an exhaustive QA process. I also wrote a lot of code (Bash, Ruby, Perl, Go, and a little Java), and I followed the typical software development process, but that did not always apply to critical production issues.</p> <p>Unfortunately, no system is 100% reliable, and you can never be prepared for a subset of the possible problem-space. IT infrastructures can be complex. Not even mentioning Kubernetes yet, a Microservice-based infrastructure can complicate things even further. You can take care of 99% of all potential problems by following all DevOps best practices. Those best practices are not the subject of this blog post; this post is about the sub 1% of the issues arising from nowhere you can't be prepared for. </p> <p>Is there a software bug in a production, even though the software passed QA (after all, it is challenging to reproduce production behaviour in an artificial testing environment) and the software didn't show any issues running in production until a special case came up just now after it got deployed to production a week ago? Are there multiple hardware failure happening which causes loss of service redundancy or data inaccessibility? Is the automation of external customers connected to our infrastructure putting unexpectedly extra pressure on your grid, driving higher latencies and putting the SLAs at risk? You bet the solution is: Sysadmins, SREs and DevOps Engineers to the rescue. </p> |
