Site Reliability Engineering - Part 5: System Design, Incidents, and Learning

Published at 2026-03-01T12:00:00+02:00

Welcome to Part 5 of my Site Reliability Engineering (SRE) series. I'm currently working as a Site Reliability Engineer, and I'm here to share what SRE is all about in this blog series.

2023-08-18 Site Reliability Engineering - Part 1: SRE and Organizational Culture
2023-11-19 Site Reliability Engineering - Part 2: Operational Balance
2024-01-09 Site Reliability Engineering - Part 3: On-Call Culture
2024-09-07 Site Reliability Engineering - Part 4: Onboarding for On-Call Engineers
2026-03-01 Site Reliability Engineering - Part 5: System Design, Incidents, and Learning (You are currently reading this)

    ___
   /   \     resilience
  |  o  |  <----------  learning
   \___/

This time I want to share some themes that build on what we've already covered: how system design and incident analysis fit together, why observability should not be an afterthought, and how a design‑improvement loop keeps systems getting better. Let's dive in!

Site Reliability Engineering - Part 5: System Design, Incidents, and Learning
⇢ System Design and Incident Analysis
⇢ ⇢ Resilience and cascading failures
⇢ ⇢ Learning from incidents
⇢ Observability: Don't leave it for when it's too late
⇢ The iterative spirit
⇢ Book tips

System Design and Incident Analysis

In my experience, a big chunk of SRE work revolves around system design and incident analysis. The thing that really matters is whether your system can contain cascading failures—because if it can't, one bad component can take everything down.

Resilience and cascading failures

What I've seen work well is thinking about resilience early—at design time, not after the first outage. You look for the weak points, address them before production, and try to keep the blast radius small when (not if) something fails.

Learning from incidents

When incidents do happen, their analysis is a goldmine. Every incident exposes gaps—whether in tooling (ops tools that aren't up to the job) or in skills (engineers missing critical know-how). Blaming "human error" doesn't help. The job is to dig into root causes and fix the system. Postmortems that focus on customer impact help us distil lessons and make the system more robust so we're less likely to repeat the same failure.

System design and incident analysis form a feedback loop: we improve the design based on what we learn from incidents, and a better design reduces the impact of the next one.

Observability: Don't leave it for when it's too late

Here's something I've seen over and over: teams agree that "we need better observability" when they're already in the middle of an incident—and by then it's too late. Observability is always an afterthought compared to product features. But you really need it in place before things go wrong. Tools that can query high-cardinality data and give you granular insight into what's happening—that's what saves you when chaos hits. So invest in it early. Trust me on this one.

The iterative spirit

We also accept that system design is never "done." We refine it based on real-world performance, incident learnings, and changing needs. Every incident is a chance to learn and improve; the emphasis is on learning, not blame. SREs work with developers, backend teams, and incident response so that the whole system keeps getting better. It's never perfect, but that's kind of the point.

Book tips

If you want to go deeper, here are a few books I can recommend:

97 Things Every SRE Should Know: Collective Wisdom from the Experts by Emily Stolarsky and Jaime Woo
Site Reliability Engineering: How Google Runs Production Systems by Jennifer Petoff, Niall Murphy, Betsy Beyer, and Chris Jones
Implementing Service Level Objectives by Alex Hidalgo

E-Mail your comments to paul@nospam.buetow.org :-)

Back to the main site