The incident spiral
I’ve watched this pattern unfold more than twice, so it’s time to write it down.
-
A team has a quality problem. Reliance on unreliable partners, technical debt after a couple of years of fast growth, accrued complexity — whatever the causes, we’ve become aware that things aren’t just humming along.
-
The team starts taking incidents more seriously. A new leader joins, or senior leadership puts the team under a microscope. The intent is good.
-
Incident retrospectives become more thorough; these take time. Each retro produces more action items, which take time to address. Some action items improve detection, so on-call engineers get paged more, and we open more incidents to track. The team is now spending 20–40% of its engineering bandwidth on incidents, retros, and action items.
-
Meanwhile, the team is expected to keep shipping features and absorbing company-wide changes. Those changes themselves increase the surface area for things to go wrong. Running at 60% capacity means corners are cut on feature work and people are tired. More incidents.
This is a bad situation: the more the team tries to meet expectations, by shipping features and taking incidents seriously, the worse things get.
Action items are a trap
In most companies, incident action items bypass normal prioritization. A P1 comes with a one-week deadline and supersedes other work for whichever engineer is assigned the ticket. This is by design, because, without that deadline and that accountability, the backlog of action items will keep getting punted until after the next sprint, when the cold light of day shines on the list of things you told your product manager you’d get done. You sat in a meeting and said these things were necessary to prevent recurrence, accelerate detection, help with mitigation, or reduce blast radius, right?
But consider the tradeoff that occurs in practice. Teams are always balancing feature and reliability work, and some of that planned reliability work might matter more than any incident action item.
A P1 action item for a minor incident might preempt a planned project to deprecate a problematic system or remove a single point of failure.
The incident action item addresses a failure mode we’ve just seen. The planned work might avoid a catastrophic failure that we haven’t yet seen. We prioritize the former because it’s vivid. We’ve been lucky on the latter — so far.
Lorin Hochstein captures this with a thought experiment he calls the Oracle of Delphi. Imagine an oracle tells you that if you do an incident’s follow-up work, you’ll avoid a recurrence… but you’ll suffer a novel eight-hour outage next month. If you do the reliability work that was already on your backlog, you’ll have another minor incident like the one you just had, but avoid the big one. Which do you choose?
Creating a P1 or P2 incident action item is a statement that this is the most important thing you can work on this week, based on an implicit assumption that the last incident is a strong predictor of future reliability issues. That assumption is often wrong. You were surprised before. You’ll be surprised again.
The complexity that was supposed to help
There’s a related problem with the kind of action items that retros produce. The natural response to an incident is to add something: a check, a cache, a retry, a fallback, a circuit breaker. Each is reasonable in isolation, and they’re easy for folks on the periphery of a system to imagine and propose as action items. Over time, they accumulate.
Hochstein has a conjecture about this: once a system reaches a certain level of reliability, most major incidents will involve either a manual intervention intended to mitigate a minor incident, or unexpected behavior of a subsystem whose primary purpose was to improve reliability.
This isn’t theoretical. The October 2025 AWS outage involved an unanticipated interaction between multiple reliability mechanisms: redundant enactor instances, a locking mechanism, a cleanup mechanism, a transactional mechanism, and a rollback mechanism. All sensible design decisions, but the incident emerged from their interaction. The December 2025 Cloudflare outage was triggered by a killswitch — a mechanism specifically designed to quickly disable misbehaving rules — that had worked well in the past but failed in a corner case.
Caches, retries, and bimodal fallback paths are all common contributors to incidents. I see them proposed as action items every week.
This doesn’t mean reliability mechanisms are bad. We need retries, timeouts, bulkheading, failovers, rate limiting, caches, circuit breakers, and all the rest. Complexity is inevitable, and indeed we must learn to ‘surf’ it, as Hochstein puts it. However, we must also continually work to simplify to make room for the new complexity we’re adding.
Breaking out
The urge to take incidents seriously is correct. The problem is that when every incident generates two weeks of follow-up work, the team can’t keep up. Prevention competes with response, and both compete with the roadmap.
Here are some approaches I’ve found helpful to reduce the burden and increase agency, starting with the two I think matter most.
Predictable external failures shouldn’t be incidents. One airline being unreachable isn’t an incident for a travel booking site. One restaurant being unexpectedly closed isn’t an incident for Uber Eats. Some failure modes exist by design — they will happen, and the system should handle them gracefully, informing the user as appropriate. (To do this right you might need to monitor external dependencies with synthetics, rather than just watching error rates.)
This doesn’t mean ignoring these failures. Track them. Build dashboards. Set up weekly reviews to spot trends. If a partner’s error rate doubles, that’s a conversation to have with the partner, or a reason to reconsider the integration. But don’t page someone at 2am.
You can achieve similar benefits by weakening guarantees: for example, retries, job queues, dead-letter queues, and so on are all mechanisms to turn failures into latency. If your users are OK with latency, you can use those mechanisms to make your system more resilient.
You might make a different decision here if you have a lot of influence or control over the partner or destination, if the partner is fungible, if there’s something you can do to unblock your own customer in extremis, or if you want to actively track these failures. However, I think this point will be immediately recognizable to teams trapped in a particular bad state.
Be more discriminating about action items. Action items should fill urgent gaps in detection or ability to mitigate, fix obvious bugs, and prevent cascading failures. Not every incident needs a code change. Good action items should address a class of incidents, not just the specific failure you observed.
Almost everything else you can think of, particularly migrations and rewrites, should be added to the backlog to be weighed against the other reliability work you were already planning.
I’m fond of asking in retros the questions I learned at Amazon. They’re simple, but they nudge you into taking different perspectives and generalizing to a class of failures: what would have halved your time to detection? What would have halved your time to recover? What would have halved the blast radius?
It’s also useful to ask what would halve the cost of this class of incident, because achieving that can help us break out of the spiral. Sometimes the answer is “nothing practical, and we accept this will happen occasionally” — that might be better than doing a bunch of work just to feel like you’re responding to the incident. If you’re never declining action items, you’re not being selective enough.
After an incident, look for complexity to remove, not just safeguards to add. We should be skeptical of action items that add new things that can interact and fail. After an incident, ask: what could we remove to make this simpler? Could we eliminate the dependency that failed, rather than wrap it in more error handling? Can we make two things share fate? Could we drop a feature that isn’t worth its operational cost?
Asking simplifying questions all the time is something I expect of staff engineers.
Be honest about staffing. A team spending 30% of its time on incidents might be understaffed for its scope. Sometimes the answer is “we can’t operate this much surface area with this many people.” That’s uncomfortable to say, but the alternative — denial, leading to degraded quality, burnout, and attrition — is worse.
Make the cost visible with an incident budget. Allocate a fixed percentage of engineering time — say, 15% — to incident work. When that’s exhausted, remaining action items compete with features through normal prioritization. This makes the cost visible, forces explicit tradeoffs, and creates pressure to make incidents cheaper.
None of these changes will fix everything overnight, but they might break the feedback loop.
Engineering needs to be sustainable. A team drowning in incidents can’t think strategically or make deliberate choices. The first step is to reclaim the ability to plan, then use that ability to chart a course towards holistic reliability.