An incident lens

Publish date: 2024-08-25

Tags:

incidents

I first started owning the operational posture of modern (SaaS, microservices, cloud, etc.) services in 2018, when I joined AWS.

With service ownership came incidents (and CoEs!), and the ensuing six years built my awareness of availability and formed the lenses I use in incident retrospectives. Those lenses now shape the advice I give.

Last week’s example: a senior engineer wanted to improve the deployment processes used by a nascent service. It is today manually deployed to a shared testing environment and a production environment, usually in order, once per week.

The first thought of the engineer and his manager was that we should build a ‘stable’ pre-production environment — let’s call it pilot for the sake of discussion — avoiding contention in the testing environment. Changes would bake there for a while before flowing to prod. Various additional suggestions, most of them good, built on that: make deployments include only a single change, so we know what broke; build a pipeline that a person would promote instead of running Terraform, and so on.

These are all worth doing! But from an incident perspective, we tend to ask some very simple questions:

How did you know something went wrong?
How long did it take you to find out?
How big was the impact?
How long did it take you to recover?

In a retrospective, we’ll rotate these questions a bit:

How would you have halved the time to detection?
How would you have halved the time to recovery?
How would you have halved the blast radius?

Thinking in this mindset led to my advice: there is little point in building a pilot environment or a release pipeline unless you can (a) reliably roll back deployments, (b) detect bustage in prod quickly enough to automatically decide to roll back, and (c) roll back quickly enough, and/or deploy to small enough populations, to minimize how many users are affected, depending on the maturity level and SLA of your service.

We have no automatic rollback other than ECS blue-green deploys, and while we have alarms on service metrics, many endpoints take little to no traffic. (I’ve seen this situation in a few places: uncommon endpoints take no traffic, so they can easily regress without being detected in any reasonable rollback window.)

Introducing a pilot environment would only give us a false sense of security, or even make things worse by improving developer productivity: we’d deploy broken changes, bake them without detecting any problems, then roll out the broken change to 100% of users in prod. Maybe we’d even watch the dashboards, and not see any problem for a week or two!

The need to detect bustage before it affects lots of users, and roll back quickly, dictates the roadmap I suggested:

Build synthetic tests (“canaries”) to ensure a volume of traffic that mimics real usage on every API (“which API endpoints are you willing to break without noticing?").
Build detectors on those canaries and service metrics to fire alarms. Any bustage should be detectable. Now your time-to-detection is low.
Build a pipeline that rolls back if those alarms fire. Now your time-to-recovery for simple bugs is low.
Then build a pilot stage to serve as a smaller ‘canary’ population. This stage bakes a release before it goes out. Now you have reduced the blast radius of a bad change, introducing a place to run canaries and real internal users to generate novel traffic.

Other concerns, like smaller rapid deployments, are secondary to these. They might be important — perhaps developer velocity is hampered by the shared testing environment — but I’ve found that bad deployments are themselves a large source of friction, as well as being bad for the customer.