Design for small, local failures so you avoid catastrophic cascades

A payments startup had a single database and a single on‑call engineer who ‘knew the magic commands.’ It worked—until it didn’t. One Friday, a minor schema change froze transactions. Support queues exploded, social media fumed, and the fix took hours. The issue wasn’t the change itself. It was the architecture that let one change affect everyone, instantly.

The team mapped single points of failure and split services. Each product line got its own database and deploy pipeline. They added caps on batch sizes and a progressive rollout so no change hit more than 5% of users at first. Then they ran black‑box drills, intentionally taking down a noncritical service on a quiet afternoon. The first two drills were bumpy, but by the fourth, the blast radius remained small and clear.

Metrics shifted. Incident counts didn’t drop at first, but the scale of incidents did. Outages were smaller, faster, and more contained. Customers noticed stability even when small issues occurred. One engineer said the new rules felt ‘slower’ at first, then admitted he slept better knowing a mistake wouldn’t drown the company.

Nature nudges us toward this design. Forests with firebreaks burn more often in small patches and less often as infernos. Decentralized systems absorb hits and self‑repair. Centralized, tightly coupled systems promise speed, then surprise you with rare, devastating failures. Choose where you want your pain: a series of small, local lessons—or one lesson you can’t afford.

Walk the system and mark where one failure can take down too much, then split responsibilities and services so trouble stays local. Add explicit limits—batch caps, approval thresholds, and progressive rollouts—so changes can’t cascade. Practice with low‑stakes black‑box drills to make sure your firebreaks hold and the team learns calmly. Expect incidents to become smaller rather than vanish. Design for that, and your weekends get much quieter. Schedule a first drill this month.

Localize risk and decentralize decisions

Map single points of failure

List where one error can take down the whole system: a monopoly vendor, one manager bottleneck, one shared database.

Split into smaller units

Divide responsibilities, vendors, or services so a failure stays local. Prefer city‑block scale over skyscraper scale.

Set firebreaks and limits

Add caps on batch sizes, approval thresholds, and exposure. Ensure one unit’s mistake can’t cascade.

Run black‑box drills

Periodically take down a noncritical component on purpose and check whether the blast radius remains contained.

What You'll Achieve

Localize risk and decentralize decisions

Map single points of failure

Split into smaller units

Set firebreaks and limits

Run black‑box drills

Reflection Questions

Personalization Tips

Antifragile: Things That Gain from Disorder

Ready to Take Action?

Get the Mentorist App

Your browser is out-of-date!