A field guide to distributed-systems failure

Distributed systems don't fail the way single machines do. They fail partially, ambiguously, and at the worst possible moment. A practical taxonomy of the failures that will find you, and the postures that survive them.


A single machine fails honestly. It crashes, or it doesn’t; the process is up or it’s down; the disk is there or it’s gone. You can reason about it because the states are discrete and the failure is total. A distributed system offers no such mercy. It fails partially — one node is slow but not dead, a message arrives but its acknowledgment doesn’t, a replica believes it’s the leader while another replica believes the same thing. The space between “working” and “broken” is where you’ll spend your career.

The instinct most engineers bring from single-machine work is that failure is an exception — something that happens occasionally and gets handled in a catch block. In a distributed system, failure is the steady state. At any given moment something is retrying, something is timing out, something is half-committed. The question is never “what if a component fails” but “which component is failing right now, and does the system notice.”

The partial-failure problem

The defining hazard of distributed systems is that a component can fail in a way that is invisible to its callers. A service that returns errors is easy: you see the errors. A service that accepts your request, does nothing, and never replies is a nightmare, because from the outside it looks identical to a service that’s merely slow.

This is why timeouts are not a nicety, they’re load-bearing. A call without a timeout is a call that can hang forever, and a thread blocked forever is a thread that’s gone. Stack enough of those and a single slow dependency takes down a service that was otherwise perfectly healthy — the classic cascading failure, where the thing that breaks is never the thing that caused it.

The first law of distributed systems is that the network is not reliable. The second is that you will forget the first law, and the network will remind you. — a sticky note that has outlived three of my laptops

Idempotency is the cure for “did it actually happen”

Here’s a question with no clean answer: you sent a request, the connection dropped, and you never got a response. Did the request succeed? You genuinely cannot know. The server may have processed it and failed to reply, or never received it at all. From where you sit, the two are indistinguishable.

The only sane response is to make the operation safe to repeat. An idempotent operation produces the same result whether it runs once or five times, which means “I’m not sure if it happened, so I’ll do it again” stops being dangerous. You attach a unique key to each logical operation and the server deduplicates on it; retries become free, and the unanswerable question stops mattering.

def charge_card(idempotency_key, amount):
    # already processed this exact operation? return the prior result
    existing = ledger.get(idempotency_key)
    if existing:
        return existing  # safe to retry — no double charge

    result = payment_gateway.charge(amount)
    ledger.put(idempotency_key, result)
    return result

Without the key, a retry double-charges the customer. With it, the client can retry as aggressively as it wants and the worst case is a wasted round trip. The retry behavior didn’t change — the operation became safe to retry, which is a completely different and far cheaper thing to get right.

Failure containment beats failure prevention

You cannot prevent failure in a distributed system; you can only decide how far it spreads. The whole discipline is really about drawing boundaries — bulkheads, circuit breakers, isolated pools — so that when a component goes bad, the blast radius is one component and not the whole fleet.

A circuit breaker is the cleanest version of this idea. When a dependency starts failing, you stop calling it for a while instead of hammering it with requests that are going to fail anyway. You fail fast and locally rather than slowly and globally, which both protects the struggling dependency and frees your own resources to keep serving the requests that don’t depend on it.

The systems that survive in production aren’t the ones that never fail. They’re the ones where failure is contained, observable, and boring — where a dead node is a non-event because three others picked up its work, and the only evidence is a blip on a dashboard nobody had to wake up for.

Common questions

How do distributed systems fail differently from single machines?

A single machine fails totally: it is up or it is down. A distributed system fails partially. A node is slow but not dead, a message arrives but its acknowledgment does not, two replicas each believe they are the leader. The hard work lives in that gap between working and broken.

Why do timeouts matter so much in distributed systems?

A call without a timeout can hang forever, and a thread blocked forever is a thread you have lost. Stack enough of those and one slow dependency takes down a healthy service. That is the classic cascading failure, where the thing that breaks is never the thing that caused it.

What is idempotency and why does it help with retries?

An idempotent operation produces the same result whether it runs once or five times. Attach a unique key to each logical operation and have the server deduplicate on it. Then a retry after a dropped connection is safe, and the unanswerable question of whether the first attempt succeeded stops mattering.

Can you prevent failure in a distributed system?

No. You can only decide how far it spreads. The discipline is containment: bulkheads, circuit breakers, and isolated pools that keep a bad component from taking the whole fleet with it. Systems survive because failure is contained and boring, not because it never happens.