Skip to content

Reliability & Resilience

The one distinction that trips everyone up: high availability minimizes downtime; fault tolerance eliminates it. Everything else here hangs off that. The rest of the vocabulary — availability, durability, disaster recovery, resilience — gets used interchangeably and shouldn’t be:

TermIn one line
AvailabilityFraction of time the system serves successfully, quoted in nines (99.9% → 99.99% → …).
High availability (HA)Stays up almost always: on failure a standby is promoted automatically — fast, but a brief gap.
Fault tolerance (FT)Stays up through a failure with no gap: redundant parts run in parallel, so one dying changes nothing.
DurabilityCommitted data isn’t lost, even through failures. Separate from uptime — data can be safe while the system is briefly down.
Disaster recovery (DR)Coming back from a large-scale loss (a whole region), graded by RTO/RPO.
ResilienceThe umbrella: degrade gracefully and recover from faults of any kind.

The difference is what happens the instant a component dies.

  • HA keeps a standby that must be brought into service — detect the failure, promote a replacement, repoint traffic. Fast (seconds), but non-zero: a brief window where requests fail. HA’s whole job is making that window short and automatic.
  • FT runs redundant components in parallel, all live. One fails, the others were already serving — a voter or load balancer just stops counting it. No failover step, no gap. That’s what “zero downtime” actually means.

High availability has a brief failover gap — detect, promote, repoint — while fault tolerance has none because redundant components serve in parallel

Mermaid source
flowchart LR
classDef ok fill:#e7f5ec,stroke:#3f9c5a,stroke-width:1.5px,color:#0f172a;
classDef gap fill:#fdeaea,stroke:#d9534f,stroke-width:1.5px,color:#0f172a;
classDef step fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a;
classDef fault fill:#fef6e7,stroke:#d9a441,stroke-width:1.5px,color:#0f172a;
subgraph HA["High availability — brief failover gap"]
direction LR
h0["serving"]:::ok --> h1["fault"]:::fault --> h2["detect<br/>(health check timeout)"]:::gap --> h3["promote standby<br/>+ repoint traffic"]:::gap --> h4["serving again"]:::ok
end
subgraph FT["Fault tolerance — no gap"]
direction LR
f0["serving<br/>(N replicas in parallel)"]:::ok --> f1["fault in one"]:::fault --> f2["peers already serving<br/>voter masks the fault"]:::ok
end
HA ~~~ FT

FT costs far more — fully redundant capacity running idle-but-live, plus the machinery to keep replicas in lockstep — so you reserve it for where any interruption is unacceptable (payments auth, flight control, telecom) and use HA for everything else.

HA and FT are the top two rungs of a redundancy ladder. Each rung shortens the gap and raises the cost; the line between HA and FT is whether failover has a promote step or not.

PostureBackup stateFailoverData loss
No redundancy (SPOF)nonefull outagehigh
Cold standbyoffminutes–hours (boot + restore)to last backupDR for non-critical
Warm standbyrunning, not servingseconds–minutes (promote)seconds (lag)HA
Hot standby (active-passive)live, readysub-second (promote)~zero (sync)HA
Active-activeall nodes servingnone — just lost capacity~zeroFT
Lockstep + voterparallel, in syncnone — fault maskedzeroFT

You can’t design reliability without a number to design to.

  • Nines — the downtime budget: 99.9% ≈ 8.8 h/yr, 99.99% ≈ 52 min/yr, 99.999% ≈ 5 min/yr (full ladder). Each nine ≈ 10× less downtime and roughly an order more cost — so name the target before chasing it.
  • Availability = MTBF / (MTBF + MTTR) — the lever is usually MTTR (recover faster), which is exactly what HA’s automated failover attacks.
  • Redundancy multiplies it — two independent 99% nodes fail only together: 99.99%. The catch is independent: two nodes sharing one power feed have a hidden SPOF. Conversely, serial dependencies erode it — a request through three 99.9% services is only ~99.7%.
  • RTO / RPO — recovery time (how fast you’re back) vs recovery point (how much data you can lose); RPO is set by backup/replication frequency.

No single switch — it’s redundancy + automatic failure detection + failover/voting, at every layer:

LayerWhat buys reliability
Hardwareredundant PSUs/NICs, ECC, hot-swap (N+1); RAID/erasure coding; lockstep CPUs + voter for true FT
Datareplication; consensus (Raft/Paxos) + quorum so a minority can fail with no data loss and a leader is re-elected automatically; WAL for recovery
Computestateless services behind a load balancer (health checks route around dead ones); Kubernetes reschedules pods; autoscaling replaces dead VMs
Geographymulti-AZ / multi-region, anycast, DNS failover — survive a datacenter loss

The key stateful primitive is the replicated state machine — N replicas applying the same consensus-ordered log — which is what makes a stateful service fault-tolerant (survives a minority, re-elects automatically, no split-brain). It runs under etcd, Spanner, CockroachDB, and Kafka’s metadata.

Reliability is also about how you fail. Two orthogonal toolkits:

  • Failure modesfail-fast (stop on a bad state, don’t corrupt), fail-soft / graceful degradation (shed non-essentials, keep the core — the feed loads even when recommendations don’t), fail-safe / fail-secure (fall to a safe vs locked-down default).
  • Containment patterns (at the call boundary, often a service mesh) — timeouts (never wait forever), retries + idempotency (recover from blips safely), circuit breaker (stop calling a sick dependency), bulkhead (isolate pools so one slow dependency can’t sink everything), backpressure (shed load instead of collapsing).

Reliability is per-subsystem, not global — for each part, ask what failure there costs:

  • Money / safety (payments, inventory, regulated) → FT, strong consistency, sync replication.
  • User-facing but recoverable (the request path) → HA: multi-AZ, automated failover, 99.9–99.99%.
  • Best-effort (analytics, recommendations, batch) → graceful degradation; let it lag without taking the core down.

Then spend nothing on the nines you don’t need.