CS‑6.7  ·  The Big Three
Enterprise Systems Design · Deep Dive · 6.7

Reliable systems from
unreliable machines.

Between 2003 and 2012, four Google papers — a file system, a compute model, a wide-column store, and a globally-consistent database — taught the industry one idea: assume the hardware will fail, and build the software to expect it.

Read the story

Drag to rotate · tap a node to fail it · watch the cluster heal

Act IThe Inversion

Stop making machines reliable. Make failure normal.

The whole library of distributed-data ideas rests on a single reversal of the prevailing wisdom — and once you accept it, every later design decision follows almost mechanically.

Through the 1990s, the answer to "my data is too big" was to buy a bigger, more reliable machine — more redundant power, error-correcting memory, a support contract. That is scaling up, and for a workload the size of the web it was a dead end: the biggest machine money can buy is still finite, the price climbs faster than the capability, and one very-reliable machine is still a single point of failure.

Google made the opposite bet. Rather than spend on reliable hardware to avoid failure, spend on software that tolerates it, and run that software on the cheapest hardware available — ordinary PCs with ordinary disks. This is scaling out. And it comes with a consequence most designs flinch from: at the scale of thousands of cheap machines, failure stops being an event and becomes a steady-state condition. Disks die, machines reboot, racks lose power — every single day. Something is always broken.

A system built assuming hardware is reliable will spend all its time in its error-handling paths. A system built assuming hardware fails treats those events as routine and keeps working. That is the inversion. Everything else in these four papers is a consequence of taking it seriously.

Prior art, in one sentence

The previous generation tried to make individual machines reliable; Google made the collection of machines reliable in software, and let each machine be cheap and unreliable. The hero above is that idea, live: kill a node and the data survives, because it was never trusting any single machine in the first place.

Four papers, one problem, four layers

The shared problem — "store and process the whole web on hardware that is always partly broken, and make that invisible to application programmers" — decomposes into four sub-problems, each answered by one system. They are the closest thing the field has to a founding library; almost every large-scale data system in use today traces its lineage to one of them.

2003 · Storage
GFS

Store files far larger than one disk, durably, on disks that fail routinely. Single master for metadata; data replicated 3× on commodity chunkservers.

2004 · Computation
MapReduce

Compute over that data across thousands of machines, some of which die mid-job. A restricted map→shuffle→reduce model makes re-execution a safe recovery mechanism.

2006 · Structured access
Bigtable

Fast, random, structured read/write at the same scale. A sparse, sorted, multi-dimensional map over GFS, partitioned into tablets.

2012 · Global consistency
Spanner

Strong, transactional guarantees across data spread worldwide. TrueTime bounds clock uncertainty so commit order can match real time.

Name the binding constraint first

Several forces shaped these systems — petabyte data volumes, commodity hardware, throughput over latency — but only one is binding, the one the whole design exists to relieve. It is not data volume; horizontal scaling handles that almost mechanically. It is the combination of constant failure and the cost of coordination. Storing and computing over enormous data would be easy if machines never failed and coordinating them were free. Neither is true — so every system here is a particular answer to one question: how do we get useful work done across many machines that keep failing, while paying as little as possible for the coordination that keeping them consistent requires?

FactThree different answers to the coordination problem

Watch how the four split on the coordination half of the question. GFS and Bigtable centralise coordination in a single master to keep it simple. MapReduce avoids most coordination by restricting the computation model so tasks don't need to talk. Spanner pays for coordination directly — buying tightly-bounded time so machines can agree on order without constantly talking. Centralise, avoid, or pay: hold that three-way contrast through the next five acts.

From the master brief's "binding constraint" framing (Chapter 4).

Act IIThe Foundation · GFS, 2003

One master holds the map. The data lives everywhere else.

GFS is the layer everything else stands on — MapReduce reads and writes it, Bigtable stores itself in it. It is the cleanest possible illustration of the inversion, so it's worth slowing down on.

A GFS cluster has a single master and many chunkservers, used by many clients. Files are split into fixed 64-megabyte chunks, each replicated — by default — on three chunkservers. The master holds all the metadata (the namespace, the file-to-chunk map, the replica locations) in memory, for speed.

The decisive choice is what the master does not do: it never touches file data. A client asks the master only "which chunkservers hold chunk N of this file?", caches the answer, then talks straight to a chunkserver for the bytes. Keeping bulk data off the master is what stops one master from becoming a bandwidth bottleneck — and why a single master can sit behind a cluster moving enormous aggregate throughput.

gfs · metadata on the master, bytes on the chunkservers
flowchart TB
  C([Client]):::c
  M["Master
metadata in memory:
namespace · chunk map · replica locations"]:::m C -->|"1 · where is chunk N?"| M M -->|"2 · handle + replica locations"| C subgraph CS["Chunkservers — commodity disks"] direction LR A[Chunkserver A
chunk 64MB]:::s B[Chunkserver B
chunk 64MB]:::s D[Chunkserver C
chunk 64MB]:::s end C ==>|"3 · read / append bytes (direct, bulk)"| A M -.->|HeartBeat + control| A M -.-> B M -.-> D classDef c fill:#0e2a39,stroke:#46C2E8,color:#bfe9fb; classDef m fill:#23190d,stroke:#F2A24B,color:#F2A24B; classDef s fill:#1E283A,stroke:#3A465E,color:#cfd6e2;
Replication factor 3 by default — the master re-replicates on failure. The master is on the control path, never the data path. That single separation is what lets it stay small while the cluster stays huge.
FactWhy is a GFS chunk a whole 64 MB — thousands of times bigger than a filesystem block?

The big chunk is a direct consequence of the single-master design. Bigger chunks mean fewer chunks, which means less metadata for the master to hold in memory and fewer client-to-master round trips — a client doing a large sequential read touches the master once, then streams from chunkservers. The cost: a small file made of one partial chunk gets little parallelism, and a popular small file can turn its few chunkservers into hot spots. GFS mitigated that by raising the replication factor for such files.

GFS (SOSP 2003), §on chunk size.

Designing for failure, concretely

Failure-handling isn't bolted on; it's woven through every mechanism. The master sends periodic HeartBeat messages to learn which chunkservers are alive; when one goes silent, the master re-replicates its chunks elsewhere to restore the factor of three. Every chunk is split into 64 KB blocks, each protected by a 32-bit checksum verified on every read, so disk corruption is caught and the bad replica repaired from a good one. And the system favours appends over overwrites, with an atomic record-append, because append-mostly access is both what Google's workloads needed and far easier to keep consistent across replicas.

a single read · the master's entire involvement is two messages
sequenceDiagram
  autonumber
  participant Cl as Client
  participant Ma as Master
  participant Cs as Nearest replica
  Note over Cl: chunks are fixed 64MB, so the client computes which chunk index it needs — no lookup
  Cl->>Ma: file name + chunk index
  Ma-->>Cl: chunk handle + replica locations
  Note over Cl,Ma: client caches this — a re-read skips the master entirely
  Cl->>Cs: give me byte range of this chunk
  Cs->>Cs: verify 64KB blocks against checksums
  Cs-->>Cl: bytes
  Note over Ma,Cs: if a replica dies, heartbeats notice and re-replicate — no human paged
What "design for failure" looks like at one operation. If a chunkserver dies between two reads, the cached location simply fails over to another replica, and the master quietly restores the replication factor. No one gets paged for a single dead disk.
GFS in one idea

Separate metadata from data; centralise the metadata in one master for simplicity; replicate the data across cheap machines for durability; and treat failure as routine by checking health continuously and re-replicating automatically. Every later distributed file system — HDFS first among them — is a variation on this template.

Act IIIThe Computation · MapReduce, 2004

Restrict what a program can do, and fault tolerance falls out for free.

If GFS answers "how do we store it?", MapReduce answers "how do we compute over it?" — with an abstraction so restricted that recovering from failure becomes trivial. That trade, generality for automatic resilience, is the lesson of the act.

The programmer writes just two functions. map(key, value) emits intermediate key/value pairs. reduce(key, list-of-values) combines all values for a key into a smaller set of outputs. Between them sits the shuffle: the framework groups every intermediate value by key and routes it to the right reducer, sorting along the way. Everything else — splitting the input, scheduling thousands of tasks, moving data, handling stragglers and failures — the framework does, not the programmer.

mapreduce · map → shuffle/sort by key → reduce
flowchart LR
  s1[split 1]:::in --> m1["map()"]:::map
  s2[split 2]:::in --> m2["map()"]:::map
  s3[split 3]:::in --> m3["map()"]:::map
  m1 --> r1["reduce()"]:::red
  m1 --> r2["reduce()"]:::red
  m2 --> r1
  m2 --> r2
  m3 --> r1
  m3 --> r2
  r1 --> o1[out 1]:::out
  r2 --> o2[out 2]:::out
  classDef in fill:#0e2a39,stroke:#46C2E8,color:#bfe9fb;
  classDef map fill:#102a23,stroke:#6BD08F,color:#bde9cd;
  classDef red fill:#23190d,stroke:#F2A24B,color:#f6d8a8;
  classDef out fill:#1E283A,stroke:#3A465E,color:#cfd6e2;
Shuffle is a barrier. A reduce task can't start until every map has finished — which is exactly why MapReduce is fundamentally a batch model, and why the arrows from every map fan into every reduce.

Why the restriction buys fault tolerance

This is the heart of the act, and one of the most teachable ideas in the whole case. Because map and reduce are pure functions — same input, same output, no hidden side effects — a task that fails, or runs on a machine that dies, can simply be run again somewhere else, with no special recovery logic. The master tracks every task's state, pings workers to detect death, and reschedules. No checkpoints, no rollback — just run it again.

This only works because the model is restricted. An arbitrary parallel program with shared mutable state could not be safely re-run — a partial execution might have already changed something. By forbidding that, the framework earns the right to recover by repetition. Toggle the restriction below and watch it break.

re-execution simulator · tap a task to kill its machine
done machine died re-running
The restriction is the feature. With pure functions, a dead task just re-runs and the job finishes. Switch to "side effects" — imagine tasks that mutate shared state — and the same death can't be safely repeated: the job is stuck or corrupt. The restriction isn't a limitation fault tolerance survives despite; it's what makes fault tolerance possible.

The straggler, and the canonical example

A subtler failure is the straggler: not a dead machine but a slow one — a failing disk, a contended CPU — holding up the whole job because reduce can't begin until every map finishes. MapReduce handles it elegantly: as a job nears completion, the master launches backup executions of the remaining tasks, and whichever copy finishes first wins. Because tasks are idempotent, running one twice is harmless.

The model is abstract until you watch word-count run on it. The whole program is two short functions: map receives a document and emits (word, 1) for every word; reduce receives a word and a long list of 1s and adds them up. That's the entire application — no code about machines, networks, splitting, or failure. The framework then supplies locality-aware scheduling (move the computation to the data), a combiner that collapses millions of (the, 1) into one (the, 53000) per task to cut network traffic, sorting, fault tolerance, and straggler mitigation. The engineer wrote ten lines about words; the framework did everything hard. That gap is the whole value of the abstraction.

The transferable lesson

Restricting what a program is allowed to do can be a feature, not a limitation. By accepting only pure functions in a fixed dataflow, MapReduce made re-execution a safe recovery strategy and put massively parallel computation in reach of ordinary programmers. Next time you see a system constrain its own expressiveness, ask what guarantee the constraint is buying.

Act IVThe Structure · Bigtable, 2006

Don't reinvent the hard parts. Compose them.

GFS gives whole-file, throughput-oriented access. A web index, Analytics, Earth, Finance need something else: fast, random, structured read and write of individual rows — at the same scale, on the same unreliable hardware. Bigtable provides it by building cleverly on top of GFS rather than replacing it.

A Bigtable is one enormous sparse, sorted, multi-dimensional map: it maps a (row, column, timestamp) triple to an uninterpreted array of bytes. Every word is load-bearing. Sorted: rows are kept in lexicographic order, so a range scan touches a small contiguous set of partitions. Sparse: a row may use any subset of columns and absent ones cost nothing, so a table can have millions of columns of which any row uses a handful. What it is not is relational — no schema, no joins, and (originally) no transaction spanning more than one row.

Read carefullyWhat does each refusal in Bigtable's definition buy?

Notice how much the abstraction quietly forbids — and that the refusals are the design. No cross-row transactions, no joins, no fixed schema. By promising less, Bigtable delivers what it does promise at a scale the richer relational abstraction could never reach. Single-row atomic writes are cheap to provide and cover most callers' needs; cross-row transactions were judged too costly at the scale. The same pattern runs through GFS (no POSIX) and MapReduce (no general programs). Spanner, two acts on, is the deliberate exception that proves the rule.

Bigtable (OSDI 2006); master brief Chapter 5.

Tablets, and writing without overwriting

The map is far too big for one machine, so it's partitioned by row range into tablets — the unit of distribution and load balancing. As a table grows, tablets split; as load shifts, they move between servers. And the storage engine rests on an idea that recurs throughout modern data systems: never modify data in place. Persistent data lives in SSTables — immutable, sorted files of key/value pairs, stored in GFS. Because an SSTable is never changed after writing, it's trivially safe to replicate, cache, and share.

New writes don't edit an SSTable. They go into a commit log (in GFS, for durability) and an in-memory buffer, the memtable. A read merges the memtable with the SSTables. When the memtable grows large it's flushed to a new SSTable, and periodic compactions merge SSTables to keep reads efficient. This log-structured, immutable-file approach turns random writes into sequential ones — exactly what makes data cheap to store reliably on a replicated file system.

bigtable · immutable SSTables, a write buffer in memory, coordination in Chubby
flowchart TB
  W([write]):::w --> TS
  subgraph TS["Tablet server — one tablet, a row range"]
    direction TB
    MT[memtable
recent writes, in RAM]:::mem CL[commit log
redo, in GFS]:::log MT -->|flush when large| SS subgraph SS["SSTables — immutable, sorted, in GFS"] direction LR s1[SSTable]:::sst s2[SSTable]:::sst s3[SSTable]:::sst end end CH["Chubby
lock service · 5 replicas · Paxos
one active master · tablet-location root"]:::chub TS <-->|coordinates| CH R([read]):::w -.->|merges memtable + SSTables| MT classDef w fill:#0e2a39,stroke:#46C2E8,color:#bfe9fb; classDef mem fill:#102a23,stroke:#6BD08F,color:#bde9cd; classDef log fill:#1E283A,stroke:#3A465E,color:#cfd6e2; classDef sst fill:#23190d,stroke:#F2A24B,color:#f6d8a8; classDef chub fill:#191a2e,stroke:#9B8CFF,color:#cabfff;
Reads merge the in-memory buffer with the immutable files. This same LSM-tree pattern underlies LevelDB, RocksDB, Cassandra, and HBase — a direct line of descent from this one diagram.

Chubby: the small, critical dependency

Bigtable doesn't solve distributed coordination itself. It leans on Chubby, a highly-available lock and small-file service: five replicas running Paxos, one elected master. Bigtable uses it only for the handful of things that genuinely need agreement — ensuring exactly one active Bigtable master, discovering tablet servers, storing schema, anchoring the tablet-location root. Push the hard, small problem of consensus into a dedicated service and keep it off the data path. That pattern is itself influential: ZooKeeper, the open-source coordinator at the heart of countless systems, is directly modelled on Chubby.

Bigtable in one idea

Compose, don't reinvent. Bigtable gets durability and replication from GFS, gets consensus from Chubby, and adds only what's genuinely its own: a sorted, sparse, versioned map, partitioned into tablets, backed by immutable SSTables and an in-memory buffer. The lesson is architectural humility — build the new capability as a thin, well-chosen layer over services that already solve the hard sub-problems.

Act VThe Reckoning · Spanner & TrueTime, 2012

Buying back the guarantee everyone else gave up on.

The first three systems all bought scale by relaxing a guarantee. Spanner went back for the hardest one — strong, transactional consistency across data spread around the world — and found a way to pay for it. It's the climax of the case, because it confronts the thing the others avoided: global coordination.

Spanner's headline guarantee is external consistency (strict serializability): if transaction T1 commits before T2 begins in real, wall-clock time, then T1 is ordered before T2 in the database, and T2 sees T1's effects. Concretely — commit a change, then phone a colleague who issues a read, and that read is guaranteed to see your change even from the far side of the planet. Most databases promise only that some serial order exists. Spanner promises the order matches what actually happened in the world.

spanner · each partition is a Paxos group across datacenters
flowchart LR
  subgraph DC1["Datacenter 1"]
    L[Paxos leader]:::lead
  end
  subgraph DC2["Datacenter 2"]
    R1[Paxos replica]:::rep
  end
  subgraph DC3["Datacenter 3"]
    R2[Paxos replica]:::rep
  end
  L <-->|synchronous Paxos replication of one partition| R1
  R1 <--> R2
  L <--> R2
  TT["TrueTime — time as a bounded interval
earliest · now() · latest   width = 2ε  (GPS + atomic clocks)"]:::tt L -.->|asks the time| TT classDef lead fill:#191a2e,stroke:#9B8CFF,color:#cabfff; classDef rep fill:#1E283A,stroke:#3A465E,color:#cfd6e2; classDef tt fill:#23190d,stroke:#F2A24B,color:#f6d8a8;
Availability from Paxos, order from TrueTime. A partition's replicas form a Paxos group, typically in different datacenters; the group keeps serving as long as a majority survive. A transaction spanning partitions runs two-phase commit layered over the Paxos groups.

The hard part: agreeing on time

Strict ordering by real time requires machines to agree on what time it is — and ordinary clocks drift and disagree by far more than a transaction's duration. Spanner's answer is TrueTime, an API that confronts the uncertainty instead of pretending it away. Rather than a single timestamp, it returns an interval — earliest and latest — and guarantees the true time lies inside it. The interval's half-width is epsilon (ε); its full width is 2ε. TrueTime keeps it narrow by deploying both GPS receivers and atomic clocks in every datacenter — two sources with independent failure modes, so the failure of one doesn't silently corrupt the time.

Commit-wait: spending time to buy order

TrueTime is put to work through commit-wait. When a transaction commits, it's assigned a timestamp; before its writes become visible, the leader deliberately waits until TrueTime guarantees that timestamp is in the past — it waits out the uncertainty, a delay on the order of ε. By the time anyone in the world can observe the commit, real time has provably moved past its timestamp, so any later transaction gets a larger timestamp and is ordered after it. Spanner literally spends a few milliseconds of waiting to buy globally correct ordering. Move ε below and watch the bill.

truetime + commit-wait · drag the clock uncertainty
Clock uncertainty ε4 ms
The cost of strong global consistency, made visible. Every write pauses for about 2ε — the width of the clock uncertainty — so that "after in real time" forces "after in timestamp order," which forces "sees your write." Tighten the clocks (smaller ε) and the tax shrinks. That's why Spanner spends real money on atomic clocks.
Worked exampleThe phone call, traced millisecond by millisecond

You commit a write. Its leader asks TrueTime and gets back an interval — say the true time is between 10:00:00.000 and 10:00:00.006, an uncertainty of 6 ms. The leader assigns the commit the latest edge, 10:00:00.006. Then commit-wait: it does not reveal the write until TrueTime can guarantee 10:00:00.006 is definitely in the past — until the earliest bound passes it, about 6 ms later. Only then are the locks released.

Now your colleague's read happens, in real time, after your phone call — so after 10:00:00.006. When their transaction asks TrueTime, even its earliest bound is past 10:00:00.006, so any timestamp it receives is larger than your commit's. A read at a later timestamp sees all writes with earlier timestamps. Therefore it sees your write. The few milliseconds of commit-wait are exactly what closed the gap.

Spanner (OSDI 2012); the canonical motivating scenario.

Spanner in one idea

You cannot make distributed clocks perfect, but you can bound how imperfect they are — and a known bound is enough. By exposing clock uncertainty as an interval and waiting it out before revealing a commit, Spanner provides the strong global guarantee everyone else had written off as too expensive. It's the clearest demonstration in the case that an abandoned guarantee can sometimes be bought back, if you're willing to pay the specific price it demands.

Act VIThe Spectrum

The one axis that turns four designs into a single lesson.

The four systems are usually taught as four separate designs. The thread that makes them cohere is consistency — what a system promises about whether different observers see the same data in the same order. It's the most important conceptual axis in distributed data, and these systems span it almost end to end.

At the weak end sits eventual consistency: after a write, replicas agree eventually, but a read in the meantime may see stale data. Cheap, highly available, and fine for a shopping cart or a social feed. At the strong end sits linearizability — every operation appears to take effect at one instant and everyone agrees on the order — and beyond it external consistency, which adds that the agreed order matches real wall-clock time. The further right you go, the more the system must coordinate, and coordination costs latency and availability. Drag along the spectrum:

the consistency spectrum · weaker & cheaper ← → stronger & costlier
Choosing the consistency model is choosing the system. GFS was content at the weak end; Spanner went to extraordinary lengths to reach the strong one. Neither is "right" — each is the correct point for a particular workload.

CAP, briefly and honestly

No discussion of the spectrum is complete without the CAP theorem: when a network partition occurs, a distributed system must choose between staying consistent and staying available — it can't have both. CAP is often over-stated; partitions are occasional, not constant, and the real choice is usually about latency. Spanner is the interesting case: it chooses consistency, and during a partition it makes the minority side unavailable rather than serve inconsistent data. Its designers argue that, because Google runs its own highly-reliable network, partitions are rare enough that this is a good bargain. The honest framing: consistency is not free, the price is paid in availability and latency, and the right point depends entirely on what the application can tolerate.

The question to carry to every data system

When you evaluate any data store, find its place on this spectrum first. Ask: what does it promise when two clients read and write the same data at once, and what does it do during a network partition? The answer tells you more about whether the system fits your problem than any benchmark.

Act VIIThe Verdict

Every design is a trade. Pay only for the guarantees you need.

The strongest version of the lesson holds all four trades in view at once: the first three spend simplicity and strong guarantees to buy scale and availability; Spanner crosses to the other side on consistency and pays the latency and infrastructure bill to do it.

Pick a system to see exactly what it bought, what it cost, and what grew from it.

system profiles · choose one
Read the ledger as a whole. The case is not "distributed beats centralised" or "strong beats eventual." It's that each guarantee has a price, and good design means paying only for the ones your problem actually needs.

The strongest counter-argument: you are not Google

Almost no one operates at Google's scale, and adopting Google's machinery without Google's problem is a well-documented way to waste years. Modern hardware is enormous — a single server holds terabytes of RAM and tens of terabytes of disk, and a well-indexed conventional database on one machine will out-perform a distributed system for all but genuinely huge workloads, while being far simpler to operate. Reach for these designs only when the data truly doesn't fit on one machine, or when availability genuinely requires geographic replication — and prefer the simplest point on the consistency spectrum your application can tolerate.

The meta-anti-pattern

Cargo-culting scale. The deepest misuse is adopting GFS-and-MapReduce-style machinery for a problem that fits comfortably on one machine. These systems exist to overcome the specific pain of data too big for one box and hardware that fails constantly. Without that pain, their costs — operational complexity, relaxed guarantees, latency — are all bill and no benefit. Use them when the constraint is real, and not before.

The ultimate validation: the progeny

The most telling evaluation isn't any benchmark in the papers — it's what happened next. Released as papers with no source code, these designs were re-implemented, independently, by people outside Google who recognised them as the right answers. A design that strangers rebuild from a description, and that then underpins a decade of industry infrastructure, has passed the most demanding evaluation there is.

lineage · four papers, a generation of systems
flowchart LR
  G[GFS · 2003]:::p --> H[HDFS]:::d
  MR[MapReduce · 2004]:::p --> HM["Hadoop MapReduce → Spark / Flink"]:::d
  B["Bigtable / Chubby · 2006"]:::p --> HB["HBase · Cassandra · ZooKeeper"]:::d
  S[Spanner · 2012]:::p --> CK["CockroachDB · YugabyteDB · Cloud Spanner"]:::d
  classDef p fill:#191a2e,stroke:#9B8CFF,color:#cabfff;
  classDef d fill:#0e2a39,stroke:#46C2E8,color:#bfe9fb;
Released with no code, rebuilt by strangers. GFS+MapReduce became Hadoop (named after Doug Cutting's son's toy elephant). Bigtable became HBase and inspired Cassandra; Chubby became ZooKeeper; Spanner became CockroachDB and Cloud Spanner — generally reproducing the global guarantee without the atomic clocks.

The vocabulary they left behind

Beyond specific systems, the papers gave the field its working language. "Eventual" versus "strong" consistency, "externally consistent" transactions, "shuffle" as a phase of computation, "tablet" and "SSTable" and "memtable," the very idea of separating storage from compute — all entered common use through this body of work. An engineer discussing data systems today is, largely without knowing it, speaking a language these four papers wrote.

Principle
Assume the parts will fail

Push the burden of coping with failure down into the storage and compute layers, so application programmers never have to think about it.

Principle
A restriction can be a feature

By promising less, a system can deliver what it does promise at a scale the richer abstraction never could. The refusals are the design.

Principle
Compose over hard services

Get durability from a file system, consensus from a lock service, and add only what's genuinely new. Architectural humility scales.

Principle
Choose the weakest consistency you can

Every guarantee above the weakest your problem can tolerate is one you pay for in coordination, latency, or cost.

Take-home challenge

Pick a realistic application — a ride-sharing backend, a messaging app, an online store. For three core data operations (say: record a payment, post a message, update a driver's location), place each on the consistency spectrum: state the weakest guarantee that operation can correctly tolerate, justify it, and name which of the four systems (or their modern descendants) you'd use and why. The trap: picking one system for everything, or reaching for the strongest guarantee everywhere "to be safe" and silently paying for it. A strong answer shows that different operations in the same application sit at different points — and that recognising this is the whole skill.

Assume the parts will fail, and decide deliberately what to promise about consistency — every guarantee above the weakest your problem can tolerate is one you must pay for. — the one principle to carry out of this case