Cassandra
Apache Cassandra is a distributed wide-column store built for write-heavy, always-on, horizontally-scaled workloads. It fuses two lineages: Dynamo’s leaderless replication + tunable consistency (no single point of failure, linear scale, multi-DC) and Bigtable’s wide-column data model. If the Distributed Systems page is the theory, Cassandra is the textbook Dynamo-style system — leaderless quorums, consistent hashing, eventual consistency, gossip — made concrete.
Data model — query-first
Section titled “Data model — query-first”A keyspace holds tables; a table’s primary key is (partition key, clustering columns…). The partition key decides which nodes store the row; clustering columns sort rows within a partition.
The mindset is the opposite of relational: model around your queries, not your entities. No joins, no ad-hoc WHERE — the partition key must be in the query. You denormalize aggressively and write the same data into several tables, one per access pattern. Duplication is the design, not a smell.
Partitioning: the token ring
Section titled “Partitioning: the token ring”Cassandra places data with consistent hashing: the partition key is hashed to a token, the ring of all tokens is divided into ranges, and each node owns a set of ranges. Adding or removing a node remaps only a slice of the ring, not the whole keyspace. Vnodes (many small token ranges per node) smooth out distribution and make rebalancing cheaper. This is the partitioning half of the Distributed Systems page, in production.
Replication & tunable consistency
Section titled “Replication & tunable consistency”Each keyspace sets a replication factor (RF) per datacenter — every partition is stored on RF nodes, all equal, no leader. NetworkTopologyStrategy spreads replicas across racks and DCs. Any node can act as the coordinator for a request and forward to the replicas.
Consistency is chosen per query via the consistency level (CL) — ONE, QUORUM, LOCAL_QUORUM, ALL. The key relation: R + W > RF gives you read-your-writes (a quorum read overlaps a quorum write on at least one current replica); LOCAL_QUORUM keeps multi-DC requests fast by staying in-region. This is the PACELC latency-vs-consistency dial exposed as a per-query knob — the same quorum math from the theory page.
Mermaid source
flowchart LR classDef client fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a; classDef coord fill:#eef0fe,stroke:#6366f1,stroke-width:1.5px,color:#0f172a; classDef replica fill:#e7f5ec,stroke:#3f9c5a,stroke-width:1.5px,color:#0f172a; Client(["Client"]):::client Coord{{"Coordinator<br/>(any node, per request)"}}:::coord R1[("Replica A")]:::replica R2[("Replica B")]:::replica R3[("Replica C")]:::replica Client -->|"CL = QUORUM"| Coord Coord -->|"partition key → token"| R1 Coord --> R2 Coord --> R3 R1 -.->|ack| Coord R2 -.->|ack| Coord Coord -.->|"2 of RF=3 agree → respond"| ClientWrite path — why writes are so fast
Section titled “Write path — why writes are so fast”A write appends to the commit log (durability) and updates an in-memory memtable — then it’s acknowledged. No read-before-write, no in-place update. When a memtable fills, it flushes to an immutable on-disk SSTable. This is an LSM-tree: all writes are sequential appends, which is why Cassandra ingests at very high volume without flinching.
Mermaid source
flowchart LR classDef io fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a; classDef mem fill:#eef0fe,stroke:#6366f1,stroke-width:1.5px,color:#0f172a; classDef disk fill:#e7f5ec,stroke:#3f9c5a,stroke-width:1.5px,color:#0f172a; W(["Write"]):::io CL[("Commit log<br/>append-only · durability")]:::disk MT["Memtable<br/>in-memory, sorted"]:::mem SS1[("SSTable")]:::disk SS2[("SSTable")]:::disk CMP[("Compaction<br/>merge · drop tombstones")]:::disk W --> CL W --> MT MT -->|"flush when full"| SS1 MT -->|"flush"| SS2 SS1 --> CMP SS2 --> CMPRead path & compaction
Section titled “Read path & compaction”A read merges the memtable with potentially many SSTables holding pieces of the partition; bloom filters skip SSTables that can’t contain the key, and a partition index + caches narrow the rest. Reads are costlier than writes — the LSM trade.
Compaction merges SSTables in the background, dropping overwritten cells and expired tombstones (delete markers). Strategies fit the workload: STCS (write-heavy), LCS (read-heavy, bounds SSTables per read), TWCS (time-series, drops whole old windows). Watch tombstones — a partition with many of them gets slow until gc_grace passes and compaction reclaims them.
Conflict resolution & anti-entropy
Section titled “Conflict resolution & anti-entropy”With no leader, replicas drift and must reconcile:
- Last-write-wins (LWW) — conflicts resolve by each cell’s timestamp, so clock skew across nodes can silently drop a write. (Modern Cassandra uses LWW, not vector clocks.)
- Hinted handoff — a coordinator stashes writes meant for a down replica and replays them when it returns.
- Read repair — on a quorum read, replicas that disagree are corrected inline.
- Anti-entropy repair — Merkle-tree comparison (
nodetool repair) reconciles deeper divergence; it must be run regularly. - Gossip — a peer-to-peer protocol for membership and failure detection (phi-accrual). No master: every node learns the ring and who’s up.
When to use — and not
Section titled “When to use — and not”Use it for: write-heavy ingest, time-series / event / IoT data, always-on (AP) availability, multi-region active-active, linear horizontal scale, and known access patterns you can model tables around.
Avoid it for: ad-hoc queries, joins, or aggregations (it isn’t relational — reach for Postgres); workloads needing strong consistency or multi-key transactions by default; small datasets where the operational weight (repair, compaction tuning, capacity planning) isn’t worth it.
These are working notes — Cassandra as the concrete embodiment of the leaderless, quorum-based ideas on the Distributed Systems page, with the one-liner terms (quorum, consistent hashing, eventual consistency) in the Study List. The throughline: trade joins and default-strong consistency for linear write scale and no single point of failure.