ClickHouse
ClickHouse is a columnar OLAP database built to scan billions of rows and return aggregations in well under a second. Where Postgres is row-oriented OLTP and Cassandra is a write-optimized key-value/wide-column store, ClickHouse is for analytics: dashboards, observability, event and time-series data, ad-hoc GROUP BY over enormous tables. It’s the concrete answer to the “analytics / stream aggregation” question pattern — and the store Momentic leaned on to serve ~2M cache queries/day at ~250ms.
Columnar storage
Section titled “Columnar storage”The single idea everything else builds on: store data by column, not by row — one compressed file per column. Two wins follow directly:
- Read only what you query. An aggregation over 2 columns of a 40-column table touches 2 files, not the whole row. A row store reads every row in full.
- Compression that actually works. Adjacent values in a column are similar (same type, often sorted/low-cardinality), so LZ4/ZSTD crush them — less disk, less I/O, more rows per cache line. Execution is vectorized: operate on whole column blocks with SIMD instead of row-at-a-time.
Mermaid source
flowchart LR classDef q fill:#eef0fe,stroke:#6366f1,stroke-width:1.5px,color:#0f172a; classDef read fill:#e7f5ec,stroke:#3f9c5a,stroke-width:1.5px,color:#0f172a; classDef skip fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a; Q["SELECT avg(price)<br/>WHERE day = '...'"]:::q subgraph T["Columnar table — one compressed file per column"] direction TB Cday["day ▮▮▮▮▮"]:::read Cprice["price ▮▮▮▮▮"]:::read Cuser["user_id ▮▮▮▮▮"]:::skip Cmeta["... 40 more cols"]:::skip end Q -->|"reads only the 2<br/>columns it needs"| Cday Q --> Cprice Cuser -.->|not read| Q Cmeta -.->|not read| QMergeTree — the engine
Section titled “MergeTree — the engine”The default engine family. Data is written as immutable, sorted parts and merged in the background — an LSM-flavored design tuned for analytical reads:
- Sorted by
ORDER BY. Each part stores rows sorted by the table’s sorting key; this clustering is what makes range scans and the index work. - Sparse primary index. The primary key is not unique and doesn’t point at rows — it stores one mark per granule (~8192 rows). Combined with
PARTITION BY(usually a month) and skip indexes (minmax, set, bloom filter), a query prunes down to a handful of granules before reading anything. - Background merges. Parts are merged into bigger sorted parts over time (reclaiming space, applying engine-specific logic). The cost: insert in batches — every insert creates a part, and thousands of tiny inserts make “too many parts” the classic ClickHouse failure mode.
Mermaid source
flowchart LR classDef ins fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a; classDef part fill:#eef0fe,stroke:#6366f1,stroke-width:1.5px,color:#0f172a; classDef merged fill:#e7f5ec,stroke:#3f9c5a,stroke-width:1.5px,color:#0f172a; classDef idx fill:#fef6e7,stroke:#d9a441,stroke-width:1.5px,color:#0f172a; I1(["Batch insert"]):::ins I2(["Batch insert"]):::ins I3(["Batch insert"]):::ins P1[("Part · sorted by ORDER BY")]:::part P2[("Part")]:::part P3[("Part")]:::part M[("Merged part<br/>re-sorted · dedup/aggregate")]:::merged IDX["Sparse primary index<br/>1 mark / ~8192-row granule"]:::idx I1 --> P1 I2 --> P2 I3 --> P3 P1 -->|background merge| M P2 --> M P3 --> M IDX -.->|skip granules| MMerge engines & materialized views
Section titled “Merge engines & materialized views”The merge step is a hook for doing work as data settles, which is how ClickHouse fakes updates and pre-aggregates:
- ReplacingMergeTree — keeps only the latest row per sorting key at merge time (dedup / “latest value”). The insert-only pattern: never
UPDATE, just insert a newer version and let merges collapse it — what Momentic used to drop a whole Redis layer. - AggregatingMergeTree / SummingMergeTree — fold rows into running aggregates at merge.
- Materialized views — incremental aggregation: each insert into the source table is transformed and written to a pre-aggregated target, so dashboards read a tiny rollup instead of scanning raw events.
Note the trade these reveal: ClickHouse has no cheap point updates/deletes. ALTER ... UPDATE/DELETE are heavy async mutations that rewrite parts. You model around inserts + merge-time collapse, not in-place edits.
Sharding & replication
Section titled “Sharding & replication”Scale-out is two orthogonal axes:
- ReplicatedMergeTree (replicas) — copies of a shard for HA and read scaling, coordinated through ClickHouse Keeper (a built-in Raft service replacing ZooKeeper). Replication is asynchronous and eventually consistent.
- Distributed table (shards) — a thin table that fans a query out across shards and merges the results; rows are placed by a sharding key. Shards add write/scan throughput; replicas add availability.
It’s not a transactional store: no multi-row ACID, eventually-consistent replication, async mutations — all the things you’d lean on Postgres for.
When to use — and not
Section titled “When to use — and not”Use it for: analytical aggregation over huge tables, time-series and event data, observability/metrics/logs, real-time dashboards — anything that’s scan-and-aggregate heavy with append-mostly writes.
Avoid it for: OLTP, frequent updates/deletes, point lookups by key (a row store or KV fits), per-row transactions, or high-concurrency single-record reads. It trades flexible writes and transactions for raw analytical scan speed.
These are working notes — ClickHouse as the columnar/OLAP point in the data-store space, beside row-oriented Postgres and wide-column Cassandra. The throughline: columnar layout + sorted parts + sparse index + vectorized execution is why it scans billions of rows fast — and why it’s the wrong tool for a single-row update.