Skip to content

Kubernetes & Orchestration

Container orchestration is the problem of running many containers across many machines without hand-placing them: scheduling, healing, scaling, networking, rollout. Kubernetes (K8s) is the de-facto answer, and its one big idea is declarative reconciliation — you describe the desired state, and control loops continuously drive actual state toward it. Everything below is a consequence of that idea. The page ends on a worked architecture: a Postgres cluster managed by an Operator, which ties straight back to the Postgres Internals page (replication, WAL/PITR) and the DBaaS-shaped drills in the Study List.

You don’t tell Kubernetes how to do things; you declare what you want (a YAML manifest), and a controller makes it so. Each controller runs a loop:

observe actual state → diff against desired (spec) → act to converge → record status → repeat.

Two properties matter:

  • Declarative, not imperative — “I want 3 replicas,” not “start a container.” If one dies, the loop notices the diff (2 ≠ 3) and creates a replacement. Self-healing is just reconciliation.
  • Level-triggered, not edge-triggered — the loop re-reads the whole current state every time rather than reacting to individual events, so a missed or duplicated event can’t corrupt it. It’s eventually consistent by design.

This is the same control-loop pattern as a thermostat, and it’s the mental model for everything in K8s — including the Operators you write yourself.

A cluster splits into a control plane (the brain — decides what should run) and worker nodes (the muscle — actually run containers). Every component talks only to the API server; nothing talks directly to anything else.

Kubernetes cluster architecture — control plane and worker nodes

ComponentPlaneRole
kube-apiservercontrolThe single front door — validates every request, the only thing that reads/writes etcd. All other components are clients of it.
etcdcontrolConsistent, replicated key-value store; the cluster’s source of truth. (Raft-based — this is your CP store.)
schedulercontrolWatches for unassigned Pods and binds each to a Node by resources, affinity, taints.
controller-managercontrolRuns the built-in reconcile loops (Deployment, ReplicaSet, Node, Job…).
cloud-controller-managercontrolBridges to the cloud provider — provisions load balancers, volumes, node lifecycle.
kubeletnodePer-node agent; ensures the containers in its assigned PodSpecs are running and healthy.
kube-proxynodePrograms node networking so a Service’s virtual IP load-balances to its Pods.
container runtimenodecontainerd/CRI-O — actually pulls images and runs containers.

Note the design: a declarative API server backed by a consistent store, with stateless reconcilers as clients. That’s a clean system-design pattern in its own right — control plane / data plane separation, single source of truth, level-triggered convergence.

Mermaid source
flowchart TB
classDef cp fill:#eef0fe,stroke:#6366f1,stroke-width:1.5px,color:#0f172a;
classDef node fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a;
classDef store fill:#e7f5ec,stroke:#3f9c5a,stroke-width:1.5px,color:#0f172a;
classDef client fill:#fef6e7,stroke:#d9a441,stroke-width:1.5px,color:#0f172a;
Kubectl(["kubectl / clients<br/>apply desired state"]):::client
subgraph CP["Control plane"]
API["kube-apiserver<br/>the one front door"]:::cp
ETCD[("etcd<br/>cluster source of truth")]:::store
SCHED["scheduler<br/>assigns Pods → Nodes"]:::cp
CM["controller-manager<br/>built-in reconcile loops"]:::cp
CCM["cloud-controller-manager<br/>LBs · volumes · nodes"]:::cp
end
subgraph N1["Worker node"]
K1["kubelet<br/>runs assigned Pods"]:::node
PR1["kube-proxy<br/>Service routing"]:::node
RT1["containerd<br/>runtime"]:::node
POD1(["Pods"]):::node
end
subgraph N2["Worker node"]
K2["kubelet"]:::node
PR2["kube-proxy"]:::node
RT2["containerd"]:::node
POD2(["Pods"]):::node
end
Kubectl --> API
API <--> ETCD
SCHED <--> API
CM <--> API
CCM <--> API
K1 <--> API
K2 <--> API
K1 --> RT1 --> POD1
K2 --> RT2 --> POD2

The nouns the reconcilers manage. The hierarchy is the useful part: a Deployment owns ReplicaSets which own Pods.

ObjectWhat it is
PodSmallest deployable unit — one or more co-located containers sharing network/storage. Treated as cattle, not pets.
ReplicaSetKeeps N identical Pods running. Rarely managed directly.
DeploymentManages ReplicaSets to give rolling updates + rollback for stateless apps.
StatefulSetLike a Deployment but with stable identity + stable storage per Pod (pg-0, pg-1…) — what databases need.
DaemonSetOne Pod per node (log shippers, agents).
Job / CronJobRun-to-completion / scheduled work.
ServiceStable virtual IP + DNS name load-balancing to a set of Pods (Pods are ephemeral; Services are not).
IngressHTTP(S) routing from outside the cluster to Services.
ConfigMap / SecretExternalized config and credentials.
PersistentVolume / PVCDurable storage decoupled from any Pod’s lifetime.
NamespaceA scope for isolating and grouping resources.

Kubernetes lets you add your own object types and your own reconcilers — that’s how you teach the cluster to run a thing it didn’t ship with.

  • CRD (CustomResourceDefinition) — registers a new kind (e.g. PostgresCluster) with the API server. From then on it’s a first-class API object: stored in etcd, gettable with kubectl get postgrescluster, RBAC-controlled — just no behavior yet.
  • Custom controller — a reconcile loop you write that watches your custom resources and makes reality match them.
  • Operator = CRD + custom controller that encodes operational domain knowledge. Instead of a human running the runbook (“primary died → promote a replica → repoint the service”), the controller does it. The Operator pattern turns “how to operate X” into code.

kubebuilder / controller-runtime are the Go toolchain for this. controller-runtime gives you a Manager (wires up shared caches/clients), informers/watches that feed a work queue, and a Reconciler interface — you implement one method:

// Called per object key; K8s handles the watching, queueing, and retries.
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// 1. fetch the desired state (the custom resource)
// 2. fetch actual state (the Pods/Services/PVCs it owns)
// 3. create/update/delete to converge
// 4. update .status
return ctrl.Result{RequeueAfter: time.Minute}, nil // re-check later (level-triggered)
}

kubebuilder scaffolds the project, the CRD types, and the RBAC. You write the diff-and-converge logic; everything else (caching, leader election, requeue with backoff) is the framework’s.

Worked architecture: Postgres on Kubernetes

Section titled “Worked architecture: Postgres on Kubernetes”

Stateful workloads are the hard case for orchestration — identity and storage must be stable, and failover is domain-specific. This is exactly what a database Operator (e.g. CloudNativePG, Zalando’s postgres-operator) exists for. You declare a small custom resource…

kind: Cluster
spec:
instances: 3 # 1 primary + 2 replicas
storage: { size: 100Gi }
backup: { target: s3://…, wal: true } # continuous WAL archiving → PITR

…and the operator reconciles it into the full topology below — and keeps reconciling it, which is what makes failover and rolling upgrades automatic rather than a 3am runbook.

Postgres cluster managed by a Kubernetes operator

What the operator owns, mapped to the DBaaS drills:

  • Topology — a primary + replica Pods (each with its own PVC), streaming-replicated. Stable identity comes from a StatefulSet.
  • Services — a read-write Service pointing at the primary, a read-only Service spread across replicas. Clients use the stable names; Pods churn underneath.
  • Automated failover — if the primary’s health checks fail, the controller promotes a replica and repoints the rw Service. (Leader election, split-brain avoidance, RTO/RPO — the HA drill.)
  • Backup + PITR — base backups plus continuous WAL archiving to object storage, enabling point-in-time recovery.
  • Zero-downtime upgrades — the operator does rolling minor-version changes: upgrade replicas, switch over, upgrade the old primary.

This is a managed-database control plane in miniature — the same control-plane/data-plane split as RDS, just built on K8s primitives.

Mermaid source
flowchart LR
classDef cp fill:#eef0fe,stroke:#6366f1,stroke-width:1.5px,color:#0f172a;
classDef node fill:#eef2f8,stroke:#94a3b8,stroke-width:1.5px,color:#0f172a;
classDef store fill:#e7f5ec,stroke:#3f9c5a,stroke-width:1.5px,color:#0f172a;
classDef op fill:#fef6e7,stroke:#d9a441,stroke-width:1.5px,color:#0f172a;
CR["Cluster CR<br/>kind: Cluster<br/>instances: 3"]:::op
OP("Postgres operator<br/>custom controller<br/>reconcile loop"):::cp
subgraph WL["Reconciled workload"]
PRIM(["Primary Pod"]):::node
R1(["Replica Pod"]):::node
R2(["Replica Pod"]):::node
PV1[("PVC")]:::store
PV2[("PVC")]:::store
PV3[("PVC")]:::store
end
RW{{"rw Service<br/>→ primary"}}:::cp
RO{{"ro Service<br/>→ replicas"}}:::cp
BAK[("Object storage<br/>base backup + WAL · PITR")]:::store
CR --> OP
OP -->|create / heal| PRIM
OP -->|create / heal| R1
OP -->|create / heal| R2
PRIM --- PV1
R1 --- PV2
R2 --- PV3
PRIM -->|streaming replication| R1
PRIM -->|streaming replication| R2
RW --> PRIM
RO --> R1
RO --> R2
OP -->|archive WAL| BAK
OP -.->|promote on failure| R1

Is Kubernetes the right base for a cloud product?

Section titled “Is Kubernetes the right base for a cloud product?”

Worth knowing because it’s a live debate — and the operator example above is exactly where it bites. A managed cloud product is itself a control plane, so building it on Kubernetes means stacking control planes: your operator reconciles your CRD into K8s objects, K8s reconciles those into containers on nodes, and the cloud’s own control plane schedules those nodes onto hardware. Critics call this wrapping a wrapper — each layer is a general-purpose, leaky abstraction you must operate, debug, and pay for, even though you run one known workload that doesn’t need K8s’s generality (arbitrary scheduling, the CNI/CSI plugin matrix, the cluster-upgrade treadmill). Their pitch: orchestrate VMs (or microVMs) directly, with automation built for your single workload — fewer layers, tighter control of networking/storage/placement, and failure modes that belong to your product rather than to Kubernetes.

The counter is plain build-vs-buy. K8s hands you a hardened reconciliation engine, self-healing, rollouts, and a whole ecosystem (operators, CSI storage, service mesh) you’d otherwise reinvent — plus multi-cloud portability. For a team that can’t afford to build and run its own control plane, that’s a decade of solved problems for free, and the operator pattern maps cleanly onto “encode our ops knowledge.” The trade is generality-overhead vs. build-it-yourself cost, and it splits real companies:

Built on KubernetesDeliberately not K8s — VMs / own orchestrator
Confluent Cloud (Kafka)AWS — its own services run on EC2 + decades of homegrown automation
ClickHouse Cloud (pods + object storage)MongoDB Atlas — VMs + a homegrown automation agent across clouds
…and many newer data/infra cloudsSnowflake — VM-based compute clusters, own architecture
Fly.io — Firecracker microVMs, own orchestrator (famously, loudly anti-K8s)
Railway — built their own orchestrator
Oxide — own stack down to the hardware

Two names sit on the seam — they support K8s without their hosted product being built on it. Supabase ships official K8s Helm charts for self-hosting, but its managed platform leans on dedicated per-project instances. Temporal ships a Helm chart and is commonly self-hosted on K8s (and your Workers can run there) — yet Temporal Cloud itself is documented as a cell-based architecture on AWS/GCP, not a Kubernetes product. That distinction — runs on K8s vs. can be run on K8s — is exactly why neither is in the left column.

Rule of thumb: if orchestration is your product’s hard part and you have the engineers, rolling your own wins control (Fly, Railway, Oxide); if orchestration is incidental and you want to ship a managed service fast, K8s buys you the most. It’s the same build-vs-bolt-on call as the Postgres extensions trade-off, one layer down.


These are from-zero working notes — a map and a mental model, not a substitute for the Kubernetes docs or the kubebuilder book. The reconciliation/control-loop pattern is the one thing worth internalizing; the rest follows from it.