Card image

Introduction to CockroachDB Part 2 - Architecture Under the Hood

CockroachDBDistributed SQL DatabaseScalability

I. Overview

In Part 1, we covered what CockroachDB is and why it was built. Now it's time to look under the hood — how does CockroachDB actually achieve distributed consistency, resilience, and scalability at the same time?

Understanding the internals will help you make better design decisions, debug production issues, and appreciate why CockroachDB behaves the way it does.

II. The Node — CockroachDB's Basic Building Block

A CockroachDB cluster is made up of multiple nodes — each node is simply a running CockroachDB process, typically on its own machine or VM.

What makes CockroachDB resilient is that every node is equal — there's no single master. Any node can serve reads or writes. When a node goes down, the cluster keeps operating as long as a majority of nodes are still alive.

💡 Key Takeaway: No single point of failure. Nodes fail; clusters survive.

III. Range-Based Data Distribution

CockroachDB stores all data in a single, monolithic sorted key-value map — but it splits this map into chunks called Ranges.

How Ranges work:

  • Each Range is approximately 64 MB of data
  • Each Range is replicated across 3 nodes by default (replication factor = 3)
  • Ranges are defined by key boundaries: e.g., [/Table/users/1, /Table/users/1000)
  • As data grows, CockroachDB automatically splits Ranges and rebalances them across nodes

This means:

  • Your data is always spread across the cluster automatically
  • No manual sharding required
  • Adding a new node triggers automatic rebalancing

Example:

Imagine a users table with 1 million rows. CockroachDB splits it into multiple Ranges:

  • Range 1: rows 1 → 15,000 → stored on Nodes 1, 2, 3
  • Range 2: rows 15,001 → 30,000 → stored on Nodes 2, 3, 4
  • Range 3: rows 30,001 → 45,000 → stored on Nodes 1, 3, 4

Each row lives in exactly one Range, but that Range exists on 3 nodes simultaneously.

IV. The Raft Consensus Algorithm

With 3 copies of each Range, how does CockroachDB decide which copy is "correct"? This is where Raft comes in.

Raft is a consensus algorithm that ensures all replicas of a Range agree on the same state, even in the presence of failures.

How Raft works in CockroachDB:

  1. Leader Election — Among the 3 replicas of a Range, one is elected as the Raft Leader
  1. Log Replication — All writes go through the Leader. The Leader writes to its log and sends the entry to Followers
  1. Quorum Commit — A write is only considered committed when a majority (2 out of 3) acknowledge it
  1. Follower Apply — Once committed, all replicas apply the change

What happens when a node fails?

  • If a Follower fails → the Leader and remaining Follower still form a majority → cluster continues uninterrupted
  • If the Leader fails → the remaining Followers elect a new Leader within seconds → cluster recovers automatically
💡 Key Takeaway: Raft ensures every write is durable on a majority of replicas before it's acknowledged to the client. You never lose committed data.

V. The Leaseholder

Raft handles write consensus, but what about reads? Reading from any replica could return stale data. Enter the Leaseholder.

Each Range has exactly one Leaseholder at any given time. The Leaseholder is the node that:

  • Serves all reads for that Range (guaranteed fresh)
  • Coordinates all writes for that Range (via Raft)

The lease is time-bounded and must be renewed. If the Leaseholder node dies, another node acquires the lease automatically.

The Leaseholder is often the same node as the Raft Leader, but not always — CockroachDB may separate them based on network topology for better performance.

VI. Hybrid Logical Clock (HLC)

In a distributed system, clocks on different machines are never perfectly synchronized. This creates a problem: how do you determine the correct order of events across nodes?

CockroachDB solves this with Hybrid Logical Clocks (HLC) — a combination of:

  • Physical time (wall clock) — for approximate ordering
  • Logical time (monotonically increasing counter) — for tie-breaking

Every event in CockroachDB is stamped with an HLC timestamp. When a node receives a message from another node with a higher timestamp, it updates its own clock. This guarantees a globally consistent ordering of events without requiring perfectly synchronized clocks.

💡 Key Takeaway: HLC is why CockroachDB can guarantee serializable isolation across a distributed cluster — it determines the correct order of transactions even across geographically distant nodes.

VII. Putting It All Together

Here's how a write request flows through CockroachDB:

  1. Client sends a write to any node (the Gateway node)
  1. Gateway looks up which Range owns the key and routes to the Leaseholder
  1. Leaseholder proposes the write to Raft
  1. Raft Leader replicates the write to Followers; waits for quorum acknowledgment
  1. Write is committed and acknowledged to the client

For reads:

  1. Client sends read to any node
  1. Gateway routes to the Leaseholder for that Range
  1. Leaseholder serves the read directly (guaranteed consistent)

VIII. Conclusion

CockroachDB's architecture — Ranges, Raft consensus, Leaseholders, and HLC — works together to deliver a system that is simultaneously consistent, highly available, and horizontally scalable.

In Part 3, we'll get hands-on and set up a local CockroachDB cluster to see all of this in action.

Thank you for reading! 😊