Design and Architecture of CockroachDb

Linearizability

First a word about Spanner. By combining judicious use of wait intervals with accurate time signals, Spanner provides a global ordering between any two non-overlapping transactions (in absolute time) with ~14ms latencies. Put another way: Spanner guarantees that if a transaction T1 commits (in absolute time) before another transaction T2 starts, then T1's assigned commit timestamp is smaller than T2's. Using atomic clocks and GPS receivers, Spanner reduces their clock skew uncertainty to < 10ms (ε). To make good on the promised guarantee, transactions must take at least double the clock skew uncertainty interval to commit (2 * ε). See this article for a helpful overview of Spanner’s concurrency control.

Cockroach could make the same guarantees without specialized hardware, at the expense of longer wait times. If servers in the cluster were configured to work only with NTP, transaction wait times would likely to be in excess of 150ms. For wide-area zones, this would be somewhat mitigated by overlap from cross datacenter link latencies. If clocks were made more accurate, the minimal limit for commit latencies would improve.

However, let’s take a step back and evaluate whether Spanner’s external consistency guarantee is worth the automatic commit wait. First, if the commit wait is omitted completely, the system still yields a consistent view of the map at an arbitrary timestamp. However with clock skew, it would become possible for commit timestamps on non-overlapping but causally related transactions to suffer temporal reverse. In other words, the following scenario is possible for a client without global ordering:

    Start transaction T1 to modify value x with commit time s1
    On commit of T1, start T2 to modify value y with commit time s2
    Read x and y and discover that s1 > s2 (!)

The external consistency which Spanner guarantees is referred to as linearizability. It goes beyond serializability by preserving information about the causality inherent in how external processes interacted with the database. The strength of Spanner’s guarantee can be formulated as follows: any two processes, with clock skew within expected bounds, may independently record their wall times for the completion of transaction T1 (T end1) and start of transaction T2 (T start2) respectively, and if later compared such that T end1 < T start2, then commit timestamps s1 < s2. This guarantee is broad enough to completely cover all cases of explicit causality, in addition to covering any and all imaginable scenarios of implicit causality.

Our contention is that causality is chiefly important from the perspective of a single client or a chain of successive clients (if a tree falls in the forest and nobody hears…). As such, Cockroach provides two mechanisms to provide linearizability for the vast majority of use cases without a mandatory transaction commit wait or an elaborate system to minimize clock skew.

Our contention is that causality is chiefly important from the perspective of a single client or a chain of successive clients (if a tree falls in the forest and nobody hears…). As such, Cockroach provides two mechanisms to provide linearizability for the vast majority of use cases without a mandatory transaction commit wait or an elaborate system to minimize clock skew.

  1. Clients provide the highest transaction commit timestamp with successive transactions. This allows node clocks from previous transactions to effectively participate in the formulation of the commit timestamp for the current transaction. This guarantees linearizability for transactions committed by this client.

Newly launched clients wait at least 2 * ε from process start time before beginning their first transaction. This preserves the same property even on client restart, and the wait will be mitigated by process initialization.

All causally-related events within Cockroach maintain linearizability. Message queues, for example, guarantee that the receipt timestamp is greater than send timestamp, and that delivered messages may not be reaped until after the commit wait.

  1. Committed transactions respond with a commit wait parameter which represents the remaining time in the nominal commit wait. This will typically be less than the full commit wait as the consensus write at the coordinator accounts for a portion of it.

Clients taking any action outside of another Cockroach transaction (e.g. writing to another distributed system component) can either choose to wait the remaining interval before proceeding, or alternatively, pass the wait and/or commit timestamp to the execution of the outside action for its consideration. This pushes the burden of linearizability to clients, but is a useful tool in mitigating commit latencies if the clock skew is potentially large. This functionality can be used for ordering in the face of backchannel dependencies as mentioned in the AugmentedTime paper.

Using these mechanisms in place of commit wait, Cockroach’s guarantee can be formulated as follows: any process which signals the start of transaction T2 (T start2) after the completion of transaction T1 (T end1), will have commit timestamps such that s1 < s2.

Logically, the map contains a series of reserved system key / value pairs covering accounting, range metadata, node accounting and permissions before the actual key / value pairs for non-system data (e.g. the actual meat of the map).

Insert table here

There are some additional system entries sprinkled amongst the non-system keys. See the Key-Prefix Accounting section in this document for further details.