Design and Architecture of CockroachDb

Choosing a Timestamp

A key challenge of reading data in a distributed system with clock skew is choosing a timestamp guaranteed to be greater than the latest timestamp of any committed transaction (in absolute time). No system can claim consistency and fail to read already-committed data.

Accomplishing this for transactions (or just single operations) accessing a single node is easy. The transaction supplies 0 for timestamp, indicating that the node should use its current time (time for a node is kept using a hybrid clock which combines wall time and a logical time). This guarantees data already committed to that node have earlier timestamps.

For multiple nodes, the timestamp of the node coordinating the transaction “t” is used. In addition, a maximum timestamp “t+ε” is supplied to provide an upper bound on timestamps for already-committed data (ε is the maximum clock skew). As the transaction progresses, any data read which have timestamps greater than t but less than t+ε cause the transaction to abort and retry with the conflicting timestamp tc, where tc > t. The maximum timestamp t+ε remains the same. Time spent retrying because of reading recently committed data has an upper bound of ε.

We expect retries will be rare, but this assumption may need to be revisited if retries become problematic. Note that this problem does not apply to historical reads. An alternate approach which does not require retries would be to make a round to all node participants in advance and choose the highest reported node wall time as the timestamp. However, knowing which nodes will be accessed in advance is difficult and potentially limiting. Cockroach could also potentially use a global clock (Google did this with Pinax TODO: link to paper), which would be feasible for smaller, geographically-proximate clusters.