Design and Architecture of CockroachDb

Conflict Resolution

Things get more interesting when a reader or writer encounters an intent record or newly-committed value in a location that it needs to read or write. This is a conflict, usually causing either of the transactions to abort or restart depending on the type of conflict.

Transaction restart: This is the usual (and more efficient) type of behaviour and is used except when the transaction was aborted (for instance by another transaction). In effect, that reduces to two cases, the first being the one outlined above: An SSI transaction that finds (upon attempting to commit) that its commit timestamp has been pushed. In the second case, a transaction actively encounters a conflict, that is, one of its readers or writers runs encounters data that necessitate conflict resolution.

When a transaction restarts, it changes its priority and/or moves its timestamp forward depending on data tied to the conflict, and the transaction begins anew, updating its intents. Since the set of keys being written change between restarts, a set of keys written during prior attempts at the transaction is maintained by the client. As it restarts the transaction from the beginning, it removes keys from this set as it writes them again. The remaining keys--should the transaction run to completion--are crufty write intents which must be deleted before the transaction commit record’s status is set to COMMITTED. Many transactions will have no keys in this set.

Transaction abort: This is the case in which a transaction, upon reading its transaction table entry, finds that it has been aborted. In this case, the transaction can not reuse its intents; it returns control to the client before cleaning them up (other readers and writers would clean up dangling intents as they encounter them) but will make an effort to clean up after itself. The next attempt (if applicable) then run as a different transaction.

There are several scenarios in which transactions interact:

  • Reader encounters write intent with newer timestamp far enough in the future: This is not a conflict. The reader is free to proceed; after all, it will be reading an older version of the value and so does not conflict. Recall that the write intent may be committed with a later timestamp than its candidate; it will never commit with an earlier one. Side note: if the reader finds an intent with a newer timestamp which the reader’s own transaction has written, the reader always returns that value. Reader encounters write intent or value with newer timestamp in the near future: In this case, we have to be careful. The newer intent may, in absolute terms, have happened in our read's past if the clock of the writer is ahead of the node serving the value's. In that case, we would need to take this value into account, but we just don't know. Hence the transaction restarts, using instead the future timestamp (but remembering a maximum timestamp used to limit the uncertainty window to the maximum clock skew). See the details under "choosing a time stamp" below.

  • Reader encounters write intent with older timestamp: the reader must follow the intent’s transaction id to the transaction table. If the transaction has already been committed, then the reader can just read the value. If the write transaction has not yet been committed, then the reader has two options. If the write conflict is from an SI transaction, the reader can push that transaction's commit timestamp into the future. This is simple to do: the reader just updates the transaction’s commit timestamp to indicate that when/if the transaction does commit, it should use a timestamp at least as high. However, if the write conflict is from an SSI transaction, the reader must compare priorities. If it has the higher priority, it pushes the transaction’s commit timestamp, as with SI (that transaction will then notice its timestamp has been pushed, and restart). If it has the lower priority, it retries itself using as a new priority max(new random priority, conflicting txn’s priority - 1).

  • Writer encounters uncommitted write intent with lower priority: writer aborts the conflicting transaction.

  • Writer encounters uncommitted write intent with higher priority: the transaction retries, using as a new priority max(new random priority, conflicting txn’s priority - 1); the retry occurs after a short, randomized backoff interval.