Design and Architecture of CockroachDb

Transaction Management

Transactions are managed by the client proxy (or gateway in SQL Azure parlance). Unlike in Spanner, writes are not buffered but are sent directly to all implicated ranges. This allows the transaction to abort quickly if it encounters a write conflict. The client proxy keeps track of all written keys in order to cleanup write intents upon transaction completion.

If a transaction is completed successfully, all intents are upgraded to committed. In the event a transaction is aborted, all written intents are deleted. The client proxy doesn’t guarantee it will cleanup intents; but dangling intents are upgraded or deleted when encountered by future readers and writers and the system does not depend on their timely cleanup for correctness.

In the event the client proxy restarts before the pending transaction is completed, the dangling transaction would continue to live in the transaction table until aborted by another transaction. Transactions heartbeat the transaction table every five seconds by default. Transactions encountered by readers or writers with dangling intents which haven’t been heartbeat within the required interval are aborted.

An exploration of retries with contention and abort times with abandoned transaction is here.

**Transaction Table**

    enum Isolation {
      SNAPSHOT = 1;
      SERIALIZABLE = 2;
    }
    enum Status {
      PENDING = 1;
      COMMITTED = 2;
      ABORTED = 3;
    }
    uint64 tx_id
    uint64 candidate_timestamp
    uint64 heartbeat_timestamp
    uint32 priority
    Isolation isolation
    Status status

Pros

  1. No requirement for reliable code execution to prevent stalled 2PC protocol.
  2. Readers never block with SI semantics; with SSI semantics, they may abort.
  3. Lower latency than traditional 2PC commit protocol (w/o contention) because second phase requires only a single write to the transaction table instead of a synchronous round to all transaction participants.
  4. Priorities avoid starvation for arbitrarily long transactions and always pick a winner from between contending transactions (no mutual aborts).
  5. Writes not buffered at client; writes fail fast.
  6. No read-locking overhead required for serializable SI (in contrast to other SSI implementations).
  7. Well-chosen (i.e. less random) priorities can flexibly give probabilistic guarantees on latency for arbitrary transactions (for example: make OLTP transactions 10x less likely to abort than low priority transactions, such as asynchronously scheduled jobs).

Cons

  1. Reads from non-leader replicas still require a ping to the leader to update latest-read-cache.
  2. Abandoned transactions may block contending writers for up to the heartbeat interval, though average wait is likely to be considerably shorter (see graph in link). This is likely considerably more performant than detecting and restarting 2PC in order to release read and write locks.
  3. Behavior different than other SI implementations: no first writer wins, and shorter transactions do not always finish quickly. Element of surprise for OLTP systems may be a problematic factor.
  4. Aborts can decrease throughput in a contended system compared with two phase locking. Aborts and retries increase read and write traffic, increase latency and decrease throughput.