Design and Architecture of CockroachDb

Node Allocation (via Gossip)

New nodes must be allocated when a range is split. Instead of requiring every RoachNode to know about the status of all or even a large number of peer nodes --or-- alternatively requiring a specialized curator or master with sufficiently global knowledge, we use a gossip protocol to efficiently communicate only interesting information between all of the nodes in the cluster. What’s interesting information? One example would be whether a particular node has a lot of spare capacity. Each node, when gossiping, compares each topic of gossip to its own state. If its own state is somehow “more interesting” than the least interesting item in the topic it’s seen recently, it includes its own state as part of the next gossip session with a peer node. In this way, a node with capacity sufficiently in excess of the mean quickly becomes discovered by the entire cluster. To avoid piling onto outliers, nodes from the high capacity set are selected at random for allocation.

The gossip protocol itself contains two primary components:

Peer Selection: each node maintains up to N peers with which it regularly communicates. It selects peers with an eye towards maximizing fanout. A peer node which itself communicates with an array of otherwise unknown nodes will be selected over one which communicates with a set containing significant overlap. Each time gossip is initiated, each nodes’ set of peers is exchanged. Each node is then free to incorporate the other’s peers as it sees fit. To avoid any node suffering from excess incoming requests, a node may refuse to answer a gossip exchange. Each node is biased towards answering requests from nodes without significant overlap and refusing requests otherwise.

Peers are efficiently selected using a heuristic as described in Agarwal & Trachtenberg (2006).

TBD: how to avoid partitions? Need to work out a simulation of the protocol to tune the behavior and see empirically how well it works.

Gossip Selection: what to communicate. Gossip is divided into topics. Load characteristics (capacity per disk, cpu load, and state [e.g. draining, ok, failure]) are used to drive node allocation. Range statistics (range read/write load, missing replicas, unavailable ranges) and network topology (inter-rack bandwidth/latency, inter-datacenter bandwidth/latency, subnet outages) are used for determining when to split ranges, when to recover replicas vs. wait for network connectivity, and for debugging / sysops. In all cases, a set of minimums and a set of maximums is propagated; each node applies its own view of the world to augment the values. Each minimum and maximum value is tagged with the reporting node and other accompanying contextual information. Each topic of gossip has its own protobuf to hold the structured data. The number of items of gossip in each topic is limited by a configurable bound.

For efficiency, nodes assign each new item of gossip a sequence number and keep track of the highest sequence number each peer node has seen. Each round of gossip communicates only the delta containing new items.

Node Accounting The gossip protocol discussed in the previous section is useful to quickly communicate fragments of important information in a decentralized manner. However, complete accounting for each node is also stored to a central location, available to any dashboard process. This is done using the map itself. Each node periodically writes its state to the map with keys prefixed by \0node, similar to the first level of range metadata, but with an ‘node’ suffix. Each value is a protobuf containing the full complement of node statistics--everything communicated normally via the gossip protocol plus other useful, but non-critical data.

The range containing the first key in the node accounting table is responsible for gossiping the total count of nodes. This total count is used by the gossip network to most efficiently organize itself. In particular, the maximum number of hops for gossipped information to take before reaching a node is given by ceil(log(node count) / log(max fanout)) + 1.