New nodes must be allocated when a range is split. Instead of requiring every RoachNode to know about the status of all or even a large number of peer nodes --or-- alternatively requiring a specialized curator or master with sufficiently global knowledge, we use a gossip protocol to efficiently communicate only interesting information between all of the nodes in the cluster. What’s interesting information? One example would be whether a particular node has a lot of spare capacity. Each node, when gossiping, compares each topic of gossip to its own state. If its own state is somehow “more interesting” than the least interesting item in the topic it’s seen recently, it includes its own state as part of the next gossip session with a peer node. In this way, a node with capacity sufficiently in excess of the mean quickly becomes discovered by the entire cluster. To avoid piling onto outliers, nodes from the high capacity set are selected at random for allocation.
The gossip protocol itself contains two primary components:
Peers are efficiently selected using a heuristic as described in Agarwal & Trachtenberg (2006).
TBD: how to avoid partitions? Need to work out a simulation of the protocol to tune the behavior and see empirically how well it works.
For efficiency, nodes assign each new item of gossip a sequence number and keep track of the highest sequence number each peer node has seen. Each round of gossip communicates only the delta containing new items.
Node Accounting The gossip protocol discussed in the previous section is useful to quickly communicate fragments of important information in a decentralized manner. However, complete accounting for each node is also stored to a central location, available to any dashboard process. This is done using the map itself. Each node periodically writes its state to the map with keys prefixed by \0node, similar to the first level of range metadata, but with an ‘node’ suffix. Each value is a protobuf containing the full complement of node statistics--everything communicated normally via the gossip protocol plus other useful, but non-critical data.
The range containing the first key in the node accounting table is responsible for gossiping the total count of nodes. This total count is used by the gossip network to most efficiently organize itself. In particular, the maximum number of hops for gossipped information to take before reaching a node is given by ceil(log(node count) / log(max fanout)) + 1.