Design and Architecture of CockroachDb

Range-Spanning Binary Tree

A crucial enhancement to the organization of range metadata is to augment the bi-level range metadata lookup with a minimum spanning tree, implemented as a left-leaning red-black tree over all ranges in the map. This tree structure allows the system to start at any key prefix and efficiently traverse an arbitrary key range with minimal RPC traffic, minimal fan-in and fan-out, and with bounded time complexity equal to 2 * log2N steps, where N is the total number of ranges in the system.

Unlike the range metadata rows prefixed with \0\0meta[12], the metadata for the range-spanning tree (e.g. parent range and left / right child ranges) is stored directly at the ranges as non-map metadata. The metadata for each node of the tree (e.g. links to parent range, left child range, and right child range) is stored with the range metadata. In effect, the tree metadata is stored implicitly. In order to traverse the tree, for example, you’d need to query each range in turn for its metadata.

Any time a range is split or merged, both the bi-level range lookup metadata and the per-range binary tree metadata are updated as part of the same distributed transaction. The total number of nodes involved in the update is bounded by 2 + log2N (i.e. 2 updates for meta1 and meta2, and up to log2N updates to balance the range-spanning tree). The range corresponding to the root node of the tree is stored in \0tree_root. As an example, consider the following set of nine ranges and their associated range-spanning tree:

Insert image here...

The range-spanning tree has many beneficial uses in Cockroach. It makes the problem of efficiently aggregating accounting information of potentially vast ranges of data tractable. Imagine a subrange of data over which accounting is being kept. For example, the photos table in a public photo sharing site. To efficiently keep track of data about the table (e.g. total size, number of rows, etc.), messages can be passed first up the tree and then down to the left until updates arrive at the key prefix under which accounting is aggregated. This makes worst case number of hops for an update to propagate into the accounting totals 2 * log2N. A 64T database will require 1M ranges, meaning 40 hops worst case. In our experience, accounting tasks over vast ranges of data are most often map/reduce jobs scheduled with coarse-grained periodicity. By contrast, we expect Cockroach to maintain statistics with sub 10s accuracy and with minimal cycles and minimal IOPs.

Another use for the range-spanning tree is to push accounting, zones and permissions configurations to all ranges. In the case of zones and permissions, this is an efficient way to pass updated configuration information with exponential fan-out. When adding accounting configurations (i.e. specifying a new key prefix to track), the implicated ranges are transactionally scanned and zero-state accounting information is computed as well. Deleting accounting configurations is similar, except accounting records are deleted.

Last but not least, the range-spanning tree provides a convenient mechanism for planning and executing parallel queries. These provide the basis for Dremel-like query execution trees and it’s easy to imagine supporting a subset of SQL or even javascript-based user functions for complex data analysis tasks.