Timestamp Oracle
As mentioned earlier, when a transaction performs a write on a key, its write timestamp must be greater than any read timestamps performed on the same key to prevent read-write conflicts. Therefore, the database needs to track the maximum timestamp that key ranges were read from. That is the job of the timestamp oracle. Timestamp Oracle is a data structure that stores the maximum timestamp that key ranges were read from.
Timestamp oracle is introduced in Yabandeh’s research paper, A Critique of Snapshot Isolation. In the paper, Yabandeh proves that read-write conflict avoidance is sufficient for serializability. For a more detailed explanation of the timestamp oracle, check out this CockroachDB blog and scroll down to the section Read-Write Conflicts – Read Timestamp Cache
.
Timestamp Oracle API
The Timestamp Oracle’s API consists of two methods: get_max_timestamp and add
Get_max_timestamp: (start_key, end_key) → Timestamp
The method returns the max timestamp, which overlaps the interval [start_key, end_key].
Add: (timestamp, start_key, end_key) → ()
The method adds the timestamp for the range [start_key, end_key] to the oracle.
Red-Black Tree
The timestamp oracle is implemented with a red-black interval tree. There are plenty of materials online on how to implement a red-black tree, so I won’t go into the details. In summary, a red-black tree is a self-balancing binary search tree in which each node is marked as either black or red. The red-black tree has some invariants around the red and black nodes. The start key of the interval is used to order keys in the red-black tree.
Here is my implementation of the red-black tree.
Implementing the Timestamp Oracle API
Add
Add’s implementation is really simple. It just inserts the interval into the red-black tree.
Get_max_timestamp
First, the function collects all overlap nodes. This is efficient because the timestamp oracle is a Binary Search Tree. Next, the function finds the max timestamp amongst the nodes and returns it.
CockroachDB’s Implementation
CockroachDB calls the timestamp oracle their timestamp cache. CockroachDB’s interface for the timestamp cache is here. They have two implementations for the cache: red-black tree based and BTree based.
CockroachDB’s cache is bounded. It evicts keys if the cache exceeds a certain size It also maintains a “low watermark” that is used whenever GetMax is called but the interval isn’t in the cache. To decide which keys to evict, CockroachDB holds a double-linked list of Entry elements. The order of entries is based on the eviction policy, one of which is LRU and another is FIFO. I didn’t bother to implement this in my MVP.