MVCC

As we mentioned earlier, the MVCC system is the bedrock for how CRDB provides atomicity and isolation to transactions. In this section, we will talk about how the MVCC layer is implemented.

In an MVCC system, each record contains a key, value, and timestamp.

Similarly to CRDB, my toy database also uses RocksDB as the MVCC's storage engine. RocksDB is a key-value store, so we need to figure out a way to store key, value, and timestamp into a key-value pair. The toy database combines the key and the timestamp of the MVCC system into a single MVCCKey which will be encoded into a single RocksDB key.

#![allow(unused)]
fn main() {
pub struct MVCCKey {
    pub key: Key,
    pub timestamp: Timestamp,
}
}

Each timestamp in an MVCCKey is a HLC (hybrid logical clock) time:

#![allow(unused)]
fn main() {
pub struct Timestamp {
    pub wall_time: u64,
    pub logical_time: u32,
}
}

The Key’s type is

#![allow(unused)]
fn main() {
pub type Key = Vec<u8>;
}

To store the MVCCKey into RocksDB, we need to encode the MVCCKey into a RocksDB key, which is just a Vec<u8>. Inspired by CRDB's EncodeMVCCKey method, the MVCCKey is encoded into the following form:

[key] [wall_time] [logical_time]

Here is the implementation:

#![allow(unused)]
fn main() {
pub fn encode_mvcc_key(mvcc_key: &MVCCKey) -> Vec<u8> {
    let mut key_vec = mvcc_key.key.to_vec();
    let timestamp_vec = encode_timestamp(mvcc_key.timestamp);
    key_vec.extend(timestamp_vec);
    key_vec
}

pub fn encode_timestamp(timestamp: Timestamp) -> Vec<u8> {
    let mut wall_time_bytes = timestamp.wall_time.to_le_bytes().to_vec();
    let logical_time_bytes = timestamp.logical_time.to_le_bytes().to_vec();
    wall_time_bytes.extend(logical_time_bytes);
    wall_time_bytes
}
}

Earlier, we mentioned that each key can only have one write intent at a time, so we need a unique key to represent an intent key. We use the MVCCKey with the zero timestamp: Timestamp { wall_time: 0, logical_time: 0 } to represent the key for a write intent.

When users query for a key, the database returns the latest version of that key. To make this faster, MVCCKeys are sorted from the highest timestamp to the lowest timestamp in the storage engine. However, we need the query to return the zero timestamp, which represents the intent key, if there is one.

Luckily, RocksDB allows developers to customize the order of keys in the table through set_comparator. Here is my implementation of the custom comparator. The comparator gives the highest precedence to the zero timestamp for the same key. Otherwise, it uses the key and the timestamp to order the MVCC keys.

Core APIs

The core API of the MVCC layer includes:

  • MVCC_SCAN
  • MVCC_GET
  • MVCC_PUT

These are the three APIs that the other entities of the transactional layer interact with the MVCC layer.

MVCC_SCAN

mvcc_scan(&self, start_key: Key, end_key: Key, timestamp: Timestamp, scan_params: MVCCScanParams) -> MVCCScanResult

MVCC_SCAN takes a start key, an end key, and a timestamp as inputs and returns an array of results that contains the keys in the range [start_key, end_key). Only keys with a timestamp less than or equal to the input timestamp are added to the scan results. MVCC_SCAN also collects any write intents that it found along the way.

Algorithm

CockroachDB’s MVCCScan

For reference, here is CockroachDB’s implementation of MVCCScan. The core idea is similar. It initializes a mvccScanner, seeks to the start of the scan, and keeps looping until it exceeds the range.

On each iteration, it checks if the key is an intent key by checking if it’s empty. It then checks if the MVCC key’s timestamp is equal to the input timestamp, greater than the input timestamp, or less than the input timestamp, in that case, it seeks an older version of the key.

MVCC_GET

fn mvcc_get<'a>(&self, key: &'a Key, timestamp: Timestamp, params: MVCCGetParams) -> MVCCGetResult

MVCC_GET takes a key, a timestamp, and an optional transaction as inputs and returns the most recent value for the specified key whose timestamp is less than or equal to the supplied timestamp. If it runs into an uncommitted value, it returns a WriteIntentError.

Algorithm

CockroachDB’s MVCCGet

For reference, here is CockroachDB’s implementation of MVCCGet. The idea to use MVCCScanner is inspired by them. In their implementation, they implement mvcc_get as a scan with start_key = end_key.

MVCC_PUT

fn mvcc_put<T: Serialize>(&self, key: Key, timestamp: Option<Timestamp>, txn: Option<TxnLink>, value: T) -> Result<MVCCKey, WriteIntentError>

MVCC_PUT takes a key, a timestamp, and an optional transaction and tries to insert a timestamped record into the MVCC database. If a transaction is provided, a write intent is placed in the database. Otherwise, the raw value is placed into the database.

Before writing, MVCC_PUT must verify that there is no uncommitted intent for the same key. This is because there can only be a write intent for a key. If an uncommitted intent already exists for the same key, an error is returned.

Algorithm:

CockroachDB’s MVCCPut

For reference, here is CockroachDB’s implementation of MVCCPut. The core idea is the same. It checks if an intent was found. If there exists an uncommitted write intent whose transaction ID is not the same as the MVCCPut’s Transaction ID, a WriteIntentError is returned.

Otherwise, if the intent’s timestamp is less than the write timestamp, clear it so that MVCCPut can overwrite it. Finally, MVCCPut writes the mvcc value.