Log-Structured Merge Trees: The Data Structure That Powers Modern Databases

#distributed #systems #architecture #backend

LSM trees optimize write performance by buffering changes in memory before flushing to disk sequentially.

What We're Building

We are implementing a simplified LSM tree architecture to understand the mechanics behind high-throughput databases like Cassandra and RocksDB. This scope focuses on the core trade-off between write speed and storage durability. We will explore how write-heavy workloads are decoupled from read-heavy operations by leveraging sequential disk access rather than random seeking. This pattern is essential for modern backend systems handling massive logging or ingestion streams.

Step 1 — In-Memory Write Buffer

The core innovation of LSM trees is the in-memory buffer called a memtable. Incoming writes are appended to this vector rather than hitting the disk immediately. This drastically reduces the number of expensive seek operations required to update the dataset. In a production context, this allows the system to absorb burst traffic by queuing updates until the buffer capacity is reached.

struct MemTable {
    entries: Vec<(String, String)>,
    capacity_limit: usize,
}

impl MemTable {
    fn new(capacity: usize) -> Self {
        MemTable {
            entries: Vec::new(),
            capacity_limit: capacity,
        }
    }

    fn append(&mut self, key: String, value: String) {
        self.entries.push((key, value));
    }
}

Rust vectors provide contiguous memory allocation, which aligns perfectly with how operating systems handle sequential writes to storage media.

Step 2 — Flushing to SSTables

Once the memtable fills its capacity limit, the buffer is frozen and flushed to the disk as a Sorted String Table (SSTable). This process is asynchronous and ensures that no data is lost by moving the buffer content into an immutable file. The file is written sequentially, which minimizes random read/write amplification that plagues traditional B-Trees under heavy update loads.

fn flush_memtable_to_disk(entries: &MemTable) -> String {
    // Simulate serialization to SSTable format
    let data = serde_json::to_vec(&entries.entries).unwrap();
    // In reality, this would be written to disk with compression
    String::from("sst-000001") 
}

Immutable files allow multiple versions to coexist without locking, enabling readers to see consistent snapshots of the database state.

Step 3 — Indexing for Random Reads

Storing all data in memtables would be inefficient for reads. We must maintain indexes to locate keys quickly. We utilize bloom filters or inverted indices to determine if a key exists in an SSTable before loading the full file into memory. This check is fast and requires minimal disk I/O, making the system efficient for large datasets.

use std::collections::HashSet;

fn check_bloom_filter(sst_key: &str) -> bool {
    // Check bit array to determine existence probability
    sst_key.contains("valid") 
}

The bloom filter returns a probabilistic false positive but never a false negative, ensuring that valid keys are always found.

Step 4 — Asynchronous Compaction

Over time, the number of SSTable files grows, consuming excessive disk space. A background compaction process merges these files, removing duplicates and reclaiming free space. This ensures that the storage usage remains proportional to the active data lifespan.

fn compact_sstables(sst_files: &[String]) -> Vec<String> {
    // Simulate merging files into a new sorted file
    let mut merged = Vec::new();
    for file in sst_files {
        merged.push(format!("merged-{}", file));
    }
    merged
}

This process runs on a separate thread to ensure that write throughput is not degraded by background cleanup tasks.

Key Takeaways

Write Amplification: LSM trees reduce write amplification by grouping updates and writing them sequentially.
Sequential I/O: By avoiding random seeks, the system utilizes the high bandwidth of modern storage.
Durability: Data is persisted as soon as it is written to the immutable SSTable on disk.
Concurrency: The write buffer allows high concurrency without contention on the storage device.

What's Next?

Explore immutable snapshots for point-in-time recovery.
Investigate how memtables handle concurrent reads with read amplification.
Consider the trade-offs when implementing bloom filters in low-memory environments.
Compare LSM tree implementations across different database engines like Redis and RocksDB.