Sorted Hash Memtable

A new memtable type that improves seek and read while writing performance.

Overview

There are currently 5 RocksDB memtable representation options, with the default being SkipList.

We designed a new memtable representation algorithm that allows parallel writes without requiring the insertion of synchronization locks, also improving read and seek operations.

This new memtable is available in version 2.0.0 and later.

Algorithm

The Speedb memtable combines hash and sorted vectors (which are created in background thread).

The default hash size is 1 Million buckets, and the maximum size for each vector is 10K keys. (this can be tuned as part of column family creation properties).

The goal of the hash is to achieve direct access on read and write operations (O(1)).

The goal of the sorted vector is to improve seek operations, taking into consideration that writes are still progressing.

A new write is added to the hash and atomic write list.

Write Operation

Hash insertion:

Since the default hash size is 1 Million buckets, the possibility of more than 1 write on the same bucket is reduced (not including an overwrite operation). This gives us direct access without synchronization.

Write list insertion:

Elements are inserted into the write list atomically (the anchor pointer is an atomic exchange). The write list is not sorted - this occurs in the background.

Sorted thread:

In the background, the sorted thread divides the write list into sorted vectors if at least one of the following conditions occurs:

  1. The vector size limit is reached (default 10,000)

  2. A seek operation occurs

The vector will only be valid if it is sorted.

Read Operation:

Read is done by direct access to the match memtable hash bucket. No synchronized lock is required.

Seek Operation:

The seek request signals the sorted thread to divide a new sorted vector by adding a special entry to the write list.

Note: New writes are not suspended!

The seek request waits for the cutoff vector to be sorted. Each seek iterator has its own binary heap of shared sorted vectors relevant to the seek time creation, so there's no dependency on other iterators' progress.

If no write occurs, the last sorted vector is used (a new one isn't created).

Merge sorted vectors:

When there are many seek operations, a situation could arise with many small sorted vectors.

In this case, the sorted thread's responsibility is to merge small continuous sorted vectors, so the seek iterator will be created from a small number of sorted vectors.

Preparing the memtable

As mentioned above, the creation of this memtable is expensive. As such, we've created a CF background thread that creates a standby memtable, which allows us to switch memtables without wasting time.

Advantages

  1. High performance in a heavy write workload

  2. A write request during a seek request is not blocked.

  3. Sorted vectors are used on each seek operation without being copied

  4. If no write occurs between two seek operations, both seek iterators will have the same sorted vector

Disadvantages

  1. The memtable constructor size is about 8mg (depends on the bucket size)

  2. memtable creation is expensive (creating 1000000 buckets takes time)

Performance Results

The new sorted hash memtable was tested with db_bench, using the configuration below.

The following benchmark results compare the performance of Rocksdb v7.2.2 with skiplist compared to Speedb with the new sorted hash memtable code.

Note: More details about how we test performance you can read in the Performance Testing chapter.

Configuration:

  • Number of objects: 1Billion

  • Value size: 1KB

  • Write buffer size 64MB (default)

  • Number of Threads: 4

  • Max number of write buffers: 4

  • Number of CF: 1 (default)

  • Number of CPU cores: 16

  • Compression mode: none

*Up to 155% improvement in overwrite workload

Up to 68% improvement in 50% random read workload

Up to 15% improvement in random seek workload

Usage

db_bench/db_stress

./db_bench--memtablerep=speedb.HashSpdRepFactory It will use the default bucket size = 1000000 In order to change the default bucket size, use the following syntax: ./db_bench --memtablerep=speedb.HashSpdRepFactory:1000 (where 1000 is the bucket size)

Configuring in the user’s application code

This should be set in the DB Options object:

Options options;
ConfigOptions config_options;
config_options.ignore_unknown_options = false;
config_options.ignore_unsupported_options = false;
Status s = MemTableRepFactory::CreateFromString(config_options, "speedb.HashSpdRepFactory", &options.memtable_factory);
assert(s.ok());

Last updated