# turbo-persistence

This crate provides a way to persist key value pairs into a folder and restore them later.

The API only allows a single write transaction at a time, but multiple threads can fill the transaction with (non-conflicting) data concurrently.

When pushing data into the WriteBatch it is already persisted to disk, but only becomes visible after the transaction is committed. On startup left-over uncommitted files on disk are automatically cleaned up.

The architecture is optimized for pushing a lot data to disk in a single transaction, while still allowing for fast random reads.

It supports having multiple key families, which are stored in separate files, but a write batch can contain keys from multiple families. Each key family defines a separate key space. Entries in different key families doesn't influence each other (also not performance-wise).

## On disk format

There is a single `CURRENT` file which stores the latest committed sequence number.

All other files have a sequence number as file name, e. g. `0000123.sst`. All files are immutable once there sequence number is <= the committed sequence number. But they might be deleted when they are superseeded by other committed files.

There are four different file types:

- Static Sorted Table (SST, `*.sst`): These files contain key value pairs.
- Blob files (`*.blob`): These files contain large values.
- Delete files (`*.del`): These files contain a list of sequence numbers of files that should be considered as deleted.
- Meta files (`*.meta`): These files contain metadata about the SST files. They contains the hash range and a AMQF for quick filtering.

Therefore there are these value types:

- INLINE: Values ≤ 8 bytes stored directly in key blocks within `*.sst` files.
- SMALL: Values 9–4096 bytes packed into shared value blocks within `*.sst` files.
- MEDIUM: Values 4097 bytes – 64 MB stored in dedicated value blocks within `*.sst` files.
- BLOB: Values > 64 MB stored in separate `*.blob` files.
- DELETED: Values that are deleted. (Tombstone)
- Future:
  - MERGE: An application specific update operation that is applied on the old value.

### Value type trade-offs

|                       | Inline            | Small                                           | Medium                                | Blob                                      |
| --------------------- | ----------------- | ----------------------------------------------- | ------------------------------------- | ----------------------------------------- |
| Size                  | ≤ 8 B             | 9 B .. 4 kB                                     | 4 kB .. 64 MB                         | > 64 MB                                   |
| Compression unit      | key block         | shared value block (≥ 8 kB)                     | dedicated value block                 | separate file                             |
| Compression unit size | ≤ 16 kB           | 8 kB .. 12 kB                                   | 4 kB .. 64 MB                         | > 64 MB                                   |
| Access cost           | no extra overhead | decompress shared block (~8 kB)                 | decompress value size                 | open separate file, decompress value size |
| Storage overhead      | 0                 | 8 B in key block + 8 B per ~8 kB in block table | 2 B in key block + 8 B in block table | 4 B in key block + 8 B in blob header     |
| Compaction            | re-compressed     | re-compressed                                   | copied compressed                     | pointer copied                            |

Small value blocks are emitted once they accumulate at least `MIN_SMALL_VALUE_BLOCK_SIZE` (8 kB) of data. This means actual block sizes range from 8 kB up to 8 kB + `MAX_SMALL_VALUE_SIZE` (4 kB) = 12 kB. This provides a good balance between compression efficiency (blocks ≥ 4 kB compress well with LZ4) and access cost (only ~8–12 kB needs to be decompressed per lookup).

### Meta file

A meta file can contain metadata about multiple SST files. The metadata is stored in a single file to avoid having too many small files.

- Header
  - 4 bytes magic number (0xFE4ADA4A)
  - 4 bytes key family
  - 4 bytes count of obsolete SST files
  - foreach obsolete SST file
    - 4 bytes sequence number of the obsolete SST file
  - 4 bytes count of described SST files
  - foreach described SST file
    - 4 bytes sequence number of the SST file
    - 2 bytes block count
    - 8 bytes min hash
    - 8 bytes max hash
    - 8 bytes SST file size
    - 4 bytes flags
      - bit 0: cold (compacted and not recently accessed)
      - bit 1: fresh (not yet compacted)
    - 4 bytes end of AMQF offset relative to start of all AMQF data
  - 4 bytes end of AMQF offset relative to start of all AMQF data of the "used key hashes" AMQF
- foreach described SST file
  - serialized AMQF
- serialized "used key hashes" AMQF

### SST file

The SST file contains only data without any header.

- foreach block
  - 4 bytes block header (uncompressed length or sentinel)
  - 4 bytes checksum (CRC32 of uncompressed block data)
  - block data (compressed or uncompressed)
- foreach block
  - 4 bytes end of block offset relative to start of all blocks

#### Block Compression

Blocks can be stored compressed (LZ4) or uncompressed. The 4-byte header distinguishes them:

- **Header > 0**: Block is LZ4 compressed. Header value is the uncompressed length.
- **Header = 0**: Block is stored uncompressed. Actual length is derived from block offsets.

#### Block Checksum

Each block stores a 4-byte CRC32 checksum (big-endian) computed on the **on-disk** block data (i.e. after compression). On read, the checksum is verified **before** decompression so that on-disk damage is caught before passing data to LZ4. A checksum mismatch returns an error indicating that the cached data is damaged.

#### Index Block

- 1 byte block type (0: index block)
- 2 bytes block index
- `n` times
  - 8 bytes hash
  - 2 bytes block index

An Index block contains `n` 8 bytes hashes, which specify `n - 1` hash ranges (eq hash goes into the prev range, except for the first key). Between these `n` hashes there are `n - 1` 2 byte block indicies that point to the block that contains the hash range.

The hashes are sorted.

`n` is `(block size + 1) / 10`

#### Key Block (variable-size)

- 1 byte block type (1: key block with hash, 2: key block without hash)
- 3 bytes entry count
- foreach entry
  - 1 byte type
  - 3 bytes position in block after header
- Max block size: 16 KB

A Key block contains n keys, which specify n key value pairs.

The block type determines whether the key hash is stored per entry:

- Block type 1 (with hash): Full 8-byte hash stored per entry
- Block type 2 (no hash): No hash stored (for keys ≤ 32 bytes)

During lookup, if block type is 2, the full hash is recomputed from the key data.

Depending on the `type` field entry has a different format:

- 0: normal key (small value)
  - 8 bytes key hash (if block type 1)
  - key data
  - 2 byte block index
  - 2 bytes size
  - 4 bytes position in block
- 1: blob reference
  - 8 bytes key hash (if block type 1)
  - key data
  - 4 bytes sequence number
- 2: deleted key / tombstone (no data)
  - 8 bytes key hash (if block type 1)
  - key data
- 3: normal key (medium sized value)
  - 8 bytes key hash (if block type 1)
  - key data
  - 2 byte block index
- 7: merge key (future)
  - key data
  - 2 byte block index
  - 3 bytes size
  - 4 bytes position in block
- 8..255: inlined value (currently only values ≤8 bytes are inlined, though the format supports up to 247)
  - 8 bytes key hash (if block type 1)
  - key data
  - (type - 8) bytes value data (inline, no separate value block)

The entries are sorted by key hash and key.

#### Key Block (fixed-size)

Used when all entries in a block have the same key size and value type. Eliminates the per-entry offset table, enabling direct arithmetic indexing during binary search.

- 1 byte block type (3: fixed-size with hash, 4: fixed-size without hash)
- 3 bytes entry count
- 1 byte key size (uniform across all entries)
- 1 byte value type (shared by all entries, same encoding as variable-size type field)
- foreach entry (packed at stride = hash_len + key_size + val_size):
  - 8 bytes key hash (if block type 3)
  - key data (key_size bytes)
  - value data (size determined by value type)

Entry position for index `i` is computed as `header_size + i * stride` with no indirection. The writer automatically selects fixed-size format when all entries in a block qualify; otherwise falls back to the variable-size format above.

#### Value Block

- no header, all bytes are data referenced by other blocks
- max block size: 4 GB

### Blob file

The plain value compressed with dynamic compression. Each blob file has an 8-byte header:

- 4 bytes: uncompressed length (u32 big-endian)
- 4 bytes: CRC32 checksum of the compressed data (u32 big-endian)
- remaining bytes: LZ4-compressed value data

The checksum is verified on the compressed data **before** decompression when the blob is read.

## Reading

Reading start from the current sequence number and goes downwards.

- We have all SST files memory mapped
- for i = CURRENT sequence number .. 0
  - Check AMQF from SST file for key existence -> if not continue
  - let block = 0
  - loop
    - Index Block: find key range that contains the key by binary search
      - found -> set block, continue
      - not found -> break
    - Key Block: find key by binary search
      - found -> lookup value from value block, return
          - read value as inline, or by using the block index in the key to find the value elsewhere in the file.
      - not found -> break

## Writing

Writing starts by creating a new WriteBatch. It maintains an atomic counter of the next free sequence number.

The WriteBatch has a thread local buffer that accumulates operations until a certain threshold is reached. Then the buffer is sorted and written to a new SST file (and maybe some blob files).

When the WriteBatch is committed all thread local buffers are merged into a single global buffer and written into new SST files (potentially multiple when threshold is reached).

fsync! The new sequence number is written to the `CURRENT` file.

After that optimization might take place.

## Compaction

For compaction we compute the "coverage" of the SST files. The coverage is the average number of SST files that need to be touched to figure out that a key is missing. The coverage can be computed by looking at the min_hash and max_hash of the SST files only.

For a single SST file we can compute `(max_hash - min_hash) / u64::MAX` as the coverage of the SST file. We sum up all these coverages to get the total coverage.

Compaction chooses a few SST files and runs the merge step of merge sort on tham to create a few new SST files with sorted ranges.

Example:

```text
key hash range: | 0    ...    u64::MAX |
SST 1:             |----------------|
SST 2:                |----------------|
SST 3:            |-----|
```

can be compacted into:

```text
key hash range: | 0    ...    u64::MAX |
SST 1':           |-------|
SST 2':                   |------|
SST 3':                          |-----|
```

The merge operation decreases the total coverage since the new SST files will have a coverage of < 1.

But we need to be careful to insert the SST files in the correct location again, since items in these SST files might be overridden in later SST file and we don't want to change that.

Since SST files that are smaller than the current sequence number are immutable we can't change the files and we can't insert new files at this sequence numbers.
Instead we need to insert the new SST after the current sequence number and copy all SST files after the original SST files after them. (Actually we only need to copy SST files with overlapping key hash ranges. And we can hardlink them instead). Later we will write the current sequence number and delete them original and all copied SST files.

We can run multiple merge operations concurrently when the key hash ranges are not overlapping or they are from different key families. The copy operation need to be strictly after all merge operations.

There must not be another SST file with overlapping key hash range between files of a merge operation.

During the merge operation we eliminate duplicate keys. When blob references are eliminated we delete the blob file after the current sequence number was updated.

Since the process might exit unexpectedly, to avoid "forgetting" to delete the SST files we keep track of that in a `*.del` file. This file contains the sequence number of SST and blob files that should be deleted. We write that file before the current sequence number is updated. On restart we execute the deletes again.

We limit the number of SST files that are merged at once to avoid long compactions.

Full example:

Example:

```text
key hash range: | 0    ...    u64::MAX | Family
SST 1:             |-|                   1
SST 2:             |----------------|    1
SST 3:                |----------------| 1
SST 4:            |-----|                2
SST 5:                |-----|            2
SST 6:                 |-------|         1
SST 7:                    |-------|      1
SST 8:                 |--------|        2
SST 9:                     |--------|    2
CURRENT: 9
```

Compactions could selects SST 2, 3, 6 and SST 4, 5, 8 for merging (we limited to 3 SST files per merge operation). This also selects SST 7, 9 for copying. The current sequence number is 9.

We merge SST 2, 3, 6 into new SST files 10, 12, 14 and SST 4, 5, 8 into new SST files 11, 13. Both operations are done concurrently so they might choose free sequence numbers in random order. The operation might result in less SST files due to duplicate keys.

After that we copy SST files 7, 9 to new SST files 15, 16.

We write a "del" file at sequence number 17.

After that we write the new current sequence number 17.

Then we delete SST files 2, 3, 6 and 4, 5, 8 and 7, 9. The

SST files 1 stays unchanged.

```text
key hash range: | 0    ...    u64::MAX | Family
SST 1:             |-|                   1
SST 10:            |-----|               1
SST 12:                  |-----|         1
SST 11:            |------|              2
SST 14:                        |-------| 1
SST 13:                   |-----|        2
SST 15:                   |-------|      1
SST 16:                    |--------|    2
DEL 17:  (2, 3, 4, 5, 6, 7, 8, 9)
CURRENT: 17
```

Configuration options for compations are:

- max number of SST files that are merged at once
- coverage when compaction is triggered (otherwise calling compact is a noop)

## Opening

- Read the `CURRENT` file
- Delete all files with a higher sequence number than the one in the `CURRENT` file.
- Read all `*.del` files and delete the files that are listed in there.
- Read all `*.sst` files and memory map them.

## Closing

- fsync!
- (this also deleted enqueued files)