# databases

# Disk-based algorithms: External merge-sort

### 2017-11-13

I have lately been studying how databases work under the hood. There are many problems that a database has to solve. At the moment, I’ve been having fun exploring how data is stored on disk. Previously, I wrote about how a hashtable is stored on disk. This post is somewhat similar– I talk about how data stored on disk can be sorted.

Databases frequently have to sort data of massive sizes, sizes big enough that they don’t fit in memory at once. Sorting algorithms that can sort data residing in disk are called external sorting algorithms. External merge-sort is one such algorithm.

## External merge-sort

Just like merge-sort, external merge-sort is a divide-and-conquer algorithm– it recursively sorts smaller subarrays, and then merges those subarrays. Because disk I/O happens in blocks, however, the smallest subarray is the size of a disk page. So for a file with n pages, the algorithm proceeds as follows.

First, all pages are sorted internally. To achieve this, we bring in however many pages can fit into the program’s buffers(remember, the entire data doesn’t fit in memory), sort each page and write them back. This step is usually referred to as pass 0.

Next, in pass 1, we sort *runs* of length 1. A *run* is a collection
of sorted pages, so that means that we take pairs of pages and merge
them to get $n/2$ runs of length 2. Now, the contents of pages 0 and 1
are sorted; as are 2 and 3, and so on. In pass 2, we do the same
thing, but with runs of length 2. And so on, until finally we have two
subarrays(runs of length $n/2$ pages). Merging those gives us the
final sorted result. We start with $n$ runs and halve at each pass, so
there will be $log_2(n)$ passes.

All this may seem very familiar to you if you’re already acquainted
with merge-sort. The difference here is that is some extra work to be
done when merging. Take, for example, the last pass where we do the
final merge between two runs of $n/2$ pages. Our original constraint
was that n pages don’t fit in memory, so 2 runs of $n/2$ pages
certainly won’t either. To solve this, the pages in the two input runs
are *streamed in*.

First we bring in the first pages from each of the runs. As with “regular” merge-sort, we iterate through both, picking off whichever has the smaller item. Eventually, one will run out. At that point, bring in another page from that run into the buffer, and continue. Our output is also a file, so we need to do the same on the output side– once the output buffer is filled, write it to the output file, and start with an empty buffer.

You can check out my Rust implementation of this algorithm on Github.

## Multi-way merging

What I’ve described above is called *two-way* merging since at any
given step we merge *two* runs. However, it can be advantageous to
merge $k$ runs at once. This would result in $log_k(n)$ number of
passes. Since we write out to disk in each pass, the efficiency gain
this reduction in the number of passes gives us can be significant.

You can read more about the ways to do k-way merging on Wikipedia.