I have lately been studying how databases work under the hood. There are many problems that a database has to solve. At the moment, I’ve been having fun exploring how data is stored on disk. Previously, I wrote about how a hashtable is stored on disk. This post is somewhat similar– I talk about how data stored on disk can be sorted.
Databases frequently have to sort data of massive sizes, sizes big enough that they don’t fit in memory at once. Sorting algorithms that can sort data residing in disk are called external sorting algorithms. External merge-sort is one such algorithm.
Just like merge-sort, external merge-sort is a divide-and-conquer algorithm– it recursively sorts smaller subarrays, and then merges those subarrays. Because disk I/O happens in blocks, however, the smallest subarray is the size of a disk page. So for a file with n pages, the algorithm proceeds as follows.
First, all pages are sorted internally. To achieve this, we bring in however many pages can fit into the program’s buffers(remember, the entire data doesn’t fit in memory), sort each page and write them back. This step is usually referred to as pass 0.
Next, in pass 1, we sort runs of length 1. A run is a collection of sorted pages, so that means that we take pairs of pages and merge them to get $n/2$ runs of length 2. Now, the contents of pages 0 and 1 are sorted; as are 2 and 3, and so on. In pass 2, we do the same thing, but with runs of length 2. And so on, until finally we have two subarrays(runs of length $n/2$ pages). Merging those gives us the final sorted result. We start with $n$ runs and halve at each pass, so there will be $log_2(n)$ passes.
All this may seem very familiar to you if you’re already acquainted with merge-sort. The difference here is that is some extra work to be done when merging. Take, for example, the last pass where we do the final merge between two runs of $n/2$ pages. Our original constraint was that n pages don’t fit in memory, so 2 runs of $n/2$ pages certainly won’t either. To solve this, the pages in the two input runs are streamed in.
First we bring in the first pages from each of the runs. As with “regular” merge-sort, we iterate through both, picking off whichever has the smaller item. Eventually, one will run out. At that point, bring in another page from that run into the buffer, and continue. Our output is also a file, so we need to do the same on the output side– once the output buffer is filled, write it to the output file, and start with an empty buffer.
You can check out my Rust implementation of this algorithm on Github.
What I’ve described above is called two-way merging since at any given step we merge two runs. However, it can be advantageous to merge $k$ runs at once. This would result in $log_k(n)$ number of passes. Since we write out to disk in each pass, the efficiency gain this reduction in the number of passes gives us can be significant.
You can read more about the ways to do k-way merging on Wikipedia.