Myers' diff algorithm in Clojure
If you’re a programmer, you’re probably fairly comfortable with
understanding and interpreting diff patches of the kind that git
and
other version control systems use. In this post we’ll walk through a
Clojure implementation of the Myers' diff algorithm, which is the
algorithm used by Git.
I have been implementing Git following along James Coglan’s book “Building Git”– you can check out the project here. The project itself is in Rust but I thought it might be an interesting exercise to implement the algorithm in Clojure.
Edit graph
Say we want to turn a file with the lines A,B,C
into A,C,E
. The
Myers' diff algorithm finds out the shortest edit script that does
so. The algorithm works on an edit graph as shown above– a rightward
traversal from one vertex to another indicates deletion of a line from
the old file; a downward traversal indicates insertion of a line from
the new file.
Also notice that in our example ABC > ACE
there are two lines which
are the same(A & C). Myers' algorithm leverages this to produce the
shortest edit script. In our edit graph above, you can see that where
lines are equal we have drawn dotted lines to indicate diagonal
traversals which require neither insertion nor deletion. The Myers'
algorithm maximizes the number of diagonal traversals in the edit
script it produces^{1}.
Let’s walk through the output Myers' algorithm produces for our example above:
The first character is the same in both files so take the first
diagonal down. Then delete B
. The third character in the old
file(C
) is the same as the second in the new file so keep it and do
a diagonal traversal again. Then insert E
. This series of edits on
the old file has produced our desired new file, and so we stop.
Finally, before we dive into the algorithm, we have one more addition
to the edit graph. We will draw downwarddiagonals across the graph,
also called klines. Each kline has the equation k = x  y
.
Implementation
The maximum number of operations(insertions or deletions) the edit
script can have is len(old) + len(new)
– in this case we would be
deleting every line in the old file and inserting every line from the
new file.
We iterate through 0 to maximum number of operations and see what the
farthest point along each kline we can go. In the reduce
, we can
see that our accumulator is a tuple with v
and trace
.
v
maps each k
to the farthest x
we have found so far along that
kline. We can discard the y
values since we can easily compute
them(y = x  k
).
trace
is a vector of the v
s we’ve seen so far.
We can see that the function ultimately returns trace
. This will be
used to reconstruct the path that we took along the graph.
(defn shortestedit [a b]
(let [max (+ (count a) (count b))
[_ trace] (reduce (fn [[v trace] d]
(let [[foundend? farthestxs] (getfarthestxsford v a b d)
[newv newtrace] [(merge v farthestxs)
(conj trace v)]]
(if foundend?
(reduced [newv newtrace])
[newv newtrace])))
[{1 0} []]
(range 0 (inc max)))]
trace))
To get to a line k, we can either go down from k+1 or go right from
line k1. We use the movedown?
function to decide whether to move
down or right:
(defn movedown?
[v d k]
(or (= k ( d))
(and (not= k d)
(< (get v (dec k))
(get v (inc k))))))

If k=d, it is not possible to go right from k1, so go down from k+1.

If k=d, it is not possible to go down from k+1, so go right from k1.

Otherwise, go down if
v[k1] < v[k+1]
.What this expression in the predicate is saying is: if we are further along on the k+1 line, we want to go down to meet the k line. Else go right from the k1 line.
Next, in getfarthestxyalongk*
we figure out what the farthest
(x,y) we can reach in this round d
along the k
line given the v
map we’ve built so far. a
and b
contain the lines that we want to
diff. This simply uses movedown?
as described above; then if
possible walks down any diagonals.
(defn getfarthestxyalongk*
[v d a b k]
(let [x (if (movedown? v d k)
(get v (inc k))
(inc (get v (dec k))))
y ( x k)
;; walk down diagonals
[x y] (loop [[x y] [x y]]
(if (and (< x (count a))
(< y (count b))
(= (get a x) (get b y)))
(recur [(inc x) (inc y)])
[x y]))]
[x y]))
getfarthestxsford
iterates from d to d(both inclusive), 2 steps
at a time^{2} and stores the farthest x
for each k
, returning
early if we reach the bottomright corner in our edit graph.
(defn getfarthestxsford
[v a b d]
(let [getfarthestxyalongk (partial getfarthestxyalongk* v d a b)]
(reduce (fn [[foundend? k>x] k]
(let [[x y] (getfarthestxyalongk k)]
(if (and (>= x (count a))
(>= y (count b)))
(reduced [true k>x])
[foundend? (assoc k>x k x)])))
[false {}]
(range ( d) (inc d) 2))))
So far, we’ve built up functions that give us a trace:
myers.core> (shortestedit (clojure.string/split "ABCABBA" #"")
(clojure.string/split "CBABAC" #""))
[{1 0}
{1 0, 0 0}
{1 1, 0 0, 1 0}
{1 1, 0 2, 1 0, 2 2, 2 3}
{1 5, 0 2, 1 4, 2 2, 2 3, 3 3, 3 5}
{1 5, 0 5, 1 4, 2 4, 2 7, 3 3, 3 5, 4 3, 4 7}]
myers.core>
Now, let’s look at how to turn this into a diff.
Backtracking and producing the diff
This next function will take an input a trace and output an edit
sequence. Each element in the editsequence is of the form [prevx prevy x y]
.
This time we start from the bottomright node of the edit graph. In each iteration, we figure out the previous (x,y)– also walking back up diagonals if required.
(defn trace>editseq
[n m trace]
(let [[_ editseq]
(reduce (fn [[[x y] editseq] [d v]]
(let [k ( x y)
prevk (if (movedown? v d k)
(inc k)
(dec k))
prevx (get v prevk)
prevy ( prevx prevk)
[[x y] editseq] (loop [[x y] [x y]
editseq editseq]
(if (and (> x prevx) (> y prevy))
(recur [(dec x) (dec y)]
(conj editseq
[(dec x) (dec y) x y]))
[[x y] editseq]))
editseq (if (pos? d)
(conj editseq [prevx prevy x y])
editseq)]
[[prevx prevy] editseq]))
[[n m] []]
(reverse (map vector (range) trace)))]
(reverse editseq)))
Once we have an edit sequence, printing the diff is pretty straightforward:
(defn diff
[a b]
(for [[prevx prevy x y] (trace>editseq (count a)
(count b)
(shortestedit a b))]
(let [aline (get a prevx)
bline (get b prevy)]
(cond
(= x prevx) {:op :ins
:content bline}
(= y prevy) {:op :del
:content aline}
:else {:op :eql
:content aline}))))
(defn printdiff
[diff]
(doseq [{:keys [op content]} diff]
(let [dout (case op
:ins (format "+%s" content)
:del (format "%s" content)
:eql (format " %s" content))]
(println dout))))