Difftastic, the Implausible Diff – Wilfred Hughes::Weblog
I’ve all the time needed a structural diff device, so I constructed
difftastic.
This has been probably the most fascinating, most irritating, and most
difficult program I’ve ever written.
How Onerous May It Be?
In the event you write Lisp code for some time, you begin to see code like
JSON. Every part is mainly an inventory.
json-diff already exists,
and it’s fairly good. I needed one thing related for programming
languages.
After an enormous quantity of experimentation, I’ve one thing that
works. On this publish, I’ll present you the way it works.
I received’t present the various, many lifeless ends and failed designs alongside the
method. We are able to faux that I acquired it proper first time.
Parsing The Code
If I wish to examine two applications, I first want a parse tree for every
program. I want an correct lexer, a fundamental parser, and I have to
protect feedback.
tree-sitter was an amazing
match right here. You outline a grammar in JSON or JS, and it generates a C
library that anybody can use. It’s not 100% correct (e.g. the C++
parser doesn’t have preprocessor knowledge) but it surely’s greater than ok.
checklist: ($) => seq("(", repeat($._sexp), ")"),
vector: ($) => seq("[", repeat($._sexp), "]"),
Right here’s an excerpt from my Emacs Lisp
grammar. There’s a ton
of tree-sitter parsers obtainable too. Difftastic now helps 44
completely different syntaxes, and including new ones is so simple that my
manual includes a worked
example.
After parsing, difftastic converts the tree-sitter parse tree to an
s-expression. Every part is an inventory or an atom. This uniform
illustration permits the diffing logic to work on any language that
I can parse.
For instance, given a JavaScript program like this:
tree-sitter parses it to this parse tree:
expression_statement
call_expression
identifier "foo"
arguments
(
quantity "1"
,
quantity "2"
)
difftastic then converts the tree to this s-expression illustration:
Checklist {
open_content: "",
kids: [
Atom "foo",
List {
open_content: "(",
children: [
Atom "1",
Atom ",",
Atom "2",
],
close_content: ")",
},
],
close_content: "",
}
Calculating The Diff
Enjoyable truth: I considered diffing applications as figuring out what has
modified. The purpose of diffing is definitely to work out what hasn’t
modified! The extra stuff you may match up, the smaller and extra
readable your diff.
How do I work out what’s modified between two parse bushes? There are a
bunch of various design concepts on the market, however
autochrome is probably the most
efficient method I discovered, and it consists of an extremely useful
labored instance.
Autochrome and difftastic signify diffing as a shortest path
downside on a directed acyclic graph. A vertex represents a pair of
positions: the place within the left-hand facet s-expression (earlier than), and the place in
the right-hand facet s-expression (after).
The purpose is to seek out the shortest route from the beginning vertex (the place
each positions are earlier than the primary merchandise within the applications) to the top
vertex (the place each positions are after the final merchandise in this system).
For instance, suppose you’re evaluating this system A
with X A
. The
vertices appear like this.
START
+---------------------+
| Left: A Proper: X A |
| ^ ^ |
+---------------------+
END
+---------------------+
| Left: A Proper: X A |
| ^ ^|
+---------------------+
The perimeters within the graph signify the modifications required to remodel the
left-hand facet program into the right-hand facet.
On this instance, there are two doable modifications from the beginning vertex,
so it has two edges. Difftastic might (1) think about the left-hand facet to
be novel and increment that place, or (2) it might think about the
right-hand facet to be novel and increment that place.
A novel merchandise on the left-hand facet is a removing, and a novel merchandise on
the right-hand facet is an addition.
START
+---------------------+
| Left: A Proper: X A |
| ^ ^ |
+---------------------+
/
Novel atom L / Novel atom R
1 v 2 v
+---------------------+ +---------------------+
| Left: A Proper: X A | | Left: A Proper: X A |
| ^ ^ | | ^ ^ |
+---------------------+ +---------------------+
When each positions are pointing to an similar s-expression, a 3rd
edge is added within the graph. This edge represents matching an
s-expression on each side.
2
+---------------------+
| Left: A Proper: X A |
| ^ ^ |
+---------------------+
/ |
Novel atom L / | Novel atom R
v | v
+---------------------+ | +---------------------+
| Left: A Proper: X A | | | Left: A Proper: X A |
| ^ ^ | | | ^ ^|
+---------------------+ | +---------------------+
| | |
| Novel atom R | Nodes match | Novel atom L
| | |
| END v |
| +---------------------+ |
+-------->| Left: A Proper: X A |<---------+
| ^ ^|
+---------------------+
This ‘unchanged node’ edge has a decrease value than the ‘novel node’
edges. The routing downside is then a matter of discovering the route with
probably the most unchanged edges.
The shortest route with applications A
and X A
is [Novel Right,
.
Unchanged]
Discovering The Shortest Route
Difftastic and Autochrome each use Dijkstra’s algorithm. Sadly
the scale of the graph is quadratic: it’s O(L * R), the place L is the
variety of objects within the left-hand s-expression and R is the variety of
objects within the right-hand s-expression.
To make efficiency bearable, difftastic aggressively discards
clearly unchanged s-expressions initially, center and finish of
the file. In the event you’ve solely modified the final operate in a file,
difftastic received’t think about the opposite capabilities in any respect.
(GNU Diff has an identical function with its horizon-lines
option,
though its efficiency is a lot better basically.)
As a result of sheer measurement of the graph (a number of million vertices), the
greatest efficiency bottleneck is vertex development. I explored
higher route discovering algorithms (e.g. A*) however I didn’t see a lot
enchancment. My present answer is to assemble the graph lazily, so
few vertices are constructed.
Even with this, difftastic typically struggles with efficiency. I’ve
profiled and optimised every little thing I can consider (utilizing Rust actually
helped right here). If the graph is simply too massive, difftastic falls again to a
typical line-oriented diff.
What About Nesting?
When incrementing positions throughout graph traversal, I wanted to be
cautious with getting into and leaving delimiters.
;; Earlier than
(x y)
;; After
(x) y
The specified consequence right here is
(x y)
and
(x) y
.
When the s-expression place is on a delimiter, difftastic can
both think about the delimiter novel, or unchanged. That is much like
the s-expression atom case.
START
+---------------------------+
| Left: (x y) Proper: (x) y |
| ^ ^ |
+---------------------------+
/ |
Novel delimiter L / | Novel delimiter R
v | v
+---------------------------+ | +---------------------------+
| Left: (x y) Proper: (x) y | | | Left: (x y) Proper: (x) y |
| ^ ^ | | | ^ ^ |
+---------------------------+ | +---------------------------+
|
| Unchanged delimiters
v
+---------------------------+
| Left: (x y) Proper: (x) y |
| ^ ^ |
+---------------------------+
What occurs when the place is on the finish of the checklist? If
difftastic entered the delimiters collectively (‘unchanged delimiter’), it
should exit them collectively. This requires each s-expressions positions to
level to the exit delimiter.
If the delimiters have been entered individually (‘novel delimiter’), then
the delimiters may be exited individually too.
+---------------------------+
| Left: (x y) Proper: (x) y |
| ^ ^ |
+---------------------------+
/
Novel node L / Exit delimiter R
v v
+---------------------------+ +---------------------------+
| Left: (x y) Proper: (x) y | | Left: (x y) Proper: (x) y |
| ^ ^ | | ^ ^ |
+---------------------------+ +---------------------------+
The ‘exit delimiter proper’ edge is barely allowed if the delimiter was
additionally entered with ‘novel delimiter proper’.
Which means graph vertices are actually a tuple of three objects:
(left-hand facet place, right-hand facet place,
list_of_parents_to_exit_together).
This exponentially will increase the scale of the graph, O(2N) the place N
is the very best checklist nesting stage in both enter.
I solved this by solely contemplating at most two graph vertices for every
place pair. I all the time discover a route this manner (there exists a path to
the top vertex from each different vertex) however it isn’t essentially the
shortest. In observe this appears to discover sufficient of the graph that
the outcomes are persistently nice.
Constructing The Interface
Phew! After wrestling with diffing algorithms for a while, I assumed
constructing the UI can be simple. I used to be fallacious.
Difftastic is aware of which s-expression nodes are unchanged. It utterly
ignores whitespace.
Nonetheless, there is no such thing as a assure that there are any traces in frequent
between the 2 recordsdata. It’s additionally doable {that a} line within the first
file may need matches in zero, one or many traces within the second file.
The show logic iterates by all of the matched traces, and tries to
align as many as doable. It makes use of a two-column show by default, so
you may reformat your complete file and it’ll nonetheless produce a smart
output.
Engaged on diff UIs has made me realise how unhealthy the standard diff
show is.
The header @@ -392,7 +392,12 @@
signifies that the diff begins at
line 392. The consumer is compelled to rely traces to work out that the change
itself is on line 395!
Dogfooding
Ultimately I added the power to make use of difftastic with git or
mercurial. This was an distinctive option to discover bugs and see how effectively
the design works in observe.
This uncovered a bunch of delicate points with structural diffing.
;; Earlier than
(foo (bar))
;; After
(foo (novel) (bar))
Diff algorithms are described within the literature as “discovering a minimal
edit script”, the place the edit script is the additions/removals required to
flip the enter file into the output file.
On this instance, (foo (novel)
can be a very
(bar))
legitimate, minimal diff. It’s including the image novel
and including one
pair of parentheses.
This isn’t what the consumer needs although. They’d somewhat see (foo (novel) (bar))
despite the fact that it’s equally minimal.
I solved this by adjusting the graph edge value mannequin to provide nicer
outcomes. I additionally added a secondary go on the diff consequence to examine for extra
aesthetically pleasing outcomes with the identical edge value.
(I encountered a number of extra instances that structural diffs battle with,
see the Tricky Cases page in the
manual for extra
particulars.)
Future Work
Difftastic is improbable when it really works, and I take advantage of it every day. It’s nonetheless
not good although.
Difftastic has some difficult failure modes. Altering massive string
literals is a problem (syntactically they’re single atoms, however customers
typically desire a word-level diff).
The minimal diff isn’t all the time useful both. Generally difftastic goes too far.
Closing Ideas
I had no concept what I used to be moving into once I began engaged on this.
I’d been questioning why one of these device is so uncommon. Now I do know: it’s
extraordinarily difficult to construct. Regardless of its limitations, I’m stunned
at how usually it really works fantastically.
Difftastic is OSS below a MIT license, so I hope it permits extra diff
instruments that may perceive construction. In the event you’re feeling courageous, you may
even try it
yourself!