I just subscribed to this list, so sorry I can't respond to the
threads already started here.  I'm the guy that was mailing Linus
about this algorithm to do similarity detection in linear time,
mentioned here:

Subject:    [PATCH] Split out "exact content match" phase of rename detection
http://marc.info/?l=git&m=119299209323278&w=2

Subject:    [PATCH, take 1] Linear-time/space rename logic (exact renames
http://marc.info/?l=git&m=119301122908852&w=2


Jeff, in this message http://marc.info/?l=git&m=119303566130201&w=2 I
think you basically hit on the first half of what I was getting at.

The step 1 you describe is the first step of the algorithm -- make an
inverted index of lines (and you can use a hash code of the line to
stand in for the line).

To avoid the m*n memory usage of step 2, I use a hash table which maps
2-tuples to counts to represent the sparse similarity matrix, instead
of representing it directly.  The 2-tuple is a pair of filenames,
corresponding to the row/column of the matrix, and the counts/values
are the entries in the matrix.

You can also prune entries which have a long "postings list" (using
the term in the IR sense here:
http://www.xapian.org/docs/intro_ir.html).  This has the nice side
effect of getting rid of quadratic behavior, *and* making the
algorithm more accurate because it stops considering common lines like
"#endif" as contributing to similarity.

This pruning of common lines from the index makes step 3 linear.  In
fact, I prune the index to only include lines which occur just *once*.
 Nearly every line of code in real data sets is unique, so this works
well in practice.


I already sent this demo to Linus, but I think it's worth sending to
the list as well.  I am just going to copy parts of my earlier e-mails
below, and attach the same demos (hopefully it is kosher to send
attachments to this list).

Before I do that, Jeff, can you still reproduce this problem:

http://marc.info/?l=git&m=118975452330644&w=2
"Of course, 2 minutes for a git-status is still way too slow, so there we
might still need a limiter. It also looks like 57% of my time is spent
in spanhash_find, and another 29% in diffcore_count_changes."

I know there have been a couple fixes since you posted that, but if it
was the O(m*n) behavior that was causing your problem, it should still
be reproducible.  Linus suggested that this is a good data set to try
things out with.  Is there there a repository I can clone with git
commands to run to repro this?


OK, so attached is a little demo of the algorithm, which is (very
little) Python code, but with comments so non-Python people can
hopefully follow it.  Because of this the timings are not very
meaningful, but it proves that the algorithm doesn't blow up.

I ran it on the entire Linux 2.4 vs. Linux 2.6 codebases.  It is
*only* considering file content.  You can rename every file in both
source trees to completely random strings and it will still match
files up.  There is nothing about filenames, or identical files, and
you can consider the whole 2.4 side "all deletes" and the whole 2.6
side "all adds".  The size of the matrix would be around 286 million
cells, but here I only represent the non-zero entries in the matrix,
which is only 15,406 cells.

$ wc -l similarity_demo.py
233 similarity_demo.py

$ ./similarity_demo.py in-all-*
12697 * 22530 = 286063410 total possible pairs of files
Indexing the first set of 12697 files (threshold=1)
Indexing the second set of 22530 files (threshold=1)
Sum of file sizes in first set: 2134249 lines
Sum of file sizes in second set: 3424338 lines
Size of index over first set: 2134249
Size of index over second set: 3424338
Computing union of lines in the indices
Total unique lines in both indices: 4384375
Making sparse common line matrix
Calculating similarity for 15406 pairs of files
Sorting 15406 similar pairs
Writing similarity report
Wrote out-in-all-24-vs-in-all-26.1.txt
End
------
Report
------
Indexing the first set of 12697 files (threshold=1)          29.540s
Indexing the second set of 22530 files (threshold=1)         111.041s
Computing union of lines in the indices                      7.450s
Making sparse common line matrix                             13.468s
Calculating similarity for 15406 pairs of files              0.055s
Sorting 15406 similar pairs                                  0.030s
Writing similarity report                                    0.249s
Total time                                                   161.834s

The script outputs a text file with 15,406 similar pairs, in order of
similarity (1.0's are at the top):

andychu demo$ wc -l out-in-all-24-vs-in-all-26.1.txt
15406 out-in-all-24-vs-in-all-26.1.txt

andychu demo$ head -n3 out-in-all-24-vs-in-all-26.1.txt
(  51) linux-2.4.35.3/include/asm-m68k/apollodma.h
(  51) linux-2.6.23.1/include/asm-m68k/apollodma.h   51 1.000

(  94) linux-2.4.35.3/fs/jfs/jfs_btree.h
(  94) linux-2.6.23.1/fs/jfs/jfs_btree.h   94 1.000

(  21) linux-2.4.35.3/Documentation/fb/00-INDEX
(  21) linux-2.6.23.1/Documentation/fb/00-INDEX   21 1.000
...


And here is my explanation from an earlier mail, with some slight edits:

So the algorithm is:

1) Make an inverted index of the left set and right set.  That is {
line => ["postings list", i.e. the list of files the line appears in]
}.  To get rid of common lines like "} else {" or "#endif", there is
an arbitrary "line threshold".

2) Combine the 2 indices into a (sparse) rectangular matrix.  For each
line, iterate over all pairs in the postings list, and increment the
cell in the matrix for that pair by 1.  The index is extremely
shallow, since nearly all lines of code are unique.  The common case
is that the postings list is of length 1.  And the line threshold caps
the length of the postings list.

In the code the matrix represented by a hash of filename pairs to
integer counts.  So then the count is the number of lines that the 2
files have in common.

3) Compute the similarity metric, which I've defined here as
max(c/|left file|, c/|right file|), where c the entry in the matrix
for the file pair.  Note that the files are treated as *sets* of lines
(unordered, unique).  The similarity score is a number between 0.0 and
1.0.  Other similarity metrics are certainly possible.

A few things to notice about this algorithm:

* It takes advantage of the fact that code edits are typically
line-oriented, and nearly every line of code is unique.  (This same
technique can be used for arbitrary documents like English text, but
it's a bit harder since you basically have to find a way to make a
"deep" index of words shallow, to speed it up.  For code, the index of
lines is already shallow.)

* Some people might be concerned that it treats files as unordered
sets of lines.  The first thought might be to do this as a
preprocessing step to cull the list of candidates, and then do a real
delta.  But in my experience, I haven't encountered a case where
there's all that much to gain by doing that.

* The line_threshold might appear to be a hack, but it actually
*improves* accuracy.  If you think about it, lines like "#endif"
should not contribute to the similarity between 2 files.  It also
makes it impossible to construct pathological blow-up cases.  If you
have 1000 files on the left and 1000 files on the right that are all
identical to each other, then every line will get pruned, and thus the
entire similarity matrix will be 0, which is arguably what you want.
There is a -l flag to the script to experiment with this threshold.

* This can be run on all files, not just adds/deletes.  If I have a
change of only edits, it could be the case that I moved 500 lines from
one file 1000 line file to another 1000 line file, and changed 75
lines within the 500.  It would be meaningful to see a diff in this
case, so that I can see those 75 edits (and a great feature!)

* The similarity is defined that way so that if one file is completely
contained in another, the similarity is 1.0.  So if I move a 1000 line
file and add 9,000 lines to it, the similarity for the file pair will
still be 1.0.  I believe this is a feature, like the point above.

* I don't have a C implementation but I believe the constant factors
should be very low.  You could use the same CRC you were talking about
to reduce memory in storing the lines.  It seems like this algorithm
is amenable to trading memory for speed, as you mention.  Since it is
raw string manipulation, C should be at least 10x faster than Python,
and I wouldn't be surprised if an optimized implementation is 50 or
100x faster.

...

If anything about the explanation is unclear, let me know and I will
try to clarify it.  Playing around with the demo should illuminate
what it does.  You can run it on data sets of your own.  All you need
is 2 source trees and the "find" command to produce input to the
script (see setup_demo.sh).

As mentioned, I will try to do some tests on this, perhaps with Jeff's
hard data set, and show that the results are good and that the
algorithm is faster because the quadratic behavior is gone (if Linus
doesn't beat me to it!).

thanks,
Andy