I just subscribed to this list, so sorry I can't respond to the threads already started here. I'm the guy that was mailing Linus about this algorithm to do similarity detection in linear time, mentioned here: Subject: [PATCH] Split out "exact content match" phase of rename detection http://marc.info/?l=git&m=119299209323278&w=2 Subject: [PATCH, take 1] Linear-time/space rename logic (exact renames http://marc.info/?l=git&m=119301122908852&w=2 Jeff, in this message http://marc.info/?l=git&m=119303566130201&w=2 I think you basically hit on the first half of what I was getting at. The step 1 you describe is the first step of the algorithm -- make an inverted index of lines (and you can use a hash code of the line to stand in for the line). To avoid the m*n memory usage of step 2, I use a hash table which maps 2-tuples to counts to represent the sparse similarity matrix, instead of representing it directly. The 2-tuple is a pair of filenames, corresponding to the row/column of the matrix, and the counts/values are the entries in the matrix. You can also prune entries which have a long "postings list" (using the term in the IR sense here: http://www.xapian.org/docs/intro_ir.html). This has the nice side effect of getting rid of quadratic behavior, *and* making the algorithm more accurate because it stops considering common lines like "#endif" as contributing to similarity. This pruning of common lines from the index makes step 3 linear. In fact, I prune the index to only include lines which occur just *once*. Nearly every line of code in real data sets is unique, so this works well in practice. I already sent this demo to Linus, but I think it's worth sending to the list as well. I am just going to copy parts of my earlier e-mails below, and attach the same demos (hopefully it is kosher to send attachments to this list). Before I do that, Jeff, can you still reproduce this problem: http://marc.info/?l=git&m=118975452330644&w=2 "Of course, 2 minutes for a git-status is still way too slow, so there we might still need a limiter. It also looks like 57% of my time is spent in spanhash_find, and another 29% in diffcore_count_changes." I know there have been a couple fixes since you posted that, but if it was the O(m*n) behavior that was causing your problem, it should still be reproducible. Linus suggested that this is a good data set to try things out with. Is there there a repository I can clone with git commands to run to repro this? OK, so attached is a little demo of the algorithm, which is (very little) Python code, but with comments so non-Python people can hopefully follow it. Because of this the timings are not very meaningful, but it proves that the algorithm doesn't blow up. I ran it on the entire Linux 2.4 vs. Linux 2.6 codebases. It is *only* considering file content. You can rename every file in both source trees to completely random strings and it will still match files up. There is nothing about filenames, or identical files, and you can consider the whole 2.4 side "all deletes" and the whole 2.6 side "all adds". The size of the matrix would be around 286 million cells, but here I only represent the non-zero entries in the matrix, which is only 15,406 cells. $ wc -l similarity_demo.py 233 similarity_demo.py $ ./similarity_demo.py in-all-* 12697 * 22530 = 286063410 total possible pairs of files Indexing the first set of 12697 files (threshold=1) Indexing the second set of 22530 files (threshold=1) Sum of file sizes in first set: 2134249 lines Sum of file sizes in second set: 3424338 lines Size of index over first set: 2134249 Size of index over second set: 3424338 Computing union of lines in the indices Total unique lines in both indices: 4384375 Making sparse common line matrix Calculating similarity for 15406 pairs of files Sorting 15406 similar pairs Writing similarity report Wrote out-in-all-24-vs-in-all-26.1.txt End ------ Report ------ Indexing the first set of 12697 files (threshold=1) 29.540s Indexing the second set of 22530 files (threshold=1) 111.041s Computing union of lines in the indices 7.450s Making sparse common line matrix 13.468s Calculating similarity for 15406 pairs of files 0.055s Sorting 15406 similar pairs 0.030s Writing similarity report 0.249s Total time 161.834s The script outputs a text file with 15,406 similar pairs, in order of similarity (1.0's are at the top): andychu demo$ wc -l out-in-all-24-vs-in-all-26.1.txt 15406 out-in-all-24-vs-in-all-26.1.txt andychu demo$ head -n3 out-in-all-24-vs-in-all-26.1.txt ( 51) linux-2.4.35.3/include/asm-m68k/apollodma.h ( 51) linux-2.6.23.1/include/asm-m68k/apollodma.h 51 1.000 ( 94) linux-2.4.35.3/fs/jfs/jfs_btree.h ( 94) linux-2.6.23.1/fs/jfs/jfs_btree.h 94 1.000 ( 21) linux-2.4.35.3/Documentation/fb/00-INDEX ( 21) linux-2.6.23.1/Documentation/fb/00-INDEX 21 1.000 ... And here is my explanation from an earlier mail, with some slight edits: So the algorithm is: 1) Make an inverted index of the left set and right set. That is { line => ["postings list", i.e. the list of files the line appears in] }. To get rid of common lines like "} else {" or "#endif", there is an arbitrary "line threshold". 2) Combine the 2 indices into a (sparse) rectangular matrix. For each line, iterate over all pairs in the postings list, and increment the cell in the matrix for that pair by 1. The index is extremely shallow, since nearly all lines of code are unique. The common case is that the postings list is of length 1. And the line threshold caps the length of the postings list. In the code the matrix represented by a hash of filename pairs to integer counts. So then the count is the number of lines that the 2 files have in common. 3) Compute the similarity metric, which I've defined here as max(c/|left file|, c/|right file|), where c the entry in the matrix for the file pair. Note that the files are treated as *sets* of lines (unordered, unique). The similarity score is a number between 0.0 and 1.0. Other similarity metrics are certainly possible. A few things to notice about this algorithm: * It takes advantage of the fact that code edits are typically line-oriented, and nearly every line of code is unique. (This same technique can be used for arbitrary documents like English text, but it's a bit harder since you basically have to find a way to make a "deep" index of words shallow, to speed it up. For code, the index of lines is already shallow.) * Some people might be concerned that it treats files as unordered sets of lines. The first thought might be to do this as a preprocessing step to cull the list of candidates, and then do a real delta. But in my experience, I haven't encountered a case where there's all that much to gain by doing that. * The line_threshold might appear to be a hack, but it actually *improves* accuracy. If you think about it, lines like "#endif" should not contribute to the similarity between 2 files. It also makes it impossible to construct pathological blow-up cases. If you have 1000 files on the left and 1000 files on the right that are all identical to each other, then every line will get pruned, and thus the entire similarity matrix will be 0, which is arguably what you want. There is a -l flag to the script to experiment with this threshold. * This can be run on all files, not just adds/deletes. If I have a change of only edits, it could be the case that I moved 500 lines from one file 1000 line file to another 1000 line file, and changed 75 lines within the 500. It would be meaningful to see a diff in this case, so that I can see those 75 edits (and a great feature!) * The similarity is defined that way so that if one file is completely contained in another, the similarity is 1.0. So if I move a 1000 line file and add 9,000 lines to it, the similarity for the file pair will still be 1.0. I believe this is a feature, like the point above. * I don't have a C implementation but I believe the constant factors should be very low. You could use the same CRC you were talking about to reduce memory in storing the lines. It seems like this algorithm is amenable to trading memory for speed, as you mention. Since it is raw string manipulation, C should be at least 10x faster than Python, and I wouldn't be surprised if an optimized implementation is 50 or 100x faster. ... If anything about the explanation is unclear, let me know and I will try to clarify it. Playing around with the demo should illuminate what it does. You can run it on data sets of your own. All you need is 2 source trees and the "find" command to produce input to the script (see setup_demo.sh). As mentioned, I will try to do some tests on this, perhaps with Jeff's hard data set, and show that the results are good and that the algorithm is faster because the quadratic behavior is gone (if Linus doesn't beat me to it!). thanks, Andy