Linear time/space rename logic for *inexact* case

From: "Andy C" <andychup@gmail.com>
To: git@vger.kernel.org
Subject: Linear time/space rename logic for *inexact* case
Date: Mon, 22 Oct 2007 02:40:08 -0700	[thread overview]
Message-ID: <596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 9134 bytes --]

I just subscribed to this list, so sorry I can't respond to the
threads already started here.  I'm the guy that was mailing Linus
about this algorithm to do similarity detection in linear time,
mentioned here:

Subject:    [PATCH] Split out "exact content match" phase of rename detection
http://marc.info/?l=git&m=119299209323278&w=2

Subject:    [PATCH, take 1] Linear-time/space rename logic (exact renames
http://marc.info/?l=git&m=119301122908852&w=2

Jeff, in this message http://marc.info/?l=git&m=119303566130201&w=2 I
think you basically hit on the first half of what I was getting at.

The step 1 you describe is the first step of the algorithm -- make an
inverted index of lines (and you can use a hash code of the line to
stand in for the line).

To avoid the m*n memory usage of step 2, I use a hash table which maps
2-tuples to counts to represent the sparse similarity matrix, instead
of representing it directly.  The 2-tuple is a pair of filenames,
corresponding to the row/column of the matrix, and the counts/values
are the entries in the matrix.

You can also prune entries which have a long "postings list" (using
the term in the IR sense here:
http://www.xapian.org/docs/intro_ir.html).  This has the nice side
effect of getting rid of quadratic behavior, *and* making the
algorithm more accurate because it stops considering common lines like
"#endif" as contributing to similarity.

This pruning of common lines from the index makes step 3 linear.  In
fact, I prune the index to only include lines which occur just *once*.
 Nearly every line of code in real data sets is unique, so this works
well in practice.

I already sent this demo to Linus, but I think it's worth sending to
the list as well.  I am just going to copy parts of my earlier e-mails
below, and attach the same demos (hopefully it is kosher to send
attachments to this list).

Before I do that, Jeff, can you still reproduce this problem:

http://marc.info/?l=git&m=118975452330644&w=2
"Of course, 2 minutes for a git-status is still way too slow, so there we
might still need a limiter. It also looks like 57% of my time is spent
in spanhash_find, and another 29% in diffcore_count_changes."

I know there have been a couple fixes since you posted that, but if it
was the O(m*n) behavior that was causing your problem, it should still
be reproducible.  Linus suggested that this is a good data set to try
things out with.  Is there there a repository I can clone with git
commands to run to repro this?

OK, so attached is a little demo of the algorithm, which is (very
little) Python code, but with comments so non-Python people can
hopefully follow it.  Because of this the timings are not very
meaningful, but it proves that the algorithm doesn't blow up.

I ran it on the entire Linux 2.4 vs. Linux 2.6 codebases.  It is
*only* considering file content.  You can rename every file in both
source trees to completely random strings and it will still match
files up.  There is nothing about filenames, or identical files, and
you can consider the whole 2.4 side "all deletes" and the whole 2.6
side "all adds".  The size of the matrix would be around 286 million
cells, but here I only represent the non-zero entries in the matrix,
which is only 15,406 cells.

$ wc -l similarity_demo.py
233 similarity_demo.py

$ ./similarity_demo.py in-all-*
12697 * 22530 = 286063410 total possible pairs of files
Indexing the first set of 12697 files (threshold=1)
Indexing the second set of 22530 files (threshold=1)
Sum of file sizes in first set: 2134249 lines
Sum of file sizes in second set: 3424338 lines
Size of index over first set: 2134249
Size of index over second set: 3424338
Computing union of lines in the indices
Total unique lines in both indices: 4384375
Making sparse common line matrix
Calculating similarity for 15406 pairs of files
Sorting 15406 similar pairs
Writing similarity report
Wrote out-in-all-24-vs-in-all-26.1.txt
End
------
Report
------
Indexing the first set of 12697 files (threshold=1)          29.540s
Indexing the second set of 22530 files (threshold=1)         111.041s
Computing union of lines in the indices                      7.450s
Making sparse common line matrix                             13.468s
Calculating similarity for 15406 pairs of files              0.055s
Sorting 15406 similar pairs                                  0.030s
Writing similarity report                                    0.249s
Total time                                                   161.834s

The script outputs a text file with 15,406 similar pairs, in order of
similarity (1.0's are at the top):

andychu demo$ wc -l out-in-all-24-vs-in-all-26.1.txt
15406 out-in-all-24-vs-in-all-26.1.txt

andychu demo$ head -n3 out-in-all-24-vs-in-all-26.1.txt
(  51) linux-2.4.35.3/include/asm-m68k/apollodma.h
(  51) linux-2.6.23.1/include/asm-m68k/apollodma.h   51 1.000

(  94) linux-2.4.35.3/fs/jfs/jfs_btree.h
(  94) linux-2.6.23.1/fs/jfs/jfs_btree.h   94 1.000

(  21) linux-2.4.35.3/Documentation/fb/00-INDEX
(  21) linux-2.6.23.1/Documentation/fb/00-INDEX   21 1.000
...

And here is my explanation from an earlier mail, with some slight edits:

So the algorithm is:

1) Make an inverted index of the left set and right set.  That is {
line => ["postings list", i.e. the list of files the line appears in]
}.  To get rid of common lines like "} else {" or "#endif", there is
an arbitrary "line threshold".

2) Combine the 2 indices into a (sparse) rectangular matrix.  For each
line, iterate over all pairs in the postings list, and increment the
cell in the matrix for that pair by 1.  The index is extremely
shallow, since nearly all lines of code are unique.  The common case
is that the postings list is of length 1.  And the line threshold caps
the length of the postings list.

In the code the matrix represented by a hash of filename pairs to
integer counts.  So then the count is the number of lines that the 2
files have in common.

3) Compute the similarity metric, which I've defined here as
max(c/|left file|, c/|right file|), where c the entry in the matrix
for the file pair.  Note that the files are treated as *sets* of lines
(unordered, unique).  The similarity score is a number between 0.0 and
1.0.  Other similarity metrics are certainly possible.

A few things to notice about this algorithm:

* It takes advantage of the fact that code edits are typically
line-oriented, and nearly every line of code is unique.  (This same
technique can be used for arbitrary documents like English text, but
it's a bit harder since you basically have to find a way to make a
"deep" index of words shallow, to speed it up.  For code, the index of
lines is already shallow.)

* Some people might be concerned that it treats files as unordered
sets of lines.  The first thought might be to do this as a
preprocessing step to cull the list of candidates, and then do a real
delta.  But in my experience, I haven't encountered a case where
there's all that much to gain by doing that.

* The line_threshold might appear to be a hack, but it actually
*improves* accuracy.  If you think about it, lines like "#endif"
should not contribute to the similarity between 2 files.  It also
makes it impossible to construct pathological blow-up cases.  If you
have 1000 files on the left and 1000 files on the right that are all
identical to each other, then every line will get pruned, and thus the
entire similarity matrix will be 0, which is arguably what you want.
There is a -l flag to the script to experiment with this threshold.

* This can be run on all files, not just adds/deletes.  If I have a
change of only edits, it could be the case that I moved 500 lines from
one file 1000 line file to another 1000 line file, and changed 75
lines within the 500.  It would be meaningful to see a diff in this
case, so that I can see those 75 edits (and a great feature!)

* The similarity is defined that way so that if one file is completely
contained in another, the similarity is 1.0.  So if I move a 1000 line
file and add 9,000 lines to it, the similarity for the file pair will
still be 1.0.  I believe this is a feature, like the point above.

* I don't have a C implementation but I believe the constant factors
should be very low.  You could use the same CRC you were talking about
to reduce memory in storing the lines.  It seems like this algorithm
is amenable to trading memory for speed, as you mention.  Since it is
raw string manipulation, C should be at least 10x faster than Python,
and I wouldn't be surprised if an optimized implementation is 50 or
100x faster.

...

If anything about the explanation is unclear, let me know and I will
try to clarify it.  Playing around with the demo should illuminate
what it does.  You can run it on data sets of your own.  All you need
is 2 source trees and the "find" command to produce input to the
script (see setup_demo.sh).

As mentioned, I will try to do some tests on this, perhaps with Jeff's
hard data set, and show that the results are good and that the
algorithm is faster because the quadratic behavior is gone (if Linus
doesn't beat me to it!).

thanks,
Andy

[-- Attachment #2: setup_demo.sh --]
[-- Type: application/x-sh, Size: 983 bytes --]

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: similarity_demo.py --]
[-- Type: text/x-python; name=similarity_demo.py, Size: 7823 bytes --]

#!/usr/bin/python2.4

"""
Demo of linear-time code similarity calculation.

Author: Andy Chu (andychu at google dot com)
"""

import optparse
import os
import md5
import re
import sys
import time
from pprint import pprint

class MultiTimer:
  """To give a general idea of how long each step takes."""

  def __init__(self, loud=False):
    self.timestamps = []
    self.descriptions = []

  def checkpoint(self, description=''):
    print description
    self.timestamps.append(time.time())
    self.descriptions.append(description)

  def report(self):
    print "------"
    print "Report"
    print "------"

    # Compute time differences
    format = "%-60s %.3fs"
    for i in xrange(len(self.timestamps)-1):
      interval = self.timestamps[i+1] - self.timestamps[i]
      print format % (self.descriptions[i], interval)
    print format % ('Total time', self.timestamps[-1] - self.timestamps[0])

def IndexFiles(files, line_threshold):
  """
  Given a list of filenames, produce an inverted index.

  Any lines which occur in "line_threshold" files are thrown out.

  Also, for similarity scoring, keep track of the number of important lines in
  each file.  Files are treated as (unordered) *sets* of (unique) lines.
  Unimportant lines are duplicates and lines which exceed line_threshold.
  """
  # { line -> list of filenames the line appears in }
  index = {}  
  # { filename -> number of unique lines in the file }
  sizes = {}

  for filename in files:
    try:
      f = file(filename)
    except IOError, e:
      print e, filename
      sys.exit(1)

    for line in f:
      line = line.strip()  # Could remove *all* whitespace here
      if line:  # Skip blank lines
        if line in index:
          filelist = index[line]
          # Stop keeping track once we reach the threshold
          if len(filelist) == line_threshold + 1:
            continue
          # Only count the first occurrence of the line in the file
          if filelist[-1] != filename:
            filelist.append(filename)
            sizes[filename] = sizes.get(filename, 0) + 1
        else:
          index[line] = [filename]
          sizes[filename] = sizes.get(filename, 0) + 1
    f.close()

  # Now remove any lines that hit the threshold from the index, and adjust the
  # file sizes.
  to_remove = []
  for line, filelist in index.iteritems():
    if len(filelist) == line_threshold + 1:
      to_remove.append(line)
      for f in filelist:
        sizes[f] -= 1
  for line in to_remove:
    del index[line]

  return index, sizes

def FindSimilarPairs(set1, set2, reportfile, line_threshold):
  """Calculates pairwise similarity between two sets of files, using no other
  information but the contents of the files (i.e. no filename information).

  Args:
    set1, set2: The 2 lists of file system paths to compare.
    reportfile: name of the output text file
    line_threshold:
      We prune the index of entries that occur more than this number of times.
      For example, the line "} else {" may occur very frequently in the code,
      but we just throw it out, since the fact that 2 files have this line in
      common is not very meaningful.

      Making line_threshold a constant also makes the algorithm linear.  In
      practice, with real data sets, the results should be quite stable with
      respect to this parameter.  That is, they should not vary very much at
      all if line_threshold = 1 or 100 or 1000.
  """
  mt = MultiTimer()

  print '%s * %s = %s total possible pairs of files' % (
      len(set1), len(set2), len(set1) * len(set2))

  #
  # 1. Generate inverted indices and size information for each set.
  #
  mt.checkpoint("Indexing the first set of %d files (threshold=%d)" % (
    len(set1), line_threshold))
  index1, sizes1 = IndexFiles(set1, line_threshold)

  mt.checkpoint("Indexing the second set of %d files (threshold=%d)" % (
    len(set2), line_threshold))
  index2, sizes2 = IndexFiles(set2, line_threshold)

  print "Sum of file sizes in first set: %s lines" % sum(sizes1.values())
  print "Sum of file sizes in second set: %s lines" % sum(sizes2.values())
  print "Size of index over first set:", len(index1)
  print "Size of index over second set:", len(index2)

  #
  # 2. Combine the 2 indices to form sparse matrix that counts common lines.
  #
  # Pairs which do not appear have an implicit count of 0.
  #
  # This is a sparse matrix represented as a dictionary of 2-tuples:
  # (filename # in set1, filename in set2) -> number of (unique) lines that
  # they have in common

  mt.checkpoint("Computing union of lines in the indices")
  all_lines = set(index1) | set(index2)
  print 'Total unique lines in both indices: %s' % len(all_lines)

  mt.checkpoint("Making sparse common line matrix")

  matrix = {}
  for line in all_lines:
    files1 = index1.get(line)
    files2 = index2.get(line)
    if files1 and files2:
      # For every pair of files that contain this line, increment the
      # corresponding cell in the common line matrix.  
      #
      # Since we pruned the index, this whole double loop is constant time.  If
      # the line_threshold is 1 (default), then the whole thing is just a
      # single iteration.  
      for f1 in files1:
        for f2 in files2:
          matrix[(f1, f2)] = matrix.get((f1, f2), 0) + 1

  mt.checkpoint("Calculating similarity for %s pairs of files" % len(matrix))

  #
  # 3. Calculate the similarity of each cell in the matrix.
  #
  # The similarity metric is number of common lines divided by the file size
  # (and take the greater of the 2).  Note that this means that if file1 is
  # entirely *contained* in file2, then the similarity of the pair will be 1.0.
  # This is desirable since you can detect if 99,000 lines were added to a 1,000
  # line file, etc.  Other similarity metrics are possible.
  #
  similar_pairs = []
  for (file1, file2), num_common in matrix.iteritems():
    c = float(num_common)
    # TODO: Write notes on similarity metric
    similarity = max(c / sizes1[file1], c / sizes2[file2])
    similar_pairs.append((file1, file2, num_common, similarity))

  mt.checkpoint("Sorting %d similar pairs" % len(similar_pairs))
  similar_pairs.sort(key=lambda x: x[3], reverse=True)

  mt.checkpoint("Writing similarity report")
  f = open(reportfile, 'w')
  for file1, file2, num_common, similarity in similar_pairs:
    # Put a * after entries where the relative paths are *not* the same, just
    # for convenience when looking at the report.
    if file1.split('/')[1:] != file2.split('/')[1:]:
      mark = '*'
    else:
      mark = ''

    print >> f, '(%4d) %-40s (%4d) %-40s %4d %.3f %s' % (
        sizes1[file1], file1, sizes2[file2], file2, num_common, similarity,
        mark)
  f.close()
  print 'Wrote %s' % reportfile

  mt.checkpoint("End")
  mt.report()

if __name__ == '__main__':
  parser = optparse.OptionParser()

  parser.add_option(
      '-l', '--line-threshold', dest='line_threshold', type='int', default=1,
      help='ignore lines that occur more than this number of times in either '
      'set')
  parser.add_option(
      '-o', '--out', dest='reportfile', default=None,
      help='Write output to this file instead of the default.')

  (options, argv) = parser.parse_args()

  try:
    left, right = argv[:2]
  except ValueError:
    print 'Pass 2 files containing paths as arguments (left tree, right tree).'
    sys.exit(1)

  if not options.reportfile:
    options.reportfile = 'out-%s-vs-%s.%s.txt' % (
        os.path.splitext(os.path.basename(left))[0],
        os.path.splitext(os.path.basename(right))[0],
        options.line_threshold)

  set1 = [f.strip() for f in open(left).readlines()]
  set2 = [f.strip() for f in open(right).readlines()]
  FindSimilarPairs(set1, set2, options.reportfile, options.line_threshold)