git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linear time/space rename logic for *inexact* case
@ 2007-10-22  9:40 Andy C
  2007-10-22 10:09 ` Andy C
  2007-10-22 10:34 ` Jeff King
  0 siblings, 2 replies; 5+ messages in thread
From: Andy C @ 2007-10-22  9:40 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 9134 bytes --]

I just subscribed to this list, so sorry I can't respond to the
threads already started here.  I'm the guy that was mailing Linus
about this algorithm to do similarity detection in linear time,
mentioned here:

Subject:    [PATCH] Split out "exact content match" phase of rename detection
http://marc.info/?l=git&m=119299209323278&w=2

Subject:    [PATCH, take 1] Linear-time/space rename logic (exact renames
http://marc.info/?l=git&m=119301122908852&w=2


Jeff, in this message http://marc.info/?l=git&m=119303566130201&w=2 I
think you basically hit on the first half of what I was getting at.

The step 1 you describe is the first step of the algorithm -- make an
inverted index of lines (and you can use a hash code of the line to
stand in for the line).

To avoid the m*n memory usage of step 2, I use a hash table which maps
2-tuples to counts to represent the sparse similarity matrix, instead
of representing it directly.  The 2-tuple is a pair of filenames,
corresponding to the row/column of the matrix, and the counts/values
are the entries in the matrix.

You can also prune entries which have a long "postings list" (using
the term in the IR sense here:
http://www.xapian.org/docs/intro_ir.html).  This has the nice side
effect of getting rid of quadratic behavior, *and* making the
algorithm more accurate because it stops considering common lines like
"#endif" as contributing to similarity.

This pruning of common lines from the index makes step 3 linear.  In
fact, I prune the index to only include lines which occur just *once*.
 Nearly every line of code in real data sets is unique, so this works
well in practice.


I already sent this demo to Linus, but I think it's worth sending to
the list as well.  I am just going to copy parts of my earlier e-mails
below, and attach the same demos (hopefully it is kosher to send
attachments to this list).

Before I do that, Jeff, can you still reproduce this problem:

http://marc.info/?l=git&m=118975452330644&w=2
"Of course, 2 minutes for a git-status is still way too slow, so there we
might still need a limiter. It also looks like 57% of my time is spent
in spanhash_find, and another 29% in diffcore_count_changes."

I know there have been a couple fixes since you posted that, but if it
was the O(m*n) behavior that was causing your problem, it should still
be reproducible.  Linus suggested that this is a good data set to try
things out with.  Is there there a repository I can clone with git
commands to run to repro this?


OK, so attached is a little demo of the algorithm, which is (very
little) Python code, but with comments so non-Python people can
hopefully follow it.  Because of this the timings are not very
meaningful, but it proves that the algorithm doesn't blow up.

I ran it on the entire Linux 2.4 vs. Linux 2.6 codebases.  It is
*only* considering file content.  You can rename every file in both
source trees to completely random strings and it will still match
files up.  There is nothing about filenames, or identical files, and
you can consider the whole 2.4 side "all deletes" and the whole 2.6
side "all adds".  The size of the matrix would be around 286 million
cells, but here I only represent the non-zero entries in the matrix,
which is only 15,406 cells.

$ wc -l similarity_demo.py
233 similarity_demo.py

$ ./similarity_demo.py in-all-*
12697 * 22530 = 286063410 total possible pairs of files
Indexing the first set of 12697 files (threshold=1)
Indexing the second set of 22530 files (threshold=1)
Sum of file sizes in first set: 2134249 lines
Sum of file sizes in second set: 3424338 lines
Size of index over first set: 2134249
Size of index over second set: 3424338
Computing union of lines in the indices
Total unique lines in both indices: 4384375
Making sparse common line matrix
Calculating similarity for 15406 pairs of files
Sorting 15406 similar pairs
Writing similarity report
Wrote out-in-all-24-vs-in-all-26.1.txt
End
------
Report
------
Indexing the first set of 12697 files (threshold=1)          29.540s
Indexing the second set of 22530 files (threshold=1)         111.041s
Computing union of lines in the indices                      7.450s
Making sparse common line matrix                             13.468s
Calculating similarity for 15406 pairs of files              0.055s
Sorting 15406 similar pairs                                  0.030s
Writing similarity report                                    0.249s
Total time                                                   161.834s

The script outputs a text file with 15,406 similar pairs, in order of
similarity (1.0's are at the top):

andychu demo$ wc -l out-in-all-24-vs-in-all-26.1.txt
15406 out-in-all-24-vs-in-all-26.1.txt

andychu demo$ head -n3 out-in-all-24-vs-in-all-26.1.txt
(  51) linux-2.4.35.3/include/asm-m68k/apollodma.h
(  51) linux-2.6.23.1/include/asm-m68k/apollodma.h   51 1.000

(  94) linux-2.4.35.3/fs/jfs/jfs_btree.h
(  94) linux-2.6.23.1/fs/jfs/jfs_btree.h   94 1.000

(  21) linux-2.4.35.3/Documentation/fb/00-INDEX
(  21) linux-2.6.23.1/Documentation/fb/00-INDEX   21 1.000
...


And here is my explanation from an earlier mail, with some slight edits:

So the algorithm is:

1) Make an inverted index of the left set and right set.  That is {
line => ["postings list", i.e. the list of files the line appears in]
}.  To get rid of common lines like "} else {" or "#endif", there is
an arbitrary "line threshold".

2) Combine the 2 indices into a (sparse) rectangular matrix.  For each
line, iterate over all pairs in the postings list, and increment the
cell in the matrix for that pair by 1.  The index is extremely
shallow, since nearly all lines of code are unique.  The common case
is that the postings list is of length 1.  And the line threshold caps
the length of the postings list.

In the code the matrix represented by a hash of filename pairs to
integer counts.  So then the count is the number of lines that the 2
files have in common.

3) Compute the similarity metric, which I've defined here as
max(c/|left file|, c/|right file|), where c the entry in the matrix
for the file pair.  Note that the files are treated as *sets* of lines
(unordered, unique).  The similarity score is a number between 0.0 and
1.0.  Other similarity metrics are certainly possible.

A few things to notice about this algorithm:

* It takes advantage of the fact that code edits are typically
line-oriented, and nearly every line of code is unique.  (This same
technique can be used for arbitrary documents like English text, but
it's a bit harder since you basically have to find a way to make a
"deep" index of words shallow, to speed it up.  For code, the index of
lines is already shallow.)

* Some people might be concerned that it treats files as unordered
sets of lines.  The first thought might be to do this as a
preprocessing step to cull the list of candidates, and then do a real
delta.  But in my experience, I haven't encountered a case where
there's all that much to gain by doing that.

* The line_threshold might appear to be a hack, but it actually
*improves* accuracy.  If you think about it, lines like "#endif"
should not contribute to the similarity between 2 files.  It also
makes it impossible to construct pathological blow-up cases.  If you
have 1000 files on the left and 1000 files on the right that are all
identical to each other, then every line will get pruned, and thus the
entire similarity matrix will be 0, which is arguably what you want.
There is a -l flag to the script to experiment with this threshold.

* This can be run on all files, not just adds/deletes.  If I have a
change of only edits, it could be the case that I moved 500 lines from
one file 1000 line file to another 1000 line file, and changed 75
lines within the 500.  It would be meaningful to see a diff in this
case, so that I can see those 75 edits (and a great feature!)

* The similarity is defined that way so that if one file is completely
contained in another, the similarity is 1.0.  So if I move a 1000 line
file and add 9,000 lines to it, the similarity for the file pair will
still be 1.0.  I believe this is a feature, like the point above.

* I don't have a C implementation but I believe the constant factors
should be very low.  You could use the same CRC you were talking about
to reduce memory in storing the lines.  It seems like this algorithm
is amenable to trading memory for speed, as you mention.  Since it is
raw string manipulation, C should be at least 10x faster than Python,
and I wouldn't be surprised if an optimized implementation is 50 or
100x faster.

...

If anything about the explanation is unclear, let me know and I will
try to clarify it.  Playing around with the demo should illuminate
what it does.  You can run it on data sets of your own.  All you need
is 2 source trees and the "find" command to produce input to the
script (see setup_demo.sh).

As mentioned, I will try to do some tests on this, perhaps with Jeff's
hard data set, and show that the results are good and that the
algorithm is faster because the quadratic behavior is gone (if Linus
doesn't beat me to it!).

thanks,
Andy

[-- Attachment #2: setup_demo.sh --]
[-- Type: application/x-sh, Size: 983 bytes --]

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: similarity_demo.py --]
[-- Type: text/x-python; name=similarity_demo.py, Size: 7823 bytes --]

#!/usr/bin/python2.4

"""
Demo of linear-time code similarity calculation.

Author: Andy Chu (andychu at google dot com)
"""

import optparse
import os
import md5
import re
import sys
import time
from pprint import pprint


class MultiTimer:
  """To give a general idea of how long each step takes."""

  def __init__(self, loud=False):
    self.timestamps = []
    self.descriptions = []

  def checkpoint(self, description=''):
    print description
    self.timestamps.append(time.time())
    self.descriptions.append(description)

  def report(self):
    print "------"
    print "Report"
    print "------"

    # Compute time differences
    format = "%-60s %.3fs"
    for i in xrange(len(self.timestamps)-1):
      interval = self.timestamps[i+1] - self.timestamps[i]
      print format % (self.descriptions[i], interval)
    print format % ('Total time', self.timestamps[-1] - self.timestamps[0])


def IndexFiles(files, line_threshold):
  """
  Given a list of filenames, produce an inverted index.
  
  Any lines which occur in "line_threshold" files are thrown out.

  Also, for similarity scoring, keep track of the number of important lines in
  each file.  Files are treated as (unordered) *sets* of (unique) lines.
  Unimportant lines are duplicates and lines which exceed line_threshold.
  """
  # { line -> list of filenames the line appears in }
  index = {}  
  # { filename -> number of unique lines in the file }
  sizes = {}

  for filename in files:
    try:
      f = file(filename)
    except IOError, e:
      print e, filename
      sys.exit(1)

    for line in f:
      line = line.strip()  # Could remove *all* whitespace here
      if line:  # Skip blank lines
        if line in index:
          filelist = index[line]
          # Stop keeping track once we reach the threshold
          if len(filelist) == line_threshold + 1:
            continue
          # Only count the first occurrence of the line in the file
          if filelist[-1] != filename:
            filelist.append(filename)
            sizes[filename] = sizes.get(filename, 0) + 1
        else:
          index[line] = [filename]
          sizes[filename] = sizes.get(filename, 0) + 1
    f.close()

  # Now remove any lines that hit the threshold from the index, and adjust the
  # file sizes.
  to_remove = []
  for line, filelist in index.iteritems():
    if len(filelist) == line_threshold + 1:
      to_remove.append(line)
      for f in filelist:
        sizes[f] -= 1
  for line in to_remove:
    del index[line]

  return index, sizes


def FindSimilarPairs(set1, set2, reportfile, line_threshold):
  """Calculates pairwise similarity between two sets of files, using no other
  information but the contents of the files (i.e. no filename information).

  Args:
    set1, set2: The 2 lists of file system paths to compare.
    reportfile: name of the output text file
    line_threshold:
      We prune the index of entries that occur more than this number of times.
      For example, the line "} else {" may occur very frequently in the code,
      but we just throw it out, since the fact that 2 files have this line in
      common is not very meaningful.
      
      Making line_threshold a constant also makes the algorithm linear.  In
      practice, with real data sets, the results should be quite stable with
      respect to this parameter.  That is, they should not vary very much at
      all if line_threshold = 1 or 100 or 1000.
  """
  mt = MultiTimer()

  print '%s * %s = %s total possible pairs of files' % (
      len(set1), len(set2), len(set1) * len(set2))

  #
  # 1. Generate inverted indices and size information for each set.
  #
  mt.checkpoint("Indexing the first set of %d files (threshold=%d)" % (
    len(set1), line_threshold))
  index1, sizes1 = IndexFiles(set1, line_threshold)

  mt.checkpoint("Indexing the second set of %d files (threshold=%d)" % (
    len(set2), line_threshold))
  index2, sizes2 = IndexFiles(set2, line_threshold)

  print "Sum of file sizes in first set: %s lines" % sum(sizes1.values())
  print "Sum of file sizes in second set: %s lines" % sum(sizes2.values())
  print "Size of index over first set:", len(index1)
  print "Size of index over second set:", len(index2)

  #
  # 2. Combine the 2 indices to form sparse matrix that counts common lines.
  #
  # Pairs which do not appear have an implicit count of 0.
  #
  # This is a sparse matrix represented as a dictionary of 2-tuples:
  # (filename # in set1, filename in set2) -> number of (unique) lines that
  # they have in common

  mt.checkpoint("Computing union of lines in the indices")
  all_lines = set(index1) | set(index2)
  print 'Total unique lines in both indices: %s' % len(all_lines)

  mt.checkpoint("Making sparse common line matrix")

  matrix = {}
  for line in all_lines:
    files1 = index1.get(line)
    files2 = index2.get(line)
    if files1 and files2:
      # For every pair of files that contain this line, increment the
      # corresponding cell in the common line matrix.  
      #
      # Since we pruned the index, this whole double loop is constant time.  If
      # the line_threshold is 1 (default), then the whole thing is just a
      # single iteration.  
      for f1 in files1:
        for f2 in files2:
          matrix[(f1, f2)] = matrix.get((f1, f2), 0) + 1

  mt.checkpoint("Calculating similarity for %s pairs of files" % len(matrix))

  #
  # 3. Calculate the similarity of each cell in the matrix.
  #
  # The similarity metric is number of common lines divided by the file size
  # (and take the greater of the 2).  Note that this means that if file1 is
  # entirely *contained* in file2, then the similarity of the pair will be 1.0.
  # This is desirable since you can detect if 99,000 lines were added to a 1,000
  # line file, etc.  Other similarity metrics are possible.
  #
  similar_pairs = []
  for (file1, file2), num_common in matrix.iteritems():
    c = float(num_common)
    # TODO: Write notes on similarity metric
    similarity = max(c / sizes1[file1], c / sizes2[file2])
    similar_pairs.append((file1, file2, num_common, similarity))

  mt.checkpoint("Sorting %d similar pairs" % len(similar_pairs))
  similar_pairs.sort(key=lambda x: x[3], reverse=True)

  mt.checkpoint("Writing similarity report")
  f = open(reportfile, 'w')
  for file1, file2, num_common, similarity in similar_pairs:
    # Put a * after entries where the relative paths are *not* the same, just
    # for convenience when looking at the report.
    if file1.split('/')[1:] != file2.split('/')[1:]:
      mark = '*'
    else:
      mark = ''

    print >> f, '(%4d) %-40s (%4d) %-40s %4d %.3f %s' % (
        sizes1[file1], file1, sizes2[file2], file2, num_common, similarity,
        mark)
  f.close()
  print 'Wrote %s' % reportfile

  mt.checkpoint("End")
  mt.report()


if __name__ == '__main__':
  parser = optparse.OptionParser()

  parser.add_option(
      '-l', '--line-threshold', dest='line_threshold', type='int', default=1,
      help='ignore lines that occur more than this number of times in either '
      'set')
  parser.add_option(
      '-o', '--out', dest='reportfile', default=None,
      help='Write output to this file instead of the default.')

  (options, argv) = parser.parse_args()

  try:
    left, right = argv[:2]
  except ValueError:
    print 'Pass 2 files containing paths as arguments (left tree, right tree).'
    sys.exit(1)

  if not options.reportfile:
    options.reportfile = 'out-%s-vs-%s.%s.txt' % (
        os.path.splitext(os.path.basename(left))[0],
        os.path.splitext(os.path.basename(right))[0],
        options.line_threshold)

  set1 = [f.strip() for f in open(left).readlines()]
  set2 = [f.strip() for f in open(right).readlines()]
  FindSimilarPairs(set1, set2, options.reportfile, options.line_threshold)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Linear time/space rename logic for *inexact* case
  2007-10-22  9:40 Linear time/space rename logic for *inexact* case Andy C
@ 2007-10-22 10:09 ` Andy C
  2007-10-22 10:34 ` Jeff King
  1 sibling, 0 replies; 5+ messages in thread
From: Andy C @ 2007-10-22 10:09 UTC (permalink / raw)
  To: git

On 10/22/07, Andy C <andychup@gmail.com> wrote:
> So the algorithm is:

I think I can make this a lot clearer than I did, while glossing over
some details and the line_threshold parameter.

1) Make a "left index" and a "right index" out of the 2 sets of files,
{ line => [list of docs] }.

2) Remove any lines that appear in more than one doc from the left
index.  Do the same for the right index.  (this corresponds to
line_threshold=1 case)

3) For all lines, if the line appears in *both* the left index and the
right index, increment the count of the (row=doc from left set,
column=doc from right set) entry in the similarity matrix by 1.  The
matrix is represented by a hash of 2-tuples => counts.

After this is done for all lines, then the matrix is sparsely filled
with the count of common lines between every pair of files in the 2
sets.  The vast majority of cells in the matrix are implicitly 0 and
thus consume neither memory nor CPU with the hash table representation
of matrix.

4) Then you can use this to compute similarity scores.

Hopefully that is more clear... though I guess it might not be obvious
that it works for the problem that git has.  I am fairly sure it does,
but the demo should allow us to evaluate that.

Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Linear time/space rename logic for *inexact* case
  2007-10-22  9:40 Linear time/space rename logic for *inexact* case Andy C
  2007-10-22 10:09 ` Andy C
@ 2007-10-22 10:34 ` Jeff King
  2007-10-22 23:47   ` Andy C
  1 sibling, 1 reply; 5+ messages in thread
From: Jeff King @ 2007-10-22 10:34 UTC (permalink / raw)
  To: Andy C; +Cc: git

On Mon, Oct 22, 2007 at 02:40:08AM -0700, Andy C wrote:

> I just subscribed to this list, so sorry I can't respond to the
> threads already started here.  I'm the guy that was mailing Linus
> about this algorithm to do similarity detection in linear time,
> mentioned here:

Great, welcome aboard. :)

> To avoid the m*n memory usage of step 2, I use a hash table which maps
> 2-tuples to counts to represent the sparse similarity matrix, instead
> of representing it directly.  The 2-tuple is a pair of filenames,
> corresponding to the row/column of the matrix, and the counts/values
> are the entries in the matrix.

OK, makes sense (that's what I was trying to propose near the end of my
mail).

> You can also prune entries which have a long "postings list" (using
> the term in the IR sense here:
> http://www.xapian.org/docs/intro_ir.html).  This has the nice side
> effect of getting rid of quadratic behavior, *and* making the
> algorithm more accurate because it stops considering common lines like
> "#endif" as contributing to similarity.

Ah, very clever. I think that should have nice behavior most of the
time, though I wonder about a pathological case where we have many
files, all very similar to each other, and then have a commit where they
all start to diverge, but just by a small amount, while getting moved.
We would end up with an artificially low score between renamed files
(because we've thrown out all of the commonality) which may lead us to
believe there was no rename at all.

But it might be worth ignoring that case.

> This pruning of common lines from the index makes step 3 linear.  In
> fact, I prune the index to only include lines which occur just *once*.
>  Nearly every line of code in real data sets is unique, so this works
> well in practice.

Makes sense.

> http://marc.info/?l=git&m=118975452330644&w=2
> "Of course, 2 minutes for a git-status is still way too slow, so there we
> might still need a limiter. It also looks like 57% of my time is spent
> in spanhash_find, and another 29% in diffcore_count_changes."
> 
> I know there have been a couple fixes since you posted that, but if it
> was the O(m*n) behavior that was causing your problem, it should still
> be reproducible.  Linus suggested that this is a good data set to try
> things out with.  Is there there a repository I can clone with git
> commands to run to repro this?

Yes, I still have the problem (the 2 minutes was _after_ we did fixes,
down from 20 minutes; the latest patch from Linus just covers the "exact
match" case where two files have identical SHA1s).

It's a 1.2G repo at the end of a slow DSL line, so rather than cloning
it, here's a way to reproduce a repo with similar properties:

-- >8 --
#!/bin/sh
rm -rf repo
mkdir repo && cd repo

# seq and openssl aren't portable, but the
# point is to generate 200 random 1M files
for i in `seq -f %03g 1 600`; do
  openssl rand 100000 >$i.rand
done

# make repo, fully packed
# we don't bother trying to delta in the pack
# since the files are all random
git-init
git-add .
git-commit -q -m one
git-repack -a -d --window=0

# move every file
mkdir new
git-mv *.rand new

# modify every file
for i in new/*.rand; do
  echo foo >>$i
done
git-add new

# this is the operation of interest
time git-diff --cached --raw -M -l0 >/dev/null

-- >8 --

The idea is to have a large number of files that are slightly changed
and moved, and to try to find the pairs.  The diff takes about 20
seconds to run for me (the real repo has 1M files rather than 100K
files, but it's nice to have the tests take a lot less time). If you
want a bigger test, bump up the file size (or increase the number of
files, which will emphasize the quadratic behavior).

> 3) Compute the similarity metric, which I've defined here as
> max(c/|left file|, c/|right file|), where c the entry in the matrix
> for the file pair.  Note that the files are treated as *sets* of lines
> (unordered, unique).  The similarity score is a number between 0.0 and
> 1.0.  Other similarity metrics are certainly possible.

We have to handle binary files, too. In the current implementation, we
consider either lines or "chunks", and similarity is increased by the
size of the chunk.

> * Some people might be concerned that it treats files as unordered
> sets of lines.  The first thought might be to do this as a
> preprocessing step to cull the list of candidates, and then do a real
> delta.  But in my experience, I haven't encountered a case where
> there's all that much to gain by doing that.

I think we are already treating the files as unordered sets of lines.
And really, I think there is some value in that, anyway. If I reverse
the order of all lines in a file, it might be useful for git to say
"this file came from that file".

> * This can be run on all files, not just adds/deletes.  If I have a

Great. git has a "look for copies also" flag, but it is usually disabled
because of the computational cost. If we can get it low enough, it might
actually become a lot more useful.

> If anything about the explanation is unclear, let me know and I will
> try to clarify it.  Playing around with the demo should illuminate
> what it does.  You can run it on data sets of your own.  All you need
> is 2 source trees and the "find" command to produce input to the
> script (see setup_demo.sh).

I'll try it on my test data, but it sounds like it doesn't really handle
binary files.

> As mentioned, I will try to do some tests on this, perhaps with Jeff's
> hard data set, and show that the results are good and that the
> algorithm is faster because the quadratic behavior is gone (if Linus
> doesn't beat me to it!).

Trying to fit it into the C git code would be useful, but I doubt I'll
have time to work on it tonight, since it's getting onto dawn here.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Linear time/space rename logic for *inexact* case
  2007-10-22 10:34 ` Jeff King
@ 2007-10-22 23:47   ` Andy C
  2007-10-23  0:49     ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Andy C @ 2007-10-22 23:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 10/22/07, Jeff King <peff@peff.net> wrote:
> On Mon, Oct 22, 2007 at 02:40:08AM -0700, Andy C wrote:
> > You can also prune entries which have a long "postings list" (using
> > the term in the IR sense here:
> > http://www.xapian.org/docs/intro_ir.html).  This has the nice side
> > effect of getting rid of quadratic behavior, *and* making the
> > algorithm more accurate because it stops considering common lines like
> > "#endif" as contributing to similarity.
>
> Ah, very clever. I think that should have nice behavior most of the
> time, though I wonder about a pathological case where we have many
> files, all very similar to each other, and then have a commit where they
> all start to diverge, but just by a small amount, while getting moved.
> We would end up with an artificially low score between renamed files
> (because we've thrown out all of the commonality) which may lead us to
> believe there was no rename at all.
>
> But it might be worth ignoring that case.

Well I do adjust the file sizes when pruning.  In the .py demo code
you can see it at these lines:

      for f in filelist:
        sizes[f] -= 1

The similarity metric depends on the sizes, so that way it won't get
out of whack if there are a lot of frequently occurring lines.

I did mention that pathological case to Linus.  The worst thing is if
you have N adds and N deletes and every single file is identical.
Well the convenient thing is that the pruning handles this gracefully.
 *Every* line in both indices will be pruned, and then the similarity
matrix will be completely empty, and no renames will be recorded.
Arguably that is what you want.  You don't want N^2 renames.  I guess
you could just assign them randomly to each other too for N renames.


> > http://marc.info/?l=git&m=118975452330644&w=2
> > "Of course, 2 minutes for a git-status is still way too slow, so there we
> > might still need a limiter. It also looks like 57% of my time is spent
> > in spanhash_find, and another 29% in diffcore_count_changes."
> >
> > I know there have been a couple fixes since you posted that, but if it
> > was the O(m*n) behavior that was causing your problem, it should still
> > be reproducible.  Linus suggested that this is a good data set to try
> > things out with.  Is there there a repository I can clone with git
> > commands to run to repro this?
>
> Yes, I still have the problem (the 2 minutes was _after_ we did fixes,
> down from 20 minutes; the latest patch from Linus just covers the "exact
> match" case where two files have identical SHA1s).
>
> It's a 1.2G repo at the end of a slow DSL line, so rather than cloning
> it, here's a way to reproduce a repo with similar properties:

Ah OK thanks.   This is a good test case for binary files, and my
Linux 2.4 vs Linux 2.6 test seems reasonable for text files.

> > 3) Compute the similarity metric, which I've defined here as
> > max(c/|left file|, c/|right file|), where c the entry in the matrix
> > for the file pair.  Note that the files are treated as *sets* of lines
> > (unordered, unique).  The similarity score is a number between 0.0 and
> > 1.0.  Other similarity metrics are certainly possible.
>
> We have to handle binary files, too. In the current implementation, we
> consider either lines or "chunks", and similarity is increased by the
> size of the chunk.

Yes, my thing doesn't handle binary files, but I have a variation of
the idea for that too which I'll explain below.

Can you explain what the current implementation does for binary files
in more detail, or point me to some docs?  I'm having a hard code
figuring out what the code is doing, even though the comments insist
that it is very simple : )

from diffcore-delta.c:
"
 * Idea here is very simple.
 *
 * Almost all data we are interested in are text, but sometimes we have
 * to deal with binary data.  So we cut them into chunks delimited by
 * LF byte, or 64-byte sequence, whichever comes first, and hash them.
"

What is the similarity metric for binary files?  Is it related to the
number of 64 byte chunks they have in common?  How do you know that
the 64 byte chunks match up?  Suppose I have 10k file, and then I
insert 10 random bytes at positions, 1000, 2000, 3000, etc.   How does
that work?

> > * This can be run on all files, not just adds/deletes.  If I have a
>
> Great. git has a "look for copies also" flag, but it is usually disabled
> because of the computational cost. If we can get it low enough, it might
> actually become a lot more useful.

Yes, there seem to be quite a few flags that have to do with trading
off features for computational time/memory.  I believe this algorithm
could be made fast enough to simplify the user interface a bit.  I'm
not yet super familiar with git, but I imagine it's probably hard for
some users to intuit what these thresholds do.

> > If anything about the explanation is unclear, let me know and I will
> > try to clarify it.  Playing around with the demo should illuminate
> > what it does.  You can run it on data sets of your own.  All you need
> > is 2 source trees and the "find" command to produce input to the
> > script (see setup_demo.sh).
>
> I'll try it on my test data, but it sounds like it doesn't really handle
> binary files.

Yes, I think we should try it out for text files first, since that's
probably the more important case.  But at the risk of getting a bit
ahead of myself, I think you can do a similar thing for binary files.

This first part isn't my idea; I heard it at a talk by Udi Manber at
Google, but I think it could be combined with my pruning and sparse
similarity matrix thing.  And it looks like there are already some of
the same ideas in git, though I'm still deciphering the code (any docs
on this area would help).

Basically his idea is that you take a rolling window of say 64 bytes
and create an inverted index, just like you do for text files.  So if
you have a 10,000 byte file, then there are (10,000 - 64) hashes to be
made.

This could be a lot, so you can speed it up by sampling.  But you want
to sample in such a way that preserves commonality.  That is, you
don't want to sample a given 64-byte sequence in one doc but miss it
in the other one.  So then all you do is keep all windows where the
hash code is 0 mod k (and you can adjust k that to achieve the
sampling rate you want).  If a given sequence is sampled in *one* doc,
then it's guaranteed to be sampled in *all* docs.

The second point is that you can do this rolling window hashing very
fast with a polynomial hash.  You can use the results of the previous
hash to compute the next hash.  I think you already have this in
diff-delta.c, but I'm not quite sure.  I'm not sure exactly what that
code is used for either (I don't think it is for the rename matching,
because that is done by estimate_similarity and
diffcore_count_changes?).

Once you have these 2 inverted indices of { sampled subset of 64-byte
substrings -> [list of documents] }  then you simply do the same thing
as I suggested for text files.  The logic is more or less the same.
If 2 files have many 64-byte substrings in common compared to their
sizes, then they are similar.  If a given 64-byte sequence (say
0000000000000...000) appears in many docs, then it's not a good
differentiator or indicator of similarity -- so throw it out.

(It occurs to me that it might be useful to do this with a small
window (8 bytes?) and a larger one (>64), and somehow combine the
results.  I guess some experimentation is in order, or maybe there is
some theory about that.)

So this is a linear algorithm too.  The hashing could have a high
constant factor, but the sampling and polynomial hashing should
basically eliminate that, I believe.

But like I said, it's probably better to concentrate on the text part
before anyone tackles that (at least I will look at the text case
first).  I have less experience with this algorithm, but it seems like
it should work just as well, and is quite simple.  I think both of
these algorithms are a bit more general and should lead to a net
reduction in lines of code, which is a good thing.

Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Linear time/space rename logic for *inexact* case
  2007-10-22 23:47   ` Andy C
@ 2007-10-23  0:49     ` Linus Torvalds
  0 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2007-10-23  0:49 UTC (permalink / raw)
  To: Andy C; +Cc: Jeff King, git



On Mon, 22 Oct 2007, Andy C wrote:
> 
> from diffcore-delta.c:
> "
>  * Idea here is very simple.
>  *
>  * Almost all data we are interested in are text, but sometimes we have
>  * to deal with binary data.  So we cut them into chunks delimited by
>  * LF byte, or 64-byte sequence, whichever comes first, and hash them.
> "
> 
> What is the similarity metric for binary files?  Is it related to the
> number of 64 byte chunks they have in common?  How do you know that
> the 64 byte chunks match up?  Suppose I have 10k file, and then I
> insert 10 random bytes at positions, 1000, 2000, 3000, etc.   How does
> that work?

The 'LF' byte will start a new window, so even with binary files (assuming 
some random-enough distribution that you have *some* 'LF' bytes!), you 
basically get a synchronization event on each LF.

So the 64-byte hunks "line up" automatically, even for binary files. Maybe 
not immediately, but soon enough (ie on average, assuming some reasonably 
compressed - ie virtually random - binary format) you should find a LF 
roughly every fourth hunk.

There are probably smarter ways to guarantee it even in the absense of 
certain bit-patterns, but I bet the "break every 64 bytes or when you see 
a LF" thing works well for any reasonable binary file too.

But text files are obviously the most important case. Diffs on binaries 
are somewhat hit-and-miss regardless.

			Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-10-23  0:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-10-22  9:40 Linear time/space rename logic for *inexact* case Andy C
2007-10-22 10:09 ` Andy C
2007-10-22 10:34 ` Jeff King
2007-10-22 23:47   ` Andy C
2007-10-23  0:49     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).