git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* diffcore and directory renames
@ 2008-09-01 21:39 Yann Dirson
  2008-09-01 22:50 ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Yann Dirson @ 2008-09-01 21:39 UTC (permalink / raw)
  To: GIT list

I often found myself lost when looking at a diff where a couple of
large dirs where renamed, and a handful of files were modified to take
the rename into account - not a rare situation, I'd say.  In such a
case, the diffs themselves are mostly hidden among numerous rename
entries.

To that, I felt that git ought to know better, and could easily
present a directory rename as such.  Among the advantages of having
such "dir rename detection", aside from readability which we should
not neglect, we'd have more accurate "svn dcommit" for such renames
and, and better decisions made when merging a branch with adds a file
into a dir, with a branch which renames that dir (ok, maybe not that
frequent, but still).


Looking closer at the current behaviour, I realized that git-diff-tree
*is* already able to report such renames in raw mode :

$ ./git-diff-tree 0f1027 -M
0f1027e1aceb4bc5fa682776ab9f72935e2cd1b3
:040000 040000 6f6159f0245784352414ff38ffb68bae80f30bd6 6f6159f0245784352414ff38ffb68bae80f30bd6 R100   ppc     moved

... but patch output breaks this into file moves, and even other
commands which I would have expected to display the same do in fact
not (eg. "git show --raw").  Is there any reason for this ?  I would
have expected git-show to do the same as diff-tree, and it would not
be so much of a problem to introduce a special form of the "rename"
chunk to handle dir renames ?


Since current diff-tree behaviour is based on comparing tree hashs, it
only detects 100%-similarity renames, and naturally diff-index and
diff-files cannot even do that much yet.  However, an optional
factorization pass following rename detection could quite easily
detect if, when a containing directory disappears because of a set of
renames - this is what I have started to look at.

Since what I propose is indeed just a factorization of file renames,
it could be adequate to represent them as such instead of directory
renames, so as to read something like:

|diff --git a/ppc/* b/moved/
|rename from ppc/*
|rename to moved/

This would have the additional advantage of also working when the
target destination was pre-existing, and when merging several dirs
into a single one - while at the same time to mistaking the reader too
much into thinking that git cares about directories so much ;)

Callers should also be careful about interpreting the output when
mixing this behaviour with file filtering, obviously, so it is not
meant to be activated by default.

Does any of this sound sensible ?


On the technical side, I figured out that the best way to plug the
factorization was inside diffcore_rename to take advantage of the fact
it has a list of all renames, that's not going to be published out of
this function - that only left me the "cleanup:" label as insertion
point, not that I find it conceptually very comfortable :)

But if I have understood correctly, diffcore only gets access to
changed filepairs, and does not know about the trees it is comparing.

The most sensible way to go I can think of would be to give to
diffcore enough information (src and dst tree-hash|"INDEX"|"WORKING
COPY") to be able to know for directory existence.  I'd rather check
it does not sound completely wrong to anyone before going into that -
opinions ?

Best regards,
-- 
Yann

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: diffcore and directory renames
  2008-09-01 21:39 diffcore and directory renames Yann Dirson
@ 2008-09-01 22:50 ` Junio C Hamano
  2008-09-02 20:48   ` Yann Dirson
  2008-09-25 21:47   ` [RFC PATCH] detection of " Yann Dirson
  0 siblings, 2 replies; 11+ messages in thread
From: Junio C Hamano @ 2008-09-01 22:50 UTC (permalink / raw)
  To: Yann Dirson; +Cc: GIT list

Yann Dirson <ydirson@altern.org> writes:

> I often found myself lost when looking at a diff where a couple of
> large dirs where renamed, and a handful of files were modified to take
> the rename into account - not a rare situation, I'd say.  In such a
> case, the diffs themselves are mostly hidden among numerous rename
> entries.

Yes, I've hinted here number of times that rename detection could notice
the fact that neighbouring paths are migrating to the same directory and
boost similarity scores of leftover paths that did not otherwise made the
threshold in such a situation.

I am glad to see finally somebody got interested ;-)

> $ ./git-diff-tree 0f1027 -M
> 0f1027e1aceb4bc5fa682776ab9f72935e2cd1b3
> :040000 040000 6f6159f0245784352414ff38ffb68bae80f30bd6 6f6159f0245784352414ff38ffb68bae80f30bd6 R100   ppc     moved

Yes, diff-tree can show this, and you should be able to teach diff-index
with a clean cache-tree to do similar, but this only applies to the
non-recursive 100% rename at the toplevel, which is too narrow a special
case to be interesting at all.  We perhaps further could run the
similarity comparison on the raw tree objects if we wanted to so that you
can find inexact matches, but I think it is going in the wrong direction.

The thing is, diffcore is designed to be a general mechanism to unify
comparisons on two arbitrary sets of files, be they from tree-vs-tree,
tree-vs-index, tree-vs-worktree, or index-vs-worktree.  The machinery
should be able to operate on any two sets of files in the same way and
produce more-or-less the same results.  It is Ok for callers to give extra
hints, when available, to speed up the computation, but the core of the
algorithm should not depend on the presense of such hints to deduce the
renames.

So please aim to make the hint-less case, "diff-files --no-index", produce
a sensible result.  That should be the first step.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: diffcore and directory renames
  2008-09-01 22:50 ` Junio C Hamano
@ 2008-09-02 20:48   ` Yann Dirson
  2008-09-25 21:47   ` [RFC PATCH] detection of " Yann Dirson
  1 sibling, 0 replies; 11+ messages in thread
From: Yann Dirson @ 2008-09-02 20:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: GIT list

On Mon, Sep 01, 2008 at 03:50:22PM -0700, Junio C Hamano wrote:
> Yes, I've hinted here number of times that rename detection could notice
> the fact that neighbouring paths are migrating to the same directory and
> boost similarity scores of leftover paths that did not otherwise made the
> threshold in such a situation.

Wow, that's a great idea I did not think about :)

> I am glad to see finally somebody got interested ;-)

I am glad we're on the same side :)

> > $ ./git-diff-tree 0f1027 -M
> > 0f1027e1aceb4bc5fa682776ab9f72935e2cd1b3
> > :040000 040000 6f6159f0245784352414ff38ffb68bae80f30bd6 6f6159f0245784352414ff38ffb68bae80f30bd6 R100   ppc     moved
> 
> Yes, diff-tree can show this, and you should be able to teach diff-index
> with a clean cache-tree to do similar, but this only applies to the
> non-recursive 100% rename at the toplevel, which is too narrow a special
> case to be interesting at all.  We perhaps further could run the
> similarity comparison on the raw tree objects if we wanted to so that you
> can find inexact matches, but I think it is going in the wrong direction.

Agreed.

> The thing is, diffcore is designed to be a general mechanism to unify
> comparisons on two arbitrary sets of files, be they from tree-vs-tree,
> tree-vs-index, tree-vs-worktree, or index-vs-worktree.  The machinery
> should be able to operate on any two sets of files in the same way and
> produce more-or-less the same results.  It is Ok for callers to give extra
> hints, when available, to speed up the computation, but the core of the
> algorithm should not depend on the presense of such hints to deduce the
> renames.

My idea was rather to just see tree/index/worktree as just 3
specialized forms of a single "abstract tree" of some sort,
effectively unifying things.

But anyway afterwards I thought that there may be something for me to
learn in how --find-copies-harder gets its information for
non-modified sources of copies -- I'll have a look at that.


> So please aim to make the hint-less case, "diff-files --no-index", produce
> a sensible result.  That should be the first step.

OK.

Best regards,
-- 
Yann

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH] detection of directory renames
  2008-09-01 22:50 ` Junio C Hamano
  2008-09-02 20:48   ` Yann Dirson
@ 2008-09-25 21:47   ` Yann Dirson
  2008-09-25 21:47     ` [PATCH] Introduce patch factorization in diffcore Yann Dirson
                       ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Yann Dirson @ 2008-09-25 21:47 UTC (permalink / raw)
  To: git

This is a first very preliminar patch, mostly so people can comment
early on the big picture.  It has a number of limitations, many but
not all already listed as FIXME's in the patch itself.  If anything in
this patch seems so wrong it is not worth polishing it, it's the
perfect time to say so :)

The idea is to add a new pass to diffcore-rename, to group file renames
looking like a full directory move, and then to hide those file
renames which do not carry any additionnal information.

Here is a sample run:

$ ./git-diff-tree ee491 --factorize-renames -r 
[DBG] possible rename from arm/ to bar/
[DBG] possible rename from ppc/ to moved/
[DBG] discarding dir rename of arm/, mixing moved/ and bar/
[DBG] ppc/* -> moved/* makes ppc/sha1ppc.S -> moved/sha1ppc.S uninteresting
[DBG] ppc/* -> moved/* makes ppc/sha1.c -> moved/sha1.c uninteresting
ee491a42190ec6e716f46a55fa0a7f4e307f1629
:040000 040000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 R100   ppc/    moved/
:100644 100644 9e3ae038e818f4e21bc50f864fc5204f6fa44daa 9e3ae038e818f4e21bc50f864fc5204f6fa44daa R100   arm/sha1.c      bar/sha1.c
:100644 100644 3952646349cf9d033177e69ba9433652a378c0e9 3952646349cf9d033177e69ba9433652a378c0e9 R100   arm/sha1.h      bar/sha1.h
:100644 100644 c3c51aa4d487f2e85c02b0257c1f0b57d6158d76 c065eeef7d68ea91863431788e20cd814c5ac52c R099   ppc/sha1.h      moved/sha1.h
:100644 100644 8c1cb99fb403875af85e4d1524d21f7eb818f59b 8c1cb99fb403875af85e4d1524d21f7eb818f59b R100   arm/sha1_arm.S  moved/sha1_arm.S

-- 
Yann Dirson    <ydirson@altern.org> |
Debian-related: <dirson@debian.org> |   Support Debian GNU/Linux:
                                    |  Freedom, Power, Stability, Gratis
     http://ydirson.free.fr/        | Check <http://www.debian.org/>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] Introduce patch factorization in diffcore.
  2008-09-25 21:47   ` [RFC PATCH] detection of " Yann Dirson
@ 2008-09-25 21:47     ` Yann Dirson
  2008-09-25 22:39       ` Andreas Ericsson
  2008-09-25 23:19     ` [RFC PATCH] detection of directory renames Jakub Narebski
  2008-10-27 13:35     ` Jakub Narebski
  2 siblings, 1 reply; 11+ messages in thread
From: Yann Dirson @ 2008-09-25 21:47 UTC (permalink / raw)
  To: git


---

 diff-lib.c        |    6 +-
 diff.c            |    5 +
 diff.h            |    3 +
 diffcore-rename.c |  194 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 diffcore.h        |    1 
 tree-diff.c       |    4 +
 6 files changed, 202 insertions(+), 11 deletions(-)

diff --git a/diff-lib.c b/diff-lib.c
index ae96c64..dcc4c2c 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -179,7 +179,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 		changed = ce_match_stat(ce, &st, ce_option);
 		if (!changed) {
 			ce_mark_uptodate(ce);
-			if (!DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER))
+			if (!DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER) &&
+			    !DIFF_OPT_TST(&revs->diffopt, FACTORIZE_RENAMES))
 				continue;
 		}
 		oldmode = ce->ce_mode;
@@ -310,7 +311,8 @@ static int show_modified(struct oneway_unpack_data *cbdata,
 
 	oldmode = old->ce_mode;
 	if (mode == oldmode && !hashcmp(sha1, old->sha1) &&
-	    !DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER))
+	    !DIFF_OPT_TST(&revs->diffopt, FIND_COPIES_HARDER) &&
+	    !DIFF_OPT_TST(&revs->diffopt, FACTORIZE_RENAMES))
 		return 0;
 
 	diff_change(&revs->diffopt, oldmode, mode,
diff --git a/diff.c b/diff.c
index 9053d4d..2315980 100644
--- a/diff.c
+++ b/diff.c
@@ -2599,6 +2599,11 @@ int diff_opt_parse(struct diff_options *options, const char **av, int ac)
 		DIFF_OPT_SET(options, REVERSE_DIFF);
 	else if (!strcmp(arg, "--find-copies-harder"))
 		DIFF_OPT_SET(options, FIND_COPIES_HARDER);
+	else if (!strcmp(arg, "--factorize-renames")) {
+		DIFF_OPT_SET(options, FACTORIZE_RENAMES);
+		if (!options->detect_rename)
+			options->detect_rename = DIFF_DETECT_RENAME;
+	}
 	else if (!strcmp(arg, "--follow"))
 		DIFF_OPT_SET(options, FOLLOW_RENAMES);
 	else if (!strcmp(arg, "--color"))
diff --git a/diff.h b/diff.h
index 1ca4f46..7b862c6 100644
--- a/diff.h
+++ b/diff.h
@@ -64,6 +64,7 @@ typedef void (*diff_format_fn_t)(struct diff_queue_struct *q,
 #define DIFF_OPT_RELATIVE_NAME       (1 << 17)
 #define DIFF_OPT_IGNORE_SUBMODULES   (1 << 18)
 #define DIFF_OPT_DIRSTAT_CUMULATIVE  (1 << 19)
+#define DIFF_OPT_FACTORIZE_RENAMES   (1 << 20)
 #define DIFF_OPT_TST(opts, flag)    ((opts)->flags & DIFF_OPT_##flag)
 #define DIFF_OPT_SET(opts, flag)    ((opts)->flags |= DIFF_OPT_##flag)
 #define DIFF_OPT_CLR(opts, flag)    ((opts)->flags &= ~DIFF_OPT_##flag)
@@ -219,6 +220,8 @@ extern void diffcore_std(struct diff_options *);
 "  -C            detect copies.\n" \
 "  --find-copies-harder\n" \
 "                try unchanged files as candidate for copy detection.\n" \
+"  --factorize_renames\n" \
+"                factorize renames of all files of a directory.\n" \
 "  -l<n>         limit rename attempts up to <n> paths.\n" \
 "  -O<file>      reorder diffs according to the <file>.\n" \
 "  -S<string>    find filepair whose only one side contains the string.\n" \
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1b2ebb4..2757204 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -6,6 +6,8 @@
 #include "diffcore.h"
 #include "hash.h"
 
+#include <libgen.h>
+
 /* Table of rename/copy destinations */
 
 static struct diff_rename_dst {
@@ -52,6 +54,32 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
 	return &(rename_dst[first]);
 }
 
+static struct diff_rename_dst *locate_rename_dst_dir(struct diff_filespec *dir)
+{
+	/* code mostly duplicated from locate_rename_dst - not sure we
+	 * could merge them efficiently,though
+	 */
+	int first, last;
+	int dirlength = strlen(dir->path);
+
+	first = 0;
+	last = rename_dst_nr;
+	while (last > first) {
+		int next = (last + first) >> 1;
+		struct diff_rename_dst *dst = &(rename_dst[next]);
+		int cmp = strncmp(dir->path, dst->two->path, dirlength);
+		if (!cmp)
+			return dst;
+		if (cmp < 0) {
+			last = next;
+			continue;
+		}
+		first = next+1;
+	}
+	/* not found */
+	return NULL;
+}
+
 /* Table of rename/copy src files */
 static struct diff_rename_src {
 	struct diff_filespec *one;
@@ -409,6 +437,130 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
 		m[worst] = *o;
 }
 
+struct diff_dir_rename {
+	struct diff_filespec *one;
+	struct diff_filespec *two;
+	int discarded;
+	struct diff_dir_rename* next;
+};
+
+/*
+ * Marks as such file_rename if it is made uninteresting by dir_rename.
+ * Returns -1 if the file_rename is outside of the range in which given
+ * renames concerned by dir_rename are to be found (ie. end of loop),
+ * 0 otherwise.
+ */
+static int maybe_mark_uninteresting(struct diff_rename_dst* file_rename,
+				    struct diff_dir_rename* dir_rename,
+				    int one_len, int two_len) {
+	if (strncmp(file_rename->two->path,
+		    dir_rename->two->path, two_len))
+		return -1;
+	if (strncmp(file_rename->pair->one->path,
+		    dir_rename->one->path, one_len) ||
+	    !basename_same(file_rename->pair->one, file_rename->two) ||
+	    file_rename->pair->score != MAX_SCORE)
+		return 0;
+
+	file_rename->pair->uninteresting_rename = 1;
+	printf ("[DBG] %s* -> %s* makes %s -> %s uninteresting\n",
+		dir_rename->one->path, dir_rename->two->path,
+		file_rename->pair->one->path, file_rename->two->path);
+	return 0;
+}
+
+/*
+ * FIXME: we could optimize the 100%-rename case by preventing
+ * recursion to unfold what we know we would refold here.
+ * FIXME: do we want to replace linked list with sorted array ?
+ * FIXME: this prototype only handles renaming of dirs without
+ * a subdir.
+ * FIXME: leaks like hell.
+ * FIXME: ideas to evaluate a similarity score, anyone ?
+ *  10% * tree similarity + 90% * moved files similarity ?
+ */
+static struct diff_dir_rename* factorization_candidates = NULL;
+static void diffcore_factorize_renames(void)
+{
+	char buffer[PATH_MAX], one_parent_path[PATH_MAX], two_parent_path[PATH_MAX];
+	int i;
+
+	for (i = 0; i < rename_dst_nr; i++) {
+		if (!rename_dst[i].pair)
+			continue;
+		strcpy(buffer, rename_dst[i].pair->one->path);
+		strcpy(one_parent_path, dirname(buffer));
+		strcat(one_parent_path, "/");
+
+		struct diff_filespec* one_parent = alloc_filespec(one_parent_path);
+		fill_filespec(one_parent, null_sha1 /*FIXME*/, S_IFDIR);
+
+		if (!locate_rename_dst_dir(one_parent)) {
+			// one_parent_path is empty in result tree
+
+			// already considered ?
+			struct diff_dir_rename* seen;
+			for (seen=factorization_candidates; seen; seen = seen->next)
+				if (!strcmp(seen->one->path, one_parent_path)) break;
+			if (!seen) {
+				// record potential dir rename
+				strcpy(buffer, rename_dst[i].pair->two->path);
+				strcpy(two_parent_path, dirname(buffer));
+				strcat(two_parent_path, "/");
+
+				seen = xmalloc(sizeof(*seen));
+				seen->one = one_parent;
+				seen->two = alloc_filespec(two_parent_path);
+				fill_filespec(seen->two, null_sha1 /*FIXME*/, S_IFDIR);
+				seen->discarded = 0;
+				seen->next = factorization_candidates;
+				factorization_candidates = seen;
+				printf("[DBG] possible rename from %s to %s\n",
+				       one_parent_path, two_parent_path);
+				continue;
+			}
+			if (seen->discarded)
+				continue;
+			// check that seen entry matches this rename
+			strcpy(buffer, rename_dst[i].pair->two->path);
+			strcpy(two_parent_path, dirname(buffer));
+			strcat(two_parent_path, "/");
+			if (strcmp(two_parent_path, seen->two->path)) {
+				printf("[DBG] discarding dir rename of %s, mixing %s and %s\n",
+				       one_parent_path, two_parent_path, seen->two->path);
+				seen->discarded = 1;
+			}
+
+			/* all checks ok, we keep that entry */
+		}
+	}
+
+	// turn candidates into real renames
+	struct diff_dir_rename* candidate;
+	for (candidate=factorization_candidates; candidate; candidate = candidate->next) {
+		int two_idx, i, one_len, two_len;
+		if (candidate->discarded)
+			continue;
+
+		// any entry within candidate dst dir
+		two_idx = locate_rename_dst_dir(candidate->two) - rename_dst;
+
+		// now remove extraneous 100% files inside.
+		one_len = strlen(candidate->one->path);
+		two_len = strlen(candidate->two->path);
+		for (i = two_idx; i < rename_dst_nr; i++)
+			if (maybe_mark_uninteresting (rename_dst+i, candidate,
+						      one_len, two_len) < 0)
+				break;
+		for (i = two_idx-1; i >= 0; i--)
+			if (maybe_mark_uninteresting (rename_dst+i, candidate,
+						      one_len, two_len) < 0)
+				break;
+	}
+
+	return;
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
@@ -446,13 +598,22 @@ void diffcore_rename(struct diff_options *options)
 				p->one->rename_used++;
 			register_rename_src(p->one, p->score);
 		}
-		else if (detect_rename == DIFF_DETECT_COPY) {
-			/*
-			 * Increment the "rename_used" score by
-			 * one, to indicate ourselves as a user.
-			 */
-			p->one->rename_used++;
-			register_rename_src(p->one, p->score);
+		else {
+			if (detect_rename == DIFF_DETECT_COPY) {
+				/*
+				 * Increment the "rename_used" score by
+				 * one, to indicate ourselves as a user.
+				 */
+				p->one->rename_used++;
+				register_rename_src(p->one, p->score);
+			}
+			if (DIFF_OPT_TST(options, FACTORIZE_RENAMES)) {
+				/* similarly, rename factorization needs to
+				 * see all files from second tree
+				 */
+				//p->two->rename_used++; // FIXME: would we need that ?
+				locate_rename_dst(p->two, 1);
+			}
 		}
 	}
 	if (rename_dst_nr == 0 || rename_src_nr == 0)
@@ -561,8 +722,24 @@ void diffcore_rename(struct diff_options *options)
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
 	 */
+
+	/* Now possibly factorize those renames and copies. */
+	if (DIFF_OPT_TST(options, FACTORIZE_RENAMES))
+		diffcore_factorize_renames();
+
 	outq.queue = NULL;
 	outq.nr = outq.alloc = 0;
+
+	// first, turn factorization_candidates into real renames
+	struct diff_dir_rename* candidate;
+	for (candidate=factorization_candidates; candidate; candidate = candidate->next) {
+		struct diff_filepair* pair;
+		if (candidate->discarded) continue;
+		pair = diff_queue(&outq, candidate->one, candidate->two);
+		pair->score = MAX_SCORE;
+		pair->renamed_pair = 1;
+	}
+
 	for (i = 0; i < q->nr; i++) {
 		struct diff_filepair *p = q->queue[i];
 		struct diff_filepair *pair_to_free = NULL;
@@ -577,7 +754,8 @@ void diffcore_rename(struct diff_options *options)
 			struct diff_rename_dst *dst =
 				locate_rename_dst(p->two, 0);
 			if (dst && dst->pair) {
-				diff_q(&outq, dst->pair);
+				if (!dst->pair->uninteresting_rename)
+					diff_q(&outq, dst->pair);
 				pair_to_free = p;
 			}
 			else
diff --git a/diffcore.h b/diffcore.h
index 8ae3578..e4aa099 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -62,6 +62,7 @@ struct diff_filepair {
 	unsigned broken_pair : 1;
 	unsigned renamed_pair : 1;
 	unsigned is_unmerged : 1;
+	unsigned uninteresting_rename : 1;
 };
 #define DIFF_PAIR_UNMERGED(p) ((p)->is_unmerged)
 
diff --git a/tree-diff.c b/tree-diff.c
index 9f67af6..872f757 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -49,7 +49,9 @@ static int compare_tree_entry(struct tree_desc *t1, struct tree_desc *t2, const
 		show_entry(opt, "+", t2, base, baselen);
 		return 1;
 	}
-	if (!DIFF_OPT_TST(opt, FIND_COPIES_HARDER) && !hashcmp(sha1, sha2) && mode1 == mode2)
+	if (!DIFF_OPT_TST(opt, FIND_COPIES_HARDER) &&
+	    !DIFF_OPT_TST(opt, FACTORIZE_RENAMES) &&
+	    !hashcmp(sha1, sha2) && mode1 == mode2)
 		return 0;
 
 	/*

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] Introduce patch factorization in diffcore.
  2008-09-25 21:47     ` [PATCH] Introduce patch factorization in diffcore Yann Dirson
@ 2008-09-25 22:39       ` Andreas Ericsson
  2008-09-26 19:21         ` Yann Dirson
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Ericsson @ 2008-09-25 22:39 UTC (permalink / raw)
  To: Yann Dirson; +Cc: git

Yann Dirson wrote:
>  "                try unchanged files as candidate for copy detection.\n" \
> +"  --factorize_renames\n" \
> +"                factorize renames of all files of a directory.\n" \


Use dashes to separate words in arguments, please.


>  
> +#include <libgen.h>
> +

 +#ifdef basename
 +# undef basename
 +#endif

We might as well use the GNU version of basename() at least. Even if
you don't use it, I'd rather not see this bite some unwary programmer
coming along after you. Worst case scenario, sha1's won't add up if
POSIX basename alters its argument, making for one fiendishly tricky
bug to find.

> +/*
> + * FIXME: we could optimize the 100%-rename case by preventing
> + * recursion to unfold what we know we would refold here.
> + * FIXME: do we want to replace linked list with sorted array ?

Either that or a hash. Most of the time seems to be spent in lookups.
With a hash you get quick lookups and reasonably quick inserts. With
a sorted array you get lower memory footprint than anything else and
can bisect your way to the right entry, which performs reasonably
close to skiplists. The sort overhead might be a deterrant factor,
but insofar as I understand it trees are always sorted in git anyway,
so perhaps that'd be the best solution.

Apart from that, I'd need to apply the patch to review it properly,
and I'm far too tired for that now.

I like the direction this is going though, so thanks a lot for doing
it :)

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] detection of directory renames
  2008-09-25 21:47   ` [RFC PATCH] detection of " Yann Dirson
  2008-09-25 21:47     ` [PATCH] Introduce patch factorization in diffcore Yann Dirson
@ 2008-09-25 23:19     ` Jakub Narebski
  2008-09-26 19:06       ` Yann Dirson
  2008-10-27 13:35     ` Jakub Narebski
  2 siblings, 1 reply; 11+ messages in thread
From: Jakub Narebski @ 2008-09-25 23:19 UTC (permalink / raw)
  To: Yann Dirson; +Cc: git

Yann Dirson <ydirson@altern.org> writes:

> This is a first very preliminar patch, mostly so people can comment
> early on the big picture.  It has a number of limitations, many but
> not all already listed as FIXME's in the patch itself.  If anything in
> this patch seems so wrong it is not worth polishing it, it's the
> perfect time to say so :)
> 
> The idea is to add a new pass to diffcore-rename, to group file renames
> looking like a full directory move, and then to hide those file
> renames which do not carry any additionnal information.
> 
> Here is a sample run:
> 
> $ ./git-diff-tree ee491 --factorize-renames -r 
> [DBG] possible rename from arm/ to bar/
> [DBG] possible rename from ppc/ to moved/
> [DBG] discarding dir rename of arm/, mixing moved/ and bar/
> [DBG] ppc/* -> moved/* makes ppc/sha1ppc.S -> moved/sha1ppc.S uninteresting
> [DBG] ppc/* -> moved/* makes ppc/sha1.c -> moved/sha1.c uninteresting
> ee491a42190ec6e716f46a55fa0a7f4e307f1629
> :040000 040000 0000000000000000000000000000000000000000 0000000000000000000000000000000000000000 R100   ppc/    moved/
> :100644 100644 9e3ae038e818f4e21bc50f864fc5204f6fa44daa 9e3ae038e818f4e21bc50f864fc5204f6fa44daa R100   arm/sha1.c      bar/sha1.c
> :100644 100644 3952646349cf9d033177e69ba9433652a378c0e9 3952646349cf9d033177e69ba9433652a378c0e9 R100   arm/sha1.h      bar/sha1.h
> :100644 100644 c3c51aa4d487f2e85c02b0257c1f0b57d6158d76 c065eeef7d68ea91863431788e20cd814c5ac52c R099   ppc/sha1.h      moved/sha1.h
> :100644 100644 8c1cb99fb403875af85e4d1524d21f7eb818f59b 8c1cb99fb403875af85e4d1524d21f7eb818f59b R100   arm/sha1_arm.S  moved/sha1_arm.S

Wonderfull!!!

I like it very much, and hope that it would get improved. And that you
would be able to use it in merges too, to allow for example for new
files to be created in renamed directory instead of old one which does
not exist any longer (c.f. http://www.markshuttleworth.com/archives/123
and following articles).

For the future, could you show examples with --abbrev option in use? TIA.
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] detection of directory renames
  2008-09-25 23:19     ` [RFC PATCH] detection of directory renames Jakub Narebski
@ 2008-09-26 19:06       ` Yann Dirson
  0 siblings, 0 replies; 11+ messages in thread
From: Yann Dirson @ 2008-09-26 19:06 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

On Thu, Sep 25, 2008 at 04:19:20PM -0700, Jakub Narebski wrote:
> Wonderfull!!!

Wow, I love this feedback :)

> I like it very much, and hope that it would get improved. And that you
> would be able to use it in merges too, to allow for example for new
> files to be created in renamed directory instead of old one which does
> not exist any longer (c.f. http://www.markshuttleworth.com/archives/123
> and following articles).

Yes, that's one of planned things.  Along the same line (once renames
are stabilized), I think about extending things so that "split dirs"
are detected as well, so merge can flag a conflict when there is a
doubt that a file would really be added in a "split directory".  The
situation may be obvious in my example (arm/ contents dispatched into
bar/ and moved/), but may also be useful when we can detect that a
significant number of files have been moved out of a directory.

> For the future, could you show examples with --abbrev option in use? TIA.

Good idea.

Best regards,
-- 
Yann

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Introduce patch factorization in diffcore.
  2008-09-25 22:39       ` Andreas Ericsson
@ 2008-09-26 19:21         ` Yann Dirson
  0 siblings, 0 replies; 11+ messages in thread
From: Yann Dirson @ 2008-09-26 19:21 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: git

On Fri, Sep 26, 2008 at 12:39:54AM +0200, Andreas Ericsson wrote:
> Yann Dirson wrote:
>>  "                try unchanged files as candidate for copy detection.\n" \
>> +"  --factorize_renames\n" \
>> +"                factorize renames of all files of a directory.\n" \
>
>
> Use dashes to separate words in arguments, please.

Fortunately, the typo only made it to the help string, thanks for
catching it :)

>>  +#include <libgen.h>
>> +
>
> +#ifdef basename
> +# undef basename
> +#endif
>
> We might as well use the GNU version of basename() at least. Even if
> you don't use it, I'd rather not see this bite some unwary programmer
> coming along after you. Worst case scenario, sha1's won't add up if
> POSIX basename alters its argument, making for one fiendishly tricky
> bug to find.

I'll look into that, thanks

>> +/*
>> + * FIXME: we could optimize the 100%-rename case by preventing
>> + * recursion to unfold what we know we would refold here.
>> + * FIXME: do we want to replace linked list with sorted array ?
>
> Either that or a hash. Most of the time seems to be spent in lookups.
> With a hash you get quick lookups and reasonably quick inserts. With
> a sorted array you get lower memory footprint than anything else and
> can bisect your way to the right entry, which performs reasonably
> close to skiplists. The sort overhead might be a deterrant factor,
> but insofar as I understand it trees are always sorted in git anyway,
> so perhaps that'd be the best solution.

Thanks for this insight - I'll wait before changing the data structure,
since the current implementation may need to be able to recurse (consider
renamed subdirs as well as renamed files).

> Apart from that, I'd need to apply the patch to review it properly,
> and I'm far too tired for that now.
>
> I like the direction this is going though, so thanks a lot for doing
> it :)

Great :)

Best regards,
-- 
Yann

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] detection of directory renames
  2008-09-25 21:47   ` [RFC PATCH] detection of " Yann Dirson
  2008-09-25 21:47     ` [PATCH] Introduce patch factorization in diffcore Yann Dirson
  2008-09-25 23:19     ` [RFC PATCH] detection of directory renames Jakub Narebski
@ 2008-10-27 13:35     ` Jakub Narebski
  2008-10-27 19:13       ` Yann Dirson
  2 siblings, 1 reply; 11+ messages in thread
From: Jakub Narebski @ 2008-10-27 13:35 UTC (permalink / raw)
  To: Yann Dirson; +Cc: git

Yann Dirson <ydirson@altern.org> writes:

> This is a first very preliminar patch, mostly so people can comment
> early on the big picture.  It has a number of limitations, many but
> not all already listed as FIXME's in the patch itself.  If anything in
> this patch seems so wrong it is not worth polishing it, it's the
> perfect time to say so :)

What happened to this work? I din't see it mentioned in "what's
cooking..." announcements...

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] detection of directory renames
  2008-10-27 13:35     ` Jakub Narebski
@ 2008-10-27 19:13       ` Yann Dirson
  0 siblings, 0 replies; 11+ messages in thread
From: Yann Dirson @ 2008-10-27 19:13 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

On Mon, Oct 27, 2008 at 06:35:44AM -0700, Jakub Narebski wrote:
> Yann Dirson <ydirson@altern.org> writes:
> 
> > This is a first very preliminar patch, mostly so people can comment
> > early on the big picture.  It has a number of limitations, many but
> > not all already listed as FIXME's in the patch itself.  If anything in
> > this patch seems so wrong it is not worth polishing it, it's the
> > perfect time to say so :)
> 
> What happened to this work? I din't see it mentioned in "what's
> cooking..." announcements...

I'm slowly bugfixing it, and plan to post a slightly updated version
soon.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-10-27 19:14 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-01 21:39 diffcore and directory renames Yann Dirson
2008-09-01 22:50 ` Junio C Hamano
2008-09-02 20:48   ` Yann Dirson
2008-09-25 21:47   ` [RFC PATCH] detection of " Yann Dirson
2008-09-25 21:47     ` [PATCH] Introduce patch factorization in diffcore Yann Dirson
2008-09-25 22:39       ` Andreas Ericsson
2008-09-26 19:21         ` Yann Dirson
2008-09-25 23:19     ` [RFC PATCH] detection of directory renames Jakub Narebski
2008-09-26 19:06       ` Yann Dirson
2008-10-27 13:35     ` Jakub Narebski
2008-10-27 19:13       ` Yann Dirson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).