git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFD] Combine diff improvements
@ 2013-03-14 22:32 Antoine Pelisse
  2013-03-14 22:52 ` Junio C Hamano
  2013-03-15  5:54 ` Junio C Hamano
  0 siblings, 2 replies; 9+ messages in thread
From: Antoine Pelisse @ 2013-03-14 22:32 UTC (permalink / raw)
  To: git

Hi,
I've been working on combine diff recently and have seen that the current
implementation does not produce optimal results for coalescing lost lines.

I would like to discuss some improvements.

I think we can define an optimal solution as the shortest diff
output. That is that we coalesced as much line as we could.
It means that if each parent has the same file removed, the output will
be n lines long (all lines coalesce to each others).

The current implementation doesn't look for optimal solution but for the
easiest/greedy solution. It also stops matching with a parent if it
can't find a match. As an example, it means that losing [1, 2, 3, 4] from
parent A, and [3, 1, 2, 4] from parent B would give:
- 1
- 2
--3
- 4
 -1
 -2
 -4

While if we invert the two parents order we will have:
- 3
--1
--2
- 4
 -3
 -4

While clearly, in both cases, the optimal solution would be close to:
- 3
--1
--2
 -3
--4

Let's say we have only one empty file (deleted), with p parents. In each
parent, the file was n lines long, but the lines don't have to be the
same.

It clearly looks like an extension/variation of the Longest Common
Subsequence (LCS) problem, but with p sequences (and the common
subsequences doesn't have to be common to all sequences).

LCS dynamic programming is O(n²) in time and space complexity for two
sequences.

We could extend it this way:
After processing two of the p parents, let's find LCS and build a new
optimal list of removals. Then find LCS again between this new list and
the third parent removals. And repeat until we made each parents.

Best-case analysis:
All p parents have the same n lines.
We will find LCS and provide a n lines (the same lines) new list in
O(n²), and then run it again in O(n²) with the next parent, etc.
It will end-up being O(pn²).

Worst-case analysis:
All p parents have no lines in common.
We will find LCS and provide a 2n new list in O(n²).
Then we run it again in O(2n x n), and again O(3n x n), etc, until
O(pn x n).
When we sum these all, we end-up with O(p² x n²)

Does it make any sense ? And is it worth implementing ?

Cheers,
Antoine

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFD] Combine diff improvements
  2013-03-14 22:32 [RFD] Combine diff improvements Antoine Pelisse
@ 2013-03-14 22:52 ` Junio C Hamano
  2013-03-15  5:54 ` Junio C Hamano
  1 sibling, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2013-03-14 22:52 UTC (permalink / raw)
  To: Antoine Pelisse; +Cc: git

Antoine Pelisse <apelisse@gmail.com> writes:

> I would like to discuss some improvements.

;-)

> The current implementation doesn't look for optimal solution but for the
> easiest/greedy solution.

Yes, you nailed it.

> Does it make any sense? And is it worth implementing?

Probably yes, and definitely.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFD] Combine diff improvements
  2013-03-14 22:32 [RFD] Combine diff improvements Antoine Pelisse
  2013-03-14 22:52 ` Junio C Hamano
@ 2013-03-15  5:54 ` Junio C Hamano
  2013-03-17 13:03   ` [PATCH] combine-diff: coalesce lost lines optimally Antoine Pelisse
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2013-03-15  5:54 UTC (permalink / raw)
  To: Antoine Pelisse; +Cc: git

Antoine Pelisse <apelisse@gmail.com> writes:

> While if we invert the two parents order we will have:
> - 3
> --1
> --2
> - 4
>  -3
>  -4
>
> While clearly, in both cases, the optimal solution would be close to:
> - 3
> --1
> --2
>  -3
> --4

We should shoot for better readability as the primary goal.  When I
am viewing "--cc" output, I find it easier to read if the lines for
the same parent are consecutive.  That is, I find the right one far
easier to grok than the left one:

     - 1                - 1
      -2                - 3
     - 3                 -2
      -4                 -4

when trying to see what got changed from the first parent (we had 1
and then 3 and whatever comes after this block; 2 and 4 are noise
and it makes it easier to skip them because they are together).

This can probably be done as a post-process phase after you find the
coalescing matches to express the result in the smallest number of
lines, but I thought it is worth mentioning that not all solutions
with the same number of lines are equally good.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] combine-diff: coalesce lost lines optimally
  2013-03-15  5:54 ` Junio C Hamano
@ 2013-03-17 13:03   ` Antoine Pelisse
  2013-03-17 13:58     ` Antoine Pelisse
                       ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Antoine Pelisse @ 2013-03-17 13:03 UTC (permalink / raw)
  To: git; +Cc: Antoine Pelisse, Junio C Hamano

This replaces the greedy implementation to coalesce lost lines by using
dynamic programming to find the Longest Common Subsequence.

The O(n²) time complexity is obviously bigger than previous
implementation but it can produce shorter diff results (and most likely
easier to read).

List of lost lines is now doubly-linked because we reverse-read it when
reading the direction matrix.

Signed-off-by: Antoine Pelisse <apelisse@gmail.com>
---
Hi,
This is a very first draft for improving the way we coalesce lost
lines. It has only been tested with the two scenarios below.

What is left to do:
- Test it more extensively
- Had some tests scenarios

I'm also having a hard time trying it with more than two parents. How I
am supposed to have more than two parents while octopus merge refuses if
there are conflicts ?

Tested scenarios:
git init
>test
git add test
git commit -m initial

git checkout -b side1
seq 10 >>test
git commit -m all -a
git checkout master
seq 1 2 10 >test
git commit -m three -a

git merge side1
>test
git commit -m merge -a
git show

AND

git init
>test
git add test
git commit -m initial

echo "3\n1\n2\n4" >test
git commit -m shuffled -a

git checkout -b side HEAD^
echo "1\n2\n3\n4" >test
git commit -m sorted -a

git merge master
>test
git commit -m merge -a
git show

 combine-diff.c |  192 ++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 160 insertions(+), 32 deletions(-)

diff --git a/combine-diff.c b/combine-diff.c
index 35d41cd..252dd72 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -73,16 +73,24 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,

 /* Lines lost from parent */
 struct lline {
-	struct lline *next;
+	struct lline *next, *prev;
 	int len;
 	unsigned long parent_map;
 	char line[FLEX_ARRAY];
 };

+/* Lines lost from current parent (before coalescing) */
+struct plost {
+	struct lline *lost_head, *lost_tail;
+	int len;
+};
+
 /* Lines surviving in the merge result */
 struct sline {
-	struct lline *lost_head, **lost_tail;
-	struct lline *next_lost;
+	/* Accumulated and coalesced lost lines */
+	struct lline *lost;
+	int lenlost;
+	struct plost plost;
 	char *bol;
 	int len;
 	/* bit 0 up to (N-1) are on if the parent has this line (i.e.
@@ -94,6 +102,132 @@ struct sline {
 	unsigned long *p_lno;
 };

+enum coalesce_direction { MATCH, BASE, NEW };
+
+/* Coalesce new lines into base by finding LCS */
+static struct lline *coalesce_lines(struct lline *base, int *lenbase,
+				    struct lline *new, int lennew,
+				    unsigned long parent)
+{
+	int **lcs;
+	enum coalesce_direction **directions;
+	struct lline *baseend, *newend;
+	int i, j, origbaselen = *lenbase;
+
+	if (new == NULL)
+		return base;
+
+	if (base == NULL) {
+		*lenbase = lennew;
+		return new;
+	}
+
+	/*
+	 * Coalesce new lines into base by finding the LCS
+	 * - Create the table to run dynamic programing
+	 * - Compute the LCS
+	 * - Then reverse read the direction structure:
+	 *   - If we have MATCH, assign parent to base flag, and consume
+	 *   both baseend and newend
+	 *   - Else if we have BASE, consume baseend
+	 *   - Else if we have NEW, insert newend lline into base and
+	 *   consume newend
+	 */
+	lcs = xcalloc(origbaselen + 1, sizeof(int*));
+	directions = xcalloc(origbaselen + 1, sizeof(enum coalesce_direction*));
+	for (i = 0; i < origbaselen + 1; i++) {
+		lcs[i] = xcalloc(lennew + 1, sizeof(int));
+		directions[i] = xcalloc(lennew + 1, sizeof(enum coalesce_direction));
+		directions[i][0] = BASE;
+	}
+	for (j = 1; j < lennew + 1; j++)
+		directions[0][j] = NEW;
+
+	for (i = 1, baseend = base; i < origbaselen + 1; i++) {
+		for (j = 1, newend = new; j < lennew + 1; j++) {
+			if (baseend->len == newend->len &&
+			    !memcmp(baseend->line, newend->line, baseend->len)) {
+				lcs[i][j] = lcs[i - 1][j - 1] + 1;
+				directions[i][j] = MATCH;
+			} else if (lcs[i][j - 1] >= lcs[i - 1][j]) {
+				lcs[i][j] = lcs[i][j - 1];
+				directions[i][j] = NEW;
+			} else {
+				lcs[i][j] = lcs[i - 1][j];
+				directions[i][j] = BASE;
+			}
+			if (newend->next)
+				newend = newend->next;
+		}
+		if (baseend->next)
+			baseend = baseend->next;
+	}
+
+	for (i = 0; i < origbaselen + 1; i++)
+		free(lcs[i]);
+	free(lcs);
+
+	/* At this point, baseend and newend point to the end of each lists */
+	i--;
+	j--;
+	while (i != 0 || j != 0) {
+		if (directions[i][j] == MATCH) {
+			baseend->parent_map |= 1<<parent;
+			baseend = baseend->prev;
+			newend = newend->prev;
+			i--;
+			j--;
+		} else if (directions[i][j] == NEW) {
+			struct lline *lline;
+
+			lline = newend;
+			/* Remove lline from new list and update newend */
+			if (lline->prev)
+				lline->prev->next = lline->next;
+			else
+				new = lline->next;
+			if (lline->next)
+				lline->next->prev = lline->prev;
+
+			newend = lline->prev;
+			j--;
+
+			/* Add lline to base list */
+			if (baseend) {
+				lline->next = baseend->next;
+				lline->prev = baseend;
+				if (lline->prev)
+					lline->prev->next = lline;
+			}
+			else {
+				lline->next = base;
+				base = lline;
+			}
+			(*lenbase)++;
+
+			if (lline->next)
+				lline->next->prev = lline;
+
+		} else {
+			baseend = baseend->prev;
+			i--;
+		}
+	}
+
+	newend = new;
+	while (newend) {
+		struct lline *lline = newend;
+		newend = newend->next;
+		free(lline);
+	}
+
+	for (i = 0; i < origbaselen + 1; i++)
+		free(directions[i]);
+	free(directions);
+
+	return base;
+}
+
 static char *grab_blob(const unsigned char *sha1, unsigned int mode,
 		       unsigned long *size, struct userdiff_driver *textconv,
 		       const char *path)
@@ -129,29 +263,19 @@ static void append_lost(struct sline *sline, int n, const char *line, int len)
 	if (line[len-1] == '\n')
 		len--;

-	/* Check to see if we can squash things */
-	if (sline->lost_head) {
-		lline = sline->next_lost;
-		while (lline) {
-			if (lline->len == len &&
-			    !memcmp(lline->line, line, len)) {
-				lline->parent_map |= this_mask;
-				sline->next_lost = lline->next;
-				return;
-			}
-			lline = lline->next;
-		}
-	}
-
 	lline = xmalloc(sizeof(*lline) + len + 1);
 	lline->len = len;
 	lline->next = NULL;
+	lline->prev = sline->plost.lost_tail;
+	if (lline->prev)
+		lline->prev->next = lline;
+	else
+		sline->plost.lost_head = lline;
+	sline->plost.lost_tail = lline;
+	sline->plost.len++;
 	lline->parent_map = this_mask;
 	memcpy(lline->line, line, len);
 	lline->line[len] = 0;
-	*sline->lost_tail = lline;
-	sline->lost_tail = &lline->next;
-	sline->next_lost = NULL;
 }

 struct combine_diff_state {
@@ -194,7 +318,6 @@ static void consume_line(void *state_, char *line, unsigned long len)
 				xcalloc(state->num_parent,
 					sizeof(unsigned long));
 		state->sline[state->nb-1].p_lno[state->n] = state->ob;
-		state->lost_bucket->next_lost = state->lost_bucket->lost_head;
 		return;
 	}
 	if (!state->lost_bucket)
@@ -255,8 +378,17 @@ static void combine_diff(const unsigned char *parent, unsigned int mode,
 		struct lline *ll;
 		sline[lno].p_lno[n] = p_lno;

+		/* Coalesce new lines */
+		if (sline[lno].plost.lost_head) {
+			struct sline *sl = &sline[lno];
+			sl->lost = coalesce_lines(sl->lost, &sl->lenlost,
+						  sl->plost.lost_head, sl->plost.len, n);
+			sl->plost.lost_head = sl->plost.lost_tail = NULL;
+			sl->plost.len = 0;
+		}
+
 		/* How many lines would this sline advance the p_lno? */
-		ll = sline[lno].lost_head;
+		ll = sline[lno].lost;
 		while (ll) {
 			if (ll->parent_map & nmask)
 				p_lno++; /* '-' means parent had it */
@@ -276,7 +408,7 @@ static int interesting(struct sline *sline, unsigned long all_mask)
 	/* If some parents lost lines here, or if we have added to
 	 * some parent, it is interesting.
 	 */
-	return ((sline->flag & all_mask) || sline->lost_head);
+	return ((sline->flag & all_mask) || sline->lost);
 }

 static unsigned long adjust_hunk_tail(struct sline *sline,
@@ -459,7 +591,7 @@ static int make_hunks(struct sline *sline, unsigned long cnt,
 		has_interesting = 0;
 		for (j = i; j < hunk_end && !has_interesting; j++) {
 			unsigned long this_diff = sline[j].flag & all_mask;
-			struct lline *ll = sline[j].lost_head;
+			struct lline *ll = sline[j].lost;
 			if (this_diff) {
 				/* This has some changes.  Is it the
 				 * same as others?
@@ -613,7 +745,7 @@ static void dump_sline(struct sline *sline, const char *line_prefix,
 			int j;
 			unsigned long p_mask;
 			struct sline *sl = &sline[lno++];
-			ll = (sl->flag & no_pre_delete) ? NULL : sl->lost_head;
+			ll = (sl->flag & no_pre_delete) ? NULL : sl->lost;
 			while (ll) {
 				printf("%s%s", line_prefix, c_old);
 				for (j = 0; j < num_parent; j++) {
@@ -664,7 +796,7 @@ static void reuse_combine_diff(struct sline *sline, unsigned long cnt,
 	jmask = (1UL<<j);

 	for (lno = 0; lno <= cnt; lno++) {
-		struct lline *ll = sline->lost_head;
+		struct lline *ll = sline->lost;
 		sline->p_lno[i] = sline->p_lno[j];
 		while (ll) {
 			if (ll->parent_map & jmask)
@@ -923,10 +1055,6 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,

 	sline = xcalloc(cnt+2, sizeof(*sline));
 	sline[0].bol = result;
-	for (lno = 0; lno <= cnt + 1; lno++) {
-		sline[lno].lost_tail = &sline[lno].lost_head;
-		sline[lno].flag = 0;
-	}
 	for (lno = 0, cp = result; cp < result + result_size; cp++) {
 		if (*cp == '\n') {
 			sline[lno].len = cp - sline[lno].bol;
@@ -976,8 +1104,8 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,
 	free(result);

 	for (lno = 0; lno < cnt; lno++) {
-		if (sline[lno].lost_head) {
-			struct lline *ll = sline[lno].lost_head;
+		if (sline[lno].lost) {
+			struct lline *ll = sline[lno].lost;
 			while (ll) {
 				struct lline *tmp = ll;
 				ll = ll->next;
--
1.7.9.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] combine-diff: coalesce lost lines optimally
  2013-03-17 13:03   ` [PATCH] combine-diff: coalesce lost lines optimally Antoine Pelisse
@ 2013-03-17 13:58     ` Antoine Pelisse
  2013-03-17 20:10     ` Junio C Hamano
  2013-03-17 22:37     ` Junio C Hamano
  2 siblings, 0 replies; 9+ messages in thread
From: Antoine Pelisse @ 2013-03-17 13:58 UTC (permalink / raw)
  To: git; +Cc: Antoine Pelisse, Junio C Hamano

> I'm also having a hard time trying it with more than two parents. How I
> am supposed to have more than two parents while octopus merge refuses if
> there are conflicts ?

OK, creating the merge commit myself solves the issue:

git init
>test
git add test
git commit -m initial

seq 100 >test
git commit -m all -a

git checkout -b side1 HEAD^1
seq 1 2 100 >test
git commit -m side1 -a

git checkout -b side2 HEAD^1
seq 1 4 100 >test
git commit -m side2 -a

git checkout -b side3 HEAD^1
seq 1 8 100 >test
git commit -m side3 -a

git checkout -b side4 HEAD^1
seq 1 16 100 >test
git commit -m side4 -a

git checkout master
>test
git add test
TREE=$(git write-tree)
COMMIT=$(git commit-tree $TREE -p master -p side1 -p side2 -p side3 -p
side4 -m merge)
git show $COMMIT

This will work with the basic greedy implementation if all parents are
in this order. But the optimal result will be lost if we change the
order of -p parameters in git-commit-tree.
The patch seems to be correct, always finding the best result (we
always have 100 lines diff) whatever the order of parents.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] combine-diff: coalesce lost lines optimally
  2013-03-17 13:03   ` [PATCH] combine-diff: coalesce lost lines optimally Antoine Pelisse
  2013-03-17 13:58     ` Antoine Pelisse
@ 2013-03-17 20:10     ` Junio C Hamano
  2013-03-17 20:37       ` Antoine Pelisse
  2013-03-17 22:37     ` Junio C Hamano
  2 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2013-03-17 20:10 UTC (permalink / raw)
  To: Antoine Pelisse; +Cc: git

Antoine Pelisse <apelisse@gmail.com> writes:

> This replaces the greedy implementation to coalesce lost lines by using
> dynamic programming to find the Longest Common Subsequence.
>
> The O(n²) time complexity is obviously bigger than previous
> implementation but it can produce shorter diff results (and most likely
> easier to read).
>
> List of lost lines is now doubly-linked because we reverse-read it when
> reading the direction matrix.
>
> Signed-off-by: Antoine Pelisse <apelisse@gmail.com>
> ---
> Hi,
> This is a very first draft for improving the way we coalesce lost
> lines. It has only been tested with the two scenarios below.
>
> What is left to do:
> - Test it more extensively
> - Had some tests scenarios
>
> I'm also having a hard time trying it with more than two parents. How I
> am supposed to have more than two parents while octopus merge refuses if
> there are conflicts ?

9fdb62af92c7 ([ACPI] merge 3549 4320 4485 4588 4980 5483 5651 acpica
asus fops pnpacpi branches into release, 2006-01-24) is one of the
amusing examples ;-)  Cf.

    http://thread.gmane.org/gmane.comp.version-control.git/15486

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] combine-diff: coalesce lost lines optimally
  2013-03-17 20:10     ` Junio C Hamano
@ 2013-03-17 20:37       ` Antoine Pelisse
  0 siblings, 0 replies; 9+ messages in thread
From: Antoine Pelisse @ 2013-03-17 20:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hopefully, my patch takes about the same time as git 1.7.9.5 and
produces the same output on that commit ;)
Unfortunately on a commit that would remove A LOT of lines (10000)
from 7 parents, the times goes from 0.01s to 1.5s... I'm pretty sure
that scenario is quite uncommon though.

On Sun, Mar 17, 2013 at 9:10 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Antoine Pelisse <apelisse@gmail.com> writes:
>
>> This replaces the greedy implementation to coalesce lost lines by using
>> dynamic programming to find the Longest Common Subsequence.
>>
>> The O(n²) time complexity is obviously bigger than previous
>> implementation but it can produce shorter diff results (and most likely
>> easier to read).
>>
>> List of lost lines is now doubly-linked because we reverse-read it when
>> reading the direction matrix.
>>
>> Signed-off-by: Antoine Pelisse <apelisse@gmail.com>
>> ---
>> Hi,
>> This is a very first draft for improving the way we coalesce lost
>> lines. It has only been tested with the two scenarios below.
>>
>> What is left to do:
>> - Test it more extensively
>> - Had some tests scenarios
>>
>> I'm also having a hard time trying it with more than two parents. How I
>> am supposed to have more than two parents while octopus merge refuses if
>> there are conflicts ?
>
> 9fdb62af92c7 ([ACPI] merge 3549 4320 4485 4588 4980 5483 5651 acpica
> asus fops pnpacpi branches into release, 2006-01-24) is one of the
> amusing examples ;-)  Cf.
>
>     http://thread.gmane.org/gmane.comp.version-control.git/15486

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] combine-diff: coalesce lost lines optimally
  2013-03-17 13:03   ` [PATCH] combine-diff: coalesce lost lines optimally Antoine Pelisse
  2013-03-17 13:58     ` Antoine Pelisse
  2013-03-17 20:10     ` Junio C Hamano
@ 2013-03-17 22:37     ` Junio C Hamano
  2013-03-23 17:23       ` [PATCH v2] " Antoine Pelisse
  2 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2013-03-17 22:37 UTC (permalink / raw)
  To: Antoine Pelisse; +Cc: git

Antoine Pelisse <apelisse@gmail.com> writes:

> +/* Coalesce new lines into base by finding LCS */
> +static struct lline *coalesce_lines(struct lline *base, int *lenbase,
> +				    struct lline *new, int lennew,
> +				    unsigned long parent)
> +{

Don't you want to pass flags so that you can use match_string_spaces()
at places like this:

> +	for (i = 1, baseend = base; i < origbaselen + 1; i++) {
> +		for (j = 1, newend = new; j < lennew + 1; j++) {
> +			if (baseend->len == newend->len &&
> +			    !memcmp(baseend->line, newend->line, baseend->len)) {

IOW, it looks to me that this wants to be rebuilt on top of
fa04ae0be8cc (Allow combined diff to ignore white-spaces,
2013-03-14).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2] combine-diff: coalesce lost lines optimally
  2013-03-17 22:37     ` Junio C Hamano
@ 2013-03-23 17:23       ` Antoine Pelisse
  0 siblings, 0 replies; 9+ messages in thread
From: Antoine Pelisse @ 2013-03-23 17:23 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Antoine Pelisse

This replaces the greedy implementation to coalesce lost lines by using
dynamic programming to find the Longest Common Subsequence.

The O(n²) time complexity is obviously bigger than previous
implementation but it can produce shorter diff results (and most likely
easier to read).

List of lost lines is now doubly-linked because we reverse-read it when
reading the direction matrix.

Signed-off-by: Antoine Pelisse <apelisse@gmail.com>
---
Hey,
This version includes some tests and is based on
ap/combine-diff-ignore-whitespace.
As you can see the last test is broken because the solution is not
optimal for more than two parents. It would probably require to extend
the dynamic programming to a k-dimension matrix (for k parents) but the
result would end-up being O(n^k) (when removing n consecutives lines
from p parents). I'm not sure there is any better solution known yet to
the k-LCS problem.
Implementing the dynamic solution with the k-dimension matrix would
probably require to re-hash the strings (I guess it's already done by
xdiff), as the number of string comparisons would increase.

Anyway I'm not exactly sure where to go from here :)

Cheers,
Antoine

 combine-diff.c           |  255 ++++++++++++++++++++++++++++++++++------------
 t/t4038-diff-combined.sh |  129 +++++++++++++++++++++++
 2 files changed, 320 insertions(+), 64 deletions(-)

diff --git a/combine-diff.c b/combine-diff.c
index 6288485..77d7872 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -74,16 +74,24 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,

 /* Lines lost from parent */
 struct lline {
-	struct lline *next;
+	struct lline *next, *prev;
 	int len;
 	unsigned long parent_map;
 	char line[FLEX_ARRAY];
 };

+/* Lines lost from current parent (before coalescing) */
+struct plost {
+	struct lline *lost_head, *lost_tail;
+	int len;
+};
+
 /* Lines surviving in the merge result */
 struct sline {
-	struct lline *lost_head, **lost_tail;
-	struct lline *next_lost;
+	/* Accumulated and coalesced lost lines */
+	struct lline *lost;
+	int lenlost;
+	struct plost plost;
 	char *bol;
 	int len;
 	/* bit 0 up to (N-1) are on if the parent has this line (i.e.
@@ -95,34 +103,6 @@ struct sline {
 	unsigned long *p_lno;
 };

-static char *grab_blob(const unsigned char *sha1, unsigned int mode,
-		       unsigned long *size, struct userdiff_driver *textconv,
-		       const char *path)
-{
-	char *blob;
-	enum object_type type;
-
-	if (S_ISGITLINK(mode)) {
-		blob = xmalloc(100);
-		*size = snprintf(blob, 100,
-				 "Subproject commit %s\n", sha1_to_hex(sha1));
-	} else if (is_null_sha1(sha1)) {
-		/* deleted blob */
-		*size = 0;
-		return xcalloc(1, 1);
-	} else if (textconv) {
-		struct diff_filespec *df = alloc_filespec(path);
-		fill_filespec(df, sha1, 1, mode);
-		*size = fill_textconv(textconv, df, &blob);
-		free_filespec(df);
-	} else {
-		blob = read_sha1_file(sha1, &type, size);
-		if (type != OBJ_BLOB)
-			die("object '%s' is not a blob!", sha1_to_hex(sha1));
-	}
-	return blob;
-}
-
 static int match_string_spaces(const char *line1, int len1,
 			       const char *line2, int len2,
 			       long flags)
@@ -163,36 +143,180 @@ static int match_string_spaces(const char *line1, int len1,
 	return 0;
 }

-static void append_lost(struct sline *sline, int n, const char *line, int len, long flags)
+enum coalesce_direction { MATCH, BASE, NEW };
+
+/* Coalesce new lines into base by finding LCS */
+static struct lline *coalesce_lines(struct lline *base, int *lenbase,
+				    struct lline *new, int lennew,
+				    unsigned long parent, long flags)
 {
-	struct lline *lline;
-	unsigned long this_mask = (1UL<<n);
-	if (line[len-1] == '\n')
-		len--;
+	int **lcs;
+	enum coalesce_direction **directions;
+	struct lline *baseend, *newend = NULL;
+	int i, j, origbaselen = *lenbase;

-	/* Check to see if we can squash things */
-	if (sline->lost_head) {
-		lline = sline->next_lost;
-		while (lline) {
-			if (match_string_spaces(lline->line, lline->len,
-						line, len, flags)) {
-				lline->parent_map |= this_mask;
-				sline->next_lost = lline->next;
-				return;
+	if (new == NULL)
+		return base;
+
+	if (base == NULL) {
+		*lenbase = lennew;
+		return new;
+	}
+
+	/*
+	 * Coalesce new lines into base by finding the LCS
+	 * - Create the table to run dynamic programing
+	 * - Compute the LCS
+	 * - Then reverse read the direction structure:
+	 *   - If we have MATCH, assign parent to base flag, and consume
+	 *   both baseend and newend
+	 *   - Else if we have BASE, consume baseend
+	 *   - Else if we have NEW, insert newend lline into base and
+	 *   consume newend
+	 */
+	lcs = xcalloc(origbaselen + 1, sizeof(int*));
+	directions = xcalloc(origbaselen + 1, sizeof(enum coalesce_direction*));
+	for (i = 0; i < origbaselen + 1; i++) {
+		lcs[i] = xcalloc(lennew + 1, sizeof(int));
+		directions[i] = xcalloc(lennew + 1, sizeof(enum coalesce_direction));
+		directions[i][0] = BASE;
+	}
+	for (j = 1; j < lennew + 1; j++)
+		directions[0][j] = NEW;
+
+	for (i = 1, baseend = base; i < origbaselen + 1; i++) {
+		for (j = 1, newend = new; j < lennew + 1; j++) {
+			if (match_string_spaces(baseend->line, baseend->len,
+						newend->line, newend->len, flags)) {
+				lcs[i][j] = lcs[i - 1][j - 1] + 1;
+				directions[i][j] = MATCH;
+			} else if (lcs[i][j - 1] >= lcs[i - 1][j]) {
+				lcs[i][j] = lcs[i][j - 1];
+				directions[i][j] = NEW;
+			} else {
+				lcs[i][j] = lcs[i - 1][j];
+				directions[i][j] = BASE;
+			}
+			if (newend->next)
+				newend = newend->next;
+		}
+		if (baseend->next)
+			baseend = baseend->next;
+	}
+
+	for (i = 0; i < origbaselen + 1; i++)
+		free(lcs[i]);
+	free(lcs);
+
+	/* At this point, baseend and newend point to the end of each lists */
+	i--;
+	j--;
+	while (i != 0 || j != 0) {
+		if (directions[i][j] == MATCH) {
+			baseend->parent_map |= 1<<parent;
+			baseend = baseend->prev;
+			newend = newend->prev;
+			i--;
+			j--;
+		} else if (directions[i][j] == NEW) {
+			struct lline *lline;
+
+			lline = newend;
+			/* Remove lline from new list and update newend */
+			if (lline->prev)
+				lline->prev->next = lline->next;
+			else
+				new = lline->next;
+			if (lline->next)
+				lline->next->prev = lline->prev;
+
+			newend = lline->prev;
+			j--;
+
+			/* Add lline to base list */
+			if (baseend) {
+				lline->next = baseend->next;
+				lline->prev = baseend;
+				if (lline->prev)
+					lline->prev->next = lline;
 			}
-			lline = lline->next;
+			else {
+				lline->next = base;
+				base = lline;
+			}
+			(*lenbase)++;
+
+			if (lline->next)
+				lline->next->prev = lline;
+
+		} else {
+			baseend = baseend->prev;
+			i--;
 		}
 	}

+	newend = new;
+	while (newend) {
+		struct lline *lline = newend;
+		newend = newend->next;
+		free(lline);
+	}
+
+	for (i = 0; i < origbaselen + 1; i++)
+		free(directions[i]);
+	free(directions);
+
+	return base;
+}
+
+static char *grab_blob(const unsigned char *sha1, unsigned int mode,
+		       unsigned long *size, struct userdiff_driver *textconv,
+		       const char *path)
+{
+	char *blob;
+	enum object_type type;
+
+	if (S_ISGITLINK(mode)) {
+		blob = xmalloc(100);
+		*size = snprintf(blob, 100,
+				 "Subproject commit %s\n", sha1_to_hex(sha1));
+	} else if (is_null_sha1(sha1)) {
+		/* deleted blob */
+		*size = 0;
+		return xcalloc(1, 1);
+	} else if (textconv) {
+		struct diff_filespec *df = alloc_filespec(path);
+		fill_filespec(df, sha1, 1, mode);
+		*size = fill_textconv(textconv, df, &blob);
+		free_filespec(df);
+	} else {
+		blob = read_sha1_file(sha1, &type, size);
+		if (type != OBJ_BLOB)
+			die("object '%s' is not a blob!", sha1_to_hex(sha1));
+	}
+	return blob;
+}
+
+static void append_lost(struct sline *sline, int n, const char *line, int len)
+{
+	struct lline *lline;
+	unsigned long this_mask = (1UL<<n);
+	if (line[len-1] == '\n')
+		len--;
+
 	lline = xmalloc(sizeof(*lline) + len + 1);
 	lline->len = len;
 	lline->next = NULL;
+	lline->prev = sline->plost.lost_tail;
+	if (lline->prev)
+		lline->prev->next = lline;
+	else
+		sline->plost.lost_head = lline;
+	sline->plost.lost_tail = lline;
+	sline->plost.len++;
 	lline->parent_map = this_mask;
 	memcpy(lline->line, line, len);
 	lline->line[len] = 0;
-	*sline->lost_tail = lline;
-	sline->lost_tail = &lline->next;
-	sline->next_lost = NULL;
 }

 struct combine_diff_state {
@@ -203,7 +327,6 @@ struct combine_diff_state {
 	int n;
 	struct sline *sline;
 	struct sline *lost_bucket;
-	long flags;
 };

 static void consume_line(void *state_, char *line, unsigned long len)
@@ -236,14 +359,13 @@ static void consume_line(void *state_, char *line, unsigned long len)
 				xcalloc(state->num_parent,
 					sizeof(unsigned long));
 		state->sline[state->nb-1].p_lno[state->n] = state->ob;
-		state->lost_bucket->next_lost = state->lost_bucket->lost_head;
 		return;
 	}
 	if (!state->lost_bucket)
 		return; /* not in any hunk yet */
 	switch (line[0]) {
 	case '-':
-		append_lost(state->lost_bucket, state->n, line+1, len-1, state->flags);
+		append_lost(state->lost_bucket, state->n, line+1, len-1);
 		break;
 	case '+':
 		state->sline[state->lno-1].flag |= state->nmask;
@@ -276,7 +398,6 @@ static void combine_diff(const unsigned char *parent, unsigned int mode,
 	xpp.flags = flags;
 	memset(&xecfg, 0, sizeof(xecfg));
 	memset(&state, 0, sizeof(state));
-	state.flags = flags;
 	state.nmask = nmask;
 	state.sline = sline;
 	state.lno = 1;
@@ -298,8 +419,18 @@ static void combine_diff(const unsigned char *parent, unsigned int mode,
 		struct lline *ll;
 		sline[lno].p_lno[n] = p_lno;

+		/* Coalesce new lines */
+		if (sline[lno].plost.lost_head) {
+			struct sline *sl = &sline[lno];
+			sl->lost = coalesce_lines(sl->lost, &sl->lenlost,
+						  sl->plost.lost_head,
+						  sl->plost.len, n, flags);
+			sl->plost.lost_head = sl->plost.lost_tail = NULL;
+			sl->plost.len = 0;
+		}
+
 		/* How many lines would this sline advance the p_lno? */
-		ll = sline[lno].lost_head;
+		ll = sline[lno].lost;
 		while (ll) {
 			if (ll->parent_map & nmask)
 				p_lno++; /* '-' means parent had it */
@@ -319,7 +450,7 @@ static int interesting(struct sline *sline, unsigned long all_mask)
 	/* If some parents lost lines here, or if we have added to
 	 * some parent, it is interesting.
 	 */
-	return ((sline->flag & all_mask) || sline->lost_head);
+	return ((sline->flag & all_mask) || sline->lost);
 }

 static unsigned long adjust_hunk_tail(struct sline *sline,
@@ -502,7 +633,7 @@ static int make_hunks(struct sline *sline, unsigned long cnt,
 		has_interesting = 0;
 		for (j = i; j < hunk_end && !has_interesting; j++) {
 			unsigned long this_diff = sline[j].flag & all_mask;
-			struct lline *ll = sline[j].lost_head;
+			struct lline *ll = sline[j].lost;
 			if (this_diff) {
 				/* This has some changes.  Is it the
 				 * same as others?
@@ -656,7 +787,7 @@ static void dump_sline(struct sline *sline, const char *line_prefix,
 			int j;
 			unsigned long p_mask;
 			struct sline *sl = &sline[lno++];
-			ll = (sl->flag & no_pre_delete) ? NULL : sl->lost_head;
+			ll = (sl->flag & no_pre_delete) ? NULL : sl->lost;
 			while (ll) {
 				printf("%s%s", line_prefix, c_old);
 				for (j = 0; j < num_parent; j++) {
@@ -707,7 +838,7 @@ static void reuse_combine_diff(struct sline *sline, unsigned long cnt,
 	jmask = (1UL<<j);

 	for (lno = 0; lno <= cnt; lno++) {
-		struct lline *ll = sline->lost_head;
+		struct lline *ll = sline->lost;
 		sline->p_lno[i] = sline->p_lno[j];
 		while (ll) {
 			if (ll->parent_map & jmask)
@@ -966,10 +1097,6 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,

 	sline = xcalloc(cnt+2, sizeof(*sline));
 	sline[0].bol = result;
-	for (lno = 0; lno <= cnt + 1; lno++) {
-		sline[lno].lost_tail = &sline[lno].lost_head;
-		sline[lno].flag = 0;
-	}
 	for (lno = 0, cp = result; cp < result + result_size; cp++) {
 		if (*cp == '\n') {
 			sline[lno].len = cp - sline[lno].bol;
@@ -1019,8 +1146,8 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,
 	free(result);

 	for (lno = 0; lno < cnt; lno++) {
-		if (sline[lno].lost_head) {
-			struct lline *ll = sline[lno].lost_head;
+		if (sline[lno].lost) {
+			struct lline *ll = sline[lno].lost;
 			while (ll) {
 				struct lline *tmp = ll;
 				ll = ll->next;
diff --git a/t/t4038-diff-combined.sh b/t/t4038-diff-combined.sh
index b7e16a7..1261dbb 100755
--- a/t/t4038-diff-combined.sh
+++ b/t/t4038-diff-combined.sh
@@ -224,4 +224,133 @@ test_expect_success 'check combined output (ignore all spaces)' '
 	compare_diff_patch expected actual
 '

+test_expect_success 'combine diff coalesce simple' '
+	>test &&
+	git add test &&
+	git commit -m initial &&
+	test_seq 4 >test &&
+	git commit -a -m empty1 &&
+	git branch side1 &&
+	git checkout HEAD^ &&
+	test_seq 5 >test &&
+	git commit -a -m empty2 &&
+	test_must_fail git merge side1 &&
+	>test &&
+	git commit -a -m merge &&
+	git show >actual.tmp &&
+	sed -e "1,/^@@@/d" < actual.tmp >actual &&
+	tr -d Q <<-\EOF >expected &&
+	--1
+	--2
+	--3
+	--4
+	- 5
+	EOF
+	compare_diff_patch expected actual
+'
+
+test_expect_success 'combine diff coalesce tricky' '
+	>test &&
+	git add test &&
+	git commit -m initial --allow-empty &&
+	cat <<-\EOF >test &&
+	3
+	1
+	2
+	3
+	4
+	EOF
+	git commit -a -m empty1 &&
+	git branch -f side1 &&
+	git checkout HEAD^ &&
+	cat <<-\EOF >test &&
+	1
+	3
+	5
+	4
+	EOF
+	git commit -a -m empty2 &&
+	git branch -f side2 &&
+	test_must_fail git merge side1 &&
+	>test &&
+	git commit -a -m merge &&
+	git show >actual.tmp &&
+	sed -e "1,/^@@@/d" < actual.tmp >actual &&
+	tr -d Q <<-\EOF >expected &&
+	 -3
+	--1
+	 -2
+	--3
+	- 5
+	--4
+	EOF
+	compare_diff_patch expected actual &&
+	git checkout -f side1 &&
+	test_must_fail git merge side2 &&
+	>test &&
+	git commit -a -m merge &&
+	git show >actual.tmp &&
+	sed -e "1,/^@@@/d" < actual.tmp >actual &&
+	tr -d Q <<-\EOF >expected &&
+	- 3
+	--1
+	- 2
+	--3
+	 -5
+	--4
+	EOF
+	compare_diff_patch expected actual
+'
+
+test_expect_failure 'combine diff coalesce three parents' '
+	>test &&
+	git add test &&
+	git commit -m initial --allow-empty &&
+	cat <<-\EOF >test &&
+	3
+	1
+	2
+	3
+	4
+	EOF
+	git commit -a -m empty1 &&
+	git checkout -B side1 &&
+	git checkout HEAD^ &&
+	cat <<-\EOF >test &&
+	1
+	3
+	7
+	5
+	4
+	EOF
+	git commit -a -m empty2 &&
+	git branch -f side2 &&
+	git checkout HEAD^ &&
+	cat <<-\EOF >test &&
+	3
+	1
+	6
+	5
+	4
+	EOF
+	git commit -a -m empty3 &&
+	>test &&
+	git add test &&
+	TREE=$(git write-tree) &&
+	COMMIT=$(git commit-tree -p HEAD -p side1 -p side2 -m merge $TREE) &&
+	git show $COMMIT >actual.tmp &&
+	sed -e "1,/^@@@/d" < actual.tmp >actual &&
+	tr -d Q <<-\EOF >expected &&
+	-- 3
+	---1
+	-  6
+	 - 2
+	 --3
+	  -7
+	- -5
+	---4
+	EOF
+	compare_diff_patch expected actual
+'
+
 test_done
--
1.7.9.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-03-23 17:24 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-14 22:32 [RFD] Combine diff improvements Antoine Pelisse
2013-03-14 22:52 ` Junio C Hamano
2013-03-15  5:54 ` Junio C Hamano
2013-03-17 13:03   ` [PATCH] combine-diff: coalesce lost lines optimally Antoine Pelisse
2013-03-17 13:58     ` Antoine Pelisse
2013-03-17 20:10     ` Junio C Hamano
2013-03-17 20:37       ` Antoine Pelisse
2013-03-17 22:37     ` Junio C Hamano
2013-03-23 17:23       ` [PATCH v2] " Antoine Pelisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).