git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Antoine Pelisse <apelisse@gmail.com>
To: git <git@vger.kernel.org>
Subject: [RFD] Combine diff improvements
Date: Thu, 14 Mar 2013 23:32:35 +0100	[thread overview]
Message-ID: <CALWbr2zY3uo==QTd69t1eXNS4-CX1w3T-AaMjOLVmxj-SMJyvg@mail.gmail.com> (raw)

Hi,
I've been working on combine diff recently and have seen that the current
implementation does not produce optimal results for coalescing lost lines.

I would like to discuss some improvements.

I think we can define an optimal solution as the shortest diff
output. That is that we coalesced as much line as we could.
It means that if each parent has the same file removed, the output will
be n lines long (all lines coalesce to each others).

The current implementation doesn't look for optimal solution but for the
easiest/greedy solution. It also stops matching with a parent if it
can't find a match. As an example, it means that losing [1, 2, 3, 4] from
parent A, and [3, 1, 2, 4] from parent B would give:
- 1
- 2
--3
- 4
 -1
 -2
 -4

While if we invert the two parents order we will have:
- 3
--1
--2
- 4
 -3
 -4

While clearly, in both cases, the optimal solution would be close to:
- 3
--1
--2
 -3
--4

Let's say we have only one empty file (deleted), with p parents. In each
parent, the file was n lines long, but the lines don't have to be the
same.

It clearly looks like an extension/variation of the Longest Common
Subsequence (LCS) problem, but with p sequences (and the common
subsequences doesn't have to be common to all sequences).

LCS dynamic programming is O(n²) in time and space complexity for two
sequences.

We could extend it this way:
After processing two of the p parents, let's find LCS and build a new
optimal list of removals. Then find LCS again between this new list and
the third parent removals. And repeat until we made each parents.

Best-case analysis:
All p parents have the same n lines.
We will find LCS and provide a n lines (the same lines) new list in
O(n²), and then run it again in O(n²) with the next parent, etc.
It will end-up being O(pn²).

Worst-case analysis:
All p parents have no lines in common.
We will find LCS and provide a 2n new list in O(n²).
Then we run it again in O(2n x n), and again O(3n x n), etc, until
O(pn x n).
When we sum these all, we end-up with O(p² x n²)

Does it make any sense ? And is it worth implementing ?

Cheers,
Antoine

             reply	other threads:[~2013-03-14 22:33 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-14 22:32 Antoine Pelisse [this message]
2013-03-14 22:52 ` [RFD] Combine diff improvements Junio C Hamano
2013-03-15  5:54 ` Junio C Hamano
2013-03-17 13:03   ` [PATCH] combine-diff: coalesce lost lines optimally Antoine Pelisse
2013-03-17 13:58     ` Antoine Pelisse
2013-03-17 20:10     ` Junio C Hamano
2013-03-17 20:37       ` Antoine Pelisse
2013-03-17 22:37     ` Junio C Hamano
2013-03-23 17:23       ` [PATCH v2] " Antoine Pelisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALWbr2zY3uo==QTd69t1eXNS4-CX1w3T-AaMjOLVmxj-SMJyvg@mail.gmail.com' \
    --to=apelisse@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).