git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Comparing rebase --am with --interactive via p3400
@ 2019-02-01  6:04 Johannes Schindelin
  2019-02-01  7:22 ` Johannes Schindelin
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Johannes Schindelin @ 2019-02-01  6:04 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git

Hi Elijah,

as discussed at the Contributors' Summit, I ran p3400 as-is (i.e. with the
--am backend) and then with --keep-empty to force the interactive backend
to be used. Here are the best of 10, on my relatively powerful Windows 10
laptop, with current `master`.

With regular rebase --am:

3400.2: rebase on top of a lot of unrelated changes             5.32(0.06+0.15)
3400.4: rebase a lot of unrelated changes without split-index   33.08(0.04+0.18)
3400.6: rebase a lot of unrelated changes with split-index      30.29(0.03+0.18)

with --keep-empty to force the interactive backend:

3400.2: rebase on top of a lot of unrelated changes             3.92(0.03+0.18)
3400.4: rebase a lot of unrelated changes without split-index   33.92(0.03+0.22)
3400.6: rebase a lot of unrelated changes with split-index      38.82(0.03+0.16)

I then changed it to -m to test the current scripted version, trying to
let it run overnight, but my laptop eventually went to sleep and the tests
were not even done. I'll let them continue and report back.

My conclusion after seeing these numbers is: the interactive rebase is
really close to the performance of the --am backend. So to me, it makes a
total lot of sense to switch --merge over to it, and to make --merge the
default. We still should investigate why the split-index performance is so
significantly worse, though.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comparing rebase --am with --interactive via p3400
  2019-02-01  6:04 Comparing rebase --am with --interactive via p3400 Johannes Schindelin
@ 2019-02-01  7:22 ` Johannes Schindelin
  2019-02-01  9:26 ` Elijah Newren
  2019-12-27 21:11 ` Alban Gruin
  2 siblings, 0 replies; 11+ messages in thread
From: Johannes Schindelin @ 2019-02-01  7:22 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git

Hi Elijah,

On Fri, 1 Feb 2019, Johannes Schindelin wrote:

> as discussed at the Contributors' Summit, I ran p3400 as-is (i.e. with the
> --am backend) and then with --keep-empty to force the interactive backend
> to be used. Here are the best of 10, on my relatively powerful Windows 10
> laptop, with current `master`.
> 
> With regular rebase --am:
> 
> 3400.2: rebase on top of a lot of unrelated changes             5.32(0.06+0.15)
> 3400.4: rebase a lot of unrelated changes without split-index   33.08(0.04+0.18)
> 3400.6: rebase a lot of unrelated changes with split-index      30.29(0.03+0.18)
> 
> with --keep-empty to force the interactive backend:
> 
> 3400.2: rebase on top of a lot of unrelated changes             3.92(0.03+0.18)
> 3400.4: rebase a lot of unrelated changes without split-index   33.92(0.03+0.22)
> 3400.6: rebase a lot of unrelated changes with split-index      38.82(0.03+0.16)
> 
> I then changed it to -m to test the current scripted version, trying to
> let it run overnight, but my laptop eventually went to sleep and the tests
> were not even done. I'll let them continue and report back.

It finally finished:

3400.2: rebase on top of a lot of unrelated changes             7.37(0.09+0.19) 
3400.4: rebase a lot of unrelated changes without split-index 393.96(0.04+0.15)
3400.6: rebase a lot of unrelated changes with split-index    404.65(0.01+0.24)

So there is a seemingly significant cost to using the split-index that is
just very unfortunate. In any case, just switching from the scripted
--merge backend to the built-in interactive backend results in a >10x
faster execution. So I *definitely* want that scripted `--merge` backend
to go away. Thank you for doing this.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comparing rebase --am with --interactive via p3400
  2019-02-01  6:04 Comparing rebase --am with --interactive via p3400 Johannes Schindelin
  2019-02-01  7:22 ` Johannes Schindelin
@ 2019-02-01  9:26 ` Elijah Newren
  2019-12-27 21:11 ` Alban Gruin
  2 siblings, 0 replies; 11+ messages in thread
From: Elijah Newren @ 2019-02-01  9:26 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Git Mailing List

Hi Dscho,

On Thu, Jan 31, 2019 at 10:04 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Elijah,
>
> as discussed at the Contributors' Summit, I ran p3400 as-is (i.e. with the
> --am backend) and then with --keep-empty to force the interactive backend
> to be used. Here are the best of 10, on my relatively powerful Windows 10
> laptop, with current `master`.
>
> With regular rebase --am:
>
> 3400.2: rebase on top of a lot of unrelated changes             5.32(0.06+0.15)
> 3400.4: rebase a lot of unrelated changes without split-index   33.08(0.04+0.18)
> 3400.6: rebase a lot of unrelated changes with split-index      30.29(0.03+0.18)
>
> with --keep-empty to force the interactive backend:
>
> 3400.2: rebase on top of a lot of unrelated changes             3.92(0.03+0.18)
> 3400.4: rebase a lot of unrelated changes without split-index   33.92(0.03+0.22)
> 3400.6: rebase a lot of unrelated changes with split-index      38.82(0.03+0.16)

Awesome, thanks for checking that out.  I ran on both linux and mac
and saw similar relative performances.  Comparing am-based rebase to
an implied-interactive rebase on both linux and mac (with a version of
git including en/rebase-merge-on-sequencer so that -m gives the same
performance that you'd see with --keep-empty), I saw:

On Linux:

am-based rebase (without -m):

3400.2: rebase on top of a lot of unrelated changes             1.87(1.64+0.21)
3400.4: rebase a lot of unrelated changes without split-index   7.87(6.24+1.00)
3400.6: rebase a lot of unrelated changes with split-index      5.99(5.05+0.67)

interactive-machinery rebase (with -m):

3400.2: rebase on top of a lot of unrelated changes             1.80(1.60+0.19)
3400.4: rebase a lot of unrelated changes without split-index   6.78(5.70+0.91)
3400.6: rebase a lot of unrelated changes with split-index      6.92(5.70+0.89)


On Mac:

am-based rebase (without -m):

Test                                                            this tree
-------------------------------------------------------------------------------
3400.2: rebase on top of a lot of unrelated changes             2.68(1.68+0.68)
3400.4: rebase a lot of unrelated changes without split-index   8.89(5.86+2.94)
3400.6: rebase a lot of unrelated changes with split-index      7.87(5.35+2.51)


interactive-machinery rebase (with -m):

Test                                                            this tree
-------------------------------------------------------------------------------
3400.2: rebase on top of a lot of unrelated changes             1.99(1.61+0.77)
3400.4: rebase a lot of unrelated changes without split-index   8.63(5.38+3.38)
3400.6: rebase a lot of unrelated changes with split-index      9.36(5.53+3.95)

> I then changed it to -m to test the current scripted version, trying to
> let it run overnight, but my laptop eventually went to sleep and the tests
> were not even done. I'll let them continue and report back.
>
> My conclusion after seeing these numbers is: the interactive rebase is
> really close to the performance of the --am backend. So to me, it makes a
> total lot of sense to switch --merge over to it, and to make --merge the
> default. We still should investigate why the split-index performance is so
> significantly worse, though.

Cool, I'll update my patches to make --merge the default (building on
top of en/rebase-merge-on-sequencer) and post it as an RFC.  But yeah,
we should also check into why the split-index performance becomes a
bit worse with such a change.

Thanks,
Elijah

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comparing rebase --am with --interactive via p3400
  2019-02-01  6:04 Comparing rebase --am with --interactive via p3400 Johannes Schindelin
  2019-02-01  7:22 ` Johannes Schindelin
  2019-02-01  9:26 ` Elijah Newren
@ 2019-12-27 21:11 ` Alban Gruin
  2019-12-27 22:45   ` Elijah Newren
  2020-01-31 21:23   ` Johannes Schindelin
  2 siblings, 2 replies; 11+ messages in thread
From: Alban Gruin @ 2019-12-27 21:11 UTC (permalink / raw)
  To: Johannes Schindelin, Elijah Newren; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 5067 bytes --]

Hi Johannes & Elijah,

Le 01/02/2019 à 07:04, Johannes Schindelin a écrit :
> Hi Elijah,
> 
> as discussed at the Contributors' Summit, I ran p3400 as-is (i.e. with the
> --am backend) and then with --keep-empty to force the interactive backend
> to be used. Here are the best of 10, on my relatively powerful Windows 10
> laptop, with current `master`.
> 
> With regular rebase --am:
> 
> 3400.2: rebase on top of a lot of unrelated changes             5.32(0.06+0.15)
> 3400.4: rebase a lot of unrelated changes without split-index   33.08(0.04+0.18)
> 3400.6: rebase a lot of unrelated changes with split-index      30.29(0.03+0.18)
> 
> with --keep-empty to force the interactive backend:
> 
> 3400.2: rebase on top of a lot of unrelated changes             3.92(0.03+0.18)
> 3400.4: rebase a lot of unrelated changes without split-index   33.92(0.03+0.22)
> 3400.6: rebase a lot of unrelated changes with split-index      38.82(0.03+0.16)
> 
> I then changed it to -m to test the current scripted version, trying to
> let it run overnight, but my laptop eventually went to sleep and the tests
> were not even done. I'll let them continue and report back.
> 
> My conclusion after seeing these numbers is: the interactive rebase is
> really close to the performance of the --am backend. So to me, it makes a
> total lot of sense to switch --merge over to it, and to make --merge the
> default. We still should investigate why the split-index performance is so
> significantly worse, though.
> 
> Ciao,
> Dscho
> 

I investigated a bit on this.  From a quick glance at a callgrind trace,
I can see that ce_write_entry() is called 20 601[1] times with `git am',
but 739 802 times with the sequencer when the split-index is enabled.

For reference, here are the timings, measured on my Linux machine, on a
tmpfs, with git.git as the repo:

`rebase --am':
> 3400.2: rebase on top of a lot of unrelated changes             0.29(0.24+0.03)
> 3400.4: rebase a lot of unrelated changes without split-index   6.77(6.51+0.22)
> 3400.6: rebase a lot of unrelated changes with split-index      4.43(4.29+0.13)
`rebase --quiet':
> 3400.2: rebase on top of a lot of unrelated changes             0.24(0.21+0.02)
> 3400.4: rebase a lot of unrelated changes without split-index   5.60(5.32+0.27)
> 3400.6: rebase a lot of unrelated changes with split-index      5.67(5.40+0.26)

This comes from two things:

1. There is not enough shared entries in the index with the sequencer.

do_write_index() is called only by do_write_locked_index() with `--am',
but is also called by write_shared_index() with the sequencer once for
every other commit.  As the latter is only called by
write_locked_index(), which means that too_many_not_shared_entries()
returns true for the sequencer, but never for `--am'.

Removing the call to discard_index() in do_pick_commit() (as in the
first attached patch) solve this particular issue, but this would
require a more thorough analysis to see if it is actually safe to do.

After this, ce_write() is still called much more by the sequencer.

Here are the results of `rebase --quiet' without discarding the index:

> 3400.2: rebase on top of a lot of unrelated changes             0.23(0.19+0.04)
> 3400.4: rebase a lot of unrelated changes without split-index   5.14(4.95+0.18)
> 3400.6: rebase a lot of unrelated changes with split-index      5.02(4.87+0.15)
The performance of the rebase is better in the two cases.


2. The base index is dropped by unpack_trees_start() and unpack_trees().

Now, write_shared_index() is no longer called and write_locked_index()
is less expensive than before according to callgrind.  But
ce_write_entry() is still called 749 302 times (which is even more than
before.)

The only place where ce_write_entry() is called is in a loop in
do_write_index().  The number of iterations is dictated by the size of
the cache, and there is a trace2 probe dumping this value.

For `--am', the value goes like this: 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5, 5, … up until 101.

For the sequencer, it goes like this: 1, 1, 3697, 3697, 3698, 3698,
3699, 3699, … up until 3796.

The size of the cache is set in prepare_to_write_split_index().  It
grows if a cache entry has no index (most of them should have one by
now), or if the split index has no base index (with `--am', the split
index always has a base.)  This comes from unpack_trees_start() -- it
creates a new index, and unpack_trees() does not carry the base index,
hence the size of the cache.

The second attached patch (which is broken for the non-interactive
rebase case) demonstrates what we could expect for the split-index case
if we fix this:

> 3400.2: rebase on top of a lot of unrelated changes             0.24(0.21+0.03)
> 3400.4: rebase a lot of unrelated changes without split-index   5.81(5.62+0.17)
> 3400.6: rebase a lot of unrelated changes with split-index      4.76(4.54+0.20)
So, for everything related to the index, I think that’s it.

[1] Numbers may vary, but they should remain in the same order of magnitude.

Cheers,
Alban


[-- Attachment #2: sequencer-rebase-si.patch --]
[-- Type: text/x-patch, Size: 317 bytes --]

diff --git a/sequencer.c b/sequencer.c
index 1bee26ebd5..2831abd0fa 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -1863,7 +1863,6 @@ static int do_pick_commit(struct repository *r,
 				       NULL, 0))
 			return error_dirty_index(r, opts);
 	}
-	discard_index(r->index);
 
 	if (!commit->parents)
 		parent = NULL;

[-- Attachment #3: merge-recursive-rebase-si.patch --]
[-- Type: text/x-patch, Size: 1367 bytes --]

diff --git a/merge-recursive.c b/merge-recursive.c
index 11869ad81c..47f67079f3 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -421,7 +421,7 @@ static int unpack_trees_start(struct merge_options *opt,
 {
 	int rc;
 	struct tree_desc t[3];
-	struct index_state tmp_index = { NULL };
+	/* struct index_state tmp_index = { NULL }; */
 
 	memset(&opt->priv->unpack_opts, 0, sizeof(opt->priv->unpack_opts));
 	if (opt->priv->call_depth)
@@ -432,7 +432,7 @@ static int unpack_trees_start(struct merge_options *opt,
 	opt->priv->unpack_opts.head_idx = 2;
 	opt->priv->unpack_opts.fn = threeway_merge;
 	opt->priv->unpack_opts.src_index = opt->repo->index;
-	opt->priv->unpack_opts.dst_index = &tmp_index;
+	opt->priv->unpack_opts.dst_index = opt->repo->index;
 	opt->priv->unpack_opts.aggressive = !merge_detect_rename(opt);
 	setup_unpack_trees_porcelain(&opt->priv->unpack_opts, "merge");
 
@@ -449,8 +449,8 @@ static int unpack_trees_start(struct merge_options *opt,
 	 * saved copy.  (verify_uptodate() checks src_index, and the original
 	 * index is the one that had the necessary modification timestamps.)
 	 */
-	opt->priv->orig_index = *opt->repo->index;
-	*opt->repo->index = tmp_index;
+	/* opt->priv->orig_index = *opt->repo->index; */
+	/* *opt->repo->index = tmp_index; */
 	opt->priv->unpack_opts.src_index = &opt->priv->orig_index;
 
 	return rc;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Comparing rebase --am with --interactive via p3400
  2019-12-27 21:11 ` Alban Gruin
@ 2019-12-27 22:45   ` Elijah Newren
  2019-12-29 17:25     ` Alban Gruin
  2020-01-31 21:23   ` Johannes Schindelin
  1 sibling, 1 reply; 11+ messages in thread
From: Elijah Newren @ 2019-12-27 22:45 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Johannes Schindelin, Git Mailing List

Hi Alban,

On Fri, Dec 27, 2019 at 1:11 PM Alban Gruin <alban.gruin@gmail.com> wrote:
>
> Hi Johannes & Elijah,
>
> Le 01/02/2019 à 07:04, Johannes Schindelin a écrit :
> > Hi Elijah,
> >
> > as discussed at the Contributors' Summit, I ran p3400 as-is (i.e. with the
> > --am backend) and then with --keep-empty to force the interactive backend
> > to be used. Here are the best of 10, on my relatively powerful Windows 10
> > laptop, with current `master`.
> >
> > With regular rebase --am:
> >
> > 3400.2: rebase on top of a lot of unrelated changes             5.32(0.06+0.15)
> > 3400.4: rebase a lot of unrelated changes without split-index   33.08(0.04+0.18)
> > 3400.6: rebase a lot of unrelated changes with split-index      30.29(0.03+0.18)
> >
> > with --keep-empty to force the interactive backend:
> >
> > 3400.2: rebase on top of a lot of unrelated changes             3.92(0.03+0.18)
> > 3400.4: rebase a lot of unrelated changes without split-index   33.92(0.03+0.22)
> > 3400.6: rebase a lot of unrelated changes with split-index      38.82(0.03+0.16)
> >
> > I then changed it to -m to test the current scripted version, trying to
> > let it run overnight, but my laptop eventually went to sleep and the tests
> > were not even done. I'll let them continue and report back.
> >
> > My conclusion after seeing these numbers is: the interactive rebase is
> > really close to the performance of the --am backend. So to me, it makes a
> > total lot of sense to switch --merge over to it, and to make --merge the
> > default. We still should investigate why the split-index performance is so
> > significantly worse, though.
> >
> > Ciao,
> > Dscho
> >
>
> I investigated a bit on this.  From a quick glance at a callgrind trace,
> I can see that ce_write_entry() is called 20 601[1] times with `git am',
> but 739 802 times with the sequencer when the split-index is enabled.

Sweet, thanks for digging in and analyzing this.

> For reference, here are the timings, measured on my Linux machine, on a
> tmpfs, with git.git as the repo:
>
> `rebase --am':
> > 3400.2: rebase on top of a lot of unrelated changes             0.29(0.24+0.03)
> > 3400.4: rebase a lot of unrelated changes without split-index   6.77(6.51+0.22)
> > 3400.6: rebase a lot of unrelated changes with split-index      4.43(4.29+0.13)
> `rebase --quiet':

--quiet?  Isn't that flag supposed to work with both backends and not
imply either one?  We previously used --keep-empty, though there's a
chance that flag means we're not doing a fair comparison (since 'am'
will drop empty commits and thus have less work to do).  Is there any
chance you actually ran a different command, but when you went to
summarize just typed the wrong flag name?  Anyway, the best would
probably be to use --merge here (at the time Johannes and I were
testing, that wouldn't have triggered the sequencer, but it does now),
after first applying the en/rebase-backend series just to make sure
we're doing an apples to apples comparison.  However, I suspect that
empty commits probably weren't much of a factor and you did find some
interesting things...

> > 3400.2: rebase on top of a lot of unrelated changes             0.24(0.21+0.02)
> > 3400.4: rebase a lot of unrelated changes without split-index   5.60(5.32+0.27)
> > 3400.6: rebase a lot of unrelated changes with split-index      5.67(5.40+0.26)
>
> This comes from two things:
>
> 1. There is not enough shared entries in the index with the sequencer.
>
> do_write_index() is called only by do_write_locked_index() with `--am',
> but is also called by write_shared_index() with the sequencer once for
> every other commit.  As the latter is only called by
> write_locked_index(), which means that too_many_not_shared_entries()
> returns true for the sequencer, but never for `--am'.
>
> Removing the call to discard_index() in do_pick_commit() (as in the
> first attached patch) solve this particular issue, but this would
> require a more thorough analysis to see if it is actually safe to do.

I'm actually surprised the sequencer would call discard_index(); I
would have thought it would have relied on merge_recursive() to do the
necessary index changes and updates other than writing the new index
out.  But I'm not quite as familar with the sequencer so perhaps
there's some reason I'm unaware of.  (Any chance this is a left-over
from when sequencer invoked external scripts to do the work, and thus
the index was updated in another processes' memory and on disk, and it
had to discard and re-read to get its own process updated?)

> After this, ce_write() is still called much more by the sequencer.
>
> Here are the results of `rebase --quiet' without discarding the index:
>
> > 3400.2: rebase on top of a lot of unrelated changes             0.23(0.19+0.04)
> > 3400.4: rebase a lot of unrelated changes without split-index   5.14(4.95+0.18)
> > 3400.6: rebase a lot of unrelated changes with split-index      5.02(4.87+0.15)
> The performance of the rebase is better in the two cases.

Nice.  :-)

> 2. The base index is dropped by unpack_trees_start() and unpack_trees().
>
> Now, write_shared_index() is no longer called and write_locked_index()
> is less expensive than before according to callgrind.  But
> ce_write_entry() is still called 749 302 times (which is even more than
> before.)
>
> The only place where ce_write_entry() is called is in a loop in
> do_write_index().  The number of iterations is dictated by the size of
> the cache, and there is a trace2 probe dumping this value.
>
> For `--am', the value goes like this: 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4,
> 4, 4, 5, 5, 5, 5, … up until 101.
>
> For the sequencer, it goes like this: 1, 1, 3697, 3697, 3698, 3698,
> 3699, 3699, … up until 3796.
>
> The size of the cache is set in prepare_to_write_split_index().  It
> grows if a cache entry has no index (most of them should have one by
> now), or if the split index has no base index (with `--am', the split
> index always has a base.)  This comes from unpack_trees_start() -- it
> creates a new index, and unpack_trees() does not carry the base index,
> hence the size of the cache.
>
> The second attached patch (which is broken for the non-interactive
> rebase case) demonstrates what we could expect for the split-index case
> if we fix this:
>
> > 3400.2: rebase on top of a lot of unrelated changes             0.24(0.21+0.03)
> > 3400.4: rebase a lot of unrelated changes without split-index   5.81(5.62+0.17)
> > 3400.6: rebase a lot of unrelated changes with split-index      4.76(4.54+0.20)
> So, for everything related to the index, I think that’s it.
>
> [1] Numbers may vary, but they should remain in the same order of magnitude.

Unfortunately, this patch as-is breaks some important things even if
it only shows up in a few testcases.  merge-recursive needs to know
both what the index looked like before the merge started, as well as
what it looks like after unpack-trees runs; see commits 1de70dbd1a
(merge-recursive: fix check for skipability of working tree updates,
2018-04-19) and a35edc84bd (merge-recursive: fix was_tracked() to quit
lying with some renamed paths, 2018-04-19), and maybe a few others
from that series.

But, noting that it comes from the differences in the index as
unpack_trees runs is useful info.  I might be restructuring this code
somewhat significantly but it helps to have this in mind; I may spot
opportunities to do something with it while I'm digging in...

Elijah

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comparing rebase --am with --interactive via p3400
  2019-12-27 22:45   ` Elijah Newren
@ 2019-12-29 17:25     ` Alban Gruin
  2020-01-02 20:17       ` Johannes Schindelin
  0 siblings, 1 reply; 11+ messages in thread
From: Alban Gruin @ 2019-12-29 17:25 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Johannes Schindelin, Git Mailing List

Hi Elijah,

Le 27/12/2019 à 23:45, Elijah Newren a écrit :
> Hi Alban,
> 
> On Fri, Dec 27, 2019 at 1:11 PM Alban Gruin <alban.gruin@gmail.com> wrote:
>>
>> Hi Johannes & Elijah,
>>
>> Le 01/02/2019 à 07:04, Johannes Schindelin a écrit :
>>> Hi Elijah,
>>>
>>> as discussed at the Contributors' Summit, I ran p3400 as-is (i.e. with the
>>> --am backend) and then with --keep-empty to force the interactive backend
>>> to be used. Here are the best of 10, on my relatively powerful Windows 10
>>> laptop, with current `master`.
>>>
>>> With regular rebase --am:
>>>
>>> 3400.2: rebase on top of a lot of unrelated changes             5.32(0.06+0.15)
>>> 3400.4: rebase a lot of unrelated changes without split-index   33.08(0.04+0.18)
>>> 3400.6: rebase a lot of unrelated changes with split-index      30.29(0.03+0.18)
>>>
>>> with --keep-empty to force the interactive backend:
>>>
>>> 3400.2: rebase on top of a lot of unrelated changes             3.92(0.03+0.18)
>>> 3400.4: rebase a lot of unrelated changes without split-index   33.92(0.03+0.22)
>>> 3400.6: rebase a lot of unrelated changes with split-index      38.82(0.03+0.16)
>>>
>>> I then changed it to -m to test the current scripted version, trying to
>>> let it run overnight, but my laptop eventually went to sleep and the tests
>>> were not even done. I'll let them continue and report back.
>>>
>>> My conclusion after seeing these numbers is: the interactive rebase is
>>> really close to the performance of the --am backend. So to me, it makes a
>>> total lot of sense to switch --merge over to it, and to make --merge the
>>> default. We still should investigate why the split-index performance is so
>>> significantly worse, though.
>>>
>>> Ciao,
>>> Dscho
>>>
>>
>> I investigated a bit on this.  From a quick glance at a callgrind trace,
>> I can see that ce_write_entry() is called 20 601[1] times with `git am',
>> but 739 802 times with the sequencer when the split-index is enabled.
> 
> Sweet, thanks for digging in and analyzing this.
> 
>> For reference, here are the timings, measured on my Linux machine, on a
>> tmpfs, with git.git as the repo:
>>
>> `rebase --am':
>>> 3400.2: rebase on top of a lot of unrelated changes             0.29(0.24+0.03)
>>> 3400.4: rebase a lot of unrelated changes without split-index   6.77(6.51+0.22)
>>> 3400.6: rebase a lot of unrelated changes with split-index      4.43(4.29+0.13)
>> `rebase --quiet':
> 
> --quiet?  Isn't that flag supposed to work with both backends and not
> imply either one?  We previously used --keep-empty, though there's a
> chance that flag means we're not doing a fair comparison (since 'am'
> will drop empty commits and thus have less work to do).  Is there any
> chance you actually ran a different command, but when you went to
> summarize just typed the wrong flag name?  Anyway, the best would
> probably be to use --merge here (at the time Johannes and I were
> testing, that wouldn't have triggered the sequencer, but it does now),
> after first applying the en/rebase-backend series just to make sure
> we're doing an apples to apples comparison.  However, I suspect that
> empty commits probably weren't much of a factor and you did find some
> interesting things...
> 

Yes, I did use `--keep-empty' but misremembered it when writing this email…

>>> 3400.2: rebase on top of a lot of unrelated changes             0.24(0.21+0.02)
>>> 3400.4: rebase a lot of unrelated changes without split-index   5.60(5.32+0.27)
>>> 3400.6: rebase a lot of unrelated changes with split-index      5.67(5.40+0.26)
>>
>> This comes from two things:
>>
>> 1. There is not enough shared entries in the index with the sequencer.
>>
>> do_write_index() is called only by do_write_locked_index() with `--am',
>> but is also called by write_shared_index() with the sequencer once for
>> every other commit.  As the latter is only called by
>> write_locked_index(), which means that too_many_not_shared_entries()
>> returns true for the sequencer, but never for `--am'.
>>
>> Removing the call to discard_index() in do_pick_commit() (as in the
>> first attached patch) solve this particular issue, but this would
>> require a more thorough analysis to see if it is actually safe to do.
> 
> I'm actually surprised the sequencer would call discard_index(); I
> would have thought it would have relied on merge_recursive() to do the
> necessary index changes and updates other than writing the new index
> out.  But I'm not quite as familar with the sequencer so perhaps
> there's some reason I'm unaware of.  (Any chance this is a left-over
> from when sequencer invoked external scripts to do the work, and thus
> the index was updated in another processes' memory and on disk, and it
> had to discard and re-read to get its own process updated?)
> 

The sequencer re-reads the index after invoking an external command
(either `git checkout', `git merge' or an `exec' command from the todo
list), which makes sense.  But this one seems to come from 6eb1b437933
("cherry-pick/revert: make direct internal call to merge_tree()",
2008-09-02).  So, yes, quite old, and perhaps no longer justified.

I know I had to add another discard_cache() in rebase--interactive.c
because it broke something with the submodules, but it does not seems
all that useful now that rebase.c no longer has to fork to use the
sequencer.

Cheers,
Alban


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comparing rebase --am with --interactive via p3400
  2019-12-29 17:25     ` Alban Gruin
@ 2020-01-02 20:17       ` Johannes Schindelin
  0 siblings, 0 replies; 11+ messages in thread
From: Johannes Schindelin @ 2020-01-02 20:17 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Elijah Newren, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 6575 bytes --]

Hi Alban & Elijah,

On Sun, 29 Dec 2019, Alban Gruin wrote:

> Hi Elijah,
>
> Le 27/12/2019 à 23:45, Elijah Newren a écrit :
> > Hi Alban,
> >
> > On Fri, Dec 27, 2019 at 1:11 PM Alban Gruin <alban.gruin@gmail.com> wrote:
> >>
> >> Hi Johannes & Elijah,
> >>
> >> Le 01/02/2019 à 07:04, Johannes Schindelin a écrit :
> >>> Hi Elijah,
> >>>
> >>> as discussed at the Contributors' Summit, I ran p3400 as-is (i.e. with the
> >>> --am backend) and then with --keep-empty to force the interactive backend
> >>> to be used. Here are the best of 10, on my relatively powerful Windows 10
> >>> laptop, with current `master`.
> >>>
> >>> With regular rebase --am:
> >>>
> >>> 3400.2: rebase on top of a lot of unrelated changes             5.32(0.06+0.15)
> >>> 3400.4: rebase a lot of unrelated changes without split-index   33.08(0.04+0.18)
> >>> 3400.6: rebase a lot of unrelated changes with split-index      30.29(0.03+0.18)
> >>>
> >>> with --keep-empty to force the interactive backend:
> >>>
> >>> 3400.2: rebase on top of a lot of unrelated changes             3.92(0.03+0.18)
> >>> 3400.4: rebase a lot of unrelated changes without split-index   33.92(0.03+0.22)
> >>> 3400.6: rebase a lot of unrelated changes with split-index      38.82(0.03+0.16)
> >>>
> >>> I then changed it to -m to test the current scripted version, trying to
> >>> let it run overnight, but my laptop eventually went to sleep and the tests
> >>> were not even done. I'll let them continue and report back.
> >>>
> >>> My conclusion after seeing these numbers is: the interactive rebase is
> >>> really close to the performance of the --am backend. So to me, it makes a
> >>> total lot of sense to switch --merge over to it, and to make --merge the
> >>> default. We still should investigate why the split-index performance is so
> >>> significantly worse, though.
> >>>
> >>> Ciao,
> >>> Dscho
> >>>
> >>
> >> I investigated a bit on this.  From a quick glance at a callgrind trace,
> >> I can see that ce_write_entry() is called 20 601[1] times with `git am',
> >> but 739 802 times with the sequencer when the split-index is enabled.
> >
> > Sweet, thanks for digging in and analyzing this.
> >
> >> For reference, here are the timings, measured on my Linux machine, on a
> >> tmpfs, with git.git as the repo:
> >>
> >> `rebase --am':
> >>> 3400.2: rebase on top of a lot of unrelated changes             0.29(0.24+0.03)
> >>> 3400.4: rebase a lot of unrelated changes without split-index   6.77(6.51+0.22)
> >>> 3400.6: rebase a lot of unrelated changes with split-index      4.43(4.29+0.13)
> >> `rebase --quiet':
> >
> > --quiet?  Isn't that flag supposed to work with both backends and not
> > imply either one?  We previously used --keep-empty, though there's a
> > chance that flag means we're not doing a fair comparison (since 'am'
> > will drop empty commits and thus have less work to do).  Is there any
> > chance you actually ran a different command, but when you went to
> > summarize just typed the wrong flag name?  Anyway, the best would
> > probably be to use --merge here (at the time Johannes and I were
> > testing, that wouldn't have triggered the sequencer, but it does now),
> > after first applying the en/rebase-backend series just to make sure
> > we're doing an apples to apples comparison.  However, I suspect that
> > empty commits probably weren't much of a factor and you did find some
> > interesting things...
> >
>
> Yes, I did use `--keep-empty' but misremembered it when writing this email…
>
> >>> 3400.2: rebase on top of a lot of unrelated changes             0.24(0.21+0.02)
> >>> 3400.4: rebase a lot of unrelated changes without split-index   5.60(5.32+0.27)
> >>> 3400.6: rebase a lot of unrelated changes with split-index      5.67(5.40+0.26)
> >>
> >> This comes from two things:
> >>
> >> 1. There is not enough shared entries in the index with the sequencer.
> >>
> >> do_write_index() is called only by do_write_locked_index() with `--am',
> >> but is also called by write_shared_index() with the sequencer once for
> >> every other commit.  As the latter is only called by
> >> write_locked_index(), which means that too_many_not_shared_entries()
> >> returns true for the sequencer, but never for `--am'.
> >>
> >> Removing the call to discard_index() in do_pick_commit() (as in the
> >> first attached patch) solve this particular issue, but this would
> >> require a more thorough analysis to see if it is actually safe to do.
> >
> > I'm actually surprised the sequencer would call discard_index(); I
> > would have thought it would have relied on merge_recursive() to do the
> > necessary index changes and updates other than writing the new index
> > out.  But I'm not quite as familar with the sequencer so perhaps
> > there's some reason I'm unaware of.  (Any chance this is a left-over
> > from when sequencer invoked external scripts to do the work, and thus
> > the index was updated in another processes' memory and on disk, and it
> > had to discard and re-read to get its own process updated?)
> >
>
> The sequencer re-reads the index after invoking an external command
> (either `git checkout', `git merge' or an `exec' command from the todo
> list), which makes sense.  But this one seems to come from 6eb1b437933
> ("cherry-pick/revert: make direct internal call to merge_tree()",
> 2008-09-02).  So, yes, quite old, and perhaps no longer justified.

Right. This commit also moved the `discard_cache()` call outside from the
`else` clause of the `if (no_commit)`.

That `else` clause goes all the way back to 9509af686bf (Make git-revert &
git-cherry-pick a builtin, 2007-03-01), and I admit freely that my memory
is no longer fresh on the specifics of this patch.

Looking at that patch, I think I simply discarded the index because a
subsequent code path would spawn the `git merge-recursive` process, which
would have changed the index externally.

> I know I had to add another discard_cache() in rebase--interactive.c
> because it broke something with the submodules, but it does not seems
> all that useful now that rebase.c no longer has to fork to use the
> sequencer.

FWIW I agree. The code is still quite complex at this point, but
infinitely more readable (thank you Elijah for taking point on simplifying
merge-recursive.c!). So I think that it might be the right point in time
to make sure that the index is not re-read and re-discarded over and over
again.

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comparing rebase --am with --interactive via p3400
  2019-12-27 21:11 ` Alban Gruin
  2019-12-27 22:45   ` Elijah Newren
@ 2020-01-31 21:23   ` Johannes Schindelin
  2020-04-01 11:33     ` Alban Gruin
  1 sibling, 1 reply; 11+ messages in thread
From: Johannes Schindelin @ 2020-01-31 21:23 UTC (permalink / raw)
  To: Alban Gruin; +Cc: Elijah Newren, git

[-- Attachment #1: Type: text/plain, Size: 8064 bytes --]

Hi Alban,

On Fri, 27 Dec 2019, Alban Gruin wrote:

> Le 01/02/2019 à 07:04, Johannes Schindelin a écrit :
>
> > as discussed at the Contributors' Summit, I ran p3400 as-is (i.e. with the
> > --am backend) and then with --keep-empty to force the interactive backend
> > to be used. Here are the best of 10, on my relatively powerful Windows 10
> > laptop, with current `master`.
> >
> > With regular rebase --am:
> >
> > 3400.2: rebase on top of a lot of unrelated changes             5.32(0.06+0.15)
> > 3400.4: rebase a lot of unrelated changes without split-index   33.08(0.04+0.18)
> > 3400.6: rebase a lot of unrelated changes with split-index      30.29(0.03+0.18)
> >
> > with --keep-empty to force the interactive backend:
> >
> > 3400.2: rebase on top of a lot of unrelated changes             3.92(0.03+0.18)
> > 3400.4: rebase a lot of unrelated changes without split-index   33.92(0.03+0.22)
> > 3400.6: rebase a lot of unrelated changes with split-index      38.82(0.03+0.16)
> >
> > I then changed it to -m to test the current scripted version, trying to
> > let it run overnight, but my laptop eventually went to sleep and the tests
> > were not even done. I'll let them continue and report back.
> >
> > My conclusion after seeing these numbers is: the interactive rebase is
> > really close to the performance of the --am backend. So to me, it makes a
> > total lot of sense to switch --merge over to it, and to make --merge the
> > default. We still should investigate why the split-index performance is so
> > significantly worse, though.
> >
> > Ciao,
> > Dscho
> >
>
> I investigated a bit on this.  From a quick glance at a callgrind trace,
> I can see that ce_write_entry() is called 20 601[1] times with `git am',
> but 739 802 times with the sequencer when the split-index is enabled.
>
> For reference, here are the timings, measured on my Linux machine, on a
> tmpfs, with git.git as the repo:
>
> `rebase --am':
> > 3400.2: rebase on top of a lot of unrelated changes             0.29(0.24+0.03)
> > 3400.4: rebase a lot of unrelated changes without split-index   6.77(6.51+0.22)
> > 3400.6: rebase a lot of unrelated changes with split-index      4.43(4.29+0.13)
> `rebase --quiet':
> > 3400.2: rebase on top of a lot of unrelated changes             0.24(0.21+0.02)
> > 3400.4: rebase a lot of unrelated changes without split-index   5.60(5.32+0.27)
> > 3400.6: rebase a lot of unrelated changes with split-index      5.67(5.40+0.26)
>
> This comes from two things:
>
> 1. There is not enough shared entries in the index with the sequencer.
>
> do_write_index() is called only by do_write_locked_index() with `--am',
> but is also called by write_shared_index() with the sequencer once for
> every other commit.  As the latter is only called by
> write_locked_index(), which means that too_many_not_shared_entries()
> returns true for the sequencer, but never for `--am'.
>
> Removing the call to discard_index() in do_pick_commit() (as in the
> first attached patch) solve this particular issue, but this would
> require a more thorough analysis to see if it is actually safe to do.

Indeed. I offered these insights in #git-devel (slightly edited):

This `discard_index()` is in an awfully central location. I am rather
certain that it would cause problems to just remove it.

Looking at `do_merge()`: it explicitly discards and re-reads the index if
we had to spawn a `git merge` process (which we do if a strategy other
than `recursive` was specified, or if it is an octopus merge). But I am
wary of other code paths that might not be as careful.

I see that `do_exec()` is similarly careful.

One thing I cannot fail to notice: we do not re-read a changed index
after running the `prepare-commit-msg` hook, or for that matter, any other
hook. That could even be an old regression from the conversion of the
interactive rebase to using the sequencer rather than a shell script.

Further, `reset_merge()` seems to spawn `git reset --merge` without
bothering to re-read the possibly modified index. Its callees are
`rollback_single_pick()`, `skip_single_pick()` and `sequencer_rollback()`,
none of which seem to be careful, either, about checking whether the index
was modified in the meantime.

Technically, the in-memory index should also be discarded
in `apply_autostash()`, but so far, we do not have any callers of that
function, I don't think, that wants to do anything but release resources
and exit.

The `run_git_checkout()` function discards, as intended. I
am not quite sure whether it needs to, though, unless the `.git/index`
file _was_ modified (it _is_ possible, after all, to run `git rebase -i
HEAD`, and I do have a use case for that where one of my scripts generates
a todo script, sort of a `git cherry-pick --rebase-merges`, because
`cherry-pick` does not support that mode).

The `continue_single_pick()` function spawns a `git
commit` which could potentially modify the index through a hook, but the
first call site does not care and the second one guards against that
(erroring out...).

My biggest concern is with the `run_git_commit()` function: it does not
re-read a potentially-modified index (think of hooks).

We will need to be very careful with this `discard_index()`, I think, and
in my opinion there is a great opportunity here for cleaning up a little:
rather than discarding and re-reading the in-memory index without seeing
whether the on-disk index has changed at all appears a bit wasteful to me.

This could be refactored into a function that only discards and re-reads
the index if the mtime of `.git/index` changed. That function could then
also be taught to detect when the in-memory index has unwritten changes:
that would constitute a bug.

Ciao,
Dscho

>
> After this, ce_write() is still called much more by the sequencer.
>
> Here are the results of `rebase --quiet' without discarding the index:
>
> > 3400.2: rebase on top of a lot of unrelated changes             0.23(0.19+0.04)
> > 3400.4: rebase a lot of unrelated changes without split-index   5.14(4.95+0.18)
> > 3400.6: rebase a lot of unrelated changes with split-index      5.02(4.87+0.15)
> The performance of the rebase is better in the two cases.
>
>
> 2. The base index is dropped by unpack_trees_start() and unpack_trees().
>
> Now, write_shared_index() is no longer called and write_locked_index()
> is less expensive than before according to callgrind.  But
> ce_write_entry() is still called 749 302 times (which is even more than
> before.)
>
> The only place where ce_write_entry() is called is in a loop in
> do_write_index().  The number of iterations is dictated by the size of
> the cache, and there is a trace2 probe dumping this value.
>
> For `--am', the value goes like this: 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4,
> 4, 4, 5, 5, 5, 5, … up until 101.
>
> For the sequencer, it goes like this: 1, 1, 3697, 3697, 3698, 3698,
> 3699, 3699, … up until 3796.
>
> The size of the cache is set in prepare_to_write_split_index().  It
> grows if a cache entry has no index (most of them should have one by
> now), or if the split index has no base index (with `--am', the split
> index always has a base.)  This comes from unpack_trees_start() -- it
> creates a new index, and unpack_trees() does not carry the base index,
> hence the size of the cache.
>
> The second attached patch (which is broken for the non-interactive
> rebase case) demonstrates what we could expect for the split-index case
> if we fix this:
>
> > 3400.2: rebase on top of a lot of unrelated changes             0.24(0.21+0.03)
> > 3400.4: rebase a lot of unrelated changes without split-index   5.81(5.62+0.17)
> > 3400.6: rebase a lot of unrelated changes with split-index      4.76(4.54+0.20)
> So, for everything related to the index, I think that’s it.
>
> [1] Numbers may vary, but they should remain in the same order of magnitude.
>
> Cheers,
> Alban
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comparing rebase --am with --interactive via p3400
  2020-01-31 21:23   ` Johannes Schindelin
@ 2020-04-01 11:33     ` Alban Gruin
  2020-04-01 14:00       ` Phillip Wood
  0 siblings, 1 reply; 11+ messages in thread
From: Alban Gruin @ 2020-04-01 11:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Elijah Newren, git

Hi Johannes,

Sorry for the late answer, I was really busy for the last months.

Le 31/01/2020 à 22:23, Johannes Schindelin a écrit :
> Hi Alban,
>  -%<-
> 
> Indeed. I offered these insights in #git-devel (slightly edited):
> 
> This `discard_index()` is in an awfully central location. I am rather
> certain that it would cause problems to just remove it.
> 
> Looking at `do_merge()`: it explicitly discards and re-reads the index if
> we had to spawn a `git merge` process (which we do if a strategy other
> than `recursive` was specified, or if it is an octopus merge). But I am
> wary of other code paths that might not be as careful.
> 
> I see that `do_exec()` is similarly careful.
> 

I have to admit that the index is not my area of expertise in git, so
sorry if my question is stupid: isn't there a less heavy way to find
unstaged or uncommitted changes than discarding and then reloading the
index?

> One thing I cannot fail to notice: we do not re-read a changed index
> after running the `prepare-commit-msg` hook, or for that matter, any other
> hook. That could even be an old regression from the conversion of the
> interactive rebase to using the sequencer rather than a shell script.
> 
> Further, `reset_merge()` seems to spawn `git reset --merge` without
> bothering to re-read the possibly modified index. Its callees are
> `rollback_single_pick()`, `skip_single_pick()` and `sequencer_rollback()`,
> none of which seem to be careful, either, about checking whether the index
> was modified in the meantime.
> 
> Technically, the in-memory index should also be discarded
> in `apply_autostash()`, but so far, we do not have any callers of that
> function, I don't think, that wants to do anything but release resources
> and exit.
> 
> The `run_git_checkout()` function discards, as intended. I
> am not quite sure whether it needs to, though, unless the `.git/index`
> file _was_ modified (it _is_ possible, after all, to run `git rebase -i
> HEAD`, and I do have a use case for that where one of my scripts generates
> a todo script, sort of a `git cherry-pick --rebase-merges`, because
> `cherry-pick` does not support that mode).
> 
> The `continue_single_pick()` function spawns a `git
> commit` which could potentially modify the index through a hook, but the
> first call site does not care and the second one guards against that
> (erroring out...).
> 
> My biggest concern is with the `run_git_commit()` function: it does not
> re-read a potentially-modified index (think of hooks).

Thank you for your analysis.

> 
> We will need to be very careful with this `discard_index()`, I think, and
> in my opinion there is a great opportunity here for cleaning up a little:
> rather than discarding and re-reading the in-memory index without seeing
> whether the on-disk index has changed at all appears a bit wasteful to me.
> 
> This could be refactored into a function that only discards and re-reads
> the index if the mtime of `.git/index` changed. That function could then
> also be taught to detect when the in-memory index has unwritten changes:
> that would constitute a bug.
> 

Hmm, checking if the mtime of the index to see if it changed isn't racy?
 Sub-second changes should happen, and to quote a comment in
is_racy_stat(), "nanosecond timestamped files can also be racy" -- even
if it should not really happen in the case of rebase…

> Ciao,
> Dscho
> 

Alban


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comparing rebase --am with --interactive via p3400
  2020-04-01 11:33     ` Alban Gruin
@ 2020-04-01 14:00       ` Phillip Wood
  2020-04-04 20:33         ` Johannes Schindelin
  0 siblings, 1 reply; 11+ messages in thread
From: Phillip Wood @ 2020-04-01 14:00 UTC (permalink / raw)
  To: Alban Gruin, Johannes Schindelin; +Cc: Elijah Newren, git

Hi Alban and Johannes

On 01/04/2020 12:33, Alban Gruin wrote:
> Hi Johannes,
> 
> Sorry for the late answer, I was really busy for the last months.
> 
> Le 31/01/2020 à 22:23, Johannes Schindelin a écrit :
>> Hi Alban,
>>   -%<-
>>
>> Indeed. I offered these insights in #git-devel (slightly edited):
>>
>> This `discard_index()` is in an awfully central location. I am rather
>> certain that it would cause problems to just remove it.
>>
>> Looking at `do_merge()`: it explicitly discards and re-reads the index if
>> we had to spawn a `git merge` process (which we do if a strategy other
>> than `recursive` was specified, or if it is an octopus merge). But I am
>> wary of other code paths that might not be as careful.
>>
>> I see that `do_exec()` is similarly careful.
>>
> 
> I have to admit that the index is not my area of expertise in git, so
> sorry if my question is stupid: isn't there a less heavy way to find
> unstaged or uncommitted changes than discarding and then reloading the
> index?
> 
>> One thing I cannot fail to notice: we do not re-read a changed index
>> after running the `prepare-commit-msg` hook, or for that matter, any other
>> hook. That could even be an old regression from the conversion of the
>> interactive rebase to using the sequencer rather than a shell script.
>>
>> Further, `reset_merge()` seems to spawn `git reset --merge` without
>> bothering to re-read the possibly modified index. Its callees are
>> `rollback_single_pick()`, `skip_single_pick()` and `sequencer_rollback()`,
>> none of which seem to be careful, either, about checking whether the index
>> was modified in the meantime.
>>
>> Technically, the in-memory index should also be discarded
>> in `apply_autostash()`, but so far, we do not have any callers of that
>> function, I don't think, that wants to do anything but release resources
>> and exit.
>>
>> The `run_git_checkout()` function discards, as intended. I
>> am not quite sure whether it needs to, though, unless the `.git/index`
>> file _was_ modified (it _is_ possible, after all, to run `git rebase -i
>> HEAD`, and I do have a use case for that where one of my scripts generates
>> a todo script, sort of a `git cherry-pick --rebase-merges`, because
>> `cherry-pick` does not support that mode).

I'm not sure it is worth optimizing the case where .git/index is not 
changed as we only do this once per rebase. In any case I hope that one 
day we'll stop forking git checkout and use the code from 
builtin/rebase.c to do it

>> The `continue_single_pick()` function spawns a `git
>> commit` which could potentially modify the index through a hook, but the
>> first call site does not care and the second one guards against that
>> (erroring out...).
>>
>> My biggest concern is with the `run_git_commit()` function: it does not
>> re-read a potentially-modified index (think of hooks).

I agree that we should be re-reading the index after forking `git 
commit` and also `git merge`. Most of the time we commit without forking 
so that should not impact the performance too much

> Thank you for your analysis.
> 
>>
>> We will need to be very careful with this `discard_index()`, I think, and
>> in my opinion there is a great opportunity here for cleaning up a little:
>> rather than discarding and re-reading the in-memory index without seeing
>> whether the on-disk index has changed at all appears a bit wasteful to me.
>>
>> This could be refactored into a function that only discards and re-reads
>> the index if the mtime of `.git/index` changed. That function could then
>> also be taught to detect when the in-memory index has unwritten changes:
>> that would constitute a bug.
>>
> 
> Hmm, checking if the mtime of the index to see if it changed isn't racy?
>   Sub-second changes should happen, and to quote a comment in
> is_racy_stat(), "nanosecond timestamped files can also be racy" -- even
> if it should not really happen in the case of rebase…

I don't think relying on the index stat data is a good idea, git 
defaults to one second mtime resolution unless it is built with 
-DUSE_NSEC and we do way more than one commit a second. We tried to rely 
on stat data to determine when to re-read the todo list after an exec 
and it is broken (both in the design because it assumes ns mtime 
resolution and the implementation because we don't update the cached 
mtime after we rewrite the todo list). There are not that many places 
where we need to re-read the index so I think we should just have 
explicit re-reads where we need them. Hopefully over time we'll stop 
forking other processes and the problem will go away.

Best Wishes

Phillip

>> Ciao,
>> Dscho
>>
> 
> Alban
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comparing rebase --am with --interactive via p3400
  2020-04-01 14:00       ` Phillip Wood
@ 2020-04-04 20:33         ` Johannes Schindelin
  0 siblings, 0 replies; 11+ messages in thread
From: Johannes Schindelin @ 2020-04-04 20:33 UTC (permalink / raw)
  To: phillip.wood; +Cc: Alban Gruin, Elijah Newren, git

[-- Attachment #1: Type: text/plain, Size: 5317 bytes --]

Hi Phillip,

On Wed, 1 Apr 2020, Phillip Wood wrote:

> On 01/04/2020 12:33, Alban Gruin wrote:
> > Hi Johannes,
> >
> > Sorry for the late answer, I was really busy for the last months.
> >
> > Le 31/01/2020 à 22:23, Johannes Schindelin a écrit :
> > > Hi Alban,
> > >   -%<-
> > >
> > > Indeed. I offered these insights in #git-devel (slightly edited):
> > >
> > > This `discard_index()` is in an awfully central location. I am rather
> > > certain that it would cause problems to just remove it.
> > >
> > > Looking at `do_merge()`: it explicitly discards and re-reads the index if
> > > we had to spawn a `git merge` process (which we do if a strategy other
> > > than `recursive` was specified, or if it is an octopus merge). But I am
> > > wary of other code paths that might not be as careful.
> > >
> > > I see that `do_exec()` is similarly careful.
> > >
> >
> > I have to admit that the index is not my area of expertise in git, so
> > sorry if my question is stupid: isn't there a less heavy way to find
> > unstaged or uncommitted changes than discarding and then reloading the
> > index?
> >
> > > One thing I cannot fail to notice: we do not re-read a changed index
> > > after running the `prepare-commit-msg` hook, or for that matter, any other
> > > hook. That could even be an old regression from the conversion of the
> > > interactive rebase to using the sequencer rather than a shell script.
> > >
> > > Further, `reset_merge()` seems to spawn `git reset --merge` without
> > > bothering to re-read the possibly modified index. Its callees are
> > > `rollback_single_pick()`, `skip_single_pick()` and `sequencer_rollback()`,
> > > none of which seem to be careful, either, about checking whether the index
> > > was modified in the meantime.
> > >
> > > Technically, the in-memory index should also be discarded
> > > in `apply_autostash()`, but so far, we do not have any callers of that
> > > function, I don't think, that wants to do anything but release resources
> > > and exit.
> > >
> > > The `run_git_checkout()` function discards, as intended. I
> > > am not quite sure whether it needs to, though, unless the `.git/index`
> > > file _was_ modified (it _is_ possible, after all, to run `git rebase -i
> > > HEAD`, and I do have a use case for that where one of my scripts generates
> > > a todo script, sort of a `git cherry-pick --rebase-merges`, because
> > > `cherry-pick` does not support that mode).
>
> I'm not sure it is worth optimizing the case where .git/index is not changed
> as we only do this once per rebase. In any case I hope that one day we'll stop
> forking git checkout and use the code from builtin/rebase.c to do it
>
> > > The `continue_single_pick()` function spawns a `git
> > > commit` which could potentially modify the index through a hook, but the
> > > first call site does not care and the second one guards against that
> > > (erroring out...).
> > >
> > > My biggest concern is with the `run_git_commit()` function: it does not
> > > re-read a potentially-modified index (think of hooks).
>
> I agree that we should be re-reading the index after forking `git commit` and
> also `git merge`. Most of the time we commit without forking so that should
> not impact the performance too much
>
> > Thank you for your analysis.
> >
> > >
> > > We will need to be very careful with this `discard_index()`, I think, and
> > > in my opinion there is a great opportunity here for cleaning up a little:
> > > rather than discarding and re-reading the in-memory index without seeing
> > > whether the on-disk index has changed at all appears a bit wasteful to me.
> > >
> > > This could be refactored into a function that only discards and re-reads
> > > the index if the mtime of `.git/index` changed. That function could then
> > > also be taught to detect when the in-memory index has unwritten changes:
> > > that would constitute a bug.
> > >
> >
> > Hmm, checking if the mtime of the index to see if it changed isn't racy?
> >   Sub-second changes should happen, and to quote a comment in
> > is_racy_stat(), "nanosecond timestamped files can also be racy" -- even
> > if it should not really happen in the case of rebase…
>
> I don't think relying on the index stat data is a good idea, git defaults to
> one second mtime resolution unless it is built with -DUSE_NSEC and we do way
> more than one commit a second. We tried to rely on stat data to determine when
> to re-read the todo list after an exec and it is broken (both in the design
> because it assumes ns mtime resolution and the implementation because we don't
> update the cached mtime after we rewrite the todo list). There are not that
> many places where we need to re-read the index so I think we should just have
> explicit re-reads where we need them. Hopefully over time we'll stop forking
> other processes and the problem will go away.

Well. Even the 1-second granularity should buy us some performance if we
assume that `same mtime` == `racy`. That should still catch the majority
of the cases where the index was simply not changed, at least in the
`do_exec()` case.

Ciao,
Dscho

>
> Best Wishes
>
> Phillip
>
> > > Ciao,
> > > Dscho
> > >
> >
> > Alban
> >
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-04-04 20:33 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-01  6:04 Comparing rebase --am with --interactive via p3400 Johannes Schindelin
2019-02-01  7:22 ` Johannes Schindelin
2019-02-01  9:26 ` Elijah Newren
2019-12-27 21:11 ` Alban Gruin
2019-12-27 22:45   ` Elijah Newren
2019-12-29 17:25     ` Alban Gruin
2020-01-02 20:17       ` Johannes Schindelin
2020-01-31 21:23   ` Johannes Schindelin
2020-04-01 11:33     ` Alban Gruin
2020-04-01 14:00       ` Phillip Wood
2020-04-04 20:33         ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).