git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Strange behavior of git rev-list
@ 2017-09-07  9:20 Paweł Marczewski
  2017-09-07  9:47 ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Paweł Marczewski @ 2017-09-07  9:20 UTC (permalink / raw)
  To: git

Hi,
I have an interesting case. In my repository, there are two commits,
'one' and 'two'. 'one' is reachable from 'two' (as evidenced by 'git
rev-list two | grep $(giv rev-parse one)'). However, the output of
'git rev-list two..one' is not empty, as is 'git rev-list ^two one'.

Here is the repository: https://github.com/pwmarcz/git-wtf/

It seems that the commit dates influence this behavior, because when I
edit all the dates to be the same, the output of 'git rev-list
two..one' is empty. Pruning seemingly irrelevant parents also makes it
empty.

I verified the behavior on git versions 2.14.1, 2.11.0, and on the
'next' branch (2.14.1.586.g1a2e63c10).

Paweł Marczewski

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Strange behavior of git rev-list
  2017-09-07  9:20 Strange behavior of git rev-list Paweł Marczewski
@ 2017-09-07  9:47 ` Jeff King
  2017-09-07  9:50   ` Paweł Marczewski
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2017-09-07  9:47 UTC (permalink / raw)
  To: Paweł Marczewski; +Cc: git

On Thu, Sep 07, 2017 at 11:20:15AM +0200, Paweł Marczewski wrote:

> I have an interesting case. In my repository, there are two commits,
> 'one' and 'two'. 'one' is reachable from 'two' (as evidenced by 'git
> rev-list two | grep $(giv rev-parse one)'). However, the output of
> 'git rev-list two..one' is not empty, as is 'git rev-list ^two one'.
> 
> Here is the repository: https://github.com/pwmarcz/git-wtf/
> 
> It seems that the commit dates influence this behavior, because when I
> edit all the dates to be the same, the output of 'git rev-list
> two..one' is empty. Pruning seemingly irrelevant parents also makes it
> empty.
> 
> I verified the behavior on git versions 2.14.1, 2.11.0, and on the
> 'next' branch (2.14.1.586.g1a2e63c10).

Yes, this is known. The commit dates are used as a proxy for graph
height (or generation numbers, if you prefer) so that we avoid having to
walk all the way down to a merge base before producing any output. But
it can give the wrong answer in the face of clock skew.

We walk back five extra commits more than we need to in order to avoid
small runs of skewed commits, but obviously you can have arbitrary-sized
runs.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Strange behavior of git rev-list
  2017-09-07  9:47 ` Jeff King
@ 2017-09-07  9:50   ` Paweł Marczewski
  2017-09-07 10:11     ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Paweł Marczewski @ 2017-09-07  9:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Thanks. Any plans to fix that? Or is there a way to turn off this heuristic?

On 7 September 2017 at 11:47, Jeff King <peff@peff.net> wrote:
> On Thu, Sep 07, 2017 at 11:20:15AM +0200, Paweł Marczewski wrote:
>
>> I have an interesting case. In my repository, there are two commits,
>> 'one' and 'two'. 'one' is reachable from 'two' (as evidenced by 'git
>> rev-list two | grep $(giv rev-parse one)'). However, the output of
>> 'git rev-list two..one' is not empty, as is 'git rev-list ^two one'.
>>
>> Here is the repository: https://github.com/pwmarcz/git-wtf/
>>
>> It seems that the commit dates influence this behavior, because when I
>> edit all the dates to be the same, the output of 'git rev-list
>> two..one' is empty. Pruning seemingly irrelevant parents also makes it
>> empty.
>>
>> I verified the behavior on git versions 2.14.1, 2.11.0, and on the
>> 'next' branch (2.14.1.586.g1a2e63c10).
>
> Yes, this is known. The commit dates are used as a proxy for graph
> height (or generation numbers, if you prefer) so that we avoid having to
> walk all the way down to a merge base before producing any output. But
> it can give the wrong answer in the face of clock skew.
>
> We walk back five extra commits more than we need to in order to avoid
> small runs of skewed commits, but obviously you can have arbitrary-sized
> runs.
>
> -Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Strange behavior of git rev-list
  2017-09-07  9:50   ` Paweł Marczewski
@ 2017-09-07 10:11     ` Jeff King
  2017-09-07 19:24       ` Stefan Beller
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2017-09-07 10:11 UTC (permalink / raw)
  To: Paweł Marczewski; +Cc: git

On Thu, Sep 07, 2017 at 11:50:25AM +0200, Paweł Marczewski wrote:

> Thanks. Any plans to fix that? Or is there a way to turn off this heuristic?

I don't think there's a way to turn it off for `rev-list`. Merge-base
computations are more careful, so you could determine the correct
endpoints that way. But when you fed them to `rev-list`, it would hit
the same run of skewed commits.

We've discussed storing true generation numbers in the past, which would
make this optimization more robust, as well as allow us to speed up many
other traversals (e.g., the "tag --contains"). It's something I'd like
to revisit, but it's not at the top of the pile.

In the meantime, it would probably be possible to write a patch that
conditionally disables the optimization. But it would mean all of your
rev-lists run a _lot_ slower.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Strange behavior of git rev-list
  2017-09-07 10:11     ` Jeff King
@ 2017-09-07 19:24       ` Stefan Beller
  2017-09-08  3:17         ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Beller @ 2017-09-07 19:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Paweł Marczewski, git

On Thu, Sep 7, 2017 at 3:11 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Sep 07, 2017 at 11:50:25AM +0200, Paweł Marczewski wrote:
>
>> Thanks. Any plans to fix that? Or is there a way to turn off this heuristic?
>
> I don't think there's a way to turn it off for `rev-list`. Merge-base
> computations are more careful, so you could determine the correct
> endpoints that way. But when you fed them to `rev-list`, it would hit
> the same run of skewed commits.
>
> We've discussed storing true generation numbers in the past, which would
> make this optimization more robust, as well as allow us to speed up many
> other traversals (e.g., the "tag --contains"). It's something I'd like
> to revisit, but it's not at the top of the pile.

(We just had an office discussion if this is part of the hash transition plan,
i.e. have a field in the commit object to contain its generation number.
and as I think the generation numbers would lead to fast and correct
behavior unlike the current heuristic which is only fast, I would try
to make a strong point actual generation numbers inside commit objects)

>
> In the meantime, it would probably be possible to write a patch that
> conditionally disables the optimization. But it would mean all of your
> rev-lists run a _lot_ slower.
>
> -Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Strange behavior of git rev-list
  2017-09-07 19:24       ` Stefan Beller
@ 2017-09-08  3:17         ` Jeff King
  2017-09-08  3:37           ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2017-09-08  3:17 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Paweł Marczewski, git

On Thu, Sep 07, 2017 at 12:24:02PM -0700, Stefan Beller wrote:

> > We've discussed storing true generation numbers in the past, which would
> > make this optimization more robust, as well as allow us to speed up many
> > other traversals (e.g., the "tag --contains"). It's something I'd like
> > to revisit, but it's not at the top of the pile.
> 
> (We just had an office discussion if this is part of the hash transition plan,
> i.e. have a field in the commit object to contain its generation number.
> and as I think the generation numbers would lead to fast and correct
> behavior unlike the current heuristic which is only fast, I would try
> to make a strong point actual generation numbers inside commit objects)

I'm still moderately against storing generation numbers inside the
objects. They're redundant with the existing parent pointers, which
means it's an opportunity for the two sets of data to disagree. And as
we've seen, once errors are cemented in history it's very hard to fix
them, because you break any history built on top.

I'm much more in favor of building a local cache of generation numbers
(either on the fly or during repacks, where we can piggy-back on the
existing pack .idx for indexing).

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Strange behavior of git rev-list
  2017-09-08  3:17         ` Jeff King
@ 2017-09-08  3:37           ` Junio C Hamano
  2017-09-08  3:47             ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2017-09-08  3:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, Paweł Marczewski, git

Jeff King <peff@peff.net> writes:

> On Thu, Sep 07, 2017 at 12:24:02PM -0700, Stefan Beller wrote:
>
>> > We've discussed storing true generation numbers in the past, which would
>> > make this optimization more robust, as well as allow us to speed up many
>> > other traversals (e.g., the "tag --contains"). It's something I'd like
>> > to revisit, but it's not at the top of the pile.
>> 
>> (We just had an office discussion if this is part of the hash transition plan,
>> i.e. have a field in the commit object to contain its generation number.
>> and as I think the generation numbers would lead to fast and correct
>> behavior unlike the current heuristic which is only fast, I would try
>> to make a strong point actual generation numbers inside commit objects)
>
> I'm still moderately against storing generation numbers inside the
> objects. They're redundant with the existing parent pointers, which
> means it's an opportunity for the two sets of data to disagree. And as
> we've seen, once errors are cemented in history it's very hard to fix
> them, because you break any history built on top.
>
> I'm much more in favor of building a local cache of generation numbers
> (either on the fly or during repacks, where we can piggy-back on the
> existing pack .idx for indexing).

I guess our mails crossed.  Yes, objects that are needlessly broken
only because they botch computation of derivable values are real
problem, as we need to accomodate them forever because histories can
and will be built on top of them.

On the other hand, seeing that the world did not stop even with some
projects have trees with entries whose mode are written with 0-padding
on the left in the ancient part of their histories, it might not be
such a big deal.  I dunno.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Strange behavior of git rev-list
  2017-09-08  3:37           ` Junio C Hamano
@ 2017-09-08  3:47             ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2017-09-08  3:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Stefan Beller, Paweł Marczewski, git

On Fri, Sep 08, 2017 at 12:37:28PM +0900, Junio C Hamano wrote:

> > I'm still moderately against storing generation numbers inside the
> > objects. They're redundant with the existing parent pointers, which
> > means it's an opportunity for the two sets of data to disagree. And as
> > we've seen, once errors are cemented in history it's very hard to fix
> > them, because you break any history built on top.
> >
> > I'm much more in favor of building a local cache of generation numbers
> > (either on the fly or during repacks, where we can piggy-back on the
> > existing pack .idx for indexing).
> 
> I guess our mails crossed.  Yes, objects that are needlessly broken
> only because they botch computation of derivable values are real
> problem, as we need to accomodate them forever because histories can
> and will be built on top of them.
> 
> On the other hand, seeing that the world did not stop even with some
> projects have trees with entries whose mode are written with 0-padding
> on the left in the ancient part of their histories, it might not be
> such a big deal.  I dunno.

True, but in counter-point:

  1. Tree problems generally only affect operations on that tree itself.
     Parent (or generation number) problems hit us any time we walk
     across that part of history, which may be much more frequent.

  2. There's an open question of what to do with existing commits
     without generation numbers.

     Per (1), "git tag --contains" is _always_ going to want to know the
     generation number of v1.0. Some problems are "local" to their block
     of history and as the project history marches forward, the bugs are
     there but you are less likely to make queries that hit them. But
     considering old tags for reachability will happen forever (and is
     the _most_ important use of generation numbers, because it lets us
     throw out that old history immediately).

     So if we assume we can't rewrite those objects, then we end up with
     some kind of local cache anyway.

  3. I think we should be moving more in the direction of keeping
     repo-local caches for optimizations. Reachability bitmaps have been
     a big performance win. I think we should be doing the same with our
     properties of commits. Not just generation numbers, but making it
     cheap to access the graph structure without zlib-inflating whole
     commit objects (i.e., packv4 or something like the "metapacks" I
     proposed a few years ago).

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-09-08  3:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-07  9:20 Strange behavior of git rev-list Paweł Marczewski
2017-09-07  9:47 ` Jeff King
2017-09-07  9:50   ` Paweł Marczewski
2017-09-07 10:11     ` Jeff King
2017-09-07 19:24       ` Stefan Beller
2017-09-08  3:17         ` Jeff King
2017-09-08  3:37           ` Junio C Hamano
2017-09-08  3:47             ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).