All of lore.kernel.org
 help / color / mirror / Atom feed
* Git is not scalable with too many refs/*
@ 2011-06-09  3:44 NAKAMURA Takumi
  2011-06-09  6:50 ` Sverre Rabbelier
  2011-06-09 11:18 ` Jakub Narebski
  0 siblings, 2 replies; 126+ messages in thread
From: NAKAMURA Takumi @ 2011-06-09  3:44 UTC (permalink / raw)
  To: git

Hello, Git. It is my 1st post here.

I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn
repo locally. (over 100k refs/tags.)
Indeed, it made something extremely slower, even with packed-refs and
pack objects.
I gave up, then, to push tags to upstream. (it must be terror) :p

I know it might be crazy in the git way, but it would bring me conveniences.
(eg. git log --oneline --decorate shows me each svn revision)
I would like to work for Git to live with many tags.

* Issues as far as I have investigated;

  - git show --decorate is always slow.
    in decorate.c, every commits are inspected.
  - git rev-tree --quiet --objects $upstream --not --all spends so much time,
    even if it is expected to return with 0.
    As you know, it is used in builtin/fetch.c.
  - git-upload-pack shows "all" refs to me if upstream has too many refs.

I would like to work as below if they were valuable.

  - Get rid of inspecting commits in packed-refs on decorate stuff.
  - Implement sort-by-hash packed-refs, (not sort-by-name)
  - Implement more effective pruning --not --all on revision.c.
  - Think about enhancement of protocol to transfer many refs more effectively.

I am happy to consider the issue, thank you.

...Takumi

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-09  3:44 Git is not scalable with too many refs/* NAKAMURA Takumi
@ 2011-06-09  6:50 ` Sverre Rabbelier
  2011-06-09 15:23   ` Shawn Pearce
  2011-06-09 11:18 ` Jakub Narebski
  1 sibling, 1 reply; 126+ messages in thread
From: Sverre Rabbelier @ 2011-06-09  6:50 UTC (permalink / raw)
  To: NAKAMURA Takumi, Shawn O. Pearce; +Cc: git

Heya,

[+shawn, who runs into something similar with Gerrit]

On Thu, Jun 9, 2011 at 05:44, NAKAMURA Takumi <geek4civic@gmail.com> wrote:
> Hello, Git. It is my 1st post here.
>
> I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn
> repo locally. (over 100k refs/tags.)
> Indeed, it made something extremely slower, even with packed-refs and
> pack objects.
> I gave up, then, to push tags to upstream. (it must be terror) :p
>
> I know it might be crazy in the git way, but it would bring me conveniences.
> (eg. git log --oneline --decorate shows me each svn revision)
> I would like to work for Git to live with many tags.
>
> * Issues as far as I have investigated;
>
>  - git show --decorate is always slow.
>    in decorate.c, every commits are inspected.
>  - git rev-tree --quiet --objects $upstream --not --all spends so much time,
>    even if it is expected to return with 0.
>    As you know, it is used in builtin/fetch.c.
>  - git-upload-pack shows "all" refs to me if upstream has too many refs.
>
> I would like to work as below if they were valuable.
>
>  - Get rid of inspecting commits in packed-refs on decorate stuff.
>  - Implement sort-by-hash packed-refs, (not sort-by-name)
>  - Implement more effective pruning --not --all on revision.c.
>  - Think about enhancement of protocol to transfer many refs more effectively.
>
> I am happy to consider the issue, thank you.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-09  3:44 Git is not scalable with too many refs/* NAKAMURA Takumi
  2011-06-09  6:50 ` Sverre Rabbelier
@ 2011-06-09 11:18 ` Jakub Narebski
  2011-06-09 15:42   ` Stephen Bash
  1 sibling, 1 reply; 126+ messages in thread
From: Jakub Narebski @ 2011-06-09 11:18 UTC (permalink / raw)
  To: NAKAMURA Takumi; +Cc: git

NAKAMURA Takumi <geek4civic@gmail.com> writes:

> Hello, Git. It is my 1st post here.
> 
> I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn
> repo locally. (over 100k refs/tags.)
[...]

That's insane.  You would do much better to mark each commit with
note.  Notes are designed to be scalable.  See e.g. this thread

  [RFD] Proposal for git-svn: storing SVN metadata (git-svn-id) in notes
  http://article.gmane.org/gmane.comp.version-control.git/174657

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-09  6:50 ` Sverre Rabbelier
@ 2011-06-09 15:23   ` Shawn Pearce
  2011-06-09 15:52     ` A Large Angry SCM
  0 siblings, 1 reply; 126+ messages in thread
From: Shawn Pearce @ 2011-06-09 15:23 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: NAKAMURA Takumi, git

On Wed, Jun 8, 2011 at 23:50, Sverre Rabbelier <srabbelier@gmail.com> wrote:
> [+shawn, who runs into something similar with Gerrit]

> On Thu, Jun 9, 2011 at 05:44, NAKAMURA Takumi <geek4civic@gmail.com> wrote:
>> Hello, Git. It is my 1st post here.
>>
>> I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn
>> repo locally. (over 100k refs/tags.)

As Jakub pointed out, use git notes for this. They were designed to
scale to >100,000 annotations.

>> Indeed, it made something extremely slower, even with packed-refs and
>> pack objects.

Having a reference to every commit in the repository is horrifically
slow. We run into this with Gerrit Code Review and I need to find
another solution. Git just wasn't meant to process repositories like
this.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-09 11:18 ` Jakub Narebski
@ 2011-06-09 15:42   ` Stephen Bash
  0 siblings, 0 replies; 126+ messages in thread
From: Stephen Bash @ 2011-06-09 15:42 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, NAKAMURA Takumi

----- Original Message -----
> From: "Jakub Narebski" <jnareb@gmail.com>
> To: "NAKAMURA Takumi" <geek4civic@gmail.com>
> Cc: "git" <git@vger.kernel.org>
> Sent: Thursday, June 9, 2011 7:18:09 AM
> Subject: Re: Git is not scalable with too many refs/*
> NAKAMURA Takumi <geek4civic@gmail.com> writes:
> 
> > Hello, Git. It is my 1st post here.
> >
> > I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn
> > repo locally. (over 100k refs/tags.)
> [...]
> 
> That's insane. You would do much better to mark each commit with
> note. Notes are designed to be scalable. See e.g. this thread
> 
> [RFD] Proposal for git-svn: storing SVN metadata (git-svn-id) in notes
> http://article.gmane.org/gmane.comp.version-control.git/174657

As a reformed SVN user (i.e. not using it anymore ;]) I agree that 100k tags seems crazy, but I was contemplating doing the exact same thing as Takumi.  Skimming that thread, I didn't see the key point (IMO): notes can map from commits to a "name" (or other information), tags map from a "name" to commits.

I've seen two different workflows develop:
  1) Hacking on some code in Git the programmer finds something wrong.  Using Git tools he can pickaxe/bisect/etc. and find that the problem traces back to a commit imported from Subversion.
  2) The programmer finds something wrong, asks coworker, coworker says "see bug XYZ", bug XYZ says "Fixed in r20356".

I agree notes is the right answer for (1), but for (2) you really want a cross reference table from Subversion rev number to Git commit.

In our office we created the cross reference table once by walking the Git tree and storing it as a file (we had some degenerate cases where one SVN rev mapped to multiple Git commits, but I don't remember the details), but it's not really usable from Git.  Lightweight tags would be an awesome solution (if they worked).  Perhaps a custom subcommand is a reasonable middle ground.

Thanks,
Stephen

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-09 15:23   ` Shawn Pearce
@ 2011-06-09 15:52     ` A Large Angry SCM
  2011-06-09 15:56       ` Shawn Pearce
  0 siblings, 1 reply; 126+ messages in thread
From: A Large Angry SCM @ 2011-06-09 15:52 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Sverre Rabbelier, NAKAMURA Takumi, git

On 06/09/2011 11:23 AM, Shawn Pearce wrote:
> On Wed, Jun 8, 2011 at 23:50, Sverre Rabbelier<srabbelier@gmail.com>  wrote:
>> [+shawn, who runs into something similar with Gerrit]
>
>> On Thu, Jun 9, 2011 at 05:44, NAKAMURA Takumi<geek4civic@gmail.com>  wrote:
>>> Hello, Git. It is my 1st post here.
>>>
>>> I have tried tagging each commit as "refs/tags/rXXXXXX" on git-svn
>>> repo locally. (over 100k refs/tags.)
>
> As Jakub pointed out, use git notes for this. They were designed to
> scale to>100,000 annotations.
>
>>> Indeed, it made something extremely slower, even with packed-refs and
>>> pack objects.
>
> Having a reference to every commit in the repository is horrifically
> slow. We run into this with Gerrit Code Review and I need to find
> another solution. Git just wasn't meant to process repositories like
> this.

Assuming a very large number of refs, what is it that makes git so 
horrifically slow? Is there a design or implementation lesson here?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-09 15:52     ` A Large Angry SCM
@ 2011-06-09 15:56       ` Shawn Pearce
  2011-06-09 16:26         ` Jeff King
  2011-06-10  7:41         ` Andreas Ericsson
  0 siblings, 2 replies; 126+ messages in thread
From: Shawn Pearce @ 2011-06-09 15:56 UTC (permalink / raw)
  To: A Large Angry SCM; +Cc: Sverre Rabbelier, NAKAMURA Takumi, git

On Thu, Jun 9, 2011 at 08:52, A Large Angry SCM <gitzilla@gmail.com> wrote:
> On 06/09/2011 11:23 AM, Shawn Pearce wrote:
>> Having a reference to every commit in the repository is horrifically
>> slow. We run into this with Gerrit Code Review and I need to find
>> another solution. Git just wasn't meant to process repositories like
>> this.
>
> Assuming a very large number of refs, what is it that makes git so
> horrifically slow? Is there a design or implementation lesson here?

A few things.

Git does a sequential scan of all references when it first needs to
access references for an operation. This requires reading the entire
packed-refs file, and the recursive scan of the "refs/" subdirectory
for any loose refs that might override the packed-refs file.

A lot of operations toss every commit that a reference points at into
the revision walker's LRU queue. If you have a tag pointing to every
commit, then the entire project history enters the LRU queue at once,
up front. That queue is managed with O(N^2) insertion time. And the
entire queue has to be filled before anything can be output.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-09 15:56       ` Shawn Pearce
@ 2011-06-09 16:26         ` Jeff King
  2011-06-10  3:59           ` NAKAMURA Takumi
  2011-06-10  7:41         ` Andreas Ericsson
  1 sibling, 1 reply; 126+ messages in thread
From: Jeff King @ 2011-06-09 16:26 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: A Large Angry SCM, Sverre Rabbelier, NAKAMURA Takumi, git

On Thu, Jun 09, 2011 at 08:56:50AM -0700, Shawn O. Pearce wrote:

> A lot of operations toss every commit that a reference points at into
> the revision walker's LRU queue. If you have a tag pointing to every
> commit, then the entire project history enters the LRU queue at once,
> up front. That queue is managed with O(N^2) insertion time. And the
> entire queue has to be filled before anything can be output.

We ran into this recently at github. Since our many-refs repos were
mostly forks, we had a lot of duplicate commits, and were able to solve
it with ea5f220 (fetch: avoid repeated commits in mark_complete,
2011-05-19).

However, I also worked up a faster priority queue implementation that
would work in the general case:

  http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005

I suspect it would speed up the original poster's slow fetch. The
problem is that a fast priority queue doesn't have quite the same access
patterns as a linked list, so replacing all of the commit_lists in git
with the priority queue would be quite a painful undertaking. So we are
left with using the fast queue only in specific hot-spots.

-Peff

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-09 16:26         ` Jeff King
@ 2011-06-10  3:59           ` NAKAMURA Takumi
  2011-06-13 22:27             ` Jeff King
  2011-06-14  0:17             ` Andreas Ericsson
  0 siblings, 2 replies; 126+ messages in thread
From: NAKAMURA Takumi @ 2011-06-10  3:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, A Large Angry SCM, Sverre Rabbelier, git

Good afternoon Git! Thank you guys to give me comments.

Jakub and Shawn,

Sure, Notes should be used at the case, I agree.

> (eg. git log --oneline --decorate shows me each svn revision)

My example might misunderstand you. I intended tags could show me
pretty abbrev everywhere on Git. I would be happier if tags might be
available bi-directional alias, as Stephen mentions.

It would be better git-svn could record metadata into notes, I think, too. :D

Stephen,

2011/6/10 Stephen Bash <bash@genarts.com>:
> I've seen two different workflows develop:
>  1) Hacking on some code in Git the programmer finds something wrong.  Using Git tools he can pickaxe/bisect/etc. and find that the problem traces back to a commit imported from Subversion.
>  2) The programmer finds something wrong, asks coworker, coworker says "see bug XYZ", bug XYZ says "Fixed in r20356".
>
> I agree notes is the right answer for (1), but for (2) you really want a cross reference table from Subversion rev number to Git commit.

It is the point I wanted to say, thank you! I am working with svn-men.
They often speak svn revision number. (And I have to tell them svn
revs then)

> In our office we created the cross reference table once by walking the Git tree and storing it as a file (we had some degenerate cases where one SVN rev mapped to multiple Git commits, but I don't remember the details), but it's not really usable from Git.  Lightweight tags would be an awesome solution (if they worked).  Perhaps a custom subcommand is a reasonable middle ground.

Reconstructing svnrev-commits mapping can be done by git-svn itself.
Unfortunately, git-svn's .rev-map is sorted by revision number. I
think it would be useless to make subcommands unless they were
pluggable into Git as "smart-tag resolver".

Peff,

At first, thank you to work for Github! Awesome!
I didn't know Github has refs issues. (yeah, I should not push 100k of
tags to Github for now :p )

I am working on linux and windows. Many-refs-repo can make Git awfully
slow (than linux!) I hope I could work also for windows to improve
various performance issue.

FYI, I have tweaked git-rev-list for commits not to sort by date with
--quiet. It improves git-fetch (git-rev-list --not --all) performance
when objects is well-packed.


...Takumi

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-09 15:56       ` Shawn Pearce
  2011-06-09 16:26         ` Jeff King
@ 2011-06-10  7:41         ` Andreas Ericsson
  2011-06-10 19:41           ` Shawn Pearce
  1 sibling, 1 reply; 126+ messages in thread
From: Andreas Ericsson @ 2011-06-10  7:41 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: A Large Angry SCM, Sverre Rabbelier, NAKAMURA Takumi, git

On 06/09/2011 05:56 PM, Shawn Pearce wrote:
> On Thu, Jun 9, 2011 at 08:52, A Large Angry SCM<gitzilla@gmail.com>  wrote:
>> On 06/09/2011 11:23 AM, Shawn Pearce wrote:
>>> Having a reference to every commit in the repository is horrifically
>>> slow. We run into this with Gerrit Code Review and I need to find
>>> another solution. Git just wasn't meant to process repositories like
>>> this.
>>
>> Assuming a very large number of refs, what is it that makes git so
>> horrifically slow? Is there a design or implementation lesson here?
> 
> A few things.
> 
> Git does a sequential scan of all references when it first needs to
> access references for an operation. This requires reading the entire
> packed-refs file, and the recursive scan of the "refs/" subdirectory
> for any loose refs that might override the packed-refs file.
> 
> A lot of operations toss every commit that a reference points at into
> the revision walker's LRU queue. If you have a tag pointing to every
> commit, then the entire project history enters the LRU queue at once,
> up front. That queue is managed with O(N^2) insertion time. And the
> entire queue has to be filled before anything can be output.
> 

Hmm. Since we're using pre-hashed data with an obvious lookup method
we should be able to do much, much better than O(n^2) for insertion
and better than O(n) for worst-case lookups. I'm thinking a 1-byte
trie, resulting in a depth, lookup and insertion complexity of 20. It
would waste some memory but it might be worth it for fixed asymptotic
complexity for both insertion and lookup.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-10  7:41         ` Andreas Ericsson
@ 2011-06-10 19:41           ` Shawn Pearce
  2011-06-10 20:12             ` Jakub Narebski
                               ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Shawn Pearce @ 2011-06-10 19:41 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: A Large Angry SCM, Sverre Rabbelier, NAKAMURA Takumi, git

On Fri, Jun 10, 2011 at 00:41, Andreas Ericsson <ae@op5.se> wrote:
> On 06/09/2011 05:56 PM, Shawn Pearce wrote:
>>
>> A lot of operations toss every commit that a reference points at into
>> the revision walker's LRU queue. If you have a tag pointing to every
>> commit, then the entire project history enters the LRU queue at once,
>> up front. That queue is managed with O(N^2) insertion time. And the
>> entire queue has to be filled before anything can be output.
>
> Hmm. Since we're using pre-hashed data with an obvious lookup method
> we should be able to do much, much better than O(n^2) for insertion
> and better than O(n) for worst-case lookups. I'm thinking a 1-byte
> trie, resulting in a depth, lookup and insertion complexity of 20. It
> would waste some memory but it might be worth it for fixed asymptotic
> complexity for both insertion and lookup.

Not really.

The queue isn't sorting by SHA-1. Its sorting by commit timestamp,
descending. Those aren't pre-hashed. The O(N^2) insertion is because
the code is trying to find where this commit belongs in the list of
commits as sorted by commit timestamp.

There are some priority queue datastructures designed for this sort of
work, e.g. a calendar queue might help. But its not as simple as a 1
byte trie.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-10 19:41           ` Shawn Pearce
@ 2011-06-10 20:12             ` Jakub Narebski
  2011-06-10 20:35             ` Jeff King
  2011-06-13  7:08             ` Andreas Ericsson
  2 siblings, 0 replies; 126+ messages in thread
From: Jakub Narebski @ 2011-06-10 20:12 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Andreas Ericsson, A Large Angry SCM, Sverre Rabbelier,
	NAKAMURA Takumi, git

Shawn Pearce <spearce@spearce.org> writes:
> On Fri, Jun 10, 2011 at 00:41, Andreas Ericsson <ae@op5.se> wrote:
>> On 06/09/2011 05:56 PM, Shawn Pearce wrote:
>>>
>>> A lot of operations toss every commit that a reference points at into
>>> the revision walker's LRU queue. If you have a tag pointing to every
>>> commit, then the entire project history enters the LRU queue at once,
>>> up front. That queue is managed with O(N^2) insertion time. And the
>>> entire queue has to be filled before anything can be output.
>>
>> Hmm. Since we're using pre-hashed data with an obvious lookup method
>> we should be able to do much, much better than O(n^2) for insertion
>> and better than O(n) for worst-case lookups. I'm thinking a 1-byte
>> trie, resulting in a depth, lookup and insertion complexity of 20. It
>> would waste some memory but it might be worth it for fixed asymptotic
>> complexity for both insertion and lookup.
> 
> Not really.
> 
> The queue isn't sorting by SHA-1. Its sorting by commit timestamp,
> descending. Those aren't pre-hashed. The O(N^2) insertion is because
> the code is trying to find where this commit belongs in the list of
> commits as sorted by commit timestamp.
> 
> There are some priority queue datastructures designed for this sort of
> work, e.g. a calendar queue might help. But its not as simple as a 1
> byte trie.

In the case of Subversion numbers (revision number to hash mapping)
sorted by name (in version order at least) means sorted by date.  I
wonder if there is data structure for which this is optimum insertion
order (like for insertion sort almost sorted data is best case).

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-10 19:41           ` Shawn Pearce
  2011-06-10 20:12             ` Jakub Narebski
@ 2011-06-10 20:35             ` Jeff King
  2011-06-13  7:08             ` Andreas Ericsson
  2 siblings, 0 replies; 126+ messages in thread
From: Jeff King @ 2011-06-10 20:35 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Andreas Ericsson, A Large Angry SCM, Sverre Rabbelier,
	NAKAMURA Takumi, git

On Fri, Jun 10, 2011 at 12:41:39PM -0700, Shawn O. Pearce wrote:

> Not really.
> 
> The queue isn't sorting by SHA-1. Its sorting by commit timestamp,
> descending. Those aren't pre-hashed. The O(N^2) insertion is because
> the code is trying to find where this commit belongs in the list of
> commits as sorted by commit timestamp.
> 
> There are some priority queue datastructures designed for this sort of
> work, e.g. a calendar queue might help. But its not as simple as a 1
> byte trie.

All you really need is a heap-based priority queue, which gives O(lg n)
insertion and popping (and O(1) peeking at the top). I even wrote one
and posted it recently (I won't dig up the reference, but I posted it
elsewhere in this thread, I think).

The problem is that many parts of the code assume that commit_list is a
linked list and do fast iterations, or even splicing. It's nothing you
couldn't get around with some work, but it turns out to involve a lot
of code changes.

-Peff

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-10 19:41           ` Shawn Pearce
  2011-06-10 20:12             ` Jakub Narebski
  2011-06-10 20:35             ` Jeff King
@ 2011-06-13  7:08             ` Andreas Ericsson
  2 siblings, 0 replies; 126+ messages in thread
From: Andreas Ericsson @ 2011-06-13  7:08 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: A Large Angry SCM, Sverre Rabbelier, NAKAMURA Takumi, git

On 06/10/2011 09:41 PM, Shawn Pearce wrote:
> On Fri, Jun 10, 2011 at 00:41, Andreas Ericsson<ae@op5.se>  wrote:
>> On 06/09/2011 05:56 PM, Shawn Pearce wrote:
>>>
>>> A lot of operations toss every commit that a reference points at into
>>> the revision walker's LRU queue. If you have a tag pointing to every
>>> commit, then the entire project history enters the LRU queue at once,
>>> up front. That queue is managed with O(N^2) insertion time. And the
>>> entire queue has to be filled before anything can be output.
>>
>> Hmm. Since we're using pre-hashed data with an obvious lookup method
>> we should be able to do much, much better than O(n^2) for insertion
>> and better than O(n) for worst-case lookups. I'm thinking a 1-byte
>> trie, resulting in a depth, lookup and insertion complexity of 20. It
>> would waste some memory but it might be worth it for fixed asymptotic
>> complexity for both insertion and lookup.
> 
> Not really.
> 
> The queue isn't sorting by SHA-1. Its sorting by commit timestamp,
> descending. Those aren't pre-hashed. The O(N^2) insertion is because
> the code is trying to find where this commit belongs in the list of
> commits as sorted by commit timestamp.
> 

Hmm. We should still be able to do better than that, and particularly
for the "tag-each-commit" workflow. Since it's most likely those tags
are generated using incrementing numbers, we could have a cut-off where
we first parse all the refs and make an optimistic assumption that an
alphabetical sort of the refs provides a map of insertion-points for
the commits. Since the best case behaviour is still O(1) for insertion
sort and it's unlikely that thousands of refs are in random order, that
should cause the vast majority of the refs we insert to follow the best
case scenario.

This will fall on its arse when people start doing hg-ref -> git-commit
tags ofcourse, but that doesn't seem to be happening, or at least not to
the same extent as with svn-revisions -> git-gommit mapping.

We're still not improving the asymptotic complexity, but it's a pretty
safe bet that we for a vast majority of cases improve wallclock runtime
by a hefty amount with a relatively minor effort.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-10  3:59           ` NAKAMURA Takumi
@ 2011-06-13 22:27             ` Jeff King
  2011-06-14  0:17             ` Andreas Ericsson
  1 sibling, 0 replies; 126+ messages in thread
From: Jeff King @ 2011-06-13 22:27 UTC (permalink / raw)
  To: NAKAMURA Takumi; +Cc: Shawn Pearce, A Large Angry SCM, Sverre Rabbelier, git

On Fri, Jun 10, 2011 at 12:59:47PM +0900, NAKAMURA Takumi wrote:

> 2011/6/10 Stephen Bash <bash@genarts.com>:
> > I've seen two different workflows develop:
> >  1) Hacking on some code in Git the programmer finds something wrong.  Using Git tools he can pickaxe/bisect/etc. and find that the problem traces back to a commit imported from Subversion.
> >  2) The programmer finds something wrong, asks coworker, coworker says "see bug XYZ", bug XYZ says "Fixed in r20356".
> >
> > I agree notes is the right answer for (1), but for (2) you really want a cross reference table from Subversion rev number to Git commit.
> 
> It is the point I wanted to say, thank you! I am working with svn-men.
> They often speak svn revision number. (And I have to tell them svn
> revs then)

Yeah, there is no simple way to do the bi-directional mapping in git.
If all you want are decorations on commits, notes are definitely the way
to go. They are optimized for lookup in of commit -> data. But if you
want data -> commit, the only mapping we have is refs, and they are not
well optimized for the many-refs use case.

Packed-refs are better than loose refs, but I think right now we just
load them all in to an in-memory linked list. We could load them into a
more efficient in-memory data structure, or we could perhaps even mmap
the packed-refs file and binary search it in place.

But lookup is only part of the problem. There are algorithms that want
to look at all the refs (notably fetching and pushing), which are going
to be a bit slower. We don't have a way to tell those algorithms that
those refs are uninteresting for reachability analysis, because they are
just pointing to parts of the graph that are already reachable by
regular refs. Maybe there could be a part of the refs namespace that is
ignored by "--all". I dunno. That seems like a weird inconsistency.

> FYI, I have tweaked git-rev-list for commits not to sort by date with
> --quiet. It improves git-fetch (git-rev-list --not --all) performance
> when objects is well-packed.

I'm not sure that is a good solution. Even with --quiet, we will be
walking the commit graph to find merge bases to see if things are
connected. The walking code expects date-sorting; I'm not sure what
changing that assumption will do to the code.

-Peff

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-10  3:59           ` NAKAMURA Takumi
  2011-06-13 22:27             ` Jeff King
@ 2011-06-14  0:17             ` Andreas Ericsson
  2011-06-14  0:30               ` Jeff King
  1 sibling, 1 reply; 126+ messages in thread
From: Andreas Ericsson @ 2011-06-14  0:17 UTC (permalink / raw)
  To: NAKAMURA Takumi
  Cc: Jeff King, Shawn Pearce, A Large Angry SCM, Sverre Rabbelier, git

On 06/10/2011 05:59 AM, NAKAMURA Takumi wrote:
> Good afternoon Git! Thank you guys to give me comments.
> 
> Jakub and Shawn,
> 
> Sure, Notes should be used at the case, I agree.
> 
>> (eg. git log --oneline --decorate shows me each svn revision)
> 
> My example might misunderstand you. I intended tags could show me
> pretty abbrev everywhere on Git. I would be happier if tags might be
> available bi-directional alias, as Stephen mentions.
> 
> It would be better git-svn could record metadata into notes, I think, too. :D
> 
> Stephen,
> 
> 2011/6/10 Stephen Bash<bash@genarts.com>:
>> I've seen two different workflows develop:
>>   1) Hacking on some code in Git the programmer finds something wrong.  Using Git tools he can pickaxe/bisect/etc. and find that the problem traces back to a commit imported from Subversion.
>>   2) The programmer finds something wrong, asks coworker, coworker says "see bug XYZ", bug XYZ says "Fixed in r20356".
>>
>> I agree notes is the right answer for (1), but for (2) you really want a cross reference table from Subversion rev number to Git commit.
> 

If you're using svn metadata in the commit text, you can always do
"git log -p --grep=@20356" to get the commits relevant to that one.
It's not as fast as "git show svn-20356", but it's not exactly
glacial either and would avoid the problems you're having now.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-14  0:17             ` Andreas Ericsson
@ 2011-06-14  0:30               ` Jeff King
  2011-06-14  4:41                 ` Junio C Hamano
  0 siblings, 1 reply; 126+ messages in thread
From: Jeff King @ 2011-06-14  0:30 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: NAKAMURA Takumi, Shawn Pearce, A Large Angry SCM, Sverre Rabbelier, git

On Tue, Jun 14, 2011 at 02:17:58AM +0200, Andreas Ericsson wrote:

> If you're using svn metadata in the commit text, you can always do
> "git log -p --grep=@20356" to get the commits relevant to that one.
> It's not as fast as "git show svn-20356", but it's not exactly
> glacial either and would avoid the problems you're having now.

If we do end up putting this data into notes eventually (which I think
we _should_ do, because then you aren't locked into having this svn
cruft in your commit messages for all time, but can rather choose
whether or not to display it), it would be nice to have a --grep-notes
feature in git-log. Or maybe --grep should look in notes by default,
too, if we are showing them.

I suspect the feature would be really easy to implement, if somebody is
looking for a gentle introduction to git, or a fun way to spend an hour.
:)

-Peff

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-14  0:30               ` Jeff King
@ 2011-06-14  4:41                 ` Junio C Hamano
  2011-06-14  7:26                   ` Sverre Rabbelier
  0 siblings, 1 reply; 126+ messages in thread
From: Junio C Hamano @ 2011-06-14  4:41 UTC (permalink / raw)
  To: Jeff King
  Cc: Andreas Ericsson, NAKAMURA Takumi, Shawn Pearce,
	A Large Angry SCM, Sverre Rabbelier, git

Jeff King <peff@peff.net> writes:

> I suspect the feature would be really easy to implement, if somebody is
> looking for a gentle introduction to git, or a fun way to spend an hour.

I would rather want to see if somebody can come up with a flexible reverse
mapping feature around notes. It does not have to be completely generic,
just being flexible enough is fine.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-14  4:41                 ` Junio C Hamano
@ 2011-06-14  7:26                   ` Sverre Rabbelier
  2011-06-14 10:02                     ` Johan Herland
  0 siblings, 1 reply; 126+ messages in thread
From: Sverre Rabbelier @ 2011-06-14  7:26 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Andreas Ericsson, NAKAMURA Takumi, Shawn Pearce,
	A Large Angry SCM, git

Heya,

On Tue, Jun 14, 2011 at 06:41, Junio C Hamano <gitster@pobox.com> wrote:
> I would rather want to see if somebody can come up with a flexible reverse
> mapping feature around notes. It does not have to be completely generic,
> just being flexible enough is fine.

Wouldn't it be enough to simply create a note on 'r651235' with as
contents the git ref?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-14  7:26                   ` Sverre Rabbelier
@ 2011-06-14 10:02                     ` Johan Herland
  2011-06-14 10:34                       ` Sverre Rabbelier
  2011-06-14 17:02                       ` Jeff King
  0 siblings, 2 replies; 126+ messages in thread
From: Johan Herland @ 2011-06-14 10:02 UTC (permalink / raw)
  To: Sverre Rabbelier
  Cc: git, Junio C Hamano, Jeff King, Andreas Ericsson,
	NAKAMURA Takumi, Shawn Pearce, A Large Angry SCM

On Tuesday 14 June 2011, Sverre Rabbelier wrote:
> Heya,
> 
> On Tue, Jun 14, 2011 at 06:41, Junio C Hamano <gitster@pobox.com> wrote:
> > I would rather want to see if somebody can come up with a flexible
> > reverse mapping feature around notes. It does not have to be
> > completely generic, just being flexible enough is fine.
> 
> Wouldn't it be enough to simply create a note on 'r651235' with as
> contents the git ref?

Not quite sure what you mean by "create a note on 'r651235'". You could 
devise a scheme where you SHA1('r651235'), and then create a note on the 
resulting hash.

Notes are named by the SHA1 of the object they annotate, but there is no 
hard requirement (as long as you stay away from "git notes prune") that the 
SHA1 annotated actually exists as a valid Git object in your repo.

Hence, you can use notes to annotate _anything_ that can be uniquely reduced 
to a SHA1 hash.


...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-14 10:02                     ` Johan Herland
@ 2011-06-14 10:34                       ` Sverre Rabbelier
  2011-06-14 17:02                       ` Jeff King
  1 sibling, 0 replies; 126+ messages in thread
From: Sverre Rabbelier @ 2011-06-14 10:34 UTC (permalink / raw)
  To: Johan Herland
  Cc: git, Junio C Hamano, Jeff King, Andreas Ericsson,
	NAKAMURA Takumi, Shawn Pearce, A Large Angry SCM

Heya,

On Tue, Jun 14, 2011 at 12:02, Johan Herland <johan@herland.net> wrote:
> Not quite sure what you mean by "create a note on 'r651235'". You could
> devise a scheme where you SHA1('r651235'), and then create a note on the
> resulting hash.

I was thinking they could annotate anything, even non-sha's, but in
that case, yes, the sha of the revision would work just as well.


-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-14 10:02                     ` Johan Herland
  2011-06-14 10:34                       ` Sverre Rabbelier
@ 2011-06-14 17:02                       ` Jeff King
  2011-06-14 19:20                         ` Shawn Pearce
  1 sibling, 1 reply; 126+ messages in thread
From: Jeff King @ 2011-06-14 17:02 UTC (permalink / raw)
  To: Johan Herland
  Cc: Sverre Rabbelier, git, Junio C Hamano, Andreas Ericsson,
	NAKAMURA Takumi, Shawn Pearce, A Large Angry SCM

On Tue, Jun 14, 2011 at 12:02:46PM +0200, Johan Herland wrote:

> > Wouldn't it be enough to simply create a note on 'r651235' with as
> > contents the git ref?
> 
> Not quite sure what you mean by "create a note on 'r651235'". You could 
> devise a scheme where you SHA1('r651235'), and then create a note on the 
> resulting hash.
> 
> Notes are named by the SHA1 of the object they annotate, but there is no 
> hard requirement (as long as you stay away from "git notes prune") that the 
> SHA1 annotated actually exists as a valid Git object in your repo.
> 
> Hence, you can use notes to annotate _anything_ that can be uniquely reduced 
> to a SHA1 hash.

I lean against that as a solution. I think "git gc" will probably
eventually learn to do "git notes prune", at which point we would start
losing people's data. So I think it is better to keep the definition of
notes a little tighter now, and say "the left-hand side of a notes
mapping must be a referenced object". We can always loosen it later.

On top of that, though, the sha1 solution is not all that pleasant. It
lets you do exact lookups, but you have no way of iterating over the
list of svn revisions.

I also think we can do something a little more lightweight. The user has
already created and is maintaining a mapping in one direction via the
notes. We just need the inverse mapping, which we can generate
programatically. So it can be a straight cache, with the sha1 of the
notes tree determining the cache validity (i.e., if the forward mapping
in the notes tree changes, you regenerate the cache from scratch).

We would want to store the cache in an on-disk format that could be
searched easily. Possibly something like the packed-refs format would be
sufficient, if we mmap'd and binary searched it. It would be dirt simple
if we used an existing key/value store like gdbm or tokyocabinet, but we
usually try to avoid extra dependencies.

-Peff

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-14 17:02                       ` Jeff King
@ 2011-06-14 19:20                         ` Shawn Pearce
  2011-06-14 19:47                           ` Jeff King
  0 siblings, 1 reply; 126+ messages in thread
From: Shawn Pearce @ 2011-06-14 19:20 UTC (permalink / raw)
  To: Jeff King
  Cc: Johan Herland, Sverre Rabbelier, git, Junio C Hamano,
	Andreas Ericsson, NAKAMURA Takumi, A Large Angry SCM

On Tue, Jun 14, 2011 at 10:02, Jeff King <peff@peff.net> wrote:
> I also think we can do something a little more lightweight. The user has
> already created and is maintaining a mapping in one direction via the
> notes. We just need the inverse mapping, which we can generate
> programatically. So it can be a straight cache, with the sha1 of the
> notes tree determining the cache validity (i.e., if the forward mapping
> in the notes tree changes, you regenerate the cache from scratch).
>
> We would want to store the cache in an on-disk format that could be
> searched easily. Possibly something like the packed-refs format would be
> sufficient, if we mmap'd and binary searched it. It would be dirt simple
> if we used an existing key/value store like gdbm or tokyocabinet, but we
> usually try to avoid extra dependencies.

Yea, not a bad idea. Use a series of SSTable like things, like Hadoop
uses. It doesn't need to be as complex as the Hadoop SSTable concept.
But a simple sorted string to string mapping file that is immutable,
with edits applied by creating an overlay file that contains
new/updated entries.

As you point out, we can use the notes tree to tell us the validity of
the cache, and do incremental updates. If the current cache doesn't
match the notes ref, compute the tree diff between the current cache's
source tree and the new tree, and create a new SSTable like thing that
has the relevant updates as an overlay of the existing tables. After
some time you will have many of these little overlay files, and a GC
can just merge them down to a single file.

The only problem is, you probably want this "reverse notes index" to
be indexing a portion of the note blob text, not all of it. That is,
we want the SVN note text to say something like "SVN Revision: r1828"
so `git log --notes=svn` shows us something more useful than just
"r1828". But in the reverse index, we may only want the key to be
"r1828". So you need some sort of small mapping function to decide
what to put into that reverse index.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-14 19:20                         ` Shawn Pearce
@ 2011-06-14 19:47                           ` Jeff King
  2011-06-14 20:12                             ` Shawn Pearce
  0 siblings, 1 reply; 126+ messages in thread
From: Jeff King @ 2011-06-14 19:47 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: Johan Herland, Sverre Rabbelier, git, Junio C Hamano,
	Andreas Ericsson, NAKAMURA Takumi, A Large Angry SCM

On Tue, Jun 14, 2011 at 12:20:29PM -0700, Shawn O. Pearce wrote:

> > We would want to store the cache in an on-disk format that could be
> > searched easily. Possibly something like the packed-refs format would be
> > sufficient, if we mmap'd and binary searched it. It would be dirt simple
> > if we used an existing key/value store like gdbm or tokyocabinet, but we
> > usually try to avoid extra dependencies.
> 
> Yea, not a bad idea. Use a series of SSTable like things, like Hadoop
> uses. It doesn't need to be as complex as the Hadoop SSTable concept.
> But a simple sorted string to string mapping file that is immutable,
> with edits applied by creating an overlay file that contains
> new/updated entries.
> 
> As you point out, we can use the notes tree to tell us the validity of
> the cache, and do incremental updates. If the current cache doesn't
> match the notes ref, compute the tree diff between the current cache's
> source tree and the new tree, and create a new SSTable like thing that
> has the relevant updates as an overlay of the existing tables. After
> some time you will have many of these little overlay files, and a GC
> can just merge them down to a single file.

I was really hoping that it would be fast enough that we could simply
blow away the old mapping and recreate it from scratch. That gets us out
of writing any journaling-type code with overlays. For something like
svn revisions, it's probably fine to take an extra second or two to
build the cache after we do a fetch. But it wouldn't scale to something
that was getting updated frequently.

If we're going to start doing clever database-y things, I'd much rather
use a proven key/value db solution like tokyocabinet. I'm just not sure
how to degrade gracefully when the db library isn't available. Don't
allow reverse mappings? Fallback to something slow?

> The only problem is, you probably want this "reverse notes index" to
> be indexing a portion of the note blob text, not all of it. That is,
> we want the SVN note text to say something like "SVN Revision: r1828"
> so `git log --notes=svn` shows us something more useful than just
> "r1828". But in the reverse index, we may only want the key to be
> "r1828". So you need some sort of small mapping function to decide
> what to put into that reverse index.

I had assumed that we would just be writing r1828 into the note. The
output via git log is actually pretty readable:

  $ git notes --ref=svn/revisions add -m r1828
  $ git show --notes=svn/revisions
  ...
  Notes (svn/revisions):
      r1828

Of course this is just one use case.

For that matter, we have to figure out how one would actually reference
the reverse mapping. If we have a simple, pure-reverse mapping, we can
just generate and cache them on the fly, and give a special syntax.
Like:

  $ git log notes/svn/revisions@{revnote:r1828}

which would invert the notes/svn/revisions tree, search for r1828, and
reference the resulting commit.

If you had something more heavyweight that actually needed to parse
during the mapping, you might have something like:

  $ : set up the mapping
  $ git config revnote.svn.map 'SVN Revision: (r[0-9]+)'

  $ : do the reverse; we should be able to build the cache on the fly
  $ git notes reverse r1828
  346ab9aaa1cf7b1ed2dd2c0a67bccc5b8ec23f7c

  $ : so really you could have a similar ref syntax like, though
  $ : this would require some ref parser updates, as we currently
  $ : assume anything to the left of @{} is a real ref
  $ git log r1828@{revnote:svn}

The syntaxes are not as nice as having a real ref. In the last example,
we could probably look for the contents of "@{}" as a possible revnote
mapping (since we've already had to name it via the configuration), to
make it "r1828@{svn}". Or you could even come up with a default set of
revnotes to consider, so that if we lookup "r1828" and it isn't a real
ref, we fall back to trying r1828@{revnote:svn}.

I dunno. I'm just throwing ideas out at this point.

-Peff

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-14 19:47                           ` Jeff King
@ 2011-06-14 20:12                             ` Shawn Pearce
  2011-09-08 19:53                               ` Martin Fick
  0 siblings, 1 reply; 126+ messages in thread
From: Shawn Pearce @ 2011-06-14 20:12 UTC (permalink / raw)
  To: Jeff King
  Cc: Johan Herland, Sverre Rabbelier, git, Junio C Hamano,
	Andreas Ericsson, NAKAMURA Takumi, A Large Angry SCM

On Tue, Jun 14, 2011 at 12:47, Jeff King <peff@peff.net> wrote:
> On Tue, Jun 14, 2011 at 12:20:29PM -0700, Shawn O. Pearce wrote:
>
>> > We would want to store the cache in an on-disk format that could be
>> > searched easily. Possibly something like the packed-refs format would be
>> > sufficient, if we mmap'd and binary searched it. It would be dirt simple
>> > if we used an existing key/value store like gdbm or tokyocabinet, but we
>> > usually try to avoid extra dependencies.
>>
>> Yea, not a bad idea. Use a series of SSTable like things, like Hadoop
>> uses. It doesn't need to be as complex as the Hadoop SSTable concept.
>> But a simple sorted string to string mapping file that is immutable,
>> with edits applied by creating an overlay file that contains
>> new/updated entries.
>>
>> As you point out, we can use the notes tree to tell us the validity of
>> the cache, and do incremental updates. If the current cache doesn't
>> match the notes ref, compute the tree diff between the current cache's
>> source tree and the new tree, and create a new SSTable like thing that
>> has the relevant updates as an overlay of the existing tables. After
>> some time you will have many of these little overlay files, and a GC
>> can just merge them down to a single file.
>
> I was really hoping that it would be fast enough that we could simply
> blow away the old mapping and recreate it from scratch. That gets us out
> of writing any journaling-type code with overlays. For something like
> svn revisions, it's probably fine to take an extra second or two to
> build the cache after we do a fetch. But it wouldn't scale to something
> that was getting updated frequently.
>
> If we're going to start doing clever database-y things, I'd much rather
> use a proven key/value db solution like tokyocabinet. I'm just not sure
> how to degrade gracefully when the db library isn't available. Don't
> allow reverse mappings? Fallback to something slow?

This is why I would prefer to build the solution into Git.

Its not that bad to do a sorted string file. Take something simple
that is similar to a pack file:

  GSST | vers | rcnt | srctree | base
  [ klen key vlen value ]*
  [ roff ]*
  SHA1(all_of_above)

Where vers is a version number, rcnt is the number of records, srctree
is the SHA-1 of the notes tree this thing indexed from, and base is
the SHA-1 of the notes tree this "applies on top of". There are then
rcnt records in the file, each using a variable length key length and
value length field (klen, vlen), with variable length key and values.
At the end of the file are rcnt 4 byte offsets to the start of each
key.

When writing the file, write all of the above to a temporary file,
then rename it to $GIT_DIR/cache/db-$SHA1.db, as its a unique name.
Its easy to prepare the list of entries in memory as an array of
structs of key/value pairs, sort them with qsort(), write them out and
update offsets as you go, then dump out the offset table at the end.
One could compress the offset table by only storing every N offsets,
readers perform binary search until they find the first key that is
before their desired key, then sequentially scan records until they
locate the correct entry... but I'm not sure the space savings is
really worthwhile here.

When reading, scan the directory and read the headers of each file. If
the file has your target srctree, your cache is current and you can
read it. If a key isn't in this file, you open the file named
$GIT_DIR/cache/db-$base.db and try again there, walking back along
that base chain until base is '0'x40. (or some other marker in the
header to denote there is no base file).

GC is just a matter of merging the sorted files together. Follow along
all of the base pointers, open all of them, scan through the records
and write out the first key that is defined. I guess we need a small
"delete" bit in the record to indicate a particular key/value was
removed from the database. Since this is a reverse mapping, duplicates
are possible, and readers that want all values need to scan back to
the base file, but skip base entries that were marked deleted in a
newer file.

Updating is just preparing a new file that uses the current srctree as
your base, and only inserting/sorting the paths that were different in
the notes.

We probably need to store these files keyed by their notes ref, so we
can find "svn/revisions" differently from "bugzilla" (another
hypothetical mapping of Bugzilla bug ids to commit SHA-1s, based on
notes that attached bug numbers to commits).

I don't think its that bad. Maybe its a bit too much complexity for
version 1 to have these incremental update files be supported, but it
shouldn't be that hard.

>> The only problem is, you probably want this "reverse notes index" to
>> be indexing a portion of the note blob text, not all of it. That is,
>> we want the SVN note text to say something like "SVN Revision: r1828"
>> so `git log --notes=svn` shows us something more useful than just
>> "r1828". But in the reverse index, we may only want the key to be
>> "r1828". So you need some sort of small mapping function to decide
>> what to put into that reverse index.
>
> I had assumed that we would just be writing r1828 into the note. The
> output via git log is actually pretty readable:
>
>  $ git notes --ref=svn/revisions add -m r1828
>  $ git show --notes=svn/revisions
>  ...
>  Notes (svn/revisions):
>      r1828
>
> Of course this is just one use case.

Thanks, I keep forgetting that the notes prints the note ref name out
before the text, so its already got this annotation present. This
makes it much more likely that the bare "r1828" text is acceptable in
the note, and that the reverse index is just the entire content of the
blob as the key.  :-)

> For that matter, we have to figure out how one would actually reference
> the reverse mapping. If we have a simple, pure-reverse mapping, we can
> just generate and cache them on the fly, and give a special syntax.
> Like:
>
>  $ git log notes/svn/revisions@{revnote:r1828}

Uhm. Ick.

> The syntaxes are not as nice as having a real ref. In the last example,
> we could probably look for the contents of "@{}" as a possible revnote
> mapping (since we've already had to name it via the configuration), to
> make it "r1828@{svn}". Or you could even come up with a default set of
> revnotes to consider, so that if we lookup "r1828" and it isn't a real
> ref, we fall back to trying r1828@{revnote:svn}.

Or, what about setting up a fake ref namespace:

  git config ref.refs/remotes/svn/*.from refs/notes/svn/revisions

Then `git log svn/r1828` works. But these aren't real references. We
would only want to consider them if a request matched the glob, so
`git for-each-ref` and `git upload-pack` aren't reporting these things
by default, and neither is `git log --all` or `gitk --all`.

I agree a syntax that works out of the box without a configuration
file change would be nicer. But we are running out of operators to do
that with. `git log notes/svn/revisions@{revnote:r1828}` as you
propose above is at least workable...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-06-14 20:12                             ` Shawn Pearce
@ 2011-09-08 19:53                               ` Martin Fick
  2011-09-09  0:52                                 ` Martin Fick
  2011-09-09 13:50                                 ` Michael Haggerty
  0 siblings, 2 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-08 19:53 UTC (permalink / raw)
  To: git

Just thought that I should add some numbers to this thread as it seems that
the later versions of git are worse off by several orders of magnitude on
this one.  

We have a Gerrit repo with just under 100K refs in refs/changes/*.  When I
fetch them all with git 1.7.6 it does not seem to complete.  Even after 5
days, it is just under half way through the ref #s!   It appears, but I am
not sure, that it is getting slower with time also, so it may not even
complete after 10 days, I couldn't wait any longer.  However, the same
command works in under 8 mins with git 1.7.3.3 on the same machine!

Syncing 100K refs:

  git 1.7.6      > 8 days?
  git 1.7.3.3   ~8mins

That is quite a difference!  Have there been any obvious changes to git that
should cause this?  If needed, I can bisect git to find out where things go
sour, but I thought that perhaps there would be someone who already
understands the problem and why older versions aren't nearly as bas as
recent ones.

Some more things that I have tried:  after syncing the repo locally with all
100K refs under refs/changes, I cloned it locally again and tried fetching
locally with both git 1.7.6 and 1.7.3.3.  I got the same results as
remotely, so it does not appear to be related to round trips.

The original git remote syncing takes just a bit of time, and then it
outputs lines like these:
 ...
 * [new branch]      refs/changes/13/66713/2 -> refs/changes/13/66713/2
 * [new branch]      refs/changes/13/66713/3 -> refs/changes/13/66713/3
 * [new branch]      refs/changes/13/66713/4 -> refs/changes/13/66713/4
 * [new branch]      refs/changes/13/66713/5 -> refs/changes/13/66713/5
 ...

This is the part that takes forever.  The lines seem to scroll by slower and
slower (with git 1.7.6).  In the beginning, the lines might be a screens
worth a minute, after 5 days, about 1  a minute.  My CPU is pegged at 100%
during this time (one core).  Since I have some good test data for this, let
me know if I should test anything specific.

Thanks,

-Martin

Employee of Qualcomm Innovation Center, Inc. which is a member of Code
Aurora Forum


--
View this message in context: http://git.661346.n2.nabble.com/Git-is-not-scalable-with-too-many-refs-tp6456443p6773496.html
Sent from the git mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-08 19:53                               ` Martin Fick
@ 2011-09-09  0:52                                 ` Martin Fick
  2011-09-09  1:05                                   ` Thomas Rast
                                                     ` (2 more replies)
  2011-09-09 13:50                                 ` Michael Haggerty
  1 sibling, 3 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-09  0:52 UTC (permalink / raw)
  To: git

An update, I bisected it down to this commit:

  88a21979c5717e3f37b9691e90b6dbf2b94c751a

   fetch/pull: recurse into submodules when necessary

Since this can be disabled with the --no-recurse-submodules switch, I tried
that and indeed, even with the latest 1.7.7rc it becomes fast (~8mins)
again. The strange part about this is that the repository does not have any
submodules. Anyway, I hope that this can be useful to others since it is a
workaround which speed things up enormously. Let me know if you have any
other tests that you want me to perform,

-Martin

--
View this message in context: http://git.661346.n2.nabble.com/Git-is-not-scalable-with-too-many-refs-tp6456443p6774328.html
Sent from the git mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-09  0:52                                 ` Martin Fick
@ 2011-09-09  1:05                                   ` Thomas Rast
  2011-09-09  1:13                                     ` Thomas Rast
  2011-09-09 15:59                                   ` Jens Lehmann
  2011-09-25 20:43                                   ` Martin Fick
  2 siblings, 1 reply; 126+ messages in thread
From: Thomas Rast @ 2011-09-09  1:05 UTC (permalink / raw)
  To: Martin Fick, Jens Lehmann; +Cc: git

Martin Fick wrote:
> An update, I bisected it down to this commit:
> 
>   88a21979c5717e3f37b9691e90b6dbf2b94c751a
> 
>    fetch/pull: recurse into submodules when necessary
> 
> Since this can be disabled with the --no-recurse-submodules switch, I tried
> that and indeed, even with the latest 1.7.7rc it becomes fast (~8mins)
> again. The strange part about this is that the repository does not have any
> submodules. Anyway, I hope that this can be useful to others since it is a
> workaround which speed things up enormously. Let me know if you have any
> other tests that you want me to perform,

Jens should know about this, so let's Cc him.

I took a quick look and I'm guessing that there's at least one
quadratic behaviour: in check_for_new_submodule_commits(), I see

+       const char *argv[] = {NULL, NULL, "--not", "--all", NULL};
+       int argc = ARRAY_SIZE(argv) - 1;
+
+       init_revisions(&rev, NULL);

which means that the --all needs to walk all commits reachable from
all refs and flag them as uninteresting.  But that function is called
for every ref update, so IIUC the time spent is on the order of
#ref updates*#commits.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-09  1:05                                   ` Thomas Rast
@ 2011-09-09  1:13                                     ` Thomas Rast
  0 siblings, 0 replies; 126+ messages in thread
From: Thomas Rast @ 2011-09-09  1:13 UTC (permalink / raw)
  To: Martin Fick, Jens Lehmann; +Cc: git

Thomas Rast wrote:
> +       const char *argv[] = {NULL, NULL, "--not", "--all", NULL};
> +       int argc = ARRAY_SIZE(argv) - 1;
> +
> +       init_revisions(&rev, NULL);
> 
> which means that the --all needs to walk all commits reachable from
> all refs and flag them as uninteresting.

Scratch that, it "only" needs to mark every tip commit and then walk
them back to about where the interesting commits end.

In any case, since the uninteresting set only gets larger, it should
be possible to reuse the same revision walker.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-08 19:53                               ` Martin Fick
  2011-09-09  0:52                                 ` Martin Fick
@ 2011-09-09 13:50                                 ` Michael Haggerty
  2011-09-09 15:51                                   ` Michael Haggerty
  2011-09-09 16:03                                   ` Jens Lehmann
  1 sibling, 2 replies; 126+ messages in thread
From: Michael Haggerty @ 2011-09-09 13:50 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On 09/08/2011 09:53 PM, Martin Fick wrote:
> Just thought that I should add some numbers to this thread as it seems that
> the later versions of git are worse off by several orders of magnitude on
> this one.  
> 
> We have a Gerrit repo with just under 100K refs in refs/changes/*.  When I
> fetch them all with git 1.7.6 it does not seem to complete.  Even after 5
> days, it is just under half way through the ref #s! [...]

I recently reported very slow performance when doing a "git
filter-branch" involving only about 1000 tags, with hints of O(N^3)
scaling [1].  That could certainly explain enormous runtimes for 100k refs.

References are cached in git in a single linked list, so it is easy to
imagine O(N^2) all over the place (which is bad enough for 100k
references).  I am working on improving the situation by reorganizing
how the reference cache is stored in memory, but progress is slow.

I'm not sure whether your problem is related.  For example, it is not
obvious to me why the commit that you cite (88a21979) would make the
reference problem so dramatically worse.

I suggest the following experiments to characterize the problem:

1. Fetch the references in batches of a few hundred each, and see if
that dramatically decreases the total time.

2. Same as (1), except run "git pack-refs --all --prune" between the
batches.  In my experiments, packing references made a dramatic
difference in runtimes.

3. Try using the --no-replace-objects option (I assume that it can be
used like "git --no-replace-objects fetch ...").  In my case this option
made a dramatic improvement in the runtimes.

4. Try a test using a repository generated something like the test
script that I posted in [1].  If it also gives pathologically bad
performance, then it can serve as a test case to use while we debug the
problem.

Yours,
Michael

[1] http://comments.gmane.org/gmane.comp.version-control.git/177103

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-09 13:50                                 ` Michael Haggerty
@ 2011-09-09 15:51                                   ` Michael Haggerty
  2011-09-09 16:03                                   ` Jens Lehmann
  1 sibling, 0 replies; 126+ messages in thread
From: Michael Haggerty @ 2011-09-09 15:51 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

I have answered some of my own questions:

On 09/09/2011 03:50 PM, Michael Haggerty wrote:
> 3. Try using the --no-replace-objects option (I assume that it can be
> used like "git --no-replace-objects fetch ...").  In my case this option
> made a dramatic improvement in the runtimes.

This does not seem to help much.

> 4. Try a test using a repository generated something like the test
> script that I posted in [1].  If it also gives pathologically bad
> performance, then it can serve as a test case to use while we debug the
> problem.

Yes, a simple test repo like that created by the script is enough to
reproduce the problem.  The slowdown becomes very obvious after only a
few hundred references.

Curiously, "git clone" is very fast under the same circumstances that
"git fetch" is excruciatingly slow.

According to strace, git seems to be repopulating the ref cache after
each new ref is created (it walks through the whole refs subdirectory
and reads every file).  Apparently the ref cache is being discarded
completely whenever a ref is added (which can and should be fixed) and
then being reloaded for some reason (though single refs can be inspected
much faster without reading the cache).  This situation should be
improved by the hierarchical refcache changes that I'm working on plus
smarter updating (rather than discarding) of the cache when a new
reference is created.

Some earlier speculation in this thread was that that slowdowns might be
caused by "pessimal" ordering of revisions in the walker queue.  But my
test repository shards the references in such a way that the lexical
order of the refnames does not correspond to the topological order of
the commits.  So that can't be the whole story.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-09  0:52                                 ` Martin Fick
  2011-09-09  1:05                                   ` Thomas Rast
@ 2011-09-09 15:59                                   ` Jens Lehmann
  2011-09-25 20:43                                   ` Martin Fick
  2 siblings, 0 replies; 126+ messages in thread
From: Jens Lehmann @ 2011-09-09 15:59 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

Am 09.09.2011 02:52, schrieb Martin Fick:
> An update, I bisected it down to this commit:
> 
>   88a21979c5717e3f37b9691e90b6dbf2b94c751a
> 
>    fetch/pull: recurse into submodules when necessary
> 
> Since this can be disabled with the --no-recurse-submodules switch, I tried
> that and indeed, even with the latest 1.7.7rc it becomes fast (~8mins)
> again. The strange part about this is that the repository does not have any
> submodules. Anyway, I hope that this can be useful to others since it is a
> workaround which speed things up enormously. Let me know if you have any
> other tests that you want me to perform,

Thanks for nailing that one down. I'm currently looking into bringing back
decent performance here.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-09 13:50                                 ` Michael Haggerty
  2011-09-09 15:51                                   ` Michael Haggerty
@ 2011-09-09 16:03                                   ` Jens Lehmann
  1 sibling, 0 replies; 126+ messages in thread
From: Jens Lehmann @ 2011-09-09 16:03 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Martin Fick, git

Am 09.09.2011 15:50, schrieb Michael Haggerty:
> On 09/08/2011 09:53 PM, Martin Fick wrote:
>> Just thought that I should add some numbers to this thread as it seems that
>> the later versions of git are worse off by several orders of magnitude on
>> this one.  
>>
>> We have a Gerrit repo with just under 100K refs in refs/changes/*.  When I
>> fetch them all with git 1.7.6 it does not seem to complete.  Even after 5
>> days, it is just under half way through the ref #s! [...]
> 
> I recently reported very slow performance when doing a "git
> filter-branch" involving only about 1000 tags, with hints of O(N^3)
> scaling [1].  That could certainly explain enormous runtimes for 100k refs.
> 
> References are cached in git in a single linked list, so it is easy to
> imagine O(N^2) all over the place (which is bad enough for 100k
> references).  I am working on improving the situation by reorganizing
> how the reference cache is stored in memory, but progress is slow.
> 
> I'm not sure whether your problem is related.  For example, it is not
> obvious to me why the commit that you cite (88a21979) would make the
> reference problem so dramatically worse.

88a21979 is the reason, as since then a "git rev-list <sha1> --not --all" is
run for *every* updated ref to find out all new commits fetched for that ref.
And if you have 100K of them ...

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-09  0:52                                 ` Martin Fick
  2011-09-09  1:05                                   ` Thomas Rast
  2011-09-09 15:59                                   ` Jens Lehmann
@ 2011-09-25 20:43                                   ` Martin Fick
  2011-09-26 12:41                                     ` Christian Couder
                                                       ` (2 more replies)
  2 siblings, 3 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-25 20:43 UTC (permalink / raw)
  To: git; +Cc: Christian Couder

A coworker of mine pointed out to me that a simple

  git checkout 

can also take rather long periods of time > 3 mins when run 
on a repo with ~100K refs.  

While this is not massive like the other problem I reported, 
it still seems like it is more than one would expect.  So, I 
tried an older version of git, and to my surprise/delight, 
it was much faster (.2s).  So, I bisected this issue also, 
and it seems that the "offending" commit is 
680955702990c1d4bfb3c6feed6ae9c6cb5c3c07:


commit 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07
Author: Christian Couder <chriscool@tuxfamily.org>
Date:   Fri Jan 23 10:06:53 2009 +0100

    replace_object: add mechanism to replace objects found 
in "refs/replace/"

    The code implementing this mechanism has been copied 
more-or-less
    from the commit graft code.

    This mechanism is used in "read_sha1_file". sha1 passed 
to this
    function that match a ref name in "refs/replace/" are 
replaced by
    the sha1 that has been read in the ref.

    We "die" if the replacement recursion depth is too high 
or if we
    can't read the replacement object.

    Signed-off-by: Christian Couder 
<chriscool@tuxfamily.org>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>



Now, I suspect this commit is desirable, but I was hoping 
that perhaps a look at it might inspire someone to find an 
obvious problem with it.  

Thanks,

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-25 20:43                                   ` Martin Fick
@ 2011-09-26 12:41                                     ` Christian Couder
  2011-09-26 17:47                                       ` Martin Fick
  2011-09-28 19:38                                       ` Martin Fick
  2011-09-26 15:15                                     ` Git is not scalable with too many refs/* Martin Fick
  2011-09-26 15:32                                     ` Michael Haggerty
  2 siblings, 2 replies; 126+ messages in thread
From: Christian Couder @ 2011-09-26 12:41 UTC (permalink / raw)
  To: Martin Fick; +Cc: git, Christian Couder

On Sun, Sep 25, 2011 at 10:43 PM, Martin Fick <mfick@codeaurora.org> wrote:
> A coworker of mine pointed out to me that a simple
>
>  git checkout
>
> can also take rather long periods of time > 3 mins when run
> on a repo with ~100K refs.

Are all these refs packed?

> While this is not massive like the other problem I reported,
> it still seems like it is more than one would expect.  So, I
> tried an older version of git, and to my surprise/delight,
> it was much faster (.2s).  So, I bisected this issue also,
> and it seems that the "offending" commit is
> 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07:
>
> commit 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07
> Author: Christian Couder <chriscool@tuxfamily.org>
> Date:   Fri Jan 23 10:06:53 2009 +0100
>
>    replace_object: add mechanism to replace objects found
> in "refs/replace/"

[...]

> Now, I suspect this commit is desirable, but I was hoping
> that perhaps a look at it might inspire someone to find an
> obvious problem with it.

I don't think there is an obvious problem with it, but it would be
nice if you could dig a bit deeper.

The first thing that could take a lot of time is the call to
for_each_replace_ref() in this function:

+static void prepare_replace_object(void)
+{
+       static int replace_object_prepared;
+
+       if (replace_object_prepared)
+               return;
+
+       for_each_replace_ref(register_replace_ref, NULL);
+       replace_object_prepared = 1;
+}

Another thing is calling replace_object_pos() repeatedly in
lookup_replace_object().

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-25 20:43                                   ` Martin Fick
  2011-09-26 12:41                                     ` Christian Couder
@ 2011-09-26 15:15                                     ` Martin Fick
  2011-09-26 15:21                                       ` Sverre Rabbelier
  2011-09-26 18:07                                       ` Martin Fick
  2011-09-26 15:32                                     ` Michael Haggerty
  2 siblings, 2 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-26 15:15 UTC (permalink / raw)
  To: git, Julian Phillips

OK, I have found what I believe is another performance 
regression for large ref counts (~100K).  

When I run git br on my repo which only has one branch, but 
has ~100K refs under ref/changes (a gerrit repo), it takes 
normally 3-6mins depending on whether my caches are fresh or 
not.  After bisecting some older changes, I noticed that 
this ref seems to be where things start to get slow: 
c774aab98ce6c5ef7aaacbef38da0a501eb671d4


commit c774aab98ce6c5ef7aaacbef38da0a501eb671d4
Author: Julian Phillips <julian@quantumfyre.co.uk>
Date:   Tue Apr 17 02:42:50 2007 +0100

    refs.c: add a function to sort a ref list, rather then 
sorting on add

    Rather than sorting the refs list while building it, 
sort in one
    go after it is built using a merge sort.  This has a 
large
    performance boost with large numbers of refs.

    It shouldn't happen that we read duplicate entries into 
the same
    list, but just in case sort_ref_list drops them if the 
SHA1s are
    the same, or dies, as we have no way of knowing which 
one is the
    correct one.

    Signed-off-by: Julian Phillips 
<julian@quantumfyre.co.uk>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Junio C Hamano <junkio@cox.net>



which is a bit strange since that commit's purpose was to 
actually speed things up in the case of many refs.  Just to 
verify, I reverted the commit on 1.7.7.rc0.73 and sure 
enough, things speed up down to the 14-20s range depending 
on caching.

If this change does not actually speed things up, should it 
be reverted?  Or was there a bug in the change that makes it 
not do what it was supposed to do?

Thanks,

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 15:15                                     ` Git is not scalable with too many refs/* Martin Fick
@ 2011-09-26 15:21                                       ` Sverre Rabbelier
  2011-09-26 15:48                                         ` Martin Fick
  2011-09-26 18:07                                       ` Martin Fick
  1 sibling, 1 reply; 126+ messages in thread
From: Sverre Rabbelier @ 2011-09-26 15:21 UTC (permalink / raw)
  To: Martin Fick; +Cc: git, Julian Phillips

Heya,

On Mon, Sep 26, 2011 at 17:15, Martin Fick <mfick@codeaurora.org> wrote:
> If this change does not actually speed things up, should it
> be reverted?  Or was there a bug in the change that makes it
> not do what it was supposed to do?

It probably looks at the refs in refs/changes while it shouldn't,
hence worsening your performance compared to not looking at those
refs. I assume that it does improve your situation if you have all
those refs under say refs/heads.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-25 20:43                                   ` Martin Fick
  2011-09-26 12:41                                     ` Christian Couder
  2011-09-26 15:15                                     ` Git is not scalable with too many refs/* Martin Fick
@ 2011-09-26 15:32                                     ` Michael Haggerty
  2011-09-26 15:42                                       ` Martin Fick
  2 siblings, 1 reply; 126+ messages in thread
From: Michael Haggerty @ 2011-09-26 15:32 UTC (permalink / raw)
  To: Martin Fick; +Cc: git, Christian Couder

On 09/25/2011 10:43 PM, Martin Fick wrote:
> A coworker of mine pointed out to me that a simple
> 
>   git checkout 
> 
> can also take rather long periods of time > 3 mins when run 
> on a repo with ~100K refs.  
> 
> While this is not massive like the other problem I reported, 
> it still seems like it is more than one would expect.  So, I 
> tried an older version of git, and to my surprise/delight, 
> it was much faster (.2s).  So, I bisected this issue also, 
> and it seems that the "offending" commit is 
> 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07:

I'm still working on changes to store references hierarchically in the
cache and read them lazily.  I hope that it will help some scaling
problems with large number of refs.

Unfortunately I keep getting tangled up in side issues, so it is taking
a lot longer than expected.  But there's still hope.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 15:32                                     ` Michael Haggerty
@ 2011-09-26 15:42                                       ` Martin Fick
  2011-09-26 16:25                                         ` Thomas Rast
  0 siblings, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-26 15:42 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Christian Couder

On Monday, September 26, 2011 09:32:14 am Michael Haggerty 
wrote:
> On 09/25/2011 10:43 PM, Martin Fick wrote:
> > A coworker of mine pointed out to me that a simple
> > 
> >   git checkout
> > 
> > can also take rather long periods of time > 3 mins when
> > run on a repo with ~100K refs.
> > 
> > While this is not massive like the other problem I
> > reported, it still seems like it is more than one
> > would expect.  So, I tried an older version of git,
> > and to my surprise/delight, it was much faster (.2s). 
> > So, I bisected this issue also, and it seems that the
> > "offending" commit is
> 
> > 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07:
> I'm still working on changes to store references
> hierarchically in the cache and read them lazily.  I
> hope that it will help some scaling problems with large
> number of refs.
> 
> Unfortunately I keep getting tangled up in side issues,
> so it is taking a lot longer than expected.  But there's
> still hope.
> 
> Michael

Thanks Michael, I look forward to those changes.  In the 
meantime however, I will try to take advantage of the 
current inefficiencies of large ref counts to attempt to 
find places where there are obvious problems in the code 
paths.  I suspect that there are several commands in git 
which inadvertently scan all the refs when they probably 
shouldn't.  Since this is likely very slow now, it should be 
easy to find those, if it were faster, this might get 
overlooked.  I feel like git checkout is one of those cases, 
it does not seem like git checkout should be affected by the 
number of refs in a repo?

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 15:21                                       ` Sverre Rabbelier
@ 2011-09-26 15:48                                         ` Martin Fick
  2011-09-26 15:56                                           ` Sverre Rabbelier
  0 siblings, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-26 15:48 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: git, Julian Phillips

On Monday, September 26, 2011 09:21:30 am Sverre Rabbelier 
wrote:
> Heya,
> 
> On Mon, Sep 26, 2011 at 17:15, Martin Fick 
<mfick@codeaurora.org> wrote:
> > If this change does not actually speed things up,
> > should it be reverted?  Or was there a bug in the
> > change that makes it not do what it was supposed to
> > do?
> 
> It probably looks at the refs in refs/changes while it
> shouldn't, hence worsening your performance compared to
> not looking at those refs. I assume that it does improve
> your situation if you have all those refs under say
> refs/heads.

Hmm, I was thinking that too, and I just did a test. 

Instead of storing the changes under refs/changes, I fetched 
them under refs/heads/changes and then ran git 1.7.6 and it 
took about 3 mins.  Then, I ran the 1.7.7.rc0.73 with 
c774aab98ce6c5ef7aaacbef38da0a501eb671d4 reverted and it 
only took 13s!  So, if this indeed tests what you were 
suggesting, I think it shows that even in the intended case 
this change slowed things down?

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 15:48                                         ` Martin Fick
@ 2011-09-26 15:56                                           ` Sverre Rabbelier
  2011-09-26 16:38                                             ` Martin Fick
  0 siblings, 1 reply; 126+ messages in thread
From: Sverre Rabbelier @ 2011-09-26 15:56 UTC (permalink / raw)
  To: Martin Fick; +Cc: git, Julian Phillips

Heya,

On Mon, Sep 26, 2011 at 17:48, Martin Fick <mfick@codeaurora.org> wrote:
> Hmm, I was thinking that too, and I just did a test.
>
> Instead of storing the changes under refs/changes, I fetched
> them under refs/heads/changes and then ran git 1.7.6 and it
> took about 3 mins.  Then, I ran the 1.7.7.rc0.73 with
> c774aab98ce6c5ef7aaacbef38da0a501eb671d4 reverted and it
> only took 13s!  So, if this indeed tests what you were
> suggesting, I think it shows that even in the intended case
> this change slowed things down?

And if you run 1.7.7 without that commit reverted?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 15:42                                       ` Martin Fick
@ 2011-09-26 16:25                                         ` Thomas Rast
  0 siblings, 0 replies; 126+ messages in thread
From: Thomas Rast @ 2011-09-26 16:25 UTC (permalink / raw)
  To: Martin Fick; +Cc: Michael Haggerty, git, Christian Couder

Martin Fick wrote:
> 
> I suspect that there are several commands in git 
> which inadvertently scan all the refs when they probably 
> shouldn't. [...] I feel like git checkout is one of those cases, 
> it does not seem like git checkout should be affected by the 
> number of refs in a repo?

git-checkout checks whether you are leaving any unreferenced
(orphaned) commits behind when you leave a detached HEAD, which
requires that it scan the history of all refs for the commit you just
left.

So unless you disable that warning it'll be pretty expensive
regardless.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 15:56                                           ` Sverre Rabbelier
@ 2011-09-26 16:38                                             ` Martin Fick
  2011-09-26 16:49                                               ` Julian Phillips
  0 siblings, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-26 16:38 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: git, Julian Phillips

On Monday, September 26, 2011 09:56:50 am Sverre Rabbelier 
wrote:
> Heya,
> 
> On Mon, Sep 26, 2011 at 17:48, Martin Fick 
<mfick@codeaurora.org> wrote:
> > Hmm, I was thinking that too, and I just did a test.
> > 
> > Instead of storing the changes under refs/changes, I
> > fetched them under refs/heads/changes and then ran git
> > 1.7.6 and it took about 3 mins.  Then, I ran the
> > 1.7.7.rc0.73 with
> > c774aab98ce6c5ef7aaacbef38da0a501eb671d4 reverted and
> > it only took 13s!  So, if this indeed tests what you
> > were suggesting, I think it shows that even in the
> > intended case this change slowed things down?
> 
> And if you run 1.7.7 without that commit reverted?

Sorry, I probably confused things by mentioning 1.7.6, the 
bad commit was way before that early 1.5 days...  

As for 1.7.7, I don't think that exists yet, so did you mean 
the 1.7.7.rc0.73 version that I mentioned above without the 
revert?  Strangely enough, that ends up being 
1.7.7.rc0.72.g4b5ea.  That is also slow with 
refs/heads/changes > 3mins.

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 16:38                                             ` Martin Fick
@ 2011-09-26 16:49                                               ` Julian Phillips
  0 siblings, 0 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-26 16:49 UTC (permalink / raw)
  To: Martin Fick; +Cc: Sverre Rabbelier, git

On Mon, 26 Sep 2011 10:38:34 -0600, Martin Fick wrote:
> On Monday, September 26, 2011 09:56:50 am Sverre Rabbelier
> wrote:
>> Heya,
>>
>> On Mon, Sep 26, 2011 at 17:48, Martin Fick
> <mfick@codeaurora.org> wrote:
>> > Hmm, I was thinking that too, and I just did a test.
>> >
>> > Instead of storing the changes under refs/changes, I
>> > fetched them under refs/heads/changes and then ran git
>> > 1.7.6 and it took about 3 mins.  Then, I ran the
>> > 1.7.7.rc0.73 with
>> > c774aab98ce6c5ef7aaacbef38da0a501eb671d4 reverted and
>> > it only took 13s!  So, if this indeed tests what you
>> > were suggesting, I think it shows that even in the
>> > intended case this change slowed things down?
>>
>> And if you run 1.7.7 without that commit reverted?
>
> Sorry, I probably confused things by mentioning 1.7.6, the
> bad commit was way before that early 1.5 days...
>
> As for 1.7.7, I don't think that exists yet, so did you mean
> the 1.7.7.rc0.73 version that I mentioned above without the
> revert?  Strangely enough, that ends up being
> 1.7.7.rc0.72.g4b5ea.  That is also slow with
> refs/heads/changes > 3mins.

Hmm ... something interesting is going on.

I created a little test repo with ~100k unpacked refs.

I tried "time git branch" with three versions of git, and I got (hot 
cache times):

git version 1.7.6.1: ~1.2s
git version 1.7.7.rc3: ~1.2s
git version 1.7.7.rc3.1.gbc93f: ~40s

Where the third was with the commit reverted.  That was almost 40s of 
100% CPU - my poor laptop had to turn the fans up to noisy ...

> -Martin

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 12:41                                     ` Christian Couder
@ 2011-09-26 17:47                                       ` Martin Fick
  2011-09-26 18:56                                         ` Christian Couder
  2011-09-28 19:38                                       ` Martin Fick
  1 sibling, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-26 17:47 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Christian Couder

On Monday, September 26, 2011 06:41:04 am Christian Couder 
wrote:
> On Sun, Sep 25, 2011 at 10:43 PM, Martin Fick 
<mfick@codeaurora.org> wrote:
> > A coworker of mine pointed out to me that a simple
> > 
> >  git checkout
> > 
> > can also take rather long periods of time > 3 mins when
> > run on a repo with ~100K refs.
> 
> Are all these refs packed?

I think so, is there a way to find out for sure?

-Martin

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 15:15                                     ` Git is not scalable with too many refs/* Martin Fick
  2011-09-26 15:21                                       ` Sverre Rabbelier
@ 2011-09-26 18:07                                       ` Martin Fick
  2011-09-26 18:37                                         ` Julian Phillips
  1 sibling, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-26 18:07 UTC (permalink / raw)
  To: git; +Cc: Julian Phillips

On Monday, September 26, 2011 09:15:29 am Martin Fick wrote:
> OK, I have found what I believe is another performance
> regression for large ref counts (~100K).
> 
> When I run git br on my repo which only has one branch,
> but has ~100K refs under ref/changes (a gerrit repo), it
> takes normally 3-6mins depending on whether my caches
> are fresh or not.  After bisecting some older changes, I
> noticed that this ref seems to be where things start to
> get slow: c774aab98ce6c5ef7aaacbef38da0a501eb671d4
> 
> 
> commit c774aab98ce6c5ef7aaacbef38da0a501eb671d4
> Author: Julian Phillips <julian@quantumfyre.co.uk>
> Date:   Tue Apr 17 02:42:50 2007 +0100
> 
>     refs.c: add a function to sort a ref list, rather
> then sorting on add
> 
>     Rather than sorting the refs list while building it,
> sort in one
>     go after it is built using a merge sort.  This has a
> large
>     performance boost with large numbers of refs.
> 
>     It shouldn't happen that we read duplicate entries
> into the same
>     list, but just in case sort_ref_list drops them if
> the SHA1s are
>     the same, or dies, as we have no way of knowing which
> one is the
>     correct one.
> 
>     Signed-off-by: Julian Phillips
> <julian@quantumfyre.co.uk>
>     Acked-by: Linus Torvalds
> <torvalds@linux-foundation.org> Signed-off-by: Junio C
> Hamano <junkio@cox.net>
> 
> 
> 
> which is a bit strange since that commit's purpose was to
> actually speed things up in the case of many refs.  Just
> to verify, I reverted the commit on 1.7.7.rc0.73 and
> sure enough, things speed up down to the 14-20s range
> depending on caching.
> 
> If this change does not actually speed things up, should
> it be reverted?  Or was there a bug in the change that
> makes it not do what it was supposed to do?


Ahh, I think I have some more clues.  So while this change 
does not speed things up for me normally, I found a case 
where it does!  I  set my .git/config to have

  [core]
        compression = 0

and ran git-gc on my repo.  Now, with a modern git with this 
optimization in it (1.7.6, 1.7.7.rc0...), 'git branch' is 
almost instantaneous (.05s)!  But, if I revert c774aa it 
takes > ~15s.  

So, it appears that this optimization is foiled by 
compression?  In the case when this optimization helps, it 
save about 15s, when it hurts (with compression), it seems 
to cost > 3mins.  I am not sure this optimization is worth 
it?  Would there be someway for it to adjust to the repo 
conditions?

 
Thanks,

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 18:07                                       ` Martin Fick
@ 2011-09-26 18:37                                         ` Julian Phillips
  2011-09-26 20:01                                           ` Martin Fick
  0 siblings, 1 reply; 126+ messages in thread
From: Julian Phillips @ 2011-09-26 18:37 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Mon, 26 Sep 2011 12:07:52 -0600, Martin Fick wrote:
-- snip --
> Ahh, I think I have some more clues.  So while this change
> does not speed things up for me normally, I found a case
> where it does!  I  set my .git/config to have
>
>   [core]
>         compression = 0
>
> and ran git-gc on my repo.  Now, with a modern git with this
> optimization in it (1.7.6, 1.7.7.rc0...), 'git branch' is
> almost instantaneous (.05s)!  But, if I revert c774aa it
> takes > ~15s.

I don't understand this.  I don't see why core.compression should have 
anything to do with refs ...

> So, it appears that this optimization is foiled by
> compression?  In the case when this optimization helps, it
> save about 15s, when it hurts (with compression), it seems
> to cost > 3mins.  I am not sure this optimization is worth
> it?  Would there be someway for it to adjust to the repo
> conditions?

Well, in the case I tried it was 1.2s vs 40s.  It would seem that you 
have managed to find some corner case.  It doesn't seem right to punish 
everyone who has large numbers of refs by making their commands take 
orders of magnitude longer to save one person 3m.  Much better to find, 
understand and fix the actual cause.

I really can't see what effect core.compression can have on loading the 
ref_list.  Certainly the sort doesn't load anything from the object 
database.  It would be really good to profile and find out what is 
taking all the time - I am assuming that the CPU is at 100% for the 3+ 
minutes?

Random thought.  What happens to the with compression case if you leave 
the commit in, but add a sleep(15) to the end of sort_refs_list?

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 17:47                                       ` Martin Fick
@ 2011-09-26 18:56                                         ` Christian Couder
  2011-09-30 16:41                                           ` Martin Fick
  0 siblings, 1 reply; 126+ messages in thread
From: Christian Couder @ 2011-09-26 18:56 UTC (permalink / raw)
  To: Martin Fick; +Cc: Christian Couder, git

On Monday 26 September 2011 19:47:56 Martin Fick wrote:
> On Monday, September 26, 2011 06:41:04 am Christian Couder
> wrote:
> > 
> > Are all these refs packed?
> 
> I think so, is there a way to find out for sure?

After "git pack-refs --all" I get:

$ find .git/refs/ -type f
.git/refs/remotes/origin/HEAD
.git/refs/stash

So I suppose that if such a find gives you only a few files all (or most of) 
your refs are packed.

Best regards,
Christian.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 18:37                                         ` Julian Phillips
@ 2011-09-26 20:01                                           ` Martin Fick
  2011-09-26 20:07                                             ` Junio C Hamano
  2011-09-26 20:28                                             ` Julian Phillips
  0 siblings, 2 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-26 20:01 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git

On Monday, September 26, 2011 12:37:10 pm Julian Phillips 
wrote:
> On Mon, 26 Sep 2011 12:07:52 -0600, Martin Fick wrote:
> -- snip --
> 
> > Ahh, I think I have some more clues.  So while this
> > change does not speed things up for me normally, I
> > found a case where it does!  I  set my .git/config to
> > have
> > 
> >   [core]
> >   
> >         compression = 0
> > 
> > and ran git-gc on my repo.  Now, with a modern git with
> > this optimization in it (1.7.6, 1.7.7.rc0...), 'git
> > branch' is almost instantaneous (.05s)!  But, if I
> > revert c774aa it takes > ~15s.
> 
> I don't understand this.  I don't see why
> core.compression should have anything to do with refs
> ...
> 
> > So, it appears that this optimization is foiled by
> > compression?  In the case when this optimization helps,
> > it save about 15s, when it hurts (with compression),
> > it seems to cost > 3mins.  I am not sure this
> > optimization is worth it?  Would there be someway for
> > it to adjust to the repo conditions?
> 
> Well, in the case I tried it was 1.2s vs 40s.  It would
> seem that you have managed to find some corner case.  It
> doesn't seem right to punish everyone who has large
> numbers of refs by making their commands take orders of
> magnitude longer to save one person 3m.  Much better to
> find, understand and fix the actual cause.

I am not sure mine is the corner case, it is a real repo 
(albeit a Gerrit repo with strange refs/changes), while it 
sounds like yours is a test repo.  It seems likely that 
whatever you did to create the test repo makes it perform 
well?  I am also guessing that it is not the refs which are 
the problem but the objects since the refs don't get 
compressed do they?  Does your repo have real data in it 
(not just 100K refs)?  

My repo compressed is about ~2G and uncompressed is ~1.1G
Yes, the compressed one is larger than the uncompressed one.
Since the compressed repo above was larger, I thought that I 
should at lest gc it.  After git gc, it is ~1.1G, so it 
looks like the size difference was really because of not 
having gced it at first after fetching the 100K refs.

After a gc, the repo does perform the similar to the 
uncompressed one (which was achieved via gc).  After gc, it 
takes ~.05s do to a 'git branch' with 1.7.6 and 
git.1.7.7.rc0.72.g4b5ea.  It also takes a bit more than 15s 
with the patch reverted.  So it appears that compression is 
not likely the culprit, but rather the need to be gced.

So, maybe you are correct, maybe my repo is the corner case?  
Is a repo which needs to be gced considered a corner case?  
Should git be able to detect that the repo is so in 
desperate need of gcing?  Is it normal for git to need to gc 
right after a clone and then fetching ~100K refs?

I am not sure what is right here, if this patch makes a repo 
which needs gcing degrade 5 to 10 times worse than the 
benefit of this patch, it still seems questionable to me.


 
> I really can't see what effect core.compression can have
> on loading the ref_list.  Certainly the sort doesn't
> load anything from the object database.  It would be
> really good to profile and find out what is taking all
> the time - I am assuming that the CPU is at 100% for the
> 3+ minutes?

Yes, 100% CPU (I mostly run the tests at least twice and 
have 8G of RAM, so I think the entire repo gets cached).


> Random thought.  What happens to the with compression
> case if you leave the commit in, but add a sleep(15) to
> the end of sort_refs_list?

Why, what are you thinking?  Hmm, I am trying this on the 
non gced repo and it doesn't seem to be completing (no cpu 
usage)!  It appears that perhaps it is being called many 
times (the sleeping would explain no cpu usage)?!?  This
could be a real problem, this should only get called once 
right?

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 20:01                                           ` Martin Fick
@ 2011-09-26 20:07                                             ` Junio C Hamano
  2011-09-26 20:28                                             ` Julian Phillips
  1 sibling, 0 replies; 126+ messages in thread
From: Junio C Hamano @ 2011-09-26 20:07 UTC (permalink / raw)
  To: Martin Fick; +Cc: Julian Phillips, git

Martin Fick <mfick@codeaurora.org> writes:

> After a gc, the repo does perform the similar to the 
> uncompressed one (which was achieved via gc).  After gc, it 
> takes ~.05s do to a 'git branch' with 1.7.6 and 
> git.1.7.7.rc0.72.g4b5ea.  It also takes a bit more than 15s 
> with the patch reverted.  So it appears that compression is 
> not likely the culprit, but rather the need to be gced.

Isn't packing refs part of "gc" these days?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 20:01                                           ` Martin Fick
  2011-09-26 20:07                                             ` Junio C Hamano
@ 2011-09-26 20:28                                             ` Julian Phillips
  2011-09-26 21:39                                               ` Martin Fick
  1 sibling, 1 reply; 126+ messages in thread
From: Julian Phillips @ 2011-09-26 20:28 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Mon, 26 Sep 2011 14:01:38 -0600, Martin Fick wrote:
-- snip --
> So, maybe you are correct, maybe my repo is the corner case?
> Is a repo which needs to be gced considered a corner case?
> Should git be able to detect that the repo is so in
> desperate need of gcing?  Is it normal for git to need to gc
> right after a clone and then fetching ~100K refs?

Were you 100k refs packed before the gc?  If not, perhaps your refs are 
causing a lot of trouble for the merge sort?  They will be written out 
sorted to the packed-refs file, so the merge sort won't have to do any 
real work when loading them after that...

> I am not sure what is right here, if this patch makes a repo
> which needs gcing degrade 5 to 10 times worse than the
> benefit of this patch, it still seems questionable to me.

Well - it does this _for your repo_, that doesn't automatically mean 
that it does generally, or frequently.  For instance, none of my normal 
repos that have a lot of refs are Gerrit ones, and I wouldn't be 
surprised if they benefitted from the merge sort (assuming that I am 
right that the merge sort is taking a long time on your gerrit refs).

Besides, you would be better off running gc, and thus getting the 
benefit too.

>> Random thought.  What happens to the with compression
>> case if you leave the commit in, but add a sleep(15) to
>> the end of sort_refs_list?
>
> Why, what are you thinking?  Hmm, I am trying this on the
> non gced repo and it doesn't seem to be completing (no cpu
> usage)!  It appears that perhaps it is being called many
> times (the sleeping would explain no cpu usage)?!?  This
> could be a real problem, this should only get called once
> right?

I was just wondering if the time taken to get the refs was changing the 
interaction with something else.  Not very likely, but ...

I added a print statement, and it was called four times when I had 
unpacked refs, and once with packed.  So, maybe you are hitting some 
nasty case with unpacked refs.  If you use a print statement instead of 
a sleep, how many times does sort_refs_lists get called in your unpacked 
case?  It may well also be worth calculating the time taken to do the 
sort.

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 20:28                                             ` Julian Phillips
@ 2011-09-26 21:39                                               ` Martin Fick
  2011-09-26 21:52                                                 ` Martin Fick
  2011-09-26 22:30                                                 ` Julian Phillips
  0 siblings, 2 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-26 21:39 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git

On Monday, September 26, 2011 02:28:53 pm Julian Phillips 
wrote:
> On Mon, 26 Sep 2011 14:01:38 -0600, Martin Fick wrote:
> -- snip --
> 
> > So, maybe you are correct, maybe my repo is the corner
> > case? Is a repo which needs to be gced considered a
> > corner case? Should git be able to detect that the
> > repo is so in desperate need of gcing?  Is it normal
> > for git to need to gc right after a clone and then
> > fetching ~100K refs?
> 
> Were you 100k refs packed before the gc?  If not, perhaps
> your refs are causing a lot of trouble for the merge
> sort?  They will be written out sorted to the
> packed-refs file, so the merge sort won't have to do any
> real work when loading them after that...

I am not sure how to determine that (?), but I think they 
were packed.  Under .git/objects/pack there were 2 large 
files, both close to 500MB.  Those 2 files constituted most 
of the space in the repo (I was wrong about the repo sizes, 
that included the working dir, so think about half the 
quoted sizes for all of .git).  So does that mean it is 
mostly packed?  Aside from the pack and idx files, there was 
nothing else under the objects dir.  After gcing, it is down 
to just one ~500MB pack file.


> > I am not sure what is right here, if this patch makes a
> > repo which needs gcing degrade 5 to 10 times worse
> > than the benefit of this patch, it still seems
> > questionable to me.
> 
> Well - it does this _for your repo_, that doesn't
> automatically mean that it does generally, or
> frequently.  

Oh, def agreed! I just didn't want to discount it so quickly 
as being a corner case.


> For instance, none of my normal repos that
> have a lot of refs are Gerrit ones, and I wouldn't be
> surprised if they benefitted from the merge sort
> (assuming that I am right that the merge sort is taking
> a long time on your gerrit refs).
> 
> Besides, you would be better off running gc, and thus
> getting the benefit too.

Agreed, which is why I was asking if git should have noticed 
my "degenerate" case and auto gced?  But hopefully, there is 
an actual bug here somewhere and we both will get to eat our 
cake. :)



> >> Random thought.  What happens to the with compression
> >> case if you leave the commit in, but add a sleep(15)
> >> to the end of sort_refs_list?
> > 
> > Why, what are you thinking?  Hmm, I am trying this on
> > the non gced repo and it doesn't seem to be completing
> > (no cpu usage)!  It appears that perhaps it is being
> > called many times (the sleeping would explain no cpu
> > usage)?!?  This could be a real problem, this should
> > only get called once right?
> 
> I was just wondering if the time taken to get the refs
> was changing the interaction with something else.  Not
> very likely, but ...
> 
> I added a print statement, and it was called four times
> when I had unpacked refs, and once with packed.  So,
> maybe you are hitting some nasty case with unpacked
> refs.  If you use a print statement instead of a sleep,
> how many times does sort_refs_lists get called in your
> unpacked case?  It may well also be worth calculating
> the time taken to do the sort.

In my case it was called 18785 times!  Any other tests I 
should run?

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 21:39                                               ` Martin Fick
@ 2011-09-26 21:52                                                 ` Martin Fick
  2011-09-26 23:26                                                   ` Julian Phillips
  2011-09-26 22:30                                                 ` Julian Phillips
  1 sibling, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-26 21:52 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git

On Monday, September 26, 2011 03:39:33 pm Martin Fick wrote:
> On Monday, September 26, 2011 02:28:53 pm Julian Phillips
> wrote:
> > >> Random thought.  What happens to the with
> > >> compression case if you leave the commit in, but
> > >> add a sleep(15) to the end of sort_refs_list?
> > > 
> > > Why, what are you thinking?  Hmm, I am trying this on
> > > the non gced repo and it doesn't seem to be
> > > completing (no cpu usage)!  It appears that perhaps
> > > it is being called many times (the sleeping would
> > > explain no cpu usage)?!?  This could be a real
> > > problem, this should only get called once right?
> > 
> > I was just wondering if the time taken to get the refs
> > was changing the interaction with something else.  Not
> > very likely, but ...
> > 
> > I added a print statement, and it was called four times
> > when I had unpacked refs, and once with packed.  So,
> > maybe you are hitting some nasty case with unpacked
> > refs.  If you use a print statement instead of a sleep,
> > how many times does sort_refs_lists get called in your
> > unpacked case?  It may well also be worth calculating
> > the time taken to do the sort.
> 
> In my case it was called 18785 times!  Any other tests I
> should run?

Gerrit stores the changes in directories under refs/changes 
named after the last 2 digits of the change.  Then under 
each change it stores each patchset.  So it looks like this:   
refs/changes/dd/change_num/ps_num

I noticed that:

 ls refs/changes/* | wc -l 
 -> 18876

somewhat close, but not super close to 18785,  I am not sure 
if that is a clue.  It's almost like each change is causing 
a re-sort,

 
-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 21:39                                               ` Martin Fick
  2011-09-26 21:52                                                 ` Martin Fick
@ 2011-09-26 22:30                                                 ` Julian Phillips
  1 sibling, 0 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-26 22:30 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Mon, 26 Sep 2011 15:39:33 -0600, Martin Fick wrote:
> On Monday, September 26, 2011 02:28:53 pm Julian Phillips
> wrote:
>> On Mon, 26 Sep 2011 14:01:38 -0600, Martin Fick wrote:
>> -- snip --
>>
>> > So, maybe you are correct, maybe my repo is the corner
>> > case? Is a repo which needs to be gced considered a
>> > corner case? Should git be able to detect that the
>> > repo is so in desperate need of gcing?  Is it normal
>> > for git to need to gc right after a clone and then
>> > fetching ~100K refs?
>>
>> Were you 100k refs packed before the gc?  If not, perhaps
>> your refs are causing a lot of trouble for the merge
>> sort?  They will be written out sorted to the
>> packed-refs file, so the merge sort won't have to do any
>> real work when loading them after that...
>
> I am not sure how to determine that (?), but I think they
> were packed.  Under .git/objects/pack there were 2 large
> files, both close to 500MB.  Those 2 files constituted most
> of the space in the repo (I was wrong about the repo sizes,
> that included the working dir, so think about half the
> quoted sizes for all of .git).  So does that mean it is
> mostly packed?  Aside from the pack and idx files, there was
> nothing else under the objects dir.  After gcing, it is down
> to just one ~500MB pack file.

If refs are listed under .git/refs/... they are unpacked, if they are 
listed in .git/packed-refs they are packed.
They can be in both if updated since the last pack.

>> > I am not sure what is right here, if this patch makes a
>> > repo which needs gcing degrade 5 to 10 times worse
>> > than the benefit of this patch, it still seems
>> > questionable to me.
>>
>> Well - it does this _for your repo_, that doesn't
>> automatically mean that it does generally, or
>> frequently.
>
> Oh, def agreed! I just didn't want to discount it so quickly
> as being a corner case.
>
>
>> For instance, none of my normal repos that
>> have a lot of refs are Gerrit ones, and I wouldn't be
>> surprised if they benefitted from the merge sort
>> (assuming that I am right that the merge sort is taking
>> a long time on your gerrit refs).
>>
>> Besides, you would be better off running gc, and thus
>> getting the benefit too.
>
> Agreed, which is why I was asking if git should have noticed
> my "degenerate" case and auto gced?  But hopefully, there is
> an actual bug here somewhere and we both will get to eat our
> cake. :)

I think automatic gc is currently only triggered by unpacked objects, 
not unpacked refs ... perhaps the auto-gc should cover refs too?

>> >> Random thought.  What happens to the with compression
>> >> case if you leave the commit in, but add a sleep(15)
>> >> to the end of sort_refs_list?
>> >
>> > Why, what are you thinking?  Hmm, I am trying this on
>> > the non gced repo and it doesn't seem to be completing
>> > (no cpu usage)!  It appears that perhaps it is being
>> > called many times (the sleeping would explain no cpu
>> > usage)?!?  This could be a real problem, this should
>> > only get called once right?
>>
>> I was just wondering if the time taken to get the refs
>> was changing the interaction with something else.  Not
>> very likely, but ...
>>
>> I added a print statement, and it was called four times
>> when I had unpacked refs, and once with packed.  So,
>> maybe you are hitting some nasty case with unpacked
>> refs.  If you use a print statement instead of a sleep,
>> how many times does sort_refs_lists get called in your
>> unpacked case?  It may well also be worth calculating
>> the time taken to do the sort.
>
> In my case it was called 18785 times!  Any other tests I
> should run?

That's a lot of sorts.  I really can't see why there would need to be 
more than one ...

I've created a new test repo, using a more complicated method to 
construct the 100k refs, and it took ~40m to run "git branch" instead of 
the 1.2s for the previous repo.  So, I think the ref naming pattern used 
by Gerrit is definitely triggering something odd.  However, progress is 
a bit slow - now that it takes over 1/2 an hour to try things out ...

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 21:52                                                 ` Martin Fick
@ 2011-09-26 23:26                                                   ` Julian Phillips
  2011-09-26 23:37                                                     ` David Michael Barr
                                                                       ` (3 more replies)
  0 siblings, 4 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-26 23:26 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Mon, 26 Sep 2011 15:52:04 -0600, Martin Fick wrote:
> On Monday, September 26, 2011 03:39:33 pm Martin Fick wrote:
>> On Monday, September 26, 2011 02:28:53 pm Julian Phillips
>> wrote:
>> > >> Random thought.  What happens to the with
>> > >> compression case if you leave the commit in, but
>> > >> add a sleep(15) to the end of sort_refs_list?
>> > >
>> > > Why, what are you thinking?  Hmm, I am trying this on
>> > > the non gced repo and it doesn't seem to be
>> > > completing (no cpu usage)!  It appears that perhaps
>> > > it is being called many times (the sleeping would
>> > > explain no cpu usage)?!?  This could be a real
>> > > problem, this should only get called once right?
>> >
>> > I was just wondering if the time taken to get the refs
>> > was changing the interaction with something else.  Not
>> > very likely, but ...
>> >
>> > I added a print statement, and it was called four times
>> > when I had unpacked refs, and once with packed.  So,
>> > maybe you are hitting some nasty case with unpacked
>> > refs.  If you use a print statement instead of a sleep,
>> > how many times does sort_refs_lists get called in your
>> > unpacked case?  It may well also be worth calculating
>> > the time taken to do the sort.
>>
>> In my case it was called 18785 times!  Any other tests I
>> should run?
>
> Gerrit stores the changes in directories under refs/changes
> named after the last 2 digits of the change.  Then under
> each change it stores each patchset.  So it looks like this:
> refs/changes/dd/change_num/ps_num
>
> I noticed that:
>
>  ls refs/changes/* | wc -l
>  -> 18876
>
> somewhat close, but not super close to 18785,  I am not sure
> if that is a clue.  It's almost like each change is causing
> a re-sort,

basically, it is ...

Back when I made that change, I failed to notice that get_ref_dir was 
recursive for subdirectories ... sorry ...

Hopefully this should speed things up.  My test repo went from ~17m 
user time, to ~2.5s.
Packing still make things much faster of course.

diff --git a/refs.c b/refs.c
index a615043..212e7ec 100644
--- a/refs.c
+++ b/refs.c
@@ -319,7 +319,7 @@ static struct ref_list *get_ref_dir(const char 
*submodule, c
                 free(ref);
                 closedir(dir);
         }
-       return sort_ref_list(list);
+       return list;
  }

  struct warn_if_dangling_data {
@@ -361,11 +361,13 @@ static struct ref_list *get_loose_refs(const char 
*submodu
         if (submodule) {
                 free_ref_list(submodule_refs.loose);
                 submodule_refs.loose = get_ref_dir(submodule, "refs", 
NULL);
+               submodule_refs.loose = 
sort_refs_list(submodule_refs.loose);
                 return submodule_refs.loose;
         }

         if (!cached_refs.did_loose) {
                 cached_refs.loose = get_ref_dir(NULL, "refs", NULL);
+               cached_refs.loose = sort_refs_list(cached_refs.loose);
                 cached_refs.did_loose = 1;
         }
         return cached_refs.loose;



>
>
> -Martin

-- 
Julian

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 23:26                                                   ` Julian Phillips
@ 2011-09-26 23:37                                                     ` David Michael Barr
  2011-09-27  1:01                                                       ` [PATCH] refs.c: Fix slowness with numerous loose refs David Barr
  2011-09-26 23:38                                                     ` Git is not scalable with too many refs/* Junio C Hamano
                                                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 126+ messages in thread
From: David Michael Barr @ 2011-09-26 23:37 UTC (permalink / raw)
  To: Julian Phillips; +Cc: Martin Fick, git

On Tue, Sep 27, 2011 at 9:26 AM, Julian Phillips
<julian@quantumfyre.co.uk> wrote:
>
> On Mon, 26 Sep 2011 15:52:04 -0600, Martin Fick wrote:
>>
>> On Monday, September 26, 2011 03:39:33 pm Martin Fick wrote:
>>>
>>> On Monday, September 26, 2011 02:28:53 pm Julian Phillips
>>> wrote:
>>> > >> Random thought.  What happens to the with
>>> > >> compression case if you leave the commit in, but
>>> > >> add a sleep(15) to the end of sort_refs_list?
>>> > >
>>> > > Why, what are you thinking?  Hmm, I am trying this on
>>> > > the non gced repo and it doesn't seem to be
>>> > > completing (no cpu usage)!  It appears that perhaps
>>> > > it is being called many times (the sleeping would
>>> > > explain no cpu usage)?!?  This could be a real
>>> > > problem, this should only get called once right?
>>> >
>>> > I was just wondering if the time taken to get the refs
>>> > was changing the interaction with something else.  Not
>>> > very likely, but ...
>>> >
>>> > I added a print statement, and it was called four times
>>> > when I had unpacked refs, and once with packed.  So,
>>> > maybe you are hitting some nasty case with unpacked
>>> > refs.  If you use a print statement instead of a sleep,
>>> > how many times does sort_refs_lists get called in your
>>> > unpacked case?  It may well also be worth calculating
>>> > the time taken to do the sort.
>>>
>>> In my case it was called 18785 times!  Any other tests I
>>> should run?
>>
>> Gerrit stores the changes in directories under refs/changes
>> named after the last 2 digits of the change.  Then under
>> each change it stores each patchset.  So it looks like this:
>> refs/changes/dd/change_num/ps_num
>>
>> I noticed that:
>>
>>  ls refs/changes/* | wc -l
>>  -> 18876
>>
>> somewhat close, but not super close to 18785,  I am not sure
>> if that is a clue.  It's almost like each change is causing
>> a re-sort,
>
> basically, it is ...
>
> Back when I made that change, I failed to notice that get_ref_dir was recursive for subdirectories ... sorry ...
>
> Hopefully this should speed things up.  My test repo went from ~17m user time, to ~2.5s.
> Packing still make things much faster of course.
>
> diff --git a/refs.c b/refs.c
> index a615043..212e7ec 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -319,7 +319,7 @@ static struct ref_list *get_ref_dir(const char *submodule, c
>                free(ref);
>                closedir(dir);
>        }
> -       return sort_ref_list(list);
> +       return list;
>  }
>
>  struct warn_if_dangling_data {
> @@ -361,11 +361,13 @@ static struct ref_list *get_loose_refs(const char *submodu
>        if (submodule) {
>                free_ref_list(submodule_refs.loose);
>                submodule_refs.loose = get_ref_dir(submodule, "refs", NULL);
> +               submodule_refs.loose = sort_refs_list(submodule_refs.loose);
>                return submodule_refs.loose;
>        }
>
>        if (!cached_refs.did_loose) {
>                cached_refs.loose = get_ref_dir(NULL, "refs", NULL);
> +               cached_refs.loose = sort_refs_list(cached_refs.loose);
>                cached_refs.did_loose = 1;
>        }
>        return cached_refs.loose;
>
>
>
>>
>>
>> -Martin
>
> --
> Julian
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Well done! I'll try to compose a patch attributed to Julian with the
information from this thread.

--
David Barr

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 23:26                                                   ` Julian Phillips
  2011-09-26 23:37                                                     ` David Michael Barr
@ 2011-09-26 23:38                                                     ` Junio C Hamano
  2011-09-27  0:00                                                       ` [PATCH] Don't sort ref_list too early Julian Phillips
  2011-09-27  0:12                                                     ` Git is not scalable with too many refs/* Martin Fick
  2011-09-27  8:20                                                     ` Sverre Rabbelier
  3 siblings, 1 reply; 126+ messages in thread
From: Junio C Hamano @ 2011-09-26 23:38 UTC (permalink / raw)
  To: Julian Phillips; +Cc: Martin Fick, git

Julian Phillips <julian@quantumfyre.co.uk> writes:

> Back when I made that change, I failed to notice that get_ref_dir was
> recursive for subdirectories ... sorry ...

Aha, I also was blind while I was watching this discussion from the
sideline, and I thought I re-read the codepath involved X-<. Indeed
we were sorting the list way too early and the patch looks correct.

Thanks.

> Hopefully this should speed things up.  My test repo went from ~17m
> user time, to ~2.5s.
> Packing still make things much faster of course.
>
> diff --git a/refs.c b/refs.c
> index a615043..212e7ec 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -319,7 +319,7 @@ static struct ref_list *get_ref_dir(const char
> *submodule, c
>                 free(ref);
>                 closedir(dir);
>         }
> -       return sort_ref_list(list);
> +       return list;
>  }
>
>  struct warn_if_dangling_data {
> @@ -361,11 +361,13 @@ static struct ref_list *get_loose_refs(const
> char *submodu
>         if (submodule) {
>                 free_ref_list(submodule_refs.loose);
>                 submodule_refs.loose = get_ref_dir(submodule, "refs",
> NULL);
> +               submodule_refs.loose =
> sort_refs_list(submodule_refs.loose);
>                 return submodule_refs.loose;
>         }
>
>         if (!cached_refs.did_loose) {
>                 cached_refs.loose = get_ref_dir(NULL, "refs", NULL);
> +               cached_refs.loose = sort_refs_list(cached_refs.loose);
>                 cached_refs.did_loose = 1;
>         }
>         return cached_refs.loose;
>
>
>
>>
>>
>> -Martin

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH] Don't sort ref_list too early
  2011-09-26 23:38                                                     ` Git is not scalable with too many refs/* Junio C Hamano
@ 2011-09-27  0:00                                                       ` Julian Phillips
  2011-10-02  4:58                                                         ` Michael Haggerty
  0 siblings, 1 reply; 126+ messages in thread
From: Julian Phillips @ 2011-09-27  0:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, git

get_ref_dir is called recursively for subdirectories, which means that
we were calling sort_ref_list for each directory of refs instead of
once for all the refs.  This is a massive wast of processing, so now
just call sort_ref_list on the result of the top-level get_ref_dir, so
that the sort is only done once.

In the common case of only a few different directories of refs the
difference isn't very noticable, but it becomes very noticeable when
you have a large number of direcotries containing refs (e.g. as
created by Gerrit).

Reported by Martin Fick.

Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---

This time the typos are fixed too ... perhaps I wrote the original commit at 1am
too ... :$

 refs.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/refs.c b/refs.c
index a615043..a49ff74 100644
--- a/refs.c
+++ b/refs.c
@@ -319,7 +319,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
 		free(ref);
 		closedir(dir);
 	}
-	return sort_ref_list(list);
+	return list;
 }
 
 struct warn_if_dangling_data {
@@ -361,11 +361,13 @@ static struct ref_list *get_loose_refs(const char *submodule)
 	if (submodule) {
 		free_ref_list(submodule_refs.loose);
 		submodule_refs.loose = get_ref_dir(submodule, "refs", NULL);
+		submodule_refs.loose = sort_ref_list(submodule_refs.loose);
 		return submodule_refs.loose;
 	}
 
 	if (!cached_refs.did_loose) {
 		cached_refs.loose = get_ref_dir(NULL, "refs", NULL);
+		cached_refs.loose = sort_ref_list(cached_refs.loose);
 		cached_refs.did_loose = 1;
 	}
 	return cached_refs.loose;
-- 
1.7.6.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 23:26                                                   ` Julian Phillips
  2011-09-26 23:37                                                     ` David Michael Barr
  2011-09-26 23:38                                                     ` Git is not scalable with too many refs/* Junio C Hamano
@ 2011-09-27  0:12                                                     ` Martin Fick
  2011-09-27  0:22                                                       ` Julian Phillips
  2011-09-27  8:20                                                     ` Sverre Rabbelier
  3 siblings, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-27  0:12 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git

On Monday, September 26, 2011 05:26:55 pm Julian Phillips 
wrote:
> On Mon, 26 Sep 2011 15:52:04 -0600, Martin Fick wrote:
> > On Monday, September 26, 2011 03:39:33 pm Martin Fick 
wrote:
> >> On Monday, September 26, 2011 02:28:53 pm Julian
> >> Phillips
> >> In my case it was called 18785 times!  Any other tests
> >> I should run?
> > 
> > Gerrit stores the changes in directories under
> > refs/changes named after the last 2 digits of the
> > change.  Then under each change it stores each
> > patchset.  So it looks like this:
> > refs/changes/dd/change_num/ps_num
> > 
> > I noticed that:
> >  ls refs/changes/* | wc -l
> >  -> 18876
> > 
> > somewhat close, but not super close to 18785,  I am not
> > sure if that is a clue.  It's almost like each change
> > is causing a re-sort,
> 
> basically, it is ...
> 
> Back when I made that change, I failed to notice that
> get_ref_dir was recursive for subdirectories ... sorry
> ...
> 
> Hopefully this should speed things up.  My test repo went
> from ~17m user time, to ~2.5s.
> Packing still make things much faster of course.

Excellent!  This works (almost, in my refs.c it is called 
sort_ref_list, not sort_refs_list).  So, on the non garbage 
collected repo, git branch now takes ~.5s, and in the 
garbage collected one it takes only ~.05s!

Thanks way much!!!

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-27  0:12                                                     ` Git is not scalable with too many refs/* Martin Fick
@ 2011-09-27  0:22                                                       ` Julian Phillips
  2011-09-27  2:34                                                         ` Martin Fick
  0 siblings, 1 reply; 126+ messages in thread
From: Julian Phillips @ 2011-09-27  0:22 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Mon, 26 Sep 2011 18:12:31 -0600, Martin Fick wrote:
> On Monday, September 26, 2011 05:26:55 pm Julian Phillips
> wrote:
-- snip --
>> Back when I made that change, I failed to notice that
>> get_ref_dir was recursive for subdirectories ... sorry
>> ...
>>
>> Hopefully this should speed things up.  My test repo went
>> from ~17m user time, to ~2.5s.
>> Packing still make things much faster of course.
>
> Excellent!  This works (almost, in my refs.c it is called
> sort_ref_list, not sort_refs_list).

Yeah, in mine too ;)  It's late and I got the compile/send mail 
sequence backwards. :$
It's fixed in the proper patch email.

>  So, on the non garbage
> collected repo, git branch now takes ~.5s, and in the
> garbage collected one it takes only ~.05s!

That sounds a lot better.  Hopefully other commands should be faster 
now too.

> Thanks way much!!!

No problem.  Thank you for all the time you've put in to help chase 
this down.  Makes it so much easier when the person with original 
problem mucks in with the investigation.
Just think how much time you've saved for anyone with a large number of 
those Gerrit change refs ;)

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH] refs.c: Fix slowness with numerous loose refs
  2011-09-26 23:37                                                     ` David Michael Barr
@ 2011-09-27  1:01                                                       ` David Barr
  2011-09-27  2:04                                                         ` David Michael Barr
  0 siblings, 1 reply; 126+ messages in thread
From: David Barr @ 2011-09-27  1:01 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Julian Phillips, Martin Fick, Junio C Hamano, David Barr

Martin Fick reported:
 OK, I have found what I believe is another performance
 regression for large ref counts (~100K).

 When I run git br on my repo which only has one branch, but
 has ~100K refs under ref/changes (a gerrit repo), it takes
 normally 3-6mins depending on whether my caches are fresh or
 not.  After bisecting some older changes, I noticed that
 this ref seems to be where things start to get slow:
 v1.5.2-rc0~21^2 (refs.c: add a function to sort a ref list,
 rather then sorting on add) (Julian Phillips, Apr 17, 2007)

Martin Fick observed that sort_refs_lists() was called almost
as many times as there were loose refs.

Julian Phillips commented:
 Back when I made that change, I failed to notice that get_ref_dir
 was recursive for subdirectories ... sorry ...

 Hopefully this should speed things up. My test repo went from
 ~17m user time, to ~2.5s.
 Packing still make things much faster of course.

Martin Fick acked:
 Excellent!  This works (almost, in my refs.c it is called
 sort_ref_list, not sort_refs_list).  So, on the non garbage
 collected repo, git branch now takes ~.5s, and in the
 garbage collected one it takes only ~.05s!

[db: summarised transcript, rewrote patch to fix callee not callers]

[attn jch: patch applies to maint]

Analyzed-by: Martin Fick <mfick@codeaurora.org>
Inspired-by: Julian Phillips <julian@quantumfyre.co.uk>
Acked-by: Martin Fick <mfick@codeaurora.org>
Signed-off-by: David Barr <davidbarr@google.com>
---
 refs.c |   14 ++++++++++----
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/refs.c b/refs.c
index 4c1fd47..e40a09c 100644
--- a/refs.c
+++ b/refs.c
@@ -255,8 +255,8 @@ static struct ref_list *get_packed_refs(const char *submodule)
 	return refs->packed;
 }
 
-static struct ref_list *get_ref_dir(const char *submodule, const char *base,
-				    struct ref_list *list)
+static struct ref_list *walk_ref_dir(const char *submodule, const char *base,
+				     struct ref_list *list)
 {
 	DIR *dir;
 	const char *path;
@@ -299,7 +299,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
 			if (stat(refdir, &st) < 0)
 				continue;
 			if (S_ISDIR(st.st_mode)) {
-				list = get_ref_dir(submodule, ref, list);
+				list = walk_ref_dir(submodule, ref, list);
 				continue;
 			}
 			if (submodule) {
@@ -319,7 +319,13 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
 		free(ref);
 		closedir(dir);
 	}
-	return sort_ref_list(list);
+	return list;
+}
+
+static struct ref_list *get_ref_dir(const char *submodule, const char *base,
+				    struct ref_list *list)
+{
+	return sort_ref_list(walk_ref_dir(submodule, base, list));
 }
 
 struct warn_if_dangling_data {
-- 
1.7.5.75.g69330

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH] refs.c: Fix slowness with numerous loose refs
  2011-09-27  1:01                                                       ` [PATCH] refs.c: Fix slowness with numerous loose refs David Barr
@ 2011-09-27  2:04                                                         ` David Michael Barr
  0 siblings, 0 replies; 126+ messages in thread
From: David Michael Barr @ 2011-09-27  2:04 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Julian Phillips, Martin Fick, Junio C Hamano, David Barr,
	Shawn O. Pearce

+cc Shawn O. Pearce

I used the following to generate a test repo shaped like
a gerrit mirror with unpacked refs (10k, because life is too short for
100k tests):

cd test.git
git init
touch empty
git add empty
git commit -m 'empty'
REV=`git rev-parse HEAD`
for ((d=0;d<100;++d)); do
 for ((n=0;n<100;++n)); do
  let r=n*100+d
  mkdir -p .git/refs/changes/$d/$r
  echo $REV > .git/refs/changes/$d/$r/1
 done
done
time git branch xyz

With warm caches...

Git 1.7.6.4:
real	0m8.232s
user	0m7.842s
sys	0m0.385s

Git 1.7.6.4, with patch below:
real	0m0.394s
user	0m0.069s
sys	0m0.324s

On Tue, Sep 27, 2011 at 11:01 AM, David Barr <davidbarr@google.com> wrote:
> Martin Fick reported:
>  OK, I have found what I believe is another performance
>  regression for large ref counts (~100K).
>
>  When I run git br on my repo which only has one branch, but
>  has ~100K refs under ref/changes (a gerrit repo), it takes
>  normally 3-6mins depending on whether my caches are fresh or
>  not.  After bisecting some older changes, I noticed that
>  this ref seems to be where things start to get slow:
>  v1.5.2-rc0~21^2 (refs.c: add a function to sort a ref list,
>  rather then sorting on add) (Julian Phillips, Apr 17, 2007)
>
> Martin Fick observed that sort_refs_lists() was called almost
> as many times as there were loose refs.
>
> Julian Phillips commented:
>  Back when I made that change, I failed to notice that get_ref_dir
>  was recursive for subdirectories ... sorry ...
>
>  Hopefully this should speed things up. My test repo went from
>  ~17m user time, to ~2.5s.
>  Packing still make things much faster of course.
>
> Martin Fick acked:
>  Excellent!  This works (almost, in my refs.c it is called
>  sort_ref_list, not sort_refs_list).  So, on the non garbage
>  collected repo, git branch now takes ~.5s, and in the
>  garbage collected one it takes only ~.05s!
>
> [db: summarised transcript, rewrote patch to fix callee not callers]
>
> [attn jch: patch applies to maint]
>
> Analyzed-by: Martin Fick <mfick@codeaurora.org>
> Inspired-by: Julian Phillips <julian@quantumfyre.co.uk>
> Acked-by: Martin Fick <mfick@codeaurora.org>
> Signed-off-by: David Barr <davidbarr@google.com>
> ---
>  refs.c |   14 ++++++++++----
>  1 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/refs.c b/refs.c
> index 4c1fd47..e40a09c 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -255,8 +255,8 @@ static struct ref_list *get_packed_refs(const char *submodule)
>        return refs->packed;
>  }
>
> -static struct ref_list *get_ref_dir(const char *submodule, const char *base,
> -                                   struct ref_list *list)
> +static struct ref_list *walk_ref_dir(const char *submodule, const char *base,
> +                                    struct ref_list *list)
>  {
>        DIR *dir;
>        const char *path;
> @@ -299,7 +299,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
>                        if (stat(refdir, &st) < 0)
>                                continue;
>                        if (S_ISDIR(st.st_mode)) {
> -                               list = get_ref_dir(submodule, ref, list);
> +                               list = walk_ref_dir(submodule, ref, list);
>                                continue;
>                        }
>                        if (submodule) {
> @@ -319,7 +319,13 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
>                free(ref);
>                closedir(dir);
>        }
> -       return sort_ref_list(list);
> +       return list;
> +}
> +
> +static struct ref_list *get_ref_dir(const char *submodule, const char *base,
> +                                   struct ref_list *list)
> +{
> +       return sort_ref_list(walk_ref_dir(submodule, base, list));
>  }
>
>  struct warn_if_dangling_data {
> --
> 1.7.5.75.g69330
>
>



-- 

David Barr | Software Engineer | davidbarr@google.com | 614-3438-8348

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-27  0:22                                                       ` Julian Phillips
@ 2011-09-27  2:34                                                         ` Martin Fick
  2011-09-27  7:59                                                           ` Julian Phillips
  0 siblings, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-27  2:34 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git



> Julian Phillips <julian@quantumfyre.co.uk> wrote:
>On Mon, 26 Sep 2011 18:12:31 -0600, Martin Fick wrote:
>That sounds a lot better.  Hopefully other commands should be faster 
>now too.

Yeah, I will try this in a few other places to see.

>> Thanks way much!!!
>
>No problem.  Thank you for all the time you've put in to help chase 
>this down.  Makes it so much easier when the person with original 
>problem mucks in with the investigation.
>Just think how much time you've saved for anyone with a large number of
>
>those Gerrit change refs ;)

 Perhaps this is a naive question, but why are all these refs being put into a list to be sorted, only to be discarded soon thereafter anyway?  After all, git branch knows that it isn't going to print these, and the refs are stored precategorized, so why not only grab the refs which matter upfront?

-Martin 

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-27  2:34                                                         ` Martin Fick
@ 2011-09-27  7:59                                                           ` Julian Phillips
  0 siblings, 0 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-27  7:59 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Mon, 26 Sep 2011 20:34:02 -0600, Martin Fick wrote:
>> Julian Phillips <julian@quantumfyre.co.uk> wrote:
>>On Mon, 26 Sep 2011 18:12:31 -0600, Martin Fick wrote:
>>That sounds a lot better.  Hopefully other commands should be faster
>>now too.
>
> Yeah, I will try this in a few other places to see.
>
>>> Thanks way much!!!
>>
>>No problem.  Thank you for all the time you've put in to help chase
>>this down.  Makes it so much easier when the person with original
>>problem mucks in with the investigation.
>>Just think how much time you've saved for anyone with a large number 
>> of
>>
>>those Gerrit change refs ;)
>
>  Perhaps this is a naive question, but why are all these refs being
> put into a list to be sorted, only to be discarded soon thereafter
> anyway?  After all, git branch knows that it isn't going to print
> these, and the refs are stored precategorized, so why not only grab
> the refs which matter upfront?

I can't say that I am aware of a specific decision having been taken on 
the subject, but I'll have a guess at the reason:

The extra code it would take to have an API for getting a list of only 
a subset of the refs has never been considered worth the cost.  It would 
take effort to implement, test and maintain - and it would have to be 
done separately for packed and unpacked cases to avoid still loading and 
discarding unwanted refs.  All that to not do something that no-one has 
noticed taking any time?  Until now, I doubt anyone has considered it 
something that was a problem - and now that even with 100k refs it takes 
less than a second, I doubt anyone will feel all that inclined to have a 
crack at it now either.

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 23:26                                                   ` Julian Phillips
                                                                       ` (2 preceding siblings ...)
  2011-09-27  0:12                                                     ` Git is not scalable with too many refs/* Martin Fick
@ 2011-09-27  8:20                                                     ` Sverre Rabbelier
  2011-09-27  9:01                                                       ` Julian Phillips
  3 siblings, 1 reply; 126+ messages in thread
From: Sverre Rabbelier @ 2011-09-27  8:20 UTC (permalink / raw)
  To: Julian Phillips; +Cc: Martin Fick, git, Junio C Hamano, David Michael Barr

Heya,

On Tue, Sep 27, 2011 at 01:26, Julian Phillips <julian@quantumfyre.co.uk> wrote:
> Back when I made that change, I failed to notice that get_ref_dir was
> recursive for subdirectories ... sorry ...
>
> Hopefully this should speed things up.  My test repo went from ~17m user
> time, to ~2.5s.
> Packing still make things much faster of course.

Can we perhaps also have some tests to prevent this from happening again?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-27  8:20                                                     ` Sverre Rabbelier
@ 2011-09-27  9:01                                                       ` Julian Phillips
  2011-09-27 10:01                                                         ` Sverre Rabbelier
  2011-09-27 11:07                                                         ` Michael Haggerty
  0 siblings, 2 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-27  9:01 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Martin Fick, git, Junio C Hamano, David Michael Barr

On Tue, 27 Sep 2011 10:20:29 +0200, Sverre Rabbelier wrote:
> Heya,
>
> On Tue, Sep 27, 2011 at 01:26, Julian Phillips
> <julian@quantumfyre.co.uk> wrote:
>> Back when I made that change, I failed to notice that get_ref_dir 
>> was
>> recursive for subdirectories ... sorry ...
>>
>> Hopefully this should speed things up.  My test repo went from ~17m 
>> user
>> time, to ~2.5s.
>> Packing still make things much faster of course.
>
> Can we perhaps also have some tests to prevent this from happening 
> again?

Um ... any suggestion what to test?

It has to be hot-cache, otherwise time taken to read the refs from disk 
will mean that it is always slow.  On my Mac it seems to _always_ be 
slow reading the refs from disk, so even the "fast" case still takes 
~17m.

Also, what counts as ok, and what as broken?

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-27  9:01                                                       ` Julian Phillips
@ 2011-09-27 10:01                                                         ` Sverre Rabbelier
  2011-09-27 10:25                                                           ` Nguyen Thai Ngoc Duy
  2011-09-27 11:07                                                         ` Michael Haggerty
  1 sibling, 1 reply; 126+ messages in thread
From: Sverre Rabbelier @ 2011-09-27 10:01 UTC (permalink / raw)
  To: Julian Phillips; +Cc: Martin Fick, git, Junio C Hamano, David Michael Barr

Heya,

On Tue, Sep 27, 2011 at 11:01, Julian Phillips <julian@quantumfyre.co.uk> wrote:
> It has to be hot-cache, otherwise time taken to read the refs from disk will
> mean that it is always slow.  On my Mac it seems to _always_ be slow reading
> the refs from disk, so even the "fast" case still takes ~17m.

Ah, that seems unfortunate. Not sure how to test it then.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-27 10:01                                                         ` Sverre Rabbelier
@ 2011-09-27 10:25                                                           ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 126+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-09-27 10:25 UTC (permalink / raw)
  To: Sverre Rabbelier
  Cc: Julian Phillips, Martin Fick, git, Junio C Hamano, David Michael Barr

On Tue, Sep 27, 2011 at 8:01 PM, Sverre Rabbelier <srabbelier@gmail.com> wrote:
> Heya,
>
> On Tue, Sep 27, 2011 at 11:01, Julian Phillips <julian@quantumfyre.co.uk> wrote:
>> It has to be hot-cache, otherwise time taken to read the refs from disk will
>> mean that it is always slow.  On my Mac it seems to _always_ be slow reading
>> the refs from disk, so even the "fast" case still takes ~17m.
>
> Ah, that seems unfortunate. Not sure how to test it then.

If you care about performance, a perf test suite could be made,
perhaps as a separate project. The output would be charts or
spreadsheets, that interesting parties can look at and point out
regressions. We may start with a set of common used operations.
-- 
Duy

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-27  9:01                                                       ` Julian Phillips
  2011-09-27 10:01                                                         ` Sverre Rabbelier
@ 2011-09-27 11:07                                                         ` Michael Haggerty
  2011-09-27 12:10                                                           ` Julian Phillips
  1 sibling, 1 reply; 126+ messages in thread
From: Michael Haggerty @ 2011-09-27 11:07 UTC (permalink / raw)
  To: Julian Phillips
  Cc: Sverre Rabbelier, Martin Fick, git, Junio C Hamano, David Michael Barr

On 09/27/2011 11:01 AM, Julian Phillips wrote:
> It has to be hot-cache, otherwise time taken to read the refs from disk
> will mean that it is always slow.  On my Mac it seems to _always_ be
> slow reading the refs from disk, so even the "fast" case still takes ~17m.

This case should be helped by lazy-loading of loose references, which I
am working on.  So if you develop some benchmarking code, it would help
me with my work.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-27 11:07                                                         ` Michael Haggerty
@ 2011-09-27 12:10                                                           ` Julian Phillips
  0 siblings, 0 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-27 12:10 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Sverre Rabbelier, Martin Fick, git, Junio C Hamano, David Michael Barr

[-- Attachment #1: Type: text/plain, Size: 1249 bytes --]

On Tue, 27 Sep 2011 13:07:15 +0200, Michael Haggerty wrote:
> On 09/27/2011 11:01 AM, Julian Phillips wrote:
>> It has to be hot-cache, otherwise time taken to read the refs from 
>> disk
>> will mean that it is always slow.  On my Mac it seems to _always_ be
>> slow reading the refs from disk, so even the "fast" case still takes 
>> ~17m.
>
> This case should be helped by lazy-loading of loose references, which 
> I
> am working on.  So if you develop some benchmarking code, it would 
> help
> me with my work.

The attached script creates the repo structure I was testing with ...

If you create a repo with 100k refs it takes quite a while to read the 
refs from disk.  If you are lazy-loading then it should take practically 
no time, since the only interesting ref is refs/heads/master.

The following is the hot-cache timing for "./refs-stress c 40000", with 
the sorting patch applied (wasn't prepared to wait for numbers with 100k 
refs).

jp3@rayne: refs>(cd c; time ~/misc/git/git/git branch)
* master

real    0m0.885s
user    0m0.161s
sys     0m0.722s

After doing "rm -rf c/.git/refs/changes/*", I get:

jp3@rayne: refs>(cd c; time ~/misc/git/git/git branch)
* master

real    0m0.004s
user    0m0.001s
sys     0m0.002s

-- 
Julian

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: refs-stress --]
[-- Type: text/x-java; name=refs-stress, Size: 1406 bytes --]

#!/usr/bin/env python

import os
import random
import subprocess
import sys

def die(msg):
    print >> sys.stderr, msg
    sys.exit(1)

def new_ref(a, b, commit):
    d = ".git/refs/changes/%d/%d" % (a, b)
    if not os.path.exists(d):
        os.makedirs(d)
    e = 1
    p = "%s/%d" % (d, e)
    while os.path.exists(p):
        e += 1
        p = "%s/%d" % (d, e)
    f = open(p, "w")
    f.write(commit)
    f.close()

def make_refs(count, commit):
    while count > 0:
        sys.stdout.write("left: %d%s\r" % (count, " " * 30))
        a = random.randrange(10, 30)
        b = random.randrange(10000, 50000)
        new_ref(a, b, commit)
        count -= 1
    print "refs complete"

def main():
    if len(sys.argv) != 3:
        die("usage: %s <name> <ref count>" % sys.argv[0])

    _, name, refs = sys.argv

    os.mkdir(name)
    os.chdir(name)

    if subprocess.call(["git", "init"]) != 0:
        die("failed to init repo")

    f = open("foobar.txt", "w")
    f.write("%s: %s refs\n" % (name, refs))
    f.close()

    if subprocess.call(["git", "add", "foobar.txt"]) != 0:
        die("failed to add foobar.txt")

    if subprocess.call(["git", "commit", "-m", "inital commit"]) != 0:
        die("failed to create initial commit")

    commit = subprocess.check_output(["git", "show-ref", "-s", "master"]).strip()

    make_refs(int(refs), commit)

if __name__ == "__main__":
    main()

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 12:41                                     ` Christian Couder
  2011-09-26 17:47                                       ` Martin Fick
@ 2011-09-28 19:38                                       ` Martin Fick
  2011-09-28 22:10                                         ` Martin Fick
  1 sibling, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-28 19:38 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Christian Couder, Thomas Rast, Julian Phillips

On Monday, September 26, 2011 06:41:04 am Christian Couder 
wrote:
> On Sun, Sep 25, 2011 at 10:43 PM, Martin Fick 
<mfick@codeaurora.org> wrote:
...
> >  git checkout
> > 
> > can also take rather long periods of time > 3 mins when
> > run on a repo with ~100K refs.
...
> >  So, I bisected this issue also, and it seems that the
> > "offending" commit is
...
> > commit 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07
> > Author: Christian Couder <chriscool@tuxfamily.org>
> > 
> >    replace_object: add mechanism to replace objects
> > found in "refs/replace/"
...

> I don't think there is an obvious problem with it, but it
> would be nice if you could dig a bit deeper.
> 
> The first thing that could take a lot of time is the call
> to for_each_replace_ref() in this function:
> 
> +static void prepare_replace_object(void)
> +{
> +       static int replace_object_prepared;
> +
> +       if (replace_object_prepared)
> +               return;
> +
> +       for_each_replace_ref(register_replace_ref, NULL);
> +       replace_object_prepared = 1;
> +}

The time was actually spent in for_each_replace_ref()
which calls get_loose_refs() which has the recursive bug 
that Julian Phillips fixed 2 days ago.  Good to see that 
this fix helps other use cases too.

So with that bug fixed, the thing taking the most time now 
for a git checkout with ~100K refs seems to be the orphan 
check as Thomas predicted.  The strange part with this, is 
that the orphan check seems to take only about ~20s in the 
repo where the refs aren't packed.  However, in the repo 
where they are packed, this check takes at least 5min!  This 
seems a bit unusual, doesn't it?  Is the filesystem that 
much better at indexing refs than git's pack mechanism?  
Seems unlikely, the unpacked refs take 312M in the FS, the 
packed ones only take about 4.3M.  I suspect their is 
something else unexpected going on here in the packed ref 
case.  

Any thoughts?  I will dig deeper...

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-28 19:38                                       ` Martin Fick
@ 2011-09-28 22:10                                         ` Martin Fick
  2011-09-29  0:54                                           ` Julian Phillips
  0 siblings, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-28 22:10 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Christian Couder, Thomas Rast, Julian Phillips

On Wednesday, September 28, 2011 01:38:04 pm Martin Fick 
wrote:
> On Monday, September 26, 2011 06:41:04 am Christian
> Couder
> 
> wrote:
> > On Sun, Sep 25, 2011 at 10:43 PM, Martin Fick
> 
> <mfick@codeaurora.org> wrote:
> ...
> 
> > >  git checkout
> > > 
> > > can also take rather long periods of time > 3 mins
> > > when run on a repo with ~100K refs.
> 
> ...
> 
> > >  So, I bisected this issue also, and it seems that
> > >  the
> > > 
> > > "offending" commit is
> 
> ...
> 
> > > commit 680955702990c1d4bfb3c6feed6ae9c6cb5c3c07
> > > Author: Christian Couder <chriscool@tuxfamily.org>
> > > 
> > >    replace_object: add mechanism to replace objects
> > > 
> > > found in "refs/replace/"
> 
> ...
> 
> > I don't think there is an obvious problem with it, but
> > it would be nice if you could dig a bit deeper.
> > 
> > The first thing that could take a lot of time is the
> > call to for_each_replace_ref() in this function:
> > 
> > +static void prepare_replace_object(void)
> > +{
> > +       static int replace_object_prepared;
> > +
> > +       if (replace_object_prepared)
> > +               return;
> > +
> > +       for_each_replace_ref(register_replace_ref,
> > NULL); +       replace_object_prepared = 1;
> > +}
> 
> The time was actually spent in for_each_replace_ref()
> which calls get_loose_refs() which has the recursive bug
> that Julian Phillips fixed 2 days ago.  Good to see that
> this fix helps other use cases too.
> 
> So with that bug fixed, the thing taking the most time
> now for a git checkout with ~100K refs seems to be the
> orphan check as Thomas predicted.  The strange part with
> this, is that the orphan check seems to take only about
> ~20s in the repo where the refs aren't packed.  However,
> in the repo where they are packed, this check takes at
> least 5min!  This seems a bit unusual, doesn't it?  Is
> the filesystem that much better at indexing refs than
> git's pack mechanism? Seems unlikely, the unpacked refs
> take 312M in the FS, the packed ones only take about
> 4.3M.  I suspect their is something else unexpected
> going on here in the packed ref case.
> 
> Any thoughts?  I will dig deeper...

I think the problem is that resolve_ref() walks a linked 
list of searching for the packed ref.  Does this mean that 
packed refs are not indexed at all?

> 
> -Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-28 22:10                                         ` Martin Fick
@ 2011-09-29  0:54                                           ` Julian Phillips
  2011-09-29  1:37                                             ` Martin Fick
  0 siblings, 1 reply; 126+ messages in thread
From: Julian Phillips @ 2011-09-29  0:54 UTC (permalink / raw)
  To: Martin Fick; +Cc: Christian Couder, git, Christian Couder, Thomas Rast

On Wed, 28 Sep 2011 16:10:48 -0600, Martin Fick wrote:
> On Wednesday, September 28, 2011 01:38:04 pm Martin Fick
> wrote:
-- snip --
>> So with that bug fixed, the thing taking the most time
>> now for a git checkout with ~100K refs seems to be the
>> orphan check as Thomas predicted.  The strange part with
>> this, is that the orphan check seems to take only about
>> ~20s in the repo where the refs aren't packed.  However,
>> in the repo where they are packed, this check takes at
>> least 5min!  This seems a bit unusual, doesn't it?  Is
>> the filesystem that much better at indexing refs than
>> git's pack mechanism? Seems unlikely, the unpacked refs
>> take 312M in the FS, the packed ones only take about
>> 4.3M.  I suspect their is something else unexpected
>> going on here in the packed ref case.
>>
>> Any thoughts?  I will dig deeper...
>
> I think the problem is that resolve_ref() walks a linked
> list of searching for the packed ref.  Does this mean that
> packed refs are not indexed at all?

Are you sure that it is walking the linked list that is the problem?  
I've created a test repo with ~100k refs/changes/... style refs, and 
~40000 refs/heads/... style refs, and checkout can walk the list of 
~140k refs seven times in 85ms user time including doing whatever other 
processing is needed for checkout.  The real time is only 114ms - but 
then my test repo has no real data in.

If resolve_ref() walking the linked list of refs was the problem, then 
I would expect my test repo to show the same problem.  It doesn't, a pre 
ref-packing checkout took minutes (~0.5s user time), whereas a 
ref-packed checkout takes ~0.1s.  So, I would suggest that the problem 
lies elsewhere.

Have you tried running a checkout whilst profiling?

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-29  0:54                                           ` Julian Phillips
@ 2011-09-29  1:37                                             ` Martin Fick
  2011-09-29  2:19                                               ` Julian Phillips
  0 siblings, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-29  1:37 UTC (permalink / raw)
  To: Julian Phillips; +Cc: Christian Couder, git, Christian Couder, Thomas Rast

On Wednesday 28 September 2011 18:59:09 Martin Fick wrote: 
> Julian Phillips <julian@quantumfyre.co.uk> wrote: 
> > On Wed, 28 Sep 2011 16:10:48 -0600, Martin Fick wrote: 
> >> So with that bug fixed, the thing taking the most time 
> >> now for a git checkout with ~100K refs seems to be the 
> >> orphan check as Thomas predicted. The strange part with 
> >> this, is that the orphan check seems to take only about 
> >> ~20s in the repo where the refs aren't packed. However, 
> >> in the repo where they are packed, this check takes at 
> >> least 5min! This seems a bit unusual, doesn't it? Is 
> >> the filesystem that much better at indexing refs than 
> >> git's pack mechanism? Seems unlikely, the unpacked refs 
> >> take 312M in the FS, the packed ones only take about 
> >> 4.3M. I suspect their is something else unexpected 
> >> going on here in the packed ref case. 
> >> 
> >> Any thoughts? I will dig deeper... 
> > 
> > I think the problem is that resolve_ref() walks a linked 
> > list of searching for the packed ref. Does this mean that 
> > packed refs are not indexed at all? 
> > Are you sure that it is walking the linked list that is the problem?

It sure seems like it.

> I've created a test repo with ~100k refs/changes/... style refs, and 
> ~40000 refs/heads/... style refs, and checkout can walk the list of 
> ~140k refs seven times in 85ms user time including doing whatever other 
> processing is needed for checkout. The real time is only 114ms - but 
> then my test repo has no real data in.

If I understand what you are saying, it sounds like you do not have a very good test case. The amount of time it takes for checkout depends on how long it takes to find a ref with the sha1 that you are on. If that sha1 is so early in the list of refs that it only took you 7 traversals to find it, then that is not a very good testcase. I think that you should probably try making an orphaned ref (checkout a detached head, commit to it), that is probably the worst testcase since it should then have to search all 140K refs to eventually give up.

Again, if I understand what you are saying, if it took 85ms for 7 traversals, then it takes approximately 10ms per traversal, that's only 100/s! If you have to traverse it 140K times, that should work out to 1400s ~ 23mins.

> If resolve_ref() walking the linked list of refs was the problem, then > I would expect my test repo to show the same problem. It doesn't, a pre 
> ref-packing checkout took minutes (~0.5s user time), whereas a 
> ref-packed checkout takes ~0.1s. So, I would suggest that the problem > lies elsewhere. 
> 
> Have you tried running a checkout whilst profiling?

No, to be honest, I am not familiar with any profilling tools.

-Martin

Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-29  1:37                                             ` Martin Fick
@ 2011-09-29  2:19                                               ` Julian Phillips
  2011-09-29 16:38                                                 ` Martin Fick
  2011-09-29 18:27                                                 ` René Scharfe
  0 siblings, 2 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-29  2:19 UTC (permalink / raw)
  To: Martin Fick; +Cc: Christian Couder, git, Christian Couder, Thomas Rast

On Wed, 28 Sep 2011 19:37:18 -0600, Martin Fick wrote:
> On Wednesday 28 September 2011 18:59:09 Martin Fick wrote:
>> Julian Phillips <julian@quantumfyre.co.uk> wrote:
-- snip --
>> I've created a test repo with ~100k refs/changes/... style refs, and
>> ~40000 refs/heads/... style refs, and checkout can walk the list of
>> ~140k refs seven times in 85ms user time including doing whatever 
>> other
>> processing is needed for checkout. The real time is only 114ms - but
>> then my test repo has no real data in.
>
> If I understand what you are saying, it sounds like you do not have a
> very good test case. The amount of time it takes for checkout depends
> on how long it takes to find a ref with the sha1 that you are on. If
> that sha1 is so early in the list of refs that it only took you 7
> traversals to find it, then that is not a very good testcase. I think
> that you should probably try making an orphaned ref (checkout a
> detached head, commit to it), that is probably the worst testcase
> since it should then have to search all 140K refs to eventually give
> up.
>
> Again, if I understand what you are saying, if it took 85ms for 7
> traversals, then it takes approximately 10ms per traversal, that's
> only 100/s! If you have to traverse it 140K times, that should work
> out to 1400s ~ 23mins.

Well, it's no more than 10ms per traversal - since the rest of the work 
presumably takes some time too ...

However, I had forgotten to make the orphaned commit as you suggest - 
and then _bang_ 7N^2, it tries seven different variants of each ref 
(which is silly as they are all fully qualified), and with packed refs 
it has to search for them each time, all to turn names into hashes that 
we already know to start with.

So, yes - it is that list traversal.

Does the following help?

diff --git a/builtin/checkout.c b/builtin/checkout.c
index 5e356a6..f0f4ca1 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -605,7 +605,7 @@ static int add_one_ref_to_rev_list_arg(const char 
*refname,
                                        int flags,
                                        void *cb_data)
  {
-       add_one_rev_list_arg(cb_data, refname);
+       add_one_rev_list_arg(cb_data, strdup(sha1_to_hex(sha1)));
         return 0;
  }

-- 
Julian

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH] refs: Use binary search to lookup refs faster
  2011-09-29 19:10                                                   ` Junio C Hamano
@ 2011-09-29  4:18                                                     ` Julian Phillips
  2011-09-29 21:57                                                       ` Junio C Hamano
                                                                         ` (2 more replies)
  2011-09-29 20:44                                                     ` Git is not scalable with too many refs/* Martin Fick
  1 sibling, 3 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-29  4:18 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Julian Phillips, Martin Fick, Christian Couder, git,
	Christian Couder, Thomas Rast

Currently we linearly search through lists of refs when we need to
find a specific ref.  This can be very slow if we need to lookup a
large number of refs.  By changing to a binary search we can make this
faster.

In order to be able to use a binary search we need to change from
using linked lists to arrays, which we can manage using ALLOC_GROW.

We can now also use the standard library qsort function to sort the
refs arrays.

Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---

Something like this?

 refs.c |  328 ++++++++++++++++++++++++++--------------------------------------
 1 files changed, 131 insertions(+), 197 deletions(-)

diff --git a/refs.c b/refs.c
index a49ff74..e411bea 100644
--- a/refs.c
+++ b/refs.c
@@ -8,14 +8,18 @@
 #define REF_KNOWS_PEELED 04
 #define REF_BROKEN 010
 
-struct ref_list {
-	struct ref_list *next;
+struct ref_entry {
 	unsigned char flag; /* ISSYMREF? ISPACKED? */
 	unsigned char sha1[20];
 	unsigned char peeled[20];
 	char name[FLEX_ARRAY];
 };
 
+struct ref_array {
+	int nr, alloc;
+	struct ref_entry **refs;
+};
+
 static const char *parse_ref_line(char *line, unsigned char *sha1)
 {
 	/*
@@ -44,108 +48,55 @@ static const char *parse_ref_line(char *line, unsigned char *sha1)
 	return line;
 }
 
-static struct ref_list *add_ref(const char *name, const unsigned char *sha1,
-				int flag, struct ref_list *list,
-				struct ref_list **new_entry)
+static void add_ref(const char *name, const unsigned char *sha1,
+		    int flag, struct ref_array *refs,
+		    struct ref_entry **new_entry)
 {
 	int len;
-	struct ref_list *entry;
+	struct ref_entry *entry;
 
 	/* Allocate it and add it in.. */
 	len = strlen(name) + 1;
-	entry = xmalloc(sizeof(struct ref_list) + len);
+	entry = xmalloc(sizeof(struct ref) + len);
 	hashcpy(entry->sha1, sha1);
 	hashclr(entry->peeled);
 	memcpy(entry->name, name, len);
 	entry->flag = flag;
-	entry->next = list;
 	if (new_entry)
 		*new_entry = entry;
-	return entry;
+	ALLOC_GROW(refs->refs, refs->nr + 1, refs->alloc);
+	refs->refs[refs->nr++] = entry;
 }
 
-/* merge sort the ref list */
-static struct ref_list *sort_ref_list(struct ref_list *list)
+static int ref_entry_cmp(const void *a, const void *b)
 {
-	int psize, qsize, last_merge_count, cmp;
-	struct ref_list *p, *q, *l, *e;
-	struct ref_list *new_list = list;
-	int k = 1;
-	int merge_count = 0;
-
-	if (!list)
-		return list;
-
-	do {
-		last_merge_count = merge_count;
-		merge_count = 0;
-
-		psize = 0;
-
-		p = new_list;
-		q = new_list;
-		new_list = NULL;
-		l = NULL;
+	struct ref_entry *one = *(struct ref_entry **)a;
+	struct ref_entry *two = *(struct ref_entry **)b;
+	return strcmp(one->name, two->name);
+}
 
-		while (p) {
-			merge_count++;
+static void sort_ref_array(struct ref_array *array)
+{
+	qsort(array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp);
+}
 
-			while (psize < k && q->next) {
-				q = q->next;
-				psize++;
-			}
-			qsize = k;
-
-			while ((psize > 0) || (qsize > 0 && q)) {
-				if (qsize == 0 || !q) {
-					e = p;
-					p = p->next;
-					psize--;
-				} else if (psize == 0) {
-					e = q;
-					q = q->next;
-					qsize--;
-				} else {
-					cmp = strcmp(q->name, p->name);
-					if (cmp < 0) {
-						e = q;
-						q = q->next;
-						qsize--;
-					} else if (cmp > 0) {
-						e = p;
-						p = p->next;
-						psize--;
-					} else {
-						if (hashcmp(q->sha1, p->sha1))
-							die("Duplicated ref, and SHA1s don't match: %s",
-							    q->name);
-						warning("Duplicated ref: %s", q->name);
-						e = q;
-						q = q->next;
-						qsize--;
-						free(e);
-						e = p;
-						p = p->next;
-						psize--;
-					}
-				}
+static struct ref_entry *search_ref_array(struct ref_array *array, const char *name)
+{
+	struct ref_entry *e, **r;
+	int len;
 
-				e->next = NULL;
+	len = strlen(name) + 1;
+	e = xmalloc(sizeof(struct ref) + len);
+	memcpy(e->name, name, len);
 
-				if (l)
-					l->next = e;
-				if (!new_list)
-					new_list = e;
-				l = e;
-			}
+	r = bsearch(&e, array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp);
 
-			p = q;
-		};
+	free(e);
 
-		k = k * 2;
-	} while ((last_merge_count != merge_count) || (last_merge_count != 1));
+	if (r == NULL)
+		return NULL;
 
-	return new_list;
+	return *r;
 }
 
 /*
@@ -155,38 +106,37 @@ static struct ref_list *sort_ref_list(struct ref_list *list)
 static struct cached_refs {
 	char did_loose;
 	char did_packed;
-	struct ref_list *loose;
-	struct ref_list *packed;
+	struct ref_array loose;
+	struct ref_array packed;
 } cached_refs, submodule_refs;
-static struct ref_list *current_ref;
+static struct ref_entry *current_ref;
 
-static struct ref_list *extra_refs;
+static struct ref_array extra_refs;
 
-static void free_ref_list(struct ref_list *list)
+static void free_ref_array(struct ref_array *array)
 {
-	struct ref_list *next;
-	for ( ; list; list = next) {
-		next = list->next;
-		free(list);
-	}
+	int i;
+	for (i = 0; i < array->nr; i++)
+		free(array->refs[i]);
+	free(array->refs);
+	array->nr = array->alloc = 0;
+	array->refs = NULL;
 }
 
 static void invalidate_cached_refs(void)
 {
 	struct cached_refs *ca = &cached_refs;
 
-	if (ca->did_loose && ca->loose)
-		free_ref_list(ca->loose);
-	if (ca->did_packed && ca->packed)
-		free_ref_list(ca->packed);
-	ca->loose = ca->packed = NULL;
+	if (ca->did_loose)
+		free_ref_array(&ca->loose);
+	if (ca->did_packed)
+		free_ref_array(&ca->packed);
 	ca->did_loose = ca->did_packed = 0;
 }
 
 static void read_packed_refs(FILE *f, struct cached_refs *cached_refs)
 {
-	struct ref_list *list = NULL;
-	struct ref_list *last = NULL;
+	struct ref_entry *last = NULL;
 	char refline[PATH_MAX];
 	int flag = REF_ISPACKED;
 
@@ -205,7 +155,7 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs)
 
 		name = parse_ref_line(refline, sha1);
 		if (name) {
-			list = add_ref(name, sha1, flag, list, &last);
+			add_ref(name, sha1, flag, &cached_refs->packed, &last);
 			continue;
 		}
 		if (last &&
@@ -215,21 +165,20 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs)
 		    !get_sha1_hex(refline + 1, sha1))
 			hashcpy(last->peeled, sha1);
 	}
-	cached_refs->packed = sort_ref_list(list);
+	sort_ref_array(&cached_refs->packed);
 }
 
 void add_extra_ref(const char *name, const unsigned char *sha1, int flag)
 {
-	extra_refs = add_ref(name, sha1, flag, extra_refs, NULL);
+	add_ref(name, sha1, flag, &extra_refs, NULL);
 }
 
 void clear_extra_refs(void)
 {
-	free_ref_list(extra_refs);
-	extra_refs = NULL;
+	free_ref_array(&extra_refs);
 }
 
-static struct ref_list *get_packed_refs(const char *submodule)
+static struct ref_array *get_packed_refs(const char *submodule)
 {
 	const char *packed_refs_file;
 	struct cached_refs *refs;
@@ -237,7 +186,7 @@ static struct ref_list *get_packed_refs(const char *submodule)
 	if (submodule) {
 		packed_refs_file = git_path_submodule(submodule, "packed-refs");
 		refs = &submodule_refs;
-		free_ref_list(refs->packed);
+		free_ref_array(&refs->packed);
 	} else {
 		packed_refs_file = git_path("packed-refs");
 		refs = &cached_refs;
@@ -245,18 +194,17 @@ static struct ref_list *get_packed_refs(const char *submodule)
 
 	if (!refs->did_packed || submodule) {
 		FILE *f = fopen(packed_refs_file, "r");
-		refs->packed = NULL;
 		if (f) {
 			read_packed_refs(f, refs);
 			fclose(f);
 		}
 		refs->did_packed = 1;
 	}
-	return refs->packed;
+	return &refs->packed;
 }
 
-static struct ref_list *get_ref_dir(const char *submodule, const char *base,
-				    struct ref_list *list)
+static void get_ref_dir(const char *submodule, const char *base,
+			struct ref_array *array)
 {
 	DIR *dir;
 	const char *path;
@@ -299,7 +247,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
 			if (stat(refdir, &st) < 0)
 				continue;
 			if (S_ISDIR(st.st_mode)) {
-				list = get_ref_dir(submodule, ref, list);
+				get_ref_dir(submodule, ref, array);
 				continue;
 			}
 			if (submodule) {
@@ -314,12 +262,11 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
 					hashclr(sha1);
 					flag |= REF_BROKEN;
 				}
-			list = add_ref(ref, sha1, flag, list, NULL);
+			add_ref(ref, sha1, flag, array, NULL);
 		}
 		free(ref);
 		closedir(dir);
 	}
-	return list;
 }
 
 struct warn_if_dangling_data {
@@ -356,21 +303,21 @@ void warn_dangling_symref(FILE *fp, const char *msg_fmt, const char *refname)
 	for_each_rawref(warn_if_dangling_symref, &data);
 }
 
-static struct ref_list *get_loose_refs(const char *submodule)
+static struct ref_array *get_loose_refs(const char *submodule)
 {
 	if (submodule) {
-		free_ref_list(submodule_refs.loose);
-		submodule_refs.loose = get_ref_dir(submodule, "refs", NULL);
-		submodule_refs.loose = sort_ref_list(submodule_refs.loose);
-		return submodule_refs.loose;
+		free_ref_array(&submodule_refs.loose);
+		get_ref_dir(submodule, "refs", &submodule_refs.loose);
+		sort_ref_array(&submodule_refs.loose);
+		return &submodule_refs.loose;
 	}
 
 	if (!cached_refs.did_loose) {
-		cached_refs.loose = get_ref_dir(NULL, "refs", NULL);
-		cached_refs.loose = sort_ref_list(cached_refs.loose);
+		get_ref_dir(NULL, "refs", &cached_refs.loose);
+		sort_ref_array(&cached_refs.loose);
 		cached_refs.did_loose = 1;
 	}
-	return cached_refs.loose;
+	return &cached_refs.loose;
 }
 
 /* We allow "recursive" symbolic refs. Only within reason, though */
@@ -381,8 +328,8 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna
 {
 	FILE *f;
 	struct cached_refs refs;
-	struct ref_list *ref;
-	int retval;
+	struct ref_entry *ref;
+	int retval = -1;
 
 	strcpy(name + pathlen, "packed-refs");
 	f = fopen(name, "r");
@@ -390,17 +337,12 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna
 		return -1;
 	read_packed_refs(f, &refs);
 	fclose(f);
-	ref = refs.packed;
-	retval = -1;
-	while (ref) {
-		if (!strcmp(ref->name, refname)) {
-			retval = 0;
-			memcpy(result, ref->sha1, 20);
-			break;
-		}
-		ref = ref->next;
+	ref = search_ref_array(&refs.packed, refname);
+	if (ref != NULL) {
+		memcpy(result, ref->sha1, 20);
+		retval = 0;
 	}
-	free_ref_list(refs.packed);
+	free_ref_array(&refs.packed);
 	return retval;
 }
 
@@ -501,15 +443,13 @@ const char *resolve_ref(const char *ref, unsigned char *sha1, int reading, int *
 		git_snpath(path, sizeof(path), "%s", ref);
 		/* Special case: non-existing file. */
 		if (lstat(path, &st) < 0) {
-			struct ref_list *list = get_packed_refs(NULL);
-			while (list) {
-				if (!strcmp(ref, list->name)) {
-					hashcpy(sha1, list->sha1);
-					if (flag)
-						*flag |= REF_ISPACKED;
-					return ref;
-				}
-				list = list->next;
+			struct ref_array *packed = get_packed_refs(NULL);
+			struct ref_entry *r = search_ref_array(packed, ref);
+			if (r != NULL) {
+				hashcpy(sha1, r->sha1);
+				if (flag)
+					*flag |= REF_ISPACKED;
+				return ref;
 			}
 			if (reading || errno != ENOENT)
 				return NULL;
@@ -584,7 +524,7 @@ int read_ref(const char *ref, unsigned char *sha1)
 
 #define DO_FOR_EACH_INCLUDE_BROKEN 01
 static int do_one_ref(const char *base, each_ref_fn fn, int trim,
-		      int flags, void *cb_data, struct ref_list *entry)
+		      int flags, void *cb_data, struct ref_entry *entry)
 {
 	if (prefixcmp(entry->name, base))
 		return 0;
@@ -630,18 +570,12 @@ int peel_ref(const char *ref, unsigned char *sha1)
 		return -1;
 
 	if ((flag & REF_ISPACKED)) {
-		struct ref_list *list = get_packed_refs(NULL);
+		struct ref_array *array = get_packed_refs(NULL);
+		struct ref_entry *r = search_ref_array(array, ref);
 
-		while (list) {
-			if (!strcmp(list->name, ref)) {
-				if (list->flag & REF_KNOWS_PEELED) {
-					hashcpy(sha1, list->peeled);
-					return 0;
-				}
-				/* older pack-refs did not leave peeled ones */
-				break;
-			}
-			list = list->next;
+		if (r != NULL && r->flag & REF_KNOWS_PEELED) {
+			hashcpy(sha1, r->peeled);
+			return 0;
 		}
 	}
 
@@ -660,36 +594,39 @@ fallback:
 static int do_for_each_ref(const char *submodule, const char *base, each_ref_fn fn,
 			   int trim, int flags, void *cb_data)
 {
-	int retval = 0;
-	struct ref_list *packed = get_packed_refs(submodule);
-	struct ref_list *loose = get_loose_refs(submodule);
+	int retval = 0, i, p = 0, l = 0;
+	struct ref_array *packed = get_packed_refs(submodule);
+	struct ref_array *loose = get_loose_refs(submodule);
 
-	struct ref_list *extra;
+	struct ref_array *extra = &extra_refs;
 
-	for (extra = extra_refs; extra; extra = extra->next)
-		retval = do_one_ref(base, fn, trim, flags, cb_data, extra);
+	for (i = 0; i < extra->nr; i++)
+		retval = do_one_ref(base, fn, trim, flags, cb_data, extra->refs[i]);
 
-	while (packed && loose) {
-		struct ref_list *entry;
-		int cmp = strcmp(packed->name, loose->name);
+	while (p < packed->nr && l < loose->nr) {
+		struct ref_entry *entry;
+		int cmp = strcmp(packed->refs[p]->name, loose->refs[l]->name);
 		if (!cmp) {
-			packed = packed->next;
+			p++;
 			continue;
 		}
 		if (cmp > 0) {
-			entry = loose;
-			loose = loose->next;
+			entry = loose->refs[l++];
 		} else {
-			entry = packed;
-			packed = packed->next;
+			entry = packed->refs[p++];
 		}
 		retval = do_one_ref(base, fn, trim, flags, cb_data, entry);
 		if (retval)
 			goto end_each;
 	}
 
-	for (packed = packed ? packed : loose; packed; packed = packed->next) {
-		retval = do_one_ref(base, fn, trim, flags, cb_data, packed);
+	if (l < loose->nr) {
+		p = l;
+		packed = loose;
+	}
+
+	for (; p < packed->nr; p++) {
+		retval = do_one_ref(base, fn, trim, flags, cb_data, packed->refs[p]);
 		if (retval)
 			goto end_each;
 	}
@@ -1005,24 +942,24 @@ static int remove_empty_directories(const char *file)
 }
 
 static int is_refname_available(const char *ref, const char *oldref,
-				struct ref_list *list, int quiet)
-{
-	int namlen = strlen(ref); /* e.g. 'foo/bar' */
-	while (list) {
-		/* list->name could be 'foo' or 'foo/bar/baz' */
-		if (!oldref || strcmp(oldref, list->name)) {
-			int len = strlen(list->name);
+				struct ref_array *array, int quiet)
+{
+	int i, namlen = strlen(ref); /* e.g. 'foo/bar' */
+	for (i = 0; i < array->nr; i++ ) {
+		struct ref_entry *entry = array->refs[i];
+		/* entry->name could be 'foo' or 'foo/bar/baz' */
+		if (!oldref || strcmp(oldref, entry->name)) {
+			int len = strlen(entry->name);
 			int cmplen = (namlen < len) ? namlen : len;
-			const char *lead = (namlen < len) ? list->name : ref;
-			if (!strncmp(ref, list->name, cmplen) &&
+			const char *lead = (namlen < len) ? entry->name : ref;
+			if (!strncmp(ref, entry->name, cmplen) &&
 			    lead[cmplen] == '/') {
 				if (!quiet)
 					error("'%s' exists; cannot create '%s'",
-					      list->name, ref);
+					      entry->name, ref);
 				return 0;
 			}
 		}
-		list = list->next;
 	}
 	return 1;
 }
@@ -1129,18 +1066,13 @@ static struct lock_file packlock;
 
 static int repack_without_ref(const char *refname)
 {
-	struct ref_list *list, *packed_ref_list;
-	int fd;
-	int found = 0;
+	struct ref_array *packed;
+	struct ref_entry *ref;
+	int fd, i;
 
-	packed_ref_list = get_packed_refs(NULL);
-	for (list = packed_ref_list; list; list = list->next) {
-		if (!strcmp(refname, list->name)) {
-			found = 1;
-			break;
-		}
-	}
-	if (!found)
+	packed = get_packed_refs(NULL);
+	ref = search_ref_array(packed, refname);
+	if (ref == NULL)
 		return 0;
 	fd = hold_lock_file_for_update(&packlock, git_path("packed-refs"), 0);
 	if (fd < 0) {
@@ -1148,17 +1080,19 @@ static int repack_without_ref(const char *refname)
 		return error("cannot delete '%s' from packed refs", refname);
 	}
 
-	for (list = packed_ref_list; list; list = list->next) {
+	for (i = 0; i < packed->nr; i++) {
 		char line[PATH_MAX + 100];
 		int len;
 
-		if (!strcmp(refname, list->name))
+		ref = packed->refs[i];
+
+		if (!strcmp(refname, ref->name))
 			continue;
 		len = snprintf(line, sizeof(line), "%s %s\n",
-			       sha1_to_hex(list->sha1), list->name);
+			       sha1_to_hex(ref->sha1), ref->name);
 		/* this should not happen but just being defensive */
 		if (len > sizeof(line))
-			die("too long a refname '%s'", list->name);
+			die("too long a refname '%s'", ref->name);
 		write_or_die(fd, line, len);
 	}
 	return commit_lock_file(&packlock);
-- 
1.7.6.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-29  2:19                                               ` Julian Phillips
@ 2011-09-29 16:38                                                 ` Martin Fick
  2011-09-29 18:26                                                   ` Julian Phillips
  2011-09-29 18:27                                                 ` René Scharfe
  1 sibling, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-29 16:38 UTC (permalink / raw)
  To: Julian Phillips; +Cc: Christian Couder, git, Christian Couder, Thomas Rast

On Wednesday, September 28, 2011 08:19:16 pm Julian Phillips 
wrote:
> On Wed, 28 Sep 2011 19:37:18 -0600, Martin Fick wrote:
> > On Wednesday 28 September 2011 18:59:09 Martin Fick 
wrote:
> >> Julian Phillips <julian@quantumfyre.co.uk> wrote:
> -- snip --
> 
> >> I've created a test repo with ~100k refs/changes/...
> >> style refs, and ~40000 refs/heads/... style refs, and
> >> checkout can walk the list of ~140k refs seven times
> >> in 85ms user time including doing whatever other
> >> processing is needed for checkout. The real time is
> >> only 114ms - but then my test repo has no real data
> >> in.
> > 
> > If I understand what you are saying, it sounds like you
> > do not have a very good test case. The amount of time
> > it takes for checkout depends on how long it takes to
> > find a ref with the sha1 that you are on. If that sha1
> > is so early in the list of refs that it only took you
> > 7 traversals to find it, then that is not a very good
> > testcase. I think that you should probably try making
> > an orphaned ref (checkout a detached head, commit to
> > it), that is probably the worst testcase since it
> > should then have to search all 140K refs to eventually
> > give up.
> > 
> > Again, if I understand what you are saying, if it took
> > 85ms for 7 traversals, then it takes approximately
> > 10ms per traversal, that's only 100/s! If you have to
> > traverse it 140K times, that should work out to 1400s
> > ~ 23mins.
> 
> Well, it's no more than 10ms per traversal - since the
> rest of the work presumably takes some time too ...
> 
> However, I had forgotten to make the orphaned commit as
> you suggest - and then _bang_ 7N^2, it tries seven
> different variants of each ref (which is silly as they
> are all fully qualified), and with packed refs it has to
> search for them each time, all to turn names into hashes
> that we already know to start with.
> 
> So, yes - it is that list traversal.
> 
> Does the following help?
> 
> diff --git a/builtin/checkout.c b/builtin/checkout.c
> index 5e356a6..f0f4ca1 100644
> --- a/builtin/checkout.c
> +++ b/builtin/checkout.c
> @@ -605,7 +605,7 @@ static int
> add_one_ref_to_rev_list_arg(const char *refname,
>                                         int flags,
>                                         void *cb_data)
>   {
> -       add_one_rev_list_arg(cb_data, refname);
> +       add_one_rev_list_arg(cb_data,
> strdup(sha1_to_hex(sha1))); return 0;
>   }


Yes, but in some strange ways. :)

First, let me clarify that all the tests here involve your 
"sort fix" from 2 days ago applied first.

In the packed ref repo, it brings the time down to about 
~10s (from > 5 mins).  In the unpacked ref repo, it brings 
it down to about the same thing ~10s, but it was only 
starting at about ~20s.

So, I have to ask, what does that change do, I don't quite 
understand it?  Does it just do only one lookup per ref by 
normalizing it?  Is the list still being traversed, just 
about 7 time less now?  Should the packed_ref list simply be 
put in an array which could be binary searched instead, it 
is a fixed list once loaded, no?  

I prototyped a packed_ref implementation using the hash.c 
provided in the git sources and it seemed to speed a 
checkout up to almost instantaneous, but I was getting a few 
collisions so the implementation was not good enough.  That 
is when I started to wonder if an array wouldn't be better 
in this case?



Now I also decided to go back and test a noop fetch (a 
refetch) of all the changes (since this use case is still 
taking way longer than I think it should, even with the 
submodule fix posted earlier).  Up until this point, even 
the sorting fix did not help.  So I tried it with this fix.  
In the unpackref case, it did not seem to change (2~4mins).  
However, in the packed ref change (which was previously also 
about 2-4mins), this now only takes about 10-15s!

Any clues as to why the unpacked refs would still be so slow 
on noop fetches and not be sped up by this?


-Martin


-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-29 16:38                                                 ` Martin Fick
@ 2011-09-29 18:26                                                   ` Julian Phillips
  0 siblings, 0 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-29 18:26 UTC (permalink / raw)
  To: Martin Fick; +Cc: Christian Couder, git, Christian Couder, Thomas Rast

On Thu, 29 Sep 2011 10:38:44 -0600, Martin Fick wrote:
> On Wednesday, September 28, 2011 08:19:16 pm Julian Phillips
> wrote:
-- snip --
>> However, I had forgotten to make the orphaned commit as
>> you suggest - and then _bang_ 7N^2, it tries seven
>> different variants of each ref (which is silly as they
>> are all fully qualified), and with packed refs it has to
>> search for them each time, all to turn names into hashes
>> that we already know to start with.
>>
>> So, yes - it is that list traversal.
>>
>> Does the following help?
>>
>> diff --git a/builtin/checkout.c b/builtin/checkout.c
>> index 5e356a6..f0f4ca1 100644
>> --- a/builtin/checkout.c
>> +++ b/builtin/checkout.c
>> @@ -605,7 +605,7 @@ static int
>> add_one_ref_to_rev_list_arg(const char *refname,
>>                                         int flags,
>>                                         void *cb_data)
>>   {
>> -       add_one_rev_list_arg(cb_data, refname);
>> +       add_one_rev_list_arg(cb_data,
>> strdup(sha1_to_hex(sha1))); return 0;
>>   }
>
>
> Yes, but in some strange ways. :)
>
> First, let me clarify that all the tests here involve your
> "sort fix" from 2 days ago applied first.
>
> In the packed ref repo, it brings the time down to about
> ~10s (from > 5 mins).  In the unpacked ref repo, it brings
> it down to about the same thing ~10s, but it was only
> starting at about ~20s.
>
> So, I have to ask, what does that change do, I don't quite
> understand it?  Does it just do only one lookup per ref by
> normalizing it?  Is the list still being traversed, just
> about 7 time less now?

In order to check for orphaned commits, checkout effectively calls 
rev-list passing it a list of the names of all the refs as input.  The 
rev-list code then has to go through this list and convert each entry 
into an actual hash that it can look up in the object database.  This is 
where the N^2 comes in for packed refs, as it calles resolve_ref() for 
each ref in the list (N), which then loops through the list of all refs 
(N) to find a match.  However, the code that creates the list of refs to 
pass to the rev-list code already knows the hash for each ref.  So the 
change above passes the hashes to rev-list, which then doesn't need to 
lookup the ref - it just converts the string form hash back to binary 
form, avoiding the N^2 work altogether.  This is why packed and unpacked 
are about the same speed, as they are now doing the same amount of work.

> Should the packed_ref list simply be
> put in an array which could be binary searched instead, it
> is a fixed list once loaded, no?

A quick look at the code suggests that probably both the list of loose 
refs, and the list of packed refs could both be stored as binary 
searchable arrays, or in an ordered hash table.  Though whether it is 
actually necessary I don't know.  So far, it seems to have been possible 
to fix performance issues whilst keeping the simple lists ...

> I prototyped a packed_ref implementation using the hash.c
> provided in the git sources and it seemed to speed a
> checkout up to almost instantaneous, but I was getting a few
> collisions so the implementation was not good enough.  That
> is when I started to wonder if an array wouldn't be better
> in this case?
>
>
>
> Now I also decided to go back and test a noop fetch (a
> refetch) of all the changes (since this use case is still
> taking way longer than I think it should, even with the
> submodule fix posted earlier).  Up until this point, even
> the sorting fix did not help.  So I tried it with this fix.
> In the unpackref case, it did not seem to change (2~4mins).
> However, in the packed ref change (which was previously also
> about 2-4mins), this now only takes about 10-15s!
>
> Any clues as to why the unpacked refs would still be so slow
> on noop fetches and not be sped up by this?

Not really.  I wouldn't expect this change to have any effect on fetch, 
but I haven't actually looked into it.

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-29  2:19                                               ` Julian Phillips
  2011-09-29 16:38                                                 ` Martin Fick
@ 2011-09-29 18:27                                                 ` René Scharfe
  2011-09-29 19:10                                                   ` Junio C Hamano
                                                                     ` (2 more replies)
  1 sibling, 3 replies; 126+ messages in thread
From: René Scharfe @ 2011-09-29 18:27 UTC (permalink / raw)
  To: Julian Phillips
  Cc: Martin Fick, Christian Couder, git, Christian Couder,
	Thomas Rast, Junio C Hamano

Am 29.09.2011 04:19, schrieb Julian Phillips:
> Does the following help?
> 
> diff --git a/builtin/checkout.c b/builtin/checkout.c
> index 5e356a6..f0f4ca1 100644
> --- a/builtin/checkout.c
> +++ b/builtin/checkout.c
> @@ -605,7 +605,7 @@ static int add_one_ref_to_rev_list_arg(const char
> *refname,
>                                        int flags,
>                                        void *cb_data)
>  {
> -       add_one_rev_list_arg(cb_data, refname);
> +       add_one_rev_list_arg(cb_data, strdup(sha1_to_hex(sha1)));
>         return 0;
>  }

Hmm.  Can we get rid of the multiple ref lookups fixed by the above
*and* the overhead of dealing with a textual argument list at the same
time by calling add_pending_object directly, like this?  (Factoring
out add_pending_sha1 should be a separate patch..)

René

---
 builtin/checkout.c |   39 ++++++++++++---------------------------
 revision.c         |   11 ++++++++---
 revision.h         |    1 +
 3 files changed, 21 insertions(+), 30 deletions(-)

diff --git a/builtin/checkout.c b/builtin/checkout.c
index 5e356a6..84e0cdc 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -588,24 +588,11 @@ static void update_refs_for_switch(struct checkout_opts *opts,
 		report_tracking(new);
 }
 
-struct rev_list_args {
-	int argc;
-	int alloc;
-	const char **argv;
-};
-
-static void add_one_rev_list_arg(struct rev_list_args *args, const char *s)
-{
-	ALLOC_GROW(args->argv, args->argc + 1, args->alloc);
-	args->argv[args->argc++] = s;
-}
-
-static int add_one_ref_to_rev_list_arg(const char *refname,
-				       const unsigned char *sha1,
-				       int flags,
-				       void *cb_data)
+static int add_pending_uninteresting_ref(const char *refname,
+					 const unsigned char *sha1,
+					 int flags, void *cb_data)
 {
-	add_one_rev_list_arg(cb_data, refname);
+	add_pending_sha1(cb_data, refname, sha1, flags | UNINTERESTING);
 	return 0;
 }
 
@@ -685,19 +672,17 @@ static void suggest_reattach(struct commit *commit, struct rev_info *revs)
  */
 static void orphaned_commit_warning(struct commit *commit)
 {
-	struct rev_list_args args = { 0, 0, NULL };
 	struct rev_info revs;
-
-	add_one_rev_list_arg(&args, "(internal)");
-	add_one_rev_list_arg(&args, sha1_to_hex(commit->object.sha1));
-	add_one_rev_list_arg(&args, "--not");
-	for_each_ref(add_one_ref_to_rev_list_arg, &args);
-	add_one_rev_list_arg(&args, "--");
-	add_one_rev_list_arg(&args, NULL);
+	struct object *object = &commit->object;
 
 	init_revisions(&revs, NULL);
-	if (setup_revisions(args.argc - 1, args.argv, &revs, NULL) != 1)
-		die(_("internal error: only -- alone should have been left"));
+	setup_revisions(0, NULL, &revs, NULL);
+
+	object->flags &= ~UNINTERESTING;
+	add_pending_object(&revs, object, sha1_to_hex(object->sha1));
+
+	for_each_ref(add_pending_uninteresting_ref, &revs);
+
 	if (prepare_revision_walk(&revs))
 		die(_("internal error in revision walk"));
 	if (!(commit->object.flags & UNINTERESTING))
diff --git a/revision.c b/revision.c
index c46cfaa..2e8aa33 100644
--- a/revision.c
+++ b/revision.c
@@ -185,6 +185,13 @@ static struct object *get_reference(struct rev_info *revs, const char *name, con
 	return object;
 }
 
+void add_pending_sha1(struct rev_info *revs, const char *name,
+		      const unsigned char *sha1, unsigned int flags)
+{
+	struct object *object = get_reference(revs, name, sha1, flags);
+	add_pending_object(revs, object, name);
+}
+
 static struct commit *handle_commit(struct rev_info *revs, struct object *object, const char *name)
 {
 	unsigned long flags = object->flags;
@@ -832,9 +839,7 @@ struct all_refs_cb {
 static int handle_one_ref(const char *path, const unsigned char *sha1, int flag, void *cb_data)
 {
 	struct all_refs_cb *cb = cb_data;
-	struct object *object = get_reference(cb->all_revs, path, sha1,
-					      cb->all_flags);
-	add_pending_object(cb->all_revs, object, path);
+	add_pending_sha1(cb->all_revs, path, sha1, cb->all_flags);
 	return 0;
 }
 
diff --git a/revision.h b/revision.h
index 3d64ada..4541265 100644
--- a/revision.h
+++ b/revision.h
@@ -191,6 +191,7 @@ extern void add_object(struct object *obj,
 		       const char *name);
 
 extern void add_pending_object(struct rev_info *revs, struct object *obj, const char *name);
+extern void add_pending_sha1(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags);
 
 extern void add_head_to_pending(struct rev_info *);
 
-- 
1.7.7.rc1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-29 18:27                                                 ` René Scharfe
@ 2011-09-29 19:10                                                   ` Junio C Hamano
  2011-09-29  4:18                                                     ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips
  2011-09-29 20:44                                                     ` Git is not scalable with too many refs/* Martin Fick
  2011-09-29 19:10                                                   ` Julian Phillips
  2011-09-29 20:11                                                   ` Martin Fick
  2 siblings, 2 replies; 126+ messages in thread
From: Junio C Hamano @ 2011-09-29 19:10 UTC (permalink / raw)
  To: René Scharfe
  Cc: Julian Phillips, Martin Fick, Christian Couder, git,
	Christian Couder, Thomas Rast

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

> Hmm.  Can we get rid of the multiple ref lookups fixed by the above
> *and* the overhead of dealing with a textual argument list at the same
> time by calling add_pending_object directly, like this?  (Factoring
> out add_pending_sha1 should be a separate patch..)

I haven't tested it or thought about it through, but it smells right ;-)

Also we would probably want to drop "next" field from "struct ref_list"
(i.e. making it not a linear list), introduce a new "struct ref_array"
that is a ALLOC_GROW() managed array of pointers to "struct ref_list",
make get_packed_refs() and get_loose_refs() return a pointer to "struct
ref_array" after sorting the array contents by "name". Then resolve_ref()
can do a bisection search in the packed refs array when it does not find a
loose ref.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-29 18:27                                                 ` René Scharfe
  2011-09-29 19:10                                                   ` Junio C Hamano
@ 2011-09-29 19:10                                                   ` Julian Phillips
  2011-09-29 20:11                                                   ` Martin Fick
  2 siblings, 0 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-29 19:10 UTC (permalink / raw)
  To: René Scharfe
  Cc: Martin Fick, Christian Couder, git, Christian Couder,
	Thomas Rast, Junio C Hamano

On Thu, 29 Sep 2011 20:27:43 +0200, René Scharfe wrote:
> Am 29.09.2011 04:19, schrieb Julian Phillips:
>> Does the following help?
>>
>> diff --git a/builtin/checkout.c b/builtin/checkout.c
>> index 5e356a6..f0f4ca1 100644
>> --- a/builtin/checkout.c
>> +++ b/builtin/checkout.c
>> @@ -605,7 +605,7 @@ static int add_one_ref_to_rev_list_arg(const 
>> char
>> *refname,
>>                                        int flags,
>>                                        void *cb_data)
>>  {
>> -       add_one_rev_list_arg(cb_data, refname);
>> +       add_one_rev_list_arg(cb_data, strdup(sha1_to_hex(sha1)));
>>         return 0;
>>  }
>
> Hmm.  Can we get rid of the multiple ref lookups fixed by the above
> *and* the overhead of dealing with a textual argument list at the 
> same
> time by calling add_pending_object directly, like this?  (Factoring
> out add_pending_sha1 should be a separate patch..)

Seems like a good idea.  I get the same sort of times as with my patch, 
but it makes the code _feel_ much nicer (and slightly smaller).  Mine 
was definitely more of a "it's 2am, but I think the problem is here" 
type of patch ;)

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-29 18:27                                                 ` René Scharfe
  2011-09-29 19:10                                                   ` Junio C Hamano
  2011-09-29 19:10                                                   ` Julian Phillips
@ 2011-09-29 20:11                                                   ` Martin Fick
  2011-09-30  9:12                                                     ` René Scharfe
  2 siblings, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-29 20:11 UTC (permalink / raw)
  To: René Scharfe
  Cc: Julian Phillips, Christian Couder, git, Christian Couder,
	Thomas Rast, Junio C Hamano

On Thursday, September 29, 2011 12:27:43 pm René Scharfe 
wrote:
> Hmm.  Can we get rid of the multiple ref lookups fixed by
> the above *and* the overhead of dealing with a textual
> argument list at the same time by calling
> add_pending_object directly, like this?  (Factoring out
> add_pending_sha1 should be a separate patch..)
 
René,

Your patch works well for me.  It achieves about the same 
gains as Julian's patch. Thanks!

After all the performance fixes get merged for large ref 
counts, it sure should help the Gerrit community.  I wonder 
how it might impact Gerrit mirroring...

-Martin


Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-29 19:10                                                   ` Junio C Hamano
  2011-09-29  4:18                                                     ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips
@ 2011-09-29 20:44                                                     ` Martin Fick
  1 sibling, 0 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-29 20:44 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: René Scharfe, Julian Phillips, Christian Couder, git,
	Christian Couder, Thomas Rast

On Thursday, September 29, 2011 01:10:06 pm Junio C Hamano 
wrote:
> Also we would probably want to drop "next" field from
> "struct ref_list" (i.e. making it not a linear list),
> introduce a new "struct ref_array" that is a
> ALLOC_GROW() managed array of pointers to "struct
> ref_list", make get_packed_refs() and get_loose_refs()
> return a pointer to "struct ref_array" after sorting the
> array contents by "name". Then resolve_ref() can do a
> bisection search in the packed refs array when it does
> not find a loose ref.

That would be nice, and I suspect it would shave a bit more 
of the orphan check and possibly even a fetch.  If I 
understood all that, I might try.  But I might need some 
hand holding, my C is pretty rusty... Is there a bisection 
search library in git already to use?  Is there a git 
sorting library for the array also?

-Martin


-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] refs: Use binary search to lookup refs faster
  2011-09-29  4:18                                                     ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips
@ 2011-09-29 21:57                                                       ` Junio C Hamano
  2011-09-29 22:04                                                       ` [PATCH v2] " Julian Phillips
  2011-09-29 22:06                                                       ` [PATCH] " Junio C Hamano
  2 siblings, 0 replies; 126+ messages in thread
From: Junio C Hamano @ 2011-09-29 21:57 UTC (permalink / raw)
  To: Julian Phillips
  Cc: Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast

Julian Phillips <julian@quantumfyre.co.uk> writes:

> Currently we linearly search through lists of refs when we need to
> find a specific ref.  This can be very slow if we need to lookup a
> large number of refs.  By changing to a binary search we can make this
> faster.
>
> In order to be able to use a binary search we need to change from
> using linked lists to arrays, which we can manage using ALLOC_GROW.
>
> We can now also use the standard library qsort function to sort the
> refs arrays.
>
> Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
> ---
>
> Something like this?
>
>  refs.c |  328 ++++++++++++++++++++++++++--------------------------------------
>  1 files changed, 131 insertions(+), 197 deletions(-)
>
> diff --git a/refs.c b/refs.c
> index a49ff74..e411bea 100644
> --- a/refs.c
> +++ b/refs.c
> @@ -8,14 +8,18 @@
>  #define REF_KNOWS_PEELED 04
>  #define REF_BROKEN 010
>  
> -struct ref_list {
> -	struct ref_list *next;
> +struct ref_entry {
>  	unsigned char flag; /* ISSYMREF? ISPACKED? */
>  	unsigned char sha1[20];
>  	unsigned char peeled[20];
>  	char name[FLEX_ARRAY];
>  };
>  
> +struct ref_array {
> +	int nr, alloc;
> +	struct ref_entry **refs;
> +};
> +

Yeah, I can say "something like that" without looking at the rest of the
patch ;-) The rest should naturally follow from the above data structures.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH v2] refs: Use binary search to lookup refs faster
  2011-09-29  4:18                                                     ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips
  2011-09-29 21:57                                                       ` Junio C Hamano
@ 2011-09-29 22:04                                                       ` Julian Phillips
  2011-09-29 22:06                                                       ` [PATCH] " Junio C Hamano
  2 siblings, 0 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-29 22:04 UTC (permalink / raw)
  To: Julian Phillips
  Cc: Julian Phillips, Martin Fick, Christian Couder, git,
	Christian Couder, Thomas Rast

Currently we linearly search through lists of refs when we need to
find a specific ref.  This can be very slow if we need to lookup a
large number of refs.  By changing to a binary search we can make this
faster.

In order to be able to use a binary search we need to change from
using linked lists to arrays, which we can manage using ALLOC_GROW.

We can now also use the standard library qsort function to sort the
refs arrays.

Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---

Previous version caused a regression in the test suite ... :$

 refs.c |  329 ++++++++++++++++++++++++++--------------------------------------
 1 files changed, 133 insertions(+), 196 deletions(-)

diff --git a/refs.c b/refs.c
index a49ff74..35bba97 100644
--- a/refs.c
+++ b/refs.c
@@ -8,14 +8,18 @@
 #define REF_KNOWS_PEELED 04
 #define REF_BROKEN 010
 
-struct ref_list {
-	struct ref_list *next;
+struct ref_entry {
 	unsigned char flag; /* ISSYMREF? ISPACKED? */
 	unsigned char sha1[20];
 	unsigned char peeled[20];
 	char name[FLEX_ARRAY];
 };
 
+struct ref_array {
+	int nr, alloc;
+	struct ref_entry **refs;
+};
+
 static const char *parse_ref_line(char *line, unsigned char *sha1)
 {
 	/*
@@ -44,108 +48,58 @@ static const char *parse_ref_line(char *line, unsigned char *sha1)
 	return line;
 }
 
-static struct ref_list *add_ref(const char *name, const unsigned char *sha1,
-				int flag, struct ref_list *list,
-				struct ref_list **new_entry)
+static void add_ref(const char *name, const unsigned char *sha1,
+		    int flag, struct ref_array *refs,
+		    struct ref_entry **new_entry)
 {
 	int len;
-	struct ref_list *entry;
+	struct ref_entry *entry;
 
 	/* Allocate it and add it in.. */
 	len = strlen(name) + 1;
-	entry = xmalloc(sizeof(struct ref_list) + len);
+	entry = xmalloc(sizeof(struct ref) + len);
 	hashcpy(entry->sha1, sha1);
 	hashclr(entry->peeled);
 	memcpy(entry->name, name, len);
 	entry->flag = flag;
-	entry->next = list;
 	if (new_entry)
 		*new_entry = entry;
-	return entry;
+	ALLOC_GROW(refs->refs, refs->nr + 1, refs->alloc);
+	refs->refs[refs->nr++] = entry;
 }
 
-/* merge sort the ref list */
-static struct ref_list *sort_ref_list(struct ref_list *list)
+static int ref_entry_cmp(const void *a, const void *b)
 {
-	int psize, qsize, last_merge_count, cmp;
-	struct ref_list *p, *q, *l, *e;
-	struct ref_list *new_list = list;
-	int k = 1;
-	int merge_count = 0;
-
-	if (!list)
-		return list;
-
-	do {
-		last_merge_count = merge_count;
-		merge_count = 0;
-
-		psize = 0;
+	struct ref_entry *one = *(struct ref_entry **)a;
+	struct ref_entry *two = *(struct ref_entry **)b;
+	return strcmp(one->name, two->name);
+}
 
-		p = new_list;
-		q = new_list;
-		new_list = NULL;
-		l = NULL;
+static void sort_ref_array(struct ref_array *array)
+{
+	qsort(array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp);
+}
 
-		while (p) {
-			merge_count++;
+static struct ref_entry *search_ref_array(struct ref_array *array, const char *name)
+{
+	struct ref_entry *e, **r;
+	int len;
 
-			while (psize < k && q->next) {
-				q = q->next;
-				psize++;
-			}
-			qsize = k;
-
-			while ((psize > 0) || (qsize > 0 && q)) {
-				if (qsize == 0 || !q) {
-					e = p;
-					p = p->next;
-					psize--;
-				} else if (psize == 0) {
-					e = q;
-					q = q->next;
-					qsize--;
-				} else {
-					cmp = strcmp(q->name, p->name);
-					if (cmp < 0) {
-						e = q;
-						q = q->next;
-						qsize--;
-					} else if (cmp > 0) {
-						e = p;
-						p = p->next;
-						psize--;
-					} else {
-						if (hashcmp(q->sha1, p->sha1))
-							die("Duplicated ref, and SHA1s don't match: %s",
-							    q->name);
-						warning("Duplicated ref: %s", q->name);
-						e = q;
-						q = q->next;
-						qsize--;
-						free(e);
-						e = p;
-						p = p->next;
-						psize--;
-					}
-				}
+	if (name == NULL)
+		return NULL;
 
-				e->next = NULL;
+	len = strlen(name) + 1;
+	e = xmalloc(sizeof(struct ref) + len);
+	memcpy(e->name, name, len);
 
-				if (l)
-					l->next = e;
-				if (!new_list)
-					new_list = e;
-				l = e;
-			}
+	r = bsearch(&e, array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp);
 
-			p = q;
-		};
+	free(e);
 
-		k = k * 2;
-	} while ((last_merge_count != merge_count) || (last_merge_count != 1));
+	if (r == NULL)
+		return NULL;
 
-	return new_list;
+	return *r;
 }
 
 /*
@@ -155,38 +109,37 @@ static struct ref_list *sort_ref_list(struct ref_list *list)
 static struct cached_refs {
 	char did_loose;
 	char did_packed;
-	struct ref_list *loose;
-	struct ref_list *packed;
+	struct ref_array loose;
+	struct ref_array packed;
 } cached_refs, submodule_refs;
-static struct ref_list *current_ref;
+static struct ref_entry *current_ref;
 
-static struct ref_list *extra_refs;
+static struct ref_array extra_refs;
 
-static void free_ref_list(struct ref_list *list)
+static void free_ref_array(struct ref_array *array)
 {
-	struct ref_list *next;
-	for ( ; list; list = next) {
-		next = list->next;
-		free(list);
-	}
+	int i;
+	for (i = 0; i < array->nr; i++)
+		free(array->refs[i]);
+	free(array->refs);
+	array->nr = array->alloc = 0;
+	array->refs = NULL;
 }
 
 static void invalidate_cached_refs(void)
 {
 	struct cached_refs *ca = &cached_refs;
 
-	if (ca->did_loose && ca->loose)
-		free_ref_list(ca->loose);
-	if (ca->did_packed && ca->packed)
-		free_ref_list(ca->packed);
-	ca->loose = ca->packed = NULL;
+	if (ca->did_loose)
+		free_ref_array(&ca->loose);
+	if (ca->did_packed)
+		free_ref_array(&ca->packed);
 	ca->did_loose = ca->did_packed = 0;
 }
 
 static void read_packed_refs(FILE *f, struct cached_refs *cached_refs)
 {
-	struct ref_list *list = NULL;
-	struct ref_list *last = NULL;
+	struct ref_entry *last = NULL;
 	char refline[PATH_MAX];
 	int flag = REF_ISPACKED;
 
@@ -205,7 +158,7 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs)
 
 		name = parse_ref_line(refline, sha1);
 		if (name) {
-			list = add_ref(name, sha1, flag, list, &last);
+			add_ref(name, sha1, flag, &cached_refs->packed, &last);
 			continue;
 		}
 		if (last &&
@@ -215,21 +168,20 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs)
 		    !get_sha1_hex(refline + 1, sha1))
 			hashcpy(last->peeled, sha1);
 	}
-	cached_refs->packed = sort_ref_list(list);
+	sort_ref_array(&cached_refs->packed);
 }
 
 void add_extra_ref(const char *name, const unsigned char *sha1, int flag)
 {
-	extra_refs = add_ref(name, sha1, flag, extra_refs, NULL);
+	add_ref(name, sha1, flag, &extra_refs, NULL);
 }
 
 void clear_extra_refs(void)
 {
-	free_ref_list(extra_refs);
-	extra_refs = NULL;
+	free_ref_array(&extra_refs);
 }
 
-static struct ref_list *get_packed_refs(const char *submodule)
+static struct ref_array *get_packed_refs(const char *submodule)
 {
 	const char *packed_refs_file;
 	struct cached_refs *refs;
@@ -237,7 +189,7 @@ static struct ref_list *get_packed_refs(const char *submodule)
 	if (submodule) {
 		packed_refs_file = git_path_submodule(submodule, "packed-refs");
 		refs = &submodule_refs;
-		free_ref_list(refs->packed);
+		free_ref_array(&refs->packed);
 	} else {
 		packed_refs_file = git_path("packed-refs");
 		refs = &cached_refs;
@@ -245,18 +197,17 @@ static struct ref_list *get_packed_refs(const char *submodule)
 
 	if (!refs->did_packed || submodule) {
 		FILE *f = fopen(packed_refs_file, "r");
-		refs->packed = NULL;
 		if (f) {
 			read_packed_refs(f, refs);
 			fclose(f);
 		}
 		refs->did_packed = 1;
 	}
-	return refs->packed;
+	return &refs->packed;
 }
 
-static struct ref_list *get_ref_dir(const char *submodule, const char *base,
-				    struct ref_list *list)
+static void get_ref_dir(const char *submodule, const char *base,
+			struct ref_array *array)
 {
 	DIR *dir;
 	const char *path;
@@ -299,7 +250,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
 			if (stat(refdir, &st) < 0)
 				continue;
 			if (S_ISDIR(st.st_mode)) {
-				list = get_ref_dir(submodule, ref, list);
+				get_ref_dir(submodule, ref, array);
 				continue;
 			}
 			if (submodule) {
@@ -314,12 +265,11 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
 					hashclr(sha1);
 					flag |= REF_BROKEN;
 				}
-			list = add_ref(ref, sha1, flag, list, NULL);
+			add_ref(ref, sha1, flag, array, NULL);
 		}
 		free(ref);
 		closedir(dir);
 	}
-	return list;
 }
 
 struct warn_if_dangling_data {
@@ -356,21 +306,21 @@ void warn_dangling_symref(FILE *fp, const char *msg_fmt, const char *refname)
 	for_each_rawref(warn_if_dangling_symref, &data);
 }
 
-static struct ref_list *get_loose_refs(const char *submodule)
+static struct ref_array *get_loose_refs(const char *submodule)
 {
 	if (submodule) {
-		free_ref_list(submodule_refs.loose);
-		submodule_refs.loose = get_ref_dir(submodule, "refs", NULL);
-		submodule_refs.loose = sort_ref_list(submodule_refs.loose);
-		return submodule_refs.loose;
+		free_ref_array(&submodule_refs.loose);
+		get_ref_dir(submodule, "refs", &submodule_refs.loose);
+		sort_ref_array(&submodule_refs.loose);
+		return &submodule_refs.loose;
 	}
 
 	if (!cached_refs.did_loose) {
-		cached_refs.loose = get_ref_dir(NULL, "refs", NULL);
-		cached_refs.loose = sort_ref_list(cached_refs.loose);
+		get_ref_dir(NULL, "refs", &cached_refs.loose);
+		sort_ref_array(&cached_refs.loose);
 		cached_refs.did_loose = 1;
 	}
-	return cached_refs.loose;
+	return &cached_refs.loose;
 }
 
 /* We allow "recursive" symbolic refs. Only within reason, though */
@@ -381,8 +331,8 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna
 {
 	FILE *f;
 	struct cached_refs refs;
-	struct ref_list *ref;
-	int retval;
+	struct ref_entry *ref;
+	int retval = -1;
 
 	strcpy(name + pathlen, "packed-refs");
 	f = fopen(name, "r");
@@ -390,17 +340,12 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna
 		return -1;
 	read_packed_refs(f, &refs);
 	fclose(f);
-	ref = refs.packed;
-	retval = -1;
-	while (ref) {
-		if (!strcmp(ref->name, refname)) {
-			retval = 0;
-			memcpy(result, ref->sha1, 20);
-			break;
-		}
-		ref = ref->next;
+	ref = search_ref_array(&refs.packed, refname);
+	if (ref != NULL) {
+		memcpy(result, ref->sha1, 20);
+		retval = 0;
 	}
-	free_ref_list(refs.packed);
+	free_ref_array(&refs.packed);
 	return retval;
 }
 
@@ -501,15 +446,13 @@ const char *resolve_ref(const char *ref, unsigned char *sha1, int reading, int *
 		git_snpath(path, sizeof(path), "%s", ref);
 		/* Special case: non-existing file. */
 		if (lstat(path, &st) < 0) {
-			struct ref_list *list = get_packed_refs(NULL);
-			while (list) {
-				if (!strcmp(ref, list->name)) {
-					hashcpy(sha1, list->sha1);
-					if (flag)
-						*flag |= REF_ISPACKED;
-					return ref;
-				}
-				list = list->next;
+			struct ref_array *packed = get_packed_refs(NULL);
+			struct ref_entry *r = search_ref_array(packed, ref);
+			if (r != NULL) {
+				hashcpy(sha1, r->sha1);
+				if (flag)
+					*flag |= REF_ISPACKED;
+				return ref;
 			}
 			if (reading || errno != ENOENT)
 				return NULL;
@@ -584,7 +527,7 @@ int read_ref(const char *ref, unsigned char *sha1)
 
 #define DO_FOR_EACH_INCLUDE_BROKEN 01
 static int do_one_ref(const char *base, each_ref_fn fn, int trim,
-		      int flags, void *cb_data, struct ref_list *entry)
+		      int flags, void *cb_data, struct ref_entry *entry)
 {
 	if (prefixcmp(entry->name, base))
 		return 0;
@@ -630,18 +573,12 @@ int peel_ref(const char *ref, unsigned char *sha1)
 		return -1;
 
 	if ((flag & REF_ISPACKED)) {
-		struct ref_list *list = get_packed_refs(NULL);
+		struct ref_array *array = get_packed_refs(NULL);
+		struct ref_entry *r = search_ref_array(array, ref);
 
-		while (list) {
-			if (!strcmp(list->name, ref)) {
-				if (list->flag & REF_KNOWS_PEELED) {
-					hashcpy(sha1, list->peeled);
-					return 0;
-				}
-				/* older pack-refs did not leave peeled ones */
-				break;
-			}
-			list = list->next;
+		if (r != NULL && r->flag & REF_KNOWS_PEELED) {
+			hashcpy(sha1, r->peeled);
+			return 0;
 		}
 	}
 
@@ -660,36 +597,39 @@ fallback:
 static int do_for_each_ref(const char *submodule, const char *base, each_ref_fn fn,
 			   int trim, int flags, void *cb_data)
 {
-	int retval = 0;
-	struct ref_list *packed = get_packed_refs(submodule);
-	struct ref_list *loose = get_loose_refs(submodule);
+	int retval = 0, i, p = 0, l = 0;
+	struct ref_array *packed = get_packed_refs(submodule);
+	struct ref_array *loose = get_loose_refs(submodule);
 
-	struct ref_list *extra;
+	struct ref_array *extra = &extra_refs;
 
-	for (extra = extra_refs; extra; extra = extra->next)
-		retval = do_one_ref(base, fn, trim, flags, cb_data, extra);
+	for (i = 0; i < extra->nr; i++)
+		retval = do_one_ref(base, fn, trim, flags, cb_data, extra->refs[i]);
 
-	while (packed && loose) {
-		struct ref_list *entry;
-		int cmp = strcmp(packed->name, loose->name);
+	while (p < packed->nr && l < loose->nr) {
+		struct ref_entry *entry;
+		int cmp = strcmp(packed->refs[p]->name, loose->refs[l]->name);
 		if (!cmp) {
-			packed = packed->next;
+			p++;
 			continue;
 		}
 		if (cmp > 0) {
-			entry = loose;
-			loose = loose->next;
+			entry = loose->refs[l++];
 		} else {
-			entry = packed;
-			packed = packed->next;
+			entry = packed->refs[p++];
 		}
 		retval = do_one_ref(base, fn, trim, flags, cb_data, entry);
 		if (retval)
 			goto end_each;
 	}
 
-	for (packed = packed ? packed : loose; packed; packed = packed->next) {
-		retval = do_one_ref(base, fn, trim, flags, cb_data, packed);
+	if (l < loose->nr) {
+		p = l;
+		packed = loose;
+	}
+
+	for (; p < packed->nr; p++) {
+		retval = do_one_ref(base, fn, trim, flags, cb_data, packed->refs[p]);
 		if (retval)
 			goto end_each;
 	}
@@ -1005,24 +945,24 @@ static int remove_empty_directories(const char *file)
 }
 
 static int is_refname_available(const char *ref, const char *oldref,
-				struct ref_list *list, int quiet)
-{
-	int namlen = strlen(ref); /* e.g. 'foo/bar' */
-	while (list) {
-		/* list->name could be 'foo' or 'foo/bar/baz' */
-		if (!oldref || strcmp(oldref, list->name)) {
-			int len = strlen(list->name);
+				struct ref_array *array, int quiet)
+{
+	int i, namlen = strlen(ref); /* e.g. 'foo/bar' */
+	for (i = 0; i < array->nr; i++ ) {
+		struct ref_entry *entry = array->refs[i];
+		/* entry->name could be 'foo' or 'foo/bar/baz' */
+		if (!oldref || strcmp(oldref, entry->name)) {
+			int len = strlen(entry->name);
 			int cmplen = (namlen < len) ? namlen : len;
-			const char *lead = (namlen < len) ? list->name : ref;
-			if (!strncmp(ref, list->name, cmplen) &&
+			const char *lead = (namlen < len) ? entry->name : ref;
+			if (!strncmp(ref, entry->name, cmplen) &&
 			    lead[cmplen] == '/') {
 				if (!quiet)
 					error("'%s' exists; cannot create '%s'",
-					      list->name, ref);
+					      entry->name, ref);
 				return 0;
 			}
 		}
-		list = list->next;
 	}
 	return 1;
 }
@@ -1129,18 +1069,13 @@ static struct lock_file packlock;
 
 static int repack_without_ref(const char *refname)
 {
-	struct ref_list *list, *packed_ref_list;
-	int fd;
-	int found = 0;
+	struct ref_array *packed;
+	struct ref_entry *ref;
+	int fd, i;
 
-	packed_ref_list = get_packed_refs(NULL);
-	for (list = packed_ref_list; list; list = list->next) {
-		if (!strcmp(refname, list->name)) {
-			found = 1;
-			break;
-		}
-	}
-	if (!found)
+	packed = get_packed_refs(NULL);
+	ref = search_ref_array(packed, refname);
+	if (ref == NULL)
 		return 0;
 	fd = hold_lock_file_for_update(&packlock, git_path("packed-refs"), 0);
 	if (fd < 0) {
@@ -1148,17 +1083,19 @@ static int repack_without_ref(const char *refname)
 		return error("cannot delete '%s' from packed refs", refname);
 	}
 
-	for (list = packed_ref_list; list; list = list->next) {
+	for (i = 0; i < packed->nr; i++) {
 		char line[PATH_MAX + 100];
 		int len;
 
-		if (!strcmp(refname, list->name))
+		ref = packed->refs[i];
+
+		if (!strcmp(refname, ref->name))
 			continue;
 		len = snprintf(line, sizeof(line), "%s %s\n",
-			       sha1_to_hex(list->sha1), list->name);
+			       sha1_to_hex(ref->sha1), ref->name);
 		/* this should not happen but just being defensive */
 		if (len > sizeof(line))
-			die("too long a refname '%s'", list->name);
+			die("too long a refname '%s'", ref->name);
 		write_or_die(fd, line, len);
 	}
 	return commit_lock_file(&packlock);
-- 
1.7.6.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH] refs: Use binary search to lookup refs faster
  2011-09-29  4:18                                                     ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips
  2011-09-29 21:57                                                       ` Junio C Hamano
  2011-09-29 22:04                                                       ` [PATCH v2] " Julian Phillips
@ 2011-09-29 22:06                                                       ` Junio C Hamano
  2011-09-29 22:11                                                         ` [PATCH v3] " Julian Phillips
  2 siblings, 1 reply; 126+ messages in thread
From: Junio C Hamano @ 2011-09-29 22:06 UTC (permalink / raw)
  To: Julian Phillips
  Cc: Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast

Julian Phillips <julian@quantumfyre.co.uk> writes:

> +static void add_ref(const char *name, const unsigned char *sha1,
> +		    int flag, struct ref_array *refs,
> +		    struct ref_entry **new_entry)
>  {
>  	int len;
> -	struct ref_list *entry;
> +	struct ref_entry *entry;
>  
>  	/* Allocate it and add it in.. */
>  	len = strlen(name) + 1;
> -	entry = xmalloc(sizeof(struct ref_list) + len);
> +	entry = xmalloc(sizeof(struct ref) + len);

This should be sizeof(struct ref_entry), no?  There is another such
misallocation in search_ref_array() where it prepares a temporary.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH v3] refs: Use binary search to lookup refs faster
  2011-09-29 22:06                                                       ` [PATCH] " Junio C Hamano
@ 2011-09-29 22:11                                                         ` Julian Phillips
  2011-09-29 23:48                                                           ` Junio C Hamano
  2011-09-30  1:13                                                           ` Martin Fick
  0 siblings, 2 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-29 22:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast

Currently we linearly search through lists of refs when we need to
find a specific ref.  This can be very slow if we need to lookup a
large number of refs.  By changing to a binary search we can make this
faster.

In order to be able to use a binary search we need to change from
using linked lists to arrays, which we can manage using ALLOC_GROW.

We can now also use the standard library qsort function to sort the
refs arrays.

Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---

On Thu, 29 Sep 2011 15:06:03 -0700, Junio C Hamano wrote:
> Julian Phillips <julian@quantumfyre.co.uk> writes:
>
>> +static void add_ref(const char *name, const unsigned char *sha1,
>> +		    int flag, struct ref_array *refs,
>> +		    struct ref_entry **new_entry)
>>  {
>>  	int len;
>> -	struct ref_list *entry;
>> +	struct ref_entry *entry;
>>
>>  	/* Allocate it and add it in.. */
>>  	len = strlen(name) + 1;
>> -	entry = xmalloc(sizeof(struct ref_list) + len);
>> +	entry = xmalloc(sizeof(struct ref) + len);
>
> This should be sizeof(struct ref_entry), no?  There is another such
> misallocation in search_ref_array() where it prepares a temporary.

Indeed, thanks.

Looks like two instances of not noticing that "struct ref" already existed
managed to survive.  Drat.  Of course since "struct ref" is bigger than "struct
ref_entry", everthing worked fine ... so no failed tests to tip me off.

 refs.c |  329 ++++++++++++++++++++++++++--------------------------------------
 1 files changed, 133 insertions(+), 196 deletions(-)

diff --git a/refs.c b/refs.c
index a49ff74..4c01d79 100644
--- a/refs.c
+++ b/refs.c
@@ -8,14 +8,18 @@
 #define REF_KNOWS_PEELED 04
 #define REF_BROKEN 010
 
-struct ref_list {
-	struct ref_list *next;
+struct ref_entry {
 	unsigned char flag; /* ISSYMREF? ISPACKED? */
 	unsigned char sha1[20];
 	unsigned char peeled[20];
 	char name[FLEX_ARRAY];
 };
 
+struct ref_array {
+	int nr, alloc;
+	struct ref_entry **refs;
+};
+
 static const char *parse_ref_line(char *line, unsigned char *sha1)
 {
 	/*
@@ -44,108 +48,58 @@ static const char *parse_ref_line(char *line, unsigned char *sha1)
 	return line;
 }
 
-static struct ref_list *add_ref(const char *name, const unsigned char *sha1,
-				int flag, struct ref_list *list,
-				struct ref_list **new_entry)
+static void add_ref(const char *name, const unsigned char *sha1,
+		    int flag, struct ref_array *refs,
+		    struct ref_entry **new_entry)
 {
 	int len;
-	struct ref_list *entry;
+	struct ref_entry *entry;
 
 	/* Allocate it and add it in.. */
 	len = strlen(name) + 1;
-	entry = xmalloc(sizeof(struct ref_list) + len);
+	entry = xmalloc(sizeof(struct ref_entry) + len);
 	hashcpy(entry->sha1, sha1);
 	hashclr(entry->peeled);
 	memcpy(entry->name, name, len);
 	entry->flag = flag;
-	entry->next = list;
 	if (new_entry)
 		*new_entry = entry;
-	return entry;
+	ALLOC_GROW(refs->refs, refs->nr + 1, refs->alloc);
+	refs->refs[refs->nr++] = entry;
 }
 
-/* merge sort the ref list */
-static struct ref_list *sort_ref_list(struct ref_list *list)
+static int ref_entry_cmp(const void *a, const void *b)
 {
-	int psize, qsize, last_merge_count, cmp;
-	struct ref_list *p, *q, *l, *e;
-	struct ref_list *new_list = list;
-	int k = 1;
-	int merge_count = 0;
-
-	if (!list)
-		return list;
-
-	do {
-		last_merge_count = merge_count;
-		merge_count = 0;
-
-		psize = 0;
+	struct ref_entry *one = *(struct ref_entry **)a;
+	struct ref_entry *two = *(struct ref_entry **)b;
+	return strcmp(one->name, two->name);
+}
 
-		p = new_list;
-		q = new_list;
-		new_list = NULL;
-		l = NULL;
+static void sort_ref_array(struct ref_array *array)
+{
+	qsort(array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp);
+}
 
-		while (p) {
-			merge_count++;
+static struct ref_entry *search_ref_array(struct ref_array *array, const char *name)
+{
+	struct ref_entry *e, **r;
+	int len;
 
-			while (psize < k && q->next) {
-				q = q->next;
-				psize++;
-			}
-			qsize = k;
-
-			while ((psize > 0) || (qsize > 0 && q)) {
-				if (qsize == 0 || !q) {
-					e = p;
-					p = p->next;
-					psize--;
-				} else if (psize == 0) {
-					e = q;
-					q = q->next;
-					qsize--;
-				} else {
-					cmp = strcmp(q->name, p->name);
-					if (cmp < 0) {
-						e = q;
-						q = q->next;
-						qsize--;
-					} else if (cmp > 0) {
-						e = p;
-						p = p->next;
-						psize--;
-					} else {
-						if (hashcmp(q->sha1, p->sha1))
-							die("Duplicated ref, and SHA1s don't match: %s",
-							    q->name);
-						warning("Duplicated ref: %s", q->name);
-						e = q;
-						q = q->next;
-						qsize--;
-						free(e);
-						e = p;
-						p = p->next;
-						psize--;
-					}
-				}
+	if (name == NULL)
+		return NULL;
 
-				e->next = NULL;
+	len = strlen(name) + 1;
+	e = xmalloc(sizeof(struct ref_entry) + len);
+	memcpy(e->name, name, len);
 
-				if (l)
-					l->next = e;
-				if (!new_list)
-					new_list = e;
-				l = e;
-			}
+	r = bsearch(&e, array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp);
 
-			p = q;
-		};
+	free(e);
 
-		k = k * 2;
-	} while ((last_merge_count != merge_count) || (last_merge_count != 1));
+	if (r == NULL)
+		return NULL;
 
-	return new_list;
+	return *r;
 }
 
 /*
@@ -155,38 +109,37 @@ static struct ref_list *sort_ref_list(struct ref_list *list)
 static struct cached_refs {
 	char did_loose;
 	char did_packed;
-	struct ref_list *loose;
-	struct ref_list *packed;
+	struct ref_array loose;
+	struct ref_array packed;
 } cached_refs, submodule_refs;
-static struct ref_list *current_ref;
+static struct ref_entry *current_ref;
 
-static struct ref_list *extra_refs;
+static struct ref_array extra_refs;
 
-static void free_ref_list(struct ref_list *list)
+static void free_ref_array(struct ref_array *array)
 {
-	struct ref_list *next;
-	for ( ; list; list = next) {
-		next = list->next;
-		free(list);
-	}
+	int i;
+	for (i = 0; i < array->nr; i++)
+		free(array->refs[i]);
+	free(array->refs);
+	array->nr = array->alloc = 0;
+	array->refs = NULL;
 }
 
 static void invalidate_cached_refs(void)
 {
 	struct cached_refs *ca = &cached_refs;
 
-	if (ca->did_loose && ca->loose)
-		free_ref_list(ca->loose);
-	if (ca->did_packed && ca->packed)
-		free_ref_list(ca->packed);
-	ca->loose = ca->packed = NULL;
+	if (ca->did_loose)
+		free_ref_array(&ca->loose);
+	if (ca->did_packed)
+		free_ref_array(&ca->packed);
 	ca->did_loose = ca->did_packed = 0;
 }
 
 static void read_packed_refs(FILE *f, struct cached_refs *cached_refs)
 {
-	struct ref_list *list = NULL;
-	struct ref_list *last = NULL;
+	struct ref_entry *last = NULL;
 	char refline[PATH_MAX];
 	int flag = REF_ISPACKED;
 
@@ -205,7 +158,7 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs)
 
 		name = parse_ref_line(refline, sha1);
 		if (name) {
-			list = add_ref(name, sha1, flag, list, &last);
+			add_ref(name, sha1, flag, &cached_refs->packed, &last);
 			continue;
 		}
 		if (last &&
@@ -215,21 +168,20 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs)
 		    !get_sha1_hex(refline + 1, sha1))
 			hashcpy(last->peeled, sha1);
 	}
-	cached_refs->packed = sort_ref_list(list);
+	sort_ref_array(&cached_refs->packed);
 }
 
 void add_extra_ref(const char *name, const unsigned char *sha1, int flag)
 {
-	extra_refs = add_ref(name, sha1, flag, extra_refs, NULL);
+	add_ref(name, sha1, flag, &extra_refs, NULL);
 }
 
 void clear_extra_refs(void)
 {
-	free_ref_list(extra_refs);
-	extra_refs = NULL;
+	free_ref_array(&extra_refs);
 }
 
-static struct ref_list *get_packed_refs(const char *submodule)
+static struct ref_array *get_packed_refs(const char *submodule)
 {
 	const char *packed_refs_file;
 	struct cached_refs *refs;
@@ -237,7 +189,7 @@ static struct ref_list *get_packed_refs(const char *submodule)
 	if (submodule) {
 		packed_refs_file = git_path_submodule(submodule, "packed-refs");
 		refs = &submodule_refs;
-		free_ref_list(refs->packed);
+		free_ref_array(&refs->packed);
 	} else {
 		packed_refs_file = git_path("packed-refs");
 		refs = &cached_refs;
@@ -245,18 +197,17 @@ static struct ref_list *get_packed_refs(const char *submodule)
 
 	if (!refs->did_packed || submodule) {
 		FILE *f = fopen(packed_refs_file, "r");
-		refs->packed = NULL;
 		if (f) {
 			read_packed_refs(f, refs);
 			fclose(f);
 		}
 		refs->did_packed = 1;
 	}
-	return refs->packed;
+	return &refs->packed;
 }
 
-static struct ref_list *get_ref_dir(const char *submodule, const char *base,
-				    struct ref_list *list)
+static void get_ref_dir(const char *submodule, const char *base,
+			struct ref_array *array)
 {
 	DIR *dir;
 	const char *path;
@@ -299,7 +250,7 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
 			if (stat(refdir, &st) < 0)
 				continue;
 			if (S_ISDIR(st.st_mode)) {
-				list = get_ref_dir(submodule, ref, list);
+				get_ref_dir(submodule, ref, array);
 				continue;
 			}
 			if (submodule) {
@@ -314,12 +265,11 @@ static struct ref_list *get_ref_dir(const char *submodule, const char *base,
 					hashclr(sha1);
 					flag |= REF_BROKEN;
 				}
-			list = add_ref(ref, sha1, flag, list, NULL);
+			add_ref(ref, sha1, flag, array, NULL);
 		}
 		free(ref);
 		closedir(dir);
 	}
-	return list;
 }
 
 struct warn_if_dangling_data {
@@ -356,21 +306,21 @@ void warn_dangling_symref(FILE *fp, const char *msg_fmt, const char *refname)
 	for_each_rawref(warn_if_dangling_symref, &data);
 }
 
-static struct ref_list *get_loose_refs(const char *submodule)
+static struct ref_array *get_loose_refs(const char *submodule)
 {
 	if (submodule) {
-		free_ref_list(submodule_refs.loose);
-		submodule_refs.loose = get_ref_dir(submodule, "refs", NULL);
-		submodule_refs.loose = sort_ref_list(submodule_refs.loose);
-		return submodule_refs.loose;
+		free_ref_array(&submodule_refs.loose);
+		get_ref_dir(submodule, "refs", &submodule_refs.loose);
+		sort_ref_array(&submodule_refs.loose);
+		return &submodule_refs.loose;
 	}
 
 	if (!cached_refs.did_loose) {
-		cached_refs.loose = get_ref_dir(NULL, "refs", NULL);
-		cached_refs.loose = sort_ref_list(cached_refs.loose);
+		get_ref_dir(NULL, "refs", &cached_refs.loose);
+		sort_ref_array(&cached_refs.loose);
 		cached_refs.did_loose = 1;
 	}
-	return cached_refs.loose;
+	return &cached_refs.loose;
 }
 
 /* We allow "recursive" symbolic refs. Only within reason, though */
@@ -381,8 +331,8 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna
 {
 	FILE *f;
 	struct cached_refs refs;
-	struct ref_list *ref;
-	int retval;
+	struct ref_entry *ref;
+	int retval = -1;
 
 	strcpy(name + pathlen, "packed-refs");
 	f = fopen(name, "r");
@@ -390,17 +340,12 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna
 		return -1;
 	read_packed_refs(f, &refs);
 	fclose(f);
-	ref = refs.packed;
-	retval = -1;
-	while (ref) {
-		if (!strcmp(ref->name, refname)) {
-			retval = 0;
-			memcpy(result, ref->sha1, 20);
-			break;
-		}
-		ref = ref->next;
+	ref = search_ref_array(&refs.packed, refname);
+	if (ref != NULL) {
+		memcpy(result, ref->sha1, 20);
+		retval = 0;
 	}
-	free_ref_list(refs.packed);
+	free_ref_array(&refs.packed);
 	return retval;
 }
 
@@ -501,15 +446,13 @@ const char *resolve_ref(const char *ref, unsigned char *sha1, int reading, int *
 		git_snpath(path, sizeof(path), "%s", ref);
 		/* Special case: non-existing file. */
 		if (lstat(path, &st) < 0) {
-			struct ref_list *list = get_packed_refs(NULL);
-			while (list) {
-				if (!strcmp(ref, list->name)) {
-					hashcpy(sha1, list->sha1);
-					if (flag)
-						*flag |= REF_ISPACKED;
-					return ref;
-				}
-				list = list->next;
+			struct ref_array *packed = get_packed_refs(NULL);
+			struct ref_entry *r = search_ref_array(packed, ref);
+			if (r != NULL) {
+				hashcpy(sha1, r->sha1);
+				if (flag)
+					*flag |= REF_ISPACKED;
+				return ref;
 			}
 			if (reading || errno != ENOENT)
 				return NULL;
@@ -584,7 +527,7 @@ int read_ref(const char *ref, unsigned char *sha1)
 
 #define DO_FOR_EACH_INCLUDE_BROKEN 01
 static int do_one_ref(const char *base, each_ref_fn fn, int trim,
-		      int flags, void *cb_data, struct ref_list *entry)
+		      int flags, void *cb_data, struct ref_entry *entry)
 {
 	if (prefixcmp(entry->name, base))
 		return 0;
@@ -630,18 +573,12 @@ int peel_ref(const char *ref, unsigned char *sha1)
 		return -1;
 
 	if ((flag & REF_ISPACKED)) {
-		struct ref_list *list = get_packed_refs(NULL);
+		struct ref_array *array = get_packed_refs(NULL);
+		struct ref_entry *r = search_ref_array(array, ref);
 
-		while (list) {
-			if (!strcmp(list->name, ref)) {
-				if (list->flag & REF_KNOWS_PEELED) {
-					hashcpy(sha1, list->peeled);
-					return 0;
-				}
-				/* older pack-refs did not leave peeled ones */
-				break;
-			}
-			list = list->next;
+		if (r != NULL && r->flag & REF_KNOWS_PEELED) {
+			hashcpy(sha1, r->peeled);
+			return 0;
 		}
 	}
 
@@ -660,36 +597,39 @@ fallback:
 static int do_for_each_ref(const char *submodule, const char *base, each_ref_fn fn,
 			   int trim, int flags, void *cb_data)
 {
-	int retval = 0;
-	struct ref_list *packed = get_packed_refs(submodule);
-	struct ref_list *loose = get_loose_refs(submodule);
+	int retval = 0, i, p = 0, l = 0;
+	struct ref_array *packed = get_packed_refs(submodule);
+	struct ref_array *loose = get_loose_refs(submodule);
 
-	struct ref_list *extra;
+	struct ref_array *extra = &extra_refs;
 
-	for (extra = extra_refs; extra; extra = extra->next)
-		retval = do_one_ref(base, fn, trim, flags, cb_data, extra);
+	for (i = 0; i < extra->nr; i++)
+		retval = do_one_ref(base, fn, trim, flags, cb_data, extra->refs[i]);
 
-	while (packed && loose) {
-		struct ref_list *entry;
-		int cmp = strcmp(packed->name, loose->name);
+	while (p < packed->nr && l < loose->nr) {
+		struct ref_entry *entry;
+		int cmp = strcmp(packed->refs[p]->name, loose->refs[l]->name);
 		if (!cmp) {
-			packed = packed->next;
+			p++;
 			continue;
 		}
 		if (cmp > 0) {
-			entry = loose;
-			loose = loose->next;
+			entry = loose->refs[l++];
 		} else {
-			entry = packed;
-			packed = packed->next;
+			entry = packed->refs[p++];
 		}
 		retval = do_one_ref(base, fn, trim, flags, cb_data, entry);
 		if (retval)
 			goto end_each;
 	}
 
-	for (packed = packed ? packed : loose; packed; packed = packed->next) {
-		retval = do_one_ref(base, fn, trim, flags, cb_data, packed);
+	if (l < loose->nr) {
+		p = l;
+		packed = loose;
+	}
+
+	for (; p < packed->nr; p++) {
+		retval = do_one_ref(base, fn, trim, flags, cb_data, packed->refs[p]);
 		if (retval)
 			goto end_each;
 	}
@@ -1005,24 +945,24 @@ static int remove_empty_directories(const char *file)
 }
 
 static int is_refname_available(const char *ref, const char *oldref,
-				struct ref_list *list, int quiet)
-{
-	int namlen = strlen(ref); /* e.g. 'foo/bar' */
-	while (list) {
-		/* list->name could be 'foo' or 'foo/bar/baz' */
-		if (!oldref || strcmp(oldref, list->name)) {
-			int len = strlen(list->name);
+				struct ref_array *array, int quiet)
+{
+	int i, namlen = strlen(ref); /* e.g. 'foo/bar' */
+	for (i = 0; i < array->nr; i++ ) {
+		struct ref_entry *entry = array->refs[i];
+		/* entry->name could be 'foo' or 'foo/bar/baz' */
+		if (!oldref || strcmp(oldref, entry->name)) {
+			int len = strlen(entry->name);
 			int cmplen = (namlen < len) ? namlen : len;
-			const char *lead = (namlen < len) ? list->name : ref;
-			if (!strncmp(ref, list->name, cmplen) &&
+			const char *lead = (namlen < len) ? entry->name : ref;
+			if (!strncmp(ref, entry->name, cmplen) &&
 			    lead[cmplen] == '/') {
 				if (!quiet)
 					error("'%s' exists; cannot create '%s'",
-					      list->name, ref);
+					      entry->name, ref);
 				return 0;
 			}
 		}
-		list = list->next;
 	}
 	return 1;
 }
@@ -1129,18 +1069,13 @@ static struct lock_file packlock;
 
 static int repack_without_ref(const char *refname)
 {
-	struct ref_list *list, *packed_ref_list;
-	int fd;
-	int found = 0;
+	struct ref_array *packed;
+	struct ref_entry *ref;
+	int fd, i;
 
-	packed_ref_list = get_packed_refs(NULL);
-	for (list = packed_ref_list; list; list = list->next) {
-		if (!strcmp(refname, list->name)) {
-			found = 1;
-			break;
-		}
-	}
-	if (!found)
+	packed = get_packed_refs(NULL);
+	ref = search_ref_array(packed, refname);
+	if (ref == NULL)
 		return 0;
 	fd = hold_lock_file_for_update(&packlock, git_path("packed-refs"), 0);
 	if (fd < 0) {
@@ -1148,17 +1083,19 @@ static int repack_without_ref(const char *refname)
 		return error("cannot delete '%s' from packed refs", refname);
 	}
 
-	for (list = packed_ref_list; list; list = list->next) {
+	for (i = 0; i < packed->nr; i++) {
 		char line[PATH_MAX + 100];
 		int len;
 
-		if (!strcmp(refname, list->name))
+		ref = packed->refs[i];
+
+		if (!strcmp(refname, ref->name))
 			continue;
 		len = snprintf(line, sizeof(line), "%s %s\n",
-			       sha1_to_hex(list->sha1), list->name);
+			       sha1_to_hex(ref->sha1), ref->name);
 		/* this should not happen but just being defensive */
 		if (len > sizeof(line))
-			die("too long a refname '%s'", list->name);
+			die("too long a refname '%s'", ref->name);
 		write_or_die(fd, line, len);
 	}
 	return commit_lock_file(&packlock);
-- 
1.7.6.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH v3] refs: Use binary search to lookup refs faster
  2011-09-29 22:11                                                         ` [PATCH v3] " Julian Phillips
@ 2011-09-29 23:48                                                           ` Junio C Hamano
  2011-09-30 15:30                                                             ` Michael Haggerty
  2011-09-30  1:13                                                           ` Martin Fick
  1 sibling, 1 reply; 126+ messages in thread
From: Junio C Hamano @ 2011-09-29 23:48 UTC (permalink / raw)
  To: Julian Phillips
  Cc: Michael Haggerty, Martin Fick, Christian Couder, git,
	Christian Couder, Thomas Rast

This version looks sane, although I have a suspicion that it may have
some interaction with what Michael may be working on.

Thanks.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH v3] refs: Use binary search to lookup refs faster
  2011-09-29 22:11                                                         ` [PATCH v3] " Julian Phillips
  2011-09-29 23:48                                                           ` Junio C Hamano
@ 2011-09-30  1:13                                                           ` Martin Fick
  2011-09-30  3:44                                                             ` Junio C Hamano
  1 sibling, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-30  1:13 UTC (permalink / raw)
  To: Julian Phillips
  Cc: Junio C Hamano, Christian Couder, git, Christian Couder, Thomas Rast

On Thursday, September 29, 2011 04:11:42 pm Julian Phillips 
wrote:
> Currently we linearly search through lists of refs when
> we need to find a specific ref.  This can be very slow
> if we need to lookup a large number of refs.  By
> changing to a binary search we can make this faster.
> 
> In order to be able to use a binary search we need to
> change from using linked lists to arrays, which we can
> manage using ALLOC_GROW.
> 
> We can now also use the standard library qsort function
> to sort the refs arrays.
> 

This works for me, however unfortunately, I cannot find any 
scenarios where it improves anything over the previous fix 
by René.  :(

I tested many things, clones, fetches, fetch noops, 
checkouts, garbage collection.  I am a bit surprised, 
because I thought that my hack of a hash map did improve 
still on checkouts on packed refs, but it could just be that 
my hack was buggy and did not actually do a full orphan 
check.

Thanks,

-Martin

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH v3] refs: Use binary search to lookup refs faster
  2011-09-30  1:13                                                           ` Martin Fick
@ 2011-09-30  3:44                                                             ` Junio C Hamano
  2011-09-30  8:04                                                               ` Julian Phillips
  2011-09-30 15:45                                                               ` Martin Fick
  0 siblings, 2 replies; 126+ messages in thread
From: Junio C Hamano @ 2011-09-30  3:44 UTC (permalink / raw)
  To: Martin Fick
  Cc: Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast

Martin Fick <mfick@codeaurora.org> writes:

> This works for me, however unfortunately, I cannot find any 
> scenarios where it improves anything over the previous fix 
> by René.  :(

Nevertheless, I would appreciate it if you can try this _without_ René's
patch. This attempts to make resolve_ref() cheap for _any_ caller. René's
patch avoids calling it in one specific callchain.

They address different issues. René's patch is probably an independently
good change (I haven't thought about the interactions with the topics in
flight and its implications on the future direction), but would not help
other/new callers that make many calls to resolve_ref().

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH v3] refs: Use binary search to lookup refs faster
  2011-09-30  3:44                                                             ` Junio C Hamano
@ 2011-09-30  8:04                                                               ` Julian Phillips
  2011-09-30 15:45                                                               ` Martin Fick
  1 sibling, 0 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-30  8:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Martin Fick, Christian Couder, git, Christian Couder, Thomas Rast

On Thu, 29 Sep 2011 20:44:40 -0700, Junio C Hamano wrote:
> Martin Fick <mfick@codeaurora.org> writes:
>
>> This works for me, however unfortunately, I cannot find any
>> scenarios where it improves anything over the previous fix
>> by René.  :(
>
> Nevertheless, I would appreciate it if you can try this _without_ 
> René's
> patch. This attempts to make resolve_ref() cheap for _any_ caller. 
> René's
> patch avoids calling it in one specific callchain.
>
> They address different issues. René's patch is probably an 
> independently
> good change (I haven't thought about the interactions with the topics 
> in
> flight and its implications on the future direction), but would not 
> help
> other/new callers that make many calls to resolve_ref().

It certainly helps with my test repo (~140k refs, of which ~40k are 
branches).  User times for checkout starting from an orphaned commit 
are:

No fix          : ~16m8s
+ Binary Search : ~4s
+ René's patch  : ~2s

(The 2s includes both patches, though the timing is the same for René's 
patch alone)

-- 
Julian

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-29 20:11                                                   ` Martin Fick
@ 2011-09-30  9:12                                                     ` René Scharfe
  2011-09-30 16:09                                                       ` Martin Fick
  2011-09-30 16:52                                                       ` Junio C Hamano
  0 siblings, 2 replies; 126+ messages in thread
From: René Scharfe @ 2011-09-30  9:12 UTC (permalink / raw)
  To: Martin Fick
  Cc: Julian Phillips, Christian Couder, git, Christian Couder,
	Thomas Rast, Junio C Hamano

Hi Martin,

Am 29.09.2011 22:11, schrieb Martin Fick:
> Your patch works well for me.  It achieves about the same 
> gains as Julian's patch. Thanks!

OK, and what happens if you apply the following patch on top of my first
one?  It avoids going through all the refs a second time during cleanup,
at the cost of going through the list of all known objects.  I wonder if
that's any faster in your case.

Thanks,
René


diff --git a/builtin/checkout.c b/builtin/checkout.c
index 84e0cdc..a4b1003 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -596,15 +596,14 @@ static int add_pending_uninteresting_ref(const char *refname,
 	return 0;
 }
 
-static int clear_commit_marks_from_one_ref(const char *refname,
-				      const unsigned char *sha1,
-				      int flags,
-				      void *cb_data)
+static void clear_commit_marks_for_all(unsigned int mark)
 {
-	struct commit *commit = lookup_commit_reference_gently(sha1, 1);
-	if (commit)
-		clear_commit_marks(commit, -1);
-	return 0;
+	unsigned int i, max = get_max_object_index();
+	for (i = 0; i < max; i++) {
+		struct object *object = get_indexed_object(i);
+		if (object && object->type == OBJ_COMMIT)
+			object->flags &= ~mark;
+	}
 }
 
 static void describe_one_orphan(struct strbuf *sb, struct commit *commit)
@@ -690,8 +689,7 @@ static void orphaned_commit_warning(struct commit *commit)
 	else
 		describe_detached_head(_("Previous HEAD position was"), commit);
 
-	clear_commit_marks(commit, -1);
-	for_each_ref(clear_commit_marks_from_one_ref, NULL);
+	clear_commit_marks_for_all(ALL_REV_FLAGS);
 }
 
 static int switch_branches(struct checkout_opts *opts, struct branch_info *new)

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH v3] refs: Use binary search to lookup refs faster
  2011-09-29 23:48                                                           ` Junio C Hamano
@ 2011-09-30 15:30                                                             ` Michael Haggerty
  2011-09-30 16:38                                                               ` Junio C Hamano
  0 siblings, 1 reply; 126+ messages in thread
From: Michael Haggerty @ 2011-09-30 15:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Julian Phillips, Martin Fick, Christian Couder, git,
	Christian Couder, Thomas Rast

On 09/30/2011 01:48 AM, Junio C Hamano wrote:
> This version looks sane, although I have a suspicion that it may have
> some interaction with what Michael may be working on.

Indeed, I have almost equivalent changes in the giant patch series that
I am working on [1].  The branch is very experimental.  The tip
currently passes all the tests, but it has a known performance
regression in connection if "git fetch" is used to fetch many commits.


But before comparing ref-related optimizations, we have an *urgent* need
for a decent performance test suite.  There are many slightly different
scenarios that have very different performance characteristics, and we
have to be sure that we are optimizing for the whole palette of
many-reference use cases.  So I made an attempt at a kludgey but
somewhat flexible performance-testing script [2].  I don't know whether
something like this should be integrated into the git project, and if so
where; suggestions are welcome.


To run the tests, from the root of the git source tree:

    make # make sure git is up-to-date
    t/make-refperf-repo --help
    t/make-refperf-repo [OPTIONS]
    t/refperf
    cat refperf.times # See the results

The default repo has 5k commits in a linear series with one reference on
each commit.  (These numbers can both be adjusted.)

The reference namespace can be laid out a few ways:

* Many references in a single "directory" vs. sharded over many
"directories"

* In lexicographic order by commit, in reverse order, or "shuffled".

By default, the repo is written to "refperf-repo".

The time it takes to create the test repository is itself also an
interesting benchmark.  For example, on the maint branch it is terribly
slow unless it is passed either the --pack-refs-interval=N (with N, say
100) or --no-replace-object option.  I also noticed that if it is run like

    t/make-refperf-repo --refs=5000 --commits=5000 \
            --pack-refs-interval=100

(one ref per commit), git-pack-refs becomes precipitously and
dramatically slower after the 2000th commit.

I haven't had time yet for systematic benchmarks of other git versions.

See the refperf script to see what sorts of benchmarks that I have built
into it so far.  The refperf test is non-destructive; it always copies
from "refperf-repo" to "refperf-repo-copy" and does its tests in the
copy; therefore a test repo can be reused.  The timing data are written
to "refperf.times" and other output to "refperf.log".

Here are my refperf results for the "maint" branch on my notebook with
the default "make-refperf-repo" arguments (times in seconds):

3.36 git branch (cold)
0.01 git branch (warm)
0.04 git for-each-ref
3.08 git checkout (cold)
0.01 git checkout (warm)
0.00 git checkout --orphan (warm)
0.15 git checkout from detached orphan
0.12 git pack-refs
1.17 git branch (cold)
0.00 git branch (warm)
0.17 git for-each-ref
0.95 git checkout (cold)
0.00 git checkout (warm)
0.00 git checkout --orphan (warm)
0.21 git checkout from detached orphan
0.18 git branch -a --contains
7.67 git clone
0.06 git fetch (nothing)
0.01 git pack-refs
0.05 git fetch (nothing, packed)
0.10 git clone of a ref-packed repo
0.63 git fetch (everything)

Probably we should test with even more references than this, but this
test already shows that some commands are quite sluggish.

There are some more things that could be added, like:

* Branches vs. annotated tags

* References on the tips of branches in a more typical "branchy" repository.

* git describe --all

* git log --decorate

* git gc

* git filter-branch
  (This has very different performance characteristics because it is a
script that invokes git many times.)

I suggest that we try to do systematic benchmarking of any changes that
we claim are performance optimizations and share before/after results in
the cover letter for the patch series.

Michael

[1] branch hierarchical-refs at git://github.com/mhagger/git.git
[2] branch refperf at git://github.com/mhagger/git.git

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH v3] refs: Use binary search to lookup refs faster
  2011-09-30  3:44                                                             ` Junio C Hamano
  2011-09-30  8:04                                                               ` Julian Phillips
@ 2011-09-30 15:45                                                               ` Martin Fick
  1 sibling, 0 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-30 15:45 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Julian Phillips, Christian Couder, git, Christian Couder, Thomas Rast

On Thursday, September 29, 2011 09:44:40 pm Junio C Hamano 
wrote:
> Martin Fick <mfick@codeaurora.org> writes:
> > This works for me, however unfortunately, I cannot find
> > any scenarios where it improves anything over the
> > previous fix by René.  :(
> 
> Nevertheless, I would appreciate it if you can try this
> _without_ René's patch. This attempts to make
> resolve_ref() cheap for _any_ caller. René's patch
> avoids calling it in one specific callchain.
> 
> They address different issues. René's patch is probably
> an independently good change (I haven't thought about
> the interactions with the topics in flight and its
> implications on the future direction), but would not
> help other/new callers that make many calls to
> resolve_ref().

Agreed.  Here is what I am seeing without René's patch.

Checkout in NON packed ref repo takes about 20s, with patch 
v3 of binary search, it takes about 11s (1s slower than 
René's patch).

Checkout in packed ref repo takes about 5:30min, with patch 
v3 of binary search, it takes about 10s (also 1s slower than 
René's patch).

I'd say that's not bad, it seems like the 1s difference is 
doing the search 60K+times (my tests don't quite scan the 
full list), so the search seems to scale well with patch v3.

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-30  9:12                                                     ` René Scharfe
@ 2011-09-30 16:09                                                       ` Martin Fick
  2011-09-30 16:52                                                       ` Junio C Hamano
  1 sibling, 0 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-30 16:09 UTC (permalink / raw)
  To: René Scharfe
  Cc: Julian Phillips, Christian Couder, git, Christian Couder,
	Thomas Rast, Junio C Hamano

On Friday, September 30, 2011 03:12:08 am René Scharfe 
wrote:
> OK, and what happens if you apply the following patch on
> top of my first one?  It avoids going through all the
> refs a second time during cleanup, at the cost of going
> through the list of all known objects.  I wonder if
> that's any faster in your case.


This patch helps a bit more.  It seems to shave about 
another .5s off in packed and non packed case w or w/o 
binary search.

-Martin



> diff --git a/builtin/checkout.c b/builtin/checkout.c
> index 84e0cdc..a4b1003 100644
> --- a/builtin/checkout.c
> +++ b/builtin/checkout.c
> @@ -596,15 +596,14 @@ static int
> add_pending_uninteresting_ref(const char *refname,
> return 0;
>  }
> 
> -static int clear_commit_marks_from_one_ref(const char
> *refname, -				      const unsigned 
char *sha1,
> -				      int flags,
> -				      void *cb_data)
> +static void clear_commit_marks_for_all(unsigned int
> mark) {
> -	struct commit *commit =
> lookup_commit_reference_gently(sha1, 1); -	if (commit)
> -		clear_commit_marks(commit, -1);
> -	return 0;
> +	unsigned int i, max = get_max_object_index();
> +	for (i = 0; i < max; i++) {
> +		struct object *object = 
get_indexed_object(i);
> +		if (object && object->type == OBJ_COMMIT)
> +			object->flags &= ~mark;
> +	}
>  }
> 
>  static void describe_one_orphan(struct strbuf *sb,
> struct commit *commit) @@ -690,8 +689,7 @@ static void
> orphaned_commit_warning(struct commit *commit) else
>  		describe_detached_head(_("Previous HEAD 
position
> was"), commit);
> 
> -	clear_commit_marks(commit, -1);
> -	for_each_ref(clear_commit_marks_from_one_ref, NULL);
> +	clear_commit_marks_for_all(ALL_REV_FLAGS);
>  }
> 
>  static int switch_branches(struct checkout_opts *opts,
> struct branch_info *new) --
> To unsubscribe from this list: send the line "unsubscribe
> git" in the body of a message to
> majordomo@vger.kernel.org More majordomo info at 
> http://vger.kernel.org/majordomo-info.html

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH v3] refs: Use binary search to lookup refs faster
  2011-09-30 15:30                                                             ` Michael Haggerty
@ 2011-09-30 16:38                                                               ` Junio C Hamano
  2011-09-30 17:56                                                                 ` [PATCH] refs: Remove duplicates after sorting with qsort Julian Phillips
  2011-10-02  5:15                                                                 ` [PATCH v3] refs: Use binary search to lookup refs faster Michael Haggerty
  0 siblings, 2 replies; 126+ messages in thread
From: Junio C Hamano @ 2011-09-30 16:38 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Julian Phillips, Martin Fick, Christian Couder, git,
	Christian Couder, Thomas Rast

Michael Haggerty <mhagger@alum.mit.edu> writes:

> On 09/30/2011 01:48 AM, Junio C Hamano wrote:
>> This version looks sane, although I have a suspicion that it may have
>> some interaction with what Michael may be working on.
>
> Indeed, I have almost equivalent changes in the giant patch series that
> I am working on [1].

Good; that was the primary thing I wanted to know.  I want to take
Julian's patch early but if the approach and data structures were
drastically different from what you are cooking, that would force
unnecessary reroll on your part, which I wanted to avoid.

Thanks.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-26 18:56                                         ` Christian Couder
@ 2011-09-30 16:41                                           ` Martin Fick
  2011-09-30 19:26                                             ` Martin Fick
  2011-09-30 21:02                                             ` Martin Fick
  0 siblings, 2 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-30 16:41 UTC (permalink / raw)
  To: git
  Cc: Christian Couder, Thomas Rast, René Scharfe,
	Julian Phillips, Michael Haggerty

> On Monday, September 26, 2011 12:56:04 pm Christian Couder 
wrote:
> After "git pack-refs --all" I get:

OK.   So many great improvements in ref scalability, thanks 
everyone!

It is getting so good, that I had to take a step back and 
re-evaluate what we consider good/bad.  On doing so, I can't 
help but think that fetches still need some improvement.

Fetches had the worst regression of all > 8days, so the 
massive fix to bring it down to 7.5mins was awesome.  
7-8mins sounded pretty good 2 weeks ago, especially when a 
checkout took 5+ mins!  but now that almost every other 
operation has been sped up, that is starting to feel a bit 
on the slow side still.  My spidey sense tells me something 
is still not quite right in the fetch path.

Here is some more data to backup my spidey sense: after all 
the improvements, a noop fetch of all the changes (noop 
meaning they are all already uptodate) takes around 
3mins with a non gced (non packed refs) case.  That same 
noop only takes ~12s in the gced (packed ref case)!

I dug into this a bit further.  I took a non gced and non 
packed refs repo and this time instead of gcing it to get 
packedrefs, I only ran the above git pack-refs --all so that
objects did not get gced.  With this, the noop fetch was 
also only around 12s.  This confirmed that the non gced 
objects are not interfering with the noop fetch, the problem 
really is just the unpacked refs.  Just to confirm that the 
FS is not horribly slow, I did a "find .git/refs" and it 
only takes about .4s for about 80Kresults!

So, while I understand that a full fetch will actually have 
to transfer quite a bit of data, the noop fetch seems like 
it is still suffering in the non gced (non packed ref case).  
If that time were improved, I suspect that the full fetch 
will improve at least by an equivalent amount, if not more.

Any thoughts?

-Martin


-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-30  9:12                                                     ` René Scharfe
  2011-09-30 16:09                                                       ` Martin Fick
@ 2011-09-30 16:52                                                       ` Junio C Hamano
  2011-09-30 18:17                                                         ` René Scharfe
  1 sibling, 1 reply; 126+ messages in thread
From: Junio C Hamano @ 2011-09-30 16:52 UTC (permalink / raw)
  To: René Scharfe
  Cc: Martin Fick, Julian Phillips, Christian Couder, git,
	Christian Couder, Thomas Rast, Junio C Hamano

René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:

> Hi Martin,
>
> Am 29.09.2011 22:11, schrieb Martin Fick:
>> Your patch works well for me.  It achieves about the same 
>> gains as Julian's patch. Thanks!
>
> OK, and what happens if you apply the following patch on top of my first
> one?  It avoids going through all the refs a second time during cleanup,
> at the cost of going through the list of all known objects.  I wonder if
> that's any faster in your case.
> ...
>  static void describe_one_orphan(struct strbuf *sb, struct commit *commit)
> @@ -690,8 +689,7 @@ static void orphaned_commit_warning(struct commit *commit)
>  	else
>  		describe_detached_head(_("Previous HEAD position was"), commit);
>  
> -	clear_commit_marks(commit, -1);
> -	for_each_ref(clear_commit_marks_from_one_ref, NULL);
> +	clear_commit_marks_for_all(ALL_REV_FLAGS);
>  }

The function already clears all the flag bits from commits near the tip of
all the refs (i.e. whatever commit it traverses until it gets to the fork
point), so it cannot be reused in other contexts where the caller

 - first marks commit objects with some flag bits for its own purpose,
   unrelated to the "orphaned"-ness check;
 - calls this function to issue a warning; and then
 - use the flag it earlier set to do something useful.

which requires "cleaning after yourself, by clearing only the bits you
used without disturbing other bits that you do not use" pattern.

It might be a better solution to not bother to clear the marks at all;
would it break anything in this codepath?

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH] refs: Remove duplicates after sorting with qsort
  2011-09-30 16:38                                                               ` Junio C Hamano
@ 2011-09-30 17:56                                                                 ` Julian Phillips
  2011-10-02  5:15                                                                 ` [PATCH v3] refs: Use binary search to lookup refs faster Michael Haggerty
  1 sibling, 0 replies; 126+ messages in thread
From: Julian Phillips @ 2011-09-30 17:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Julian Phillips, Martin Fick, Christian Couder, git,
	Christian Couder, Thomas Rast, Michael Haggerty

The previous custom merge sort would drop duplicate entries as part of
the sort.  It would also die if the duplicate entries had different
sha1 values.  The standard library qsort doesn't do this, so we have
to do it manually afterwards.

Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---

On Fri, 30 Sep 2011 09:38:54 -0700, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
>
>> On 09/30/2011 01:48 AM, Junio C Hamano wrote:
>>> This version looks sane, although I have a suspicion that it may 
>>> have
>>> some interaction with what Michael may be working on.
>>
>> Indeed, I have almost equivalent changes in the giant patch series 
>> that
>> I am working on [1].
>
> Good; that was the primary thing I wanted to know.  I want to take
> Julian's patch early but if the approach and data structures were
> drastically different from what you are cooking, that would force
> unnecessary reroll on your part, which I wanted to avoid.
>
> Thanks.

I had a quick look at Michael's code, and it reminded me that I had missed one
thing out.  If we want to keep the duplicate detection & removal from the
original merge sort then this patch is needed on top of v3 of the binary search.

Though I never could figure out how duplicate refs were supposed to appear ... I
tested by editing packed-refs, but I assume that isn't "supported".

 refs.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/refs.c b/refs.c
index 4c01d79..cf080ee 100644
--- a/refs.c
+++ b/refs.c
@@ -77,7 +77,29 @@ static int ref_entry_cmp(const void *a, const void *b)
 
 static void sort_ref_array(struct ref_array *array)
 {
+	int i = 0, j = 1;
+
+	/* Nothing to sort unless there are at least two entries */
+	if (array->nr < 2)
+		return;
+
 	qsort(array->refs, array->nr, sizeof(*array->refs), ref_entry_cmp);
+
+	/* Remove any duplicates from the ref_array */
+	for (; j < array->nr; j++) {
+		struct ref_entry *a = array->refs[i];
+		struct ref_entry *b = array->refs[j];
+		if (!strcmp(a->name, b->name)) {
+			if (hashcmp(a->sha1, b->sha1))
+				die("Duplicated ref, and SHA1s don't match: %s",
+				    a->name);
+			warning("Duplicated ref: %s", a->name);
+			continue;
+		}
+		i++;
+		array->refs[i] = array->refs[j];
+	}
+	array->nr = i + 1;
 }
 
 static struct ref_entry *search_ref_array(struct ref_array *array, const char *name)
-- 
1.7.6.1

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-30 16:52                                                       ` Junio C Hamano
@ 2011-09-30 18:17                                                         ` René Scharfe
  2011-10-01 15:28                                                           ` René Scharfe
  0 siblings, 1 reply; 126+ messages in thread
From: René Scharfe @ 2011-09-30 18:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Martin Fick, Julian Phillips, Christian Couder, git,
	Christian Couder, Thomas Rast

Am 30.09.2011 18:52, schrieb Junio C Hamano:
> René Scharfe <rene.scharfe@lsrfire.ath.cx> writes:
> 
>> Hi Martin,
>>
>> Am 29.09.2011 22:11, schrieb Martin Fick:
>>> Your patch works well for me.  It achieves about the same 
>>> gains as Julian's patch. Thanks!
>>
>> OK, and what happens if you apply the following patch on top of my first
>> one?  It avoids going through all the refs a second time during cleanup,
>> at the cost of going through the list of all known objects.  I wonder if
>> that's any faster in your case.
>> ...
>>  static void describe_one_orphan(struct strbuf *sb, struct commit *commit)
>> @@ -690,8 +689,7 @@ static void orphaned_commit_warning(struct commit *commit)
>>  	else
>>  		describe_detached_head(_("Previous HEAD position was"), commit);
>>  
>> -	clear_commit_marks(commit, -1);
>> -	for_each_ref(clear_commit_marks_from_one_ref, NULL);
>> +	clear_commit_marks_for_all(ALL_REV_FLAGS);
>>  }
> 
> The function already clears all the flag bits from commits near the tip of
> all the refs (i.e. whatever commit it traverses until it gets to the fork
> point), so it cannot be reused in other contexts where the caller
> 
>  - first marks commit objects with some flag bits for its own purpose,
>    unrelated to the "orphaned"-ness check;
>  - calls this function to issue a warning; and then
>  - use the flag it earlier set to do something useful.
> 
> which requires "cleaning after yourself, by clearing only the bits you
> used without disturbing other bits that you do not use" pattern.

Yes, clear_commit_marks_for_all is a bit brutal.  Callers could clear
specfic bits (e.g. SEEN|UNINTERESTING) instead of ALL_REV_FLAGS, though.

> It might be a better solution to not bother to clear the marks at all;
> would it break anything in this codepath?

Unfortunately, yes; the cleanup part was added by 5c08dc48 later, when
it become apparent that it's really needed.

However, since the patch only buys us a 5% speedup I'm not sure it's
worth it in its current form.

René

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-30 16:41                                           ` Martin Fick
@ 2011-09-30 19:26                                             ` Martin Fick
  2011-09-30 21:02                                             ` Martin Fick
  1 sibling, 0 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-30 19:26 UTC (permalink / raw)
  To: git
  Cc: Christian Couder, Thomas Rast, René Scharfe,
	Julian Phillips, Michael Haggerty

On Friday, September 30, 2011 10:41:13 am Martin Fick wrote:
> I dug into this a bit further.  I took a non gced and non
> packed refs repo and this time instead of gcing it to get
> packedrefs, I only ran the above git pack-refs --all so
> that objects did not get gced.  With this, the noop
> fetch was also only around 12s.  This confirmed that the
> non gced objects are not interfering with the noop
> fetch, the problem really is just the unpacked refs. 
> Just to confirm that the FS is not horribly slow, I did
> a "find .git/refs" and it only takes about .4s for about
> 80Kresults!

Is there a way I can for refs to always be packed?  I didn't 
see a config option for this.  I would like to try a fetch 
this way even if I have to make a small code tweak.  

I tried simulating on the fly ref packing every now and then 
by running the pack from another repo during the fetch, it 
actually slowed things down (by more than the time it took 
to do the packs).


-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-30 16:41                                           ` Martin Fick
  2011-09-30 19:26                                             ` Martin Fick
@ 2011-09-30 21:02                                             ` Martin Fick
  2011-09-30 22:06                                               ` Martin Fick
  1 sibling, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-09-30 21:02 UTC (permalink / raw)
  To: git
  Cc: Christian Couder, Thomas Rast, René Scharfe,
	Julian Phillips, Michael Haggerty

On Friday, September 30, 2011 10:41:13 am Martin Fick wrote:
> massive fix to bring it down to 7.5mins was awesome.
> 7-8mins sounded pretty good 2 weeks ago, especially when
> a checkout took 5+ mins!  but now that almost every
> other operation has been sped up, that is starting to
> feel a bit on the slow side still.  My spidey sense
> tells me something is still not quite right in the fetch
> path.

I guess I overlooked that there were 2 sides to this 
equation.  Even though I have been doing my fetches locally, 
I was using the file:// protocol and it appears that the 
remote was running git 1.7.6 which was in my path the whole 
time.  So eliminating that from my path and pointing to the 
the "best" binary with all the fixes for both remote and 
local, the full fetch does indeed speed up quite a bit, it 
goes from about 7.5mins down to ~5m!  Previously the remote 
seemed to primarily spend the extra time after:

 remote: Counting objects: 316961

yet before:

 remote: Compressing objects


> Here is some more data to backup my spidey sense: after
> all the improvements, a noop fetch of all the changes
> (noop meaning they are all already uptodate) takes
> around 3mins with a non gced (non packed refs) case. 
> That same noop only takes ~12s in the gced (packed ref
> case)!

I believe (it is hard to be go back and be sure) that this 
means that the timings above which gave me 3mins were 
because the remote was using git 1.7.6.  Now, with the good 
binary, in both repos (packed and unpacked), I get great 
warm cache times of about 11-13s for a noop fetch.  It is 
interesting to note that cold cache times are 20s for packed 
refs and 1m30s for unpacked refs.  I guess that makes some 
sense.  

But, this does leave me thinking that packed refs should 
become the default and that there should be a config option 
to disable it?  This still might help a fetch?

Since a full sync is now done to about 5mins, I broke down 
the output a bit.  It appears that the longest part (2:45m) 
is now the time spent scrolling though each change still.  
Each one of these takes about 2ms:
 * [new branch]      refs/changes/99/71199/1 -> 
refs/changes/99/71199/1

Seems fast, but at about 80K... So, are there any obvious N 
loops over the refs happening inside each of of the [new 
branch] iterations?


-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-30 21:02                                             ` Martin Fick
@ 2011-09-30 22:06                                               ` Martin Fick
  2011-10-01 20:41                                                 ` Junio C Hamano
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 126+ messages in thread
From: Martin Fick @ 2011-09-30 22:06 UTC (permalink / raw)
  To: git
  Cc: Christian Couder, Thomas Rast, René Scharfe,
	Julian Phillips, Michael Haggerty

On Friday, September 30, 2011 03:02:30 pm Martin Fick wrote:
> On Friday, September 30, 2011 10:41:13 am Martin Fick 
wrote:
> Since a full sync is now done to about 5mins, I broke
> down the output a bit.  It appears that the longest part
> (2:45m) is now the time spent scrolling though each
> change still. Each one of these takes about 2ms:
>  * [new branch]      refs/changes/99/71199/1 ->
> refs/changes/99/71199/1
> 
> Seems fast, but at about 80K... So, are there any obvious
> N loops over the refs happening inside each of of the
> [new branch] iterations?

OK, I narrowed it down I believe.  If I comment out the 
invalidate_cached_refs() line in write_ref_sha1(), it speeds 
through this section.  

I guess this makes sense, we invalidate the cache and have 
to rebuild it after every new ref is added?  Perhaps a 
simple fix would be to move the invalidation right after all 
the refs are updated?  Maybe write_ref_sha1 could take in a 
flag to tell it to not invalidate the cache so that during 
iterative updates it could be disabled and then run manually 
after the update?

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-30 18:17                                                         ` René Scharfe
@ 2011-10-01 15:28                                                           ` René Scharfe
  2011-10-01 15:38                                                             ` [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 René Scharfe
                                                                               ` (7 more replies)
  0 siblings, 8 replies; 126+ messages in thread
From: René Scharfe @ 2011-10-01 15:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Martin Fick, Julian Phillips, Christian Couder, git,
	Christian Couder, Thomas Rast

Am 30.09.2011 20:17, schrieb René Scharfe:
> Am 30.09.2011 18:52, schrieb Junio C Hamano:
>> It might be a better solution to not bother to clear the marks at
>> all; would it break anything in this codepath?
> 
> Unfortunately, yes; the cleanup part was added by 5c08dc48 later,
> when it become apparent that it's really needed.
> 
> However, since the patch only buys us a 5% speedup I'm not sure it's 
> worth it in its current form.

I found something better: A trick used by bisect and bundle.  They copy
the list of pending objects from rev_info before calling
prepare_revision_walk and then go through it to clean up the commit
marks without going through the refs again.  And I think we can even
improve it a little.

The following patches tighten some orphan/detached head tests a little,
then comes a resend of my first patch on this topic, only split up into
two, then four patches that introduce the trick mentioned above (which
could be squashed together perhaps) and the last one is a bonus
refactoring patch.

 bisect.c                   |   20 +++++++-------
 builtin/checkout.c         |   58 +++++++++++++------------------------------
 bundle.c                   |   11 +++-----
 commit.c                   |   14 ++++++++++
 commit.h                   |    1 +
 revision.c                 |   14 +++++++---
 revision.h                 |    2 +
 t/t2020-checkout-detach.sh |    7 ++++-
 8 files changed, 64 insertions(+), 63 deletions(-)

René

^ permalink raw reply	[flat|nested] 126+ messages in thread

* [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020
  2011-10-01 15:28                                                           ` René Scharfe
@ 2011-10-01 15:38                                                             ` René Scharfe
  2011-10-01 19:02                                                               ` Sverre Rabbelier
  2011-10-01 15:43                                                             ` [PATCH 2/8] revision: factor out add_pending_sha1 René Scharfe
                                                                               ` (6 subsequent siblings)
  7 siblings, 1 reply; 126+ messages in thread
From: René Scharfe @ 2011-10-01 15:38 UTC (permalink / raw)
  Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder,
	git, Christian Couder, Thomas Rast

If we leave a detached head, exactly one of two things happens: either
checkout warns about it being an orphan or describes it as a courtesy.
Test t2020 already checked that the warning is shown as needed.  This
patch also checks for the description.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 t/t2020-checkout-detach.sh |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/t/t2020-checkout-detach.sh b/t/t2020-checkout-detach.sh
index 2366f0f..068fba4 100755
--- a/t/t2020-checkout-detach.sh
+++ b/t/t2020-checkout-detach.sh
@@ -12,11 +12,14 @@ check_not_detached () {
 }
 
 ORPHAN_WARNING='you are leaving .* commit.*behind'
+PREV_HEAD_DESC='Previous HEAD position was'
 check_orphan_warning() {
-	test_i18ngrep "$ORPHAN_WARNING" "$1"
+	test_i18ngrep "$ORPHAN_WARNING" "$1" &&
+	test_i18ngrep ! "$PREV_HEAD_DESC" "$1"
 }
 check_no_orphan_warning() {
-	test_i18ngrep ! "$ORPHAN_WARNING" "$1"
+	test_i18ngrep ! "$ORPHAN_WARNING" "$1" &&
+	test_i18ngrep "$PREV_HEAD_DESC" "$1"
 }
 
 reset () {
-- 
1.7.7

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 2/8] revision: factor out add_pending_sha1
  2011-10-01 15:28                                                           ` René Scharfe
  2011-10-01 15:38                                                             ` [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 René Scharfe
@ 2011-10-01 15:43                                                             ` René Scharfe
  2011-10-01 15:51                                                             ` [PATCH 3/8] checkout: use add_pending_{object,sha1} in orphan check René Scharfe
                                                                               ` (5 subsequent siblings)
  7 siblings, 0 replies; 126+ messages in thread
From: René Scharfe @ 2011-10-01 15:43 UTC (permalink / raw)
  Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder,
	git, Christian Couder, Thomas Rast

This function is a combination of the static get_reference and
add_pending_object.  It can be used to easily queue objects by hash.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
The next patch is going to use it in checkout.

 revision.c |   11 ++++++++---
 revision.h |    1 +
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/revision.c b/revision.c
index c46cfaa..2e8aa33 100644
--- a/revision.c
+++ b/revision.c
@@ -185,6 +185,13 @@ static struct object *get_reference(struct rev_info *revs, const char *name, con
 	return object;
 }
 
+void add_pending_sha1(struct rev_info *revs, const char *name,
+		      const unsigned char *sha1, unsigned int flags)
+{
+	struct object *object = get_reference(revs, name, sha1, flags);
+	add_pending_object(revs, object, name);
+}
+
 static struct commit *handle_commit(struct rev_info *revs, struct object *object, const char *name)
 {
 	unsigned long flags = object->flags;
@@ -832,9 +839,7 @@ struct all_refs_cb {
 static int handle_one_ref(const char *path, const unsigned char *sha1, int flag, void *cb_data)
 {
 	struct all_refs_cb *cb = cb_data;
-	struct object *object = get_reference(cb->all_revs, path, sha1,
-					      cb->all_flags);
-	add_pending_object(cb->all_revs, object, path);
+	add_pending_sha1(cb->all_revs, path, sha1, cb->all_flags);
 	return 0;
 }
 
diff --git a/revision.h b/revision.h
index 3d64ada..4541265 100644
--- a/revision.h
+++ b/revision.h
@@ -191,6 +191,7 @@ extern void add_object(struct object *obj,
 		       const char *name);
 
 extern void add_pending_object(struct rev_info *revs, struct object *obj, const char *name);
+extern void add_pending_sha1(struct rev_info *revs, const char *name, const unsigned char *sha1, unsigned int flags);
 
 extern void add_head_to_pending(struct rev_info *);
 
-- 
1.7.7

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 3/8] checkout: use add_pending_{object,sha1} in orphan check
  2011-10-01 15:28                                                           ` René Scharfe
  2011-10-01 15:38                                                             ` [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 René Scharfe
  2011-10-01 15:43                                                             ` [PATCH 2/8] revision: factor out add_pending_sha1 René Scharfe
@ 2011-10-01 15:51                                                             ` René Scharfe
  2011-10-01 15:56                                                             ` [PATCH 4/8] revision: add leak_pending flag René Scharfe
                                                                               ` (4 subsequent siblings)
  7 siblings, 0 replies; 126+ messages in thread
From: René Scharfe @ 2011-10-01 15:51 UTC (permalink / raw)
  Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder,
	git, Christian Couder, Thomas Rast

Instead of building a list of textual arguments for setup_revisions, use
add_pending_object and add_pending_sha1 to queue the objects directly.
This is both faster and simpler.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 builtin/checkout.c |   39 ++++++++++++---------------------------
 1 files changed, 12 insertions(+), 27 deletions(-)

diff --git a/builtin/checkout.c b/builtin/checkout.c
index 5e356a6..84e0cdc 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -588,24 +588,11 @@ static void update_refs_for_switch(struct checkout_opts *opts,
 		report_tracking(new);
 }
 
-struct rev_list_args {
-	int argc;
-	int alloc;
-	const char **argv;
-};
-
-static void add_one_rev_list_arg(struct rev_list_args *args, const char *s)
-{
-	ALLOC_GROW(args->argv, args->argc + 1, args->alloc);
-	args->argv[args->argc++] = s;
-}
-
-static int add_one_ref_to_rev_list_arg(const char *refname,
-				       const unsigned char *sha1,
-				       int flags,
-				       void *cb_data)
+static int add_pending_uninteresting_ref(const char *refname,
+					 const unsigned char *sha1,
+					 int flags, void *cb_data)
 {
-	add_one_rev_list_arg(cb_data, refname);
+	add_pending_sha1(cb_data, refname, sha1, flags | UNINTERESTING);
 	return 0;
 }
 
@@ -685,19 +672,17 @@ static void suggest_reattach(struct commit *commit, struct rev_info *revs)
  */
 static void orphaned_commit_warning(struct commit *commit)
 {
-	struct rev_list_args args = { 0, 0, NULL };
 	struct rev_info revs;
-
-	add_one_rev_list_arg(&args, "(internal)");
-	add_one_rev_list_arg(&args, sha1_to_hex(commit->object.sha1));
-	add_one_rev_list_arg(&args, "--not");
-	for_each_ref(add_one_ref_to_rev_list_arg, &args);
-	add_one_rev_list_arg(&args, "--");
-	add_one_rev_list_arg(&args, NULL);
+	struct object *object = &commit->object;
 
 	init_revisions(&revs, NULL);
-	if (setup_revisions(args.argc - 1, args.argv, &revs, NULL) != 1)
-		die(_("internal error: only -- alone should have been left"));
+	setup_revisions(0, NULL, &revs, NULL);
+
+	object->flags &= ~UNINTERESTING;
+	add_pending_object(&revs, object, sha1_to_hex(object->sha1));
+
+	for_each_ref(add_pending_uninteresting_ref, &revs);
+
 	if (prepare_revision_walk(&revs))
 		die(_("internal error in revision walk"));
 	if (!(commit->object.flags & UNINTERESTING))
-- 
1.7.7

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 4/8] revision: add leak_pending flag
  2011-10-01 15:28                                                           ` René Scharfe
                                                                               ` (2 preceding siblings ...)
  2011-10-01 15:51                                                             ` [PATCH 3/8] checkout: use add_pending_{object,sha1} in orphan check René Scharfe
@ 2011-10-01 15:56                                                             ` René Scharfe
  2011-10-01 16:01                                                             ` [PATCH 5/8] bisect: use " René Scharfe
                                                                               ` (3 subsequent siblings)
  7 siblings, 0 replies; 126+ messages in thread
From: René Scharfe @ 2011-10-01 15:56 UTC (permalink / raw)
  Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder,
	git, Christian Couder, Thomas Rast

The new flag leak_pending in struct rev_info can be used to prevent
prepare_revision_walk from freeing the list of pending objects.  It
will still forget about them, so it really is leaked.  This behaviour
may look weird at first, but it can be useful if the pointer to the
list is saved before calling prepare_revision_walk.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
The next three patches are going to use this flag.

 revision.c |    3 ++-
 revision.h |    1 +
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/revision.c b/revision.c
index 2e8aa33..6d329b4 100644
--- a/revision.c
+++ b/revision.c
@@ -1974,7 +1974,8 @@ int prepare_revision_walk(struct rev_info *revs)
 		}
 		e++;
 	}
-	free(list);
+	if (!revs->leak_pending)
+		free(list);
 
 	if (revs->no_walk)
 		return 0;
diff --git a/revision.h b/revision.h
index 4541265..366a9b4 100644
--- a/revision.h
+++ b/revision.h
@@ -97,6 +97,7 @@ struct rev_info {
 			date_mode_explicit:1,
 			preserve_subject:1;
 	unsigned int	disable_stdin:1;
+	unsigned int	leak_pending:1;
 
 	enum date_mode date_mode;
 
-- 
1.7.7

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 5/8] bisect: use leak_pending flag
  2011-10-01 15:28                                                           ` René Scharfe
                                                                               ` (3 preceding siblings ...)
  2011-10-01 15:56                                                             ` [PATCH 4/8] revision: add leak_pending flag René Scharfe
@ 2011-10-01 16:01                                                             ` René Scharfe
  2011-10-01 16:02                                                             ` [PATCH 6/8] bundle: " René Scharfe
                                                                               ` (2 subsequent siblings)
  7 siblings, 0 replies; 126+ messages in thread
From: René Scharfe @ 2011-10-01 16:01 UTC (permalink / raw)
  Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder,
	git, Christian Couder, Thomas Rast

Instead of creating a copy of the list of pending objects, copy the
struct object_array that points to it, turn on leak_pending, and thus
cause prepare_revision_walk to leave it to us.  And free it once
we're done.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 bisect.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/bisect.c b/bisect.c
index c7b7d79..a05504f 100644
--- a/bisect.c
+++ b/bisect.c
@@ -831,12 +831,14 @@ static int check_ancestors(const char *prefix)
 	bisect_rev_setup(&revs, prefix, "^%s", "%s", 0);
 
 	/* Save pending objects, so they can be cleaned up later. */
-	memset(&pending_copy, 0, sizeof(pending_copy));
-	for (i = 0; i < revs.pending.nr; i++)
-		add_object_array(revs.pending.objects[i].item,
-				 revs.pending.objects[i].name,
-				 &pending_copy);
+	pending_copy = revs.pending;
+	revs.leak_pending = 1;
 
+	/*
+	 * bisect_common calls prepare_revision_walk right away, which
+	 * (together with .leak_pending = 1) makes us the sole owner of
+	 * the list of pending objects.
+	 */
 	bisect_common(&revs);
 	res = (revs.commits != NULL);
 
@@ -845,6 +847,7 @@ static int check_ancestors(const char *prefix)
 		struct object *o = pending_copy.objects[i].item;
 		clear_commit_marks((struct commit *)o, ALL_REV_FLAGS);
 	}
+	free(pending_copy.objects);
 
 	return res;
 }
-- 
1.7.7

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 6/8] bundle: use leak_pending flag
  2011-10-01 15:28                                                           ` René Scharfe
                                                                               ` (4 preceding siblings ...)
  2011-10-01 16:01                                                             ` [PATCH 5/8] bisect: use " René Scharfe
@ 2011-10-01 16:02                                                             ` René Scharfe
  2011-10-01 16:09                                                             ` [PATCH 7/8] checkout: " René Scharfe
  2011-10-01 16:16                                                             ` [PATCH 8/8] commit: factor out clear_commit_marks_for_object_array René Scharfe
  7 siblings, 0 replies; 126+ messages in thread
From: René Scharfe @ 2011-10-01 16:02 UTC (permalink / raw)
  Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder,
	git, Christian Couder, Thomas Rast

Instead of creating a copy of the list of pending objects, copy the
struct object_array that points to it, turn on leak_pending, and thus
cause prepare_revision_walk to leave it to us.  And free it once
we're done.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 bundle.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/bundle.c b/bundle.c
index f48fd7d..26cc9ab 100644
--- a/bundle.c
+++ b/bundle.c
@@ -122,11 +122,8 @@ int verify_bundle(struct bundle_header *header, int verbose)
 	req_nr = revs.pending.nr;
 	setup_revisions(2, argv, &revs, NULL);
 
-	memset(&refs, 0, sizeof(struct object_array));
-	for (i = 0; i < revs.pending.nr; i++) {
-		struct object_array_entry *e = revs.pending.objects + i;
-		add_object_array(e->item, e->name, &refs);
-	}
+	refs = revs.pending;
+	revs.leak_pending = 1;
 
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
@@ -146,6 +143,7 @@ int verify_bundle(struct bundle_header *header, int verbose)
 
 	for (i = 0; i < refs.nr; i++)
 		clear_commit_marks((struct commit *)refs.objects[i].item, -1);
+	free(refs.objects);
 
 	if (verbose) {
 		struct ref_list *r;
-- 
1.7.7

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 7/8] checkout: use leak_pending flag
  2011-10-01 15:28                                                           ` René Scharfe
                                                                               ` (5 preceding siblings ...)
  2011-10-01 16:02                                                             ` [PATCH 6/8] bundle: " René Scharfe
@ 2011-10-01 16:09                                                             ` René Scharfe
  2011-10-01 16:16                                                             ` [PATCH 8/8] commit: factor out clear_commit_marks_for_object_array René Scharfe
  7 siblings, 0 replies; 126+ messages in thread
From: René Scharfe @ 2011-10-01 16:09 UTC (permalink / raw)
  Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder,
	git, Christian Couder, Thomas Rast

Instead of going through all the references again when we clear the
commit marks, do it like bisect and bundle and gain ownership of the
list of pending objects which we constructed from those references.

We simply copy the struct object_array that points to the list, set
the flag leak_pending and then prepare_revision_walk won't destroy
it and it's ours.  We use it to clear the marks and  free it at the
end.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 builtin/checkout.c |   25 ++++++++++++-------------
 1 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/builtin/checkout.c b/builtin/checkout.c
index 84e0cdc..cfd7e59 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -596,17 +596,6 @@ static int add_pending_uninteresting_ref(const char *refname,
 	return 0;
 }
 
-static int clear_commit_marks_from_one_ref(const char *refname,
-				      const unsigned char *sha1,
-				      int flags,
-				      void *cb_data)
-{
-	struct commit *commit = lookup_commit_reference_gently(sha1, 1);
-	if (commit)
-		clear_commit_marks(commit, -1);
-	return 0;
-}
-
 static void describe_one_orphan(struct strbuf *sb, struct commit *commit)
 {
 	parse_commit(commit);
@@ -674,6 +663,8 @@ static void orphaned_commit_warning(struct commit *commit)
 {
 	struct rev_info revs;
 	struct object *object = &commit->object;
+	struct object_array refs;
+	unsigned int i;
 
 	init_revisions(&revs, NULL);
 	setup_revisions(0, NULL, &revs, NULL);
@@ -683,6 +674,9 @@ static void orphaned_commit_warning(struct commit *commit)
 
 	for_each_ref(add_pending_uninteresting_ref, &revs);
 
+	refs = revs.pending;
+	revs.leak_pending = 1;
+
 	if (prepare_revision_walk(&revs))
 		die(_("internal error in revision walk"));
 	if (!(commit->object.flags & UNINTERESTING))
@@ -690,8 +684,13 @@ static void orphaned_commit_warning(struct commit *commit)
 	else
 		describe_detached_head(_("Previous HEAD position was"), commit);
 
-	clear_commit_marks(commit, -1);
-	for_each_ref(clear_commit_marks_from_one_ref, NULL);
+	for (i = 0; i < refs.nr; i++) {
+		struct object *o = refs.objects[i].item;
+		struct commit *c = lookup_commit_reference_gently(o->sha1, 1);
+		if (c)
+			clear_commit_marks(c, ALL_REV_FLAGS);
+	}
+	free(refs.objects);
 }
 
 static int switch_branches(struct checkout_opts *opts, struct branch_info *new)
-- 
1.7.7

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* [PATCH 8/8] commit: factor out clear_commit_marks_for_object_array
  2011-10-01 15:28                                                           ` René Scharfe
                                                                               ` (6 preceding siblings ...)
  2011-10-01 16:09                                                             ` [PATCH 7/8] checkout: " René Scharfe
@ 2011-10-01 16:16                                                             ` René Scharfe
  7 siblings, 0 replies; 126+ messages in thread
From: René Scharfe @ 2011-10-01 16:16 UTC (permalink / raw)
  Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder,
	git, Christian Couder, Thomas Rast

Factor out the code to clear the commit marks for a whole struct
object_array from builtin/checkout.c into its own exported function
clear_commit_marks_for_object_array and use it in bisect and bundle
as well.  It handles tags and commits and ignores objects of any
other type.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
---
 bisect.c           |    7 ++-----
 builtin/checkout.c |    8 +-------
 bundle.c           |    3 +--
 commit.c           |   14 ++++++++++++++
 commit.h           |    1 +
 5 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/bisect.c b/bisect.c
index a05504f..b4547b9 100644
--- a/bisect.c
+++ b/bisect.c
@@ -826,7 +826,7 @@ static int check_ancestors(const char *prefix)
 {
 	struct rev_info revs;
 	struct object_array pending_copy;
-	int i, res;
+	int res;
 
 	bisect_rev_setup(&revs, prefix, "^%s", "%s", 0);
 
@@ -843,10 +843,7 @@ static int check_ancestors(const char *prefix)
 	res = (revs.commits != NULL);
 
 	/* Clean up objects used, as they will be reused. */
-	for (i = 0; i < pending_copy.nr; i++) {
-		struct object *o = pending_copy.objects[i].item;
-		clear_commit_marks((struct commit *)o, ALL_REV_FLAGS);
-	}
+	clear_commit_marks_for_object_array(&pending_copy, ALL_REV_FLAGS);
 	free(pending_copy.objects);
 
 	return res;
diff --git a/builtin/checkout.c b/builtin/checkout.c
index cfd7e59..683819b 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -664,7 +664,6 @@ static void orphaned_commit_warning(struct commit *commit)
 	struct rev_info revs;
 	struct object *object = &commit->object;
 	struct object_array refs;
-	unsigned int i;
 
 	init_revisions(&revs, NULL);
 	setup_revisions(0, NULL, &revs, NULL);
@@ -684,12 +683,7 @@ static void orphaned_commit_warning(struct commit *commit)
 	else
 		describe_detached_head(_("Previous HEAD position was"), commit);
 
-	for (i = 0; i < refs.nr; i++) {
-		struct object *o = refs.objects[i].item;
-		struct commit *c = lookup_commit_reference_gently(o->sha1, 1);
-		if (c)
-			clear_commit_marks(c, ALL_REV_FLAGS);
-	}
+	clear_commit_marks_for_object_array(&refs, ALL_REV_FLAGS);
 	free(refs.objects);
 }
 
diff --git a/bundle.c b/bundle.c
index 26cc9ab..a8ea918 100644
--- a/bundle.c
+++ b/bundle.c
@@ -141,8 +141,7 @@ int verify_bundle(struct bundle_header *header, int verbose)
 				refs.objects[i].name);
 		}
 
-	for (i = 0; i < refs.nr; i++)
-		clear_commit_marks((struct commit *)refs.objects[i].item, -1);
+	clear_commit_marks_for_object_array(&refs, ALL_REV_FLAGS);
 	free(refs.objects);
 
 	if (verbose) {
diff --git a/commit.c b/commit.c
index 97b4327..50af007 100644
--- a/commit.c
+++ b/commit.c
@@ -430,6 +430,20 @@ void clear_commit_marks(struct commit *commit, unsigned int mark)
 	}
 }
 
+void clear_commit_marks_for_object_array(struct object_array *a, unsigned mark)
+{
+	struct object *object;
+	struct commit *commit;
+	unsigned int i;
+
+	for (i = 0; i < a->nr; i++) {
+		object = a->objects[i].item;
+		commit = lookup_commit_reference_gently(object->sha1, 1);
+		if (commit)
+			clear_commit_marks(commit, mark);
+	}
+}
+
 struct commit *pop_commit(struct commit_list **stack)
 {
 	struct commit_list *top = *stack;
diff --git a/commit.h b/commit.h
index 12d100b..0a4c730 100644
--- a/commit.h
+++ b/commit.h
@@ -126,6 +126,7 @@ struct commit *pop_most_recent_commit(struct commit_list **list,
 struct commit *pop_commit(struct commit_list **stack);
 
 void clear_commit_marks(struct commit *commit, unsigned int mark);
+void clear_commit_marks_for_object_array(struct object_array *a, unsigned mark);
 
 /*
  * Performs an in-place topological sort of list supplied.
-- 
1.7.7

^ permalink raw reply related	[flat|nested] 126+ messages in thread

* Re: [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020
  2011-10-01 15:38                                                             ` [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 René Scharfe
@ 2011-10-01 19:02                                                               ` Sverre Rabbelier
  0 siblings, 0 replies; 126+ messages in thread
From: Sverre Rabbelier @ 2011-10-01 19:02 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, Martin Fick, Julian Phillips, Christian Couder,
	git, Christian Couder, Thomas Rast

Heya,

On Sat, Oct 1, 2011 at 17:38, René Scharfe <rene.scharfe@lsrfire.ath.cx> wrote:
> If we leave a detached head, exactly one of two things happens: either
> checkout warns about it being an orphan or describes it as a courtesy.
> Test t2020 already checked that the warning is shown as needed.  This
> patch also checks for the description.

A cover letter would have been nice for such a long series :).

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-30 22:06                                               ` Martin Fick
@ 2011-10-01 20:41                                                 ` Junio C Hamano
  2011-10-02  5:19                                                   ` Michael Haggerty
  2011-10-03 18:12                                                 ` Martin Fick
  2011-10-08 20:59                                                 ` Martin Fick
  2 siblings, 1 reply; 126+ messages in thread
From: Junio C Hamano @ 2011-10-01 20:41 UTC (permalink / raw)
  To: Martin Fick
  Cc: git, Christian Couder, Thomas Rast, René Scharfe,
	Julian Phillips, Michael Haggerty

Martin Fick <mfick@codeaurora.org> writes:

> I guess this makes sense, we invalidate the cache and have 
> to rebuild it after every new ref is added?  Perhaps a 
> simple fix would be to move the invalidation right after all 
> the refs are updated?  Maybe write_ref_sha1 could take in a 
> flag to tell it to not invalidate the cache so that during 
> iterative updates it could be disabled and then run manually 
> after the update?

It might make sense, on top of Julian's patch, to add a bit that says "the
contents of this ref-array is current but the array is not sorted", and
whenever somebody runs add_ref(), append it also to the ref-array (so that
the contents do not have to be re-read from the filesystem) but flip the
"unsorted" bit on. Then update look-up and iteration to sort the array
when "unsorted" bit is on without re-reading the contents from the
filesystem.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH] Don't sort ref_list too early
  2011-09-27  0:00                                                       ` [PATCH] Don't sort ref_list too early Julian Phillips
@ 2011-10-02  4:58                                                         ` Michael Haggerty
  0 siblings, 0 replies; 126+ messages in thread
From: Michael Haggerty @ 2011-10-02  4:58 UTC (permalink / raw)
  To: Julian Phillips; +Cc: Junio C Hamano, Martin Fick, git

On 09/27/2011 02:00 AM, Julian Phillips wrote:
> get_ref_dir is called recursively for subdirectories, which means that
> we were calling sort_ref_list for each directory of refs instead of
> once for all the refs.  This is a massive wast of processing, so now
> just call sort_ref_list on the result of the top-level get_ref_dir, so
> that the sort is only done once.

+1

I think this patch should also be considered for maint, since it is
noninvasive and fixes a bad performance regression.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH v3] refs: Use binary search to lookup refs faster
  2011-09-30 16:38                                                               ` Junio C Hamano
  2011-09-30 17:56                                                                 ` [PATCH] refs: Remove duplicates after sorting with qsort Julian Phillips
@ 2011-10-02  5:15                                                                 ` Michael Haggerty
  2011-10-02  5:45                                                                   ` Junio C Hamano
  1 sibling, 1 reply; 126+ messages in thread
From: Michael Haggerty @ 2011-10-02  5:15 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Julian Phillips, Martin Fick, Christian Couder, git,
	Christian Couder, Thomas Rast

On 09/30/2011 06:38 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> On 09/30/2011 01:48 AM, Junio C Hamano wrote:
>>> This version looks sane, although I have a suspicion that it may have
>>> some interaction with what Michael may be working on.
>>
>> Indeed, I have almost equivalent changes in the giant patch series that
>> I am working on [1].
> 
> Good; that was the primary thing I wanted to know.  I want to take
> Julian's patch early but if the approach and data structures were
> drastically different from what you are cooking, that would force
> unnecessary reroll on your part, which I wanted to avoid.

Um, well, my patch series includes the same changes that Julian's wants
to introduce, but following lots of other changes, cleanups,
documentation improvements, etc.  Moreover, my patch series builds on
mh/iterate-refs, with which Julian's patch conflicts.  In other words,
it would be a real mess to reroll my series on top of Julian's patch.
(That is of course not to imply that I hold a mutex on refs.c.)  Because
it changes a data structure that is used throughout refs.c, changes a
lot of lines of code.

I think that the switch from linked list + linear sort to array plus
binary sort is a pretty obvious win in terms of code complexity and
*potential* performance improvement, but empirically I haven't seen any
claims that it brings performance improvements beyond "René's patch".
(Though, honestly, I've lost track of which "René's patch" is being
discussed and I don't see anything relevant in Junio's tree.)

Intuitively, given that populating the reference cache involves O(N)
I/O, speeding up lookups can only help if there are very many ref
lookups within a single git invocation.  I think we will get a better
improvement by avoiding the reading of unneeded loose refs by reading
them one subdirectory at a time instead of always reading them en masse.
 I wanted to reach that milestone before submitting my changes.

My preference would be:

1. Merge jp/get-ref-dir-unsorted, perhaps even into maint.  It is a
simple, noninvasive, and obvious improvement and helps performance a lot
in an important use case.

2. Hold off on merging jp/get-ref-dir-unsorted for a while to give me a
chance to avoid conflict hell.

3. Evaluate René's patch on its own merits; if it makes sense regardless
of the binary search speedups, then it can be accepted independently to
give most of the performance benefit already.

Are there any other patches in this area that I've forgotten?

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-10-01 20:41                                                 ` Junio C Hamano
@ 2011-10-02  5:19                                                   ` Michael Haggerty
  2011-10-03  0:46                                                     ` Martin Fick
  0 siblings, 1 reply; 126+ messages in thread
From: Michael Haggerty @ 2011-10-02  5:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Martin Fick, git, Christian Couder, Thomas Rast,
	René Scharfe, Julian Phillips

On 10/01/2011 10:41 PM, Junio C Hamano wrote:
> Martin Fick <mfick@codeaurora.org> writes:
>> I guess this makes sense, we invalidate the cache and have 
>> to rebuild it after every new ref is added?  Perhaps a 
>> simple fix would be to move the invalidation right after all 
>> the refs are updated?  Maybe write_ref_sha1 could take in a 
>> flag to tell it to not invalidate the cache so that during 
>> iterative updates it could be disabled and then run manually 
>> after the update?
> 
> It might make sense, on top of Julian's patch, to add a bit that says "the
> contents of this ref-array is current but the array is not sorted", and
> whenever somebody runs add_ref(), append it also to the ref-array (so that
> the contents do not have to be re-read from the filesystem) but flip the
> "unsorted" bit on. Then update look-up and iteration to sort the array
> when "unsorted" bit is on without re-reading the contents from the
> filesystem.

My WIP patch series does one better than this; it keeps track of what
part of the array is already sorted so that a reference can be found in
the sorted part of the array using binary search, and if it is not found
there a linear search is done through the unsorted part of the array.  I
also have some code (not pushed) that adds some intelligence to make the
use case

    repeat many times:
        check if reference exists
        add reference

efficient by picking optimal intervals to re-sort the array.  (This sort
can also be faster if most of the array is already sorted: sort the new
entries using qsort then merge sort them into the already-sorted part of
the list.)

But there is another reason that we cannot currently update the
reference cache on the fly rather than invalidating it after each
change: symbolic references are stored *resolved* in the reference
cache, and no record is kept of the reference that they refer to.
Therefore it is possible that the addition or modification of an
arbitrary reference can affect how a symbolic reference is resolved, but
there is not enough information in the cache to track this.

IMO the correct solution is to store symbolic references un-resolved.
Given that lookup is going to become much faster, the slowdown in
reference resolution should not be a big performance penalty, whereas
reference updating could become *much* faster.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH v3] refs: Use binary search to lookup refs faster
  2011-10-02  5:15                                                                 ` [PATCH v3] refs: Use binary search to lookup refs faster Michael Haggerty
@ 2011-10-02  5:45                                                                   ` Junio C Hamano
  2011-10-04 20:58                                                                     ` Junio C Hamano
  0 siblings, 1 reply; 126+ messages in thread
From: Junio C Hamano @ 2011-10-02  5:45 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Julian Phillips, Martin Fick, Christian Couder, git,
	Christian Couder, Thomas Rast

Michael Haggerty <mhagger@alum.mit.edu> writes:

> Um, well, my patch series includes the same changes that Julian's wants
> to introduce, but following lots of other changes, cleanups,
> documentation improvements, etc.  Moreover, my patch series builds on
> mh/iterate-refs, with which Julian's patch conflicts.  In other words,
> it would be a real mess to reroll my series on top of Julian's patch.

Conflicts during re-rolling was not something I was worried too much
about---that is just the fact of life. We cannot easily resolve two topics
that want to go in totally different direction, but we should be able to
converge two topics that want to take the same approach in the end,
especially one is a subset of the other.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-10-02  5:19                                                   ` Michael Haggerty
@ 2011-10-03  0:46                                                     ` Martin Fick
  2011-10-04  8:08                                                       ` Michael Haggerty
  0 siblings, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-10-03  0:46 UTC (permalink / raw)
  To: Michael Haggerty, Junio C Hamano
  Cc: git, Christian Couder, Thomas Rast, René Scharfe, Julian Phillips



Michael Haggerty <mhagger@alum.mit.edu> wrote:

>On 10/01/2011 10:41 PM, Junio C Hamano wrote:
>> Martin Fick <mfick@codeaurora.org> writes:
>>> I guess this makes sense, we invalidate the cache and have 
>>> to rebuild it after every new ref is added?  Perhaps a 
>>> simple fix would be to move the invalidation right after all 
>>> the refs are updated?  Maybe write_ref_sha1 could take in a 
>>> flag to tell it to not invalidate the cache so that during 
>>> iterative updates it could be disabled and then run manually 
>>> after the update?
>> 
>I
>also have some code (not pushed) that adds some intelligence to make
>the use case
>
>    repeat many times:
>        check if reference exists
>        add reference

Would it be possible to separate the two steps into separate loops somehow?  Could it instead look like this:
 
>    repeat many times:
>        check if reference exists
 
>    repeat many times:
>        add reference

It might be difficult with the current functions to achive this, but it would allow the cache to be invalidated over and over in loop two without impacting performance since all the lookups could be done in the first loop.  Of course, this would likely require checking for dups before running the first loop.

-Martin
Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-30 22:06                                               ` Martin Fick
  2011-10-01 20:41                                                 ` Junio C Hamano
@ 2011-10-03 18:12                                                 ` Martin Fick
  2011-10-03 19:42                                                   ` Junio C Hamano
  2011-10-04  8:16                                                   ` Michael Haggerty
  2011-10-08 20:59                                                 ` Martin Fick
  2 siblings, 2 replies; 126+ messages in thread
From: Martin Fick @ 2011-10-03 18:12 UTC (permalink / raw)
  To: git
  Cc: Christian Couder, Thomas Rast, René Scharfe,
	Julian Phillips, Michael Haggerty

On Friday, September 30, 2011 04:06:31 pm Martin Fick wrote:
> 
> OK, I narrowed it down I believe.  If I comment out the
> invalidate_cached_refs() line in write_ref_sha1(), it
> speeds through this section.
> 
> I guess this makes sense, we invalidate the cache and
> have to rebuild it after every new ref is added? 
> Perhaps a simple fix would be to move the invalidation
> right after all the refs are updated?  Maybe
> write_ref_sha1 could take in a flag to tell it to not
> invalidate the cache so that during iterative updates it
> could be disabled and then run manually after the
> update?

Would this solution be acceptable if I submitted a patch to 
do it?  My test shows that this will make a full fetch of 
~80K changes go from 4:50min to 1:50min,

-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-10-03 18:12                                                 ` Martin Fick
@ 2011-10-03 19:42                                                   ` Junio C Hamano
  2011-10-04  8:16                                                   ` Michael Haggerty
  1 sibling, 0 replies; 126+ messages in thread
From: Junio C Hamano @ 2011-10-03 19:42 UTC (permalink / raw)
  To: Martin Fick
  Cc: git, Christian Couder, Thomas Rast, René Scharfe,
	Julian Phillips, Michael Haggerty

Martin Fick <mfick@codeaurora.org> writes:

>> I guess this makes sense, we invalidate the cache and
>> have to rebuild it after every new ref is added? 
>> Perhaps a simple fix would be to move the invalidation
>> right after all the refs are updated?  Maybe
>> write_ref_sha1 could take in a flag to tell it to not
>> invalidate the cache so that during iterative updates it
>> could be disabled and then run manually after the
>> update?
>
> Would this solution be acceptable if I submitted a patch to 
> do it?  My test shows that this will make a full fetch of 
> ~80K changes go from 4:50min to 1:50min,

As long as the resulting code does not introduce new races with another
process updating refs while the bulk update is running, I wouldn't have an
issue with it.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-10-03  0:46                                                     ` Martin Fick
@ 2011-10-04  8:08                                                       ` Michael Haggerty
  0 siblings, 0 replies; 126+ messages in thread
From: Michael Haggerty @ 2011-10-04  8:08 UTC (permalink / raw)
  To: Martin Fick
  Cc: Junio C Hamano, git, Christian Couder, Thomas Rast,
	René Scharfe, Julian Phillips

On 10/03/2011 02:46 AM, Martin Fick wrote:
> Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> I
>> also have some code (not pushed) that adds some intelligence to make
>> the use case
>>
>>    repeat many times:
>>        check if reference exists
>>        add reference
> 
> Would it be possible to separate the two steps into separate loops somehow?  Could it instead look like this:
>  
>>    repeat many times:
>>        check if reference exists
>  
>>    repeat many times:
>>        add reference

Undoubtedly this would be possible.  But I'd rather make the refs code
efficient and general enough that its users don't need to worry about
such things.

> [...] Of course, this would likely require checking for dups
> before running the first loop.

Yes, and this "checking for dups before running the first loop" is
approximately the same work that would have to be done within a smarter
version of the refs code.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-10-03 18:12                                                 ` Martin Fick
  2011-10-03 19:42                                                   ` Junio C Hamano
@ 2011-10-04  8:16                                                   ` Michael Haggerty
  1 sibling, 0 replies; 126+ messages in thread
From: Michael Haggerty @ 2011-10-04  8:16 UTC (permalink / raw)
  To: Martin Fick
  Cc: git, Christian Couder, Thomas Rast, René Scharfe, Julian Phillips

On 10/03/2011 08:12 PM, Martin Fick wrote:
> On Friday, September 30, 2011 04:06:31 pm Martin Fick wrote:
>> OK, I narrowed it down I believe.  If I comment out the
>> invalidate_cached_refs() line in write_ref_sha1(), it
>> speeds through this section.
>>
>> I guess this makes sense, we invalidate the cache and
>> have to rebuild it after every new ref is added? 
>> Perhaps a simple fix would be to move the invalidation
>> right after all the refs are updated?  Maybe
>> write_ref_sha1 could take in a flag to tell it to not
>> invalidate the cache so that during iterative updates it
>> could be disabled and then run manually after the
>> update?
> 
> Would this solution be acceptable if I submitted a patch to 
> do it?  My test shows that this will make a full fetch of 
> ~80K changes go from 4:50min to 1:50min,

No, no, no.  Let's fix up the refs cache once and for all and avoid
adding special case code all over the place.

* With minor changes, we can make it possible to invalidate single refs
instead of the whole the refs cache.  And we can teach the refs code to
invalidate refs by itself when necessary, so that other code can become
stupider and more decoupled from the refs code.

* With other minor changes (mostly implemented), we can support a
partly-sorted refs list that decides intelligently when to resort
itself.  This will give most of the performance benefit of circumventing
the refs cache API with none of the chaos.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: [PATCH v3] refs: Use binary search to lookup refs faster
  2011-10-02  5:45                                                                   ` Junio C Hamano
@ 2011-10-04 20:58                                                                     ` Junio C Hamano
  0 siblings, 0 replies; 126+ messages in thread
From: Junio C Hamano @ 2011-10-04 20:58 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Julian Phillips, Martin Fick, Christian Couder, git,
	Christian Couder, Thomas Rast

Junio C Hamano <gitster@pobox.com> writes:

> Michael Haggerty <mhagger@alum.mit.edu> writes:
>
>> Um, well, my patch series includes the same changes that Julian's wants
>> to introduce, but following lots of other changes, cleanups,
>> documentation improvements, etc.  Moreover, my patch series builds on
>> mh/iterate-refs, with which Julian's patch conflicts.  In other words,
>> it would be a real mess to reroll my series on top of Julian's patch.
>
> Conflicts during re-rolling was not something I was worried too much
> about---that is just the fact of life. We cannot easily resolve two topics
> that want to go in totally different direction, but we should be able to
> converge two topics that want to take the same approach in the end,
> especially one is a subset of the other.

Ah, also I should have noted that I have a fix-up between mh/iterate-refs
and Julian's patch already queued on 'pu'.

I am planning to make mh/iterate-refs graduate to 'master' soonish, so
hopefully things will become simpler.

Thanks.

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-09-30 22:06                                               ` Martin Fick
  2011-10-01 20:41                                                 ` Junio C Hamano
  2011-10-03 18:12                                                 ` Martin Fick
@ 2011-10-08 20:59                                                 ` Martin Fick
  2011-10-09  5:43                                                   ` Michael Haggerty
  2 siblings, 1 reply; 126+ messages in thread
From: Martin Fick @ 2011-10-08 20:59 UTC (permalink / raw)
  To: git
  Cc: Christian Couder, Thomas Rast, René Scharfe,
	Julian Phillips, Michael Haggerty

On Friday, September 30, 2011 04:06:31 pm Martin Fick wrote:
> On Friday, September 30, 2011 03:02:30 pm Martin Fick 
wrote:
> > On Friday, September 30, 2011 10:41:13 am Martin Fick
> 
> wrote:
> > Since a full sync is now done to about 5mins, I broke
> > down the output a bit.  It appears that the longest
> > part (2:45m) is now the time spent scrolling though
> > each
> > 
> > change still. Each one of these takes about 2ms:
> >  * [new branch]      refs/changes/99/71199/1 ->
> > 
> > refs/changes/99/71199/1
> > 
> > Seems fast, but at about 80K... So, are there any
> > obvious N loops over the refs happening inside each of
> > of the [new branch] iterations?
> 
> OK, I narrowed it down I believe.  If I comment out the
> invalidate_cached_refs() line in write_ref_sha1(), it
> speeds through this section.
> 
> I guess this makes sense, we invalidate the cache and
> have to rebuild it after every new ref is added? 
> Perhaps a simple fix would be to move the invalidation
> right after all the refs are updated?  Maybe
> write_ref_sha1 could take in a flag to tell it to not
> invalidate the cache so that during iterative updates it
> could be disabled and then run manually after the
> update?

OK, this thing has been bugging me...

I found some more surprising results, I hope you can follow 
because there are corner cases here which have surprising 
impacts.


** Important fact: 
** ---------------
** When I clone my repo, it has about 4K tags which
** come in packed to the clone.
**

This fact has a heavy impact on how I test things. If I 
choose to delete these packed-refs from the cloned repo and 
then do a fetch of the changes, all of the tags are also 
fetched along with these changes.  This means that if I want 
to test the impact of having packed-refs vs no packed refs, 
on my change fetches, I need to first delete the packed-refs 
file, and second fetch all the tags again, so that when I 
fetch the changes, the repo only actually fetches changes, 
not all the tags!

So, with this in mind, I have discovered, that the fetch 
performance degradation by invalidating the caches in 
write_ref_sha1() is actually due to the packed-refs being 
reloaded and resorted again on each ref insertion (not the 
loose refs)!!!

Remember the important fact above?  Yeah, those silly 4K 
refs (not a huge number, not 61K!) take a while to reread 
from the file and sort.  When this is done for 61K changes, 
it adds a lot of time to a fetch.  The sad part is that, of 
course, the packed-refs don't really need to be invalidated 
since we never add new refs as packed refs during a fetch 
(but apparently we do during a clone)!  Also noteworthy is 
that invalidating the loose refs, does not cause a big 
delay.


Some data:

1) A fetch of the changes in my series with all good 
external patches applied takes about 7:30min.


2) A fetch of the changes with #1 invalidate_cache_refs() 
commented out in write_ref_sha1() takes about 1:50min.


3) A fetch of the changes with #1 with 
invalidate_cache_refs() in write_ref_sha1() replaced with a 
call to my custom invalidate_loose_cache_refs() takes about 
1:50min.


4) A fetch with #1 on a repo with packed-refs deleted after 
the clone, takes about ~5min.  

** This is a strange regression which threw me off.  In this 
case, all the tags are refetched in addition to the changes, 
this seems to cause some weird interaction that makes things 
take longer than they should (#5 + #6 = 2:10m  <<  #4 5min).


5) A fetch with #1 on a repo with packed-refs deleted after 
the clone, and then a fetch done to get all the tags (see 
#6), takes only 1:30m!!!!


6) A fetch to get all the **TAGS** with packed-refs deleted 
after the clone, takes about 40s.



---Additional side data/tests:

7) A fetch of the changes with #1 and a special flag causing 
the packed-refs to be read from the file, but not parsed or 
sorted, takes 2:34min.  So just the repeated reads add at 
least 40s.


8) A fetch of the changes with #1 and a special flag causing 
the packed-refs to be read from the file, parsed, but NOT 
sorted, takes 3:40min.  So the parsing appears to take an 
additional minute at least.




I think that all of this might explain why no matter how 
good Michael's intentions are with his patch series, his 
series isn't likely to fix this problem unless he does not 
invalidate the packed-refs after each insertion.  I tried 
preventing this invalidation in his series to prove this, 
but unfortunately, it appears that in his series it is no 
longer possible to only invalidate just the packed-refs? :(
Michael, I hope I am completely wrong about that...


Are there any good consistency reasons to invalidate the 
packed refs in write_ref_sha1()?  If not, would you accept a 
patch to simply skip this invalidation (to only invalidate 
the loose refs)?

Thanks,
 
-Martin

-- 
Employee of Qualcomm Innovation Center, Inc. which is a 
member of Code Aurora Forum

^ permalink raw reply	[flat|nested] 126+ messages in thread

* Re: Git is not scalable with too many refs/*
  2011-10-08 20:59                                                 ` Martin Fick
@ 2011-10-09  5:43                                                   ` Michael Haggerty
  0 siblings, 0 replies; 126+ messages in thread
From: Michael Haggerty @ 2011-10-09  5:43 UTC (permalink / raw)
  To: Martin Fick
  Cc: git, Christian Couder, Thomas Rast, René Scharfe, Julian Phillips

[-- Attachment #1: Type: text/plain, Size: 2441 bytes --]

On 10/08/2011 10:59 PM, Martin Fick wrote:
> [...]
> So, with this in mind, I have discovered, that the fetch 
> performance degradation by invalidating the caches in 
> write_ref_sha1() is actually due to the packed-refs being 
> reloaded and resorted again on each ref insertion (not the 
> loose refs)!!!

Good point.

> I think that all of this might explain why no matter how 
> good Michael's intentions are with his patch series, his 
> series isn't likely to fix this problem

I never claimed that my patch fixes all use cases, or cures cancer
either :-)  One step at a time.

>                                         unless he does not
> invalidate the packed-refs after each insertion.  I tried 
> preventing this invalidation in his series to prove this, 
> but unfortunately, it appears that in his series it is no 
> longer possible to only invalidate just the packed-refs? :(
> Michael, I hope I am completely wrong about that...

Yes, you are completely wrong.  I just implemented more selective cache
invalidation on top of the patch series.

I think your suggestion is safe because only non-symbolic references can
be stored in the packed refs; therefore the modification of a loose ref
can never affect the value of a packed ref.  Of course a loose ref can
*hide* the value of a packed ref, but in such cases the packed ref is
never read anyway.  And the *deletion* of a loose ref can expose a
previously-hidden packed ref, but this case is handled by delete_ref(),
which explicitly invalidates the packed-ref cache.

While I was at it, I also:

* In delete_ref(), only invalidate the packed reference cache if the
reference that is being deleted actually *is* among the packed references.

* Changed the code to stop invalidating the ref caches for submodules.
In the code paths where the cache invalidation was being done, only
main-module references were being changed.  However, I'm not familiar
enough with submodules to know if/when submodule references *can* be
changed.  It could be that the submodule reference caches have to be
invalidated under some circumstances; the current code might be buggy in
this area.

The changes are pushed to github.  They don't make any significant
difference to my "refperf" results (attached), so perhaps a new
benchmark should be added.  But I'm curious to see how they affect your
timings.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

[-- Attachment #2: refperf-summary.out --]
[-- Type: text/plain, Size: 5251 bytes --]

===================================  =======  =======  =======  =======  =======  =======  =======  =======  =======  =======
Test name                                [0]      [1]      [2]      [3]      [4]      [5]      [6]      [7]      [8]      [9]
===================================  =======  =======  =======  =======  =======  =======  =======  =======  =======  =======
branch-loose-cold                       3.19     3.15     3.10     3.19     3.25     0.70     0.61     0.74     0.66     0.56
branch-loose-warm                       0.19     0.19     0.20     0.19     0.19     0.00     0.00     0.00     0.00     0.00
for-each-ref-loose-cold                 3.73     3.45     3.55     3.39     3.44     3.40     3.50     3.52     3.70     3.51
for-each-ref-loose-warm                 0.44     0.44     0.44     0.43     0.43     0.43     0.43     0.43     0.43     0.43
checkout-loose-cold                     3.35     3.23     3.23     3.15     3.29     0.65     0.71     0.76     0.66     0.69
checkout-loose-warm                     0.19     0.19     0.20     0.18     0.19     0.01     0.01     0.01     0.01     0.00
checkout-orphan-loose                   0.19     0.19     0.19     0.18     0.19     0.00     0.00     0.00     0.00     0.00
checkout-from-detached-loose-cold       7.80     4.17     4.17     4.05     4.09     4.07     4.26     4.23     4.18     4.08
checkout-from-detached-loose-warm       1.01     1.01     1.02     1.02     1.04     1.03     1.04     1.04     1.02     1.04
branch-contains-loose-cold             35.76    35.80    36.15    36.67    35.13    36.29    36.37    36.03    36.70    36.01
branch-contains-loose-warm             33.01    33.62    33.52    33.51    32.41    33.51    33.71    32.10    33.70    31.99
pack-refs-loose                         4.19     4.20     4.25     4.21     4.20     4.21     4.20     4.19     4.24     4.21
branch-packed-cold                      0.79     0.62     0.60     0.66     0.65     0.58     0.68     0.72     0.60     0.61
branch-packed-warm                      0.02     0.02     0.02     0.02     0.02     0.02     0.02     0.02     0.02     0.02
for-each-ref-packed-cold                0.96     0.97     0.97     0.93     0.89     0.92     0.98     0.96     0.92     0.96
for-each-ref-packed-warm                0.26     0.26     0.26     0.26     0.26     0.26     0.26     0.27     0.27     0.27
checkout-packed-cold                   16.14    16.16    16.74     2.04     2.03     2.09     2.06     2.13     2.03     2.00
checkout-packed-warm                    0.17     0.17     0.18     0.19     0.18     0.17     0.27     0.18     0.19     0.18
checkout-orphan-packed                  0.02     0.01     0.02     0.02     0.02     0.02     0.02     0.02     0.02     0.02
checkout-from-detached-packed-cold     16.24    15.96    16.80     1.99     2.06     2.01     2.08     2.10     1.97     1.96
checkout-from-detached-packed-warm     15.04    14.96    15.76     0.77     0.81     0.79     0.83     0.80     0.79     0.80
branch-contains-packed-cold            36.18    36.98    36.92    35.19    34.97    35.09    33.34    33.87    34.27    34.51
branch-contains-packed-warm            35.27    35.12    36.20    33.52    32.76    33.49    33.65    32.96    33.68    32.34
clone-loose-cold                        9.09     9.22     9.15     9.10     9.19     9.03     9.09     9.25     8.96     9.03
clone-loose-warm                        5.57     5.85     5.65     5.55     5.61     5.64     5.65     5.61     5.74     5.59
fetch-nothing-loose                     1.43     1.43     1.44     1.44     1.45     1.45     1.46     1.44     1.44     1.44
pack-refs                               0.08     0.08     0.08     0.08     0.09     0.08     0.09     0.08     0.08     0.08
fetch-nothing-packed                    1.44     1.43     1.44     1.44     1.44     1.44     1.44     1.44     1.44     1.44
clone-packed-cold                       1.35     1.26     1.30     1.32     1.28     1.35     1.38     1.35     1.29     1.21
clone-packed-warm                       0.36     0.35     0.35     0.36     0.36     0.36     0.35     0.36     0.37     0.35
fetch-everything-cold                  30.29    30.01    29.79    29.04    29.84    29.25    29.30    29.26    29.76    29.30
fetch-everything-warm                  26.20    26.04    26.40    25.60    26.22    25.83    25.82    25.85    26.68    25.73
===================================  =======  =======  =======  =======  =======  =======  =======  =======  =======  =======


[0] f696543 (tag: v1.7.6) Git 1.7.6
[1] 703f05a (tag: v1.7.7) Git 1.7.7
[2] 27897d2 (origin/master) Merge remote-tracking branch 'gitster/mh/iterate-refs'
[3] 558b49c is_refname_available(): reimplement using do_for_each_ref_in_list()
[4] 1658397 Store references hierarchically
[5] 5f5a126 get_ref_dir(): add a recursive option
[6] a306af1 get_ref_dir(): read one whole directory before descending into subdirs
[7] fd53cf7 add_ref(): change to take a (struct ref_entry *) as second argument
[8] 9944c7f (origin/testing) read_packed_refs(): keep track of the directory being worked in
[9] cb75c57 (origin/ok, origin/hierarchical-refs, origin/HEAD) refs.c: call clear_cached_ref_cache() from repack_without_ref()


^ permalink raw reply	[flat|nested] 126+ messages in thread

end of thread, other threads:[~2011-10-09  5:44 UTC | newest]

Thread overview: 126+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-09  3:44 Git is not scalable with too many refs/* NAKAMURA Takumi
2011-06-09  6:50 ` Sverre Rabbelier
2011-06-09 15:23   ` Shawn Pearce
2011-06-09 15:52     ` A Large Angry SCM
2011-06-09 15:56       ` Shawn Pearce
2011-06-09 16:26         ` Jeff King
2011-06-10  3:59           ` NAKAMURA Takumi
2011-06-13 22:27             ` Jeff King
2011-06-14  0:17             ` Andreas Ericsson
2011-06-14  0:30               ` Jeff King
2011-06-14  4:41                 ` Junio C Hamano
2011-06-14  7:26                   ` Sverre Rabbelier
2011-06-14 10:02                     ` Johan Herland
2011-06-14 10:34                       ` Sverre Rabbelier
2011-06-14 17:02                       ` Jeff King
2011-06-14 19:20                         ` Shawn Pearce
2011-06-14 19:47                           ` Jeff King
2011-06-14 20:12                             ` Shawn Pearce
2011-09-08 19:53                               ` Martin Fick
2011-09-09  0:52                                 ` Martin Fick
2011-09-09  1:05                                   ` Thomas Rast
2011-09-09  1:13                                     ` Thomas Rast
2011-09-09 15:59                                   ` Jens Lehmann
2011-09-25 20:43                                   ` Martin Fick
2011-09-26 12:41                                     ` Christian Couder
2011-09-26 17:47                                       ` Martin Fick
2011-09-26 18:56                                         ` Christian Couder
2011-09-30 16:41                                           ` Martin Fick
2011-09-30 19:26                                             ` Martin Fick
2011-09-30 21:02                                             ` Martin Fick
2011-09-30 22:06                                               ` Martin Fick
2011-10-01 20:41                                                 ` Junio C Hamano
2011-10-02  5:19                                                   ` Michael Haggerty
2011-10-03  0:46                                                     ` Martin Fick
2011-10-04  8:08                                                       ` Michael Haggerty
2011-10-03 18:12                                                 ` Martin Fick
2011-10-03 19:42                                                   ` Junio C Hamano
2011-10-04  8:16                                                   ` Michael Haggerty
2011-10-08 20:59                                                 ` Martin Fick
2011-10-09  5:43                                                   ` Michael Haggerty
2011-09-28 19:38                                       ` Martin Fick
2011-09-28 22:10                                         ` Martin Fick
2011-09-29  0:54                                           ` Julian Phillips
2011-09-29  1:37                                             ` Martin Fick
2011-09-29  2:19                                               ` Julian Phillips
2011-09-29 16:38                                                 ` Martin Fick
2011-09-29 18:26                                                   ` Julian Phillips
2011-09-29 18:27                                                 ` René Scharfe
2011-09-29 19:10                                                   ` Junio C Hamano
2011-09-29  4:18                                                     ` [PATCH] refs: Use binary search to lookup refs faster Julian Phillips
2011-09-29 21:57                                                       ` Junio C Hamano
2011-09-29 22:04                                                       ` [PATCH v2] " Julian Phillips
2011-09-29 22:06                                                       ` [PATCH] " Junio C Hamano
2011-09-29 22:11                                                         ` [PATCH v3] " Julian Phillips
2011-09-29 23:48                                                           ` Junio C Hamano
2011-09-30 15:30                                                             ` Michael Haggerty
2011-09-30 16:38                                                               ` Junio C Hamano
2011-09-30 17:56                                                                 ` [PATCH] refs: Remove duplicates after sorting with qsort Julian Phillips
2011-10-02  5:15                                                                 ` [PATCH v3] refs: Use binary search to lookup refs faster Michael Haggerty
2011-10-02  5:45                                                                   ` Junio C Hamano
2011-10-04 20:58                                                                     ` Junio C Hamano
2011-09-30  1:13                                                           ` Martin Fick
2011-09-30  3:44                                                             ` Junio C Hamano
2011-09-30  8:04                                                               ` Julian Phillips
2011-09-30 15:45                                                               ` Martin Fick
2011-09-29 20:44                                                     ` Git is not scalable with too many refs/* Martin Fick
2011-09-29 19:10                                                   ` Julian Phillips
2011-09-29 20:11                                                   ` Martin Fick
2011-09-30  9:12                                                     ` René Scharfe
2011-09-30 16:09                                                       ` Martin Fick
2011-09-30 16:52                                                       ` Junio C Hamano
2011-09-30 18:17                                                         ` René Scharfe
2011-10-01 15:28                                                           ` René Scharfe
2011-10-01 15:38                                                             ` [PATCH 1/8] checkout: check for "Previous HEAD" notice in t2020 René Scharfe
2011-10-01 19:02                                                               ` Sverre Rabbelier
2011-10-01 15:43                                                             ` [PATCH 2/8] revision: factor out add_pending_sha1 René Scharfe
2011-10-01 15:51                                                             ` [PATCH 3/8] checkout: use add_pending_{object,sha1} in orphan check René Scharfe
2011-10-01 15:56                                                             ` [PATCH 4/8] revision: add leak_pending flag René Scharfe
2011-10-01 16:01                                                             ` [PATCH 5/8] bisect: use " René Scharfe
2011-10-01 16:02                                                             ` [PATCH 6/8] bundle: " René Scharfe
2011-10-01 16:09                                                             ` [PATCH 7/8] checkout: " René Scharfe
2011-10-01 16:16                                                             ` [PATCH 8/8] commit: factor out clear_commit_marks_for_object_array René Scharfe
2011-09-26 15:15                                     ` Git is not scalable with too many refs/* Martin Fick
2011-09-26 15:21                                       ` Sverre Rabbelier
2011-09-26 15:48                                         ` Martin Fick
2011-09-26 15:56                                           ` Sverre Rabbelier
2011-09-26 16:38                                             ` Martin Fick
2011-09-26 16:49                                               ` Julian Phillips
2011-09-26 18:07                                       ` Martin Fick
2011-09-26 18:37                                         ` Julian Phillips
2011-09-26 20:01                                           ` Martin Fick
2011-09-26 20:07                                             ` Junio C Hamano
2011-09-26 20:28                                             ` Julian Phillips
2011-09-26 21:39                                               ` Martin Fick
2011-09-26 21:52                                                 ` Martin Fick
2011-09-26 23:26                                                   ` Julian Phillips
2011-09-26 23:37                                                     ` David Michael Barr
2011-09-27  1:01                                                       ` [PATCH] refs.c: Fix slowness with numerous loose refs David Barr
2011-09-27  2:04                                                         ` David Michael Barr
2011-09-26 23:38                                                     ` Git is not scalable with too many refs/* Junio C Hamano
2011-09-27  0:00                                                       ` [PATCH] Don't sort ref_list too early Julian Phillips
2011-10-02  4:58                                                         ` Michael Haggerty
2011-09-27  0:12                                                     ` Git is not scalable with too many refs/* Martin Fick
2011-09-27  0:22                                                       ` Julian Phillips
2011-09-27  2:34                                                         ` Martin Fick
2011-09-27  7:59                                                           ` Julian Phillips
2011-09-27  8:20                                                     ` Sverre Rabbelier
2011-09-27  9:01                                                       ` Julian Phillips
2011-09-27 10:01                                                         ` Sverre Rabbelier
2011-09-27 10:25                                                           ` Nguyen Thai Ngoc Duy
2011-09-27 11:07                                                         ` Michael Haggerty
2011-09-27 12:10                                                           ` Julian Phillips
2011-09-26 22:30                                                 ` Julian Phillips
2011-09-26 15:32                                     ` Michael Haggerty
2011-09-26 15:42                                       ` Martin Fick
2011-09-26 16:25                                         ` Thomas Rast
2011-09-09 13:50                                 ` Michael Haggerty
2011-09-09 15:51                                   ` Michael Haggerty
2011-09-09 16:03                                   ` Jens Lehmann
2011-06-10  7:41         ` Andreas Ericsson
2011-06-10 19:41           ` Shawn Pearce
2011-06-10 20:12             ` Jakub Narebski
2011-06-10 20:35             ` Jeff King
2011-06-13  7:08             ` Andreas Ericsson
2011-06-09 11:18 ` Jakub Narebski
2011-06-09 15:42   ` Stephen Bash

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.