All of lore.kernel.org
 help / color / mirror / Atom feed
From: Julian Phillips <julian@quantumfyre.co.uk>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC/PATCH 0/2] Speed up fetch with large number of tags
Date: Wed, 16 Sep 2009 23:32:52 +0100 (BST)	[thread overview]
Message-ID: <alpine.LNX.2.00.0909162141140.13697@reaper.quantumfyre.co.uk> (raw)
In-Reply-To: <7vbplb2pi7.fsf@alter.siamese.dyndns.org>

On Wed, 16 Sep 2009, Junio C Hamano wrote:

> Julian Phillips <julian@quantumfyre.co.uk> writes:
>
>> I have a repository at $dayjob where fetch was taking ~30s to tell me
>> that there were no updates.
>>
>> It turns out that I appear to have added a nasty linear search of all
>> remote refs for every commit (i.e. tag^{}) tag ref way back in the
>> original C implementation of fetch.  This doesn't scale well to large
>> numbers of refs, so this replaces it with a hash table based lookup
>> instead, which brings the time down to a few seconds even for very large
>> ref counts.
>>
>> I haven't tested it with non-native transports, but there is no reason
>> to believe that the code should be transport specific.
>
> Very interesting.
>
> A few questions (not criticisms).
>
> * 1m50s to 4.5s is quite impressive, even if it is only in a repository
>   with unusual refs-vs-commits ratio, but I personally think 10 refs per
>   every commit is already on the borderline of being insane, and the
>   normal ratio would be more like 1 refs per every 10-20 commits.

I noticed the problem with a real repository at $dayjob, and did enough 
anaylsis to identiy the problem.  I deliberately created a repository that 
should emphasise the problem so that it was easy to get a handle on.

Having applied the ref-dict patches fetch on my $dayjob repo has gone from 
~30s to ~4.5s, which may not be as impressive - but is much easier to live 
with.

>   What are possible downsides with the new code in repositories with more
>   reasonable refs-vs-commits ratio?  A hash table (with a sensible hash
>   function) would almost always outperform linear search in an randomly
>   ordered collection, so my gut tells me that there won't be performance
>   downsides, but are there other potential issues we should worry about?

I guess that main thing would be the extra memory usage.

> * In an insanely large refs-vs-commits case, perhaps not 50000:1 but more
>   like 100:1, but with a history with far more than one commit, what is
>   the memory consumption?  Judging from a cursory view, I think the way
>   ref-dict re-uses struct ref might be quite suboptimal, as you are using
>   only next (for hash-bucket link), old_sha1[] and its name field, and
>   also your ref_dict_add() calls alloc_ref() which calls one calloc() per
>   requested ref, instead of attempting any bulk allocation.

Yeah, I just reused struct for speed and convience of developing, to 
veryify that ref-dict would give me the speed I wanted.  A final patch 
would want a more optimised version.  Except that I've thrown the whole 
hash table thing away anyway.

> * The outer loop is walking the list of refs from a transport, and the
>   inner loop is walking a copy of the same list of refs from the same
>   transport, looking for each refs/tags/X^{} what record, if any, existed
>   for refs/tags/X.
>
>   Would it make sense to further specialize your optimization?  For
>   example, something like...

I actually arrived at somthing similar to this myself, after realising 
that I could use string_list as a basis.

> * It is tempting to use a hash table when you have to deal with an
>   unordered collection, but in this case, wouldn't the refs obtained from
>   the transport (it's essentially a ls-remote output, isn't it?) be
>   sorted?  Can't you take advantage of that fact to optimize the loop,
>   without adding a specialized hash table implementation?

I wasn't sure if we could rely on the refs list being sorted.  But I've 
got a new version that uses an extra string_list instead that is actually 
slightly faster.  I'll post that shortly.

>   We find refs/tags/v0.99 immediately followed by refs/tags/v0.99^{} in
>   the ls-remote output.  And the inefficient loop is about finding
>   refs/tags/v0.99 when we see refs/tags/v0.99^{}, so if we remember the
>   tag ref we saw in the previous round, we can check with that first to
>   make sure our "sorted" assumption holds true, and optimize the loop out
>   that way, no?

-- 
Julian

  ---
Bilbo's First Law:
 	You cannot count friends that are all packed up in barrels.

  reply	other threads:[~2009-09-16 22:36 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-16  7:53 [RFC/PATCH 0/2] Speed up fetch with large number of tags Julian Phillips
2009-09-16  7:53 ` [RFC/PATCH 1/2] ref-dict: Add a set of functions for working with a ref dictionary Julian Phillips
2009-09-16  7:53 ` [RFC/PATCH 2/2] fetch: Speed up fetch by using " Julian Phillips
2009-09-16  9:44 ` [RFC/PATCH 0/2] Speed up fetch with large number of tags Junio C Hamano
2009-09-16 22:32   ` Julian Phillips [this message]
2009-09-16 22:42     ` Shawn O. Pearce
2009-09-16 22:52       ` Junio C Hamano
2009-09-16 23:03         ` Shawn O. Pearce
2009-09-16 23:19           ` Junio C Hamano
2009-09-16 22:53     ` [RFC/PATCH v2] fetch: Speed up fetch by rewriting find_non_local_tags Julian Phillips
2009-09-16 23:15       ` Junio C Hamano
2009-09-16 23:46         ` Julian Phillips
2009-09-17  1:30           ` Julian Phillips
2009-09-17  7:13             ` Johan Herland
2009-09-17  7:33               ` [RFC/PATCH v3] " Julian Phillips
2009-09-16 22:46   ` [RFC/PATCH 0/2] Speed up fetch with large number of tags Shawn O. Pearce
2009-09-22 20:36     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LNX.2.00.0909162141140.13697@reaper.quantumfyre.co.uk \
    --to=julian@quantumfyre.co.uk \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.