Re: [RFC/PATCH 0/2] Speed up fetch with large number of tags

From: Julian Phillips <julian@quantumfyre.co.uk>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC/PATCH 0/2] Speed up fetch with large number of tags
Date: Wed, 16 Sep 2009 23:32:52 +0100 (BST)	[thread overview]
Message-ID: <alpine.LNX.2.00.0909162141140.13697@reaper.quantumfyre.co.uk> (raw)
In-Reply-To: <7vbplb2pi7.fsf@alter.siamese.dyndns.org>

On Wed, 16 Sep 2009, Junio C Hamano wrote:

> Julian Phillips <julian@quantumfyre.co.uk> writes:
>
>> I have a repository at $dayjob where fetch was taking ~30s to tell me
>> that there were no updates.
>>
>> It turns out that I appear to have added a nasty linear search of all
>> remote refs for every commit (i.e. tag^{}) tag ref way back in the
>> original C implementation of fetch.  This doesn't scale well to large
>> numbers of refs, so this replaces it with a hash table based lookup
>> instead, which brings the time down to a few seconds even for very large
>> ref counts.
>>
>> I haven't tested it with non-native transports, but there is no reason
>> to believe that the code should be transport specific.
>
> Very interesting.
>
> A few questions (not criticisms).
>
> * 1m50s to 4.5s is quite impressive, even if it is only in a repository
>   with unusual refs-vs-commits ratio, but I personally think 10 refs per
>   every commit is already on the borderline of being insane, and the
>   normal ratio would be more like 1 refs per every 10-20 commits.

I noticed the problem with a real repository at $dayjob, and did enough 
anaylsis to identiy the problem.  I deliberately created a repository that 
should emphasise the problem so that it was easy to get a handle on.

Having applied the ref-dict patches fetch on my $dayjob repo has gone from 
~30s to ~4.5s, which may not be as impressive - but is much easier to live 
with.

>   What are possible downsides with the new code in repositories with more
>   reasonable refs-vs-commits ratio?  A hash table (with a sensible hash
>   function) would almost always outperform linear search in an randomly
>   ordered collection, so my gut tells me that there won't be performance
>   downsides, but are there other potential issues we should worry about?

I guess that main thing would be the extra memory usage.

> * In an insanely large refs-vs-commits case, perhaps not 50000:1 but more
>   like 100:1, but with a history with far more than one commit, what is
>   the memory consumption?  Judging from a cursory view, I think the way
>   ref-dict re-uses struct ref might be quite suboptimal, as you are using
>   only next (for hash-bucket link), old_sha1[] and its name field, and
>   also your ref_dict_add() calls alloc_ref() which calls one calloc() per
>   requested ref, instead of attempting any bulk allocation.

Yeah, I just reused struct for speed and convience of developing, to 
veryify that ref-dict would give me the speed I wanted.  A final patch 
would want a more optimised version.  Except that I've thrown the whole 
hash table thing away anyway.

> * The outer loop is walking the list of refs from a transport, and the
>   inner loop is walking a copy of the same list of refs from the same
>   transport, looking for each refs/tags/X^{} what record, if any, existed
>   for refs/tags/X.
>
>   Would it make sense to further specialize your optimization?  For
>   example, something like...

I actually arrived at somthing similar to this myself, after realising 
that I could use string_list as a basis.

> * It is tempting to use a hash table when you have to deal with an
>   unordered collection, but in this case, wouldn't the refs obtained from
>   the transport (it's essentially a ls-remote output, isn't it?) be
>   sorted?  Can't you take advantage of that fact to optimize the loop,
>   without adding a specialized hash table implementation?

I wasn't sure if we could rely on the refs list being sorted.  But I've 
got a new version that uses an extra string_list instead that is actually 
slightly faster.  I'll post that shortly.

>   We find refs/tags/v0.99 immediately followed by refs/tags/v0.99^{} in
>   the ls-remote output.  And the inefficient loop is about finding
>   refs/tags/v0.99 when we see refs/tags/v0.99^{}, so if we remember the
>   tag ref we saw in the previous round, we can check with that first to
>   make sure our "sorted" assumption holds true, and optimize the loop out
>   that way, no?

-- 
Julian

  ---
Bilbo's First Law:
 	You cannot count friends that are all packed up in barrels.