Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand

From: Ben Peart <peartben@gmail.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, gitster@pobox.com, benpeart@microsoft.com,
	pclouds@gmail.com, christian.couder@gmail.com,
	git@jeffhostetler.com
Subject: Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
Date: Mon, 17 Jul 2017 16:09:17 -0400	[thread overview]
Message-ID: <d3f1f884-7b8a-885f-47cb-eca2b8ef0ecf@gmail.com> (raw)
In-Reply-To: <20170717110602.6fac89ea@twelve2.svl.corp.google.com>

On 7/17/2017 2:06 PM, Jonathan Tan wrote:
> About the difference between this patch and my patch set [1], besides
> the fact that this patch does not spawn separate processes for each
> missing object, which does seem like an improvement to me, this patch
> (i) does not use a list of promised objects (but instead communicates
> with the hook for each missing object), and (ii) provides backwards
> compatibility with other Git code (that does not know about promised
> objects) in a different way.
> 
> The costs and benefits of (i) are being discussed here [2]. As for (ii),
> I still think that my approach is better - I have commented more about
> this below.
> 
> Maybe the best approach is a combination of both our approaches.

Yes, in the context of the promised objects model patch series, the 
value of this patch series is primarily as a sample of how to use the 
sub-process mechanism to create a versioned interface for retrieving 
objects.

> 
> [1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathantanmy@google.com/
> 
> [2] https://public-inbox.org/git/20170713123951.5cab1adc@twelve2.svl.corp.google.com/
> 
> On Fri, 14 Jul 2017 09:26:51 -0400
> Ben Peart <peartben@gmail.com> wrote:
> 
>> +------------------------
>> +packet: git> command=get
>> +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
>> +packet: git> 0000
>> +------------------------
> 
> It would be useful to have this command support more than one SHA-1, so
> that hooks that know how to batch can do so.
> 

I agree.  Since nothing was using that capability yet, I decided to keep 
it simple and not add support for a feature that wasn't being used. The 
reason the interface is versioned is so that if/when something does need 
that capability, it can be added.

>> +static int subprocess_map_initialized;
>> +static struct hashmap subprocess_map;
> 
> The documentation of "tablesize" in "struct hashmap" states that it can
> be used to check if the hashmap is initialized, so
> subprocess_map_initialized is probably unnecessary.
> 

Nice.  That will make things a little simpler.

>>   static int check_and_freshen(const unsigned char *sha1, int freshen)
>>   {
>> -	return check_and_freshen_local(sha1, freshen) ||
>> -	       check_and_freshen_nonlocal(sha1, freshen);
>> +	int ret;
>> +	int already_retried = 0;
>> +
>> +retry:
>> +	ret = check_and_freshen_local(sha1, freshen) ||
>> +		check_and_freshen_nonlocal(sha1, freshen);
>> +	if (!ret && core_virtualize_objects && !already_retried) {
>> +		already_retried = 1;
>> +		if (!read_object_process(sha1))
>> +			goto retry;
>> +	}
>> +
>> +	return ret;
>>   }
> 
> Is this change meant to ensure that Git code that operates on loose
> objects directly (bypassing storage-agnostic functions such as
> sha1_object_info_extended() and has_sha1_file()) still work? If yes,
> this patch appears incomplete (for example, read_loose_object() needs to
> be changed too), and this seems like a difficult task - in my patch set
> [1], I ended up deciding to create a separate type of storage and
> instead looked at the code that operates on *packed* objects directly
> (because there were fewer such methods) to ensure that they would work
> correctly in the presence of a separate type of storage.
> 

Yes, with this set of patches, we've been running successfully on 
completely sparse clones (no commits, trees, or blobs) for several 
months.  read_loose_object() is only called by fsck when it is 
enumerating existing loose objects so does not need to be updated.

We have a few thousand developers making ~100K commits per week so in 
our particular usage, I'm fairly confident it works correctly.  That 
said, it is possible there is some code path I've missed. :)

> [1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathantanmy@google.com/
>