Re: git-clone --config order & fetching extra refs during initial clone

From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>
Cc: Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>,
	"Robin H. Johnson" <robbat2@gentoo.org>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: git-clone --config order & fetching extra refs during initial clone
Date: Thu, 4 May 2017 09:28:12 +0200	[thread overview]
Message-ID: <CACBZZX5706NELxAOWRVAx-QFPtZ_rAsRkTX811+jfrN4u47XfA@mail.gmail.com> (raw)
In-Reply-To: <CAM0VKjnjMEThuMvLEQByxWvxVvdzMSVsFKKstKLMweEx5UwTcg@mail.gmail.com>

On Wed, May 3, 2017 at 4:42 PM, SZEDER Gábor <szeder.dev@gmail.com> wrote:
> Cc'ing Ævar because of his work on 'clone --no-tags'.
>
> On Wed, Mar 15, 2017 at 6:08 PM, Jeff King <peff@peff.net> wrote:
>> On Sat, Mar 11, 2017 at 01:41:34AM +0100, SZEDER Gábor wrote:
>>> > Though if I'm bikeshedding, I'd probably have written the whole thing
>>> > with an argv_array and avoided counting at all.
>>>
>>> Yeah, I did notice that you love argv_array :)  I had to raise an
>>> eyebrow recently while looking at send-pack and how it uses argv_array
>>> to read refspecs from stdin into an array.  I think argv_array feels a
>>> bit out of place in both cases.  Yes, it does exactly what's needed.
>>> However, it's called *argv*_array, indicating that its contents is
>>> destined to become the options of some command.  But that's not the
>>> case in these two cases, we are not dealing with arguments to a
>>> command, these are just arrays of strings.
>>
>> In my mind, "argv" is synonymous with "NULL-terminated array of
>> strings". If the name is the only thing keeping it from wider use, I'd
>> much prefer us to give it a more generic name. All I really care about
>> is simplifying memory management. :)
>
> Whether its name is the _only_ thing keeping it from wider use, I
> don't know :)
>
> All I can say is that I was aware of argv_array, but because of
> its name it didn't even occur to me.  And even if I had considered it,
> I still wouldn't have used it here.  Had it been called string_array,
> I think I would have used it.
>
> On a related note, we have string_list, which is not a list but an
> ALLOC_GROW()-able array, and not that of strings (i.e. plan char*),
> but of structs with a string and an additional data field.
> Oh, well :)
>
>
>>> > I do also notice that right _after_ this parsing, we use remote_get(),
>>> > which is supposed to give us this config anyway. Which makes me wonder
>>> > if we could just reorder this to put remote_get() first, and then read
>>> > the resulting refspecs from remote->fetch.
>>>
>>> Though get_remote() does give us this config, at this point the
>>> default fetch refspec has not been configured yet, therefore it's not
>>> included in the resulting remote->fetch array.  The default refspec is
>>> set in write_refspec_config(), but that's called only about two
>>> screenfulls later.  So there is a bit of extra work to do in order to
>>> leverage get_remote()'s parsing.
>>>
>>> I think the simplest way is to keep parsing the default fetch refspec
>>> manually, and then append it to the remote->fetch array.  Definitely
>>> shorter and simpler than that parsing in the current patch.
>>> Alternatively, we could set the default fetch refspec in the
>>> configuration temporarily, only for the duration of the get_remote()
>>> call, but it feels a bit too hackish...
>>
>> Yeah, I think manually combining the two here is fine. Though I won't
>> complain if you want to look into setting the config earlier. If the
>> refactoring isn't too bad, it would probably provide the nicest outcome.
>
> I did actually look into that, but don't think it's a good idea.
>
> write_refspec_config() nicely encapsulates writing the proper fetch
> refspec configuration according to the given command line options.  Of
> course these options are already known right at the start, so solely
> in this regard we could call this function earlier.  However, in some
> cases, e.g. '--single-branch', the refspec to be written to the config
> depends not only on the given options but on the refs in the remote
> repository, too, so it can only be written after we got the remote's
> refs.
>
>
> Unfortunately, there is more to this issue.  Earlier I though that the
> fetch refspec is the only config that is ignored during a clone.
> However, Ævar's 'clone --no-tags' patches[1] drew my attention to the
> 'remote.<name>.tagOpt' config variable, that I overlooked earlier...
> and apparently 'git clone' overlooks it as well, grabbing all tags
> even when it's set to '--no-tags'.
>
> The root issue is that 'git clone' calls directly into the fetch
> machinery instead of running 'git fetch' (either via run_command() or
> cmd_fetch()), and some of these "higher-level" config variables are
> only handled in 'builtin/fetch.c' but not in 'git clone'.  By
> "handle" I mean "parse and act accordingly": as it happens, these
> config values are parsed alright when 'git clone' calls remote_get(),
> but it never looks at the relevant fields in the resulting 'struct
> remote'.
>
> Luckily, many "lower-level" config variables are working properly even
> during 'git clone', because they are handled in the transport layer,
> e.g. 'git clone -c url.https://github.com/.insteadof=gh: gh:git/git'
> does the right thing.
>
>
> I'm not sure what the right way forward would be.
>
> My patch deals with 'remote.<name>.refspec', i.e. 'remote->fetch'.
> Apparently some extra care is necessary for 'remote.<name>.tagOpt' and
> 'remote->fetch_tags', too.  Perhaps there are more, I haven't checked
> again, and maybe we'll add similar config variables in the future.  So
> I don't think that dealing with such config variables one by one in
> 'git clone', too, is the right long-term solution...  but perhaps it's
> sufficient for the time being?
>
> Running a fully-fledged 'git fetch' seems to be simpler and safer
> conceptually, as it would naturally handle all fetch-related config
> variables, present and future.  However, it's not without drawbacks:
> 'git clone' must set the proper config before running 'git fetch' (or
> at least set equivalent cmdline options), which in some cases requires
> the refs in the remote repository, making an additional "list remote
> refs" step necessary (i.e. both 'clone' and 'fetch' would have to
> retrieve the refs from the remote, resulting in more network I/O).
>
> Or we should libify more of 'builtin/fetch.c', but with all those
> static variables and functions in there...  Ugh :)

Yes from my (limited) understanding of the code after hacking in
--no-tags this all seems correct. I.e. that a large part of my patch
wouldn't be needed if we were able to just set tagOpt earlier & then
call fetch.

But as you point out there's a big chicken & egg problem with the
likes of --single-branch where we'd either need to run upload-pack on
the remote side twice, once to get the branch and once to fetch (ew!).

The way to get around that that I can see would be to have some deep
hooking into the fetch machinery where first we set our config like
tagOpt=--no-tags for --no-tags, then we call `fetch`, but `fetch`
would need to have some hook where in the case of --single-branch it
would immediately write the branch name to the fetch spec in the
config, then do the actual fetching.