Re: [PATCH v2] cat-file: skip expanding default format

From: John Cai <johncai86@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: "John Cai via GitGitGadget" <gitgitgadget@gmail.com>,
	git@vger.kernel.org, "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [PATCH v2] cat-file: skip expanding default format
Date: Tue, 08 Mar 2022 14:01:19 -0500	[thread overview]
Message-ID: <0D4FEF37-1652-4172-A266-9FF98DA2EE9C@gmail.com> (raw)
In-Reply-To: <xmqqmti0nzx7.fsf@gitster.g>

Hi Junio,

On 8 Mar 2022, at 11:59, Junio C Hamano wrote:

> "John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
>> index 7b3f42950ec..ab9a49e13a4 100644
>> --- a/builtin/cat-file.c
>> +++ b/builtin/cat-file.c
>> @@ -351,6 +351,14 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
>>  	}
>>  }
>>
>> +static int print_default_format(char *buf, int len, struct expand_data *data)
>> +{
>> +	return xsnprintf(buf, len, "%s %s %"PRIuMAX"\n", oid_to_hex(&data->oid),
>> +		 data->info.type_name->buf,
>> +		 (uintmax_t)*data->info.sizep);
>> +
>> +}
>
> OK.  We want size and type if we were to show the default output out
> of the object-info API.
>
>>  /*
>>   * If "pack" is non-NULL, then "offset" is the byte offset within the pack from
>>   * which the object may be accessed (though note that we may also rely on
>> @@ -363,6 +371,11 @@ static void batch_object_write(const char *obj_name,
>>  			       struct packed_git *pack,
>>  			       off_t offset)
>>  {
>> +	struct strbuf type_name = STRBUF_INIT;
>> +
>> +	if (!opt->format)
>> +		data->info.type_name = &type_name;
>
> And at this point, !opt->format means we would use the default
> format, so we cannot leave .type_name member NULL.  That is OK
> but puzzling.  Why didn't we need this before?
>
> If the caller is batch_objects(), there is the "mark_query" call to
> strbuf_expand() to learn which field in data->info are needed, so
> it seems that this new code should NOT be necessary.
>
>     Side note.  I briefly wondered if this expand is something you
>     would want to optimize when the default format is used, but this
>     is just "probe just once to ensure various members of data->info
>     are populated, to prepare for showing hundreds of objects in the
>     batch request", so it probably is not worth it.
>
> I am guessing that this is for callers that do not come via
> batch_objects() where the "mark_query" strbuf_expand() is not made?
> If so,
>
>  * why is it sufficient to fill .type_name and not .sizep for the
>    default format (i.e. when opt->format is NULL)?
>
>  * why is it OK not to do anything for non-default format?  If no
>    "mark_query" call has been made, we wouldn't be preparing the
>    .type_name field even if the user-supplied format calls for
>    %(objecttype), would we?
>
> Looking at the call graph:
>
>  - batch_object_write() is called by
>    - batch_one_object()
>    - batch_object_cb()
>    - batch_unordered_object()
>
>  - batch_one_object() is called only by batch_objects()
>  - batch_object_cb() is used only by batch_objects()
>
>  - batch_unordered_object() is called by
>    - batch_unordered_loose()
>    - batch_unordered_packed()
>    and these two are called only by batch_objects()
>
> And the "mark_query" strbuf_expand() to probe which members in
> expand_data are are necessary is done very early, before any of the
> calls batch_objects() makes that reach batch_object_write().
>
> OK, so my initial guess that the new "we need .type_name member to
> point at a strbuf" is because there are some code that bypasses the
> "mark_query" strbuf_expand() in batch_objects() is totally wrong.
> Everybody uses the "mark_query" thing.  Then why do we need to ask
> type_name?
>
> Going back to the new special case print_default_format() gives us
> the answer to the question.  It expects that data->info already
> knows the stringified typename in the type_name member.  The
> original slow code path in expand_atom() uses this, instead:
>
> 	} else if (is_atom("objecttype", atom, len)) {
> 		if (data->mark_query)
> 			data->info.typep = &data->type;
> 		else
> 			strbuf_addstr(sb, type_name(data->type));

Thanks for going through this analysis! so looks like I am relying on
oid_object_info_extended() which calls do_oid_object_info_extended(), which calls
type_name(co->type) if oi->type_name is not NULL.

This is a bit roundabout, so I like what you suggest below of just calling
type_name() in print_default_format() directly.

>
> Which makes me wonder:
>
>  * Is calling type_name(data->type) for many objects a lot less
>    efficient than asking the stringified type_name from the
>    object-info layer?
I'm not sure, but I imagine that if the # of calls to type_name remain the same,
eg: once per object that it wouldn't really matter much where in the stack it
happens. Also, I took a look at type_name() in object.c and it's just a lookup
in a constant array so that should be pretty fast.

>    If that is the case, would you gain
>    performance for all cases if you did this instead
>
> 	} else if (is_atom("objecttype", atom, len)) {
> -		if (data->mark_query)
> -			data->info.typep = &data->type;
> -		else
> -			strbuf_addstr(sb, type_name(data->type));
> +		if (data->mark_query) {
> +			data->info.typep = &data->type;
> +			data->info.type_name = &data->type_name;
> +		} else {
> +			strbuf_addstr(sb, data->type_name);
> +		}
>
>    in expand_atom()?

I don't quite follow here. Would we add a member type_name to
expand_data? Also where would the call to type_name() be to get the stringified
type_name?

Also I'm thinking this approach may not work well with the default format
optimization as we would be skipping the strbuf_expand() call altogether when
default format is used.

>
> 	Side note: I am keeping data->info.typep because a lot of
> 	existing code switches on data->type, which is an enum.
>
>    We may have to keep the strbuf_release() at the end of this
>    function this patch added, to release data->info.type_name, if we
>    go that route, but we wouldn't be dealing with an on-stack
>    type_name in this function.
>
>  * If it does not make any difference between calling type_name() on
>    our side in expand_atom() or asking object-info API to do so,
>    then would it make more sense to lose the local type_name strbuf
>    and print type_name(data->type) in print_default_format() instead?

I think this is the most intuitive solution.

>
> Other than that, this looks good to me.
>
> Thanks.

thanks!
John