Re: [Question] Can git cat-file have a type filtering option?

From: ZheNing Hu <adlternative@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jeff King <peff@peff.net>, Taylor Blau <me@ttaylorr.com>,
	Git List <git@vger.kernel.org>,
	johncai86@gmail.com,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [Question] Can git cat-file have a type filtering option?
Date: Sun, 16 Apr 2023 19:15:47 +0800	[thread overview]
Message-ID: <CAOLTT8QS8VzepLid7V4FXMfGJpiQL6P5_Bd+2=YygfNoZrPU7w@mail.gmail.com> (raw)
In-Reply-To: <xmqqh6titpzk.fsf@gitster.g>

Junio C Hamano <gitster@pobox.com> 于2023年4月14日周五 23:58写道：
>
> ZheNing Hu <adlternative@gmail.com> writes:
>
> > Oh, you are right, this could be to prevent conflicts between Git objects
> > with identical content but different types. However, I always associate
> > Git with the file system, where metadata such as file type and size is
> > stored in the inode, while the file data is stored in separate chunks.
>
> I am afraid the presentation order Peff used caused a bit of
> confusion.  The true reason is what Peff brought up as "Or worse".
> We need to be able to tell, given only the name of an object,
> everything that we need to know about the object, and for that, we
> need the type information when we ask for an object by its name.
> Having size embedded in the data that comes back to us when we
> consult object database with an object name helps the implementation
> to pre-allocate a buffer and then inflate into it--there is no
> fundamental reason why it should be there.
>

Yes, I think I understand the point now. Since Git addresses objects
based on their content, if type information is not included in the object,
we cannot easily understand what type of Git object corresponds to
a given object ID. Moreover, if we don't include type and size information
in Git objects, We would need to maintain a large number of external tables
to record this information, in order to inflate and identify the type.

> It is a secondary problem created by the design choice that we store
> type together with contents, that the object type recorded in a tree
> entry may contradict the actual type of the object recorded in the
> tree entry.  We could have declared that the object type found in a
> tree entry is to be trusted, if we didn't record the type in the
> object database together with the object contents.
>

Yes, that may not be crucial, but including type information
in Git objects can help validate the correctness of tree entries better.

> I think your original question was not "why do we store type and
> size together with the contents?", but was "why do we include in the
> hash computation?", and all of the above discuss related tangent
> without touching the original question.
>

Yes, but I think these two problems should be similar.

> The need to have type or size available when we ask the object
> database for data associated with the object does not necessarily
> mean they must be hashed together with the contents.  It was done
> merely because "why not? that way, we do not have to worry about
> catching corrupt values for type and size information we want to
> store together with the contents".  IOW, we could have checksummed
> these two pieces of information separately, but why bother?

 Thank you. I think I roughly understand.