Re: [Question] Can git cat-file have a type filtering option?

From: Linus Torvalds <torvalds@linux-foundation.org>
To: ZheNing Hu <adlternative@gmail.com>
Cc: Jeff King <peff@peff.net>, Taylor Blau <me@ttaylorr.com>,
	Junio C Hamano <gitster@pobox.com>,
	Git List <git@vger.kernel.org>,
	johncai86@gmail.com
Subject: Re: [Question] Can git cat-file have a type filtering option?
Date: Fri, 14 Apr 2023 10:04:59 -0700	[thread overview]
Message-ID: <CAHk-=wjr-CMLX2Jo2++rwcv0VNr+HmZqXEVXNsJGiPRUwNxzBQ@mail.gmail.com> (raw)
In-Reply-To: <CAOLTT8SEeY1tfU39xHPJ21F7o3dmgEFwNCny=Z2F4Y2HFR3DzA@mail.gmail.com>

On Fri, Apr 14, 2023 at 5:17 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> Jeff King <peff@peff.net> 于2023年4月14日周五 15:30写道：
> >
> > On Wed, Apr 12, 2023 at 05:57:02PM +0800, ZheNing Hu wrote:
> > >
> > > I'm still puzzled why git calculated the object id based on {type, size, data}
> > >  together instead of just {data}?
> >
> > You'd have to ask Linus for the original reasoning. ;)

I originally thought of the git object store as "tagged pointers".

That actually caused confusion initially when I tried to explain this
to SCM people, because "tag" means something very different in an SCM
environment than it means in computer architecture.

And the implication of a tagged pointer is that you have two parts of
it - the "tag" and the "address". Both are relevant at all points.

This isn't quite as obvious in everyday moden git usage, because a lot
of uses end up _only_ using the "address" (aka SHA1), but it's very
much part of the object store design. Internally, the object layout
never uses just the SHA1, it's all "type:SHA1", even if sometimes the
types are implied (ie the tree object doesn't spell out "blob", but
it's still explicit in the mode bits).

This is very very obvious in "git cat-file", which was one of the
original scripts in the first commit (but even there the tag/type has
changed meaning over time: the very first version didn't use it as
input at all, then it started verifying it, and then later it got the
more subtle context of "peel the tags until you find this type").

You can also see this in the original README (again, go look at that
first git commit): the README talks about the "tag of their type".

Of course, in practice git then walked away from having to specify the
type all the time. It started even in that original release, in that
the HEAD file never contained the type - because it was implicit (a
HEAD is always a commit).

So we ended up having a lot of situations like that where the actual
tag part was implicit from context, and these days people basically
never refer to the "full" object name with tag, but only the SHA1
address.

So now we have situations where the type really has to be looked up
dynamically, because it's not explicitly encoded anywhere. While HEAD
is supposed to always be a commit, other refs can be pretty much
anything, and can point to a tag object, a commit, a tree or a blob.
So then you actually have to look up the type based on the address.

End result: these days people don't even think of git objects as
"tagged pointers".  Even internally in git, lots of code just passes
the "object name" along without any tag/type, just the raw SHA1 / OID.

So that originally "everything is a tagged pointer" is much less true
than it used to be, and now, instead of having tagged pointers, you
mostly end up with just "bare pointers" and look up the type
dynamically from there.

And that "look up the type in the object" is possible because even
originally, I did *not* want any kind of "object type aliasing".

So even when looking up the object with the full "tag:pointer", the
encoding of the object itself then also contains that object type, so
that you can cross-check that you used the right tag.

That said, you *can* see some of the effects of this "tagged pointers"
in how the internals do things like

    struct commit *commit = lookup_commit(repo, &oid);

which conceptually very much is about tagged pointers. And the fact
that two objects cannot alias is actually somewhat encoded in that: a
"struct commit" contains a "struct object" as a member. But so does
"struct blob" - and the two "struct object" cases are never the same
"object".

So there's never any worry about "could blob.object be the same object
as commit.object"?

That is actually inherent in the code, in how "lookup_commit()"
actually does lookup_object() and then does object_as_type(OBJ_COMMIT)
on the result.

> Oh, you are right, this could be to prevent conflicts between Git objects
> with identical content but different types. However, I always associate
> Git with the file system, where metadata such as file type and size is
> stored in the inode, while the file data is stored in separate chunks.

See above: yes, git design was *also* influenced heavily by
filesystems, but that was mostly in the sense of "this is how to
encode these things without undue pain".

The object database being immutable was partly a security and safety
measure, but it was also very much partly a "rewriting files is going
to be a major pain from a filesystem consistency standpoint - don't do
it".

But even more than a filesystem design, it's an "computer
architecture" design. Think of the git object store as a very abstract
computer architecture that has tagged pointers, stable storage, and no
aliasing - and where the tag is actually verified at each lookup.

The "no aliasing" means that no two distinct pointers can point to the
same data. So a tagged pointer of type "commit" can not point to the
same object as a tagged pointer of type "blob". They are distinct
pointers, even if (maybe) the commit object encoding ends up then
being identical to a blob object.

And as mentioned, that "verified at each lookup" has mostly gone away,
and "each lookup" has become more of a "can be verified by fsck", but
it's probably still a good thing to think that way.

You still have "lookup_object_by_type()" internally in git that takes
the full tagged pointer, but almost nobody uses it any more. The
closest you get is those "lookup_commit()" things (which are fairly
common, still).

              Linus