git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Options like hash-object --literally and cat-file --allow-unknown-type
@ 2021-05-18 14:08 Ævar Arnfjörð Bjarmason
  2021-05-18 15:16 ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-05-18 14:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Han-Wen Nienhuys, git, Jeff King, Han-Wen Nienhuys,
	karthik nayak, Eric Sunshine


[Breaking the In-Reply-To chain per recent discussions on list, would
have been sent out with [1] if not. See
https://lore.kernel.org/git/xmqqo8eem5hr.fsf@gitster.g/ for the
replied-to message in context].

Skip to "Get to the point" for the "tl;dr".

On Fri, Apr 16 2021, Junio C Hamano wrote:

> Han-Wen Nienhuys <hanwen@google.com> writes:
>
>>> I would think that a better approach here would be to start with some
>>> (per-se unrelated) series to teach update-ref some mode like
>>> hash-object's --literally, i.e. "YOLO this ref update".
>>
>> I disagree.  I think this would be a job better suited to a
>> test-helper. Then we don't put tools into users' hands that
>> potentially corrupt the repository. I don't understand why hash-object
>> --literally is not a test helper either.
>
> As the person who invented the "--literally" option, I'd have to
> agree with this assessment.  It does make life a little bit easier
> for those who hack on Git codebase and reimplementations of it, but
> little practical value for those who use Git every day [*].
>
> If it were a tool to _dump_ the contents of a possibly corrupt
> object that the "--literally" option would have produced to make it
> easier to see by humans, I might be pursuaded to say that such a
> feature may be better in end-user accessible subcommands.  But the
> reason why we invented "--literally" was specifically to create
> corrupt objects in test repository to see end-user accessible tools
> behave, and in hindsight, such a use case shouldn't have been a good
> justification to add the option.
>
> I wasn't following the discussion of this particular patch closely,
> so I do not know what is being discussed is a tool on the diagnosis
> side (for which I may say it is OK to give to end-usres) or on the
> currupting side (for which I would regret to add to end-user tools),
> but hopefully the above would help guiding the decision between the
> two.
>
> Thanks.
>
>
> [Footnote]
>
> * ... for any purpose other than creating a corrupt repository,
>   asking somebody who is or claims to be a Git expert to figure out
>   what is wrong in his or her repository, and either waste expert's
>   time or have fun by watching the expert fail and gets embarrased,
>   that is ;-)

I totally get where you're coming from. We're talking about handing
users a loaded shotgun without the safety pin with --literally.

Having had some time to think about it though (and encountering various
code around it) since this E-Mail I think that on balance we're better
off with it than without it, and should extend it to update-ref et al.

I'm very much cognitively biased on the topic since 100% of my
interaction with these --literally objects is in hacking on git.git
itself, which if anything should serve as an argument not to have this
sort of thing in public tooling.

But I think one thing that makes git great is that we don't have opaque
magic at any level. Sure, we've got complexity, but you can always
devolve thing down to "simple" manipulation of the object store in cases
such as writing refs / loose objects.

So if for nothing else having a tool like hash-object take your raw
input is good for the educational value, even if it does (and should) do
sanity checking by default.

I actually think that like "mktag" it should do even more sanity
checking, I just think there should always be a way to turn that sort of
thing off. At the very lowest level our plumbing shouldn't be caring
about object validation at all. That also neatly allows us to
e.g. support creating an OBJ_NEW with a git version that doesn't
understand such a thing. We're unlikely to add new object types, but I
don't see why it shouldn't be supported.

In any case, to get to the point what prompted me to write this E-Mail
is that I'm annoyed that "cat-file --allow-unknown-type" exists at all,
i.e. I don't see the need for something like a "--literally" (originally
it was "cat-file --literally", see [2]) if we're talking about *reading*
objects, not writing them (hash-object without -w not withstanding).

In re-rolling my fsck fixes for what it does on unknown types[3] I've
ended up with sillyness like replacing the current bad error message
cat-file emits:

    $ git cat-file -t 7a52aafca353d2fd9f3faca575e9fe4fd48619d4
    fatal: invalid object type

With:

    $ ~/g/git/git cat-file -t 7a52aafca353d2fd9f3faca575e9fe4fd48619d4
    fatal: object 7a52aafca353d2fd9f3faca575e9fe4fd48619d4 is of unknown type 'bogus', refusing to emit it without --allow-unknown-type

I.e. now the guts of the object-file.c code don't have a way to ferry
that information up. Now it does via a "struct object_info", but (for
compatibility with documented behavior) it still refuses to "really"
print the type without --allow-unknown-type.

I think it should just print it unconditionally, perhaps with a
--no-allow-unknown-type "strict" mode.

We also currently don't support:

    $ git cat-file -p 7a52aafca353d2fd9f3faca575e9fe4fd48619d4
    fatal: invalid object type

But I've added support for that, we don't need to know about the type
itself to dump the raw content after the header, which is more useful,
and means you can roundtrip hash-object --literally and cat-file.

But most annoying is:

    $ echo 7a52aafca353d2fd9f3faca575e9fe4fd48619d4 | git cat-file --batch
    fatal: invalid object type

Now, looking at the implementation that's really a bug. We document:

    objecttype: The type of the object (the same as cat-file -t reports).

So you might expect it to accept an --allow-unknown-type, but it
doesn't. I could add support for that, but I think it would be stupid.

Why should you need to restart a "cat-file --batch" just because you
encounter a bad object? Just .. print it, we can do that safely. I
really don't see the point of having --allow-unknown-type at all. Ditto
for the --batch-check mode.

I mean, having read all the code I think I know why it's there. I think
It's because there was no way to ferry the information up from
object-file.c before my yet-to-be-submitted patches, so the solution was
to pass down a flag saying "please don't die()".

But is it something that anyone thinks is a good idea in the abstract? I
don't see why I shouldn't just document something like:

    Older versions of "cat-file" used to require an --allow-unknown-type
    flag to emit information about objects of unknown types. This is no
    longer required or the default. If you'd like to die on anything
    except known types (e.g. to die instead of bothering with parsing a
    "bad" type that possibly has spaces in it in the --batch output)
    supply --no-allow-unknown-type.

What do you think?

1.

    > In-reply-to: <xmqqo8eem5hr.fsf@gitster.g>

    > References:
    > <pull.847.v5.git.git.1615580397.gitgitgadget@gmail.com>
    > <pull.847.v6.git.git.1618255552.gitgitgadget@gmail.com>
    > <2dc73bf2ec96acae12a17c719083d11401775bc3.1618255553.git.gitgitgadget@gmail.com>
    > <87im4qejpk.fsf@evledraar.gmail.com>
    > <CAFQ2z_OSbeciLQnoognG+Hh5S1tZTHO6WUviC9h5YMer766k6g@mail.gmail.com>
    > <xmqqo8eem5hr.fsf@gitster.g>

2. https://lore.kernel.org/git/CAPig+cTcAq_p3QXqcG+o1saWZyvDHCW=_JWYn6s7B1L4X5X1cQ@mail.gmail.com/

3. https://lore.kernel.org/git/87im493yos.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Options like hash-object --literally and cat-file --allow-unknown-type
  2021-05-18 14:08 Options like hash-object --literally and cat-file --allow-unknown-type Ævar Arnfjörð Bjarmason
@ 2021-05-18 15:16 ` Junio C Hamano
  2021-05-18 18:42   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 4+ messages in thread
From: Junio C Hamano @ 2021-05-18 15:16 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Han-Wen Nienhuys, git, Jeff King, Han-Wen Nienhuys,
	karthik nayak, Eric Sunshine

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> So you might expect it to accept an --allow-unknown-type, but it
> doesn't. I could add support for that, but I think it would be stupid.
>
> Why should you need to restart a "cat-file --batch" just because you
> encounter a bad object? Just .. print it, we can do that safely. I
> really don't see the point of having --allow-unknown-type at all. Ditto
> for the --batch-check mode.
>
> I mean, having read all the code I think I know why it's there. I think
> It's because there was no way to ferry the information up from
> object-file.c before my yet-to-be-submitted patches, so the solution was
> to pass down a flag saying "please don't die()".
>
> But is it something that anyone thinks is a good idea in the abstract? I
> don't see why I shouldn't just document something like:
>
>     Older versions of "cat-file" used to require an --allow-unknown-type
>     flag to emit information about objects of unknown types. This is no
>     longer required or the default. If you'd like to die on anything
>     except known types (e.g. to die instead of bothering with parsing a
>     "bad" type that possibly has spaces in it in the --batch output)
>     supply --no-allow-unknown-type.
>
> What do you think?

Thanks for thinking things through.

My knee-jerk reaction is

 - As long as "cat-file -t $thing" exits with non-zero status for an
   invalid thing, which was crafted using hash-object --literally,
   reporting the typename it read from the object header to its
   standard output would be fine without "--allow-unknown-type".
   But scripts would be upset if it suddenly started to return with
   zero status when asked to check what type of object the $thing
   is.

 - For "cat-file --batch[-check]", I am on the fence.  A script may
   break because it is not prepared to see anything but four
   existing types (so it might even say "do X if it is a blob, do Y
   if it is a tree, do Z if it is a tag, and do W for everything
   else" and expect/assume that W will see only commits), so failing
   without --allow may still be a prudent thing to do.  We could
   declare that such a script is already broken, even though it
   would not change the fact that the user has been successfully
   using it reliably.

With your new "we can now bubble necessary information up from the
object-file.c layer without dying" change, it might make it easier
to arrange so that object-file.c layer would never die and I do not
think I have any objection against such a plan.  The implementation
of "--[no-]allow-unknown-type" would have to be migrated to the
caller at the higher level to decide what to do when it learns that
the object it asked the object_info() about is of an invalid type.
And the choice of the default would become easier to change later
with such a change to the lower layer.  But I am not sure if that
warrants switching of the default.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Options like hash-object --literally and cat-file --allow-unknown-type
  2021-05-18 15:16 ` Junio C Hamano
@ 2021-05-18 18:42   ` Ævar Arnfjörð Bjarmason
  2021-05-19  0:25     ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-05-18 18:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Han-Wen Nienhuys, git, Jeff King, Han-Wen Nienhuys,
	karthik nayak, Eric Sunshine


On Wed, May 19 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> So you might expect it to accept an --allow-unknown-type, but it
>> doesn't. I could add support for that, but I think it would be stupid.
>>
>> Why should you need to restart a "cat-file --batch" just because you
>> encounter a bad object? Just .. print it, we can do that safely. I
>> really don't see the point of having --allow-unknown-type at all. Ditto
>> for the --batch-check mode.
>>
>> I mean, having read all the code I think I know why it's there. I think
>> It's because there was no way to ferry the information up from
>> object-file.c before my yet-to-be-submitted patches, so the solution was
>> to pass down a flag saying "please don't die()".
>>
>> But is it something that anyone thinks is a good idea in the abstract? I
>> don't see why I shouldn't just document something like:
>>
>>     Older versions of "cat-file" used to require an --allow-unknown-type
>>     flag to emit information about objects of unknown types. This is no
>>     longer required or the default. If you'd like to die on anything
>>     except known types (e.g. to die instead of bothering with parsing a
>>     "bad" type that possibly has spaces in it in the --batch output)
>>     supply --no-allow-unknown-type.
>>
>> What do you think?
>
> Thanks for thinking things through.
>
> My knee-jerk reaction is
>
>  - As long as "cat-file -t $thing" exits with non-zero status for an
>    invalid thing, which was crafted using hash-object --literally,
>    reporting the typename it read from the object header to its
>    standard output would be fine without "--allow-unknown-type".
>    But scripts would be upset if it suddenly started to return with
>    zero status when asked to check what type of object the $thing
>    is.

That feels even weirder to me. Why conflate the "this is naughty object"
with "we successfully inspected it"? I don't think cat-file should be
exiting non-zero unless it actually has an error inspecting the object.

But in terms of implementation that's trivial, just look if the object
is on the whitelist of types before exiting. It's certainly less hassle
than what it's currently doing.

I was thinking something like an advise() to stderr at most when we
emitted one of these:

>  - For "cat-file --batch[-check]", I am on the fence.  A script may
>    break because it is not prepared to see anything but four
>    existing types (so it might even say "do X if it is a blob, do Y
>    if it is a tree, do Z if it is a tag, and do W for everything
>    else" and expect/assume that W will see only commits), so failing
>    without --allow may still be a prudent thing to do.  We could
>    declare that such a script is already broken, even though it
>    would not change the fact that the user has been successfully
>    using it reliably.

The main reason I even care about this was the fsck case, and it looking
up and down the callstacks from OBJECT_INFO_ALLOW_UNKNOWN_TYPE I think
we're needlessly worrying about this in the cat-file case.

I think even in the case you mentioned a script would be better off with
a default of --allow-unknown-type. Right now if you somehow fed it an
object with such a type it would just hard die, whereas after you could
proceed to feed it other objects.

But realistically (and I may be wrong here) I don't think we need to
worry about backwards compatibility with --allow-unknown-type at
all. I'm fairly certain nobody's dealing with these objects in the wild
except git's own test suite.

No server implementation will accept them, you can't even push/pull them
locally. It's just there for completeness with "hash-object
--literally": https://lore.kernel.org/git/54EDACC9.5080204@gmail.com/

So I think it's fine just to say it was a mistake to require opt-in
options to *inspect* such naughty objects, as opposed to creating them
(where we should have the opt-in --literally option).

> With your new "we can now bubble necessary information up from the
> object-file.c layer without dying" change, it might make it easier
> to arrange so that object-file.c layer would never die and I do not
> think I have any objection against such a plan.  The implementation
> of "--[no-]allow-unknown-type" would have to be migrated to the
> caller at the higher level to decide what to do when it learns that
> the object it asked the object_info() about is of an invalid type.
> And the choice of the default would become easier to change later
> with such a change to the lower layer.

For what it's worth the new API is just passing
OBJECT_INFO_ALLOW_UNKNOWN_TYPE to oid_object_info_extended() along with
the (existing) "typep". You then see if the typep is < 0.

> But I am not sure if that warrants switching of the default.

Hopefully either this thread or a re-roll of that series will convince
you, is this something you'd like to see on list in the next few days or
is it better to hold off until after the RC period?

Another thing I didn't mention as a reason for getting rid of
--allow-unknown-type is that part of the patches I was hacking up made
various read-only commants behave sensibly in the face of these badly
typed objects, where now they'll just die.

E.g. Jeff King's recent --disk-usage for rev-list will just die on it,
since we call "what is the type of this" without
OBJECT_INFO_ALLOW_UNKNOWN_TYPE somewhere.

I think it's a much better approach to just implicitly supply a
OBJECT_INFO_ALLOW_UNKNOWN_TYPE at a high level (you just want the total
disk use, who cares if there's one bad object there), as opposed to
teaching rev-list and others about --allow-unknown-type.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Options like hash-object --literally and cat-file --allow-unknown-type
  2021-05-18 18:42   ` Ævar Arnfjörð Bjarmason
@ 2021-05-19  0:25     ` Junio C Hamano
  0 siblings, 0 replies; 4+ messages in thread
From: Junio C Hamano @ 2021-05-19  0:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Han-Wen Nienhuys, git, Jeff King, Han-Wen Nienhuys,
	karthik nayak, Eric Sunshine

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> But I am not sure if that warrants switching of the default.
>
> Hopefully either this thread or a re-roll of that series will convince
> you, is this something you'd like to see on list in the next few days or
> is it better to hold off until after the RC period?

You sent me a message, said that you wanted to change "cat-file" and
that you wanted my opinion on it, I gave a response, and in response
to that, you just repeated that you still want to change it, and I
did not see any new justification with it in the message I am
responding to, so sorry, but I fail to see where that "hopefully"
comes from.

The way I see it, if there is no practical difference, leaving the
existing behaviour of "cat-file" the same would be the most prudent
thing to do, especially when the primary goal of the topic is not
about improving "cat-file".  This is merely a fallout from the work
to improve lower layers so that "fsck" can work better with objects
with unexpected types, no?

I suspect that this is one of the reasons your topics take way too
more time than they should to mature: biting too much at one time.

Surely it will be about the same effort for you to change the
default, while updating the lower layer that results in making it
easier to have either default.  However, having to argue for and
justify the change of default of a command unrelated to the main
focus of the topic would make your series longer, and adds one more
thing on the plate of reviewers and future readers of the series
that they have to reason about.  If you keep the topic more focused
to "fsck" and the improvement of the lower layer to support it, you
may still need to update the callers that share the same lower layer
("cat-file" in this case) but you would reduce the burden of
reviewers if you can label the change as "this adjust another caller
to the updated lower layer, but no end-user visible change in its
behaviour", as there is one fewer thing for them to have to think.

> I think it's a much better approach to just implicitly supply a
> OBJECT_INFO_ALLOW_UNKNOWN_TYPE at a high level (you just want the total
> disk use, who cares if there's one bad object there), as opposed to
> teaching rev-list and others about --allow-unknown-type.

I actually was hoping that the update to the lower layer functions
would be to stop dying when they see an unknown type (instead they
can leave a bit of information in object_info to signal that the
object is bogus---perhaps the *typep having a value outside the four
real types may be a good enough implementation for that signal) and
leave the policy decision to the callers.  That is, there is no
option bit OBJECT_INFO_ALLOW_UNKNOWN_TYPE that is passed from the
caller to the lower layer; the caller can decide what to do with a
bogus object themselves, "cat-file" retains the current "we'd die
unless allow-unknown is passed" default to reduce the impact from
the entire series to reduce cognitive load from reviewers to make it
easier to review the series, and if you really feel like doing the
"while at it, cat-file requiring --allow does not make sense; if the
user wants the raw inflated content, just let them have it", do so
outside the series after the dust settles.

I dunno.  I may be agreeing with your "it's a much better approach
to just implicitly supply allow-unknown-type at a higher level".


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-05-19  0:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-18 14:08 Options like hash-object --literally and cat-file --allow-unknown-type Ævar Arnfjörð Bjarmason
2021-05-18 15:16 ` Junio C Hamano
2021-05-18 18:42   ` Ævar Arnfjörð Bjarmason
2021-05-19  0:25     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).