Re: [RFC PATCH 0/2] MVP implementation of remote-suggested hooks

From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, sandals@crustytoothpaste.net,
	emilyshaffer@google.com
Subject: Re: [RFC PATCH 0/2] MVP implementation of remote-suggested hooks
Date: Mon, 21 Jun 2021 21:35:06 +0200	[thread overview]
Message-ID: <87k0mn2dd3.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20210621185858.1169385-1-jonathantanmy@google.com>

On Mon, Jun 21 2021, Jonathan Tan wrote:

>> On Wed, Jun 16 2021, Jonathan Tan wrote:
>> 
>> > This is a continuation of the work from [1]. That work described the
>> > reasons, possible features, and possible workflows, but not the
>> > implementation in detail. This patch set has an MVP implementation, and
>> > my hope is that having a concrete implementation to look at makes it
>> > easier to discuss matters of implementation.
>> 
>> My C on this RFC is:
>> 
>> 1) A request that someone reply (there or here would do) to my comments
>>    on the last iteration of this at:
>>    https://lore.kernel.org/git/874kghk906.fsf@evledraar.gmail.com/
>
> OK - I'll take a look at that.
>
>> 2) I think you'd get better feedback if you CC'd the people who've been
>>   actively discussing this in previous rounds.
>
> Good point.
>
>> > Design choices:
>> >
>> >  1. Where should the suggested hooks be in the remote repo? A branch,
>> >     a non-branch ref, a config? I think that a branch is best - it is
>> >     relatively well-understood and any hooks there can be
>> >     version-controlled (and its history is independent of the other
>> >     branches).
>> 
>> First, unlike brian I don't (I hope I'm fairly summarizing his view
>> here) disagree mostly or entirely with the existence of such a feature
>> at all. I mean, I get the viewpoint that git shouldn't bless what
>> amounts to an active RCE from the remote.
>> 
>> I just think that we could probably do a better job of it than what
>> people are doing in practice, and I've seen people do stuff like have
>> build systems setup permanent symlinks to git-hooks/<some-name> in the
>> tracked dir. We could at least envision a git-native implementation
>> asking the user "do you want this hook update? <shows diff>".
>> 
>> I just find this design approach completely bizarre as noted (probably
>> in less blunt words) in the linked E-Mail.
>
> That's fair. You suggest an alternative below (and maybe more in the
> linked e-mail) - let's look at your suggestion...
>
>> We have Emily's series to convert hooks to be config driven that we hope
>> to land in some form, at that point they won't be any more of a special
>> snowflake than any other config.
>> 
>> And then, instead of doing what I'd think would be the natural result of
>> that: Simply supporting an in-repo top-level ".gitconfig" file. We're
>> still going to seemingly forever have them be an even more special
>> snowflake with this facility, and the reason seems to be mostly/entirely
>> to do with working around some aspect or restriction of Google's
>> internal infrastructure.
>
> I don't think that this is "natural". In particular, I still don't think
> that hooks should be tied to code revision. E.g. if we make commits
> based on an old revision and push them, we still want them to follow the
> latest requirements.

Even for real-world centralized workflow situations where I've seen
people think they want that, and the end of the day they almost never
actually want that.

Even something like code linting is a good example, to make it
Google-specific: say for a Go project: Are you going to pin your linting
tool/version to whatever understood your YAML format for the linter as
it was specced 10 years ago when the project started?  It's simply a
giant hassle to have a piece of code operate on every version of your
project ever in a way that doesn't break.

I think in practice the designers of this feature don't actually have
that in mind, but a "close to trunk" workflow, where you'd expect a hook
to only need to operate on revisions for the last few weeks or months,
because that'll be the oldest think people create new topics from.

But I think the burden of proof is really on the other side here,
something that works entirely differently than the rest of git needs to
have a good reason. Our in-repo .gitattributes don't work like this, nor
.gitignore, .mailmap etc.

There's also real world uses of git where the "branches" are wildly
divergent, e.g. I've worked on a system automation repo where the
"master" was just a stub template, and every team had their own almost
entirely different "repo-like" branch. Probably a bad idea for various
reasons, but Git supports it just fine.

For the centralized use-case what's the problem with just having the
hook do a 'for-each-ref --format=' invocation or "cat-file --batch" on
the "origin", and eval what it finds there? I'd think that gives you
what you want for the more centarlized workflow, while leaving git's
implementation working like the rest of our in-repo files.

>> I think it's just un-git-y to have a meta-branch that in some way drives
>> not only all other branches, but all other revisions of all branches,
>> ever.
>> 
>> It breaks expectations around git in lots of different ways, you can't
>> fetch a single branch and get its hooks,
>
> Are you saying that each branch should have its own hooks? That might be
> reasonable in certain projects, but I don't see how that is a Git
> expectation.

It's a git expectation now that I can add git.git as a remote, also
chromium.git, and linux.git, fetch them all, and happily switch in the
same repo between entirely different codebases that don't share a
history.

>> you can't locally alter, commit
>> and update your hooks while e.g. renaming a "t/" directory to "test/";
>> your hooks and code can't be atomically changed).
>
> I still think that hooks should work independent of code versions, so I
> wouldn't think that atomicity here is important.

Covered above.

>> I think I get why you want to do it that way, I just don't get why, as
>> mostly noted in those earlier rounds why it wouldn't be a better
>> approach / more straightforward / more git-y to:
>> 
>> 1. Work on getting hooks driven by config <this is happening with
>>    Emily's series / my split-out "base" topic>
>> 2. Have a facility to read an in-repo '.gitconfig'; have lots of safety
>>    valves etc. around this, I suggested starting with a whitelist of the
>>    N least dangerous config options, e.g. some diff viewing options, or
>>    a suggested sendemail.to or whatever.
>
> I've replied to this above.

Not really, even if we went for this one-HEAD-version-to-rule-them-all
plan wouldn't it make more sense to generalize it as a
refs/remotes/origin/magic-config, and we'd discover a ".gitconfig" file
under that commit/tree.

I.e. whether we generalize this to config in general is orthagonal to
whether such config lives in HEAD or in a magic ref.

With hooks as config I don't see how you'd make any of this
hook-specific, there's other config where the "every revision ever"
applies much more strongly, e.g. sendemail.to. If that changed for this
project tomorrow you wouldn't want a patch based on "maint" to send
things to a different ML.

>> 3. Work our way up to trusting that for more dangerous stuff, eventually
>>    hooks. Most of the legitimate concerns from others with this is
>>    having some UX where our users won't be trained to just blindly say
>>    "yes" to an alias/hook config that "rm -rf's /" or whatever.
>> 
>>    If we start experimenting with that with aliases or hooks that can
>>    run arbitrary code it's like handing a toddlder a shotgun, let's at
>>    least start with a sharp fork or something (less dangerous config) :)
>> 
>> 4. People who want this "I want my hooks to apply to all revisions ever"
>>    could probably get 99% or 100% of what they want if their hook is
>>    just a stub that does the equivalent of:
>> 
>>        sh `curl https://git.google.com/$reponame/hooks/$hookname`
>> 
>>    You'd then simply forbid on your servers any changes to a .gitconfig
>>    that did anything with the hook.* namespace.
>
> This would work if set in .git/config (not version controlled), but not
> .gitconfig (version controlled).

Sorry, what wouldn't work? I meant you'd forbid pushes to your in-repo
.gitconfig in your "master" branch or whatever, just like you're
presumably planning some stronger ACLs for this magic hook branch.

>> With such an implementation you don't need a magic
>> "refs/remotes/origin/suggested-hooks" refs, just some state machine (I
>> suggested e.g. GPG signing chains as an eventual end-state, but "show a
>> diff every time" would also do) that keeps track of what config (and
>> hooks are just one such case) has been OK'd, and which has not.
>
> This sounds complicated.

On the contrary I think anything that leans into git's
content-addressable security model is way less complicated. You don't
care who you fetched Junio's v2.32.0 tag from, what matters is that the
signing chain validates.

The plan of having this magic branch means a whole new trust model for
git, you trust magical authorized remotes. If you trust signed content
chains you can trust hooks if their last modification can be traced to a
signing authority you trust.

    It's really just:

        if (hook_content_changed() && hook_content_same_as_in_ok'd_revision_from_upsteam())
            trust_hooks();

But while we're on the subject, it seems like a very generous assumption
to think that just because you trust hooks at a given revision (or
always trust the latest), that you implicitly trust them when *combined
with* all past and future revisions from the same repository.

Even without a malicious actor that seems like it'll inevitably break in
all sorts of data-destroying ways. E.g. people commit stuff
accidentally. A hook run under a "git bisect" that naïvely does an "rm
*" will eat your data if you land on a revision that an in-tree "-rf"
file.

But once you get to a malicious actor who can say push a topic branch
but not hook updates, will your hooks deal with files with whitespace in
them, arbitrary crafted content etc?

So I'd think that's an even better reason to prefer the in-repo
per-revision atomically committed plan, and only trust hooks for the
revision they're shipped with, at least as a default git security model.

>> I'd think it would even work better in the Googleplex, you could clone a
>> co-worker's branch and execute their hooks, since they're the same as
>> what you've pre-approved,
>
> In the presence of .gitconfig, how would you know?

If it's the same config, or you can automatically OK it. So "same" was
discussed above, or you could trust any hook that's only doing a wget of
some trusted domain and piping that to "sh".

>> you could even clone some random person's fork
>> of a "blessed" project, because the hooks would be the same `sh $(curl
>> <url I already trust>)`. That validation could even be a system-level
>> in-config hook on your laptop, thus bringing the whole thing full
>> circle...
>
> Same here.
>
> In summary, I think your point of using hook configs + remote-suggested
> configs instead of remote-suggested hooks is a reasonable one, but I
> disagree with your reasons (or, at least, your reasons as I understand
> them).

You trust e.g. chromium.git's hooks, but I clone it, patch it, and
re-push it to somegithost.com URL. If you go with trusting content it
becomes easy to install those trusted hooks for the common case, but not
if your entire trust model relies on what URL you git clone'd from.