Re: problem with not being able to enforce git content filters

From: Stas Bekman <stas@stason.org>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: problem with not being able to enforce git content filters
Date: Tue, 16 Oct 2018 15:12:58 -0700	[thread overview]
Message-ID: <a6dcfee6-595b-2375-e9cb-71da74ed80bb@stason.org> (raw)
In-Reply-To: <87y3axd21q.fsf@evledraar.gmail.com>

On 2018-10-16 02:17 PM, Ævar Arnfjörð Bjarmason wrote:
[...]
>> We can't use server-side hooks to enforce this because the project is on
>> github.
> 
> Ultimately git's a distributed system and we won't ever be able to
> enforce that users in their local copies use filters, and they might
> edit stuff e.g. in the GitHub UI directly, or with some other git
> implementation.

In this particular case github won't be a problem, since for the problem
to appear it has to be executed on the user side. Editing directly in
github UI is not a problem.

Just to give a little big more context to the issue in first place. A
jupyter notebook is a json file that contains the source code, the
outputs from executing that code and the unique user environment bits.
We want to "git" the source code but not the rest. Otherwise merging is
a hell. Hence the stripping.

There are other approaches to solve this problem, besides stripping, but
 they all require some kind of pre-processing before committing the file.

> So if you have such special needs maybe consider hosting your own setup
> where you can have pre-receive hooks, or within GitHub e.g. enforce that
> all things must go through merge requests, and have some robot that
> checks that the content to be merged has gone through filters before
> being approved.

Yes, that would be ideal. But I doubt this is going to happen, I'm just
a contributing developer, not the creator/owner of the project. And as I
said this affects anybody who collaborates on jupyter notebooks, not
just in our project. I think there are several millions of them on github.

> *Picks up flag*. For the purposes of us trying to understand this report
> it would be really useful to boil what's happening down to some
> step-by-step reproducible recipe.
> 
> I.e. with some dummy filter configured & not configured how does "git
> pull" end up breaking, and in this case you're alluding to git simply
> forgetting config, how would that happen?

The problem of 'git pull' and 'git status' and 'git stash' is presented
in details here:
https://stackoverflow.com/questions/51883227/git-pull-stash-conflicts-with-a-git-filter
and more here:
https://github.com/kynan/nbstripout/issues/65
https://github.com/jupyter/nbdime/issues/410#issuecomment-412999758

The problem of configuration disappearing I sadly have no input. It
doesn't happen to me, and all I get from others is that it first works,
and then it doesn't. Perhaps it has something to do with some of them
using windows. I don't know, sorry.

>> The bottom line it sucks and I hope that you can help with offering a
>> programmatic solution, rather than recommending creating a police
>> department.
> 
> I think it would be great to have .gitconfig in-repo support, but a lot
> of security & UI problems need to be surmounted before that would
> happen, here's some previous ramblings of mine on that issue:
> https://public-inbox.org/git/?q=87zi6eakkt.fsf%40evledraar.gmail.com
> 
> It now occurs to me that a rather minimal proof-of-concept version of
> that would be:
> 
>  1. On repository clone, detect if HEAD:.gitconfig exists
> 
>  2. If it does, and we trust $(git config -f <untrusted file> -l)
>     enough, emit some advice() output saying the project suggests
>     setting config so-and-so, and that you could run the following one
>     liner(s) to set it if you agree.
> 
>  3. Once we have that we could eventually nudge our way towards
>     something like what I suggested in the linked threads above,
>     i.e. consuming some subset of that config directly from the repo's
>     HEAD:.gitconfig

I like it, Ævar!

I feel doing the check and prompting the user on the first push/commit
after clone would be a better. It'd be too easy for the user to skip
that step on git clone. In our particular case we want it where the
problem is introduced, which is on commit, and not on clone. I hope it
makes sense.

-- 
________________________________________________
Stas Bekman       <'))))><       <'))))><
https://stasosphere.com  https://chestofbooks.com
https://experientialsexlab.com https://stason.org
https://stasosphere.com/experience-life/my-books