git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* problem with not being able to enforce git content filters
@ 2018-10-16 20:36 Stas Bekman
  2018-10-16 21:17 ` Ævar Arnfjörð Bjarmason
  2018-10-16 23:26 ` brian m. carlson
  0 siblings, 2 replies; 7+ messages in thread
From: Stas Bekman @ 2018-10-16 20:36 UTC (permalink / raw)
  To: git

Hi,

TL;DR

Our open source project dev team has a continuous problem with git
content filters, because developers don't always have them configured.

We need a way for git to support content filters w/o using user's
.gitconfig. Otherwise it leads to an inconsistent behavior and messed up
git checkouts.

=================================================

Full story:

We use a version of the nbstripout content filter
(https://github.com/kynan/nbstripout), which removes user-specific
information from the jupyter notebooks during commit. But the problem
would be the same with any one way clean filter.

First the setup:
https://github.com/kynan/nbstripout#manual-filter-installation

=================================================
Set up a git filter using nbstripout as follows:

git config filter.nbstripout.clean '/path/to/nbstripout'
git config filter.nbstripout.smudge cat
git config filter.nbstripout.required true

Create a file .gitattributes or .git/info/attributes with:

*.ipynb filter=nbstripout

Apply the filter for git diff of *.ipynb files:

git config diff.ipynb.textconv '/path/to/nbstripout -t'

In file .gitattributes or .git/info/attributes add:

*.ipynb diff=ipynb

=================================================

The problem is that it can't be enforced.

When it's not enforced, we end up with some devs using it and others
don't, or more often is the same dev sometimes doesn't have it configured.

When a person has a stripped out notebook checked out, when another
person commits un-stripped out notebook, it leads to: invalid `git
status` reports, `git pull` breaks, `git stash` doesn't work, since it
tries to stash using the filters, and `git pull' can never succeed
because it thinks that it'll overwrite the local changes, but `git diff`
returns no changes.

So the only solution when this happens is to disable the filters, clean
up the mess, re-enable the filters. Many people just make a new clone -
ouch!

And the biggest problem is that it affects all users who may have the
filters enabled, e.g. because they worked on a PR, and not just devs -
i.e. the repercussions are much bigger than just a few devs affected.

We can't use server-side hooks to enforce this because the project is on
github.

And the devs honestly try to do their best to remember to configure the
filters, but for some reason they disappear for them, don't ask me why,
I don't know. This is an open source project team, not a work place.

You can see some related complaints here:
https://github.com/kynan/nbstripout/issues/65#issuecomment-430346894
https://stackoverflow.com/questions/51883227/git-pull-stash-conflicts-with-a-git-filter
and I can find you a whole bunch more if you need more evidence.

==================================================

Proposed solution:

There needs to be a way for a project to define content filters w/o
going via user's .gitconfig configuration, since due to git's security
it can't be distributed to all users and must be enabled manually by
each user, which is just not the right solution in this case.

Of course, I'm open to other suggestions that may help this issue.

"Tell your developers they must configure the filters" is not it - I
tried it for a long time and in vain. If you look at our install
instructions: https://github.com/fastai/fastai#developer-install

  git clone https://github.com/fastai/fastai
  cd fastai
  tools/run-after-git-clone

It already includes an instruction to run a script which enables the
filters, but this doesn't seem to help (and no, it's not a problem with
the script). The devs report that the configuration is there for a
while, and then suddenly it is not there, I don't know why. Perhaps they
make a new clone and forget to re-enable the filters, perhaps they
disable them to clean up and forget to reenable them, I can't tell.

The bottom line it sucks and I hope that you can help with offering a
programmatic solution, rather than recommending creating a police
department.

Thank you for listening.

-- 
________________________________________________
Stas Bekman       <'))))><       <'))))><
https://stasosphere.com  https://chestofbooks.com
https://experientialsexlab.com https://stason.org
https://stasosphere.com/experience-life/my-books

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problem with not being able to enforce git content filters
  2018-10-16 20:36 problem with not being able to enforce git content filters Stas Bekman
@ 2018-10-16 21:17 ` Ævar Arnfjörð Bjarmason
  2018-10-16 22:12   ` Stas Bekman
  2018-10-16 23:26 ` brian m. carlson
  1 sibling, 1 reply; 7+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-16 21:17 UTC (permalink / raw)
  To: Stas Bekman; +Cc: git


On Tue, Oct 16 2018, Stas Bekman wrote:

> When a person has a stripped out notebook checked out, when another
> person commits un-stripped out notebook, it leads to: invalid `git
> status` reports, `git pull` breaks, `git stash` doesn't work, since it
> tries to stash using the filters, and `git pull' can never succeed
> because it thinks that it'll overwrite the local changes, but `git diff`
> returns no changes.

Planting a flag here. Let's get to this later.

> So the only solution when this happens is to disable the filters, clean
> up the mess, re-enable the filters. Many people just make a new clone -
> ouch!
>
> And the biggest problem is that it affects all users who may have the
> filters enabled, e.g. because they worked on a PR, and not just devs -
> i.e. the repercussions are much bigger than just a few devs affected.
>
> We can't use server-side hooks to enforce this because the project is on
> github.

Ultimately git's a distributed system and we won't ever be able to
enforce that users in their local copies use filters, and they might
edit stuff e.g. in the GitHub UI directly, or with some other git
implementation.

So if you have such special needs maybe consider hosting your own setup
where you can have pre-receive hooks, or within GitHub e.g. enforce that
all things must go through merge requests, and have some robot that
checks that the content to be merged has gone through filters before
being approved.

> And the devs honestly try to do their best to remember to configure the
> filters, but for some reason they disappear for them, don't ask me why,
> I don't know. This is an open source project team, not a work place.

*Picks up flag*. For the purposes of us trying to understand this report
it would be really useful to boil what's happening down to some
step-by-step reproducible recipe.

I.e. with some dummy filter configured & not configured how does "git
pull" end up breaking, and in this case you're alluding to git simply
forgetting config, how would that happen?

> There needs to be a way for a project to define content filters w/o
> going via user's .gitconfig configuration, since due to git's security
> it can't be distributed to all users and must be enabled manually by
> each user, which is just not the right solution in this case.
>
> Of course, I'm open to other suggestions that may help this issue.
>
> "Tell your developers they must configure the filters" is not it - I
> tried it for a long time and in vain. If you look at our install
> instructions: https://github.com/fastai/fastai#developer-install
>
>   git clone https://github.com/fastai/fastai
>   cd fastai
>   tools/run-after-git-clone
>
> It already includes an instruction to run a script which enables the
> filters, but this doesn't seem to help (and no, it's not a problem with
> the script). The devs report that the configuration is there for a
> while, and then suddenly it is not there, I don't know why. Perhaps they
> make a new clone and forget to re-enable the filters, perhaps they
> disable them to clean up and forget to reenable them, I can't tell.

FWIW I tried following these on my Debian install and both on python2
and python3 I get either syntax errors or a combo of that and a missing
pathlib from that script. I don't know Python well enough to debug
it. Maybe that's part of the issue?

> The bottom line it sucks and I hope that you can help with offering a
> programmatic solution, rather than recommending creating a police
> department.

I think it would be great to have .gitconfig in-repo support, but a lot
of security & UI problems need to be surmounted before that would
happen, here's some previous ramblings of mine on that issue:
https://public-inbox.org/git/?q=87zi6eakkt.fsf%40evledraar.gmail.com

It now occurs to me that a rather minimal proof-of-concept version of
that would be:

 1. On repository clone, detect if HEAD:.gitconfig exists

 2. If it does, and we trust $(git config -f <untrusted file> -l)
    enough, emit some advice() output saying the project suggests
    setting config so-and-so, and that you could run the following one
    liner(s) to set it if you agree.

 3. Once we have that we could eventually nudge our way towards
    something like what I suggested in the linked threads above,
    i.e. consuming some subset of that config directly from the repo's
    HEAD:.gitconfig

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problem with not being able to enforce git content filters
  2018-10-16 21:17 ` Ævar Arnfjörð Bjarmason
@ 2018-10-16 22:12   ` Stas Bekman
  0 siblings, 0 replies; 7+ messages in thread
From: Stas Bekman @ 2018-10-16 22:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

On 2018-10-16 02:17 PM, Ævar Arnfjörð Bjarmason wrote:
[...]
>> We can't use server-side hooks to enforce this because the project is on
>> github.
> 
> Ultimately git's a distributed system and we won't ever be able to
> enforce that users in their local copies use filters, and they might
> edit stuff e.g. in the GitHub UI directly, or with some other git
> implementation.

In this particular case github won't be a problem, since for the problem
to appear it has to be executed on the user side. Editing directly in
github UI is not a problem.

Just to give a little big more context to the issue in first place. A
jupyter notebook is a json file that contains the source code, the
outputs from executing that code and the unique user environment bits.
We want to "git" the source code but not the rest. Otherwise merging is
a hell. Hence the stripping.

There are other approaches to solve this problem, besides stripping, but
 they all require some kind of pre-processing before committing the file.

> So if you have such special needs maybe consider hosting your own setup
> where you can have pre-receive hooks, or within GitHub e.g. enforce that
> all things must go through merge requests, and have some robot that
> checks that the content to be merged has gone through filters before
> being approved.

Yes, that would be ideal. But I doubt this is going to happen, I'm just
a contributing developer, not the creator/owner of the project. And as I
said this affects anybody who collaborates on jupyter notebooks, not
just in our project. I think there are several millions of them on github.

> *Picks up flag*. For the purposes of us trying to understand this report
> it would be really useful to boil what's happening down to some
> step-by-step reproducible recipe.
> 
> I.e. with some dummy filter configured & not configured how does "git
> pull" end up breaking, and in this case you're alluding to git simply
> forgetting config, how would that happen?

The problem of 'git pull' and 'git status' and 'git stash' is presented
in details here:
https://stackoverflow.com/questions/51883227/git-pull-stash-conflicts-with-a-git-filter
and more here:
https://github.com/kynan/nbstripout/issues/65
https://github.com/jupyter/nbdime/issues/410#issuecomment-412999758

The problem of configuration disappearing I sadly have no input. It
doesn't happen to me, and all I get from others is that it first works,
and then it doesn't. Perhaps it has something to do with some of them
using windows. I don't know, sorry.

>> The bottom line it sucks and I hope that you can help with offering a
>> programmatic solution, rather than recommending creating a police
>> department.
> 
> I think it would be great to have .gitconfig in-repo support, but a lot
> of security & UI problems need to be surmounted before that would
> happen, here's some previous ramblings of mine on that issue:
> https://public-inbox.org/git/?q=87zi6eakkt.fsf%40evledraar.gmail.com
> 
> It now occurs to me that a rather minimal proof-of-concept version of
> that would be:
> 
>  1. On repository clone, detect if HEAD:.gitconfig exists
> 
>  2. If it does, and we trust $(git config -f <untrusted file> -l)
>     enough, emit some advice() output saying the project suggests
>     setting config so-and-so, and that you could run the following one
>     liner(s) to set it if you agree.
> 
>  3. Once we have that we could eventually nudge our way towards
>     something like what I suggested in the linked threads above,
>     i.e. consuming some subset of that config directly from the repo's
>     HEAD:.gitconfig

I like it, Ævar!

I feel doing the check and prompting the user on the first push/commit
after clone would be a better. It'd be too easy for the user to skip
that step on git clone. In our particular case we want it where the
problem is introduced, which is on commit, and not on clone. I hope it
makes sense.


-- 
________________________________________________
Stas Bekman       <'))))><       <'))))><
https://stasosphere.com  https://chestofbooks.com
https://experientialsexlab.com https://stason.org
https://stasosphere.com/experience-life/my-books

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problem with not being able to enforce git content filters
  2018-10-16 20:36 problem with not being able to enforce git content filters Stas Bekman
  2018-10-16 21:17 ` Ævar Arnfjörð Bjarmason
@ 2018-10-16 23:26 ` brian m. carlson
  2018-10-16 23:48   ` Stas Bekman
  1 sibling, 1 reply; 7+ messages in thread
From: brian m. carlson @ 2018-10-16 23:26 UTC (permalink / raw)
  To: Stas Bekman; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]

On Tue, Oct 16, 2018 at 01:36:37PM -0700, Stas Bekman wrote:
> The problem is that it can't be enforced.
> 
> When it's not enforced, we end up with some devs using it and others
> don't, or more often is the same dev sometimes doesn't have it configured.
> 
> When a person has a stripped out notebook checked out, when another
> person commits un-stripped out notebook, it leads to: invalid `git
> status` reports, `git pull` breaks, `git stash` doesn't work, since it
> tries to stash using the filters, and `git pull' can never succeed
> because it thinks that it'll overwrite the local changes, but `git diff`
> returns no changes.
> 
> So the only solution when this happens is to disable the filters, clean
> up the mess, re-enable the filters. Many people just make a new clone -
> ouch!
> 
> And the biggest problem is that it affects all users who may have the
> filters enabled, e.g. because they worked on a PR, and not just devs -
> i.e. the repercussions are much bigger than just a few devs affected.
> 
> We can't use server-side hooks to enforce this because the project is on
> github.
> 
> And the devs honestly try to do their best to remember to configure the
> filters, but for some reason they disappear for them, don't ask me why,
> I don't know. This is an open source project team, not a work place.

This sounds like it could be easily solved by continuous integration.
You could set up a job on any of a variety of services that checks that
a pull request or other commit is clean when when the filter runs.  If
it doesn't pass, the code doesn't merge.

This is what other projects do for style-related and tidiness issues.
Similar approaches can be used in other situations to enforce that all
line endings are LF, or whatever your project desires.

I don't think it's a good idea to provide Git configuration to end
users, even with prompting, since there are many novice users who don't
know what the security implications of various config options are.  I
also personally never would want to be prompted for such a thing, so
even if that were a feature, people would turn if off, and you'd be no
better off than you were before.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problem with not being able to enforce git content filters
  2018-10-16 23:26 ` brian m. carlson
@ 2018-10-16 23:48   ` Stas Bekman
  2018-10-17  0:54     ` brian m. carlson
  0 siblings, 1 reply; 7+ messages in thread
From: Stas Bekman @ 2018-10-16 23:48 UTC (permalink / raw)
  To: brian m. carlson, git


>> And the devs honestly try to do their best to remember to configure the
>> filters, but for some reason they disappear for them, don't ask me why,
>> I don't know. This is an open source project team, not a work place.
> 
> This sounds like it could be easily solved by continuous integration.
> You could set up a job on any of a variety of services that checks that
> a pull request or other commit is clean when when the filter runs.  If
> it doesn't pass, the code doesn't merge.

This is an excellent idea wrt to PRs. Thank you, Brian! I will implement
that.

It doesn't help with direct commits to master, since CI would be
detecting it after it was committed. And when that happens we all know
that already because 'git pull' fails.


-- 
________________________________________________
Stas Bekman       <'))))><       <'))))><
https://stasosphere.com  https://chestofbooks.com
https://experientialsexlab.com https://stason.org
https://stasosphere.com/experience-life/my-books

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problem with not being able to enforce git content filters
  2018-10-16 23:48   ` Stas Bekman
@ 2018-10-17  0:54     ` brian m. carlson
  2018-10-17  1:26       ` Stas Bekman
  0 siblings, 1 reply; 7+ messages in thread
From: brian m. carlson @ 2018-10-17  0:54 UTC (permalink / raw)
  To: Stas Bekman; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1931 bytes --]

On Tue, Oct 16, 2018 at 04:48:13PM -0700, Stas Bekman wrote:
> 
> >> And the devs honestly try to do their best to remember to configure the
> >> filters, but for some reason they disappear for them, don't ask me why,
> >> I don't know. This is an open source project team, not a work place.
> > 
> > This sounds like it could be easily solved by continuous integration.
> > You could set up a job on any of a variety of services that checks that
> > a pull request or other commit is clean when when the filter runs.  If
> > it doesn't pass, the code doesn't merge.
> 
> This is an excellent idea wrt to PRs. Thank you, Brian! I will implement
> that.
> 
> It doesn't help with direct commits to master, since CI would be
> detecting it after it was committed. And when that happens we all know
> that already because 'git pull' fails.

Typically projects that have CI set up don't allow direct pushes to
master, since that tends to allow to bypass CI, or they allow it only in
exceptional circumstances (e.g., reverts).  Also, typically most
projects want to have some sort of review before code (resp. documents,
other contributions) are merged.  Most Git hosting platforms, including
GitHub, offer protected branches, so it's something to consider.

There is a possible alternative, and that's that if your project has
some sort of build file (e.g., a Makefile), you can set it up to
automatically insert hooks or git configuration into developers'
systems, optionally only if it's in a Git repository.  For example, you
could add a pre-commit hook that fails if the filters are disabled.

These are the approaches that tend to work well for most projects.  I
tend to prefer the CI approach with restricted branches because often I
want to customize my hooks, but your project will decide what works best
for it.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: problem with not being able to enforce git content filters
  2018-10-17  0:54     ` brian m. carlson
@ 2018-10-17  1:26       ` Stas Bekman
  0 siblings, 0 replies; 7+ messages in thread
From: Stas Bekman @ 2018-10-17  1:26 UTC (permalink / raw)
  To: brian m. carlson, git

On 2018-10-16 05:54 PM, brian m. carlson wrote:
[...]
>> It doesn't help with direct commits to master, since CI would be
>> detecting it after it was committed. And when that happens we all know
>> that already because 'git pull' fails.
> 
> Typically projects that have CI set up don't allow direct pushes to
> master, since that tends to allow to bypass CI, or they allow it only in
> exceptional circumstances (e.g., reverts).  Also, typically most
> projects want to have some sort of review before code (resp. documents,
> other contributions) are merged.  Most Git hosting platforms, including
> GitHub, offer protected branches, so it's something to consider.
> 
> There is a possible alternative, and that's that if your project has
> some sort of build file (e.g., a Makefile), you can set it up to
> automatically insert hooks or git configuration into developers'
> systems, optionally only if it's in a Git repository.  For example, you
> could add a pre-commit hook that fails if the filters are disabled.
> 
> These are the approaches that tend to work well for most projects.  I
> tend to prefer the CI approach with restricted branches because often I
> want to customize my hooks, but your project will decide what works best
> for it.

Yes, Brian, what you're sharing makes total sense when things are setup
this way, but this is not the case with the project I'm contributing to.
This one is setup, commit first, review later, due to having too few hands.

And I have already setup a CI to detect misconfigured systems. It'll
catch RPs in time, and everything else post-commit. Let's hope the
developers will watch the status of their commits and will react quickly
to fix their setup and mend the broken commit, when they see their
commit broke things. That's as good as it can get in this particular
situation I understand.

Thank you, Brian and Ævar for your support and very helpful suggestions.

-- 
________________________________________________
Stas Bekman       <'))))><       <'))))><
https://stasosphere.com  https://chestofbooks.com
https://experientialsexlab.com https://stason.org
https://stasosphere.com/experience-life/my-books

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-10-17  1:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-16 20:36 problem with not being able to enforce git content filters Stas Bekman
2018-10-16 21:17 ` Ævar Arnfjörð Bjarmason
2018-10-16 22:12   ` Stas Bekman
2018-10-16 23:26 ` brian m. carlson
2018-10-16 23:48   ` Stas Bekman
2018-10-17  0:54     ` brian m. carlson
2018-10-17  1:26       ` Stas Bekman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).