git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible regression for sparse-checkout in git 2.27.0
@ 2020-06-03  1:08 Shaun Case
  2020-06-03  4:37 ` Elijah Newren
  0 siblings, 1 reply; 8+ messages in thread
From: Shaun Case @ 2020-06-03  1:08 UTC (permalink / raw)
  To: git

I'm seeing an unexpected change in behavior of git sparse-checkout
between 2.26.2 and 2.27.0 on CentOS 7.8.  Problem manifests with both
github and multiple recent versions of gitlab.

Sources came from
https://mirrors.edge.kernel.org/pub/software/scm/git/, I verified the
sha256 hashes for both 2.26.2 and 2.27.0 matched.

Built with
gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)

By doing:

./configure \
    --prefix=/usr/local \
    --with-python=python3
make -j 10
sudo make -j 10 install install-doc install-html

Problem detail:

In 2.26.2, git sparse-checkout commands...

  init --cone
  set
  add

... all bring in files.  In 2.27.0, no files appear except the .git subdir
until I do:

git read-tree -mu HEAD

However, according to my reading of this, git read-tree should not be
necessary in 2.27.0:
https://stackoverflow.com/questions/28760940/why-does-one-call-git-read-tree-after-a-sparse-checkout/

When I do export GIT_TRACE_PACKET=1, I see that the init --cone, set
and add sparse-checkout commands all talk to the server.  In 2.27.0,
these commands do not result in any packets sent to the server.  It
sounds like they should be calling update_sparsity() themselves, but
aren't.  Is that wrong?  I couldn't find a git CLI client command to
invoke it directly except for read-tree.

Below is a small bash script and its output that shows the problem
with a small random public repo on github.

Best regards,
Shaun

Script:
==============================================
#!/bin/bash -x

export GIT_TRACE_PACKET=0
git --version
git clone --filter=blob:none --no-checkout
https://github.com/r-spacex/launch-timeline.git
cd launch-timeline

# Note no server traffic at from here...
export GIT_TRACE_PACKET=1
git sparse-checkout init --cone
git sparse-checkout set README.md
git sparse-checkout add css
# ... to here.

# There is nothing here yet except .git/
ls -las

# This brings in the files as expected, uncomment to verify.
# git read-tree -mu HEAD
==============================================


Output:
==============================================
3sc3396:/tmp>demo-git-sparse-checkout-bug.sh
+ export GIT_TRACE_PACKET=0
+ GIT_TRACE_PACKET=0
+ git --version
git version 2.27.0
+ git clone --filter=blob:none --no-checkout
https://github.com/r-spacex/launch-timeline.git
Cloning into 'launch-timeline'...
remote: Enumerating objects: 529, done.
remote: Total 529 (delta 0), reused 0 (delta 0), pack-reused 529
Receiving objects: 100% (529/529), 67.49 KiB | 987.00 KiB/s, done.
Resolving deltas: 100% (168/168), done.
+ cd launch-timeline
+ export GIT_TRACE_PACKET=1
+ GIT_TRACE_PACKET=1
+ git sparse-checkout init --cone
+ git sparse-checkout set README.md
+ git sparse-checkout add css
+ ls -las
total 68
 4 drwxrwxr-x   3 scase scase  4096 Jun  2 17:46 .
60 drwxrwxrwt. 34 root  root  57344 Jun  2 17:46 ..
 4 drwxrwxr-x   8 scase scase  4096 Jun  2 17:46 .git
==============================================

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible regression for sparse-checkout in git 2.27.0
  2020-06-03  1:08 Possible regression for sparse-checkout in git 2.27.0 Shaun Case
@ 2020-06-03  4:37 ` Elijah Newren
  2020-06-03 15:36   ` Derrick Stolee
  0 siblings, 1 reply; 8+ messages in thread
From: Elijah Newren @ 2020-06-03  4:37 UTC (permalink / raw)
  To: Shaun Case; +Cc: Git Mailing List, Derrick Stolee

Hi,

On Tue, Jun 2, 2020 at 6:12 PM Shaun Case <warmsocks@gmail.com> wrote:
>
> I'm seeing an unexpected change in behavior of git sparse-checkout
> between 2.26.2 and 2.27.0 on CentOS 7.8.  Problem manifests with both
> github and multiple recent versions of gitlab.
>
> Sources came from
> https://mirrors.edge.kernel.org/pub/software/scm/git/, I verified the
> sha256 hashes for both 2.26.2 and 2.27.0 matched.
>
> Built with
> gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
>
> By doing:
>
> ./configure \
>     --prefix=/usr/local \
>     --with-python=python3
> make -j 10
> sudo make -j 10 install install-doc install-html
>
> Problem detail:
>
> In 2.26.2, git sparse-checkout commands...
>
>   init --cone
>   set
>   add
>
> ... all bring in files.  In 2.27.0, no files appear except the .git subdir
> until I do:
>
> git read-tree -mu HEAD
>
> However, according to my reading of this, git read-tree should not be
> necessary in 2.27.0:
> https://stackoverflow.com/questions/28760940/why-does-one-call-git-read-tree-after-a-sparse-checkout/
>
> When I do export GIT_TRACE_PACKET=1, I see that the init --cone, set
> and add sparse-checkout commands all talk to the server.  In 2.27.0,
> these commands do not result in any packets sent to the server.  It
> sounds like they should be calling update_sparsity() themselves, but
> aren't.  Is that wrong?  I couldn't find a git CLI client command to
> invoke it directly except for read-tree.
>
> Below is a small bash script and its output that shows the problem
> with a small random public repo on github.
>
> Best regards,
> Shaun
>
> Script:
> ==============================================
> #!/bin/bash -x
>
> export GIT_TRACE_PACKET=0
> git --version
> git clone --filter=blob:none --no-checkout
> https://github.com/r-spacex/launch-timeline.git
> cd launch-timeline
>
> # Note no server traffic at from here...
> export GIT_TRACE_PACKET=1
> git sparse-checkout init --cone
> git sparse-checkout set README.md
> git sparse-checkout add css
> # ... to here.
>
> # There is nothing here yet except .git/
> ls -las
>
> # This brings in the files as expected, uncomment to verify.
> # git read-tree -mu HEAD
> ==============================================
>
>
> Output:
> ==============================================
> 3sc3396:/tmp>demo-git-sparse-checkout-bug.sh
> + export GIT_TRACE_PACKET=0
> + GIT_TRACE_PACKET=0
> + git --version
> git version 2.27.0
> + git clone --filter=blob:none --no-checkout
> https://github.com/r-spacex/launch-timeline.git
> Cloning into 'launch-timeline'...
> remote: Enumerating objects: 529, done.
> remote: Total 529 (delta 0), reused 0 (delta 0), pack-reused 529
> Receiving objects: 100% (529/529), 67.49 KiB | 987.00 KiB/s, done.
> Resolving deltas: 100% (168/168), done.
> + cd launch-timeline
> + export GIT_TRACE_PACKET=1
> + GIT_TRACE_PACKET=1
> + git sparse-checkout init --cone
> + git sparse-checkout set README.md
> + git sparse-checkout add css
> + ls -las
> total 68
>  4 drwxrwxr-x   3 scase scase  4096 Jun  2 17:46 .
> 60 drwxrwxrwt. 34 root  root  57344 Jun  2 17:46 ..
>  4 drwxrwxr-x   8 scase scase  4096 Jun  2 17:46 .git
> ==============================================

Thanks for the detailed report.  This change came from commit
f56f31af03 ("sparse-checkout: use new update_sparsity() function",
2020-03-27), made by me.  After playing with it for a bit, the
--filter=blob:none turns out to not be relevant to reproducing the
behavior of not getting any files in the working tree, other than
perhaps providing you a way to measure what was or wasn't happening.

I think it'd be more natural to run
  git clone --filter=blob:none --sparse
https://github.com/r-spacex/launch-timeline.git
in place of the combination of
  git clone --filter=blob:none --no-checkout
https://github.com/r-spacex/launch-timeline.git
  git sparse-checkout init --cone
since the --sparse flag was added just for this kind of case -- to do
a clone but start with only a few things checked out.  It's easier, is
the route we're moving towards, and as a bonus also happens to work.
;-)

A bit of a side note, or a few of them, but this command of yours is broken:
  git sparse-checkout set README.md
because --cone mode means you are specifying *directories* that should
be checked out.  Currently, this gives no error, it instead silently
drops you back to non-cone mode, which seems bad to me.
sparse-checkout should provide some kind of error -- or at very least
a warning -- when you make that mistake.

Now let's talk about the commit in question that changed behavior
here.  The point of sparse-checkout is never to switch branches or
checkout new commits; all it does is update which paths are in the
current working directory.  A related point to this is it should never
add or remove entries from the index and shouldn't change any hashes
of files in the index.  It used to violate this, at first via an
implementation that was literally invoking `git read-tree -mu HEAD` in
a subprocess, and then later using internal code equivalent to
invoking that command in a subprocess.  But by violating the
leave-index-entries-alone mandate, it left folks who were in the
middle of a rebase and wanted to update their sparse-checkout to
include some more directories in their working tree in a precarious
spot -- if they didn't update, then they didn't have the files
necessary to build, and if they did forcibly update via `git read-tree
-mu HEAD` then their staged changes would all get wiped out.  I spent
some quality time helping users recover their files and teaching them
about the git storage model.

So that brings us back to your original question.  When you said
--no-checkout, it means that there is no commit checked out and the
index is empty.  update_sparsity() is correctly toggling the
SKIP_WORKTREE bits for the existing index entries that don't match the
sparsity patterns, and it is correctly calling check_updates().
check_updates() is correctly checking for files currently in the index
which have toggled to being needed in the current worktree so that it
can issue downloads related to promisor packs.  The problem is just
that there aren't any index entries to begin with, so there are no
SKIP_WORKTREE bits to update, and thus no files that need to be
downloaded.

It seems a bit risky to make sparse-checkout start doing
checkout/switch behavior and adding entries to the index.  There's a
couple ways forward.  One, we could decide this is a special edge or
corner case where we allow it: if the index is completely empty, then
there's no data to lose and thus we could make `git sparse-checkout
init [--cone]` in that one case use the old 'read-tree -mu HEAD'
logic.  Alternatively, we could just require users to run 'git reset
--hard' at the end of your script.

Stolee: Thoughts?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible regression for sparse-checkout in git 2.27.0
  2020-06-03  4:37 ` Elijah Newren
@ 2020-06-03 15:36   ` Derrick Stolee
  2020-06-04  7:27     ` Elijah Newren
  0 siblings, 1 reply; 8+ messages in thread
From: Derrick Stolee @ 2020-06-03 15:36 UTC (permalink / raw)
  To: Elijah Newren, Shaun Case; +Cc: Git Mailing List

On 6/3/2020 12:37 AM, Elijah Newren wrote:
> I think it'd be more natural to run

>   git clone --filter=blob:none --sparse
> https://github.com/r-spacex/launch-timeline.git

> in place of the combination of

>   git clone --filter=blob:none --no-checkout
> https://github.com/r-spacex/launch-timeline.git
>   git sparse-checkout init --cone

> since the --sparse flag was added just for this kind of case -- to do
> a clone but start with only a few things checked out.  It's easier, is
> the route we're moving towards, and as a bonus also happens to work.

Just one warning: the --sparse option in "git clone" does not currently
enable core.sparseCheckoutCone, so running "git sparse-checkout init --cone"
afterwards is a good idea, or else your "git sparse-checkout (set|add)"
commands will not behave the way you expect.

(I think that I will propose a change in behavior to make it do so during
this release cycle.)

> A bit of a side note, or a few of them, but this command of yours is broken:
>   git sparse-checkout set README.md
> because --cone mode means you are specifying *directories* that should
> be checked out.  Currently, this gives no error, it instead silently
> drops you back to non-cone mode, which seems bad to me.
> sparse-checkout should provide some kind of error -- or at very least
> a warning -- when you make that mistake.
 
> Now let's talk about the commit in question that changed behavior
> here.  The point of sparse-checkout is never to switch branches or
> checkout new commits; all it does is update which paths are in the
> current working directory.  A related point to this is it should never
> add or remove entries from the index and shouldn't change any hashes
> of files in the index.  It used to violate this, at first via an
> implementation that was literally invoking `git read-tree -mu HEAD` in
> a subprocess, and then later using internal code equivalent to
> invoking that command in a subprocess.  But by violating the
> leave-index-entries-alone mandate, it left folks who were in the
> middle of a rebase and wanted to update their sparse-checkout to
> include some more directories in their working tree in a precarious
> spot -- if they didn't update, then they didn't have the files
> necessary to build, and if they did forcibly update via `git read-tree
> -mu HEAD` then their staged changes would all get wiped out.  I spent
> some quality time helping users recover their files and teaching them
> about the git storage model.
> 
> So that brings us back to your original question.  When you said
> --no-checkout, it means that there is no commit checked out and the
> index is empty.  update_sparsity() is correctly toggling the
> SKIP_WORKTREE bits for the existing index entries that don't match the
> sparsity patterns, and it is correctly calling check_updates().
> check_updates() is correctly checking for files currently in the index
> which have toggled to being needed in the current worktree so that it
> can issue downloads related to promisor packs.  The problem is just
> that there aren't any index entries to begin with, so there are no
> SKIP_WORKTREE bits to update, and thus no files that need to be
> downloaded.
> 
> It seems a bit risky to make sparse-checkout start doing
> checkout/switch behavior and adding entries to the index.  There's a
> couple ways forward.  One, we could decide this is a special edge or
> corner case where we allow it: if the index is completely empty, then
> there's no data to lose and thus we could make `git sparse-checkout
> init [--cone]` in that one case use the old 'read-tree -mu HEAD'
> logic.  Alternatively, we could just require users to run 'git reset
> --hard' at the end of your script.
> 
> Stolee: Thoughts?

I agree that using "--sparse" instead of "--no-checkout" is the
best way forward for now, but I'll classify that as a workaround
and not necessarily the end of the conversation.

In general, the commit in question is doing something extremely
valuable for common situations, like rebase that you mention.
I also think that this change in behavior is warranted by the
clear warning placed at the top of the docs [1]:

	THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE
	BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-
	CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE.

[1] https://git-scm.com/docs/git-sparse-checkout#_description

Of course, that's just us covering our butts as we push the
feature forward. It doesn't stop users (especially those that
are bravely using the feature) from feeling pain that might
be avoidable.

To wrap up, it's unfortunate that the behavior changed. If it
is easy to detect and handle this special case, then it _may_
be worth doing to avoid disrupting users. Elijah: could you
spend at most an hour trying to see how difficult this would be?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible regression for sparse-checkout in git 2.27.0
  2020-06-03 15:36   ` Derrick Stolee
@ 2020-06-04  7:27     ` Elijah Newren
  2020-06-04 20:50       ` Shaun Case
  0 siblings, 1 reply; 8+ messages in thread
From: Elijah Newren @ 2020-06-04  7:27 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Shaun Case, Git Mailing List

Hi,

On Wed, Jun 3, 2020 at 8:36 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/3/2020 12:37 AM, Elijah Newren wrote:
> > I think it'd be more natural to run
>
> >   git clone --filter=blob:none --sparse
> > https://github.com/r-spacex/launch-timeline.git
>
> > in place of the combination of
>
> >   git clone --filter=blob:none --no-checkout
> > https://github.com/r-spacex/launch-timeline.git
> >   git sparse-checkout init --cone
>
> > since the --sparse flag was added just for this kind of case -- to do
> > a clone but start with only a few things checked out.  It's easier, is
> > the route we're moving towards, and as a bonus also happens to work.
>
> Just one warning: the --sparse option in "git clone" does not currently
> enable core.sparseCheckoutCone, so running "git sparse-checkout init --cone"
> afterwards is a good idea, or else your "git sparse-checkout (set|add)"
> commands will not behave the way you expect.
>
> (I think that I will propose a change in behavior to make it do so during
> this release cycle.)
>
> > A bit of a side note, or a few of them, but this command of yours is broken:
> >   git sparse-checkout set README.md
> > because --cone mode means you are specifying *directories* that should
> > be checked out.  Currently, this gives no error, it instead silently
> > drops you back to non-cone mode, which seems bad to me.
> > sparse-checkout should provide some kind of error -- or at very least
> > a warning -- when you make that mistake.
>
> > Now let's talk about the commit in question that changed behavior
> > here.  The point of sparse-checkout is never to switch branches or
> > checkout new commits; all it does is update which paths are in the
> > current working directory.  A related point to this is it should never
> > add or remove entries from the index and shouldn't change any hashes
> > of files in the index.  It used to violate this, at first via an
> > implementation that was literally invoking `git read-tree -mu HEAD` in
> > a subprocess, and then later using internal code equivalent to
> > invoking that command in a subprocess.  But by violating the
> > leave-index-entries-alone mandate, it left folks who were in the
> > middle of a rebase and wanted to update their sparse-checkout to
> > include some more directories in their working tree in a precarious
> > spot -- if they didn't update, then they didn't have the files
> > necessary to build, and if they did forcibly update via `git read-tree
> > -mu HEAD` then their staged changes would all get wiped out.  I spent
> > some quality time helping users recover their files and teaching them
> > about the git storage model.
> >
> > So that brings us back to your original question.  When you said
> > --no-checkout, it means that there is no commit checked out and the
> > index is empty.  update_sparsity() is correctly toggling the
> > SKIP_WORKTREE bits for the existing index entries that don't match the
> > sparsity patterns, and it is correctly calling check_updates().
> > check_updates() is correctly checking for files currently in the index
> > which have toggled to being needed in the current worktree so that it
> > can issue downloads related to promisor packs.  The problem is just
> > that there aren't any index entries to begin with, so there are no
> > SKIP_WORKTREE bits to update, and thus no files that need to be
> > downloaded.
> >
> > It seems a bit risky to make sparse-checkout start doing
> > checkout/switch behavior and adding entries to the index.  There's a
> > couple ways forward.  One, we could decide this is a special edge or
> > corner case where we allow it: if the index is completely empty, then
> > there's no data to lose and thus we could make `git sparse-checkout
> > init [--cone]` in that one case use the old 'read-tree -mu HEAD'
> > logic.  Alternatively, we could just require users to run 'git reset
> > --hard' at the end of your script.
> >
> > Stolee: Thoughts?
>
> I agree that using "--sparse" instead of "--no-checkout" is the
> best way forward for now, but I'll classify that as a workaround
> and not necessarily the end of the conversation.

Agreed.

> In general, the commit in question is doing something extremely
> valuable for common situations, like rebase that you mention.
> I also think that this change in behavior is warranted by the
> clear warning placed at the top of the docs [1]:
>
>         THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE
>         BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-
>         CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE.
>
> [1] https://git-scm.com/docs/git-sparse-checkout#_description
>
> Of course, that's just us covering our butts as we push the
> feature forward. It doesn't stop users (especially those that
> are bravely using the feature) from feeling pain that might
> be avoidable.
>
> To wrap up, it's unfortunate that the behavior changed. If it
> is easy to detect and handle this special case, then it _may_
> be worth doing to avoid disrupting users. Elijah: could you
> spend at most an hour trying to see how difficult this would be?

I actually spent...um...I don't know, but quite a bit longer than an
hour on this today (er, um, yesterday now).  I kept feeling that we
would be painting ourselves into a bad corner, or violating some
important invariant, and was also worried for a bit that this was just
a canary pointing out other important bugs that were ready to spring
on us once partial clones and sparse-checkouts were used more heavily
together.  So I dug around a fair amount.  jrnieder's post today[1]
was very timely; it gave me a good way to frame what bothered me so
much.  It seems to explain why I tend to hate working in dir.c, but
don't mind merge-recursive.c quite as bad: merge-recursive.c is a bit
of a mess but it has intended invariants that can be discovered if you
dig far enough (which means, it should be fixable or at least
replaceable); dir.c by contrast is mostly just a pile of empirical
behavior bolted on as we went and attempting to modify or replace or
duplicate it is going to always be a big uphill battle.  I'm worried
about moving sparse-checkout in the direction of a pile of bolted on
behaviors and losing sight of high-level design principles.  In
particular, I think that restoring the old behavior would move us in
that direction, and make us more inconsistent with other aspects of
git.  However, that said, the new behavior is not good either; it's
just buggy in a new way (though it's limited to this unborn index
special case).

I've got a patch that I'll post soon; it's a simple two-line change,
accompanied by an 87-line commit message -- because it took a while to
explain the history, the intended invariants, how those invariants
were messed up in different ways historically, how we can restore the
desired properties that'll make it easier to reason about the code,
and how the new behavior after the patch also makes us more consistent
with other behavior already found in git.  I'll submit it as soon as
it passes testsuites on the relevant machines.


[1] https://lore.kernel.org/git/20200603212336.GD253041@google.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible regression for sparse-checkout in git 2.27.0
  2020-06-04  7:27     ` Elijah Newren
@ 2020-06-04 20:50       ` Shaun Case
  2020-06-05  0:32         ` Derrick Stolee
  0 siblings, 1 reply; 8+ messages in thread
From: Shaun Case @ 2020-06-04 20:50 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Derrick Stolee, Git Mailing List

Hi Elijah, Derrick,

On Thu, Jun 4, 2020 at 12:27 AM Elijah Newren <newren@gmail.com> wrote:
>
> Hi,
>
> On Wed, Jun 3, 2020 at 8:36 AM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > On 6/3/2020 12:37 AM, Elijah Newren wrote:
> > > I think it'd be more natural to run
> >
> > >   git clone --filter=blob:none --sparse
> > > https://github.com/r-spacex/launch-timeline.git
> >
> > > in place of the combination of
> >
> > >   git clone --filter=blob:none --no-checkout
> > > https://github.com/r-spacex/launch-timeline.git
> > >   git sparse-checkout init --cone
> >
> > > since the --sparse flag was added just for this kind of case -- to do
> > > a clone but start with only a few things checked out.  It's easier, is
> > > the route we're moving towards, and as a bonus also happens to work.
> >
> > Just one warning: the --sparse option in "git clone" does not currently
> > enable core.sparseCheckoutCone, so running "git sparse-checkout init --cone"
> > afterwards is a good idea, or else your "git sparse-checkout (set|add)"
> > commands will not behave the way you expect.

Ok, great.  My production script now does this:

git clone --filter=blob:none --sparse <URL>
git sparse-checkout init --cone
git sparse-checkout add <PATH1>
git sparse-checkout add <PATH2>
[ ... ]

... and everything works as expected in 2.26.2 and 2.27.0. Perfection!

See below for why I was doing it the other way.

> > (I think that I will propose a change in behavior to make it do so during
> > this release cycle.)

That sounds good to me.  Could you also consider adding an error
message for git sparse-checkout init, if something unexpected follows?  My
rationale is that my script initially contained this:

     git sparse-checkout init cone

Note the missing "--" in front of cone.  This failed silently, so I
thought I was in
cone mode, but wasn't.  An error message would have helped.  Your proposal,
having it happen automatically, would too.

> > > A bit of a side note, or a few of them, but this command of yours is broken:
> > >   git sparse-checkout set README.md
> > > because --cone mode means you are specifying *directories* that should
> > > be checked out.

Thank you for pointing this out.  This is a surprise.  I'm migrating some large
Subversion repos and workspace setup scripts to git (thus the interest
in partial
cloning and sparse checkouts), and svn has the ability to sparse checkout
individual files as well as directories, so I expected it to work the
same way in git.

Is this limitation a design decision, a technical limitation, a
planned feature?  It
seems to be fairly popular in svn, and  there is even more interest
for it in git:

https://stackoverflow.com/questions/122107/checkout-one-file-from-subversion
https://stackoverflow.com/questions/2466735/how-to-sparsely-checkout-only-one-single-file-from-a-git-repository

> > > Currently, this gives no error, it instead silently
> > > drops you back to non-cone mode, which seems bad to me.
> > > sparse-checkout should provide some kind of error -- or at very least
> > > a warning -- when you make that mistake.

Agreed.

I'm going to leave the following here for context, but please scroll down.

> > > Now let's talk about the commit in question that changed behavior
> > > here.  The point of sparse-checkout is never to switch branches or
> > > checkout new commits; all it does is update which paths are in the
> > > current working directory.  A related point to this is it should never
> > > add or remove entries from the index and shouldn't change any hashes
> > > of files in the index.  It used to violate this, at first via an
> > > implementation that was literally invoking `git read-tree -mu HEAD` in
> > > a subprocess, and then later using internal code equivalent to
> > > invoking that command in a subprocess.  But by violating the
> > > leave-index-entries-alone mandate, it left folks who were in the
> > > middle of a rebase and wanted to update their sparse-checkout to
> > > include some more directories in their working tree in a precarious
> > > spot -- if they didn't update, then they didn't have the files
> > > necessary to build, and if they did forcibly update via `git read-tree
> > > -mu HEAD` then their staged changes would all get wiped out.  I spent
> > > some quality time helping users recover their files and teaching them
> > > about the git storage model.
> > >
> > > So that brings us back to your original question.  When you said
> > > --no-checkout, it means that there is no commit checked out and the
> > > index is empty.  update_sparsity() is correctly toggling the
> > > SKIP_WORKTREE bits for the existing index entries that don't match the
> > > sparsity patterns, and it is correctly calling check_updates().
> > > check_updates() is correctly checking for files currently in the index
> > > which have toggled to being needed in the current worktree so that it
> > > can issue downloads related to promisor packs.  The problem is just
> > > that there aren't any index entries to begin with, so there are no
> > > SKIP_WORKTREE bits to update, and thus no files that need to be
> > > downloaded.
> > >
> > > It seems a bit risky to make sparse-checkout start doing
> > > checkout/switch behavior and adding entries to the index.  There's a
> > > couple ways forward.  One, we could decide this is a special edge or
> > > corner case where we allow it: if the index is completely empty, then
> > > there's no data to lose and thus we could make `git sparse-checkout
> > > init [--cone]` in that one case use the old 'read-tree -mu HEAD'
> > > logic.  Alternatively, we could just require users to run 'git reset
> > > --hard' at the end of your script.
> > >
> > > Stolee: Thoughts?
> >
> > I agree that using "--sparse" instead of "--no-checkout" is the
> > best way forward for now, but I'll classify that as a workaround
> > and not necessarily the end of the conversation.
>
> Agreed.

Agree too.  I also agree that Elijah's changes are necessary, and will save me
a ton of headaches going forward.  I don't want them backed out.

So, it seems --sparse instead of --no-checkout is what I should have been doing
from the beginning, I had to fiddle a bit to get what I had working,
by following
some online resources and experimenting, and what I ended up with working
satisfactorily was --no-checkout, with 2.26.2.

Why?

Perhaps the messaging has inadvertently become a little muddled.  I started
by working on getting partial cloning working; when I searched, these are the
resources I found and followed:

https://about.gitlab.com/blog/2020/03/13/partial-clone-for-massive-repositories/
https://github.blog/2020-01-13-highlights-from-git-2-25/
https://stackoverflow.com/questions/4114887/is-it-possible-to-do-a-sparse-checkout-without-checking-out-the-whole-repository
https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/
https://docs.gitlab.com/ce/topics/git/partial_clone.html

The first four mention --no-checkout, but do not mention --sparse.  The last one
actually gives proper guidance, but by the time I came across it, I had the
--no-checkout method working and I came away not even realizing --sparse
exists.  Perhaps I'm the only one who went down this path.  If not, the
--no-checkout case may be a bit more common than you might expect.

> > In general, the commit in question is doing something extremely
> > valuable for common situations, like rebase that you mention.
> > I also think that this change in behavior is warranted by the
> > clear warning placed at the top of the docs [1]:
> >
> >         THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE
> >         BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-
> >         CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE.
> >
> > [1] https://git-scm.com/docs/git-sparse-checkout#_description

Got it.  Consider your butts covered.  :)

Thanks for the guidance and explanation.

Shaun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible regression for sparse-checkout in git 2.27.0
  2020-06-04 20:50       ` Shaun Case
@ 2020-06-05  0:32         ` Derrick Stolee
  2020-06-05  0:56           ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Derrick Stolee @ 2020-06-05  0:32 UTC (permalink / raw)
  To: Shaun Case, Elijah Newren; +Cc: Git Mailing List

On 6/4/2020 4:50 PM, Shaun Case wrote:
> Hi Elijah, Derrick,
> 
> On Thu, Jun 4, 2020 at 12:27 AM Elijah Newren <newren@gmail.com> wrote:
>>
>> Hi,
>>
>> On Wed, Jun 3, 2020 at 8:36 AM Derrick Stolee <stolee@gmail.com> wrote:
>>>
>>> On 6/3/2020 12:37 AM, Elijah Newren wrote:
>>>> I think it'd be more natural to run
>>>
>>>>   git clone --filter=blob:none --sparse
>>>> https://github.com/r-spacex/launch-timeline.git
>>>
>>>> in place of the combination of
>>>
>>>>   git clone --filter=blob:none --no-checkout
>>>> https://github.com/r-spacex/launch-timeline.git
>>>>   git sparse-checkout init --cone
>>>
>>>> since the --sparse flag was added just for this kind of case -- to do
>>>> a clone but start with only a few things checked out.  It's easier, is
>>>> the route we're moving towards, and as a bonus also happens to work.
>>>
>>> Just one warning: the --sparse option in "git clone" does not currently
>>> enable core.sparseCheckoutCone, so running "git sparse-checkout init --cone"
>>> afterwards is a good idea, or else your "git sparse-checkout (set|add)"
>>> commands will not behave the way you expect.
> 
> Ok, great.  My production script now does this:
> 
> git clone --filter=blob:none --sparse <URL>
> git sparse-checkout init --cone
> git sparse-checkout add <PATH1>
> git sparse-checkout add <PATH2>
> [ ... ]
> 
> ... and everything works as expected in 2.26.2 and 2.27.0. Perfection!
> 
> See below for why I was doing it the other way.
> 
>>> (I think that I will propose a change in behavior to make it do so during
>>> this release cycle.)
> 
> That sounds good to me.  Could you also consider adding an error
> message for git sparse-checkout init, if something unexpected follows?  My
> rationale is that my script initially contained this:
> 
>      git sparse-checkout init cone
> 
> Note the missing "--" in front of cone.  This failed silently, so I
> thought I was in
> cone mode, but wasn't.  An error message would have helped.  Your proposal,
> having it happen automatically, would too.

Yes, this is probably a bug that's my fault. The option parser should
be more strict in the "init" case.

>>>> A bit of a side note, or a few of them, but this command of yours is broken:
>>>>   git sparse-checkout set README.md
>>>> because --cone mode means you are specifying *directories* that should
>>>> be checked out.
> 
> Thank you for pointing this out.  This is a surprise.  I'm migrating some large
> Subversion repos and workspace setup scripts to git (thus the interest
> in partial
> cloning and sparse checkouts), and svn has the ability to sparse checkout
> individual files as well as directories, so I expected it to work the
> same way in git.
> 
> Is this limitation a design decision, a technical limitation, a
> planned feature?  It
> seems to be fairly popular in svn, and  there is even more interest
> for it in git:
> 
> https://stackoverflow.com/questions/122107/checkout-one-file-from-subversion
> https://stackoverflow.com/questions/2466735/how-to-sparsely-checkout-only-one-single-file-from-a-git-repository

There are a couple reasons why this is tricky in Git.

The first is due to the .gitignore syntax. The syntax allows exact
matches for _directories_ using a trailing slash "/". For example,
we can match everything in A/B/C with the pattern

	A/B/C/

This would match the files in A/B/C/ and its subdirectories, but will
not match a file A/B/C.txt or A/B/C1/. There is no equivalent matching
for files, so A/B/C _will_ match a file A/B/C and A/B/C.txt. Whether this
matters to you or not depends on your file structure.

The second is how the pattern-matching works for cone mode. Everything
is based on prefix matches of directories, but also that if a directory
is included, then all of its parent directories are included, but not
recursively. All files contained directly in one of the parent directories
are included automatically. Thus, if A/B/C/ is recursively added, then
A/B/X.txt, A/Y.txt, and Z.txt all match in cone mode. This allows us some
significant performance gains by allowing us to stop walking trees when
we reach a directory that isn't one of these "parent" directories. We
know that A/D/ is not in our list of "parents" so we don't need to look
in there.

What does this have to do with file matches? If we include A/B/C.txt as
a file, then we will need to match all sibling files in A/B/! This is
very similar to matching A/B/* or even A/B/C/*.

This idea was brought up and debated shortly after 2.25.0 released [1]

[1] https://lore.kernel.org/git/CADSBhNbbO=aq-Oo2MpzDMN2VAX4m6f9Jb-eCtVVX1NfWKE9zJw@mail.gmail.com/

>>>> Currently, this gives no error, it instead silently
>>>> drops you back to non-cone mode, which seems bad to me.
>>>> sparse-checkout should provide some kind of error -- or at very least
>>>> a warning -- when you make that mistake.
> 
> Agreed.
> 
> I'm going to leave the following here for context, but please scroll down.
> 
>>>> Now let's talk about the commit in question that changed behavior
>>>> here.  The point of sparse-checkout is never to switch branches or
>>>> checkout new commits; all it does is update which paths are in the
>>>> current working directory.  A related point to this is it should never
>>>> add or remove entries from the index and shouldn't change any hashes
>>>> of files in the index.  It used to violate this, at first via an
>>>> implementation that was literally invoking `git read-tree -mu HEAD` in
>>>> a subprocess, and then later using internal code equivalent to
>>>> invoking that command in a subprocess.  But by violating the
>>>> leave-index-entries-alone mandate, it left folks who were in the
>>>> middle of a rebase and wanted to update their sparse-checkout to
>>>> include some more directories in their working tree in a precarious
>>>> spot -- if they didn't update, then they didn't have the files
>>>> necessary to build, and if they did forcibly update via `git read-tree
>>>> -mu HEAD` then their staged changes would all get wiped out.  I spent
>>>> some quality time helping users recover their files and teaching them
>>>> about the git storage model.
>>>>
>>>> So that brings us back to your original question.  When you said
>>>> --no-checkout, it means that there is no commit checked out and the
>>>> index is empty.  update_sparsity() is correctly toggling the
>>>> SKIP_WORKTREE bits for the existing index entries that don't match the
>>>> sparsity patterns, and it is correctly calling check_updates().
>>>> check_updates() is correctly checking for files currently in the index
>>>> which have toggled to being needed in the current worktree so that it
>>>> can issue downloads related to promisor packs.  The problem is just
>>>> that there aren't any index entries to begin with, so there are no
>>>> SKIP_WORKTREE bits to update, and thus no files that need to be
>>>> downloaded.
>>>>
>>>> It seems a bit risky to make sparse-checkout start doing
>>>> checkout/switch behavior and adding entries to the index.  There's a
>>>> couple ways forward.  One, we could decide this is a special edge or
>>>> corner case where we allow it: if the index is completely empty, then
>>>> there's no data to lose and thus we could make `git sparse-checkout
>>>> init [--cone]` in that one case use the old 'read-tree -mu HEAD'
>>>> logic.  Alternatively, we could just require users to run 'git reset
>>>> --hard' at the end of your script.
>>>>
>>>> Stolee: Thoughts?
>>>
>>> I agree that using "--sparse" instead of "--no-checkout" is the
>>> best way forward for now, but I'll classify that as a workaround
>>> and not necessarily the end of the conversation.
>>
>> Agreed.
> 
> Agree too.  I also agree that Elijah's changes are necessary, and will save me
> a ton of headaches going forward.  I don't want them backed out.
> 
> So, it seems --sparse instead of --no-checkout is what I should have been doing
> from the beginning, I had to fiddle a bit to get what I had working,
> by following
> some online resources and experimenting, and what I ended up with working
> satisfactorily was --no-checkout, with 2.26.2.
> 
> Why?
> 
> Perhaps the messaging has inadvertently become a little muddled.  I started
> by working on getting partial cloning working; when I searched, these are the
> resources I found and followed:
> 
> https://about.gitlab.com/blog/2020/03/13/partial-clone-for-massive-repositories/
> https://github.blog/2020-01-13-highlights-from-git-2-25/
> https://stackoverflow.com/questions/4114887/is-it-possible-to-do-a-sparse-checkout-without-checking-out-the-whole-repository
> https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/
> https://docs.gitlab.com/ce/topics/git/partial_clone.html
> 
> The first four mention --no-checkout, but do not mention --sparse.  The last one
> actually gives proper guidance, but by the time I came across it, I had the
> --no-checkout method working and I came away not even realizing --sparse
> exists.  Perhaps I'm the only one who went down this path.  If not, the
> --no-checkout case may be a bit more common than you might expect.

There's a good reason for this. In v2.25.0, the "--sparse" option was
present but broken. See [1] for the fix which was released in v2.26.0
as 47dbf10d8 (clone: fix --sparse option with URLs, 2020-01-24).

[1] https://lore.kernel.org/git/4991a51f6d5d840eaa3bb830e68f1530c2ee08e4.1579900782.git.gitgitgadget@gmail.com/

>>> In general, the commit in question is doing something extremely
>>> valuable for common situations, like rebase that you mention.
>>> I also think that this change in behavior is warranted by the
>>> clear warning placed at the top of the docs [1]:
>>>
>>>         THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE
>>>         BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-
>>>         CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE.
>>>
>>> [1] https://git-scm.com/docs/git-sparse-checkout#_description
> 
> Got it.  Consider your butts covered.  :)

:D

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible regression for sparse-checkout in git 2.27.0
  2020-06-05  0:32         ` Derrick Stolee
@ 2020-06-05  0:56           ` Junio C Hamano
  2020-06-05 13:07             ` Derrick Stolee
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2020-06-05  0:56 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Shaun Case, Elijah Newren, Git Mailing List

Derrick Stolee <stolee@gmail.com> writes:

> The first is due to the .gitignore syntax. The syntax allows exact
> matches for _directories_ using a trailing slash "/". For example,
> we can match everything in A/B/C with the pattern
>
> 	A/B/C/
>
> This would match the files in A/B/C/ and its subdirectories, but will
> not match a file A/B/C.txt or A/B/C1/. There is no equivalent matching
> for files, so A/B/C _will_ match a file A/B/C and A/B/C.txt. Whether this
> matters to you or not depends on your file structure.

The pattern A/B/C _will_ match a file A/B/C and a directory A/B/C/,
but I do not think it matches a file A/B/C.txt (or a path with any
other suffix).

I suspect that for the purpose of your explanation, the pattern
A/B/C does not have to match A/B/C.txt to cause trouble; the fact
that it matches directory A/B/C would be sufficient, I guess.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible regression for sparse-checkout in git 2.27.0
  2020-06-05  0:56           ` Junio C Hamano
@ 2020-06-05 13:07             ` Derrick Stolee
  0 siblings, 0 replies; 8+ messages in thread
From: Derrick Stolee @ 2020-06-05 13:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shaun Case, Elijah Newren, Git Mailing List

On 6/4/2020 8:56 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> The first is due to the .gitignore syntax. The syntax allows exact
>> matches for _directories_ using a trailing slash "/". For example,
>> we can match everything in A/B/C with the pattern
>>
>> 	A/B/C/
>>
>> This would match the files in A/B/C/ and its subdirectories, but will
>> not match a file A/B/C.txt or A/B/C1/. There is no equivalent matching
>> for files, so A/B/C _will_ match a file A/B/C and A/B/C.txt. Whether this
>> matters to you or not depends on your file structure.
> 
> The pattern A/B/C _will_ match a file A/B/C and a directory A/B/C/,
> but I do not think it matches a file A/B/C.txt (or a path with any
> other suffix).

You are correct! I am mistaken. The directory/file confusion would
lead to some interesting logic in an implementation that allowed
file matches.

> I suspect that for the purpose of your explanation, the pattern
> A/B/C does not have to match A/B/C.txt to cause trouble; the fact
> that it matches directory A/B/C would be sufficient, I guess.

Yes. It certainly complicates things.

The thing I neglected to mention is that this file-based matching
is usually necessary when there is a directory containing _all_ of
the "big files" instead of organizing those files into directories
based on who needs those files. It is possible to reorganize the
directory structure to use cone mode and achieve similar goals.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-06-05 13:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-03  1:08 Possible regression for sparse-checkout in git 2.27.0 Shaun Case
2020-06-03  4:37 ` Elijah Newren
2020-06-03 15:36   ` Derrick Stolee
2020-06-04  7:27     ` Elijah Newren
2020-06-04 20:50       ` Shaun Case
2020-06-05  0:32         ` Derrick Stolee
2020-06-05  0:56           ` Junio C Hamano
2020-06-05 13:07             ` Derrick Stolee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).