git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Dian Xu <dianxudev@gmail.com>
Cc: Victoria Dye <vdye@github.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <derrickstolee@github.com>,
	Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Subject: Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
Date: Fri, 1 Jul 2022 14:52:48 -0700	[thread overview]
Message-ID: <CABPp-BHgwaWNEJnSer0-jw8+53NDuRWLvtXp4U_JJ8T_t-bTpQ@mail.gmail.com> (raw)
In-Reply-To: <CAKSRnEx2seC41QCe8sQOPf0=VNqHB6GkZ3M_CpGmOZRS0FS1gA@mail.gmail.com>

Hi Dian,

As a heads up, note that on this list we don't top-post.

On Fri, Jul 1, 2022 at 1:24 PM Dian Xu <dianxudev@gmail.com> wrote:
>
> Hi Victoria, Elijah, Derrick,
>
> Thanks a lot for the detailed insight.
>
> (Btw our company’s email mathworks.com is blocked by
> mailto:git@vger.kernel.org, hope someone can help take a look)

Konstantin: Is this something you know how to look into?  (Or do you
know who to ask?)

> 1. We use a no-cone version of sparse-checkout to control the 'shape'
> (set of scm files) of our source code. In this case, the local sandbox
> is not necessarily 'sparse' (2m files), but it's very convenient that
> we can use git to check out the exact amount (shape) of files. To
> Victoria's question, all these 2m files are "H".

How many are "H", how many are "S", and how many files in total?  I'd
like to try to construct a way to reproduce your issue, and knowing
how many of each will help.

> 2. Below is the detail steps to create the local repo (sparse-checkout
> was defined 'before' git checkout)
>       % git init
>       % git remote add origin <url>
>       % git config core.sparsecheckout true
>       % vi .git/info/sparse-checkout
>       % git fetch
>       % git checkout -b <SHA>
>     Do I still need to 'git sparse-checkout reapply' after checkout?
> (Thanks for pointing out to run reapply once .git/info/sparse-checkout
> changed)

Why didn't you list 'git sparse-checkout reapply' after editing
.git/info/sparse-checkout?  You mention it later, so I'm hoping you
ran it at that point.

You should only need to run the sparse-checkout reapply command after
manually editing the .git/info/sparse-checkout file.  There are
special cases where it might be useful after other commands, but it's
pretty rare.  Most git commands, and particularly checkout, will keep
the sparsity of the working tree up-to-date with the sparse-checkout
file -- assuming it was up-to-date beforehand.  Basically, feel free
to use the rule that you only need to reapply after manual edits of
the $GIT_DIR/info/sparse-checkout file.

Also, with newer git, you can replace all three of
   git config core.sparsecheckout true
   vi .git/info/sparse-checkout
   git sparse-checkout reapply
with
   git sparse-checkout set --no-cone <space-separated list of patterns
to insert into the .git/info/sparse-checkout file>

With older git, you can replace those three commands with two: `git
sparse-checkout init --no-cone && git sparse-checkout set <list of
patterns>`.  But that's sometimes not wanted since the init command
sparsifies everything away except files in the toplevel directory, and
then the second step restores all the files, and that two-step
approach is really slow as it deletes and then restores a huge number
of files from the working directory.

> 3. Unfortunately, after executing reapply (btw it is very slow on this
> 2m files * 16k patterns scenario: 30 mins), 'git add', and 'git add
> --sparse' still hangs.

'git add --sparse' is still slow?  That sounds like a bug I'd like to
investigate.

What's the particular timing you get for each of 'git add' and 'git
add --sparse'?  Are you giving it individual files (if so, how many?),
or directories (how many files under those directories?), or globs?
(This information will be helpful in my attempts to get a synthetic
setup aiming to be similar to yours.)

> 4. --cone is a big topic for us now, since 2.37.0 deprecates
> --no-cone. We do have our own challenges to move away from --no-cone
> (E.g. we use lots of file specifiers and/or exclusion patterns to
> define our source code shape), which will be a huge amount of work, if
> feasible. We've established a set of workflows based on --no-cone,
> because of its merit of being capable of defining a fine-grained scm
> shape.

To be fair, --no-cone is deprecated as in discouraged due to various
usability problems (including performance), but we currently have no
plans to remove it from Git.  I do heartily recommend migrating to
--cone since it solves so many problems, but we'll still support
--no-cone users as best we can.

> 5. Back to this case, what we've experimented on are:
>       - Remove all files/*/! patterns from our shape definition, which
> leave us with 14k directories (Obviously the scm shape no longe
> matches, but just to proof of concept here)
>       - 'git sparse-checkout set <14k directories>' finishes fast

Now I'm surprised.  You said in the previous email that you were using
git 2.34.2.  In that version, --no-cone is the default, so this would
still be using --no-cone mode.  That either suggests you switched to
v2.37 since your email and didn't include that detail here, or that
the performance issue is actually with certain specific patterns.
What version of git did you use here?  And did you have either an
explicit --cone or --no-cone when using the sparse-checkout set
command?

>       - 'git add' finishes fast

>     As Victoria mentioned, I hope this --no-cone 'git add' performance
> can be addressed because 'those performance gains can also be realized
> in cone mode', as we saw here.

Are we sure we saw that here?  Could you verify by reporting: (a) what
version of git were you using, and (b) does `git config --list | grep
-i sparse` show both core.sparsecheckout and core.sparsecheckoutcone
as being true after your do your sparse-checkout set?


Elijah

  reply	other threads:[~2022-07-01 21:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-29 19:11 git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it Dian Xu
2022-06-29 21:53 ` Victoria Dye
2022-06-30  4:06   ` Elijah Newren
2022-06-30  5:06     ` Victoria Dye
2022-07-01  3:42       ` Elijah Newren
2022-07-01 20:24         ` Dian Xu
2022-07-01 21:52           ` Elijah Newren [this message]
2022-07-04 19:11             ` Konstantin Ryabitsev
2022-07-05 13:08               ` Dian Xu
2022-07-08  1:53                 ` Elijah Newren
2022-07-12 13:00                   ` Dian Xu
2022-06-30  3:10 ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BHgwaWNEJnSer0-jw8+53NDuRWLvtXp4U_JJ8T_t-bTpQ@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dianxudev@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=konstantin@linuxfoundation.org \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).