git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Dian Xu <dianxudev@gmail.com>
Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org>,
	Victoria Dye <vdye@github.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <derrickstolee@github.com>
Subject: Re: git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it
Date: Thu, 7 Jul 2022 18:53:00 -0700	[thread overview]
Message-ID: <CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com> (raw)
In-Reply-To: <CAKSRnEwda+WomBQbvjZ+hry+k2vGO4ukR42f66tHqxO7LdU_sA@mail.gmail.com>

On Tue, Jul 5, 2022 at 6:08 AM Dian Xu <dianxudev@gmail.com> wrote:
>
> Hi Elijah,

Hi Dian,

Please don't top post on this list.  It'd also help to respond to the
relevant email instead of picking a different email in the thread to
put your answers in.  Anyway, that aside...

> Please see answers below:
>
> 1.  H: 2.27m; S: 7.7k; Total: 2.28m
>
> 2.  Sure I will run 'reapply' after the sparse-checkout file has
> changed. Just curious, do I have to run 'reapply' if 'checkout' is the
> next immediate cmd? I thought 'checkout' does the updating index as
> well
>
> 3.  I simply added one file only, 'git add' and 'git add --sparse'
> still hang. Let me know if you need me to send you any debug info from
> pathspec.c/dir.c
>
> 4.  Good to know and we are investigating if we have a way out from --no-cone
>
> 5.  I should've been clearer: The experiment done here uses 2.37.0

Thanks for providing these details.  It was enough to at least get me
started, and from my experiments, it appears the arguments to `git
add` are important.  In particular, I could not trigger this when
passing actual filenames that existed.  I could when passing a fake
filename.  Here's the concrete steps I used to reproduce:

    git clone git@github.com:newren/gvfs-like-git-bomb
    cd gvfs-like-git-bomb

    git init attempt
    cd attempt
    ../make-a-git-bomb.sh

    time git checkout bomb

    echo "/*" >.git/info/sparse-checkout
    echo '!/bomb/j/j/' >>.git/info/sparse-checkout
    for i in $(seq 1 10000); do
        printf '!some/random/file/path-%05d\n' $i
    done >>.git/info/sparse-checkout
    git config core.sparseCheckout true
    time git sparse-checkout reapply

    echo hello >world
    time git add --sparse world nonexistent
    time git rm --cached --sparse world nonexistent
    time git add world nonexistent
    time git rm --cached world nonexistent

This sequence of steps will (1) clone a repo with 2 files, (2) create
another repository in subdirectory 'attempt' that has 1000001 files
(but only two unique files, and only six or so unique trees) in a
branch called 'bomb', (3) check it out, (4) create 10002 patterns for
the sparse-checkout file (only the first 2 of which match anything)
which will leave ~99% of files still present (990001 files checked out
and 10000 files sparse) and turn on sparsity, (5) measure how long it
takes to add and remove a file from the index, both with and without
the --sparse flag, but always listing an extra path that won't match
anything.

The timings I see for the setup steps are:
    4m10.444s  checkout bomb
    1m0.380s   sparse-checkout reapply

And the timings for the add/rm steps are:
    4m43.353s  add --sparse world nonexistent
    9m25.666s  add world nonexistent
    0m0.129s  rm --cached --sparse world nonexistent
    9m23.601s  rm --cached world nonexistent

which shows that 'rm' also has a performance problem without the
'--sparse' flag (which seems like another bug).

Now, if I remove the 'nonexistent' argument from the commands, then
the timings drop to:
    0m0.236s   add --sparse world
    0m0.233s   add world
    0m0.175s   rm --cached --sparse world
    4m43.744s  rm --cached world

So, I can reproduce some slowness.  'rm' without --sparse seems
buggily slow for either set, whereas 'add' is only slow when given a
fake path.  You never mentioned anything about the arguments you were
passing to `git add`, so I don't know whether you are using specific
filenames that just don't exist (like I did above), or globs that
perhaps match some files, or something else.  That might be useful to
know.  But there appears to be something here for both 'add' and 'rm'
that we could look into optimizing.  I don't have time right now.  I'm
not sure if someone else has some time to look into it; if no one else
does, I'll eventually try to get back to it.

  reply	other threads:[~2022-07-08  1:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-29 19:11 git bug report: 'git add' hangs in a large repo which has sparse-checkout file with large number of patterns in it Dian Xu
2022-06-29 21:53 ` Victoria Dye
2022-06-30  4:06   ` Elijah Newren
2022-06-30  5:06     ` Victoria Dye
2022-07-01  3:42       ` Elijah Newren
2022-07-01 20:24         ` Dian Xu
2022-07-01 21:52           ` Elijah Newren
2022-07-04 19:11             ` Konstantin Ryabitsev
2022-07-05 13:08               ` Dian Xu
2022-07-08  1:53                 ` Elijah Newren [this message]
2022-07-12 13:00                   ` Dian Xu
2022-06-30  3:10 ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dianxudev@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=konstantin@linuxfoundation.org \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).