All of lore.kernel.org
 help / color / mirror / Atom feed
* git stash takes excessively long when many untracked files present
@ 2013-08-10 21:44 Josh Triplett
  2013-08-13 10:11 ` Anders Darander
  0 siblings, 1 reply; 15+ messages in thread
From: Josh Triplett @ 2013-08-10 21:44 UTC (permalink / raw)
  To: git; +Cc: Petr Baudis, Junio C Hamano

[CCing folks involved in the recent "stash-refuse-to-kill" merge.]

I keep portions of my home directory in git.  I tried to "git stash"
some local changes, and it ran for several minutes with no progress.  ps
showed that it was running "git ls-files --killed", which was taking
100% CPU, and occasionally reading the disk very slowly.

strace shows that git ls-files --killed is doing a full recursive
enumeration of my entire home directory.  That's a Really Bad Idea:

~$ find | wc -l
3248997
~$ find -type d | wc -l
350680

Not only that, but it also appears to be attempting to stat and open
several files in every single directory; for instance:

stat(".ccache/1/3/.git", 0x7fff254bc7a0) = -1 ENOENT (No such file or directory)
open(".ccache/1/3/.git/HEAD", O_RDONLY) = -1 ENOENT (No such file or directory)
stat(".ccache/1/3/.git", 0x7fff254bc770) = -1 ENOENT (No such file or directory)
open(".ccache/1/3/.git/packed-refs", O_RDONLY) = -1 ENOENT (No such file or directory)

(Yes, in that order.)

I see a lot of room for optimization here.  Most importantly, git
ls-files --killed really doesn't need to look at any directory entry
unless something in the index would conflict with it.

- Josh Triplett

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: git stash takes excessively long when many untracked files present
  2013-08-10 21:44 git stash takes excessively long when many untracked files present Josh Triplett
@ 2013-08-13 10:11 ` Anders Darander
  2013-08-13 17:07   ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Anders Darander @ 2013-08-13 10:11 UTC (permalink / raw)
  To: git

Josh Triplett <josh <at> joshtriplett.org> writes:
> [CCing folks involved in the recent "stash-refuse-to-kill" merge.]
> 
> I keep portions of my home directory in git.  I tried to "git stash"
> some local changes, and it ran for several minutes with no progress.  ps
> showed that it was running "git ls-files --killed", which was taking
> 100% CPU, and occasionally reading the disk very slowly.

I've recently got the same problem, though in this case it's my 
openembedded directory that's giving me those problems. (Having an 
untracked build-directory of quiet a few GB takes some time).

I worked around it by locally patching git-stash:
-------------------------------------------
diff --git a/git-stash.sh b/git-stash.sh
index 85c9e2c..e5a2043 100755
--- a/git-stash.sh
+++ b/git-stash.sh
@@ -263,7 +263,7 @@ save_stash () {
                exit 0
        fi
        if test -z "$untracked$force" &&
-          test -n "$(git ls-files --killed | head -n 1)"
+          test -n "$(git ls-files --killed --directory | head -n 1)"
        then
                say "$(gettext "The following untracked files would NOT be 
saved
                test -n "$GIT_QUIET" || git ls-files --killed | sed 
's/^/\t/'
-------------------------------------------

It seems to work in my extremely limited testing. Though, I'm pretty sure 
that there'll be quite a few error cases... (Especially, as I just made
a naive attempt at patching git-stash, so I could go on with a few other 
things).

Do anyone have any better idea on how to approach this?

> strace shows that git ls-files --killed is doing a full recursive
> enumeration of my entire home directory.  That's a Really Bad Idea:
> 
> ~$ find | wc -l
> 3248997
> ~$ find -type d | wc -l
> 350680
> 
> Not only that, but it also appears to be attempting to stat and open
> several files in every single directory; for instance:
> 
> stat(".ccache/1/3/.git", 0x7fff254bc7a0) = -1 ENOENT (No such file or 
directory)
> open(".ccache/1/3/.git/HEAD", O_RDONLY) = -1 ENOENT (No such file or 
directory)
> stat(".ccache/1/3/.git", 0x7fff254bc770) = -1 ENOENT (No such file or 
directory)
> open(".ccache/1/3/.git/packed-refs", O_RDONLY) = -1 ENOENT (No such file 
or directory)
> 
> (Yes, in that order.)
> 
> I see a lot of room for optimization here.  Most importantly, git
> ls-files --killed really doesn't need to look at any directory entry
> unless something in the index would conflict with it.

I guess that this would be a good optimization. Or, are ls-files --killed 
used in other cases where the current behaviour would be requiered?

Cheers,
Anders

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: git stash takes excessively long when many untracked files present
  2013-08-13 10:11 ` Anders Darander
@ 2013-08-13 17:07   ` Junio C Hamano
  2013-08-13 17:36     ` Anders Darander
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2013-08-13 17:07 UTC (permalink / raw)
  To: Anders Darander; +Cc: git

Anders Darander <anders.darander@gmail.com> writes:

> diff --git a/git-stash.sh b/git-stash.sh
> index 85c9e2c..e5a2043 100755
> --- a/git-stash.sh
> +++ b/git-stash.sh
> @@ -263,7 +263,7 @@ save_stash () {
>                 exit 0
>         fi
>         if test -z "$untracked$force" &&
> -          test -n "$(git ls-files --killed | head -n 1)"
> +          test -n "$(git ls-files --killed --directory | head -n 1)"
>         then
>                 say "$(gettext "The following untracked files would NOT be 
> saved
>                 test -n "$GIT_QUIET" || git ls-files --killed | sed 
> 's/^/\t/'
> -------------------------------------------
>
> It seems to work in my extremely limited testing. Though, I'm pretty sure 
> that there'll be quite a few error cases... (Especially, as I just made
> a naive attempt at patching git-stash, so I could go on with a few other 
> things).

I am not sure adding "--directory" there is safe.  Aren't there
cases where saving a stash and going back to the committed state
will involve killing no directories, but some files?  If your local
change is to remove a directory and files in it from your working
tree and then deposit a newly created file at the path where the
directory was in the HEAD, stashing that local change and then going
back to the HEAD will involve removing the new file from the working
tree, and unless you have "git add"ed the new file, it will be lost.

> Do anyone have any better idea on how to approach this?

Teaching "ls-files" to leave early once it seens even a single
output is probably a possibility.

>> I see a lot of room for optimization here.  Most importantly, git
>> ls-files --killed really doesn't need to look at any directory entry
>> unless something in the index would conflict with it.

This observation probably is correct, even though I didn't think
about it long enough.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: git stash takes excessively long when many untracked files present
  2013-08-13 17:07   ` Junio C Hamano
@ 2013-08-13 17:36     ` Anders Darander
  2013-08-13 17:52       ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Anders Darander @ 2013-08-13 17:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



(I'm resending this as Gmail added some html parts...) 

Junio C Hamano <gitster@pobox.com> wrote:
>Anders Darander <anders.darander@gmail.com> writes:
>
>> diff --git a/git-stash.sh b/git-stash.sh
>> index 85c9e2c..e5a2043 100755
>> --- a/git-stash.sh
>> +++ b/git-stash.sh
>> @@ -263,7 +263,7 @@ save_stash () {
>>                 exit 0
>>         fi
>>         if test -z "$untracked$force" &&
>> -          test -n "$(git ls-files --killed | head -n 1)"
>> +          test -n "$(git ls-files --killed --directory | head -n 1)"
>>         then
>>                 say "$(gettext "The following untracked files would
>NOT be 
>> saved
>>                 test -n "$GIT_QUIET" || git ls-files --killed | sed 
>> 's/^/\t/'
>> -------------------------------------------
>>
>> It seems to work in my extremely limited testing. Though, I'm pretty
>sure 
>> that there'll be quite a few error cases... (Especially, as I just
>made
>> a naive attempt at patching git-stash, so I could go on with a few
>other 
>> things).
>
>I am not sure adding "--directory" there is safe.  Aren't there
>cases where saving a stash and going back to the committed state
>will involve killing no directories, but some files?  If your local
>change is to remove a directory and files in it from your working
>tree and then deposit a newly created file at the path where the
>directory was in the HEAD, stashing that local change and then going
>back to the HEAD will involve removing the new file from the working
>tree, and unless you have "git add"ed the new file, it will be lost.

Yes, it's more than likely that there are some real issues with adding - -directory here. I just realised that in the specific case I needed to run stash, I could do that by adding either of -u or -f as options. Obviously, 

>> Do anyone have any better idea on how to approach this?
>
>Teaching "ls-files" to leave early once it seens even a single
>output is probably a possibility.

Would that mean that we're able to fail early? That's certainly an improvement, but not a working situation.

In my case, running stash in an OpenEmbedded checkout (including untracked directories for builds and caches), I gave up waiting on stash running ls-files after running at 100% for more than 12 minutes. Git status returned after a couple of seconds.

>>> I see a lot of room for optimization here.  Most importantly, git
>>> ls-files --killed really doesn't need to look at any directory entry
>>> unless something in the index would conflict with it.
>
>This observation probably is correct, even though I didn't think
>about it long enough.

I'd think that something like this is needed. As the index should be rather small, compared to e.g. some large untracked directory.

Cheers, 
Anders 


.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: git stash takes excessively long when many untracked files present
  2013-08-13 17:36     ` Anders Darander
@ 2013-08-13 17:52       ` Junio C Hamano
  2013-08-13 21:47         ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2013-08-13 17:52 UTC (permalink / raw)
  To: Anders Darander; +Cc: git

Anders Darander <anders.darander@gmail.com> writes:

>>> Do anyone have any better idea on how to approach this?
>>
>>Teaching "ls-files" to leave early once it seens even a single
>>output is probably a possibility.
>
> Would that mean that we're able to fail early?

Heh, good point.  "Leave once you find one path" does not help the
most common "sane" case where you do not kill any path, so it does
not help us at all.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: git stash takes excessively long when many untracked files present
  2013-08-13 17:52       ` Junio C Hamano
@ 2013-08-13 21:47         ` Junio C Hamano
  2013-08-15 17:52           ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2013-08-13 21:47 UTC (permalink / raw)
  To: Anders Darander; +Cc: git, Petr Baudis, Josh Triplett

[administrivia: people on the original thread added back on CC]

Junio C Hamano <gitster@pobox.com> writes:

> Anders Darander <anders.darander@gmail.com> writes:
>
>>>> Do anyone have any better idea on how to approach this?
>>>
>>>Teaching "ls-files" to leave early once it seens even a single
>>>output is probably a possibility.
>>
>> Would that mean that we're able to fail early?
>
> Heh, good point.  "Leave once you find one path" does not help the
> most common "sane" case where you do not kill any path, so it does
> not help us at all.

In any case, this is a regression introduced in 'master' since the
last release, and the attempted fix was for an issue that has long
been with us, so I'll revert a7365313 (git stash: avoid data loss
when "git stash save" kills a directory, 2013-06-28) soon.  For
today's -rc3, I'm already deep into the integration cycle, so it is
too late to do the revert it and then redo everything.

Then we will plan to re-apply the patch once "ls-files --killed"
gets fixed not to waste too much cycles needlessly, after the coming
release.

Thanks for a report and discussion.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: git stash takes excessively long when many untracked files present
  2013-08-13 21:47         ` Junio C Hamano
@ 2013-08-15 17:52           ` Junio C Hamano
  2013-08-15 18:07             ` Josh Triplett
  2013-08-15 19:47             ` Junio C Hamano
  0 siblings, 2 replies; 15+ messages in thread
From: Junio C Hamano @ 2013-08-15 17:52 UTC (permalink / raw)
  To: git; +Cc: Anders Darander, Petr Baudis, Josh Triplett

Junio C Hamano <gitster@pobox.com> writes:

> In any case, this is a regression introduced in 'master' since the
> last release, and the attempted fix was for an issue that has long
> been with us, so I'll revert a7365313 (git stash: avoid data loss
> when "git stash save" kills a directory, 2013-06-28) soon.  For
> today's -rc3, I'm already deep into the integration cycle, so it is
> too late to do the revert it and then redo everything.
>
> Then we will plan to re-apply the patch once "ls-files --killed"
> gets fixed not to waste too much cycles needlessly, after the coming
> release.

I've already reverted the problematic patch to "git stash" and it
will not be part of the upcoming release.

Here is a quick attempt to see if we can do better in "ls-files -k".

We have an existing test t3010.3 that tries all the combinations of
directory turning into a regular file, symlink, etc. and vice versa,
and it seems to pass.  The test has a directory path6 in the working
tree without any paths in it in the index, and the added bypass code
seems to correctly trigger and prevents us from digging into that
directory, so this patch may be sufficient to improve "ls-files -k".

By the way, regarding the reverted commit, I do not think it is
enough to ask "ls-files -k" to see if the state recorded in the
current index is sufficient.  Imagine your HEAD records "path" as a
file and then you did this:

    $ git reset --hard ;# "path" is now a regular file
    $ mv path path.bak
    $ mkdir path
    $ mv path.bak path/file
    $ git add -A ;# "path/file" in the index and in the working tree
    $ >path/cruft ;# "path/cruft" in the working tree

Then call "save_stash" without saving untracked.  The resulting
stash will save the contents of "path/file" but "path/cruft" is not
recorded anywhere, and then we would need to bring the state in the
working tree and the index back to the state recorded in HEAD, hence
"path" needs to be turned back to a directory.

But "ls-files -k" is asked to check with the index, which has the
path as a directory, so this case is missed.

So instead of

	test -n "$(git ls-files --killed | head -n 1)"

in Pasky's patch, which probably is a right thing to do if you are
running "git stash save --keep-index", you would need something like
this if you are not running with "--keep-index":

	test -n "$(
        	GIT_INDEX_FILE=tmp_index
                export GIT_INDEX_FILE
                git read-tree HEAD
                git ls-files -k
	)"

in order to make sure that the result of going back to the state in
the HEAD will not clobber leftover "path/cruft".

 builtin/ls-files.c | 2 ++
 dir.c              | 9 +++++++++
 dir.h              | 3 ++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 5cf3e31..8500446 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -219,6 +219,8 @@ static void show_files(struct dir_struct *dir)
 
 	/* For cached/deleted files we don't need to even do the readdir */
 	if (show_others || show_killed) {
+		if (!show_others)
+			dir->flags |= DIR_COLLECT_KILLED_ONLY;
 		fill_directory(dir, pathspec);
 		if (show_others)
 			show_other_files(dir);
diff --git a/dir.c b/dir.c
index 910bfcd..02939e2 100644
--- a/dir.c
+++ b/dir.c
@@ -1183,6 +1183,15 @@ static enum path_treatment treat_one_path(struct dir_struct *dir,
 	    cache_name_exists(path->buf, path->len, ignore_case))
 		return path_none;
 
+	/*
+	 * A directory can only contain killed files if the index
+	 * has a path that wants it to be a non-directory.
+	 */
+	if ((dir->flags & DIR_COLLECT_KILLED_ONLY) &&
+	    (dtype == DT_DIR) &&
+	    !cache_name_exists(path->buf, path->len, ignore_case))
+		return path_none;
+
 	exclude = is_excluded(dir, path->buf, &dtype);
 
 	/*
diff --git a/dir.h b/dir.h
index 3d6b80c..4677b86 100644
--- a/dir.h
+++ b/dir.h
@@ -80,7 +80,8 @@ struct dir_struct {
 		DIR_HIDE_EMPTY_DIRECTORIES = 1<<2,
 		DIR_NO_GITLINKS = 1<<3,
 		DIR_COLLECT_IGNORED = 1<<4,
-		DIR_SHOW_IGNORED_TOO = 1<<5
+		DIR_SHOW_IGNORED_TOO = 1<<5,
+		DIR_COLLECT_KILLED_ONLY = 1<<6
 	} flags;
 	struct dir_entry **entries;
 	struct dir_entry **ignored;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: git stash takes excessively long when many untracked files present
  2013-08-15 17:52           ` Junio C Hamano
@ 2013-08-15 18:07             ` Josh Triplett
  2013-08-15 18:58               ` Junio C Hamano
  2013-08-15 19:47             ` Junio C Hamano
  1 sibling, 1 reply; 15+ messages in thread
From: Josh Triplett @ 2013-08-15 18:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Anders Darander, Petr Baudis

On Thu, Aug 15, 2013 at 10:52:39AM -0700, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > In any case, this is a regression introduced in 'master' since the
> > last release, and the attempted fix was for an issue that has long
> > been with us, so I'll revert a7365313 (git stash: avoid data loss
> > when "git stash save" kills a directory, 2013-06-28) soon.  For
> > today's -rc3, I'm already deep into the integration cycle, so it is
> > too late to do the revert it and then redo everything.
> >
> > Then we will plan to re-apply the patch once "ls-files --killed"
> > gets fixed not to waste too much cycles needlessly, after the coming
> > release.
> 
> I've already reverted the problematic patch to "git stash" and it
> will not be part of the upcoming release.

Thanks!

> Here is a quick attempt to see if we can do better in "ls-files -k".
> 
> We have an existing test t3010.3 that tries all the combinations of
> directory turning into a regular file, symlink, etc. and vice versa,
> and it seems to pass.  The test has a directory path6 in the working
> tree without any paths in it in the index, and the added bypass code
> seems to correctly trigger and prevents us from digging into that
> directory, so this patch may be sufficient to improve "ls-files -k".
> 
> By the way, regarding the reverted commit, I do not think it is
> enough to ask "ls-files -k" to see if the state recorded in the
> current index is sufficient.  Imagine your HEAD records "path" as a
> file and then you did this:
> 
>     $ git reset --hard ;# "path" is now a regular file
>     $ mv path path.bak
>     $ mkdir path
>     $ mv path.bak path/file
>     $ git add -A ;# "path/file" in the index and in the working tree
>     $ >path/cruft ;# "path/cruft" in the working tree
> 
> Then call "save_stash" without saving untracked.  The resulting
> stash will save the contents of "path/file" but "path/cruft" is not
> recorded anywhere, and then we would need to bring the state in the
> working tree and the index back to the state recorded in HEAD, hence
> "path" needs to be turned back to a directory.
> 
> But "ls-files -k" is asked to check with the index, which has the
> path as a directory, so this case is missed.

Since git stash resets to the state in HEAD, whatever --killed check it
does needs to check against HEAD, yes.  It still doesn't need to check
any path that doesn't exist in HEAD, though; it makes more sense to
drive this from the list of files in HEAD rather than from the list of
files in the working directory, even with a filter applied to the latter
to prune bits not in HEAD.

> So instead of
> 
> 	test -n "$(git ls-files --killed | head -n 1)"
> 
> in Pasky's patch, which probably is a right thing to do if you are
> running "git stash save --keep-index", you would need something like
> this if you are not running with "--keep-index":
> 
> 	test -n "$(
>         	GIT_INDEX_FILE=tmp_index
>                 export GIT_INDEX_FILE
>                 git read-tree HEAD
>                 git ls-files -k
> 	)"
> 
> in order to make sure that the result of going back to the state in
> the HEAD will not clobber leftover "path/cruft".

Sure, that works.  However, wouldn't it make sense to just directly let
git ls-files output to the screen, then test its return value (after
adding some ls-files option to set the return value)?  Since ls-files
--killed will have no output if git stash can proceed, and since git
stash should show the list of files that'd be killed before it fails,
using the output directly makes sense.

- Josh Triplett

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: git stash takes excessively long when many untracked files present
  2013-08-15 18:07             ` Josh Triplett
@ 2013-08-15 18:58               ` Junio C Hamano
  0 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2013-08-15 18:58 UTC (permalink / raw)
  To: Josh Triplett; +Cc: git, Anders Darander, Petr Baudis

Josh Triplett <josh@joshtriplett.org> writes:

>> I've already reverted the problematic patch to "git stash" and it
>> will not be part of the upcoming release.
>
> Thanks!
>
>> Here is a quick attempt to see if we can do better in "ls-files -k".

Having said that, I am curious if the result of applying the patch
you are responding to, without reverting the "git stash" patch, is
now usable in the working tree you earlier had trouble with.

> Sure, that works.  However, wouldn't it make sense to just directly let
> git ls-files output to the screen, then test its return value (after
> adding some ls-files option to set the return value)?

Not really.

We may want to add "exit early if we see even a single killed file"
option to the command so that we can simplify the "are we going to
abort" logic, but the error codepath that is executed after that
decision is made is not performance critical, and may need more
flexibility than always spewing everything that will be killed,
which could be thousands of crufts.  So I think using two separate
invocations to "ls-files --killed" is a necessity anyway.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: git stash takes excessively long when many untracked files present
  2013-08-15 17:52           ` Junio C Hamano
  2013-08-15 18:07             ` Josh Triplett
@ 2013-08-15 19:47             ` Junio C Hamano
  2013-08-15 21:28               ` [PATCH 0/3] Optimizing "ls-files -k" Junio C Hamano
  1 sibling, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2013-08-15 19:47 UTC (permalink / raw)
  To: git; +Cc: Anders Darander, Petr Baudis, Josh Triplett

Junio C Hamano <gitster@pobox.com> writes:

> diff --git a/dir.c b/dir.c
> index 910bfcd..02939e2 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1183,6 +1183,15 @@ static enum path_treatment treat_one_path(struct dir_struct *dir,
>  	    cache_name_exists(path->buf, path->len, ignore_case))
>  		return path_none;
>  
> +	/*
> +	 * A directory can only contain killed files if the index
> +	 * has a path that wants it to be a non-directory.
> +	 */
> +	if ((dir->flags & DIR_COLLECT_KILLED_ONLY) &&
> +	    (dtype == DT_DIR) &&
> +	    !cache_name_exists(path->buf, path->len, ignore_case))
> +		return path_none;
> +

I think this is wrong.

When we are looking at a directory P in the working tree, there are
three cases:

 (1) P exists in the index.  Everything inside the directory P in
     the working tree needs to go when P is checked out from the
     index.

 (2) P does not exist in the index, but there is P/Q in the index.
     We know P will stay a directory when we check out the contents
     of the index, but we do not know yet if there is a directory
     P/Q in the working tree to be killed, so we need to recurse.

 (3) P does not exist in the index, and there is no P/Q in the index
     to require P to be a directory, either.  Only in this case, we
     know that everything inside P will not be killed without
     recursing.

The patch will break with the second case, I think.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/3] Optimizing "ls-files -k"
  2013-08-15 19:47             ` Junio C Hamano
@ 2013-08-15 21:28               ` Junio C Hamano
  2013-08-15 21:28                 ` [PATCH 1/3] dir.c: use the cache_* macro to access the current index Junio C Hamano
                                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Junio C Hamano @ 2013-08-15 21:28 UTC (permalink / raw)
  To: git; +Cc: Petr Baudis, Josh Triplett

"ls-files -o" and "ls-files -k" both traverse the working tree down
to find either all untracked paths or those that will be "killed"
(removed from the working tree to make room) when the paths recorded
in the index are checked out.

It is necessary to traverse the working tree fully when enumerating
all the "other" paths, but when we are only interested in "killed"
paths, we can take advantage of the fact that paths that do not
overlap with entries in the index can never be killed.

The first one is an independent clean-up.  No public API in the
working tree traversal takes alternate in-core index, so there is no
reason to explicitly use the_index and index_* functions from the
in-core index API.

The second one is rerolled from the "something like this" patch I
sent earlier, but corrects the "we see a directory, it is not in the
index, but a file in it is" case.

And the third one adds a testcase that illustrates why the earlier
"something like this" patch is not sufficient.

These are designed to apply on top of v1.8.3, and needs a bit of
conflict resolution for the upcoming v1.8.4 codebase; I'll queue
them in 'pu' for now.

Note that t3010, especially after merged to 'pu', will use many
different ways to create a test file.  Some redirect "date" into it,
some redirect ":" into it, some "touch" it, and some just redirect
with no command.

	date >file1
	: >file2
	touch file3
	>file4

We should consolidate them all to just do ">file4" after making sure
the contents do not matter (we kind of know it already, as "date"
will output string that is not repeatable).  Use of "touch" for
anything other than updating the timestamp is especially bad, as it
is misleading.

Junio C Hamano (3):
  dir.c: use the cache_* macro to access the current index
  ls-files -k: a directory only can be killed if the index has a non-directory
  t3010: update to demonstrate "ls-files -k" optimization pitfalls

 builtin/ls-files.c                  |  2 ++
 dir.c                               | 40 +++++++++++++++++++++++++++++--------
 dir.h                               |  3 ++-
 t/t3010-ls-files-killed-modified.sh | 12 ++++++++---
 4 files changed, 45 insertions(+), 12 deletions(-)

-- 
1.8.4-rc3-232-ga8053f8

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] dir.c: use the cache_* macro to access the current index
  2013-08-15 21:28               ` [PATCH 0/3] Optimizing "ls-files -k" Junio C Hamano
@ 2013-08-15 21:28                 ` Junio C Hamano
  2013-08-15 21:28                 ` [PATCH 2/3] ls-files -k: a directory only can be killed if the index has a non-directory Junio C Hamano
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2013-08-15 21:28 UTC (permalink / raw)
  To: git; +Cc: Petr Baudis, Josh Triplett

These codepaths always start from the_index and use index_*
functions, but there is no reason to do so.  Use the compatibility
cache_* macro to access the current in-core index like everybody
else.

While at it, fix typo in the comment for a function to check if a
path within a directory appears in the index.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 dir.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/dir.c b/dir.c
index a5926fb..2f82cd1 100644
--- a/dir.c
+++ b/dir.c
@@ -472,15 +472,14 @@ static void *read_skip_worktree_file_from_index(const char *path, size_t *size)
 	unsigned long sz;
 	enum object_type type;
 	void *data;
-	struct index_state *istate = &the_index;
 
 	len = strlen(path);
-	pos = index_name_pos(istate, path, len);
+	pos = cache_name_pos(path, len);
 	if (pos < 0)
 		return NULL;
-	if (!ce_skip_worktree(istate->cache[pos]))
+	if (!ce_skip_worktree(active_cache[pos]))
 		return NULL;
-	data = read_sha1_file(istate->cache[pos]->sha1, &type, &sz);
+	data = read_sha1_file(active_cache[pos]->sha1, &type, &sz);
 	if (!data || type != OBJ_BLOB) {
 		free(data);
 		return NULL;
@@ -924,13 +923,13 @@ enum exist_status {
 };
 
 /*
- * Do not use the alphabetically stored index to look up
+ * Do not use the alphabetically sorted index to look up
  * the directory name; instead, use the case insensitive
  * name hash.
  */
 static enum exist_status directory_exists_in_index_icase(const char *dirname, int len)
 {
-	struct cache_entry *ce = index_name_exists(&the_index, dirname, len + 1, ignore_case);
+	struct cache_entry *ce = cache_name_exists(dirname, len + 1, ignore_case);
 	unsigned char endchar;
 
 	if (!ce)
-- 
1.8.4-rc3-232-ga8053f8

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] ls-files -k: a directory only can be killed if the index has a non-directory
  2013-08-15 21:28               ` [PATCH 0/3] Optimizing "ls-files -k" Junio C Hamano
  2013-08-15 21:28                 ` [PATCH 1/3] dir.c: use the cache_* macro to access the current index Junio C Hamano
@ 2013-08-15 21:28                 ` Junio C Hamano
  2013-08-15 21:28                 ` [PATCH 3/3] t3010: update to demonstrate "ls-files -k" optimization pitfalls Junio C Hamano
  2013-08-15 23:30                 ` [PATCH 4/3] git stash: avoid data loss when "git stash save" kills a directory Junio C Hamano
  3 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2013-08-15 21:28 UTC (permalink / raw)
  To: git; +Cc: Petr Baudis, Josh Triplett

"ls-files -o" and "ls-files -k" both traverse the working tree down
to find either all untracked paths or those that will be "killed"
(removed from the working tree to make room) when the paths recorded
in the index are checked out.  It is necessary to traverse the
working tree fully when enumerating all the "other" paths, but when
we are only interested in "killed" paths, we can take advantage of
the fact that paths that do not overlap with entries in the index
can never be killed.

The treat_one_path() helper function, which is called during the
recursive traversal, is the ideal place to implement an
optimization.

When we are looking at a directory P in the working tree, there are
three cases:

 (1) P exists in the index.  Everything inside the directory P in
     the working tree needs to go when P is checked out from the
     index.

 (2) P does not exist in the index, but there is P/Q in the index.
     We know P will stay a directory when we check out the contents
     of the index, but we do not know yet if there is a directory
     P/Q in the working tree to be killed, so we need to recurse.

 (3) P does not exist in the index, and there is no P/Q in the index
     to require P to be a directory, either.  Only in this case, we
     know that everything inside P will not be killed without
     recursing.

Note that this helper is called by treat_leading_path() that decides
if we need to traverse only subdirectories of a single common
leading directory, which is essential for this optimization to be
correct.  This caller checks each level of the leading path
component from shallower directory to deeper ones, and that is what
allows us to only check if the path appears in the index.  If the
call to treat_one_path() weren't there, given a path P/Q/R, the real
traversal may start from directory P/Q/R, even when the index
records P as a regular file, and we would end up having to check if
any leading subpath in P/Q/R, e.g. P, appears in the index.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/ls-files.c |  2 ++
 dir.c              | 29 +++++++++++++++++++++++++++--
 dir.h              |  3 ++-
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 2202072..c7eb6f4 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -213,6 +213,8 @@ static void show_files(struct dir_struct *dir)
 
 	/* For cached/deleted files we don't need to even do the readdir */
 	if (show_others || show_killed) {
+		if (!show_others)
+			dir->flags |= DIR_COLLECT_KILLED_ONLY;
 		fill_directory(dir, pathspec);
 		if (show_others)
 			show_other_files(dir);
diff --git a/dir.c b/dir.c
index 2f82cd1..ff768f3 100644
--- a/dir.c
+++ b/dir.c
@@ -1173,12 +1173,37 @@ static enum path_treatment treat_one_path(struct dir_struct *dir,
 					  int dtype, struct dirent *de)
 {
 	int exclude;
+	int has_path_in_index = !!cache_name_exists(path->buf, path->len, ignore_case);
+
 	if (dtype == DT_UNKNOWN)
 		dtype = get_dtype(de, path->buf, path->len);
 
 	/* Always exclude indexed files */
-	if (dtype != DT_DIR &&
-	    cache_name_exists(path->buf, path->len, ignore_case))
+	if (dtype != DT_DIR && has_path_in_index)
+		return path_none;
+
+	/*
+	 * When we are looking at a directory P in the working tree,
+	 * there are three cases:
+	 *
+	 * (1) P exists in the index.  Everything inside the directory P in
+	 * the working tree needs to go when P is checked out from the
+	 * index.
+	 *
+	 * (2) P does not exist in the index, but there is P/Q in the index.
+	 * We know P will stay a directory when we check out the contents
+	 * of the index, but we do not know yet if there is a directory
+	 * P/Q in the working tree to be killed, so we need to recurse.
+	 *
+	 * (3) P does not exist in the index, and there is no P/Q in the index
+	 * to require P to be a directory, either.  Only in this case, we
+	 * know that everything inside P will not be killed without
+	 * recursing.
+	 */
+	if ((dir->flags & DIR_COLLECT_KILLED_ONLY) &&
+	    (dtype == DT_DIR) &&
+	    !has_path_in_index &&
+	    (directory_exists_in_index(path->buf, path->len) == index_nonexistent))
 		return path_none;
 
 	exclude = is_excluded(dir, path->buf, &dtype);
diff --git a/dir.h b/dir.h
index 3d6b80c..4677b86 100644
--- a/dir.h
+++ b/dir.h
@@ -80,7 +80,8 @@ struct dir_struct {
 		DIR_HIDE_EMPTY_DIRECTORIES = 1<<2,
 		DIR_NO_GITLINKS = 1<<3,
 		DIR_COLLECT_IGNORED = 1<<4,
-		DIR_SHOW_IGNORED_TOO = 1<<5
+		DIR_SHOW_IGNORED_TOO = 1<<5,
+		DIR_COLLECT_KILLED_ONLY = 1<<6
 	} flags;
 	struct dir_entry **entries;
 	struct dir_entry **ignored;
-- 
1.8.4-rc3-232-ga8053f8

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] t3010: update to demonstrate "ls-files -k" optimization pitfalls
  2013-08-15 21:28               ` [PATCH 0/3] Optimizing "ls-files -k" Junio C Hamano
  2013-08-15 21:28                 ` [PATCH 1/3] dir.c: use the cache_* macro to access the current index Junio C Hamano
  2013-08-15 21:28                 ` [PATCH 2/3] ls-files -k: a directory only can be killed if the index has a non-directory Junio C Hamano
@ 2013-08-15 21:28                 ` Junio C Hamano
  2013-08-15 23:30                 ` [PATCH 4/3] git stash: avoid data loss when "git stash save" kills a directory Junio C Hamano
  3 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2013-08-15 21:28 UTC (permalink / raw)
  To: git; +Cc: Petr Baudis, Josh Triplett

An earlier draft of the previous step used cache_name_exists() to
check the directory we were looking at, which missed the second case
described in its log message.  Demonstrate why it is not sufficient.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t3010-ls-files-killed-modified.sh | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/t/t3010-ls-files-killed-modified.sh b/t/t3010-ls-files-killed-modified.sh
index 95671c2..6ea7ca8 100755
--- a/t/t3010-ls-files-killed-modified.sh
+++ b/t/t3010-ls-files-killed-modified.sh
@@ -11,6 +11,7 @@ This test prepares the following in the cache:
     path1       - a symlink
     path2/file2 - a file in a directory
     path3/file3 - a file in a directory
+    pathx/ju    - a file in a directory
 
 and the following on the filesystem:
 
@@ -21,6 +22,7 @@ and the following on the filesystem:
     path4	- a file
     path5	- a symlink
     path6/file6 - a file in a directory
+    pathx/ju/nk - a file in a directory to be killed
 
 git ls-files -k should report that existing filesystem
 objects except path4, path5 and path6/file6 to be killed.
@@ -44,16 +46,17 @@ then
 else
 	date > path1
 fi
-mkdir path2 path3
+mkdir path2 path3 pathx
 date >path2/file2
 date >path3/file3
+>pathx/ju
 : >path7
 date >path8
 : >path9
 date >path10
 test_expect_success \
     'git update-index --add to add various paths.' \
-    "git update-index --add -- path0 path1 path?/file? path7 path8 path9 path10"
+    "git update-index --add -- path0 path1 path?/file? pathx/ju path7 path8 path9 path10"
 
 rm -fr path? ;# leave path10 alone
 date >path2
@@ -65,7 +68,7 @@ else
 	date > path3
 	date > path5
 fi
-mkdir path0 path1 path6
+mkdir -p path0 path1 path6 pathx/ju
 date >path0/file0
 date >path1/file1
 date >path6/file6
@@ -73,6 +76,7 @@ date >path7
 : >path8
 : >path9
 touch path10
+>pathx/ju/nk
 
 test_expect_success \
     'git ls-files -k to show killed files.' \
@@ -82,6 +86,7 @@ path0/file0
 path1/file1
 path2
 path3
+pathx/ju/nk
 EOF
 
 test_expect_success \
@@ -98,6 +103,7 @@ path2/file2
 path3/file3
 path7
 path8
+pathx/ju
 EOF
 
 test_expect_success \
-- 
1.8.4-rc3-232-ga8053f8

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/3] git stash: avoid data loss when "git stash save" kills a directory
  2013-08-15 21:28               ` [PATCH 0/3] Optimizing "ls-files -k" Junio C Hamano
                                   ` (2 preceding siblings ...)
  2013-08-15 21:28                 ` [PATCH 3/3] t3010: update to demonstrate "ls-files -k" optimization pitfalls Junio C Hamano
@ 2013-08-15 23:30                 ` Junio C Hamano
  3 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2013-08-15 23:30 UTC (permalink / raw)
  To: git; +Cc: Petr Baudis, Josh Triplett

From: Petr Baudis <pasky@ucw.cz>

"stash save" is about saving the local change to the working tree,
but also about restoring the state of the last commit to the working
tree.  When a local change is to turn a non-directory to a directory,
in order to restore the non-directory, everything in the directory
needs to be removed.

Which is fine when running "git stash save --include-untracked",
but without that option, untracked, newly created files in the
directory will have to be discarded, if the state you are restoring
to has a non-directory at the same path as the directory.

Introduce a safety valve to fail the operation in such case, using
the "ls-files --killed" which was designed for this exact purpose.

The "stash save" is stopped when untracked files need to be
discarded because their leading path ceased to be a directory, and
the user is required to pass --force to really have the data
removed.

Signed-off-by: Petr Baudis <pasky@ucw.cz>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * And this is the reverted patch ported on top of the "ls-files -k"
   miniseries I sent earlier.  The updates to the test in t3903
   compared to the original illustrates that the check implemented
   in the original did not protect once a path that was turned into
   a directory from a file gets added to the index, which this round
   also fixes by running "ls-files -k" against the state in the HEAD.

 Documentation/git-stash.txt | 11 +++++++++--
 git-stash.sh                | 20 ++++++++++++++++++++
 t/t3903-stash.sh            | 22 ++++++++++++++++++++++
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-stash.txt b/Documentation/git-stash.txt
index 711ffe1..61fadc5 100644
--- a/Documentation/git-stash.txt
+++ b/Documentation/git-stash.txt
@@ -14,7 +14,7 @@ SYNOPSIS
 'git stash' ( pop | apply ) [--index] [-q|--quiet] [<stash>]
 'git stash' branch <branchname> [<stash>]
 'git stash' [save [--patch] [-k|--[no-]keep-index] [-q|--quiet]
-	     [-u|--include-untracked] [-a|--all] [<message>]]
+	     [-u|--include-untracked] [-a|--all] [-f|--force] [<message>]]
 'git stash' clear
 'git stash' create
 
@@ -43,7 +43,7 @@ is also possible).
 OPTIONS
 -------
 
-save [-p|--patch] [--[no-]keep-index] [-u|--include-untracked] [-a|--all] [-q|--quiet] [<message>]::
+save [-p|--patch] [--[no-]keep-index] [-u|--include-untracked] [-a|--all] [-q|--quiet] [-f|--force] [<message>]::
 
 	Save your local modifications to a new 'stash', and run `git reset
 	--hard` to revert them.  The <message> part is optional and gives
@@ -70,6 +70,13 @@ linkgit:git-add[1] to learn how to operate the `--patch` mode.
 +
 The `--patch` option implies `--keep-index`.  You can use
 `--no-keep-index` to override this.
++
+In some cases, saving a stash could mean irretrievably removing some
+data - if a directory with untracked files replaces a tracked file of
+the same name, the new untracked files are not saved (except in case
+of `--include-untracked`) but the original tracked file shall be restored.
+By default, `stash save` will abort in such a case; `--force` will allow
+it to remove the untracked files.
 
 list [<options>]::
 
diff --git a/git-stash.sh b/git-stash.sh
index bbefdf6..2d539f3 100755
--- a/git-stash.sh
+++ b/git-stash.sh
@@ -156,10 +156,19 @@ create_stash () {
 	die "$(gettext "Cannot record working tree state")"
 }
 
+# This helper MUST be run inside a subshell.
+list_killed_files () {
+	GIT_INDEX_FILE=$TMP-ls-files-k &&
+	export GIT_INDEX_FILE &&
+	git read-tree HEAD &&
+	git ls-files --killed
+}
+
 save_stash () {
 	keep_index=
 	patch_mode=
 	untracked=
+	force=
 	while test $# != 0
 	do
 		case "$1" in
@@ -180,6 +189,9 @@ save_stash () {
 		-u|--include-untracked)
 			untracked=untracked
 			;;
+		-f|--force)
+			force=t
+			;;
 		-a|--all)
 			untracked=all
 			;;
@@ -223,6 +235,14 @@ save_stash () {
 		say "$(gettext "No local changes to save")"
 		exit 0
 	fi
+	if test -z "$untracked$force" &&
+	   test -n "$(list_killed_files | head -n 1)"
+	then
+		say "$(gettext "The following untracked files would NOT be saved but need to be removed by stash save:")"
+		test -n "$GIT_QUIET" || (list_killed_files | sed 's/^/\t/')
+		say "$(gettext "Aborting. Consider using either the --force or --include-untracked option.")" >&2
+		exit 1
+	fi
 	test -f "$GIT_DIR/logs/$ref_stash" ||
 		clear_stash || die "$(gettext "Cannot initialize stash")"
 
diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh
index 5dfbda7..08ce23b 100755
--- a/t/t3903-stash.sh
+++ b/t/t3903-stash.sh
@@ -637,4 +637,26 @@ test_expect_success 'stash where working directory contains "HEAD" file' '
 	test_cmp output expect
 '
 
+test_expect_success 'stash a change to turn a non-directory to a directory' '
+	git reset --hard &&
+	>testfile &&
+	git add testfile &&
+	git commit -m "add testfile as a regular file" &&
+	rm testfile &&
+	mkdir testfile &&
+	>testfile/file &&
+	test_must_fail git stash save "recover regular file" &&
+	test -f testfile/file &&
+
+	git add testfile/file &&
+	test_must_fail git stash save "recover regular file after adding" &&
+	test -f testfile/file
+'
+
+test_expect_success 'stash a change to turn a non-directory to a directory (forced)' '
+	git stash save --force "recover regular file (forced)" &&
+	! test -f testfile/file &&
+	test -f testfile
+'
+
 test_done
-- 
1.8.4-rc3-236-g903ae4b

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-08-15 23:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-10 21:44 git stash takes excessively long when many untracked files present Josh Triplett
2013-08-13 10:11 ` Anders Darander
2013-08-13 17:07   ` Junio C Hamano
2013-08-13 17:36     ` Anders Darander
2013-08-13 17:52       ` Junio C Hamano
2013-08-13 21:47         ` Junio C Hamano
2013-08-15 17:52           ` Junio C Hamano
2013-08-15 18:07             ` Josh Triplett
2013-08-15 18:58               ` Junio C Hamano
2013-08-15 19:47             ` Junio C Hamano
2013-08-15 21:28               ` [PATCH 0/3] Optimizing "ls-files -k" Junio C Hamano
2013-08-15 21:28                 ` [PATCH 1/3] dir.c: use the cache_* macro to access the current index Junio C Hamano
2013-08-15 21:28                 ` [PATCH 2/3] ls-files -k: a directory only can be killed if the index has a non-directory Junio C Hamano
2013-08-15 21:28                 ` [PATCH 3/3] t3010: update to demonstrate "ls-files -k" optimization pitfalls Junio C Hamano
2013-08-15 23:30                 ` [PATCH 4/3] git stash: avoid data loss when "git stash save" kills a directory Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.