All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Improving performance of git clean
@ 2015-04-06 11:48 Erik Elfström
  2015-04-06 11:48 ` [PATCH 1/3] t7300: add tests to document behavior of clean and nested git Erik Elfström
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Erik Elfström @ 2015-04-06 11:48 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

This series addresses a performance issue of git clean previously
discussed here:
http://thread.gmane.org/gmane.comp.version-control.git/265560/focus=265560
and here:
http://thread.gmane.org/gmane.comp.version-control.git/266777/focus=266777

The issue manifests when trying to clean a large number of untracked
directories. In my case this scenario triggered by a test suite
running in the repository creating a directory for each test resulting
in a build directory with ~100000 sub directories that needs to be
cleaned. For some extreme cases, clean times of more than 1h have been
observed.

With this series, the time to clean an untracked directory containing
100000 sub directories goes from 61s to 1.7s.

The main change is to switch the repository check in
clean.c:remove_dirs from using refs.c:resolve_gitlink_ref to
setup.c:is_git_directory.

One potential issue that is_git_directory contains the following check:

	if (getenv(DB_ENVIRONMENT)) {
		if (access(getenv(DB_ENVIRONMENT), X_OK))
			return 0;
	}

I'm not sure how this will affect this usecase (checking for some
other nested git repo). Please give some thought to this when
reviewing.

Jeff King also expressed concerns that we may have similar performance
issues in other commands and that it could be good to unify these "is
this a repo?"-checks. This series only attempts to solve the git-clean
case.

Erik Elfström (3):
  t7300: add tests to document behavior of clean and nested git
  p7300: added performance tests for clean
  clean: improve performance when removing lots of directories

 builtin/clean.c       | 23 ++++++++++++---
 t/perf/p7300-clean.sh | 37 +++++++++++++++++++++++
 t/t7300-clean.sh      | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 138 insertions(+), 4 deletions(-)
 create mode 100755 t/perf/p7300-clean.sh

-- 
2.4.0.rc0.37.ga3b75b3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] t7300: add tests to document behavior of clean and nested git
  2015-04-06 11:48 [PATCH 0/3] Improving performance of git clean Erik Elfström
@ 2015-04-06 11:48 ` Erik Elfström
  2015-04-06 22:06   ` Eric Sunshine
  2015-04-07 19:40   ` Eric Sunshine
  2015-04-06 11:48 ` [PATCH 2/3] p7300: added performance tests for clean Erik Elfström
  2015-04-06 11:48 ` [PATCH 3/3] clean: improve performance when removing lots of directories Erik Elfström
  2 siblings, 2 replies; 14+ messages in thread
From: Erik Elfström @ 2015-04-06 11:48 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
---

These tests were added so that I could understand the corner case
behaviors of clean and nested repositories and document the changes in
behavior that this series will cause. The ones marked as expect
failure will be changed to expect success later in the series.

 t/t7300-clean.sh | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 99be5d9..cfdf6d4 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,6 +455,88 @@ test_expect_success 'nested git work tree' '
 	! test -d bar
 '
 
+test_expect_failure 'nested git (only init) should be kept' '
+	rm -fr foo bar &&
+	mkdir foo bar &&
+	(
+		cd foo &&
+		git init
+	) &&
+	(
+		cd bar &&
+		>goodbye.people
+	) &&
+	git clean -f -d &&
+	test -f foo/.git/HEAD &&
+	! test -d bar
+'
+
+test_expect_failure 'nested git (bare) should be kept' '
+	rm -fr foo bar &&
+	mkdir foo bar &&
+	(
+		cd foo &&
+		git init --bare
+	) &&
+	(
+		cd bar &&
+		>goodbye.people
+	) &&
+	git clean -f -d &&
+	test -f foo/HEAD &&
+	! test -d bar
+'
+
+test_expect_success 'giving path in nested git work tree will remove it' '
+	rm -fr foo &&
+	mkdir foo &&
+	(
+		cd foo &&
+		git init &&
+		mkdir -p bar/baz &&
+		cd bar/baz &&
+		>hello.world
+		git add . &&
+		git commit -a -m nested
+	) &&
+	git clean -f -d foo/bar/baz &&
+	test -f foo/.git/HEAD &&
+	test -d foo/bar/ &&
+	! test -d foo/bar/baz
+'
+
+test_expect_success 'giving path to nested .git will not remove it' '
+	rm -fr foo &&
+	mkdir foo bar &&
+	(
+		cd foo &&
+		git init &&
+		>hello.world
+		git add . &&
+		git commit -a -m nested
+	) &&
+	git clean -f -d foo/.git &&
+	test -f foo/.git/HEAD &&
+	test -d foo/.git/refs &&
+	test -d foo/.git/objects &&
+	test -d bar/
+'
+
+test_expect_success 'giving path to nested .git/ will remove contents' '
+	rm -fr foo bar &&
+	mkdir foo bar &&
+	(
+		cd foo &&
+		git init &&
+		>hello.world
+		git add . &&
+		git commit -a -m nested
+	) &&
+	git clean -f -d foo/.git/ &&
+	test 0 = $(ls -A foo/.git | wc -l) &&
+	test -d foo/.git
+'
+
 test_expect_success 'force removal of nested git work tree' '
 	rm -fr foo bar baz &&
 	mkdir -p foo bar baz/boo &&
-- 
2.4.0.rc0.37.ga3b75b3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] p7300: added performance tests for clean
  2015-04-06 11:48 [PATCH 0/3] Improving performance of git clean Erik Elfström
  2015-04-06 11:48 ` [PATCH 1/3] t7300: add tests to document behavior of clean and nested git Erik Elfström
@ 2015-04-06 11:48 ` Erik Elfström
  2015-04-06 20:40   ` Torsten Bögershausen
  2015-04-06 11:48 ` [PATCH 3/3] clean: improve performance when removing lots of directories Erik Elfström
  2 siblings, 1 reply; 14+ messages in thread
From: Erik Elfström @ 2015-04-06 11:48 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
---
 t/perf/p7300-clean.sh | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)
 create mode 100755 t/perf/p7300-clean.sh

diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
new file mode 100755
index 0000000..3f56fb2
--- /dev/null
+++ b/t/perf/p7300-clean.sh
@@ -0,0 +1,37 @@
+#!/bin/sh
+
+test_description="Test git-clean performance"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+test_expect_success 'setup untracked directory with many sub dirs' '
+	rm -rf 500_sub_dirs 50000_sub_dirs clean_test_dir &&
+	mkdir 500_sub_dirs 50000_sub_dirs clean_test_dir &&
+	for i in $(seq 1 500)
+	do
+		mkdir 500_sub_dirs/dir$i
+	done &&
+	for i in $(seq 1 100)
+	do
+		cp -r 500_sub_dirs 50000_sub_dirs/dir$i
+	done
+'
+
+test_perf 'clean many untracked sub dirs, check for nested git' '
+	rm -rf clean_test_dir/50000_sub_dirs_cpy &&
+	cp -r 50000_sub_dirs clean_test_dir/50000_sub_dirs_cpy &&
+	git clean -q -f -d  clean_test_dir/ &&
+	test 0 = $(ls -A clean_test_dir/ | wc -l)
+'
+
+test_perf 'clean many untracked sub dirs, ignore nested git' '
+	rm -rf clean_test_dir/50000_sub_dirs_cpy &&
+	cp -r 50000_sub_dirs clean_test_dir/50000_sub_dirs_cpy &&
+	git clean -q -f -f -d  clean_test_dir/ &&
+	test 0 = $(ls -A clean_test_dir/ | wc -l)
+'
+
+test_done
-- 
2.4.0.rc0.37.ga3b75b3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] clean: improve performance when removing lots of directories
  2015-04-06 11:48 [PATCH 0/3] Improving performance of git clean Erik Elfström
  2015-04-06 11:48 ` [PATCH 1/3] t7300: add tests to document behavior of clean and nested git Erik Elfström
  2015-04-06 11:48 ` [PATCH 2/3] p7300: added performance tests for clean Erik Elfström
@ 2015-04-06 11:48 ` Erik Elfström
  2015-04-06 22:10   ` Eric Sunshine
  2 siblings, 1 reply; 14+ messages in thread
From: Erik Elfström @ 2015-04-06 11:48 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

Before this change, clean used resolve_gitlink_ref to check for the
presence of nested git repositories. This had the drawback of creating
a ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list caused a massive
performance hit for large number of directories.

Teach clean.c:remove_dirs to use setup.c:is_git_directory
instead. is_git_directory will actually open HEAD and parse the HEAD
ref but this implies a nested git repository and should be rare when
cleaning.

Using is_git_directory should give a more standardized check for what
is and what isn't a git repository but also gives a slight behavioral
change. We will now detect and respect bare and empty nested git
repositories (only init run). Update t7300 to reflect this.

The time to clean an untracked directory containing 100000 sub
directories went from 61s to 1.7s after this change.

Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
Helped-by: Jeff King <peff@peff.net>
---
 builtin/clean.c  | 23 +++++++++++++++++++----
 t/t7300-clean.sh |  4 ++--
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index 98c103f..e951bd9 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -10,7 +10,6 @@
 #include "cache.h"
 #include "dir.h"
 #include "parse-options.h"
-#include "refs.h"
 #include "string-list.h"
 #include "quote.h"
 #include "column.h"
@@ -148,6 +147,24 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset)
 	return 0;
 }
 
+static int is_git_repository(struct strbuf *path)
+{
+	int ret = 0;
+	if (is_git_directory(path->buf))
+		ret = 1;
+	else {
+		int orig_path_len = path->len;
+		if (path->buf[orig_path_len - 1] != '/')
+			strbuf_addch(path, '/');
+		strbuf_addstr(path, ".git");
+		if (is_git_directory(path->buf))
+			ret = 1;
+		strbuf_setlen(path, orig_path_len);
+	}
+
+	return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
 		int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +172,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
 	struct strbuf quoted = STRBUF_INIT;
 	struct dirent *e;
 	int res = 0, ret = 0, gone = 1, original_len = path->len, len;
-	unsigned char submodule_head[20];
 	struct string_list dels = STRING_LIST_INIT_DUP;
 
 	*dir_gone = 1;
 
-	if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) &&
-			!resolve_gitlink_ref(path->buf, "HEAD", submodule_head)) {
+	if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) && is_git_repository(path)) {
 		if (!quiet) {
 			quote_path_relative(path->buf, prefix, &quoted);
 			printf(dry_run ?  _(msg_would_skip_git_dir) : _(msg_skip_git_dir),
diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index cfdf6d4..ada569d 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
 	! test -d bar
 '
 
-test_expect_failure 'nested git (only init) should be kept' '
+test_expect_success 'nested git (only init) should be kept' '
 	rm -fr foo bar &&
 	mkdir foo bar &&
 	(
@@ -471,7 +471,7 @@ test_expect_failure 'nested git (only init) should be kept' '
 	! test -d bar
 '
 
-test_expect_failure 'nested git (bare) should be kept' '
+test_expect_success 'nested git (bare) should be kept' '
 	rm -fr foo bar &&
 	mkdir foo bar &&
 	(
-- 
2.4.0.rc0.37.ga3b75b3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] p7300: added performance tests for clean
  2015-04-06 11:48 ` [PATCH 2/3] p7300: added performance tests for clean Erik Elfström
@ 2015-04-06 20:40   ` Torsten Bögershausen
  2015-04-06 22:09     ` Eric Sunshine
  0 siblings, 1 reply; 14+ messages in thread
From: Torsten Bögershausen @ 2015-04-06 20:40 UTC (permalink / raw)
  To: Erik Elfström, git

On 2015-04-06 13.48, Erik Elfström wrote:
> Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
> ---
>  t/perf/p7300-clean.sh | 37 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
>  create mode 100755 t/perf/p7300-clean.sh
> 
> diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
> new file mode 100755
> index 0000000..3f56fb2
> --- /dev/null
> +++ b/t/perf/p7300-clean.sh
> @@ -0,0 +1,37 @@
> +#!/bin/sh
> +
> +test_description="Test git-clean performance"
> +
> +. ./perf-lib.sh
> +
> +test_perf_large_repo
> +test_checkout_worktree
> +
> +test_expect_success 'setup untracked directory with many sub dirs' '
> +	rm -rf 500_sub_dirs 50000_sub_dirs clean_test_dir &&
> +	mkdir 500_sub_dirs 50000_sub_dirs clean_test_dir &&
> +	for i in $(seq 1 500)
I think "seq" is bash-only, and can be easily replaced by "test_seq" 

> +	do
> +		mkdir 500_sub_dirs/dir$i
> +	done &&
> +	for i in $(seq 1 100)
> +	do
> +		cp -r 500_sub_dirs 50000_sub_dirs/dir$i
> +	done
> +'
> +
> +test_perf 'clean many untracked sub dirs, check for nested git' '
> +	rm -rf clean_test_dir/50000_sub_dirs_cpy &&
> +	cp -r 50000_sub_dirs clean_test_dir/50000_sub_dirs_cpy &&
> +	git clean -q -f -d  clean_test_dir/ &&
> +	test 0 = $(ls -A clean_test_dir/ | wc -l)

Is the "ls -A" portable on all systems:
http://pubs.opengroup.org/onlinepubs/009695399/utilities/ls.html

My feeling is that the "emptiness" of a directory can by tested by simply removing it:
> +	git clean -q -f -d  clean_test_dir/ &&
> +	rmdir clean_test_dir
(And similar below)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] t7300: add tests to document behavior of clean and nested git
  2015-04-06 11:48 ` [PATCH 1/3] t7300: add tests to document behavior of clean and nested git Erik Elfström
@ 2015-04-06 22:06   ` Eric Sunshine
  2015-04-07 19:27     ` erik elfström
  2015-04-07 19:40   ` Eric Sunshine
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Sunshine @ 2015-04-06 22:06 UTC (permalink / raw)
  To: Erik Elfström; +Cc: Git List

On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström <erik.elfstrom@gmail.com> wrote:
> Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
> ---
> diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
> index 99be5d9..cfdf6d4 100755
> --- a/t/t7300-clean.sh
> +++ b/t/t7300-clean.sh
> @@ -455,6 +455,88 @@ test_expect_success 'nested git work tree' '
>         ! test -d bar
>  '
>
> +test_expect_failure 'nested git (only init) should be kept' '
> +       rm -fr foo bar &&
> +       mkdir foo bar &&
> +       (
> +               cd foo &&
> +               git init
> +       ) &&
> +       (
> +               cd bar &&
> +               >goodbye.people
> +       ) &&

(minor; ignore if desired) The above could be simplified to:

    rm -fr foo bar &&
    git init foo &&
    mkdir bar &&
    >bar/goodbye.people &&

> +       git clean -f -d &&
> +       test -f foo/.git/HEAD &&
> +       ! test -d bar

Here and elsewhere, you could instead use test_path_is_file() and
test_path_is_missing(), respectively.

> +'
> +
> +test_expect_failure 'nested git (bare) should be kept' '
> +       rm -fr foo bar &&
> +       mkdir foo bar &&
> +       (
> +               cd foo &&
> +               git init --bare
> +       ) &&
> +       (
> +               cd bar &&
> +               >goodbye.people
> +       ) &&

Simplified:

    rm -fr foo bar &&
    git init --bare foo &&
    mkdir bar &&
    >bar/goodbye.people &&

> +       git clean -f -d &&
> +       test -f foo/HEAD &&
> +       ! test -d bar
> +'
> +
> +test_expect_success 'giving path in nested git work tree will remove it' '
> +       rm -fr foo &&
> +       mkdir foo &&
> +       (
> +               cd foo &&
> +               git init &&
> +               mkdir -p bar/baz &&
> +               cd bar/baz &&
> +               >hello.world
> +               git add . &&
> +               git commit -a -m nested
> +       ) &&
> +       git clean -f -d foo/bar/baz &&
> +       test -f foo/.git/HEAD &&
> +       test -d foo/bar/ &&

Alternative, here and elsewhere: test_path_is_dir()

> +       ! test -d foo/bar/baz
> +'
> +
> +test_expect_success 'giving path to nested .git will not remove it' '
> +       rm -fr foo &&
> +       mkdir foo bar &&
> +       (
> +               cd foo &&
> +               git init &&
> +               >hello.world
> +               git add . &&
> +               git commit -a -m nested
> +       ) &&
> +       git clean -f -d foo/.git &&
> +       test -f foo/.git/HEAD &&
> +       test -d foo/.git/refs &&
> +       test -d foo/.git/objects &&
> +       test -d bar/
> +'
> +
> +test_expect_success 'giving path to nested .git/ will remove contents' '
> +       rm -fr foo bar &&
> +       mkdir foo bar &&
> +       (
> +               cd foo &&
> +               git init &&
> +               >hello.world
> +               git add . &&
> +               git commit -a -m nested
> +       ) &&
> +       git clean -f -d foo/.git/ &&
> +       test 0 = $(ls -A foo/.git | wc -l) &&

Although in the latest POSIX, -A may not be portable.

Alternative: test_dir_is_empty()

> +       test -d foo/.git
> +'
> +
>  test_expect_success 'force removal of nested git work tree' '
>         rm -fr foo bar baz &&
>         mkdir -p foo bar baz/boo &&
> --
> 2.4.0.rc0.37.ga3b75b3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] p7300: added performance tests for clean
  2015-04-06 20:40   ` Torsten Bögershausen
@ 2015-04-06 22:09     ` Eric Sunshine
  2015-04-07 19:35       ` erik elfström
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Sunshine @ 2015-04-06 22:09 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Erik Elfström, Git List

On Mon, Apr 6, 2015 at 4:40 PM, Torsten Bögershausen <tboegi@web.de> wrote:
> On 2015-04-06 13.48, Erik Elfström wrote:
>> Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
>> ---
>> diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
>> new file mode 100755
>> index 0000000..3f56fb2
>> --- /dev/null
>> +++ b/t/perf/p7300-clean.sh
>> @@ -0,0 +1,37 @@
>> +#!/bin/sh
>> +
>> +test_description="Test git-clean performance"
>> +
>> +. ./perf-lib.sh
>> +
>> +test_perf_large_repo
>> +test_checkout_worktree
>> +
>> +test_expect_success 'setup untracked directory with many sub dirs' '
>> +     rm -rf 500_sub_dirs 50000_sub_dirs clean_test_dir &&
>> +     mkdir 500_sub_dirs 50000_sub_dirs clean_test_dir &&
>> +     for i in $(seq 1 500)
> I think "seq" is bash-only, and can be easily replaced by "test_seq"

test_seq is definitely desirable. 'seq' is not present on some systems I use.

>
>> +     do
>> +             mkdir 500_sub_dirs/dir$i

You may want:

    mkdir 500_sub_dirs/dir$i || return $?

to catch failure of 'mkdir'.

>> +     done &&
>> +     for i in $(seq 1 100)
>> +     do
>> +             cp -r 500_sub_dirs 50000_sub_dirs/dir$i

Ditto:

    cp -r 500_sub_dirs 50000_sub_dirs/dir$i || return $?

>> +     done
>> +'
>> +
>> +test_perf 'clean many untracked sub dirs, check for nested git' '
>> +     rm -rf clean_test_dir/50000_sub_dirs_cpy &&
>> +     cp -r 50000_sub_dirs clean_test_dir/50000_sub_dirs_cpy &&
>> +     git clean -q -f -d  clean_test_dir/ &&
>> +     test 0 = $(ls -A clean_test_dir/ | wc -l)
>
> Is the "ls -A" portable on all systems:
> http://pubs.opengroup.org/onlinepubs/009695399/utilities/ls.html
>
> My feeling is that the "emptiness" of a directory can by tested by simply removing it:
>> +     git clean -q -f -d  clean_test_dir/ &&
>> +     rmdir clean_test_dir
> (And similar below)

There's also the test_dir_is_empty() function which makes the intent
even clearer than 'rmdir'.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] clean: improve performance when removing lots of directories
  2015-04-06 11:48 ` [PATCH 3/3] clean: improve performance when removing lots of directories Erik Elfström
@ 2015-04-06 22:10   ` Eric Sunshine
  2015-04-07 19:55     ` erik elfström
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Sunshine @ 2015-04-06 22:10 UTC (permalink / raw)
  To: Erik Elfström; +Cc: Git List

On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström <erik.elfstrom@gmail.com> wrote:
> Before this change, clean used resolve_gitlink_ref to check for the
> presence of nested git repositories. This had the drawback of creating
> a ref_cache entry for every directory that should potentially be
> cleaned. The linear search through the ref_cache list caused a massive
> performance hit for large number of directories.
>
> Teach clean.c:remove_dirs to use setup.c:is_git_directory
> instead. is_git_directory will actually open HEAD and parse the HEAD
> ref but this implies a nested git repository and should be rare when
> cleaning.
>
> Using is_git_directory should give a more standardized check for what
> is and what isn't a git repository but also gives a slight behavioral
> change. We will now detect and respect bare and empty nested git
> repositories (only init run). Update t7300 to reflect this.
>
> The time to clean an untracked directory containing 100000 sub
> directories went from 61s to 1.7s after this change.

Impressive.

> Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
> Helped-by: Jeff King <peff@peff.net>

It is customary for your sign-off to be last.

More below...

> ---
> diff --git a/builtin/clean.c b/builtin/clean.c
> index 98c103f..e951bd9 100644
> --- a/builtin/clean.c
> +++ b/builtin/clean.c
> @@ -148,6 +147,24 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset)
>         return 0;
>  }
>
> +static int is_git_repository(struct strbuf *path)
> +{
> +       int ret = 0;
> +       if (is_git_directory(path->buf))
> +               ret = 1;
> +       else {
> +               int orig_path_len = path->len;
> +               if (path->buf[orig_path_len - 1] != '/')

Minor: I don't know how others feel about it, but I always find it a
bit disturbing to see a potential negative array access without a
safety check that orig_path_len is not 0, either directly in the
conditional or as a documenting assert().

> +                       strbuf_addch(path, '/');
> +               strbuf_addstr(path, ".git");
> +               if (is_git_directory(path->buf))
> +                       ret = 1;
> +               strbuf_setlen(path, orig_path_len);
> +       }
> +
> +       return ret;
> +}
> +
>  static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
>                 int dry_run, int quiet, int *dir_gone)
>  {

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] t7300: add tests to document behavior of clean and nested git
  2015-04-06 22:06   ` Eric Sunshine
@ 2015-04-07 19:27     ` erik elfström
  0 siblings, 0 replies; 14+ messages in thread
From: erik elfström @ 2015-04-07 19:27 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List

will fix!

On Tue, Apr 7, 2015 at 12:06 AM, Eric Sunshine <sunshine@sunshineco.com> wrote:
> On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström <erik.elfstrom@gmail.com> wrote:
>> Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
>> ---
>> diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
>> index 99be5d9..cfdf6d4 100755
>> --- a/t/t7300-clean.sh
>> +++ b/t/t7300-clean.sh
>> @@ -455,6 +455,88 @@ test_expect_success 'nested git work tree' '
>>         ! test -d bar
>>  '
>>
>> +test_expect_failure 'nested git (only init) should be kept' '
>> +       rm -fr foo bar &&
>> +       mkdir foo bar &&
>> +       (
>> +               cd foo &&
>> +               git init
>> +       ) &&
>> +       (
>> +               cd bar &&
>> +               >goodbye.people
>> +       ) &&
>
> (minor; ignore if desired) The above could be simplified to:
>
>     rm -fr foo bar &&
>     git init foo &&
>     mkdir bar &&
>     >bar/goodbye.people &&
>
>> +       git clean -f -d &&
>> +       test -f foo/.git/HEAD &&
>> +       ! test -d bar
>
> Here and elsewhere, you could instead use test_path_is_file() and
> test_path_is_missing(), respectively.
>
>> +'
>> +
>> +test_expect_failure 'nested git (bare) should be kept' '
>> +       rm -fr foo bar &&
>> +       mkdir foo bar &&
>> +       (
>> +               cd foo &&
>> +               git init --bare
>> +       ) &&
>> +       (
>> +               cd bar &&
>> +               >goodbye.people
>> +       ) &&
>
> Simplified:
>
>     rm -fr foo bar &&
>     git init --bare foo &&
>     mkdir bar &&
>     >bar/goodbye.people &&
>
>> +       git clean -f -d &&
>> +       test -f foo/HEAD &&
>> +       ! test -d bar
>> +'
>> +
>> +test_expect_success 'giving path in nested git work tree will remove it' '
>> +       rm -fr foo &&
>> +       mkdir foo &&
>> +       (
>> +               cd foo &&
>> +               git init &&
>> +               mkdir -p bar/baz &&
>> +               cd bar/baz &&
>> +               >hello.world
>> +               git add . &&
>> +               git commit -a -m nested
>> +       ) &&
>> +       git clean -f -d foo/bar/baz &&
>> +       test -f foo/.git/HEAD &&
>> +       test -d foo/bar/ &&
>
> Alternative, here and elsewhere: test_path_is_dir()
>
>> +       ! test -d foo/bar/baz
>> +'
>> +
>> +test_expect_success 'giving path to nested .git will not remove it' '
>> +       rm -fr foo &&
>> +       mkdir foo bar &&
>> +       (
>> +               cd foo &&
>> +               git init &&
>> +               >hello.world
>> +               git add . &&
>> +               git commit -a -m nested
>> +       ) &&
>> +       git clean -f -d foo/.git &&
>> +       test -f foo/.git/HEAD &&
>> +       test -d foo/.git/refs &&
>> +       test -d foo/.git/objects &&
>> +       test -d bar/
>> +'
>> +
>> +test_expect_success 'giving path to nested .git/ will remove contents' '
>> +       rm -fr foo bar &&
>> +       mkdir foo bar &&
>> +       (
>> +               cd foo &&
>> +               git init &&
>> +               >hello.world
>> +               git add . &&
>> +               git commit -a -m nested
>> +       ) &&
>> +       git clean -f -d foo/.git/ &&
>> +       test 0 = $(ls -A foo/.git | wc -l) &&
>
> Although in the latest POSIX, -A may not be portable.
>
> Alternative: test_dir_is_empty()
>
>> +       test -d foo/.git
>> +'
>> +
>>  test_expect_success 'force removal of nested git work tree' '
>>         rm -fr foo bar baz &&
>>         mkdir -p foo bar baz/boo &&
>> --
>> 2.4.0.rc0.37.ga3b75b3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] p7300: added performance tests for clean
  2015-04-06 22:09     ` Eric Sunshine
@ 2015-04-07 19:35       ` erik elfström
  0 siblings, 0 replies; 14+ messages in thread
From: erik elfström @ 2015-04-07 19:35 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Torsten Bögershausen, Git List

Will fix!

Also I forgot to ask, does anyone have a good way of moving the copy
out of the performance timing?

After the fix this test spends more time copying than cleaning and
that is not so good. I'm not very good at shell scripting and the only
way I could think of was to make multiple copies at the start and then
in the test check and clean the first non empty directory. That feels
kind of ugly and will fail if a different number of iterations per
test is used. Any suggestions?

I looked a bit at the framework code but I couldn't figure out if it
was easy to add a "setup" option to be called be before each
iteration.

On Tue, Apr 7, 2015 at 12:09 AM, Eric Sunshine <sunshine@sunshineco.com> wrote:
> On Mon, Apr 6, 2015 at 4:40 PM, Torsten Bögershausen <tboegi@web.de> wrote:
>> On 2015-04-06 13.48, Erik Elfström wrote:
>>> Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
>>> ---
>>> diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
>>> new file mode 100755
>>> index 0000000..3f56fb2
>>> --- /dev/null
>>> +++ b/t/perf/p7300-clean.sh
>>> @@ -0,0 +1,37 @@
>>> +#!/bin/sh
>>> +
>>> +test_description="Test git-clean performance"
>>> +
>>> +. ./perf-lib.sh
>>> +
>>> +test_perf_large_repo
>>> +test_checkout_worktree
>>> +
>>> +test_expect_success 'setup untracked directory with many sub dirs' '
>>> +     rm -rf 500_sub_dirs 50000_sub_dirs clean_test_dir &&
>>> +     mkdir 500_sub_dirs 50000_sub_dirs clean_test_dir &&
>>> +     for i in $(seq 1 500)
>> I think "seq" is bash-only, and can be easily replaced by "test_seq"
>
> test_seq is definitely desirable. 'seq' is not present on some systems I use.
>
>>
>>> +     do
>>> +             mkdir 500_sub_dirs/dir$i
>
> You may want:
>
>     mkdir 500_sub_dirs/dir$i || return $?
>
> to catch failure of 'mkdir'.
>
>>> +     done &&
>>> +     for i in $(seq 1 100)
>>> +     do
>>> +             cp -r 500_sub_dirs 50000_sub_dirs/dir$i
>
> Ditto:
>
>     cp -r 500_sub_dirs 50000_sub_dirs/dir$i || return $?
>
>>> +     done
>>> +'
>>> +
>>> +test_perf 'clean many untracked sub dirs, check for nested git' '
>>> +     rm -rf clean_test_dir/50000_sub_dirs_cpy &&
>>> +     cp -r 50000_sub_dirs clean_test_dir/50000_sub_dirs_cpy &&
>>> +     git clean -q -f -d  clean_test_dir/ &&
>>> +     test 0 = $(ls -A clean_test_dir/ | wc -l)
>>
>> Is the "ls -A" portable on all systems:
>> http://pubs.opengroup.org/onlinepubs/009695399/utilities/ls.html
>>
>> My feeling is that the "emptiness" of a directory can by tested by simply removing it:
>>> +     git clean -q -f -d  clean_test_dir/ &&
>>> +     rmdir clean_test_dir
>> (And similar below)
>
> There's also the test_dir_is_empty() function which makes the intent
> even clearer than 'rmdir'.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] t7300: add tests to document behavior of clean and nested git
  2015-04-06 11:48 ` [PATCH 1/3] t7300: add tests to document behavior of clean and nested git Erik Elfström
  2015-04-06 22:06   ` Eric Sunshine
@ 2015-04-07 19:40   ` Eric Sunshine
  2015-04-07 19:53     ` Torsten Bögershausen
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Sunshine @ 2015-04-07 19:40 UTC (permalink / raw)
  To: Erik Elfström; +Cc: Git List

On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström <erik.elfstrom@gmail.com> wrote:
> Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
> ---
> diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
> index 99be5d9..cfdf6d4 100755
> --- a/t/t7300-clean.sh
> +++ b/t/t7300-clean.sh
> @@ -455,6 +455,88 @@ test_expect_success 'nested git work tree' '
> +test_expect_success 'giving path in nested git work tree will remove it' '
> +       rm -fr foo &&
> +       mkdir foo &&
> +       (
> +               cd foo &&
> +               git init &&
> +               mkdir -p bar/baz &&
> +               cd bar/baz &&
> +               >hello.world

In my earlier review, I utterly forgot to  mention the broken &&-chain
here and throughout the patch.

> +               git add . &&
> +               git commit -a -m nested
> +       ) &&
> +       git clean -f -d foo/bar/baz &&
> +       test -f foo/.git/HEAD &&
> +       test -d foo/bar/ &&
> +       ! test -d foo/bar/baz
> +'

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] t7300: add tests to document behavior of clean and nested git
  2015-04-07 19:40   ` Eric Sunshine
@ 2015-04-07 19:53     ` Torsten Bögershausen
  0 siblings, 0 replies; 14+ messages in thread
From: Torsten Bögershausen @ 2015-04-07 19:53 UTC (permalink / raw)
  To: Erik Elfström; +Cc: Git List

On 2015-04-07 21.40, Eric Sunshine wrote:
> On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström <erik.elfstrom@gmail.com> wrote:
>> Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
>> ---
>> diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
>> index 99be5d9..cfdf6d4 100755
>> --- a/t/t7300-clean.sh
>> +++ b/t/t7300-clean.sh
>> @@ -455,6 +455,88 @@ test_expect_success 'nested git work tree' '
>> +test_expect_success 'giving path in nested git work tree will remove it' '
>> +       rm -fr foo &&
>> +       mkdir foo &&
>> +       (
>> +               cd foo &&
>> +               git init &&
>> +               mkdir -p bar/baz &&
>> +               cd bar/baz &&
>> +               >hello.world
> 
> In my earlier review, I utterly forgot to  mention the broken &&-chain
> here and throughout the patch.
> 
>> +               git add . &&
>> +               git commit -a -m nested
>> +       ) &&
Beside that, all "cd" commands should be done within an own sub-shell.
In other words, something like this:
		mkdir -p bar/baz &&
		(
               		cd bar/baz &&
               		>hello.world &&
               		git add . &&
               		git commit -a -m nested
		)
Side note:
Needed to drop Eric:
An error occurred while sending mail. The mail server responded:  Requested action not taken: mailbox unavailable
invalid DNS MX or A/AAAA resource record. Please check the message recipient sunshine@sunshineco.com and try again.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] clean: improve performance when removing lots of directories
  2015-04-06 22:10   ` Eric Sunshine
@ 2015-04-07 19:55     ` erik elfström
  2015-04-08 21:29       ` Eric Sunshine
  0 siblings, 1 reply; 14+ messages in thread
From: erik elfström @ 2015-04-07 19:55 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List

On Tue, Apr 7, 2015 at 12:10 AM, Eric Sunshine <sunshine@sunshineco.com> wrote:
> On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström <erik.elfstrom@gmail.com> wrote:
>> Before this change, clean used resolve_gitlink_ref to check for the
>> presence of nested git repositories. This had the drawback of creating
>> a ref_cache entry for every directory that should potentially be
>> cleaned. The linear search through the ref_cache list caused a massive
>> performance hit for large number of directories.
>>
>> Teach clean.c:remove_dirs to use setup.c:is_git_directory
>> instead. is_git_directory will actually open HEAD and parse the HEAD
>> ref but this implies a nested git repository and should be rare when
>> cleaning.
>>
>> Using is_git_directory should give a more standardized check for what
>> is and what isn't a git repository but also gives a slight behavioral
>> change. We will now detect and respect bare and empty nested git
>> repositories (only init run). Update t7300 to reflect this.
>>
>> The time to clean an untracked directory containing 100000 sub
>> directories went from 61s to 1.7s after this change.
>
> Impressive.
>
>> Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
>> Helped-by: Jeff King <peff@peff.net>
>
> It is customary for your sign-off to be last.
>
> More below...
>
>> ---
>> diff --git a/builtin/clean.c b/builtin/clean.c
>> index 98c103f..e951bd9 100644
>> --- a/builtin/clean.c
>> +++ b/builtin/clean.c
>> @@ -148,6 +147,24 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset)
>>         return 0;
>>  }
>>
>> +static int is_git_repository(struct strbuf *path)
>> +{
>> +       int ret = 0;
>> +       if (is_git_directory(path->buf))
>> +               ret = 1;
>> +       else {
>> +               int orig_path_len = path->len;
>> +               if (path->buf[orig_path_len - 1] != '/')
>
> Minor: I don't know how others feel about it, but I always find it a
> bit disturbing to see a potential negative array access without a
> safety check that orig_path_len is not 0, either directly in the
> conditional or as a documenting assert().
>


I think I would prefer to accept empty input and return false rather
than assert. What to you think about:

static int is_git_repository(struct strbuf *path)
{
    int ret = 0;
    size_t orig_path_len = path->len;
    if (orig_path_len == 0)
        ret = 0;
    else if (is_git_directory(path->buf))
        ret = 1;
    else {
        if (path->buf[orig_path_len - 1] != '/')
            strbuf_addch(path, '/');
        strbuf_addstr(path, ".git");
        if (is_git_directory(path->buf))
            ret = 1;
        strbuf_setlen(path, orig_path_len);
    }

    return ret;
}


Also I borrowed this pattern from remove_dirs and it has the same
problem. Should I add something like this as a separate commit?

diff --git a/builtin/clean.c b/builtin/clean.c
index ccffd8a..88850e3 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -173,7 +173,8 @@ static int remove_dirs(struct strbuf *path, const
char *prefix, int force_flag,
        DIR *dir;
        struct strbuf quoted = STRBUF_INIT;
        struct dirent *e;
-       int res = 0, ret = 0, gone = 1, original_len = path->len, len;
+       int res = 0, ret = 0, gone = 1;
+       size_t original_len = path->len, len;
        struct string_list dels = STRING_LIST_INIT_DUP;

        *dir_gone = 1;
@@ -201,6 +202,7 @@ static int remove_dirs(struct strbuf *path, const
char *prefix, int force_flag,
                return res;
        }

+       assert(original_len > 0 && "expects non-empty path");
        if (path->buf[original_len - 1] != '/')
                strbuf_addch(path, '/');


>> +                       strbuf_addch(path, '/');
>> +               strbuf_addstr(path, ".git");
>> +               if (is_git_directory(path->buf))
>> +                       ret = 1;
>> +               strbuf_setlen(path, orig_path_len);
>> +       }
>> +
>> +       return ret;
>> +}
>> +
>>  static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
>>                 int dry_run, int quiet, int *dir_gone)
>>  {

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] clean: improve performance when removing lots of directories
  2015-04-07 19:55     ` erik elfström
@ 2015-04-08 21:29       ` Eric Sunshine
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Sunshine @ 2015-04-08 21:29 UTC (permalink / raw)
  To: erik elfström; +Cc: Git List

On Tue, Apr 7, 2015 at 3:55 PM, erik elfström <erik.elfstrom@gmail.com> wrote:
> On Tue, Apr 7, 2015 at 12:10 AM, Eric Sunshine <sunshine@sunshineco.com> wrote:
>> On Mon, Apr 6, 2015 at 7:48 AM, Erik Elfström <erik.elfstrom@gmail.com> wrote:
>>> diff --git a/builtin/clean.c b/builtin/clean.c
>>> index 98c103f..e951bd9 100644
>>> --- a/builtin/clean.c
>>> +++ b/builtin/clean.c
>>> +static int is_git_repository(struct strbuf *path)
>>> +{
>>> +       int ret = 0;
>>> +       if (is_git_directory(path->buf))
>>> +               ret = 1;
>>> +       else {
>>> +               int orig_path_len = path->len;
>>> +               if (path->buf[orig_path_len - 1] != '/')
>>
>> Minor: I don't know how others feel about it, but I always find it a
>> bit disturbing to see a potential negative array access without a
>> safety check that orig_path_len is not 0, either directly in the
>> conditional or as a documenting assert().
>
> I think I would prefer to accept empty input and return false rather
> than assert. What to you think about:
>
> static int is_git_repository(struct strbuf *path)
> {
>     int ret = 0;
>     size_t orig_path_len = path->len;
>     if (orig_path_len == 0)
>         ret = 0;

My concern in raising the issue is that someone reviewing the patch or
reading the code later won't necessarily know whether you took the
potential negative array access into account and dismissed it as
"can't happen", or if you overlooked the possibility altogether. Had
there been an explicit check in the code (either assert() or other
special handling such as returning 'false'), a comment in the code, or
mention in the commit message, then it would have been clear that you
took the case into consideration, and I wouldn't have worried about
it.

As for the this proposed version of is_git_repository(), I don't have
strong feelings, and can formulate arguments either way. If it doesn't
make sense for is_git_repository() ever to be called with empty input,
then assert() may be the better choice for documenting that fact.
However, if you foresee some need for allowing empty input, or if you
audited the functionality and found that it can already be called with
empty input, then returning 'false' makes sense. Use your best
judgment.

>     else if (is_git_directory(path->buf))
>         ret = 1;
>     else {
>         if (path->buf[orig_path_len - 1] != '/')
>             strbuf_addch(path, '/');
>         strbuf_addstr(path, ".git");
>         if (is_git_directory(path->buf))
>             ret = 1;
>         strbuf_setlen(path, orig_path_len);
>     }
>
>     return ret;
> }
>
>
> Also I borrowed this pattern from remove_dirs and it has the same
> problem. Should I add something like this as a separate commit?
>
> diff --git a/builtin/clean.c b/builtin/clean.c
> index ccffd8a..88850e3 100644
> --- a/builtin/clean.c
> +++ b/builtin/clean.c
> @@ -201,6 +202,7 @@ static int remove_dirs(struct strbuf *path, const
> char *prefix, int force_flag,
>                 return res;
>         }
>
> +       assert(original_len > 0 && "expects non-empty path");
>         if (path->buf[original_len - 1] != '/')
>                 strbuf_addch(path, '/');

I personally wouldn't mind such a patch. (I'm not sure that the string
within the assert() adds much value, and it's a not-much-used idiom
within the Git source.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-04-08 21:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-06 11:48 [PATCH 0/3] Improving performance of git clean Erik Elfström
2015-04-06 11:48 ` [PATCH 1/3] t7300: add tests to document behavior of clean and nested git Erik Elfström
2015-04-06 22:06   ` Eric Sunshine
2015-04-07 19:27     ` erik elfström
2015-04-07 19:40   ` Eric Sunshine
2015-04-07 19:53     ` Torsten Bögershausen
2015-04-06 11:48 ` [PATCH 2/3] p7300: added performance tests for clean Erik Elfström
2015-04-06 20:40   ` Torsten Bögershausen
2015-04-06 22:09     ` Eric Sunshine
2015-04-07 19:35       ` erik elfström
2015-04-06 11:48 ` [PATCH 3/3] clean: improve performance when removing lots of directories Erik Elfström
2015-04-06 22:10   ` Eric Sunshine
2015-04-07 19:55     ` erik elfström
2015-04-08 21:29       ` Eric Sunshine

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.