All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/7] Improving performance of git clean
@ 2015-05-10 20:00 Erik Elfström
  2015-05-10 20:00 ` [PATCH v6 1/7] setup: add gentle version of is_git_directory Erik Elfström
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Erik Elfström @ 2015-05-10 20:00 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

Here is v6 of of this series. v5 can be found at:

http://thread.gmane.org/gmane.comp.version-control.git/267823

Sorry for the slow progress on this, I've been busy with other things.

Changes in v6:
* added gentle version of is_git_directory and used it in
  read_gitfile_gently
* use 1MB as size limit for read_gitfile_gently instead of
  PATH_MAX*4
* fixed file descriptor leak in read_gitfile_gently
* avoid cleaning if we can't open, read or validate the path in a git
  file (we used to die on these cases).
* added one more testcase to cover the behavior mention above.
* switched to default repo in performance test


Erik Elfström (7):
  setup: add gentle version of is_git_directory
  setup: add gentle version of read_gitfile
  setup: sanity check file size in read_gitfile_gently
  t7300: add tests to document behavior of clean and nested git
  p7300: add performance tests for clean
  clean: improve performance when removing lots of directories
  RFC: Change error handling scheme in read_gitfile_gently

 builtin/clean.c       |  32 ++++++++--
 cache.h               |  16 +++++
 setup.c               | 158 +++++++++++++++++++++++++++++++++++++++++++-------
 t/perf/p7300-clean.sh |  31 ++++++++++
 t/t7300-clean.sh      | 144 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 355 insertions(+), 26 deletions(-)
 create mode 100755 t/perf/p7300-clean.sh

-- 
2.4.0.60.gf7143f7

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v6 1/7] setup: add gentle version of is_git_directory
  2015-05-10 20:00 [PATCH v6 0/7] Improving performance of git clean Erik Elfström
@ 2015-05-10 20:00 ` Erik Elfström
  2015-05-10 20:00 ` [PATCH v6 2/7] setup: add gentle version of read_gitfile Erik Elfström
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Erik Elfström @ 2015-05-10 20:00 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

This is a prerequisite for implementing a gentle version of
read_gitfile.

Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
---
 cache.h |  4 ++++
 setup.c | 44 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index b34447f..dd67695 100644
--- a/cache.h
+++ b/cache.h
@@ -431,7 +431,11 @@ extern int is_inside_git_dir(void);
 extern char *git_work_tree_cfg;
 extern int is_inside_work_tree(void);
 extern const char *get_git_dir(void);
+
+#define IS_GIT_DIRECTORY_ERR_PATH_TOO_LONG 1
+extern int is_git_directory_gently(const char *path, int *return_err, struct strbuf *err_msg);
 extern int is_git_directory(const char *path);
+
 extern char *get_object_directory(void);
 extern char *get_index_file(void);
 extern char *get_graft_file(void);
diff --git a/setup.c b/setup.c
index 979b13f..62ee88c 100644
--- a/setup.c
+++ b/setup.c
@@ -224,6 +224,18 @@ void verify_non_filename(const char *prefix, const char *arg)
 	    "'git <command> [<revision>...] -- [<file>...]'", arg);
 }
 
+__attribute((format (printf,4,5)))
+static void set_error(int *return_err, struct strbuf *err_msg, int err,
+		      const char *msg, ...)
+{
+	va_list params;
+	va_start(params, msg);
+	if (err_msg)
+		strbuf_vaddf(err_msg, msg, params);
+	va_end(params);
+	if (return_err)
+		*return_err = err;
+}
 
 /*
  * Test if it looks like we're at a git directory.
@@ -235,14 +247,28 @@ void verify_non_filename(const char *prefix, const char *arg)
  *  - either a HEAD symlink or a HEAD file that is formatted as
  *    a proper "ref:", or a regular file HEAD that has a properly
  *    formatted sha1 object name.
+ *
+ * In the event of an error, return_err will be set to an error code
+ * and err_msg will be set to an error message describing the error
+ * and 0 will be returned. If no error reporting is required, pass
+ * NULL for return_err and/or err_msg.
  */
-int is_git_directory(const char *suspect)
+int is_git_directory_gently(const char *suspect, int *return_err,
+			    struct strbuf *err_msg)
 {
 	char path[PATH_MAX];
 	size_t len = strlen(suspect);
 
-	if (PATH_MAX <= len + strlen("/objects"))
-		die("Too long path: %.*s", 60, suspect);
+	if (return_err)
+		*return_err = 0;
+
+	if (PATH_MAX <= len + strlen("/objects")) {
+		set_error(return_err, err_msg,
+			  IS_GIT_DIRECTORY_ERR_PATH_TOO_LONG,
+			  "Too long path: %.*s", 60, suspect);
+		return 0;
+	}
+
 	strcpy(path, suspect);
 	if (getenv(DB_ENVIRONMENT)) {
 		if (access(getenv(DB_ENVIRONMENT), X_OK))
@@ -265,6 +291,18 @@ int is_git_directory(const char *suspect)
 	return 1;
 }
 
+int is_git_directory(const char *suspect)
+{
+	int err;
+	int ret;
+	struct strbuf err_msg = STRBUF_INIT;
+	ret = is_git_directory_gently(suspect, &err, &err_msg);
+	if (err)
+		die("%s", err_msg.buf);
+	/* No need to free err_msg, will only be touched in case of error */
+	return ret;
+}
+
 int is_inside_git_dir(void)
 {
 	if (inside_git_dir < 0)
-- 
2.4.0.60.gf7143f7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 2/7] setup: add gentle version of read_gitfile
  2015-05-10 20:00 [PATCH v6 0/7] Improving performance of git clean Erik Elfström
  2015-05-10 20:00 ` [PATCH v6 1/7] setup: add gentle version of is_git_directory Erik Elfström
@ 2015-05-10 20:00 ` Erik Elfström
  2015-05-10 20:00 ` [PATCH v6 3/7] setup: sanity check file size in read_gitfile_gently Erik Elfström
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Erik Elfström @ 2015-05-10 20:00 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

read_gitfile will die on most error cases. This makes it unsuitable
for speculative calls. Extract the core logic and provide a gentle
version that returns NULL on failure.

The first usecase of the new gentle version will be to probe for
submodules during git clean.

Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
---
 cache.h | 12 ++++++++-
 setup.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 86 insertions(+), 18 deletions(-)

diff --git a/cache.h b/cache.h
index dd67695..54c902b 100644
--- a/cache.h
+++ b/cache.h
@@ -443,7 +443,17 @@ extern int set_git_dir(const char *path);
 extern const char *get_git_namespace(void);
 extern const char *strip_namespace(const char *namespaced_ref);
 extern const char *get_git_work_tree(void);
-extern const char *read_gitfile(const char *path);
+
+#define READ_GITFILE_ERR_STAT_FAILED 1
+#define READ_GITFILE_ERR_NOT_A_FILE 2
+#define READ_GITFILE_ERR_OPEN_FAILED 3
+#define READ_GITFILE_ERR_READ_FAILED 4
+#define READ_GITFILE_ERR_INVALID_FORMAT 5
+#define READ_GITFILE_ERR_NO_PATH 6
+#define READ_GITFILE_ERR_CANT_VERIFY_PATH 7
+#define READ_GITFILE_ERR_NOT_A_REPO 8
+extern const char *read_gitfile_gently(const char *path, int *return_error_code);
+#define read_gitfile(path) read_gitfile_gently((path), NULL)
 extern const char *resolve_gitdir(const char *suspect);
 extern void set_git_work_tree(const char *tree);
 
diff --git a/setup.c b/setup.c
index 62ee88c..b919ea6 100644
--- a/setup.c
+++ b/setup.c
@@ -373,35 +373,55 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
 /*
  * Try to read the location of the git directory from the .git file,
  * return path to git directory if found.
+ *
+ * On failure, if return_error_code is not NULL, return_error_code
+ * will be set to an error code and NULL will be returned. If
+ * return_error_code is NULL the function will die instead (for most
+ * cases).
  */
-const char *read_gitfile(const char *path)
+const char *read_gitfile_gently(const char *path, int *return_error_code)
 {
-	char *buf;
-	char *dir;
+	int error_code = 0;
+	char *buf = NULL;
+	char *dir = NULL;
 	const char *slash;
 	struct stat st;
 	int fd;
 	ssize_t len;
+	int is_git_dir;
+	struct strbuf err_msg = STRBUF_INIT;
 
-	if (stat(path, &st))
-		return NULL;
-	if (!S_ISREG(st.st_mode))
-		return NULL;
+	if (stat(path, &st)) {
+		error_code = READ_GITFILE_ERR_STAT_FAILED;
+		goto cleanup_return;
+	}
+	if (!S_ISREG(st.st_mode)) {
+		error_code = READ_GITFILE_ERR_NOT_A_FILE;
+		goto cleanup_return;
+	}
 	fd = open(path, O_RDONLY);
-	if (fd < 0)
-		die_errno("Error opening '%s'", path);
+	if (fd < 0) {
+		error_code = READ_GITFILE_ERR_OPEN_FAILED;
+		goto cleanup_return;
+	}
 	buf = xmalloc(st.st_size + 1);
 	len = read_in_full(fd, buf, st.st_size);
 	close(fd);
-	if (len != st.st_size)
-		die("Error reading %s", path);
+	if (len != st.st_size) {
+		error_code = READ_GITFILE_ERR_READ_FAILED;
+		goto cleanup_return;
+	}
 	buf[len] = '\0';
-	if (!starts_with(buf, "gitdir: "))
-		die("Invalid gitfile format: %s", path);
+	if (!starts_with(buf, "gitdir: ")) {
+		error_code = READ_GITFILE_ERR_INVALID_FORMAT;
+		goto cleanup_return;
+	}
 	while (buf[len - 1] == '\n' || buf[len - 1] == '\r')
 		len--;
-	if (len < 9)
-		die("No path in gitfile: %s", path);
+	if (len < 9) {
+		error_code = READ_GITFILE_ERR_NO_PATH;
+		goto cleanup_return;
+	}
 	buf[len] = '\0';
 	dir = buf + 8;
 
@@ -416,11 +436,49 @@ const char *read_gitfile(const char *path)
 		buf = dir;
 	}
 
-	if (!is_git_directory(dir))
-		die("Not a git repository: %s", dir);
+	is_git_dir = is_git_directory_gently(dir, &error_code, &err_msg);
+	if (error_code) {
+		error_code = READ_GITFILE_ERR_CANT_VERIFY_PATH;
+		goto cleanup_return;
+	}
+	if (!is_git_dir) {
+		error_code = READ_GITFILE_ERR_NOT_A_REPO;
+		goto cleanup_return;
+	}
 	path = real_path(dir);
 
+cleanup_return:
 	free(buf);
+
+	if (return_error_code)
+		*return_error_code = error_code;
+
+	if (error_code) {
+		if (return_error_code)
+			return NULL;
+
+		switch (error_code) {
+		case READ_GITFILE_ERR_STAT_FAILED:
+		case READ_GITFILE_ERR_NOT_A_FILE:
+			return NULL;
+		case READ_GITFILE_ERR_OPEN_FAILED:
+			die_errno("Error opening '%s'", path);
+		case READ_GITFILE_ERR_READ_FAILED:
+			die("Error reading %s", path);
+		case READ_GITFILE_ERR_INVALID_FORMAT:
+			die("Invalid gitfile format: %s", path);
+		case READ_GITFILE_ERR_NO_PATH:
+			die("No path in gitfile: %s", path);
+		case READ_GITFILE_ERR_CANT_VERIFY_PATH:
+			die("%s", err_msg.buf);
+		case READ_GITFILE_ERR_NOT_A_REPO:
+			die("Not a git repository: %s", dir);
+		default:
+			assert(0);
+		}
+	}
+
+	strbuf_release(&err_msg);
 	return path;
 }
 
-- 
2.4.0.60.gf7143f7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 3/7] setup: sanity check file size in read_gitfile_gently
  2015-05-10 20:00 [PATCH v6 0/7] Improving performance of git clean Erik Elfström
  2015-05-10 20:00 ` [PATCH v6 1/7] setup: add gentle version of is_git_directory Erik Elfström
  2015-05-10 20:00 ` [PATCH v6 2/7] setup: add gentle version of read_gitfile Erik Elfström
@ 2015-05-10 20:00 ` Erik Elfström
  2015-05-12  6:46   ` erik elfström
  2015-05-10 20:00 ` [PATCH v6 4/7] t7300: add tests to document behavior of clean and nested git Erik Elfström
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 9+ messages in thread
From: Erik Elfström @ 2015-05-10 20:00 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

read_gitfile_gently will allocate a buffer to fit the entire file that
should be read. Add a sanity check of the file size before opening to
avoid allocating a potentially huge amount of memory if we come across
a large file that someone happened to name ".git". The limit is set to
a sufficiently unreasonable size that should never be exceeded by a
genuine .git file.

Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
---
 cache.h | 1 +
 setup.c | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/cache.h b/cache.h
index 54c902b..7c8abcb 100644
--- a/cache.h
+++ b/cache.h
@@ -452,6 +452,7 @@ extern const char *get_git_work_tree(void);
 #define READ_GITFILE_ERR_NO_PATH 6
 #define READ_GITFILE_ERR_CANT_VERIFY_PATH 7
 #define READ_GITFILE_ERR_NOT_A_REPO 8
+#define READ_GITFILE_ERR_TOO_LARGE 9
 extern const char *read_gitfile_gently(const char *path, int *return_error_code);
 #define read_gitfile(path) read_gitfile_gently((path), NULL)
 extern const char *resolve_gitdir(const char *suspect);
diff --git a/setup.c b/setup.c
index b919ea6..bfaf4a6 100644
--- a/setup.c
+++ b/setup.c
@@ -381,6 +381,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
  */
 const char *read_gitfile_gently(const char *path, int *return_error_code)
 {
+	static const int one_MB = 1 << 20;
 	int error_code = 0;
 	char *buf = NULL;
 	char *dir = NULL;
@@ -404,6 +405,11 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
 		error_code = READ_GITFILE_ERR_OPEN_FAILED;
 		goto cleanup_return;
 	}
+	if (st.st_size > one_MB) {
+		close(fd);
+		error_code = READ_GITFILE_ERR_TOO_LARGE;
+		goto cleanup_return;
+	}
 	buf = xmalloc(st.st_size + 1);
 	len = read_in_full(fd, buf, st.st_size);
 	close(fd);
@@ -463,6 +469,8 @@ cleanup_return:
 			return NULL;
 		case READ_GITFILE_ERR_OPEN_FAILED:
 			die_errno("Error opening '%s'", path);
+		case READ_GITFILE_ERR_TOO_LARGE:
+			die("Too large to be a .git file: '%s'", path);
 		case READ_GITFILE_ERR_READ_FAILED:
 			die("Error reading %s", path);
 		case READ_GITFILE_ERR_INVALID_FORMAT:
-- 
2.4.0.60.gf7143f7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 4/7] t7300: add tests to document behavior of clean and nested git
  2015-05-10 20:00 [PATCH v6 0/7] Improving performance of git clean Erik Elfström
                   ` (2 preceding siblings ...)
  2015-05-10 20:00 ` [PATCH v6 3/7] setup: sanity check file size in read_gitfile_gently Erik Elfström
@ 2015-05-10 20:00 ` Erik Elfström
  2015-05-10 20:00 ` [PATCH v6 5/7] p7300: add performance tests for clean Erik Elfström
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Erik Elfström @ 2015-05-10 20:00 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
---
 t/t7300-clean.sh | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 146 insertions(+)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 99be5d9..23962e4 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,6 +455,152 @@ test_expect_success 'nested git work tree' '
 	! test -d bar
 '
 
+test_expect_failure 'should clean things that almost look like git but are not' '
+	rm -fr almost_git almost_bare_git almost_submodule &&
+	mkdir -p almost_git/.git/objects &&
+	mkdir -p almost_git/.git/refs &&
+	cat >almost_git/.git/HEAD <<-\EOF &&
+	garbage
+	EOF
+	cp -r almost_git/.git/ almost_bare_git &&
+	mkdir almost_submodule/ &&
+	cat >almost_submodule/.git <<-\EOF &&
+	garbage
+	EOF
+	test_when_finished "rm -rf almost_*" &&
+	## This will fail due to die("Invalid gitfile format: %s", path); in
+	## setup.c:read_gitfile.
+	git clean -f -d &&
+	test_path_is_missing almost_git &&
+	test_path_is_missing almost_bare_git &&
+	test_path_is_missing almost_submodule
+'
+
+test_expect_success 'should not clean submodules' '
+	rm -fr repo to_clean sub1 sub2 &&
+	mkdir repo to_clean &&
+	(
+		cd repo &&
+		git init &&
+		test_commit msg hello.world
+	) &&
+	git submodule add ./repo/.git sub1 &&
+	git commit -m "sub1" &&
+	git branch before_sub2 &&
+	git submodule add ./repo/.git sub2 &&
+	git commit -m "sub2" &&
+	git checkout before_sub2 &&
+	>to_clean/should_clean.this &&
+	git clean -f -d &&
+	test_path_is_file repo/.git/index &&
+	test_path_is_file repo/hello.world &&
+	test_path_is_file sub1/.git &&
+	test_path_is_file sub1/hello.world &&
+	test_path_is_file sub2/.git &&
+	test_path_is_file sub2/hello.world &&
+	test_path_is_missing to_clean
+'
+
+test_expect_failure 'should avoid cleaning possible submodules' '
+	rm -fr to_clean possible_sub1 possible_sub2 &&
+	mkdir to_clean possible_sub1 &&
+	test_when_finished "rm -rf possible_sub*" &&
+	echo "gitdir: foo" > possible_sub1/.git &&
+	>possible_sub1/hello.world &&
+	cp -r possible_sub1 possible_sub2 &&
+	printf "%*s\n" 5000 | tr " " a >> possible_sub1/.git &&
+	chmod 0 possible_sub2/.git &&
+	>to_clean/should_clean.this &&
+	git clean -f -d &&
+	test_path_is_file possible_sub1/.git &&
+	test_path_is_file possible_sub1/hello.world &&
+	test_path_is_file possible_sub2/.git &&
+	test_path_is_file possible_sub2/hello.world &&
+	test_path_is_missing to_clean
+'
+
+test_expect_failure 'nested (empty) git should be kept' '
+	rm -fr empty_repo to_clean &&
+	git init empty_repo &&
+	mkdir to_clean &&
+	>to_clean/should_clean.this &&
+	git clean -f -d &&
+	test_path_is_file empty_repo/.git/HEAD &&
+	test_path_is_missing to_clean
+'
+
+test_expect_success 'nested bare repositories should be cleaned' '
+	rm -fr bare1 bare2 subdir &&
+	git init --bare bare1 &&
+	git clone --local --bare . bare2 &&
+	mkdir subdir &&
+	cp -r bare2 subdir/bare3 &&
+	git clean -f -d &&
+	test_path_is_missing bare1 &&
+	test_path_is_missing bare2 &&
+	test_path_is_missing subdir
+'
+
+test_expect_success 'nested (empty) bare repositories should be cleaned even when in .git' '
+	rm -fr strange_bare &&
+	mkdir strange_bare &&
+	git init --bare strange_bare/.git &&
+	git clean -f -d &&
+	test_path_is_missing strange_bare
+'
+
+test_expect_failure 'nested (non-empty) bare repositories should be cleaned even when in .git' '
+	rm -fr strange_bare &&
+	mkdir strange_bare &&
+	git clone --local --bare . strange_bare/.git &&
+	git clean -f -d &&
+	test_path_is_missing strange_bare
+'
+
+test_expect_success 'giving path in nested git work tree will remove it' '
+	rm -fr repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+		mkdir -p bar/baz &&
+		test_commit msg bar/baz/hello.world
+	) &&
+	git clean -f -d repo/bar/baz &&
+	test_path_is_file repo/.git/HEAD &&
+	test_path_is_dir repo/bar/ &&
+	test_path_is_missing repo/bar/baz
+'
+
+test_expect_success 'giving path to nested .git will not remove it' '
+	rm -fr repo &&
+	mkdir repo untracked &&
+	(
+		cd repo &&
+		git init &&
+		test_commit msg hello.world
+	) &&
+	git clean -f -d repo/.git &&
+	test_path_is_file repo/.git/HEAD &&
+	test_path_is_dir repo/.git/refs &&
+	test_path_is_dir repo/.git/objects &&
+	test_path_is_dir untracked/
+'
+
+test_expect_success 'giving path to nested .git/ will remove contents' '
+	rm -fr repo untracked &&
+	mkdir repo untracked &&
+	(
+		cd repo &&
+		git init &&
+		test_commit msg hello.world
+	) &&
+	git clean -f -d repo/.git/ &&
+	test_path_is_dir repo/.git &&
+	test_dir_is_empty repo/.git &&
+	test_path_is_dir untracked/
+'
+
 test_expect_success 'force removal of nested git work tree' '
 	rm -fr foo bar baz &&
 	mkdir -p foo bar baz/boo &&
-- 
2.4.0.60.gf7143f7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 5/7] p7300: add performance tests for clean
  2015-05-10 20:00 [PATCH v6 0/7] Improving performance of git clean Erik Elfström
                   ` (3 preceding siblings ...)
  2015-05-10 20:00 ` [PATCH v6 4/7] t7300: add tests to document behavior of clean and nested git Erik Elfström
@ 2015-05-10 20:00 ` Erik Elfström
  2015-05-10 20:00 ` [PATCH v6 6/7] clean: improve performance when removing lots of directories Erik Elfström
  2015-05-10 20:00 ` [PATCH v6 7/7] RFC: Change error handling scheme in read_gitfile_gently Erik Elfström
  6 siblings, 0 replies; 9+ messages in thread
From: Erik Elfström @ 2015-05-10 20:00 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

The tests are run in dry-run mode to avoid having to restore the test
directories for each timed iteration. Using dry-run is an acceptable
compromise since we are mostly interested in the initial computation
of what to clean and not so much in the cleaning it self.

Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
---
 t/perf/p7300-clean.sh | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)
 create mode 100755 t/perf/p7300-clean.sh

diff --git a/t/perf/p7300-clean.sh b/t/perf/p7300-clean.sh
new file mode 100755
index 0000000..ec94cdd
--- /dev/null
+++ b/t/perf/p7300-clean.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+test_description="Test git-clean performance"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+test_checkout_worktree
+
+test_expect_success 'setup untracked directory with many sub dirs' '
+	rm -rf 500_sub_dirs 100000_sub_dirs clean_test_dir &&
+	mkdir 500_sub_dirs 100000_sub_dirs clean_test_dir &&
+	for i in $(test_seq 1 500)
+	do
+		mkdir 500_sub_dirs/dir$i || return $?
+	done &&
+	for i in $(test_seq 1 200)
+	do
+		cp -r 500_sub_dirs 100000_sub_dirs/dir$i || return $?
+	done
+'
+
+test_perf 'clean many untracked sub dirs, check for nested git' '
+	git clean -n -q -f -d 100000_sub_dirs/
+'
+
+test_perf 'clean many untracked sub dirs, ignore nested git' '
+	git clean -n -q -f -f -d 100000_sub_dirs/
+'
+
+test_done
-- 
2.4.0.60.gf7143f7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 6/7] clean: improve performance when removing lots of directories
  2015-05-10 20:00 [PATCH v6 0/7] Improving performance of git clean Erik Elfström
                   ` (4 preceding siblings ...)
  2015-05-10 20:00 ` [PATCH v6 5/7] p7300: add performance tests for clean Erik Elfström
@ 2015-05-10 20:00 ` Erik Elfström
  2015-05-10 20:00 ` [PATCH v6 7/7] RFC: Change error handling scheme in read_gitfile_gently Erik Elfström
  6 siblings, 0 replies; 9+ messages in thread
From: Erik Elfström @ 2015-05-10 20:00 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

"git clean" uses resolve_gitlink_ref() to check for the presence of
nested git repositories, but it has the drawback of creating a
ref_cache entry for every directory that should potentially be
cleaned. The linear search through the ref_cache list causes a massive
performance hit for large number of directories.

Modify clean.c:remove_dirs to use setup.c:is_git_directory and
setup.c:read_gitfile_gently instead.

Both these functions will open files and parse contents when they find
something that looks like a git repository. This is ok from a
performance standpoint since finding repository candidates should be
comparatively rare.

Using is_git_directory and read_gitfile_gently should give a more
standardized check for what is and what isn't a git repository but
also gives three behavioral changes.

The first change is that we will now detect and avoid cleaning empty
nested git repositories (only init run). This is desirable.

Second, we will no longer die when cleaning a file named ".git" with
garbage content (it will be cleaned instead). This is also desirable.

The last change is that we will detect and avoid cleaning empty bare
repositories that have been placed in a directory named ".git". This
is not desirable but should have no real user impact since we already
fail to clean non-empty bare repositories in the same scenario. This
is thus deemed acceptable.

On top of this we add some extra precautions. If read_gitfile_gently
fails to open the git file, read the git file or verify the path in
the git file we assume that the path with the git file is a valid
repository and avoid cleaning.

Update t7300 to reflect these changes in behavior.

The time to clean an untracked directory containing 100000 sub
directories went from 61s to 1.7s after this change.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
---
 builtin/clean.c  | 31 +++++++++++++++++++++++++++----
 t/t7300-clean.sh | 10 ++++------
 2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index 98c103f..d739dcf 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -10,7 +10,6 @@
 #include "cache.h"
 #include "dir.h"
 #include "parse-options.h"
-#include "refs.h"
 #include "string-list.h"
 #include "quote.h"
 #include "column.h"
@@ -148,6 +147,32 @@ static int exclude_cb(const struct option *opt, const char *arg, int unset)
 	return 0;
 }
 
+/*
+ * Return 1 if the given path is the root of a git repository or
+ * submodule else 0. Will not return 1 for bare repositories with the
+ * exception of creating a bare repository in "foo/.git" and calling
+ * is_git_repository("foo").
+ */
+static int is_git_repository(struct strbuf *path)
+{
+	int ret = 0;
+	int gitfile_error;
+	size_t orig_path_len = path->len;
+	assert(orig_path_len != 0);
+	if (path->buf[orig_path_len - 1] != '/')
+		strbuf_addch(path, '/');
+	strbuf_addstr(path, ".git");
+	if (read_gitfile_gently(path->buf, &gitfile_error) || is_git_directory(path->buf))
+		ret = 1;
+	if (gitfile_error == READ_GITFILE_ERR_OPEN_FAILED ||
+	    gitfile_error == READ_GITFILE_ERR_READ_FAILED ||
+	    gitfile_error == READ_GITFILE_ERR_CANT_VERIFY_PATH)
+		ret = 1;  /* This could be a real .git file, take the
+			   * safe option and avoid cleaning */
+	strbuf_setlen(path, orig_path_len);
+	return ret;
+}
+
 static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
 		int dry_run, int quiet, int *dir_gone)
 {
@@ -155,13 +180,11 @@ static int remove_dirs(struct strbuf *path, const char *prefix, int force_flag,
 	struct strbuf quoted = STRBUF_INIT;
 	struct dirent *e;
 	int res = 0, ret = 0, gone = 1, original_len = path->len, len;
-	unsigned char submodule_head[20];
 	struct string_list dels = STRING_LIST_INIT_DUP;
 
 	*dir_gone = 1;
 
-	if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) &&
-			!resolve_gitlink_ref(path->buf, "HEAD", submodule_head)) {
+	if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) && is_git_repository(path)) {
 		if (!quiet) {
 			quote_path_relative(path->buf, prefix, &quoted);
 			printf(dry_run ?  _(msg_would_skip_git_dir) : _(msg_skip_git_dir),
diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 23962e4..fbab888 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -455,7 +455,7 @@ test_expect_success 'nested git work tree' '
 	! test -d bar
 '
 
-test_expect_failure 'should clean things that almost look like git but are not' '
+test_expect_success 'should clean things that almost look like git but are not' '
 	rm -fr almost_git almost_bare_git almost_submodule &&
 	mkdir -p almost_git/.git/objects &&
 	mkdir -p almost_git/.git/refs &&
@@ -468,8 +468,6 @@ test_expect_failure 'should clean things that almost look like git but are not'
 	garbage
 	EOF
 	test_when_finished "rm -rf almost_*" &&
-	## This will fail due to die("Invalid gitfile format: %s", path); in
-	## setup.c:read_gitfile.
 	git clean -f -d &&
 	test_path_is_missing almost_git &&
 	test_path_is_missing almost_bare_git &&
@@ -501,7 +499,7 @@ test_expect_success 'should not clean submodules' '
 	test_path_is_missing to_clean
 '
 
-test_expect_failure 'should avoid cleaning possible submodules' '
+test_expect_success 'should avoid cleaning possible submodules' '
 	rm -fr to_clean possible_sub1 possible_sub2 &&
 	mkdir to_clean possible_sub1 &&
 	test_when_finished "rm -rf possible_sub*" &&
@@ -519,7 +517,7 @@ test_expect_failure 'should avoid cleaning possible submodules' '
 	test_path_is_missing to_clean
 '
 
-test_expect_failure 'nested (empty) git should be kept' '
+test_expect_success 'nested (empty) git should be kept' '
 	rm -fr empty_repo to_clean &&
 	git init empty_repo &&
 	mkdir to_clean &&
@@ -541,7 +539,7 @@ test_expect_success 'nested bare repositories should be cleaned' '
 	test_path_is_missing subdir
 '
 
-test_expect_success 'nested (empty) bare repositories should be cleaned even when in .git' '
+test_expect_failure 'nested (empty) bare repositories should be cleaned even when in .git' '
 	rm -fr strange_bare &&
 	mkdir strange_bare &&
 	git init --bare strange_bare/.git &&
-- 
2.4.0.60.gf7143f7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 7/7] RFC: Change error handling scheme in read_gitfile_gently
  2015-05-10 20:00 [PATCH v6 0/7] Improving performance of git clean Erik Elfström
                   ` (5 preceding siblings ...)
  2015-05-10 20:00 ` [PATCH v6 6/7] clean: improve performance when removing lots of directories Erik Elfström
@ 2015-05-10 20:00 ` Erik Elfström
  6 siblings, 0 replies; 9+ messages in thread
From: Erik Elfström @ 2015-05-10 20:00 UTC (permalink / raw)
  To: git; +Cc: Erik Elfström

Signed-off-by: Erik Elfström <erik.elfstrom@gmail.com>
---

Since there was a lot of discussion on error reporting strategy on
the previous patch I have done a quick prototype of the theme
proposed by Jonathan Nieder.

I believe the conclusion was to NOT go this route but this way people
get to see an example of what it could look like to make the
discussion and decision a bit easier.

I will either drop this patch or split it up and squash it into the
appropriate commits (along with change requests if any) depending on
the outcome of the review discussion.

 builtin/clean.c |   3 +-
 cache.h         |   5 +--
 setup.c         | 106 +++++++++++++++++++++++++++++++-------------------------
 3 files changed, 63 insertions(+), 51 deletions(-)

diff --git a/builtin/clean.c b/builtin/clean.c
index d739dcf..7047d6e 100644
--- a/builtin/clean.c
+++ b/builtin/clean.c
@@ -162,7 +162,8 @@ static int is_git_repository(struct strbuf *path)
 	if (path->buf[orig_path_len - 1] != '/')
 		strbuf_addch(path, '/');
 	strbuf_addstr(path, ".git");
-	if (read_gitfile_gently(path->buf, &gitfile_error) || is_git_directory(path->buf))
+	if (read_gitfile_gently(path->buf, &gitfile_error, NULL) ||
+	    is_git_directory(path->buf))
 		ret = 1;
 	if (gitfile_error == READ_GITFILE_ERR_OPEN_FAILED ||
 	    gitfile_error == READ_GITFILE_ERR_READ_FAILED ||
diff --git a/cache.h b/cache.h
index 7c8abcb..76d311a 100644
--- a/cache.h
+++ b/cache.h
@@ -453,8 +453,9 @@ extern const char *get_git_work_tree(void);
 #define READ_GITFILE_ERR_CANT_VERIFY_PATH 7
 #define READ_GITFILE_ERR_NOT_A_REPO 8
 #define READ_GITFILE_ERR_TOO_LARGE 9
-extern const char *read_gitfile_gently(const char *path, int *return_error_code);
-#define read_gitfile(path) read_gitfile_gently((path), NULL)
+extern const char *read_gitfile_gently(const char *path, int *return_err, struct strbuf *err_msg);
+extern const char *read_gitfile(const char *path);
+
 extern const char *resolve_gitdir(const char *suspect);
 extern void set_git_work_tree(const char *tree);
 
diff --git a/setup.c b/setup.c
index bfaf4a6..49274b3 100644
--- a/setup.c
+++ b/setup.c
@@ -374,15 +374,16 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
  * Try to read the location of the git directory from the .git file,
  * return path to git directory if found.
  *
- * On failure, if return_error_code is not NULL, return_error_code
- * will be set to an error code and NULL will be returned. If
- * return_error_code is NULL the function will die instead (for most
- * cases).
+ * In the event of an error, return_err will be set to an error code
+ * and err_msg will be set to an error message describing the error
+ * and NULL will be returned. If no error reporting is required, pass
+ * NULL for return_err and/or err_msg.
  */
-const char *read_gitfile_gently(const char *path, int *return_error_code)
+const char *read_gitfile_gently(const char *path, int *return_err,
+				struct strbuf *err_msg)
 {
 	static const int one_MB = 1 << 20;
-	int error_code = 0;
+	const char *ret = NULL;
 	char *buf = NULL;
 	char *dir = NULL;
 	const char *slash;
@@ -390,42 +391,59 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
 	int fd;
 	ssize_t len;
 	int is_git_dir;
-	struct strbuf err_msg = STRBUF_INIT;
+	int is_git_dir_err;
+
+	if (return_err)
+		*return_err = 0;
 
 	if (stat(path, &st)) {
-		error_code = READ_GITFILE_ERR_STAT_FAILED;
+		set_error(return_err, err_msg,
+			  READ_GITFILE_ERR_STAT_FAILED,
+			  "Could not stat: '%s'", path);
 		goto cleanup_return;
 	}
 	if (!S_ISREG(st.st_mode)) {
-		error_code = READ_GITFILE_ERR_NOT_A_FILE;
+		set_error(return_err, err_msg,
+			  READ_GITFILE_ERR_NOT_A_FILE,
+			  "Not a file: '%s'", path);
 		goto cleanup_return;
 	}
 	fd = open(path, O_RDONLY);
 	if (fd < 0) {
-		error_code = READ_GITFILE_ERR_OPEN_FAILED;
+		set_error(return_err, err_msg,
+			  READ_GITFILE_ERR_OPEN_FAILED,
+			  "Error opening '%s'", path);
 		goto cleanup_return;
 	}
 	if (st.st_size > one_MB) {
 		close(fd);
-		error_code = READ_GITFILE_ERR_TOO_LARGE;
+		set_error(return_err, err_msg,
+			  READ_GITFILE_ERR_TOO_LARGE,
+			  "Too large to be a .git file: '%s'", path);
 		goto cleanup_return;
 	}
 	buf = xmalloc(st.st_size + 1);
 	len = read_in_full(fd, buf, st.st_size);
 	close(fd);
 	if (len != st.st_size) {
-		error_code = READ_GITFILE_ERR_READ_FAILED;
+		set_error(return_err, err_msg,
+			  READ_GITFILE_ERR_READ_FAILED,
+			  "Error reading %s", path);
 		goto cleanup_return;
 	}
 	buf[len] = '\0';
 	if (!starts_with(buf, "gitdir: ")) {
-		error_code = READ_GITFILE_ERR_INVALID_FORMAT;
+		set_error(return_err, err_msg,
+			  READ_GITFILE_ERR_INVALID_FORMAT,
+			  "Invalid gitfile format: %s", path);
 		goto cleanup_return;
 	}
 	while (buf[len - 1] == '\n' || buf[len - 1] == '\r')
 		len--;
 	if (len < 9) {
-		error_code = READ_GITFILE_ERR_NO_PATH;
+		set_error(return_err, err_msg,
+			  READ_GITFILE_ERR_NO_PATH,
+			  "No path in gitfile: %s", path);
 		goto cleanup_return;
 	}
 	buf[len] = '\0';
@@ -442,52 +460,44 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
 		buf = dir;
 	}
 
-	is_git_dir = is_git_directory_gently(dir, &error_code, &err_msg);
-	if (error_code) {
-		error_code = READ_GITFILE_ERR_CANT_VERIFY_PATH;
+	is_git_dir = is_git_directory_gently(dir, &is_git_dir_err, err_msg);
+	if (is_git_dir_err) {
+		if (return_err)
+			*return_err = READ_GITFILE_ERR_CANT_VERIFY_PATH;
 		goto cleanup_return;
 	}
 	if (!is_git_dir) {
-		error_code = READ_GITFILE_ERR_NOT_A_REPO;
+		set_error(return_err, err_msg,
+			  READ_GITFILE_ERR_NOT_A_REPO,
+			  "Not a git repository: %s", dir);
 		goto cleanup_return;
 	}
-	path = real_path(dir);
+	ret = real_path(dir);
 
 cleanup_return:
 	free(buf);
+	return ret;
+}
 
-	if (return_error_code)
-		*return_error_code = error_code;
+const char *read_gitfile(const char *path)
+{
+	int err;
+	const char *ret;
+	struct strbuf err_msg = STRBUF_INIT;
 
-	if (error_code) {
-		if (return_error_code)
-			return NULL;
+	ret = read_gitfile_gently(path, &err, &err_msg);
 
-		switch (error_code) {
-		case READ_GITFILE_ERR_STAT_FAILED:
-		case READ_GITFILE_ERR_NOT_A_FILE:
-			return NULL;
-		case READ_GITFILE_ERR_OPEN_FAILED:
-			die_errno("Error opening '%s'", path);
-		case READ_GITFILE_ERR_TOO_LARGE:
-			die("Too large to be a .git file: '%s'", path);
-		case READ_GITFILE_ERR_READ_FAILED:
-			die("Error reading %s", path);
-		case READ_GITFILE_ERR_INVALID_FORMAT:
-			die("Invalid gitfile format: %s", path);
-		case READ_GITFILE_ERR_NO_PATH:
-			die("No path in gitfile: %s", path);
-		case READ_GITFILE_ERR_CANT_VERIFY_PATH:
-			die("%s", err_msg.buf);
-		case READ_GITFILE_ERR_NOT_A_REPO:
-			die("Not a git repository: %s", dir);
-		default:
-			assert(0);
-		}
+	switch (err) {
+	case 0: /* No need to free err_msg, will only be
+		 * touched in case of error */
+		return ret;
+	case READ_GITFILE_ERR_STAT_FAILED:
+	case READ_GITFILE_ERR_NOT_A_FILE:
+		strbuf_release(&err_msg);
+		return NULL;
+	default:
+		die("%s", err_msg.buf);
 	}
-
-	strbuf_release(&err_msg);
-	return path;
 }
 
 static const char *setup_explicit_git_dir(const char *gitdirenv,
-- 
2.4.0.60.gf7143f7

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 3/7] setup: sanity check file size in read_gitfile_gently
  2015-05-10 20:00 ` [PATCH v6 3/7] setup: sanity check file size in read_gitfile_gently Erik Elfström
@ 2015-05-12  6:46   ` erik elfström
  0 siblings, 0 replies; 9+ messages in thread
From: erik elfström @ 2015-05-12  6:46 UTC (permalink / raw)
  To: Git List; +Cc: Erik Elfström

On Sun, May 10, 2015 at 10:00 PM, Erik Elfström <erik.elfstrom@gmail.com> wrote:
> @@ -404,6 +405,11 @@ const char *read_gitfile_gently(const char *path, int *return_error_code)
>                 error_code = READ_GITFILE_ERR_OPEN_FAILED;
>                 goto cleanup_return;
>         }
> +       if (st.st_size > one_MB) {
> +               close(fd);
> +               error_code = READ_GITFILE_ERR_TOO_LARGE;
> +               goto cleanup_return;
> +       }

Hmm... The order should probably be changed here. It would make more
sense to check the size before opening the file. That way the error
handling in clean would be more consistent if we can't open a large
.git file.

Right now we would treat any file that we can't open as a potential
repository and avoid cleaning but if we can open it and it turns out
that it is larger than 1MB we will ignore it and clean. By switching
the order here we would always ignore files larger than 1MB regardless
of if we can open them or not and I think that would make more sense.
It would also remove the need to close the file when erroring out due
to size so it makes more sense from a pure structural point of view as
well.

Sorry for not thinking of this earlier.

/Erik

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-05-12  6:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-10 20:00 [PATCH v6 0/7] Improving performance of git clean Erik Elfström
2015-05-10 20:00 ` [PATCH v6 1/7] setup: add gentle version of is_git_directory Erik Elfström
2015-05-10 20:00 ` [PATCH v6 2/7] setup: add gentle version of read_gitfile Erik Elfström
2015-05-10 20:00 ` [PATCH v6 3/7] setup: sanity check file size in read_gitfile_gently Erik Elfström
2015-05-12  6:46   ` erik elfström
2015-05-10 20:00 ` [PATCH v6 4/7] t7300: add tests to document behavior of clean and nested git Erik Elfström
2015-05-10 20:00 ` [PATCH v6 5/7] p7300: add performance tests for clean Erik Elfström
2015-05-10 20:00 ` [PATCH v6 6/7] clean: improve performance when removing lots of directories Erik Elfström
2015-05-10 20:00 ` [PATCH v6 7/7] RFC: Change error handling scheme in read_gitfile_gently Erik Elfström

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.