git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] Speed up connectivity checks via quarantine dir
@ 2021-05-19 19:13 Patrick Steinhardt
  2021-05-19 19:13 ` [PATCH 1/8] perf: fix when running with TEST_OUTPUT_DIRECTORY Patrick Steinhardt
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Patrick Steinhardt @ 2021-05-19 19:13 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 3413 bytes --]

Hi,

While analyzing push performance on gitlab.com, I've been at times
wondering what git-receive-pack(1) is doing for so long. For some repos
which have loads of references (~880k), even tiny pushes of less than 10
objects took dozens of seconds to get accepted.

One of the issues I've found is the object connectivity check, which may
run for a significant amount of time. The root cause here is that we're
computing connectivity via `git rev-list --not --all`: if we've got many
refs in the repository, computing `--not --all` is hugely expensive.

This commit series thus implements an alternative way of computing
reachability, which reuses information from the object quarantine
environment. Instead of doing a refwalk, we just look iterate over all
packed and loose quarantined objects any for each of them, we determine
whether their immediate references are all satisfied.

This reimplementation is paying out quite well for repos which have many
refs. The following benchmarks for git-receive-pack(1) (added in patch
2/8) have been performed in linux-stable.git:

Test                                     v2.32.0-rc0             HEAD
--------------------------------------------------------------------------------------------
5400.3: receive-pack clone create        1.27(1.11+0.16)         0.02(0.01+0.01) -98.4%
5400.5: receive-pack clone update        1.27(1.13+0.13)         0.02(0.02+0.00) -98.4%
5400.7: receive-pack clone reset         0.13(0.11+0.02)         0.02(0.01+0.01) -84.6%
5400.9: receive-pack clone delete        0.02(0.01+0.01)         0.03(0.02+0.01) +50.0%
5400.11: receive-pack extrarefs create   33.01(18.80+14.43)      9.00(4.30+4.65) -72.7%
5400.13: receive-pack extrarefs update   33.13(18.85+14.50)      9.01(4.28+4.67) -72.8%
5400.15: receive-pack extrarefs reset    32.90(18.82+14.32)      9.04(4.26+4.77) -72.5%
5400.17: receive-pack extrarefs delete   9.13(4.35+4.77)         8.94(4.29+4.64) -2.1%
5400.19: receive-pack empty create       223.35(640.63+127.74)   227.55(651.75+130.94) +1.9%

These rather clearly show that the previous rev-walk has been a major
bottleneck in the implementation.

Patrick

Patrick Steinhardt (8):
  perf: fix when running with TEST_OUTPUT_DIRECTORY
  p5400: add perf tests for git-receive-pack(1)
  tmp-objdir: expose function to retrieve path
  packfile: have `for_each_file_in_pack_dir()` return error codes
  object-file: allow reading loose objects without reading their
    contents
  connected: implement connectivity check via temporary object dirs
  receive-pack: skip connectivity checks on delete-only commands
  receive-pack: check connectivity via quarantined objects

 builtin/receive-pack.c       |  57 +++++++----
 connected.c                  | 192 +++++++++++++++++++++++++++++++++++
 connected.h                  |  19 ++++
 midx.c                       |  22 ++--
 object-file.c                |   9 +-
 packfile.c                   |  26 +++--
 packfile.h                   |  10 +-
 t/perf/aggregate.perl        |   8 +-
 t/perf/p5400-receive-pack.sh |  74 ++++++++++++++
 t/perf/perf-lib.sh           |   4 +-
 t/perf/run                   |  25 +++--
 tmp-objdir.c                 |   7 ++
 tmp-objdir.h                 |   5 +
 13 files changed, 401 insertions(+), 57 deletions(-)
 create mode 100755 t/perf/p5400-receive-pack.sh

-- 
2.31.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/8] perf: fix when running with TEST_OUTPUT_DIRECTORY
  2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
@ 2021-05-19 19:13 ` Patrick Steinhardt
  2021-05-20  2:03   ` Chris Torek
  2021-05-19 19:13 ` [PATCH 2/8] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Patrick Steinhardt @ 2021-05-19 19:13 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 5317 bytes --]

When the TEST_OUTPUT_DIRECTORY is defined, then all test data will be
written in that directory instead of the default directory located in
"t/". While this works as expected for our normal tests, performance
tests fail to locate and aggregate performance data because they don't
know to handle TEST_OUTPUT_DIRECTORY correctly and always look at the
default location.

Fix the issue by adding a `--results-dir` parameter to "aggregate.perl"
which identifies the directory where results are and by making the "run"
script awake of the TEST_OUTPUT_DIRECTORY variable.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 t/perf/aggregate.perl |  8 ++++++--
 t/perf/perf-lib.sh    |  4 ++--
 t/perf/run            | 25 ++++++++++++++++---------
 3 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/t/perf/aggregate.perl b/t/perf/aggregate.perl
index 14e4cda287..5d4964c5c6 100755
--- a/t/perf/aggregate.perl
+++ b/t/perf/aggregate.perl
@@ -58,6 +58,7 @@ sub usage {
   Options:
     --codespeed          * Format output for Codespeed
     --reponame    <str>  * Send given reponame to codespeed
+    --results-dir <str>  * Directory where test results are located
     --sort-by     <str>  * Sort output (only "regression" criteria is supported)
     --subsection  <str>  * Use results from given subsection
 
@@ -90,12 +91,13 @@ sub sane_backticks {
 }
 
 my (@dirs, %dirnames, %dirabbrevs, %prefixes, @tests,
-    $codespeed, $sortby, $subsection, $reponame);
+    $codespeed, $sortby, $subsection, $reponame, $resultsdir);
 
 Getopt::Long::Configure qw/ require_order /;
 
 my $rc = GetOptions("codespeed"     => \$codespeed,
 		    "reponame=s"    => \$reponame,
+		    "results-dir=s" => \$resultsdir,
 		    "sort-by=s"     => \$sortby,
 		    "subsection=s"  => \$subsection);
 usage() unless $rc;
@@ -137,7 +139,9 @@ sub sane_backticks {
 	@tests = glob "p????-*.sh";
 }
 
-my $resultsdir = "test-results";
+if (not $resultsdir) {
+	$resultsdir = "test-results";
+}
 
 if (! $subsection and
     exists $ENV{GIT_PERF_SUBSECTION} and
diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh
index 601d9f67dd..8ca6dd0297 100644
--- a/t/perf/perf-lib.sh
+++ b/t/perf/perf-lib.sh
@@ -45,7 +45,7 @@ export TEST_DIRECTORY TRASH_DIRECTORY GIT_BUILD_DIR GIT_TEST_CMP
 MODERN_GIT=$GIT_BUILD_DIR/bin-wrappers/git
 export MODERN_GIT
 
-perf_results_dir=$TEST_OUTPUT_DIRECTORY/test-results
+perf_results_dir=$TEST_RESULTS_DIR
 test -n "$GIT_PERF_SUBSECTION" && perf_results_dir="$perf_results_dir/$GIT_PERF_SUBSECTION"
 mkdir -p "$perf_results_dir"
 rm -f "$perf_results_dir"/$(basename "$0" .sh).subtests
@@ -253,7 +253,7 @@ test_size () {
 # and does it after running everything)
 test_at_end_hook_ () {
 	if test -z "$GIT_PERF_AGGREGATING_LATER"; then
-		( cd "$TEST_DIRECTORY"/perf && ./aggregate.perl $(basename "$0") )
+		( cd "$TEST_DIRECTORY"/perf && ./aggregate.perl --results-dir="$TEST_RESULTS_DIR" $(basename "$0") )
 	fi
 }
 
diff --git a/t/perf/run b/t/perf/run
index c7b86104e1..03128d440a 100755
--- a/t/perf/run
+++ b/t/perf/run
@@ -188,10 +188,10 @@ run_subsection () {
 
 	if test -z "$GIT_PERF_SEND_TO_CODESPEED"
 	then
-		./aggregate.perl $codespeed_opt "$@"
+		./aggregate.perl --results-dir="$TEST_RESULTS_DIR" $codespeed_opt "$@"
 	else
-		json_res_file="test-results/$GIT_PERF_SUBSECTION/aggregate.json"
-		./aggregate.perl --codespeed "$@" | tee "$json_res_file"
+		json_res_file=""$TEST_RESULTS_DIR"/$GIT_PERF_SUBSECTION/aggregate.json"
+		./aggregate.perl --results-dir="$TEST_RESULTS_DIR" --codespeed "$@" | tee "$json_res_file"
 		send_data_url="$GIT_PERF_SEND_TO_CODESPEED/result/add/json/"
 		curl -v --request POST --data-urlencode "json=$(cat "$json_res_file")" "$send_data_url"
 	fi
@@ -203,10 +203,17 @@ get_var_from_env_or_config "GIT_PERF_SEND_TO_CODESPEED" "perf" "sendToCodespeed"
 cd "$(dirname $0)"
 . ../../GIT-BUILD-OPTIONS
 
-mkdir -p test-results
-get_subsections "perf" >test-results/run_subsections.names
+if test -n "$TEST_OUTPUT_DIRECTORY"
+then
+    TEST_RESULTS_DIR="$TEST_OUTPUT_DIRECTORY/test-results"
+else
+    TEST_RESULTS_DIR=test-results
+fi
 
-if test $(wc -l <test-results/run_subsections.names) -eq 0
+mkdir -p "$TEST_RESULTS_DIR"
+get_subsections "perf" >"$TEST_RESULTS_DIR"/run_subsections.names
+
+if test $(wc -l <"$TEST_RESULTS_DIR"/run_subsections.names) -eq 0
 then
 	if test -n "$GIT_PERF_SUBSECTION"
 	then
@@ -222,10 +229,10 @@ then
 	)
 elif test -n "$GIT_PERF_SUBSECTION"
 then
-	egrep "^$GIT_PERF_SUBSECTION\$" test-results/run_subsections.names >/dev/null ||
+	egrep "^$GIT_PERF_SUBSECTION\$" "$TEST_RESULTS_DIR"/run_subsections.names >/dev/null ||
 		die "subsection '$GIT_PERF_SUBSECTION' not found in '$GIT_PERF_CONFIG_FILE'"
 
-	egrep "^$GIT_PERF_SUBSECTION\$" test-results/run_subsections.names | while read -r subsec
+	egrep "^$GIT_PERF_SUBSECTION\$" "$TEST_RESULTS_DIR"/run_subsections.names | while read -r subsec
 	do
 		(
 			GIT_PERF_SUBSECTION="$subsec"
@@ -243,5 +250,5 @@ else
 			echo "======== Run for subsection '$GIT_PERF_SUBSECTION' ========"
 			run_subsection "$@"
 		)
-	done <test-results/run_subsections.names
+	done <"$TEST_RESULTS_DIR"/run_subsections.names
 fi
-- 
2.31.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/8] p5400: add perf tests for git-receive-pack(1)
  2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
  2021-05-19 19:13 ` [PATCH 1/8] perf: fix when running with TEST_OUTPUT_DIRECTORY Patrick Steinhardt
@ 2021-05-19 19:13 ` Patrick Steinhardt
  2021-05-20  2:09   ` Chris Torek
                     ` (2 more replies)
  2021-05-19 19:13 ` [PATCH 3/8] tmp-objdir: expose function to retrieve path Patrick Steinhardt
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 22+ messages in thread
From: Patrick Steinhardt @ 2021-05-19 19:13 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 3079 bytes --]

We'll the connectivity check logic for git-receive-pack(1) in the
following commits to make it perform better. As a preparatory step, add
some benchmarks such that we can measure these changes.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 t/perf/p5400-receive-pack.sh | 74 ++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)
 create mode 100755 t/perf/p5400-receive-pack.sh

diff --git a/t/perf/p5400-receive-pack.sh b/t/perf/p5400-receive-pack.sh
new file mode 100755
index 0000000000..2b0c89d977
--- /dev/null
+++ b/t/perf/p5400-receive-pack.sh
@@ -0,0 +1,74 @@
+#!/bin/sh
+
+test_description="Tests performance of receive-pack"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test_expect_success 'setup' '
+	# Create a main branch such that we do not have to rely on any specific
+	# branch to exist in the perf repository.
+	git switch --force-create main &&
+
+	TARGET_REPO_CLONE=$(pwd)/target-clone.git &&
+	git clone --bare --dissociate --branch main "$(pwd)" "$TARGET_REPO_CLONE" &&
+	TARGET_REPO_REFS=$(pwd)/target-refs.git &&
+	git clone --bare --dissociate --branch main "$(pwd)" "$TARGET_REPO_REFS" &&
+	TARGET_REPO_EMPTY=$(pwd)/target-empty.git &&
+	git init --bare "$TARGET_REPO_EMPTY" &&
+
+	# Set up a pre-receive hook such that no refs will ever be changed.
+	# This easily allows multiple perf runs, but still exercises
+	# server-side reference negotiation and checking for consistency.
+	mkdir hooks &&
+	write_script hooks/pre-receive <<-EOF &&
+		#!/bin/sh
+		echo "failed in pre-receive hook"
+		exit 1
+	EOF
+	cat >config <<-EOF &&
+		[core]
+			hooksPath=$(pwd)/hooks
+	EOF
+	GIT_CONFIG_GLOBAL="$(pwd)/config" &&
+	export GIT_CONFIG_GLOBAL &&
+
+	# Create a reference for each commit in the target repository with
+	# extra-refs. While this may be an atypical setup, biggish repositories
+	# easily end up with hundreds of thousands of refs, and this is a good
+	# enough approximation.
+	git -C "$TARGET_REPO_REFS" log --all --format="tformat:create refs/commit/%h %H" |
+		git -C "$TARGET_REPO_REFS" update-ref --stdin &&
+
+	git switch --create updated &&
+	test_commit --no-tag updated
+'
+
+while read name repo
+do
+	refs=("create updated:new")
+
+	# If the target repository is the empty one, then the only thing we can
+	# do is to create a new branch.
+	if test "$repo" != "$TARGET_REPO_EMPTY"
+	then
+		refs+=("update updated:main" "reset main~:main" "delete :main")
+	fi
+
+	while read desc ref
+	do
+		test_expect_success "setup $name $desc" "
+			test_must_fail git push --force '$repo' '$ref' \
+				--receive-pack='tee pack | git receive-pack' 2>err &&
+			grep 'failed in pre-receive hook' err
+		"
+
+		test_perf "receive-pack $name $desc" "
+			git receive-pack '$repo' <pack >negotiation &&
+			grep 'pre-receive hook declined' negotiation
+		"
+	done < <(printf "%s\n" "${refs[@]}")
+done < <(printf "%s\n" "clone $TARGET_REPO_CLONE" "extrarefs $TARGET_REPO_REFS" "empty $TARGET_REPO_EMPTY")
+
+test_done
-- 
2.31.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/8] tmp-objdir: expose function to retrieve path
  2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
  2021-05-19 19:13 ` [PATCH 1/8] perf: fix when running with TEST_OUTPUT_DIRECTORY Patrick Steinhardt
  2021-05-19 19:13 ` [PATCH 2/8] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
@ 2021-05-19 19:13 ` Patrick Steinhardt
  2021-05-20  0:16   ` Elijah Newren
  2021-05-19 19:13 ` [PATCH 4/8] packfile: have `for_each_file_in_pack_dir()` return error codes Patrick Steinhardt
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Patrick Steinhardt @ 2021-05-19 19:13 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1397 bytes --]

It's currently impossible to tell the path of a temporary object
directory for outside users of `struct tmp_objdir`. We'll soon need that
information though so that we can reuse information from the quarantine
environment in git-receive-pack(1).

Provide a new function `tmp_objdir_path()` which returns the path of a
temporary object directory to prepare for this.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 tmp-objdir.c | 7 +++++++
 tmp-objdir.h | 5 +++++
 2 files changed, 12 insertions(+)

diff --git a/tmp-objdir.c b/tmp-objdir.c
index b8d880e362..6056917c63 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -288,6 +288,13 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
 	return t->env.v;
 }
 
+const char *tmp_objdir_path(const struct tmp_objdir *t)
+{
+	if (!t)
+		return NULL;
+	return t->path.buf;
+}
+
 void tmp_objdir_add_as_alternate(const struct tmp_objdir *t)
 {
 	add_to_alternates_memory(t->path.buf);
diff --git a/tmp-objdir.h b/tmp-objdir.h
index b1e45b4c75..da3ccb98bc 100644
--- a/tmp-objdir.h
+++ b/tmp-objdir.h
@@ -51,4 +51,9 @@ int tmp_objdir_destroy(struct tmp_objdir *);
  */
 void tmp_objdir_add_as_alternate(const struct tmp_objdir *);
 
+/*
+ * Return the path of the temporary object directory.
+ */
+const char *tmp_objdir_path(const struct tmp_objdir *t);
+
 #endif /* TMP_OBJDIR_H */
-- 
2.31.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/8] packfile: have `for_each_file_in_pack_dir()` return error codes
  2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
                   ` (2 preceding siblings ...)
  2021-05-19 19:13 ` [PATCH 3/8] tmp-objdir: expose function to retrieve path Patrick Steinhardt
@ 2021-05-19 19:13 ` Patrick Steinhardt
  2021-05-19 19:13 ` [PATCH 5/8] object-file: allow reading loose objects without reading their contents Patrick Steinhardt
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Patrick Steinhardt @ 2021-05-19 19:13 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 6380 bytes --]

The function `for_each_file_in_pack_dir()` doesn't ever return error
codes right now: any errors it hits are simply swallowed. A new user
we're about to add would like to know what's going on though such that
it can decide whether the call was successful or not.

Refactor the function to return an error code, and adapt the signature
of the callback function to also return an error code. This enables the
callback to abort iteration and have its error code passed through.
Existing callers are adapted to this change, but keep ignoring errors.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 midx.c     | 22 +++++++++++++---------
 packfile.c | 26 ++++++++++++++++----------
 packfile.h | 10 +++++-----
 3 files changed, 34 insertions(+), 24 deletions(-)

diff --git a/midx.c b/midx.c
index 21d6a05e88..f87a3d36ab 100644
--- a/midx.c
+++ b/midx.c
@@ -470,15 +470,15 @@ struct write_midx_context {
 	int preferred_pack_idx;
 };
 
-static void add_pack_to_midx(const char *full_path, size_t full_path_len,
-			     const char *file_name, void *data)
+static int add_pack_to_midx(const char *full_path, size_t full_path_len,
+			    const char *file_name, void *data)
 {
 	struct write_midx_context *ctx = data;
 
 	if (ends_with(file_name, ".idx")) {
 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
 		if (ctx->m && midx_contains_pack(ctx->m, file_name))
-			return;
+			return 0;
 
 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
@@ -489,7 +489,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		if (!ctx->info[ctx->nr].p) {
 			warning(_("failed to add packfile '%s'"),
 				full_path);
-			return;
+			return 0;
 		}
 
 		if (open_pack_index(ctx->info[ctx->nr].p)) {
@@ -497,7 +497,7 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 				full_path);
 			close_pack(ctx->info[ctx->nr].p);
 			FREE_AND_NULL(ctx->info[ctx->nr].p);
-			return;
+			return 0;
 		}
 
 		ctx->info[ctx->nr].pack_name = xstrdup(file_name);
@@ -505,6 +505,8 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		ctx->info[ctx->nr].expired = 0;
 		ctx->nr++;
 	}
+
+	return 0;
 }
 
 struct pack_midx_entry {
@@ -1110,19 +1112,21 @@ struct clear_midx_data {
 	const char *ext;
 };
 
-static void clear_midx_file_ext(const char *full_path, size_t full_path_len,
-				const char *file_name, void *_data)
+static int clear_midx_file_ext(const char *full_path, size_t full_path_len,
+			const char *file_name, void *_data)
 {
 	struct clear_midx_data *data = _data;
 
 	if (!(starts_with(file_name, "multi-pack-index-") &&
 	      ends_with(file_name, data->ext)))
-		return;
+		return 0;
 	if (data->keep && !strcmp(data->keep, file_name))
-		return;
+		return 0;
 
 	if (unlink(full_path))
 		die_errno(_("failed to remove %s"), full_path);
+
+	return 0;
 }
 
 static void clear_midx_files_ext(struct repository *r, const char *ext,
diff --git a/packfile.c b/packfile.c
index b79cbc8cd4..f488a214cf 100644
--- a/packfile.c
+++ b/packfile.c
@@ -792,14 +792,15 @@ static void report_pack_garbage(struct string_list *list)
 	report_helper(list, seen_bits, first, list->nr);
 }
 
-void for_each_file_in_pack_dir(const char *objdir,
-			       each_file_in_pack_dir_fn fn,
-			       void *data)
+int for_each_file_in_pack_dir(const char *objdir,
+			      each_file_in_pack_dir_fn fn,
+			      void *data)
 {
 	struct strbuf path = STRBUF_INIT;
 	size_t dirnamelen;
 	DIR *dir;
 	struct dirent *de;
+	int ret = 0;
 
 	strbuf_addstr(&path, objdir);
 	strbuf_addstr(&path, "/pack");
@@ -809,7 +810,7 @@ void for_each_file_in_pack_dir(const char *objdir,
 			error_errno("unable to open object pack directory: %s",
 				    path.buf);
 		strbuf_release(&path);
-		return;
+		return -1;
 	}
 	strbuf_addch(&path, '/');
 	dirnamelen = path.len;
@@ -820,11 +821,14 @@ void for_each_file_in_pack_dir(const char *objdir,
 		strbuf_setlen(&path, dirnamelen);
 		strbuf_addstr(&path, de->d_name);
 
-		fn(path.buf, path.len, de->d_name, data);
+		ret = fn(path.buf, path.len, de->d_name, data);
+		if (ret)
+			break;
 	}
 
 	closedir(dir);
 	strbuf_release(&path);
+	return ret;
 }
 
 struct prepare_pack_data {
@@ -834,8 +838,8 @@ struct prepare_pack_data {
 	struct multi_pack_index *m;
 };
 
-static void prepare_pack(const char *full_name, size_t full_name_len,
-			 const char *file_name, void *_data)
+static int prepare_pack(const char *full_name, size_t full_name_len,
+			const char *file_name, void *_data)
 {
 	struct prepare_pack_data *data = (struct prepare_pack_data *)_data;
 	struct packed_git *p;
@@ -858,13 +862,13 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 	}
 
 	if (!report_garbage)
-		return;
+		return 0;
 
 	if (!strcmp(file_name, "multi-pack-index"))
-		return;
+		return 0;
 	if (starts_with(file_name, "multi-pack-index") &&
 	    ends_with(file_name, ".rev"))
-		return;
+		return 0;
 	if (ends_with(file_name, ".idx") ||
 	    ends_with(file_name, ".rev") ||
 	    ends_with(file_name, ".pack") ||
@@ -874,6 +878,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len,
 		string_list_append(data->garbage, full_name);
 	else
 		report_garbage(PACKDIR_FILE_GARBAGE, full_name);
+
+	return 0;
 }
 
 static void prepare_packed_git_one(struct repository *r, char *objdir, int local)
diff --git a/packfile.h b/packfile.h
index 3ae117a8ae..eac930cd27 100644
--- a/packfile.h
+++ b/packfile.h
@@ -39,11 +39,11 @@ const char *pack_basename(struct packed_git *p);
 
 struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path);
 
-typedef void each_file_in_pack_dir_fn(const char *full_path, size_t full_path_len,
-				      const char *file_pach, void *data);
-void for_each_file_in_pack_dir(const char *objdir,
-			       each_file_in_pack_dir_fn fn,
-			       void *data);
+typedef int each_file_in_pack_dir_fn(const char *full_path, size_t full_path_len,
+				     const char *file_pach, void *data);
+int for_each_file_in_pack_dir(const char *objdir,
+			      each_file_in_pack_dir_fn fn,
+			      void *data);
 
 /* A hook to report invalid files in pack directory */
 #define PACKDIR_FILE_PACK 1
-- 
2.31.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/8] object-file: allow reading loose objects without reading their contents
  2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
                   ` (3 preceding siblings ...)
  2021-05-19 19:13 ` [PATCH 4/8] packfile: have `for_each_file_in_pack_dir()` return error codes Patrick Steinhardt
@ 2021-05-19 19:13 ` Patrick Steinhardt
  2021-05-19 19:13 ` [PATCH 6/8] connected: implement connectivity check via temporary object dirs Patrick Steinhardt
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Patrick Steinhardt @ 2021-05-19 19:13 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1391 bytes --]

We're about to add a new implementation of the connectivity check via
the object quarantine directory, which will directly iterate over any
loose object via their paths. For optimization purposes, we'll want to
skip over any loose blobs, but right now there's no easy way to read a
loose object without also slurping in all its contents.

Fix this shortcoming by modifying `read_loose_object()` such that it can
handle the case where no pointer for `contents` was passed in. If so,
then we'll simply skip reading the loose object.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 object-file.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/object-file.c b/object-file.c
index 884855b9df..ab695823fd 100644
--- a/object-file.c
+++ b/object-file.c
@@ -2552,7 +2552,8 @@ int read_loose_object(const char *path,
 	git_zstream stream;
 	char hdr[MAX_HEADER_LEN];
 
-	*contents = NULL;
+	if (contents)
+		*contents = NULL;
 
 	map = map_loose_object_1(the_repository, path, NULL, &mapsize);
 	if (!map) {
@@ -2572,6 +2573,12 @@ int read_loose_object(const char *path,
 		goto out;
 	}
 
+	if (!contents) {
+		git_inflate_end(&stream);
+		ret = 0;
+		goto out;
+	}
+
 	if (*type == OBJ_BLOB && *size > big_file_threshold) {
 		if (check_stream_oid(&stream, hdr, *size, path, expected_oid) < 0)
 			goto out;
-- 
2.31.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/8] connected: implement connectivity check via temporary object dirs
  2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
                   ` (4 preceding siblings ...)
  2021-05-19 19:13 ` [PATCH 5/8] object-file: allow reading loose objects without reading their contents Patrick Steinhardt
@ 2021-05-19 19:13 ` Patrick Steinhardt
  2021-05-19 19:13 ` [PATCH 7/8] receive-pack: skip connectivity checks on delete-only commands Patrick Steinhardt
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Patrick Steinhardt @ 2021-05-19 19:13 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 9296 bytes --]

The connectivity checks are currently implemented via git-rev-list(1):
we simply ignore any preexisting refs via `--not --all`, and pass all
new refs which are to be checked via its stdin. While this works well
enough for repositories which do not have a lot of references, it's
clear that `--not --all` will linearly scale with the number of refs
which do exist: for each reference, we'll walk through its commit as
well as its five parent commits (defined via `SLOP`). Given that many
major hosting sites which use a pull/merge request workflow keep refs to
the request's HEAD, this effectively means that `--not --all` will do a
nearly complete graph walk.

We can compute the set of pushed objects much more easily though. We
know exactly which objects were just received, which is the set of
objects which are in git-receive-pack(1)'s object quarantine directory.
While this may overshoot new objects because the client may have sent
objects which we already know, it lets us avoid doing a graph walk of
preexisting commits.

The implementation of this is quite simple: we iterate through all loose
and packed objects in the object directory, and for each object we check
whether its immediate referenced objects exist. There is no need to do a
walk beyond these direct links: if the receiving repository already has
the directly referenced object, then it's not a new object in the first
place and thus does not need a re-check. If the referenced object is a
new one and we can load it, then we know that it will eventually be
checked via the quarantined objects walk, too.

Finally, after we have concluded that all quarantined objects are indeed
fully connected, the only thing left to check is whether all reference
updates reference an existing object.

This new connectivity check will be wired up in a subsequent commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 connected.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 connected.h |  19 ++++++
 2 files changed, 211 insertions(+)

diff --git a/connected.c b/connected.c
index b18299fdf0..ab0eb9f974 100644
--- a/connected.c
+++ b/connected.c
@@ -6,6 +6,13 @@
 #include "transport.h"
 #include "packfile.h"
 #include "promisor-remote.h"
+#include "object.h"
+#include "tree-walk.h"
+#include "commit.h"
+#include "tag.h"
+#include "progress.h"
+#include "tmp-objdir.h"
+#include "oidset.h"
 
 /*
  * If we feed all the commits we want to verify to this command
@@ -151,3 +158,188 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 	sigchain_pop(SIGPIPE);
 	return finish_command(&rev_list) || err;
 }
+
+struct connected_payload {
+	struct repository *repo;
+	struct progress *progress;
+	struct oidset seen;
+	size_t checked_objects;
+	size_t missing_objects;
+};
+
+static void check_missing(struct connected_payload *payload, const struct object_id *oid)
+{
+	if (oidset_contains(&payload->seen, oid))
+		return;
+	if (oid_object_info(payload->repo, oid, NULL) <= 0)
+		payload->missing_objects++;
+	oidset_insert(&payload->seen, oid);
+	display_progress(payload->progress, ++payload->checked_objects);
+}
+
+static int check_quarantined_object(const struct object_id *oid, enum object_type type,
+				    void *content, unsigned long size, void *data)
+{
+	struct connected_payload *payload = data;
+	struct object *object;
+	int eaten;
+
+	object = parse_object_buffer(payload->repo, oid, type, size, content, &eaten);
+	if (!object) {
+		if (!eaten)
+			free(content);
+		error(_("could not parse quarantined object %s"), oid_to_hex(oid));
+		return -1;
+	}
+	if (!eaten)
+		free(content);
+
+	if (type == OBJ_TREE) {
+		struct tree *tree = (struct tree *)object;
+		struct tree_desc tree_desc;
+		struct name_entry entry;
+
+		if (init_tree_desc_gently(&tree_desc, tree->buffer, tree->size))
+			return -1;
+		while (tree_entry_gently(&tree_desc, &entry))
+			check_missing(payload, &entry.oid);
+
+		free_tree_buffer(tree);
+	} else if (type == OBJ_COMMIT) {
+		struct commit *commit = (struct commit *) object;
+		struct commit_list *parents;
+
+		check_missing(payload, get_commit_tree_oid(commit));
+		for (parents = commit->parents; parents; parents = parents->next)
+			check_missing(payload, &parents->item->object.oid);
+	} else if (type == OBJ_TAG) {
+		check_missing(payload, get_tagged_oid((struct tag *) object));
+	} else {
+		error(_("unexpected object type"));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int loose_object_iterator(const struct object_id *oid,
+				 const char *path, void *_data)
+{
+	struct connected_payload *payload = _data;
+	enum object_type type;
+	unsigned long size;
+	void *content;
+
+	if (read_loose_object(path, oid, &type, &size, NULL) < 0)
+		return -1;
+
+	oidset_insert(&payload->seen, oid);
+	display_progress(payload->progress, ++payload->checked_objects);
+
+	if (type == OBJ_BLOB)
+		return 0;
+	if (read_loose_object(path, oid, &type, &size, &content) < 0)
+		return -1;
+
+	return check_quarantined_object(oid, type, content, size, payload);
+}
+
+static int packed_object_iterator(const struct object_id *oid,
+				  struct packed_git *pack, uint32_t pos,
+				  void *_data)
+{
+	off_t off = nth_packed_object_offset(pack, pos);
+	struct connected_payload *payload = _data;
+	struct object_info oi = OBJECT_INFO_INIT;
+	enum object_type type;
+	unsigned long size;
+	void *content;
+
+	oi.typep = &type;
+	if (packed_object_info(payload->repo, pack, off, &oi) < 0)
+		return -1;
+
+	oidset_insert(&payload->seen, oid);
+	display_progress(payload->progress, ++payload->checked_objects);
+
+	if (type == OBJ_BLOB)
+		return 0;
+
+	oi.contentp = &content;
+	oi.sizep = &size;
+
+	if (packed_object_info(payload->repo, pack, off, &oi) < 0)
+		return -1;
+
+	return check_quarantined_object(oid, type, content, size, payload);
+}
+
+static int pack_dir_iterator(const char *full_path, size_t full_path_len,
+			     const char *file_path, void *data)
+{
+	if (ends_with(full_path, ".idx")) {
+		struct packed_git *pack = add_packed_git(full_path, full_path_len, 1);
+		if (!pack) {
+			error(_("unable to add quarantined pack"));
+			return -1;
+		}
+
+		if (for_each_object_in_pack(pack, packed_object_iterator, data,
+					    FOR_EACH_OBJECT_PACK_ORDER) < 0) {
+			error(_("unable to iterate packed objects"));
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+int check_connected_quarantine(struct repository *repo,
+			       struct tmp_objdir *quarantine,
+			       oid_iterate_fn fn, void *cb_data)
+{
+	struct connected_payload payload = {
+		.repo = repo,
+		.seen = OIDSET_INIT,
+	};
+	struct object_id oid;
+	int ret;
+
+	payload.progress = start_delayed_progress("Checking connectivity", 0);
+
+	/*
+	 * Iterate through all loose and packed objects part of the object
+	 * quarantine and verify that for each of them, all directly referenced
+	 * objects exist.
+	 */
+	if (for_each_loose_file_in_objdir(tmp_objdir_path(quarantine),
+					  loose_object_iterator, NULL,
+					  NULL, &payload) < 0) {
+		error(_("unable to check quarantined loose objects"));
+		ret = -1;
+		goto out;
+	}
+
+	if (for_each_file_in_pack_dir(tmp_objdir_path(quarantine),
+				      pack_dir_iterator, &payload) < 0) {
+		error(_("unable to check quarantined packed objects"));
+		ret = -1;
+		goto out;
+	}
+
+	/*
+	 * After we've verified that all quarantined objects are indeed
+	 * connected, we need to verify that all new tips are contained in the
+	 * repository, too.
+	 */
+	while (!fn(cb_data, &oid)) {
+		if (oid_object_info(repo, &oid, NULL) <= 0)
+			payload.missing_objects++;
+		display_progress(payload.progress, ++payload.checked_objects);
+	}
+
+out:
+	stop_progress(&payload.progress);
+	oidset_clear(&payload.seen);
+	return ret || payload.missing_objects != 0;
+}
diff --git a/connected.h b/connected.h
index 8d5a6b3ad6..336b6f362e 100644
--- a/connected.h
+++ b/connected.h
@@ -3,6 +3,7 @@
 
 struct object_id;
 struct transport;
+struct tmp_objdir;
 
 /*
  * Take callback data, and return next object name in the buffer.
@@ -62,4 +63,22 @@ struct check_connected_options {
 int check_connected(oid_iterate_fn fn, void *cb_data,
 		    struct check_connected_options *opt);
 
+/*
+ * Make sure that all objects in the given quarantine directory are connected
+ * and that all OIDs returned by the given callback are backed by an existing
+ * object. The quarantine directory must have been made available to the
+ * repository via `add_to_alternates_memory()`.
+ *
+ * Given that objects are enumerated via the quarantine directory directly,
+ * this check should in general be more efficient than `check_connected()`
+ * while being more pedantic at the same time (e.g. quarantined objects which
+ * aren't referenced by anything but which do have missing links would get
+ * rejected).
+ *
+ * Return 0 if Ok, non zero otherwise (i.e. some missing objects)
+ */
+int check_connected_quarantine(struct repository *repo,
+			       struct tmp_objdir *quarantine,
+			       oid_iterate_fn fn, void *cb_data);
+
 #endif /* CONNECTED_H */
-- 
2.31.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 7/8] receive-pack: skip connectivity checks on delete-only commands
  2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
                   ` (5 preceding siblings ...)
  2021-05-19 19:13 ` [PATCH 6/8] connected: implement connectivity check via temporary object dirs Patrick Steinhardt
@ 2021-05-19 19:13 ` Patrick Steinhardt
  2021-05-21 18:53   ` Felipe Contreras
  2021-05-19 19:13 ` [PATCH 8/8] receive-pack: check connectivity via quarantined objects Patrick Steinhardt
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Patrick Steinhardt @ 2021-05-19 19:13 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 4064 bytes --]

In the case where git-receive-pack(1) receives only commands which
delete references, then per technical specification the client MUST NOT
send a packfile. As a result, we know that no new objects have been
received, which makes it a moot point to check whether all received
objects are fully connected.

Fix this by not doing a connectivity check in case there were no pushed
objects. Given that git-rev-walk(1) with only negative references will
not do any graph walk, no performance improvements are to be expected.
Conceptionally, it is still the right thing to do though.

The following tests were executed on linux.git and back up above
expectation:

Test                                     v2.32.0-rc0             HEAD
--------------------------------------------------------------------------------------------
5400.3: receive-pack clone create        1.27(1.11+0.16)         1.26(1.12+0.14) -0.8%
5400.5: receive-pack clone update        1.27(1.13+0.13)         1.27(1.11+0.16) +0.0%
5400.7: receive-pack clone reset         0.13(0.11+0.02)         0.14(0.11+0.02) +7.7%
5400.9: receive-pack clone delete        0.02(0.01+0.01)         0.02(0.00+0.01) +0.0%
5400.11: receive-pack extrarefs create   33.01(18.80+14.43)      32.63(18.52+14.24) -1.2%
5400.13: receive-pack extrarefs update   33.13(18.85+14.50)      32.82(18.85+14.29) -0.9%
5400.15: receive-pack extrarefs reset    32.90(18.82+14.32)      32.70(18.76+14.20) -0.6%
5400.17: receive-pack extrarefs delete   9.13(4.35+4.77)         8.99(4.28+4.70) -1.5%
5400.19: receive-pack empty create       223.35(640.63+127.74)   226.96(655.16+131.93) +1.6%

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/receive-pack.c | 49 ++++++++++++++++++++++++++----------------
 1 file changed, 30 insertions(+), 19 deletions(-)

diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index a34742513a..b9263cec15 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1918,11 +1918,8 @@ static void execute_commands(struct command *commands,
 			     struct shallow_info *si,
 			     const struct string_list *push_options)
 {
-	struct check_connected_options opt = CHECK_CONNECTED_INIT;
 	struct command *cmd;
 	struct iterate_data data;
-	struct async muxer;
-	int err_fd = 0;
 	int run_proc_receive = 0;
 
 	if (unpacker_error) {
@@ -1931,25 +1928,39 @@ static void execute_commands(struct command *commands,
 		return;
 	}
 
-	if (use_sideband) {
-		memset(&muxer, 0, sizeof(muxer));
-		muxer.proc = copy_to_sideband;
-		muxer.in = -1;
-		if (!start_async(&muxer))
-			err_fd = muxer.in;
-		/* ...else, continue without relaying sideband */
-	}
-
 	data.cmds = commands;
 	data.si = si;
-	opt.err_fd = err_fd;
-	opt.progress = err_fd && !quiet;
-	opt.env = tmp_objdir_env(tmp_objdir);
-	if (check_connected(iterate_receive_command_list, &data, &opt))
-		set_connectivity_errors(commands, si);
 
-	if (use_sideband)
-		finish_async(&muxer);
+	/*
+	 * If received commands only consist of deletions, then the client MUST
+	 * NOT send a packfile because there cannot be any new objects in the
+	 * first place. As a result, we do not set up a quarantine environment
+	 * because we know no new objects will be received. And that in turn
+	 * means that we can skip connectivity checks here.
+	 */
+	if (tmp_objdir) {
+		struct check_connected_options opt = CHECK_CONNECTED_INIT;
+		struct async muxer;
+		int err_fd = 0;
+
+		if (use_sideband) {
+			memset(&muxer, 0, sizeof(muxer));
+			muxer.proc = copy_to_sideband;
+			muxer.in = -1;
+			if (!start_async(&muxer))
+				err_fd = muxer.in;
+			/* ...else, continue without relaying sideband */
+		}
+
+		opt.err_fd = err_fd;
+		opt.progress = err_fd && !quiet;
+		opt.env = tmp_objdir_env(tmp_objdir);
+		if (check_connected(iterate_receive_command_list, &data, &opt))
+			set_connectivity_errors(commands, si);
+
+		if (use_sideband)
+			finish_async(&muxer);
+	}
 
 	reject_updates_to_hidden(commands);
 
-- 
2.31.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 8/8] receive-pack: check connectivity via quarantined objects
  2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
                   ` (6 preceding siblings ...)
  2021-05-19 19:13 ` [PATCH 7/8] receive-pack: skip connectivity checks on delete-only commands Patrick Steinhardt
@ 2021-05-19 19:13 ` Patrick Steinhardt
  2021-05-20  2:19 ` [PATCH 0/8] Speed up connectivity checks via quarantine dir Chris Torek
  2021-05-20 16:50 ` Jeff King
  9 siblings, 0 replies; 22+ messages in thread
From: Patrick Steinhardt @ 2021-05-19 19:13 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 3293 bytes --]

Convert git-receive-pack(1) to use the new connectivity check which
uses the quarantine directory to enumerate new objects instead of using
references. This change significantly decreases the time required to
accept reference updates due to the fact that we don't need to do a
graph walk anymore. The following measurements have been executed on
linux.git:

Test                                     v2.32.0-rc0             HEAD
--------------------------------------------------------------------------------------------
5400.3: receive-pack clone create        1.27(1.11+0.16)         0.02(0.01+0.01) -98.4%
5400.5: receive-pack clone update        1.27(1.13+0.13)         0.02(0.02+0.00) -98.4%
5400.7: receive-pack clone reset         0.13(0.11+0.02)         0.02(0.01+0.01) -84.6%
5400.9: receive-pack clone delete        0.02(0.01+0.01)         0.03(0.02+0.01) +50.0%
5400.11: receive-pack extrarefs create   33.01(18.80+14.43)      9.00(4.30+4.65) -72.7%
5400.13: receive-pack extrarefs update   33.13(18.85+14.50)      9.01(4.28+4.67) -72.8%
5400.15: receive-pack extrarefs reset    32.90(18.82+14.32)      9.04(4.26+4.77) -72.5%
5400.17: receive-pack extrarefs delete   9.13(4.35+4.77)         8.94(4.29+4.64) -2.1%
5400.19: receive-pack empty create       223.35(640.63+127.74)   227.55(651.75+130.94) +1.9%

Interestingly enough, the normal "clone" repository shows a more extreme
improvement compared to the "extrarefs" repository, which has one
reference per commit. While I haven't yet dived into the root cause, I
expect that there is another location besides the connectivity check
which does scale with the number of references.

Notably, the connectivity check when pushing into an empty repository
shows comparable performance with the previous implementation. This is
because in both cases we're actually performing a complete graph walk to
determine whether there are any unreachable commits.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/receive-pack.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index b9263cec15..23087acad0 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1937,8 +1937,12 @@ static void execute_commands(struct command *commands,
 	 * first place. As a result, we do not set up a quarantine environment
 	 * because we know no new objects will be received. And that in turn
 	 * means that we can skip connectivity checks here.
+	 *
+	 * Using the quarantine directory to determine connectivity may not
+	 * work with shallow clones, so let's let's play it safe for now and
+	 * fall back to the old connectivity check.
 	 */
-	if (tmp_objdir) {
+	if (tmp_objdir && (si->nr_ours || si->nr_theirs)) {
 		struct check_connected_options opt = CHECK_CONNECTED_INIT;
 		struct async muxer;
 		int err_fd = 0;
@@ -1960,6 +1964,10 @@ static void execute_commands(struct command *commands,
 
 		if (use_sideband)
 			finish_async(&muxer);
+	} else if (tmp_objdir) {
+		if (check_connected_quarantine(the_repository, tmp_objdir,
+					       iterate_receive_command_list, &data))
+			set_connectivity_errors(commands, si);
 	}
 
 	reject_updates_to_hidden(commands);
-- 
2.31.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] tmp-objdir: expose function to retrieve path
  2021-05-19 19:13 ` [PATCH 3/8] tmp-objdir: expose function to retrieve path Patrick Steinhardt
@ 2021-05-20  0:16   ` Elijah Newren
  0 siblings, 0 replies; 22+ messages in thread
From: Elijah Newren @ 2021-05-20  0:16 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Git Mailing List

On Wed, May 19, 2021 at 1:22 PM Patrick Steinhardt <ps@pks.im> wrote:
>
> It's currently impossible to tell the path of a temporary object
> directory for outside users of `struct tmp_objdir`. We'll soon need that
> information though so that we can reuse information from the quarantine
> environment in git-receive-pack(1).
>
> Provide a new function `tmp_objdir_path()` which returns the path of a
> temporary object directory to prepare for this.

Oh, sweet, someone else wants this too.  I have a local patch
introducing the same function, which I needed for --remerge-diff.  I
hadn't submitted the patch yet, and if yours is accepted then I won't
need to.  :-)

> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  tmp-objdir.c | 7 +++++++
>  tmp-objdir.h | 5 +++++
>  2 files changed, 12 insertions(+)
>
> diff --git a/tmp-objdir.c b/tmp-objdir.c
> index b8d880e362..6056917c63 100644
> --- a/tmp-objdir.c
> +++ b/tmp-objdir.c
> @@ -288,6 +288,13 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
>         return t->env.v;
>  }
>
> +const char *tmp_objdir_path(const struct tmp_objdir *t)
> +{
> +       if (!t)
> +               return NULL;
> +       return t->path.buf;
> +}
> +
>  void tmp_objdir_add_as_alternate(const struct tmp_objdir *t)
>  {
>         add_to_alternates_memory(t->path.buf);
> diff --git a/tmp-objdir.h b/tmp-objdir.h
> index b1e45b4c75..da3ccb98bc 100644
> --- a/tmp-objdir.h
> +++ b/tmp-objdir.h
> @@ -51,4 +51,9 @@ int tmp_objdir_destroy(struct tmp_objdir *);
>   */
>  void tmp_objdir_add_as_alternate(const struct tmp_objdir *);
>
> +/*
> + * Return the path of the temporary object directory.
> + */
> +const char *tmp_objdir_path(const struct tmp_objdir *t);
> +
>  #endif /* TMP_OBJDIR_H */
> --
> 2.31.1

Looks good to me.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] perf: fix when running with TEST_OUTPUT_DIRECTORY
  2021-05-19 19:13 ` [PATCH 1/8] perf: fix when running with TEST_OUTPUT_DIRECTORY Patrick Steinhardt
@ 2021-05-20  2:03   ` Chris Torek
  0 siblings, 0 replies; 22+ messages in thread
From: Chris Torek @ 2021-05-20  2:03 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Git List

On Wed, May 19, 2021 at 1:21 PM Patrick Steinhardt <ps@pks.im> wrote:
> When the TEST_OUTPUT_DIRECTORY is defined, then all test data will be
> written in that directory instead of the default directory located in
> "t/". While this works as expected for our normal tests, performance
> tests fail to locate and aggregate performance data because they don't
> know to handle TEST_OUTPUT_DIRECTORY correctly and always look at the
> default location.
>
> Fix the issue by adding a `--results-dir` parameter to "aggregate.perl"
> which identifies the directory where results are and by making the "run"
> script awake of the TEST_OUTPUT_DIRECTORY variable.

Trivial typo: s/awake/aware/

(The perl code looks OK to me but I'm no Perl programmer)

>+if test -n "$TEST_OUTPUT_DIRECTORY"
>+then
>+    TEST_RESULTS_DIR="$TEST_OUTPUT_DIRECTORY/test-results"
>+else
>+    TEST_RESULTS_DIR=test-results
>+fi

Optional shorter (one-liner) replacement:

TEST_RESULTS_DIR=${TEST_OUTPUT_DIRECTORY:+$TEST_OUTPUT_DIRECTORY/}test-results

Chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/8] p5400: add perf tests for git-receive-pack(1)
  2021-05-19 19:13 ` [PATCH 2/8] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
@ 2021-05-20  2:09   ` Chris Torek
  2021-05-20 17:04   ` Jeff King
  2021-05-21 15:03   ` SZEDER Gábor
  2 siblings, 0 replies; 22+ messages in thread
From: Chris Torek @ 2021-05-20  2:09 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Git List

On Wed, May 19, 2021 at 1:22 PM Patrick Steinhardt <ps@pks.im> wrote:
> We'll the connectivity check logic for git-receive-pack(1) in the

s/the conn/do the conn/

> diff --git a/t/perf/p5400-receive-pack.sh b/t/perf/p5400-receive-pack.sh
> new file mode 100755
> index 0000000000..2b0c89d977
> --- /dev/null
> +++ b/t/perf/p5400-receive-pack.sh
> @@ -0,0 +1,74 @@
> +#!/bin/sh

This code is, unfortunately, full of bash-isms:

> +       refs=("create updated:new")

Plain sh doesn't have arrays...

> +       done < <(printf "%s\n" "${refs[@]}")

This is another bash-ism.

> +done < <(printf "%s\n" "clone $TARGET_REPO_CLONE" "extrarefs $TARGET_REPO_REFS" "empty $TARGET_REPO_EMPTY")

I think these are mostly easily corrected but the `refs` probably should
just be dumped into a file, one line at a time, to be re-read from a file,
since `printf ... | while read ...` runs the whole loop inside a subshell.

Chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Speed up connectivity checks via quarantine dir
  2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
                   ` (7 preceding siblings ...)
  2021-05-19 19:13 ` [PATCH 8/8] receive-pack: check connectivity via quarantined objects Patrick Steinhardt
@ 2021-05-20  2:19 ` Chris Torek
  2021-05-20 16:50 ` Jeff King
  9 siblings, 0 replies; 22+ messages in thread
From: Chris Torek @ 2021-05-20  2:19 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Git List

On Wed, May 19, 2021 at 1:22 PM Patrick Steinhardt <ps@pks.im> wrote:
> Test                                     v2.32.0-rc0             HEAD
> --------------------------------------------------------------------------------------------
> 5400.3: receive-pack clone create        1.27(1.11+0.16)         0.02(0.01+0.01) -98.4%
> 5400.5: receive-pack clone update        1.27(1.13+0.13)         0.02(0.02+0.00) -98.4%
> 5400.7: receive-pack clone reset         0.13(0.11+0.02)         0.02(0.01+0.01) -84.6%
> 5400.9: receive-pack clone delete        0.02(0.01+0.01)         0.03(0.02+0.01) +50.0%
> 5400.11: receive-pack extrarefs create   33.01(18.80+14.43)      9.00(4.30+4.65) -72.7%
> 5400.13: receive-pack extrarefs update   33.13(18.85+14.50)      9.01(4.28+4.67) -72.8%
> 5400.15: receive-pack extrarefs reset    32.90(18.82+14.32)      9.04(4.26+4.77) -72.5%
> 5400.17: receive-pack extrarefs delete   9.13(4.35+4.77)         8.94(4.29+4.64) -2.1%
> 5400.19: receive-pack empty create       223.35(640.63+127.74)   227.55(651.75+130.94) +1.9%
>
> These rather clearly show that the previous rev-walk has been a major
> bottleneck in the implementation.

These are pretty impressive speedups! :-)

I didn't look too closely at the C code for the connectivity scan as I am
not very familiar with the background.  I did look at everything lightly
though, for whatever that's worth.

Chris

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Speed up connectivity checks via quarantine dir
  2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
                   ` (8 preceding siblings ...)
  2021-05-20  2:19 ` [PATCH 0/8] Speed up connectivity checks via quarantine dir Chris Torek
@ 2021-05-20 16:50 ` Jeff King
  2021-05-20 21:45   ` Junio C Hamano
                     ` (2 more replies)
  9 siblings, 3 replies; 22+ messages in thread
From: Jeff King @ 2021-05-20 16:50 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Wed, May 19, 2021 at 09:13:18PM +0200, Patrick Steinhardt wrote:

> One of the issues I've found is the object connectivity check, which may
> run for a significant amount of time. The root cause here is that we're
> computing connectivity via `git rev-list --not --all`: if we've got many
> refs in the repository, computing `--not --all` is hugely expensive.
> 
> This commit series thus implements an alternative way of computing
> reachability, which reuses information from the object quarantine
> environment. Instead of doing a refwalk, we just look iterate over all
> packed and loose quarantined objects any for each of them, we determine
> whether their immediate references are all satisfied.

If I am reading the patches correctly, your definition of "satisfied"
is: the referenced object exists already on the receiving side.

But that's subtly different from the current rule, which is: the object
must be reachable from the current ref tips. The invariant that Git has
traditionally tried to maintain (for a repo not to be corrupt) is only
that we have the complete graph of objects reachable from the tips.

If we have an unreachable tree in the object database which references
blobs we don't have, that doesn't make the repository corrupt. And with
the current code, we would not accept a push that references that tree
(unless it also pushes the necessary blobs). But after your patch, we
would, and that would _make_ the repository corrupt.

I will say that:

  1. Modern versions of git-repack and git-prune try to keep even
     unreachable parts of the graph complete (if we are keeping object X
     that refers to Y, then we try to keep Y, too). But I don't know how
     foolproof it is (certainly the traversal we do there is "best
     effort"; if there's a missing reference that exists, we don't
     bail).

  2. This is not the only place that just checks object existence in the
     name of speed. When updating a ref, for example, we only check that
     the tip object exists.

So I suspect it might work OK in practice. But it is a pretty big
loosening of the current rules for pushes, and that makes me nervous.

There's another related change here that is actually a tightening of the
rules. The current code checks that the ref tips proposed by the sender
are valid.  If there are objects in the pack not needed for the ref
update, their connectivity isn't checked (though normal senders would
obviously avoid sending extra objects for no reason). Your "iterate over
all quarantined objects" makes that stricter.

I'm of two minds there:

  1. We could easily keep the original rule by just traversing the
     object graph starting from the ref tips, as we do now, but ending
     the traversal any time we hit an object that we already have
     outside the quarantine.

  2. This tightening is actually important if we want to avoid letting
     people _intentionally_ introduce the unreachable-but-incomplete
     scenario. Without it, an easy denial-of-service corruption against
     a repository you can push to is:

       - push an update to change a ref from X to Y. Include all objects
	 necessary for X..Y, but _also_ include a tree T which points to
	 a missing blob B. This will be accepted by the current rules
	 (but not by your patch).

       - push an update to change the ref from Y to C, where C is a
	 commit whose root tree is T. Your patch allows this (because we
	 already have T in the repository). But the resulting repository
	 is corrupt (the ref now points to an incomplete object graph).

If we wanted to keep the existing rule (requiring that any objects that
sender didn't provide are actually reachable from the current refs),
then we'd want to be able to check reachability quickly. And there I'd
probably turn to reachability bitmaps.

I suspect that "rev-list --use-bitmap-index" in the connectivity check
might help in some cases. Especially when you are looking at the union
of objects reachable from all refs, we can avoid a lot of fill-in
traversal (because if the bitmap for a recent ref includes the object in
an older ref, then we know the older ref is covered, even if it doesn't
have an on-disk bitmap at its tip). But I would not be at all surprised
if there are other slowdowns in the traversal code when you have a lot
of refs (e.g., I think we're pretty eager to parse all of the traversal
tips as part of the setup).

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/8] p5400: add perf tests for git-receive-pack(1)
  2021-05-19 19:13 ` [PATCH 2/8] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
  2021-05-20  2:09   ` Chris Torek
@ 2021-05-20 17:04   ` Jeff King
  2021-05-21 15:03   ` SZEDER Gábor
  2 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2021-05-20 17:04 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Wed, May 19, 2021 at 09:13:27PM +0200, Patrick Steinhardt wrote:

> +while read name repo
> +do
> +	refs=("create updated:new")

This (and the other array manipulation below) is a bash-ism. Presumably
you've got TEST_SHELL_PATH pointed at bash. Without that, on a system
where /bin/sh is dash, the script chokes here.

For your purposes here, I think you can get by with just a single string
with newlines in it. Or even a file (see below).

> +	while read desc ref
> +	do
> +		test_expect_success "setup $name $desc" "
> +			test_must_fail git push --force '$repo' '$ref' \
> +				--receive-pack='tee pack | git receive-pack' 2>err &&
> +			grep 'failed in pre-receive hook' err
> +		"

This inverts the double- and single- quotes from our usual style. So if
$repo is "foo", you are creating a string that has:

  test_must_fail git push --force 'foo' ...

in it, and then the test harness will eval that string. That will fail
if $repo itself contains a single quote. Pretty unlikely, but I think it
contains the absolute path.

The usual style is:

  test_expect_success "setup $name $desc" '
	test_must_fail git push --force "$repo" "$ref" \
	...etc...
  '

where the variables are dereferenced inside the eval'd snippet. So no
quoting is necessary, no matter what's in the variables.

> +		test_perf "receive-pack $name $desc" "
> +			git receive-pack '$repo' <pack >negotiation &&
> +			grep 'pre-receive hook declined' negotiation
> +		"

Likewise here, but note that test_perf is tricky! It runs the snippet in
a sub-process. You have to export $repo to make it visible (you can use
test_export, but you don't need to; there's some discussing in
t/perf/README).

> +	done < <(printf "%s\n" "${refs[@]}")
> +done < <(printf "%s\n" "clone $TARGET_REPO_CLONE" "extrarefs $TARGET_REPO_REFS" "empty $TARGET_REPO_EMPTY")

These process substitutions are also bash-isms. I guess you're trying to
avoid putting the while on the right-hand side of a pipe like:

  printf "%s\n" ... |
  while read ...

which is good, because they set variables and those values don't
reliably make it out of the pipeline. If you stick the contents of $refs
into a file, then you can just do:

  while read ...
  do
     ...
  done <ref-descs

For the outer one, a here-doc is probably a bit simpler:

  while read ...
  do
     ...
  done <<-EOF
  clone $TARGET_REPO_CLONE
  extrarefs $TARGET_REPO_REFS
  empty $TARGET_REPO_EMPTY
  EOF

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Speed up connectivity checks via quarantine dir
  2021-05-20 16:50 ` Jeff King
@ 2021-05-20 21:45   ` Junio C Hamano
  2021-05-21  9:30     ` Jeff King
  2021-05-21  9:42   ` Patrick Steinhardt
  2021-05-21 11:20   ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2021-05-20 21:45 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Steinhardt, git

Jeff King <peff@peff.net> writes:

> If we have an unreachable tree in the object database which references
> blobs we don't have, that doesn't make the repository corrupt. And with
> the current code, we would not accept a push that references that tree
> (unless it also pushes the necessary blobs). But after your patch, we
> would, and that would _make_ the repository corrupt.
>
> I will say that:
>
>   1. Modern versions of git-repack and git-prune try to keep even
>      unreachable parts of the graph complete (if we are keeping object X
>      that refers to Y, then we try to keep Y, too). But I don't know how
>      foolproof it is (certainly the traversal we do there is "best
>      effort"; if there's a missing reference that exists, we don't
>      bail).
>
>   2. This is not the only place that just checks object existence in the
>      name of speed. When updating a ref, for example, we only check that
>      the tip object exists.

There might be already other ways to corrupt repositories, and a
corrupted repository to be left unnoticed, in other words.

But that does not make it OK to add more ways to corrupt
repositories.

>   1. We could easily keep the original rule by just traversing the
>      object graph starting from the ref tips, as we do now, but ending
>      the traversal any time we hit an object that we already have
>      outside the quarantine.
>
>   2. This tightening is actually important if we want to avoid letting
>      people _intentionally_ introduce the unreachable-but-incomplete
>      scenario. Without it, an easy denial-of-service corruption against
>      a repository you can push to is:
>
>        - push an update to change a ref from X to Y. Include all objects
> 	 necessary for X..Y, but _also_ include a tree T which points to
> 	 a missing blob B. This will be accepted by the current rules
> 	 (but not by your patch).
>
>        - push an update to change the ref from Y to C, where C is a
> 	 commit whose root tree is T. Your patch allows this (because we
> 	 already have T in the repository). But the resulting repository
> 	 is corrupt (the ref now points to an incomplete object graph).

Hmph, the last step of that attack would not work with our current
check; is this the same new hole the series brings in as you
explained earlier for a case where a newly pushed tree/commit starts
to reference a left-over dangling tree already in the repository
whose content blobs are missing?

> If we wanted to keep the existing rule (requiring that any objects that
> sender didn't provide are actually reachable from the current refs),
> then we'd want to be able to check reachability quickly. And there I'd
> probably turn to reachability bitmaps.

True.  As we are not "performance is king---a code that corrupts
repositories as quickly as possible is an improvement" kind of
project, we should keep the existing "an object can become part of
DAG referred by ref tips only when the objects it refers to all
exist in the object store, because we want to keep the invariant: an
object that is reachable from a ref is guaranteed to have everything
reachable from it in the object store" rule, and find a way to make
it fast to enforce that rule somehow.

Thank for a review.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Speed up connectivity checks via quarantine dir
  2021-05-20 21:45   ` Junio C Hamano
@ 2021-05-21  9:30     ` Jeff King
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2021-05-21  9:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git

On Fri, May 21, 2021 at 06:45:02AM +0900, Junio C Hamano wrote:

> >   2. This tightening is actually important if we want to avoid letting
> >      people _intentionally_ introduce the unreachable-but-incomplete
> >      scenario. Without it, an easy denial-of-service corruption against
> >      a repository you can push to is:
> >
> >        - push an update to change a ref from X to Y. Include all objects
> > 	 necessary for X..Y, but _also_ include a tree T which points to
> > 	 a missing blob B. This will be accepted by the current rules
> > 	 (but not by your patch).
> >
> >        - push an update to change the ref from Y to C, where C is a
> > 	 commit whose root tree is T. Your patch allows this (because we
> > 	 already have T in the repository). But the resulting repository
> > 	 is corrupt (the ref now points to an incomplete object graph).
> 
> Hmph, the last step of that attack would not work with our current
> check; is this the same new hole the series brings in as you
> explained earlier for a case where a newly pushed tree/commit starts
> to reference a left-over dangling tree already in the repository
> whose content blobs are missing?

Right. The current state is immune to this attack; the rule is "refs
must be complete, but nothing else has to be". The state after Patrick's
series is (I think) likewise immune; the rule is "all objects must be
complete".

I'm not sure if that was intentional, or a lucky save because the series
tries to optimize both parts (both which objects we decide to check, as
well as which we consider we "already have"). But it's quite subtle. :)

> > If we wanted to keep the existing rule (requiring that any objects that
> > sender didn't provide are actually reachable from the current refs),
> > then we'd want to be able to check reachability quickly. And there I'd
> > probably turn to reachability bitmaps.
> 
> True.  As we are not "performance is king---a code that corrupts
> repositories as quickly as possible is an improvement" kind of
> project, we should keep the existing "an object can become part of
> DAG referred by ref tips only when the objects it refers to all
> exist in the object store, because we want to keep the invariant: an
> object that is reachable from a ref is guaranteed to have everything
> reachable from it in the object store" rule, and find a way to make
> it fast to enforce that rule somehow.

That's my feeling, too. A rule of "you are not allowed to introduce
objects which directly reference a missing object" would be likewise
correct, if consistently enforced. But it makes me nervous to switch
from one to the other, especially in just one place.

And I guess in that sense, this series isn't immune as I said earlier.
It looks like a push from a shallow clone would use the old "reachable
from refs" check even after this series.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Speed up connectivity checks via quarantine dir
  2021-05-20 16:50 ` Jeff King
  2021-05-20 21:45   ` Junio C Hamano
@ 2021-05-21  9:42   ` Patrick Steinhardt
  2021-05-21 11:20   ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 22+ messages in thread
From: Patrick Steinhardt @ 2021-05-21  9:42 UTC (permalink / raw)
  To: Jeff King; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 7976 bytes --]

On Thu, May 20, 2021 at 12:50:24PM -0400, Jeff King wrote:
> On Wed, May 19, 2021 at 09:13:18PM +0200, Patrick Steinhardt wrote:
> 
> > One of the issues I've found is the object connectivity check, which may
> > run for a significant amount of time. The root cause here is that we're
> > computing connectivity via `git rev-list --not --all`: if we've got many
> > refs in the repository, computing `--not --all` is hugely expensive.
> > 
> > This commit series thus implements an alternative way of computing
> > reachability, which reuses information from the object quarantine
> > environment. Instead of doing a refwalk, we just look iterate over all
> > packed and loose quarantined objects any for each of them, we determine
> > whether their immediate references are all satisfied.
> 
> If I am reading the patches correctly, your definition of "satisfied"
> is: the referenced object exists already on the receiving side.

Indeed. Each referenced object must either be in the quarantine
directory or an already existing one in the main repo. If it's
quarantined, then we'll eventually check it, too, so we don't need to
traverse any further than immediate parents.

> But that's subtly different from the current rule, which is: the object
> must be reachable from the current ref tips. The invariant that Git has
> traditionally tried to maintain (for a repo not to be corrupt) is only
> that we have the complete graph of objects reachable from the tips.
> 
> If we have an unreachable tree in the object database which references
> blobs we don't have, that doesn't make the repository corrupt. And with
> the current code, we would not accept a push that references that tree
> (unless it also pushes the necessary blobs). But after your patch, we
> would, and that would _make_ the repository corrupt.

Right. The assumption basically is that if it's part of the repository's
ODB already, then it was checked at some point before and should be
fully connected. Also, we typically wouldn't delete any objects which
are referenced by anything else, so either we delete the unreachable
tree and its referenced blob together, or none of them.

> I will say that:
> 
>   1. Modern versions of git-repack and git-prune try to keep even
>      unreachable parts of the graph complete (if we are keeping object X
>      that refers to Y, then we try to keep Y, too). But I don't know how
>      foolproof it is (certainly the traversal we do there is "best
>      effort"; if there's a missing reference that exists, we don't
>      bail).
> 
>   2. This is not the only place that just checks object existence in the
>      name of speed. When updating a ref, for example, we only check that
>      the tip object exists.
> 
> So I suspect it might work OK in practice. But it is a pretty big
> loosening of the current rules for pushes, and that makes me nervous.
>
> There's another related change here that is actually a tightening of the
> rules. The current code checks that the ref tips proposed by the sender
> are valid.  If there are objects in the pack not needed for the ref
> update, their connectivity isn't checked (though normal senders would
> obviously avoid sending extra objects for no reason). Your "iterate over
> all quarantined objects" makes that stricter.
> 
> I'm of two minds there:
> 
>   1. We could easily keep the original rule by just traversing the
>      object graph starting from the ref tips, as we do now, but ending
>      the traversal any time we hit an object that we already have
>      outside the quarantine.

This should be doable, although it's a bit more complex compared to my
current proposal. As far as I know we don't have any APIs to say "Please
look up objects in this object directory, only", but we'd need to do
that or otherwise we're not able to tell where an object was looked up
from.

Alternatively, we can implement this via a combination of both
approaches: first, we enumerate all quarantined objects and mark them
so. Second, we do a refwalk on all tips and stop whenever we find a
non-quarantined object.

My guess is that this is still going to be a lot faster compared to the
current implementation, at least when updating existing branches. But in
the case where we're pushing into an empty repo (or when we're pushing
entirely unrelated history), we now essentially have to iterate over all
objects twice. I'm sure this is going to be noticable performance wise.

>   2. This tightening is actually important if we want to avoid letting
>      people _intentionally_ introduce the unreachable-but-incomplete
>      scenario. Without it, an easy denial-of-service corruption against
>      a repository you can push to is:
> 
>        - push an update to change a ref from X to Y. Include all objects
> 	 necessary for X..Y, but _also_ include a tree T which points to
> 	 a missing blob B. This will be accepted by the current rules
> 	 (but not by your patch).
> 
>        - push an update to change the ref from Y to C, where C is a
> 	 commit whose root tree is T. Your patch allows this (because we
> 	 already have T in the repository). But the resulting repository
> 	 is corrupt (the ref now points to an incomplete object graph).

So with my patch we catch it right when it's being introduced, while
right now we only detect it as soon as somebody actually wants to start
using it. Depending on one's viewpoint, one could probably argue that
the repository is already corrupt as soon as we accept the tree into our
repo because we're missing objects. But except for git-fsck(1), one
wouldn't typically notice.

This does surface a second issue: right now, if receive.fsckObjects is
set to `false` (which is the default), we'd also happily accept objects
into the repo which aren't even parseable in case they're not
referenced. This is another form of corruption we allow with the current
implementation, but which gets detected by the new check.

> If we wanted to keep the existing rule (requiring that any objects that
> sender didn't provide are actually reachable from the current refs),
> then we'd want to be able to check reachability quickly. And there I'd
> probably turn to reachability bitmaps.

Honestly, at GitLab we've only had problems with reachability bitmaps to
determine newly pushed objects. We tried to use them to detect all newly
pushed blobs which are below a certain size (to detect LFS pointers, we
had this discussion before). But computing the set of new objects with
bitmaps enabled was orders of magnitudes slower for some repos than with
bitmaps disabled and has caused quite some pain already.

One of these edge cases I've fixed with negative tags causing a complete
graph walk (540cdc11ad (pack-bitmap: avoid traversal of objects
referenced by uninteresting tag, 2021-03-22)), but seemingly there are
more. I haven't yet found the time to dig into this though.

> I suspect that "rev-list --use-bitmap-index" in the connectivity check
> might help in some cases. Especially when you are looking at the union
> of objects reachable from all refs, we can avoid a lot of fill-in
> traversal (because if the bitmap for a recent ref includes the object in
> an older ref, then we know the older ref is covered, even if it doesn't
> have an on-disk bitmap at its tip). But I would not be at all surprised
> if there are other slowdowns in the traversal code when you have a lot
> of refs (e.g., I think we're pretty eager to parse all of the traversal
> tips as part of the setup).

Does this help if you've got hundreds of thousands of refs though? As I
said, one of the benchmarks I'm using is a repo with 880k+ references.
And lots of objects referenced by them will never be covered by bitmaps
in the first place because they're not part of the usual user-visible
references but part of an alternate.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/8] Speed up connectivity checks via quarantine dir
  2021-05-20 16:50 ` Jeff King
  2021-05-20 21:45   ` Junio C Hamano
  2021-05-21  9:42   ` Patrick Steinhardt
@ 2021-05-21 11:20   ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 22+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-05-21 11:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Steinhardt, git


On Thu, May 20 2021, Jeff King wrote:

> On Wed, May 19, 2021 at 09:13:18PM +0200, Patrick Steinhardt wrote:
>
>> One of the issues I've found is the object connectivity check, which may
>> run for a significant amount of time. The root cause here is that we're
>> computing connectivity via `git rev-list --not --all`: if we've got many
>> refs in the repository, computing `--not --all` is hugely expensive.
>> 
>> This commit series thus implements an alternative way of computing
>> reachability, which reuses information from the object quarantine
>> environment. Instead of doing a refwalk, we just look iterate over all
>> packed and loose quarantined objects any for each of them, we determine
>> whether their immediate references are all satisfied.
>
> If I am reading the patches correctly, your definition of "satisfied"
> is: the referenced object exists already on the receiving side.
>
> But that's subtly different from the current rule, which is: the object
> must be reachable from the current ref tips. The invariant that Git has
> traditionally tried to maintain (for a repo not to be corrupt) is only
> that we have the complete graph of objects reachable from the tips.
>
> If we have an unreachable tree in the object database which references
> blobs we don't have, that doesn't make the repository corrupt. And with
> the current code, we would not accept a push that references that tree
> (unless it also pushes the necessary blobs). But after your patch, we
> would, and that would _make_ the repository corrupt.
>
> I will say that:
>
>   1. Modern versions of git-repack and git-prune try to keep even
>      unreachable parts of the graph complete (if we are keeping object X
>      that refers to Y, then we try to keep Y, too). But I don't know how
>      foolproof it is (certainly the traversal we do there is "best
>      effort"; if there's a missing reference that exists, we don't
>      bail).
>
>   2. This is not the only place that just checks object existence in the
>      name of speed. When updating a ref, for example, we only check that
>      the tip object exists.

Hopefull you mean "when we update a ref locally", i.e. update-ref, not
receive-pack.

I think that's fine, and we should consider these corruption detections
to have two different classes, there's the local update-ref, mktag
etc. which typically only do skin-deep checking, and the full check we
want to do in receive-pack and other transport.fsckObjects.

It's fine in practice for the "local" case to be fast and loose, but
when you're accepting foreign objects over the network we should always
be as paranoid as possible, both to prevent accidental corruption and
deliberate attack.

None of that goes against what you're saying, just a bit of an
elaboration.

> [...]
> There's another related change here that is actually a tightening of the
> rules. The current code checks that the ref tips proposed by the sender
> are valid.  If there are objects in the pack not needed for the ref
> update, their connectivity isn't checked (though normal senders would
> obviously avoid sending extra objects for no reason). Your "iterate over
> all quarantined objects" makes that stricter.
>
> I'm of two minds there:
>
>   1. We could easily keep the original rule by just traversing the
>      object graph starting from the ref tips, as we do now, but ending
>      the traversal any time we hit an object that we already have
>      outside the quarantine.
>
>   2. This tightening is actually important if we want to avoid letting
>      people _intentionally_ introduce the unreachable-but-incomplete
>      scenario. Without it, an easy denial-of-service corruption against
>      a repository you can push to is:
>
>        - push an update to change a ref from X to Y. Include all objects
> 	 necessary for X..Y, but _also_ include a tree T which points to
> 	 a missing blob B. This will be accepted by the current rules
> 	 (but not by your patch).
>
>        - push an update to change the ref from Y to C, where C is a
> 	 commit whose root tree is T. Your patch allows this (because we
> 	 already have T in the repository). But the resulting repository
> 	 is corrupt (the ref now points to an incomplete object graph).

We should also consider not closing the door to some future
optimizations and features by being overly strict with #2 here. Maybe
I've misunderstood things, but I think tightening it would prevent
things like:

 A. I'm pushing a ref update for X..Y, the server is at X, but I happen
    to have a pack (e.g. from an earlier pull) that contains objects
    from W..Y. The server doesn't need W..X, but I just sent the whole
    W..Y set over saying "please update to Y".

 B. I got halfway with patches to make clients aid servers with
    server-side corruption of objects (the root cause was some NFS
    shenanigans + our non-fsync()-ing). A server would have an empty
    loose object, to recover I needed to manually scp it from a
    client->server. This happened a few times at odd hours.

    With the not-accepted core.checkCollisions patch I hacked up for
    related reasons[1] I found that we were actually quite close to
    learning a mode on the server-side where we'd just blindly accept
    such objects (the client would also need to learn to do a hail-mary
    push).

    Strictly speaking we could support such a recovery mode while still
    having the #2 under discussion here (only accepting such objects if
    our own repo is corrupt), but I thought it was rather neat that it
    would naturally fall out of the general rule that we didn't care
    about "redundant" objects + my tweaks to make the "there's a
    collision" check less anal (in that case it was a false alarm, our
    local object was corrupt, but not the one the remote end tried to
    send).

1. https://lore.kernel.org/git/20181113201910.11518-1-avarab@gmail.com/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/8] p5400: add perf tests for git-receive-pack(1)
  2021-05-19 19:13 ` [PATCH 2/8] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
  2021-05-20  2:09   ` Chris Torek
  2021-05-20 17:04   ` Jeff King
@ 2021-05-21 15:03   ` SZEDER Gábor
  2 siblings, 0 replies; 22+ messages in thread
From: SZEDER Gábor @ 2021-05-21 15:03 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Wed, May 19, 2021 at 09:13:27PM +0200, Patrick Steinhardt wrote:
> We'll the connectivity check logic for git-receive-pack(1) in the
> following commits to make it perform better. As a preparatory step, add
> some benchmarks such that we can measure these changes.
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  t/perf/p5400-receive-pack.sh | 74 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 74 insertions(+)
>  create mode 100755 t/perf/p5400-receive-pack.sh
> 
> diff --git a/t/perf/p5400-receive-pack.sh b/t/perf/p5400-receive-pack.sh
> new file mode 100755
> index 0000000000..2b0c89d977
> --- /dev/null
> +++ b/t/perf/p5400-receive-pack.sh
> @@ -0,0 +1,74 @@
> +#!/bin/sh
> +
> +test_description="Tests performance of receive-pack"
> +
> +. ./perf-lib.sh
> +
> +test_perf_large_repo
> +
> +test_expect_success 'setup' '
> +	# Create a main branch such that we do not have to rely on any specific
> +	# branch to exist in the perf repository.
> +	git switch --force-create main &&
> +
> +	TARGET_REPO_CLONE=$(pwd)/target-clone.git &&
> +	git clone --bare --dissociate --branch main "$(pwd)" "$TARGET_REPO_CLONE" &&
> +	TARGET_REPO_REFS=$(pwd)/target-refs.git &&
> +	git clone --bare --dissociate --branch main "$(pwd)" "$TARGET_REPO_REFS" &&
> +	TARGET_REPO_EMPTY=$(pwd)/target-empty.git &&
> +	git init --bare "$TARGET_REPO_EMPTY" &&
> +
> +	# Set up a pre-receive hook such that no refs will ever be changed.
> +	# This easily allows multiple perf runs, but still exercises
> +	# server-side reference negotiation and checking for consistency.
> +	mkdir hooks &&
> +	write_script hooks/pre-receive <<-EOF &&
> +		#!/bin/sh

Nit: the 'write_script' helper writes a shebang line, so this is
unnecessary.

> +		echo "failed in pre-receive hook"
> +		exit 1
> +	EOF

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH 7/8] receive-pack: skip connectivity checks on delete-only commands
  2021-05-19 19:13 ` [PATCH 7/8] receive-pack: skip connectivity checks on delete-only commands Patrick Steinhardt
@ 2021-05-21 18:53   ` Felipe Contreras
  2021-05-27 14:38     ` Jeff King
  0 siblings, 1 reply; 22+ messages in thread
From: Felipe Contreras @ 2021-05-21 18:53 UTC (permalink / raw)
  To: Patrick Steinhardt, git

Patrick Steinhardt wrote:
> In the case where git-receive-pack(1) receives only commands which
> delete references, then per technical specification the client MUST NOT
> send a packfile. As a result, we know that no new objects have been
> received, which makes it a moot point to check whether all received
> objects are fully connected.

I don't know if this is related but yesterday I decided to delete a
bunch of refs from a forked repo in GitHub. I did it naively with a for
loop and so it was doing a bunch of `git push myrepo :ref`.

It was unbearably slow.

Sure, it was a stupid thing to do, but maybe it can help you do some
tests.

Cheers.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/8] receive-pack: skip connectivity checks on delete-only commands
  2021-05-21 18:53   ` Felipe Contreras
@ 2021-05-27 14:38     ` Jeff King
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2021-05-27 14:38 UTC (permalink / raw)
  To: Felipe Contreras; +Cc: Patrick Steinhardt, git

On Fri, May 21, 2021 at 01:53:49PM -0500, Felipe Contreras wrote:

> Patrick Steinhardt wrote:
> > In the case where git-receive-pack(1) receives only commands which
> > delete references, then per technical specification the client MUST NOT
> > send a packfile. As a result, we know that no new objects have been
> > received, which makes it a moot point to check whether all received
> > objects are fully connected.
> 
> I don't know if this is related but yesterday I decided to delete a
> bunch of refs from a forked repo in GitHub. I did it naively with a for
> loop and so it was doing a bunch of `git push myrepo :ref`.
> 
> It was unbearably slow.
> 
> Sure, it was a stupid thing to do, but maybe it can help you do some
> tests.

Patrick's patch might help some, as it would avoid calling rev-list at
all. But we wouldn't do any traversal in that command if there are no
positive tips anyway, so it is really just saving the startup overhead
of iterating the ref tips to add them to the traversal.

In the case of GitHub, the problem is much more likely outside of Git's
immediate control. Every push will run GitHub-specific hooks for things
like branch protections, etc, and there's a lot of overhead there.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-05-27 14:38 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19 19:13 [PATCH 0/8] Speed up connectivity checks via quarantine dir Patrick Steinhardt
2021-05-19 19:13 ` [PATCH 1/8] perf: fix when running with TEST_OUTPUT_DIRECTORY Patrick Steinhardt
2021-05-20  2:03   ` Chris Torek
2021-05-19 19:13 ` [PATCH 2/8] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
2021-05-20  2:09   ` Chris Torek
2021-05-20 17:04   ` Jeff King
2021-05-21 15:03   ` SZEDER Gábor
2021-05-19 19:13 ` [PATCH 3/8] tmp-objdir: expose function to retrieve path Patrick Steinhardt
2021-05-20  0:16   ` Elijah Newren
2021-05-19 19:13 ` [PATCH 4/8] packfile: have `for_each_file_in_pack_dir()` return error codes Patrick Steinhardt
2021-05-19 19:13 ` [PATCH 5/8] object-file: allow reading loose objects without reading their contents Patrick Steinhardt
2021-05-19 19:13 ` [PATCH 6/8] connected: implement connectivity check via temporary object dirs Patrick Steinhardt
2021-05-19 19:13 ` [PATCH 7/8] receive-pack: skip connectivity checks on delete-only commands Patrick Steinhardt
2021-05-21 18:53   ` Felipe Contreras
2021-05-27 14:38     ` Jeff King
2021-05-19 19:13 ` [PATCH 8/8] receive-pack: check connectivity via quarantined objects Patrick Steinhardt
2021-05-20  2:19 ` [PATCH 0/8] Speed up connectivity checks via quarantine dir Chris Torek
2021-05-20 16:50 ` Jeff King
2021-05-20 21:45   ` Junio C Hamano
2021-05-21  9:30     ` Jeff King
2021-05-21  9:42   ` Patrick Steinhardt
2021-05-21 11:20   ` Ævar Arnfjörð Bjarmason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).