git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Speed up mirror-fetches with many refs
@ 2021-08-20 10:08 Patrick Steinhardt
  2021-08-20 10:08 ` [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
                   ` (9 more replies)
  0 siblings, 10 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-20 10:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2864 bytes --]

Hi,

I've taken another look at fetches in the context of repos with a huge
amount of refs. This time around, I've taken a look at mirror fetches:
in our notorious repo with 2.3M refs, these mirror fetches can take up
to several minutes of time even if there are no changes at all.

As it turns out, many of the issues are again caused by loading and
dereferencing refs. This patch series thus mostly focusses on optimizing
the patterns there, where the biggest win is to opportunistically load
refs via commit-graphs. The following numbers were all calculated for a
mirror-fetch of above 2.3M refs repo on the local disk:

    - Patch 1 speeds up the way we look up commits when appending to
      FETCH_HEAD via the commit-graph, resulting in a ~40% speedup.

    - Patch 2 optimizes the way we check for object existence for a 7%
      speedup.

    - Patch 3 is a cleanup patch which changes the iterator functions
      passed to our connectivity checks. I was hoping for a speedup
      given that we can now avoid copying objects (which could have an
      effect with 2.3M copied OIDs), but unfortunately it didn't. In any
      case, I still think that the end result is much cleaner.

    - Patch 4 optimizes git-fetch-pack(1) to use the commit-graph. This
      is a small win of about ~2%. It's debatable whether this patch is
      worth it.

    - Patch 5 is a preparatory commit which refactors `fetch_refs()` to
      be more readily extendable.

    - Patch 6 optimizes an edge case where we're doing two connectivity
      checks even if the first connectivity check noticed we already had
      all objects locally available, skipping the fetch. This brings a
      15% speedup.

In combination with my previous optimizations for git-fetch-pack(1) and
the connectivity check, this improves performance from 71s
(ps/fetch-pack-load-refs-optim), to 54s (ps/connectivity-optim) to 26s
(this series).

Note that this series depends on ps/connectivity-optim and thus only
applies on top of next.

Patrick

[1]: <08519b8ab6f395cffbcd5e530bfba6aaf64241a2.1628085347.git.ps@pks.im>


Patrick Steinhardt (6):
  fetch: speed up lookup of want refs via commit-graph
  fetch: avoid unpacking headers in object existence check
  connected: refactor iterator to return next object ID directly
  fetch-pack: optimize loading of refs via commit graph
  fetch: refactor fetch refs to be more extendable
  fetch: avoid second connectivity check if we already have all objects

 builtin/clone.c        |  8 ++--
 builtin/fetch.c        | 84 +++++++++++++++++++++++-------------------
 builtin/receive-pack.c | 17 ++++-----
 connected.c            | 15 ++++----
 connected.h            |  2 +-
 fetch-pack.c           | 14 ++++---
 6 files changed, 74 insertions(+), 66 deletions(-)

-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph
  2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
@ 2021-08-20 10:08 ` Patrick Steinhardt
  2021-08-20 14:27   ` Derrick Stolee
  2021-08-20 10:08 ` [PATCH 2/6] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-20 10:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2224 bytes --]

When updating our local refs based on the refs fetched from the remote,
we need to iterate through all requested refs and load their respective
commits such that we can determine whether they need to be appended to
FETCH_HEAD or not. In cases where we're fetching from a remote with
exceedingly many refs, resolving these refs can be quite expensive given
that we repeatedly need to unpack object headers for each of the
referenced objects.

Speed this up by opportunistcally trying to resolve object IDs via the
commit graph: more likely than not, they're going to be a commit anyway,
and this lets us avoid having to unpack object headers completely in
case the object is a commit that is part of the commit-graph. This
significantly speeds up mirror-fetches in a real-world repository with
2.3M refs:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     56.942 s ±  0.449 s    [User: 53.360 s, System: 5.356 s]
      Range (min … max):   56.372 s … 57.533 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     33.657 s ±  0.167 s    [User: 30.302 s, System: 5.181 s]
      Range (min … max):   33.454 s … 33.844 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.69 ± 0.02 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 405afe9bdf..73f5b286d5 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1131,11 +1131,14 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 				continue;
 			}
 
-			commit = lookup_commit_reference_gently(the_repository,
-								&rm->old_oid,
-								1);
-			if (!commit)
-				rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
+			commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
+			if (!commit) {
+				commit = lookup_commit_reference_gently(the_repository,
+									&rm->old_oid,
+									1);
+				if (!commit)
+					rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
+			}
 
 			if (rm->fetch_head_status != want_status)
 				continue;
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/6] fetch: avoid unpacking headers in object existence check
  2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
  2021-08-20 10:08 ` [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
@ 2021-08-20 10:08 ` Patrick Steinhardt
  2021-08-25 23:44   ` Ævar Arnfjörð Bjarmason
  2021-08-20 10:08 ` [PATCH 3/6] connected: refactor iterator to return next object ID directly Patrick Steinhardt
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-20 10:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]

When updating local refs after the fetch has transferred all objects, we
do an object existence test as a safety guard to avoid updating a ref to
an object which we don't have. We do so via `oid_object_info()`: if it
returns an error, then we know the object does not exist.

One side effect of `oid_object_info()` is that it parses the object's
type, and to do so it must unpack the object header. This is completely
pointless: we don't care for the type, but only want to assert that the
object exists.

Refactor the code to use `repo_has_object_file()`, which both makes the
code's intent clearer and is also faster because it does not unpack
object headers. In a real-world repo with 2.3M refs, this results in a
small speedup when doing a mirror-fetch:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     33.686 s ±  0.176 s    [User: 30.119 s, System: 5.262 s]
      Range (min … max):   33.512 s … 33.944 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     31.247 s ±  0.195 s    [User: 28.135 s, System: 5.066 s]
      Range (min … max):   30.948 s … 31.472 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.08 ± 0.01 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 73f5b286d5..5fd0f7c791 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -846,13 +846,11 @@ static int update_local_ref(struct ref *ref,
 			    int summary_width)
 {
 	struct commit *current = NULL, *updated;
-	enum object_type type;
 	struct branch *current_branch = branch_get(NULL);
 	const char *pretty_ref = prettify_refname(ref->name);
 	int fast_forward = 0;
 
-	type = oid_object_info(the_repository, &ref->new_oid, NULL);
-	if (type < 0)
+	if (!repo_has_object_file(the_repository, &ref->new_oid))
 		die(_("object %s not found"), oid_to_hex(&ref->new_oid));
 
 	if (oideq(&ref->old_oid, &ref->new_oid)) {
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/6] connected: refactor iterator to return next object ID directly
  2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
  2021-08-20 10:08 ` [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
  2021-08-20 10:08 ` [PATCH 2/6] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
@ 2021-08-20 10:08 ` Patrick Steinhardt
  2021-08-20 14:32   ` Derrick Stolee
  2021-08-20 17:43   ` René Scharfe
  2021-08-20 10:08 ` [PATCH 4/6] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-20 10:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 7702 bytes --]

The object ID iterator used by the connectivity checks returns the next
object ID via an out-parameter and then uses a return code to indicate
whether an item was found. This is a bit roundabout: instead of a
separate error code, we can just retrun the next object ID directly and
use `NULL` pointers as indicator that the iterator got no items left.
Furthermore, this avoids a copy of the object ID.

Refactor the iterator and all its implementations to return object IDs
directly. While I was honestly hoping for a small speedup given that we
can now avoid a copy, both versions perform the same. Still, the end
result is easier to understand and thus it makes sense to keep this
refactoring regardless.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c        |  8 +++-----
 builtin/fetch.c        |  7 +++----
 builtin/receive-pack.c | 17 +++++++----------
 connected.c            | 15 ++++++++-------
 connected.h            |  2 +-
 fetch-pack.c           |  7 +++----
 6 files changed, 25 insertions(+), 31 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index a74558f30c..817a651936 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -544,7 +544,7 @@ static void write_followtags(const struct ref *refs, const char *msg)
 	}
 }
 
-static int iterate_ref_map(void *cb_data, struct object_id *oid)
+static struct object_id *iterate_ref_map(void *cb_data)
 {
 	struct ref **rm = cb_data;
 	struct ref *ref = *rm;
@@ -555,13 +555,11 @@ static int iterate_ref_map(void *cb_data, struct object_id *oid)
 	 */
 	while (ref && !ref->peer_ref)
 		ref = ref->next;
-	/* Returning -1 notes "end of list" to the caller. */
 	if (!ref)
-		return -1;
+		return NULL;
 
-	oidcpy(oid, &ref->old_oid);
 	*rm = ref->next;
-	return 0;
+	return &ref->old_oid;
 }
 
 static void update_remote_refs(const struct ref *refs,
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 5fd0f7c791..76ce73ccf5 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -962,7 +962,7 @@ static int update_local_ref(struct ref *ref,
 	}
 }
 
-static int iterate_ref_map(void *cb_data, struct object_id *oid)
+static struct object_id *iterate_ref_map(void *cb_data)
 {
 	struct ref **rm = cb_data;
 	struct ref *ref = *rm;
@@ -970,10 +970,9 @@ static int iterate_ref_map(void *cb_data, struct object_id *oid)
 	while (ref && ref->status == REF_STATUS_REJECT_SHALLOW)
 		ref = ref->next;
 	if (!ref)
-		return -1; /* end of the list */
+		return NULL;
 	*rm = ref->next;
-	oidcpy(oid, &ref->old_oid);
-	return 0;
+	return &ref->old_oid;
 }
 
 struct fetch_head {
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index a419de5b38..0abda033bc 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1307,7 +1307,7 @@ static void refuse_unconfigured_deny_delete_current(void)
 	rp_error("%s", _(refuse_unconfigured_deny_delete_current_msg));
 }
 
-static int command_singleton_iterator(void *cb_data, struct object_id *oid);
+static struct object_id *command_singleton_iterator(void *cb_data);
 static int update_shallow_ref(struct command *cmd, struct shallow_info *si)
 {
 	struct shallow_lock shallow_lock = SHALLOW_LOCK_INIT;
@@ -1725,16 +1725,15 @@ static void check_aliased_updates(struct command *commands)
 	string_list_clear(&ref_list, 0);
 }
 
-static int command_singleton_iterator(void *cb_data, struct object_id *oid)
+static struct object_id *command_singleton_iterator(void *cb_data)
 {
 	struct command **cmd_list = cb_data;
 	struct command *cmd = *cmd_list;
 
 	if (!cmd || is_null_oid(&cmd->new_oid))
-		return -1; /* end of list */
+		return NULL;
 	*cmd_list = NULL; /* this returns only one */
-	oidcpy(oid, &cmd->new_oid);
-	return 0;
+	return &cmd->new_oid;
 }
 
 static void set_connectivity_errors(struct command *commands,
@@ -1764,7 +1763,7 @@ struct iterate_data {
 	struct shallow_info *si;
 };
 
-static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
+static struct object_id *iterate_receive_command_list(void *cb_data)
 {
 	struct iterate_data *data = cb_data;
 	struct command **cmd_list = &data->cmds;
@@ -1775,13 +1774,11 @@ static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
 			/* to be checked in update_shallow_ref() */
 			continue;
 		if (!is_null_oid(&cmd->new_oid) && !cmd->skip_update) {
-			oidcpy(oid, &cmd->new_oid);
 			*cmd_list = cmd->next;
-			return 0;
+			return &cmd->new_oid;
 		}
 	}
-	*cmd_list = NULL;
-	return -1; /* end of list */
+	return NULL;
 }
 
 static void reject_updates_to_hidden(struct command *commands)
diff --git a/connected.c b/connected.c
index b5f9523a5f..374b145355 100644
--- a/connected.c
+++ b/connected.c
@@ -24,7 +24,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 	struct child_process rev_list = CHILD_PROCESS_INIT;
 	FILE *rev_list_in;
 	struct check_connected_options defaults = CHECK_CONNECTED_INIT;
-	struct object_id oid;
+	struct object_id *oid;
 	int err = 0;
 	struct packed_git *new_pack = NULL;
 	struct transport *transport;
@@ -34,7 +34,8 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 		opt = &defaults;
 	transport = opt->transport;
 
-	if (fn(cb_data, &oid)) {
+	oid = fn(cb_data);
+	if (!oid) {
 		if (opt->err_fd)
 			close(opt->err_fd);
 		return err;
@@ -73,7 +74,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 			for (p = get_all_packs(the_repository); p; p = p->next) {
 				if (!p->pack_promisor)
 					continue;
-				if (find_pack_entry_one(oid.hash, p))
+				if (find_pack_entry_one(oid->hash, p))
 					goto promisor_pack_found;
 			}
 			/*
@@ -83,7 +84,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 			goto no_promisor_pack_found;
 promisor_pack_found:
 			;
-		} while (!fn(cb_data, &oid));
+		} while ((oid = fn(cb_data)) != NULL);
 		return 0;
 	}
 
@@ -133,12 +134,12 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 		 * are sure the ref is good and not sending it to
 		 * rev-list for verification.
 		 */
-		if (new_pack && find_pack_entry_one(oid.hash, new_pack))
+		if (new_pack && find_pack_entry_one(oid->hash, new_pack))
 			continue;
 
-		if (fprintf(rev_list_in, "%s\n", oid_to_hex(&oid)) < 0)
+		if (fprintf(rev_list_in, "%s\n", oid_to_hex(oid)) < 0)
 			break;
-	} while (!fn(cb_data, &oid));
+	} while ((oid = fn(cb_data)) != NULL);
 
 	if (ferror(rev_list_in) || fflush(rev_list_in)) {
 		if (errno != EPIPE && errno != EINVAL)
diff --git a/connected.h b/connected.h
index 8d5a6b3ad6..56cc95be2d 100644
--- a/connected.h
+++ b/connected.h
@@ -9,7 +9,7 @@ struct transport;
  * When called after returning the name for the last object, return -1
  * to signal EOF, otherwise return 0.
  */
-typedef int (*oid_iterate_fn)(void *, struct object_id *oid);
+typedef struct object_id *(*oid_iterate_fn)(void *);
 
 /*
  * Named-arguments struct for check_connected. All arguments are
diff --git a/fetch-pack.c b/fetch-pack.c
index 0bf7ed7e47..1a6242cd71 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1912,16 +1912,15 @@ static void update_shallow(struct fetch_pack_args *args,
 	oid_array_clear(&ref);
 }
 
-static int iterate_ref_map(void *cb_data, struct object_id *oid)
+static struct object_id *iterate_ref_map(void *cb_data)
 {
 	struct ref **rm = cb_data;
 	struct ref *ref = *rm;
 
 	if (!ref)
-		return -1; /* end of the list */
+		return NULL;
 	*rm = ref->next;
-	oidcpy(oid, &ref->old_oid);
-	return 0;
+	return &ref->old_oid;
 }
 
 struct ref *fetch_pack(struct fetch_pack_args *args,
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/6] fetch-pack: optimize loading of refs via commit graph
  2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
                   ` (2 preceding siblings ...)
  2021-08-20 10:08 ` [PATCH 3/6] connected: refactor iterator to return next object ID directly Patrick Steinhardt
@ 2021-08-20 10:08 ` Patrick Steinhardt
  2021-08-20 14:37   ` Derrick Stolee
  2021-08-20 10:08 ` [PATCH 5/6] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-20 10:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]

In order to negotiate a packfile, we need to dereference refs to see
which commits we have in common with the remote. To do so, we first look
up the object's type -- if it's a tag, we peel until we hit a non-tag
object. If we hit a commit eventually, then we return that commit.

In case the object ID points to a commit directly, we can avoid the
initial lookup of the object type by opportunistically looking up the
commit via the commit-graph, if available, which gives us a slight speed
bump of about 2% in a huge repository with about 2.3M refs:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     31.634 s ±  0.258 s    [User: 28.400 s, System: 5.090 s]
      Range (min … max):   31.280 s … 31.896 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     31.129 s ±  0.543 s    [User: 27.976 s, System: 5.056 s]
      Range (min … max):   30.172 s … 31.479 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.02 ± 0.02 times faster than 'HEAD~: git-fetch'

In case this fails, we fall back to the old code which peels the
objects to a commit.
---
 fetch-pack.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 1a6242cd71..c57faf278f 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -119,6 +119,11 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 {
 	enum object_type type;
 	struct object_info info = { .typep = &type };
+	struct commit *commit;
+
+	commit = lookup_commit_in_graph(the_repository, oid);
+	if (commit)
+		return commit;
 
 	while (1) {
 		if (oid_object_info_extended(the_repository, oid, &info,
@@ -139,7 +144,7 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 	}
 
 	if (type == OBJ_COMMIT) {
-		struct commit *commit = lookup_commit(the_repository, oid);
+		commit = lookup_commit(the_repository, oid);
 		if (!commit || repo_parse_commit(the_repository, commit))
 			return NULL;
 		return commit;
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/6] fetch: refactor fetch refs to be more extendable
  2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
                   ` (3 preceding siblings ...)
  2021-08-20 10:08 ` [PATCH 4/6] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
@ 2021-08-20 10:08 ` Patrick Steinhardt
  2021-08-20 14:41   ` Derrick Stolee
  2021-08-20 10:08 ` [PATCH 6/6] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-20 10:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1721 bytes --]

Refactor `fetch_refs()` code to make it more extendable by explicitly
handling error cases. The refactored code should behave the same.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 76ce73ccf5..20fcfe0f45 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1284,20 +1284,29 @@ static int check_exist_and_connected(struct ref *ref_map)
 
 static int fetch_refs(struct transport *transport, struct ref *ref_map)
 {
-	int ret = check_exist_and_connected(ref_map);
-	if (ret) {
-		trace2_region_enter("fetch", "fetch_refs", the_repository);
-		ret = transport_fetch_refs(transport, ref_map);
-		trace2_region_leave("fetch", "fetch_refs", the_repository);
-	}
+	int ret;
+
+	/*
+	 * We don't need to perform a fetch in case we can already satisfy all
+	 * refs.
+	 */
+	ret = check_exist_and_connected(ref_map);
 	if (!ret)
-		/*
-		 * Keep the new pack's ".keep" file around to allow the caller
-		 * time to update refs to reference the new objects.
-		 */
 		return 0;
-	transport_unlock_pack(transport);
-	return ret;
+
+	trace2_region_enter("fetch", "fetch_refs", the_repository);
+	ret = transport_fetch_refs(transport, ref_map);
+	trace2_region_leave("fetch", "fetch_refs", the_repository);
+	if (ret) {
+		transport_unlock_pack(transport);
+		return ret;
+	}
+
+	/*
+	 * Keep the new pack's ".keep" file around to allow the caller
+	 * time to update refs to reference the new objects.
+	 */
+	return 0;
 }
 
 /* Update local refs based on the ref values fetched from a remote */
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 6/6] fetch: avoid second connectivity check if we already have all objects
  2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
                   ` (4 preceding siblings ...)
  2021-08-20 10:08 ` [PATCH 5/6] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
@ 2021-08-20 10:08 ` Patrick Steinhardt
  2021-08-20 14:47   ` Derrick Stolee
  2021-08-20 14:50 ` [PATCH 0/6] Speed up mirror-fetches with many refs Derrick Stolee
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-20 10:08 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 3998 bytes --]

When fetching refs, we are doing two connectivity checks:

    - The first one in `fetch_refs()` is done such that we can
      short-circuit the case where we already have all objects
      referenced by the updated set of refs.

    - The second one in `store_updated_refs()` does a sanity check that
      we have all objects after we have fetched the packfile.

We always execute both connectivity checks, but this is wasteful in case
the first connectivity check already notices that we have all objects
locally available.

Refactor the code to do both connectivity checks in `fetch_refs()`,
which allows us to easily skip the second connectivity check if we
already have all objects available. This refactoring is safe to do given
that we always call `fetch_refs()` followed by `consume_refs()`, which
is the only caller of `store_updated_refs()`.

This gives us a nice speedup when doing a mirror-fetch in a repository
with about 2.3M refs where the fetching repo already has all objects:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     31.232 s ±  0.082 s    [User: 27.901 s, System: 5.178 s]
      Range (min … max):   31.118 s … 31.301 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     26.616 s ±  0.100 s    [User: 23.675 s, System: 4.752 s]
      Range (min … max):   26.544 s … 26.788 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.17 ± 0.01 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 20fcfe0f45..088a8af13b 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1068,7 +1068,7 @@ N_("It took %.2f seconds to check forced updates. You can use\n"
    " to avoid this check.\n");
 
 static int store_updated_refs(const char *raw_url, const char *remote_name,
-			      int connectivity_checked, struct ref *ref_map)
+			      struct ref *ref_map)
 {
 	struct fetch_head fetch_head;
 	struct commit *commit;
@@ -1090,16 +1090,6 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 	else
 		url = xstrdup("foreign");
 
-	if (!connectivity_checked) {
-		struct check_connected_options opt = CHECK_CONNECTED_INIT;
-
-		rm = ref_map;
-		if (check_connected(iterate_ref_map, &rm, &opt)) {
-			rc = error(_("%s did not send all necessary objects\n"), url);
-			goto abort;
-		}
-	}
-
 	if (atomic_fetch) {
 		transaction = ref_transaction_begin(&err);
 		if (!transaction) {
@@ -1302,6 +1292,18 @@ static int fetch_refs(struct transport *transport, struct ref *ref_map)
 		return ret;
 	}
 
+	/*
+	 * If the transport didn't yet check for us, we need to verify
+	 * ourselves that we have obtained all missing objects now.
+	 */
+	if (!transport->smart_options || !transport->smart_options->connectivity_checked) {
+		if (check_connected(iterate_ref_map, &ref_map, NULL)) {
+			ret = error(_("remote did not send all necessary objects\n"));
+			transport_unlock_pack(transport);
+			return ret;
+		}
+	}
+
 	/*
 	 * Keep the new pack's ".keep" file around to allow the caller
 	 * time to update refs to reference the new objects.
@@ -1312,13 +1314,10 @@ static int fetch_refs(struct transport *transport, struct ref *ref_map)
 /* Update local refs based on the ref values fetched from a remote */
 static int consume_refs(struct transport *transport, struct ref *ref_map)
 {
-	int connectivity_checked = transport->smart_options
-		? transport->smart_options->connectivity_checked : 0;
 	int ret;
 	trace2_region_enter("fetch", "consume_refs", the_repository);
 	ret = store_updated_refs(transport->url,
 				 transport->remote->name,
-				 connectivity_checked,
 				 ref_map);
 	transport_unlock_pack(transport);
 	trace2_region_leave("fetch", "consume_refs", the_repository);
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph
  2021-08-20 10:08 ` [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
@ 2021-08-20 14:27   ` Derrick Stolee
  2021-08-20 17:18     ` Junio C Hamano
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2021-08-20 14:27 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
> When updating our local refs based on the refs fetched from the remote,
> we need to iterate through all requested refs and load their respective
> commits such that we can determine whether they need to be appended to
> FETCH_HEAD or not. In cases where we're fetching from a remote with
> exceedingly many refs, resolving these refs can be quite expensive given
> that we repeatedly need to unpack object headers for each of the
> referenced objects.
> 
> Speed this up by opportunistcally trying to resolve object IDs via the
> commit graph: more likely than not, they're going to be a commit anyway,
> and this lets us avoid having to unpack object headers completely in
> case the object is a commit that is part of the commit-graph. This
> significantly speeds up mirror-fetches in a real-world repository with
> 2.3M refs:
> 
>     Benchmark #1: HEAD~: git-fetch
>       Time (mean ± σ):     56.942 s ±  0.449 s    [User: 53.360 s, System: 5.356 s]
>       Range (min … max):   56.372 s … 57.533 s    5 runs
> 
>     Benchmark #2: HEAD: git-fetch
>       Time (mean ± σ):     33.657 s ±  0.167 s    [User: 30.302 s, System: 5.181 s]
>       Range (min … max):   33.454 s … 33.844 s    5 runs
> 
>     Summary
>       'HEAD: git-fetch' ran
>         1.69 ± 0.02 times faster than 'HEAD~: git-fetch'

These numbers are impressive, and it makes sense that performing a
binary search on the OID lookup chunk of the commit-graph is faster
than doing a binary search on the OIDs across the pack-index(es).

I do worry about the case where annotated tags greatly outnumber
branches, so this binary search is extra overhead and the performance
may degrade. Would it be worth checking the ref to see if it lies
within "refs/heads/" (or even _not_ in "refs/tags/") before doing
this commit-graph check?

> -			commit = lookup_commit_reference_gently(the_repository,
> -								&rm->old_oid,
> -								1);
> -			if (!commit)
> -				rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
> +			commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
> +			if (!commit) {
> +				commit = lookup_commit_reference_gently(the_repository,
> +									&rm->old_oid,
> +									1);
> +				if (!commit)
> +					rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;

nit: I wouldn't nest this last "if (!commit)".

Code will work as advertised.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/6] connected: refactor iterator to return next object ID directly
  2021-08-20 10:08 ` [PATCH 3/6] connected: refactor iterator to return next object ID directly Patrick Steinhardt
@ 2021-08-20 14:32   ` Derrick Stolee
  2021-08-20 17:43     ` Junio C Hamano
  2021-08-20 17:43   ` René Scharfe
  1 sibling, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2021-08-20 14:32 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
> The object ID iterator used by the connectivity checks returns the next
> object ID via an out-parameter and then uses a return code to indicate
> whether an item was found. This is a bit roundabout: instead of a
> separate error code, we can just retrun the next object ID directly and
> use `NULL` pointers as indicator that the iterator got no items left.
> Furthermore, this avoids a copy of the object ID.
> 
> Refactor the iterator and all its implementations to return object IDs
> directly. While I was honestly hoping for a small speedup given that we
> can now avoid a copy, both versions perform the same. Still, the end
> result is easier to understand and thus it makes sense to keep this
> refactoring regardless.

It's too bad about the lack of measurable performance gains, but the
new code _is_ doing less, it's just not enough.

I agree that the new code organization is better.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/6] fetch-pack: optimize loading of refs via commit graph
  2021-08-20 10:08 ` [PATCH 4/6] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
@ 2021-08-20 14:37   ` Derrick Stolee
  0 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2021-08-20 14:37 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
> In order to negotiate a packfile, we need to dereference refs to see
> which commits we have in common with the remote. To do so, we first look
> up the object's type -- if it's a tag, we peel until we hit a non-tag
> object. If we hit a commit eventually, then we return that commit.
> 
> In case the object ID points to a commit directly, we can avoid the
> initial lookup of the object type by opportunistically looking up the
> commit via the commit-graph, if available, which gives us a slight speed
> bump of about 2% in a huge repository with about 2.3M refs:
> 
>     Benchmark #1: HEAD~: git-fetch
>       Time (mean ± σ):     31.634 s ±  0.258 s    [User: 28.400 s, System: 5.090 s]
>       Range (min … max):   31.280 s … 31.896 s    5 runs
> 
>     Benchmark #2: HEAD: git-fetch
>       Time (mean ± σ):     31.129 s ±  0.543 s    [User: 27.976 s, System: 5.056 s]
>       Range (min … max):   30.172 s … 31.479 s    5 runs
> 
>     Summary
>       'HEAD: git-fetch' ran
>         1.02 ± 0.02 times faster than 'HEAD~: git-fetch'

This 2% gain is nice, especially because you are measuring the
end-to-end scenario. If you use GIT_TRACE2_PERF=1 on a few runs,
then you could likely isolate some of the regions from
mark_complete_and_common_ref() and demonstrate a larger improvement
in that focused area.

> @@ -119,6 +119,11 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
>  {
>  	enum object_type type;
>  	struct object_info info = { .typep = &type };
> +	struct commit *commit;
> +
> +	commit = lookup_commit_in_graph(the_repository, oid);
> +	if (commit)
> +		return commit;

Obviously a correct thing to do.

>  	if (type == OBJ_COMMIT) {
> -		struct commit *commit = lookup_commit(the_repository, oid);
> +		commit = lookup_commit(the_repository, oid);

Re-using the local simplifies this. Good.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/6] fetch: refactor fetch refs to be more extendable
  2021-08-20 10:08 ` [PATCH 5/6] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
@ 2021-08-20 14:41   ` Derrick Stolee
  0 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2021-08-20 14:41 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
> Refactor `fetch_refs()` code to make it more extendable by explicitly
> handling error cases. The refactored code should behave the same.

It took unrolling this diff to understand that this code behaves the
same, and it's because of the previous code using "if (!ret) return 0;"
to handle two possible ways that 'ret' could become zero.

I agree that the new code makes it clear that we can leave early after
a successful call to check_exist_and_connected() and again after a
successful call to transport_fetch_refs().

Thanks
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/6] fetch: avoid second connectivity check if we already have all objects
  2021-08-20 10:08 ` [PATCH 6/6] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
@ 2021-08-20 14:47   ` Derrick Stolee
  2021-08-23  6:52     ` Patrick Steinhardt
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2021-08-20 14:47 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
> When fetching refs, we are doing two connectivity checks:
> 
>     - The first one in `fetch_refs()` is done such that we can
>       short-circuit the case where we already have all objects
>       referenced by the updated set of refs.
> 
>     - The second one in `store_updated_refs()` does a sanity check that
>       we have all objects after we have fetched the packfile.
> 
> We always execute both connectivity checks, but this is wasteful in case
> the first connectivity check already notices that we have all objects
> locally available.
> 
> Refactor the code to do both connectivity checks in `fetch_refs()`,
> which allows us to easily skip the second connectivity check if we
> already have all objects available. This refactoring is safe to do given
> that we always call `fetch_refs()` followed by `consume_refs()`, which
> is the only caller of `store_updated_refs()`.

Should we try to make it more clear that fetch_refs() must be followed
by consume_refs() via a comment above the fetch_refs(), or possibly even
its call sites?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/6] Speed up mirror-fetches with many refs
  2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
                   ` (5 preceding siblings ...)
  2021-08-20 10:08 ` [PATCH 6/6] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
@ 2021-08-20 14:50 ` Derrick Stolee
  2021-08-21  0:09 ` Junio C Hamano
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2021-08-20 14:50 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
...
> As it turns out, many of the issues are again caused by loading and
> dereferencing refs. This patch series thus mostly focusses on optimizing
> the patterns there, where the biggest win is to opportunistically load
> refs via commit-graphs.

You caught my attention at "commit-graph" and I found your use of them
to be interesting. You strike a balance in checking the commit-graph
when it is likely to be helpful, and skip the commit-graph when it is
not. (For example, PATCH 2 is unlikely to benefit from checking the
commit-graph at that point, because we are looking for objects that
were just downloaded.)

I read all the patches and checked the full context of the functions
to see if there were any issues, but found none. My only comments are
about the case of many annotated tags (do we slow down?) and some
nitpicks.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph
  2021-08-20 14:27   ` Derrick Stolee
@ 2021-08-20 17:18     ` Junio C Hamano
  2021-08-23  6:46       ` Patrick Steinhardt
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2021-08-20 17:18 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Patrick Steinhardt, git, Jeff King,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:

> I do worry about the case where annotated tags greatly outnumber
> branches, so this binary search is extra overhead and the performance
> may degrade. Would it be worth checking the ref to see if it lies
> within "refs/heads/" (or even _not_ in "refs/tags/") before doing
> this commit-graph check?

Ah, clever.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/6] connected: refactor iterator to return next object ID directly
  2021-08-20 10:08 ` [PATCH 3/6] connected: refactor iterator to return next object ID directly Patrick Steinhardt
  2021-08-20 14:32   ` Derrick Stolee
@ 2021-08-20 17:43   ` René Scharfe
  2021-08-23  6:47     ` Patrick Steinhardt
  1 sibling, 1 reply; 48+ messages in thread
From: René Scharfe @ 2021-08-20 17:43 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

Am 20.08.21 um 12:08 schrieb Patrick Steinhardt:
> The object ID iterator used by the connectivity checks returns the next
> object ID via an out-parameter and then uses a return code to indicate
> whether an item was found. This is a bit roundabout: instead of a
> separate error code, we can just retrun the next object ID directly and

s/retrun/return/

> use `NULL` pointers as indicator that the iterator got no items left.
> Furthermore, this avoids a copy of the object ID.
>
> Refactor the iterator and all its implementations to return object IDs
> directly. While I was honestly hoping for a small speedup given that we
> can now avoid a copy, both versions perform the same. Still, the end
> result is easier to understand and thus it makes sense to keep this
> refactoring regardless.

check_connected() calls find_pack_entry_one() on the object ID hash,
which copies it anyway.  Perhaps that and caching prevent the expected
speedup?

The private copy made sure check_connected() could not modify the
object IDs.  It still only reads them with this patch, but the compiler
no longer prevents writes.  The iterators could return const pointers
to restore that guarantee.

>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  builtin/clone.c        |  8 +++-----
>  builtin/fetch.c        |  7 +++----
>  builtin/receive-pack.c | 17 +++++++----------
>  connected.c            | 15 ++++++++-------
>  connected.h            |  2 +-
>  fetch-pack.c           |  7 +++----
>  6 files changed, 25 insertions(+), 31 deletions(-)
>
> diff --git a/builtin/clone.c b/builtin/clone.c
> index a74558f30c..817a651936 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -544,7 +544,7 @@ static void write_followtags(const struct ref *refs, const char *msg)
>  	}
>  }
>
> -static int iterate_ref_map(void *cb_data, struct object_id *oid)
> +static struct object_id *iterate_ref_map(void *cb_data)
>  {
>  	struct ref **rm = cb_data;
>  	struct ref *ref = *rm;
> @@ -555,13 +555,11 @@ static int iterate_ref_map(void *cb_data, struct object_id *oid)
>  	 */
>  	while (ref && !ref->peer_ref)
>  		ref = ref->next;
> -	/* Returning -1 notes "end of list" to the caller. */
>  	if (!ref)
> -		return -1;
> +		return NULL;
>
> -	oidcpy(oid, &ref->old_oid);
>  	*rm = ref->next;
> -	return 0;
> +	return &ref->old_oid;
>  }
>
>  static void update_remote_refs(const struct ref *refs,
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 5fd0f7c791..76ce73ccf5 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -962,7 +962,7 @@ static int update_local_ref(struct ref *ref,
>  	}
>  }
>
> -static int iterate_ref_map(void *cb_data, struct object_id *oid)
> +static struct object_id *iterate_ref_map(void *cb_data)
>  {
>  	struct ref **rm = cb_data;
>  	struct ref *ref = *rm;
> @@ -970,10 +970,9 @@ static int iterate_ref_map(void *cb_data, struct object_id *oid)
>  	while (ref && ref->status == REF_STATUS_REJECT_SHALLOW)
>  		ref = ref->next;
>  	if (!ref)
> -		return -1; /* end of the list */
> +		return NULL;
>  	*rm = ref->next;
> -	oidcpy(oid, &ref->old_oid);
> -	return 0;
> +	return &ref->old_oid;
>  }
>
>  struct fetch_head {
> diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
> index a419de5b38..0abda033bc 100644
> --- a/builtin/receive-pack.c
> +++ b/builtin/receive-pack.c
> @@ -1307,7 +1307,7 @@ static void refuse_unconfigured_deny_delete_current(void)
>  	rp_error("%s", _(refuse_unconfigured_deny_delete_current_msg));
>  }
>
> -static int command_singleton_iterator(void *cb_data, struct object_id *oid);
> +static struct object_id *command_singleton_iterator(void *cb_data);
>  static int update_shallow_ref(struct command *cmd, struct shallow_info *si)
>  {
>  	struct shallow_lock shallow_lock = SHALLOW_LOCK_INIT;
> @@ -1725,16 +1725,15 @@ static void check_aliased_updates(struct command *commands)
>  	string_list_clear(&ref_list, 0);
>  }
>
> -static int command_singleton_iterator(void *cb_data, struct object_id *oid)
> +static struct object_id *command_singleton_iterator(void *cb_data)
>  {
>  	struct command **cmd_list = cb_data;
>  	struct command *cmd = *cmd_list;
>
>  	if (!cmd || is_null_oid(&cmd->new_oid))
> -		return -1; /* end of list */
> +		return NULL;
>  	*cmd_list = NULL; /* this returns only one */
> -	oidcpy(oid, &cmd->new_oid);
> -	return 0;
> +	return &cmd->new_oid;
>  }
>
>  static void set_connectivity_errors(struct command *commands,
> @@ -1764,7 +1763,7 @@ struct iterate_data {
>  	struct shallow_info *si;
>  };
>
> -static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
> +static struct object_id *iterate_receive_command_list(void *cb_data)
>  {
>  	struct iterate_data *data = cb_data;
>  	struct command **cmd_list = &data->cmds;
> @@ -1775,13 +1774,11 @@ static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
>  			/* to be checked in update_shallow_ref() */
>  			continue;
>  		if (!is_null_oid(&cmd->new_oid) && !cmd->skip_update) {
> -			oidcpy(oid, &cmd->new_oid);
>  			*cmd_list = cmd->next;
> -			return 0;
> +			return &cmd->new_oid;
>  		}
>  	}
> -	*cmd_list = NULL;
> -	return -1; /* end of list */
> +	return NULL;
>  }
>
>  static void reject_updates_to_hidden(struct command *commands)
> diff --git a/connected.c b/connected.c
> index b5f9523a5f..374b145355 100644
> --- a/connected.c
> +++ b/connected.c
> @@ -24,7 +24,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  	struct child_process rev_list = CHILD_PROCESS_INIT;
>  	FILE *rev_list_in;
>  	struct check_connected_options defaults = CHECK_CONNECTED_INIT;
> -	struct object_id oid;
> +	struct object_id *oid;
>  	int err = 0;
>  	struct packed_git *new_pack = NULL;
>  	struct transport *transport;
> @@ -34,7 +34,8 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  		opt = &defaults;
>  	transport = opt->transport;
>
> -	if (fn(cb_data, &oid)) {
> +	oid = fn(cb_data);
> +	if (!oid) {
>  		if (opt->err_fd)
>  			close(opt->err_fd);
>  		return err;
> @@ -73,7 +74,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  			for (p = get_all_packs(the_repository); p; p = p->next) {
>  				if (!p->pack_promisor)
>  					continue;
> -				if (find_pack_entry_one(oid.hash, p))
> +				if (find_pack_entry_one(oid->hash, p))
>  					goto promisor_pack_found;
>  			}
>  			/*
> @@ -83,7 +84,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  			goto no_promisor_pack_found;
>  promisor_pack_found:
>  			;
> -		} while (!fn(cb_data, &oid));
> +		} while ((oid = fn(cb_data)) != NULL);
>  		return 0;
>  	}
>
> @@ -133,12 +134,12 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  		 * are sure the ref is good and not sending it to
>  		 * rev-list for verification.
>  		 */
> -		if (new_pack && find_pack_entry_one(oid.hash, new_pack))
> +		if (new_pack && find_pack_entry_one(oid->hash, new_pack))
>  			continue;
>
> -		if (fprintf(rev_list_in, "%s\n", oid_to_hex(&oid)) < 0)
> +		if (fprintf(rev_list_in, "%s\n", oid_to_hex(oid)) < 0)
>  			break;
> -	} while (!fn(cb_data, &oid));
> +	} while ((oid = fn(cb_data)) != NULL);
>
>  	if (ferror(rev_list_in) || fflush(rev_list_in)) {
>  		if (errno != EPIPE && errno != EINVAL)
> diff --git a/connected.h b/connected.h
> index 8d5a6b3ad6..56cc95be2d 100644
> --- a/connected.h
> +++ b/connected.h
> @@ -9,7 +9,7 @@ struct transport;
>   * When called after returning the name for the last object, return -1
>   * to signal EOF, otherwise return 0.
>   */
> -typedef int (*oid_iterate_fn)(void *, struct object_id *oid);
> +typedef struct object_id *(*oid_iterate_fn)(void *);

I.e. this would be safer (requires some more const modifiers throughout
the code):

typedef const struct object_id * (*oid_iterate_fn)(void *);

>
>  /*
>   * Named-arguments struct for check_connected. All arguments are
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 0bf7ed7e47..1a6242cd71 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -1912,16 +1912,15 @@ static void update_shallow(struct fetch_pack_args *args,
>  	oid_array_clear(&ref);
>  }
>
> -static int iterate_ref_map(void *cb_data, struct object_id *oid)
> +static struct object_id *iterate_ref_map(void *cb_data)
>  {
>  	struct ref **rm = cb_data;
>  	struct ref *ref = *rm;
>
>  	if (!ref)
> -		return -1; /* end of the list */
> +		return NULL;
>  	*rm = ref->next;
> -	oidcpy(oid, &ref->old_oid);
> -	return 0;
> +	return &ref->old_oid;
>  }
>
>  struct ref *fetch_pack(struct fetch_pack_args *args,
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/6] connected: refactor iterator to return next object ID directly
  2021-08-20 14:32   ` Derrick Stolee
@ 2021-08-20 17:43     ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2021-08-20 17:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Patrick Steinhardt, git, Jeff King,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:

> On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
>> The object ID iterator used by the connectivity checks returns the next
>> object ID via an out-parameter and then uses a return code to indicate
>> whether an item was found. This is a bit roundabout: instead of a
>> separate error code, we can just retrun the next object ID directly and
>> use `NULL` pointers as indicator that the iterator got no items left.
>> Furthermore, this avoids a copy of the object ID.
>> 
>> Refactor the iterator and all its implementations to return object IDs
>> directly. While I was honestly hoping for a small speedup given that we
>> can now avoid a copy, both versions perform the same. Still, the end
>> result is easier to understand and thus it makes sense to keep this
>> refactoring regardless.
>
> It's too bad about the lack of measurable performance gains, but the
> new code _is_ doing less, it's just not enough.
>
> I agree that the new code organization is better.

The only potential downside I can think of is the loss of ability to
convey the reason why it failed to return one by adding new return
codes from the function, which I do not immediately see all that
useful future extension anyway, so I agree that "we find what we
found, or NULL if we don't find" is much straight-forward and easier
to understand.

Nicely done.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/6] Speed up mirror-fetches with many refs
  2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
                   ` (6 preceding siblings ...)
  2021-08-20 14:50 ` [PATCH 0/6] Speed up mirror-fetches with many refs Derrick Stolee
@ 2021-08-21  0:09 ` Junio C Hamano
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
  2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
  9 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2021-08-21  0:09 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Ævar Arnfjörð Bjarmason

Patrick Steinhardt <ps@pks.im> writes:

> I've taken another look at fetches in the context of repos with a huge
> amount of refs. This time around, I've taken a look at mirror fetches:
> in our notorious repo with 2.3M refs, these mirror fetches can take up
> to several minutes of time even if there are no changes at all.

I notice that 4/6 (and no other patch) is not signed-off and wonder
if there is a reason (e.g. you are not so comfortable with the idea
behind the step or the implementation) or a simple oversight?

> Note that this series depends on ps/connectivity-optim and thus only
> applies on top of next.

It seems that the dependency of this series is not just 'master'
plus 'ps/connectivity-optim' but some more stuff from 'next'.  I
expact that topics that have been cooking in 'next' during the
previous cycle will graduate to 'master' early next week, so perhaps
it is easier to handle this kind of "depends on some stuff in
'next'" topics after that happens.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph
  2021-08-20 17:18     ` Junio C Hamano
@ 2021-08-23  6:46       ` Patrick Steinhardt
  2021-08-25 14:12         ` Derrick Stolee
  0 siblings, 1 reply; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-23  6:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, git, Jeff King, Ævar Arnfjörð Bjarmason

[-- Attachment #1: Type: text/plain, Size: 1981 bytes --]

On Fri, Aug 20, 2021 at 10:18:22AM -0700, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
> > I do worry about the case where annotated tags greatly outnumber
> > branches, so this binary search is extra overhead and the performance
> > may degrade. Would it be worth checking the ref to see if it lies
> > within "refs/heads/" (or even _not_ in "refs/tags/") before doing
> > this commit-graph check?
> 
> Ah, clever.

Good idea. Benchmarks for my test repository (which definitely isn't
representative, but it's at least some numbers) show that restricting to
"refs/heads/" diminishes almost all the gains, while restricting to
everything but "refs/tags/" performs almost the same (it's a tiny bit
slower, probably because of the added string comparisons):

    Benchmark #1: all refs: git-fetch
      Time (mean ± σ):     32.959 s ±  0.282 s    [User: 29.801 s, System: 5.137 s]
      Range (min … max):   32.760 s … 33.158 s    2 runs

    Benchmark #2: refs/heads: git-fetch
      Time (mean ± σ):     56.955 s ±  0.002 s    [User: 53.447 s, System: 5.362 s]
      Range (min … max):   56.953 s … 56.957 s    2 runs

    Benchmark #3: !refs/tags: git-fetch
      Time (mean ± σ):     33.447 s ±  0.003 s    [User: 30.160 s, System: 5.027 s]
      Range (min … max):   33.444 s … 33.449 s    2 runs

    Summary
      'all refs: git-fetch' ran
        1.01 ± 0.01 times faster than '!refs/tags: git-fetch'
        1.73 ± 0.01 times faster than 'refs/heads: git-fetch'

This is easily explained by the fact that the test repo has most of its
refs neither in "refs/tags/" nor in "refs/heads/", but rather in special
namespaces like "refs/merge-requests/", "refs/environments/" or
"refs/keep-around/".

I like the idea of excluding "refs/tags/" though: as you point out,
chances are high that these don't point to commits but to annotated tags
instead. So I'll go with that, thanks!

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/6] connected: refactor iterator to return next object ID directly
  2021-08-20 17:43   ` René Scharfe
@ 2021-08-23  6:47     ` Patrick Steinhardt
  0 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-23  6:47 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1460 bytes --]

On Fri, Aug 20, 2021 at 07:43:06PM +0200, René Scharfe wrote:
> Am 20.08.21 um 12:08 schrieb Patrick Steinhardt:
> > The object ID iterator used by the connectivity checks returns the next
> > object ID via an out-parameter and then uses a return code to indicate
> > whether an item was found. This is a bit roundabout: instead of a
> > separate error code, we can just retrun the next object ID directly and
> 
> s/retrun/return/
> 
> > use `NULL` pointers as indicator that the iterator got no items left.
> > Furthermore, this avoids a copy of the object ID.
> >
> > Refactor the iterator and all its implementations to return object IDs
> > directly. While I was honestly hoping for a small speedup given that we
> > can now avoid a copy, both versions perform the same. Still, the end
> > result is easier to understand and thus it makes sense to keep this
> > refactoring regardless.
> 
> check_connected() calls find_pack_entry_one() on the object ID hash,
> which copies it anyway.  Perhaps that and caching prevent the expected
> speedup?
> 
> The private copy made sure check_connected() could not modify the
> object IDs.  It still only reads them with this patch, but the compiler
> no longer prevents writes.  The iterators could return const pointers
> to restore that guarantee.

Right, will change the signature and re-benchmark. I'd be surprised if
it significantly changed the picture, but let's see.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/6] fetch: avoid second connectivity check if we already have all objects
  2021-08-20 14:47   ` Derrick Stolee
@ 2021-08-23  6:52     ` Patrick Steinhardt
  0 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-23  6:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1528 bytes --]

On Fri, Aug 20, 2021 at 10:47:11AM -0400, Derrick Stolee wrote:
> On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
> > When fetching refs, we are doing two connectivity checks:
> > 
> >     - The first one in `fetch_refs()` is done such that we can
> >       short-circuit the case where we already have all objects
> >       referenced by the updated set of refs.
> > 
> >     - The second one in `store_updated_refs()` does a sanity check that
> >       we have all objects after we have fetched the packfile.
> > 
> > We always execute both connectivity checks, but this is wasteful in case
> > the first connectivity check already notices that we have all objects
> > locally available.
> > 
> > Refactor the code to do both connectivity checks in `fetch_refs()`,
> > which allows us to easily skip the second connectivity check if we
> > already have all objects available. This refactoring is safe to do given
> > that we always call `fetch_refs()` followed by `consume_refs()`, which
> > is the only caller of `store_updated_refs()`.
> 
> Should we try to make it more clear that fetch_refs() must be followed
> by consume_refs() via a comment above the fetch_refs(), or possibly even
> its call sites?

I wasn't quite happy with this outcome, either. How about we instead
merge both functions into `fetch_and_consume_refs()`? Both are quite
short, and that makes sure we always call them together to make the
requirement explicit.

I'll add another patch to do this refactoring.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 0/7] Speed up mirror-fetches with many refs
  2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
                   ` (7 preceding siblings ...)
  2021-08-21  0:09 ` Junio C Hamano
@ 2021-08-24 10:36 ` Patrick Steinhardt
  2021-08-24 10:36   ` [PATCH v2 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
                     ` (8 more replies)
  2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
  9 siblings, 9 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-24 10:36 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 13857 bytes --]

Hi,

this is the second version of my patch series to speed up mirror-fetches
with many refs. This topic applies on top of Junio's 9d5700f60b (Merge
branch 'ps/connectivity-optim' into jch, 2021-08-23).

Changes compared to v1:

    - Patch 1/7: I've applied Stolee's proposal to only
      opportunistically load objects via the commit-graph in case the
      reference is not in refs/tags/ such that we don't regress repos
      with many annotated tags.

    - Patch 3/7: The return parameter of the iterator is now const to
      allow further optimizations by the compiler, as suggested by
      René. I've also re-benchmarked this, and one can now see a very
      slight performance improvement of ~1%.

    - Patch 4/7: Added my missing DCO, as pointed out by Junio.

    - Patch 5, 6, 7: I've redone these to make it clearer that the
      refactoring I'm doing doesn't cause us to miss any object
      connectivity checks. Most importantly, I've merged `fetch_refs()`
      and `consume_refs()` into `fetch_and_consume_refs()` in 6/7, which
      makes the optimization where we elide the second connectivity
      check in 7/7 trivial.

Thanks for your feedback!

Patrick

Patrick Steinhardt (7):
  fetch: speed up lookup of want refs via commit-graph
  fetch: avoid unpacking headers in object existence check
  connected: refactor iterator to return next object ID directly
  fetch-pack: optimize loading of refs via commit graph
  fetch: refactor fetch refs to be more extendable
  fetch: merge fetching and consuming refs
  fetch: avoid second connectivity check if we already have all objects

 builtin/clone.c        |  8 ++---
 builtin/fetch.c        | 74 +++++++++++++++++++++++-------------------
 builtin/receive-pack.c | 17 ++++------
 connected.c            | 15 +++++----
 connected.h            |  2 +-
 fetch-pack.c           | 14 +++++---
 6 files changed, 68 insertions(+), 62 deletions(-)

Range-diff against v1:
1:  6872979c45 ! 1:  4a819a6830 fetch: speed up lookup of want refs via commit-graph
    @@ Commit message
         referenced objects.
     
         Speed this up by opportunistcally trying to resolve object IDs via the
    -    commit graph: more likely than not, they're going to be a commit anyway,
    -    and this lets us avoid having to unpack object headers completely in
    -    case the object is a commit that is part of the commit-graph. This
    -    significantly speeds up mirror-fetches in a real-world repository with
    +    commit graph. We only do so for any refs which are not in "refs/tags":
    +    more likely than not, these are going to be a commit anyway, and this
    +    lets us avoid having to unpack object headers completely in case the
    +    object is a commit that is part of the commit-graph. This significantly
    +    speeds up mirror-fetches in a real-world repository with
         2.3M refs:
     
             Benchmark #1: HEAD~: git-fetch
    -          Time (mean ± σ):     56.942 s ±  0.449 s    [User: 53.360 s, System: 5.356 s]
    -          Range (min … max):   56.372 s … 57.533 s    5 runs
    +          Time (mean ± σ):     56.482 s ±  0.384 s    [User: 53.340 s, System: 5.365 s]
    +          Range (min … max):   56.050 s … 57.045 s    5 runs
     
             Benchmark #2: HEAD: git-fetch
    -          Time (mean ± σ):     33.657 s ±  0.167 s    [User: 30.302 s, System: 5.181 s]
    -          Range (min … max):   33.454 s … 33.844 s    5 runs
    +          Time (mean ± σ):     33.727 s ±  0.170 s    [User: 30.252 s, System: 5.194 s]
    +          Range (min … max):   33.452 s … 33.871 s    5 runs
     
             Summary
               'HEAD: git-fetch' ran
    -            1.69 ± 0.02 times faster than 'HEAD~: git-fetch'
    +            1.67 ± 0.01 times faster than 'HEAD~: git-fetch'
     
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
      ## builtin/fetch.c ##
    +@@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *remote_name,
    + 			      int connectivity_checked, struct ref *ref_map)
    + {
    + 	struct fetch_head fetch_head;
    +-	struct commit *commit;
    + 	int url_len, i, rc = 0;
    + 	struct strbuf note = STRBUF_INIT, err = STRBUF_INIT;
    + 	struct ref_transaction *transaction = NULL;
    +@@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *remote_name,
    + 	     want_status <= FETCH_HEAD_IGNORE;
    + 	     want_status++) {
    + 		for (rm = ref_map; rm; rm = rm->next) {
    ++			struct commit *commit = NULL;
    + 			struct ref *ref = NULL;
    + 
    + 			if (rm->status == REF_STATUS_REJECT_SHALLOW) {
     @@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *remote_name,
      				continue;
      			}
    @@ builtin/fetch.c: static int store_updated_refs(const char *raw_url, const char *
     -								1);
     -			if (!commit)
     -				rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
    -+			commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
    ++			/*
    ++			 * References in "refs/tags/" are often going to point
    ++			 * to annotated tags, which are not part of the
    ++			 * commit-graph. We thus only try to look up refs in
    ++			 * the graph which are not in that namespace to not
    ++			 * regress performance in repositories with many
    ++			 * annotated tags.
    ++			 */
    ++			if (!starts_with(rm->name, "refs/tags/"))
    ++				commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
     +			if (!commit) {
     +				commit = lookup_commit_reference_gently(the_repository,
     +									&rm->old_oid,
2:  d3dac607f2 = 2:  81ebadabe8 fetch: avoid unpacking headers in object existence check
3:  3bdad7bc8b ! 3:  98e981ced9 connected: refactor iterator to return next object ID directly
    @@ Commit message
         The object ID iterator used by the connectivity checks returns the next
         object ID via an out-parameter and then uses a return code to indicate
         whether an item was found. This is a bit roundabout: instead of a
    -    separate error code, we can just retrun the next object ID directly and
    +    separate error code, we can just return the next object ID directly and
         use `NULL` pointers as indicator that the iterator got no items left.
         Furthermore, this avoids a copy of the object ID.
     
         Refactor the iterator and all its implementations to return object IDs
    -    directly. While I was honestly hoping for a small speedup given that we
    -    can now avoid a copy, both versions perform the same. Still, the end
    -    result is easier to understand and thus it makes sense to keep this
    -    refactoring regardless.
    +    directly. This brings a tiny performance improvement when doing a mirror-fetch of a repository with about 2.3M refs:
    +
    +        Benchmark #1: 328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch
    +          Time (mean ± σ):     30.110 s ±  0.148 s    [User: 27.161 s, System: 5.075 s]
    +          Range (min … max):   29.934 s … 30.406 s    10 runs
    +
    +        Benchmark #2: 328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch
    +          Time (mean ± σ):     29.899 s ±  0.109 s    [User: 26.916 s, System: 5.104 s]
    +          Range (min … max):   29.696 s … 29.996 s    10 runs
    +
    +        Summary
    +          '328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch' ran
    +            1.01 ± 0.01 times faster than '328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch'
    +
    +    While this 1% speedup could be labelled as statistically insignificant,
    +    the speedup is consistent on my machine. Furthermore, this is an end to
    +    end test, so it is expected that the improvement in the connectivity
    +    check itself is more significant.
     
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
    @@ builtin/clone.c: static void write_followtags(const struct ref *refs, const char
      }
      
     -static int iterate_ref_map(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_ref_map(void *cb_data)
    ++static const struct object_id *iterate_ref_map(void *cb_data)
      {
      	struct ref **rm = cb_data;
      	struct ref *ref = *rm;
    @@ builtin/fetch.c: static int update_local_ref(struct ref *ref,
      }
      
     -static int iterate_ref_map(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_ref_map(void *cb_data)
    ++static const struct object_id *iterate_ref_map(void *cb_data)
      {
      	struct ref **rm = cb_data;
      	struct ref *ref = *rm;
    @@ builtin/receive-pack.c: static void refuse_unconfigured_deny_delete_current(void
      }
      
     -static int command_singleton_iterator(void *cb_data, struct object_id *oid);
    -+static struct object_id *command_singleton_iterator(void *cb_data);
    ++static const struct object_id *command_singleton_iterator(void *cb_data);
      static int update_shallow_ref(struct command *cmd, struct shallow_info *si)
      {
      	struct shallow_lock shallow_lock = SHALLOW_LOCK_INIT;
    @@ builtin/receive-pack.c: static void check_aliased_updates(struct command *comman
      }
      
     -static int command_singleton_iterator(void *cb_data, struct object_id *oid)
    -+static struct object_id *command_singleton_iterator(void *cb_data)
    ++static const struct object_id *command_singleton_iterator(void *cb_data)
      {
      	struct command **cmd_list = cb_data;
      	struct command *cmd = *cmd_list;
    @@ builtin/receive-pack.c: struct iterate_data {
      };
      
     -static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_receive_command_list(void *cb_data)
    ++static const struct object_id *iterate_receive_command_list(void *cb_data)
      {
      	struct iterate_data *data = cb_data;
      	struct command **cmd_list = &data->cmds;
    @@ connected.c: int check_connected(oid_iterate_fn fn, void *cb_data,
      	FILE *rev_list_in;
      	struct check_connected_options defaults = CHECK_CONNECTED_INIT;
     -	struct object_id oid;
    -+	struct object_id *oid;
    ++	const struct object_id *oid;
      	int err = 0;
      	struct packed_git *new_pack = NULL;
      	struct transport *transport;
    @@ connected.h: struct transport;
       * to signal EOF, otherwise return 0.
       */
     -typedef int (*oid_iterate_fn)(void *, struct object_id *oid);
    -+typedef struct object_id *(*oid_iterate_fn)(void *);
    ++typedef const struct object_id *(*oid_iterate_fn)(void *);
      
      /*
       * Named-arguments struct for check_connected. All arguments are
    @@ fetch-pack.c: static void update_shallow(struct fetch_pack_args *args,
      }
      
     -static int iterate_ref_map(void *cb_data, struct object_id *oid)
    -+static struct object_id *iterate_ref_map(void *cb_data)
    ++static const struct object_id *iterate_ref_map(void *cb_data)
      {
      	struct ref **rm = cb_data;
      	struct ref *ref = *rm;
4:  67917af7ce ! 4:  6311203f08 fetch-pack: optimize loading of refs via commit graph
    @@ Commit message
         In case this fails, we fall back to the old code which peels the
         objects to a commit.
     
    +    Signed-off-by: Patrick Steinhardt <ps@pks.im>
    +
      ## fetch-pack.c ##
     @@ fetch-pack.c: static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
      {
5:  7653f8eabc ! 5:  56a9158ac3 fetch: refactor fetch refs to be more extendable
    @@ builtin/fetch.c: static int check_exist_and_connected(struct ref *ref_map)
      static int fetch_refs(struct transport *transport, struct ref *ref_map)
      {
     -	int ret = check_exist_and_connected(ref_map);
    --	if (ret) {
    --		trace2_region_enter("fetch", "fetch_refs", the_repository);
    --		ret = transport_fetch_refs(transport, ref_map);
    --		trace2_region_leave("fetch", "fetch_refs", the_repository);
    --	}
     +	int ret;
     +
     +	/*
    @@ builtin/fetch.c: static int check_exist_and_connected(struct ref *ref_map)
     +	 * refs.
     +	 */
     +	ret = check_exist_and_connected(ref_map);
    - 	if (!ret)
    + 	if (ret) {
    + 		trace2_region_enter("fetch", "fetch_refs", the_repository);
    + 		ret = transport_fetch_refs(transport, ref_map);
    + 		trace2_region_leave("fetch", "fetch_refs", the_repository);
    ++		if (ret) {
    ++			transport_unlock_pack(transport);
    ++			return ret;
    ++		}
    + 	}
    +-	if (!ret)
     -		/*
     -		 * Keep the new pack's ".keep" file around to allow the caller
     -		 * time to update refs to reference the new objects.
     -		 */
    - 		return 0;
    +-		return 0;
     -	transport_unlock_pack(transport);
     -	return ret;
     +
    -+	trace2_region_enter("fetch", "fetch_refs", the_repository);
    -+	ret = transport_fetch_refs(transport, ref_map);
    -+	trace2_region_leave("fetch", "fetch_refs", the_repository);
    -+	if (ret) {
    -+		transport_unlock_pack(transport);
    -+		return ret;
    -+	}
    -+
     +	/*
     +	 * Keep the new pack's ".keep" file around to allow the caller
     +	 * time to update refs to reference the new objects.
6:  646ac90e62 < -:  ---------- fetch: avoid second connectivity check if we already have all objects
-:  ---------- > 6:  31d9f72edf fetch: merge fetching and consuming refs
-:  ---------- > 7:  84e39c847f fetch: avoid second connectivity check if we already have all objects
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 1/7] fetch: speed up lookup of want refs via commit-graph
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
@ 2021-08-24 10:36   ` Patrick Steinhardt
  2021-08-25 14:16     ` Derrick Stolee
  2021-08-24 10:37   ` [PATCH v2 2/7] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-24 10:36 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 3346 bytes --]

When updating our local refs based on the refs fetched from the remote,
we need to iterate through all requested refs and load their respective
commits such that we can determine whether they need to be appended to
FETCH_HEAD or not. In cases where we're fetching from a remote with
exceedingly many refs, resolving these refs can be quite expensive given
that we repeatedly need to unpack object headers for each of the
referenced objects.

Speed this up by opportunistcally trying to resolve object IDs via the
commit graph. We only do so for any refs which are not in "refs/tags":
more likely than not, these are going to be a commit anyway, and this
lets us avoid having to unpack object headers completely in case the
object is a commit that is part of the commit-graph. This significantly
speeds up mirror-fetches in a real-world repository with
2.3M refs:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     56.482 s ±  0.384 s    [User: 53.340 s, System: 5.365 s]
      Range (min … max):   56.050 s … 57.045 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     33.727 s ±  0.170 s    [User: 30.252 s, System: 5.194 s]
      Range (min … max):   33.452 s … 33.871 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.67 ± 0.01 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index e064687dbd..91d1301613 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1074,7 +1074,6 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 			      int connectivity_checked, struct ref *ref_map)
 {
 	struct fetch_head fetch_head;
-	struct commit *commit;
 	int url_len, i, rc = 0;
 	struct strbuf note = STRBUF_INIT, err = STRBUF_INIT;
 	struct ref_transaction *transaction = NULL;
@@ -1122,6 +1121,7 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 	     want_status <= FETCH_HEAD_IGNORE;
 	     want_status++) {
 		for (rm = ref_map; rm; rm = rm->next) {
+			struct commit *commit = NULL;
 			struct ref *ref = NULL;
 
 			if (rm->status == REF_STATUS_REJECT_SHALLOW) {
@@ -1131,11 +1131,23 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 				continue;
 			}
 
-			commit = lookup_commit_reference_gently(the_repository,
-								&rm->old_oid,
-								1);
-			if (!commit)
-				rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
+			/*
+			 * References in "refs/tags/" are often going to point
+			 * to annotated tags, which are not part of the
+			 * commit-graph. We thus only try to look up refs in
+			 * the graph which are not in that namespace to not
+			 * regress performance in repositories with many
+			 * annotated tags.
+			 */
+			if (!starts_with(rm->name, "refs/tags/"))
+				commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
+			if (!commit) {
+				commit = lookup_commit_reference_gently(the_repository,
+									&rm->old_oid,
+									1);
+				if (!commit)
+					rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
+			}
 
 			if (rm->fetch_head_status != want_status)
 				continue;
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 2/7] fetch: avoid unpacking headers in object existence check
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
  2021-08-24 10:36   ` [PATCH v2 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
@ 2021-08-24 10:37   ` Patrick Steinhardt
  2021-08-24 10:37   ` [PATCH v2 3/7] connected: refactor iterator to return next object ID directly Patrick Steinhardt
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-24 10:37 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]

When updating local refs after the fetch has transferred all objects, we
do an object existence test as a safety guard to avoid updating a ref to
an object which we don't have. We do so via `oid_object_info()`: if it
returns an error, then we know the object does not exist.

One side effect of `oid_object_info()` is that it parses the object's
type, and to do so it must unpack the object header. This is completely
pointless: we don't care for the type, but only want to assert that the
object exists.

Refactor the code to use `repo_has_object_file()`, which both makes the
code's intent clearer and is also faster because it does not unpack
object headers. In a real-world repo with 2.3M refs, this results in a
small speedup when doing a mirror-fetch:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     33.686 s ±  0.176 s    [User: 30.119 s, System: 5.262 s]
      Range (min … max):   33.512 s … 33.944 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     31.247 s ±  0.195 s    [User: 28.135 s, System: 5.066 s]
      Range (min … max):   30.948 s … 31.472 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.08 ± 0.01 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 91d1301613..01513e6aea 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -846,13 +846,11 @@ static int update_local_ref(struct ref *ref,
 			    int summary_width)
 {
 	struct commit *current = NULL, *updated;
-	enum object_type type;
 	struct branch *current_branch = branch_get(NULL);
 	const char *pretty_ref = prettify_refname(ref->name);
 	int fast_forward = 0;
 
-	type = oid_object_info(the_repository, &ref->new_oid, NULL);
-	if (type < 0)
+	if (!repo_has_object_file(the_repository, &ref->new_oid))
 		die(_("object %s not found"), oid_to_hex(&ref->new_oid));
 
 	if (oideq(&ref->old_oid, &ref->new_oid)) {
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 3/7] connected: refactor iterator to return next object ID directly
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
  2021-08-24 10:36   ` [PATCH v2 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
  2021-08-24 10:37   ` [PATCH v2 2/7] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
@ 2021-08-24 10:37   ` Patrick Steinhardt
  2021-08-24 10:37   ` [PATCH v2 4/7] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-24 10:37 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 8508 bytes --]

The object ID iterator used by the connectivity checks returns the next
object ID via an out-parameter and then uses a return code to indicate
whether an item was found. This is a bit roundabout: instead of a
separate error code, we can just return the next object ID directly and
use `NULL` pointers as indicator that the iterator got no items left.
Furthermore, this avoids a copy of the object ID.

Refactor the iterator and all its implementations to return object IDs
directly. This brings a tiny performance improvement when doing a mirror-fetch of a repository with about 2.3M refs:

    Benchmark #1: 328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch
      Time (mean ± σ):     30.110 s ±  0.148 s    [User: 27.161 s, System: 5.075 s]
      Range (min … max):   29.934 s … 30.406 s    10 runs

    Benchmark #2: 328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch
      Time (mean ± σ):     29.899 s ±  0.109 s    [User: 26.916 s, System: 5.104 s]
      Range (min … max):   29.696 s … 29.996 s    10 runs

    Summary
      '328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch' ran
        1.01 ± 0.01 times faster than '328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch'

While this 1% speedup could be labelled as statistically insignificant,
the speedup is consistent on my machine. Furthermore, this is an end to
end test, so it is expected that the improvement in the connectivity
check itself is more significant.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c        |  8 +++-----
 builtin/fetch.c        |  7 +++----
 builtin/receive-pack.c | 17 +++++++----------
 connected.c            | 15 ++++++++-------
 connected.h            |  2 +-
 fetch-pack.c           |  7 +++----
 6 files changed, 25 insertions(+), 31 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 66fe66679c..4a1056fcc2 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -657,7 +657,7 @@ static void write_followtags(const struct ref *refs, const char *msg)
 	}
 }
 
-static int iterate_ref_map(void *cb_data, struct object_id *oid)
+static const struct object_id *iterate_ref_map(void *cb_data)
 {
 	struct ref **rm = cb_data;
 	struct ref *ref = *rm;
@@ -668,13 +668,11 @@ static int iterate_ref_map(void *cb_data, struct object_id *oid)
 	 */
 	while (ref && !ref->peer_ref)
 		ref = ref->next;
-	/* Returning -1 notes "end of list" to the caller. */
 	if (!ref)
-		return -1;
+		return NULL;
 
-	oidcpy(oid, &ref->old_oid);
 	*rm = ref->next;
-	return 0;
+	return &ref->old_oid;
 }
 
 static void update_remote_refs(const struct ref *refs,
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 01513e6aea..cdf0d0d671 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -962,7 +962,7 @@ static int update_local_ref(struct ref *ref,
 	}
 }
 
-static int iterate_ref_map(void *cb_data, struct object_id *oid)
+static const struct object_id *iterate_ref_map(void *cb_data)
 {
 	struct ref **rm = cb_data;
 	struct ref *ref = *rm;
@@ -970,10 +970,9 @@ static int iterate_ref_map(void *cb_data, struct object_id *oid)
 	while (ref && ref->status == REF_STATUS_REJECT_SHALLOW)
 		ref = ref->next;
 	if (!ref)
-		return -1; /* end of the list */
+		return NULL;
 	*rm = ref->next;
-	oidcpy(oid, &ref->old_oid);
-	return 0;
+	return &ref->old_oid;
 }
 
 struct fetch_head {
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index 2d1f97e1ca..041e915454 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1306,7 +1306,7 @@ static void refuse_unconfigured_deny_delete_current(void)
 	rp_error("%s", _(refuse_unconfigured_deny_delete_current_msg));
 }
 
-static int command_singleton_iterator(void *cb_data, struct object_id *oid);
+static const struct object_id *command_singleton_iterator(void *cb_data);
 static int update_shallow_ref(struct command *cmd, struct shallow_info *si)
 {
 	struct shallow_lock shallow_lock = SHALLOW_LOCK_INIT;
@@ -1731,16 +1731,15 @@ static void check_aliased_updates(struct command *commands)
 	string_list_clear(&ref_list, 0);
 }
 
-static int command_singleton_iterator(void *cb_data, struct object_id *oid)
+static const struct object_id *command_singleton_iterator(void *cb_data)
 {
 	struct command **cmd_list = cb_data;
 	struct command *cmd = *cmd_list;
 
 	if (!cmd || is_null_oid(&cmd->new_oid))
-		return -1; /* end of list */
+		return NULL;
 	*cmd_list = NULL; /* this returns only one */
-	oidcpy(oid, &cmd->new_oid);
-	return 0;
+	return &cmd->new_oid;
 }
 
 static void set_connectivity_errors(struct command *commands,
@@ -1770,7 +1769,7 @@ struct iterate_data {
 	struct shallow_info *si;
 };
 
-static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
+static const struct object_id *iterate_receive_command_list(void *cb_data)
 {
 	struct iterate_data *data = cb_data;
 	struct command **cmd_list = &data->cmds;
@@ -1781,13 +1780,11 @@ static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
 			/* to be checked in update_shallow_ref() */
 			continue;
 		if (!is_null_oid(&cmd->new_oid) && !cmd->skip_update) {
-			oidcpy(oid, &cmd->new_oid);
 			*cmd_list = cmd->next;
-			return 0;
+			return &cmd->new_oid;
 		}
 	}
-	*cmd_list = NULL;
-	return -1; /* end of list */
+	return NULL;
 }
 
 static void reject_updates_to_hidden(struct command *commands)
diff --git a/connected.c b/connected.c
index b5f9523a5f..cf68e37a97 100644
--- a/connected.c
+++ b/connected.c
@@ -24,7 +24,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 	struct child_process rev_list = CHILD_PROCESS_INIT;
 	FILE *rev_list_in;
 	struct check_connected_options defaults = CHECK_CONNECTED_INIT;
-	struct object_id oid;
+	const struct object_id *oid;
 	int err = 0;
 	struct packed_git *new_pack = NULL;
 	struct transport *transport;
@@ -34,7 +34,8 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 		opt = &defaults;
 	transport = opt->transport;
 
-	if (fn(cb_data, &oid)) {
+	oid = fn(cb_data);
+	if (!oid) {
 		if (opt->err_fd)
 			close(opt->err_fd);
 		return err;
@@ -73,7 +74,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 			for (p = get_all_packs(the_repository); p; p = p->next) {
 				if (!p->pack_promisor)
 					continue;
-				if (find_pack_entry_one(oid.hash, p))
+				if (find_pack_entry_one(oid->hash, p))
 					goto promisor_pack_found;
 			}
 			/*
@@ -83,7 +84,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 			goto no_promisor_pack_found;
 promisor_pack_found:
 			;
-		} while (!fn(cb_data, &oid));
+		} while ((oid = fn(cb_data)) != NULL);
 		return 0;
 	}
 
@@ -133,12 +134,12 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 		 * are sure the ref is good and not sending it to
 		 * rev-list for verification.
 		 */
-		if (new_pack && find_pack_entry_one(oid.hash, new_pack))
+		if (new_pack && find_pack_entry_one(oid->hash, new_pack))
 			continue;
 
-		if (fprintf(rev_list_in, "%s\n", oid_to_hex(&oid)) < 0)
+		if (fprintf(rev_list_in, "%s\n", oid_to_hex(oid)) < 0)
 			break;
-	} while (!fn(cb_data, &oid));
+	} while ((oid = fn(cb_data)) != NULL);
 
 	if (ferror(rev_list_in) || fflush(rev_list_in)) {
 		if (errno != EPIPE && errno != EINVAL)
diff --git a/connected.h b/connected.h
index 8d5a6b3ad6..6e59c92aa3 100644
--- a/connected.h
+++ b/connected.h
@@ -9,7 +9,7 @@ struct transport;
  * When called after returning the name for the last object, return -1
  * to signal EOF, otherwise return 0.
  */
-typedef int (*oid_iterate_fn)(void *, struct object_id *oid);
+typedef const struct object_id *(*oid_iterate_fn)(void *);
 
 /*
  * Named-arguments struct for check_connected. All arguments are
diff --git a/fetch-pack.c b/fetch-pack.c
index 0bf7ed7e47..e6ec79f81a 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1912,16 +1912,15 @@ static void update_shallow(struct fetch_pack_args *args,
 	oid_array_clear(&ref);
 }
 
-static int iterate_ref_map(void *cb_data, struct object_id *oid)
+static const struct object_id *iterate_ref_map(void *cb_data)
 {
 	struct ref **rm = cb_data;
 	struct ref *ref = *rm;
 
 	if (!ref)
-		return -1; /* end of the list */
+		return NULL;
 	*rm = ref->next;
-	oidcpy(oid, &ref->old_oid);
-	return 0;
+	return &ref->old_oid;
 }
 
 struct ref *fetch_pack(struct fetch_pack_args *args,
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 4/7] fetch-pack: optimize loading of refs via commit graph
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2021-08-24 10:37   ` [PATCH v2 3/7] connected: refactor iterator to return next object ID directly Patrick Steinhardt
@ 2021-08-24 10:37   ` Patrick Steinhardt
  2021-08-24 10:37   ` [PATCH v2 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-24 10:37 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 2116 bytes --]

In order to negotiate a packfile, we need to dereference refs to see
which commits we have in common with the remote. To do so, we first look
up the object's type -- if it's a tag, we peel until we hit a non-tag
object. If we hit a commit eventually, then we return that commit.

In case the object ID points to a commit directly, we can avoid the
initial lookup of the object type by opportunistically looking up the
commit via the commit-graph, if available, which gives us a slight speed
bump of about 2% in a huge repository with about 2.3M refs:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     31.634 s ±  0.258 s    [User: 28.400 s, System: 5.090 s]
      Range (min … max):   31.280 s … 31.896 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     31.129 s ±  0.543 s    [User: 27.976 s, System: 5.056 s]
      Range (min … max):   30.172 s … 31.479 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.02 ± 0.02 times faster than 'HEAD~: git-fetch'

In case this fails, we fall back to the old code which peels the
objects to a commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 fetch-pack.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index e6ec79f81a..dc800879cb 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -119,6 +119,11 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 {
 	enum object_type type;
 	struct object_info info = { .typep = &type };
+	struct commit *commit;
+
+	commit = lookup_commit_in_graph(the_repository, oid);
+	if (commit)
+		return commit;
 
 	while (1) {
 		if (oid_object_info_extended(the_repository, oid, &info,
@@ -139,7 +144,7 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 	}
 
 	if (type == OBJ_COMMIT) {
-		struct commit *commit = lookup_commit(the_repository, oid);
+		commit = lookup_commit(the_repository, oid);
 		if (!commit || repo_parse_commit(the_repository, commit))
 			return NULL;
 		return commit;
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 5/7] fetch: refactor fetch refs to be more extendable
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2021-08-24 10:37   ` [PATCH v2 4/7] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
@ 2021-08-24 10:37   ` Patrick Steinhardt
  2021-08-25 14:19     ` Derrick Stolee
  2021-08-24 10:37   ` [PATCH v2 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-24 10:37 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 1537 bytes --]

Refactor `fetch_refs()` code to make it more extendable by explicitly
handling error cases. The refactored code should behave the same.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index cdf0d0d671..da0e283288 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1293,20 +1293,28 @@ static int check_exist_and_connected(struct ref *ref_map)
 
 static int fetch_refs(struct transport *transport, struct ref *ref_map)
 {
-	int ret = check_exist_and_connected(ref_map);
+	int ret;
+
+	/*
+	 * We don't need to perform a fetch in case we can already satisfy all
+	 * refs.
+	 */
+	ret = check_exist_and_connected(ref_map);
 	if (ret) {
 		trace2_region_enter("fetch", "fetch_refs", the_repository);
 		ret = transport_fetch_refs(transport, ref_map);
 		trace2_region_leave("fetch", "fetch_refs", the_repository);
+		if (ret) {
+			transport_unlock_pack(transport);
+			return ret;
+		}
 	}
-	if (!ret)
-		/*
-		 * Keep the new pack's ".keep" file around to allow the caller
-		 * time to update refs to reference the new objects.
-		 */
-		return 0;
-	transport_unlock_pack(transport);
-	return ret;
+
+	/*
+	 * Keep the new pack's ".keep" file around to allow the caller
+	 * time to update refs to reference the new objects.
+	 */
+	return 0;
 }
 
 /* Update local refs based on the ref values fetched from a remote */
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 6/7] fetch: merge fetching and consuming refs
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
                     ` (4 preceding siblings ...)
  2021-08-24 10:37   ` [PATCH v2 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
@ 2021-08-24 10:37   ` Patrick Steinhardt
  2021-08-25 14:26     ` Derrick Stolee
  2021-08-24 10:37   ` [PATCH v2 7/7] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-24 10:37 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 3220 bytes --]

The functions `fetch_refs()` and `consume_refs()` must always be called
together such that we first obtain all missing objects and then update
our local refs to match the remote refs. In a subsequent patch, we'll
further require that `fetch_refs()` must always be called before
`consume_refs()` such that it can correctly assert that we have all
objects after the fetch given that we're about to move the connectivity
check.

Make this requirement explicit by merging both functions into a single
`fetch_and_consume_refs()` function.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 32 +++++++++++---------------------
 1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index da0e283288..a1e17edd8b 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1291,8 +1291,9 @@ static int check_exist_and_connected(struct ref *ref_map)
 	return check_connected(iterate_ref_map, &rm, &opt);
 }
 
-static int fetch_refs(struct transport *transport, struct ref *ref_map)
+static int fetch_and_consume_refs(struct transport *transport, struct ref *ref_map)
 {
+	int connectivity_checked;
 	int ret;
 
 	/*
@@ -1304,32 +1305,22 @@ static int fetch_refs(struct transport *transport, struct ref *ref_map)
 		trace2_region_enter("fetch", "fetch_refs", the_repository);
 		ret = transport_fetch_refs(transport, ref_map);
 		trace2_region_leave("fetch", "fetch_refs", the_repository);
-		if (ret) {
-			transport_unlock_pack(transport);
-			return ret;
-		}
+		if (ret)
+			goto out;
 	}
 
-	/*
-	 * Keep the new pack's ".keep" file around to allow the caller
-	 * time to update refs to reference the new objects.
-	 */
-	return 0;
-}
-
-/* Update local refs based on the ref values fetched from a remote */
-static int consume_refs(struct transport *transport, struct ref *ref_map)
-{
-	int connectivity_checked = transport->smart_options
+	connectivity_checked = transport->smart_options
 		? transport->smart_options->connectivity_checked : 0;
-	int ret;
+
 	trace2_region_enter("fetch", "consume_refs", the_repository);
 	ret = store_updated_refs(transport->url,
 				 transport->remote->name,
 				 connectivity_checked,
 				 ref_map);
-	transport_unlock_pack(transport);
 	trace2_region_leave("fetch", "consume_refs", the_repository);
+
+out:
+	transport_unlock_pack(transport);
 	return ret;
 }
 
@@ -1518,8 +1509,7 @@ static void backfill_tags(struct transport *transport, struct ref *ref_map)
 	transport_set_option(transport, TRANS_OPT_FOLLOWTAGS, NULL);
 	transport_set_option(transport, TRANS_OPT_DEPTH, "0");
 	transport_set_option(transport, TRANS_OPT_DEEPEN_RELATIVE, NULL);
-	if (!fetch_refs(transport, ref_map))
-		consume_refs(transport, ref_map);
+	fetch_and_consume_refs(transport, ref_map);
 
 	if (gsecondary) {
 		transport_disconnect(gsecondary);
@@ -1610,7 +1600,7 @@ static int do_fetch(struct transport *transport,
 				   transport->url);
 		}
 	}
-	if (fetch_refs(transport, ref_map) || consume_refs(transport, ref_map)) {
+	if (fetch_and_consume_refs(transport, ref_map)) {
 		free_refs(ref_map);
 		retcode = 1;
 		goto cleanup;
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 7/7] fetch: avoid second connectivity check if we already have all objects
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
                     ` (5 preceding siblings ...)
  2021-08-24 10:37   ` [PATCH v2 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
@ 2021-08-24 10:37   ` Patrick Steinhardt
  2021-08-24 22:48   ` [PATCH v2 0/7] Speed up mirror-fetches with many refs Junio C Hamano
  2021-08-25 14:27   ` Derrick Stolee
  8 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-24 10:37 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 2303 bytes --]

When fetching refs, we are doing two connectivity checks:

    - The first one is done such that we can skip fetching refs in the
      case where we already have all objects referenced by the updated
      set of refs.

    - The second one verifies that we have all objects after we have
      fetched objects.

We always execute both connectivity checks, but this is wasteful in case
the first connectivity check already notices that we have all objects
locally available.

Skip the second connectivity check in case we already had all objects
available. This gives us a nice speedup when doing a mirror-fetch in a
repository with about 2.3M refs where the fetching repo already has all
objects:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     30.025 s ±  0.081 s    [User: 27.070 s, System: 4.933 s]
      Range (min … max):   29.900 s … 30.111 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     25.574 s ±  0.177 s    [User: 22.855 s, System: 4.683 s]
      Range (min … max):   25.399 s … 25.765 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.17 ± 0.01 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index a1e17edd8b..e2c952ec67 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1293,7 +1293,7 @@ static int check_exist_and_connected(struct ref *ref_map)
 
 static int fetch_and_consume_refs(struct transport *transport, struct ref *ref_map)
 {
-	int connectivity_checked;
+	int connectivity_checked = 1;
 	int ret;
 
 	/*
@@ -1307,11 +1307,10 @@ static int fetch_and_consume_refs(struct transport *transport, struct ref *ref_m
 		trace2_region_leave("fetch", "fetch_refs", the_repository);
 		if (ret)
 			goto out;
+		connectivity_checked = transport->smart_options ?
+			transport->smart_options->connectivity_checked : 0;
 	}
 
-	connectivity_checked = transport->smart_options
-		? transport->smart_options->connectivity_checked : 0;
-
 	trace2_region_enter("fetch", "consume_refs", the_repository);
 	ret = store_updated_refs(transport->url,
 				 transport->remote->name,
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/7] Speed up mirror-fetches with many refs
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
                     ` (6 preceding siblings ...)
  2021-08-24 10:37   ` [PATCH v2 7/7] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
@ 2021-08-24 22:48   ` Junio C Hamano
  2021-08-25  6:04     ` Patrick Steinhardt
  2021-08-25 14:27   ` Derrick Stolee
  8 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2021-08-24 22:48 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, René Scharfe

Patrick Steinhardt <ps@pks.im> writes:

> this is the second version of my patch series to speed up mirror-fetches
> with many refs. This topic applies on top of Junio's 9d5700f60b (Merge
> branch 'ps/connectivity-optim' into jch, 2021-08-23).

It is a horrible commit to base anything on.  You are taking your
patches hostage to all of these other topics.

    9d5700f60b Merge branch 'ps/connectivity-optim' into jch
    7ad315de2f Merge branch 'js/log-protocol-version' into jch
    1726f748f5 Merge branch 'en/ort-becomes-the-default' into jch
    23aeecb099 Merge branch 'en/merge-strategy-docs' into jch
    568277d458 Merge branch 'en/pull-conflicting-options' into jch
    2b316bb006 ### match next
    4efa9ea0b6 Merge branch 'ps/fetch-pack-load-refs-optim' into jch
    b305842ee8 Merge branch 'jt/push-negotiation-fixes' into jch
    83b45616f1 Merge branch 'es/trace2-log-parent-process-name' into jch
    be89aa8c38 Merge branch 'hn/refs-test-cleanup' into jch
    256d56ed32 Merge branch 'en/ort-perf-batch-15' into jch
    7477fbf53a Merge branch 'js/expand-runtime-prefix' into jch
    b1453dfd30 Merge branch 'ab/bundle-doc' into jch
    1b66e8e89d Merge branch 'zh/ref-filter-raw-data' into jch
    1fbf27ddcd Merge branch 'ab/pack-stdin-packs-fix' into jch
    dcf57bfebb Merge branch 'ab/http-drop-old-curl' into jch
    93041f7c57 Merge branch 'ds/add-with-sparse-index' into jch
    814a016195 Merge branch 'jc/bisect-sans-show-branch' into jch

A better way to handle a situation like this is to limit your
dependencies more explicitly.  If you look at what I did to the last
round of this topic, you'll see that there is a merge of the
'ps/connectivity-optim' topic into v2.33 followed by application of
the patches, like this:

    1d576ca7b2 fetch: avoid second connectivity check if we already have all objects
    6768595f10 fetch: refactor fetch refs to be more extendable
    a615d7cf87 fetch-pack: optimize loading of refs via commit graph
    bfd04fc24c connected: refactor iterator to return next object ID directly
    1a387c9f3a fetch: avoid unpacking headers in object existence check
    f1a4367ec4 fetch: speed up lookup of want refs via commit-graph
    3628199d4d Merge branch 'ps/connectivity-optim' into ps/fetch-optim

What I did to your last round was to merge 'ps/connectivity-optim'
on top of v2.33 and then queue them.  You can do the same for this
round (you can tell people "apply these on top of the result of
merging topic X, Y and Z on tag V").

    df52ef2c3a fetch: avoid second connectivity check if we already have all objects
    c1721680e4 fetch: merge fetching and consuming refs
    5470cbe1be fetch: refactor fetch refs to be more extendable
    016a510428 fetch-pack: optimize loading of refs via commit graph
    f6c7e63cc7 connected: refactor iterator to return next object ID directly
    17c8e90df3 fetch: avoid unpacking headers in object existence check
    a54c245004 fetch: speed up lookup of want refs via commit-graph
    3628199d4d Merge branch 'ps/connectivity-optim' into ps/fetch-optim

I had to adjust [4/7] while applying them on top of the same
3628199d4d I created for queuing the previous round, and it would be
appreciated if you can double-check the result.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/7] Speed up mirror-fetches with many refs
  2021-08-24 22:48   ` [PATCH v2 0/7] Speed up mirror-fetches with many refs Junio C Hamano
@ 2021-08-25  6:04     ` Patrick Steinhardt
  0 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-08-25  6:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 782 bytes --]

On Tue, Aug 24, 2021 at 03:48:19PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
[snip]
> A better way to handle a situation like this is to limit your
> dependencies more explicitly.  If you look at what I did to the last
> round of this topic, you'll see that there is a merge of the
> 'ps/connectivity-optim' topic into v2.33 followed by application of
> the patches, like this:

I wasn't quite sure how to best handle this, but I'll keep this in mind
for future iterations/patch series. Thanks for the explanation.

[snip]
> I had to adjust [4/7] while applying them on top of the same
> 3628199d4d I created for queuing the previous round, and it would be
> appreciated if you can double-check the result.

The result looks good to me, thanks.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph
  2021-08-23  6:46       ` Patrick Steinhardt
@ 2021-08-25 14:12         ` Derrick Stolee
  0 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2021-08-25 14:12 UTC (permalink / raw)
  To: Patrick Steinhardt, Junio C Hamano
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason

On 8/23/2021 2:46 AM, Patrick Steinhardt wrote:
> On Fri, Aug 20, 2021 at 10:18:22AM -0700, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> I do worry about the case where annotated tags greatly outnumber
>>> branches, so this binary search is extra overhead and the performance
>>> may degrade. Would it be worth checking the ref to see if it lies
>>> within "refs/heads/" (or even _not_ in "refs/tags/") before doing
>>> this commit-graph check?
>>
>> Ah, clever.
> 
> Good idea. Benchmarks for my test repository (which definitely isn't
> representative, but it's at least some numbers) show that restricting to
> "refs/heads/" diminishes almost all the gains, while restricting to
> everything but "refs/tags/" performs almost the same (it's a tiny bit
> slower, probably because of the added string comparisons):
> 
>     Benchmark #1: all refs: git-fetch
>       Time (mean ± σ):     32.959 s ±  0.282 s    [User: 29.801 s, System: 5.137 s]
>       Range (min … max):   32.760 s … 33.158 s    2 runs
> 
>     Benchmark #2: refs/heads: git-fetch
>       Time (mean ± σ):     56.955 s ±  0.002 s    [User: 53.447 s, System: 5.362 s]
>       Range (min … max):   56.953 s … 56.957 s    2 runs
> 
>     Benchmark #3: !refs/tags: git-fetch
>       Time (mean ± σ):     33.447 s ±  0.003 s    [User: 30.160 s, System: 5.027 s]
>       Range (min … max):   33.444 s … 33.449 s    2 runs
> 
>     Summary
>       'all refs: git-fetch' ran
>         1.01 ± 0.01 times faster than '!refs/tags: git-fetch'
>         1.73 ± 0.01 times faster than 'refs/heads: git-fetch'

Thanks for testing both options.

> This is easily explained by the fact that the test repo has most of its
> refs neither in "refs/tags/" nor in "refs/heads/", but rather in special
> namespaces like "refs/merge-requests/", "refs/environments/" or
> "refs/keep-around/".

That makes sense to me. GitHub also stores refs like refs/pull/ so I can
understand not wanting to restrict to refs/heads/.

> I like the idea of excluding "refs/tags/" though: as you point out,
> chances are high that these don't point to commits but to annotated tags
> instead. So I'll go with that, thanks!

Yeah, that makes sense as a good way forward.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/7] fetch: speed up lookup of want refs via commit-graph
  2021-08-24 10:36   ` [PATCH v2 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
@ 2021-08-25 14:16     ` Derrick Stolee
  0 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2021-08-25 14:16 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, René Scharfe

On 8/24/2021 6:36 AM, Patrick Steinhardt wrote:
> Speed this up by opportunistcally trying to resolve object IDs via the

s/opportunistcally/opportunistically/ 

> +			/*
> +			 * References in "refs/tags/" are often going to point
> +			 * to annotated tags, which are not part of the
> +			 * commit-graph. We thus only try to look up refs in
> +			 * the graph which are not in that namespace to not
> +			 * regress performance in repositories with many
> +			 * annotated tags.
> +			 */
> +			if (!starts_with(rm->name, "refs/tags/"))
> +				commit = lookup_commit_in_graph(the_repository, &rm->old_oid);

This new logic looks good.

> +			if (!commit) {
> +				commit = lookup_commit_reference_gently(the_repository,
> +									&rm->old_oid,
> +									1);
> +				if (!commit)
> +					rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
> +			}

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 5/7] fetch: refactor fetch refs to be more extendable
  2021-08-24 10:37   ` [PATCH v2 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
@ 2021-08-25 14:19     ` Derrick Stolee
  2021-09-01 12:48       ` Patrick Steinhardt
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2021-08-25 14:19 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, René Scharfe

On 8/24/2021 6:37 AM, Patrick Steinhardt wrote:
> Refactor `fetch_refs()` code to make it more extendable by explicitly
> handling error cases. The refactored code should behave the same.
...
> +	/*
> +	 * We don't need to perform a fetch in case we can already satisfy all
> +	 * refs.
> +	 */
> +	ret = check_exist_and_connected(ref_map);
>  	if (ret) {
>  		trace2_region_enter("fetch", "fetch_refs", the_repository);
>  		ret = transport_fetch_refs(transport, ref_map);
>  		trace2_region_leave("fetch", "fetch_refs", the_repository);
> +		if (ret) {
> +			transport_unlock_pack(transport);
> +			return ret;
> +		}>  	}

I see that this nested organization makes it more clear what cases
lead into this error state.

> -	if (!ret)
> -		/*
> -		 * Keep the new pack's ".keep" file around to allow the caller
> -		 * time to update refs to reference the new objects.
> -		 */
> -		return 0;
> -	transport_unlock_pack(transport);
> -	return ret;
> +
> +	/*
> +	 * Keep the new pack's ".keep" file around to allow the caller
> +	 * time to update refs to reference the new objects.
> +	 */
> +	return 0;

And it happens that 'ret' is zero here. Should we keep returning 'ret'
or perhaps add an "assert(!ret);" before the return? The assert()
doesn't do much, but at minimum would serve as an extra indicator to
anyone working in this method in the future.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/7] fetch: merge fetching and consuming refs
  2021-08-24 10:37   ` [PATCH v2 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
@ 2021-08-25 14:26     ` Derrick Stolee
  2021-09-01 12:49       ` Patrick Steinhardt
  0 siblings, 1 reply; 48+ messages in thread
From: Derrick Stolee @ 2021-08-25 14:26 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, René Scharfe

On 8/24/2021 6:37 AM, Patrick Steinhardt wrote:
> -		if (ret) {
> -			transport_unlock_pack(transport);
> -			return ret;
> -		}
> +		if (ret)
> +			goto out;

You were just reorganizing this method in the previous patch.
This "goto out" trick could have applied there instead, which
wouldn't complicate that patch and would simplify this one.

But perhaps it would look strange to have the following ending
to the method, even if for only one patch:

	return 0;

out:
	transport_unlock_pack(transport);
	return res;
}

So, feel free to ignore me here. Decide based on your taste.

>  	}
>  
> -	/*
> -	 * Keep the new pack's ".keep" file around to allow the caller
> -	 * time to update refs to reference the new objects.
> -	 */
> -	return 0;
> -}
> -
> -/* Update local refs based on the ref values fetched from a remote */
> -static int consume_refs(struct transport *transport, struct ref *ref_map)
> -{
> -	int connectivity_checked = transport->smart_options
> +	connectivity_checked = transport->smart_options
>  		? transport->smart_options->connectivity_checked : 0;
> -	int ret;
> +
>  	trace2_region_enter("fetch", "consume_refs", the_repository);
>  	ret = store_updated_refs(transport->url,
>  				 transport->remote->name,
>  				 connectivity_checked,
>  				 ref_map);
> -	transport_unlock_pack(transport);
>  	trace2_region_leave("fetch", "consume_refs", the_repository);

This transport_unlock_pack() is leaving the trace2 region. I think
it is unlikely that the loop of unlink_or_warn() calls will take
significant time that affects this region, so it should be fine to
move it.

> +
> +out:
> +	transport_unlock_pack(transport);
>  	return ret;
>  }
...
> -	if (!fetch_refs(transport, ref_map))
> -		consume_refs(transport, ref_map);
> +	fetch_and_consume_refs(transport, ref_map);
>  
>  	if (gsecondary) {
>  		transport_disconnect(gsecondary);
> @@ -1610,7 +1600,7 @@ static int do_fetch(struct transport *transport,
>  				   transport->url);
>  		}
>  	}
> -	if (fetch_refs(transport, ref_map) || consume_refs(transport, ref_map)) {
> +	if (fetch_and_consume_refs(transport, ref_map)) {

The end goal of making these consumers simpler is worth it.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/7] Speed up mirror-fetches with many refs
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
                     ` (7 preceding siblings ...)
  2021-08-24 22:48   ` [PATCH v2 0/7] Speed up mirror-fetches with many refs Junio C Hamano
@ 2021-08-25 14:27   ` Derrick Stolee
  8 siblings, 0 replies; 48+ messages in thread
From: Derrick Stolee @ 2021-08-25 14:27 UTC (permalink / raw)
  To: Patrick Steinhardt, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, René Scharfe

On 8/24/2021 6:36 AM, Patrick Steinhardt wrote:
> Changes compared to v1:
> 
>     - Patch 1/7: I've applied Stolee's proposal to only
>       opportunistically load objects via the commit-graph in case the
>       reference is not in refs/tags/ such that we don't regress repos
>       with many annotated tags.
> 
>     - Patch 3/7: The return parameter of the iterator is now const to
>       allow further optimizations by the compiler, as suggested by
>       René. I've also re-benchmarked this, and one can now see a very
>       slight performance improvement of ~1%.
> 
>     - Patch 4/7: Added my missing DCO, as pointed out by Junio.
> 
>     - Patch 5, 6, 7: I've redone these to make it clearer that the
>       refactoring I'm doing doesn't cause us to miss any object
>       connectivity checks. Most importantly, I've merged `fetch_refs()`
>       and `consume_refs()` into `fetch_and_consume_refs()` in 6/7, which
>       makes the optimization where we elide the second connectivity
>       check in 7/7 trivial.

These changes are positive. My read through this set of patches
had only a few nit-picks.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/6] fetch: avoid unpacking headers in object existence check
  2021-08-20 10:08 ` [PATCH 2/6] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
@ 2021-08-25 23:44   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 48+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-08-25 23:44 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Jeff King, Junio C Hamano


On Fri, Aug 20 2021, Patrick Steinhardt wrote:

> [[PGP Signed Part:Undecided]]
> When updating local refs after the fetch has transferred all objects, we
> do an object existence test as a safety guard to avoid updating a ref to
> an object which we don't have. We do so via `oid_object_info()`: if it
> returns an error, then we know the object does not exist.
>
> One side effect of `oid_object_info()` is that it parses the object's
> type, and to do so it must unpack the object header. This is completely
> pointless: we don't care for the type, but only want to assert that the
> object exists.
>
> Refactor the code to use `repo_has_object_file()`, which both makes the
> code's intent clearer and is also faster because it does not unpack
> object headers. In a real-world repo with 2.3M refs, this results in a
> small speedup when doing a mirror-fetch:
>
>     Benchmark #1: HEAD~: git-fetch
>       Time (mean ± σ):     33.686 s ±  0.176 s    [User: 30.119 s, System: 5.262 s]
>       Range (min … max):   33.512 s … 33.944 s    5 runs
>
>     Benchmark #2: HEAD: git-fetch
>       Time (mean ± σ):     31.247 s ±  0.195 s    [User: 28.135 s, System: 5.066 s]
>       Range (min … max):   30.948 s … 31.472 s    5 runs
>
>     Summary
>       'HEAD: git-fetch' ran
>         1.08 ± 0.01 times faster than 'HEAD~: git-fetch'
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  builtin/fetch.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 73f5b286d5..5fd0f7c791 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -846,13 +846,11 @@ static int update_local_ref(struct ref *ref,
>  			    int summary_width)
>  {
>  	struct commit *current = NULL, *updated;
> -	enum object_type type;
>  	struct branch *current_branch = branch_get(NULL);
>  	const char *pretty_ref = prettify_refname(ref->name);
>  	int fast_forward = 0;
>  
> -	type = oid_object_info(the_repository, &ref->new_oid, NULL);
> -	if (type < 0)
> +	if (!repo_has_object_file(the_repository, &ref->new_oid))
>  		die(_("object %s not found"), oid_to_hex(&ref->new_oid));
>  
>  	if (oideq(&ref->old_oid, &ref->new_oid)) {

I tried grepping the source for any other candidates for a migration to
repo_has_object_file(), but this is the only "type = oid_object_info" I
could find that didn't care about the type, perhaps there's some callers
of *_extended() that could be moved over, but that's less likely, and I
didn't check...

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 5/7] fetch: refactor fetch refs to be more extendable
  2021-08-25 14:19     ` Derrick Stolee
@ 2021-09-01 12:48       ` Patrick Steinhardt
  0 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-09-01 12:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 1073 bytes --]

On Wed, Aug 25, 2021 at 10:19:27AM -0400, Derrick Stolee wrote:
> On 8/24/2021 6:37 AM, Patrick Steinhardt wrote:
> > Refactor `fetch_refs()` code to make it more extendable by explicitly
> > handling error cases. The refactored code should behave the same.
[snip]
> > -	if (!ret)
> > -		/*
> > -		 * Keep the new pack's ".keep" file around to allow the caller
> > -		 * time to update refs to reference the new objects.
> > -		 */
> > -		return 0;
> > -	transport_unlock_pack(transport);
> > -	return ret;
> > +
> > +	/*
> > +	 * Keep the new pack's ".keep" file around to allow the caller
> > +	 * time to update refs to reference the new objects.
> > +	 */
> > +	return 0;
> 
> And it happens that 'ret' is zero here. Should we keep returning 'ret'
> or perhaps add an "assert(!ret);" before the return? The assert()
> doesn't do much, but at minimum would serve as an extra indicator to
> anyone working in this method in the future.

The assert isn't really needed: in the subsequent patch, we always
unlock the packfile on exit.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/7] fetch: merge fetching and consuming refs
  2021-08-25 14:26     ` Derrick Stolee
@ 2021-09-01 12:49       ` Patrick Steinhardt
  0 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-09-01 12:49 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 775 bytes --]

On Wed, Aug 25, 2021 at 10:26:28AM -0400, Derrick Stolee wrote:
> On 8/24/2021 6:37 AM, Patrick Steinhardt wrote:
> > -		if (ret) {
> > -			transport_unlock_pack(transport);
> > -			return ret;
> > -		}
> > +		if (ret)
> > +			goto out;
> 
> You were just reorganizing this method in the previous patch.
> This "goto out" trick could have applied there instead, which
> wouldn't complicate that patch and would simplify this one.
> 
> But perhaps it would look strange to have the following ending
> to the method, even if for only one patch:
> 
> 	return 0;
> 
> out:
> 	transport_unlock_pack(transport);
> 	return res;
> }
> 
> So, feel free to ignore me here. Decide based on your taste.

I think you've got a point, I'll change this.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 0/7] Speed up mirror-fetches with many refs
  2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
                   ` (8 preceding siblings ...)
  2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
@ 2021-09-01 13:09 ` Patrick Steinhardt
  2021-09-01 13:09   ` [PATCH v3 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
                     ` (7 more replies)
  9 siblings, 8 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-09-01 13:09 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 5048 bytes --]

Hi,

this is the third version of my patch series to speed up mirror-fetches
with many refs. This patch series applies on top of master with
ps/connectivity-optim merged into it.

There's only some smallish changes based on Stolee's feedback (thanks
for that!):

    - A small typo in 1/7.

    - A confict fix in 4/7 required now because it's based on master
      instead of directly on my merged topic.

    - I've adjusted patch 5/7 such that I don't have to re-touch the
      logic in 6/7.

Patrick

Patrick Steinhardt (7):
  fetch: speed up lookup of want refs via commit-graph
  fetch: avoid unpacking headers in object existence check
  connected: refactor iterator to return next object ID directly
  fetch-pack: optimize loading of refs via commit graph
  fetch: refactor fetch refs to be more extendable
  fetch: merge fetching and consuming refs
  fetch: avoid second connectivity check if we already have all objects

 builtin/clone.c        |  8 ++---
 builtin/fetch.c        | 74 +++++++++++++++++++++++-------------------
 builtin/receive-pack.c | 17 ++++------
 connected.c            | 15 +++++----
 connected.h            |  2 +-
 fetch-pack.c           | 12 ++++---
 6 files changed, 67 insertions(+), 61 deletions(-)

Range-diff against v2:
1:  4a819a6830 ! 1:  8214f04971 fetch: speed up lookup of want refs via commit-graph
    @@ Commit message
         that we repeatedly need to unpack object headers for each of the
         referenced objects.
     
    -    Speed this up by opportunistcally trying to resolve object IDs via the
    +    Speed this up by opportunistically trying to resolve object IDs via the
         commit graph. We only do so for any refs which are not in "refs/tags":
         more likely than not, these are going to be a commit anyway, and this
         lets us avoid having to unpack object headers completely in case the
2:  81ebadabe8 = 2:  991a27cb82 fetch: avoid unpacking headers in object existence check
3:  98e981ced9 = 3:  ba834803ab connected: refactor iterator to return next object ID directly
4:  6311203f08 ! 4:  99d3316d48 fetch-pack: optimize loading of refs via commit graph
    @@ fetch-pack.c: static struct commit *deref_without_lazy_fetch(const struct object
      
      	while (1) {
      		if (oid_object_info_extended(the_repository, oid, &info,
    -@@ fetch-pack.c: static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
    - 	}
    - 
    - 	if (type == OBJ_COMMIT) {
    --		struct commit *commit = lookup_commit(the_repository, oid);
    -+		commit = lookup_commit(the_repository, oid);
    - 		if (!commit || repo_parse_commit(the_repository, commit))
    - 			return NULL;
    - 		return commit;
5:  56a9158ac3 ! 5:  d64888e072 fetch: refactor fetch refs to be more extendable
    @@ builtin/fetch.c: static int check_exist_and_connected(struct ref *ref_map)
      		trace2_region_enter("fetch", "fetch_refs", the_repository);
      		ret = transport_fetch_refs(transport, ref_map);
      		trace2_region_leave("fetch", "fetch_refs", the_repository);
    -+		if (ret) {
    -+			transport_unlock_pack(transport);
    -+			return ret;
    -+		}
    ++		if (ret)
    ++			goto out;
      	}
     -	if (!ret)
     -		/*
    @@ builtin/fetch.c: static int check_exist_and_connected(struct ref *ref_map)
     -		 * time to update refs to reference the new objects.
     -		 */
     -		return 0;
    --	transport_unlock_pack(transport);
    --	return ret;
     +
     +	/*
     +	 * Keep the new pack's ".keep" file around to allow the caller
     +	 * time to update refs to reference the new objects.
     +	 */
    -+	return 0;
    ++	return ret;
    ++
    ++out:
    + 	transport_unlock_pack(transport);
    + 	return ret;
      }
    - 
    - /* Update local refs based on the ref values fetched from a remote */
6:  31d9f72edf ! 6:  56ecbfc9c3 fetch: merge fetching and consuming refs
    @@ builtin/fetch.c: static int check_exist_and_connected(struct ref *ref_map)
      
      	/*
     @@ builtin/fetch.c: static int fetch_refs(struct transport *transport, struct ref *ref_map)
    - 		trace2_region_enter("fetch", "fetch_refs", the_repository);
    - 		ret = transport_fetch_refs(transport, ref_map);
    - 		trace2_region_leave("fetch", "fetch_refs", the_repository);
    --		if (ret) {
    --			transport_unlock_pack(transport);
    --			return ret;
    --		}
    -+		if (ret)
    -+			goto out;
    + 			goto out;
      	}
      
     -	/*
     -	 * Keep the new pack's ".keep" file around to allow the caller
     -	 * time to update refs to reference the new objects.
     -	 */
    --	return 0;
    +-	return ret;
    +-
    +-out:
    +-	transport_unlock_pack(transport);
    +-	return ret;
     -}
     -
     -/* Update local refs based on the ref values fetched from a remote */
7:  84e39c847f = 7:  c342fc0c69 fetch: avoid second connectivity check if we already have all objects
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 1/7] fetch: speed up lookup of want refs via commit-graph
  2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
@ 2021-09-01 13:09   ` Patrick Steinhardt
  2021-09-01 13:09   ` [PATCH v3 2/7] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-09-01 13:09 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 3347 bytes --]

When updating our local refs based on the refs fetched from the remote,
we need to iterate through all requested refs and load their respective
commits such that we can determine whether they need to be appended to
FETCH_HEAD or not. In cases where we're fetching from a remote with
exceedingly many refs, resolving these refs can be quite expensive given
that we repeatedly need to unpack object headers for each of the
referenced objects.

Speed this up by opportunistically trying to resolve object IDs via the
commit graph. We only do so for any refs which are not in "refs/tags":
more likely than not, these are going to be a commit anyway, and this
lets us avoid having to unpack object headers completely in case the
object is a commit that is part of the commit-graph. This significantly
speeds up mirror-fetches in a real-world repository with
2.3M refs:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     56.482 s ±  0.384 s    [User: 53.340 s, System: 5.365 s]
      Range (min … max):   56.050 s … 57.045 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     33.727 s ±  0.170 s    [User: 30.252 s, System: 5.194 s]
      Range (min … max):   33.452 s … 33.871 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.67 ± 0.01 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 25740c13df..bd7c0da232 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1074,7 +1074,6 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 			      int connectivity_checked, struct ref *ref_map)
 {
 	struct fetch_head fetch_head;
-	struct commit *commit;
 	int url_len, i, rc = 0;
 	struct strbuf note = STRBUF_INIT, err = STRBUF_INIT;
 	struct ref_transaction *transaction = NULL;
@@ -1122,6 +1121,7 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 	     want_status <= FETCH_HEAD_IGNORE;
 	     want_status++) {
 		for (rm = ref_map; rm; rm = rm->next) {
+			struct commit *commit = NULL;
 			struct ref *ref = NULL;
 
 			if (rm->status == REF_STATUS_REJECT_SHALLOW) {
@@ -1131,11 +1131,23 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 				continue;
 			}
 
-			commit = lookup_commit_reference_gently(the_repository,
-								&rm->old_oid,
-								1);
-			if (!commit)
-				rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
+			/*
+			 * References in "refs/tags/" are often going to point
+			 * to annotated tags, which are not part of the
+			 * commit-graph. We thus only try to look up refs in
+			 * the graph which are not in that namespace to not
+			 * regress performance in repositories with many
+			 * annotated tags.
+			 */
+			if (!starts_with(rm->name, "refs/tags/"))
+				commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
+			if (!commit) {
+				commit = lookup_commit_reference_gently(the_repository,
+									&rm->old_oid,
+									1);
+				if (!commit)
+					rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
+			}
 
 			if (rm->fetch_head_status != want_status)
 				continue;
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 2/7] fetch: avoid unpacking headers in object existence check
  2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
  2021-09-01 13:09   ` [PATCH v3 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
@ 2021-09-01 13:09   ` Patrick Steinhardt
  2021-09-01 13:09   ` [PATCH v3 3/7] connected: refactor iterator to return next object ID directly Patrick Steinhardt
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-09-01 13:09 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]

When updating local refs after the fetch has transferred all objects, we
do an object existence test as a safety guard to avoid updating a ref to
an object which we don't have. We do so via `oid_object_info()`: if it
returns an error, then we know the object does not exist.

One side effect of `oid_object_info()` is that it parses the object's
type, and to do so it must unpack the object header. This is completely
pointless: we don't care for the type, but only want to assert that the
object exists.

Refactor the code to use `repo_has_object_file()`, which both makes the
code's intent clearer and is also faster because it does not unpack
object headers. In a real-world repo with 2.3M refs, this results in a
small speedup when doing a mirror-fetch:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     33.686 s ±  0.176 s    [User: 30.119 s, System: 5.262 s]
      Range (min … max):   33.512 s … 33.944 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     31.247 s ±  0.195 s    [User: 28.135 s, System: 5.066 s]
      Range (min … max):   30.948 s … 31.472 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.08 ± 0.01 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index bd7c0da232..0b18c47732 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -846,13 +846,11 @@ static int update_local_ref(struct ref *ref,
 			    int summary_width)
 {
 	struct commit *current = NULL, *updated;
-	enum object_type type;
 	struct branch *current_branch = branch_get(NULL);
 	const char *pretty_ref = prettify_refname(ref->name);
 	int fast_forward = 0;
 
-	type = oid_object_info(the_repository, &ref->new_oid, NULL);
-	if (type < 0)
+	if (!repo_has_object_file(the_repository, &ref->new_oid))
 		die(_("object %s not found"), oid_to_hex(&ref->new_oid));
 
 	if (oideq(&ref->old_oid, &ref->new_oid)) {
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 3/7] connected: refactor iterator to return next object ID directly
  2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
  2021-09-01 13:09   ` [PATCH v3 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
  2021-09-01 13:09   ` [PATCH v3 2/7] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
@ 2021-09-01 13:09   ` Patrick Steinhardt
  2021-09-01 13:09   ` [PATCH v3 4/7] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-09-01 13:09 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 8508 bytes --]

The object ID iterator used by the connectivity checks returns the next
object ID via an out-parameter and then uses a return code to indicate
whether an item was found. This is a bit roundabout: instead of a
separate error code, we can just return the next object ID directly and
use `NULL` pointers as indicator that the iterator got no items left.
Furthermore, this avoids a copy of the object ID.

Refactor the iterator and all its implementations to return object IDs
directly. This brings a tiny performance improvement when doing a mirror-fetch of a repository with about 2.3M refs:

    Benchmark #1: 328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch
      Time (mean ± σ):     30.110 s ±  0.148 s    [User: 27.161 s, System: 5.075 s]
      Range (min … max):   29.934 s … 30.406 s    10 runs

    Benchmark #2: 328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch
      Time (mean ± σ):     29.899 s ±  0.109 s    [User: 26.916 s, System: 5.104 s]
      Range (min … max):   29.696 s … 29.996 s    10 runs

    Summary
      '328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch' ran
        1.01 ± 0.01 times faster than '328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch'

While this 1% speedup could be labelled as statistically insignificant,
the speedup is consistent on my machine. Furthermore, this is an end to
end test, so it is expected that the improvement in the connectivity
check itself is more significant.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c        |  8 +++-----
 builtin/fetch.c        |  7 +++----
 builtin/receive-pack.c | 17 +++++++----------
 connected.c            | 15 ++++++++-------
 connected.h            |  2 +-
 fetch-pack.c           |  7 +++----
 6 files changed, 25 insertions(+), 31 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 66fe66679c..4a1056fcc2 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -657,7 +657,7 @@ static void write_followtags(const struct ref *refs, const char *msg)
 	}
 }
 
-static int iterate_ref_map(void *cb_data, struct object_id *oid)
+static const struct object_id *iterate_ref_map(void *cb_data)
 {
 	struct ref **rm = cb_data;
 	struct ref *ref = *rm;
@@ -668,13 +668,11 @@ static int iterate_ref_map(void *cb_data, struct object_id *oid)
 	 */
 	while (ref && !ref->peer_ref)
 		ref = ref->next;
-	/* Returning -1 notes "end of list" to the caller. */
 	if (!ref)
-		return -1;
+		return NULL;
 
-	oidcpy(oid, &ref->old_oid);
 	*rm = ref->next;
-	return 0;
+	return &ref->old_oid;
 }
 
 static void update_remote_refs(const struct ref *refs,
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 0b18c47732..f67fe543ad 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -962,7 +962,7 @@ static int update_local_ref(struct ref *ref,
 	}
 }
 
-static int iterate_ref_map(void *cb_data, struct object_id *oid)
+static const struct object_id *iterate_ref_map(void *cb_data)
 {
 	struct ref **rm = cb_data;
 	struct ref *ref = *rm;
@@ -970,10 +970,9 @@ static int iterate_ref_map(void *cb_data, struct object_id *oid)
 	while (ref && ref->status == REF_STATUS_REJECT_SHALLOW)
 		ref = ref->next;
 	if (!ref)
-		return -1; /* end of the list */
+		return NULL;
 	*rm = ref->next;
-	oidcpy(oid, &ref->old_oid);
-	return 0;
+	return &ref->old_oid;
 }
 
 struct fetch_head {
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index 2d1f97e1ca..041e915454 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1306,7 +1306,7 @@ static void refuse_unconfigured_deny_delete_current(void)
 	rp_error("%s", _(refuse_unconfigured_deny_delete_current_msg));
 }
 
-static int command_singleton_iterator(void *cb_data, struct object_id *oid);
+static const struct object_id *command_singleton_iterator(void *cb_data);
 static int update_shallow_ref(struct command *cmd, struct shallow_info *si)
 {
 	struct shallow_lock shallow_lock = SHALLOW_LOCK_INIT;
@@ -1731,16 +1731,15 @@ static void check_aliased_updates(struct command *commands)
 	string_list_clear(&ref_list, 0);
 }
 
-static int command_singleton_iterator(void *cb_data, struct object_id *oid)
+static const struct object_id *command_singleton_iterator(void *cb_data)
 {
 	struct command **cmd_list = cb_data;
 	struct command *cmd = *cmd_list;
 
 	if (!cmd || is_null_oid(&cmd->new_oid))
-		return -1; /* end of list */
+		return NULL;
 	*cmd_list = NULL; /* this returns only one */
-	oidcpy(oid, &cmd->new_oid);
-	return 0;
+	return &cmd->new_oid;
 }
 
 static void set_connectivity_errors(struct command *commands,
@@ -1770,7 +1769,7 @@ struct iterate_data {
 	struct shallow_info *si;
 };
 
-static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
+static const struct object_id *iterate_receive_command_list(void *cb_data)
 {
 	struct iterate_data *data = cb_data;
 	struct command **cmd_list = &data->cmds;
@@ -1781,13 +1780,11 @@ static int iterate_receive_command_list(void *cb_data, struct object_id *oid)
 			/* to be checked in update_shallow_ref() */
 			continue;
 		if (!is_null_oid(&cmd->new_oid) && !cmd->skip_update) {
-			oidcpy(oid, &cmd->new_oid);
 			*cmd_list = cmd->next;
-			return 0;
+			return &cmd->new_oid;
 		}
 	}
-	*cmd_list = NULL;
-	return -1; /* end of list */
+	return NULL;
 }
 
 static void reject_updates_to_hidden(struct command *commands)
diff --git a/connected.c b/connected.c
index b5f9523a5f..cf68e37a97 100644
--- a/connected.c
+++ b/connected.c
@@ -24,7 +24,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 	struct child_process rev_list = CHILD_PROCESS_INIT;
 	FILE *rev_list_in;
 	struct check_connected_options defaults = CHECK_CONNECTED_INIT;
-	struct object_id oid;
+	const struct object_id *oid;
 	int err = 0;
 	struct packed_git *new_pack = NULL;
 	struct transport *transport;
@@ -34,7 +34,8 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 		opt = &defaults;
 	transport = opt->transport;
 
-	if (fn(cb_data, &oid)) {
+	oid = fn(cb_data);
+	if (!oid) {
 		if (opt->err_fd)
 			close(opt->err_fd);
 		return err;
@@ -73,7 +74,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 			for (p = get_all_packs(the_repository); p; p = p->next) {
 				if (!p->pack_promisor)
 					continue;
-				if (find_pack_entry_one(oid.hash, p))
+				if (find_pack_entry_one(oid->hash, p))
 					goto promisor_pack_found;
 			}
 			/*
@@ -83,7 +84,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 			goto no_promisor_pack_found;
 promisor_pack_found:
 			;
-		} while (!fn(cb_data, &oid));
+		} while ((oid = fn(cb_data)) != NULL);
 		return 0;
 	}
 
@@ -133,12 +134,12 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 		 * are sure the ref is good and not sending it to
 		 * rev-list for verification.
 		 */
-		if (new_pack && find_pack_entry_one(oid.hash, new_pack))
+		if (new_pack && find_pack_entry_one(oid->hash, new_pack))
 			continue;
 
-		if (fprintf(rev_list_in, "%s\n", oid_to_hex(&oid)) < 0)
+		if (fprintf(rev_list_in, "%s\n", oid_to_hex(oid)) < 0)
 			break;
-	} while (!fn(cb_data, &oid));
+	} while ((oid = fn(cb_data)) != NULL);
 
 	if (ferror(rev_list_in) || fflush(rev_list_in)) {
 		if (errno != EPIPE && errno != EINVAL)
diff --git a/connected.h b/connected.h
index 8d5a6b3ad6..6e59c92aa3 100644
--- a/connected.h
+++ b/connected.h
@@ -9,7 +9,7 @@ struct transport;
  * When called after returning the name for the last object, return -1
  * to signal EOF, otherwise return 0.
  */
-typedef int (*oid_iterate_fn)(void *, struct object_id *oid);
+typedef const struct object_id *(*oid_iterate_fn)(void *);
 
 /*
  * Named-arguments struct for check_connected. All arguments are
diff --git a/fetch-pack.c b/fetch-pack.c
index b0c7be717c..7b0e69884d 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1906,16 +1906,15 @@ static void update_shallow(struct fetch_pack_args *args,
 	oid_array_clear(&ref);
 }
 
-static int iterate_ref_map(void *cb_data, struct object_id *oid)
+static const struct object_id *iterate_ref_map(void *cb_data)
 {
 	struct ref **rm = cb_data;
 	struct ref *ref = *rm;
 
 	if (!ref)
-		return -1; /* end of the list */
+		return NULL;
 	*rm = ref->next;
-	oidcpy(oid, &ref->old_oid);
-	return 0;
+	return &ref->old_oid;
 }
 
 struct ref *fetch_pack(struct fetch_pack_args *args,
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 4/7] fetch-pack: optimize loading of refs via commit graph
  2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2021-09-01 13:09   ` [PATCH v3 3/7] connected: refactor iterator to return next object ID directly Patrick Steinhardt
@ 2021-09-01 13:09   ` Patrick Steinhardt
  2021-09-01 13:09   ` [PATCH v3 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-09-01 13:09 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 1753 bytes --]

In order to negotiate a packfile, we need to dereference refs to see
which commits we have in common with the remote. To do so, we first look
up the object's type -- if it's a tag, we peel until we hit a non-tag
object. If we hit a commit eventually, then we return that commit.

In case the object ID points to a commit directly, we can avoid the
initial lookup of the object type by opportunistically looking up the
commit via the commit-graph, if available, which gives us a slight speed
bump of about 2% in a huge repository with about 2.3M refs:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     31.634 s ±  0.258 s    [User: 28.400 s, System: 5.090 s]
      Range (min … max):   31.280 s … 31.896 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     31.129 s ±  0.543 s    [User: 27.976 s, System: 5.056 s]
      Range (min … max):   30.172 s … 31.479 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.02 ± 0.02 times faster than 'HEAD~: git-fetch'

In case this fails, we fall back to the old code which peels the
objects to a commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 fetch-pack.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fetch-pack.c b/fetch-pack.c
index 7b0e69884d..da92a5e474 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -119,6 +119,11 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 {
 	enum object_type type;
 	struct object_info info = { .typep = &type };
+	struct commit *commit;
+
+	commit = lookup_commit_in_graph(the_repository, oid);
+	if (commit)
+		return commit;
 
 	while (1) {
 		if (oid_object_info_extended(the_repository, oid, &info,
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 5/7] fetch: refactor fetch refs to be more extendable
  2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2021-09-01 13:09   ` [PATCH v3 4/7] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
@ 2021-09-01 13:09   ` Patrick Steinhardt
  2021-09-01 13:10   ` [PATCH v3 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-09-01 13:09 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

Refactor `fetch_refs()` code to make it more extendable by explicitly
handling error cases. The refactored code should behave the same.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index f67fe543ad..bafeb3d44b 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1293,18 +1293,28 @@ static int check_exist_and_connected(struct ref *ref_map)
 
 static int fetch_refs(struct transport *transport, struct ref *ref_map)
 {
-	int ret = check_exist_and_connected(ref_map);
+	int ret;
+
+	/*
+	 * We don't need to perform a fetch in case we can already satisfy all
+	 * refs.
+	 */
+	ret = check_exist_and_connected(ref_map);
 	if (ret) {
 		trace2_region_enter("fetch", "fetch_refs", the_repository);
 		ret = transport_fetch_refs(transport, ref_map);
 		trace2_region_leave("fetch", "fetch_refs", the_repository);
+		if (ret)
+			goto out;
 	}
-	if (!ret)
-		/*
-		 * Keep the new pack's ".keep" file around to allow the caller
-		 * time to update refs to reference the new objects.
-		 */
-		return 0;
+
+	/*
+	 * Keep the new pack's ".keep" file around to allow the caller
+	 * time to update refs to reference the new objects.
+	 */
+	return ret;
+
+out:
 	transport_unlock_pack(transport);
 	return ret;
 }
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 6/7] fetch: merge fetching and consuming refs
  2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
                     ` (4 preceding siblings ...)
  2021-09-01 13:09   ` [PATCH v3 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
@ 2021-09-01 13:10   ` Patrick Steinhardt
  2021-09-01 13:10   ` [PATCH v3 7/7] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
  2021-09-01 19:58   ` [PATCH v3 0/7] Speed up mirror-fetches with many refs Junio C Hamano
  7 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-09-01 13:10 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 3011 bytes --]

The functions `fetch_refs()` and `consume_refs()` must always be called
together such that we first obtain all missing objects and then update
our local refs to match the remote refs. In a subsequent patch, we'll
further require that `fetch_refs()` must always be called before
`consume_refs()` such that it can correctly assert that we have all
objects after the fetch given that we're about to move the connectivity
check.

Make this requirement explicit by merging both functions into a single
`fetch_and_consume_refs()` function.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 30 +++++++++---------------------
 1 file changed, 9 insertions(+), 21 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index bafeb3d44b..09f7742d0f 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1291,8 +1291,9 @@ static int check_exist_and_connected(struct ref *ref_map)
 	return check_connected(iterate_ref_map, &rm, &opt);
 }
 
-static int fetch_refs(struct transport *transport, struct ref *ref_map)
+static int fetch_and_consume_refs(struct transport *transport, struct ref *ref_map)
 {
+	int connectivity_checked;
 	int ret;
 
 	/*
@@ -1308,30 +1309,18 @@ static int fetch_refs(struct transport *transport, struct ref *ref_map)
 			goto out;
 	}
 
-	/*
-	 * Keep the new pack's ".keep" file around to allow the caller
-	 * time to update refs to reference the new objects.
-	 */
-	return ret;
-
-out:
-	transport_unlock_pack(transport);
-	return ret;
-}
-
-/* Update local refs based on the ref values fetched from a remote */
-static int consume_refs(struct transport *transport, struct ref *ref_map)
-{
-	int connectivity_checked = transport->smart_options
+	connectivity_checked = transport->smart_options
 		? transport->smart_options->connectivity_checked : 0;
-	int ret;
+
 	trace2_region_enter("fetch", "consume_refs", the_repository);
 	ret = store_updated_refs(transport->url,
 				 transport->remote->name,
 				 connectivity_checked,
 				 ref_map);
-	transport_unlock_pack(transport);
 	trace2_region_leave("fetch", "consume_refs", the_repository);
+
+out:
+	transport_unlock_pack(transport);
 	return ret;
 }
 
@@ -1518,8 +1507,7 @@ static void backfill_tags(struct transport *transport, struct ref *ref_map)
 	transport_set_option(transport, TRANS_OPT_FOLLOWTAGS, NULL);
 	transport_set_option(transport, TRANS_OPT_DEPTH, "0");
 	transport_set_option(transport, TRANS_OPT_DEEPEN_RELATIVE, NULL);
-	if (!fetch_refs(transport, ref_map))
-		consume_refs(transport, ref_map);
+	fetch_and_consume_refs(transport, ref_map);
 
 	if (gsecondary) {
 		transport_disconnect(gsecondary);
@@ -1610,7 +1598,7 @@ static int do_fetch(struct transport *transport,
 				   transport->url);
 		}
 	}
-	if (fetch_refs(transport, ref_map) || consume_refs(transport, ref_map)) {
+	if (fetch_and_consume_refs(transport, ref_map)) {
 		free_refs(ref_map);
 		retcode = 1;
 		goto cleanup;
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 7/7] fetch: avoid second connectivity check if we already have all objects
  2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
                     ` (5 preceding siblings ...)
  2021-09-01 13:10   ` [PATCH v3 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
@ 2021-09-01 13:10   ` Patrick Steinhardt
  2021-09-01 19:58   ` [PATCH v3 0/7] Speed up mirror-fetches with many refs Junio C Hamano
  7 siblings, 0 replies; 48+ messages in thread
From: Patrick Steinhardt @ 2021-09-01 13:10 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Derrick Stolee, René Scharfe

[-- Attachment #1: Type: text/plain, Size: 2303 bytes --]

When fetching refs, we are doing two connectivity checks:

    - The first one is done such that we can skip fetching refs in the
      case where we already have all objects referenced by the updated
      set of refs.

    - The second one verifies that we have all objects after we have
      fetched objects.

We always execute both connectivity checks, but this is wasteful in case
the first connectivity check already notices that we have all objects
locally available.

Skip the second connectivity check in case we already had all objects
available. This gives us a nice speedup when doing a mirror-fetch in a
repository with about 2.3M refs where the fetching repo already has all
objects:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     30.025 s ±  0.081 s    [User: 27.070 s, System: 4.933 s]
      Range (min … max):   29.900 s … 30.111 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     25.574 s ±  0.177 s    [User: 22.855 s, System: 4.683 s]
      Range (min … max):   25.399 s … 25.765 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.17 ± 0.01 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 09f7742d0f..6fdd01fb01 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1293,7 +1293,7 @@ static int check_exist_and_connected(struct ref *ref_map)
 
 static int fetch_and_consume_refs(struct transport *transport, struct ref *ref_map)
 {
-	int connectivity_checked;
+	int connectivity_checked = 1;
 	int ret;
 
 	/*
@@ -1307,11 +1307,10 @@ static int fetch_and_consume_refs(struct transport *transport, struct ref *ref_m
 		trace2_region_leave("fetch", "fetch_refs", the_repository);
 		if (ret)
 			goto out;
+		connectivity_checked = transport->smart_options ?
+			transport->smart_options->connectivity_checked : 0;
 	}
 
-	connectivity_checked = transport->smart_options
-		? transport->smart_options->connectivity_checked : 0;
-
 	trace2_region_enter("fetch", "consume_refs", the_repository);
 	ret = store_updated_refs(transport->url,
 				 transport->remote->name,
-- 
2.33.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/7] Speed up mirror-fetches with many refs
  2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
                     ` (6 preceding siblings ...)
  2021-09-01 13:10   ` [PATCH v3 7/7] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
@ 2021-09-01 19:58   ` Junio C Hamano
  2021-09-08  0:08     ` Junio C Hamano
  7 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2021-09-01 19:58 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, René Scharfe

Patrick Steinhardt <ps@pks.im> writes:

> There's only some smallish changes based on Stolee's feedback (thanks
> for that!):
>
>     - A small typo in 1/7.
>
>     - A confict fix in 4/7 required now because it's based on master
>       instead of directly on my merged topic.
>
>     - I've adjusted patch 5/7 such that I don't have to re-touch the
>       logic in 6/7.

OK, as a sanity check on my end, I've tried to apply this on master
and then to merge the result with 'ps/connectivity-optim' topic
(i.e. expected endgame for this and the other topic if both of them
graduated today without any other topics), and then tried a merge on
master to the previous round (i.e. has the other topic already in),
and did not see any difference in the end result.

Will replace.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/7] Speed up mirror-fetches with many refs
  2021-09-01 19:58   ` [PATCH v3 0/7] Speed up mirror-fetches with many refs Junio C Hamano
@ 2021-09-08  0:08     ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2021-09-08  0:08 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, René Scharfe

Junio C Hamano <gitster@pobox.com> writes:

> Patrick Steinhardt <ps@pks.im> writes:
>
>> There's only some smallish changes based on Stolee's feedback (thanks
>> for that!):
>>
>>     - A small typo in 1/7.
>>
>>     - A confict fix in 4/7 required now because it's based on master
>>       instead of directly on my merged topic.
>>
>>     - I've adjusted patch 5/7 such that I don't have to re-touch the
>>       logic in 6/7.
>
> OK, as a sanity check on my end, I've tried to apply this on master
> and then to merge the result with 'ps/connectivity-optim' topic
> (i.e. expected endgame for this and the other topic if both of them
> graduated today without any other topics), and then tried a merge on
> master to the previous round (i.e. has the other topic already in),
> and did not see any difference in the end result.
>
> Will replace.

OK.  Is everybody happy with this version?  I am about to mark it
for 'next' in the next issue of "What's cooking" report, so please
holler if I should wait.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-09-08  0:09 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-20 10:08 [PATCH 0/6] Speed up mirror-fetches with many refs Patrick Steinhardt
2021-08-20 10:08 ` [PATCH 1/6] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
2021-08-20 14:27   ` Derrick Stolee
2021-08-20 17:18     ` Junio C Hamano
2021-08-23  6:46       ` Patrick Steinhardt
2021-08-25 14:12         ` Derrick Stolee
2021-08-20 10:08 ` [PATCH 2/6] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
2021-08-25 23:44   ` Ævar Arnfjörð Bjarmason
2021-08-20 10:08 ` [PATCH 3/6] connected: refactor iterator to return next object ID directly Patrick Steinhardt
2021-08-20 14:32   ` Derrick Stolee
2021-08-20 17:43     ` Junio C Hamano
2021-08-20 17:43   ` René Scharfe
2021-08-23  6:47     ` Patrick Steinhardt
2021-08-20 10:08 ` [PATCH 4/6] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
2021-08-20 14:37   ` Derrick Stolee
2021-08-20 10:08 ` [PATCH 5/6] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
2021-08-20 14:41   ` Derrick Stolee
2021-08-20 10:08 ` [PATCH 6/6] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
2021-08-20 14:47   ` Derrick Stolee
2021-08-23  6:52     ` Patrick Steinhardt
2021-08-20 14:50 ` [PATCH 0/6] Speed up mirror-fetches with many refs Derrick Stolee
2021-08-21  0:09 ` Junio C Hamano
2021-08-24 10:36 ` [PATCH v2 0/7] " Patrick Steinhardt
2021-08-24 10:36   ` [PATCH v2 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
2021-08-25 14:16     ` Derrick Stolee
2021-08-24 10:37   ` [PATCH v2 2/7] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 3/7] connected: refactor iterator to return next object ID directly Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 4/7] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
2021-08-25 14:19     ` Derrick Stolee
2021-09-01 12:48       ` Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
2021-08-25 14:26     ` Derrick Stolee
2021-09-01 12:49       ` Patrick Steinhardt
2021-08-24 10:37   ` [PATCH v2 7/7] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
2021-08-24 22:48   ` [PATCH v2 0/7] Speed up mirror-fetches with many refs Junio C Hamano
2021-08-25  6:04     ` Patrick Steinhardt
2021-08-25 14:27   ` Derrick Stolee
2021-09-01 13:09 ` [PATCH v3 " Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 1/7] fetch: speed up lookup of want refs via commit-graph Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 2/7] fetch: avoid unpacking headers in object existence check Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 3/7] connected: refactor iterator to return next object ID directly Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 4/7] fetch-pack: optimize loading of refs via commit graph Patrick Steinhardt
2021-09-01 13:09   ` [PATCH v3 5/7] fetch: refactor fetch refs to be more extendable Patrick Steinhardt
2021-09-01 13:10   ` [PATCH v3 6/7] fetch: merge fetching and consuming refs Patrick Steinhardt
2021-09-01 13:10   ` [PATCH v3 7/7] fetch: avoid second connectivity check if we already have all objects Patrick Steinhardt
2021-09-01 19:58   ` [PATCH v3 0/7] Speed up mirror-fetches with many refs Junio C Hamano
2021-09-08  0:08     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).