Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] connected: always use partial clone optimization
@ 2020-03-20 22:00 Jonathan Tan
  2020-03-20 22:54 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Jonathan Tan @ 2020-03-20 22:00 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

With 50033772d5 ("connected: verify promisor-ness of partial clone",
2020-01-30), the fast path (checking promisor packs) in
check_connected() now passes a subset of the slow path (rev-list) - if
all objects to be checked are found in promisor packs, both the fast
path and the slow path will pass; otherwise, the fast path will
definitely not pass. This means that we can always attempt the fast path
whenever we need to do the slow path.

The fast path is currently guarded by a flag; therefore, remove that
flag. Also, make the fast path fallback to the slow path - if the fast
path fails, the failing OID and all remaining OIDs will be passed to
rev-list.

The main user-visible benefit is the performance of fetch from a partial
clone - specifically, the speedup of the connectivity check done before
the fetch. In particular, a no-op fetch into a partial clone on my
computer was sped up from 7 seconds to 0.01 seconds. This is a
complement to the work in 2df1aa239c ("fetch: forgo full
connectivity check if --filter", 2020-01-30), which is the child of the
aforementioned 50033772d5. In that commit, the connectivity check
*after* the fetch was sped up.

The addition of the fast path might cause performance reductions in
these cases:

 - If a partial clone or a fetch into a partial clone fails, Git will
   fruitlessly run rev-list (it is expected that everything fetched
   would go into promisor packs, so if that didn't happen, it is most
   likely that rev-list will fail too).

 - Any connectivity checks done by receive-pack, in the (in my opinion,
   unlikely) event that a partial clone serves receive-pack.

I think that these cases are rare enough, and the performance reduction
in this case minor enough (additional object DB access), that the
benefit of avoiding a flag outweighs these.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
This is the second half of the work I did previously [1]. Quoting from
[1]:

> For example, a local fetch was sped up from 6.63s to 3.39s. The bulk of
> the remaining time is spent in yet another connectivity check
> (fetch_refs -> check_exist_and_connected) prior to the fetch - that will
> hopefully be done in a subsequent patch.

This is the subsequent patch. (Note that the timings were done on
another computer, so don't compare the timings from [1] and this patch
directly.)

[1] https://lore.kernel.org/git/be1d6aa4c4fd8868f3682b73c01a92d3830534ad.1578802317.git.jonathantanmy@google.com/
---
 builtin/clone.c | 7 ++-----
 builtin/fetch.c | 7 -------
 connected.c     | 9 +++++++--
 connected.h     | 9 ---------
 4 files changed, 9 insertions(+), 23 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 1ad26f4d8c..4b2b14ff61 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -672,8 +672,7 @@ static void update_remote_refs(const struct ref *refs,
 			       const char *branch_top,
 			       const char *msg,
 			       struct transport *transport,
-			       int check_connectivity,
-			       int check_refs_are_promisor_objects_only)
+			       int check_connectivity)
 {
 	const struct ref *rm = mapped_refs;
 
@@ -682,8 +681,6 @@ static void update_remote_refs(const struct ref *refs,
 
 		opt.transport = transport;
 		opt.progress = transport->progress;
-		opt.check_refs_are_promisor_objects_only =
-			!!check_refs_are_promisor_objects_only;
 
 		if (check_connected(iterate_ref_map, &rm, &opt))
 			die(_("remote did not send all necessary objects"));
@@ -1275,7 +1272,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 
 	update_remote_refs(refs, mapped_refs, remote_head_points_at,
 			   branch_top.buf, reflog_msg.buf, transport,
-			   !is_local, filter_options.choice);
+			   !is_local);
 
 	update_head(our_head_points_at, remote_head, reflog_msg.buf);
 
diff --git a/builtin/fetch.c b/builtin/fetch.c
index bf6bab80fa..1097e1e512 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -908,13 +908,6 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 	if (!connectivity_checked) {
 		struct check_connected_options opt = CHECK_CONNECTED_INIT;
 
-		if (filter_options.choice)
-			/*
-			 * Since a filter is specified, objects indirectly
-			 * referenced by refs are allowed to be absent.
-			 */
-			opt.check_refs_are_promisor_objects_only = 1;
-
 		rm = ref_map;
 		if (check_connected(iterate_ref_map, &rm, &opt)) {
 			rc = error(_("%s did not send all necessary objects\n"), url);
diff --git a/connected.c b/connected.c
index 7e9bd1bc62..846f2e4eef 100644
--- a/connected.c
+++ b/connected.c
@@ -52,7 +52,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 		strbuf_release(&idx_file);
 	}
 
-	if (opt->check_refs_are_promisor_objects_only) {
+	if (has_promisor_remote()) {
 		/*
 		 * For partial clones, we don't want to have to do a regular
 		 * connectivity check because we have to enumerate and exclude
@@ -71,13 +71,18 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 				if (find_pack_entry_one(oid.hash, p))
 					goto promisor_pack_found;
 			}
-			return 1;
+			/*
+			 * Fallback to rev-list with oid and the rest of the
+			 * object IDs provided by fn.
+			 */
+			goto no_promisor_pack_found;
 promisor_pack_found:
 			;
 		} while (!fn(cb_data, &oid));
 		return 0;
 	}
 
+no_promisor_pack_found:
 	if (opt->shallow_file) {
 		argv_array_push(&rev_list.args, "--shallow-file");
 		argv_array_push(&rev_list.args, opt->shallow_file);
diff --git a/connected.h b/connected.h
index eba5c261ba..8d5a6b3ad6 100644
--- a/connected.h
+++ b/connected.h
@@ -46,15 +46,6 @@ struct check_connected_options {
 	 * during a fetch.
 	 */
 	unsigned is_deepening_fetch : 1;
-
-	/*
-	 * If non-zero, only check that the top-level objects referenced by the
-	 * wanted refs (passed in as cb_data) are promisor objects. This is
-	 * useful for partial clones, where enumerating and excluding all
-	 * promisor objects is very slow and the commit-walk itself becomes a
-	 * no-op.
-	 */
-	unsigned check_refs_are_promisor_objects_only : 1;
 };
 
 #define CHECK_CONNECTED_INIT { 0 }
-- 
2.25.1.696.g5e7596f4ac-goog


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] connected: always use partial clone optimization
  2020-03-20 22:00 [PATCH] connected: always use partial clone optimization Jonathan Tan
@ 2020-03-20 22:54 ` Junio C Hamano
  2020-03-26 19:01 ` Josh Steadmon
  2020-03-26 21:11 ` Emily Shaffer
  2 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2020-03-20 22:54 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> The addition of the fast path might cause performance reductions in
> these cases:
>
>  - If a partial clone or a fetch into a partial clone fails, Git will
>    fruitlessly run rev-list (it is expected that everything fetched
>    would go into promisor packs, so if that didn't happen, it is most
>    likely that rev-list will fail too).

I agree that it is reasonable not to optimize the system for the
failure case---it is pointless to fail as quick as possible ;-)

>  - Any connectivity checks done by receive-pack, in the (in my opinion,
>    unlikely) event that a partial clone serves receive-pack.

Meaning "I created this repository by partially cloning, and now
I want to update it by pushing into it"?  I am not sure how rare
such a use case is, so I won't be a good judge on this one.

> diff --git a/builtin/clone.c b/builtin/clone.c
> index 1ad26f4d8c..4b2b14ff61 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -672,8 +672,7 @@ static void update_remote_refs(const struct ref *refs,
>  			       const char *branch_top,
>  			       const char *msg,
>  			       struct transport *transport,
> -			       int check_connectivity,
> -			       int check_refs_are_promisor_objects_only)
> +			       int check_connectivity)
>  {
>  	const struct ref *rm = mapped_refs;
>  
> @@ -682,8 +681,6 @@ static void update_remote_refs(const struct ref *refs,
>  
>  		opt.transport = transport;
>  		opt.progress = transport->progress;
> -		opt.check_refs_are_promisor_objects_only =
> -			!!check_refs_are_promisor_objects_only;
>  
>  		if (check_connected(iterate_ref_map, &rm, &opt))
>  			die(_("remote did not send all necessary objects"));
> @@ -1275,7 +1272,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
>  
>  	update_remote_refs(refs, mapped_refs, remote_head_points_at,
>  			   branch_top.buf, reflog_msg.buf, transport,
> -			   !is_local, filter_options.choice);
> +			   !is_local);
>  
>  	update_head(our_head_points_at, remote_head, reflog_msg.buf);
>  
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index bf6bab80fa..1097e1e512 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -908,13 +908,6 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
>  	if (!connectivity_checked) {
>  		struct check_connected_options opt = CHECK_CONNECTED_INIT;
>  
> -		if (filter_options.choice)
> -			/*
> -			 * Since a filter is specified, objects indirectly
> -			 * referenced by refs are allowed to be absent.
> -			 */
> -			opt.check_refs_are_promisor_objects_only = 1;
> -
>  		rm = ref_map;
>  		if (check_connected(iterate_ref_map, &rm, &opt)) {
>  			rc = error(_("%s did not send all necessary objects\n"), url);
> diff --git a/connected.c b/connected.c
> index 7e9bd1bc62..846f2e4eef 100644
> --- a/connected.c
> +++ b/connected.c
> @@ -52,7 +52,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  		strbuf_release(&idx_file);
>  	}
>  
> -	if (opt->check_refs_are_promisor_objects_only) {
> +	if (has_promisor_remote()) {

Earlier we would have this bit on only when filter_options.choice
was non-NULL but this allows us to take the branch as long as we
are lazily populated.  Makes sense.

> @@ -71,13 +71,18 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  				if (find_pack_entry_one(oid.hash, p))
>  					goto promisor_pack_found;
>  			}
> -			return 1;
> +			/*
> +			 * Fallback to rev-list with oid and the rest of the
> +			 * object IDs provided by fn.
> +			 */
> +			goto no_promisor_pack_found;

OK.  This is the "fallback" thing you mentioned in the log message.
Makes sense.

>  promisor_pack_found:
>  			;
>  		} while (!fn(cb_data, &oid));
>  		return 0;
>  	}
>  
> +no_promisor_pack_found:
>  	if (opt->shallow_file) {
>  		argv_array_push(&rev_list.args, "--shallow-file");
>  		argv_array_push(&rev_list.args, opt->shallow_file);



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] connected: always use partial clone optimization
  2020-03-20 22:00 [PATCH] connected: always use partial clone optimization Jonathan Tan
  2020-03-20 22:54 ` Junio C Hamano
@ 2020-03-26 19:01 ` Josh Steadmon
  2020-03-26 21:11 ` Emily Shaffer
  2 siblings, 0 replies; 11+ messages in thread
From: Josh Steadmon @ 2020-03-26 19:01 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On 2020.03.20 15:00, Jonathan Tan wrote:
[snip]
> The addition of the fast path might cause performance reductions in
> these cases:
> 
>  - If a partial clone or a fetch into a partial clone fails, Git will
>    fruitlessly run rev-list (it is expected that everything fetched
>    would go into promisor packs, so if that didn't happen, it is most
>    likely that rev-list will fail too).
> 
>  - Any connectivity checks done by receive-pack, in the (in my opinion,
>    unlikely) event that a partial clone serves receive-pack.

Yes, this setup doesn't match with my understanding of the usual partial
clone workflow.

> I think that these cases are rare enough, and the performance reduction
> in this case minor enough (additional object DB access), that the
> benefit of avoiding a flag outweighs these.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> This is the second half of the work I did previously [1]. Quoting from
> [1]:
> 
> > For example, a local fetch was sped up from 6.63s to 3.39s. The bulk of
> > the remaining time is spent in yet another connectivity check
> > (fetch_refs -> check_exist_and_connected) prior to the fetch - that will
> > hopefully be done in a subsequent patch.
> 
> This is the subsequent patch. (Note that the timings were done on
> another computer, so don't compare the timings from [1] and this patch
> directly.)
> 
> [1] https://lore.kernel.org/git/be1d6aa4c4fd8868f3682b73c01a92d3830534ad.1578802317.git.jonathantanmy@google.com/
> ---
>  builtin/clone.c | 7 ++-----
>  builtin/fetch.c | 7 -------
>  connected.c     | 9 +++++++--
>  connected.h     | 9 ---------
>  4 files changed, 9 insertions(+), 23 deletions(-)
> 
> diff --git a/builtin/clone.c b/builtin/clone.c
> index 1ad26f4d8c..4b2b14ff61 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -672,8 +672,7 @@ static void update_remote_refs(const struct ref *refs,
>  			       const char *branch_top,
>  			       const char *msg,
>  			       struct transport *transport,
> -			       int check_connectivity,
> -			       int check_refs_are_promisor_objects_only)
> +			       int check_connectivity)
>  {
>  	const struct ref *rm = mapped_refs;
>  
> @@ -682,8 +681,6 @@ static void update_remote_refs(const struct ref *refs,
>  
>  		opt.transport = transport;
>  		opt.progress = transport->progress;
> -		opt.check_refs_are_promisor_objects_only =
> -			!!check_refs_are_promisor_objects_only;

I was curious if any other code uses this option; it appears not. And
you're removing the option from the struct later on, so that's good.

>  
>  		if (check_connected(iterate_ref_map, &rm, &opt))
>  			die(_("remote did not send all necessary objects"));
> @@ -1275,7 +1272,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
>  
>  	update_remote_refs(refs, mapped_refs, remote_head_points_at,
>  			   branch_top.buf, reflog_msg.buf, transport,
> -			   !is_local, filter_options.choice);
> +			   !is_local);
>  
>  	update_head(our_head_points_at, remote_head, reflog_msg.buf);
>  
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index bf6bab80fa..1097e1e512 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -908,13 +908,6 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
>  	if (!connectivity_checked) {
>  		struct check_connected_options opt = CHECK_CONNECTED_INIT;
>  
> -		if (filter_options.choice)
> -			/*
> -			 * Since a filter is specified, objects indirectly
> -			 * referenced by refs are allowed to be absent.
> -			 */
> -			opt.check_refs_are_promisor_objects_only = 1;
> -
>  		rm = ref_map;
>  		if (check_connected(iterate_ref_map, &rm, &opt)) {
>  			rc = error(_("%s did not send all necessary objects\n"), url);
> diff --git a/connected.c b/connected.c
> index 7e9bd1bc62..846f2e4eef 100644
> --- a/connected.c
> +++ b/connected.c
> @@ -52,7 +52,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  		strbuf_release(&idx_file);
>  	}
>  
> -	if (opt->check_refs_are_promisor_objects_only) {
> +	if (has_promisor_remote()) {
>  		/*
>  		 * For partial clones, we don't want to have to do a regular
>  		 * connectivity check because we have to enumerate and exclude
> @@ -71,13 +71,18 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  				if (find_pack_entry_one(oid.hash, p))
>  					goto promisor_pack_found;
>  			}
> -			return 1;
> +			/*
> +			 * Fallback to rev-list with oid and the rest of the
> +			 * object IDs provided by fn.
> +			 */
> +			goto no_promisor_pack_found;

Previously, we'd iterate until we fail to find an OID in a promisor pack
in which case we'd return 1; now, we instead jump to the
non-promisor-pack check, which uses rev-list.

>  promisor_pack_found:
>  			;
>  		} while (!fn(cb_data, &oid));
>  		return 0;
>  	}
>  
> +no_promisor_pack_found:
>  	if (opt->shallow_file) {
>  		argv_array_push(&rev_list.args, "--shallow-file");
>  		argv_array_push(&rev_list.args, opt->shallow_file);
> diff --git a/connected.h b/connected.h
> index eba5c261ba..8d5a6b3ad6 100644
> --- a/connected.h
> +++ b/connected.h
> @@ -46,15 +46,6 @@ struct check_connected_options {
>  	 * during a fetch.
>  	 */
>  	unsigned is_deepening_fetch : 1;
> -
> -	/*
> -	 * If non-zero, only check that the top-level objects referenced by the
> -	 * wanted refs (passed in as cb_data) are promisor objects. This is
> -	 * useful for partial clones, where enumerating and excluding all
> -	 * promisor objects is very slow and the commit-walk itself becomes a
> -	 * no-op.
> -	 */
> -	unsigned check_refs_are_promisor_objects_only : 1;
>  };

And here's the struct field cleanup.

>  
>  #define CHECK_CONNECTED_INIT { 0 }
> -- 
> 2.25.1.696.g5e7596f4ac-goog
> 

This all looks good to me.

Reviewed-by: Josh Steadmon <steadmon@google.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] connected: always use partial clone optimization
  2020-03-20 22:00 [PATCH] connected: always use partial clone optimization Jonathan Tan
  2020-03-20 22:54 ` Junio C Hamano
  2020-03-26 19:01 ` Josh Steadmon
@ 2020-03-26 21:11 ` Emily Shaffer
  2020-03-26 23:14   ` Josh Steadmon
  2020-03-30 13:37   ` Jeff King
  2 siblings, 2 replies; 11+ messages in thread
From: Emily Shaffer @ 2020-03-26 21:11 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Fri, Mar 20, 2020 at 03:00:45PM -0700, Jonathan Tan wrote:
> With 50033772d5 ("connected: verify promisor-ness of partial clone",
> 2020-01-30), the fast path (checking promisor packs) in
> check_connected() now passes a subset of the slow path (rev-list) - if
> all objects to be checked are found in promisor packs, both the fast
> path and the slow path will pass; otherwise, the fast path will
> definitely not pass. This means that we can always attempt the fast path
> whenever we need to do the slow path.
> 
> The fast path is currently guarded by a flag; therefore, remove that
> flag. Also, make the fast path fallback to the slow path - if the fast
> path fails, the failing OID and all remaining OIDs will be passed to
> rev-list.

It looks like a pretty simple change. I had one probably-biased
complaint about gotos below, otherwise it looks reasonable to me.

> 
> The main user-visible benefit is the performance of fetch from a partial
> clone - specifically, the speedup of the connectivity check done before
> the fetch. In particular, a no-op fetch into a partial clone on my
> computer was sped up from 7 seconds to 0.01 seconds. This is a
> complement to the work in 2df1aa239c ("fetch: forgo full
> connectivity check if --filter", 2020-01-30), which is the child of the
> aforementioned 50033772d5. In that commit, the connectivity check
> *after* the fetch was sped up.
> 
> The addition of the fast path might cause performance reductions in
> these cases:
> 
>  - If a partial clone or a fetch into a partial clone fails, Git will
>    fruitlessly run rev-list (it is expected that everything fetched
>    would go into promisor packs, so if that didn't happen, it is most
>    likely that rev-list will fail too).
> 
>  - Any connectivity checks done by receive-pack, in the (in my opinion,
>    unlikely) event that a partial clone serves receive-pack.
> 
> I think that these cases are rare enough, and the performance reduction
> in this case minor enough (additional object DB access), that the
> benefit of avoiding a flag outweighs these.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> This is the second half of the work I did previously [1]. Quoting from
> [1]:
> 
> > For example, a local fetch was sped up from 6.63s to 3.39s. The bulk of
> > the remaining time is spent in yet another connectivity check
> > (fetch_refs -> check_exist_and_connected) prior to the fetch - that will
> > hopefully be done in a subsequent patch.
> 
> This is the subsequent patch. (Note that the timings were done on
> another computer, so don't compare the timings from [1] and this patch
> directly.)
> 
> [1] https://lore.kernel.org/git/be1d6aa4c4fd8868f3682b73c01a92d3830534ad.1578802317.git.jonathantanmy@google.com/
> ---
>  builtin/clone.c | 7 ++-----
>  builtin/fetch.c | 7 -------
>  connected.c     | 9 +++++++--
>  connected.h     | 9 ---------
>  4 files changed, 9 insertions(+), 23 deletions(-)
> 
> diff --git a/builtin/clone.c b/builtin/clone.c
> index 1ad26f4d8c..4b2b14ff61 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -672,8 +672,7 @@ static void update_remote_refs(const struct ref *refs,
>  			       const char *branch_top,
>  			       const char *msg,
>  			       struct transport *transport,
> -			       int check_connectivity,
> -			       int check_refs_are_promisor_objects_only)
> +			       int check_connectivity)
>  {
>  	const struct ref *rm = mapped_refs;
>  
> @@ -682,8 +681,6 @@ static void update_remote_refs(const struct ref *refs,
>  
>  		opt.transport = transport;
>  		opt.progress = transport->progress;
> -		opt.check_refs_are_promisor_objects_only =
> -			!!check_refs_are_promisor_objects_only;
>  
>  		if (check_connected(iterate_ref_map, &rm, &opt))
>  			die(_("remote did not send all necessary objects"));
> @@ -1275,7 +1272,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
>  
>  	update_remote_refs(refs, mapped_refs, remote_head_points_at,
>  			   branch_top.buf, reflog_msg.buf, transport,
> -			   !is_local, filter_options.choice);
> +			   !is_local);
>  
>  	update_head(our_head_points_at, remote_head, reflog_msg.buf);
>  
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index bf6bab80fa..1097e1e512 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -908,13 +908,6 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
>  	if (!connectivity_checked) {
>  		struct check_connected_options opt = CHECK_CONNECTED_INIT;
>  
> -		if (filter_options.choice)
> -			/*
> -			 * Since a filter is specified, objects indirectly
> -			 * referenced by refs are allowed to be absent.
> -			 */
> -			opt.check_refs_are_promisor_objects_only = 1;
> -
>  		rm = ref_map;
>  		if (check_connected(iterate_ref_map, &rm, &opt)) {
>  			rc = error(_("%s did not send all necessary objects\n"), url);
> diff --git a/connected.c b/connected.c
> index 7e9bd1bc62..846f2e4eef 100644
> --- a/connected.c
> +++ b/connected.c
> @@ -52,7 +52,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  		strbuf_release(&idx_file);
>  	}
>  
> -	if (opt->check_refs_are_promisor_objects_only) {
> +	if (has_promisor_remote()) {
>  		/*
>  		 * For partial clones, we don't want to have to do a regular
>  		 * connectivity check because we have to enumerate and exclude
> @@ -71,13 +71,18 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
>  				if (find_pack_entry_one(oid.hash, p))
>  					goto promisor_pack_found;
>  			}
> -			return 1;
> +			/*
> +			 * Fallback to rev-list with oid and the rest of the
> +			 * object IDs provided by fn.
> +			 */
> +			goto no_promisor_pack_found;
>  promisor_pack_found:
>  			;
>  		} while (!fn(cb_data, &oid));
>  		return 0;
>  	}
>  
> +no_promisor_pack_found:

Having a look at the final structure of the loop with these gotos, I'm a
little confused. Could be this isn't C-idiomatic but I think the code
could be easier to read with helpers instead of gotos. I realize it's
longer but I have a hard time understanding that your gotos are used to
double-continue or double-break; nested loops tend to make me want to
use helpers. But - I'm a lowly barely-reformed C++ developer, so what do
I know ;)

  int oid_in_promisor(oid) {
    for (p = get_all_packs(the_repository); p; p = p->next) {
      if (!p->pack_promisor)
        continue;
      if (find_pack_entry_one(oid.hash, p)
        return 1;
    }
  }

  int all_oids_in_promisors(oid, fn, cb_data)
  {
    do {
      if (! oid_in_promisor(oid))
        return 0;
    } while (!fn(cb_data, &oid));
    return 1;
  }

  int check_connected(...)
  {
    ...
    if (has_promisor_remote()) {
      if (all_oids_in_promisors(oid, fn, cb_data))
        return 0;
      if (opt->shallow_file) {
       ...
  }

>  	if (opt->shallow_file) {
>  		argv_array_push(&rev_list.args, "--shallow-file");
>  		argv_array_push(&rev_list.args, opt->shallow_file);
> diff --git a/connected.h b/connected.h
> index eba5c261ba..8d5a6b3ad6 100644
> --- a/connected.h
> +++ b/connected.h
> @@ -46,15 +46,6 @@ struct check_connected_options {
>  	 * during a fetch.
>  	 */
>  	unsigned is_deepening_fetch : 1;
> -
> -	/*
> -	 * If non-zero, only check that the top-level objects referenced by the
> -	 * wanted refs (passed in as cb_data) are promisor objects. This is
> -	 * useful for partial clones, where enumerating and excluding all
> -	 * promisor objects is very slow and the commit-walk itself becomes a
> -	 * no-op.
> -	 */
> -	unsigned check_refs_are_promisor_objects_only : 1;
>  };
>  
>  #define CHECK_CONNECTED_INIT { 0 }
> -- 
> 2.25.1.696.g5e7596f4ac-goog
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] connected: always use partial clone optimization
  2020-03-26 21:11 ` Emily Shaffer
@ 2020-03-26 23:14   ` Josh Steadmon
  2020-03-29 17:39     ` Junio C Hamano
  2020-03-30 13:37   ` Jeff King
  1 sibling, 1 reply; 11+ messages in thread
From: Josh Steadmon @ 2020-03-26 23:14 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Jonathan Tan, git

On 2020.03.26 14:11, Emily Shaffer wrote:
> On Fri, Mar 20, 2020 at 03:00:45PM -0700, Jonathan Tan wrote:
> > With 50033772d5 ("connected: verify promisor-ness of partial clone",
> > 2020-01-30), the fast path (checking promisor packs) in
> > check_connected() now passes a subset of the slow path (rev-list) - if
> > all objects to be checked are found in promisor packs, both the fast
> > path and the slow path will pass; otherwise, the fast path will
> > definitely not pass. This means that we can always attempt the fast path
> > whenever we need to do the slow path.
> > 
> > The fast path is currently guarded by a flag; therefore, remove that
> > flag. Also, make the fast path fallback to the slow path - if the fast
> > path fails, the failing OID and all remaining OIDs will be passed to
> > rev-list.
> 
> It looks like a pretty simple change. I had one probably-biased
> complaint about gotos below, otherwise it looks reasonable to me.

[snip]

> > diff --git a/connected.c b/connected.c
> > index 7e9bd1bc62..846f2e4eef 100644
> > --- a/connected.c
> > +++ b/connected.c
> > @@ -52,7 +52,7 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
> >  		strbuf_release(&idx_file);
> >  	}
> >  
> > -	if (opt->check_refs_are_promisor_objects_only) {
> > +	if (has_promisor_remote()) {
> >  		/*
> >  		 * For partial clones, we don't want to have to do a regular
> >  		 * connectivity check because we have to enumerate and exclude
> > @@ -71,13 +71,18 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
> >  				if (find_pack_entry_one(oid.hash, p))
> >  					goto promisor_pack_found;
> >  			}
> > -			return 1;
> > +			/*
> > +			 * Fallback to rev-list with oid and the rest of the
> > +			 * object IDs provided by fn.
> > +			 */
> > +			goto no_promisor_pack_found;
> >  promisor_pack_found:
> >  			;
> >  		} while (!fn(cb_data, &oid));
> >  		return 0;
> >  	}
> >  
> > +no_promisor_pack_found:
> 
> Having a look at the final structure of the loop with these gotos, I'm a
> little confused. Could be this isn't C-idiomatic but I think the code
> could be easier to read with helpers instead of gotos. I realize it's
> longer but I have a hard time understanding that your gotos are used to
> double-continue or double-break; nested loops tend to make me want to
> use helpers. But - I'm a lowly barely-reformed C++ developer, so what do
> I know ;)
> 
>   int oid_in_promisor(oid) {
>     for (p = get_all_packs(the_repository); p; p = p->next) {
>       if (!p->pack_promisor)
>         continue;
>       if (find_pack_entry_one(oid.hash, p)
>         return 1;
>     }
>   }
> 
>   int all_oids_in_promisors(oid, fn, cb_data)
>   {
>     do {
>       if (! oid_in_promisor(oid))
>         return 0;
>     } while (!fn(cb_data, &oid));
>     return 1;
>   }
> 
>   int check_connected(...)
>   {
>     ...
>     if (has_promisor_remote()) {
>       if (all_oids_in_promisors(oid, fn, cb_data))
>         return 0;
>       if (opt->shallow_file) {
>        ...
>   }

I like this version better as well.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] connected: always use partial clone optimization
  2020-03-26 23:14   ` Josh Steadmon
@ 2020-03-29 17:39     ` Junio C Hamano
  2020-03-30  3:32       ` Jonathan Tan
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2020-03-29 17:39 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Emily Shaffer, Josh Steadmon, git

Josh Steadmon <steadmon@google.com> writes:

>> Having a look at the final structure of the loop with these gotos, I'm a
>> little confused. Could be this isn't C-idiomatic but I think the code
>> could be easier to read with helpers instead of gotos. I realize it's
>> longer but I have a hard time understanding that your gotos are used to
>> double-continue or double-break; nested loops tend to make me want to
>> use helpers. But - I'm a lowly barely-reformed C++ developer, so what do
>> I know ;)
>> 
>>   int oid_in_promisor(oid) {
>>     for (p = get_all_packs(the_repository); p; p = p->next) {
>>       if (!p->pack_promisor)
>>         continue;
>>       if (find_pack_entry_one(oid.hash, p)
>>         return 1;
>>     }
>>   }
>> 
>>   int all_oids_in_promisors(oid, fn, cb_data)
>>   {
>>     do {
>>       if (! oid_in_promisor(oid))
>>         return 0;
>>     } while (!fn(cb_data, &oid));
>>     return 1;
>>   }
>> 
>>   int check_connected(...)
>>   {
>>     ...
>>     if (has_promisor_remote()) {
>>       if (all_oids_in_promisors(oid, fn, cb_data))
>>         return 0;
>>       if (opt->shallow_file) {
>>        ...
>>   }
>
> I like this version better as well.

Sounds good.  Jonathan?  I've squashed Josh'es Reviewed-by, but I
will refrain from merging it to 'next' just yet to see if you too
like the proposed code structure.

Thanks, all.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] connected: always use partial clone optimization
  2020-03-29 17:39     ` Junio C Hamano
@ 2020-03-30  3:32       ` Jonathan Tan
  2020-03-30  5:12         ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Jonathan Tan @ 2020-03-30  3:32 UTC (permalink / raw)
  To: gitster; +Cc: jonathantanmy, emilyshaffer, steadmon, git

> Sounds good.  Jonathan?  I've squashed Josh'es Reviewed-by, but I
> will refrain from merging it to 'next' just yet to see if you too
> like the proposed code structure.

I think that this is a local enough concern that going either way won't
paint us into a corner, so if what's in
jt/connectivity-check-optim-in-partial-clone is OK, I prefer using that
to reduce churn.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] connected: always use partial clone optimization
  2020-03-30  3:32       ` Jonathan Tan
@ 2020-03-30  5:12         ` Junio C Hamano
  2020-03-30 16:04           ` Jonathan Tan
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2020-03-30  5:12 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: emilyshaffer, steadmon, git

Jonathan Tan <jonathantanmy@google.com> writes:

>> Sounds good.  Jonathan?  I've squashed Josh'es Reviewed-by, but I
>> will refrain from merging it to 'next' just yet to see if you too
>> like the proposed code structure.
>
> I think that this is a local enough concern that going either way won't
> paint us into a corner, so if what's in
> jt/connectivity-check-optim-in-partial-clone is OK, I prefer using that
> to reduce churn.

If you do not think their improvement is not much of improvement,
then please say so.  On the other hand, if you also believe that the
resulting code is better, adopting the improvement (in other words,
help from other people) and get a better version of the code queued
before it hits 'next' is not a "churn".  Leaving a chance to make
the code into a shape that you think better is a waste and risking
the chance to forget improving it with a follow-up patch.

Thanks.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] connected: always use partial clone optimization
  2020-03-26 21:11 ` Emily Shaffer
  2020-03-26 23:14   ` Josh Steadmon
@ 2020-03-30 13:37   ` Jeff King
  1 sibling, 0 replies; 11+ messages in thread
From: Jeff King @ 2020-03-30 13:37 UTC (permalink / raw)
  To: Emily Shaffer; +Cc: Jonathan Tan, git

On Thu, Mar 26, 2020 at 02:11:56PM -0700, Emily Shaffer wrote:

> Having a look at the final structure of the loop with these gotos, I'm a
> little confused. Could be this isn't C-idiomatic but I think the code
> could be easier to read with helpers instead of gotos. I realize it's
> longer but I have a hard time understanding that your gotos are used to
> double-continue or double-break; nested loops tend to make me want to
> use helpers. But - I'm a lowly barely-reformed C++ developer, so what do
> I know ;)
> 
>   int oid_in_promisor(oid) {
>     for (p = get_all_packs(the_repository); p; p = p->next) {
>       if (!p->pack_promisor)
>         continue;
>       if (find_pack_entry_one(oid.hash, p)
>         return 1;
>     }
>   }
> 
>   int all_oids_in_promisors(oid, fn, cb_data)
>   {
>     do {
>       if (! oid_in_promisor(oid))
>         return 0;
>     } while (!fn(cb_data, &oid));
>     return 1;
>   }
> 
>   int check_connected(...)
>   {
>     ...
>     if (has_promisor_remote()) {
>       if (all_oids_in_promisors(oid, fn, cb_data))
>         return 0;
>       if (opt->shallow_file) {
>        ...
>   }

I agree the current code is a bit hard to follow, and I do like the
separation you gave here. But there's something subtle going on in the
current code with the fn() callback. It lets us iterate forward over the
list, but we can never seek back to the beginning to reset. Yet we have
multiple loops calling fn().

This works in the current code because we do something like:

  fn(&oid);
  do {
    /* check for oid in promisor packs; if not, jump to rev-list below */
  } while(!fn(&oid));

  start_command(&rev_list);
  do {
    /* feed oid to rev-list */
  } while(!fn(&oid));

The two subtle things are:

 - we keep the cursor in the same spot after breaking out of the
   promisor loop, so that we don't feed any promisor objects to rev-list

 - by using the same local object_id buffer, when we break out of the
   promisor-checking loop, our do-while will pick up where we left off,
   and not miss checking that first object

I'd worry that splitting the two loops across separate functions would
make it easier for that second thing to be forgotten. On the other hand,
maybe pulling out some of the details into loops would make the
dependency more clear (because there'd be less intervening cruft).

-Peff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] connected: always use partial clone optimization
  2020-03-30  5:12         ` Junio C Hamano
@ 2020-03-30 16:04           ` Jonathan Tan
  2020-03-30 18:09             ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Jonathan Tan @ 2020-03-30 16:04 UTC (permalink / raw)
  To: gitster; +Cc: jonathantanmy, emilyshaffer, steadmon, git, peff

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> >> Sounds good.  Jonathan?  I've squashed Josh'es Reviewed-by, but I
> >> will refrain from merging it to 'next' just yet to see if you too
> >> like the proposed code structure.
> >
> > I think that this is a local enough concern that going either way won't
> > paint us into a corner, so if what's in
> > jt/connectivity-check-optim-in-partial-clone is OK, I prefer using that
> > to reduce churn.
> 
> If you do not think their improvement is not much of improvement,
> then please say so.

Yes, I don't think that their improvement is much of an improvement. If
we were to split up the logic into functions, one of the functions would
need to be documented as "Return true if all objects returned by 'fn'
exist in promisor packs. If one of them does not, return false; the ID
of the failing object is then stored in 'oid', and 'fn' can be used to
obtain all untested objects." Peff said something similar in [1].

So I think it's better to inline the logic, even if we have to use
"goto", so that we can see all of this in one place.

[1] https://lore.kernel.org/git/20200330133714.GA2410648@coredump.intra.peff.net/

> On the other hand, if you also believe that the
> resulting code is better, adopting the improvement (in other words,
> help from other people) and get a better version of the code queued
> before it hits 'next' is not a "churn".  Leaving a chance to make
> the code into a shape that you think better is a waste and risking
> the chance to forget improving it with a follow-up patch.

I agree.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] connected: always use partial clone optimization
  2020-03-30 16:04           ` Jonathan Tan
@ 2020-03-30 18:09             ` Junio C Hamano
  0 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2020-03-30 18:09 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: emilyshaffer, steadmon, git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

>> Jonathan Tan <jonathantanmy@google.com> writes:
>> 
>> >> Sounds good.  Jonathan?  I've squashed Josh'es Reviewed-by, but I
>> >> will refrain from merging it to 'next' just yet to see if you too
>> >> like the proposed code structure.
>> >
>> > I think that this is a local enough concern that going either way won't
>> > paint us into a corner, so if what's in
>> > jt/connectivity-check-optim-in-partial-clone is OK, I prefer using that
>> > to reduce churn.
>> 
>> If you do not think their improvement is not much of improvement,
>> then please say so.
>
> Yes, I don't think that their improvement is much of an improvement. If
> we were to split up the logic into functions, one of the functions would
> need to be documented as "Return true if all objects returned by 'fn'
> exist in promisor packs.

So we have a stronger basis to reject the different code structure,
and I think it makes sense.  Which is a better reason to give than
"it is a local enough concern and we can do so later if we wanted
to".  We probably do not want to anyway, right?

Thanks.  Let's mark the topic as ready for 'next'.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-20 22:00 [PATCH] connected: always use partial clone optimization Jonathan Tan
2020-03-20 22:54 ` Junio C Hamano
2020-03-26 19:01 ` Josh Steadmon
2020-03-26 21:11 ` Emily Shaffer
2020-03-26 23:14   ` Josh Steadmon
2020-03-29 17:39     ` Junio C Hamano
2020-03-30  3:32       ` Jonathan Tan
2020-03-30  5:12         ` Junio C Hamano
2020-03-30 16:04           ` Jonathan Tan
2020-03-30 18:09             ` Junio C Hamano
2020-03-30 13:37   ` Jeff King

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git