All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 2/2] index-pack: prefetch missing REF_DELTA bases
Date: Wed, 15 May 2019 10:46:42 +0200 (DST)	[thread overview]
Message-ID: <nycvar.QRO.7.76.6.1905151040240.44@tvgsbejvaqbjf.bet> (raw)
In-Reply-To: <4fcaa4481b5fd2a76aa21263f997e00913db0e0f.1557868134.git.jonathantanmy@google.com>

Hi Jonathan,

On Tue, 14 May 2019, Jonathan Tan wrote:

> When fetching, the client sends "have" commit IDs indicating that the
> server does not need to send any object referenced by those commits,
> reducing network I/O. When the client is a partial clone, the client
> still sends "have"s in this way, even if it does not have every object
> referenced by a commit it sent as "have".
>
> If a server omits such an object, it is fine: the client could lazily
> fetch that object before this fetch, and it can still do so after.
>
> The issue is when the server sends a thin pack containing an object that
> is a REF_DELTA against such a missing object: index-pack fails to fix
> the thin pack. When support for lazily fetching missing objects was
> added in 8b4c0103a9 ("sha1_file: support lazily fetching missing
> objects", 2017-12-08), support in index-pack was turned off in the
> belief that it accesses the repo only to do hash collision checks.
> However, this is not true: it also needs to access the repo to resolve
> REF_DELTA bases.
>
> Support for lazy fetching should still generally be turned off in
> index-pack because it is used as part of the lazy fetching process
> itself (if not, infinite loops may occur), but we do need to fetch the
> REF_DELTA bases. (When fetching REF_DELTA bases, it is unlikely that
> those are REF_DELTA themselves, because we do not send "have" when
> making such fetches.)
>
> To resolve this, prefetch all missing REF_DELTA bases before attempting
> to resolve them. This both ensures that all bases are attempted to be
> fetched, and ensures that we make only one request per index-pack
> invocation, and not one request per missing object.

Hmm. I wonder whether this can lead to *really* undesirable behavior, e.g.
with deep delta chains. The client would possibly have to fetch the
REF_DELTA object, but that would also be delivered in a thin pack with
*another* REF_DELTA object, and the same over and over again, with plenty
of round trips that kill performance really well.

Wouldn't it make more sense to introduce a new term like `promised`
(instead of `have`)? Both client and server will have to know about this,
and it would be a new capability, of course, but that way the server could
know that it has to send the entire delta chain.

Of course, this would be quite a bit more involved than the current patch
:-(

Ciao,
Dscho

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  builtin/index-pack.c     | 26 +++++++++++++++--
>  t/t5616-partial-clone.sh | 61 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 85 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index ccf4eb7e9b..0d55f73b0b 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -14,6 +14,7 @@
>  #include "thread-utils.h"
>  #include "packfile.h"
>  #include "object-store.h"
> +#include "fetch-object.h"
>
>  static const char index_pack_usage[] =
>  "git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
> @@ -1351,6 +1352,25 @@ static void fix_unresolved_deltas(struct hashfile *f)
>  		sorted_by_pos[i] = &ref_deltas[i];
>  	QSORT(sorted_by_pos, nr_ref_deltas, delta_pos_compare);
>
> +	if (repository_format_partial_clone) {
> +		/*
> +		 * Prefetch the delta bases.
> +		 */
> +		struct oid_array to_fetch = OID_ARRAY_INIT;
> +		for (i = 0; i < nr_ref_deltas; i++) {
> +			struct ref_delta_entry *d = sorted_by_pos[i];
> +			if (!oid_object_info_extended(the_repository, &d->oid,
> +						      NULL,
> +						      OBJECT_INFO_FOR_PREFETCH))
> +				continue;
> +			oid_array_append(&to_fetch, &d->oid);
> +		}
> +		if (to_fetch.nr)
> +			fetch_objects(repository_format_partial_clone,
> +				      to_fetch.oid, to_fetch.nr);
> +		oid_array_clear(&to_fetch);
> +	}
> +
>  	for (i = 0; i < nr_ref_deltas; i++) {
>  		struct ref_delta_entry *d = sorted_by_pos[i];
>  		enum object_type type;
> @@ -1650,8 +1670,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  	int report_end_of_input = 0;
>
>  	/*
> -	 * index-pack never needs to fetch missing objects, since it only
> -	 * accesses the repo to do hash collision checks
> +	 * index-pack never needs to fetch missing objects except when
> +	 * REF_DELTA bases are missing (which are explicitly handled). It only
> +	 * accesses the repo to do hash collision checks and to check which
> +	 * REF_DELTA bases need to be fetched.
>  	 */
>  	fetch_if_missing = 0;
>
> diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
> index 7cc0c71556..f1baf83502 100755
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -339,4 +339,65 @@ test_expect_success 'when partial cloning, tolerate server not sending target of
>  	! test -e "$HTTPD_ROOT_PATH/one-time-sed"
>  '
>
> +test_expect_success 'tolerate server sending REF_DELTA against missing promisor objects' '
> +	SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" &&
> +	rm -rf "$SERVER" repo &&
> +	test_create_repo "$SERVER" &&
> +	test_config -C "$SERVER" uploadpack.allowfilter 1 &&
> +	test_config -C "$SERVER" uploadpack.allowanysha1inwant 1 &&
> +
> +	# Create a commit with a blob to be used as a delta base.
> +	for i in $(test_seq 10)
> +	do
> +		echo "this is a line" >>"$SERVER/foo.txt"
> +	done &&
> +	git -C "$SERVER" add foo.txt &&
> +	git -C "$SERVER" commit -m bar &&
> +	git -C "$SERVER" rev-parse HEAD:foo.txt >deltabase &&
> +
> +	git -c protocol.version=2 clone --no-checkout \
> +		--filter=blob:none $HTTPD_URL/one_time_sed/server repo &&
> +
> +	# Sanity check to ensure that the client does not have that blob.
> +	git -C repo rev-list --objects --exclude-promisor-objects \
> +		-- $(cat deltabase) >objlist &&
> +	test_line_count = 0 objlist &&
> +
> +	# Another commit. This commit will be fetched by the client.
> +	echo "abcdefghijklmnopqrstuvwxyz" >>"$SERVER/foo.txt" &&
> +	git -C "$SERVER" add foo.txt &&
> +	git -C "$SERVER" commit -m baz &&
> +
> +	# Pack a thin pack containing, among other things, HEAD:foo.txt
> +	# delta-ed against HEAD^:foo.txt.
> +	printf "%s\n--not\n%s\n" \
> +		$(git -C "$SERVER" rev-parse HEAD) \
> +		$(git -C "$SERVER" rev-parse HEAD^) |
> +		git -C "$SERVER" pack-objects --thin --stdout >thin.pack &&
> +
> +	# Ensure that the pack contains one delta against HEAD^:foo.txt. Since
> +	# the delta contains at least 26 novel characters, the size cannot be
> +	# contained in 4 bits, so the object header will take up 2 bytes. The
> +	# most significant nybble of the first byte is 0b1111 (0b1 to indicate
> +	# that the header continues, and 0b111 to indicate REF_DELTA), followed
> +	# by any 3 nybbles, then the OID of the delta base.
> +	git -C "$SERVER" rev-parse HEAD^:foo.txt >deltabase &&
> +	printf "f.,..%s" $(intersperse "," <deltabase) >want &&
> +	hex_unpack <thin.pack | intersperse "," >have &&
> +	grep $(cat want) have &&
> +
> +	replace_packfile thin.pack &&
> +
> +	# Use protocol v2 because the sed command looks for the "packfile"
> +	# section header.
> +	test_config -C "$SERVER" protocol.version 2 &&
> +
> +	# Fetch the thin pack and ensure that index-pack is able to handle the
> +	# REF_DELTA object with a missing promisor delta base.
> +	git -C repo -c protocol.version=2 fetch &&
> +
> +	# Ensure that the one-time-sed script was used.
> +	! test -e "$HTTPD_ROOT_PATH/one-time-sed"
> +'
> +
>  test_done
> --
> 2.21.0.1020.gf2820cf01a-goog
>
>

  reply	other threads:[~2019-05-15  8:46 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-14 21:10 [PATCH 0/2] Partial clone fix: handling received REF_DELTA Jonathan Tan
2019-05-14 21:10 ` [PATCH 1/2] t5616: refactor packfile replacement Jonathan Tan
2019-05-15  8:36   ` Johannes Schindelin
2019-05-15 18:22     ` Jonathan Tan
2019-05-14 21:10 ` [PATCH 2/2] index-pack: prefetch missing REF_DELTA bases Jonathan Tan
2019-05-15  8:46   ` Johannes Schindelin [this message]
2019-05-15 18:28     ` Jonathan Tan
2019-05-17 18:33       ` Johannes Schindelin
2019-05-15 23:16   ` Jeff King
2019-05-16  1:43     ` Junio C Hamano
2019-05-16  4:04       ` Jeff King
2019-05-16 18:26     ` Jonathan Tan
2019-05-16 21:12       ` Jeff King
2019-05-16 21:30         ` Jonathan Tan
2019-05-16 21:42           ` Jeff King
2019-05-16 23:15             ` Jonathan Tan
2019-05-17  1:09               ` Jeff King
2019-05-17  1:22                 ` Jeff King
2019-05-17  4:39                   ` Jeff King
2019-05-17  4:42                     ` Jeff King
2019-05-17  7:20                     ` Duy Nguyen
2019-05-17  8:55                       ` Jeff King
2019-05-18 11:39                         ` Duy Nguyen
2019-05-20 23:04                           ` Nicolas Pitre
2019-05-21 21:20                             ` Jeff King
2019-06-03 22:23   ` Jonathan Nieder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.QRO.7.76.6.1905151040240.44@tvgsbejvaqbjf.bet \
    --to=johannes.schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.