All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation
@ 2021-10-25 21:25 Ævar Arnfjörð Bjarmason
  2021-10-25 21:25 ` [PATCH 1/3] leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak Ævar Arnfjörð Bjarmason
                   ` (4 more replies)
  0 siblings, 5 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-25 21:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson,
	Ævar Arnfjörð Bjarmason

This implements a new "bundle-uri" protocol v2 extension, which allows
servers to advertise *.bundle files which clients can pre-seed their
full "clone"'s or incremental "fetch"'s from.

This is both an alternative to, and complimentary to the existing
"packfile-uri" mechanism, i.e. servers and/or clients can pick one or
both, but would generally pick one over the other.

This "bundle-uri" mechanism has the advantage of being dumber, and
offloads more complexity from the server side to the client
side.

Unlike with packfile-uri a conforming server doesn't need produce a
PACK that (hopefully, otherwise there's not much point) excludes OIDs
that it knows it'll provide via a packfile-uri.

To the server a "bundle-uri" negotiation the same as a "normal" one,
the client just happens to provide OIDs it found in bundles as "have"
lines.

In my WIP client patches I even have a (trivial to implement) mode
where a client can choose to pretend that a server reported that a
given set of bundle URIs can be used to pre-seed its "clone" or
"fetch".

A client can thus use use a CDN it controls to optimistically pre-seed
a clone from a server that knows nothing about "bundle-uri", sort of
like a "git clone --reference <path> --dissociate", except with a
<uri> instead of a <path>.

Need re-clone a bunch of large repositories on CI boxes from
git.example.com, but git.example.com doesn't support "bundle-uri", and
you've got a slow outbound connection? Just point to a pre-seeding CDN
you control.

There are disadvantages to this over packfile-uri, JGit has a mature
implementation of it, and I doubt that e.g. Google will ever want to
use this, since that feature was tailor-made for their use case.

E.g. a repository that has a *.pack sitting on disk can't re-use and
stream it out with sendfile() as it could with a "packfile-uri",
instead it would need to point to some duplicate of that data in
*.bundle form (or on-the-fly generate a header for the *.pack).

The goal of this feature isn't to win over packfile-uri users, but to
give users who wouldn't consider it due to its tight coupling to have
access to CDN offloading.

The error optimistic recovery of "bundle-uri" and looseer coupling
between server and CDN means that it should be easy to use this for
use where the CDN is something like say Debian's mirror network.

We're coming up on 2.34.0-rc0, so this certainly won't be in 2.34.0,
but I'm submitting this now per discussion during '#git-devel' standup
today.

There was a discussion on the RFC version of the larger series of
patches to implement this "bundle-uri"[1].

I've updated the protocol-v2.txt changes in 2/3 a lot in response to
that, in particular I've specified and implemented early client
disconnection behavior, so bundle-uri SHOULD never cause
client<->server dialog to hang (at most we'll need to re-connect, if
we need to fall back from a failed bundle-uri).

1. https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/

Ævar Arnfjörð Bjarmason (3):
  leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak
  protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  bundle-uri client: add "bundle-uri" parsing + tests

 Documentation/technical/protocol-v2.txt | 209 ++++++++++++++++++++++++
 Makefile                                |   2 +
 bundle-uri.c                            | 179 ++++++++++++++++++++
 bundle-uri.h                            |  30 ++++
 serve.c                                 |   6 +
 t/helper/test-bundle-uri.c              |  83 ++++++++++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5701-git-serve.sh                    | 125 +++++++++++++-
 t/t5750-bundle-uri-parse.sh             | 153 +++++++++++++++++
 10 files changed, 788 insertions(+), 1 deletion(-)
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h
 create mode 100644 t/helper/test-bundle-uri.c
 create mode 100755 t/t5750-bundle-uri-parse.sh

-- 
2.33.1.1511.gd15d1b313a6


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 1/3] leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak
  2021-10-25 21:25 [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Ævar Arnfjörð Bjarmason
@ 2021-10-25 21:25 ` Ævar Arnfjörð Bjarmason
  2021-10-25 21:25 ` [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri" Ævar Arnfjörð Bjarmason
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-25 21:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson,
	Ævar Arnfjörð Bjarmason

The "t5701-git-serve.sh" test passes when run under a git compiled
with SANITIZE=leak, let's mark it as such to add it to the
"linux-leaks" CI job.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/t5701-git-serve.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/t/t5701-git-serve.sh b/t/t5701-git-serve.sh
index aa1827d841d..1896f671cb3 100755
--- a/t/t5701-git-serve.sh
+++ b/t/t5701-git-serve.sh
@@ -5,6 +5,7 @@ test_description='test protocol v2 server commands'
 GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
 export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
 
+TEST_PASSES_SANITIZE_LEAK=true
 . ./test-lib.sh
 
 test_expect_success 'test capability advertisement' '
-- 
2.33.1.1511.gd15d1b313a6


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-25 21:25 [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Ævar Arnfjörð Bjarmason
  2021-10-25 21:25 ` [PATCH 1/3] leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak Ævar Arnfjörð Bjarmason
@ 2021-10-25 21:25 ` Ævar Arnfjörð Bjarmason
  2021-10-26 14:00   ` Derrick Stolee
  2021-10-27  2:01   ` Derrick Stolee
  2021-10-25 21:25 ` [PATCH 3/3] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-25 21:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson,
	Ævar Arnfjörð Bjarmason

Add a server-side implementation of a new "bundle-uri" command to
protocol v2. As discussed in the updated "protocol-v2.txt" this will
allow conforming clients to optionally seed their initial clones or
incremental fetches from URLs containing "*.bundle" files created with
"git bundle create".

The use-cases are similar to those of the existing "Packfile URIs",
and the two feature can be combined within a single request, but
"bundle-uri" has a few advantages over packfile-uris in some some
common scenarios, discussed below.

This change does not give us a working "bundle-uri" client. I have
those patches as a follow-up, but let's first establish what the
protocol for this should be like first. The client implementation will
then implement this specification.

With this change when the uploadpack.bundleURI config is set to a
URI (or URIs, if set >1 times), advertise a "bundle-uri" command. Then
when the client requests "bundle-uri" emit those URIs back at them.

Differences between this and the existing packfile-uri facility:

 A. There is no "real" support for packfile-uri in git.git. The
    uploadpack.blobPackfileUri setting allows carving out a list of
    blobs (actually any OIDs), but as alluded to in bfc2a36ff2a (Doc:
    clarify contents of packfile sent as URI, 2021-01-20) the only
    "real" implementation is JGit based.

 B. The uploadpack.blobPackfileUri is a MUST where this is a
    "CAN". I.e. once a client says they support packfile-uri of given
    list of protocols the server will send them a PACK response
    assuming they've downloaded the URI they client was sent, if the
    client doesn't do that they don't have a valid repository.

    Pointing at a bundle and having the client send us "have"
    lines (or not, maybe they couldn't fetch it, or decided they
    didn't want to) is more flexible, and can gracefully recover
    e.g. if the CDN isn't reachable (maybe you do support "https", but
    the CDN provider is down, or blocked your whole country).

 C. The client, after executing "ls-refs" will disconnect if it has
    also grabbed the "bundle-uris" and knows the server won't send it
    anything it doesn't already have (or expect to have, if it's
    downloading the bundles concurrent to an early disconnect).

    This is in (small) contrast to packfile-uri where a client would
    enter a negotiation dialog, which may or may not result in a
    packfile-uri and/or an inline PACK.

 D. Because of "C" clients can, if the bundles are up-to-date, get an
    up-to-date repository with just "bundle-uri" and "ls-refs" commands,
    with no need to enter a dialog with "git upload-pack".

    That small dialog is unlikely to matter for performance purposes,
    this section is just noting differences between "bundle-uri" and
    "packfile-uri".

As noted above the features are compatible, a client that supports
"bundle-uri" and "packfile-uri" might download a bundle, and then
proceed with a "fetch" dialog, that dialog might then result in
"packfile-uri" response.

In practice server operators are unlikely to want to mix the two,
since the main benefit of either approach is the ability to offload
large "clone" responses to CDNs. A server operator would have little
reason not to go with one approach or the other.

There was a suggestion of implementing a similar feature long ago[1]
by Jeff King. The main difference between it and this approach is that
we've since gained protocol v2, so we can add this as an optional path
in the dialog between client and server. The 2011 implementation
hooked into the transport mechanism to try to clone from a bundle
directly. See also [2] and [3] for some later mentions of that
approach.

See also [4] for the series that implemented
uploadpack.blobPackfileUri, and [5] for a series on top that did the
.gitmodules check in that context. See [6] for the "ls-refs unborn"
feature which modified code in similar areas of the request flow.

1. https://lore.kernel.org/git/20111110074330.GA27925@sigill.intra.peff.net/
2. https://lore.kernel.org/git/20190514092900.GA11679@sigill.intra.peff.net/
3. https://lore.kernel.org/git/YFJWz5yIGng+a16k@coredump.intra.peff.net/
4. https://lore.kernel.org/git/cover.1591821067.git.jonathantanmy@google.com/
   Merged as 34e849b05a4 (Merge branch 'jt/cdn-offload', 2020-06-25)
5. https://lore.kernel.org/git/cover.1614021092.git.jonathantanmy@google.com/
   Merged as 6ee353d42f3 (Merge branch 'jt/transfer-fsck-across-packs',
   2021-03-01)
6. 69571dfe219 (Merge branch 'jt/clone-unborn-head', 2021-02-17)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/technical/protocol-v2.txt | 209 ++++++++++++++++++++++++
 Makefile                                |   1 +
 bundle-uri.c                            |  55 +++++++
 bundle-uri.h                            |  14 ++
 serve.c                                 |   6 +
 t/t5701-git-serve.sh                    | 124 +++++++++++++-
 6 files changed, 408 insertions(+), 1 deletion(-)
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h

diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 21e8258ccf3..4bc15a976cd 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -566,3 +566,212 @@ and associated requested information, each separated by a single space.
 	attr = "size"
 
 	obj-info = obj-id SP obj-size
+
+bundle-uri
+~~~~~~~~~~
+
+If the 'bundle-uri' capability is advertised, the server supports the
+`bundle-uri' command.
+
+The capability is currently advertised with no value (i.e. not
+"bundle-uri=somevalue"), a value may be added in the future for
+supporting command-wide extensions. Clients MUST ignore any unknown
+capability values and proceed with the 'bundle-uri` dialog they
+support.
+
+The 'bundle-uri' command is intended to be issued before `fetch` to
+get URIs to bundle files (see linkgit:git-bundle[1]) to "seed" and
+inform the subsequent `fetch` command.
+
+The client CAN issue `bundle-uri` before or after any other valid
+command. To be useful to clients it's expected that it'll be issued
+after an `ls-refs` and before `fetch`, but CAN be issued at any time
+in the dialog.
+
+DISCUSSION of bundle-uri
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The intent of the feature is optimize for server resource consumption
+in the common case by changing the common case of fetching a very
+large PACK during linkgit:git-clone[1] into a smaller incremental
+fetch.
+
+It also allows servers to achieve better caching in combination with
+an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
+
+By having new clones or fetches be a more predictable and common
+negotiation against the tips of recently produces *.bundle file(s).
+Servers might even pre-generate the results of such negotiations for
+the `uploadpack.packObjectsHook` as new pushes come in.
+
+I.e. the server would anticipate that fresh clones will download a
+known bundle, followed by catching up to the current state of the
+repository using ref tips found in that bundle (or bundles).
+
+PROTOCOL for bundle-uri
+^^^^^^^^^^^^^^^^^^^^^^^
+
+A `bundle-uri` request takes no arguments, and as noted above does not
+currently advertise a capability value. Both may be added in the
+future.
+
+When the client issues a `command=bundle-uri` the response is a list
+of URIs the server would like the client to fetch out-of-bounds before
+proceeding with the `fetch` request in this format:
+
+	output = bundle-uri-line
+		 bundle-uri-line* flush-pkt
+
+	bundle-uri-line = PKT-LINE(bundle-uri)
+			  *(SP bundle-feature-key *(=bundle-feature-val))
+			  LF
+
+	bundle-uri = A URI such as a https://, ssh:// etc. URI
+
+	bundle-feature-key = Any printable ASCII characters except SP or "="
+	bundle-feature-val = Any printable ASCII characters except SP or "="
+
+No `bundle-feature-key`=`bundle-feature-value` fields are currently
+defined. See the discussion of features below.
+
+Clients are still expected to fully parse the line according to the
+above format, lines that do not conform to the format SHOULD be
+discarded. The user MAY be warned in such a case.
+
+bundle-uri CLIENT AND SERVER EXPECTATIONS
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+".bundle" FORMAT
+++++++++++++++++
+
+The advertised bundle(s) MUST be in a format that "git bundle verify"
+would accept. I.e. they MUST contain one or more reference tips for
+use by the client, MUST indicate prerequisites (in any) with standard
+"-" prefixes, and MUST indicate their "object-format", if
+applicable. Create "*.bundle" files with "git bundle create".
+
+bundle-uri CLIENT ERROR RECOVERY
+++++++++++++++++++++++++++++++++
+
+A client MUST above all gracefully degrade on errors, whether that
+error is because of bad missing/data in the bundle URI(s), because
+that client is too dumb to e.g. understand and fully parse out bundle
+headers and their prerequisite relationships, or something else.
+
+Server operators should feel confident in turning on "bundle-uri" and
+not worry if e.g. their CDN goes down that clones or fetches will run
+into hard failures. Even if the server bundle bundle(s) are
+incomplete, or bad in some way the client should still end up with a
+functioning repository, just as if it had chosen not to use this
+protocol extension.
+
+All subsequent discussion on client and server interaction MUST keep
+this in mind.
+
+bundle-uri SERVER TO CLIENT
++++++++++++++++++++++++++++
+
+The ordering of the returned bundle uris is not significant. Clients
+MUST parse their headers to discover their contained OIDS and
+prerequisites. A client MUST consider the content of the bundle(s)
+themselves and their header as the ultimate source of truth.
+
+A server MAY even return bundle(s) that don't have any direct
+relationship to the repository being cloned (either through accident,
+or intentional "clever" configuration), and expect a client to sort
+out what data they'd like from the bundle(s), if any.
+
+bundle-uri CLIENT TO SERVER
++++++++++++++++++++++++++++
+
+The client SHOULD provide reference tips found in the bundle header(s)
+as 'have' lines in any subsequent `fetch` request. A client MAY also
+ignore the bundle(s) entirely if doing so is deemed worse for some
+reason, e.g. if the bundles can't be downloaded, it doesn't like the
+tips it finds etc.
+
+WHEN ADVERTISED BUNDLE(S) REQUIRE NO FURTHER NEGOTIATION
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+If after issuing `bundle-uri` and `ls-refs`, and getting the header(s)
+of the bundle(s) the client finds that the ref tips it wants can be
+retrieved entirety from advertised bundle(s), it MAY disconnect. The
+results of such a 'clone' or 'fetch' should be indistinguishable from
+the state attained without using bundle-uri.
+
+EARLY CLIENT DISCONNECTIONS AND ERROR RECOVERY
+++++++++++++++++++++++++++++++++++++++++++++++
+
+A client MAY perform an early disconnect while still downloading the
+bundle(s) (having streamed and parsed their headers). In such a case
+the client MUST gracefully recover from any errors related to
+finishing the download and validation of the bundle(s).
+
+I.e. a client might need to re-connect and issue a 'fetch' command,
+and possibly fall back to not making use of 'bundle-uri' at all.
+
+This "MAY" behavior is specified as such (and not a "SHOULD") on the
+assumption that a server advertising bundle uris is more likely than
+not to be serving up a relatively large repository, and to be pointing
+to URIs that have a good chance of being in working order. A client
+MAY e.g. look at the payload size of the bundles as a heuristic to see
+if an early disconnect is worth it, should falling back on a full
+"fetch" dialog be necessary.
+
+WHEN ADVERTISED BUNDLE(S) REQUIRE FURTHER NEGOTIATION
++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+A client SHOULD commence a negotiation of a PACK from the server via
+the "fetch" command using the OID tips found in advertised bundles,
+even if's still in the process of downloading those bundle(s).
+
+This allows for aggressive early disconnects from any interactive
+server dialog. The client blindly trusts that the advertised OID tips
+are relevant, and issues them as 'have' lines, it then requests any
+tips it would like (usually from the "ls-refs" advertisement) via
+'want' lines. The server will then compute a (hopefully small) PACK
+with the expected difference between the tips from the bundle(s) and
+the data requested.
+
+The only connection the client then needs to keep active is to the
+concurrently downloading static bundle(s), when those and the
+incremental PACK are retrieved they should be inflated and
+validated. Any errors at this point should be gracefully recovered
+from, see above.
+
+bundle-uri PROTOCOL FEATURES
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As noted above no `bundle-feature-key`=`bundle-feature-value` fields
+are currently defined.
+
+They are intended for future per-URI metadata which older clients MUST
+ignore and gracefully degrade on. Any fields they do recognize they
+CAN also ignore.
+
+Any backwards-incompatible addition of pre-URI key-value will be
+guarded by a new value or values in 'bundle-uri' capability
+advertisement itself, and/or by new future `bundle-uri` request
+arguments.
+
+While no per-URI key-value are currently supported currently they're
+intended to support future features such as:
+
+ * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or
+   size of the bundle file.
+
+ * Advertise that one or more bundle files are the same (to e.g. have
+   clients round-robin or otherwise choose one of N possible files).
+
+ * A "oid=<OID>" shortcut and "prerequisite=<OID>" shortcut. For
+   expressing the common case of a bundle with one tip and no
+   prerequisites, or one tip and one prerequisite.
++
+This would allow for optimizing the common case of servers who'd like
+to provide one "big bundle" containing only their "main" branch,
+and/or incremental updates thereof.
++
+A client receiving such a a response MAY assume that they can skip
+retrieving the header from a bundle at the indicated URI, and thus
+save themselves and the server(s) the request(s) needed to inspect the
+headers of that bundle or bundles.
diff --git a/Makefile b/Makefile
index 381bed2c1d2..e41ac60829d 100644
--- a/Makefile
+++ b/Makefile
@@ -846,6 +846,7 @@ LIB_OBJS += blob.o
 LIB_OBJS += bloom.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
+LIB_OBJS += bundle-uri.o
 LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += cbtree.o
diff --git a/bundle-uri.c b/bundle-uri.c
new file mode 100644
index 00000000000..ff054ddc690
--- /dev/null
+++ b/bundle-uri.c
@@ -0,0 +1,55 @@
+#include "cache.h"
+#include "bundle-uri.h"
+#include "pkt-line.h"
+#include "config.h"
+
+static void send_bundle_uris(struct packet_writer *writer,
+			     struct string_list *uris)
+{
+	struct string_list_item *item;
+
+	for_each_string_list_item(item, uris)
+		packet_writer_write(writer, "%s", item->string);
+}
+
+static int advertise_bundle_uri = -1;
+static struct string_list bundle_uris = STRING_LIST_INIT_DUP;
+static int bundle_uri_config(const char *var, const char *value, void *data)
+{
+	if (!strcmp(var, "uploadpack.bundleuri")) {
+		advertise_bundle_uri = 1;
+		string_list_append(&bundle_uris, value);
+	}
+
+	return 0;
+}
+
+int bundle_uri_advertise(struct repository *r, struct strbuf *value)
+{
+	if (advertise_bundle_uri != -1)
+		goto cached;
+
+	git_config(bundle_uri_config, NULL);
+	advertise_bundle_uri = !!bundle_uris.nr;
+
+cached:
+	return advertise_bundle_uri;
+}
+
+int bundle_uri_command(struct repository *r,
+		       struct packet_reader *request)
+{
+	struct packet_writer writer;
+	packet_writer_init(&writer, 1);
+
+	while (packet_reader_read(request) == PACKET_READ_NORMAL)
+		die(_("bundle-uri: unexpected argument: '%s'"), request->line);
+	if (request->status != PACKET_READ_FLUSH)
+		die(_("bundle-uri: expected flush after arguments"));
+
+	send_bundle_uris(&writer, &bundle_uris);
+
+	packet_writer_flush(&writer);
+
+	return 0;
+}
diff --git a/bundle-uri.h b/bundle-uri.h
new file mode 100644
index 00000000000..b8762e6a8e4
--- /dev/null
+++ b/bundle-uri.h
@@ -0,0 +1,14 @@
+#ifndef BUNDLE_URI_H
+#define BUNDLE_URI_H
+
+struct repository;
+struct packet_reader;
+struct packet_writer;
+
+/**
+ * API used by serve.[ch].
+ */
+int bundle_uri_advertise(struct repository *r, struct strbuf *value);
+int bundle_uri_command(struct repository *r, struct packet_reader *request);
+
+#endif /* BUNDLE_URI_H */
diff --git a/serve.c b/serve.c
index b3fe9b5126a..f3e0203d2c6 100644
--- a/serve.c
+++ b/serve.c
@@ -8,6 +8,7 @@
 #include "protocol-caps.h"
 #include "serve.h"
 #include "upload-pack.h"
+#include "bundle-uri.h"
 
 static int advertise_sid = -1;
 static int client_hash_algo = GIT_HASH_SHA1;
@@ -136,6 +137,11 @@ static struct protocol_capability capabilities[] = {
 		.advertise = always_advertise,
 		.command = cap_object_info,
 	},
+	{
+		.name = "bundle-uri",
+		.advertise = bundle_uri_advertise,
+		.command = bundle_uri_command,
+	},
 };
 
 void protocol_v2_advertise_capabilities(void)
diff --git a/t/t5701-git-serve.sh b/t/t5701-git-serve.sh
index 1896f671cb3..9d053f77a93 100755
--- a/t/t5701-git-serve.sh
+++ b/t/t5701-git-serve.sh
@@ -13,7 +13,7 @@ test_expect_success 'test capability advertisement' '
 	wrong_algo sha1:sha256
 	wrong_algo sha256:sha1
 	EOF
-	cat >expect <<-EOF &&
+	cat >expect.base <<-EOF &&
 	version 2
 	agent=git/$(git version | cut -d" " -f3)
 	ls-refs=unborn
@@ -21,8 +21,11 @@ test_expect_success 'test capability advertisement' '
 	server-option
 	object-format=$(test_oid algo)
 	object-info
+	EOF
+	cat >expect.trailer <<-EOF &&
 	0000
 	EOF
+	cat expect.base expect.trailer >expect &&
 
 	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
 		--advertise-capabilities >out &&
@@ -342,4 +345,123 @@ test_expect_success 'basics of object-info' '
 	test_cmp expect actual
 '
 
+# Test the basics of bundle-uri
+#
+test_expect_success 'test capability advertisement with uploadpack.bundleURI' '
+	test_config uploadpack.bundleURI FAKE &&
+
+	cat >expect.extra <<-EOF &&
+	bundle-uri
+	EOF
+	cat expect.base \
+	    expect.extra \
+	    expect.trailer >expect &&
+
+	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
+		--advertise-capabilities >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: dies if not enabled' '
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	0000
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	fatal: invalid command '"'"'bundle-uri'"'"'
+	EOF
+
+	cat >expect <<-\EOF &&
+	ERR serve: invalid command '"'"'bundle-uri'"'"'
+	EOF
+
+	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out
+'
+
+
+test_expect_success 'basics of bundle-uri: enabled with single URI' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: enabled with single URI' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: enabled with two URIs' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+	test_config uploadpack.bundleURI https://cdn.example.com/recent.bdl --add &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	https://cdn.example.com/recent.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: unknown future feature(s)' '
+	test_config uploadpack.bundleURI https://cdn.example.com/fake.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0001
+	some-feature
+	we-do-not
+	know=about
+	0000
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	fatal: bundle-uri: unexpected argument: '"'"'some-feature'"'"'
+	EOF
+
+	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out
+'
+
 test_done
-- 
2.33.1.1511.gd15d1b313a6


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 3/3] bundle-uri client: add "bundle-uri" parsing + tests
  2021-10-25 21:25 [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Ævar Arnfjörð Bjarmason
  2021-10-25 21:25 ` [PATCH 1/3] leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak Ævar Arnfjörð Bjarmason
  2021-10-25 21:25 ` [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri" Ævar Arnfjörð Bjarmason
@ 2021-10-25 21:25 ` Ævar Arnfjörð Bjarmason
  2021-10-26 14:05   ` Derrick Stolee
  2021-10-29 18:46 ` [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Derrick Stolee
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
  4 siblings, 1 reply; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-25 21:25 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson,
	Ævar Arnfjörð Bjarmason

Add a "test-tool bundle-uri parse" which parses the format defined in
the newly specified "bundle-uri" command.

As note in the "bundle-uri" section in protocol-v2.txt we haven't
specified any key-values yet, just URI lines, but we should parse
their format for conformity with the spec.

We need to make sure our future client doesn't die if this optional
data is ever provided by the server, and that we've covered all the
edge cases with these key-values in our specification. Let's add and
test a bundle_uri_parse_line() to do that.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Makefile                    |   1 +
 bundle-uri.c                | 124 +++++++++++++++++++++++++++++
 bundle-uri.h                |  16 ++++
 t/helper/test-bundle-uri.c  |  83 +++++++++++++++++++
 t/helper/test-tool.c        |   1 +
 t/helper/test-tool.h        |   1 +
 t/t5750-bundle-uri-parse.sh | 153 ++++++++++++++++++++++++++++++++++++
 7 files changed, 379 insertions(+)
 create mode 100644 t/helper/test-bundle-uri.c
 create mode 100755 t/t5750-bundle-uri-parse.sh

diff --git a/Makefile b/Makefile
index e41ac60829d..de66a016c78 100644
--- a/Makefile
+++ b/Makefile
@@ -691,6 +691,7 @@ PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 TEST_BUILTINS_OBJS += test-advise.o
 TEST_BUILTINS_OBJS += test-bitmap.o
 TEST_BUILTINS_OBJS += test-bloom.o
+TEST_BUILTINS_OBJS += test-bundle-uri.o
 TEST_BUILTINS_OBJS += test-chmtime.o
 TEST_BUILTINS_OBJS += test-config.o
 TEST_BUILTINS_OBJS += test-crontab.o
diff --git a/bundle-uri.c b/bundle-uri.c
index ff054ddc690..9827fc5da17 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -53,3 +53,127 @@ int bundle_uri_command(struct repository *r,
 
 	return 0;
 }
+
+/**
+ * General API for {transport,connect}.c etc.
+ */
+int bundle_uri_parse_line(struct string_list *bundle_uri, const char *line)
+{
+	size_t i;
+	struct string_list columns = STRING_LIST_INIT_DUP;
+	const char *uri;
+	struct string_list *uri_columns = NULL;
+	int ret = 0;
+
+	if (!strlen(line))
+		return error(_("bundle-uri: got an empty line"));
+
+	/*
+	 * Right now we don't understand anything beyond the first SP,
+	 * but let's be tolerant and ignore any future unknown
+	 * fields. See the "MUST" note about "bundle-feature-key" in
+	 * Documentation/technical/protocol-v2.txt
+	 */
+	if (string_list_split(&columns, line, ' ', -1) < 1)
+		return error(_("bundle-uri: line not in SP-delimited format: %s"), line);
+
+	/*
+	 * We represent a "<uri>[ <key-values>...]" line with the URI
+	 * being the .string in a string list, and the .util being an
+	 * optional string list of key (.string) and values
+	 * (.util). If the top-level .util is NULL there's no
+	 * key-value pairs....
+	 */
+	uri = columns.items[0].string;
+	if (!strlen(uri)) {
+		ret = error(_("bundle-uri: got an empty URI component"));
+		goto cleanup;
+	}
+
+	/*
+	 * ... we're going to need that non-NULL .util .
+	 */
+	if (columns.nr > 1) {
+		uri_columns = xcalloc(1, sizeof(struct string_list));
+		string_list_init_dup(uri_columns);
+	}
+
+	/*
+	 * Let's parse the optional "kv" format, even if we don't
+	 * understand any of the keys or values yet.
+	 */
+	for (i = 1; i < columns.nr; i++) {
+		struct string_list kv = STRING_LIST_INIT_DUP;
+		const char *arg = columns.items[i].string;
+		int fields = string_list_split(&kv, arg, '=', 2);
+		int err = 0;
+
+		switch (fields) {
+		case 0:
+			BUG("should have no fields=0");
+		case 1:
+			if (!strlen(arg)) {
+				err = error("bundle-uri: column %lu: got an empty attribute (full line was '%s')",
+					    i, line);
+				break;
+			}
+			/*
+			 * We could dance around with
+			 * string_list_append_nodup() and skip
+			 * string_list_clear(&kv, 0) here, but let's
+			 * keep it simple.
+			 */
+			string_list_append(uri_columns, arg);
+			break;
+		case 2:
+		{
+			const char *k = kv.items[0].string;
+			const char *v = kv.items[1].string;
+
+			string_list_append(uri_columns, k)->util = xstrdup(v);
+			break;
+		}
+		default:
+			err = error("bundle-uri: column %lu: '%s' more than one '=' character (full line was '%s')",
+				    i, arg, line);
+			break;
+		}
+
+		string_list_clear(&kv, 0);
+		if (err) {
+			ret = err;
+			break;
+		}
+	}
+
+
+	/*
+	 * Per the spec we'll only consider bundle-uri lines OK if
+	 * there were no parsing problems, even if the problems were
+	 * with attributes whose content we don't understand.
+	 */
+	if (ret && uri_columns) {
+		string_list_clear(uri_columns, 1);
+		free(uri_columns);
+	} else if (!ret) {
+		string_list_append(bundle_uri, uri)->util = uri_columns;
+	}
+
+cleanup:
+	string_list_clear(&columns, 0);
+	return ret;
+}
+
+static void bundle_uri_string_list_clear_cb(void *util, const char *string)
+{
+	struct string_list *fields = util;
+	if (!fields)
+		return;
+	string_list_clear(fields, 1);
+	free(fields);
+}
+
+void bundle_uri_string_list_clear(struct string_list *bundle_uri)
+{
+	string_list_clear_func(bundle_uri, bundle_uri_string_list_clear_cb);
+}
diff --git a/bundle-uri.h b/bundle-uri.h
index b8762e6a8e4..c23d7316555 100644
--- a/bundle-uri.h
+++ b/bundle-uri.h
@@ -4,6 +4,7 @@
 struct repository;
 struct packet_reader;
 struct packet_writer;
+struct string_list;
 
 /**
  * API used by serve.[ch].
@@ -11,4 +12,19 @@ struct packet_writer;
 int bundle_uri_advertise(struct repository *r, struct strbuf *value);
 int bundle_uri_command(struct repository *r, struct packet_reader *request);
 
+/**
+ * General API for {transport,connect}.c etc.
+ */
+
+/**
+ * bundle_uri_parse_line() returns 0 when a valid bundle-uri has been
+ * added to `bundle_uri`, <0 on error.
+ */
+int bundle_uri_parse_line(struct string_list *bundle_uri, const char *line);
+
+/**
+ * Clear the `bundle_uri` list. Just a very thin wrapper on
+ * string_list_clear().
+ */
+void bundle_uri_string_list_clear(struct string_list *bundle_uri);
 #endif /* BUNDLE_URI_H */
diff --git a/t/helper/test-bundle-uri.c b/t/helper/test-bundle-uri.c
new file mode 100644
index 00000000000..805a86c0130
--- /dev/null
+++ b/t/helper/test-bundle-uri.c
@@ -0,0 +1,83 @@
+#include "test-tool.h"
+#include "parse-options.h"
+#include "bundle-uri.h"
+#include "strbuf.h"
+#include "string-list.h"
+
+static int cmd__bundle_uri_parse(int argc, const char **argv)
+{
+	const char *usage[] = {
+		"test-tool bundle-uri parse <in",
+		NULL
+	};
+	struct option options[] = {
+		OPT_END(),
+	};
+	struct strbuf sb = STRBUF_INIT;
+	struct string_list list = STRING_LIST_INIT_DUP;
+	int err = 0;
+	struct string_list_item *item;
+	size_t line_nr = 0;
+
+	argc = parse_options(argc, argv, NULL, options, usage, 0);
+	if (argc)
+		goto usage;
+
+	while (strbuf_getline(&sb, stdin) != EOF) {
+		line_nr++;
+		if (bundle_uri_parse_line(&list, sb.buf) < 0)
+			err = error("bad line: '%s'", sb.buf);
+	}
+
+	for_each_string_list_item(item, &list) {
+		struct string_list_item *kv_item;
+		struct string_list *kv = item->util;
+
+		fprintf(stdout, "%s", item->string);
+		if (!kv) {
+			fprintf(stdout, "\n");
+			continue;
+		}
+		for_each_string_list_item(kv_item, kv) {
+			const char *k = kv_item->string;
+			const char *v = kv_item->util;
+
+			if (v)
+				fprintf(stdout, " [kv: %s => %s]", k, v);
+			else
+				fprintf(stdout, " [attr: %s]", k);
+		}
+		fprintf(stdout, "\n");
+	}
+	strbuf_release(&sb);
+
+	bundle_uri_string_list_clear(&list);
+
+	return err < 0 ? 1 : 0;
+usage:
+	usage_with_options(usage, options);
+}
+
+int cmd__bundle_uri(int argc, const char **argv)
+{
+	const char *usage[] = {
+		"test-tool bundle-uri <subcommand> [<options>]",
+		NULL
+	};
+	struct option options[] = {
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL, options, usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION |
+			     PARSE_OPT_KEEP_ARGV0);
+	if (argc == 1)
+		goto usage;
+
+	if (!strcmp(argv[1], "parse"))
+		return cmd__bundle_uri_parse(argc - 1, argv + 1);
+	error("there is no test-tool bundle-uri tool '%s'", argv[1]);
+
+usage:
+	usage_with_options(usage, options);
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 3ce5585e53a..b6e1ee7b253 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -17,6 +17,7 @@ static struct test_cmd cmds[] = {
 	{ "advise", cmd__advise_if_enabled },
 	{ "bitmap", cmd__bitmap },
 	{ "bloom", cmd__bloom },
+	{ "bundle-uri", cmd__bundle_uri },
 	{ "chmtime", cmd__chmtime },
 	{ "config", cmd__config },
 	{ "crontab", cmd__crontab },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 9f0f5228508..ef839ac7262 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -7,6 +7,7 @@
 int cmd__advise_if_enabled(int argc, const char **argv);
 int cmd__bitmap(int argc, const char **argv);
 int cmd__bloom(int argc, const char **argv);
+int cmd__bundle_uri(int argc, const char **argv);
 int cmd__chmtime(int argc, const char **argv);
 int cmd__config(int argc, const char **argv);
 int cmd__crontab(int argc, const char **argv);
diff --git a/t/t5750-bundle-uri-parse.sh b/t/t5750-bundle-uri-parse.sh
new file mode 100755
index 00000000000..70fd1b398e9
--- /dev/null
+++ b/t/t5750-bundle-uri-parse.sh
@@ -0,0 +1,153 @@
+#!/bin/sh
+
+test_description="Test bundle-uri bundle_uri_parse_line()"
+
+TEST_NO_CREATE_REPO=1
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+test_expect_success 'bundle_uri_parse_line() just URIs' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle.bdl
+	https://example.com/bundle.bdl
+	file:///usr/share/git/bundle.bdl
+	EOF
+
+	# For the simple case
+	cp in expect &&
+
+	test-tool bundle-uri parse <in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() with attributes' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl attr
+	http://example.com/bundle2.bdl ibute
+	EOF
+
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl [attr: attr]
+	http://example.com/bundle2.bdl [attr: ibute]
+	EOF
+
+	test-tool bundle-uri parse <in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() with attributes and key-value attributes' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl x a=b y c=d z e=f a=b
+	EOF
+
+
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl [attr: x] [kv: a => b] [attr: y] [kv: c => d] [attr: z] [kv: e => f] [kv: a => b]
+	EOF
+
+	test-tool bundle-uri parse <in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: extra SP' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl one-space
+	http://example.com/bundle2.bdl  two-space
+	http://example.com/bundle3.bdl   three-space
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: column 1: got an empty attribute (full line was '\''http://example.com/bundle2.bdl  two-space'\'')
+	error: bad line: '\''http://example.com/bundle2.bdl  two-space'\''
+	error: bundle-uri: column 1: got an empty attribute (full line was '\''http://example.com/bundle3.bdl   three-space'\'')
+	error: bad line: '\''http://example.com/bundle3.bdl   three-space'\''
+	EOF
+
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl [attr: one-space]
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: empty lines' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl
+
+	http://example.com/bundle2.bdl a=b
+
+	http://example.com/bundle3.bdl
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: got an empty line
+	error: bad line: '\'''\''
+	error: bundle-uri: got an empty line
+	error: bad line: '\'''\''
+	EOF
+
+	# We fail, but try to continue parsing regardless
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl
+	http://example.com/bundle2.bdl [kv: a => b]
+	http://example.com/bundle3.bdl
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: empty URIs' '
+	sed "s/> //" >in <<-\EOF &&
+	http://example.com/bundle1.bdl
+	>  a=b
+	http://example.com/bundle3.bdl a=b
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: got an empty URI component
+	error: bad line: '\'' a=b'\''
+	EOF
+
+	# We fail, but try to continue parsing regardless
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl
+	http://example.com/bundle3.bdl [kv: a => b]
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: multiple = in key-values' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl k=v=extra
+	http://example.com/bundle2.bdl a=b k=v=extra c=d
+	http://example.com/bundle3.bdl kv=ok
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: column 1: '\''k=v=extra'\'' more than one '\''='\'' character (full line was '\''http://example.com/bundle1.bdl k=v=extra'\'')
+	error: bad line: '\''http://example.com/bundle1.bdl k=v=extra'\''
+	error: bundle-uri: column 2: '\''k=v=extra'\'' more than one '\''='\'' character (full line was '\''http://example.com/bundle2.bdl a=b k=v=extra c=d'\'')
+	error: bad line: '\''http://example.com/bundle2.bdl a=b k=v=extra c=d'\''
+	EOF
+
+	# We fail, but try to continue parsing regardless
+	cat >expect <<-\EOF &&
+	http://example.com/bundle3.bdl [kv: kv => ok]
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_done
-- 
2.33.1.1511.gd15d1b313a6


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-25 21:25 ` [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri" Ævar Arnfjörð Bjarmason
@ 2021-10-26 14:00   ` Derrick Stolee
  2021-10-26 15:00     ` Ævar Arnfjörð Bjarmason
  2021-10-27  2:01   ` Derrick Stolee
  1 sibling, 1 reply; 77+ messages in thread
From: Derrick Stolee @ 2021-10-26 14:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Jeff King, Patrick Steinhardt, Christian Couder,
	Albert Cui, Jonathan Tan, Jonathan Nieder, brian m . carlson,
	Robin H . Johnson

On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
> +bundle-uri CLIENT ERROR RECOVERY
> +++++++++++++++++++++++++++++++++
> +
> +A client MUST above all gracefully degrade on errors, whether that
> +error is because of bad missing/data in the bundle URI(s), because
> +that client is too dumb to e.g. understand and fully parse out bundle
> +headers and their prerequisite relationships, or something else.

"too dumb" seems a bit informal to me, especially because you
immediately elaborate on its meaning. You could rewrite as follows:

  ...because
  that client can't understand or fully parse out bundle
  headers and their prerequisite relationships, or something else.

> +Server operators should feel confident in turning on "bundle-uri" and
> +not worry if e.g. their CDN goes down that clones or fetches will run
> +into hard failures. Even if the server bundle bundle(s) are
> +incomplete, or bad in some way the client should still end up with a
> +functioning repository, just as if it had chosen not to use this
> +protocol extension.

Also, insertions of "e.g." in the middle of a sentence don't flow well.

  Server operators should feel confident in turning on "bundle-uri" and
  not worry that failures such as the CDN being unavailable will cause
  clones or fetches to have hard failures. Even if the server bundle(s)
  are invalid, the client should still end up with a functioning
  repository, just as if it had chosen not to use this protocol extension.

(Note: I also removed a "bundle bundle(s)" that was split across a line
break.)

> +bundle-uri SERVER TO CLIENT
> ++++++++++++++++++++++++++++
> +
> +The ordering of the returned bundle uris is not significant. Clients

I'm late to noticing, but shouldn't "URI" be all-caps when not used in
the literal capability string "bundle-uri"?

> +bundle-uri CLIENT TO SERVER
> ++++++++++++++++++++++++++++
> +
> +The client SHOULD provide reference tips found in the bundle header(s)
> +as 'have' lines in any subsequent `fetch` request. A client MAY also
> +ignore the bundle(s) entirely if doing so is deemed worse for some
> +reason, e.g. if the bundles can't be downloaded, it doesn't like the
> +tips it finds etc.

I would just stop after "is deemed worse for some reason." because one
example is obvious and the other is unclear how the client would detect
that situation. (Maybe: tip commit timestamps are really old?)

> +
> +WHEN ADVERTISED BUNDLE(S) REQUIRE NO FURTHER NEGOTIATION
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +
> +If after issuing `bundle-uri` and `ls-refs`, and getting the header(s)
> +of the bundle(s) the client finds that the ref tips it wants can be
> +retrieved entirety from advertised bundle(s), it MAY disconnect. The

s/entirety/entirely/

> +results of such a 'clone' or 'fetch' should be indistinguishable from
> +the state attained without using bundle-uri.
> +
> +EARLY CLIENT DISCONNECTIONS AND ERROR RECOVERY
> +++++++++++++++++++++++++++++++++++++++++++++++
> +
> +A client MAY perform an early disconnect while still downloading the
> +bundle(s) (having streamed and parsed their headers). In such a case
> +the client MUST gracefully recover from any errors related to
> +finishing the download and validation of the bundle(s).
> +
> +I.e. a client might need to re-connect and issue a 'fetch' command,
> +and possibly fall back to not making use of 'bundle-uri' at all.

Use "For example," over starting a sentence with "i.e.". The examples
of "i.e." and "e.g." already in this document show proper use, which
involves parentheses.

> +This "MAY" behavior is specified as such (and not a "SHOULD") on the
> +assumption that a server advertising bundle uris is more likely than
> +not to be serving up a relatively large repository, and to be pointing
> +to URIs that have a good chance of being in working order. A client
> +MAY e.g. look at the payload size of the bundles as a heuristic to see

Again, here, the entire sentence is an example. This "e.g." can be
removed with no loss of meaning.

> +if an early disconnect is worth it, should falling back on a full
> +"fetch" dialog be necessary.


> +While no per-URI key-value are currently supported currently they're
> +intended to support future features such as:
> +
> + * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or
> +   size of the bundle file.

I suppose if one wanted to add this server-to-bundle coupling, then some
clients might appreciate it.

> + * Advertise that one or more bundle files are the same (to e.g. have
> +   clients round-robin or otherwise choose one of N possible files).

  * Advertise that one or more bundle files are the same, to allow for
    redundancy without causing duplicated effort.

> +static void send_bundle_uris(struct packet_writer *writer,
> +			     struct string_list *uris)
> +{
> +	struct string_list_item *item;
> +
> +	for_each_string_list_item(item, uris)
> +		packet_writer_write(writer, "%s", item->string);
> +}
> +
> +static int advertise_bundle_uri = -1;
> +static struct string_list bundle_uris = STRING_LIST_INIT_DUP;

I see you put send_bundle_uris() before the global bundle_uris so
it can be independent, but do you expect anyone to call send_bundle_uris()
via a different list?

Should we find a different place to store this data?

> +static int bundle_uri_config(const char *var, const char *value, void *data)
> +{
> +	if (!strcmp(var, "uploadpack.bundleuri")) {
> +		advertise_bundle_uri = 1;
> +		string_list_append(&bundle_uris, value);
> +	}
> +
> +	return 0;
> +}

Here, we are dictating that the URI list is available as a multi-valued
config "uploadpack.bundleuri".

1. Should this be updated in Documentation/config/uploadpack.txt?

2. This seems difficult to extend to your possible future features as
   listed in the protocol docs, mainly because this can only store the
   flat URI string. To add things like hash values, sizes, and prereqs,
   you would need more data included and grouped on a per-URI basis.
   What plans do you have to make extensions here while remaining
   somewhat compatible with downgrading Git versions?

> @@ -136,6 +137,11 @@ static struct protocol_capability capabilities[] = {
>  		.advertise = always_advertise,
>  		.command = cap_object_info,
>  	},
> +	{
> +		.name = "bundle-uri",
> +		.advertise = bundle_uri_advertise,
> +		.command = bundle_uri_command,
> +	},
>  };

I really appreciate that it is this simple to extend protocol v2.

> +test_expect_success 'basics of bundle-uri: dies if not enabled' '
> +	test-tool pkt-line pack >in <<-EOF &&
> +	command=bundle-uri
> +	0000
> +	EOF
> +
> +	cat >err.expect <<-\EOF &&
> +	fatal: invalid command '"'"'bundle-uri'"'"'
> +	EOF
> +
> +	cat >expect <<-\EOF &&
> +	ERR serve: invalid command '"'"'bundle-uri'"'"'
> +	EOF
> +
> +	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
> +	test_cmp err.expect err.actual &&
> +	test_must_be_empty out
> +'
> +
> +

hyper-nit: double newline.

The implementation seems simple enough, which I like. I'm a bit
your current use of Git config as the back-end, only because it is
difficult to be future-proof. As the functionality stands today, the
current config design works just fine. Perhaps we don't need to
worry about the future, because we can design a new, complementary
storage for that extra data. It seems worth exploring for a little
while, though. Perhaps we should take a page out of 'git-remote'
and how it stores named remotes with sub-items for metadata. The
names probably don't need to ever be exposed to users, but it could
be beneficial to anyone implementing this scheme.

[bundle "main"]
	uri = https://example.com/my-bundle
	uri = https://redundant-cdn.com/my-bundle
	size = 120523
	sha256 = {64hexchars}

[bundle "fork"]
	uri = https://cdn.org/my-fork
	size = 334
	sha256 = {...}
	prereq = {oid}

This kind of layout has an immediate grouping of data that should
help any future plan. Notice how I included multiple "uri" lines
in the "main", which helps with your plan for duplicate URIs.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] bundle-uri client: add "bundle-uri" parsing + tests
  2021-10-25 21:25 ` [PATCH 3/3] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
@ 2021-10-26 14:05   ` Derrick Stolee
  0 siblings, 0 replies; 77+ messages in thread
From: Derrick Stolee @ 2021-10-26 14:05 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Jeff King, Patrick Steinhardt, Christian Couder,
	Albert Cui, Jonathan Tan, Jonathan Nieder, brian m . carlson,
	Robin H . Johnson

On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
> Add a "test-tool bundle-uri parse" which parses the format defined in
> the newly specified "bundle-uri" command.
> 
> As note in the "bundle-uri" section in protocol-v2.txt we haven't
> specified any key-values yet, just URI lines, but we should parse
> their format for conformity with the spec.
> 
> We need to make sure our future client doesn't die if this optional
> data is ever provided by the server, and that we've covered all the
> edge cases with these key-values in our specification. Let's add and
> test a bundle_uri_parse_line() to do that.

While this implementation is interesting, and the tests available are
useful for validating the protocol, I would like to see how this
integrates into a full 'git clone' operation to be sure that the API
is correct. How much more work is it to implement the full end-to-end
scenario?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-26 14:00   ` Derrick Stolee
@ 2021-10-26 15:00     ` Ævar Arnfjörð Bjarmason
  2021-10-27  1:55       ` Derrick Stolee
  0 siblings, 1 reply; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-26 15:00 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson


On Tue, Oct 26 2021, Derrick Stolee wrote:

> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
>> +bundle-uri CLIENT ERROR RECOVERY
>> +++++++++++++++++++++++++++++++++
>> +
>> +A client MUST above all gracefully degrade on errors, whether that
>> +error is because of bad missing/data in the bundle URI(s), because
>> +that client is too dumb to e.g. understand and fully parse out bundle
>> +headers and their prerequisite relationships, or something else.
>
> "too dumb" seems a bit informal to me, especially because you
> immediately elaborate on its meaning. You could rewrite as follows:
>
>   ...because
>   that client can't understand or fully parse out bundle
>   headers and their prerequisite relationships, or something else.

Thanks, I've snipped all your subsequent comments on
phrasing/clarifications etc, except insofar as they have questions I
need to address (as opposed to just my bad grammar/phrasing etc).

Thanks a lot for them, will go through them closely for any subsequent
re-roll & address them.

> [...]
>> +While no per-URI key-value are currently supported currently they're
>> +intended to support future features such as:
>> +
>> + * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or
>> +   size of the bundle file.
>
> I suppose if one wanted to add this server-to-bundle coupling, then some
> clients might appreciate it.

For packfile-uri there's a hard dependency on the server transferring
the hash of the PACK file.

I've intentionally omitted it, the reasons are covered in [1], which I
realize now should really be part of this early series.

Basically having it as a hard requirement isn't necessary for security
or payload validation. Any server who's worried about their transport
integrity would point to a https URI under their control, any
checksumming and validation we'll need we'll get from the transport
layer and the client's reachability check.

Having it would mean that you need closer cooperation by default between
server and CDN than I'm aiming for, i.e. a server should be able to
point to some URI somewhere updated by a dumb hourly cronjob, without
needing to pass information back & forth about what the "current" URL
is. The client will discover all that.

But I left that "hash=*" in because it could be optionally added, in
case someone really wants it for some reason...

1. https://lore.kernel.org/git/RFC-patch-13.13-1e657ed27a-20210805T150534Z-avarab@gmail.com/

>> + * Advertise that one or more bundle files are the same (to e.g. have
>> +   clients round-robin or otherwise choose one of N possible files).
>
>   * Advertise that one or more bundle files are the same, to allow for
>     redundancy without causing duplicated effort.

*nod*

>> +static void send_bundle_uris(struct packet_writer *writer,
>> +			     struct string_list *uris)
>> +{
>> +	struct string_list_item *item;
>> +
>> +	for_each_string_list_item(item, uris)
>> +		packet_writer_write(writer, "%s", item->string);
>> +}
>> +
>> +static int advertise_bundle_uri = -1;
>> +static struct string_list bundle_uris = STRING_LIST_INIT_DUP;
>
> I see you put send_bundle_uris() before the global bundle_uris so
> it can be independent, but do you expect anyone to call send_bundle_uris()
> via a different list?

No, I'll move that around or rather fold it into bundle_uri_command()
directly.

I think I'd originally copied the structure of send_ref() and ls_refs()
from ls-refs.c, but it doesn't make much sense anymore here for this
2-line function. Thanks.

> Should we find a different place to store this data?
>
>> +static int bundle_uri_config(const char *var, const char *value, void *data)
>> +{
>> +	if (!strcmp(var, "uploadpack.bundleuri")) {
>> +		advertise_bundle_uri = 1;
>> +		string_list_append(&bundle_uris, value);
>> +	}
>> +
>> +	return 0;
>> +}
>
> Here, we are dictating that the URI list is available as a multi-valued
> config "uploadpack.bundleuri".
>
> 1. Should this be updated in Documentation/config/uploadpack.txt?

Definitely. I'll either incorporate that or re-structure this leading
series so that it's more design-doc/protocol focused, in any case all of
this ends up documented in the right places eventually...

> 2. This seems difficult to extend to your possible future features as
>    listed in the protocol docs, mainly because this can only store the
>    flat URI string. To add things like hash values, sizes, and prereqs,
>    you would need more data included and grouped on a per-URI basis.
>    What plans do you have to make extensions here while remaining
>    somewhat compatible with downgrading Git versions?

[...addressed below...]

>> @@ -136,6 +137,11 @@ static struct protocol_capability capabilities[] = {
>>  		.advertise = always_advertise,
>>  		.command = cap_object_info,
>>  	},
>> +	{
>> +		.name = "bundle-uri",
>> +		.advertise = bundle_uri_advertise,
>> +		.command = bundle_uri_command,
>> +	},
>>  };
>
> I really appreciate that it is this simple to extend protocol v2.

Yeah! FWIW I've got some WIP patches to make it easier still, i.e. some
further simplification & validation in the serve.[ch] API.

>> +test_expect_success 'basics of bundle-uri: dies if not enabled' '
>> +	test-tool pkt-line pack >in <<-EOF &&
>> +	command=bundle-uri
>> +	0000
>> +	EOF
>> +
>> +	cat >err.expect <<-\EOF &&
>> +	fatal: invalid command '"'"'bundle-uri'"'"'
>> +	EOF
>> +
>> +	cat >expect <<-\EOF &&
>> +	ERR serve: invalid command '"'"'bundle-uri'"'"'
>> +	EOF
>> +
>> +	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
>> +	test_cmp err.expect err.actual &&
>> +	test_must_be_empty out
>> +'
>> +
>> +
>
> hyper-nit: double newline.
>
> The implementation seems simple enough, which I like. I'm a bit

I mentally inserted the missing "skeptical/uncertain" etc. here :)

> your current use of Git config as the back-end, only because it is
> difficult to be future-proof. As the functionality stands today, the
> current config design works just fine. Perhaps we don't need to
> worry about the future, because we can design a new, complementary
> storage for that extra data. It seems worth exploring for a little
> while, though. Perhaps we should take a page out of 'git-remote'
> and how it stores named remotes with sub-items for metadata. The
> names probably don't need to ever be exposed to users, but it could
> be beneficial to anyone implementing this scheme.
>
> [bundle "main"]
> 	uri = https://example.com/my-bundle
> 	uri = https://redundant-cdn.com/my-bundle
> 	size = 120523
> 	sha256 = {64hexchars}
>
> [bundle "fork"]
> 	uri = https://cdn.org/my-fork
> 	size = 334
> 	sha256 = {...}
> 	prereq = {oid}
>
> This kind of layout has an immediate grouping of data that should
> help any future plan. Notice how I included multiple "uri" lines
> in the "main", which helps with your plan for duplicate URIs.

At first sight I like that config schema much better than my current
one, in particular how it makes the future-proofed "these N urls are one
logical URL" case simpler.

But overall I'm a bit on the fence, and leaning towards keeping what I
have, not out of any lazynes or wanting to just keep what I have mind
you.

But rather that the main benefit of the current one is that it's a 1=1
mapping to the line-based protocol, and you can say update your URLs as:

    git config --replace-all uploadpack.bundleUri "$first_url" &&
    git config --add uploadpack.bundleUri "$second_url"

Having usually you'd know the URL you'd like to replace, so you can use
the [value-pattern] of --replace-all, if it's a named section or other
split-out structure that become a two-step lookup.

Also for testing I've got a (trivial) plumbing tool I'll submit called
"git ls-remote-bundle-uri" (could be folded into something else I guess)
to dump the server-side config, it's nice that you can pretty much
directly copy/paste it into config without needing to adjust it.

Having said all that I'm not sure I feel strongly about it either way,
what do you think given the above?

I think most "real" server operators will use this as
GIT_CONFIG_COUNT=<n> GIT_CONFIG_{KEY,VALUE}_<1..n>, which can of course
work with any config schema, but if you've got code generating it on the
other side naming the targets is probably a slight hassle / confusing.

There's also the small matter of it being consistent with the
packfile-uri config in its current form, but that shouldn't be a reason
not to come up with something better. If anything any better suggestion
(if we go for that) could be supported by it too...

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-26 15:00     ` Ævar Arnfjörð Bjarmason
@ 2021-10-27  1:55       ` Derrick Stolee
  2021-10-27 17:49         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 77+ messages in thread
From: Derrick Stolee @ 2021-10-27  1:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson

On 10/26/2021 11:00 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Oct 26 2021, Derrick Stolee wrote:
>> The implementation seems simple enough, which I like. I'm a bit
> 
> I mentally inserted the missing "skeptical/uncertain" etc. here :)

More "uncertain" than "skeptical". The current plan works perfectly
for the current implementation, so there is an element of YAGNI that
could easily lead us to avoid overthinking this.
 
>> your current use of Git config as the back-end, only because it is
>> difficult to be future-proof. As the functionality stands today, the
>> current config design works just fine. Perhaps we don't need to
>> worry about the future, because we can design a new, complementary
>> storage for that extra data. It seems worth exploring for a little
>> while, though. Perhaps we should take a page out of 'git-remote'
>> and how it stores named remotes with sub-items for metadata. The
>> names probably don't need to ever be exposed to users, but it could
>> be beneficial to anyone implementing this scheme.
>>
>> [bundle "main"]
>> 	uri = https://example.com/my-bundle
>> 	uri = https://redundant-cdn.com/my-bundle
>> 	size = 120523
>> 	sha256 = {64hexchars}
>>
>> [bundle "fork"]
>> 	uri = https://cdn.org/my-fork
>> 	size = 334
>> 	sha256 = {...}
>> 	prereq = {oid}
>>
>> This kind of layout has an immediate grouping of data that should
>> help any future plan. Notice how I included multiple "uri" lines
>> in the "main", which helps with your plan for duplicate URIs.
> 
> At first sight I like that config schema much better than my current
> one, in particular how it makes the future-proofed "these N urls are one
> logical URL" case simpler.
> 
> But overall I'm a bit on the fence, and leaning towards keeping what I
> have, not out of any lazynes or wanting to just keep what I have mind
> you.
> 
> But rather that the main benefit of the current one is that it's a 1=1
> mapping to the line-based protocol, and you can say update your URLs as:
> 
>     git config --replace-all uploadpack.bundleUri "$first_url" &&
>     git config --add uploadpack.bundleUri "$second_url"
> 
> Having usually you'd know the URL you'd like to replace, so you can use
> the [value-pattern] of --replace-all, if it's a named section or other
> split-out structure that become a two-step lookup.

Don't forget to use --fixed-value for exact string matching instead of
regex matching!

> Also for testing I've got a (trivial) plumbing tool I'll submit called
> "git ls-remote-bundle-uri" (could be folded into something else I guess)
> to dump the server-side config, it's nice that you can pretty much
> directly copy/paste it into config without needing to adjust it.

With the appropriate helper structs and methods in the product code,
such helper tools will still be simple without being a second place
that is directly aware of how the values are stored to disk. I don't
judge your prototype work that helps you build the feature, but it's
simultaneously not a reason to stick to a design.

> Having said all that I'm not sure I feel strongly about it either way,
> what do you think given the above?

I'm not feeling too strong about it right now. The current design
does not need anything extra, but it also purposefully leaves certain
things open for extension in the future.

The thing I worry about is that there will be two supported ways to
store a list of bundle URIs: a flat list of URIs in the multi-valued
uploadPack.bundleURI config value, but then also a second option that
allows the extensions that arise. It's a layer of complication that
would be nice to avoid if there was an easy way to do it, but the
schema I sketched earlier isn't simple enough to merit a switch right
now. Perhaps someone else will have an idea that accomplishes the
same goal, but also is less complicated?
 
> I think most "real" server operators will use this as
> GIT_CONFIG_COUNT=<n> GIT_CONFIG_{KEY,VALUE}_<1..n>, which can of course
> work with any config schema, but if you've got code generating it on the
> other side naming the targets is probably a slight hassle / confusing.

You'd really overload the environment this way? That's not how I would
approach it, but maybe there is a benefit over writing to the repository's
config file. I suppose that you could store the data in a database and
link it to the repository at runtime instead.

> There's also the small matter of it being consistent with the
> packfile-uri config in its current form, but that shouldn't be a reason
> not to come up with something better. If anything any better suggestion
> (if we go for that) could be supported by it too...

What do you mean about being consistent with packfile-uri? This layer
that we care about isn't even implemented in git.git.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-25 21:25 ` [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri" Ævar Arnfjörð Bjarmason
  2021-10-26 14:00   ` Derrick Stolee
@ 2021-10-27  2:01   ` Derrick Stolee
  2021-10-27  8:29     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 77+ messages in thread
From: Derrick Stolee @ 2021-10-27  2:01 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Jeff King, Patrick Steinhardt, Christian Couder,
	Albert Cui, Jonathan Tan, Jonathan Nieder, brian m . carlson,
	Robin H . Johnson

On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
> Add a server-side implementation of a new "bundle-uri" command to
> protocol v2. As discussed in the updated "protocol-v2.txt" this will
> allow conforming clients to optionally seed their initial clones or
> incremental fetches from URLs containing "*.bundle" files created with
> "git bundle create".

...

> +DISCUSSION of bundle-uri
> +^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The intent of the feature is optimize for server resource consumption
> +in the common case by changing the common case of fetching a very
> +large PACK during linkgit:git-clone[1] into a smaller incremental
> +fetch.
> +
> +It also allows servers to achieve better caching in combination with
> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
> +
> +By having new clones or fetches be a more predictable and common
> +negotiation against the tips of recently produces *.bundle file(s).
> +Servers might even pre-generate the results of such negotiations for
> +the `uploadpack.packObjectsHook` as new pushes come in.
> +
> +I.e. the server would anticipate that fresh clones will download a
> +known bundle, followed by catching up to the current state of the
> +repository using ref tips found in that bundle (or bundles).
> +
> +PROTOCOL for bundle-uri
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +A `bundle-uri` request takes no arguments, and as noted above does not
> +currently advertise a capability value. Both may be added in the
> +future.

One thing I realized was missing from this proposal is any interaction
with partial clone. It would be disappointing if we could not advertise
bundles of commit-and-tree packfiles for blobless partial clones.

There is currently no way for the client to signal the filter type
during this command. Not having any way to extend to include that
seems like an oversight we should remedy before committing to a
protocol that can't be extended.

(This also seems like a good enough reason to group the URIs into a
struct-like storage, because the filter type could be stored next to
the URI.)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-27  2:01   ` Derrick Stolee
@ 2021-10-27  8:29     ` Ævar Arnfjörð Bjarmason
  2021-10-27 16:31       ` Derrick Stolee
  0 siblings, 1 reply; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-27  8:29 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson


On Tue, Oct 26 2021, Derrick Stolee wrote:

> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>> Add a server-side implementation of a new "bundle-uri" command to
>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>> allow conforming clients to optionally seed their initial clones or
>> incremental fetches from URLs containing "*.bundle" files created with
>> "git bundle create".
>
> ...
>
>> +DISCUSSION of bundle-uri
>> +^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +The intent of the feature is optimize for server resource consumption
>> +in the common case by changing the common case of fetching a very
>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>> +fetch.
>> +
>> +It also allows servers to achieve better caching in combination with
>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>> +
>> +By having new clones or fetches be a more predictable and common
>> +negotiation against the tips of recently produces *.bundle file(s).
>> +Servers might even pre-generate the results of such negotiations for
>> +the `uploadpack.packObjectsHook` as new pushes come in.
>> +
>> +I.e. the server would anticipate that fresh clones will download a
>> +known bundle, followed by catching up to the current state of the
>> +repository using ref tips found in that bundle (or bundles).
>> +
>> +PROTOCOL for bundle-uri
>> +^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +A `bundle-uri` request takes no arguments, and as noted above does not
>> +currently advertise a capability value. Both may be added in the
>> +future.
>
> One thing I realized was missing from this proposal is any interaction
> with partial clone. It would be disappointing if we could not advertise
> bundles of commit-and-tree packfiles for blobless partial clones.
>
> There is currently no way for the client to signal the filter type
> during this command. Not having any way to extend to include that
> seems like an oversight we should remedy before committing to a
> protocol that can't be extended.
>
> (This also seems like a good enough reason to group the URIs into a
> struct-like storage, because the filter type could be stored next to
> the URI.)

I'll update the docs to note that. I'd definitely like to leave out any
implementation of filter/shallow for an initial iteration of this for
simplicity, but the protocol keyword/behavior is open-ended enough to
permit any extension.

I.e. the server can start advertising "bundle-uri=shallow", and future
clients can request arbitrary key-value pairs in addition to just
"bundle-uri" now.

Having said that I think that *probably* this is something that'll never
be implemented, but maybe I'll eat my words there.

The reason is that once we're in the "fetch" dialog with the server, as
we are with "filter" and "shallow" I'd think that we'd be better of just
sending a packfile-uri, since that's tailor-made for that use-case.

But I suppose we could also advertise e.g.:

    <bundle-uri> tip=<oid> depth=1

Which a client that noticed that it noticed say the --single-branch at
<oid> but with depth=1 could use before it ever got to "fetch".

But (and I haven't looked into this really) I'd think that would quickly
get you into having a bundle with a PACK payload that wouldn't be
representable with the current bundle header format, which I think we'd
always want a 1=1 mapping of. I.e. you can specify a prereq, but not
leave out trees/blobs etc.

So thoughts on that most welcome, in particular how it could be made
future-proof.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-27  8:29     ` Ævar Arnfjörð Bjarmason
@ 2021-10-27 16:31       ` Derrick Stolee
  2021-10-27 18:01         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 77+ messages in thread
From: Derrick Stolee @ 2021-10-27 16:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson

On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Oct 26 2021, Derrick Stolee wrote:
> 
>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>> Add a server-side implementation of a new "bundle-uri" command to
>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>>> allow conforming clients to optionally seed their initial clones or
>>> incremental fetches from URLs containing "*.bundle" files created with
>>> "git bundle create".
>>
>> ...
>>
>>> +DISCUSSION of bundle-uri
>>> +^^^^^^^^^^^^^^^^^^^^^^^^
>>> +
>>> +The intent of the feature is optimize for server resource consumption
>>> +in the common case by changing the common case of fetching a very
>>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>>> +fetch.
>>> +
>>> +It also allows servers to achieve better caching in combination with
>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>>> +
>>> +By having new clones or fetches be a more predictable and common
>>> +negotiation against the tips of recently produces *.bundle file(s).
>>> +Servers might even pre-generate the results of such negotiations for
>>> +the `uploadpack.packObjectsHook` as new pushes come in.
>>> +
>>> +I.e. the server would anticipate that fresh clones will download a
>>> +known bundle, followed by catching up to the current state of the
>>> +repository using ref tips found in that bundle (or bundles).
>>> +
>>> +PROTOCOL for bundle-uri
>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>> +
>>> +A `bundle-uri` request takes no arguments, and as noted above does not
>>> +currently advertise a capability value. Both may be added in the
>>> +future.
>>
>> One thing I realized was missing from this proposal is any interaction
>> with partial clone. It would be disappointing if we could not advertise
>> bundles of commit-and-tree packfiles for blobless partial clones.
>>
>> There is currently no way for the client to signal the filter type
>> during this command. Not having any way to extend to include that
>> seems like an oversight we should remedy before committing to a
>> protocol that can't be extended.
>>
>> (This also seems like a good enough reason to group the URIs into a
>> struct-like storage, because the filter type could be stored next to
>> the URI.)
> 
> I'll update the docs to note that. I'd definitely like to leave out any
> implementation of filter/shallow for an initial iteration of this for
> simplicity, but the protocol keyword/behavior is open-ended enough to
> permit any extension.

It would be good to be explicit about how this would work. Looking at
it fresh, it seems that the server could send multiple bundle URIs with
the extra metadata to say which ones have a filter (and what that filter
is). The client could then check if a bundle matches the given filter.

But this is a bit inverted: the filter mechanism currently has the client
request a given filter and the server responds with _at least_ that much
data. This allows the server to ignore things like pathspec-filters or
certain size-based filters. If the client just ignores a bundle URI
because it doesn't match the exact filter, this could lead the client to
ask for the data without a bundle, even if it would be faster to just
download the advertised bundle.

For this reason, I think it would be valuable for the client to tell
the server the intended filter, and the server responds with bundle
URIs that contain a subset of the information that would be provided
by a later fetch request with that filter.

> I.e. the server can start advertising "bundle-uri=shallow", and future
> clients can request arbitrary key-value pairs in addition to just
> "bundle-uri" now.
> 
> Having said that I think that *probably* this is something that'll never
> be implemented, but maybe I'll eat my words there.

You continue focusing on the shallow option, which I agree is not
important. The filter option, specifically --filter=blob:none, seems
to be critical to have a short-term plan for implementing with this
in mind.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-27  1:55       ` Derrick Stolee
@ 2021-10-27 17:49         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-27 17:49 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson, Teng Long


On Tue, Oct 26 2021, Derrick Stolee wrote:

> On 10/26/2021 11:00 AM, Ævar Arnfjörð Bjarmason wrote:

[I'll reply to the rest later, either here or in related threads. I.e. I
might end up entirely revamping the config etc. format]

>> There's also the small matter of it being consistent with the
>> packfile-uri config in its current form, but that shouldn't be a reason
>> not to come up with something better. If anything any better suggestion
>> (if we go for that) could be supported by it too...
>
> What do you mean about being consistent with packfile-uri? This layer
> that we care about isn't even implemented in git.git.

It's rather limited, but we do support a uploadpack.BlobPackFileUri as a
server-side feature for upload-pack. I.e.:

    uploadpack.BlobPackFileUri=<OID> <pack-hash> <packfile-uri>

See Documentation/technical/packfile-uri.txt.

The <pack-hash> is part of the protocol, but the <OID> is just an aid to
upload-pack to peel out that OID when it serves up the PACK, the <OID>
being what you get from the URI.

In terms of server implementation it's rather proof-of-concept-ish,
i.e. it's not really all that useful unless your use case is carving out
a small number of really big blobs. JGit's is much more mature, and
there's some patches on-list recently to make the git.git one more
practically useful[1].

1. https://lore.kernel.org/git/cover.1634634814.git.tenglong@alibaba-inc.com/


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-27 16:31       ` Derrick Stolee
@ 2021-10-27 18:01         ` Ævar Arnfjörð Bjarmason
  2021-10-27 19:23           ` Derrick Stolee
  2021-10-30 14:51           ` Philip Oakley
  0 siblings, 2 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-27 18:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson


On Wed, Oct 27 2021, Derrick Stolee wrote:

> On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Tue, Oct 26 2021, Derrick Stolee wrote:
>> 
>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>>> Add a server-side implementation of a new "bundle-uri" command to
>>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>>>> allow conforming clients to optionally seed their initial clones or
>>>> incremental fetches from URLs containing "*.bundle" files created with
>>>> "git bundle create".
>>>
>>> ...
>>>
>>>> +DISCUSSION of bundle-uri
>>>> +^^^^^^^^^^^^^^^^^^^^^^^^
>>>> +
>>>> +The intent of the feature is optimize for server resource consumption
>>>> +in the common case by changing the common case of fetching a very
>>>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>>>> +fetch.
>>>> +
>>>> +It also allows servers to achieve better caching in combination with
>>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>>>> +
>>>> +By having new clones or fetches be a more predictable and common
>>>> +negotiation against the tips of recently produces *.bundle file(s).
>>>> +Servers might even pre-generate the results of such negotiations for
>>>> +the `uploadpack.packObjectsHook` as new pushes come in.
>>>> +
>>>> +I.e. the server would anticipate that fresh clones will download a
>>>> +known bundle, followed by catching up to the current state of the
>>>> +repository using ref tips found in that bundle (or bundles).
>>>> +
>>>> +PROTOCOL for bundle-uri
>>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>>> +
>>>> +A `bundle-uri` request takes no arguments, and as noted above does not
>>>> +currently advertise a capability value. Both may be added in the
>>>> +future.
>>>
>>> One thing I realized was missing from this proposal is any interaction
>>> with partial clone. It would be disappointing if we could not advertise
>>> bundles of commit-and-tree packfiles for blobless partial clones.
>>>
>>> There is currently no way for the client to signal the filter type
>>> during this command. Not having any way to extend to include that
>>> seems like an oversight we should remedy before committing to a
>>> protocol that can't be extended.
>>>
>>> (This also seems like a good enough reason to group the URIs into a
>>> struct-like storage, because the filter type could be stored next to
>>> the URI.)
>> 
>> I'll update the docs to note that. I'd definitely like to leave out any
>> implementation of filter/shallow for an initial iteration of this for
>> simplicity, but the protocol keyword/behavior is open-ended enough to
>> permit any extension.
>
> It would be good to be explicit about how this would work. Looking at
> it fresh, it seems that the server could send multiple bundle URIs with
> the extra metadata to say which ones have a filter (and what that filter
> is). The client could then check if a bundle matches the given filter.
>
> But this is a bit inverted: the filter mechanism currently has the client
> request a given filter and the server responds with _at least_ that much
> data. This allows the server to ignore things like pathspec-filters or
> certain size-based filters. If the client just ignores a bundle URI
> because it doesn't match the exact filter, this could lead the client to
> ask for the data without a bundle, even if it would be faster to just
> download the advertised bundle.
>
> For this reason, I think it would be valuable for the client to tell
> the server the intended filter, and the server responds with bundle
> URIs that contain a subset of the information that would be provided
> by a later fetch request with that filter.
>
>> I.e. the server can start advertising "bundle-uri=shallow", and future
>> clients can request arbitrary key-value pairs in addition to just
>> "bundle-uri" now.
>> 
>> Having said that I think that *probably* this is something that'll never
>> be implemented, but maybe I'll eat my words there.

I didn't mean to elide past "filter", but was just using "shallow" as a
short-hand for one thing in the "fetch" dialog that a client can mention
that'll impact PACK generation, just like filter.

Having thought about this a bit more, I think it should be an invariant
in any bundle-uri design that the server shouldn't communicate any
side-channel information whatsoever about a bundle it advertises, if
that information can't be discovered in the header of that bundle file.

Mind you, this means throwing out large parts of my current proposed
over-the-wire design, but I think for the better. I.e. the whole
response part where we communicate:

    (bundle-uri (SP bundle-feature-key (=bundle-feature-val)?)* LF)*
    flush-pkt

Would just become something like:

    (bundle-uri delim-pkt bundle-header? delim-pkt)*
    flush-pkt

I.e. we'd optionally transfer the content of the bundle header (content
up to the first "\n\n") to the client, but *only* ever as a shorthand
for saving the client a roundtrip.

The pointed-to bundle is still 100% the source of truth, and when
retrieving the bundle-uri we'd ignore whatever "bundle-header" we got
earlier (except insofar as we'd like to say emit a warning() if the two
don't match).

(I'd not thought too carefully about these shallow/filter etc. edge
cases, my main intended use-case has been pre-seeding full clones, and
having this feedback to make me think about it is very valuable).

> You continue focusing on the shallow option, which I agree is not
> important. The filter option, specifically --filter=blob:none, seems
> to be critical to have a short-term plan for implementing with this
> in mind.

Per the above this then just becomes a question of "how do we produce a
bundle with those attributes?".

I *think* that currently there isn't a way to do that, i.e. the PACK
payload of a bundle is the output of "git pack-objects", but due to it
including refs, tips and prerequisites.

I don't think you can say "this bundle has no blobs". The
"prerequisites" hard map to the same thing you could put on a
"want/have" line during PACK negotiation.

I think we could/should fix that, i.e. we can bump the bundle format
version and have it encode some extended prerequisites/filter/shallow
etc information. You'd then have a 1=1 match between the features of
git-upload-pack and what you can transfer via the bundle side-channel.

But the more I think about it, the more strongly I feel that we should
always add that to the bundle *format*, and not as some side-channel
information in this "bundle-uri" protocol keyword.

To me *the* point of this feature is to have servers provide a shorthand
for something that's been a well-established trick you can do today, and
of which there are any number of pre-existing implementations.

I'm not trying to break any new ground here, just make "git
[fetch|clone]" support a well-known trick as a first-class feature via
protocol v2.

I'm not the first person to whip up some custom
"git-clone-via-bundle.sh" that takes bundle URI(s) and a repo URI,
wget's the bundle, calls "git bundle unbundle", updates ref tips, and
then does a "git fetch".

The benefit of making that a first-class protocol feature over a full
negotiation is essentially synonymous with how it's easier in practice
to widely deploy static assets on CDNs v.s. guaranteeing the same
network locality, caching etc. when serving up the same payload by
running a custom binary.

One reason not to add any side-channel information not found in the
bundle header(s) is that we can also guarantee that there won't be any
feature gap between the "transfer.injectBundleURI" config key I've
already got implemented (and is in the earlier RFC version of this
series). I.e. you can do:

    # You can specify this N number of times to inject N bundles
    git clone \
	-c transfer.injectBundleURI="https://something.local/some-repo.bundle" \
        https://example.com/some-repo.git

To inject CDN support to any remote server that doesn't know about
"bundle-uri", or add to the bundles of a server that does. That URI can
even be a file:// if you add "-c fetch.uriProtocols=file".

I realize that all of the above does *not* answer part of your question
about filters, which I think I can accurately rephrase as:

    Ok, so you can dump a static list of bundle URIs from config, but
    that's always going to be a small list, what about the combinatorial
    explosion arbitrary upload-pack options? Filters, shallow,
    include-tag etc.

My main answer to that is that YAGNI. If you need to spew out an URL for
a PACK after a client describes any of its arbitrary wants, needs,
filters etc. you've exactly re-invented what "packfile-uri" is today. I
think that feature is very useful, and I've got no intention of trying
to replace it.

I think the sweet spot for "bundle-uri" is to advertise a small number
of bundles that encompass common clone/fetch patterns. I.e. something
like a bundle for a full clone with the repo data up to today, and maybe
a couple of other bundles covering a time period that clients would be
likely to incrementally update within, e.g. 1 week ago..today &&
yesterday..now.

I agree that adding say "full clone, --depth=1" and "full clone, no
blobs" etc. to that might be *very* useful for some deployments, but per
the above I think we should really add that to the bundle format first,
not protocol v2.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-27 18:01         ` Ævar Arnfjörð Bjarmason
@ 2021-10-27 19:23           ` Derrick Stolee
  2021-10-27 20:22             ` Ævar Arnfjörð Bjarmason
  2021-10-30 14:51           ` Philip Oakley
  1 sibling, 1 reply; 77+ messages in thread
From: Derrick Stolee @ 2021-10-27 19:23 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson

On 10/27/2021 2:01 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 27 2021, Derrick Stolee wrote:
> 
>> On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Tue, Oct 26 2021, Derrick Stolee wrote:
>>>
>>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>>>> Add a server-side implementation of a new "bundle-uri" command to
>>>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>>>>> allow conforming clients to optionally seed their initial clones or
>>>>> incremental fetches from URLs containing "*.bundle" files created with
>>>>> "git bundle create".
>>>>
>>>> ...
>>>>
>>>>> +DISCUSSION of bundle-uri
>>>>> +^^^^^^^^^^^^^^^^^^^^^^^^
>>>>> +
>>>>> +The intent of the feature is optimize for server resource consumption
>>>>> +in the common case by changing the common case of fetching a very
>>>>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>>>>> +fetch.
>>>>> +
>>>>> +It also allows servers to achieve better caching in combination with
>>>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>>>>> +
>>>>> +By having new clones or fetches be a more predictable and common
>>>>> +negotiation against the tips of recently produces *.bundle file(s).
>>>>> +Servers might even pre-generate the results of such negotiations for
>>>>> +the `uploadpack.packObjectsHook` as new pushes come in.
>>>>> +
>>>>> +I.e. the server would anticipate that fresh clones will download a
>>>>> +known bundle, followed by catching up to the current state of the
>>>>> +repository using ref tips found in that bundle (or bundles).
>>>>> +
>>>>> +PROTOCOL for bundle-uri
>>>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>>>> +
>>>>> +A `bundle-uri` request takes no arguments, and as noted above does not
>>>>> +currently advertise a capability value. Both may be added in the
>>>>> +future.
>>>>
>>>> One thing I realized was missing from this proposal is any interaction
>>>> with partial clone. It would be disappointing if we could not advertise
>>>> bundles of commit-and-tree packfiles for blobless partial clones.
>>>>
>>>> There is currently no way for the client to signal the filter type
>>>> during this command. Not having any way to extend to include that
>>>> seems like an oversight we should remedy before committing to a
>>>> protocol that can't be extended.
>>>>
>>>> (This also seems like a good enough reason to group the URIs into a
>>>> struct-like storage, because the filter type could be stored next to
>>>> the URI.)
>>>
>>> I'll update the docs to note that. I'd definitely like to leave out any
>>> implementation of filter/shallow for an initial iteration of this for
>>> simplicity, but the protocol keyword/behavior is open-ended enough to
>>> permit any extension.
>>
>> It would be good to be explicit about how this would work. Looking at
>> it fresh, it seems that the server could send multiple bundle URIs with
>> the extra metadata to say which ones have a filter (and what that filter
>> is). The client could then check if a bundle matches the given filter.
>>
>> But this is a bit inverted: the filter mechanism currently has the client
>> request a given filter and the server responds with _at least_ that much
>> data. This allows the server to ignore things like pathspec-filters or
>> certain size-based filters. If the client just ignores a bundle URI
>> because it doesn't match the exact filter, this could lead the client to
>> ask for the data without a bundle, even if it would be faster to just
>> download the advertised bundle.
>>
>> For this reason, I think it would be valuable for the client to tell
>> the server the intended filter, and the server responds with bundle
>> URIs that contain a subset of the information that would be provided
>> by a later fetch request with that filter.
>>
>>> I.e. the server can start advertising "bundle-uri=shallow", and future
>>> clients can request arbitrary key-value pairs in addition to just
>>> "bundle-uri" now.
>>>
>>> Having said that I think that *probably* this is something that'll never
>>> be implemented, but maybe I'll eat my words there.
> 
> I didn't mean to elide past "filter", but was just using "shallow" as a
> short-hand for one thing in the "fetch" dialog that a client can mention
> that'll impact PACK generation, just like filter.
> 
> Having thought about this a bit more, I think it should be an invariant
> in any bundle-uri design that the server shouldn't communicate any
> side-channel information whatsoever about a bundle it advertises, if
> that information can't be discovered in the header of that bundle file.
> 
> Mind you, this means throwing out large parts of my current proposed
> over-the-wire design, but I think for the better. I.e. the whole
> response part where we communicate:
> 
>     (bundle-uri (SP bundle-feature-key (=bundle-feature-val)?)* LF)*
>     flush-pkt
> 
> Would just become something like:
> 
>     (bundle-uri delim-pkt bundle-header? delim-pkt)*
>     flush-pkt
> 
> I.e. we'd optionally transfer the content of the bundle header (content
> up to the first "\n\n") to the client, but *only* ever as a shorthand
> for saving the client a roundtrip.

It still seems like we're better off letting the client request a
filter and have the server present URIs that the client can use,
and the server can choose to ignore the filter or provide URIs that
are specific to that filter. Sending the full list and making the
client decide what it wants seems like it will be more complicated
than necessary.

However, I'll withhold complete judgement until I see a full proposal
in a v2.

> I think the sweet spot for "bundle-uri" is to advertise a small number
> of bundles that encompass common clone/fetch patterns. I.e. something
> like a bundle for a full clone with the repo data up to today, and maybe
> a couple of other bundles covering a time period that clients would be
> likely to incrementally update within, e.g. 1 week ago..today &&
> yesterday..now.
> 
> I agree that adding say "full clone, --depth=1" and "full clone, no
> blobs" etc. to that might be *very* useful for some deployments, but per
> the above I think we should really add that to the bundle format first,
> not protocol v2.
 
I'm focusing my interest in this topic not on "how can we make what we
already do faster?" but "how can we unlock scale not previously
possible?" Allowing blobless clones is an important part of this, in
my opinion, so it is my _default_ mode of operating.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-27 19:23           ` Derrick Stolee
@ 2021-10-27 20:22             ` Ævar Arnfjörð Bjarmason
  2021-10-29 18:30               ` Derrick Stolee
  0 siblings, 1 reply; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-27 20:22 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson


On Wed, Oct 27 2021, Derrick Stolee wrote:

> On 10/27/2021 2:01 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Wed, Oct 27 2021, Derrick Stolee wrote:
>> 
>>> On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote:
>>>>
>>>> On Tue, Oct 26 2021, Derrick Stolee wrote:
>>>>
>>>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>>>>> Add a server-side implementation of a new "bundle-uri" command to
>>>>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>>>>>> allow conforming clients to optionally seed their initial clones or
>>>>>> incremental fetches from URLs containing "*.bundle" files created with
>>>>>> "git bundle create".
>>>>>
>>>>> ...
>>>>>
>>>>>> +DISCUSSION of bundle-uri
>>>>>> +^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>> +
>>>>>> +The intent of the feature is optimize for server resource consumption
>>>>>> +in the common case by changing the common case of fetching a very
>>>>>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>>>>>> +fetch.
>>>>>> +
>>>>>> +It also allows servers to achieve better caching in combination with
>>>>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>>>>>> +
>>>>>> +By having new clones or fetches be a more predictable and common
>>>>>> +negotiation against the tips of recently produces *.bundle file(s).
>>>>>> +Servers might even pre-generate the results of such negotiations for
>>>>>> +the `uploadpack.packObjectsHook` as new pushes come in.
>>>>>> +
>>>>>> +I.e. the server would anticipate that fresh clones will download a
>>>>>> +known bundle, followed by catching up to the current state of the
>>>>>> +repository using ref tips found in that bundle (or bundles).
>>>>>> +
>>>>>> +PROTOCOL for bundle-uri
>>>>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>>>>> +
>>>>>> +A `bundle-uri` request takes no arguments, and as noted above does not
>>>>>> +currently advertise a capability value. Both may be added in the
>>>>>> +future.
>>>>>
>>>>> One thing I realized was missing from this proposal is any interaction
>>>>> with partial clone. It would be disappointing if we could not advertise
>>>>> bundles of commit-and-tree packfiles for blobless partial clones.
>>>>>
>>>>> There is currently no way for the client to signal the filter type
>>>>> during this command. Not having any way to extend to include that
>>>>> seems like an oversight we should remedy before committing to a
>>>>> protocol that can't be extended.
>>>>>
>>>>> (This also seems like a good enough reason to group the URIs into a
>>>>> struct-like storage, because the filter type could be stored next to
>>>>> the URI.)
>>>>
>>>> I'll update the docs to note that. I'd definitely like to leave out any
>>>> implementation of filter/shallow for an initial iteration of this for
>>>> simplicity, but the protocol keyword/behavior is open-ended enough to
>>>> permit any extension.
>>>
>>> It would be good to be explicit about how this would work. Looking at
>>> it fresh, it seems that the server could send multiple bundle URIs with
>>> the extra metadata to say which ones have a filter (and what that filter
>>> is). The client could then check if a bundle matches the given filter.
>>>
>>> But this is a bit inverted: the filter mechanism currently has the client
>>> request a given filter and the server responds with _at least_ that much
>>> data. This allows the server to ignore things like pathspec-filters or
>>> certain size-based filters. If the client just ignores a bundle URI
>>> because it doesn't match the exact filter, this could lead the client to
>>> ask for the data without a bundle, even if it would be faster to just
>>> download the advertised bundle.
>>>
>>> For this reason, I think it would be valuable for the client to tell
>>> the server the intended filter, and the server responds with bundle
>>> URIs that contain a subset of the information that would be provided
>>> by a later fetch request with that filter.
>>>
>>>> I.e. the server can start advertising "bundle-uri=shallow", and future
>>>> clients can request arbitrary key-value pairs in addition to just
>>>> "bundle-uri" now.
>>>>
>>>> Having said that I think that *probably* this is something that'll never
>>>> be implemented, but maybe I'll eat my words there.
>> 
>> I didn't mean to elide past "filter", but was just using "shallow" as a
>> short-hand for one thing in the "fetch" dialog that a client can mention
>> that'll impact PACK generation, just like filter.
>> 
>> Having thought about this a bit more, I think it should be an invariant
>> in any bundle-uri design that the server shouldn't communicate any
>> side-channel information whatsoever about a bundle it advertises, if
>> that information can't be discovered in the header of that bundle file.
>> 
>> Mind you, this means throwing out large parts of my current proposed
>> over-the-wire design, but I think for the better. I.e. the whole
>> response part where we communicate:
>> 
>>     (bundle-uri (SP bundle-feature-key (=bundle-feature-val)?)* LF)*
>>     flush-pkt
>> 
>> Would just become something like:
>> 
>>     (bundle-uri delim-pkt bundle-header? delim-pkt)*
>>     flush-pkt
>> 
>> I.e. we'd optionally transfer the content of the bundle header (content
>> up to the first "\n\n") to the client, but *only* ever as a shorthand
>> for saving the client a roundtrip.
>
> It still seems like we're better off letting the client request a
> filter and have the server present URIs that the client can use,
> and the server can choose to ignore the filter or provide URIs that
> are specific to that filter.[...]

I've tested this a bit now and think there's no way to create such a
bundle currently. I.e. try:

    git clone --filter=blob:none --single-branch --no-tags https://github.com/git/git.git
    cd git
    git config --unset remote.origin.partialclonefilter
    git config --unset remote.origin.promisor

You'll get:
    
    $ GIT_TRACE_PACKET=1 git bundle create --version=3 master.bdl master
    Enumerating objects: 306784, done.
    Counting objects: 100% (306784/306784), done.
    Compressing objects: 100% (69265/69265), done.
    fatal: unable to read c85385dc03228450cb7fb6d306252038a91b47e6
    error: pack-objects died

If you didn't do that config munging we'd create the pack, but it would
be inflated with the blobs (after going back and getting them from the
server).

So aside from any questions of how you'd hypothetically communicate your
desire to get such bundle from the server, I don't think it could serve
one up.

So I think this is moot until the bundle format itself could support
it. I'll need to "git bundle [verify|unbundle]" whatever I get on the
other end.

I really don't mean this in any way as dodging the desirability of this
feature. I'd really like to have it too. I think implementing it should
be relatively simple, and I've got an implementation in mind that makes
this future-proof for anything else we'd like to add.

I.e. if you look at that v3 format bundle you'll see:
    
    $ head -c 100 master.bdl
    # v3 git bundle
    @object-format=sha1
    e9e5ba39a78c8f5057262d49e261b42a8660d5b9 refs/heads/master
    
    PACK

Wouldn't this just be a matter of including extra lines with:

    # I'm assuming that the promisor url can be assumed to be "the url
    # we cloned this from", but maybe we need @remoteURL=https://....
    @promisor=true
    @filter = blob:none

I.e. exactly corresponding to the .git/config we'd end up with,
config. We'd then (I think) create .git/objects/pack/*.promisor with the
OIDs of each of the inflated tips (I'm not familiar with .promisor
files).

And a thing I need to include in the bundle-uri protocol is that the
client should not just include a "bundle-uri" attribute, but have a
"value" describing the bundle format it accepts. I.e. now:

    bundle-uri=v3,object-format

And for supporting the above:

    bundle-uri=v3,object-format,promisor,filter

I.e. currently we die on any bundle capability except "object-format",
if we're going to discover what to send we'd like a less crappy way than
parsing the version from the "agent" field.

> Sending the full list and making the client decide what it wants seems
> like it will be more complicated than necessary. However, I'll
> withhold complete judgement until I see a full proposal in a v2.

I'm very thankful for the thorough review, and it's exciting that you'd
like to use this feature in some form, and I'll definitely do my best to
support (and if not, future-proof) any use-cases you have in mind.

But I really don't get how this wouldn't effectively be functionally
indistinguishable from packfile-uri, sans a nit here and there.

I can see the convenience of having say 100 bundles, advertising 5, and
then after a full negotiation dialog pointing the equivalent of a
packfile-uri at a *.bundle file, just because that's what you happen to
have around already. If bundle-uri is your main static file distribution
you don't want a duplicate *.pack (without the bundle header) just for
that.

I think a logical extension of the packfile-uri feature for those that
need extended negotiation before deciding on the static URL would be to
teach the packfile-uri downloader to ignore an optional bundle header of
any PACK it finds at a URL (which would not be the same as this
proposal), just to support that use-case.

But, isn't that essentially what you'd want in those cases?

Spewing a "here's my bundles" at a client gets it started quickly, and
also has the side-benefit of making those assets more cachable, as well
as creating a known base for the caching of any subsequent "...and a
PACK negotiation to make it fully up-to-date" request.

The bundles are also our de-facto sneakernet and format, and can be used
for incremental replication. All of which is also a sweet spot for
bundle-uri, i.e. the combination of being able to re-use already
"replicated" files for CDN-ing, and providing wider access to CDN
features for "dumb" servers.

But once we're in dialog with a client to discuss its arbitrary filter
preferences before giving it a URL we're going to be most likely
implementing that as a mode of upload-pack.c anyway, and when it spews
out an optional URL at the end of that dialog....

>> I think the sweet spot for "bundle-uri" is to advertise a small number
>> of bundles that encompass common clone/fetch patterns. I.e. something
>> like a bundle for a full clone with the repo data up to today, and maybe
>> a couple of other bundles covering a time period that clients would be
>> likely to incrementally update within, e.g. 1 week ago..today &&
>> yesterday..now.
>> 
>> I agree that adding say "full clone, --depth=1" and "full clone, no
>> blobs" etc. to that might be *very* useful for some deployments, but per
>> the above I think we should really add that to the bundle format first,
>> not protocol v2.
>  
> I'm focusing my interest in this topic not on "how can we make what we
> already do faster?" but "how can we unlock scale not previously
> possible?" Allowing blobless clones is an important part of this, in
> my opinion, so it is my _default_ mode of operating.

*nod*, I'm keen to support it using something like what's described
above. I.e. stick it as new headers in the bundle format, then be able
to advertise those for common cases. I'd think most of these would be a
few combinations (usually something copy/pasted from a relevant dev
guide), e.g. "all history, no blobs", "main branch, no tags, no blobs"
etc.

Aside from this bundle-uri protocol proposal being able to sneakernet a
repo you cloned like that around as-is seems highly desirable, and just
a feature gap we added when those features were added to "git fetch" and
friends, but not "git bundle".

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-27 20:22             ` Ævar Arnfjörð Bjarmason
@ 2021-10-29 18:30               ` Derrick Stolee
  0 siblings, 0 replies; 77+ messages in thread
From: Derrick Stolee @ 2021-10-29 18:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson

On 10/27/2021 4:22 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 27 2021, Derrick Stolee wrote:
>> It still seems like we're better off letting the client request a
>> filter and have the server present URIs that the client can use,
>> and the server can choose to ignore the filter or provide URIs that
>> are specific to that filter.[...]
> 
> I've tested this a bit now and think there's no way to create such a
> bundle currently. I.e. try:
> 
>     git clone --filter=blob:none --single-branch --no-tags https://github.com/git/git.git
>     cd git
>     git config --unset remote.origin.partialclonefilter
>     git config --unset remote.origin.promisor
> 
> You'll get:
>     
>     $ GIT_TRACE_PACKET=1 git bundle create --version=3 master.bdl master
>     Enumerating objects: 306784, done.
>     Counting objects: 100% (306784/306784), done.
>     Compressing objects: 100% (69265/69265), done.
>     fatal: unable to read c85385dc03228450cb7fb6d306252038a91b47e6
>     error: pack-objects died
> 
> If you didn't do that config munging we'd create the pack, but it would
> be inflated with the blobs (after going back and getting them from the
> server).
> 
> So aside from any questions of how you'd hypothetically communicate your
> desire to get such bundle from the server, I don't think it could serve
> one up.
> 
> So I think this is moot until the bundle format itself could support
> it. I'll need to "git bundle [verify|unbundle]" whatever I get on the
> other end.

Thank you for demonstrating that bundles don't currently work with
filters. This would need to be remedied, but can be done independently.

For now, let's just make sure that there is a paved path for extending
whatever is decided here to work with filters in the future.

As for the overall design, I have some thoughts that I'm going to
break out into a new message responding to your cover letter.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation
  2021-10-25 21:25 [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Ævar Arnfjörð Bjarmason
                   ` (2 preceding siblings ...)
  2021-10-25 21:25 ` [PATCH 3/3] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
@ 2021-10-29 18:46 ` Derrick Stolee
  2021-10-30  7:21   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
  4 siblings, 1 reply; 77+ messages in thread
From: Derrick Stolee @ 2021-10-29 18:46 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Jeff King, Patrick Steinhardt, Christian Couder,
	Albert Cui, Jonathan Tan, Jonathan Nieder, brian m . carlson,
	Robin H . Johnson

On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
> This implements a new "bundle-uri" protocol v2 extension, which allows
> servers to advertise *.bundle files which clients can pre-seed their
> full "clone"'s or incremental "fetch"'s from.
> 
> This is both an alternative to, and complimentary to the existing
> "packfile-uri" mechanism, i.e. servers and/or clients can pick one or
> both, but would generally pick one over the other.
> 
> This "bundle-uri" mechanism has the advantage of being dumber, and
> offloads more complexity from the server side to the client
> side.

Generally, I like that using bundles presents an easier way to serve
static content from an alternative source and then let Git's fetch
negotiation catch up with the remainder.

However, after inspecting your design and talking to some GitHub
engineers who know more about CDNs and general internet things than I
do, I want to propose an alternative design. I think this new design
is simultaneously more flexible as well as promotes further decoupling
of the origin Git server and the bundle contents.

Your proposed design extends protocol v2 to let the client request a
list of bundle URIs from the origin server. However, this still requires
the origin server to know about this list. Further, your implementation
focuses on the server side without integrating with the client.

I propose that we flip this around. The "bundle server" should know
which bundles are available at which URIs, and the client should contact
the bundle server directly for a "table of contents" that lists these
URIs, along with metadata related to each URI. The origin Git server
then would only need to store the list of bundle servers and the URIs
to their table of contents. The client could then pick from among those
bundle servers (probably by ping time, or randomly) to start the bundle
downloads.

To summarize, there are two pieces here, that can be implemented at
different times:

1. Create a specification for a "bundle server" that doesn't need to
   speak the Git protocol at all. This could be a REST API specification
   using well-established standards such as JSON for the table of
   contents.

2. Create a way for the origin Git server to advertise known bundle
   servers to clients so they can automatically benefit from faster
   downloads without needing to know about bundle servers.

There are a few key benefits to this approach:

 * Further decoupling. The origin Git server doesn't need to know how
   the bundle server organizes its bundles. This allows maximum flexibility
   depending on whether the bundles are stored in something like a CDN
   (where bundles can't be too big) or some kind of blob storage (where
   they can have arbitrarily large size).

 * The bundle servers could be run completely independently from the
   origin Git server. Organizations could run their own bundle servers to
   host data in the same building as their build farms. As long as they
   can configure the bundle location at clone/fetch time, the origin Git
   server doesn't need to be involved.

While I didn't go so far as to create a clear standard or implement a
prototype in the Git codebase, I created a very simple prototype [1] using
a python script that parses a JSON table of contents and downloads
bundles into the Git repository. Then, I made a 'clone.sh' script that
initializes a repository using the bundle fetcher and fetching the
remainder from the origin Git server. I even computed static bundles for
the git.git repository based on where 'master' has been over several days
in the past month, to give an example of incremental bundles. You can
test the approach all the way to including the fetch to github.com (note
how the GitHub servers were not modified in any way for this).

[1] https://github.com/derrickstolee/bundles

There are a lot of limitations to the prototype, but it hopefully
demonstrates the possibility of using something other than the Git protocol
to solve these problems.

Let me know if you are interested in switching your approach to something
more like what I propose here. There are many more questions about what
information could/should be located in the table of contents and how it can
be extended in the future. I'm interested to explore that space with you.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation
  2021-10-29 18:46 ` [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Derrick Stolee
@ 2021-10-30  7:21   ` Ævar Arnfjörð Bjarmason
  2021-11-01 21:00     ` Derrick Stolee
  0 siblings, 1 reply; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-30  7:21 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson


On Fri, Oct 29 2021, Derrick Stolee wrote:

> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>> This implements a new "bundle-uri" protocol v2 extension, which allows
>> servers to advertise *.bundle files which clients can pre-seed their
>> full "clone"'s or incremental "fetch"'s from.
>> 
>> This is both an alternative to, and complimentary to the existing
>> "packfile-uri" mechanism, i.e. servers and/or clients can pick one or
>> both, but would generally pick one over the other.
>> 
>> This "bundle-uri" mechanism has the advantage of being dumber, and
>> offloads more complexity from the server side to the client
>> side.
>
> Generally, I like that using bundles presents an easier way to serve
> static content from an alternative source and then let Git's fetch
> negotiation catch up with the remainder.
>
> However, after inspecting your design and talking to some GitHub
> engineers who know more about CDNs and general internet things than I
> do, I want to propose an alternative design. I think this new design
> is simultaneously more flexible as well as promotes further decoupling
> of the origin Git server and the bundle contents.
>
> Your proposed design extends protocol v2 to let the client request a
> list of bundle URIs from the origin server. However, this still requires
> the origin server to know about this list. [...]

Interesting, more below...

> Further, your implementation focuses on the server side without
> integrating with the client.

Do you mean these 3 patches we're discussing now? Yes, that's the
server-side and protocol specification only, because I figured talking
about just the spec might be helpful.

But as noted in the CL and previously on-list I have a larger set of
patches to implement the client behavior, an old RFC version of that
here (I've since changed some things...):
https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/

I mean, you commented on those too, so I'm not sure if that's what you
meant, but just for context...

> I propose that we flip this around. The "bundle server" should know
> which bundles are available at which URIs, and the client should contact
> the bundle server directly for a "table of contents" that lists these
> URIs, along with metadata related to each URI. The origin Git server
> then would only need to store the list of bundle servers and the URIs
> to their table of contents. The client could then pick from among those
> bundle servers (probably by ping time, or randomly) to start the bundle
> downloads.

I hadn't considered the server not advertising the list, but pointing to
another URI that has the list. I was thinking that the server would be
close enough to whatever's generating the list that updating the list
there wouldn't be a meaningful limitation for anyone.

But you seem to have a use-case for it, I'd be curious to hear why
specifically, but in any case that's easy to support in the client
patches I have.

There's a point at which we get the list of URIs from the server, to
support your case the client would just advertise the one TOC URI.

Then similarly to the "packfile-uri" special-case of handling a *.bundle
instead of a PACK that I noted in [1], the downloader would just spot
"oh this isn't a bundle, but list of URIs, and then fetch those (even
recursively), and eventually get to *.bundle files.

> To summarize, there are two pieces here, that can be implemented at
> different times:
>
> 1. Create a specification for a "bundle server" that doesn't need to
>    speak the Git protocol at all. This could be a REST API specification
>    using well-established standards such as JSON for the table of
>    contents.
>
> 2. Create a way for the origin Git server to advertise known bundle
>    servers to clients so they can automatically benefit from faster
>    downloads without needing to know about bundle servers.
>
> There are a few key benefits to this approach:
>
>  * Further decoupling. The origin Git server doesn't need to know how
>    the bundle server organizes its bundles. This allows maximum flexibility
>    depending on whether the bundles are stored in something like a CDN
>    (where bundles can't be too big) or some kind of blob storage (where
>    they can have arbitrarily large size).
>
>  * The bundle servers could be run completely independently from the
>    origin Git server. Organizations could run their own bundle servers to
>    host data in the same building as their build farms. As long as they
>    can configure the bundle location at clone/fetch time, the origin Git
>    server doesn't need to be involved.
>
> While I didn't go so far as to create a clear standard or implement a
> prototype in the Git codebase, I created a very simple prototype [1] using
> a python script that parses a JSON table of contents and downloads
> bundles into the Git repository. Then, I made a 'clone.sh' script that
> initializes a repository using the bundle fetcher and fetching the
> remainder from the origin Git server. I even computed static bundles for
> the git.git repository based on where 'master' has been over several days
> in the past month, to give an example of incremental bundles. You can
> test the approach all the way to including the fetch to github.com (note
> how the GitHub servers were not modified in any way for this).
>
> [1] https://github.com/derrickstolee/bundles
>
> There are a lot of limitations to the prototype, but it hopefully
> demonstrates the possibility of using something other than the Git protocol
> to solve these problems.

In your proposal the TOC bundle itself doesn't need to speak the git
protocol.

But as as soon as we specify such a thing all of that becomes a part of
the git protocol at large in any meaningful way, i.e. git.git's client
and any other client that wants to implement the full protocol at large
would now need to understand not only pkt-line but also ship a JSON
decoder etc.

I don't see an inherent problem with us wanting to support some nested
encoding format as part of the protocol, but JSON seems like a
particularly bad choice. It's specified as UTF-8 only (or rather, "a
Unicode enoding"), so you can't stick both valid UTF-8 and binary data
into it.

Our refs on the other hand don't conform to that, so having a JSON
format means you can never have something that refers to refnames, which
given that we're talking about bundles, whose own header already has
that information.

> Let me know if you are interested in switching your approach to something
> more like what I propose here. There are many more questions about what
> information could/should be located in the table of contents and how it can
> be extended in the future. I'm interested to explore that space with you.

As noted above, the TOC part of this seems interesting, and I don't see
a reason not to implement that.

But as noted in [1] I really don't see why it would be a good idea to
implement a side-format that's encoding a limited subset of what you'd
find in bundle headers.

Specifically on the meta-information you're proposing:

== requires

In your example you've added a monolithic "requires" relationship
between bundles, saying "This assumes that the bundles can be ordered".

But that's not something you can assume for actual bundle files,
i.e. the prerequisite relationship is per-reftip, it's not the case that
a given bundle requires another bundle, it's the case that tips found in
them may or may not depend on other prerequisites.

If you're creating bundles that contain only one tip there's a 1=1
mapping to what you're proposing with "requires", but otherwise there
isn't.

== timestamp

"This allows us to reexamine the table of contents and only download the
bundles that are newer than that timestamp."

We're usually going to be fetching these over http(s), why duplicate
what you can already get if the server just takes care to create unique
filenames (e.g. as a function of the SHA of their contents), and then
provides appropriate caching headers to a client so that they'll be
cached forever?

I think that gives you everything you'd like out of the "timestamp" and
more, the "more" being that since it's part of a protocol that's already
standard you'd have e.g. intermediate caching proxies understanding this
implicitly, in addition to the git client itself.

So on a network that's say locally unpacking https connections to a
public CDN you could have a local caching proxy for your N local
clients, as opposed to a custom "timestamp" value, which only each local
git client will understand.

== Generally

Sorry, I've got to run, so I haven't addressed all the things you
brought up, but generally while I think that the TOC idea is a good one.

I don't see a reason for why most/all of the other bits shouldn't be
leaning into either the bundle header (and for any TOC shortcut, dump it
as-is, as noted in [1]), or in the case of "timestamp" lean into the
properties of the transport protocol.

And just generally on overall protocol complexity, wouldn't it be OK if
any such TOC is just in pkt-line format?

We could just provide a git plumbing tool to spew that out, and having
some static server job call that once and ever more serve up a a
plain-file doesn't seem like a big restriction, and would mean that any
git client code wouldn't need to deal with another encoding format.

1. https://lore.kernel.org/git/211027.86a6iuxk3x.gmgdl@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  2021-10-27 18:01         ` Ævar Arnfjörð Bjarmason
  2021-10-27 19:23           ` Derrick Stolee
@ 2021-10-30 14:51           ` Philip Oakley
  1 sibling, 0 replies; 77+ messages in thread
From: Philip Oakley @ 2021-10-30 14:51 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson

An aside:

On 27/10/2021 19:01, Ævar Arnfjörð Bjarmason wrote:
> I don't think you can say "this bundle has no blobs". The
> "prerequisites" hard map to the same thing you could put on a
> "want/have" line during PACK negotiation.
>
> I think we could/should fix that, i.e. we can 

> bump the bundle format
> version and have it encode some extended prerequisites/filter/shallow
> etc information.
If the format is bumped, could we also include the
HEAD=<particular-branch> info within that format.
The `guess the HEAD` algorithm isn't ideal and shows up in user
questions every now and again.

>  You'd then have a 1=1 match between the features of
> git-upload-pack and what you can transfer via the bundle side-channel.
--
Philip

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation
  2021-10-30  7:21   ` Ævar Arnfjörð Bjarmason
@ 2021-11-01 21:00     ` Derrick Stolee
  2021-11-01 23:18       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 77+ messages in thread
From: Derrick Stolee @ 2021-11-01 21:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson

On 10/30/2021 3:21 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Oct 29 2021, Derrick Stolee wrote:
> 
>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>> This implements a new "bundle-uri" protocol v2 extension, which allows
>>> servers to advertise *.bundle files which clients can pre-seed their
>>> full "clone"'s or incremental "fetch"'s from.
>>>
>>> This is both an alternative to, and complimentary to the existing
>>> "packfile-uri" mechanism, i.e. servers and/or clients can pick one or
>>> both, but would generally pick one over the other.
>>>
>>> This "bundle-uri" mechanism has the advantage of being dumber, and
>>> offloads more complexity from the server side to the client
>>> side.
>>
>> Generally, I like that using bundles presents an easier way to serve
>> static content from an alternative source and then let Git's fetch
>> negotiation catch up with the remainder.
>>
>> However, after inspecting your design and talking to some GitHub
>> engineers who know more about CDNs and general internet things than I
>> do, I want to propose an alternative design. I think this new design
>> is simultaneously more flexible as well as promotes further decoupling
>> of the origin Git server and the bundle contents.
>>
>> Your proposed design extends protocol v2 to let the client request a
>> list of bundle URIs from the origin server. However, this still requires
>> the origin server to know about this list. [...]
> 
> Interesting, more below...
> 
>> Further, your implementation focuses on the server side without
>> integrating with the client.
> 
> Do you mean these 3 patches we're discussing now? Yes, that's the
> server-side and protocol specification only, because I figured talking
> about just the spec might be helpful.
> 
> But as noted in the CL and previously on-list I have a larger set of
> patches to implement the client behavior, an old RFC version of that
> here (I've since changed some things...):
> https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
> 
> I mean, you commented on those too, so I'm not sure if that's what you
> meant, but just for context...

Yeah, I'm not able to keep all of that in my head, and I focused on
what you presented in this thread.

>> I propose that we flip this around. The "bundle server" should know
>> which bundles are available at which URIs, and the client should contact
>> the bundle server directly for a "table of contents" that lists these
>> URIs, along with metadata related to each URI. The origin Git server
>> then would only need to store the list of bundle servers and the URIs
>> to their table of contents. The client could then pick from among those
>> bundle servers (probably by ping time, or randomly) to start the bundle
>> downloads.
> 
> I hadn't considered the server not advertising the list, but pointing to
> another URI that has the list. I was thinking that the server would be
> close enough to whatever's generating the list that updating the list
> there wouldn't be a meaningful limitation for anyone.
> 
> But you seem to have a use-case for it, I'd be curious to hear why
> specifically, but in any case that's easy to support in the client
> patches I have.

Show me the client patches and then I can determine if I think that
is sufficiently flexible.

In general, I want to expand the scope of this feature beyond "bundles
on a CDN" and towards "alternative sources of Git object data" which
_could_ be a CDN, but could also be geodistributed HTTP servers that
manage their own copy of the Git data (periodically fetching from the
origin). These could be self-hosted by organizations with needs for
low-latency, high-throughput downloads of object data.

For a concrete example, a group with a build farm could create their
own bundle server on the same LAN as the build machines, and they could
mirror whatever Git service they want. This requires the users setting
up the server and telling the machines about the URL for the table of
contents at clone/fetch time.

By having the origin Git server advertise the table of contents, hosts
such as GitHub, GitLab, and others could have their own CDN solutions
that clients discover automatically. This is clearly the environment
you are targeting. Allowing a redirection to another table of contents
further decouples what is responsible for the bundle organization away
from the origin Git server.

One thing neither of us have touched is authentication, so we'll want
to find out how to access private information securely in this model.
Authentication doesn't matter for CDNs, but would matter for other
hosting models.

> There's a point at which we get the list of URIs from the server, to
> support your case the client would just advertise the one TOC URI.
> 
> Then similarly to the "packfile-uri" special-case of handling a *.bundle
> instead of a PACK that I noted in [1], the downloader would just spot
> "oh this isn't a bundle, but list of URIs, and then fetch those (even
> recursively), and eventually get to *.bundle files.

This recursive "follow the contents" approach seems to be a nice
general approach. I would still want to have some kind of specification
about what could be seen at these URIs before modifying the protocol v2
specification.

>> To summarize, there are two pieces here, that can be implemented at
>> different times:
>>
>> 1. Create a specification for a "bundle server" that doesn't need to
>>    speak the Git protocol at all. This could be a REST API specification
>>    using well-established standards such as JSON for the table of
>>    contents.
>>
>> 2. Create a way for the origin Git server to advertise known bundle
>>    servers to clients so they can automatically benefit from faster
>>    downloads without needing to know about bundle servers.
>>
>> There are a few key benefits to this approach:
>>
>>  * Further decoupling. The origin Git server doesn't need to know how
>>    the bundle server organizes its bundles. This allows maximum flexibility
>>    depending on whether the bundles are stored in something like a CDN
>>    (where bundles can't be too big) or some kind of blob storage (where
>>    they can have arbitrarily large size).
>>
>>  * The bundle servers could be run completely independently from the
>>    origin Git server. Organizations could run their own bundle servers to
>>    host data in the same building as their build farms. As long as they
>>    can configure the bundle location at clone/fetch time, the origin Git
>>    server doesn't need to be involved.
>>
>> While I didn't go so far as to create a clear standard or implement a
>> prototype in the Git codebase, I created a very simple prototype [1] using
>> a python script that parses a JSON table of contents and downloads
>> bundles into the Git repository. Then, I made a 'clone.sh' script that
>> initializes a repository using the bundle fetcher and fetching the
>> remainder from the origin Git server. I even computed static bundles for
>> the git.git repository based on where 'master' has been over several days
>> in the past month, to give an example of incremental bundles. You can
>> test the approach all the way to including the fetch to github.com (note
>> how the GitHub servers were not modified in any way for this).
>>
>> [1] https://github.com/derrickstolee/bundles
>>
>> There are a lot of limitations to the prototype, but it hopefully
>> demonstrates the possibility of using something other than the Git protocol
>> to solve these problems.
> 
> In your proposal the TOC bundle itself doesn't need to speak the git
> protocol.

Yes. I see this as a HUGE opportunity for flexibility.

> But as as soon as we specify such a thing all of that becomes a part of
> the git protocol at large in any meaningful way, i.e. git.git's client
> and any other client that wants to implement the full protocol at large
> would now need to understand not only pkt-line but also ship a JSON
> decoder etc.

I use JSON as my example because it is easy to implement in other
languages. It's easy to use in practically any other language than C.

The reason to use something like JSON is that it already encodes a way
to include optional, structured information in the list of results. I
focused on using that instead of specifics of a pkt-line protocol.

You have some flexibility in your protocol, but it's unclear how the
optional data would work without careful testing and actually building
a way to store and communicate the optional data.

> I don't see an inherent problem with us wanting to support some nested
> encoding format as part of the protocol, but JSON seems like a
> particularly bad choice. It's specified as UTF-8 only (or rather, "a
> Unicode enoding"), so you can't stick both valid UTF-8 and binary data
> into it.
> 
> Our refs on the other hand don't conform to that, so having a JSON
> format means you can never have something that refers to refnames, which
> given that we're talking about bundles, whose own header already has
> that information.

As I imagine it, we won't want the bundles to store real refnames,
anyway, since we just need pointers into the commit history to start
the incremental fetch of the real refs after the bundles are downloaded.
The prototype I made doesn't even store the bundled refs in "refs/heads"
or "refs/remotes".

>> Let me know if you are interested in switching your approach to something
>> more like what I propose here. There are many more questions about what
>> information could/should be located in the table of contents and how it can
>> be extended in the future. I'm interested to explore that space with you.
> 
> As noted above, the TOC part of this seems interesting, and I don't see
> a reason not to implement that.
> 
> But as noted in [1] I really don't see why it would be a good idea to
> implement a side-format that's encoding a limited subset of what you'd
> find in bundle headers.

You say "bundle headers" a lot as if we are supposed to download and
examine the start of the bundle file before completing the download.
Is that what you mean? Or, do you mean that somehow you are communicating
the bundle header in your list of URIs and I just missed it?

If you do mean that the bundle header _must_ be included in the table
of contents, then I think that is a blocker for scaling this approach,
since the bundle header includes the 'prerequisite' and 'reference'
records, which could be substantial. At scale, info/refs already takes
too much data to want to do often, let alone multiple times in a single
request.

I think the table of contents should provide enough information for the
client to decide if they should initiate the download at all, but be
flexible to multiple strategies for organizing the data.

> Specifically on the meta-information you're proposing:
> 
> == requires
> == timestamp

Part of the point of my proposal was to show how things can work in a
different way, especially with a flexible format like JSON. One possible
way to organize bundles is in a linear list of files that form snapshots
of the history at a given timestamp, storing the new objects since the
previous bundle. A client could store a local timestamp as a heuristic
for whether they need to download a bundle, but if they are missing
reachable objects, then the 'requires' tag gives them a bundle that
could fill in those gaps (they might need to follow a list until closing
all the gaps, depending on many factors).

So, that's the high-level "one could organize bundles like this" plan.
It is _not_ intended as the only way to do it, but I also believe it is
the most efficient. It's also why things like "requires" and "timestamp"
are intended to be optional metadata.

> == requires
> In your example you've added a monolithic "requires" relationship
> between bundles, saying "This assumes that the bundles can be ordered".
> 
> But that's not something you can assume for actual bundle files,
> i.e. the prerequisite relationship is per-reftip, it's not the case that
> a given bundle requires another bundle, it's the case that tips found in
> them may or may not depend on other prerequisites.

No, not all lists of bundles can be ordered. For instance, if you
use the "here is where the 'main' branch was for each month, minus
the previous month" as one list of bundles and then another "here is
where the 'dev' branch was for each month, minus the 'dev' for the
previous month and the 'main' branch for the current month" then you
get a partial order instead.

I agree that it is important to allow for full generality of cases like
this, but for large repositories, users will probably just want a
snapshot of all the ref tips.

> If you're creating bundles that contain only one tip there's a 1=1
> mapping to what you're proposing with "requires", but otherwise there
> isn't.
> 
> 
> "This allows us to reexamine the table of contents and only download the
> bundles that are newer than that timestamp."
> 
> We're usually going to be fetching these over http(s), why duplicate
> what you can already get if the server just takes care to create unique
> filenames (e.g. as a function of the SHA of their contents), and then
> provides appropriate caching headers to a client so that they'll be
> cached forever?

This assumes that a bundle URI will always be available forever, and that
the table of contents will never shift with any future reorganization.
If the snapshot layout that I specified was always additive, then the URI
would be sufficient (although we would need to keep a full list of every
URI ever downloaded) but also a single timestamp would be sufficient.

The issue arises if the bundle server reorganizes the data somehow, or
worse, the back-end of the bundle server is completely replaced with a
different server that had a different view of the refs at these timestamps.

Now, the 'requires' links provide a way to reconcile missing objects
after downloading only the new bundles, without downloading the full list.

> I think that gives you everything you'd like out of the "timestamp" and
> more, the "more" being that since it's part of a protocol that's already
> standard you'd have e.g. intermediate caching proxies understanding this
> implicitly, in addition to the git client itself.
> 
> So on a network that's say locally unpacking https connections to a
> public CDN you could have a local caching proxy for your N local
> clients, as opposed to a custom "timestamp" value, which only each local
> git client will understand.

I don't understand most of what you are saying here, and perhaps that
is my lack of understanding of possible network services that you are
referencing.

What I'm trying to communicate is that a URI is not enough information
to make a decision about whether or not the Git client should start
downloading that data. Providing clues about the bundle content can
be helpful.

In talking with some others about this, they were thinking about
advertising the ref tips at the bundle boundaries. This would be a
list of "have"s and "have not"s as OIDs that were used to generate
the bundle. However, in my opinion, this only works if you are focused
on snapshots of a single ref instead of a large set of moving refs
(think thousands of refs). In that environment, timestamps are rather
effective so it's nice to have the option.

I'm also not saying that you need to implement an organization such
as the one I'm proposing. I am only strongly recommending that you
build it with enough generality that it is possible.

> == Generally
> 
> Sorry, I've got to run, so I haven't addressed all the things you
> brought up, but generally while I think that the TOC idea is a good one.
> 
> I don't see a reason for why most/all of the other bits shouldn't be
> leaning into either the bundle header (and for any TOC shortcut, dump it
> as-is, as noted in [1]), or in the case of "timestamp" lean into the
> properties of the transport protocol.
> 
> And just generally on overall protocol complexity, wouldn't it be OK if
> any such TOC is just in pkt-line format?

The complexity either lives in the code that parses a well-known format
or the design of a format that is sufficiently general to handle extra
metadata that we could use for the future. Pick your poison.

> We could just provide a git plumbing tool to spew that out, and having
> some static server job call that once and ever more serve up a a
> plain-file doesn't seem like a big restriction, and would mean that any
> git client code wouldn't need to deal with another encoding format.

I agree that it would be good to have a Git command that prepares bundles
for publishing on a static webserver, complete with table of contents.
It would be especially good if this incrementally updated the table with
new bundles as time goes on.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation
  2021-11-01 21:00     ` Derrick Stolee
@ 2021-11-01 23:18       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-01 23:18 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jeff King, Patrick Steinhardt,
	Christian Couder, Albert Cui, Jonathan Tan, Jonathan Nieder,
	brian m . carlson, Robin H . Johnson


On Mon, Nov 01 2021, Derrick Stolee wrote:

> On 10/30/2021 3:21 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Fri, Oct 29 2021, Derrick Stolee wrote:
>> 
>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>>> This implements a new "bundle-uri" protocol v2 extension, which allows
>>>> servers to advertise *.bundle files which clients can pre-seed their
>>>> full "clone"'s or incremental "fetch"'s from.
>>>>
>>>> This is both an alternative to, and complimentary to the existing
>>>> "packfile-uri" mechanism, i.e. servers and/or clients can pick one or
>>>> both, but would generally pick one over the other.
>>>>
>>>> This "bundle-uri" mechanism has the advantage of being dumber, and
>>>> offloads more complexity from the server side to the client
>>>> side.
>>>
>>> Generally, I like that using bundles presents an easier way to serve
>>> static content from an alternative source and then let Git's fetch
>>> negotiation catch up with the remainder.
>>>
>>> However, after inspecting your design and talking to some GitHub
>>> engineers who know more about CDNs and general internet things than I
>>> do, I want to propose an alternative design. I think this new design
>>> is simultaneously more flexible as well as promotes further decoupling
>>> of the origin Git server and the bundle contents.
>>>
>>> Your proposed design extends protocol v2 to let the client request a
>>> list of bundle URIs from the origin server. However, this still requires
>>> the origin server to know about this list. [...]
>> 
>> Interesting, more below...
>> 
>>> Further, your implementation focuses on the server side without
>>> integrating with the client.
>> 
>> Do you mean these 3 patches we're discussing now? Yes, that's the
>> server-side and protocol specification only, because I figured talking
>> about just the spec might be helpful.
>> 
>> But as noted in the CL and previously on-list I have a larger set of
>> patches to implement the client behavior, an old RFC version of that
>> here (I've since changed some things...):
>> https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
>> 
>> I mean, you commented on those too, so I'm not sure if that's what you
>> meant, but just for context...
>
> Yeah, I'm not able to keep all of that in my head, and I focused on
> what you presented in this thread.

[...]

>>> I propose that we flip this around. The "bundle server" should know
>>> which bundles are available at which URIs, and the client should contact
>>> the bundle server directly for a "table of contents" that lists these
>>> URIs, along with metadata related to each URI. The origin Git server
>>> then would only need to store the list of bundle servers and the URIs
>>> to their table of contents. The client could then pick from among those
>>> bundle servers (probably by ping time, or randomly) to start the bundle
>>> downloads.
>> 
>> I hadn't considered the server not advertising the list, but pointing to
>> another URI that has the list. I was thinking that the server would be
>> close enough to whatever's generating the list that updating the list
>> there wouldn't be a meaningful limitation for anyone.
>> 
>> But you seem to have a use-case for it, I'd be curious to hear why
>> specifically, but in any case that's easy to support in the client
>> patches I have.
>
> Show me the client patches and then I can determine if I think that
> is sufficiently flexible.

I'm working on that larger re-roll, hoping to have something to submit
by the end of the week. I've snipped out much of the below, to the
extent that it's probably better on my part to reply to it with working
code than just continued discussion..

> In general, I want to expand the scope of this feature beyond "bundles
> on a CDN" and towards "alternative sources of Git object data" which
> _could_ be a CDN, but could also be geodistributed HTTP servers that
> manage their own copy of the Git data (periodically fetching from the
> origin). These could be self-hosted by organizations with needs for
> low-latency, high-throughput downloads of object data.

*Nod*. FWIW I share those goal[...]

> One thing neither of us have touched is authentication, so we'll want
> to find out how to access private information securely in this model.
> Authentication doesn't matter for CDNs, but would matter for other
> hosting models.

We've got authentication already for packfile-uri in the form of bearer
tokens, and I was planning to do the same (there's recent related
patches on-list to strip out those URLs during logging for that reason).

I was planning to handle it the same way, i.e. a server implementation
would be responsible for spewing out a bundle-uri that's
authenticated/non-public.

In practice I think that probably something that just allows generating
the bundle-uri list via a hook would be flexible enough for both this &
some of the other use-cases you've had in mind, as long as we pass the
protocol-specific auth info down to it in some way.

>> There's a point at which we get the list of URIs from the server, to
>> support your case the client would just advertise the one TOC URI.
>> 
>> Then similarly to the "packfile-uri" special-case of handling a *.bundle
>> instead of a PACK that I noted in [1], the downloader would just spot
>> "oh this isn't a bundle, but list of URIs, and then fetch those (even
>> recursively), and eventually get to *.bundle files.
>
> This recursive "follow the contents" approach seems to be a nice
> general approach. I would still want to have some kind of specification
> about what could be seen at these URIs before modifying the protocol v2
> specification.

*nod*, of course.

> [...]
> Yes. I see this as a HUGE opportunity for flexibility.
>
>> But as as soon as we specify such a thing all of that becomes a part of
>> the git protocol at large in any meaningful way, i.e. git.git's client
>> and any other client that wants to implement the full protocol at large
>> would now need to understand not only pkt-line but also ship a JSON
>> decoder etc.
>
> I use JSON as my example because it is easy to implement in other
> languages. It's easy to use in practically any other language than C.
>
> The reason to use something like JSON is that it already encodes a way
> to include optional, structured information in the list of results. I
> focused on using that instead of specifics of a pkt-line protocol.
>
> You have some flexibility in your protocol, but it's unclear how the
> optional data would work without careful testing and actually building
> a way to store and communicate the optional data.

Yes, it's definitely not as out-of-the-box extensible as a nested
structure like JSON, but if we leave in space for arbitrary key-values
(which I'm planning) then we should be able to extend it in the future,
a key/value could even be a serialized data of some sort...

>> I don't see an inherent problem with us wanting to support some nested
>> encoding format as part of the protocol, but JSON seems like a
>> particularly bad choice. It's specified as UTF-8 only (or rather, "a
>> Unicode enoding"), so you can't stick both valid UTF-8 and binary data
>> into it.
>> 
>> Our refs on the other hand don't conform to that, so having a JSON
>> format means you can never have something that refers to refnames, which
>> given that we're talking about bundles, whose own header already has
>> that information.
>
> As I imagine it, we won't want the bundles to store real refnames,
> anyway, since we just need pointers into the commit history to start
> the incremental fetch of the real refs after the bundles are downloaded.
> The prototype I made doesn't even store the bundled refs in "refs/heads"
> or "refs/remotes".

*nod*, FWIW I've got off-list patches to make that use-case much easier
with "git bundle", i.e. teaching it to take <oid>\t<refname>\n... on
--stdin, not just <oid>\n..., so you can pick arbitrary names.

>>> Let me know if you are interested in switching your approach to something
>>> more like what I propose here. There are many more questions about what
>>> information could/should be located in the table of contents and how it can
>>> be extended in the future. I'm interested to explore that space with you.
>> 
>> As noted above, the TOC part of this seems interesting, and I don't see
>> a reason not to implement that.
>> 
>> But as noted in [1] I really don't see why it would be a good idea to
>> implement a side-format that's encoding a limited subset of what you'd
>> find in bundle headers.
>
> You say "bundle headers" a lot as if we are supposed to download and
> examine the start of the bundle file before completing the download.
> Is that what you mean? Or, do you mean that somehow you are communicating
> the bundle header in your list of URIs and I just missed it?

It's something I've changed my mind on mid-way through this RFC, but
yes, as described in
https://lore.kernel.org/git/211027.86a6iuxk3x.gmgdl@evledraar.gmail.com/
I was originally thinking of a design like

    client: bundle-uri
    server: https://example.com/bundle2.bdl
    server: https://example.com/bundle1.bdl optional key=values

But am now thinking of/working on something like:

    client: command=bundle-uri
    client: v3
    client: object-format
    client: want=headers <pkt-line-flush>
    server: https://example.com/bundle2.bdl <pkt-line-delim>
    server: # v3 git bundle
    server: @object-format=sha1
    server: e9e5ba39a78c8f5057262d49e261b42a8660d5b9 refs/heads/master <pkt-line-delim>
    server: https://example.com/bundle1.bdl <pkt-line-delim> [...]

I.e. a client requests bundle-uris with N arguments, those communicate
what sort of bundles it's able to understand, and that it would like the
server to give it headers as a shortcut, if it can.

The server replies with a list of URIs to bundles (or TOC's etc.), and
optionally as a shortcut to the client includes the headers of those
bundles.

It could also simply skip those, but then the client will need to go and
fetch the headers from the pointed-to resources, or decide it doesn't
need to.

> If you do mean that the bundle header _must_ be included in the table
> of contents, then I think that is a blocker for scaling this approach,
> since the bundle header includes the 'prerequisite' and 'reference'
> records, which could be substantial. At scale, info/refs already takes
> too much data to want to do often, let alone multiple times in a single
> request.

We'll see when I submit the updated working patches for this, and in
coming up with testcases, but I think that you can attain a sweet spot
in most repositories by advertising some of your common/recent tips, as
opposed to a big dump of all the refs.

> [...]
> Part of the point of my proposal was to show how things can work in a
> different way, especially with a flexible format like JSON. One possible
> way to organize bundles is in a linear list of files that form snapshots
> of the history at a given timestamp, storing the new objects since the
> previous bundle. A client could store a local timestamp as a heuristic
> for whether they need to download a bundle, but if they are missing
> reachable objects, then the 'requires' tag gives them a bundle that
> could fill in those gaps (they might need to follow a list until closing
> all the gaps, depending on many factors).
>
> So, that's the high-level "one could organize bundles like this" plan.
> It is _not_ intended as the only way to do it, but I also believe it is
> the most efficient. It's also why things like "requires" and "timestamp"
> are intended to be optional metadata.

[more below]

> [...]
>> If you're creating bundles that contain only one tip there's a 1=1
>> mapping to what you're proposing with "requires", but otherwise there
>> isn't.
>> 
>> 
>> "This allows us to reexamine the table of contents and only download the
>> bundles that are newer than that timestamp."
>> 
>> We're usually going to be fetching these over http(s), why duplicate
>> what you can already get if the server just takes care to create unique
>> filenames (e.g. as a function of the SHA of their contents), and then
>> provides appropriate caching headers to a client so that they'll be
>> cached forever?
>
> This assumes that a bundle URI will always be available forever, and that
> the table of contents will never shift with any future reorganization.
> If the snapshot layout that I specified was always additive, then the URI
> would be sufficient (although we would need to keep a full list of every
> URI ever downloaded) but also a single timestamp would be sufficient.
>
> The issue arises if the bundle server reorganizes the data somehow, or
> worse, the back-end of the bundle server is completely replaced with a
> different server that had a different view of the refs at these timestamps.
>
> Now, the 'requires' links provide a way to reconcile missing objects
> after downloading only the new bundles, without downloading the full list.

[also on this]

>> I think that gives you everything you'd like out of the "timestamp" and
>> more, the "more" being that since it's part of a protocol that's already
>> standard you'd have e.g. intermediate caching proxies understanding this
>> implicitly, in addition to the git client itself.
>> 
>> So on a network that's say locally unpacking https connections to a
>> public CDN you could have a local caching proxy for your N local
>> clients, as opposed to a custom "timestamp" value, which only each local
>> git client will understand.
>
> I don't understand most of what you are saying here, and perhaps that
> is my lack of understanding of possible network services that you are
> referencing.
>
> What I'm trying to communicate is that a URI is not enough information
> to make a decision about whether or not the Git client should start
> downloading that data. Providing clues about the bundle content can
> be helpful.
>
> In talking with some others about this, they were thinking about
> advertising the ref tips at the bundle boundaries. This would be a
> list of "have"s and "have not"s as OIDs that were used to generate
> the bundle. However, in my opinion, this only works if you are focused
> on snapshots of a single ref instead of a large set of moving refs
> (think thousands of refs). In that environment, timestamps are rather
> effective so it's nice to have the option.
>
> I'm also not saying that you need to implement an organization such
> as the one I'm proposing. I am only strongly recommending that you
> build it with enough generality that it is possible.

On the "This assumes that a bundle URI will always be available forever"
& generally on piggy-backing on the HTTP protocol. I mean that you can
just advertise a:

    https://example.com/big-bundle.bdl

I.e. a URL that never changes, and if you serve it up with appropriate
caching headers a client may or may not request it, e.g. a "clone"
probably will every time, but we MAY also just ignore it based on some
other client heuristic.

But then let's say you serve up (opaque SHA1s of the content) with
appropriate Cache-Control headers[1][2]:

    https://example.com/18719eddecbdf01d6c4166402d62e178482d83d4.bdl
    https://example.com/9cfaf0ef69c3c3024ff5fe92ba84bf7f6caefa2a.bdl

Now a client only needs to grab those once, and if the server operator
has set Cache-Control appropriately we'll only need to request the
header for each resource once, e.g. for a use-case of having a "big
bundle" we update monthly, and some weekly updates etc.

One reason I highly prefer this sort of approach is that it works well
out of the box with other software.

Want to make your local git clients faster? Just pipe those URLs into
wget, and if you've got a squid/varnish or other http cache sitting in
front of your clients doing that will pre-seed your local cache.

I may also turn out to be wrong, but early tests with
pipelining/streaming the headers are very promising (see e.g. [3] for
the API). I.e. for the common case of N bundles on the same CDN you can
pipeline them all & stream them in parallel, and if you don't like what
you see in some of them you don't need to continue to download the PACK
payload itself.

1. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control
2. https://bitsup.blogspot.com/2016/05/cache-control-immutable.html
3. https://curl.se/libcurl/c/http2-download.html

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git
  2021-10-25 21:25 [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Ævar Arnfjörð Bjarmason
                   ` (3 preceding siblings ...)
  2021-10-29 18:46 ` [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Derrick Stolee
@ 2022-03-11 16:24 ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 01/13] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
                     ` (14 more replies)
  4 siblings, 15 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Per recent discussion[1] this is my not-quite-feature-complete version
of the bundle-uri capability.  This was sent to the list in some form
beforfe in [2] and [3].

Recently Derrick Stolee has sent an alternate implementation of some
of the same ideas in [4]. Per [1] we're planning to work together on
getting a version of this into git that makes everyone happy, sending
what I've got here is the first step in that.

A high-level summary of the important differences in my approach &
Derrick's (which I hope I'm summarizing fairly here) is that his
approach optionally adds a bundle TOC format, that format allows you
to define topology relationships between bundles to guide a
(returning) client in what it needs to fetch.

Whereas the idea in this series is to lean entirely on the client
downloading bundles & inferring what needs to be done via the
tip/prereqs listed in the header format of the (existing, not changed
here) bundle format.

Both have pros & cons, I started trying to summarize those, but let's
leave that for later.

There's also some high-level "journey" differences in the
two. E.g. Derrick implemented the ability to have "git bundle" itself
fetch bundles remotely, I don't have that and instead lean entirely on
the protocol and "fetch". Those differences really aren't important,
and we can have our cake & eat it too on that front. I.e. end up with
some sensible intersection (or union) of the tooling.

I ran out of time to finish up some of what I had on this topic this
week, but figured (especially since I'd promised to get it done this
week) to send what I have now for discussion.

Things missing & reader's notes:

 * I had some amendmends to the protocol I meant to distill further
   into the protocol docs at [5]. Basically omitting the ability to
   transmit key-values and to have it just be a list of URIs with an
   optional <header> for each one, which is purely a server-to-client
   aid (i.e. those headers will be what you'll find in the pointed-to
   bundles).

* This series goes up to "clone", but I also have incremental fetch
  patches. I ran into an (easily solvable bug) in that & thought it
  was best to omit it for now. It'll be here soon.

  Basically for incremental fetch we'll in parallel get the headers
  for remote bundles, and then do an early abort of those downloads
  that we deem that we don't need.

  Clever (but all standard & boring) use of HTTP caching headers
  between client & servers then allows the client to not request the
  same thing again and again. I.e. want less server load on your CDN?
  Just have the bundles be unique URLs and set appropriate caching
  headers.

* 13/13 demonstrates some of those ideas at a high level.

* A problem with this implementation (and Derrick's, I believe) is
  that it keeps a server waiting while we twiddle our thumbs and wget
  (or equivalent) the bundle(s) locally. If you e.g. clone
  "chromium.git" the server will get tired of waiting, drop the
  connection, and unless the bundle is 100% up-to-date the "clone"
  will fail.

  The solution to this is to get the bundle headers in parallel, and
  as soon as we've got them present the OIDs in the headers as "HAVE"
  to the server, which'll then send us an incremental PACK (and even
  possibly a packfile-uri) for the difference between those bundle(s)
  and what its tips are.

  We can then simply disconnect, download/index-pack everything, and
  do a connectivity checkat the end.

  This requires some careful error handling, e.g. if the resulting
  repo isn't valid we'll need to retry, and the current negotiation
  API isn't very happy to negotiate on the basis of OIDs we don't have
  in the object store (but the protocol handles it just fine).

  So while I have local patches for (some of) that I opted to leave it
  out for now. Once we have incremental fetch it's *mostly* an
  optimization anyway (except in cases of e.g. chromium.git), so for
  the initial protocol etc. discussion it's probably best to leave it
  out.

 * This currently fails CI on Windows & 32 bit Linux for the most
   trivial of reasons, need to adjust printf formats to use PRIuMAX
   etc., but I ran out of time, sorry.

1. https://lore.kernel.org/git/ddebc223-1e13-e758-f9b1-d3f23961e459@github.com/
2. https://lore.kernel.org/git/patch-3.3-64224ec2cba-20211025T211159Z-avarab@gmail.com/
3. https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
4. https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com/
5. https://lore.kernel.org/git/211027.86ilxixoxz.gmgdl@evledraar.gmail.com/

Ævar Arnfjörð Bjarmason (13):
  protocol v2: add server-side "bundle-uri" skeleton
  bundle-uri docs: add design notes
  bundle-uri client: add "bundle-uri" parsing + tests
  connect.c: refactor sending of agent & object-format
  bundle-uri client: add minimal NOOP client
  bundle-uri client: add "git ls-remote-bundle-uri"
  bundle-uri client: add transfer.injectBundleURI support
  bundle-uri client: add boolean transfer.bundleURI setting
  fetch-pack: add a deref_without_lazy_fetch_extended()
  fetch-pack: move --keep=* option filling to a function
  bundle.h: make "fd" version of read_bundle_header() public
  bundle-uri client: support for bundle-uri with "clone"
  bundle-uri: make the download program configurable

 Documentation/config/transfer.txt          |  33 ++
 Documentation/git-ls-remote-bundle-uri.txt |  62 +++
 Documentation/git-ls-remote.txt            |   1 +
 Documentation/technical/bundle-uri.txt     | 119 ++++++
 Documentation/technical/protocol-v2.txt    | 214 +++++++++++
 Makefile                                   |   3 +
 builtin.h                                  |   1 +
 builtin/clone.c                            |   7 +
 builtin/ls-remote-bundle-uri.c             |  90 +++++
 bundle-uri.c                               | 183 +++++++++
 bundle-uri.h                               |  29 ++
 bundle.c                                   |   8 +-
 bundle.h                                   |   2 +
 command-list.txt                           |   1 +
 connect.c                                  |  80 +++-
 fetch-pack.c                               | 306 ++++++++++++++-
 fetch-pack.h                               |   6 +
 git.c                                      |   1 +
 remote.h                                   |   4 +
 serve.c                                    |   6 +
 t/helper/test-bundle-uri.c                 |  83 ++++
 t/helper/test-tool.c                       |   1 +
 t/helper/test-tool.h                       |   1 +
 t/lib-t5730-protocol-v2-bundle-uri.sh      | 424 +++++++++++++++++++++
 t/t5701-git-serve.sh                       | 124 +++++-
 t/t5730-protocol-v2-bundle-uri-file.sh     |  36 ++
 t/t5731-protocol-v2-bundle-uri-git.sh      |  17 +
 t/t5732-protocol-v2-bundle-uri-http.sh     |  17 +
 t/t5750-bundle-uri-parse.sh                | 153 ++++++++
 transport-helper.c                         |  13 +
 transport-internal.h                       |   7 +
 transport.c                                | 120 ++++++
 transport.h                                |  22 ++
 33 files changed, 2141 insertions(+), 33 deletions(-)
 create mode 100644 Documentation/git-ls-remote-bundle-uri.txt
 create mode 100644 Documentation/technical/bundle-uri.txt
 create mode 100644 builtin/ls-remote-bundle-uri.c
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h
 create mode 100644 t/helper/test-bundle-uri.c
 create mode 100644 t/lib-t5730-protocol-v2-bundle-uri.sh
 create mode 100755 t/t5730-protocol-v2-bundle-uri-file.sh
 create mode 100755 t/t5731-protocol-v2-bundle-uri-git.sh
 create mode 100755 t/t5732-protocol-v2-bundle-uri-http.sh
 create mode 100755 t/t5750-bundle-uri-parse.sh

Range-diff against v1:
 2:  3ac0539c053 !  1:  2fc87ce092b protocol v2: specify static seeding of clone/fetch via "bundle-uri"
    @@ Metadata
     Author: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
      ## Commit message ##
    -    protocol v2: specify static seeding of clone/fetch via "bundle-uri"
    +    protocol v2: add server-side "bundle-uri" skeleton
     
    -    Add a server-side implementation of a new "bundle-uri" command to
    -    protocol v2. As discussed in the updated "protocol-v2.txt" this will
    -    allow conforming clients to optionally seed their initial clones or
    -    incremental fetches from URLs containing "*.bundle" files created with
    -    "git bundle create".
    +    Add a skeleton server-side implementation of a new "bundle-uri"
    +    command to protocol v2. This will allow conforming clients to
    +    optionally seed their initial clones or incremental fetches from URLs
    +    containing "*.bundle" files created with "git bundle create".
     
         The use-cases are similar to those of the existing "Packfile URIs",
         and the two feature can be combined within a single request, but
         "bundle-uri" has a few advantages over packfile-uris in some some
         common scenarios, discussed below.
     
    -    This change does not give us a working "bundle-uri" client. I have
    -    those patches as a follow-up, but let's first establish what the
    -    protocol for this should be like first. The client implementation will
    -    then implement this specification.
    +    This change does not give us a working "bundle-uri" client, subsequent
    +    commits will do that. Let's first establish what the protocol for this
    +    should be like first. The client implementation will then implement
    +    this specification.
     
         With this change when the uploadpack.bundleURI config is set to a
         URI (or URIs, if set >1 times), advertise a "bundle-uri" command. Then
    @@ Commit message
         .gitmodules check in that context. See [6] for the "ls-refs unborn"
         feature which modified code in similar areas of the request flow.
     
    +    Finally, there's currently a concurrent (submitted after the v1 of
    +    this commit, but before the subsequent client parts of this
    +    implementation) RFC of a somewhat similar "bundle-uri" facility at
    +    [7].
    +
         1. https://lore.kernel.org/git/20111110074330.GA27925@sigill.intra.peff.net/
         2. https://lore.kernel.org/git/20190514092900.GA11679@sigill.intra.peff.net/
         3. https://lore.kernel.org/git/YFJWz5yIGng+a16k@coredump.intra.peff.net/
    @@ Commit message
            Merged as 6ee353d42f3 (Merge branch 'jt/transfer-fsck-across-packs',
            2021-03-01)
         6. 69571dfe219 (Merge branch 'jt/clone-unborn-head', 2021-02-17)
    +    7. https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com/
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
    @@ bundle-uri.h (new)
     @@
     +#ifndef BUNDLE_URI_H
     +#define BUNDLE_URI_H
    -+
    -+struct repository;
    -+struct packet_reader;
    -+struct packet_writer;
    ++#include "repository.h"
    ++#include "pkt-line.h"
    ++#include "strbuf.h"
     +
     +/**
     + * API used by serve.[ch].
 -:  ----------- >  2:  84c4036a510 bundle-uri docs: add design notes
 3:  64224ec2cba !  3:  3abfb2290fd bundle-uri client: add "bundle-uri" parsing + tests
    @@ bundle-uri.c: int bundle_uri_command(struct repository *r,
     
      ## bundle-uri.h ##
     @@
    - struct repository;
    - struct packet_reader;
    - struct packet_writer;
    -+struct string_list;
    + #include "repository.h"
    + #include "pkt-line.h"
    + #include "strbuf.h"
    ++#include "string-list.h"
      
      /**
       * API used by serve.[ch].
    -@@ bundle-uri.h: struct packet_writer;
    +@@
      int bundle_uri_advertise(struct repository *r, struct strbuf *value);
      int bundle_uri_command(struct repository *r, struct packet_reader *request);
      
 1:  7639b9bbac5 !  4:  f64aefa9ece leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak
    @@ Metadata
     Author: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
      ## Commit message ##
    -    leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak
    +    connect.c: refactor sending of agent & object-format
     
    -    The "t5701-git-serve.sh" test passes when run under a git compiled
    -    with SANITIZE=leak, let's mark it as such to add it to the
    -    "linux-leaks" CI job.
    +    Refactor the sending of the "agent" and "object-format" capabilities
    +    into a function.
    +
    +    This was added in its current form in ab67235bc4 (connect: parse v2
    +    refs with correct hash algorithm, 2020-05-25). When we connect to a v2
    +    server we need to know about its object-format, and it needs to know
    +    about ours. Since most things in connect.c and transport.c piggy-back
    +    on the eager getting of remote refs via the handshake() those commands
    +    can make use of the just-sent-over object-format by ls-refs.
    +
    +    But I'm about to add a command that may come after ls-refs, and may
    +    not, but we need the server to know about our user-agent and
    +    object-format. So let's split this into a function.
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
    - ## t/t5701-git-serve.sh ##
    -@@ t/t5701-git-serve.sh: test_description='test protocol v2 server commands'
    - GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
    - export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
    + ## connect.c ##
    +@@ connect.c: void check_stateless_delimiter(int stateless_rpc,
    + 		die("%s", error);
    + }
    + 
    ++static void send_capabilities(int fd_out, struct packet_reader *reader)
    ++{
    ++	const char *hash_name;
    ++
    ++	if (server_supports_v2("agent", 0))
    ++		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
    ++
    ++	if (server_feature_v2("object-format", &hash_name)) {
    ++		int hash_algo = hash_algo_by_name(hash_name);
    ++		if (hash_algo == GIT_HASH_UNKNOWN)
    ++			die(_("unknown object format '%s' specified by server"), hash_name);
    ++		reader->hash_algo = &hash_algos[hash_algo];
    ++		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
    ++	} else {
    ++		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
    ++	}
    ++}
    ++
    + struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
    + 			     struct ref **list, int for_push,
    + 			     struct transport_ls_refs_options *transport_options,
    +@@ connect.c: struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
    + 			     int stateless_rpc)
    + {
    + 	int i;
    +-	const char *hash_name;
    + 	struct strvec *ref_prefixes = transport_options ?
    + 		&transport_options->ref_prefixes : NULL;
    + 	const char **unborn_head_target = transport_options ?
    +@@ connect.c: struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
    + 	if (server_supports_v2("ls-refs", 1))
    + 		packet_write_fmt(fd_out, "command=ls-refs\n");
      
    -+TEST_PASSES_SANITIZE_LEAK=true
    - . ./test-lib.sh
    +-	if (server_supports_v2("agent", 0))
    +-		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
    +-
    +-	if (server_feature_v2("object-format", &hash_name)) {
    +-		int hash_algo = hash_algo_by_name(hash_name);
    +-		if (hash_algo == GIT_HASH_UNKNOWN)
    +-			die(_("unknown object format '%s' specified by server"), hash_name);
    +-		reader->hash_algo = &hash_algos[hash_algo];
    +-		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
    +-	} else {
    +-		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
    +-	}
    ++	/* Send capabilities */
    ++	send_capabilities(fd_out, reader);
      
    - test_expect_success 'test capability advertisement' '
    + 	if (server_options && server_options->nr &&
    + 	    server_supports_v2("server-option", 1))
 -:  ----------- >  5:  105ced66409 bundle-uri client: add minimal NOOP client
 -:  ----------- >  6:  617a6b16df8 bundle-uri client: add "git ls-remote-bundle-uri"
 -:  ----------- >  7:  b0ce379528e bundle-uri client: add transfer.injectBundleURI support
 -:  ----------- >  8:  44d96a0f5f8 bundle-uri client: add boolean transfer.bundleURI setting
 -:  ----------- >  9:  d9f5b486511 fetch-pack: add a deref_without_lazy_fetch_extended()
 -:  ----------- > 10:  31a22eb3bd4 fetch-pack: move --keep=* option filling to a function
 -:  ----------- > 11:  5ade9419454 bundle.h: make "fd" version of read_bundle_header() public
 -:  ----------- > 12:  bb0d681f5f0 bundle-uri client: support for bundle-uri with "clone"
 -:  ----------- > 13:  40f37c8b9d5 bundle-uri: make the download program configurable
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 01/13] protocol v2: add server-side "bundle-uri" skeleton
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 02/13] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a skeleton server-side implementation of a new "bundle-uri"
command to protocol v2. This will allow conforming clients to
optionally seed their initial clones or incremental fetches from URLs
containing "*.bundle" files created with "git bundle create".

The use-cases are similar to those of the existing "Packfile URIs",
and the two feature can be combined within a single request, but
"bundle-uri" has a few advantages over packfile-uris in some some
common scenarios, discussed below.

This change does not give us a working "bundle-uri" client, subsequent
commits will do that. Let's first establish what the protocol for this
should be like first. The client implementation will then implement
this specification.

With this change when the uploadpack.bundleURI config is set to a
URI (or URIs, if set >1 times), advertise a "bundle-uri" command. Then
when the client requests "bundle-uri" emit those URIs back at them.

Differences between this and the existing packfile-uri facility:

 A. There is no "real" support for packfile-uri in git.git. The
    uploadpack.blobPackfileUri setting allows carving out a list of
    blobs (actually any OIDs), but as alluded to in bfc2a36ff2a (Doc:
    clarify contents of packfile sent as URI, 2021-01-20) the only
    "real" implementation is JGit based.

 B. The uploadpack.blobPackfileUri is a MUST where this is a
    "CAN". I.e. once a client says they support packfile-uri of given
    list of protocols the server will send them a PACK response
    assuming they've downloaded the URI they client was sent, if the
    client doesn't do that they don't have a valid repository.

    Pointing at a bundle and having the client send us "have"
    lines (or not, maybe they couldn't fetch it, or decided they
    didn't want to) is more flexible, and can gracefully recover
    e.g. if the CDN isn't reachable (maybe you do support "https", but
    the CDN provider is down, or blocked your whole country).

 C. The client, after executing "ls-refs" will disconnect if it has
    also grabbed the "bundle-uris" and knows the server won't send it
    anything it doesn't already have (or expect to have, if it's
    downloading the bundles concurrent to an early disconnect).

    This is in (small) contrast to packfile-uri where a client would
    enter a negotiation dialog, which may or may not result in a
    packfile-uri and/or an inline PACK.

 D. Because of "C" clients can, if the bundles are up-to-date, get an
    up-to-date repository with just "bundle-uri" and "ls-refs" commands,
    with no need to enter a dialog with "git upload-pack".

    That small dialog is unlikely to matter for performance purposes,
    this section is just noting differences between "bundle-uri" and
    "packfile-uri".

As noted above the features are compatible, a client that supports
"bundle-uri" and "packfile-uri" might download a bundle, and then
proceed with a "fetch" dialog, that dialog might then result in
"packfile-uri" response.

In practice server operators are unlikely to want to mix the two,
since the main benefit of either approach is the ability to offload
large "clone" responses to CDNs. A server operator would have little
reason not to go with one approach or the other.

There was a suggestion of implementing a similar feature long ago[1]
by Jeff King. The main difference between it and this approach is that
we've since gained protocol v2, so we can add this as an optional path
in the dialog between client and server. The 2011 implementation
hooked into the transport mechanism to try to clone from a bundle
directly. See also [2] and [3] for some later mentions of that
approach.

See also [4] for the series that implemented
uploadpack.blobPackfileUri, and [5] for a series on top that did the
.gitmodules check in that context. See [6] for the "ls-refs unborn"
feature which modified code in similar areas of the request flow.

Finally, there's currently a concurrent (submitted after the v1 of
this commit, but before the subsequent client parts of this
implementation) RFC of a somewhat similar "bundle-uri" facility at
[7].

1. https://lore.kernel.org/git/20111110074330.GA27925@sigill.intra.peff.net/
2. https://lore.kernel.org/git/20190514092900.GA11679@sigill.intra.peff.net/
3. https://lore.kernel.org/git/YFJWz5yIGng+a16k@coredump.intra.peff.net/
4. https://lore.kernel.org/git/cover.1591821067.git.jonathantanmy@google.com/
   Merged as 34e849b05a4 (Merge branch 'jt/cdn-offload', 2020-06-25)
5. https://lore.kernel.org/git/cover.1614021092.git.jonathantanmy@google.com/
   Merged as 6ee353d42f3 (Merge branch 'jt/transfer-fsck-across-packs',
   2021-03-01)
6. 69571dfe219 (Merge branch 'jt/clone-unborn-head', 2021-02-17)
7. https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/technical/protocol-v2.txt | 209 ++++++++++++++++++++++++
 Makefile                                |   1 +
 bundle-uri.c                            |  55 +++++++
 bundle-uri.h                            |  13 ++
 serve.c                                 |   6 +
 t/t5701-git-serve.sh                    | 124 +++++++++++++-
 6 files changed, 407 insertions(+), 1 deletion(-)
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h

diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 8a877d27e23..3ea96add398 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -566,3 +566,212 @@ and associated requested information, each separated by a single space.
 	attr = "size"
 
 	obj-info = obj-id SP obj-size
+
+bundle-uri
+~~~~~~~~~~
+
+If the 'bundle-uri' capability is advertised, the server supports the
+`bundle-uri' command.
+
+The capability is currently advertised with no value (i.e. not
+"bundle-uri=somevalue"), a value may be added in the future for
+supporting command-wide extensions. Clients MUST ignore any unknown
+capability values and proceed with the 'bundle-uri` dialog they
+support.
+
+The 'bundle-uri' command is intended to be issued before `fetch` to
+get URIs to bundle files (see linkgit:git-bundle[1]) to "seed" and
+inform the subsequent `fetch` command.
+
+The client CAN issue `bundle-uri` before or after any other valid
+command. To be useful to clients it's expected that it'll be issued
+after an `ls-refs` and before `fetch`, but CAN be issued at any time
+in the dialog.
+
+DISCUSSION of bundle-uri
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The intent of the feature is optimize for server resource consumption
+in the common case by changing the common case of fetching a very
+large PACK during linkgit:git-clone[1] into a smaller incremental
+fetch.
+
+It also allows servers to achieve better caching in combination with
+an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
+
+By having new clones or fetches be a more predictable and common
+negotiation against the tips of recently produces *.bundle file(s).
+Servers might even pre-generate the results of such negotiations for
+the `uploadpack.packObjectsHook` as new pushes come in.
+
+I.e. the server would anticipate that fresh clones will download a
+known bundle, followed by catching up to the current state of the
+repository using ref tips found in that bundle (or bundles).
+
+PROTOCOL for bundle-uri
+^^^^^^^^^^^^^^^^^^^^^^^
+
+A `bundle-uri` request takes no arguments, and as noted above does not
+currently advertise a capability value. Both may be added in the
+future.
+
+When the client issues a `command=bundle-uri` the response is a list
+of URIs the server would like the client to fetch out-of-bounds before
+proceeding with the `fetch` request in this format:
+
+	output = bundle-uri-line
+		 bundle-uri-line* flush-pkt
+
+	bundle-uri-line = PKT-LINE(bundle-uri)
+			  *(SP bundle-feature-key *(=bundle-feature-val))
+			  LF
+
+	bundle-uri = A URI such as a https://, ssh:// etc. URI
+
+	bundle-feature-key = Any printable ASCII characters except SP or "="
+	bundle-feature-val = Any printable ASCII characters except SP or "="
+
+No `bundle-feature-key`=`bundle-feature-value` fields are currently
+defined. See the discussion of features below.
+
+Clients are still expected to fully parse the line according to the
+above format, lines that do not conform to the format SHOULD be
+discarded. The user MAY be warned in such a case.
+
+bundle-uri CLIENT AND SERVER EXPECTATIONS
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+".bundle" FORMAT
+++++++++++++++++
+
+The advertised bundle(s) MUST be in a format that "git bundle verify"
+would accept. I.e. they MUST contain one or more reference tips for
+use by the client, MUST indicate prerequisites (in any) with standard
+"-" prefixes, and MUST indicate their "object-format", if
+applicable. Create "*.bundle" files with "git bundle create".
+
+bundle-uri CLIENT ERROR RECOVERY
+++++++++++++++++++++++++++++++++
+
+A client MUST above all gracefully degrade on errors, whether that
+error is because of bad missing/data in the bundle URI(s), because
+that client is too dumb to e.g. understand and fully parse out bundle
+headers and their prerequisite relationships, or something else.
+
+Server operators should feel confident in turning on "bundle-uri" and
+not worry if e.g. their CDN goes down that clones or fetches will run
+into hard failures. Even if the server bundle bundle(s) are
+incomplete, or bad in some way the client should still end up with a
+functioning repository, just as if it had chosen not to use this
+protocol extension.
+
+All subsequent discussion on client and server interaction MUST keep
+this in mind.
+
+bundle-uri SERVER TO CLIENT
++++++++++++++++++++++++++++
+
+The ordering of the returned bundle uris is not significant. Clients
+MUST parse their headers to discover their contained OIDS and
+prerequisites. A client MUST consider the content of the bundle(s)
+themselves and their header as the ultimate source of truth.
+
+A server MAY even return bundle(s) that don't have any direct
+relationship to the repository being cloned (either through accident,
+or intentional "clever" configuration), and expect a client to sort
+out what data they'd like from the bundle(s), if any.
+
+bundle-uri CLIENT TO SERVER
++++++++++++++++++++++++++++
+
+The client SHOULD provide reference tips found in the bundle header(s)
+as 'have' lines in any subsequent `fetch` request. A client MAY also
+ignore the bundle(s) entirely if doing so is deemed worse for some
+reason, e.g. if the bundles can't be downloaded, it doesn't like the
+tips it finds etc.
+
+WHEN ADVERTISED BUNDLE(S) REQUIRE NO FURTHER NEGOTIATION
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+If after issuing `bundle-uri` and `ls-refs`, and getting the header(s)
+of the bundle(s) the client finds that the ref tips it wants can be
+retrieved entirety from advertised bundle(s), it MAY disconnect. The
+results of such a 'clone' or 'fetch' should be indistinguishable from
+the state attained without using bundle-uri.
+
+EARLY CLIENT DISCONNECTIONS AND ERROR RECOVERY
+++++++++++++++++++++++++++++++++++++++++++++++
+
+A client MAY perform an early disconnect while still downloading the
+bundle(s) (having streamed and parsed their headers). In such a case
+the client MUST gracefully recover from any errors related to
+finishing the download and validation of the bundle(s).
+
+I.e. a client might need to re-connect and issue a 'fetch' command,
+and possibly fall back to not making use of 'bundle-uri' at all.
+
+This "MAY" behavior is specified as such (and not a "SHOULD") on the
+assumption that a server advertising bundle uris is more likely than
+not to be serving up a relatively large repository, and to be pointing
+to URIs that have a good chance of being in working order. A client
+MAY e.g. look at the payload size of the bundles as a heuristic to see
+if an early disconnect is worth it, should falling back on a full
+"fetch" dialog be necessary.
+
+WHEN ADVERTISED BUNDLE(S) REQUIRE FURTHER NEGOTIATION
++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+A client SHOULD commence a negotiation of a PACK from the server via
+the "fetch" command using the OID tips found in advertised bundles,
+even if's still in the process of downloading those bundle(s).
+
+This allows for aggressive early disconnects from any interactive
+server dialog. The client blindly trusts that the advertised OID tips
+are relevant, and issues them as 'have' lines, it then requests any
+tips it would like (usually from the "ls-refs" advertisement) via
+'want' lines. The server will then compute a (hopefully small) PACK
+with the expected difference between the tips from the bundle(s) and
+the data requested.
+
+The only connection the client then needs to keep active is to the
+concurrently downloading static bundle(s), when those and the
+incremental PACK are retrieved they should be inflated and
+validated. Any errors at this point should be gracefully recovered
+from, see above.
+
+bundle-uri PROTOCOL FEATURES
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As noted above no `bundle-feature-key`=`bundle-feature-value` fields
+are currently defined.
+
+They are intended for future per-URI metadata which older clients MUST
+ignore and gracefully degrade on. Any fields they do recognize they
+CAN also ignore.
+
+Any backwards-incompatible addition of pre-URI key-value will be
+guarded by a new value or values in 'bundle-uri' capability
+advertisement itself, and/or by new future `bundle-uri` request
+arguments.
+
+While no per-URI key-value are currently supported currently they're
+intended to support future features such as:
+
+ * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or
+   size of the bundle file.
+
+ * Advertise that one or more bundle files are the same (to e.g. have
+   clients round-robin or otherwise choose one of N possible files).
+
+ * A "oid=<OID>" shortcut and "prerequisite=<OID>" shortcut. For
+   expressing the common case of a bundle with one tip and no
+   prerequisites, or one tip and one prerequisite.
++
+This would allow for optimizing the common case of servers who'd like
+to provide one "big bundle" containing only their "main" branch,
+and/or incremental updates thereof.
++
+A client receiving such a a response MAY assume that they can skip
+retrieving the header from a bundle at the indicated URI, and thus
+save themselves and the server(s) the request(s) needed to inspect the
+headers of that bundle or bundles.
diff --git a/Makefile b/Makefile
index 6f0b4b775fe..5a3d35109a1 100644
--- a/Makefile
+++ b/Makefile
@@ -855,6 +855,7 @@ LIB_OBJS += blob.o
 LIB_OBJS += bloom.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
+LIB_OBJS += bundle-uri.o
 LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += cbtree.o
diff --git a/bundle-uri.c b/bundle-uri.c
new file mode 100644
index 00000000000..ff054ddc690
--- /dev/null
+++ b/bundle-uri.c
@@ -0,0 +1,55 @@
+#include "cache.h"
+#include "bundle-uri.h"
+#include "pkt-line.h"
+#include "config.h"
+
+static void send_bundle_uris(struct packet_writer *writer,
+			     struct string_list *uris)
+{
+	struct string_list_item *item;
+
+	for_each_string_list_item(item, uris)
+		packet_writer_write(writer, "%s", item->string);
+}
+
+static int advertise_bundle_uri = -1;
+static struct string_list bundle_uris = STRING_LIST_INIT_DUP;
+static int bundle_uri_config(const char *var, const char *value, void *data)
+{
+	if (!strcmp(var, "uploadpack.bundleuri")) {
+		advertise_bundle_uri = 1;
+		string_list_append(&bundle_uris, value);
+	}
+
+	return 0;
+}
+
+int bundle_uri_advertise(struct repository *r, struct strbuf *value)
+{
+	if (advertise_bundle_uri != -1)
+		goto cached;
+
+	git_config(bundle_uri_config, NULL);
+	advertise_bundle_uri = !!bundle_uris.nr;
+
+cached:
+	return advertise_bundle_uri;
+}
+
+int bundle_uri_command(struct repository *r,
+		       struct packet_reader *request)
+{
+	struct packet_writer writer;
+	packet_writer_init(&writer, 1);
+
+	while (packet_reader_read(request) == PACKET_READ_NORMAL)
+		die(_("bundle-uri: unexpected argument: '%s'"), request->line);
+	if (request->status != PACKET_READ_FLUSH)
+		die(_("bundle-uri: expected flush after arguments"));
+
+	send_bundle_uris(&writer, &bundle_uris);
+
+	packet_writer_flush(&writer);
+
+	return 0;
+}
diff --git a/bundle-uri.h b/bundle-uri.h
new file mode 100644
index 00000000000..5a7e556a0ba
--- /dev/null
+++ b/bundle-uri.h
@@ -0,0 +1,13 @@
+#ifndef BUNDLE_URI_H
+#define BUNDLE_URI_H
+#include "repository.h"
+#include "pkt-line.h"
+#include "strbuf.h"
+
+/**
+ * API used by serve.[ch].
+ */
+int bundle_uri_advertise(struct repository *r, struct strbuf *value);
+int bundle_uri_command(struct repository *r, struct packet_reader *request);
+
+#endif /* BUNDLE_URI_H */
diff --git a/serve.c b/serve.c
index b3fe9b5126a..f3e0203d2c6 100644
--- a/serve.c
+++ b/serve.c
@@ -8,6 +8,7 @@
 #include "protocol-caps.h"
 #include "serve.h"
 #include "upload-pack.h"
+#include "bundle-uri.h"
 
 static int advertise_sid = -1;
 static int client_hash_algo = GIT_HASH_SHA1;
@@ -136,6 +137,11 @@ static struct protocol_capability capabilities[] = {
 		.advertise = always_advertise,
 		.command = cap_object_info,
 	},
+	{
+		.name = "bundle-uri",
+		.advertise = bundle_uri_advertise,
+		.command = bundle_uri_command,
+	},
 };
 
 void protocol_v2_advertise_capabilities(void)
diff --git a/t/t5701-git-serve.sh b/t/t5701-git-serve.sh
index 1896f671cb3..9d053f77a93 100755
--- a/t/t5701-git-serve.sh
+++ b/t/t5701-git-serve.sh
@@ -13,7 +13,7 @@ test_expect_success 'test capability advertisement' '
 	wrong_algo sha1:sha256
 	wrong_algo sha256:sha1
 	EOF
-	cat >expect <<-EOF &&
+	cat >expect.base <<-EOF &&
 	version 2
 	agent=git/$(git version | cut -d" " -f3)
 	ls-refs=unborn
@@ -21,8 +21,11 @@ test_expect_success 'test capability advertisement' '
 	server-option
 	object-format=$(test_oid algo)
 	object-info
+	EOF
+	cat >expect.trailer <<-EOF &&
 	0000
 	EOF
+	cat expect.base expect.trailer >expect &&
 
 	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
 		--advertise-capabilities >out &&
@@ -342,4 +345,123 @@ test_expect_success 'basics of object-info' '
 	test_cmp expect actual
 '
 
+# Test the basics of bundle-uri
+#
+test_expect_success 'test capability advertisement with uploadpack.bundleURI' '
+	test_config uploadpack.bundleURI FAKE &&
+
+	cat >expect.extra <<-EOF &&
+	bundle-uri
+	EOF
+	cat expect.base \
+	    expect.extra \
+	    expect.trailer >expect &&
+
+	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
+		--advertise-capabilities >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: dies if not enabled' '
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	0000
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	fatal: invalid command '"'"'bundle-uri'"'"'
+	EOF
+
+	cat >expect <<-\EOF &&
+	ERR serve: invalid command '"'"'bundle-uri'"'"'
+	EOF
+
+	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out
+'
+
+
+test_expect_success 'basics of bundle-uri: enabled with single URI' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: enabled with single URI' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: enabled with two URIs' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+	test_config uploadpack.bundleURI https://cdn.example.com/recent.bdl --add &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	https://cdn.example.com/recent.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: unknown future feature(s)' '
+	test_config uploadpack.bundleURI https://cdn.example.com/fake.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0001
+	some-feature
+	we-do-not
+	know=about
+	0000
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	fatal: bundle-uri: unexpected argument: '"'"'some-feature'"'"'
+	EOF
+
+	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out
+'
+
 test_done
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 02/13] bundle-uri docs: add design notes
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 01/13] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 03/13] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a design doc for the bundle-uri protocol extension to go along
with the packfile-uri extension added in cd8402e0fd8 (Documentation:
add Packfile URIs design doc, 2020-06-10).

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/technical/bundle-uri.txt  | 119 ++++++++++++++++++++++++
 Documentation/technical/protocol-v2.txt |   5 +
 2 files changed, 124 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri.txt

diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
new file mode 100644
index 00000000000..5ae9a15eafe
--- /dev/null
+++ b/Documentation/technical/bundle-uri.txt
@@ -0,0 +1,119 @@
+Bundle URI Design Notes
+=======================
+
+Protocol
+--------
+
+See `bundle-uri` in the link:protocol-v2.html[protocol-v2]
+documentation for a discussion of the bundle-uri command, and the
+expectations of clients and servers.
+
+This document is a a more general discussion of how the `bundle-uri`
+command fits in with the rest of the git ecosystem, its design goals
+and non-goals, comparison to alternatives etc.
+
+Comparison with Packfile URIs
+-----------------------------
+
+There is a similar "Packfile URIs" facility, see the
+link:packfile-uri.html[packfile-uri] documentation for details.
+
+The Packfile URIs facility requires a much closer cooperation between
+CDN and server than the bundle URI facility.
+
+I.e. the server MUST know what objects exist in the packfile URI it's
+pointing to, as well as its pack checksum. Failure to do so will not
+only result in a client error (the packfile hash won't match), but
+even if it got past that would likely result in a corrupt repository
+with tips pointing to unreachable objects.
+
+By comparison the bundle URIs are meant to be a "dumb" solution
+friendly to e.g. having a weekly cronjob take a snapshot of a git
+repository, that snapshot being uploaded to a network of FTP mirrors
+(which may be inconsistent or out of date).
+
+The server does not need to know what state the side-channel download
+is at, because the client will first validate it, and then optionally
+negotiate with the server using what it discovers there.
+
+Using the local `transfer.injectBundleURI` configuration variable (see
+linkgit:git-config[1]) the `bundle-uri` mechanism doesn't even need
+the server to support it.
+
+Security
+--------
+
+The omission of something equivalent to the packfile <OID> in the
+Packfile URIs protocol is intentional, as having it would require
+closer server and CDN cooperation than some server operators are
+comfortable with.
+
+Furthermore, it is not needed for security. The server doesn't need to
+trust its CDN. If the server were to attempt to send harmful content
+to the client, the result would not validate against the server's
+provided ref tips gotten from ls-refs.
+
+The lack of a such a hash does leave room open to a malicious CDN
+operation to be annoying however. E.g. they could inject irrelevant
+objects into the bundles, which would enlarge the downloaded
+repository until a "gc" would eventually throw them away.
+
+In practice the lack of a hash is considered to be a non-issue. Anyone
+concerned about such security problems between their server and their
+CDN is going to be pointing to a "https" URL under their control. For
+a client the "threat" is the same as without bundle-uri, i.e. a server
+is free to be annoying today and send you garbage in the PACK that you
+won't need.
+
+Security issues peculiar to bundle-uri
+--------------------------------------
+
+Both packfile-uri and bundle-uri use the `fetch.uriProtocols`
+configuration variable (see linkgit:git-config[1]) to configure which
+protocols they support.
+
+By default this is set to "http,https" for both, but bundle-uri
+supports adding "file" to that list. The server can thus point to
+"file://" URIs it expects the client to have access to.
+
+This is primarily intended for use with the `transfer.injectBundleURI`
+mechanism, but can also be useful e.g. in a centralized environment
+where a server might point to a "file:///mnt/bundles/big-repo.bdl" it
+knows to be mounted on the local machine (e.g. a racked server),
+points to it in its "bundle-uri" response.
+
+The client can then add "file" to the `fetch.uriProtocols` list to
+obey such responses. That does mean that a malicious server can point
+to any arbitrary file on the local machine. The threat of this is
+considered minimal, since anyone adding `file` to `fetch.uriProtocols`
+likely knows what they're doing and controls both ands, and the worst
+they can do is make a curl(1) pipe garbage into "index-pack" (which
+will likely promptly die on the non-PACK-file).
+
+Security comparison with packfile-uri
+-------------------------------------
+
+The initial implementation of packfile-uri needed special adjusting to
+run "git fsck" on incoming .gitmodules files, this was to deal with a
+general security issue in git, See CVE-2018-17456.
+
+The current packfile-uri mechanism requires special handling around
+"fsck" to do such cross-PACK fsck's, this is because it first indexes
+the "incremental" PACK, and then any PACK(s) provided via
+packfile-uri, before finally doing a full connectivity check.
+
+This is effect doing the fsck one might do via "clone" and "fetch" in
+reverse, or the equivalent of starting with the incremental "fetch",
+followed by the "clone".
+
+Since the packfile-uri mechanism can result in the .gitmodules blob
+referenced by such a "fetch" to be in the pack for the "clone" the
+fetch-pack process needs to keep state between the indexing of
+multiple packs, to remember to fsck the blob (via the "clone") later
+after seeing it in a tree (from the "fetch).
+
+There are no known security issues with the way packfile-uri does
+this, but since bundle-uri effectively emulates what a which doesn't
+support either "bundle-uri" or "packfile-uri" would do on clone/fetch,
+any future security issues peculiar to the packfile-uri approach are
+unlikely to be shared by it.
diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 3ea96add398..3a51492049f 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -775,3 +775,8 @@ A client receiving such a a response MAY assume that they can skip
 retrieving the header from a bundle at the indicated URI, and thus
 save themselves and the server(s) the request(s) needed to inspect the
 headers of that bundle or bundles.
+
+bundle-uri SEE ALSO
+^^^^^^^^^^^^^^^^^^^
+
+See the link:bundle-uri.html[Bundle URI Design Notes] for more.
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 03/13] bundle-uri client: add "bundle-uri" parsing + tests
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 01/13] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 02/13] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 04/13] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a "test-tool bundle-uri parse" which parses the format defined in
the newly specified "bundle-uri" command.

As note in the "bundle-uri" section in protocol-v2.txt we haven't
specified any key-values yet, just URI lines, but we should parse
their format for conformity with the spec.

We need to make sure our future client doesn't die if this optional
data is ever provided by the server, and that we've covered all the
edge cases with these key-values in our specification. Let's add and
test a bundle_uri_parse_line() to do that.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Makefile                    |   1 +
 bundle-uri.c                | 124 +++++++++++++++++++++++++++++
 bundle-uri.h                |  16 ++++
 t/helper/test-bundle-uri.c  |  83 +++++++++++++++++++
 t/helper/test-tool.c        |   1 +
 t/helper/test-tool.h        |   1 +
 t/t5750-bundle-uri-parse.sh | 153 ++++++++++++++++++++++++++++++++++++
 7 files changed, 379 insertions(+)
 create mode 100644 t/helper/test-bundle-uri.c
 create mode 100755 t/t5750-bundle-uri-parse.sh

diff --git a/Makefile b/Makefile
index 5a3d35109a1..4fec8e5af09 100644
--- a/Makefile
+++ b/Makefile
@@ -696,6 +696,7 @@ PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 TEST_BUILTINS_OBJS += test-advise.o
 TEST_BUILTINS_OBJS += test-bitmap.o
 TEST_BUILTINS_OBJS += test-bloom.o
+TEST_BUILTINS_OBJS += test-bundle-uri.o
 TEST_BUILTINS_OBJS += test-chmtime.o
 TEST_BUILTINS_OBJS += test-config.o
 TEST_BUILTINS_OBJS += test-crontab.o
diff --git a/bundle-uri.c b/bundle-uri.c
index ff054ddc690..9827fc5da17 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -53,3 +53,127 @@ int bundle_uri_command(struct repository *r,
 
 	return 0;
 }
+
+/**
+ * General API for {transport,connect}.c etc.
+ */
+int bundle_uri_parse_line(struct string_list *bundle_uri, const char *line)
+{
+	size_t i;
+	struct string_list columns = STRING_LIST_INIT_DUP;
+	const char *uri;
+	struct string_list *uri_columns = NULL;
+	int ret = 0;
+
+	if (!strlen(line))
+		return error(_("bundle-uri: got an empty line"));
+
+	/*
+	 * Right now we don't understand anything beyond the first SP,
+	 * but let's be tolerant and ignore any future unknown
+	 * fields. See the "MUST" note about "bundle-feature-key" in
+	 * Documentation/technical/protocol-v2.txt
+	 */
+	if (string_list_split(&columns, line, ' ', -1) < 1)
+		return error(_("bundle-uri: line not in SP-delimited format: %s"), line);
+
+	/*
+	 * We represent a "<uri>[ <key-values>...]" line with the URI
+	 * being the .string in a string list, and the .util being an
+	 * optional string list of key (.string) and values
+	 * (.util). If the top-level .util is NULL there's no
+	 * key-value pairs....
+	 */
+	uri = columns.items[0].string;
+	if (!strlen(uri)) {
+		ret = error(_("bundle-uri: got an empty URI component"));
+		goto cleanup;
+	}
+
+	/*
+	 * ... we're going to need that non-NULL .util .
+	 */
+	if (columns.nr > 1) {
+		uri_columns = xcalloc(1, sizeof(struct string_list));
+		string_list_init_dup(uri_columns);
+	}
+
+	/*
+	 * Let's parse the optional "kv" format, even if we don't
+	 * understand any of the keys or values yet.
+	 */
+	for (i = 1; i < columns.nr; i++) {
+		struct string_list kv = STRING_LIST_INIT_DUP;
+		const char *arg = columns.items[i].string;
+		int fields = string_list_split(&kv, arg, '=', 2);
+		int err = 0;
+
+		switch (fields) {
+		case 0:
+			BUG("should have no fields=0");
+		case 1:
+			if (!strlen(arg)) {
+				err = error("bundle-uri: column %lu: got an empty attribute (full line was '%s')",
+					    i, line);
+				break;
+			}
+			/*
+			 * We could dance around with
+			 * string_list_append_nodup() and skip
+			 * string_list_clear(&kv, 0) here, but let's
+			 * keep it simple.
+			 */
+			string_list_append(uri_columns, arg);
+			break;
+		case 2:
+		{
+			const char *k = kv.items[0].string;
+			const char *v = kv.items[1].string;
+
+			string_list_append(uri_columns, k)->util = xstrdup(v);
+			break;
+		}
+		default:
+			err = error("bundle-uri: column %lu: '%s' more than one '=' character (full line was '%s')",
+				    i, arg, line);
+			break;
+		}
+
+		string_list_clear(&kv, 0);
+		if (err) {
+			ret = err;
+			break;
+		}
+	}
+
+
+	/*
+	 * Per the spec we'll only consider bundle-uri lines OK if
+	 * there were no parsing problems, even if the problems were
+	 * with attributes whose content we don't understand.
+	 */
+	if (ret && uri_columns) {
+		string_list_clear(uri_columns, 1);
+		free(uri_columns);
+	} else if (!ret) {
+		string_list_append(bundle_uri, uri)->util = uri_columns;
+	}
+
+cleanup:
+	string_list_clear(&columns, 0);
+	return ret;
+}
+
+static void bundle_uri_string_list_clear_cb(void *util, const char *string)
+{
+	struct string_list *fields = util;
+	if (!fields)
+		return;
+	string_list_clear(fields, 1);
+	free(fields);
+}
+
+void bundle_uri_string_list_clear(struct string_list *bundle_uri)
+{
+	string_list_clear_func(bundle_uri, bundle_uri_string_list_clear_cb);
+}
diff --git a/bundle-uri.h b/bundle-uri.h
index 5a7e556a0ba..be6d1df97ff 100644
--- a/bundle-uri.h
+++ b/bundle-uri.h
@@ -3,6 +3,7 @@
 #include "repository.h"
 #include "pkt-line.h"
 #include "strbuf.h"
+#include "string-list.h"
 
 /**
  * API used by serve.[ch].
@@ -10,4 +11,19 @@
 int bundle_uri_advertise(struct repository *r, struct strbuf *value);
 int bundle_uri_command(struct repository *r, struct packet_reader *request);
 
+/**
+ * General API for {transport,connect}.c etc.
+ */
+
+/**
+ * bundle_uri_parse_line() returns 0 when a valid bundle-uri has been
+ * added to `bundle_uri`, <0 on error.
+ */
+int bundle_uri_parse_line(struct string_list *bundle_uri, const char *line);
+
+/**
+ * Clear the `bundle_uri` list. Just a very thin wrapper on
+ * string_list_clear().
+ */
+void bundle_uri_string_list_clear(struct string_list *bundle_uri);
 #endif /* BUNDLE_URI_H */
diff --git a/t/helper/test-bundle-uri.c b/t/helper/test-bundle-uri.c
new file mode 100644
index 00000000000..805a86c0130
--- /dev/null
+++ b/t/helper/test-bundle-uri.c
@@ -0,0 +1,83 @@
+#include "test-tool.h"
+#include "parse-options.h"
+#include "bundle-uri.h"
+#include "strbuf.h"
+#include "string-list.h"
+
+static int cmd__bundle_uri_parse(int argc, const char **argv)
+{
+	const char *usage[] = {
+		"test-tool bundle-uri parse <in",
+		NULL
+	};
+	struct option options[] = {
+		OPT_END(),
+	};
+	struct strbuf sb = STRBUF_INIT;
+	struct string_list list = STRING_LIST_INIT_DUP;
+	int err = 0;
+	struct string_list_item *item;
+	size_t line_nr = 0;
+
+	argc = parse_options(argc, argv, NULL, options, usage, 0);
+	if (argc)
+		goto usage;
+
+	while (strbuf_getline(&sb, stdin) != EOF) {
+		line_nr++;
+		if (bundle_uri_parse_line(&list, sb.buf) < 0)
+			err = error("bad line: '%s'", sb.buf);
+	}
+
+	for_each_string_list_item(item, &list) {
+		struct string_list_item *kv_item;
+		struct string_list *kv = item->util;
+
+		fprintf(stdout, "%s", item->string);
+		if (!kv) {
+			fprintf(stdout, "\n");
+			continue;
+		}
+		for_each_string_list_item(kv_item, kv) {
+			const char *k = kv_item->string;
+			const char *v = kv_item->util;
+
+			if (v)
+				fprintf(stdout, " [kv: %s => %s]", k, v);
+			else
+				fprintf(stdout, " [attr: %s]", k);
+		}
+		fprintf(stdout, "\n");
+	}
+	strbuf_release(&sb);
+
+	bundle_uri_string_list_clear(&list);
+
+	return err < 0 ? 1 : 0;
+usage:
+	usage_with_options(usage, options);
+}
+
+int cmd__bundle_uri(int argc, const char **argv)
+{
+	const char *usage[] = {
+		"test-tool bundle-uri <subcommand> [<options>]",
+		NULL
+	};
+	struct option options[] = {
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL, options, usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION |
+			     PARSE_OPT_KEEP_ARGV0);
+	if (argc == 1)
+		goto usage;
+
+	if (!strcmp(argv[1], "parse"))
+		return cmd__bundle_uri_parse(argc - 1, argv + 1);
+	error("there is no test-tool bundle-uri tool '%s'", argv[1]);
+
+usage:
+	usage_with_options(usage, options);
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index e6ec69cf326..dc73e68f329 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -17,6 +17,7 @@ static struct test_cmd cmds[] = {
 	{ "advise", cmd__advise_if_enabled },
 	{ "bitmap", cmd__bitmap },
 	{ "bloom", cmd__bloom },
+	{ "bundle-uri", cmd__bundle_uri },
 	{ "chmtime", cmd__chmtime },
 	{ "config", cmd__config },
 	{ "crontab", cmd__crontab },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 20756eefdda..927b6b418cd 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -7,6 +7,7 @@
 int cmd__advise_if_enabled(int argc, const char **argv);
 int cmd__bitmap(int argc, const char **argv);
 int cmd__bloom(int argc, const char **argv);
+int cmd__bundle_uri(int argc, const char **argv);
 int cmd__chmtime(int argc, const char **argv);
 int cmd__config(int argc, const char **argv);
 int cmd__crontab(int argc, const char **argv);
diff --git a/t/t5750-bundle-uri-parse.sh b/t/t5750-bundle-uri-parse.sh
new file mode 100755
index 00000000000..70fd1b398e9
--- /dev/null
+++ b/t/t5750-bundle-uri-parse.sh
@@ -0,0 +1,153 @@
+#!/bin/sh
+
+test_description="Test bundle-uri bundle_uri_parse_line()"
+
+TEST_NO_CREATE_REPO=1
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+test_expect_success 'bundle_uri_parse_line() just URIs' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle.bdl
+	https://example.com/bundle.bdl
+	file:///usr/share/git/bundle.bdl
+	EOF
+
+	# For the simple case
+	cp in expect &&
+
+	test-tool bundle-uri parse <in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() with attributes' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl attr
+	http://example.com/bundle2.bdl ibute
+	EOF
+
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl [attr: attr]
+	http://example.com/bundle2.bdl [attr: ibute]
+	EOF
+
+	test-tool bundle-uri parse <in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() with attributes and key-value attributes' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl x a=b y c=d z e=f a=b
+	EOF
+
+
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl [attr: x] [kv: a => b] [attr: y] [kv: c => d] [attr: z] [kv: e => f] [kv: a => b]
+	EOF
+
+	test-tool bundle-uri parse <in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: extra SP' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl one-space
+	http://example.com/bundle2.bdl  two-space
+	http://example.com/bundle3.bdl   three-space
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: column 1: got an empty attribute (full line was '\''http://example.com/bundle2.bdl  two-space'\'')
+	error: bad line: '\''http://example.com/bundle2.bdl  two-space'\''
+	error: bundle-uri: column 1: got an empty attribute (full line was '\''http://example.com/bundle3.bdl   three-space'\'')
+	error: bad line: '\''http://example.com/bundle3.bdl   three-space'\''
+	EOF
+
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl [attr: one-space]
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: empty lines' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl
+
+	http://example.com/bundle2.bdl a=b
+
+	http://example.com/bundle3.bdl
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: got an empty line
+	error: bad line: '\'''\''
+	error: bundle-uri: got an empty line
+	error: bad line: '\'''\''
+	EOF
+
+	# We fail, but try to continue parsing regardless
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl
+	http://example.com/bundle2.bdl [kv: a => b]
+	http://example.com/bundle3.bdl
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: empty URIs' '
+	sed "s/> //" >in <<-\EOF &&
+	http://example.com/bundle1.bdl
+	>  a=b
+	http://example.com/bundle3.bdl a=b
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: got an empty URI component
+	error: bad line: '\'' a=b'\''
+	EOF
+
+	# We fail, but try to continue parsing regardless
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl
+	http://example.com/bundle3.bdl [kv: a => b]
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: multiple = in key-values' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl k=v=extra
+	http://example.com/bundle2.bdl a=b k=v=extra c=d
+	http://example.com/bundle3.bdl kv=ok
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: column 1: '\''k=v=extra'\'' more than one '\''='\'' character (full line was '\''http://example.com/bundle1.bdl k=v=extra'\'')
+	error: bad line: '\''http://example.com/bundle1.bdl k=v=extra'\''
+	error: bundle-uri: column 2: '\''k=v=extra'\'' more than one '\''='\'' character (full line was '\''http://example.com/bundle2.bdl a=b k=v=extra c=d'\'')
+	error: bad line: '\''http://example.com/bundle2.bdl a=b k=v=extra c=d'\''
+	EOF
+
+	# We fail, but try to continue parsing regardless
+	cat >expect <<-\EOF &&
+	http://example.com/bundle3.bdl [kv: kv => ok]
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_done
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 04/13] connect.c: refactor sending of agent & object-format
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (2 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 03/13] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 05/13] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Refactor the sending of the "agent" and "object-format" capabilities
into a function.

This was added in its current form in ab67235bc4 (connect: parse v2
refs with correct hash algorithm, 2020-05-25). When we connect to a v2
server we need to know about its object-format, and it needs to know
about ours. Since most things in connect.c and transport.c piggy-back
on the eager getting of remote refs via the handshake() those commands
can make use of the just-sent-over object-format by ls-refs.

But I'm about to add a command that may come after ls-refs, and may
not, but we need the server to know about our user-agent and
object-format. So let's split this into a function.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 connect.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/connect.c b/connect.c
index afc79a6236e..e6d0b1d34bd 100644
--- a/connect.c
+++ b/connect.c
@@ -473,6 +473,24 @@ void check_stateless_delimiter(int stateless_rpc,
 		die("%s", error);
 }
 
+static void send_capabilities(int fd_out, struct packet_reader *reader)
+{
+	const char *hash_name;
+
+	if (server_supports_v2("agent", 0))
+		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
+
+	if (server_feature_v2("object-format", &hash_name)) {
+		int hash_algo = hash_algo_by_name(hash_name);
+		if (hash_algo == GIT_HASH_UNKNOWN)
+			die(_("unknown object format '%s' specified by server"), hash_name);
+		reader->hash_algo = &hash_algos[hash_algo];
+		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
+	} else {
+		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
+	}
+}
+
 struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     struct ref **list, int for_push,
 			     struct transport_ls_refs_options *transport_options,
@@ -480,7 +498,6 @@ struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     int stateless_rpc)
 {
 	int i;
-	const char *hash_name;
 	struct strvec *ref_prefixes = transport_options ?
 		&transport_options->ref_prefixes : NULL;
 	const char **unborn_head_target = transport_options ?
@@ -490,18 +507,8 @@ struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 	if (server_supports_v2("ls-refs", 1))
 		packet_write_fmt(fd_out, "command=ls-refs\n");
 
-	if (server_supports_v2("agent", 0))
-		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
-
-	if (server_feature_v2("object-format", &hash_name)) {
-		int hash_algo = hash_algo_by_name(hash_name);
-		if (hash_algo == GIT_HASH_UNKNOWN)
-			die(_("unknown object format '%s' specified by server"), hash_name);
-		reader->hash_algo = &hash_algos[hash_algo];
-		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
-	} else {
-		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
-	}
+	/* Send capabilities */
+	send_capabilities(fd_out, reader);
 
 	if (server_options && server_options->nr &&
 	    server_supports_v2("server-option", 1))
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 05/13] bundle-uri client: add minimal NOOP client
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (3 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 04/13] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 06/13] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Set up all the needed client parts of the "bundle-uri" protocol
extension, without actually doing anything with the bundle URIs.

I.e. if the server says it supports "bundle-uri" we'll issue a
command=bundle-uri after command=ls-refs when we're cloning. We'll
parse the returned output using the code already tested for in
t5750-bundle-uri-parse.sh.

What we aren't doing is actually acting on that data, i.e. downloading
the bundle(s) before we get to doing the command=fetch, and adjusting
our negotiation dialog appropriately. I'll do that in subsequent
commits.

There's a question of what level of encapsulation we should use here,
I've opted to use connect.h in clone.c, but we could also e.g. make
transport_get_remote_refs() invoke this, i.e. make it implicitly get
the bundle-uri list for later steps.

This approach means that we don't "support" this in "git fetch" for
now. I'm starting with the case of initial clones, although as noted
in preceding commits to the protocol documentation nothing about this
approach precludes getting bundles on incremental fetches.

For the t5732-protocol-v2-bundle-uri-http.sh it's not easy to set
environment variables for git-upload-pack (it's started by Apache), so
let's skip the test under T5730_HTTP, and add unused T5730_{FILE,GIT}
prerequisites for consistency and future use.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/clone.c                        |   7 ++
 bundle-uri.c                           |   4 +
 connect.c                              |  47 +++++++++
 remote.h                               |   4 +
 t/lib-t5730-protocol-v2-bundle-uri.sh  | 141 +++++++++++++++++++++++++
 t/t5730-protocol-v2-bundle-uri-file.sh |  36 +++++++
 t/t5731-protocol-v2-bundle-uri-git.sh  |  17 +++
 t/t5732-protocol-v2-bundle-uri-http.sh |  17 +++
 transport-helper.c                     |  13 +++
 transport-internal.h                   |   7 ++
 transport.c                            |  48 +++++++++
 transport.h                            |  18 ++++
 12 files changed, 359 insertions(+)
 create mode 100644 t/lib-t5730-protocol-v2-bundle-uri.sh
 create mode 100755 t/t5730-protocol-v2-bundle-uri-file.sh
 create mode 100755 t/t5731-protocol-v2-bundle-uri-git.sh
 create mode 100755 t/t5732-protocol-v2-bundle-uri-http.sh

diff --git a/builtin/clone.c b/builtin/clone.c
index a572cda5030..b2c1b4142ee 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -27,6 +27,7 @@
 #include "iterator.h"
 #include "sigchain.h"
 #include "branch.h"
+#include "connect.h"
 #include "remote.h"
 #include "run-command.h"
 #include "connected.h"
@@ -1220,6 +1221,12 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (refs)
 		mapped_refs = wanted_peer_refs(refs, &remote->fetch);
 
+	/*
+	 * Populate transport->got_remote_bundle_uri and
+	 * transport->bundle_uri. We might get nothing.
+	 */
+	transport_get_remote_bundle_uri(transport);
+
 	if (mapped_refs) {
 		int hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
 
diff --git a/bundle-uri.c b/bundle-uri.c
index 9827fc5da17..c503ed51ca8 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -26,6 +26,10 @@ static int bundle_uri_config(const char *var, const char *value, void *data)
 
 int bundle_uri_advertise(struct repository *r, struct strbuf *value)
 {
+	if (value &&
+	    git_env_bool("GIT_TEST_BUNDLE_URI_UNKNOWN_CAPABILITY_VALUE", 0))
+		strbuf_addstr(value, "test-unknown-capability-value");
+
 	if (advertise_bundle_uri != -1)
 		goto cached;
 
diff --git a/connect.c b/connect.c
index e6d0b1d34bd..a8fdb5255f7 100644
--- a/connect.c
+++ b/connect.c
@@ -15,6 +15,7 @@
 #include "version.h"
 #include "protocol.h"
 #include "alias.h"
+#include "bundle-uri.h"
 
 static char *server_capabilities_v1;
 static struct strvec server_capabilities_v2 = STRVEC_INIT;
@@ -491,6 +492,52 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
 	}
 }
 
+int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
+			  struct string_list *bundle_uri, int stateless_rpc)
+{
+	int line_nr = 1;
+
+	/* Assert bundle-uri support */
+	server_supports_v2("bundle-uri", 1);
+
+	/* (Re-)send capabilities */
+	send_capabilities(fd_out, reader);
+
+	/* Send command */
+	packet_write_fmt(fd_out, "command=bundle-uri\n");
+	packet_delim(fd_out);
+
+	/* Send options */
+	if (git_env_bool("GIT_TEST_PROTOCOL_BAD_BUNDLE_URI", 0))
+		packet_write_fmt(fd_out, "test-bad-client\n");
+	packet_flush(fd_out);
+
+	/* Process response from server */
+	while (packet_reader_read(reader) == PACKET_READ_NORMAL) {
+		const char *line = reader->line;
+		line_nr++;
+
+		if (!bundle_uri_parse_line(bundle_uri, line))
+			continue;
+
+		return error(_("error on bundle-uri response line %d: %s"),
+			     line_nr, line);
+	}
+
+	if (reader->status != PACKET_READ_FLUSH)
+		return error(_("expected flush after bundle-uri listing"));
+
+	/*
+	 * Might die(), but obscure enough that that's OK, e.g. in
+	 * serve.c we'll call BUG() on its equivalent (the
+	 * PACKET_READ_RESPONSE_END check).
+	 */
+	check_stateless_delimiter(stateless_rpc, reader,
+				  _("expected response end packet after ref listing"));
+
+	return 0;
+}
+
 struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     struct ref **list, int for_push,
 			     struct transport_ls_refs_options *transport_options,
diff --git a/remote.h b/remote.h
index 4a1209ae2c8..ca820b60948 100644
--- a/remote.h
+++ b/remote.h
@@ -236,6 +236,10 @@ struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     const struct string_list *server_options,
 			     int stateless_rpc);
 
+/* Used for protocol v2 in order to retrieve refs from a remote */
+int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
+			  struct string_list *bundle_uri, int stateless_rpc);
+
 int resolve_remote_symref(struct ref *ref, struct ref *list);
 
 /*
diff --git a/t/lib-t5730-protocol-v2-bundle-uri.sh b/t/lib-t5730-protocol-v2-bundle-uri.sh
new file mode 100644
index 00000000000..724acecda66
--- /dev/null
+++ b/t/lib-t5730-protocol-v2-bundle-uri.sh
@@ -0,0 +1,141 @@
+# Included from t573*-protocol-v2-bundle-uri-*.sh
+
+T5730_PARENT=
+T5730_URI=
+T5730_BUNDLE_URI=
+case "$T5730_PROTOCOL" in
+file)
+	T5730_PARENT=file_parent
+	T5730_URI="file://$PWD/file_parent"
+	T5730_BUNDLE_URI="$T5730_URI/fake.bdl"
+	test_set_prereq T5730_FILE
+	;;
+git)
+	. "$TEST_DIRECTORY"/lib-git-daemon.sh
+	start_git_daemon --export-all --enable=receive-pack
+	T5730_PARENT="$GIT_DAEMON_DOCUMENT_ROOT_PATH/parent"
+	T5730_URI="$GIT_DAEMON_URL/parent"
+	T5730_BUNDLE_URI="https://example.com/fake.bdl"
+	test_set_prereq T5730_GIT
+	;;
+http)
+	. "$TEST_DIRECTORY"/lib-httpd.sh
+	start_httpd
+	T5730_PARENT="$HTTPD_DOCUMENT_ROOT_PATH/http_parent"
+	T5730_URI="$HTTPD_URL/smart/http_parent"
+	T5730_BUNDLE_URI="https://example.com/fake.bdl"
+	test_set_prereq T5730_HTTP
+	;;
+*)
+	BUG "Need to pass valid T5730_PROTOCOL (was $T5730_PROTOCOL)"
+	;;
+esac
+
+test_expect_success "setup protocol v2 $T5730_PROTOCOL:// tests" '
+	git init "$T5730_PARENT" &&
+	test_commit -C "$T5730_PARENT" one
+'
+
+# Poor man's URI escaping. Good enough for the test suite whose trash
+# directory has a space in it. See 93c3fcbe4d4 (git-svn: attempt to
+# mimic SVN 1.7 URL canonicalization, 2012-07-28) for prior art.
+test_uri_escape() {
+	sed 's/ /%20/g'
+}
+
+case "$T5730_PROTOCOL" in
+http)
+	test_expect_success "setup config for $T5730_PROTOCOL:// tests" '
+		git -C "$T5730_PARENT" config http.receivepack true
+	'
+	;;
+*)
+	;;
+esac
+T5730_BUNDLE_URI_ESCAPED=$(echo "$T5730_BUNDLE_URI" | test_uri_escape)
+
+test_expect_success "connect with $T5730_PROTOCOL:// using protocol v2: no bundle-uri" '
+	test_when_finished "rm -f log" &&
+
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		ls-remote --symref "$T5730_URI" \
+		>actual 2>err &&
+
+	# Server responded using protocol v2
+	grep "< version 2" log &&
+
+	! grep bundle-uri log
+'
+
+test_expect_success "connect with $T5730_PROTOCOL:// using protocol v2: have bundle-uri" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" \
+		uploadpack.bundleURI "$T5730_BUNDLE_URI_ESCAPED" &&
+
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		ls-remote --symref "$T5730_URI" \
+		>actual 2>err &&
+
+	# Server responded using protocol v2
+	grep "< version 2" log &&
+
+	# Server advertised bundle-uri capability
+	grep bundle-uri log
+'
+
+test_expect_success !T5730_HTTP "bad client with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED" &&
+
+	cat >err.expect <<-\EOF &&
+	Cloning into '"'"'child'"'"'...
+	EOF
+	case "$T5730_PROTOCOL" in
+	file)
+		cat >fatal-bundle-uri.expect <<-\EOF
+		fatal: bundle-uri: unexpected argument: '"'"'test-bad-client'"'"'
+		EOF
+		;;
+	*)
+		cat >fatal.expect <<-\EOF
+		fatal: read error: Connection reset by peer
+		EOF
+		;;
+	esac &&
+
+	test_when_finished "rm -rf child" &&
+	test_must_fail ok=sigpipe env \
+		GIT_TRACE_PACKET="$PWD/log" \
+		GIT_TEST_PROTOCOL_BAD_BUNDLE_URI=true \
+		git -c protocol.version=2 \
+		clone "$T5730_URI" child \
+		>out 2>err &&
+	test_must_be_empty out &&
+
+	grep -v -e ^fatal: -e ^error: err >err.actual &&
+	test_cmp err.expect err.actual &&
+
+	case "$T5730_PROTOCOL" in
+	file)
+		# Due to general race conditions with client/server replies we
+		# may or may not get "fatal: the remote end hung up
+		# expectedly" here
+		grep "^fatal: bundle-uri:" err >fatal-bundle-uri.actual &&
+		test_cmp fatal-bundle-uri.expect fatal-bundle-uri.actual
+		;;
+	*)
+		grep "^fatal:" err >fatal.actual &&
+		test_cmp fatal.expect fatal.actual
+		;;
+	esac &&
+
+	grep "clone> test-bad-client$" log >sent-bad-request &&
+	test_file_not_empty sent-bad-request
+'
diff --git a/t/t5730-protocol-v2-bundle-uri-file.sh b/t/t5730-protocol-v2-bundle-uri-file.sh
new file mode 100755
index 00000000000..89203d3a23c
--- /dev/null
+++ b/t/t5730-protocol-v2-bundle-uri-file.sh
@@ -0,0 +1,36 @@
+#!/bin/sh
+
+test_description="Test bundle-uri with protocol v2 and 'file://' transport"
+
+TEST_NO_CREATE_REPO=1
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+# Test protocol v2 with 'file://' transport
+#
+T5730_PROTOCOL=file
+. "$TEST_DIRECTORY"/lib-t5730-protocol-v2-bundle-uri.sh
+
+test_expect_success "unknown capability value with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" \
+		uploadpack.bundleURI "$T5730_BUNDLE_URI_ESCAPED" &&
+
+	GIT_TRACE_PACKET="$PWD/log" \
+	GIT_TEST_BUNDLE_URI_UNKNOWN_CAPABILITY_VALUE=true \
+	git \
+		-c protocol.version=2 \
+		ls-remote --symref "$T5730_URI" \
+		>actual 2>err &&
+
+	# Server responded using protocol v2
+	grep "< version 2" log &&
+
+	grep "> bundle-uri=test-unknown-capability-value" log
+'
+
+test_done
diff --git a/t/t5731-protocol-v2-bundle-uri-git.sh b/t/t5731-protocol-v2-bundle-uri-git.sh
new file mode 100755
index 00000000000..282847b311f
--- /dev/null
+++ b/t/t5731-protocol-v2-bundle-uri-git.sh
@@ -0,0 +1,17 @@
+#!/bin/sh
+
+test_description="Test bundle-uri with protocol v2 and 'git://' transport"
+
+TEST_NO_CREATE_REPO=1
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+# Test protocol v2 with 'git://' transport
+#
+T5730_PROTOCOL=git
+. "$TEST_DIRECTORY"/lib-t5730-protocol-v2-bundle-uri.sh
+
+test_done
diff --git a/t/t5732-protocol-v2-bundle-uri-http.sh b/t/t5732-protocol-v2-bundle-uri-http.sh
new file mode 100755
index 00000000000..fcc1cf3faef
--- /dev/null
+++ b/t/t5732-protocol-v2-bundle-uri-http.sh
@@ -0,0 +1,17 @@
+#!/bin/sh
+
+test_description="Test bundle-uri with protocol v2 and 'git://' transport"
+
+TEST_NO_CREATE_REPO=1
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+# Test protocol v2 with 'git://' transport
+#
+T5730_PROTOCOL=http
+. "$TEST_DIRECTORY"/lib-t5730-protocol-v2-bundle-uri.sh
+
+test_done
diff --git a/transport-helper.c b/transport-helper.c
index a0297b0986c..4ddf2a4be2a 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -1264,9 +1264,22 @@ static struct ref *get_refs_list_using_list(struct transport *transport,
 	return ret;
 }
 
+static int get_bundle_uri(struct transport *transport)
+{
+	get_helper(transport);
+
+	if (process_connect(transport, 0)) {
+		do_take_over(transport);
+		return transport->vtable->get_bundle_uri(transport);
+	}
+
+	return -1;
+}
+
 static struct transport_vtable vtable = {
 	.set_option	= set_helper_option,
 	.get_refs_list	= get_refs_list,
+	.get_bundle_uri = get_bundle_uri,
 	.fetch_refs	= fetch_refs,
 	.push_refs	= push_refs,
 	.connect	= connect_helper,
diff --git a/transport-internal.h b/transport-internal.h
index c4ca0b733ac..90ea749e5cf 100644
--- a/transport-internal.h
+++ b/transport-internal.h
@@ -26,6 +26,13 @@ struct transport_vtable {
 	struct ref *(*get_refs_list)(struct transport *transport, int for_push,
 				     struct transport_ls_refs_options *transport_options);
 
+	/**
+	 * Populates the remote side's bundle-uri under protocol v2,
+	 * if the "bundle-uri" capability was advertised. Returns 0 if
+	 * OK, negative values on error.
+	 */
+	int (*get_bundle_uri)(struct transport *transport);
+
 	/**
 	 * Fetch the objects for the given refs. Note that this gets
 	 * an array, and should ignore the list structure.
diff --git a/transport.c b/transport.c
index 253d6671b1f..7c9a371ed7f 100644
--- a/transport.c
+++ b/transport.c
@@ -22,6 +22,7 @@
 #include "protocol.h"
 #include "object-store.h"
 #include "color.h"
+#include "bundle-uri.h"
 
 static int transport_use_color = -1;
 static char transport_colors[][COLOR_MAXLEN] = {
@@ -349,6 +350,21 @@ static struct ref *get_refs_via_connect(struct transport *transport, int for_pus
 	return handshake(transport, for_push, options, 1);
 }
 
+static int get_bundle_uri(struct transport *transport)
+{
+	struct git_transport_data *data = transport->data;
+	struct packet_reader reader;
+	int stateless_rpc = transport->stateless_rpc;
+	string_list_init_dup(&transport->bundle_uri);
+
+	packet_reader_init(&reader, data->fd[0], NULL, 0,
+			   PACKET_READ_CHOMP_NEWLINE |
+			   PACKET_READ_GENTLE_ON_EOF);
+
+	return get_remote_bundle_uri(data->fd[1], &reader,
+				     &transport->bundle_uri, stateless_rpc);
+}
+
 static int fetch_refs_via_pack(struct transport *transport,
 			       int nr_heads, struct ref **to_fetch)
 {
@@ -888,6 +904,7 @@ static int disconnect_git(struct transport *transport)
 
 static struct transport_vtable taken_over_vtable = {
 	.get_refs_list	= get_refs_via_connect,
+	.get_bundle_uri = get_bundle_uri,
 	.fetch_refs	= fetch_refs_via_pack,
 	.push_refs	= git_transport_push,
 	.disconnect	= disconnect_git
@@ -1041,6 +1058,7 @@ static struct transport_vtable bundle_vtable = {
 
 static struct transport_vtable builtin_smart_vtable = {
 	.get_refs_list	= get_refs_via_connect,
+	.get_bundle_uri = get_bundle_uri,
 	.fetch_refs	= fetch_refs_via_pack,
 	.push_refs	= git_transport_push,
 	.connect	= connect_git,
@@ -1054,6 +1072,7 @@ struct transport *transport_get(struct remote *remote, const char *url)
 
 	ret->progress = isatty(2);
 	string_list_init_dup(&ret->pack_lockfiles);
+	string_list_init_dup(&ret->bundle_uri);
 
 	if (!remote)
 		BUG("No remote provided to transport_get()");
@@ -1462,6 +1481,34 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs)
 	return rc;
 }
 
+int transport_get_remote_bundle_uri(struct transport *transport)
+{
+	const struct transport_vtable *vtable = transport->vtable;
+
+	/* Lazily configured */
+	if (transport->got_remote_bundle_uri++)
+		return 0;
+
+	/*
+	 * "Support" protocol v0 and v2 without bundle-uri support by
+	 * silently degrading to a NOOP.
+	 */
+	if (!server_supports_v2("bundle-uri", 0))
+		return 0;
+
+	/*
+	 * This is intentionally below the transport.injectBundleURI,
+	 * we want to be able to inject into protocol v0, or into the
+	 * dialog of a server who doesn't support this.
+	 */
+	if (!vtable->get_bundle_uri)
+		return error(_("bundle-uri operation not supported by protocol"));
+
+	if (vtable->get_bundle_uri(transport) < 0)
+		return error(_("could not retrieve server-advertised bundle-uri list"));
+	return 0;
+}
+
 void transport_unlock_pack(struct transport *transport, unsigned int flags)
 {
 	int in_signal_handler = !!(flags & TRANSPORT_UNLOCK_PACK_IN_SIGNAL_HANDLER);
@@ -1492,6 +1539,7 @@ int transport_disconnect(struct transport *transport)
 		ret = transport->vtable->disconnect(transport);
 	if (transport->got_remote_refs)
 		free_refs((void *)transport->remote_refs);
+	bundle_uri_string_list_clear(&transport->bundle_uri);
 	free(transport);
 	return ret;
 }
diff --git a/transport.h b/transport.h
index a0bc6a1e9eb..7740306b850 100644
--- a/transport.h
+++ b/transport.h
@@ -75,6 +75,18 @@ struct transport {
 	 */
 	unsigned got_remote_refs : 1;
 
+	/**
+	 * Indicates whether we already called get_bundle_uri_list(); set by
+	 * transport.c::transport_get_remote_bundle_uri().
+	 */
+	unsigned got_remote_bundle_uri : 1;
+
+	/*
+	 * The results of "command=bundle-uri", if both sides support
+	 * the "bundle-uri" capability.
+	 */
+	struct string_list bundle_uri;
+
 	/*
 	 * Transports that call take-over destroys the data specific to
 	 * the transport type while doing so, and cannot be reused.
@@ -276,6 +288,12 @@ void transport_ls_refs_options_release(struct transport_ls_refs_options *opts);
 const struct ref *transport_get_remote_refs(struct transport *transport,
 					    struct transport_ls_refs_options *transport_options);
 
+/**
+ * Retrieve bundle URI(s) from a remote. Populates "struct
+ * transport"'s "bundle_uri" and "got_remote_bundle_uri".
+ */
+int transport_get_remote_bundle_uri(struct transport *transport);
+
 /*
  * Fetch the hash algorithm used by a remote.
  *
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 06/13] bundle-uri client: add "git ls-remote-bundle-uri"
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (4 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 05/13] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 07/13] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a git-ls-remote-bundle-uri command, this is a thin wrapper for
issuing protocol v2 "bundle-uri" commands to a server, and to the
parsing routines in bundle-uri.c.

Since in the "git clone" case we'll have already done the handshake(),
but not here, introduce a "got_advertisement" state along with
"got_remote_heads". It seems to me that the "got_remote_heads" is
badly named in the first place, and the whole logic of eagerly getting
ls-refs on handshake() or not could be refactored somewhat, but let's
not do that now, and instead just add another self-documenting state
variable.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/git-ls-remote-bundle-uri.txt |  62 ++++++++++
 Documentation/git-ls-remote.txt            |   1 +
 Makefile                                   |   1 +
 builtin.h                                  |   1 +
 builtin/clone.c                            |   2 +-
 builtin/ls-remote-bundle-uri.c             |  90 ++++++++++++++
 command-list.txt                           |   1 +
 git.c                                      |   1 +
 t/lib-t5730-protocol-v2-bundle-uri.sh      | 132 +++++++++++++++++++++
 transport.c                                |  43 +++++--
 transport.h                                |   6 +-
 11 files changed, 329 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/git-ls-remote-bundle-uri.txt
 create mode 100644 builtin/ls-remote-bundle-uri.c

diff --git a/Documentation/git-ls-remote-bundle-uri.txt b/Documentation/git-ls-remote-bundle-uri.txt
new file mode 100644
index 00000000000..793d7677f2f
--- /dev/null
+++ b/Documentation/git-ls-remote-bundle-uri.txt
@@ -0,0 +1,62 @@
+git-ls-remote-bundle-uri(1)
+===========================
+
+NAME
+----
+git-ls-remote-bundle-uri - List 'bundle-uri' in a remote repository
+
+SYNOPSIS
+--------
+[verse]
+'git ls-remote-bundle-uri' [-q |--quiet] [--uri] [--upload-pack=<exec>]
+			 [[-o | --server-option=]<option>] <repository>
+
+
+DESCRIPTION
+-----------
+
+Displays the `bundle-uri`s advertised by a remote repository. See
+`bundle-uri` in link:technical/protocol-v2.html[the Git Wire Protocol,
+Version 2] documentation for what the output format looks like.
+
+OPTIONS
+-------
+
+-q::
+--quiet::
+	Do not print remote URL to stderr in cases where the remote
+	name is inferred from config.
++
+When the remote name is not inferred (e.g. `git ls-remote-bundle-uri
+origin`, or `git ls-remote-bundle-uri https://[...]`) the remote URL
+is not printed in any case.
+
+--uri::
+	Print only the URIs, and not any of their optional attributes.
+
+--upload-pack=<exec>::
+	Specify the full path of 'git-upload-pack' on the remote
+	host. This allows listing references from repositories accessed via
+	SSH and where the SSH daemon does not use the PATH configured by the
+	user.
+
+-o <option>::
+--server-option=<option>::
+	Transmit the given string to the server when communicating using
+	protocol version 2.  The given string must not contain a NUL or LF
+	character.
+	When multiple `--server-option=<option>` are given, they are all
+	sent to the other side in the order listed on the command line.
+
+<repository>::
+	The "remote" repository to query.  This parameter can be
+	either a URL or the name of a remote (see the GIT URLS and
+	REMOTES sections of linkgit:git-fetch[1]).
+
+SEE ALSO
+--------
+linkgit:git-ls-remote[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Documentation/git-ls-remote.txt b/Documentation/git-ls-remote.txt
index 492e573856f..86c07eff832 100644
--- a/Documentation/git-ls-remote.txt
+++ b/Documentation/git-ls-remote.txt
@@ -114,6 +114,7 @@ c5db5456ae3b0873fc659c19fafdde22313cc441	refs/tags/v0.99.2
 
 SEE ALSO
 --------
+linkgit:git-ls-remote-bundle-uri[1].
 linkgit:git-check-ref-format[1].
 
 GIT
diff --git a/Makefile b/Makefile
index 4fec8e5af09..bdebf16a23f 100644
--- a/Makefile
+++ b/Makefile
@@ -1125,6 +1125,7 @@ BUILTIN_OBJS += builtin/init-db.o
 BUILTIN_OBJS += builtin/interpret-trailers.o
 BUILTIN_OBJS += builtin/log.o
 BUILTIN_OBJS += builtin/ls-files.o
+BUILTIN_OBJS += builtin/ls-remote-bundle-uri.o
 BUILTIN_OBJS += builtin/ls-remote.o
 BUILTIN_OBJS += builtin/ls-tree.o
 BUILTIN_OBJS += builtin/mailinfo.o
diff --git a/builtin.h b/builtin.h
index 83379f3832c..d093a1e7ffe 100644
--- a/builtin.h
+++ b/builtin.h
@@ -172,6 +172,7 @@ int cmd_log(int argc, const char **argv, const char *prefix);
 int cmd_log_reflog(int argc, const char **argv, const char *prefix);
 int cmd_ls_files(int argc, const char **argv, const char *prefix);
 int cmd_ls_tree(int argc, const char **argv, const char *prefix);
+int cmd_ls_remote_bundle_uri(int argc, const char **argv, const char *prefix);
 int cmd_ls_remote(int argc, const char **argv, const char *prefix);
 int cmd_mailinfo(int argc, const char **argv, const char *prefix);
 int cmd_mailsplit(int argc, const char **argv, const char *prefix);
diff --git a/builtin/clone.c b/builtin/clone.c
index b2c1b4142ee..007a59c6118 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1225,7 +1225,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	 * Populate transport->got_remote_bundle_uri and
 	 * transport->bundle_uri. We might get nothing.
 	 */
-	transport_get_remote_bundle_uri(transport);
+	transport_get_remote_bundle_uri(transport, 1);
 
 	if (mapped_refs) {
 		int hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
diff --git a/builtin/ls-remote-bundle-uri.c b/builtin/ls-remote-bundle-uri.c
new file mode 100644
index 00000000000..dadb21043c0
--- /dev/null
+++ b/builtin/ls-remote-bundle-uri.c
@@ -0,0 +1,90 @@
+#include "builtin.h"
+#include "cache.h"
+#include "transport.h"
+#include "ref-filter.h"
+#include "remote.h"
+#include "refs.h"
+
+static const char * const ls_remote_bundle_uri_usage[] = {
+	N_("git ls-remote-bundle-uri <repository>"),
+	NULL
+};
+
+int cmd_ls_remote_bundle_uri(int argc, const char **argv, const char *prefix)
+{
+	int quiet = 0;
+	int uri = 0;
+	const char *uploadpack = NULL;
+	struct string_list server_options = STRING_LIST_INIT_DUP;
+	struct option options[] = {
+		OPT__QUIET(&quiet, N_("do not print remote URL")),
+		OPT_BOOL(0, "uri", &uri, N_("limit to showing uri field")),
+		OPT_STRING(0, "upload-pack", &uploadpack, N_("exec"),
+			   N_("path of git-upload-pack on the remote host")),
+		OPT_STRING_LIST('o', "server-option", &server_options,
+				N_("server-specific"),
+				N_("option to transmit")),
+		OPT_END()
+	};
+	const char *dest = NULL;
+	struct remote *remote;
+	struct transport *transport;
+	int status = 0;
+	struct string_list_item *item;
+
+	argc = parse_options(argc, argv, prefix, options, ls_remote_bundle_uri_usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION);
+	dest = argv[0];
+
+	packet_trace_identity("ls-remote-bundle-uri");
+
+	remote = remote_get(dest);
+	if (!remote) {
+		if (dest)
+			die(_("bad repository '%s'"), dest);
+		die(_("no remote configured to get bundle URIs from"));
+	}
+	if (!remote->url_nr)
+		die(_("remote '%s' has no configured URL"), dest);
+
+	transport = transport_get(remote, NULL);
+	if (uploadpack)
+		transport_set_option(transport, TRANS_OPT_UPLOADPACK, uploadpack);
+	if (server_options.nr)
+		transport->server_options = &server_options;
+
+	if (!dest && !quiet)
+		fprintf(stderr, "From %s\n", *remote->url);
+
+	if (transport_get_remote_bundle_uri(transport, 0) < 0) {
+		error(_("could not get the bundle-uri list"));
+		status = 1;
+		goto cleanup;
+	}
+
+	for_each_string_list_item(item, &transport->bundle_uri) {
+		struct string_list_item *kv_item;
+		struct string_list *kv = item->util;
+
+		fprintf(stdout, "%s", item->string);
+		if (uri || !kv) {
+			fprintf(stdout, "\n");
+			continue;
+		}
+		for_each_string_list_item(kv_item, kv) {
+			const char *k = kv_item->string;
+			const char *v = kv_item->util;
+
+			if (v)
+				fprintf(stdout, " %s=%s", k, v);
+			else
+				fprintf(stdout, " %s", k);
+		}
+		fprintf(stdout, "\n");
+	}
+
+cleanup:
+	if (transport_disconnect(transport))
+		return 1;
+	return status;
+}
diff --git a/command-list.txt b/command-list.txt
index 9bd6f3c48f4..a50eebd4aa2 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -115,6 +115,7 @@ git-interpret-trailers                  purehelpers
 git-log                                 mainporcelain           info
 git-ls-files                            plumbinginterrogators
 git-ls-remote                           plumbinginterrogators
+git-ls-remote-bundle-uri                plumbinginterrogators
 git-ls-tree                             plumbinginterrogators
 git-mailinfo                            purehelpers
 git-mailsplit                           purehelpers
diff --git a/git.c b/git.c
index a25940d72e8..22554f2e5c5 100644
--- a/git.c
+++ b/git.c
@@ -550,6 +550,7 @@ static struct cmd_struct commands[] = {
 	{ "log", cmd_log, RUN_SETUP },
 	{ "ls-files", cmd_ls_files, RUN_SETUP },
 	{ "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
+	{ "ls-remote-bundle-uri", cmd_ls_remote_bundle_uri, RUN_SETUP_GENTLY },
 	{ "ls-tree", cmd_ls_tree, RUN_SETUP },
 	{ "mailinfo", cmd_mailinfo, RUN_SETUP_GENTLY | NO_PARSEOPT },
 	{ "mailsplit", cmd_mailsplit, NO_PARSEOPT },
diff --git a/t/lib-t5730-protocol-v2-bundle-uri.sh b/t/lib-t5730-protocol-v2-bundle-uri.sh
index 724acecda66..f0c41d60654 100644
--- a/t/lib-t5730-protocol-v2-bundle-uri.sh
+++ b/t/lib-t5730-protocol-v2-bundle-uri.sh
@@ -139,3 +139,135 @@ test_expect_success !T5730_HTTP "bad client with $T5730_PROTOCOL:// using protoc
 	grep "clone> test-bad-client$" log >sent-bad-request &&
 	test_file_not_empty sent-bad-request
 '
+
+test_expect_success "ls-remote-bundle-uri with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED" &&
+
+	# All data about bundle URIs
+	cat >expect <<-EOF &&
+	$T5730_BUNDLE_URI_ESCAPED
+	EOF
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		"$T5730_URI" \
+		>actual &&
+	test_cmp expect actual &&
+
+	# Only the URIs
+	git \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri --uri \
+		"$T5730_URI" \
+		>actual2 &&
+	test_cmp actual actual2
+'
+
+test_expect_success "ls-remote-bundle-uri with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	ATTR="foo bar=baz" &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED $ATTR" &&
+
+	# All data about bundle URIs
+	cat >expect <<-EOF &&
+	$T5730_BUNDLE_URI_ESCAPED $ATTR
+	EOF
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		"$T5730_URI" \
+		>actual &&
+	test_cmp expect actual
+'
+
+test_expect_success "ls-remote-bundle-uri with $T5730_PROTOCOL:// using protocol v2: --uri" '
+	test_when_finished "rm -f log" &&
+
+	ATTR="foo bar=baz" &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED $ATTR" &&
+
+	# All data about bundle URIs
+	cat >expect <<-EOF &&
+	$T5730_BUNDLE_URI_ESCAPED
+	EOF
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		--uri \
+		"$T5730_URI" \
+		>actual &&
+	test_cmp expect actual
+'
+
+test_expect_success "ls-remote-bundle-uri --[no-]quiet with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	cat >err.expect <<-\EOF &&
+	Cloning into '"'"'child'"'"'...
+	EOF
+
+	test_when_finished "rm -rf child" &&
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		 clone "$T5730_URI" child \
+		 >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED" &&
+
+	# Without --[no-]quiet
+	cat >out.expect <<-EOF &&
+	$T5730_BUNDLE_URI_ESCAPED
+	EOF
+	cat >err.expect <<-EOF &&
+	From $T5730_URI
+	EOF
+	git \
+		-C child \
+		 -c protocol.version=2 \
+		ls-remote-bundle-uri \
+		>out.actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp out.expect out.actual &&
+
+	# --no-quiet is the default
+	git \
+		-C child \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		--no-quiet \
+		>out.actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp out.expect out.actual &&
+
+	# --quiet quiets the "From" line
+	git \
+		-C child \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		--quiet \
+		>out.actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp out.expect out.actual &&
+
+	# --quiet is implicit if the remote is not implicit
+	git \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		"$T5730_URI" \
+		>out.actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp out.expect out.actual
+'
diff --git a/transport.c b/transport.c
index 7c9a371ed7f..16147a170f8 100644
--- a/transport.c
+++ b/transport.c
@@ -191,6 +191,7 @@ struct git_transport_data {
 	struct git_transport_options options;
 	struct child_process *conn;
 	int fd[2];
+	unsigned got_advertisement : 1;
 	unsigned got_remote_heads : 1;
 	enum protocol_version version;
 	struct oid_array extra_have;
@@ -336,6 +337,7 @@ static struct ref *handshake(struct transport *transport, int for_push,
 		BUG("unknown protocol version");
 	}
 	data->got_remote_heads = 1;
+	data->got_advertisement = 1;
 	transport->hash_algo = reader.hash_algo;
 
 	if (reader.line_peeked)
@@ -357,6 +359,33 @@ static int get_bundle_uri(struct transport *transport)
 	int stateless_rpc = transport->stateless_rpc;
 	string_list_init_dup(&transport->bundle_uri);
 
+	if (!data->got_advertisement) {
+		struct ref *refs;
+		struct git_transport_data *data = transport->data;
+		enum protocol_version version;
+
+		refs = handshake(transport, 0, NULL, 0);
+		version = data->version;
+
+		switch (version) {
+		case protocol_v2:
+			assert(!refs);
+			break;
+		case protocol_v0:
+		case protocol_v1:
+		case protocol_unknown_version:
+			assert(refs);
+			break;
+		}
+	}
+
+	/*
+	 * "Support" protocol v0 and v2 without bundle-uri support by
+	 * silently degrading to a NOOP.
+	 */
+	if (!server_supports_v2("bundle-uri", 0))
+		return 0;
+
 	packet_reader_init(&reader, data->fd[0], NULL, 0,
 			   PACKET_READ_CHOMP_NEWLINE |
 			   PACKET_READ_GENTLE_ON_EOF);
@@ -1481,7 +1510,7 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs)
 	return rc;
 }
 
-int transport_get_remote_bundle_uri(struct transport *transport)
+int transport_get_remote_bundle_uri(struct transport *transport, int quiet)
 {
 	const struct transport_vtable *vtable = transport->vtable;
 
@@ -1489,20 +1518,16 @@ int transport_get_remote_bundle_uri(struct transport *transport)
 	if (transport->got_remote_bundle_uri++)
 		return 0;
 
-	/*
-	 * "Support" protocol v0 and v2 without bundle-uri support by
-	 * silently degrading to a NOOP.
-	 */
-	if (!server_supports_v2("bundle-uri", 0))
-		return 0;
-
 	/*
 	 * This is intentionally below the transport.injectBundleURI,
 	 * we want to be able to inject into protocol v0, or into the
 	 * dialog of a server who doesn't support this.
 	 */
-	if (!vtable->get_bundle_uri)
+	if (!vtable->get_bundle_uri) {
+		if (quiet)
+			return -1;
 		return error(_("bundle-uri operation not supported by protocol"));
+	}
 
 	if (vtable->get_bundle_uri(transport) < 0)
 		return error(_("could not retrieve server-advertised bundle-uri list"));
diff --git a/transport.h b/transport.h
index 7740306b850..4253eafd954 100644
--- a/transport.h
+++ b/transport.h
@@ -291,8 +291,12 @@ const struct ref *transport_get_remote_refs(struct transport *transport,
 /**
  * Retrieve bundle URI(s) from a remote. Populates "struct
  * transport"'s "bundle_uri" and "got_remote_bundle_uri".
+ *
+ * With `quiet=1` it will not complain if the serve doesn't support
+ * the protocol, but only if we discover the server uses it, and
+ * encounter issues then.
  */
-int transport_get_remote_bundle_uri(struct transport *transport);
+int transport_get_remote_bundle_uri(struct transport *transport, int quiet);
 
 /*
  * Fetch the hash algorithm used by a remote.
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 07/13] bundle-uri client: add transfer.injectBundleURI support
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (5 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 06/13] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 08/13] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add the ability to inject "fake" bundle URIs into the newly supported
bundle-uri dialog. As discussed in the added documentation this allows
us to pretend as though the remote supports bundle URIs.

This will be useful both for ad-hoc testing, and for the real use-case
of retrofitting bundle URI support on-the-fly, i.e. to have:

	git -c transfer.injectBundleURI "file://$(pwd)/local.bdl" \
	clone https://example.com/git.git"

Be similar in spirit to:

	git clone --reference local-clone.git --disassociate \
	https://example.com/git.git"

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/transfer.txt     | 20 ++++++++++++
 t/lib-t5730-protocol-v2-bundle-uri.sh | 46 +++++++++++++++++++++++++++
 transport.c                           | 33 +++++++++++++++++++
 3 files changed, 99 insertions(+)

diff --git a/Documentation/config/transfer.txt b/Documentation/config/transfer.txt
index b49429eb4db..71b9b8f29e6 100644
--- a/Documentation/config/transfer.txt
+++ b/Documentation/config/transfer.txt
@@ -77,3 +77,23 @@ transfer.unpackLimit::
 transfer.advertiseSID::
 	Boolean. When true, client and server processes will advertise their
 	unique session IDs to their remote counterpart. Defaults to false.
+
+transfer.injectBundleURI::
+	Allows for the injection of `bundle-uri` lines into the
+	protocol v2 transport dialog (see `protocol.version` in
+	linkgit:git-config[1]). See `bundle-uri` in
+	link:technical/protocol-v2.html[the Git Wire Protocol, Version
+	2] documentation for what the format looks like.
++
+Can be given more than once, each key being injected as one line into
+the dialog.
++
+This is useful for testing the `bundle-uri` facility, and to e.g. use
+linkgit:git-clone[1] to clone from a server which does not support
+`bundle-uri`, but where the clone can benefit from getting some or
+most of the data from a static bundle retrieved from elsewhere.
++
+Impacts any command that uses the transport to communicate with remote
+linkgit:git-upload-pack[1] processes, e.g. linkgit:git-clone[1],
+linkgit:git-fetch[1] and the linkgit:git-ls-remote-bundle-uri[1]
+inspection command, this includes the `file://` protocol.
diff --git a/t/lib-t5730-protocol-v2-bundle-uri.sh b/t/lib-t5730-protocol-v2-bundle-uri.sh
index f0c41d60654..3be47bacc5f 100644
--- a/t/lib-t5730-protocol-v2-bundle-uri.sh
+++ b/t/lib-t5730-protocol-v2-bundle-uri.sh
@@ -271,3 +271,49 @@ test_expect_success "ls-remote-bundle-uri --[no-]quiet with $T5730_PROTOCOL:// u
 	test_must_be_empty err &&
 	test_cmp out.expect out.actual
 '
+
+test_expect_success "ls-remote-bundle-uri with -c transfer.injectBundleURI using with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED" &&
+
+	cat >expect <<-\EOF &&
+	https://injected.example.com/fake-1.bdl
+	https://injected.example.com/fake-2.bdl
+	EOF
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		-c transfer.injectBundleURI="https://injected.example.com/fake-1.bdl" \
+		-c transfer.injectBundleURI="https://injected.example.com/fake-2.bdl" \
+		ls-remote-bundle-uri \
+		"$T5730_URI" \
+		>actual 2>err &&
+	test_cmp expect actual &&
+	test_path_is_missing log
+'
+
+test_expect_success "ls-remote-bundle-uri with bad -c transfer.injectBundleURI protocol v2 with $T5730_PROTOCOL://" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED" &&
+
+	cat >err.expect <<-\EOF &&
+	error: bad (empty) transfer.injectBundleURI
+	error: could not get the bundle-uri list
+	EOF
+
+	test_must_fail env \
+		GIT_TRACE_PACKET="$PWD/log" \
+		git \
+		-c protocol.version=2 \
+		-c transfer.injectBundleURI \
+		ls-remote-bundle-uri \
+		"$T5730_URI" \
+		>out 2>err.actual &&
+	test_must_be_empty out &&
+	test_cmp err.expect err.actual &&
+	test_path_is_missing log
+'
diff --git a/transport.c b/transport.c
index 16147a170f8..16332f9d64a 100644
--- a/transport.c
+++ b/transport.c
@@ -1510,14 +1510,47 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs)
 	return rc;
 }
 
+struct config_cb {
+	struct transport *transport;
+	int configured;
+	int ret;
+};
+
+static int bundle_uri_config(const char *var, const char *value, void *data)
+{
+	struct config_cb *cb = data;
+	struct transport *transport = cb->transport;
+	struct string_list *uri = &transport->bundle_uri;
+
+	if (!strcmp(var, "transfer.injectbundleuri")) {
+		cb->configured = 1;
+		if (!value)
+			cb->ret = error(_("bad (empty) transfer.injectBundleURI"));
+		else if (bundle_uri_parse_line(uri, value) < 0)
+			cb->ret = error(_("bad transfer.injectBundleURI: '%s'"),
+					value);
+		return 0;
+	}
+	return 0;
+}
+
 int transport_get_remote_bundle_uri(struct transport *transport, int quiet)
 {
 	const struct transport_vtable *vtable = transport->vtable;
+	struct config_cb cb = {
+		.transport = transport,
+	};
 
 	/* Lazily configured */
 	if (transport->got_remote_bundle_uri++)
 		return 0;
 
+	git_config(bundle_uri_config, &cb);
+
+	/* Our own config can fake it up with transport.injectBundleURI */
+	if (cb.configured)
+		return cb.ret;
+
 	/*
 	 * This is intentionally below the transport.injectBundleURI,
 	 * we want to be able to inject into protocol v0, or into the
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 08/13] bundle-uri client: add boolean transfer.bundleURI setting
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (6 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 07/13] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 09/13] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

The yet-to-be introduced client support for bundle-uri will always
fall back on a full clone, but we'd still like to be able to ignore a
server's bundle-uri advertisement entirely.

This is useful for testing, and if a server is pointing to bad
bundles, they take a while to time out etc.

Since we might see the config in any order we need to clear out any
accumulated bundle_uri list when we see transfer.bundleURI=false
setting, and not add any more things to the list.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/transfer.txt |  6 ++++++
 transport.c                       | 21 +++++++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/transfer.txt b/Documentation/config/transfer.txt
index 71b9b8f29e6..ae85ca5760b 100644
--- a/Documentation/config/transfer.txt
+++ b/Documentation/config/transfer.txt
@@ -78,6 +78,12 @@ transfer.advertiseSID::
 	Boolean. When true, client and server processes will advertise their
 	unique session IDs to their remote counterpart. Defaults to false.
 
+transfer.bundleURI::
+	When set to `false` ignores any server advertisement of
+	`bundle-uri` and proceed with a "normal" clone/fetch even if
+	using bundles to bootstap is possible. Defaults to `true`,
+	i.e. bundle-uri is tried whenever a server offers it.
+
 transfer.injectBundleURI::
 	Allows for the injection of `bundle-uri` lines into the
 	protocol v2 transport dialog (see `protocol.version` in
diff --git a/transport.c b/transport.c
index 16332f9d64a..7085bfb3db8 100644
--- a/transport.c
+++ b/transport.c
@@ -1510,19 +1510,28 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs)
 	return rc;
 }
 
-struct config_cb {
+struct bundle_config_cb {
 	struct transport *transport;
 	int configured;
 	int ret;
+	int disabled;
 };
 
 static int bundle_uri_config(const char *var, const char *value, void *data)
 {
-	struct config_cb *cb = data;
+	struct bundle_config_cb *cb = data;
 	struct transport *transport = cb->transport;
 	struct string_list *uri = &transport->bundle_uri;
 
-	if (!strcmp(var, "transfer.injectbundleuri")) {
+	if (!strcmp(var, "transfer.bundleuri")) {
+		cb->disabled = !git_config_bool(var, value);
+		if (cb->disabled)
+			bundle_uri_string_list_clear(uri);
+		return 0;
+	}
+
+	if (!cb->disabled &&
+	    !strcmp(var, "transfer.injectbundleuri")) {
 		cb->configured = 1;
 		if (!value)
 			cb->ret = error(_("bad (empty) transfer.injectBundleURI"));
@@ -1537,7 +1546,7 @@ static int bundle_uri_config(const char *var, const char *value, void *data)
 int transport_get_remote_bundle_uri(struct transport *transport, int quiet)
 {
 	const struct transport_vtable *vtable = transport->vtable;
-	struct config_cb cb = {
+	struct bundle_config_cb cb = {
 		.transport = transport,
 	};
 
@@ -1547,6 +1556,10 @@ int transport_get_remote_bundle_uri(struct transport *transport, int quiet)
 
 	git_config(bundle_uri_config, &cb);
 
+	/* Don't use bundle-uri at all */
+	if (cb.disabled)
+		return 0;
+
 	/* Our own config can fake it up with transport.injectBundleURI */
 	if (cb.configured)
 		return cb.ret;
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 09/13] fetch-pack: add a deref_without_lazy_fetch_extended()
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (7 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 08/13] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 10/13] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a version of the deref_without_lazy_fetch function which can be
called with custom oi_flags and to grab information about the
"object_type". This will be used for the bundle-uri client in a
subsequent commit.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 fetch-pack.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 87657907e78..a0558f70b0c 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -115,11 +115,12 @@ static void for_each_cached_alternate(struct fetch_negotiator *negotiator,
 		cb(negotiator, cache.items[i]);
 }
 
-static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
-					       int mark_tags_complete)
+static struct commit *deref_without_lazy_fetch_extended(const struct object_id *oid,
+							int mark_tags_complete,
+							enum object_type *type,
+							unsigned int oi_flags)
 {
-	enum object_type type;
-	struct object_info info = { .typep = &type };
+	struct object_info info = { .typep = type };
 	struct commit *commit;
 
 	commit = lookup_commit_in_graph(the_repository, oid);
@@ -128,9 +129,9 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 
 	while (1) {
 		if (oid_object_info_extended(the_repository, oid, &info,
-					     OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_QUICK))
+					     oi_flags))
 			return NULL;
-		if (type == OBJ_TAG) {
+		if (*type == OBJ_TAG) {
 			struct tag *tag = (struct tag *)
 				parse_object(the_repository, oid);
 
@@ -144,7 +145,7 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 		}
 	}
 
-	if (type == OBJ_COMMIT) {
+	if (*type == OBJ_COMMIT) {
 		struct commit *commit = lookup_commit(the_repository, oid);
 		if (!commit || repo_parse_commit(the_repository, commit))
 			return NULL;
@@ -154,6 +155,16 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 	return NULL;
 }
 
+
+static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
+					       int mark_tags_complete)
+{
+	enum object_type type;
+	unsigned flags = OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_QUICK;
+	return deref_without_lazy_fetch_extended(oid, mark_tags_complete,
+						 &type, flags);
+}
+
 static int rev_list_insert_ref(struct fetch_negotiator *negotiator,
 			       const struct object_id *oid)
 {
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 10/13] fetch-pack: move --keep=* option filling to a function
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (8 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 09/13] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 11/13] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Move the populating of the --keep=* option argument to "index-pack" to
a static function, a subsequent commit will make use of it in another
function.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 fetch-pack.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index a0558f70b0c..0010867e5f5 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -842,6 +842,16 @@ static void parse_gitmodules_oids(int fd, struct oidset *gitmodules_oids)
 	} while (1);
 }
 
+static void add_index_pack_keep_option(struct strvec *args)
+{
+	char hostname[HOST_NAME_MAX + 1];
+
+	if (xgethostname(hostname, sizeof(hostname)))
+		xsnprintf(hostname, sizeof(hostname), "localhost");
+	strvec_pushf(args, "--keep=fetch-pack %"PRIuMAX " on %s",
+		     (uintmax_t)getpid(), hostname);
+}
+
 /*
  * If packfile URIs were provided, pass a non-NULL pointer to index_pack_args.
  * The strings to pass as the --index-pack-arg arguments to http-fetch will be
@@ -911,14 +921,8 @@ static int get_pack(struct fetch_pack_args *args,
 			strvec_push(&cmd.args, "-v");
 		if (args->use_thin_pack)
 			strvec_push(&cmd.args, "--fix-thin");
-		if ((do_keep || index_pack_args) && (args->lock_pack || unpack_limit)) {
-			char hostname[HOST_NAME_MAX + 1];
-			if (xgethostname(hostname, sizeof(hostname)))
-				xsnprintf(hostname, sizeof(hostname), "localhost");
-			strvec_pushf(&cmd.args,
-				     "--keep=fetch-pack %"PRIuMAX " on %s",
-				     (uintmax_t)getpid(), hostname);
-		}
+		if ((do_keep || index_pack_args) && (args->lock_pack || unpack_limit))
+			add_index_pack_keep_option(&cmd.args);
 		if (!index_pack_args && args->check_self_contained_and_connected)
 			strvec_push(&cmd.args, "--check-self-contained-and-connected");
 		else
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 11/13] bundle.h: make "fd" version of read_bundle_header() public
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (9 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 10/13] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 12/13] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Change the parse_bundle_header() function to be non-static, and rename
it to parse_bundle_header_fd(). The parse_bundle_header() function is
already public, and it's a thin wrapper around this function. This
will be used by code that wants to pass a fd to the bundle API.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 bundle.c | 8 ++++----
 bundle.h | 2 ++
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/bundle.c b/bundle.c
index a0bb687b0f4..bfe32f543a6 100644
--- a/bundle.c
+++ b/bundle.c
@@ -61,8 +61,8 @@ static int parse_bundle_signature(struct bundle_header *header, const char *line
 	return -1;
 }
 
-static int parse_bundle_header(int fd, struct bundle_header *header,
-			       const char *report_path)
+int read_bundle_header_fd(int fd, struct bundle_header *header,
+			  const char *report_path)
 {
 	struct strbuf buf = STRBUF_INIT;
 	int status = 0;
@@ -138,7 +138,7 @@ int read_bundle_header(const char *path, struct bundle_header *header)
 
 	if (fd < 0)
 		return error(_("could not open '%s'"), path);
-	return parse_bundle_header(fd, header, path);
+	return read_bundle_header_fd(fd, header, path);
 }
 
 int is_bundle(const char *path, int quiet)
@@ -148,7 +148,7 @@ int is_bundle(const char *path, int quiet)
 
 	if (fd < 0)
 		return 0;
-	fd = parse_bundle_header(fd, &header, quiet ? NULL : path);
+	fd = read_bundle_header_fd(fd, &header, quiet ? NULL : path);
 	if (fd >= 0)
 		close(fd);
 	bundle_header_release(&header);
diff --git a/bundle.h b/bundle.h
index 06009fe6b1f..2893defbc33 100644
--- a/bundle.h
+++ b/bundle.h
@@ -22,6 +22,8 @@ void bundle_header_release(struct bundle_header *header);
 
 int is_bundle(const char *path, int quiet);
 int read_bundle_header(const char *path, struct bundle_header *header);
+int read_bundle_header_fd(int fd, struct bundle_header *header,
+			  const char *report_path);
 int create_bundle(struct repository *r, const char *path,
 		  int argc, const char **argv, struct strvec *pack_options,
 		  int version);
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 12/13] bundle-uri client: support for bundle-uri with "clone"
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (10 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 11/13] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 16:24   ` [RFC PATCH v2 13/13] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

In an earlier commit ("bundle-uri client: add minimal NOOP client") a
transport_get_remote_bundle_uri() call was added to builtin/clone.c to
get any advertised bundle URIs from the server during cloning, but
nothing was being done with them yet.

This implements real support for bundle-uri during the "clone"
phase. It's not used at all by "fetch", but the code to support it is
mostly here already, and will be finished later.

Using the new transfer.injectBundleURI support it's easy to test this
method of cloning on a live server that doesn't support bundle-uri. In
a git.git checkout.

First let's prepare two bundles:

    git bundle create /tmp/git-master-only.bdl origin/master
    git bundle create /tmp/git-master-to-next.bdl origin/master..origin/next

And next, let's do a "fake" clone where we bootstrap from these
bundles. The fetch.uriProtocols is needed because we'd otherwise
ignore "file://" URIs. This uses --no-tags --single-branch for
simplicity:

    rm -rf /tmp/git.git &&
    git \
	-c protocol.version=2 \
        -c fetch.uriProtocols=file \
        -c transfer.injectBundleURI="file:///tmp/git-master-only.bdl" \
	-c transfer.injectBundleURI="file:///tmp/git-master-to-next.bdl" \
	clone --bare --no-tags --single-branch --branch next --template= \
	--verbose --verbose \
	https://github.com/git/git.git /tmp/git.git

We'll then get output like:

    Receiving bundle (1/2): 100% (300529/300529), 87.57 MiB | 32.70 MiB/s, done.
    Resolving deltas: 100% (226765/226765), done.
    have eb27b338a3e71c7c4079fbac8aeae3f8fbb5c687 commit via bundle-uri
    Receiving bundle (2/2): 100% (725/725), 221.11 KiB | 22.11 MiB/s, done.
    Resolving deltas: 100% (539/539), completed with 153 local objects.
    have e1b32706d8dd5db1dc2e13f8e391651214f1d987 commit via bundle-uri
    Marking e1b32706d8dd5db1dc2e13f8e391651214f1d987 as complete
    already have e1b32706d8dd5db1dc2e13f8e391651214f1d987 (refs/heads/next)
    Checking connectivity: 301210, done.

I.e. we did an ls-refs on connection to the server, then retrieved the
advertised bundles (faked up via config in this case).

We then got all the data leading up to the current "master" from
there, and also the commit that's currently on "next. In this case we
found that we didn't need to proceed further with the dialog.

I.e. other than an ls-refs and the server waiting until we downloaded
the bundles, the server didn't need to do any work creating a PACK for
us.

If we change "--branch next" into "--branch seen" in the above command
we'll get the same output at the start until the "want" line, then:

    [...]
    want 93021c12c9f91e0d750d3ca8750a62416f4ea81a (refs/heads/seen)
    POST git-upload-pack (212 bytes)
    remote: Enumerating objects: 2265, done.
    remote: Counting objects: 100% (1576/1576), done.
    remote: Compressing objects: 100% (233/233), done.
    remote: Total 2265 (delta 1378), reused 1480 (delta 1341), pack-reused 689
    Receiving objects: 100% (2265/2265), 2.17 MiB | 10.77 MiB/s, done.
    Resolving deltas: 100% (1673/1673), completed with 339 local objects.
    Checking connectivity: 303225, done.

I.e. the server needed to send us an incremental update on top after
we'd unpacked the bundles, but this was a fairly minimal set of ~2k
objects. It didn't need to service a full clone.

We can see the savings on the server by setting up a local server at
the tip of "next":

    rm -rf /tmp/git-server.git &&
    git init --bare /tmp/git-server.git &&
    git -C /tmp/git-server.git bundle unbundle /tmp/git-master-only.bdl &&
    git -C /tmp/git-server.git bundle unbundle /tmp/git-master-to-next.bdl
    git -C /tmp/git-server.git update-ref refs/heads/master $(git ls-remote /tmp/git-master-only.bdl | cut -f 1) &&
    git -C /tmp/git-server.git update-ref refs/heads/next $(git ls-remote /tmp/git-master-to-next.bdl | cut -f 1) &&
    git -C /tmp/git-server.git for-each-ref

Let's then clone from it, and record the time we spend.

    rm -rf /tmp/git.git /tmp/{client,server}.time &&
    /usr/bin/time -o /tmp/client.time -v git \
	-c protocol.version=2 \
        -c fetch.uriProtocols=file \
        -c transfer.injectBundleURI="file:///tmp/git-master-only.bdl" \
	clone \
	--upload-pack '/usr/bin/time -o /tmp/server.time -v git-upload-pack' \
	--bare --no-tags --single-branch --branch next --template= \
	--verbose --verbose \
	file:///tmp/git-server.git /tmp/git.git &&
    for i in client server
    do
        echo $i: &&
        grep -e seconds -e wall -e Maximum -e context /tmp/$i.time
    done

This gives us something like these results:

    client:
        User time (seconds): 46.34
        System time (seconds): 0.67
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.67
        Maximum resident set size (kbytes): 207096
        Voluntary context switches: 116058
        Involuntary context switches: 220
    server:
        User time (seconds): 0.13
        System time (seconds): 0.00
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.08
        Maximum resident set size (kbytes): 53168
        Voluntary context switches: 255
        Involuntary context switches: 7

Whereas doing a normal "clone" (by e.g. adding "-c
transfer.bundleURI=false" to the above) will give something like:

    client:
        User time (seconds): 47.24
        System time (seconds): 0.92
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.55
        Maximum resident set size (kbytes): 288104
        Voluntary context switches: 136350
        Involuntary context switches: 296
    server:
        User time (seconds): 5.73
        System time (seconds): 0.24
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.45
        Maximum resident set size (kbytes): 288104
        Voluntary context switches: 26568
        Involuntary context switches: 111

I.e. we can see that the win on the client in this case is negative,
but we use around over 2% of the CPU time on the server, and around
20% of the memory. The client-visible time is a bit slower, by around
2%.

In practice I think this will be more of a win-win. These results are
on an unloaded local machine, and don't account for the benefit of the
server being more likely to have a network-local version of most of
the repository via dumb CDNs.

Real servers are also usually in a messier state of having various
loose objects and more fragmented pack collections, and needing to
spend CPU to assemble these. Frequent repacking and e.g. local caching
e.g. via the uploadpack.packObjectsHook helps, but using this should
make it more accessible to run a highly performance git server.

This feature also makes things like resumable clones rather trivial to
implement, this approach was discussed in the past[1] as a means to
get that feature.

1. https://lore.kernel.org/git/20111110074330.GA27925@sigill.intra.peff.net/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 fetch-pack.c                          | 255 ++++++++++++++++++++++++++
 fetch-pack.h                          |   6 +
 t/lib-t5730-protocol-v2-bundle-uri.sh | 107 ++++++++++-
 transport.c                           |   1 +
 4 files changed, 368 insertions(+), 1 deletion(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 0010867e5f5..4f1a7acb20d 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -26,6 +26,7 @@
 #include "commit-reach.h"
 #include "commit-graph.h"
 #include "sigchain.h"
+#include "bundle.h"
 
 static int transfer_unpack_limit = -1;
 static int fetch_unpack_limit = -1;
@@ -1020,6 +1021,133 @@ static int get_pack(struct fetch_pack_args *args,
 	return 0;
 }
 
+static int unbundle_bundle_uri(const char *bundle_uri, unsigned int nth,
+			       unsigned int total_nr, FILE *in, int in_fd,
+			       struct oid_array *bundle_oids,
+			       unsigned int use_thin_pack)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct bundle_header header = BUNDLE_HEADER_INIT;
+	int ret = 0;
+	struct string_list_item *item;
+	struct strbuf progress_title = STRBUF_INIT;
+	int code;
+
+	ret = read_bundle_header_fd(in_fd, &header, bundle_uri);
+	if (ret < 0) {
+		ret = error("could not read_bundle_header(%s)", bundle_uri);
+		goto cleanup;
+	}
+
+	for_each_string_list_item(item, &header.references) {
+		/*
+		 * The bundle's idea of the ref name is
+		 * item->string.
+		 *
+		 * Here's where we could do concurrent negotiation
+		 * with the server (and possibly start the fetch!)
+		 * before or while we unpack the bundle with
+		 * index-pack.
+		 *
+		 * The negotiator would need a small change to trust
+		 * arbitrary OIDs instead of assuming it has existing
+		 * in-repo "struct commit *", but ad-hoc testing
+		 * reveals that it'll work & speed up the fetch even
+		 * more, as we could proceed in parallel with the full
+		 * bundle fetching as soon as we get the headers.
+		 */
+		struct object_id *oid = item->util;
+
+		oid_array_append(bundle_oids, oid);
+	}
+
+	if (git_env_bool("GIT_TEST_BUNDLE_URI_FAIL_UNBUNDLE", 0))
+		lseek(in_fd, 0, SEEK_SET);
+
+	strbuf_addf(&progress_title, "Receiving bundle (%d/%d)", nth, total_nr);
+	strvec_pushl(&cmd.args, "index-pack", "--stdin", "-v",
+		     "--progress-title", progress_title.buf, NULL);
+
+	if (header.prerequisites.nr && use_thin_pack)
+		strvec_push(&cmd.args, "--fix-thin");
+	strvec_push(&cmd.args, "--check-self-contained-and-connected");
+	add_index_pack_keep_option(&cmd.args);
+
+	cmd.git_cmd = 1;
+	cmd.in = in_fd;
+	cmd.no_stdout = 1;
+	cmd.git_cmd = 1;
+
+	if (start_command(&cmd)) {
+		ret = error(_("fetch-pack: unable to spawn index-pack"));
+		goto cleanup;
+	}
+
+	code = finish_command(&cmd);
+
+	if (header.prerequisites.nr && code == 1)
+		/*
+		 * index-pack returns -1 on
+		 * --check-self-contained-and-connected to indicate
+		 * that the pack was indeed not self contained and
+		 * connected. We know from the bundle header
+		 * prerequisites.
+		 */
+		code = 0;
+
+	if (code) {
+		ret = error(_("fetch-pack: unable to finish index-pack, exited with %d"), code);
+		goto cleanup;
+	}
+
+cleanup:
+	strbuf_release(&progress_title);
+	bundle_header_release(&header);
+	return ret;
+}
+
+static int get_bundle_uri(struct string_list_item *item, unsigned int nth,
+			  unsigned int total_nr, struct oid_array *bundle_oids,
+			  unsigned int use_thin_pack)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf tempfile = STRBUF_INIT;
+	int ret = 0;
+	const char *uri = item->string;
+	FILE *out;
+	int out_fd;
+
+	strvec_push(&cmd.args, "curl");
+	strvec_push(&cmd.args, "--silent");
+	strvec_push(&cmd.args, "--output");
+	strvec_push(&cmd.args, "-");
+	strvec_push(&cmd.args, "--");
+	strvec_push(&cmd.args, item->string);
+	cmd.git_cmd = 0;
+	cmd.no_stdin = 1;
+	cmd.out = -1;
+
+	if (start_command(&cmd)) {
+		ret = error("fetch-pack: unable to spawn http-fetch");
+		goto cleanup;
+	}
+
+	out = xfdopen(cmd.out, "r");
+	out_fd = fileno(out);
+	ret = unbundle_bundle_uri(uri, nth, total_nr, out, out_fd,
+				  bundle_oids, use_thin_pack);
+
+	if (finish_command(&cmd)) {
+		ret = error("fetch-pack: unable to finish http-fetch");
+		goto cleanup;
+	}
+
+cleanup:
+	strbuf_release(&tempfile);
+
+	return ret;
+}
+
 static int cmp_ref_by_name(const void *a_, const void *b_)
 {
 	const struct ref *a = *((const struct ref **)a_);
@@ -1577,6 +1705,130 @@ static void do_check_stateless_delimiter(int stateless_rpc,
 				  _("git fetch-pack: expected response end packet"));
 }
 
+static int get_bundle_uri_add_known_common(struct string_list_item *item,
+					   unsigned int nth, unsigned int total_nr,
+					   struct fetch_negotiator *negotiator,
+					   struct fetch_pack_args *args,
+					   unsigned int use_thin_pack)
+{
+	int i;
+	struct oid_array bundle_oids = OID_ARRAY_INIT;
+
+	/*
+	 * We don't use OBJECT_INFO_QUICK here unlike in the rest of
+	 * the fetch routines, that's because the rest of them don't
+	 * need to consider a commit object that's just been
+	 * downloaded for further negotiation, but bundle-uri does for
+	 * adding newly downloaded OIDs to the negotiator.
+	 */
+	unsigned oi_flags = OBJECT_INFO_SKIP_FETCH_OBJECT;
+
+	if (get_bundle_uri(item, nth, total_nr, &bundle_oids, use_thin_pack) < 0)
+		return error(_("could not get the bundle URI #%d"), nth);
+
+	for (i = 0; i < bundle_oids.nr; i++) {
+		struct object_id *oid = &bundle_oids.oid[i];
+		enum object_type type = OBJ_NONE;
+		struct commit *c = deref_without_lazy_fetch_extended(oid, 0,
+								     &type,
+								     oi_flags);
+		if (!c) {
+			if (type == OBJ_BLOB || type == OBJ_TREE) {
+				print_verbose(args, "have %s %s via bundle-uri (ignoring due to type)",
+					      oid_to_hex(oid), type_name(type));
+				continue;
+			} else if (type) {
+				/*
+				 * OBJ_TAG should have been peeled,
+				 * and OBJ_COMMIT should have a
+				 * non-NULL "c".
+				 *
+				 * Should be a BUG() if we were not
+				 * bending over backwards to make
+				 * bundle-uri soft-fail.
+				 */
+				return error(_("bundle-uri says it has %s, got it at unexpected type %s"),
+					     oid_to_hex(oid), type_name(type));
+			}
+		}
+
+		print_verbose(args, "have %s %s via bundle-uri",
+			      oid_to_hex(oid), type_name(type));
+
+		negotiator->known_common(negotiator, c);
+		mark_complete(oid);
+	}
+	return 0;
+}
+
+static void do_fetch_pack_v2_bundle_uri(struct fetch_pack_args *args,
+					struct string_list  *bundle_uri,
+					struct fetch_negotiator *negotiator)
+{
+	struct string_list_item *item;
+	struct string_list list = STRING_LIST_INIT_NODUP;
+	struct string_list default_protocols = STRING_LIST_INIT_NODUP;
+	struct string_list *ok_protocols;
+
+	if (!bundle_uri)
+		return;
+
+	if (!bundle_uri->nr)
+		return;
+
+	if (uri_protocols.nr) {
+		ok_protocols = &uri_protocols;
+	} else {
+		string_list_append(&default_protocols, "http");
+		string_list_append(&default_protocols, "https");
+		ok_protocols = &default_protocols;
+	}
+
+	for_each_string_list_item(item, bundle_uri) {
+		const char *uri = item->string;
+		int protocol_ok = 0;
+		struct string_list_item *item2;
+
+		for_each_string_list_item(item2, ok_protocols) {
+			const char *s = item2->string;
+			const char *p;
+
+			if (skip_prefix(item->string, s, &p) &&
+			    starts_with(p, "://")) {
+				protocol_ok = 1;
+				break;
+			}
+		}
+
+		if (!protocol_ok) {
+			print_verbose(args, "skipping bundle-uri not on protocol whitelist: %s",
+				      item->string);
+			continue;
+		}
+
+		string_list_append(&list, uri)->util = item->util;
+	}
+
+	if (list.nr) {
+		int i;
+		unsigned int total_nr = list.nr;
+
+		trace2_region_enter("fetch-pack", "bundle-uri", the_repository);
+		for (i = 0; i < total_nr; i++) {
+			struct string_list_item item = list.items[i];
+			unsigned int nth = i + 1;
+
+			get_bundle_uri_add_known_common(&item, nth, total_nr,
+							negotiator, args,
+							args->use_thin_pack);
+		}
+		trace2_region_leave("fetch-pack", "bundle-uri", the_repository);
+	}
+
+	string_list_clear(&default_protocols, 0);;
+}
+
+
 static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 				    int fd[2],
 				    const struct ref *orig_ref,
@@ -1600,10 +1852,13 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 	struct string_list packfile_uris = STRING_LIST_INIT_DUP;
 	int i;
 	struct strvec index_pack_args = STRVEC_INIT;
+	struct string_list *bundle_uri = args->bundle_uri;
 
 	negotiator = &negotiator_alloc;
 	fetch_negotiator_init(r, negotiator);
 
+	do_fetch_pack_v2_bundle_uri(args, bundle_uri, negotiator);
+
 	packet_reader_init(&reader, fd[0], NULL, 0,
 			   PACKET_READ_CHOMP_NEWLINE |
 			   PACKET_READ_DIE_ON_ERR_PACKET);
diff --git a/fetch-pack.h b/fetch-pack.h
index 7f94a2a5831..b19fd7d93be 100644
--- a/fetch-pack.h
+++ b/fetch-pack.h
@@ -24,6 +24,12 @@ struct fetch_pack_args {
 	 */
 	const struct oid_array *negotiation_tips;
 
+	/*
+	 * A pointer to the already populated transport.bundle_uri
+	 * struct.
+	 */
+	struct string_list *bundle_uri;
+
 	unsigned deepen_relative:1;
 	unsigned quiet:1;
 	unsigned keep_pack:1;
diff --git a/t/lib-t5730-protocol-v2-bundle-uri.sh b/t/lib-t5730-protocol-v2-bundle-uri.sh
index 3be47bacc5f..ab9b725f038 100644
--- a/t/lib-t5730-protocol-v2-bundle-uri.sh
+++ b/t/lib-t5730-protocol-v2-bundle-uri.sh
@@ -7,6 +7,8 @@ case "$T5730_PROTOCOL" in
 file)
 	T5730_PARENT=file_parent
 	T5730_URI="file://$PWD/file_parent"
+	T5730_URI_BDL_PROTO="file://"
+	T5730_URI_BDL="$T5730_URI_BDL_PROTO$PWD/file_parent"
 	T5730_BUNDLE_URI="$T5730_URI/fake.bdl"
 	test_set_prereq T5730_FILE
 	;;
@@ -15,6 +17,8 @@ git)
 	start_git_daemon --export-all --enable=receive-pack
 	T5730_PARENT="$GIT_DAEMON_DOCUMENT_ROOT_PATH/parent"
 	T5730_URI="$GIT_DAEMON_URL/parent"
+	T5730_URI_BDL_PROTO="file://"
+	T5730_URI_BDL="$T5730_URI_BDL_PROTO$GIT_DAEMON_DOCUMENT_ROOT_PATH/parent"
 	T5730_BUNDLE_URI="https://example.com/fake.bdl"
 	test_set_prereq T5730_GIT
 	;;
@@ -24,6 +28,8 @@ http)
 	T5730_PARENT="$HTTPD_DOCUMENT_ROOT_PATH/http_parent"
 	T5730_URI="$HTTPD_URL/smart/http_parent"
 	T5730_BUNDLE_URI="https://example.com/fake.bdl"
+	T5730_URI_BDL_PROTO="http://"
+	T5730_URI_BDL="$HTTPD_URL/dumb/http_parent"
 	test_set_prereq T5730_HTTP
 	;;
 *)
@@ -33,7 +39,20 @@ esac
 
 test_expect_success "setup protocol v2 $T5730_PROTOCOL:// tests" '
 	git init "$T5730_PARENT" &&
-	test_commit -C "$T5730_PARENT" one
+	test_commit -C "$T5730_PARENT" one &&
+	test_commit -C "$T5730_PARENT" two &&
+	test_commit -C "$T5730_PARENT" three &&
+	test_commit -C "$T5730_PARENT" four &&
+	test_commit -C "$T5730_PARENT" five &&
+	test_commit -C "$T5730_PARENT" six &&
+
+	mkdir "$T5730_PARENT"/bdl &&
+	git -C "$T5730_PARENT" bundle create bdl/1.bdl one &&
+	git -C "$T5730_PARENT" bundle create bdl/1-2.bdl one..two &&
+	git -C "$T5730_PARENT" bundle create bdl/2-3.bdl two..three &&
+	git -C "$T5730_PARENT" bundle create bdl/3-4.bdl three..four &&
+	git -C "$T5730_PARENT" bundle create bdl/4-5.bdl four..five &&
+	git -C "$T5730_PARENT" bundle create bdl/5-6.bdl five..six
 '
 
 # Poor man's URI escaping. Good enough for the test suite whose trash
@@ -317,3 +336,89 @@ test_expect_success "ls-remote-bundle-uri with bad -c transfer.injectBundleURI p
 	test_cmp err.expect err.actual &&
 	test_path_is_missing log
 '
+
+test_cmp_repo_refs() {
+	one="$1"
+	two="$2"
+	shift 2
+
+	git -C "$one" for-each-ref "$@" >expect &&
+	git -C "$two" for-each-ref "$@" >actual &&
+	test_cmp expect actual
+}
+
+test_expect_success "clone with bundle-uri protocol v2 over $T5730_PROTOCOL:// 1.bdl via $T5730_URI_BDL_PROTO" '
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1.bdl | test_uri_escape)" &&
+
+	test_when_finished "rm -rf log child" &&
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		-c fetch.uriProtocols=file,http \
+		clone --verbose --verbose \
+		"$T5730_URI" child >out 2>err &&
+	grep -F "Receiving bundle (1/1)" err &&
+	grep "clone> want " log &&
+	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags
+'
+
+test_expect_success "fetch with bundle-uri protocol v2 over $T5730_PROTOCOL:// 1.bdl via $T5730_URI_BDL_PROTO" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1.bdl | test_uri_escape)" &&
+
+	test_when_finished "rm -rf log child" &&
+	git init --bare child &&
+	git -C child remote add --mirror=fetch origin "$T5730_URI" &&
+	GIT_TRACE_PACKET="$PWD/log" \
+	git -C child \
+		-c protocol.version=2 \
+		-c fetch.uriProtocols=file,http \
+		fetch --verbose --verbose >out 2>err &&
+	# Fetch is not supported yet
+	! grep -F "Receiving bundle (1/1)" err &&
+	grep "fetch> want " log &&
+	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags
+'
+
+test_expect_success "clone with bundle-uri protocol v2 with $T5730_PROTOCOL:// 1 + 1-2 + [...].bdl via $T5730_URI_BDL_PROTO" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1.bdl | test_uri_escape)" &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1-2.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/2-3.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/3-4.bdl | test_uri_escape)" --add &&
+
+	test_when_finished "rm -rf log child" &&
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		-c fetch.uriProtocols=file,http \
+		clone --verbose --verbose \
+		"$T5730_URI" child >out 2>err &&
+	grep -F "Receiving bundle (4/4)" err &&
+	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags &&
+	grep "clone> want " log
+'
+
+test_expect_success "clone with bundle-uri protocol v2 with $T5730_PROTOCOL:// ALL.bdl via $T5730_URI_BDL_PROTO" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1.bdl | test_uri_escape)" &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1-2.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/2-3.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/3-4.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/4-5.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/5-6.bdl | test_uri_escape)" --add &&
+
+	test_when_finished "rm -rf log child" &&
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		-c fetch.uriProtocols=file,http \
+		clone --verbose --verbose \
+		"$T5730_URI" child >out 2>err &&
+	grep -F "Receiving bundle (6/6)" err &&
+	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags &&
+	! grep "clone> want " log
+'
diff --git a/transport.c b/transport.c
index 7085bfb3db8..ebc3ed9b608 100644
--- a/transport.c
+++ b/transport.c
@@ -426,6 +426,7 @@ static int fetch_refs_via_pack(struct transport *transport,
 	args.server_options = transport->server_options;
 	args.negotiation_tips = data->options.negotiation_tips;
 	args.reject_shallow_remote = transport->smart_options->reject_shallow;
+	args.bundle_uri = &transport->bundle_uri;
 
 	if (!data->got_remote_heads) {
 		int i;
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 13/13] bundle-uri: make the download program configurable
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (11 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 12/13] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
@ 2022-03-11 16:24   ` Ævar Arnfjörð Bjarmason
  2022-03-11 21:28   ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Derrick Stolee
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
  14 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 16:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

As noted in a preceding commit we really should be using libcurl's C
API by default in get_bundle_uri(), but testing with a command-line
program can be very handy, and useful e.g. to implement custom or
ad-hoc caching.

E.g. using part of the recipe noted in a preceding commit to create
the "git-master-only.bdl" and "git-master-to-next.bdl" files, we can
implement a trivial caching shellscript as:

	cat >get-bundle.sh <<-\EOF &&
	#!/bin/sh
	set -xe

	uri="$1"

	bundle_cache_key () {
		echo "Computing cache key for URI '$1' (only getting the header)" >&2

		curl --silent --output - -- "$1" |
		sed -n -e '/^$/q' -e 'p' |
		git hash-object --stdin
	}

	get_cached_bundle_uri() {
		cache_key=$(bundle_cache_key "$1")

		path="/tmp/bundle-cache-$cache_key.bdl"

		if test -e "$path"
		then
			echo "Using cache '$path' for URI '$1'" >&2
			cat "$path"
		else
			echo "Downloading bundle URI $1" >&2
			curl --silent --output - -- "$uri" | tee "$path"
		fi
	}

	get_cached_bundle_uri "$1"
	EOF
	chmod +x get-bundle.sh &&
	rm -rf /tmp/git.git &&
	./git \
		-c protocol.version=2 \
		-c fetch.uriProtocols=file \
		-c transfer.bundleURI.downloader=./get-bundle.sh \
		-c transfer.injectBundleURI="file:///tmp/git-master-only.bdl" \
		-c transfer.injectBundleURI="file:///tmp/git-master-to-next.bdl" \
		clone --bare --no-tags --single-branch --branch next --template= \
		--verbose --verbose \
		https://github.com/git/git.git /tmp/git.git

Now, clearly that specific example is rather pointless. We're getting
a local file anyway, so "cat"-ing another local file doesn't make any
difference, it's even slightly slower & more redundant as we're having
to get it twice with "curl".

But the point is that this can be trivially improved for use in any
arbitrary custom caching strategy. E.g.:

 * A less dumber implementation that would stream the remote URL,
   check the header as we go, and disconnect if we've got that content
   locally.
 * Ditto, but using an ETag or other strategy.
 * N boxes could share a cache an NFS with a shared mount, or N
   disconnected git processes could use a common cache without the
   need for a front-line HTTP proxy server.

 * It would be trivial to extend this to guard against a "thundering
   herd" (e.g. concurrent CI) downloading the same bundle N times. As
   soon as we'd get the header we'd create a $cache_key.lock as we
   download the rest, and other concurrent clients spotting that would
   wait, then eventually cache "$cache_key".

   Still racy as N clients could download the header in parallel, but
   way less so (the header will be a tiny part of the payload).

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/transfer.txt | 7 +++++++
 fetch-pack.c                      | 6 ++++++
 2 files changed, 13 insertions(+)

diff --git a/Documentation/config/transfer.txt b/Documentation/config/transfer.txt
index ae85ca5760b..5310cd96cb9 100644
--- a/Documentation/config/transfer.txt
+++ b/Documentation/config/transfer.txt
@@ -84,6 +84,13 @@ transfer.bundleURI::
 	using bundles to bootstap is possible. Defaults to `true`,
 	i.e. bundle-uri is tried whenever a server offers it.
 
+transfer.bundleURI.downloader::
+	When set to `<program>` will be invoked when
+	`transfer.bundleURI` is in effect to download URIs containing
+	bundles. Expected to take one `URI` as an argument, and to
+	emit the bundle on STDOUT. Defaults to "curl --silent --output
+	- --". I.e. we'll invoke "curl --silent --output - -- <URI>".
+
 transfer.injectBundleURI::
 	Allows for the injection of `bundle-uri` lines into the
 	protocol v2 transport dialog (see `protocol.version` in
diff --git a/fetch-pack.c b/fetch-pack.c
index 4f1a7acb20d..6e22605f06c 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1116,12 +1116,18 @@ static int get_bundle_uri(struct string_list_item *item, unsigned int nth,
 	const char *uri = item->string;
 	FILE *out;
 	int out_fd;
+	const char *tmp;
 
 	strvec_push(&cmd.args, "curl");
 	strvec_push(&cmd.args, "--silent");
 	strvec_push(&cmd.args, "--output");
 	strvec_push(&cmd.args, "-");
 	strvec_push(&cmd.args, "--");
+	if (!git_config_get_string_tmp("transfer.bundleURI.downloader", &tmp)) {
+		strvec_clear(&cmd.args);
+		strvec_push(&cmd.args, tmp);
+		cmd.use_shell = 1;
+	}
 	strvec_push(&cmd.args, item->string);
 	cmd.git_cmd = 0;
 	cmd.no_stdin = 1;
-- 
2.35.1.1337.g7e32d794afe


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (12 preceding siblings ...)
  2022-03-11 16:24   ` [RFC PATCH v2 13/13] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
@ 2022-03-11 21:28   ` Derrick Stolee
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
  14 siblings, 0 replies; 77+ messages in thread
From: Derrick Stolee @ 2022-03-11 21:28 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Jonathan Tan, Jonathan Nieder, Albert Cui,
	Robin H . Johnson, Teng Long

On 3/11/2022 11:24 AM, Ævar Arnfjörð Bjarmason wrote:
> Per recent discussion[1] this is my not-quite-feature-complete version
> of the bundle-uri capability.  This was sent to the list in some form
> beforfe in [2] and [3].
> 
> Recently Derrick Stolee has sent an alternate implementation of some
> of the same ideas in [4]. Per [1] we're planning to work together on
> getting a version of this into git that makes everyone happy, sending
> what I've got here is the first step in that.

Thanks! It's good to see your intended end-to-end for comparison
before we start the combined effort. I look forward to the additional
details coming next week, because there are a lot of optimizations in
there that will inform our direction.
 
> A high-level summary of the important differences in my approach &
> Derrick's (which I hope I'm summarizing fairly here) is that his
> approach optionally adds a bundle TOC format, that format allows you
> to define topology relationships between bundles to guide a
> (returning) client in what it needs to fetch.
> 
> Whereas the idea in this series is to lean entirely on the client
> downloading bundles & inferring what needs to be done via the
> tip/prereqs listed in the header format of the (existing, not changed
> here) bundle format.
> 
> Both have pros & cons, I started trying to summarize those, but let's
> leave that for later.

I know you were in a rush to deliver this, so I'm going to assume
that "leave that for later" was "I didn't have time to write that
here". The very high-level comparison that I gathered from our
chat were these points:

 1. My version uses the TOC and its "timestamp" heuristic to
    minimize how much data the client needs to download on a "no-op"
    fetch. Your version requires that the client downloads some
    initial range of data from each advertised URI.

 2. My version lets the TOC sit at one well-known URI that can be
    advertised independently of the origin server. I don't see an
    equivalent in yours (so far).

 3. The TOC in my version allows the server advertisement be an "OR"
    (download from any of these locations... some might be closer to
    you than others) and yours is an "AND" (this is the list of
    bundles... you probably need all of them for a full clone). This
    difference is something that can be worked into the advertisement
    to allow both modes, if that is a valuable mechanism.

I'm also interested to see how you allow for someone to create their
own local bundle server that is independent of the origin Git server
(so the bundle-uri advertisement does not help them see the bundles).
Perhaps that's just not part of your design, so will be part of the
combined effort.

> There's also some high-level "journey" differences in the
> two. E.g. Derrick implemented the ability to have "git bundle" itself
> fetch bundles remotely, I don't have that and instead lean entirely on
> the protocol and "fetch". Those differences really aren't important,
> and we can have our cake & eat it too on that front. I.e. end up with
> some sensible intersection (or union) of the tooling.

I agree that this is something to work out later. I think it is nice
to allow the user a way to download bundles from a specific URI if
they happen to have one, but that could easily be embedded into a
'git fetch <bundle-URI>' kind of command. Please redesign this as you
see fit when combining efforts.

> I ran out of time to finish up some of what I had on this topic this
> week, but figured (especially since I'd promised to get it done this
> week) to send what I have now for discussion.
> 
> Things missing & reader's notes:
> 
>  * I had some amendmends to the protocol I meant to distill further
>    into the protocol docs at [5]. Basically omitting the ability to
>    transmit key-values and to have it just be a list of URIs with an
>    optional <header> for each one, which is purely a server-to-client
>    aid (i.e. those headers will be what you'll find in the pointed-to
>    bundles).

As long as early versions of the client can ignore the extra key-value
pairs advertised by later versions without issue, it makes sense to
avoid this in early versions.

It would be nice to delay the use of advertising these headers inline
with the advertisement until more of the idea is made concrete. In
particular, the more we can strip out things in early versions that
can be applied later as an optimization, the better. I'm thinking
specifically about how your incremental fetch story will download only
the headers of the bundles to discover which ones are necessary. That
can also be used in the TOC with a timestamp heuristic to discover
that the client already has all of the information from the latest
bundle, even though the server timestamp advanced. Showing the value
to that case _plus_ the "AND" case of bundle-uri advertisements
would be a nice justification for the complication involved there.

> * This series goes up to "clone", but I also have incremental fetch
>   patches. I ran into an (easily solvable bug) in that & thought it
>   was best to omit it for now. It'll be here soon.
> 
>   Basically for incremental fetch we'll in parallel get the headers
>   for remote bundles, and then do an early abort of those downloads
>   that we deem that we don't need.

This is the main thing I was missing in our earlier discussions
(in August and October): this feature of downloading the headers for
the remote bundles is critical for allowing incremental fetch to
work in your model. It's a clever way to solve the problem.

I'm interested to see how well it performs in real-world scenarios.

I'm imagining a way to incrementally build things from simplest to
most complicated, and it goes in this order:

 0. Implement 'git clone --bundle-uri=<X> <url>' where we expect
    a bundle at the given uri <X>.

 1. Implement 'git clone <url>' to understand a bundle-uri
    advertisement with AND (get all bundles and unbundle them in
    some order before fetching) and OR (get any _one_ of these
    full bundles) logic.

 2. Extend the bundle downloading to understand a TOC, allowing
    the OR advertisement to advertise TOC (perhaps guarded with
    some metadata in the advertisement).

 3. For the TOC model, allow 'git fetch' to update from the TOC.

 4. Extend the AND advertisements to do parallel header-only
    downloads to integrate with 'git fetch'. The implementation
    of these pieces also improve performance of the TOC model.

This is me just spitballing of a way that we can make incremental
progress towards this feature without needing to go super-deep
into one model or the other before we are able to test this
example against real-world bundle server prototypes.

>   Clever (but all standard & boring) use of HTTP caching headers
>   between client & servers then allows the client to not request the
>   same thing again and again. I.e. want less server load on your CDN?
>   Just have the bundles be unique URLs and set appropriate caching
>   headers.

I was hoping that the TOC model would avoid the need for cleverness
at all, but I'm interested to see what we can do with these tricks
in all cases.
 
> * A problem with this implementation (and Derrick's, I believe) is
>   that it keeps a server waiting while we twiddle our thumbs and wget
>   (or equivalent) the bundle(s) locally. If you e.g. clone
>   "chromium.git" the server will get tired of waiting, drop the
>   connection, and unless the bundle is 100% up-to-date the "clone"
>   will fail.

This is absolutely the case with my implementation. I call it out
with comments, but I didn't have a solution in mind other than
"disconnect if necessary, then reconnect."
 
>   The solution to this is to get the bundle headers in parallel, and
>   as soon as we've got them present the OIDs in the headers as "HAVE"
>   to the server, which'll then send us an incremental PACK (and even
>   possibly a packfile-uri) for the difference between those bundle(s)
>   and what its tips are.
> 
>   We can then simply disconnect, download/index-pack everything, and
>   do a connectivity checkat the end.

This is a clever idea and is absolutely something that can be done
as a later step (even after step 4 from my outline above).

>   This requires some careful error handling, e.g. if the resulting
>   repo isn't valid we'll need to retry, and the current negotiation
>   API isn't very happy to negotiate on the basis of OIDs we don't have
>   in the object store (but the protocol handles it just fine).

This is exactly why we shouldn't over-index on that idea too early,
but definitely keep it in our back pockets for a future improvement.

Thanks for the detailed cover letter. I hope to look more closely
at the patches themselves for feedback early next week.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format
  2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
                     ` (13 preceding siblings ...)
  2022-03-11 21:28   ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Derrick Stolee
@ 2022-04-18 17:23   ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 01/36] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
                       ` (36 more replies)
  14 siblings, 37 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

This RFC series is a start at trying to combine the two differing RFC
versions of bundle URIs I [1] and Derrick Stolee [2] were kicking
around.

= Layout

This series arranged in the following way:

* 01-08: "Prep" patches from both [1] and [2] which in principle could
  graduate first to "master".

  I.e. they're prep fixes added for the two bundle-uri
  implementations, but which either justify themselves, or e.g. expose
  a now-static function via an API.

  I tried to move things into the "justify themselves" category
  whenever possible, but may have overdone it e.g. for 02/36
  (originally an idea/commit of Derrick's, but I changed the
  authorship as pretty much all of it at this point is something I
  changed).

  For the "prep" changes that are only needed for later changes in the
  series perhaps we should just squash them if they're small enough.

* 09-16: My RFC series at [1], minus things extracted to the below and
  docs (see below).

* 17-33: Derrick's RFC series at [2], minos <ditto above>

* 34-36: I peeled off the Documentation/technical/bundle-uri.txt from
  both to put at the end, and renamed Derrick's to
  "bundle-uri-TOC.txt" so most of the change wouldn't be a massive
  diff.

  These obviously need to be unified too, but I figured doing so was
  better once we shake out what features/interfaces we want to keep.

= Overall state

Derrick: Sorry about the long delay in submitting this. I spent a lot
of time trying to get further in semantically merging the two in terms
of getting some some sensible user-usable end-result, but ultimately
wasn't happy with how opinionated that result was without looping you
in earlier.

So I figured that a better start was a a version of this where we
could test the two against one another, see where feature
differences/parity etc. were, and depending on those discussions
eventually unify the config, tests, features etc. for the two.

= About the range-diff

The range-diff was produced by rebasing my [1] on "master", fixing
conflicts, then rebasing a version of Derrick's [2] on it where I'd
munged it in some minimal way to reduce conflicts between the two, but
without "really" changing anything (mainly moving added functions to
slightly different places in the same files to reduce textual
conflicts).

When in doubt don't trust the range-diff, I figured it was better than
not including one at all, and it does a good job of pointing towards
the main areas of differences v.s. [1] and [2]. But it's not "real",
and will e.g. omit changes I made while getting to the point where I
could run the range-diff at all...

= Outstanding issues

Aside from the large issue of needing to more sensibly combine these
two, there are:

 * A CI failure on Windows: https://github.com/avar/git/actions/runs/2184857264

   This is in my [1], the implementation currently shells out to
   "curl", which is failing somehow. Note that the tests are "failing"
   on e.g. linux-musl too in that there's no /usr/bin/curl there, but
   in a way where they'll recover and fall back to a non-bundle-uri
   clone. So the same tests are testing graceful recovery elsewhere.

 * I didn't include the "incremental" fetch in my version. As noted in
   "overall state" I figured starting with a smaller version in this
   already-huge 33-patch series was better, and while I have that
   locally adding it is even more code...

1. https://lore.kernel.org/git/RFC-cover-v2-00.13-00000000000-20220311T155841Z-avarab@gmail.com/
2. https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com/

Derrick Stolee (21):
  http: make http_get_file() external
  remote: move relative_url()
  remote: allow relative_url() to return an absolute url
  remote-curl: add 'get' capability
  bundle: implement 'fetch' command for direct bundles
  bundle: parse table of contents during 'fetch'
  bundle: add --filter option to 'fetch'
  bundle: allow relative URLs in table of contents
  bundle: make it easy to call 'git bundle fetch'
  clone: add --bundle-uri option
  clone: --bundle-uri cannot be combined with --depth
  bundle: only fetch bundles if timestamp is new
  fetch: fetch bundles before fetching original data
  protocol-caps: implement cap_features()
  serve: understand but do not advertise 'features' capability
  serve: advertise 'features' when config exists
  connect: implement get_recommended_features()
  transport: add connections for 'features' capability
  clone: use server-recommended bundle URI
  t5601: basic bundle URI test
  docs: document bundle URI standard

Ævar Arnfjörð Bjarmason (15):
  connect.c: refactor sending of agent & object-format
  dir API: add a generalized path_match_flags() function
  fetch-pack: add a deref_without_lazy_fetch_extended()
  fetch-pack: move --keep=* option filling to a function
  bundle.h: make "fd" version of read_bundle_header() public
  protocol v2: add server-side "bundle-uri" skeleton
  bundle-uri client: add "bundle-uri" parsing + tests
  bundle-uri client: add minimal NOOP client
  bundle-uri client: add "git ls-remote-bundle-uri"
  bundle-uri client: add transfer.injectBundleURI support
  bundle-uri client: add boolean transfer.bundleURI setting
  bundle-uri client: support for bundle-uri with "clone"
  bundle-uri: make the download program configurable
  protocol v2: add server-side "bundle-uri" skeleton (docs)
  bundle-uri docs: add design notes

 Documentation/config/transfer.txt          |  33 ++
 Documentation/git-bundle.txt               |   1 +
 Documentation/git-ls-remote-bundle-uri.txt |  62 +++
 Documentation/git-ls-remote.txt            |   1 +
 Documentation/gitremote-helpers.txt        |   6 +
 Documentation/technical/bundle-uri-TOC.txt | 404 +++++++++++++++++
 Documentation/technical/bundle-uri.txt     | 119 +++++
 Documentation/technical/protocol-v2.txt    | 214 +++++++++
 Makefile                                   |   3 +
 builtin.h                                  |   1 +
 builtin/bundle.c                           | 481 +++++++++++++++++++++
 builtin/clone.c                            |  57 +++
 builtin/fetch.c                            |  17 +
 builtin/ls-remote-bundle-uri.c             |  90 ++++
 builtin/submodule--helper.c                | 141 +-----
 bundle-uri.c                               | 183 ++++++++
 bundle-uri.h                               |  29 ++
 bundle.c                                   |  29 +-
 bundle.h                                   |  11 +
 command-list.txt                           |   1 +
 compat/mingw.c                             |   2 +-
 compat/win32/path-utils.h                  |   6 +-
 connect.c                                  | 116 ++++-
 dir.c                                      |  29 ++
 dir.h                                      |  63 +++
 fetch-pack.c                               | 306 ++++++++++++-
 fetch-pack.h                               |   6 +
 fsck.c                                     |  23 +-
 git-compat-util.h                          |   8 +-
 git.c                                      |   1 +
 http.c                                     |   4 +-
 http.h                                     |   9 +
 path.c                                     |   2 +-
 protocol-caps.c                            |  66 +++
 protocol-caps.h                            |   1 +
 remote-curl.c                              |  32 ++
 remote.c                                   |  99 +++++
 remote.h                                   |  40 ++
 serve.c                                    |  29 ++
 submodule-config.c                         |   6 +-
 t/helper/test-bundle-uri.c                 |  83 ++++
 t/helper/test-tool.c                       |   1 +
 t/helper/test-tool.h                       |   1 +
 t/lib-t5730-protocol-v2-bundle-uri.sh      | 458 ++++++++++++++++++++
 t/t5601-clone.sh                           |  15 +
 t/t5701-git-serve.sh                       | 133 +++++-
 t/t5730-protocol-v2-bundle-uri-file.sh     |  36 ++
 t/t5731-protocol-v2-bundle-uri-git.sh      |  17 +
 t/t5732-protocol-v2-bundle-uri-http.sh     |  17 +
 t/t5750-bundle-uri-parse.sh                | 153 +++++++
 transport-helper.c                         |  26 ++
 transport-internal.h                       |  16 +
 transport.c                                | 158 +++++++
 transport.h                                |  27 ++
 54 files changed, 3680 insertions(+), 192 deletions(-)
 create mode 100644 Documentation/git-ls-remote-bundle-uri.txt
 create mode 100644 Documentation/technical/bundle-uri-TOC.txt
 create mode 100644 Documentation/technical/bundle-uri.txt
 create mode 100644 builtin/ls-remote-bundle-uri.c
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h
 create mode 100644 t/helper/test-bundle-uri.c
 create mode 100644 t/lib-t5730-protocol-v2-bundle-uri.sh
 create mode 100755 t/t5730-protocol-v2-bundle-uri-file.sh
 create mode 100755 t/t5731-protocol-v2-bundle-uri-git.sh
 create mode 100755 t/t5732-protocol-v2-bundle-uri-http.sh
 create mode 100755 t/t5750-bundle-uri-parse.sh

Range-diff against v1:
 4:  034d371472e =  1:  95c53a3e779 connect.c: refactor sending of agent & object-format
15:  02563939040 !  2:  8f6e4f12e8a dir: extract starts_with_dot[_dot]_slash()
    @@
      ## Metadata ##
    -Author: Derrick Stolee <derrickstolee@github.com>
    +Author: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
      ## Commit message ##
    -    dir: extract starts_with_dot[_dot]_slash()
    +    dir API: add a generalized path_match_flags() function
     
    -    We will want to use this logic to assist checking if paths are absolute
    -    or relative, so extract it into a helpful place. This creates a
    -    collision with similar methods in builtin/fsck.c, but those methods have
    -    important differences. Prepend "fsck_" to those methods to emphasize
    -    that they are custom to the fsck builtin.
    +    Add a path_match_flags() function and have the two sets of
    +    starts_with_dot_{,dot_}slash() functions added in
    +    63e95beb085 (submodule: port resolve_relative_url from shell to C,
    +    2016-04-15) and a2b26ffb1a8 (fsck: convert gitmodules url to URL
    +    passed to curl, 2020-04-18) be thin wrappers for it.
    +
    +    As the latter of those notes the fsck version was copied from the
    +    initial builtin/submodule--helper.c version.
    +
    +    Since the code added in a2b26ffb1a8 was doing really doing the same as
    +    win32_is_dir_sep() added in 1cadad6f658 (git clone <url>
    +    C:\cygwin\home\USER\repo' is working (again), 2018-12-15) let's move
    +    the latter to git-compat-util.h is a is_xplatform_dir_sep(). We can
    +    then call either it or the platform-specific is_dir_sep() from this
    +    new function.
    +
    +    Let's likewise change code in various other places that was hardcoding
    +    checks for "'/' || '\\'" with the new is_xplatform_dir_sep(). As can
    +    be seen in those callers some of them still concern themselves with
    +    ':' (Mac OS classic?), but let's leave the question of whether that
    +    should be consolidated for some other time.
    +
    +    As we expect to make wider use of the "native" case in the future,
    +    define and use two starts_with_dot_{,dot_}slash_native() convenience
    +    wrappers. This makes the diff in builtin/submodule--helper.c much
    +    smaller.
     
         Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    +    Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
      ## builtin/submodule--helper.c ##
     @@ builtin/submodule--helper.c: static char *get_default_remote(void)
    @@ builtin/submodule--helper.c: static char *get_default_remote(void)
      /*
       * Returns 1 if it was the last chop before ':'.
       */
    +@@ builtin/submodule--helper.c: static int chop_last_dir(char **remoteurl, int is_relative)
    + 	return 0;
    + }
    + 
    ++static int starts_with_dot_slash(const char *const path)
    ++{
    ++	return starts_with_dot_slash_native(path);;
    ++}
    ++
    ++static int starts_with_dot_dot_slash(const char *const path)
    ++{
    ++	return starts_with_dot_dot_slash_native(path);
    ++}
    ++
    + /*
    +  * The `url` argument is the URL that navigates to the submodule origin
    +  * repo. When relative, this URL is relative to the superproject origin
    +
    + ## compat/mingw.c ##
    +@@ compat/mingw.c: int is_valid_win32_path(const char *path, int allow_literal_nul)
    + 			}
    + 
    + 			c = path[i];
    +-			if (c && c != '.' && c != ':' && c != '/' && c != '\\')
    ++			if (c && c != '.' && c != ':' && !is_xplatform_dir_sep(c))
    + 				goto not_a_reserved_name;
    + 
    + 			/* contains reserved name */
    +
    + ## compat/win32/path-utils.h ##
    +@@ compat/win32/path-utils.h: int win32_has_dos_drive_prefix(const char *path);
    + 
    + int win32_skip_dos_drive_prefix(char **path);
    + #define skip_dos_drive_prefix win32_skip_dos_drive_prefix
    +-static inline int win32_is_dir_sep(int c)
    +-{
    +-	return c == '/' || c == '\\';
    +-}
    +-#define is_dir_sep win32_is_dir_sep
    ++#define is_dir_sep is_xplatform_dir_sep
    + static inline char *win32_find_last_dir_sep(const char *path)
    + {
    + 	char *ret = NULL;
    +
    + ## dir.c ##
    +@@ dir.c: void relocate_gitdir(const char *path, const char *old_git_dir, const char *new_
    + 
    + 	connect_work_tree_and_git_dir(path, new_git_dir, 0);
    + }
    ++
    ++int path_match_flags(const char *const str, const enum path_match_flags flags)
    ++{
    ++	const char *p = str;
    ++
    ++	if (flags & PATH_MATCH_NATIVE &&
    ++	    flags & PATH_MATCH_XPLATFORM)
    ++		BUG("path_match_flags() must get one match kind, not multiple!");
    ++	else if (!(flags & PATH_MATCH_KINDS_MASK))
    ++		BUG("path_match_flags() must get at least one match kind!");
    ++
    ++	if (flags & PATH_MATCH_STARTS_WITH_DOT_SLASH &&
    ++	    flags & PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH)
    ++		BUG("path_match_flags() must get one platform kind, not multiple!");
    ++	else if (!(flags & PATH_MATCH_PLATFORM_MASK))
    ++		BUG("path_match_flags() must get at least one platform kind!");
    ++
    ++	if (*p++ != '.')
    ++		return 0;
    ++	if (flags & PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH &&
    ++	    *p++ != '.')
    ++		return 0;
    ++
    ++	if (flags & PATH_MATCH_NATIVE)
    ++		return is_dir_sep(*p);
    ++	else if (flags & PATH_MATCH_XPLATFORM)
    ++		return is_xplatform_dir_sep(*p);
    ++	BUG("unreachable");
    ++}
     
      ## dir.h ##
     @@ dir.h: void connect_work_tree_and_git_dir(const char *work_tree,
    @@ dir.h: void connect_work_tree_and_git_dir(const char *work_tree,
      		     const char *old_git_dir,
      		     const char *new_git_dir);
     +
    -+static inline int starts_with_dot_slash(const char *str)
    ++/**
    ++ * The "enum path_matches_kind" determines how path_match_flags() will
    ++ * behave. The flags come in sets, and one (and only one) must be
    ++ * provided out of each "set":
    ++ *
    ++ * PATH_MATCH_NATIVE:
    ++ *	Path separator is is_dir_sep()
    ++ * PATH_MATCH_XPLATFORM:
    ++ *	Path separator is is_xplatform_dir_sep()
    ++ *
    ++ * Do we use is_dir_sep() to check for a directory separator
    ++ * (*_NATIVE), or do we always check for '/' or '\' (*_XPLATFORM). The
    ++ * "*_NATIVE" version on Windows is the same as "*_XPLATFORM",
    ++ * everywhere else "*_NATIVE" means "only /".
    ++ *
    ++ * PATH_MATCH_STARTS_WITH_DOT_SLASH:
    ++ *	Match a path starting with "./"
    ++ * PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH:
    ++ *	Match a path starting with "../"
    ++ *
    ++ * The "/" in the above is adjusted based on the "*_NATIVE" and
    ++ * "*_XPLATFORM" flags.
    ++ */
    ++enum path_match_flags {
    ++	PATH_MATCH_NATIVE = 1 << 0,
    ++	PATH_MATCH_XPLATFORM = 1 << 1,
    ++	PATH_MATCH_STARTS_WITH_DOT_SLASH = 1 << 2,
    ++	PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH = 1 << 3,
    ++};
    ++#define PATH_MATCH_KINDS_MASK (PATH_MATCH_STARTS_WITH_DOT_SLASH | \
    ++	PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH)
    ++#define PATH_MATCH_PLATFORM_MASK (PATH_MATCH_NATIVE | PATH_MATCH_XPLATFORM)
    ++
    ++/**
    ++ * path_match_flags() checks if a given "path" matches a given "enum
    ++ * path_match_flags" criteria.
    ++ */
    ++int path_match_flags(const char *const path, const enum path_match_flags f);
    ++
    ++/**
    ++ * starts_with_dot_slash_native(): convenience wrapper for
    ++ * path_match_flags() with PATH_MATCH_STARTS_WITH_DOT_SLASH and
    ++ * PATH_MATCH_NATIVE.
    ++ */
    ++static inline int starts_with_dot_slash_native(const char *const path)
     +{
    -+	return str[0] == '.' && is_dir_sep(str[1]);
    ++	const enum path_match_flags what = PATH_MATCH_STARTS_WITH_DOT_SLASH;
    ++
    ++	return path_match_flags(path, what | PATH_MATCH_NATIVE);
     +}
     +
    -+static inline int starts_with_dot_dot_slash(const char *str)
    ++/**
    ++ * starts_with_dot_slash_native(): convenience wrapper for
    ++ * path_match_flags() with PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH and
    ++ * PATH_MATCH_NATIVE.
    ++ */
    ++static inline int starts_with_dot_dot_slash_native(const char *const path)
     +{
    -+	return str[0] == '.' && str[1] == '.' && is_dir_sep(str[2]);
    -+}
    ++	const enum path_match_flags what = PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH;
     +
    ++	return path_match_flags(path, what | PATH_MATCH_NATIVE);
    ++}
      #endif
     
      ## fsck.c ##
     @@ fsck.c: int fsck_tag_standalone(const struct object_id *oid, const char *buffer,
    + 	return ret;
      }
      
    - /*
    +-/*
     - * Like builtin/submodule--helper.c's starts_with_dot_slash, but without
    -+ * Like dir.h's starts_with_dot_slash, but without
    -  * relying on the platform-dependent is_dir_sep helper.
    -  *
    -  * This is for use in checking whether a submodule URL is interpreted as
    -  * relative to the current directory on any platform, since \ is a
    -  * directory separator on Windows but not on other platforms.
    -  */
    +- * relying on the platform-dependent is_dir_sep helper.
    +- *
    +- * This is for use in checking whether a submodule URL is interpreted as
    +- * relative to the current directory on any platform, since \ is a
    +- * directory separator on Windows but not on other platforms.
    +- */
     -static int starts_with_dot_slash(const char *str)
    -+static int fsck_starts_with_dot_slash(const char *str)
    ++static int starts_with_dot_slash(const char *const path)
      {
    - 	return str[0] == '.' && (str[1] == '/' || str[1] == '\\');
    +-	return str[0] == '.' && (str[1] == '/' || str[1] == '\\');
    ++	return path_match_flags(path, PATH_MATCH_STARTS_WITH_DOT_SLASH |
    ++				PATH_MATCH_XPLATFORM);
      }
      
    - /*
    +-/*
     - * Like starts_with_dot_slash, this is a variant of submodule--helper's
     - * helper of the same name with the twist that it accepts backslash as a
    -+ * Like fsck_starts_with_dot_slash, this is a variant of dir.h's
    -+ * helper with the twist that it accepts backslash as a
    -  * directory separator even on non-Windows platforms.
    -  */
    +- * directory separator even on non-Windows platforms.
    +- */
     -static int starts_with_dot_dot_slash(const char *str)
    -+static int fsck_starts_with_dot_dot_slash(const char *str)
    ++static int starts_with_dot_dot_slash(const char *const path)
      {
     -	return str[0] == '.' && starts_with_dot_slash(str + 1);
    -+	return str[0] == '.' && fsck_starts_with_dot_slash(str + 1);
    ++	return path_match_flags(path, PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH |
    ++				PATH_MATCH_XPLATFORM);
      }
      
      static int submodule_url_is_relative(const char *url)
    +
    + ## git-compat-util.h ##
    +@@
    + #include <sys/sysctl.h>
    + #endif
    + 
    ++/* Used by compat/win32/path-utils.h, and more */
    ++static inline int is_xplatform_dir_sep(int c)
    ++{
    ++	return c == '/' || c == '\\';
    ++}
    ++
    + #if defined(__CYGWIN__)
    + #include "compat/win32/path-utils.h"
    + #endif
    +@@ git-compat-util.h: static inline int git_skip_dos_drive_prefix(char **path)
    + #define skip_dos_drive_prefix git_skip_dos_drive_prefix
    + #endif
    + 
    +-#ifndef is_dir_sep
    + static inline int git_is_dir_sep(int c)
      {
    --	return starts_with_dot_slash(url) || starts_with_dot_dot_slash(url);
    -+	return fsck_starts_with_dot_slash(url) || fsck_starts_with_dot_dot_slash(url);
    + 	return c == '/';
      }
    ++#ifndef is_dir_sep
    + #define is_dir_sep git_is_dir_sep
    + #endif
      
    - /*
    +
    + ## path.c ##
    +@@ path.c: int is_ntfs_dotgit(const char *name)
    + 
    + 	for (;;) {
    + 		c = *(name++);
    +-		if (!c || c == '\\' || c == '/' || c == ':')
    ++		if (!c || is_xplatform_dir_sep(c) || c == ':')
    + 			return 1;
    + 		if (c != '.' && c != ' ')
    + 			return 0;
    +
    + ## submodule-config.c ##
    +@@ submodule-config.c: int check_submodule_name(const char *name)
    + 		return -1;
    + 
    + 	/*
    +-	 * Look for '..' as a path component. Check both '/' and '\\' as
    ++	 * Look for '..' as a path component. Check is_xplatform_dir_sep() as
    + 	 * separators rather than is_dir_sep(), because we want the name rules
    + 	 * to be consistent across platforms.
    + 	 */
    + 	goto in_component; /* always start inside component */
    + 	while (*name) {
    + 		char c = *name++;
    +-		if (c == '/' || c == '\\') {
    ++		if (is_xplatform_dir_sep(c)) {
    + in_component:
    + 			if (name[0] == '.' && name[1] == '.' &&
    +-			    (!name[2] || name[2] == '/' || name[2] == '\\'))
    ++			    (!name[2] || is_xplatform_dir_sep(name[2])))
    + 				return -1;
    + 		}
    + 	}
 9:  e93306308f9 =  3:  7823a177fd7 fetch-pack: add a deref_without_lazy_fetch_extended()
10:  d7633791083 =  4:  0315bda0dac fetch-pack: move --keep=* option filling to a function
18:  0bfc59ad308 =  5:  6e1f4296896 http: make http_get_file() external
16:  eea2816bc8f !  6:  97a9f38f08d remote: move relative_url()
    @@ Commit message
         similar functionality in the bundle URI feature, extract this to be
         available in remote.h.
     
    -    The code is exactly the same. The prototype is different only in
    -    whitespace. The documentation comment only adds explicit instructions on
    -    what happens when supplying two absolute URLs.
    +    The code is almost exactly the same, except for the following trivial
    +    differences:
    +
    +     * Fix whitespace and wrapping issues with the prototype and argument
    +       lists.
    +
    +     * Let's call starts_with_dot_{,dot_}slash_native() instead of the
    +       functionally identical "starts_with_dot_{,dot_}slash()" wrappers
    +       "builtin/submodule--helper.c".
     
         Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    +    Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
      ## builtin/submodule--helper.c ##
     @@ builtin/submodule--helper.c: static char *get_default_remote(void)
    @@ builtin/submodule--helper.c: static char *get_default_remote(void)
     -	return 0;
     -}
     -
    +-static int starts_with_dot_slash(const char *const path)
    +-{
    +-	return starts_with_dot_slash_native(path);;
    +-}
    +-
    +-static int starts_with_dot_dot_slash(const char *const path)
    +-{
    +-	return starts_with_dot_dot_slash_native(path);
    +-}
    +-
     -/*
     - * The `url` argument is the URL that navigates to the submodule origin
     - * repo. When relative, this URL is relative to the superproject origin
    @@ builtin/submodule--helper.c: static char *get_default_remote(void)
      static char *resolve_relative_url(const char *rel_url, const char *up_path, int quiet)
      {
      	char *remoteurl, *resolved_url;
    +@@ builtin/submodule--helper.c: static int module_foreach(int argc, const char **argv, const char *prefix)
    + 	return 0;
    + }
    + 
    ++static int starts_with_dot_slash(const char *const path)
    ++{
    ++	return path_match_flags(path, PATH_MATCH_STARTS_WITH_DOT_SLASH |
    ++				PATH_MATCH_XPLATFORM);
    ++}
    ++
    ++static int starts_with_dot_dot_slash(const char *const path)
    ++{
    ++	return path_match_flags(path, PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH |
    ++				PATH_MATCH_XPLATFORM);
    ++}
    ++
    + struct init_cb {
    + 	const char *prefix;
    + 	const char *superprefix;
     
      ## remote.c ##
     @@
    @@ remote.c: void remote_state_clear(struct remote_state *remote_state)
     +	return 0;
     +}
     +
    -+/*
    -+ * NEEDSWORK: Given how chop_last_dir() works, this function is broken
    -+ * when a local part has a colon in its path component, too.
    -+ */
    -+char *relative_url(const char *remote_url,
    -+		   const char *url,
    ++char *relative_url(const char *remote_url, const char *url,
     +		   const char *up_path)
     +{
     +	int is_relative = 0;
    @@ remote.c: void remote_state_clear(struct remote_state *remote_state)
     +		 * Prepend a './' to ensure all relative
     +		 * remoteurls start with './' or '../'
     +		 */
    -+		if (!starts_with_dot_slash(remoteurl) &&
    -+		    !starts_with_dot_dot_slash(remoteurl)) {
    ++		if (!starts_with_dot_slash_native(remoteurl) &&
    ++		    !starts_with_dot_dot_slash_native(remoteurl)) {
     +			strbuf_reset(&sb);
     +			strbuf_addf(&sb, "./%s", remoteurl);
     +			free(remoteurl);
    @@ remote.c: void remote_state_clear(struct remote_state *remote_state)
     +	 * last directory in remoteurl.
     +	 */
     +	while (url) {
    -+		if (starts_with_dot_dot_slash(url)) {
    ++		if (starts_with_dot_dot_slash_native(url)) {
     +			url += 3;
     +			colonsep |= chop_last_dir(&remoteurl, is_relative);
    -+		} else if (starts_with_dot_slash(url))
    ++		} else if (starts_with_dot_slash_native(url))
     +			url += 2;
     +		else
     +			break;
    @@ remote.c: void remote_state_clear(struct remote_state *remote_state)
     +		strbuf_setlen(&sb, sb.len - 1);
     +	free(remoteurl);
     +
    -+	if (starts_with_dot_slash(sb.buf))
    ++	if (starts_with_dot_slash_native(sb.buf))
     +		out = xstrdup(sb.buf + 2);
     +	else
     +		out = xstrdup(sb.buf);
    @@ remote.h: int parseopt_push_cas_option(const struct option *, const char *arg, i
     + * http://a.com/b  ../../../c       http:/c          error out
     + * http://a.com/b  ../../../../c    http:c           error out
     + * http://a.com/b  ../../../../../c    .:c           error out
    ++ * NEEDSWORK: Given how chop_last_dir() works, this function is broken
    ++ * when a local part has a colon in its path component, too.
     + */
    -+char *relative_url(const char *remote_url,
    -+		   const char *url,
    ++char *relative_url(const char *remote_url, const char *url,
     +		   const char *up_path);
     +
      #endif
17:  68b10e64382 !  7:  2917cdd8277 remote: allow relative_url() to return an absolute url
    @@ Commit message
         concatenate 'remote_url' with 'url'. Instead, we want to return 'url' in
         this case.
     
    +    The documentation now discusses what happens when supplying two
    +    absolute URLs.
    +
         Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    +    Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
      ## remote.c ##
    -@@ remote.c: char *relative_url(const char *remote_url,
    +@@ remote.c: char *relative_url(const char *remote_url, const char *url,
      	int is_relative = 0;
      	int colonsep = 0;
      	char *out;
    @@ remote.h: void apply_push_cas(struct push_cas_option *, struct remote *, struct
       * http://a.com/b  ../../../../c    http:c           error out
       * http://a.com/b  ../../../../../c    .:c           error out
     + * http://a.com/b  http://d.org/e   http://d.org/e   as is
    +  * NEEDSWORK: Given how chop_last_dir() works, this function is broken
    +  * when a local part has a colon in its path component, too.
       */
    - char *relative_url(const char *remote_url,
    - 		   const char *url,
11:  1f5a48c712c =  8:  2b236af147b bundle.h: make "fd" version of read_bundle_header() public
 1:  3875bf2a294 !  9:  1496b89ea6a protocol v2: add server-side "bundle-uri" skeleton
    @@ Commit message
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
    - ## Documentation/technical/protocol-v2.txt ##
    -@@ Documentation/technical/protocol-v2.txt: and associated requested information, each separated by a single space.
    - 	attr = "size"
    - 
    - 	obj-info = obj-id SP obj-size
    -+
    -+bundle-uri
    -+~~~~~~~~~~
    -+
    -+If the 'bundle-uri' capability is advertised, the server supports the
    -+`bundle-uri' command.
    -+
    -+The capability is currently advertised with no value (i.e. not
    -+"bundle-uri=somevalue"), a value may be added in the future for
    -+supporting command-wide extensions. Clients MUST ignore any unknown
    -+capability values and proceed with the 'bundle-uri` dialog they
    -+support.
    -+
    -+The 'bundle-uri' command is intended to be issued before `fetch` to
    -+get URIs to bundle files (see linkgit:git-bundle[1]) to "seed" and
    -+inform the subsequent `fetch` command.
    -+
    -+The client CAN issue `bundle-uri` before or after any other valid
    -+command. To be useful to clients it's expected that it'll be issued
    -+after an `ls-refs` and before `fetch`, but CAN be issued at any time
    -+in the dialog.
    -+
    -+DISCUSSION of bundle-uri
    -+^^^^^^^^^^^^^^^^^^^^^^^^
    -+
    -+The intent of the feature is optimize for server resource consumption
    -+in the common case by changing the common case of fetching a very
    -+large PACK during linkgit:git-clone[1] into a smaller incremental
    -+fetch.
    -+
    -+It also allows servers to achieve better caching in combination with
    -+an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
    -+
    -+By having new clones or fetches be a more predictable and common
    -+negotiation against the tips of recently produces *.bundle file(s).
    -+Servers might even pre-generate the results of such negotiations for
    -+the `uploadpack.packObjectsHook` as new pushes come in.
    -+
    -+I.e. the server would anticipate that fresh clones will download a
    -+known bundle, followed by catching up to the current state of the
    -+repository using ref tips found in that bundle (or bundles).
    -+
    -+PROTOCOL for bundle-uri
    -+^^^^^^^^^^^^^^^^^^^^^^^
    -+
    -+A `bundle-uri` request takes no arguments, and as noted above does not
    -+currently advertise a capability value. Both may be added in the
    -+future.
    -+
    -+When the client issues a `command=bundle-uri` the response is a list
    -+of URIs the server would like the client to fetch out-of-bounds before
    -+proceeding with the `fetch` request in this format:
    -+
    -+	output = bundle-uri-line
    -+		 bundle-uri-line* flush-pkt
    -+
    -+	bundle-uri-line = PKT-LINE(bundle-uri)
    -+			  *(SP bundle-feature-key *(=bundle-feature-val))
    -+			  LF
    -+
    -+	bundle-uri = A URI such as a https://, ssh:// etc. URI
    -+
    -+	bundle-feature-key = Any printable ASCII characters except SP or "="
    -+	bundle-feature-val = Any printable ASCII characters except SP or "="
    -+
    -+No `bundle-feature-key`=`bundle-feature-value` fields are currently
    -+defined. See the discussion of features below.
    -+
    -+Clients are still expected to fully parse the line according to the
    -+above format, lines that do not conform to the format SHOULD be
    -+discarded. The user MAY be warned in such a case.
    -+
    -+bundle-uri CLIENT AND SERVER EXPECTATIONS
    -+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    -+
    -+".bundle" FORMAT
    -+++++++++++++++++
    -+
    -+The advertised bundle(s) MUST be in a format that "git bundle verify"
    -+would accept. I.e. they MUST contain one or more reference tips for
    -+use by the client, MUST indicate prerequisites (in any) with standard
    -+"-" prefixes, and MUST indicate their "object-format", if
    -+applicable. Create "*.bundle" files with "git bundle create".
    -+
    -+bundle-uri CLIENT ERROR RECOVERY
    -+++++++++++++++++++++++++++++++++
    -+
    -+A client MUST above all gracefully degrade on errors, whether that
    -+error is because of bad missing/data in the bundle URI(s), because
    -+that client is too dumb to e.g. understand and fully parse out bundle
    -+headers and their prerequisite relationships, or something else.
    -+
    -+Server operators should feel confident in turning on "bundle-uri" and
    -+not worry if e.g. their CDN goes down that clones or fetches will run
    -+into hard failures. Even if the server bundle bundle(s) are
    -+incomplete, or bad in some way the client should still end up with a
    -+functioning repository, just as if it had chosen not to use this
    -+protocol extension.
    -+
    -+All subsequent discussion on client and server interaction MUST keep
    -+this in mind.
    -+
    -+bundle-uri SERVER TO CLIENT
    -++++++++++++++++++++++++++++
    -+
    -+The ordering of the returned bundle uris is not significant. Clients
    -+MUST parse their headers to discover their contained OIDS and
    -+prerequisites. A client MUST consider the content of the bundle(s)
    -+themselves and their header as the ultimate source of truth.
    -+
    -+A server MAY even return bundle(s) that don't have any direct
    -+relationship to the repository being cloned (either through accident,
    -+or intentional "clever" configuration), and expect a client to sort
    -+out what data they'd like from the bundle(s), if any.
    -+
    -+bundle-uri CLIENT TO SERVER
    -++++++++++++++++++++++++++++
    -+
    -+The client SHOULD provide reference tips found in the bundle header(s)
    -+as 'have' lines in any subsequent `fetch` request. A client MAY also
    -+ignore the bundle(s) entirely if doing so is deemed worse for some
    -+reason, e.g. if the bundles can't be downloaded, it doesn't like the
    -+tips it finds etc.
    -+
    -+WHEN ADVERTISED BUNDLE(S) REQUIRE NO FURTHER NEGOTIATION
    -+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    -+
    -+If after issuing `bundle-uri` and `ls-refs`, and getting the header(s)
    -+of the bundle(s) the client finds that the ref tips it wants can be
    -+retrieved entirety from advertised bundle(s), it MAY disconnect. The
    -+results of such a 'clone' or 'fetch' should be indistinguishable from
    -+the state attained without using bundle-uri.
    -+
    -+EARLY CLIENT DISCONNECTIONS AND ERROR RECOVERY
    -+++++++++++++++++++++++++++++++++++++++++++++++
    -+
    -+A client MAY perform an early disconnect while still downloading the
    -+bundle(s) (having streamed and parsed their headers). In such a case
    -+the client MUST gracefully recover from any errors related to
    -+finishing the download and validation of the bundle(s).
    -+
    -+I.e. a client might need to re-connect and issue a 'fetch' command,
    -+and possibly fall back to not making use of 'bundle-uri' at all.
    -+
    -+This "MAY" behavior is specified as such (and not a "SHOULD") on the
    -+assumption that a server advertising bundle uris is more likely than
    -+not to be serving up a relatively large repository, and to be pointing
    -+to URIs that have a good chance of being in working order. A client
    -+MAY e.g. look at the payload size of the bundles as a heuristic to see
    -+if an early disconnect is worth it, should falling back on a full
    -+"fetch" dialog be necessary.
    -+
    -+WHEN ADVERTISED BUNDLE(S) REQUIRE FURTHER NEGOTIATION
    -++++++++++++++++++++++++++++++++++++++++++++++++++++++
    -+
    -+A client SHOULD commence a negotiation of a PACK from the server via
    -+the "fetch" command using the OID tips found in advertised bundles,
    -+even if's still in the process of downloading those bundle(s).
    -+
    -+This allows for aggressive early disconnects from any interactive
    -+server dialog. The client blindly trusts that the advertised OID tips
    -+are relevant, and issues them as 'have' lines, it then requests any
    -+tips it would like (usually from the "ls-refs" advertisement) via
    -+'want' lines. The server will then compute a (hopefully small) PACK
    -+with the expected difference between the tips from the bundle(s) and
    -+the data requested.
    -+
    -+The only connection the client then needs to keep active is to the
    -+concurrently downloading static bundle(s), when those and the
    -+incremental PACK are retrieved they should be inflated and
    -+validated. Any errors at this point should be gracefully recovered
    -+from, see above.
    -+
    -+bundle-uri PROTOCOL FEATURES
    -+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    -+
    -+As noted above no `bundle-feature-key`=`bundle-feature-value` fields
    -+are currently defined.
    -+
    -+They are intended for future per-URI metadata which older clients MUST
    -+ignore and gracefully degrade on. Any fields they do recognize they
    -+CAN also ignore.
    -+
    -+Any backwards-incompatible addition of pre-URI key-value will be
    -+guarded by a new value or values in 'bundle-uri' capability
    -+advertisement itself, and/or by new future `bundle-uri` request
    -+arguments.
    -+
    -+While no per-URI key-value are currently supported currently they're
    -+intended to support future features such as:
    -+
    -+ * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or
    -+   size of the bundle file.
    -+
    -+ * Advertise that one or more bundle files are the same (to e.g. have
    -+   clients round-robin or otherwise choose one of N possible files).
    -+
    -+ * A "oid=<OID>" shortcut and "prerequisite=<OID>" shortcut. For
    -+   expressing the common case of a bundle with one tip and no
    -+   prerequisites, or one tip and one prerequisite.
    -++
    -+This would allow for optimizing the common case of servers who'd like
    -+to provide one "big bundle" containing only their "main" branch,
    -+and/or incremental updates thereof.
    -++
    -+A client receiving such a a response MAY assume that they can skip
    -+retrieving the header from a bundle at the indicated URI, and thus
    -+save themselves and the server(s) the request(s) needed to inspect the
    -+headers of that bundle or bundles.
    -
      ## Makefile ##
     @@ Makefile: LIB_OBJS += blob.o
      LIB_OBJS += bloom.o
 3:  edff9c34c0f ! 10:  2a487981328 bundle-uri client: add "bundle-uri" parsing + tests
    @@ bundle-uri.c: int bundle_uri_command(struct repository *r,
     +			BUG("should have no fields=0");
     +		case 1:
     +			if (!strlen(arg)) {
    -+				err = error("bundle-uri: column %lu: got an empty attribute (full line was '%s')",
    -+					    i, line);
    ++				err = error("bundle-uri: column %"PRIuMAX": got an empty attribute (full line was '%s')",
    ++					    (uintmax_t)i, line);
     +				break;
     +			}
     +			/*
    @@ bundle-uri.c: int bundle_uri_command(struct repository *r,
     +			break;
     +		}
     +		default:
    -+			err = error("bundle-uri: column %lu: '%s' more than one '=' character (full line was '%s')",
    -+				    i, arg, line);
    ++			err = error("bundle-uri: column %"PRIuMAX": '%s' more than one '=' character (full line was '%s')",
    ++				    (uintmax_t)i, arg, line);
     +			break;
     +		}
     +
 5:  e3a235fb11a ! 11:  b85c2a1c0df bundle-uri client: add minimal NOOP client
    @@ t/lib-t5730-protocol-v2-bundle-uri.sh (new)
     +		;;
     +	*)
     +		grep "^fatal:" err >fatal.actual &&
    -+		test_cmp fatal.expect fatal.actual
    ++		# Due to the same race conditions this might be
    ++		# "fatal: read error: Connection reset by peer", "fatal: the remote end
    ++		# hung up unexpectedly" etc.
    ++		cat fatal.actual &&
    ++		test_file_not_empty fatal.actual
     +		;;
     +	esac &&
     +
 6:  9d67cb640c1 ! 12:  54149dcb0aa bundle-uri client: add "git ls-remote-bundle-uri"
    @@ t/lib-t5730-protocol-v2-bundle-uri.sh: test_expect_success !T5730_HTTP "bad clie
     +
     +test_expect_success "ls-remote-bundle-uri --[no-]quiet with $T5730_PROTOCOL:// using protocol v2" '
     +	test_when_finished "rm -f log" &&
    -+
    -+	cat >err.expect <<-\EOF &&
    -+	Cloning into '"'"'child'"'"'...
    -+	EOF
    -+
     +	test_when_finished "rm -rf child" &&
    -+	GIT_TRACE_PACKET="$PWD/log" \
    ++	env GIT_TRACE_PACKET="$PWD/log" \
     +	git \
     +		-c protocol.version=2 \
    -+		 clone "$T5730_URI" child \
    -+		 >out 2>err.actual &&
    -+	test_cmp err.expect err.actual &&
    -+	test_must_be_empty out &&
    ++		 clone "$T5730_URI" child &&
     +
     +	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
     +		"$T5730_BUNDLE_URI_ESCAPED" &&
 7:  f78d970975b = 13:  5c50daa92bb bundle-uri client: add transfer.injectBundleURI support
 8:  716470488c5 = 14:  e66aa1f18b4 bundle-uri client: add boolean transfer.bundleURI setting
12:  6176c4554ce ! 15:  f19d2bcbc66 bundle-uri client: support for bundle-uri with "clone"
    @@ t/lib-t5730-protocol-v2-bundle-uri.sh: test_expect_success "ls-remote-bundle-uri
     +	test_cmp expect actual
     +}
     +
    -+test_expect_success "clone with bundle-uri protocol v2 over $T5730_PROTOCOL:// 1.bdl via $T5730_URI_BDL_PROTO" '
    ++show_cr () {
    ++	tr '\015' Q | sed -e "s/Q/<CR>\\$LF/g"
    ++}
    ++
    ++test_expect_success CURL "clone with bundle-uri protocol v2 over $T5730_PROTOCOL:// 1.bdl via $T5730_URI_BDL_PROTO" '
     +	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1.bdl | test_uri_escape)" &&
     +
    -+	test_when_finished "rm -rf log child" &&
    ++	test_when_finished "rm -rf event log child" &&
    ++	GIT_TRACE2_EVENT="$PWD/event" \
     +	GIT_TRACE_PACKET="$PWD/log" \
     +	git \
     +		-c protocol.version=2 \
     +		-c fetch.uriProtocols=file,http \
     +		clone --verbose --verbose \
    -+		"$T5730_URI" child >out 2>err &&
    -+	grep -F "Receiving bundle (1/1)" err &&
    ++		"$T5730_URI" child &&
    ++	test_region progress "Receiving bundle (1/1)" event &&
     +	grep "clone> want " log &&
     +	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags
     +'
    @@ t/lib-t5730-protocol-v2-bundle-uri.sh: test_expect_success "ls-remote-bundle-uri
     +
     +	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1.bdl | test_uri_escape)" &&
     +
    -+	test_when_finished "rm -rf log child" &&
    ++	test_when_finished "rm -rf event log child" &&
     +	git init --bare child &&
     +	git -C child remote add --mirror=fetch origin "$T5730_URI" &&
    ++
    ++	GIT_TRACE2_EVENT="$PWD/event" \
     +	GIT_TRACE_PACKET="$PWD/log" \
     +	git -C child \
     +		-c protocol.version=2 \
     +		-c fetch.uriProtocols=file,http \
    -+		fetch --verbose --verbose >out 2>err &&
    -+	# Fetch is not supported yet
    -+	! grep -F "Receiving bundle (1/1)" err &&
    -+	grep "fetch> want " log &&
    ++		fetch --verbose --verbose &&
    ++
    ++	if test_have_prereq CURL
    ++	then
    ++		# Fetch is not supported yet
    ++		! test_region progress "Receiving bundle (1/1)" event &&
    ++		grep "fetch> want " log
    ++	else
    ++		! grep "fetch-pack: unable to spawn" event
    ++	fi &&
    ++
     +	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags
     +'
     +
    @@ t/lib-t5730-protocol-v2-bundle-uri.sh: test_expect_success "ls-remote-bundle-uri
     +	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/2-3.bdl | test_uri_escape)" --add &&
     +	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/3-4.bdl | test_uri_escape)" --add &&
     +
    -+	test_when_finished "rm -rf log child" &&
    ++	test_when_finished "rm -rf event log child" &&
    ++	GIT_TRACE2_EVENT="$PWD/event" \
     +	GIT_TRACE_PACKET="$PWD/log" \
     +	git \
     +		-c protocol.version=2 \
     +		-c fetch.uriProtocols=file,http \
     +		clone --verbose --verbose \
    -+		"$T5730_URI" child >out 2>err &&
    -+	grep -F "Receiving bundle (4/4)" err &&
    ++		"$T5730_URI" child &&
    ++
    ++	if test_have_prereq CURL
    ++	then
    ++		test_region progress "Receiving bundle (1/4)" event &&
    ++		test_region progress "Receiving bundle (2/4)" event &&
    ++		test_region progress "Receiving bundle (3/4)" event &&
    ++		test_region progress "Receiving bundle (4/4)" event
    ++	else
    ++		grep "fetch-pack: unable to spawn" event
    ++	fi &&
    ++
     +	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags &&
     +	grep "clone> want " log
     +'
    @@ t/lib-t5730-protocol-v2-bundle-uri.sh: test_expect_success "ls-remote-bundle-uri
     +	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/4-5.bdl | test_uri_escape)" --add &&
     +	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/5-6.bdl | test_uri_escape)" --add &&
     +
    -+	test_when_finished "rm -rf log child" &&
    ++	test_when_finished "rm -rf event log child" &&
    ++	GIT_TRACE2_EVENT="$PWD/event" \
     +	GIT_TRACE_PACKET="$PWD/log" \
     +	git \
     +		-c protocol.version=2 \
     +		-c fetch.uriProtocols=file,http \
     +		clone --verbose --verbose \
    -+		"$T5730_URI" child >out 2>err &&
    -+	grep -F "Receiving bundle (6/6)" err &&
    -+	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags &&
    -+	! grep "clone> want " log
    ++		"$T5730_URI" child &&
    ++
    ++	if test_have_prereq CURL
    ++	then
    ++		test_region progress "Receiving bundle (1/6)" event &&
    ++		test_region progress "Receiving bundle (2/6)" event &&
    ++		test_region progress "Receiving bundle (3/6)" event &&
    ++		test_region progress "Receiving bundle (4/6)" event &&
    ++		test_region progress "Receiving bundle (5/6)" event &&
    ++		test_region progress "Receiving bundle (6/6)" event &&
    ++		! grep "clone> want " log
    ++	else
    ++		grep "fetch-pack: unable to spawn" event
    ++	fi &&
    ++
    ++	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags
     +'
     
      ## transport.c ##
13:  be59495d81b = 16:  632c68b224f bundle-uri: make the download program configurable
19:  4398efebcec = 17:  8ac5bfca545 remote-curl: add 'get' capability
20:  5cbaa40b365 ! 18:  ff9a7afaccd bundle: implement 'fetch' command for direct bundles
    @@ Commit message
     
         Signed-off-by: Derrick Stolee <derrickstolee@github.com>
     
    + ## Documentation/git-bundle.txt ##
    +@@ Documentation/git-bundle.txt: SYNOPSIS
    + 'git bundle' create [-q | --quiet | --progress | --all-progress] [--all-progress-implied]
    + 		    [--version=<version>] <file> <git-rev-list-args>
    + 'git bundle' verify [-q | --quiet] <file>
    ++'git bundle' fetch [--filter=<spec>] <uri>
    + 'git bundle' list-heads <file> [<refname>...]
    + 'git bundle' unbundle [--progress] <file> [<refname>...]
    + 
    +
      ## builtin/bundle.c ##
     @@
      #include "parse-options.h"
21:  6c055bc2613 ! 19:  a5245a31a12 bundle: parse table of contents during 'fetch'
    @@ Commit message
         RFC-TODO: create tests that check these erroneous cases.
     
         Signed-off-by: Derrick Stolee <derrickstolee@github.com>
    +    Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
      ## builtin/bundle.c ##
     @@
    @@ builtin/bundle.c: struct remote_bundle_info {
     +	 */
     +	unsigned pushed:1;
      };
    - 
    ++#define REMOTE_BUNDLE_INFO_INIT { \
    ++	.file = STRBUF_INIT, \
    ++}
    ++
     +static int remote_bundle_cmp(const void *unused_cmp_data,
     +			     const struct hashmap_entry *a,
     +			     const struct hashmap_entry *b,
    @@ builtin/bundle.c: struct remote_bundle_info {
     +	struct hashmap *toc = data;
     +	const char *key1, *key2, *id_end;
     +	struct strbuf id = STRBUF_INIT;
    -+	struct remote_bundle_info info_lookup;
    ++	struct remote_bundle_info info_lookup = REMOTE_BUNDLE_INFO_INIT;
     +	struct remote_bundle_info *info;
     +
     +	if (!skip_prefix(key, "bundle.", &key1))
    @@ builtin/bundle.c: struct remote_bundle_info {
     +	/* Return 0 here to ignore unknown options. */
     +	return 0;
     +}
    -+
    + 
      static void download_uri_to_file(const char *uri, const char *file)
      {
    - 	struct child_process cp = CHILD_PROCESS_INIT;
     @@ builtin/bundle.c: static void unbundle_fetched_bundle(struct remote_bundle_info *info)
      
      static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
    @@ builtin/bundle.c: static int cmd_bundle_fetch(int argc, const char **argv, const
      			 * push it onto the stack.
      			 */
     +			struct remote_bundle_info *info;
    -+			struct remote_bundle_info info_lookup = { 0 };
    ++			struct remote_bundle_info info_lookup = REMOTE_BUNDLE_INFO_INIT;
     +			info_lookup.id = stack->requires_id;
     +
     +			hashmap_entry_init(&info_lookup.ent, strhash(info_lookup.id));
22:  af61b29a571 = 20:  c3a60a9bc72 bundle: add --filter option to 'fetch'
23:  de4845ef1c0 = 21:  c08406cd9c2 bundle: allow relative URLs in table of contents
24:  19d25702355 = 22:  1350c19c3a1 bundle: make it easy to call 'git bundle fetch'
25:  a20e4a5b207 ! 23:  62623324d2f clone: add --bundle-uri option
    @@ builtin/clone.c: static struct option builtin_clone_options[] = {
      	OPT_BOOL(0, "sparse", &option_sparse_checkout,
      		    N_("initialize sparse-checkout file to include only files at root")),
     +	OPT_STRING(0, "bundle-uri", &bundle_uri,
    -+		   N_("uri"), N_("A URI for downloading bundles before fetching from origin remote")),
    ++		   N_("uri"), N_("a URI for downloading bundles before fetching from origin remote")),
      	OPT_END()
      };
      
26:  277091d5eeb = 24:  d79901dddb0 clone: --bundle-uri cannot be combined with --depth
28:  7b414176313 ! 25:  ab349b1cbea bundle: only fetch bundles if timestamp is new
    @@ Commit message
     
         RFC-TODO: Add 'fetch.bundleTimestamp' to Documentation/config/
     
    +    RFC-TODO @ Ævar: I replaced the git_config_get_timestamp() with
    +    parse_expiry_date(), but as noted perhaps we want *nix epochs here
    +    only, in that case we could add an "isdigit" loop here.
    +
         Signed-off-by: Derrick Stolee <derrickstolee@github.com>
     
      ## builtin/bundle.c ##
    @@ builtin/bundle.c: static int cmd_bundle_fetch(int argc, const char **argv, const
     +	const char *timestamp_key = "fetch.bundletimestamp";
     +	timestamp_t stored_time = 0;
     +	timestamp_t max_time = 0;
    ++	const char *value;
      
      	struct option options[] = {
      		OPT_BOOL(0, "progress", &progress,
    @@ builtin/bundle.c: static int cmd_bundle_fetch(int argc, const char **argv, const
      	if (!startup_info->have_repository)
      		die(_("'fetch' requires a repository"));
      
    -+	git_config_get_timestamp(timestamp_key, &stored_time);
    ++	/*
    ++	 * TODO: Is it important re
    ++	 * https://lore.kernel.org/git/220311.86pmmshahy.gmgdl@evledraar.gmail.com/
    ++	 * that we don't accept "2.days.ago" etc., and only *nix
    ++	 * epochs?
    ++	 */
    ++	if (!git_config_get_string_tmp(timestamp_key, &value) &&
    ++	    parse_expiry_date(value, &stored_time))
    ++		return error(_("'%s' for '%s' is not a valid timestamp"),
    ++			     value, timestamp_key);
     +
      	/*
      	 * Step 1: determine protocol for uri, and download contents to
29:  857f9be78e5 = 26:  0a238641247 fetch: fetch bundles before fetching original data
30:  85ebf44038e = 27:  5e8cec1e193 protocol-caps: implement cap_features()
31:  e30d9a9f95d = 28:  145c660ca52 serve: understand but do not advertise 'features' capability
32:  cf07392921d = 29:  2c9886c64ea serve: advertise 'features' when config exists
33:  1e8c52dbe47 = 30:  e834e633e84 connect: implement get_recommended_features()
34:  b8044bb09f0 = 31:  6611dd08f93 transport: add connections for 'features' capability
35:  3aa4d42d2ac = 32:  2b424bedfc5 clone: use server-recommended bundle URI
36:  6e4da9ccc85 ! 33:  52ee1e08dec t5601: basic bundle URI test
    @@ t/t5601-clone.sh: test_expect_success 'reject cloning shallow repository using H
     +	GIT_TRACE2_EVENT="$(pwd)/trace.txt" \
     +		git -c protocol.version=2 clone \
     +		$HTTPD_URL/smart/repo2.git repo &&
    -+	test_subcommand_inexact git bundle unbundle <trace.txt
    ++	cat >pat <<-\EOF &&
    ++	"event":"child_start".*"argv":\["git","bundle","unbundle"
    ++	EOF
    ++	grep -f pat trace.txt
     +'
     +
      # DO NOT add non-httpd-specific tests here, because the last part of this
 -:  ----------- > 34:  f872793cac2 protocol v2: add server-side "bundle-uri" skeleton (docs)
 2:  6bc2316d2fd = 35:  cfda9323aaa bundle-uri docs: add design notes
14:  54c4ccafd9a ! 36:  764f82a0c0a docs: document bundle URI standard
    @@ Documentation/technical/bundle-uri-TOC.txt (new)
     +
     +[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
     +    The GVFS Protocol
    - \ No newline at end of file
27:  1173ceeb08a <  -:  ----------- config: add git_config_get_timestamp()
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 01/36] connect.c: refactor sending of agent & object-format
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 02/36] dir API: add a generalized path_match_flags() function Ævar Arnfjörð Bjarmason
                       ` (35 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Refactor the sending of the "agent" and "object-format" capabilities
into a function.

This was added in its current form in ab67235bc4 (connect: parse v2
refs with correct hash algorithm, 2020-05-25). When we connect to a v2
server we need to know about its object-format, and it needs to know
about ours. Since most things in connect.c and transport.c piggy-back
on the eager getting of remote refs via the handshake() those commands
can make use of the just-sent-over object-format by ls-refs.

But I'm about to add a command that may come after ls-refs, and may
not, but we need the server to know about our user-agent and
object-format. So let's split this into a function.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 connect.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/connect.c b/connect.c
index afc79a6236e..e6d0b1d34bd 100644
--- a/connect.c
+++ b/connect.c
@@ -473,6 +473,24 @@ void check_stateless_delimiter(int stateless_rpc,
 		die("%s", error);
 }
 
+static void send_capabilities(int fd_out, struct packet_reader *reader)
+{
+	const char *hash_name;
+
+	if (server_supports_v2("agent", 0))
+		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
+
+	if (server_feature_v2("object-format", &hash_name)) {
+		int hash_algo = hash_algo_by_name(hash_name);
+		if (hash_algo == GIT_HASH_UNKNOWN)
+			die(_("unknown object format '%s' specified by server"), hash_name);
+		reader->hash_algo = &hash_algos[hash_algo];
+		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
+	} else {
+		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
+	}
+}
+
 struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     struct ref **list, int for_push,
 			     struct transport_ls_refs_options *transport_options,
@@ -480,7 +498,6 @@ struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     int stateless_rpc)
 {
 	int i;
-	const char *hash_name;
 	struct strvec *ref_prefixes = transport_options ?
 		&transport_options->ref_prefixes : NULL;
 	const char **unborn_head_target = transport_options ?
@@ -490,18 +507,8 @@ struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 	if (server_supports_v2("ls-refs", 1))
 		packet_write_fmt(fd_out, "command=ls-refs\n");
 
-	if (server_supports_v2("agent", 0))
-		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
-
-	if (server_feature_v2("object-format", &hash_name)) {
-		int hash_algo = hash_algo_by_name(hash_name);
-		if (hash_algo == GIT_HASH_UNKNOWN)
-			die(_("unknown object format '%s' specified by server"), hash_name);
-		reader->hash_algo = &hash_algos[hash_algo];
-		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
-	} else {
-		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
-	}
+	/* Send capabilities */
+	send_capabilities(fd_out, reader);
 
 	if (server_options && server_options->nr &&
 	    server_supports_v2("server-option", 1))
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 02/36] dir API: add a generalized path_match_flags() function
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 01/36] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-21 17:26       ` Derrick Stolee
  2022-04-18 17:23     ` [RFC PATCH v2 03/36] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
                       ` (34 subsequent siblings)
  36 siblings, 1 reply; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a path_match_flags() function and have the two sets of
starts_with_dot_{,dot_}slash() functions added in
63e95beb085 (submodule: port resolve_relative_url from shell to C,
2016-04-15) and a2b26ffb1a8 (fsck: convert gitmodules url to URL
passed to curl, 2020-04-18) be thin wrappers for it.

As the latter of those notes the fsck version was copied from the
initial builtin/submodule--helper.c version.

Since the code added in a2b26ffb1a8 was doing really doing the same as
win32_is_dir_sep() added in 1cadad6f658 (git clone <url>
C:\cygwin\home\USER\repo' is working (again), 2018-12-15) let's move
the latter to git-compat-util.h is a is_xplatform_dir_sep(). We can
then call either it or the platform-specific is_dir_sep() from this
new function.

Let's likewise change code in various other places that was hardcoding
checks for "'/' || '\\'" with the new is_xplatform_dir_sep(). As can
be seen in those callers some of them still concern themselves with
':' (Mac OS classic?), but let's leave the question of whether that
should be consolidated for some other time.

As we expect to make wider use of the "native" case in the future,
define and use two starts_with_dot_{,dot_}slash_native() convenience
wrappers. This makes the diff in builtin/submodule--helper.c much
smaller.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/submodule--helper.c | 20 ++++++------
 compat/mingw.c              |  2 +-
 compat/win32/path-utils.h   |  6 +---
 dir.c                       | 29 +++++++++++++++++
 dir.h                       | 63 +++++++++++++++++++++++++++++++++++++
 fsck.c                      | 23 ++++----------
 git-compat-util.h           |  8 ++++-
 path.c                      |  2 +-
 submodule-config.c          |  6 ++--
 9 files changed, 121 insertions(+), 38 deletions(-)

diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index 2c87ef9364f..b68102bb3ed 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -72,16 +72,6 @@ static char *get_default_remote(void)
 	return repo_get_default_remote(the_repository);
 }
 
-static int starts_with_dot_slash(const char *str)
-{
-	return str[0] == '.' && is_dir_sep(str[1]);
-}
-
-static int starts_with_dot_dot_slash(const char *str)
-{
-	return str[0] == '.' && str[1] == '.' && is_dir_sep(str[2]);
-}
-
 /*
  * Returns 1 if it was the last chop before ':'.
  */
@@ -108,6 +98,16 @@ static int chop_last_dir(char **remoteurl, int is_relative)
 	return 0;
 }
 
+static int starts_with_dot_slash(const char *const path)
+{
+	return starts_with_dot_slash_native(path);;
+}
+
+static int starts_with_dot_dot_slash(const char *const path)
+{
+	return starts_with_dot_dot_slash_native(path);
+}
+
 /*
  * The `url` argument is the URL that navigates to the submodule origin
  * repo. When relative, this URL is relative to the superproject origin
diff --git a/compat/mingw.c b/compat/mingw.c
index 6fe80fdf014..b94b473d978 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -2830,7 +2830,7 @@ int is_valid_win32_path(const char *path, int allow_literal_nul)
 			}
 
 			c = path[i];
-			if (c && c != '.' && c != ':' && c != '/' && c != '\\')
+			if (c && c != '.' && c != ':' && !is_xplatform_dir_sep(c))
 				goto not_a_reserved_name;
 
 			/* contains reserved name */
diff --git a/compat/win32/path-utils.h b/compat/win32/path-utils.h
index bba2b644080..65fa3b9263a 100644
--- a/compat/win32/path-utils.h
+++ b/compat/win32/path-utils.h
@@ -6,11 +6,7 @@ int win32_has_dos_drive_prefix(const char *path);
 
 int win32_skip_dos_drive_prefix(char **path);
 #define skip_dos_drive_prefix win32_skip_dos_drive_prefix
-static inline int win32_is_dir_sep(int c)
-{
-	return c == '/' || c == '\\';
-}
-#define is_dir_sep win32_is_dir_sep
+#define is_dir_sep is_xplatform_dir_sep
 static inline char *win32_find_last_dir_sep(const char *path)
 {
 	char *ret = NULL;
diff --git a/dir.c b/dir.c
index f2b0f242101..d25aa6ade55 100644
--- a/dir.c
+++ b/dir.c
@@ -3890,3 +3890,32 @@ void relocate_gitdir(const char *path, const char *old_git_dir, const char *new_
 
 	connect_work_tree_and_git_dir(path, new_git_dir, 0);
 }
+
+int path_match_flags(const char *const str, const enum path_match_flags flags)
+{
+	const char *p = str;
+
+	if (flags & PATH_MATCH_NATIVE &&
+	    flags & PATH_MATCH_XPLATFORM)
+		BUG("path_match_flags() must get one match kind, not multiple!");
+	else if (!(flags & PATH_MATCH_KINDS_MASK))
+		BUG("path_match_flags() must get at least one match kind!");
+
+	if (flags & PATH_MATCH_STARTS_WITH_DOT_SLASH &&
+	    flags & PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH)
+		BUG("path_match_flags() must get one platform kind, not multiple!");
+	else if (!(flags & PATH_MATCH_PLATFORM_MASK))
+		BUG("path_match_flags() must get at least one platform kind!");
+
+	if (*p++ != '.')
+		return 0;
+	if (flags & PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH &&
+	    *p++ != '.')
+		return 0;
+
+	if (flags & PATH_MATCH_NATIVE)
+		return is_dir_sep(*p);
+	else if (flags & PATH_MATCH_XPLATFORM)
+		return is_xplatform_dir_sep(*p);
+	BUG("unreachable");
+}
diff --git a/dir.h b/dir.h
index 8e02dfb505d..7bc862030cf 100644
--- a/dir.h
+++ b/dir.h
@@ -578,4 +578,67 @@ void connect_work_tree_and_git_dir(const char *work_tree,
 void relocate_gitdir(const char *path,
 		     const char *old_git_dir,
 		     const char *new_git_dir);
+
+/**
+ * The "enum path_matches_kind" determines how path_match_flags() will
+ * behave. The flags come in sets, and one (and only one) must be
+ * provided out of each "set":
+ *
+ * PATH_MATCH_NATIVE:
+ *	Path separator is is_dir_sep()
+ * PATH_MATCH_XPLATFORM:
+ *	Path separator is is_xplatform_dir_sep()
+ *
+ * Do we use is_dir_sep() to check for a directory separator
+ * (*_NATIVE), or do we always check for '/' or '\' (*_XPLATFORM). The
+ * "*_NATIVE" version on Windows is the same as "*_XPLATFORM",
+ * everywhere else "*_NATIVE" means "only /".
+ *
+ * PATH_MATCH_STARTS_WITH_DOT_SLASH:
+ *	Match a path starting with "./"
+ * PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH:
+ *	Match a path starting with "../"
+ *
+ * The "/" in the above is adjusted based on the "*_NATIVE" and
+ * "*_XPLATFORM" flags.
+ */
+enum path_match_flags {
+	PATH_MATCH_NATIVE = 1 << 0,
+	PATH_MATCH_XPLATFORM = 1 << 1,
+	PATH_MATCH_STARTS_WITH_DOT_SLASH = 1 << 2,
+	PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH = 1 << 3,
+};
+#define PATH_MATCH_KINDS_MASK (PATH_MATCH_STARTS_WITH_DOT_SLASH | \
+	PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH)
+#define PATH_MATCH_PLATFORM_MASK (PATH_MATCH_NATIVE | PATH_MATCH_XPLATFORM)
+
+/**
+ * path_match_flags() checks if a given "path" matches a given "enum
+ * path_match_flags" criteria.
+ */
+int path_match_flags(const char *const path, const enum path_match_flags f);
+
+/**
+ * starts_with_dot_slash_native(): convenience wrapper for
+ * path_match_flags() with PATH_MATCH_STARTS_WITH_DOT_SLASH and
+ * PATH_MATCH_NATIVE.
+ */
+static inline int starts_with_dot_slash_native(const char *const path)
+{
+	const enum path_match_flags what = PATH_MATCH_STARTS_WITH_DOT_SLASH;
+
+	return path_match_flags(path, what | PATH_MATCH_NATIVE);
+}
+
+/**
+ * starts_with_dot_slash_native(): convenience wrapper for
+ * path_match_flags() with PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH and
+ * PATH_MATCH_NATIVE.
+ */
+static inline int starts_with_dot_dot_slash_native(const char *const path)
+{
+	const enum path_match_flags what = PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH;
+
+	return path_match_flags(path, what | PATH_MATCH_NATIVE);
+}
 #endif
diff --git a/fsck.c b/fsck.c
index 3ec500d707a..dd4822ba1be 100644
--- a/fsck.c
+++ b/fsck.c
@@ -975,27 +975,16 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer,
 	return ret;
 }
 
-/*
- * Like builtin/submodule--helper.c's starts_with_dot_slash, but without
- * relying on the platform-dependent is_dir_sep helper.
- *
- * This is for use in checking whether a submodule URL is interpreted as
- * relative to the current directory on any platform, since \ is a
- * directory separator on Windows but not on other platforms.
- */
-static int starts_with_dot_slash(const char *str)
+static int starts_with_dot_slash(const char *const path)
 {
-	return str[0] == '.' && (str[1] == '/' || str[1] == '\\');
+	return path_match_flags(path, PATH_MATCH_STARTS_WITH_DOT_SLASH |
+				PATH_MATCH_XPLATFORM);
 }
 
-/*
- * Like starts_with_dot_slash, this is a variant of submodule--helper's
- * helper of the same name with the twist that it accepts backslash as a
- * directory separator even on non-Windows platforms.
- */
-static int starts_with_dot_dot_slash(const char *str)
+static int starts_with_dot_dot_slash(const char *const path)
 {
-	return str[0] == '.' && starts_with_dot_slash(str + 1);
+	return path_match_flags(path, PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH |
+				PATH_MATCH_XPLATFORM);
 }
 
 static int submodule_url_is_relative(const char *url)
diff --git a/git-compat-util.h b/git-compat-util.h
index 58fd813bd01..ba3436db9a1 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -236,6 +236,12 @@
 #include <sys/sysctl.h>
 #endif
 
+/* Used by compat/win32/path-utils.h, and more */
+static inline int is_xplatform_dir_sep(int c)
+{
+	return c == '/' || c == '\\';
+}
+
 #if defined(__CYGWIN__)
 #include "compat/win32/path-utils.h"
 #endif
@@ -416,11 +422,11 @@ static inline int git_skip_dos_drive_prefix(char **path)
 #define skip_dos_drive_prefix git_skip_dos_drive_prefix
 #endif
 
-#ifndef is_dir_sep
 static inline int git_is_dir_sep(int c)
 {
 	return c == '/';
 }
+#ifndef is_dir_sep
 #define is_dir_sep git_is_dir_sep
 #endif
 
diff --git a/path.c b/path.c
index d73146b6cd2..2ab78278943 100644
--- a/path.c
+++ b/path.c
@@ -1413,7 +1413,7 @@ int is_ntfs_dotgit(const char *name)
 
 	for (;;) {
 		c = *(name++);
-		if (!c || c == '\\' || c == '/' || c == ':')
+		if (!c || is_xplatform_dir_sep(c) || c == ':')
 			return 1;
 		if (c != '.' && c != ' ')
 			return 0;
diff --git a/submodule-config.c b/submodule-config.c
index 29668b0620d..ce3beaf5d4f 100644
--- a/submodule-config.c
+++ b/submodule-config.c
@@ -204,17 +204,17 @@ int check_submodule_name(const char *name)
 		return -1;
 
 	/*
-	 * Look for '..' as a path component. Check both '/' and '\\' as
+	 * Look for '..' as a path component. Check is_xplatform_dir_sep() as
 	 * separators rather than is_dir_sep(), because we want the name rules
 	 * to be consistent across platforms.
 	 */
 	goto in_component; /* always start inside component */
 	while (*name) {
 		char c = *name++;
-		if (c == '/' || c == '\\') {
+		if (is_xplatform_dir_sep(c)) {
 in_component:
 			if (name[0] == '.' && name[1] == '.' &&
-			    (!name[2] || name[2] == '/' || name[2] == '\\'))
+			    (!name[2] || is_xplatform_dir_sep(name[2])))
 				return -1;
 		}
 	}
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 03/36] fetch-pack: add a deref_without_lazy_fetch_extended()
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 01/36] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 02/36] dir API: add a generalized path_match_flags() function Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-21 17:28       ` Derrick Stolee
  2022-04-18 17:23     ` [RFC PATCH v2 04/36] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
                       ` (33 subsequent siblings)
  36 siblings, 1 reply; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a version of the deref_without_lazy_fetch function which can be
called with custom oi_flags and to grab information about the
"object_type". This will be used for the bundle-uri client in a
subsequent commit.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 fetch-pack.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 4e1e88eea09..d0aa3a5c229 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -115,11 +115,12 @@ static void for_each_cached_alternate(struct fetch_negotiator *negotiator,
 		cb(negotiator, cache.items[i]);
 }
 
-static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
-					       int mark_tags_complete)
+static struct commit *deref_without_lazy_fetch_extended(const struct object_id *oid,
+							int mark_tags_complete,
+							enum object_type *type,
+							unsigned int oi_flags)
 {
-	enum object_type type;
-	struct object_info info = { .typep = &type };
+	struct object_info info = { .typep = type };
 	struct commit *commit;
 
 	commit = lookup_commit_in_graph(the_repository, oid);
@@ -128,9 +129,9 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 
 	while (1) {
 		if (oid_object_info_extended(the_repository, oid, &info,
-					     OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_QUICK))
+					     oi_flags))
 			return NULL;
-		if (type == OBJ_TAG) {
+		if (*type == OBJ_TAG) {
 			struct tag *tag = (struct tag *)
 				parse_object(the_repository, oid);
 
@@ -144,7 +145,7 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 		}
 	}
 
-	if (type == OBJ_COMMIT) {
+	if (*type == OBJ_COMMIT) {
 		struct commit *commit = lookup_commit(the_repository, oid);
 		if (!commit || repo_parse_commit(the_repository, commit))
 			return NULL;
@@ -154,6 +155,16 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
 	return NULL;
 }
 
+
+static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
+					       int mark_tags_complete)
+{
+	enum object_type type;
+	unsigned flags = OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_QUICK;
+	return deref_without_lazy_fetch_extended(oid, mark_tags_complete,
+						 &type, flags);
+}
+
 static int rev_list_insert_ref(struct fetch_negotiator *negotiator,
 			       const struct object_id *oid)
 {
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 04/36] fetch-pack: move --keep=* option filling to a function
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (2 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 03/36] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 05/36] http: make http_get_file() external Ævar Arnfjörð Bjarmason
                       ` (32 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Move the populating of the --keep=* option argument to "index-pack" to
a static function, a subsequent commit will make use of it in another
function.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 fetch-pack.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index d0aa3a5c229..b1d90d1914f 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -847,6 +847,16 @@ static void parse_gitmodules_oids(int fd, struct oidset *gitmodules_oids)
 	} while (1);
 }
 
+static void add_index_pack_keep_option(struct strvec *args)
+{
+	char hostname[HOST_NAME_MAX + 1];
+
+	if (xgethostname(hostname, sizeof(hostname)))
+		xsnprintf(hostname, sizeof(hostname), "localhost");
+	strvec_pushf(args, "--keep=fetch-pack %"PRIuMAX " on %s",
+		     (uintmax_t)getpid(), hostname);
+}
+
 /*
  * If packfile URIs were provided, pass a non-NULL pointer to index_pack_args.
  * The strings to pass as the --index-pack-arg arguments to http-fetch will be
@@ -916,14 +926,8 @@ static int get_pack(struct fetch_pack_args *args,
 			strvec_push(&cmd.args, "-v");
 		if (args->use_thin_pack)
 			strvec_push(&cmd.args, "--fix-thin");
-		if ((do_keep || index_pack_args) && (args->lock_pack || unpack_limit)) {
-			char hostname[HOST_NAME_MAX + 1];
-			if (xgethostname(hostname, sizeof(hostname)))
-				xsnprintf(hostname, sizeof(hostname), "localhost");
-			strvec_pushf(&cmd.args,
-				     "--keep=fetch-pack %"PRIuMAX " on %s",
-				     (uintmax_t)getpid(), hostname);
-		}
+		if ((do_keep || index_pack_args) && (args->lock_pack || unpack_limit))
+			add_index_pack_keep_option(&cmd.args);
 		if (!index_pack_args && args->check_self_contained_and_connected)
 			strvec_push(&cmd.args, "--check-self-contained-and-connected");
 		else
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 05/36] http: make http_get_file() external
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (3 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 04/36] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 06/36] remote: move relative_url() Ævar Arnfjörð Bjarmason
                       ` (31 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

This method will be used in an upcoming extension of git-remote-curl to
download a single file over HTTP(S) by request.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 http.c | 4 ++--
 http.h | 9 +++++++++
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/http.c b/http.c
index 229da4d1488..04e73149357 100644
--- a/http.c
+++ b/http.c
@@ -1945,8 +1945,8 @@ int http_get_strbuf(const char *url,
  * If a previous interrupted download is detected (i.e. a previous temporary
  * file is still around) the download is resumed.
  */
-static int http_get_file(const char *url, const char *filename,
-			 struct http_get_options *options)
+int http_get_file(const char *url, const char *filename,
+		  struct http_get_options *options)
 {
 	int ret;
 	struct strbuf tmpfile = STRBUF_INIT;
diff --git a/http.h b/http.h
index df1590e53a4..ba303cfb372 100644
--- a/http.h
+++ b/http.h
@@ -163,6 +163,15 @@ struct http_get_options {
  */
 int http_get_strbuf(const char *url, struct strbuf *result, struct http_get_options *options);
 
+/*
+ * Downloads a URL and stores the result in the given file.
+ *
+ * If a previous interrupted download is detected (i.e. a previous temporary
+ * file is still around) the download is resumed.
+ */
+int http_get_file(const char *url, const char *filename,
+		  struct http_get_options *options);
+
 int http_fetch_ref(const char *base, struct ref *ref);
 
 /* Helpers for fetching packs */
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 06/36] remote: move relative_url()
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (4 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 05/36] http: make http_get_file() external Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 07/36] remote: allow relative_url() to return an absolute url Ævar Arnfjörð Bjarmason
                       ` (30 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

From: Derrick Stolee <derrickstolee@github.com>

This method was initially written in 63e95beb0 (submodule: port
resolve_relative_url from shell to C, 2016-05-15). As we will need
similar functionality in the bundle URI feature, extract this to be
available in remote.h.

The code is almost exactly the same, except for the following trivial
differences:

 * Fix whitespace and wrapping issues with the prototype and argument
   lists.

 * Let's call starts_with_dot_{,dot_}slash_native() instead of the
   functionally identical "starts_with_dot_{,dot_}slash()" wrappers
   "builtin/submodule--helper.c".

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/submodule--helper.c | 141 +++---------------------------------
 remote.c                    |  91 +++++++++++++++++++++++
 remote.h                    |  31 ++++++++
 3 files changed, 134 insertions(+), 129 deletions(-)

diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index b68102bb3ed..86f38e489b0 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -72,135 +72,6 @@ static char *get_default_remote(void)
 	return repo_get_default_remote(the_repository);
 }
 
-/*
- * Returns 1 if it was the last chop before ':'.
- */
-static int chop_last_dir(char **remoteurl, int is_relative)
-{
-	char *rfind = find_last_dir_sep(*remoteurl);
-	if (rfind) {
-		*rfind = '\0';
-		return 0;
-	}
-
-	rfind = strrchr(*remoteurl, ':');
-	if (rfind) {
-		*rfind = '\0';
-		return 1;
-	}
-
-	if (is_relative || !strcmp(".", *remoteurl))
-		die(_("cannot strip one component off url '%s'"),
-			*remoteurl);
-
-	free(*remoteurl);
-	*remoteurl = xstrdup(".");
-	return 0;
-}
-
-static int starts_with_dot_slash(const char *const path)
-{
-	return starts_with_dot_slash_native(path);;
-}
-
-static int starts_with_dot_dot_slash(const char *const path)
-{
-	return starts_with_dot_dot_slash_native(path);
-}
-
-/*
- * The `url` argument is the URL that navigates to the submodule origin
- * repo. When relative, this URL is relative to the superproject origin
- * URL repo. The `up_path` argument, if specified, is the relative
- * path that navigates from the submodule working tree to the superproject
- * working tree. Returns the origin URL of the submodule.
- *
- * Return either an absolute URL or filesystem path (if the superproject
- * origin URL is an absolute URL or filesystem path, respectively) or a
- * relative file system path (if the superproject origin URL is a relative
- * file system path).
- *
- * When the output is a relative file system path, the path is either
- * relative to the submodule working tree, if up_path is specified, or to
- * the superproject working tree otherwise.
- *
- * NEEDSWORK: This works incorrectly on the domain and protocol part.
- * remote_url      url              outcome          expectation
- * http://a.com/b  ../c             http://a.com/c   as is
- * http://a.com/b/ ../c             http://a.com/c   same as previous line, but
- *                                                   ignore trailing slash in url
- * http://a.com/b  ../../c          http://c         error out
- * http://a.com/b  ../../../c       http:/c          error out
- * http://a.com/b  ../../../../c    http:c           error out
- * http://a.com/b  ../../../../../c    .:c           error out
- * NEEDSWORK: Given how chop_last_dir() works, this function is broken
- * when a local part has a colon in its path component, too.
- */
-static char *relative_url(const char *remote_url,
-				const char *url,
-				const char *up_path)
-{
-	int is_relative = 0;
-	int colonsep = 0;
-	char *out;
-	char *remoteurl = xstrdup(remote_url);
-	struct strbuf sb = STRBUF_INIT;
-	size_t len = strlen(remoteurl);
-
-	if (is_dir_sep(remoteurl[len-1]))
-		remoteurl[len-1] = '\0';
-
-	if (!url_is_local_not_ssh(remoteurl) || is_absolute_path(remoteurl))
-		is_relative = 0;
-	else {
-		is_relative = 1;
-		/*
-		 * Prepend a './' to ensure all relative
-		 * remoteurls start with './' or '../'
-		 */
-		if (!starts_with_dot_slash(remoteurl) &&
-		    !starts_with_dot_dot_slash(remoteurl)) {
-			strbuf_reset(&sb);
-			strbuf_addf(&sb, "./%s", remoteurl);
-			free(remoteurl);
-			remoteurl = strbuf_detach(&sb, NULL);
-		}
-	}
-	/*
-	 * When the url starts with '../', remove that and the
-	 * last directory in remoteurl.
-	 */
-	while (url) {
-		if (starts_with_dot_dot_slash(url)) {
-			url += 3;
-			colonsep |= chop_last_dir(&remoteurl, is_relative);
-		} else if (starts_with_dot_slash(url))
-			url += 2;
-		else
-			break;
-	}
-	strbuf_reset(&sb);
-	strbuf_addf(&sb, "%s%s%s", remoteurl, colonsep ? ":" : "/", url);
-	if (ends_with(url, "/"))
-		strbuf_setlen(&sb, sb.len - 1);
-	free(remoteurl);
-
-	if (starts_with_dot_slash(sb.buf))
-		out = xstrdup(sb.buf + 2);
-	else
-		out = xstrdup(sb.buf);
-
-	if (!up_path || !is_relative) {
-		strbuf_release(&sb);
-		return out;
-	}
-
-	strbuf_reset(&sb);
-	strbuf_addf(&sb, "%s%s", up_path, out);
-	free(out);
-	return strbuf_detach(&sb, NULL);
-}
-
 static char *resolve_relative_url(const char *rel_url, const char *up_path, int quiet)
 {
 	char *remoteurl, *resolved_url;
@@ -592,6 +463,18 @@ static int module_foreach(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
+static int starts_with_dot_slash(const char *const path)
+{
+	return path_match_flags(path, PATH_MATCH_STARTS_WITH_DOT_SLASH |
+				PATH_MATCH_XPLATFORM);
+}
+
+static int starts_with_dot_dot_slash(const char *const path)
+{
+	return path_match_flags(path, PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH |
+				PATH_MATCH_XPLATFORM);
+}
+
 struct init_cb {
 	const char *prefix;
 	const char *superprefix;
diff --git a/remote.c b/remote.c
index 42a4e7106e1..87656138645 100644
--- a/remote.c
+++ b/remote.c
@@ -14,6 +14,7 @@
 #include "strvec.h"
 #include "commit-reach.h"
 #include "advice.h"
+#include "connect.h"
 
 enum map_direction { FROM_SRC, FROM_DST };
 
@@ -2727,3 +2728,93 @@ void remote_state_clear(struct remote_state *remote_state)
 	hashmap_clear_and_free(&remote_state->remotes_hash, struct remote, ent);
 	hashmap_clear_and_free(&remote_state->branches_hash, struct remote, ent);
 }
+
+/*
+ * Returns 1 if it was the last chop before ':'.
+ */
+static int chop_last_dir(char **remoteurl, int is_relative)
+{
+	char *rfind = find_last_dir_sep(*remoteurl);
+	if (rfind) {
+		*rfind = '\0';
+		return 0;
+	}
+
+	rfind = strrchr(*remoteurl, ':');
+	if (rfind) {
+		*rfind = '\0';
+		return 1;
+	}
+
+	if (is_relative || !strcmp(".", *remoteurl))
+		die(_("cannot strip one component off url '%s'"),
+			*remoteurl);
+
+	free(*remoteurl);
+	*remoteurl = xstrdup(".");
+	return 0;
+}
+
+char *relative_url(const char *remote_url, const char *url,
+		   const char *up_path)
+{
+	int is_relative = 0;
+	int colonsep = 0;
+	char *out;
+	char *remoteurl = xstrdup(remote_url);
+	struct strbuf sb = STRBUF_INIT;
+	size_t len = strlen(remoteurl);
+
+	if (is_dir_sep(remoteurl[len-1]))
+		remoteurl[len-1] = '\0';
+
+	if (!url_is_local_not_ssh(remoteurl) || is_absolute_path(remoteurl))
+		is_relative = 0;
+	else {
+		is_relative = 1;
+		/*
+		 * Prepend a './' to ensure all relative
+		 * remoteurls start with './' or '../'
+		 */
+		if (!starts_with_dot_slash_native(remoteurl) &&
+		    !starts_with_dot_dot_slash_native(remoteurl)) {
+			strbuf_reset(&sb);
+			strbuf_addf(&sb, "./%s", remoteurl);
+			free(remoteurl);
+			remoteurl = strbuf_detach(&sb, NULL);
+		}
+	}
+	/*
+	 * When the url starts with '../', remove that and the
+	 * last directory in remoteurl.
+	 */
+	while (url) {
+		if (starts_with_dot_dot_slash_native(url)) {
+			url += 3;
+			colonsep |= chop_last_dir(&remoteurl, is_relative);
+		} else if (starts_with_dot_slash_native(url))
+			url += 2;
+		else
+			break;
+	}
+	strbuf_reset(&sb);
+	strbuf_addf(&sb, "%s%s%s", remoteurl, colonsep ? ":" : "/", url);
+	if (ends_with(url, "/"))
+		strbuf_setlen(&sb, sb.len - 1);
+	free(remoteurl);
+
+	if (starts_with_dot_slash_native(sb.buf))
+		out = xstrdup(sb.buf + 2);
+	else
+		out = xstrdup(sb.buf);
+
+	if (!up_path || !is_relative) {
+		strbuf_release(&sb);
+		return out;
+	}
+
+	strbuf_reset(&sb);
+	strbuf_addf(&sb, "%s%s", up_path, out);
+	free(out);
+	return strbuf_detach(&sb, NULL);
+}
diff --git a/remote.h b/remote.h
index 4a1209ae2c8..f18fd27e530 100644
--- a/remote.h
+++ b/remote.h
@@ -409,4 +409,35 @@ int parseopt_push_cas_option(const struct option *, const char *arg, int unset);
 int is_empty_cas(const struct push_cas_option *);
 void apply_push_cas(struct push_cas_option *, struct remote *, struct ref *);
 
+/*
+ * The `url` argument is the URL that navigates to the submodule origin
+ * repo. When relative, this URL is relative to the superproject origin
+ * URL repo. The `up_path` argument, if specified, is the relative
+ * path that navigates from the submodule working tree to the superproject
+ * working tree. Returns the origin URL of the submodule.
+ *
+ * Return either an absolute URL or filesystem path (if the superproject
+ * origin URL is an absolute URL or filesystem path, respectively) or a
+ * relative file system path (if the superproject origin URL is a relative
+ * file system path).
+ *
+ * When the output is a relative file system path, the path is either
+ * relative to the submodule working tree, if up_path is specified, or to
+ * the superproject working tree otherwise.
+ *
+ * NEEDSWORK: This works incorrectly on the domain and protocol part.
+ * remote_url      url              outcome          expectation
+ * http://a.com/b  ../c             http://a.com/c   as is
+ * http://a.com/b/ ../c             http://a.com/c   same as previous line, but
+ *                                                   ignore trailing slash in url
+ * http://a.com/b  ../../c          http://c         error out
+ * http://a.com/b  ../../../c       http:/c          error out
+ * http://a.com/b  ../../../../c    http:c           error out
+ * http://a.com/b  ../../../../../c    .:c           error out
+ * NEEDSWORK: Given how chop_last_dir() works, this function is broken
+ * when a local part has a colon in its path component, too.
+ */
+char *relative_url(const char *remote_url, const char *url,
+		   const char *up_path);
+
 #endif
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 07/36] remote: allow relative_url() to return an absolute url
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (5 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 06/36] remote: move relative_url() Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 08/36] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
                       ` (29 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

From: Derrick Stolee <derrickstolee@github.com>

When the 'url' parameter was absolute, the previous implementation would
concatenate 'remote_url' with 'url'. Instead, we want to return 'url' in
this case.

The documentation now discusses what happens when supplying two
absolute URLs.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 remote.c | 12 ++++++++++--
 remote.h |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/remote.c b/remote.c
index 87656138645..7576f673fcd 100644
--- a/remote.c
+++ b/remote.c
@@ -2761,10 +2761,18 @@ char *relative_url(const char *remote_url, const char *url,
 	int is_relative = 0;
 	int colonsep = 0;
 	char *out;
-	char *remoteurl = xstrdup(remote_url);
+	char *remoteurl;
 	struct strbuf sb = STRBUF_INIT;
-	size_t len = strlen(remoteurl);
+	size_t len;
+
+	if (!url_is_local_not_ssh(url) || is_absolute_path(url))
+		return xstrdup(url);
+
+	len = strlen(remote_url);
+	if (!len)
+		BUG("invalid empty remote_url");
 
+	remoteurl = xstrdup(remote_url);
 	if (is_dir_sep(remoteurl[len-1]))
 		remoteurl[len-1] = '\0';
 
diff --git a/remote.h b/remote.h
index f18fd27e530..dd4402436f1 100644
--- a/remote.h
+++ b/remote.h
@@ -434,6 +434,7 @@ void apply_push_cas(struct push_cas_option *, struct remote *, struct ref *);
  * http://a.com/b  ../../../c       http:/c          error out
  * http://a.com/b  ../../../../c    http:c           error out
  * http://a.com/b  ../../../../../c    .:c           error out
+ * http://a.com/b  http://d.org/e   http://d.org/e   as is
  * NEEDSWORK: Given how chop_last_dir() works, this function is broken
  * when a local part has a colon in its path component, too.
  */
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 08/36] bundle.h: make "fd" version of read_bundle_header() public
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (6 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 07/36] remote: allow relative_url() to return an absolute url Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 09/36] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
                       ` (28 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Change the parse_bundle_header() function to be non-static, and rename
it to parse_bundle_header_fd(). The parse_bundle_header() function is
already public, and it's a thin wrapper around this function. This
will be used by code that wants to pass a fd to the bundle API.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 bundle.c | 8 ++++----
 bundle.h | 2 ++
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/bundle.c b/bundle.c
index d50cfb5aa7e..5fa41a52f11 100644
--- a/bundle.c
+++ b/bundle.c
@@ -66,8 +66,8 @@ static int parse_bundle_signature(struct bundle_header *header, const char *line
 	return -1;
 }
 
-static int parse_bundle_header(int fd, struct bundle_header *header,
-			       const char *report_path)
+int read_bundle_header_fd(int fd, struct bundle_header *header,
+			  const char *report_path)
 {
 	struct strbuf buf = STRBUF_INIT;
 	int status = 0;
@@ -143,7 +143,7 @@ int read_bundle_header(const char *path, struct bundle_header *header)
 
 	if (fd < 0)
 		return error(_("could not open '%s'"), path);
-	return parse_bundle_header(fd, header, path);
+	return read_bundle_header_fd(fd, header, path);
 }
 
 int is_bundle(const char *path, int quiet)
@@ -153,7 +153,7 @@ int is_bundle(const char *path, int quiet)
 
 	if (fd < 0)
 		return 0;
-	fd = parse_bundle_header(fd, &header, quiet ? NULL : path);
+	fd = read_bundle_header_fd(fd, &header, quiet ? NULL : path);
 	if (fd >= 0)
 		close(fd);
 	bundle_header_release(&header);
diff --git a/bundle.h b/bundle.h
index 7fef2108f43..0c052f54964 100644
--- a/bundle.h
+++ b/bundle.h
@@ -24,6 +24,8 @@ void bundle_header_release(struct bundle_header *header);
 
 int is_bundle(const char *path, int quiet);
 int read_bundle_header(const char *path, struct bundle_header *header);
+int read_bundle_header_fd(int fd, struct bundle_header *header,
+			  const char *report_path);
 int create_bundle(struct repository *r, const char *path,
 		  int argc, const char **argv, struct strvec *pack_options,
 		  int version);
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 09/36] protocol v2: add server-side "bundle-uri" skeleton
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (7 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 08/36] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 10/36] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
                       ` (27 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a skeleton server-side implementation of a new "bundle-uri"
command to protocol v2. This will allow conforming clients to
optionally seed their initial clones or incremental fetches from URLs
containing "*.bundle" files created with "git bundle create".

The use-cases are similar to those of the existing "Packfile URIs",
and the two feature can be combined within a single request, but
"bundle-uri" has a few advantages over packfile-uris in some some
common scenarios, discussed below.

This change does not give us a working "bundle-uri" client, subsequent
commits will do that. Let's first establish what the protocol for this
should be like first. The client implementation will then implement
this specification.

With this change when the uploadpack.bundleURI config is set to a
URI (or URIs, if set >1 times), advertise a "bundle-uri" command. Then
when the client requests "bundle-uri" emit those URIs back at them.

Differences between this and the existing packfile-uri facility:

 A. There is no "real" support for packfile-uri in git.git. The
    uploadpack.blobPackfileUri setting allows carving out a list of
    blobs (actually any OIDs), but as alluded to in bfc2a36ff2a (Doc:
    clarify contents of packfile sent as URI, 2021-01-20) the only
    "real" implementation is JGit based.

 B. The uploadpack.blobPackfileUri is a MUST where this is a
    "CAN". I.e. once a client says they support packfile-uri of given
    list of protocols the server will send them a PACK response
    assuming they've downloaded the URI they client was sent, if the
    client doesn't do that they don't have a valid repository.

    Pointing at a bundle and having the client send us "have"
    lines (or not, maybe they couldn't fetch it, or decided they
    didn't want to) is more flexible, and can gracefully recover
    e.g. if the CDN isn't reachable (maybe you do support "https", but
    the CDN provider is down, or blocked your whole country).

 C. The client, after executing "ls-refs" will disconnect if it has
    also grabbed the "bundle-uris" and knows the server won't send it
    anything it doesn't already have (or expect to have, if it's
    downloading the bundles concurrent to an early disconnect).

    This is in (small) contrast to packfile-uri where a client would
    enter a negotiation dialog, which may or may not result in a
    packfile-uri and/or an inline PACK.

 D. Because of "C" clients can, if the bundles are up-to-date, get an
    up-to-date repository with just "bundle-uri" and "ls-refs" commands,
    with no need to enter a dialog with "git upload-pack".

    That small dialog is unlikely to matter for performance purposes,
    this section is just noting differences between "bundle-uri" and
    "packfile-uri".

As noted above the features are compatible, a client that supports
"bundle-uri" and "packfile-uri" might download a bundle, and then
proceed with a "fetch" dialog, that dialog might then result in
"packfile-uri" response.

In practice server operators are unlikely to want to mix the two,
since the main benefit of either approach is the ability to offload
large "clone" responses to CDNs. A server operator would have little
reason not to go with one approach or the other.

There was a suggestion of implementing a similar feature long ago[1]
by Jeff King. The main difference between it and this approach is that
we've since gained protocol v2, so we can add this as an optional path
in the dialog between client and server. The 2011 implementation
hooked into the transport mechanism to try to clone from a bundle
directly. See also [2] and [3] for some later mentions of that
approach.

See also [4] for the series that implemented
uploadpack.blobPackfileUri, and [5] for a series on top that did the
.gitmodules check in that context. See [6] for the "ls-refs unborn"
feature which modified code in similar areas of the request flow.

Finally, there's currently a concurrent (submitted after the v1 of
this commit, but before the subsequent client parts of this
implementation) RFC of a somewhat similar "bundle-uri" facility at
[7].

1. https://lore.kernel.org/git/20111110074330.GA27925@sigill.intra.peff.net/
2. https://lore.kernel.org/git/20190514092900.GA11679@sigill.intra.peff.net/
3. https://lore.kernel.org/git/YFJWz5yIGng+a16k@coredump.intra.peff.net/
4. https://lore.kernel.org/git/cover.1591821067.git.jonathantanmy@google.com/
   Merged as 34e849b05a4 (Merge branch 'jt/cdn-offload', 2020-06-25)
5. https://lore.kernel.org/git/cover.1614021092.git.jonathantanmy@google.com/
   Merged as 6ee353d42f3 (Merge branch 'jt/transfer-fsck-across-packs',
   2021-03-01)
6. 69571dfe219 (Merge branch 'jt/clone-unborn-head', 2021-02-17)
7. https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Makefile             |   1 +
 bundle-uri.c         |  55 +++++++++++++++++++
 bundle-uri.h         |  13 +++++
 serve.c              |   6 +++
 t/t5701-git-serve.sh | 124 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 198 insertions(+), 1 deletion(-)
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h

diff --git a/Makefile b/Makefile
index f8bccfab5e9..8f27310836d 100644
--- a/Makefile
+++ b/Makefile
@@ -887,6 +887,7 @@ LIB_OBJS += blob.o
 LIB_OBJS += bloom.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
+LIB_OBJS += bundle-uri.o
 LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += cbtree.o
diff --git a/bundle-uri.c b/bundle-uri.c
new file mode 100644
index 00000000000..ff054ddc690
--- /dev/null
+++ b/bundle-uri.c
@@ -0,0 +1,55 @@
+#include "cache.h"
+#include "bundle-uri.h"
+#include "pkt-line.h"
+#include "config.h"
+
+static void send_bundle_uris(struct packet_writer *writer,
+			     struct string_list *uris)
+{
+	struct string_list_item *item;
+
+	for_each_string_list_item(item, uris)
+		packet_writer_write(writer, "%s", item->string);
+}
+
+static int advertise_bundle_uri = -1;
+static struct string_list bundle_uris = STRING_LIST_INIT_DUP;
+static int bundle_uri_config(const char *var, const char *value, void *data)
+{
+	if (!strcmp(var, "uploadpack.bundleuri")) {
+		advertise_bundle_uri = 1;
+		string_list_append(&bundle_uris, value);
+	}
+
+	return 0;
+}
+
+int bundle_uri_advertise(struct repository *r, struct strbuf *value)
+{
+	if (advertise_bundle_uri != -1)
+		goto cached;
+
+	git_config(bundle_uri_config, NULL);
+	advertise_bundle_uri = !!bundle_uris.nr;
+
+cached:
+	return advertise_bundle_uri;
+}
+
+int bundle_uri_command(struct repository *r,
+		       struct packet_reader *request)
+{
+	struct packet_writer writer;
+	packet_writer_init(&writer, 1);
+
+	while (packet_reader_read(request) == PACKET_READ_NORMAL)
+		die(_("bundle-uri: unexpected argument: '%s'"), request->line);
+	if (request->status != PACKET_READ_FLUSH)
+		die(_("bundle-uri: expected flush after arguments"));
+
+	send_bundle_uris(&writer, &bundle_uris);
+
+	packet_writer_flush(&writer);
+
+	return 0;
+}
diff --git a/bundle-uri.h b/bundle-uri.h
new file mode 100644
index 00000000000..5a7e556a0ba
--- /dev/null
+++ b/bundle-uri.h
@@ -0,0 +1,13 @@
+#ifndef BUNDLE_URI_H
+#define BUNDLE_URI_H
+#include "repository.h"
+#include "pkt-line.h"
+#include "strbuf.h"
+
+/**
+ * API used by serve.[ch].
+ */
+int bundle_uri_advertise(struct repository *r, struct strbuf *value);
+int bundle_uri_command(struct repository *r, struct packet_reader *request);
+
+#endif /* BUNDLE_URI_H */
diff --git a/serve.c b/serve.c
index b3fe9b5126a..f3e0203d2c6 100644
--- a/serve.c
+++ b/serve.c
@@ -8,6 +8,7 @@
 #include "protocol-caps.h"
 #include "serve.h"
 #include "upload-pack.h"
+#include "bundle-uri.h"
 
 static int advertise_sid = -1;
 static int client_hash_algo = GIT_HASH_SHA1;
@@ -136,6 +137,11 @@ static struct protocol_capability capabilities[] = {
 		.advertise = always_advertise,
 		.command = cap_object_info,
 	},
+	{
+		.name = "bundle-uri",
+		.advertise = bundle_uri_advertise,
+		.command = bundle_uri_command,
+	},
 };
 
 void protocol_v2_advertise_capabilities(void)
diff --git a/t/t5701-git-serve.sh b/t/t5701-git-serve.sh
index 1896f671cb3..9d053f77a93 100755
--- a/t/t5701-git-serve.sh
+++ b/t/t5701-git-serve.sh
@@ -13,7 +13,7 @@ test_expect_success 'test capability advertisement' '
 	wrong_algo sha1:sha256
 	wrong_algo sha256:sha1
 	EOF
-	cat >expect <<-EOF &&
+	cat >expect.base <<-EOF &&
 	version 2
 	agent=git/$(git version | cut -d" " -f3)
 	ls-refs=unborn
@@ -21,8 +21,11 @@ test_expect_success 'test capability advertisement' '
 	server-option
 	object-format=$(test_oid algo)
 	object-info
+	EOF
+	cat >expect.trailer <<-EOF &&
 	0000
 	EOF
+	cat expect.base expect.trailer >expect &&
 
 	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
 		--advertise-capabilities >out &&
@@ -342,4 +345,123 @@ test_expect_success 'basics of object-info' '
 	test_cmp expect actual
 '
 
+# Test the basics of bundle-uri
+#
+test_expect_success 'test capability advertisement with uploadpack.bundleURI' '
+	test_config uploadpack.bundleURI FAKE &&
+
+	cat >expect.extra <<-EOF &&
+	bundle-uri
+	EOF
+	cat expect.base \
+	    expect.extra \
+	    expect.trailer >expect &&
+
+	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
+		--advertise-capabilities >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: dies if not enabled' '
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	0000
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	fatal: invalid command '"'"'bundle-uri'"'"'
+	EOF
+
+	cat >expect <<-\EOF &&
+	ERR serve: invalid command '"'"'bundle-uri'"'"'
+	EOF
+
+	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out
+'
+
+
+test_expect_success 'basics of bundle-uri: enabled with single URI' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: enabled with single URI' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: enabled with two URIs' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+	test_config uploadpack.bundleURI https://cdn.example.com/recent.bdl --add &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	https://cdn.example.com/recent.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: unknown future feature(s)' '
+	test_config uploadpack.bundleURI https://cdn.example.com/fake.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0001
+	some-feature
+	we-do-not
+	know=about
+	0000
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	fatal: bundle-uri: unexpected argument: '"'"'some-feature'"'"'
+	EOF
+
+	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out
+'
+
 test_done
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 10/36] bundle-uri client: add "bundle-uri" parsing + tests
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (8 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 09/36] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 11/36] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
                       ` (26 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a "test-tool bundle-uri parse" which parses the format defined in
the newly specified "bundle-uri" command.

As note in the "bundle-uri" section in protocol-v2.txt we haven't
specified any key-values yet, just URI lines, but we should parse
their format for conformity with the spec.

We need to make sure our future client doesn't die if this optional
data is ever provided by the server, and that we've covered all the
edge cases with these key-values in our specification. Let's add and
test a bundle_uri_parse_line() to do that.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Makefile                    |   1 +
 bundle-uri.c                | 124 +++++++++++++++++++++++++++++
 bundle-uri.h                |  16 ++++
 t/helper/test-bundle-uri.c  |  83 +++++++++++++++++++
 t/helper/test-tool.c        |   1 +
 t/helper/test-tool.h        |   1 +
 t/t5750-bundle-uri-parse.sh | 153 ++++++++++++++++++++++++++++++++++++
 7 files changed, 379 insertions(+)
 create mode 100644 t/helper/test-bundle-uri.c
 create mode 100755 t/t5750-bundle-uri-parse.sh

diff --git a/Makefile b/Makefile
index 8f27310836d..c8a14793005 100644
--- a/Makefile
+++ b/Makefile
@@ -706,6 +706,7 @@ PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 TEST_BUILTINS_OBJS += test-advise.o
 TEST_BUILTINS_OBJS += test-bitmap.o
 TEST_BUILTINS_OBJS += test-bloom.o
+TEST_BUILTINS_OBJS += test-bundle-uri.o
 TEST_BUILTINS_OBJS += test-chmtime.o
 TEST_BUILTINS_OBJS += test-config.o
 TEST_BUILTINS_OBJS += test-crontab.o
diff --git a/bundle-uri.c b/bundle-uri.c
index ff054ddc690..33386769f55 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -53,3 +53,127 @@ int bundle_uri_command(struct repository *r,
 
 	return 0;
 }
+
+/**
+ * General API for {transport,connect}.c etc.
+ */
+int bundle_uri_parse_line(struct string_list *bundle_uri, const char *line)
+{
+	size_t i;
+	struct string_list columns = STRING_LIST_INIT_DUP;
+	const char *uri;
+	struct string_list *uri_columns = NULL;
+	int ret = 0;
+
+	if (!strlen(line))
+		return error(_("bundle-uri: got an empty line"));
+
+	/*
+	 * Right now we don't understand anything beyond the first SP,
+	 * but let's be tolerant and ignore any future unknown
+	 * fields. See the "MUST" note about "bundle-feature-key" in
+	 * Documentation/technical/protocol-v2.txt
+	 */
+	if (string_list_split(&columns, line, ' ', -1) < 1)
+		return error(_("bundle-uri: line not in SP-delimited format: %s"), line);
+
+	/*
+	 * We represent a "<uri>[ <key-values>...]" line with the URI
+	 * being the .string in a string list, and the .util being an
+	 * optional string list of key (.string) and values
+	 * (.util). If the top-level .util is NULL there's no
+	 * key-value pairs....
+	 */
+	uri = columns.items[0].string;
+	if (!strlen(uri)) {
+		ret = error(_("bundle-uri: got an empty URI component"));
+		goto cleanup;
+	}
+
+	/*
+	 * ... we're going to need that non-NULL .util .
+	 */
+	if (columns.nr > 1) {
+		uri_columns = xcalloc(1, sizeof(struct string_list));
+		string_list_init_dup(uri_columns);
+	}
+
+	/*
+	 * Let's parse the optional "kv" format, even if we don't
+	 * understand any of the keys or values yet.
+	 */
+	for (i = 1; i < columns.nr; i++) {
+		struct string_list kv = STRING_LIST_INIT_DUP;
+		const char *arg = columns.items[i].string;
+		int fields = string_list_split(&kv, arg, '=', 2);
+		int err = 0;
+
+		switch (fields) {
+		case 0:
+			BUG("should have no fields=0");
+		case 1:
+			if (!strlen(arg)) {
+				err = error("bundle-uri: column %"PRIuMAX": got an empty attribute (full line was '%s')",
+					    (uintmax_t)i, line);
+				break;
+			}
+			/*
+			 * We could dance around with
+			 * string_list_append_nodup() and skip
+			 * string_list_clear(&kv, 0) here, but let's
+			 * keep it simple.
+			 */
+			string_list_append(uri_columns, arg);
+			break;
+		case 2:
+		{
+			const char *k = kv.items[0].string;
+			const char *v = kv.items[1].string;
+
+			string_list_append(uri_columns, k)->util = xstrdup(v);
+			break;
+		}
+		default:
+			err = error("bundle-uri: column %"PRIuMAX": '%s' more than one '=' character (full line was '%s')",
+				    (uintmax_t)i, arg, line);
+			break;
+		}
+
+		string_list_clear(&kv, 0);
+		if (err) {
+			ret = err;
+			break;
+		}
+	}
+
+
+	/*
+	 * Per the spec we'll only consider bundle-uri lines OK if
+	 * there were no parsing problems, even if the problems were
+	 * with attributes whose content we don't understand.
+	 */
+	if (ret && uri_columns) {
+		string_list_clear(uri_columns, 1);
+		free(uri_columns);
+	} else if (!ret) {
+		string_list_append(bundle_uri, uri)->util = uri_columns;
+	}
+
+cleanup:
+	string_list_clear(&columns, 0);
+	return ret;
+}
+
+static void bundle_uri_string_list_clear_cb(void *util, const char *string)
+{
+	struct string_list *fields = util;
+	if (!fields)
+		return;
+	string_list_clear(fields, 1);
+	free(fields);
+}
+
+void bundle_uri_string_list_clear(struct string_list *bundle_uri)
+{
+	string_list_clear_func(bundle_uri, bundle_uri_string_list_clear_cb);
+}
diff --git a/bundle-uri.h b/bundle-uri.h
index 5a7e556a0ba..be6d1df97ff 100644
--- a/bundle-uri.h
+++ b/bundle-uri.h
@@ -3,6 +3,7 @@
 #include "repository.h"
 #include "pkt-line.h"
 #include "strbuf.h"
+#include "string-list.h"
 
 /**
  * API used by serve.[ch].
@@ -10,4 +11,19 @@
 int bundle_uri_advertise(struct repository *r, struct strbuf *value);
 int bundle_uri_command(struct repository *r, struct packet_reader *request);
 
+/**
+ * General API for {transport,connect}.c etc.
+ */
+
+/**
+ * bundle_uri_parse_line() returns 0 when a valid bundle-uri has been
+ * added to `bundle_uri`, <0 on error.
+ */
+int bundle_uri_parse_line(struct string_list *bundle_uri, const char *line);
+
+/**
+ * Clear the `bundle_uri` list. Just a very thin wrapper on
+ * string_list_clear().
+ */
+void bundle_uri_string_list_clear(struct string_list *bundle_uri);
 #endif /* BUNDLE_URI_H */
diff --git a/t/helper/test-bundle-uri.c b/t/helper/test-bundle-uri.c
new file mode 100644
index 00000000000..805a86c0130
--- /dev/null
+++ b/t/helper/test-bundle-uri.c
@@ -0,0 +1,83 @@
+#include "test-tool.h"
+#include "parse-options.h"
+#include "bundle-uri.h"
+#include "strbuf.h"
+#include "string-list.h"
+
+static int cmd__bundle_uri_parse(int argc, const char **argv)
+{
+	const char *usage[] = {
+		"test-tool bundle-uri parse <in",
+		NULL
+	};
+	struct option options[] = {
+		OPT_END(),
+	};
+	struct strbuf sb = STRBUF_INIT;
+	struct string_list list = STRING_LIST_INIT_DUP;
+	int err = 0;
+	struct string_list_item *item;
+	size_t line_nr = 0;
+
+	argc = parse_options(argc, argv, NULL, options, usage, 0);
+	if (argc)
+		goto usage;
+
+	while (strbuf_getline(&sb, stdin) != EOF) {
+		line_nr++;
+		if (bundle_uri_parse_line(&list, sb.buf) < 0)
+			err = error("bad line: '%s'", sb.buf);
+	}
+
+	for_each_string_list_item(item, &list) {
+		struct string_list_item *kv_item;
+		struct string_list *kv = item->util;
+
+		fprintf(stdout, "%s", item->string);
+		if (!kv) {
+			fprintf(stdout, "\n");
+			continue;
+		}
+		for_each_string_list_item(kv_item, kv) {
+			const char *k = kv_item->string;
+			const char *v = kv_item->util;
+
+			if (v)
+				fprintf(stdout, " [kv: %s => %s]", k, v);
+			else
+				fprintf(stdout, " [attr: %s]", k);
+		}
+		fprintf(stdout, "\n");
+	}
+	strbuf_release(&sb);
+
+	bundle_uri_string_list_clear(&list);
+
+	return err < 0 ? 1 : 0;
+usage:
+	usage_with_options(usage, options);
+}
+
+int cmd__bundle_uri(int argc, const char **argv)
+{
+	const char *usage[] = {
+		"test-tool bundle-uri <subcommand> [<options>]",
+		NULL
+	};
+	struct option options[] = {
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL, options, usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION |
+			     PARSE_OPT_KEEP_ARGV0);
+	if (argc == 1)
+		goto usage;
+
+	if (!strcmp(argv[1], "parse"))
+		return cmd__bundle_uri_parse(argc - 1, argv + 1);
+	error("there is no test-tool bundle-uri tool '%s'", argv[1]);
+
+usage:
+	usage_with_options(usage, options);
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 0424f7adf5d..bff823fbd3e 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -17,6 +17,7 @@ static struct test_cmd cmds[] = {
 	{ "advise", cmd__advise_if_enabled },
 	{ "bitmap", cmd__bitmap },
 	{ "bloom", cmd__bloom },
+	{ "bundle-uri", cmd__bundle_uri },
 	{ "chmtime", cmd__chmtime },
 	{ "config", cmd__config },
 	{ "crontab", cmd__crontab },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index c876e8246fb..eb747e07dd1 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -7,6 +7,7 @@
 int cmd__advise_if_enabled(int argc, const char **argv);
 int cmd__bitmap(int argc, const char **argv);
 int cmd__bloom(int argc, const char **argv);
+int cmd__bundle_uri(int argc, const char **argv);
 int cmd__chmtime(int argc, const char **argv);
 int cmd__config(int argc, const char **argv);
 int cmd__crontab(int argc, const char **argv);
diff --git a/t/t5750-bundle-uri-parse.sh b/t/t5750-bundle-uri-parse.sh
new file mode 100755
index 00000000000..70fd1b398e9
--- /dev/null
+++ b/t/t5750-bundle-uri-parse.sh
@@ -0,0 +1,153 @@
+#!/bin/sh
+
+test_description="Test bundle-uri bundle_uri_parse_line()"
+
+TEST_NO_CREATE_REPO=1
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+test_expect_success 'bundle_uri_parse_line() just URIs' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle.bdl
+	https://example.com/bundle.bdl
+	file:///usr/share/git/bundle.bdl
+	EOF
+
+	# For the simple case
+	cp in expect &&
+
+	test-tool bundle-uri parse <in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() with attributes' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl attr
+	http://example.com/bundle2.bdl ibute
+	EOF
+
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl [attr: attr]
+	http://example.com/bundle2.bdl [attr: ibute]
+	EOF
+
+	test-tool bundle-uri parse <in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() with attributes and key-value attributes' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl x a=b y c=d z e=f a=b
+	EOF
+
+
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl [attr: x] [kv: a => b] [attr: y] [kv: c => d] [attr: z] [kv: e => f] [kv: a => b]
+	EOF
+
+	test-tool bundle-uri parse <in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: extra SP' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl one-space
+	http://example.com/bundle2.bdl  two-space
+	http://example.com/bundle3.bdl   three-space
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: column 1: got an empty attribute (full line was '\''http://example.com/bundle2.bdl  two-space'\'')
+	error: bad line: '\''http://example.com/bundle2.bdl  two-space'\''
+	error: bundle-uri: column 1: got an empty attribute (full line was '\''http://example.com/bundle3.bdl   three-space'\'')
+	error: bad line: '\''http://example.com/bundle3.bdl   three-space'\''
+	EOF
+
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl [attr: one-space]
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: empty lines' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl
+
+	http://example.com/bundle2.bdl a=b
+
+	http://example.com/bundle3.bdl
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: got an empty line
+	error: bad line: '\'''\''
+	error: bundle-uri: got an empty line
+	error: bad line: '\'''\''
+	EOF
+
+	# We fail, but try to continue parsing regardless
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl
+	http://example.com/bundle2.bdl [kv: a => b]
+	http://example.com/bundle3.bdl
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: empty URIs' '
+	sed "s/> //" >in <<-\EOF &&
+	http://example.com/bundle1.bdl
+	>  a=b
+	http://example.com/bundle3.bdl a=b
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: got an empty URI component
+	error: bad line: '\'' a=b'\''
+	EOF
+
+	# We fail, but try to continue parsing regardless
+	cat >expect <<-\EOF &&
+	http://example.com/bundle1.bdl
+	http://example.com/bundle3.bdl [kv: a => b]
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'bundle_uri_parse_line() parsing edge cases: multiple = in key-values' '
+	cat >in <<-\EOF &&
+	http://example.com/bundle1.bdl k=v=extra
+	http://example.com/bundle2.bdl a=b k=v=extra c=d
+	http://example.com/bundle3.bdl kv=ok
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	error: bundle-uri: column 1: '\''k=v=extra'\'' more than one '\''='\'' character (full line was '\''http://example.com/bundle1.bdl k=v=extra'\'')
+	error: bad line: '\''http://example.com/bundle1.bdl k=v=extra'\''
+	error: bundle-uri: column 2: '\''k=v=extra'\'' more than one '\''='\'' character (full line was '\''http://example.com/bundle2.bdl a=b k=v=extra c=d'\'')
+	error: bad line: '\''http://example.com/bundle2.bdl a=b k=v=extra c=d'\''
+	EOF
+
+	# We fail, but try to continue parsing regardless
+	cat >expect <<-\EOF &&
+	http://example.com/bundle3.bdl [kv: kv => ok]
+	EOF
+
+	test_must_fail test-tool bundle-uri parse <in >actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp expect actual
+'
+
+test_done
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 11/36] bundle-uri client: add minimal NOOP client
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (9 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 10/36] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 12/36] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
                       ` (25 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Set up all the needed client parts of the "bundle-uri" protocol
extension, without actually doing anything with the bundle URIs.

I.e. if the server says it supports "bundle-uri" we'll issue a
command=bundle-uri after command=ls-refs when we're cloning. We'll
parse the returned output using the code already tested for in
t5750-bundle-uri-parse.sh.

What we aren't doing is actually acting on that data, i.e. downloading
the bundle(s) before we get to doing the command=fetch, and adjusting
our negotiation dialog appropriately. I'll do that in subsequent
commits.

There's a question of what level of encapsulation we should use here,
I've opted to use connect.h in clone.c, but we could also e.g. make
transport_get_remote_refs() invoke this, i.e. make it implicitly get
the bundle-uri list for later steps.

This approach means that we don't "support" this in "git fetch" for
now. I'm starting with the case of initial clones, although as noted
in preceding commits to the protocol documentation nothing about this
approach precludes getting bundles on incremental fetches.

For the t5732-protocol-v2-bundle-uri-http.sh it's not easy to set
environment variables for git-upload-pack (it's started by Apache), so
let's skip the test under T5730_HTTP, and add unused T5730_{FILE,GIT}
prerequisites for consistency and future use.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/clone.c                        |   7 ++
 bundle-uri.c                           |   4 +
 connect.c                              |  47 ++++++++
 remote.h                               |   4 +
 t/lib-t5730-protocol-v2-bundle-uri.sh  | 145 +++++++++++++++++++++++++
 t/t5730-protocol-v2-bundle-uri-file.sh |  36 ++++++
 t/t5731-protocol-v2-bundle-uri-git.sh  |  17 +++
 t/t5732-protocol-v2-bundle-uri-http.sh |  17 +++
 transport-helper.c                     |  13 +++
 transport-internal.h                   |   7 ++
 transport.c                            |  48 ++++++++
 transport.h                            |  18 +++
 12 files changed, 363 insertions(+)
 create mode 100644 t/lib-t5730-protocol-v2-bundle-uri.sh
 create mode 100755 t/t5730-protocol-v2-bundle-uri-file.sh
 create mode 100755 t/t5731-protocol-v2-bundle-uri-git.sh
 create mode 100755 t/t5732-protocol-v2-bundle-uri-http.sh

diff --git a/builtin/clone.c b/builtin/clone.c
index 52316563795..709f1502f91 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -27,6 +27,7 @@
 #include "iterator.h"
 #include "sigchain.h"
 #include "branch.h"
+#include "connect.h"
 #include "remote.h"
 #include "run-command.h"
 #include "connected.h"
@@ -1235,6 +1236,12 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (refs)
 		mapped_refs = wanted_peer_refs(refs, &remote->fetch);
 
+	/*
+	 * Populate transport->got_remote_bundle_uri and
+	 * transport->bundle_uri. We might get nothing.
+	 */
+	transport_get_remote_bundle_uri(transport);
+
 	if (mapped_refs) {
 		int hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
 
diff --git a/bundle-uri.c b/bundle-uri.c
index 33386769f55..09493140299 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -26,6 +26,10 @@ static int bundle_uri_config(const char *var, const char *value, void *data)
 
 int bundle_uri_advertise(struct repository *r, struct strbuf *value)
 {
+	if (value &&
+	    git_env_bool("GIT_TEST_BUNDLE_URI_UNKNOWN_CAPABILITY_VALUE", 0))
+		strbuf_addstr(value, "test-unknown-capability-value");
+
 	if (advertise_bundle_uri != -1)
 		goto cached;
 
diff --git a/connect.c b/connect.c
index e6d0b1d34bd..a8fdb5255f7 100644
--- a/connect.c
+++ b/connect.c
@@ -15,6 +15,7 @@
 #include "version.h"
 #include "protocol.h"
 #include "alias.h"
+#include "bundle-uri.h"
 
 static char *server_capabilities_v1;
 static struct strvec server_capabilities_v2 = STRVEC_INIT;
@@ -491,6 +492,52 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
 	}
 }
 
+int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
+			  struct string_list *bundle_uri, int stateless_rpc)
+{
+	int line_nr = 1;
+
+	/* Assert bundle-uri support */
+	server_supports_v2("bundle-uri", 1);
+
+	/* (Re-)send capabilities */
+	send_capabilities(fd_out, reader);
+
+	/* Send command */
+	packet_write_fmt(fd_out, "command=bundle-uri\n");
+	packet_delim(fd_out);
+
+	/* Send options */
+	if (git_env_bool("GIT_TEST_PROTOCOL_BAD_BUNDLE_URI", 0))
+		packet_write_fmt(fd_out, "test-bad-client\n");
+	packet_flush(fd_out);
+
+	/* Process response from server */
+	while (packet_reader_read(reader) == PACKET_READ_NORMAL) {
+		const char *line = reader->line;
+		line_nr++;
+
+		if (!bundle_uri_parse_line(bundle_uri, line))
+			continue;
+
+		return error(_("error on bundle-uri response line %d: %s"),
+			     line_nr, line);
+	}
+
+	if (reader->status != PACKET_READ_FLUSH)
+		return error(_("expected flush after bundle-uri listing"));
+
+	/*
+	 * Might die(), but obscure enough that that's OK, e.g. in
+	 * serve.c we'll call BUG() on its equivalent (the
+	 * PACKET_READ_RESPONSE_END check).
+	 */
+	check_stateless_delimiter(stateless_rpc, reader,
+				  _("expected response end packet after ref listing"));
+
+	return 0;
+}
+
 struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     struct ref **list, int for_push,
 			     struct transport_ls_refs_options *transport_options,
diff --git a/remote.h b/remote.h
index dd4402436f1..571338510a8 100644
--- a/remote.h
+++ b/remote.h
@@ -236,6 +236,10 @@ struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     const struct string_list *server_options,
 			     int stateless_rpc);
 
+/* Used for protocol v2 in order to retrieve refs from a remote */
+int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
+			  struct string_list *bundle_uri, int stateless_rpc);
+
 int resolve_remote_symref(struct ref *ref, struct ref *list);
 
 /*
diff --git a/t/lib-t5730-protocol-v2-bundle-uri.sh b/t/lib-t5730-protocol-v2-bundle-uri.sh
new file mode 100644
index 00000000000..7a90c80f0b1
--- /dev/null
+++ b/t/lib-t5730-protocol-v2-bundle-uri.sh
@@ -0,0 +1,145 @@
+# Included from t573*-protocol-v2-bundle-uri-*.sh
+
+T5730_PARENT=
+T5730_URI=
+T5730_BUNDLE_URI=
+case "$T5730_PROTOCOL" in
+file)
+	T5730_PARENT=file_parent
+	T5730_URI="file://$PWD/file_parent"
+	T5730_BUNDLE_URI="$T5730_URI/fake.bdl"
+	test_set_prereq T5730_FILE
+	;;
+git)
+	. "$TEST_DIRECTORY"/lib-git-daemon.sh
+	start_git_daemon --export-all --enable=receive-pack
+	T5730_PARENT="$GIT_DAEMON_DOCUMENT_ROOT_PATH/parent"
+	T5730_URI="$GIT_DAEMON_URL/parent"
+	T5730_BUNDLE_URI="https://example.com/fake.bdl"
+	test_set_prereq T5730_GIT
+	;;
+http)
+	. "$TEST_DIRECTORY"/lib-httpd.sh
+	start_httpd
+	T5730_PARENT="$HTTPD_DOCUMENT_ROOT_PATH/http_parent"
+	T5730_URI="$HTTPD_URL/smart/http_parent"
+	T5730_BUNDLE_URI="https://example.com/fake.bdl"
+	test_set_prereq T5730_HTTP
+	;;
+*)
+	BUG "Need to pass valid T5730_PROTOCOL (was $T5730_PROTOCOL)"
+	;;
+esac
+
+test_expect_success "setup protocol v2 $T5730_PROTOCOL:// tests" '
+	git init "$T5730_PARENT" &&
+	test_commit -C "$T5730_PARENT" one
+'
+
+# Poor man's URI escaping. Good enough for the test suite whose trash
+# directory has a space in it. See 93c3fcbe4d4 (git-svn: attempt to
+# mimic SVN 1.7 URL canonicalization, 2012-07-28) for prior art.
+test_uri_escape() {
+	sed 's/ /%20/g'
+}
+
+case "$T5730_PROTOCOL" in
+http)
+	test_expect_success "setup config for $T5730_PROTOCOL:// tests" '
+		git -C "$T5730_PARENT" config http.receivepack true
+	'
+	;;
+*)
+	;;
+esac
+T5730_BUNDLE_URI_ESCAPED=$(echo "$T5730_BUNDLE_URI" | test_uri_escape)
+
+test_expect_success "connect with $T5730_PROTOCOL:// using protocol v2: no bundle-uri" '
+	test_when_finished "rm -f log" &&
+
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		ls-remote --symref "$T5730_URI" \
+		>actual 2>err &&
+
+	# Server responded using protocol v2
+	grep "< version 2" log &&
+
+	! grep bundle-uri log
+'
+
+test_expect_success "connect with $T5730_PROTOCOL:// using protocol v2: have bundle-uri" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" \
+		uploadpack.bundleURI "$T5730_BUNDLE_URI_ESCAPED" &&
+
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		ls-remote --symref "$T5730_URI" \
+		>actual 2>err &&
+
+	# Server responded using protocol v2
+	grep "< version 2" log &&
+
+	# Server advertised bundle-uri capability
+	grep bundle-uri log
+'
+
+test_expect_success !T5730_HTTP "bad client with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED" &&
+
+	cat >err.expect <<-\EOF &&
+	Cloning into '"'"'child'"'"'...
+	EOF
+	case "$T5730_PROTOCOL" in
+	file)
+		cat >fatal-bundle-uri.expect <<-\EOF
+		fatal: bundle-uri: unexpected argument: '"'"'test-bad-client'"'"'
+		EOF
+		;;
+	*)
+		cat >fatal.expect <<-\EOF
+		fatal: read error: Connection reset by peer
+		EOF
+		;;
+	esac &&
+
+	test_when_finished "rm -rf child" &&
+	test_must_fail ok=sigpipe env \
+		GIT_TRACE_PACKET="$PWD/log" \
+		GIT_TEST_PROTOCOL_BAD_BUNDLE_URI=true \
+		git -c protocol.version=2 \
+		clone "$T5730_URI" child \
+		>out 2>err &&
+	test_must_be_empty out &&
+
+	grep -v -e ^fatal: -e ^error: err >err.actual &&
+	test_cmp err.expect err.actual &&
+
+	case "$T5730_PROTOCOL" in
+	file)
+		# Due to general race conditions with client/server replies we
+		# may or may not get "fatal: the remote end hung up
+		# expectedly" here
+		grep "^fatal: bundle-uri:" err >fatal-bundle-uri.actual &&
+		test_cmp fatal-bundle-uri.expect fatal-bundle-uri.actual
+		;;
+	*)
+		grep "^fatal:" err >fatal.actual &&
+		# Due to the same race conditions this might be
+		# "fatal: read error: Connection reset by peer", "fatal: the remote end
+		# hung up unexpectedly" etc.
+		cat fatal.actual &&
+		test_file_not_empty fatal.actual
+		;;
+	esac &&
+
+	grep "clone> test-bad-client$" log >sent-bad-request &&
+	test_file_not_empty sent-bad-request
+'
diff --git a/t/t5730-protocol-v2-bundle-uri-file.sh b/t/t5730-protocol-v2-bundle-uri-file.sh
new file mode 100755
index 00000000000..89203d3a23c
--- /dev/null
+++ b/t/t5730-protocol-v2-bundle-uri-file.sh
@@ -0,0 +1,36 @@
+#!/bin/sh
+
+test_description="Test bundle-uri with protocol v2 and 'file://' transport"
+
+TEST_NO_CREATE_REPO=1
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+# Test protocol v2 with 'file://' transport
+#
+T5730_PROTOCOL=file
+. "$TEST_DIRECTORY"/lib-t5730-protocol-v2-bundle-uri.sh
+
+test_expect_success "unknown capability value with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" \
+		uploadpack.bundleURI "$T5730_BUNDLE_URI_ESCAPED" &&
+
+	GIT_TRACE_PACKET="$PWD/log" \
+	GIT_TEST_BUNDLE_URI_UNKNOWN_CAPABILITY_VALUE=true \
+	git \
+		-c protocol.version=2 \
+		ls-remote --symref "$T5730_URI" \
+		>actual 2>err &&
+
+	# Server responded using protocol v2
+	grep "< version 2" log &&
+
+	grep "> bundle-uri=test-unknown-capability-value" log
+'
+
+test_done
diff --git a/t/t5731-protocol-v2-bundle-uri-git.sh b/t/t5731-protocol-v2-bundle-uri-git.sh
new file mode 100755
index 00000000000..282847b311f
--- /dev/null
+++ b/t/t5731-protocol-v2-bundle-uri-git.sh
@@ -0,0 +1,17 @@
+#!/bin/sh
+
+test_description="Test bundle-uri with protocol v2 and 'git://' transport"
+
+TEST_NO_CREATE_REPO=1
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+# Test protocol v2 with 'git://' transport
+#
+T5730_PROTOCOL=git
+. "$TEST_DIRECTORY"/lib-t5730-protocol-v2-bundle-uri.sh
+
+test_done
diff --git a/t/t5732-protocol-v2-bundle-uri-http.sh b/t/t5732-protocol-v2-bundle-uri-http.sh
new file mode 100755
index 00000000000..fcc1cf3faef
--- /dev/null
+++ b/t/t5732-protocol-v2-bundle-uri-http.sh
@@ -0,0 +1,17 @@
+#!/bin/sh
+
+test_description="Test bundle-uri with protocol v2 and 'git://' transport"
+
+TEST_NO_CREATE_REPO=1
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+# Test protocol v2 with 'git://' transport
+#
+T5730_PROTOCOL=http
+. "$TEST_DIRECTORY"/lib-t5730-protocol-v2-bundle-uri.sh
+
+test_done
diff --git a/transport-helper.c b/transport-helper.c
index b4dbbabb0c2..398712c76f3 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -1267,9 +1267,22 @@ static struct ref *get_refs_list_using_list(struct transport *transport,
 	return ret;
 }
 
+static int get_bundle_uri(struct transport *transport)
+{
+	get_helper(transport);
+
+	if (process_connect(transport, 0)) {
+		do_take_over(transport);
+		return transport->vtable->get_bundle_uri(transport);
+	}
+
+	return -1;
+}
+
 static struct transport_vtable vtable = {
 	.set_option	= set_helper_option,
 	.get_refs_list	= get_refs_list,
+	.get_bundle_uri = get_bundle_uri,
 	.fetch_refs	= fetch_refs,
 	.push_refs	= push_refs,
 	.connect	= connect_helper,
diff --git a/transport-internal.h b/transport-internal.h
index c4ca0b733ac..90ea749e5cf 100644
--- a/transport-internal.h
+++ b/transport-internal.h
@@ -26,6 +26,13 @@ struct transport_vtable {
 	struct ref *(*get_refs_list)(struct transport *transport, int for_push,
 				     struct transport_ls_refs_options *transport_options);
 
+	/**
+	 * Populates the remote side's bundle-uri under protocol v2,
+	 * if the "bundle-uri" capability was advertised. Returns 0 if
+	 * OK, negative values on error.
+	 */
+	int (*get_bundle_uri)(struct transport *transport);
+
 	/**
 	 * Fetch the objects for the given refs. Note that this gets
 	 * an array, and should ignore the list structure.
diff --git a/transport.c b/transport.c
index 3d64a43ab39..9a31e3f996b 100644
--- a/transport.c
+++ b/transport.c
@@ -22,6 +22,7 @@
 #include "protocol.h"
 #include "object-store.h"
 #include "color.h"
+#include "bundle-uri.h"
 
 static int transport_use_color = -1;
 static char transport_colors[][COLOR_MAXLEN] = {
@@ -359,6 +360,21 @@ static struct ref *get_refs_via_connect(struct transport *transport, int for_pus
 	return handshake(transport, for_push, options, 1);
 }
 
+static int get_bundle_uri(struct transport *transport)
+{
+	struct git_transport_data *data = transport->data;
+	struct packet_reader reader;
+	int stateless_rpc = transport->stateless_rpc;
+	string_list_init_dup(&transport->bundle_uri);
+
+	packet_reader_init(&reader, data->fd[0], NULL, 0,
+			   PACKET_READ_CHOMP_NEWLINE |
+			   PACKET_READ_GENTLE_ON_EOF);
+
+	return get_remote_bundle_uri(data->fd[1], &reader,
+				     &transport->bundle_uri, stateless_rpc);
+}
+
 static int fetch_refs_via_pack(struct transport *transport,
 			       int nr_heads, struct ref **to_fetch)
 {
@@ -899,6 +915,7 @@ static int disconnect_git(struct transport *transport)
 
 static struct transport_vtable taken_over_vtable = {
 	.get_refs_list	= get_refs_via_connect,
+	.get_bundle_uri = get_bundle_uri,
 	.fetch_refs	= fetch_refs_via_pack,
 	.push_refs	= git_transport_push,
 	.disconnect	= disconnect_git
@@ -1052,6 +1069,7 @@ static struct transport_vtable bundle_vtable = {
 
 static struct transport_vtable builtin_smart_vtable = {
 	.get_refs_list	= get_refs_via_connect,
+	.get_bundle_uri = get_bundle_uri,
 	.fetch_refs	= fetch_refs_via_pack,
 	.push_refs	= git_transport_push,
 	.connect	= connect_git,
@@ -1065,6 +1083,7 @@ struct transport *transport_get(struct remote *remote, const char *url)
 
 	ret->progress = isatty(2);
 	string_list_init_dup(&ret->pack_lockfiles);
+	string_list_init_dup(&ret->bundle_uri);
 
 	if (!remote)
 		BUG("No remote provided to transport_get()");
@@ -1473,6 +1492,34 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs)
 	return rc;
 }
 
+int transport_get_remote_bundle_uri(struct transport *transport)
+{
+	const struct transport_vtable *vtable = transport->vtable;
+
+	/* Lazily configured */
+	if (transport->got_remote_bundle_uri++)
+		return 0;
+
+	/*
+	 * "Support" protocol v0 and v2 without bundle-uri support by
+	 * silently degrading to a NOOP.
+	 */
+	if (!server_supports_v2("bundle-uri", 0))
+		return 0;
+
+	/*
+	 * This is intentionally below the transport.injectBundleURI,
+	 * we want to be able to inject into protocol v0, or into the
+	 * dialog of a server who doesn't support this.
+	 */
+	if (!vtable->get_bundle_uri)
+		return error(_("bundle-uri operation not supported by protocol"));
+
+	if (vtable->get_bundle_uri(transport) < 0)
+		return error(_("could not retrieve server-advertised bundle-uri list"));
+	return 0;
+}
+
 void transport_unlock_pack(struct transport *transport, unsigned int flags)
 {
 	int in_signal_handler = !!(flags & TRANSPORT_UNLOCK_PACK_IN_SIGNAL_HANDLER);
@@ -1503,6 +1550,7 @@ int transport_disconnect(struct transport *transport)
 		ret = transport->vtable->disconnect(transport);
 	if (transport->got_remote_refs)
 		free_refs((void *)transport->remote_refs);
+	bundle_uri_string_list_clear(&transport->bundle_uri);
 	free(transport);
 	return ret;
 }
diff --git a/transport.h b/transport.h
index 12bc08fc339..90568845876 100644
--- a/transport.h
+++ b/transport.h
@@ -76,6 +76,18 @@ struct transport {
 	 */
 	unsigned got_remote_refs : 1;
 
+	/**
+	 * Indicates whether we already called get_bundle_uri_list(); set by
+	 * transport.c::transport_get_remote_bundle_uri().
+	 */
+	unsigned got_remote_bundle_uri : 1;
+
+	/*
+	 * The results of "command=bundle-uri", if both sides support
+	 * the "bundle-uri" capability.
+	 */
+	struct string_list bundle_uri;
+
 	/*
 	 * Transports that call take-over destroys the data specific to
 	 * the transport type while doing so, and cannot be reused.
@@ -280,6 +292,12 @@ void transport_ls_refs_options_release(struct transport_ls_refs_options *opts);
 const struct ref *transport_get_remote_refs(struct transport *transport,
 					    struct transport_ls_refs_options *transport_options);
 
+/**
+ * Retrieve bundle URI(s) from a remote. Populates "struct
+ * transport"'s "bundle_uri" and "got_remote_bundle_uri".
+ */
+int transport_get_remote_bundle_uri(struct transport *transport);
+
 /*
  * Fetch the hash algorithm used by a remote.
  *
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 12/36] bundle-uri client: add "git ls-remote-bundle-uri"
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (10 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 11/36] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 13/36] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
                       ` (24 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a git-ls-remote-bundle-uri command, this is a thin wrapper for
issuing protocol v2 "bundle-uri" commands to a server, and to the
parsing routines in bundle-uri.c.

Since in the "git clone" case we'll have already done the handshake(),
but not here, introduce a "got_advertisement" state along with
"got_remote_heads". It seems to me that the "got_remote_heads" is
badly named in the first place, and the whole logic of eagerly getting
ls-refs on handshake() or not could be refactored somewhat, but let's
not do that now, and instead just add another self-documenting state
variable.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/git-ls-remote-bundle-uri.txt |  62 +++++++++++
 Documentation/git-ls-remote.txt            |   1 +
 Makefile                                   |   1 +
 builtin.h                                  |   1 +
 builtin/clone.c                            |   2 +-
 builtin/ls-remote-bundle-uri.c             |  90 +++++++++++++++
 command-list.txt                           |   1 +
 git.c                                      |   1 +
 t/lib-t5730-protocol-v2-bundle-uri.sh      | 124 +++++++++++++++++++++
 transport.c                                |  43 +++++--
 transport.h                                |   6 +-
 11 files changed, 321 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/git-ls-remote-bundle-uri.txt
 create mode 100644 builtin/ls-remote-bundle-uri.c

diff --git a/Documentation/git-ls-remote-bundle-uri.txt b/Documentation/git-ls-remote-bundle-uri.txt
new file mode 100644
index 00000000000..793d7677f2f
--- /dev/null
+++ b/Documentation/git-ls-remote-bundle-uri.txt
@@ -0,0 +1,62 @@
+git-ls-remote-bundle-uri(1)
+===========================
+
+NAME
+----
+git-ls-remote-bundle-uri - List 'bundle-uri' in a remote repository
+
+SYNOPSIS
+--------
+[verse]
+'git ls-remote-bundle-uri' [-q |--quiet] [--uri] [--upload-pack=<exec>]
+			 [[-o | --server-option=]<option>] <repository>
+
+
+DESCRIPTION
+-----------
+
+Displays the `bundle-uri`s advertised by a remote repository. See
+`bundle-uri` in link:technical/protocol-v2.html[the Git Wire Protocol,
+Version 2] documentation for what the output format looks like.
+
+OPTIONS
+-------
+
+-q::
+--quiet::
+	Do not print remote URL to stderr in cases where the remote
+	name is inferred from config.
++
+When the remote name is not inferred (e.g. `git ls-remote-bundle-uri
+origin`, or `git ls-remote-bundle-uri https://[...]`) the remote URL
+is not printed in any case.
+
+--uri::
+	Print only the URIs, and not any of their optional attributes.
+
+--upload-pack=<exec>::
+	Specify the full path of 'git-upload-pack' on the remote
+	host. This allows listing references from repositories accessed via
+	SSH and where the SSH daemon does not use the PATH configured by the
+	user.
+
+-o <option>::
+--server-option=<option>::
+	Transmit the given string to the server when communicating using
+	protocol version 2.  The given string must not contain a NUL or LF
+	character.
+	When multiple `--server-option=<option>` are given, they are all
+	sent to the other side in the order listed on the command line.
+
+<repository>::
+	The "remote" repository to query.  This parameter can be
+	either a URL or the name of a remote (see the GIT URLS and
+	REMOTES sections of linkgit:git-fetch[1]).
+
+SEE ALSO
+--------
+linkgit:git-ls-remote[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Documentation/git-ls-remote.txt b/Documentation/git-ls-remote.txt
index 492e573856f..86c07eff832 100644
--- a/Documentation/git-ls-remote.txt
+++ b/Documentation/git-ls-remote.txt
@@ -114,6 +114,7 @@ c5db5456ae3b0873fc659c19fafdde22313cc441	refs/tags/v0.99.2
 
 SEE ALSO
 --------
+linkgit:git-ls-remote-bundle-uri[1].
 linkgit:git-check-ref-format[1].
 
 GIT
diff --git a/Makefile b/Makefile
index c8a14793005..badb87bed78 100644
--- a/Makefile
+++ b/Makefile
@@ -1161,6 +1161,7 @@ BUILTIN_OBJS += builtin/init-db.o
 BUILTIN_OBJS += builtin/interpret-trailers.o
 BUILTIN_OBJS += builtin/log.o
 BUILTIN_OBJS += builtin/ls-files.o
+BUILTIN_OBJS += builtin/ls-remote-bundle-uri.o
 BUILTIN_OBJS += builtin/ls-remote.o
 BUILTIN_OBJS += builtin/ls-tree.o
 BUILTIN_OBJS += builtin/mailinfo.o
diff --git a/builtin.h b/builtin.h
index 40e9ecc8485..c80bec94abe 100644
--- a/builtin.h
+++ b/builtin.h
@@ -173,6 +173,7 @@ int cmd_log(int argc, const char **argv, const char *prefix);
 int cmd_log_reflog(int argc, const char **argv, const char *prefix);
 int cmd_ls_files(int argc, const char **argv, const char *prefix);
 int cmd_ls_tree(int argc, const char **argv, const char *prefix);
+int cmd_ls_remote_bundle_uri(int argc, const char **argv, const char *prefix);
 int cmd_ls_remote(int argc, const char **argv, const char *prefix);
 int cmd_mailinfo(int argc, const char **argv, const char *prefix);
 int cmd_mailsplit(int argc, const char **argv, const char *prefix);
diff --git a/builtin/clone.c b/builtin/clone.c
index 709f1502f91..e11f4019b87 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1240,7 +1240,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	 * Populate transport->got_remote_bundle_uri and
 	 * transport->bundle_uri. We might get nothing.
 	 */
-	transport_get_remote_bundle_uri(transport);
+	transport_get_remote_bundle_uri(transport, 1);
 
 	if (mapped_refs) {
 		int hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
diff --git a/builtin/ls-remote-bundle-uri.c b/builtin/ls-remote-bundle-uri.c
new file mode 100644
index 00000000000..dadb21043c0
--- /dev/null
+++ b/builtin/ls-remote-bundle-uri.c
@@ -0,0 +1,90 @@
+#include "builtin.h"
+#include "cache.h"
+#include "transport.h"
+#include "ref-filter.h"
+#include "remote.h"
+#include "refs.h"
+
+static const char * const ls_remote_bundle_uri_usage[] = {
+	N_("git ls-remote-bundle-uri <repository>"),
+	NULL
+};
+
+int cmd_ls_remote_bundle_uri(int argc, const char **argv, const char *prefix)
+{
+	int quiet = 0;
+	int uri = 0;
+	const char *uploadpack = NULL;
+	struct string_list server_options = STRING_LIST_INIT_DUP;
+	struct option options[] = {
+		OPT__QUIET(&quiet, N_("do not print remote URL")),
+		OPT_BOOL(0, "uri", &uri, N_("limit to showing uri field")),
+		OPT_STRING(0, "upload-pack", &uploadpack, N_("exec"),
+			   N_("path of git-upload-pack on the remote host")),
+		OPT_STRING_LIST('o', "server-option", &server_options,
+				N_("server-specific"),
+				N_("option to transmit")),
+		OPT_END()
+	};
+	const char *dest = NULL;
+	struct remote *remote;
+	struct transport *transport;
+	int status = 0;
+	struct string_list_item *item;
+
+	argc = parse_options(argc, argv, prefix, options, ls_remote_bundle_uri_usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION);
+	dest = argv[0];
+
+	packet_trace_identity("ls-remote-bundle-uri");
+
+	remote = remote_get(dest);
+	if (!remote) {
+		if (dest)
+			die(_("bad repository '%s'"), dest);
+		die(_("no remote configured to get bundle URIs from"));
+	}
+	if (!remote->url_nr)
+		die(_("remote '%s' has no configured URL"), dest);
+
+	transport = transport_get(remote, NULL);
+	if (uploadpack)
+		transport_set_option(transport, TRANS_OPT_UPLOADPACK, uploadpack);
+	if (server_options.nr)
+		transport->server_options = &server_options;
+
+	if (!dest && !quiet)
+		fprintf(stderr, "From %s\n", *remote->url);
+
+	if (transport_get_remote_bundle_uri(transport, 0) < 0) {
+		error(_("could not get the bundle-uri list"));
+		status = 1;
+		goto cleanup;
+	}
+
+	for_each_string_list_item(item, &transport->bundle_uri) {
+		struct string_list_item *kv_item;
+		struct string_list *kv = item->util;
+
+		fprintf(stdout, "%s", item->string);
+		if (uri || !kv) {
+			fprintf(stdout, "\n");
+			continue;
+		}
+		for_each_string_list_item(kv_item, kv) {
+			const char *k = kv_item->string;
+			const char *v = kv_item->util;
+
+			if (v)
+				fprintf(stdout, " %s=%s", k, v);
+			else
+				fprintf(stdout, " %s", k);
+		}
+		fprintf(stdout, "\n");
+	}
+
+cleanup:
+	if (transport_disconnect(transport))
+		return 1;
+	return status;
+}
diff --git a/command-list.txt b/command-list.txt
index 9bd6f3c48f4..a50eebd4aa2 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -115,6 +115,7 @@ git-interpret-trailers                  purehelpers
 git-log                                 mainporcelain           info
 git-ls-files                            plumbinginterrogators
 git-ls-remote                           plumbinginterrogators
+git-ls-remote-bundle-uri                plumbinginterrogators
 git-ls-tree                             plumbinginterrogators
 git-mailinfo                            purehelpers
 git-mailsplit                           purehelpers
diff --git a/git.c b/git.c
index 3d8e48cf555..352e76cedaf 100644
--- a/git.c
+++ b/git.c
@@ -551,6 +551,7 @@ static struct cmd_struct commands[] = {
 	{ "log", cmd_log, RUN_SETUP },
 	{ "ls-files", cmd_ls_files, RUN_SETUP },
 	{ "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
+	{ "ls-remote-bundle-uri", cmd_ls_remote_bundle_uri, RUN_SETUP_GENTLY },
 	{ "ls-tree", cmd_ls_tree, RUN_SETUP },
 	{ "mailinfo", cmd_mailinfo, RUN_SETUP_GENTLY | NO_PARSEOPT },
 	{ "mailsplit", cmd_mailsplit, NO_PARSEOPT },
diff --git a/t/lib-t5730-protocol-v2-bundle-uri.sh b/t/lib-t5730-protocol-v2-bundle-uri.sh
index 7a90c80f0b1..d0b15a47ec2 100644
--- a/t/lib-t5730-protocol-v2-bundle-uri.sh
+++ b/t/lib-t5730-protocol-v2-bundle-uri.sh
@@ -143,3 +143,127 @@ test_expect_success !T5730_HTTP "bad client with $T5730_PROTOCOL:// using protoc
 	grep "clone> test-bad-client$" log >sent-bad-request &&
 	test_file_not_empty sent-bad-request
 '
+
+test_expect_success "ls-remote-bundle-uri with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED" &&
+
+	# All data about bundle URIs
+	cat >expect <<-EOF &&
+	$T5730_BUNDLE_URI_ESCAPED
+	EOF
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		"$T5730_URI" \
+		>actual &&
+	test_cmp expect actual &&
+
+	# Only the URIs
+	git \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri --uri \
+		"$T5730_URI" \
+		>actual2 &&
+	test_cmp actual actual2
+'
+
+test_expect_success "ls-remote-bundle-uri with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	ATTR="foo bar=baz" &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED $ATTR" &&
+
+	# All data about bundle URIs
+	cat >expect <<-EOF &&
+	$T5730_BUNDLE_URI_ESCAPED $ATTR
+	EOF
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		"$T5730_URI" \
+		>actual &&
+	test_cmp expect actual
+'
+
+test_expect_success "ls-remote-bundle-uri with $T5730_PROTOCOL:// using protocol v2: --uri" '
+	test_when_finished "rm -f log" &&
+
+	ATTR="foo bar=baz" &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED $ATTR" &&
+
+	# All data about bundle URIs
+	cat >expect <<-EOF &&
+	$T5730_BUNDLE_URI_ESCAPED
+	EOF
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		--uri \
+		"$T5730_URI" \
+		>actual &&
+	test_cmp expect actual
+'
+
+test_expect_success "ls-remote-bundle-uri --[no-]quiet with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+	test_when_finished "rm -rf child" &&
+	env GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		 clone "$T5730_URI" child &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED" &&
+
+	# Without --[no-]quiet
+	cat >out.expect <<-EOF &&
+	$T5730_BUNDLE_URI_ESCAPED
+	EOF
+	cat >err.expect <<-EOF &&
+	From $T5730_URI
+	EOF
+	git \
+		-C child \
+		 -c protocol.version=2 \
+		ls-remote-bundle-uri \
+		>out.actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp out.expect out.actual &&
+
+	# --no-quiet is the default
+	git \
+		-C child \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		--no-quiet \
+		>out.actual 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_cmp out.expect out.actual &&
+
+	# --quiet quiets the "From" line
+	git \
+		-C child \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		--quiet \
+		>out.actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp out.expect out.actual &&
+
+	# --quiet is implicit if the remote is not implicit
+	git \
+		-c protocol.version=2 \
+		ls-remote-bundle-uri \
+		"$T5730_URI" \
+		>out.actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp out.expect out.actual
+'
diff --git a/transport.c b/transport.c
index 9a31e3f996b..e648d3110bb 100644
--- a/transport.c
+++ b/transport.c
@@ -198,6 +198,7 @@ struct git_transport_data {
 	struct git_transport_options options;
 	struct child_process *conn;
 	int fd[2];
+	unsigned got_advertisement : 1;
 	unsigned got_remote_heads : 1;
 	enum protocol_version version;
 	struct oid_array extra_have;
@@ -346,6 +347,7 @@ static struct ref *handshake(struct transport *transport, int for_push,
 		BUG("unknown protocol version");
 	}
 	data->got_remote_heads = 1;
+	data->got_advertisement = 1;
 	transport->hash_algo = reader.hash_algo;
 
 	if (reader.line_peeked)
@@ -367,6 +369,33 @@ static int get_bundle_uri(struct transport *transport)
 	int stateless_rpc = transport->stateless_rpc;
 	string_list_init_dup(&transport->bundle_uri);
 
+	if (!data->got_advertisement) {
+		struct ref *refs;
+		struct git_transport_data *data = transport->data;
+		enum protocol_version version;
+
+		refs = handshake(transport, 0, NULL, 0);
+		version = data->version;
+
+		switch (version) {
+		case protocol_v2:
+			assert(!refs);
+			break;
+		case protocol_v0:
+		case protocol_v1:
+		case protocol_unknown_version:
+			assert(refs);
+			break;
+		}
+	}
+
+	/*
+	 * "Support" protocol v0 and v2 without bundle-uri support by
+	 * silently degrading to a NOOP.
+	 */
+	if (!server_supports_v2("bundle-uri", 0))
+		return 0;
+
 	packet_reader_init(&reader, data->fd[0], NULL, 0,
 			   PACKET_READ_CHOMP_NEWLINE |
 			   PACKET_READ_GENTLE_ON_EOF);
@@ -1492,7 +1521,7 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs)
 	return rc;
 }
 
-int transport_get_remote_bundle_uri(struct transport *transport)
+int transport_get_remote_bundle_uri(struct transport *transport, int quiet)
 {
 	const struct transport_vtable *vtable = transport->vtable;
 
@@ -1500,20 +1529,16 @@ int transport_get_remote_bundle_uri(struct transport *transport)
 	if (transport->got_remote_bundle_uri++)
 		return 0;
 
-	/*
-	 * "Support" protocol v0 and v2 without bundle-uri support by
-	 * silently degrading to a NOOP.
-	 */
-	if (!server_supports_v2("bundle-uri", 0))
-		return 0;
-
 	/*
 	 * This is intentionally below the transport.injectBundleURI,
 	 * we want to be able to inject into protocol v0, or into the
 	 * dialog of a server who doesn't support this.
 	 */
-	if (!vtable->get_bundle_uri)
+	if (!vtable->get_bundle_uri) {
+		if (quiet)
+			return -1;
 		return error(_("bundle-uri operation not supported by protocol"));
+	}
 
 	if (vtable->get_bundle_uri(transport) < 0)
 		return error(_("could not retrieve server-advertised bundle-uri list"));
diff --git a/transport.h b/transport.h
index 90568845876..ed5ebcf1466 100644
--- a/transport.h
+++ b/transport.h
@@ -295,8 +295,12 @@ const struct ref *transport_get_remote_refs(struct transport *transport,
 /**
  * Retrieve bundle URI(s) from a remote. Populates "struct
  * transport"'s "bundle_uri" and "got_remote_bundle_uri".
+ *
+ * With `quiet=1` it will not complain if the serve doesn't support
+ * the protocol, but only if we discover the server uses it, and
+ * encounter issues then.
  */
-int transport_get_remote_bundle_uri(struct transport *transport);
+int transport_get_remote_bundle_uri(struct transport *transport, int quiet);
 
 /*
  * Fetch the hash algorithm used by a remote.
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 13/36] bundle-uri client: add transfer.injectBundleURI support
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (11 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 12/36] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 14/36] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
                       ` (23 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add the ability to inject "fake" bundle URIs into the newly supported
bundle-uri dialog. As discussed in the added documentation this allows
us to pretend as though the remote supports bundle URIs.

This will be useful both for ad-hoc testing, and for the real use-case
of retrofitting bundle URI support on-the-fly, i.e. to have:

	git -c transfer.injectBundleURI "file://$(pwd)/local.bdl" \
	clone https://example.com/git.git"

Be similar in spirit to:

	git clone --reference local-clone.git --disassociate \
	https://example.com/git.git"

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/transfer.txt     | 20 ++++++++++++
 t/lib-t5730-protocol-v2-bundle-uri.sh | 46 +++++++++++++++++++++++++++
 transport.c                           | 33 +++++++++++++++++++
 3 files changed, 99 insertions(+)

diff --git a/Documentation/config/transfer.txt b/Documentation/config/transfer.txt
index b49429eb4db..71b9b8f29e6 100644
--- a/Documentation/config/transfer.txt
+++ b/Documentation/config/transfer.txt
@@ -77,3 +77,23 @@ transfer.unpackLimit::
 transfer.advertiseSID::
 	Boolean. When true, client and server processes will advertise their
 	unique session IDs to their remote counterpart. Defaults to false.
+
+transfer.injectBundleURI::
+	Allows for the injection of `bundle-uri` lines into the
+	protocol v2 transport dialog (see `protocol.version` in
+	linkgit:git-config[1]). See `bundle-uri` in
+	link:technical/protocol-v2.html[the Git Wire Protocol, Version
+	2] documentation for what the format looks like.
++
+Can be given more than once, each key being injected as one line into
+the dialog.
++
+This is useful for testing the `bundle-uri` facility, and to e.g. use
+linkgit:git-clone[1] to clone from a server which does not support
+`bundle-uri`, but where the clone can benefit from getting some or
+most of the data from a static bundle retrieved from elsewhere.
++
+Impacts any command that uses the transport to communicate with remote
+linkgit:git-upload-pack[1] processes, e.g. linkgit:git-clone[1],
+linkgit:git-fetch[1] and the linkgit:git-ls-remote-bundle-uri[1]
+inspection command, this includes the `file://` protocol.
diff --git a/t/lib-t5730-protocol-v2-bundle-uri.sh b/t/lib-t5730-protocol-v2-bundle-uri.sh
index d0b15a47ec2..28c095c1224 100644
--- a/t/lib-t5730-protocol-v2-bundle-uri.sh
+++ b/t/lib-t5730-protocol-v2-bundle-uri.sh
@@ -267,3 +267,49 @@ test_expect_success "ls-remote-bundle-uri --[no-]quiet with $T5730_PROTOCOL:// u
 	test_must_be_empty err &&
 	test_cmp out.expect out.actual
 '
+
+test_expect_success "ls-remote-bundle-uri with -c transfer.injectBundleURI using with $T5730_PROTOCOL:// using protocol v2" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED" &&
+
+	cat >expect <<-\EOF &&
+	https://injected.example.com/fake-1.bdl
+	https://injected.example.com/fake-2.bdl
+	EOF
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		-c transfer.injectBundleURI="https://injected.example.com/fake-1.bdl" \
+		-c transfer.injectBundleURI="https://injected.example.com/fake-2.bdl" \
+		ls-remote-bundle-uri \
+		"$T5730_URI" \
+		>actual 2>err &&
+	test_cmp expect actual &&
+	test_path_is_missing log
+'
+
+test_expect_success "ls-remote-bundle-uri with bad -c transfer.injectBundleURI protocol v2 with $T5730_PROTOCOL://" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI \
+		"$T5730_BUNDLE_URI_ESCAPED" &&
+
+	cat >err.expect <<-\EOF &&
+	error: bad (empty) transfer.injectBundleURI
+	error: could not get the bundle-uri list
+	EOF
+
+	test_must_fail env \
+		GIT_TRACE_PACKET="$PWD/log" \
+		git \
+		-c protocol.version=2 \
+		-c transfer.injectBundleURI \
+		ls-remote-bundle-uri \
+		"$T5730_URI" \
+		>out 2>err.actual &&
+	test_must_be_empty out &&
+	test_cmp err.expect err.actual &&
+	test_path_is_missing log
+'
diff --git a/transport.c b/transport.c
index e648d3110bb..342e39d81f3 100644
--- a/transport.c
+++ b/transport.c
@@ -1521,14 +1521,47 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs)
 	return rc;
 }
 
+struct config_cb {
+	struct transport *transport;
+	int configured;
+	int ret;
+};
+
+static int bundle_uri_config(const char *var, const char *value, void *data)
+{
+	struct config_cb *cb = data;
+	struct transport *transport = cb->transport;
+	struct string_list *uri = &transport->bundle_uri;
+
+	if (!strcmp(var, "transfer.injectbundleuri")) {
+		cb->configured = 1;
+		if (!value)
+			cb->ret = error(_("bad (empty) transfer.injectBundleURI"));
+		else if (bundle_uri_parse_line(uri, value) < 0)
+			cb->ret = error(_("bad transfer.injectBundleURI: '%s'"),
+					value);
+		return 0;
+	}
+	return 0;
+}
+
 int transport_get_remote_bundle_uri(struct transport *transport, int quiet)
 {
 	const struct transport_vtable *vtable = transport->vtable;
+	struct config_cb cb = {
+		.transport = transport,
+	};
 
 	/* Lazily configured */
 	if (transport->got_remote_bundle_uri++)
 		return 0;
 
+	git_config(bundle_uri_config, &cb);
+
+	/* Our own config can fake it up with transport.injectBundleURI */
+	if (cb.configured)
+		return cb.ret;
+
 	/*
 	 * This is intentionally below the transport.injectBundleURI,
 	 * we want to be able to inject into protocol v0, or into the
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 14/36] bundle-uri client: add boolean transfer.bundleURI setting
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (12 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 13/36] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 15/36] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
                       ` (22 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

The yet-to-be introduced client support for bundle-uri will always
fall back on a full clone, but we'd still like to be able to ignore a
server's bundle-uri advertisement entirely.

This is useful for testing, and if a server is pointing to bad
bundles, they take a while to time out etc.

Since we might see the config in any order we need to clear out any
accumulated bundle_uri list when we see transfer.bundleURI=false
setting, and not add any more things to the list.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/transfer.txt |  6 ++++++
 transport.c                       | 21 +++++++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/transfer.txt b/Documentation/config/transfer.txt
index 71b9b8f29e6..ae85ca5760b 100644
--- a/Documentation/config/transfer.txt
+++ b/Documentation/config/transfer.txt
@@ -78,6 +78,12 @@ transfer.advertiseSID::
 	Boolean. When true, client and server processes will advertise their
 	unique session IDs to their remote counterpart. Defaults to false.
 
+transfer.bundleURI::
+	When set to `false` ignores any server advertisement of
+	`bundle-uri` and proceed with a "normal" clone/fetch even if
+	using bundles to bootstap is possible. Defaults to `true`,
+	i.e. bundle-uri is tried whenever a server offers it.
+
 transfer.injectBundleURI::
 	Allows for the injection of `bundle-uri` lines into the
 	protocol v2 transport dialog (see `protocol.version` in
diff --git a/transport.c b/transport.c
index 342e39d81f3..9e20b531215 100644
--- a/transport.c
+++ b/transport.c
@@ -1521,19 +1521,28 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs)
 	return rc;
 }
 
-struct config_cb {
+struct bundle_config_cb {
 	struct transport *transport;
 	int configured;
 	int ret;
+	int disabled;
 };
 
 static int bundle_uri_config(const char *var, const char *value, void *data)
 {
-	struct config_cb *cb = data;
+	struct bundle_config_cb *cb = data;
 	struct transport *transport = cb->transport;
 	struct string_list *uri = &transport->bundle_uri;
 
-	if (!strcmp(var, "transfer.injectbundleuri")) {
+	if (!strcmp(var, "transfer.bundleuri")) {
+		cb->disabled = !git_config_bool(var, value);
+		if (cb->disabled)
+			bundle_uri_string_list_clear(uri);
+		return 0;
+	}
+
+	if (!cb->disabled &&
+	    !strcmp(var, "transfer.injectbundleuri")) {
 		cb->configured = 1;
 		if (!value)
 			cb->ret = error(_("bad (empty) transfer.injectBundleURI"));
@@ -1548,7 +1557,7 @@ static int bundle_uri_config(const char *var, const char *value, void *data)
 int transport_get_remote_bundle_uri(struct transport *transport, int quiet)
 {
 	const struct transport_vtable *vtable = transport->vtable;
-	struct config_cb cb = {
+	struct bundle_config_cb cb = {
 		.transport = transport,
 	};
 
@@ -1558,6 +1567,10 @@ int transport_get_remote_bundle_uri(struct transport *transport, int quiet)
 
 	git_config(bundle_uri_config, &cb);
 
+	/* Don't use bundle-uri at all */
+	if (cb.disabled)
+		return 0;
+
 	/* Our own config can fake it up with transport.injectBundleURI */
 	if (cb.configured)
 		return cb.ret;
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 15/36] bundle-uri client: support for bundle-uri with "clone"
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (13 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 14/36] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 16/36] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
                       ` (21 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

In an earlier commit ("bundle-uri client: add minimal NOOP client") a
transport_get_remote_bundle_uri() call was added to builtin/clone.c to
get any advertised bundle URIs from the server during cloning, but
nothing was being done with them yet.

This implements real support for bundle-uri during the "clone"
phase. It's not used at all by "fetch", but the code to support it is
mostly here already, and will be finished later.

Using the new transfer.injectBundleURI support it's easy to test this
method of cloning on a live server that doesn't support bundle-uri. In
a git.git checkout.

First let's prepare two bundles:

    git bundle create /tmp/git-master-only.bdl origin/master
    git bundle create /tmp/git-master-to-next.bdl origin/master..origin/next

And next, let's do a "fake" clone where we bootstrap from these
bundles. The fetch.uriProtocols is needed because we'd otherwise
ignore "file://" URIs. This uses --no-tags --single-branch for
simplicity:

    rm -rf /tmp/git.git &&
    git \
	-c protocol.version=2 \
        -c fetch.uriProtocols=file \
        -c transfer.injectBundleURI="file:///tmp/git-master-only.bdl" \
	-c transfer.injectBundleURI="file:///tmp/git-master-to-next.bdl" \
	clone --bare --no-tags --single-branch --branch next --template= \
	--verbose --verbose \
	https://github.com/git/git.git /tmp/git.git

We'll then get output like:

    Receiving bundle (1/2): 100% (300529/300529), 87.57 MiB | 32.70 MiB/s, done.
    Resolving deltas: 100% (226765/226765), done.
    have eb27b338a3e71c7c4079fbac8aeae3f8fbb5c687 commit via bundle-uri
    Receiving bundle (2/2): 100% (725/725), 221.11 KiB | 22.11 MiB/s, done.
    Resolving deltas: 100% (539/539), completed with 153 local objects.
    have e1b32706d8dd5db1dc2e13f8e391651214f1d987 commit via bundle-uri
    Marking e1b32706d8dd5db1dc2e13f8e391651214f1d987 as complete
    already have e1b32706d8dd5db1dc2e13f8e391651214f1d987 (refs/heads/next)
    Checking connectivity: 301210, done.

I.e. we did an ls-refs on connection to the server, then retrieved the
advertised bundles (faked up via config in this case).

We then got all the data leading up to the current "master" from
there, and also the commit that's currently on "next. In this case we
found that we didn't need to proceed further with the dialog.

I.e. other than an ls-refs and the server waiting until we downloaded
the bundles, the server didn't need to do any work creating a PACK for
us.

If we change "--branch next" into "--branch seen" in the above command
we'll get the same output at the start until the "want" line, then:

    [...]
    want 93021c12c9f91e0d750d3ca8750a62416f4ea81a (refs/heads/seen)
    POST git-upload-pack (212 bytes)
    remote: Enumerating objects: 2265, done.
    remote: Counting objects: 100% (1576/1576), done.
    remote: Compressing objects: 100% (233/233), done.
    remote: Total 2265 (delta 1378), reused 1480 (delta 1341), pack-reused 689
    Receiving objects: 100% (2265/2265), 2.17 MiB | 10.77 MiB/s, done.
    Resolving deltas: 100% (1673/1673), completed with 339 local objects.
    Checking connectivity: 303225, done.

I.e. the server needed to send us an incremental update on top after
we'd unpacked the bundles, but this was a fairly minimal set of ~2k
objects. It didn't need to service a full clone.

We can see the savings on the server by setting up a local server at
the tip of "next":

    rm -rf /tmp/git-server.git &&
    git init --bare /tmp/git-server.git &&
    git -C /tmp/git-server.git bundle unbundle /tmp/git-master-only.bdl &&
    git -C /tmp/git-server.git bundle unbundle /tmp/git-master-to-next.bdl
    git -C /tmp/git-server.git update-ref refs/heads/master $(git ls-remote /tmp/git-master-only.bdl | cut -f 1) &&
    git -C /tmp/git-server.git update-ref refs/heads/next $(git ls-remote /tmp/git-master-to-next.bdl | cut -f 1) &&
    git -C /tmp/git-server.git for-each-ref

Let's then clone from it, and record the time we spend.

    rm -rf /tmp/git.git /tmp/{client,server}.time &&
    /usr/bin/time -o /tmp/client.time -v git \
	-c protocol.version=2 \
        -c fetch.uriProtocols=file \
        -c transfer.injectBundleURI="file:///tmp/git-master-only.bdl" \
	clone \
	--upload-pack '/usr/bin/time -o /tmp/server.time -v git-upload-pack' \
	--bare --no-tags --single-branch --branch next --template= \
	--verbose --verbose \
	file:///tmp/git-server.git /tmp/git.git &&
    for i in client server
    do
        echo $i: &&
        grep -e seconds -e wall -e Maximum -e context /tmp/$i.time
    done

This gives us something like these results:

    client:
        User time (seconds): 46.34
        System time (seconds): 0.67
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.67
        Maximum resident set size (kbytes): 207096
        Voluntary context switches: 116058
        Involuntary context switches: 220
    server:
        User time (seconds): 0.13
        System time (seconds): 0.00
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.08
        Maximum resident set size (kbytes): 53168
        Voluntary context switches: 255
        Involuntary context switches: 7

Whereas doing a normal "clone" (by e.g. adding "-c
transfer.bundleURI=false" to the above) will give something like:

    client:
        User time (seconds): 47.24
        System time (seconds): 0.92
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.55
        Maximum resident set size (kbytes): 288104
        Voluntary context switches: 136350
        Involuntary context switches: 296
    server:
        User time (seconds): 5.73
        System time (seconds): 0.24
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.45
        Maximum resident set size (kbytes): 288104
        Voluntary context switches: 26568
        Involuntary context switches: 111

I.e. we can see that the win on the client in this case is negative,
but we use around over 2% of the CPU time on the server, and around
20% of the memory. The client-visible time is a bit slower, by around
2%.

In practice I think this will be more of a win-win. These results are
on an unloaded local machine, and don't account for the benefit of the
server being more likely to have a network-local version of most of
the repository via dumb CDNs.

Real servers are also usually in a messier state of having various
loose objects and more fragmented pack collections, and needing to
spend CPU to assemble these. Frequent repacking and e.g. local caching
e.g. via the uploadpack.packObjectsHook helps, but using this should
make it more accessible to run a highly performance git server.

This feature also makes things like resumable clones rather trivial to
implement, this approach was discussed in the past[1] as a means to
get that feature.

1. https://lore.kernel.org/git/20111110074330.GA27925@sigill.intra.peff.net/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 fetch-pack.c                          | 255 ++++++++++++++++++++++++++
 fetch-pack.h                          |   6 +
 t/lib-t5730-protocol-v2-bundle-uri.sh | 145 ++++++++++++++-
 transport.c                           |   1 +
 4 files changed, 406 insertions(+), 1 deletion(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index b1d90d1914f..316fb2fd65d 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -26,6 +26,7 @@
 #include "commit-reach.h"
 #include "commit-graph.h"
 #include "sigchain.h"
+#include "bundle.h"
 
 static int transfer_unpack_limit = -1;
 static int fetch_unpack_limit = -1;
@@ -1025,6 +1026,133 @@ static int get_pack(struct fetch_pack_args *args,
 	return 0;
 }
 
+static int unbundle_bundle_uri(const char *bundle_uri, unsigned int nth,
+			       unsigned int total_nr, FILE *in, int in_fd,
+			       struct oid_array *bundle_oids,
+			       unsigned int use_thin_pack)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct bundle_header header = BUNDLE_HEADER_INIT;
+	int ret = 0;
+	struct string_list_item *item;
+	struct strbuf progress_title = STRBUF_INIT;
+	int code;
+
+	ret = read_bundle_header_fd(in_fd, &header, bundle_uri);
+	if (ret < 0) {
+		ret = error("could not read_bundle_header(%s)", bundle_uri);
+		goto cleanup;
+	}
+
+	for_each_string_list_item(item, &header.references) {
+		/*
+		 * The bundle's idea of the ref name is
+		 * item->string.
+		 *
+		 * Here's where we could do concurrent negotiation
+		 * with the server (and possibly start the fetch!)
+		 * before or while we unpack the bundle with
+		 * index-pack.
+		 *
+		 * The negotiator would need a small change to trust
+		 * arbitrary OIDs instead of assuming it has existing
+		 * in-repo "struct commit *", but ad-hoc testing
+		 * reveals that it'll work & speed up the fetch even
+		 * more, as we could proceed in parallel with the full
+		 * bundle fetching as soon as we get the headers.
+		 */
+		struct object_id *oid = item->util;
+
+		oid_array_append(bundle_oids, oid);
+	}
+
+	if (git_env_bool("GIT_TEST_BUNDLE_URI_FAIL_UNBUNDLE", 0))
+		lseek(in_fd, 0, SEEK_SET);
+
+	strbuf_addf(&progress_title, "Receiving bundle (%d/%d)", nth, total_nr);
+	strvec_pushl(&cmd.args, "index-pack", "--stdin", "-v",
+		     "--progress-title", progress_title.buf, NULL);
+
+	if (header.prerequisites.nr && use_thin_pack)
+		strvec_push(&cmd.args, "--fix-thin");
+	strvec_push(&cmd.args, "--check-self-contained-and-connected");
+	add_index_pack_keep_option(&cmd.args);
+
+	cmd.git_cmd = 1;
+	cmd.in = in_fd;
+	cmd.no_stdout = 1;
+	cmd.git_cmd = 1;
+
+	if (start_command(&cmd)) {
+		ret = error(_("fetch-pack: unable to spawn index-pack"));
+		goto cleanup;
+	}
+
+	code = finish_command(&cmd);
+
+	if (header.prerequisites.nr && code == 1)
+		/*
+		 * index-pack returns -1 on
+		 * --check-self-contained-and-connected to indicate
+		 * that the pack was indeed not self contained and
+		 * connected. We know from the bundle header
+		 * prerequisites.
+		 */
+		code = 0;
+
+	if (code) {
+		ret = error(_("fetch-pack: unable to finish index-pack, exited with %d"), code);
+		goto cleanup;
+	}
+
+cleanup:
+	strbuf_release(&progress_title);
+	bundle_header_release(&header);
+	return ret;
+}
+
+static int get_bundle_uri(struct string_list_item *item, unsigned int nth,
+			  unsigned int total_nr, struct oid_array *bundle_oids,
+			  unsigned int use_thin_pack)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct strbuf tempfile = STRBUF_INIT;
+	int ret = 0;
+	const char *uri = item->string;
+	FILE *out;
+	int out_fd;
+
+	strvec_push(&cmd.args, "curl");
+	strvec_push(&cmd.args, "--silent");
+	strvec_push(&cmd.args, "--output");
+	strvec_push(&cmd.args, "-");
+	strvec_push(&cmd.args, "--");
+	strvec_push(&cmd.args, item->string);
+	cmd.git_cmd = 0;
+	cmd.no_stdin = 1;
+	cmd.out = -1;
+
+	if (start_command(&cmd)) {
+		ret = error("fetch-pack: unable to spawn http-fetch");
+		goto cleanup;
+	}
+
+	out = xfdopen(cmd.out, "r");
+	out_fd = fileno(out);
+	ret = unbundle_bundle_uri(uri, nth, total_nr, out, out_fd,
+				  bundle_oids, use_thin_pack);
+
+	if (finish_command(&cmd)) {
+		ret = error("fetch-pack: unable to finish http-fetch");
+		goto cleanup;
+	}
+
+cleanup:
+	strbuf_release(&tempfile);
+
+	return ret;
+}
+
 static int cmp_ref_by_name(const void *a_, const void *b_)
 {
 	const struct ref *a = *((const struct ref **)a_);
@@ -1586,6 +1714,130 @@ static void do_check_stateless_delimiter(int stateless_rpc,
 				  _("git fetch-pack: expected response end packet"));
 }
 
+static int get_bundle_uri_add_known_common(struct string_list_item *item,
+					   unsigned int nth, unsigned int total_nr,
+					   struct fetch_negotiator *negotiator,
+					   struct fetch_pack_args *args,
+					   unsigned int use_thin_pack)
+{
+	int i;
+	struct oid_array bundle_oids = OID_ARRAY_INIT;
+
+	/*
+	 * We don't use OBJECT_INFO_QUICK here unlike in the rest of
+	 * the fetch routines, that's because the rest of them don't
+	 * need to consider a commit object that's just been
+	 * downloaded for further negotiation, but bundle-uri does for
+	 * adding newly downloaded OIDs to the negotiator.
+	 */
+	unsigned oi_flags = OBJECT_INFO_SKIP_FETCH_OBJECT;
+
+	if (get_bundle_uri(item, nth, total_nr, &bundle_oids, use_thin_pack) < 0)
+		return error(_("could not get the bundle URI #%d"), nth);
+
+	for (i = 0; i < bundle_oids.nr; i++) {
+		struct object_id *oid = &bundle_oids.oid[i];
+		enum object_type type = OBJ_NONE;
+		struct commit *c = deref_without_lazy_fetch_extended(oid, 0,
+								     &type,
+								     oi_flags);
+		if (!c) {
+			if (type == OBJ_BLOB || type == OBJ_TREE) {
+				print_verbose(args, "have %s %s via bundle-uri (ignoring due to type)",
+					      oid_to_hex(oid), type_name(type));
+				continue;
+			} else if (type) {
+				/*
+				 * OBJ_TAG should have been peeled,
+				 * and OBJ_COMMIT should have a
+				 * non-NULL "c".
+				 *
+				 * Should be a BUG() if we were not
+				 * bending over backwards to make
+				 * bundle-uri soft-fail.
+				 */
+				return error(_("bundle-uri says it has %s, got it at unexpected type %s"),
+					     oid_to_hex(oid), type_name(type));
+			}
+		}
+
+		print_verbose(args, "have %s %s via bundle-uri",
+			      oid_to_hex(oid), type_name(type));
+
+		negotiator->known_common(negotiator, c);
+		mark_complete(oid);
+	}
+	return 0;
+}
+
+static void do_fetch_pack_v2_bundle_uri(struct fetch_pack_args *args,
+					struct string_list  *bundle_uri,
+					struct fetch_negotiator *negotiator)
+{
+	struct string_list_item *item;
+	struct string_list list = STRING_LIST_INIT_NODUP;
+	struct string_list default_protocols = STRING_LIST_INIT_NODUP;
+	struct string_list *ok_protocols;
+
+	if (!bundle_uri)
+		return;
+
+	if (!bundle_uri->nr)
+		return;
+
+	if (uri_protocols.nr) {
+		ok_protocols = &uri_protocols;
+	} else {
+		string_list_append(&default_protocols, "http");
+		string_list_append(&default_protocols, "https");
+		ok_protocols = &default_protocols;
+	}
+
+	for_each_string_list_item(item, bundle_uri) {
+		const char *uri = item->string;
+		int protocol_ok = 0;
+		struct string_list_item *item2;
+
+		for_each_string_list_item(item2, ok_protocols) {
+			const char *s = item2->string;
+			const char *p;
+
+			if (skip_prefix(item->string, s, &p) &&
+			    starts_with(p, "://")) {
+				protocol_ok = 1;
+				break;
+			}
+		}
+
+		if (!protocol_ok) {
+			print_verbose(args, "skipping bundle-uri not on protocol whitelist: %s",
+				      item->string);
+			continue;
+		}
+
+		string_list_append(&list, uri)->util = item->util;
+	}
+
+	if (list.nr) {
+		int i;
+		unsigned int total_nr = list.nr;
+
+		trace2_region_enter("fetch-pack", "bundle-uri", the_repository);
+		for (i = 0; i < total_nr; i++) {
+			struct string_list_item item = list.items[i];
+			unsigned int nth = i + 1;
+
+			get_bundle_uri_add_known_common(&item, nth, total_nr,
+							negotiator, args,
+							args->use_thin_pack);
+		}
+		trace2_region_leave("fetch-pack", "bundle-uri", the_repository);
+	}
+
+	string_list_clear(&default_protocols, 0);;
+}
+
+
 static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 				    int fd[2],
 				    const struct ref *orig_ref,
@@ -1609,6 +1861,7 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 	struct string_list packfile_uris = STRING_LIST_INIT_DUP;
 	int i;
 	struct strvec index_pack_args = STRVEC_INIT;
+	struct string_list *bundle_uri = args->bundle_uri;
 
 	negotiator = &negotiator_alloc;
 	if (args->refetch)
@@ -1616,6 +1869,8 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 	else
 		fetch_negotiator_init(r, negotiator);
 
+	do_fetch_pack_v2_bundle_uri(args, bundle_uri, negotiator);
+
 	packet_reader_init(&reader, fd[0], NULL, 0,
 			   PACKET_READ_CHOMP_NEWLINE |
 			   PACKET_READ_DIE_ON_ERR_PACKET);
diff --git a/fetch-pack.h b/fetch-pack.h
index 8c7752fc821..5d8c8b03e1f 100644
--- a/fetch-pack.h
+++ b/fetch-pack.h
@@ -24,6 +24,12 @@ struct fetch_pack_args {
 	 */
 	const struct oid_array *negotiation_tips;
 
+	/*
+	 * A pointer to the already populated transport.bundle_uri
+	 * struct.
+	 */
+	struct string_list *bundle_uri;
+
 	unsigned deepen_relative:1;
 	unsigned quiet:1;
 	unsigned keep_pack:1;
diff --git a/t/lib-t5730-protocol-v2-bundle-uri.sh b/t/lib-t5730-protocol-v2-bundle-uri.sh
index 28c095c1224..0235ba50d6f 100644
--- a/t/lib-t5730-protocol-v2-bundle-uri.sh
+++ b/t/lib-t5730-protocol-v2-bundle-uri.sh
@@ -7,6 +7,8 @@ case "$T5730_PROTOCOL" in
 file)
 	T5730_PARENT=file_parent
 	T5730_URI="file://$PWD/file_parent"
+	T5730_URI_BDL_PROTO="file://"
+	T5730_URI_BDL="$T5730_URI_BDL_PROTO$PWD/file_parent"
 	T5730_BUNDLE_URI="$T5730_URI/fake.bdl"
 	test_set_prereq T5730_FILE
 	;;
@@ -15,6 +17,8 @@ git)
 	start_git_daemon --export-all --enable=receive-pack
 	T5730_PARENT="$GIT_DAEMON_DOCUMENT_ROOT_PATH/parent"
 	T5730_URI="$GIT_DAEMON_URL/parent"
+	T5730_URI_BDL_PROTO="file://"
+	T5730_URI_BDL="$T5730_URI_BDL_PROTO$GIT_DAEMON_DOCUMENT_ROOT_PATH/parent"
 	T5730_BUNDLE_URI="https://example.com/fake.bdl"
 	test_set_prereq T5730_GIT
 	;;
@@ -24,6 +28,8 @@ http)
 	T5730_PARENT="$HTTPD_DOCUMENT_ROOT_PATH/http_parent"
 	T5730_URI="$HTTPD_URL/smart/http_parent"
 	T5730_BUNDLE_URI="https://example.com/fake.bdl"
+	T5730_URI_BDL_PROTO="http://"
+	T5730_URI_BDL="$HTTPD_URL/dumb/http_parent"
 	test_set_prereq T5730_HTTP
 	;;
 *)
@@ -33,7 +39,20 @@ esac
 
 test_expect_success "setup protocol v2 $T5730_PROTOCOL:// tests" '
 	git init "$T5730_PARENT" &&
-	test_commit -C "$T5730_PARENT" one
+	test_commit -C "$T5730_PARENT" one &&
+	test_commit -C "$T5730_PARENT" two &&
+	test_commit -C "$T5730_PARENT" three &&
+	test_commit -C "$T5730_PARENT" four &&
+	test_commit -C "$T5730_PARENT" five &&
+	test_commit -C "$T5730_PARENT" six &&
+
+	mkdir "$T5730_PARENT"/bdl &&
+	git -C "$T5730_PARENT" bundle create bdl/1.bdl one &&
+	git -C "$T5730_PARENT" bundle create bdl/1-2.bdl one..two &&
+	git -C "$T5730_PARENT" bundle create bdl/2-3.bdl two..three &&
+	git -C "$T5730_PARENT" bundle create bdl/3-4.bdl three..four &&
+	git -C "$T5730_PARENT" bundle create bdl/4-5.bdl four..five &&
+	git -C "$T5730_PARENT" bundle create bdl/5-6.bdl five..six
 '
 
 # Poor man's URI escaping. Good enough for the test suite whose trash
@@ -313,3 +332,127 @@ test_expect_success "ls-remote-bundle-uri with bad -c transfer.injectBundleURI p
 	test_cmp err.expect err.actual &&
 	test_path_is_missing log
 '
+
+test_cmp_repo_refs() {
+	one="$1"
+	two="$2"
+	shift 2
+
+	git -C "$one" for-each-ref "$@" >expect &&
+	git -C "$two" for-each-ref "$@" >actual &&
+	test_cmp expect actual
+}
+
+show_cr () {
+	tr '\015' Q | sed -e "s/Q/<CR>\\$LF/g"
+}
+
+test_expect_success CURL "clone with bundle-uri protocol v2 over $T5730_PROTOCOL:// 1.bdl via $T5730_URI_BDL_PROTO" '
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1.bdl | test_uri_escape)" &&
+
+	test_when_finished "rm -rf event log child" &&
+	GIT_TRACE2_EVENT="$PWD/event" \
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		-c fetch.uriProtocols=file,http \
+		clone --verbose --verbose \
+		"$T5730_URI" child &&
+	test_region progress "Receiving bundle (1/1)" event &&
+	grep "clone> want " log &&
+	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags
+'
+
+test_expect_success "fetch with bundle-uri protocol v2 over $T5730_PROTOCOL:// 1.bdl via $T5730_URI_BDL_PROTO" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1.bdl | test_uri_escape)" &&
+
+	test_when_finished "rm -rf event log child" &&
+	git init --bare child &&
+	git -C child remote add --mirror=fetch origin "$T5730_URI" &&
+
+	GIT_TRACE2_EVENT="$PWD/event" \
+	GIT_TRACE_PACKET="$PWD/log" \
+	git -C child \
+		-c protocol.version=2 \
+		-c fetch.uriProtocols=file,http \
+		fetch --verbose --verbose &&
+
+	if test_have_prereq CURL
+	then
+		# Fetch is not supported yet
+		! test_region progress "Receiving bundle (1/1)" event &&
+		grep "fetch> want " log
+	else
+		! grep "fetch-pack: unable to spawn" event
+	fi &&
+
+	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags
+'
+
+test_expect_success "clone with bundle-uri protocol v2 with $T5730_PROTOCOL:// 1 + 1-2 + [...].bdl via $T5730_URI_BDL_PROTO" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1.bdl | test_uri_escape)" &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1-2.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/2-3.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/3-4.bdl | test_uri_escape)" --add &&
+
+	test_when_finished "rm -rf event log child" &&
+	GIT_TRACE2_EVENT="$PWD/event" \
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		-c fetch.uriProtocols=file,http \
+		clone --verbose --verbose \
+		"$T5730_URI" child &&
+
+	if test_have_prereq CURL
+	then
+		test_region progress "Receiving bundle (1/4)" event &&
+		test_region progress "Receiving bundle (2/4)" event &&
+		test_region progress "Receiving bundle (3/4)" event &&
+		test_region progress "Receiving bundle (4/4)" event
+	else
+		grep "fetch-pack: unable to spawn" event
+	fi &&
+
+	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags &&
+	grep "clone> want " log
+'
+
+test_expect_success "clone with bundle-uri protocol v2 with $T5730_PROTOCOL:// ALL.bdl via $T5730_URI_BDL_PROTO" '
+	test_when_finished "rm -f log" &&
+
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1.bdl | test_uri_escape)" &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/1-2.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/2-3.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/3-4.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/4-5.bdl | test_uri_escape)" --add &&
+	test_config -C "$T5730_PARENT" uploadpack.bundleURI "$(echo $T5730_URI_BDL/bdl/5-6.bdl | test_uri_escape)" --add &&
+
+	test_when_finished "rm -rf event log child" &&
+	GIT_TRACE2_EVENT="$PWD/event" \
+	GIT_TRACE_PACKET="$PWD/log" \
+	git \
+		-c protocol.version=2 \
+		-c fetch.uriProtocols=file,http \
+		clone --verbose --verbose \
+		"$T5730_URI" child &&
+
+	if test_have_prereq CURL
+	then
+		test_region progress "Receiving bundle (1/6)" event &&
+		test_region progress "Receiving bundle (2/6)" event &&
+		test_region progress "Receiving bundle (3/6)" event &&
+		test_region progress "Receiving bundle (4/6)" event &&
+		test_region progress "Receiving bundle (5/6)" event &&
+		test_region progress "Receiving bundle (6/6)" event &&
+		! grep "clone> want " log
+	else
+		grep "fetch-pack: unable to spawn" event
+	fi &&
+
+	test_cmp_repo_refs "$T5730_PARENT" child refs/heads refs/tags
+'
diff --git a/transport.c b/transport.c
index 9e20b531215..7e5e1192f95 100644
--- a/transport.c
+++ b/transport.c
@@ -437,6 +437,7 @@ static int fetch_refs_via_pack(struct transport *transport,
 	args.server_options = transport->server_options;
 	args.negotiation_tips = data->options.negotiation_tips;
 	args.reject_shallow_remote = transport->smart_options->reject_shallow;
+	args.bundle_uri = &transport->bundle_uri;
 
 	if (!data->got_remote_heads) {
 		int i;
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 16/36] bundle-uri: make the download program configurable
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (14 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 15/36] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 17/36] remote-curl: add 'get' capability Ævar Arnfjörð Bjarmason
                       ` (20 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

As noted in a preceding commit we really should be using libcurl's C
API by default in get_bundle_uri(), but testing with a command-line
program can be very handy, and useful e.g. to implement custom or
ad-hoc caching.

E.g. using part of the recipe noted in a preceding commit to create
the "git-master-only.bdl" and "git-master-to-next.bdl" files, we can
implement a trivial caching shellscript as:

	cat >get-bundle.sh <<-\EOF &&
	#!/bin/sh
	set -xe

	uri="$1"

	bundle_cache_key () {
		echo "Computing cache key for URI '$1' (only getting the header)" >&2

		curl --silent --output - -- "$1" |
		sed -n -e '/^$/q' -e 'p' |
		git hash-object --stdin
	}

	get_cached_bundle_uri() {
		cache_key=$(bundle_cache_key "$1")

		path="/tmp/bundle-cache-$cache_key.bdl"

		if test -e "$path"
		then
			echo "Using cache '$path' for URI '$1'" >&2
			cat "$path"
		else
			echo "Downloading bundle URI $1" >&2
			curl --silent --output - -- "$uri" | tee "$path"
		fi
	}

	get_cached_bundle_uri "$1"
	EOF
	chmod +x get-bundle.sh &&
	rm -rf /tmp/git.git &&
	./git \
		-c protocol.version=2 \
		-c fetch.uriProtocols=file \
		-c transfer.bundleURI.downloader=./get-bundle.sh \
		-c transfer.injectBundleURI="file:///tmp/git-master-only.bdl" \
		-c transfer.injectBundleURI="file:///tmp/git-master-to-next.bdl" \
		clone --bare --no-tags --single-branch --branch next --template= \
		--verbose --verbose \
		https://github.com/git/git.git /tmp/git.git

Now, clearly that specific example is rather pointless. We're getting
a local file anyway, so "cat"-ing another local file doesn't make any
difference, it's even slightly slower & more redundant as we're having
to get it twice with "curl".

But the point is that this can be trivially improved for use in any
arbitrary custom caching strategy. E.g.:

 * A less dumber implementation that would stream the remote URL,
   check the header as we go, and disconnect if we've got that content
   locally.
 * Ditto, but using an ETag or other strategy.
 * N boxes could share a cache an NFS with a shared mount, or N
   disconnected git processes could use a common cache without the
   need for a front-line HTTP proxy server.

 * It would be trivial to extend this to guard against a "thundering
   herd" (e.g. concurrent CI) downloading the same bundle N times. As
   soon as we'd get the header we'd create a $cache_key.lock as we
   download the rest, and other concurrent clients spotting that would
   wait, then eventually cache "$cache_key".

   Still racy as N clients could download the header in parallel, but
   way less so (the header will be a tiny part of the payload).

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/transfer.txt | 7 +++++++
 fetch-pack.c                      | 6 ++++++
 2 files changed, 13 insertions(+)

diff --git a/Documentation/config/transfer.txt b/Documentation/config/transfer.txt
index ae85ca5760b..5310cd96cb9 100644
--- a/Documentation/config/transfer.txt
+++ b/Documentation/config/transfer.txt
@@ -84,6 +84,13 @@ transfer.bundleURI::
 	using bundles to bootstap is possible. Defaults to `true`,
 	i.e. bundle-uri is tried whenever a server offers it.
 
+transfer.bundleURI.downloader::
+	When set to `<program>` will be invoked when
+	`transfer.bundleURI` is in effect to download URIs containing
+	bundles. Expected to take one `URI` as an argument, and to
+	emit the bundle on STDOUT. Defaults to "curl --silent --output
+	- --". I.e. we'll invoke "curl --silent --output - -- <URI>".
+
 transfer.injectBundleURI::
 	Allows for the injection of `bundle-uri` lines into the
 	protocol v2 transport dialog (see `protocol.version` in
diff --git a/fetch-pack.c b/fetch-pack.c
index 316fb2fd65d..7e696142c4d 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1121,12 +1121,18 @@ static int get_bundle_uri(struct string_list_item *item, unsigned int nth,
 	const char *uri = item->string;
 	FILE *out;
 	int out_fd;
+	const char *tmp;
 
 	strvec_push(&cmd.args, "curl");
 	strvec_push(&cmd.args, "--silent");
 	strvec_push(&cmd.args, "--output");
 	strvec_push(&cmd.args, "-");
 	strvec_push(&cmd.args, "--");
+	if (!git_config_get_string_tmp("transfer.bundleURI.downloader", &tmp)) {
+		strvec_clear(&cmd.args);
+		strvec_push(&cmd.args, tmp);
+		cmd.use_shell = 1;
+	}
 	strvec_push(&cmd.args, item->string);
 	cmd.git_cmd = 0;
 	cmd.no_stdin = 1;
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 17/36] remote-curl: add 'get' capability
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (15 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 16/36] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 18/36] bundle: implement 'fetch' command for direct bundles Ævar Arnfjörð Bjarmason
                       ` (19 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

A future change will want a way to download a file over HTTP(S) using
the simplest of download mechanisms. We do not want to assume that the
server on the other side understands anything about the Git protocol but
could be a simple static web server.

Create the new 'get' capability for the remote helpers which advertises
that the 'get' command is avalable. A caller can send a line containing
'get <url> <path>' to download the file at <url> into the file at
<path>.

RFC-TODO: This change requires tests directly on the remote helper.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/gitremote-helpers.txt |  6 ++++++
 remote-curl.c                       | 32 +++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/Documentation/gitremote-helpers.txt b/Documentation/gitremote-helpers.txt
index 6f1e269ae43..f82588601a9 100644
--- a/Documentation/gitremote-helpers.txt
+++ b/Documentation/gitremote-helpers.txt
@@ -168,6 +168,9 @@ Supported commands: 'list', 'import'.
 	Can guarantee that when a clone is requested, the received
 	pack is self contained and is connected.
 
+'get'::
+	Can use the 'get' command to download a file from a given URI.
+
 If a helper advertises 'connect', Git will use it if possible and
 fall back to another capability if the helper requests so when
 connecting (see the 'connect' command under COMMANDS).
@@ -418,6 +421,9 @@ Supported if the helper has the "connect" capability.
 +
 Supported if the helper has the "stateless-connect" capability.
 
+'get' <uri> <path>::
+	Downloads the file from the given `<uri>` to the given `<path>`.
+
 If a fatal error occurs, the program writes the error message to
 stderr and exits. The caller should expect that a suitable error
 message has been printed if the child closes the connection without
diff --git a/remote-curl.c b/remote-curl.c
index 67f178b1120..53750d88e76 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -1276,6 +1276,33 @@ static void parse_fetch(struct strbuf *buf)
 	strbuf_reset(buf);
 }
 
+static void parse_get(struct strbuf *buf)
+{
+	struct http_get_options opts = { 0 };
+	struct strbuf url = STRBUF_INIT;
+	struct strbuf path = STRBUF_INIT;
+	const char *p, *space;
+
+	if (!skip_prefix(buf->buf, "get ", &p))
+		die(_("http transport does not support %s"), buf->buf);
+
+	space = strchr(p, ' ');
+
+	if (!space)
+		die(_("protocol error: expected '<url> <path>', missing space"));
+
+	strbuf_add(&url, p, space - p);
+	strbuf_addstr(&path, space + 1);
+
+	http_get_file(url.buf, path.buf, &opts);
+
+	strbuf_release(&url);
+	strbuf_release(&path);
+	printf("\n");
+	fflush(stdout);
+	strbuf_reset(buf);
+}
+
 static int push_dav(int nr_spec, const char **specs)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
@@ -1549,9 +1576,14 @@ int cmd_main(int argc, const char **argv)
 				printf("unsupported\n");
 			fflush(stdout);
 
+		} else if (skip_prefix(buf.buf, "get ", &arg)) {
+			parse_get(&buf);
+			fflush(stdout);
+
 		} else if (!strcmp(buf.buf, "capabilities")) {
 			printf("stateless-connect\n");
 			printf("fetch\n");
+			printf("get\n");
 			printf("option\n");
 			printf("push\n");
 			printf("check-connectivity\n");
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 18/36] bundle: implement 'fetch' command for direct bundles
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (16 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 17/36] remote-curl: add 'get' capability Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 19/36] bundle: parse table of contents during 'fetch' Ævar Arnfjörð Bjarmason
                       ` (18 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

The 'git bundle fetch <uri>' command will be used to download one or
more bundles from a specified '<uri>'. The implementation being added
here focuses only on downloading a file from '<uri>' and unbundling it
if it is a valid bundle file.

If it is not a bundle file, then we currently die(), but a later change
will attempt to interpret it as a table of contents with possibly
multiple bundles listed, along with other metadata for each bundle.

That explains a bit why cmd_bundle_fetch() has three steps carefully
commented, including a "stack" that currently can only hold one bundle.
We will later update this while loop to push onto the stack when
necessary.

RFC-TODO: Add documentation to Documentation/git-bundle.txt

RFC-TODO: Add direct tests of 'git bundle fetch' when the URI is a
bundle file.

RFC-TODO: Split out the docs and subcommand boilerplate into its own
commit.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/git-bundle.txt |   1 +
 builtin/bundle.c             | 261 +++++++++++++++++++++++++++++++++++
 2 files changed, 262 insertions(+)

diff --git a/Documentation/git-bundle.txt b/Documentation/git-bundle.txt
index 7685b570455..bf5cd90391c 100644
--- a/Documentation/git-bundle.txt
+++ b/Documentation/git-bundle.txt
@@ -12,6 +12,7 @@ SYNOPSIS
 'git bundle' create [-q | --quiet | --progress | --all-progress] [--all-progress-implied]
 		    [--version=<version>] <file> <git-rev-list-args>
 'git bundle' verify [-q | --quiet] <file>
+'git bundle' fetch [--filter=<spec>] <uri>
 'git bundle' list-heads <file> [<refname>...]
 'git bundle' unbundle [--progress] <file> [<refname>...]
 
diff --git a/builtin/bundle.c b/builtin/bundle.c
index 2adad545a2e..6b6107d83cf 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -3,6 +3,10 @@
 #include "parse-options.h"
 #include "cache.h"
 #include "bundle.h"
+#include "run-command.h"
+#include "hashmap.h"
+#include "object-store.h"
+#include "refs.h"
 
 /*
  * Basic handler for bundle files to connect repositories via sneakernet.
@@ -14,6 +18,7 @@
 static const char * const builtin_bundle_usage[] = {
   N_("git bundle create [<options>] <file> <git-rev-list args>"),
   N_("git bundle verify [<options>] <file>"),
+  N_("git bundle fetch [<options>] <uri>"),
   N_("git bundle list-heads <file> [<refname>...]"),
   N_("git bundle unbundle <file> [<refname>...]"),
   NULL
@@ -29,6 +34,11 @@ static const char * const builtin_bundle_verify_usage[] = {
   NULL
 };
 
+static const char * const builtin_bundle_fetch_usage[] = {
+	N_("git bundle fetch [--filter=<spec>] <uri>"),
+	NULL
+};
+
 static const char * const builtin_bundle_list_heads_usage[] = {
   N_("git bundle list-heads <file> [<refname>...]"),
   NULL
@@ -132,6 +142,255 @@ static int cmd_bundle_verify(int argc, const char **argv, const char *prefix) {
 	return ret;
 }
 
+/**
+ * The remote_bundle_info struct contains the necessary data for
+ * the list of bundles advertised by a table of contents. If the
+ * bundle URI instead contains a single bundle, then this struct
+ * can represent a single bundle without a 'uri' but with a
+ * tempfile storing its current location on disk.
+ */
+struct remote_bundle_info {
+	struct hashmap_entry ent;
+
+	/**
+	 * The 'id' is a name given to the bundle for reference
+	 * by other bundle infos.
+	 */
+	char *id;
+
+	/**
+	 * The 'uri' is the location of the remote bundle so
+	 * it can be downloaded on-demand. This will be NULL
+	 * if there was no table of contents.
+	 */
+	char *uri;
+
+	/**
+	 * The 'next_id' string, if non-NULL, contains the 'id'
+	 * for a bundle that contains the prerequisites for this
+	 * bundle. Used by table of contents to allow fetching
+	 * a portion of a repository incrementally.
+	 */
+	char *next_id;
+
+	/**
+	 * A table of contents can include a timestamp for the
+	 * bundle as a heuristic for describing a list of bundles
+	 * in order of recency.
+	 */
+	timestamp_t timestamp;
+
+	/**
+	 * If the bundle has been downloaded, then 'file' is a
+	 * filename storing its contents. Otherwise, 'file' is
+	 * an empty string.
+	 */
+	struct strbuf file;
+
+	/**
+	 * The 'stack_next' pointer allows this struct to form
+	 * a stack.
+	 */
+	struct remote_bundle_info *stack_next;
+};
+
+static void download_uri_to_file(const char *uri, const char *file)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	FILE *child_in;
+
+	strvec_pushl(&cp.args, "git-remote-https", "origin", uri, NULL);
+	cp.in = -1;
+	cp.out = -1;
+
+	if (start_command(&cp))
+		die(_("failed to start remote helper"));
+
+	child_in = fdopen(cp.in, "w");
+	if (!child_in)
+		die(_("cannot write to child process"));
+
+	fprintf(child_in, "get %s %s\n\n", uri, file);
+	fclose(child_in);
+
+	if (finish_command(&cp))
+		die(_("remote helper failed"));
+}
+
+static void find_temp_filename(struct strbuf *name)
+{
+	int fd;
+	/*
+	 * Find a temporray filename that is available. This is briefly
+	 * racy, but unlikely to collide.
+	 */
+	fd = odb_mkstemp(name, "bundles/tmp_uri_XXXXXX");
+	if (fd < 0)
+		die(_("failed to create temporary file"));
+	close(fd);
+	unlink(name->buf);
+}
+
+static void unbundle_fetched_bundle(struct remote_bundle_info *info)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	FILE *f;
+	struct strbuf line = STRBUF_INIT;
+	struct strbuf bundle_ref = STRBUF_INIT;
+	size_t bundle_prefix_len;
+
+	strvec_pushl(&cp.args, "bundle", "unbundle",
+				info->file.buf, NULL);
+	cp.git_cmd = 1;
+	cp.out = -1;
+
+	if (start_command(&cp))
+		die(_("failed to start 'unbundle' process"));
+
+	strbuf_addstr(&bundle_ref, "refs/bundles/");
+	bundle_prefix_len = bundle_ref.len;
+
+	f = fdopen(cp.out, "r");
+	while (strbuf_getline(&line, f) != EOF) {
+		struct object_id oid, old_oid;
+		const char *refname, *branch_name, *end;
+		char *space;
+		int has_old;
+
+		strbuf_trim_trailing_newline(&line);
+
+		space = strchr(line.buf, ' ');
+
+		if (!space)
+			continue;
+
+		refname = space + 1;
+		*space = '\0';
+		parse_oid_hex(line.buf, &oid, &end);
+
+		if (!skip_prefix(refname, "refs/heads/", &branch_name))
+			continue;
+
+		strbuf_setlen(&bundle_ref, bundle_prefix_len);
+		strbuf_addstr(&bundle_ref, branch_name);
+
+		has_old = !read_ref(bundle_ref.buf, &old_oid);
+
+		update_ref("bundle fetch", bundle_ref.buf, &oid,
+				has_old ? &old_oid : NULL,
+				REF_SKIP_OID_VERIFICATION,
+				UPDATE_REFS_MSG_ON_ERR);
+	}
+
+	if (finish_command(&cp))
+		die(_("failed to unbundle bundle from '%s'"), info->uri);
+
+	unlink_or_warn(info->file.buf);
+}
+
+static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
+{
+	int ret = 0;
+	int progress = isatty(2);
+	char *bundle_uri;
+	struct remote_bundle_info first_file = {
+		.file = STRBUF_INIT,
+	};
+	struct remote_bundle_info *stack = NULL;
+
+	struct option options[] = {
+		OPT_BOOL(0, "progress", &progress,
+			 N_("show progress meter")),
+		OPT_END()
+	};
+
+	argc = parse_options_cmd_bundle(argc, argv, prefix,
+			builtin_bundle_fetch_usage, options, &bundle_uri);
+
+	if (!startup_info->have_repository)
+		die(_("'fetch' requires a repository"));
+
+	/*
+	 * Step 1: determine protocol for uri, and download contents to
+	 * a temporary location.
+	 */
+	first_file.uri = bundle_uri;
+	find_temp_filename(&first_file.file);
+	download_uri_to_file(bundle_uri, first_file.file.buf);
+
+	/*
+	 * Step 2: Check if the file is a bundle (if so, add it to the
+	 * stack and move to step 3).
+	 */
+
+	if (is_bundle(first_file.file.buf, 1)) {
+		/* The simple case: only one file, no stack to worry about. */
+		stack = &first_file;
+	} else {
+		/* TODO: Expect and parse a table of contents. */
+		die(_("unexpected data at bundle URI"));
+	}
+
+	/*
+	 * Step 3: For each bundle in the stack:
+	 * 	i. If not downloaded to a temporary file, download it.
+	 * 	ii. Once downloaded, check that its prerequisites are in
+	 * 	    the object database. If not, then push its dependent
+	 * 	    bundle onto the stack. (Fail if no such bundle exists.)
+	 * 	iii. If all prerequisites are present, then unbundle the
+	 * 	     temporary file and pop the bundle from the stack.
+	 */
+	while (stack) {
+		int valid = 1;
+		int bundle_fd;
+		struct string_list_item *prereq;
+		struct bundle_header header = BUNDLE_HEADER_INIT;
+
+		if (!stack->file.len) {
+			find_temp_filename(&stack->file);
+			download_uri_to_file(stack->uri, stack->file.buf);
+			if (!is_bundle(stack->file.buf, 1))
+				die(_("file downloaded from '%s' is not a bundle"), stack->uri);
+		}
+
+		bundle_header_init(&header);
+		bundle_fd = read_bundle_header(stack->file.buf, &header);
+		if (bundle_fd < 0)
+			die(_("failed to read bundle from '%s'"), stack->uri);
+
+		for_each_string_list_item(prereq, &header.prerequisites) {
+			struct object_info info = OBJECT_INFO_INIT;
+			struct object_id *oid = prereq->util;
+
+			if (oid_object_info_extended(the_repository, oid, &info,
+						     OBJECT_INFO_QUICK)) {
+				valid = 0;
+				break;
+			}
+		}
+
+		close(bundle_fd);
+		bundle_header_release(&header);
+
+		if (valid) {
+			unbundle_fetched_bundle(stack);
+		} else if (stack->next_id) {
+			/*
+			 * Load the next bundle from the hashtable and
+			 * push it onto the stack.
+			 */
+		} else {
+			die(_("bundle from '%s' has missing prerequisites and no dependent bundle"),
+			    stack->uri);
+		}
+
+		stack = stack->stack_next;
+	}
+
+	free(bundle_uri);
+	return ret;
+}
+
 static int cmd_bundle_list_heads(int argc, const char **argv, const char *prefix) {
 	struct bundle_header header = BUNDLE_HEADER_INIT;
 	int bundle_fd = -1;
@@ -212,6 +471,8 @@ int cmd_bundle(int argc, const char **argv, const char *prefix)
 		result = cmd_bundle_create(argc, argv, prefix);
 	else if (!strcmp(argv[0], "verify"))
 		result = cmd_bundle_verify(argc, argv, prefix);
+	else if (!strcmp(argv[0], "fetch"))
+		result = cmd_bundle_fetch(argc, argv, prefix);
 	else if (!strcmp(argv[0], "list-heads"))
 		result = cmd_bundle_list_heads(argc, argv, prefix);
 	else if (!strcmp(argv[0], "unbundle"))
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 19/36] bundle: parse table of contents during 'fetch'
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (17 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 18/36] bundle: implement 'fetch' command for direct bundles Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 20/36] bundle: add --filter option to 'fetch' Ævar Arnfjörð Bjarmason
                       ` (17 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

From: Derrick Stolee <derrickstolee@github.com>

In order to support a flexible bundle URI feature, we allow the server
to return a "table of contents" file that is formatted according to Git
config file standards. These files can describe multiple bundles,
intended to assist with using bundle URIs for fetching or with partial
clone.

Here is an example table of contents file:

[bundle "tableofcontents"]
	version = 1

[bundle "2022-02-09-1644442601-daily"]
	uri = 2022-02-09-1644442601-daily.bundle
	timestamp = 1644442601
	requires = 2022-02-02-1643842562

[bundle "2022-02-02-1643842562"]
	uri = 2022-02-02-1643842562.bundle
	timestamp = 1643842562

[bundle "2022-02-09-1644442631-daily-blobless"]
	uri = 2022-02-09-1644442631-daily-blobless.bundle
	timestamp = 1644442631
	requires = 2022-02-02-1643842568-blobless
	filter = blob:none

[bundle "2022-02-02-1643842568-blobless"]
	uri = 2022-02-02-1643842568-blobless.bundle
	timestamp = 1643842568
	filter = blob:none

(End of example.)

This file contains some important fixed values, such as

 * bundle.tableofcontents.version = 1

Also, different bundles are referenced by <id>, using keys with names

 * bundle.<id>.uri: the URI to download this bundle. This could be an
   absolute URI or a URI relative to the bundle file's URI.
 * bundle.<id>.timestamp: the timestamp when this file was generated.
 * bundle.<id>.filter: the partial clone filter applied on this bundle.
 * bundle.<id>.requires: the ID for the previous bundle.

The current change does not parse the '.filter' option, but does use the
'.requires' in the 'while (stack)' loop.

The process is that 'git bundle fetch' will parse the table of contents
and pick the most-recent bundle and download that one. That bundle
header has a ref listing, including (possibly) a list of commits that
are missing from the bundle. If any of those commits are missing, then
Git downloads the bundle specified by the '.requires' value and tries
again.  Eventually, Git should download a bundle where all missing
commits actually exist in the current repository, or Git downloads a
bundle with no missing commits.

Of course, the server could be advertising incorrect information, so it
could advertise bundles that never satisfy the missing objects. It could
also create a directed cycle in its '.requires' specifications. In each
of these cases, Git will die with a "bundle '<id>' still invalid after
downloading required bundle" message or a "bundle from '<uri>' has
missing prerequisites and no dependent bundle" message.

RFC-TODO: add a direct test of table of contents parsing in this change.
RFC-TODO: create tests that check these erroneous cases.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/bundle.c | 172 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 165 insertions(+), 7 deletions(-)

diff --git a/builtin/bundle.c b/builtin/bundle.c
index 6b6107d83cf..255d7aa774b 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -7,6 +7,8 @@
 #include "hashmap.h"
 #include "object-store.h"
 #include "refs.h"
+#include "config.h"
+#include "packfile.h"
 
 /*
  * Basic handler for bundle files to connect repositories via sneakernet.
@@ -166,12 +168,21 @@ struct remote_bundle_info {
 	char *uri;
 
 	/**
-	 * The 'next_id' string, if non-NULL, contains the 'id'
+	 * The 'requires_id' string, if non-NULL, contains the 'id'
 	 * for a bundle that contains the prerequisites for this
 	 * bundle. Used by table of contents to allow fetching
 	 * a portion of a repository incrementally.
 	 */
-	char *next_id;
+	char *requires_id;
+
+	/**
+	 * The 'filter_str' string, if non-NULL, specifies the
+	 * filter capability exists in this bundle with the given
+	 * specification. Allows selecting bundles that match the
+	 * client's desired filter. If NULL, then no filter exists
+	 * on the bundle.
+	 */
+	char *filter_str;
 
 	/**
 	 * A table of contents can include a timestamp for the
@@ -192,7 +203,108 @@ struct remote_bundle_info {
 	 * a stack.
 	 */
 	struct remote_bundle_info *stack_next;
+
+	/**
+	 * 'pushed' is set when first pushing the required bundle
+	 * onto the stack. Used to error out when verifying the
+	 * prerequisites and avoiding an infinite loop.
+	 */
+	unsigned pushed:1;
 };
+#define REMOTE_BUNDLE_INFO_INIT { \
+	.file = STRBUF_INIT, \
+}
+
+static int remote_bundle_cmp(const void *unused_cmp_data,
+			     const struct hashmap_entry *a,
+			     const struct hashmap_entry *b,
+			     const void *key)
+{
+	const struct remote_bundle_info *ee1 =
+			container_of(a, struct remote_bundle_info, ent);
+	const struct remote_bundle_info *ee2 =
+			container_of(b, struct remote_bundle_info, ent);
+
+	return strcmp(ee1->id, ee2->id);
+}
+
+static int parse_toc_config(const char *key, const char *value, void *data)
+{
+	struct hashmap *toc = data;
+	const char *key1, *key2, *id_end;
+	struct strbuf id = STRBUF_INIT;
+	struct remote_bundle_info info_lookup = REMOTE_BUNDLE_INFO_INIT;
+	struct remote_bundle_info *info;
+
+	if (!skip_prefix(key, "bundle.", &key1))
+		return -1;
+
+	if (skip_prefix(key1, "tableofcontents.", &key2)) {
+		if (!strcmp(key2, "version")) {
+			int version = git_config_int(key, value);
+
+			if (version != 1) {
+				warning(_("table of contents version %d not understood"), version);
+				return -1;
+			}
+		}
+
+		return 0;
+	}
+
+	id_end = strchr(key1, '.');
+
+	/*
+	 * If this key is of the form "bundle.<x>" with no third item,
+	 * then we do not know about it. We should ignore it. Later versions
+	 * might start caring about this data on an optional basis. Increase
+	 * the version number to add keys that must be understood.
+	 */
+	if (!id_end)
+		return 0;
+
+	strbuf_add(&id, key1, id_end - key1);
+	key2 = id_end + 1;
+
+	info_lookup.id = id.buf;
+	hashmap_entry_init(&info_lookup.ent, strhash(info_lookup.id));
+	if (!(info = hashmap_get_entry(toc, &info_lookup, ent, NULL))) {
+		CALLOC_ARRAY(info, 1);
+		info->id = strbuf_detach(&id, NULL);
+		strbuf_init(&info->file, 0);
+		hashmap_entry_init(&info->ent, strhash(info->id));
+		hashmap_add(toc, &info->ent);
+	}
+
+	if (!strcmp(key2, "uri")) {
+		if (info->uri)
+			warning(_("duplicate 'uri' value for id '%s'"), info->id);
+		else
+			info->uri = xstrdup(value);
+		return 0;
+	} else if (!strcmp(key2, "timestamp")) {
+		if (info->timestamp)
+			warning(_("duplicate 'timestamp' value for id '%s'"), info->id);
+		else
+			info->timestamp = git_config_int64(key, value);
+		return 0;
+	} else if (!strcmp(key2, "requires")) {
+		if (info->requires_id)
+			warning(_("duplicate 'requires' value for id '%s'"), info->id);
+		else
+			info->requires_id = xstrdup(value);
+		return 0;
+	} else if (!strcmp(key2, "filter")) {
+		if (info->filter_str)
+			warning(_("duplicate 'filter' value for id '%s'"), info->id);
+		else
+			info->filter_str = xstrdup(value);
+		return 0;
+	}
+
+	/* Return 0 here to ignore unknown options. */
+	return 0;
+}
 
 static void download_uri_to_file(const char *uri, const char *file)
 {
@@ -290,13 +402,14 @@ static void unbundle_fetched_bundle(struct remote_bundle_info *info)
 
 static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 {
-	int ret = 0;
+	int ret = 0, used_hashmap = 0;
 	int progress = isatty(2);
 	char *bundle_uri;
 	struct remote_bundle_info first_file = {
 		.file = STRBUF_INIT,
 	};
 	struct remote_bundle_info *stack = NULL;
+	struct hashmap toc = { 0 };
 
 	struct option options[] = {
 		OPT_BOOL(0, "progress", &progress,
@@ -320,15 +433,31 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 
 	/*
 	 * Step 2: Check if the file is a bundle (if so, add it to the
-	 * stack and move to step 3).
+	 * stack and move to step 3). Otherwise, expect it to be a table
+	 * of contents. Use the table to populate a hashtable of bundles
+	 * and push the most recent bundle to the stack.
 	 */
 
 	if (is_bundle(first_file.file.buf, 1)) {
 		/* The simple case: only one file, no stack to worry about. */
 		stack = &first_file;
 	} else {
-		/* TODO: Expect and parse a table of contents. */
-		die(_("unexpected data at bundle URI"));
+		struct hashmap_iter iter;
+		struct remote_bundle_info *info;
+		timestamp_t max_time = 0;
+
+		/* populate a hashtable with all relevant bundles. */
+		used_hashmap = 1;
+		hashmap_init(&toc, remote_bundle_cmp, NULL, 0);
+		git_config_from_file(parse_toc_config, first_file.file.buf, &toc);
+
+		/* initialize stack using timestamp heuristic. */
+		hashmap_for_each_entry(&toc, &iter, info, ent) {
+			if (info->timestamp > max_time || !stack) {
+				stack = info;
+				max_time = info->timestamp;
+			}
+		}
 	}
 
 	/*
@@ -358,6 +487,7 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 		if (bundle_fd < 0)
 			die(_("failed to read bundle from '%s'"), stack->uri);
 
+		reprepare_packed_git(the_repository);
 		for_each_string_list_item(prereq, &header.prerequisites) {
 			struct object_info info = OBJECT_INFO_INIT;
 			struct object_id *oid = prereq->util;
@@ -374,11 +504,28 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 
 		if (valid) {
 			unbundle_fetched_bundle(stack);
-		} else if (stack->next_id) {
+		} else if (stack->pushed) {
+			die(_("bundle '%s' still invalid after downloading required bundle"), stack->id);
+		} else if (stack->requires_id) {
 			/*
 			 * Load the next bundle from the hashtable and
 			 * push it onto the stack.
 			 */
+			struct remote_bundle_info *info;
+			struct remote_bundle_info info_lookup = REMOTE_BUNDLE_INFO_INIT;
+			info_lookup.id = stack->requires_id;
+
+			hashmap_entry_init(&info_lookup.ent, strhash(info_lookup.id));
+			if ((info = hashmap_get_entry(&toc, &info_lookup, ent, NULL))) {
+				/* Push onto the stack */
+				stack->pushed = 1;
+				info->stack_next = stack;
+				stack = info;
+				continue;
+			} else {
+				die(_("unable to find bundle '%s' required by bundle '%s'"),
+				    stack->requires_id, stack->id);
+			}
 		} else {
 			die(_("bundle from '%s' has missing prerequisites and no dependent bundle"),
 			    stack->uri);
@@ -387,6 +534,17 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 		stack = stack->stack_next;
 	}
 
+	if (used_hashmap) {
+		struct hashmap_iter iter;
+		struct remote_bundle_info *info;
+		hashmap_for_each_entry(&toc, &iter, info, ent) {
+			free(info->id);
+			free(info->uri);
+			free(info->requires_id);
+			free(info->filter_str);
+		}
+		hashmap_clear_and_free(&toc, struct remote_bundle_info, ent);
+	}
 	free(bundle_uri);
 	return ret;
 }
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 20/36] bundle: add --filter option to 'fetch'
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (18 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 19/36] bundle: parse table of contents during 'fetch' Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 21/36] bundle: allow relative URLs in table of contents Ævar Arnfjörð Bjarmason
                       ` (16 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

When a repository uses an object filter for partial clone, the 'git
bundle fetch' command should try to download bundles that match that
filter.

Teach 'git bundle fetch' to take a '--filter' option and then only
consider bundles that match that filter (or lack thereof). This allows
the bundle server to advertise different sets of bundles for different
filters.

Add some verification to be sure that the bundle we downloaded actually
uses that filter. This is especially important when no filter is
requested but the downloaded bundle _does_ have a filter.

RFC-TODO: add tests for the happy path.

RFC-TODO: add tests for these validations.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/bundle.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/builtin/bundle.c b/builtin/bundle.c
index 255d7aa774b..711e0863a16 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -9,6 +9,7 @@
 #include "refs.h"
 #include "config.h"
 #include "packfile.h"
+#include "list-objects-filter-options.h"
 
 /*
  * Basic handler for bundle files to connect repositories via sneakernet.
@@ -410,10 +411,13 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 	};
 	struct remote_bundle_info *stack = NULL;
 	struct hashmap toc = { 0 };
+	const char *filter = NULL;
 
 	struct option options[] = {
 		OPT_BOOL(0, "progress", &progress,
 			 N_("show progress meter")),
+		OPT_STRING(0, "filter", &filter,
+			   N_("filter-spec"), N_("only install bundles matching this filter")),
 		OPT_END()
 	};
 
@@ -453,6 +457,17 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 
 		/* initialize stack using timestamp heuristic. */
 		hashmap_for_each_entry(&toc, &iter, info, ent) {
+			/* Skip if filter does not match. */
+			if (!filter && info->filter_str)
+				continue;
+			if (filter &&
+			    (!info->filter_str || strcasecmp(filter, info->filter_str)))
+				continue;
+
+			/*
+			 * Now that the filter matches, start with the
+			 * bundle with largest timestamp.
+			 */
 			if (info->timestamp > max_time || !stack) {
 				stack = info;
 				max_time = info->timestamp;
@@ -472,6 +487,7 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 	while (stack) {
 		int valid = 1;
 		int bundle_fd;
+		const char *filter_str = NULL;
 		struct string_list_item *prereq;
 		struct bundle_header header = BUNDLE_HEADER_INIT;
 
@@ -487,6 +503,16 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 		if (bundle_fd < 0)
 			die(_("failed to read bundle from '%s'"), stack->uri);
 
+		if (header.filter.choice)
+			filter_str = list_objects_filter_spec(&header.filter);
+
+		if (filter && (!filter_str || strcasecmp(filter, filter_str)))
+			die(_("bundle from '%s' does not match expected filter"),
+			    stack->uri);
+		if (!filter && filter_str)
+			die(_("bundle from '%s' has an unexpected filter"),
+			    stack->uri);
+
 		reprepare_packed_git(the_repository);
 		for_each_string_list_item(prereq, &header.prerequisites) {
 			struct object_info info = OBJECT_INFO_INIT;
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 21/36] bundle: allow relative URLs in table of contents
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (19 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 20/36] bundle: add --filter option to 'fetch' Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 22/36] bundle: make it easy to call 'git bundle fetch' Ævar Arnfjörð Bjarmason
                       ` (15 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

When hosting bundle data, it can be helpful to distribute that data
across multiple CDNs. This might require a change in the base URI, all
the way to the domain name. If all bundles require an absolute URI in
their 'uri' value, then every push to a CDN would require altering the
table of contents to match the expected domain and exact location within
it.

Allow the table of contents to specify a relative URI for the bundles.
This allows easier distribution of bundle data.

RFC-TODO: An earlier change referenced relative URLs, but it was not
implemented until this change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/bundle.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/builtin/bundle.c b/builtin/bundle.c
index 711e0863a16..c55d5215181 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -10,6 +10,7 @@
 #include "config.h"
 #include "packfile.h"
 #include "list-objects-filter-options.h"
+#include "remote.h"
 
 /*
  * Basic handler for bundle files to connect repositories via sneakernet.
@@ -457,6 +458,8 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 
 		/* initialize stack using timestamp heuristic. */
 		hashmap_for_each_entry(&toc, &iter, info, ent) {
+			char *old_uri;
+
 			/* Skip if filter does not match. */
 			if (!filter && info->filter_str)
 				continue;
@@ -464,6 +467,10 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 			    (!info->filter_str || strcasecmp(filter, info->filter_str)))
 				continue;
 
+			old_uri = info->uri;
+			info->uri = relative_url(bundle_uri, info->uri, NULL);
+			free(old_uri);
+
 			/*
 			 * Now that the filter matches, start with the
 			 * bundle with largest timestamp.
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 22/36] bundle: make it easy to call 'git bundle fetch'
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (20 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 21/36] bundle: allow relative URLs in table of contents Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 23/36] clone: add --bundle-uri option Ævar Arnfjörð Bjarmason
                       ` (14 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

Future changes will integrate 'git bundle fetch' into the 'git clone'
and 'git fetch' operations. Make it easy to fetch bundles via a helper
method.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 bundle.c | 21 +++++++++++++++++++++
 bundle.h |  9 +++++++++
 2 files changed, 30 insertions(+)

diff --git a/bundle.c b/bundle.c
index 5fa41a52f11..7e88f5bc942 100644
--- a/bundle.c
+++ b/bundle.c
@@ -639,3 +639,24 @@ int unbundle(struct repository *r, struct bundle_header *header,
 		return error(_("index-pack died"));
 	return 0;
 }
+
+int fetch_bundle_uri(const char *bundle_uri,
+		     const char *filter)
+{
+	int res = 0;
+	struct strvec args = STRVEC_INIT;
+
+	strvec_pushl(&args, "bundle", "fetch", NULL);
+
+	if (filter)
+		strvec_pushf(&args, "--filter=%s", filter);
+	strvec_push(&args, bundle_uri);
+
+	if (run_command_v_opt(args.v, RUN_GIT_CMD)) {
+		warning(_("failed to download bundle from uri '%s'"), bundle_uri);
+		res = 1;
+	}
+
+	strvec_clear(&args);
+	return res;
+}
diff --git a/bundle.h b/bundle.h
index 0c052f54964..c647dec7c93 100644
--- a/bundle.h
+++ b/bundle.h
@@ -46,4 +46,13 @@ int unbundle(struct repository *r, struct bundle_header *header,
 int list_bundle_refs(struct bundle_header *header,
 		int argc, const char **argv);
 
+struct list_objects_filter_options;
+/**
+ * Fetch bundles from the given URI with the given filter.
+ *
+ * Uses 'git bundle fetch' as a subprocess.
+ */
+int fetch_bundle_uri(const char *bundle_uri,
+		     const char *filter);
+
 #endif
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 23/36] clone: add --bundle-uri option
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (21 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 22/36] bundle: make it easy to call 'git bundle fetch' Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 24/36] clone: --bundle-uri cannot be combined with --depth Ævar Arnfjörð Bjarmason
                       ` (13 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

Cloning a remote repository is one of the most expensive operations in
Git. The server can spend a lot of CPU time generating a pack-file for
the client's request. The amount of data can clog the network for a long
time, and the Git protocol is not resumable. For users with poor network
connections or are located far away from the origin server, this can be
especially painful.

The 'git bundle fetch' command allows users to bootstrap a repository
using a set of bundles. However, this would require them to use 'git
init' first, followed by the 'git bundle fetch', and finally add a
remote, fetch, and checkout the branch they want.

Instead, integrate this workflow directly into 'git clone' with the
--bundle-uri' option. If the user is aware of a bundle server, then they
can tell Git to bootstrap the new repository with these bundles before
fetching the remaining objects from the origin server.

RFC-TODO: Document this option in git-clone.txt.

RFC-TODO: I added a comment about the location of this code being
necessary for the later step of auto-discovering the bundle URI from the
origin server. This is probably not actually a requirement, but rather a
pain point around how I implemented the feature. If a --bundle-uri
option is specified, but SSH is used for the clone, then the SSH
connection is left open while Git downloads bundles from another server.
This is sub-optimal and should be reconsidered when fully reviewed.

RFC-TODO: create tests for this option with a variety of URI types.

RFC-TODO: a simple end-to-end test is available at the end of the
series.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/clone.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/builtin/clone.c b/builtin/clone.c
index e11f4019b87..51141c979fa 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -78,6 +78,7 @@ static int option_filter_submodules = -1;    /* unspecified */
 static int config_filter_submodules = -1;    /* unspecified */
 static struct string_list server_options = STRING_LIST_INIT_NODUP;
 static int option_remote_submodules;
+static const char *bundle_uri;
 
 static int recurse_submodules_cb(const struct option *opt,
 				 const char *arg, int unset)
@@ -161,6 +162,8 @@ static struct option builtin_clone_options[] = {
 		    N_("any cloned submodules will use their remote-tracking branch")),
 	OPT_BOOL(0, "sparse", &option_sparse_checkout,
 		    N_("initialize sparse-checkout file to include only files at root")),
+	OPT_STRING(0, "bundle-uri", &bundle_uri,
+		   N_("uri"), N_("a URI for downloading bundles before fetching from origin remote")),
 	OPT_END()
 };
 
@@ -1233,6 +1236,35 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 
 	refs = transport_get_remote_refs(transport, &transport_ls_refs_options);
 
+	/*
+	 * NOTE: The bundle URI download takes place after transport_get_remote_refs()
+	 * because a later change will introduce a check for recommended features,
+	 * which might include a recommended bundle URI.
+	 */
+
+	/*
+	 * Before fetching from the remote, download and install bundle
+	 * data from the --bundle-uri option.
+	 */
+	if (bundle_uri) {
+		const char *filter = NULL;
+
+		if (filter_options.filter_spec.nr)
+			filter = expand_list_objects_filter_spec(&filter_options);
+		/*
+		 * Set the config for fetching from this bundle URI in the
+		 * future, but do it before fetch_bundle_uri() which might
+		 * un-set it (for instance, if there is no table of contents).
+		 */
+		git_config_set("fetch.bundleuri", bundle_uri);
+		if (filter)
+			git_config_set("fetch.bundlefilter", filter);
+
+		if (!fetch_bundle_uri(bundle_uri, filter))
+			warning(_("failed to fetch objects from bundle URI '%s'"),
+				bundle_uri);
+	}
+
 	if (refs)
 		mapped_refs = wanted_peer_refs(refs, &remote->fetch);
 
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 24/36] clone: --bundle-uri cannot be combined with --depth
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (22 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 23/36] clone: add --bundle-uri option Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 25/36] bundle: only fetch bundles if timestamp is new Ævar Arnfjörð Bjarmason
                       ` (12 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

The previous change added the '--bundle-uri' option, but did not check
if the --depth parameter was included. Since bundles are not compatible
with shallow clones, provide an error message to the user who is
attempting this combination.

I am leaving this as its own change, separate from the one that
implements '--bundle-uri', because this is more of an advisory for the
user. There is nothing wrong with bootstrapping with bundles and then
fetching a shallow clone. However, that is likely going to involve too
much work for the client _and_ the server. The client will download all
of this bundle information containing the full history of the
repository only to ignore most of it. The server will get a shallow
fetch request, but with a list of haves that might cause a more painful
computation of that shallow pack-file.

RFC-TODO: add a test case for this error message.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/clone.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/builtin/clone.c b/builtin/clone.c
index 51141c979fa..af64bd273b7 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -926,6 +926,11 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		option_no_checkout = 1;
 	}
 
+	if (bundle_uri) {
+		if (deepen)
+			die(_("--bundle-uri is incompatible with --depth, --shallow-since, and --shallow-exclude"));
+	}
+
 	repo_name = argv[0];
 
 	path = get_repo_path(repo_name, &is_bundle);
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 25/36] bundle: only fetch bundles if timestamp is new
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (23 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 24/36] clone: --bundle-uri cannot be combined with --depth Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 26/36] fetch: fetch bundles before fetching original data Ævar Arnfjörð Bjarmason
                       ` (11 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

If a bundle server is providing a table of contents with timestamps for
the bundles, then we can store the most-recent timestamp and use that as
a test if the bundle server has any new information. Teach 'git bundle
fetch' to store the timestamp in the config file as
'fetch.bundleTimestamp' and compare the existing value to the
most-recent timestamp in the bundle server's table of contents. If the
new timestamp is at most the stored timestamp, then exit early (with
success). If the new timestamp is greater than the stored timestamp,
then continue with the normal fetch logic of downloading the most-recent
bundle until all missing objects are satisfied. Store that new timestamp
in the config for next time.

RFC-TODO: Update documentation of 'git bundle fetch' to match his new
behavior.

RFC-TODO: Add 'fetch.bundleTimestamp' to Documentation/config/

RFC-TODO @ Ævar: I replaced the git_config_get_timestamp() with
parse_expiry_date(), but as noted perhaps we want *nix epochs here
only, in that case we could add an "isdigit" loop here.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/bundle.c | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/builtin/bundle.c b/builtin/bundle.c
index c55d5215181..4c51b014f0b 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -413,6 +413,10 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 	struct remote_bundle_info *stack = NULL;
 	struct hashmap toc = { 0 };
 	const char *filter = NULL;
+	const char *timestamp_key = "fetch.bundletimestamp";
+	timestamp_t stored_time = 0;
+	timestamp_t max_time = 0;
+	const char *value;
 
 	struct option options[] = {
 		OPT_BOOL(0, "progress", &progress,
@@ -428,6 +432,17 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 	if (!startup_info->have_repository)
 		die(_("'fetch' requires a repository"));
 
+	/*
+	 * TODO: Is it important re
+	 * https://lore.kernel.org/git/220311.86pmmshahy.gmgdl@evledraar.gmail.com/
+	 * that we don't accept "2.days.ago" etc., and only *nix
+	 * epochs?
+	 */
+	if (!git_config_get_string_tmp(timestamp_key, &value) &&
+	    parse_expiry_date(value, &stored_time))
+		return error(_("'%s' for '%s' is not a valid timestamp"),
+			     value, timestamp_key);
+
 	/*
 	 * Step 1: determine protocol for uri, and download contents to
 	 * a temporary location.
@@ -449,7 +464,6 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 	} else {
 		struct hashmap_iter iter;
 		struct remote_bundle_info *info;
-		timestamp_t max_time = 0;
 
 		/* populate a hashtable with all relevant bundles. */
 		used_hashmap = 1;
@@ -480,6 +494,13 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 				max_time = info->timestamp;
 			}
 		}
+
+		trace2_data_intmax("bundle", the_repository, "max_time", max_time);
+		trace2_data_intmax("bundle", the_repository, "stored_time", stored_time);
+
+		/* Skip fetching bundles if data isn't new enough. */
+		if (max_time <= stored_time)
+			goto cleanup;
 	}
 
 	/*
@@ -567,6 +588,14 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 		stack = stack->stack_next;
 	}
 
+	if (max_time) {
+		struct strbuf tstr = STRBUF_INIT;
+		strbuf_addf(&tstr, "%"PRIuMAX"", max_time);
+		git_config_set_gently(timestamp_key, tstr.buf);
+		strbuf_release(&tstr);
+	}
+
+cleanup:
 	if (used_hashmap) {
 		struct hashmap_iter iter;
 		struct remote_bundle_info *info;
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 26/36] fetch: fetch bundles before fetching original data
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (24 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 25/36] bundle: only fetch bundles if timestamp is new Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 27/36] protocol-caps: implement cap_features() Ævar Arnfjörð Bjarmason
                       ` (10 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

If a user cloned using a bundle URI, then they might want to re-use it
to download new bundles during 'git fetch' before fetching the remaining
objects from the origin server. Use the 'fetch.bundleURI' config as the
indicator for whether this extra step should happen.

Do not fetch bundles if --dry-run is specified.

RFC-TODO: add tests.

RFC-TODO: update Documentation/git-fetch.txt

RFC-TODO: update Documentation/config/

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/fetch.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index e3791f09ed5..ac684bdf252 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -29,6 +29,7 @@
 #include "commit-graph.h"
 #include "shallow.h"
 #include "worktree.h"
+#include "bundle.h"
 
 #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)
 
@@ -2180,6 +2181,22 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	/* FETCH_HEAD never gets updated in --dry-run mode */
 	if (dry_run)
 		write_fetch_head = 0;
+	else {
+		/*
+		 * --dry-run mode skips bundle downloads, which might
+		 * update some refs.
+		 */
+		char *bundle_uri = NULL;
+		git_config_get_string("fetch.bundleuri", &bundle_uri);
+
+		if (bundle_uri) {
+			char *filter = NULL;
+			git_config_get_string("fetch.bundlefilter", &filter);
+			fetch_bundle_uri(bundle_uri, filter);
+			free(bundle_uri);
+			free(filter);
+		}
+	}
 
 	if (all) {
 		if (argc == 1)
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 27/36] protocol-caps: implement cap_features()
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (25 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 26/36] fetch: fetch bundles before fetching original data Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 28/36] serve: understand but do not advertise 'features' capability Ævar Arnfjörð Bjarmason
                       ` (9 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

The 'features' capability sends a list of "key=value" pairs from the
server. These are a set of fixed config values, all prefixed with
"serve." to avoid conflicting with other config values of similar names.

The initial set chosen here are:

* bundleURI: Allow advertising one or more bundle servers by URI.

* partialCloneFilter: Advertise one or more recommended partial clone
  filters.

* sparseCheckout: Advertise that this repository recommends using the
  sparse-checkout feature in cone mode.

The client will have the choice to enable these features.

RFC-TODO: Create Documentation/config/serve.txt

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 protocol-caps.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++
 protocol-caps.h |  1 +
 2 files changed, 67 insertions(+)

diff --git a/protocol-caps.c b/protocol-caps.c
index bbde91810ac..88b01c4133e 100644
--- a/protocol-caps.c
+++ b/protocol-caps.c
@@ -8,6 +8,7 @@
 #include "object-store.h"
 #include "string-list.h"
 #include "strbuf.h"
+#include "config.h"
 
 struct requested_info {
 	unsigned size : 1;
@@ -111,3 +112,68 @@ int cap_object_info(struct repository *r, struct packet_reader *request)
 
 	return 0;
 }
+
+static void send_lines(struct repository *r, struct packet_writer *writer,
+		       struct string_list *str_list)
+{
+	struct string_list_item *item;
+
+	if (!str_list->nr)
+		return;
+
+	for_each_string_list_item (item, str_list) {
+		packet_writer_write(writer, "%s", item->string);
+	}
+}
+
+int cap_features(struct repository *r, struct packet_reader *request)
+{
+	struct packet_writer writer;
+	struct string_list feature_list = STRING_LIST_INIT_DUP;
+	int i = 0;
+	const char *keys[] = {
+		"bundleuri",
+		"partialclonefilter",
+		"sparsecheckout",
+		NULL
+	};
+	struct strbuf serve_feature = STRBUF_INIT;
+	struct strbuf key_equals_value = STRBUF_INIT;
+	size_t len;
+	strbuf_add(&serve_feature, "serve.", 6);
+	len = serve_feature.len;
+
+	packet_writer_init(&writer, 1);
+
+	while (keys[i]) {
+		struct string_list_item *item;
+		const struct string_list *values = NULL;
+		strbuf_setlen(&serve_feature, len);
+		strbuf_addstr(&serve_feature, keys[i]);
+
+		values = repo_config_get_value_multi(r, serve_feature.buf);
+
+		if (values) {
+			for_each_string_list_item(item, values) {
+				strbuf_reset(&key_equals_value);
+				strbuf_addstr(&key_equals_value, keys[i]);
+				strbuf_addch(&key_equals_value, '=');
+				strbuf_addstr(&key_equals_value, item->string);
+
+				string_list_append(&feature_list, key_equals_value.buf);
+			}
+		}
+
+		i++;
+	}
+	strbuf_release(&serve_feature);
+	strbuf_release(&key_equals_value);
+
+	send_lines(r, &writer, &feature_list);
+
+	string_list_clear(&feature_list, 1);
+
+	packet_flush(1);
+
+	return 0;
+}
diff --git a/protocol-caps.h b/protocol-caps.h
index 15c4550360c..681d2106d88 100644
--- a/protocol-caps.h
+++ b/protocol-caps.h
@@ -4,5 +4,6 @@
 struct repository;
 struct packet_reader;
 int cap_object_info(struct repository *r, struct packet_reader *request);
+int cap_features(struct repository *r, struct packet_reader *request);
 
 #endif /* PROTOCOL_CAPS_H */
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 28/36] serve: understand but do not advertise 'features' capability
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (26 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 27/36] protocol-caps: implement cap_features() Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 29/36] serve: advertise 'features' when config exists Ævar Arnfjörð Bjarmason
                       ` (8 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

The previous change implemented cap_features() to return a set of
'key=value' pairs when this capability is run. Add the capability to our
list of understood capabilities.

This change does not advertise the capability. When deploying a new
capability to a distributed fleet of Git servers, it is important to
delay advertising the capability until all nodes understand it. A later
change will advertise it when appropriate, but as a separate change to
simplify this transition.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 serve.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/serve.c b/serve.c
index f3e0203d2c6..3368d16efda 100644
--- a/serve.c
+++ b/serve.c
@@ -19,6 +19,12 @@ static int always_advertise(struct repository *r,
 	return 1;
 }
 
+static int never_advertise(struct repository *r,
+			   struct strbuf *value)
+{
+	return 0;
+}
+
 static int agent_advertise(struct repository *r,
 			   struct strbuf *value)
 {
@@ -113,6 +119,11 @@ static struct protocol_capability capabilities[] = {
 		.advertise = ls_refs_advertise,
 		.command = ls_refs,
 	},
+	{
+		.name = "features",
+		.advertise = never_advertise,
+		.command = cap_features,
+	},
 	{
 		.name = "fetch",
 		.advertise = upload_pack_advertise,
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 29/36] serve: advertise 'features' when config exists
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (27 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 28/36] serve: understand but do not advertise 'features' capability Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 30/36] connect: implement get_recommended_features() Ævar Arnfjörð Bjarmason
                       ` (7 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

The 'features' capability allows a server to recommend some Git features
at a high level. Previous changes implemented the capability so servers
understand it, but it was never advertised.

Now, allow it to be advertised, but only when the capability will
actually _do_ something. That is, advertise if and only if a config
value exists with the prefix "serve.". This avoids unnecessary round
trips for an empty result.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 serve.c              | 18 +++++++++++++++---
 t/t5701-git-serve.sh |  9 +++++++++
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/serve.c b/serve.c
index 3368d16efda..6237bf63d60 100644
--- a/serve.c
+++ b/serve.c
@@ -19,12 +19,24 @@ static int always_advertise(struct repository *r,
 	return 1;
 }
 
-static int never_advertise(struct repository *r,
-			   struct strbuf *value)
+static int key_serve_prefix(const char *key, const char *value, void *data)
 {
+	int *signal = data;
+	if (!strncmp(key, "serve.", 6)) {
+		*signal = 1;
+		return 1;
+	}
 	return 0;
 }
 
+static int has_serve_config(struct repository *r,
+			    struct strbuf *value)
+{
+	int signal = 0;
+	repo_config(r, key_serve_prefix, &signal);
+	return signal;
+}
+
 static int agent_advertise(struct repository *r,
 			   struct strbuf *value)
 {
@@ -121,7 +133,7 @@ static struct protocol_capability capabilities[] = {
 	},
 	{
 		.name = "features",
-		.advertise = never_advertise,
+		.advertise = has_serve_config,
 		.command = cap_features,
 	},
 	{
diff --git a/t/t5701-git-serve.sh b/t/t5701-git-serve.sh
index 9d053f77a93..befc800593e 100755
--- a/t/t5701-git-serve.sh
+++ b/t/t5701-git-serve.sh
@@ -33,6 +33,15 @@ test_expect_success 'test capability advertisement' '
 	test_cmp expect actual
 '
 
+test_expect_success 'test capability advertisement' '
+	test_when_finished git config --unset serve.bundleuri &&
+	git config serve.bundleuri "file://$(pwd)" &&
+	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
+		--advertise-capabilities >out &&
+	test-tool pkt-line unpack <out >actual &&
+	grep features actual
+'
+
 test_expect_success 'stateless-rpc flag does not list capabilities' '
 	# Empty request
 	test-tool pkt-line pack >in <<-EOF &&
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 30/36] connect: implement get_recommended_features()
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (28 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 29/36] serve: advertise 'features' when config exists Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 31/36] transport: add connections for 'features' capability Ævar Arnfjörð Bjarmason
                       ` (6 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

From: Derrick Stolee <derrickstolee@github.com>

This method allows a client to request and parse the 'features' capability
of protocol v2. The response is expected to be a list of 'key=value'
lines, but this implementation does no checking of the lines, expecting
a later parse of the lines to be careful of the existence of that '='
character.

This change is based on an earlier patch [1] written for a similar
capability.

[1] https://lore.kernel.org/git/RFC-patch-04.13-21caf01775-20210805T150534Z-avarab@gmail.com/

Co-authored-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 connect.c | 36 ++++++++++++++++++++++++++++++++++++
 remote.h  |  4 ++++
 2 files changed, 40 insertions(+)

diff --git a/connect.c b/connect.c
index a8fdb5255f7..1739d1f8a5f 100644
--- a/connect.c
+++ b/connect.c
@@ -591,6 +591,42 @@ struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 	return list;
 }
 
+int get_recommended_features(int fd_out, struct packet_reader *reader,
+			     struct string_list *list, int stateless_rpc)
+{
+	int line_nr = 1;
+
+	server_supports_v2("features", 1);
+
+	/* (Re-)send capabilities */
+	send_capabilities(fd_out, reader);
+
+	/* Send command */
+	packet_write_fmt(fd_out, "command=features\n");
+	packet_delim(fd_out);
+	packet_flush(fd_out);
+
+	/* Process response from server */
+	while (packet_reader_read(reader) == PACKET_READ_NORMAL) {
+		const char *line = reader->line;
+		line_nr++;
+
+		string_list_append(list, line);
+	}
+
+	if (reader->status != PACKET_READ_FLUSH)
+		return error(_("expected flush after features listing"));
+
+	/*
+	 * Might die(), but obscure enough that that's OK, e.g. in
+	 * serve.c, we'll call BUG() on its equivalent (the
+	 * PACKET_READ_RESPONSE_END check).
+	 */
+	check_stateless_delimiter(stateless_rpc, reader,
+		_("expected response end packet after features listing"));
+	return 0;
+}
+
 const char *parse_feature_value(const char *feature_list, const char *feature, int *lenp, int *offset)
 {
 	int len;
diff --git a/remote.h b/remote.h
index 571338510a8..bccb8484dbd 100644
--- a/remote.h
+++ b/remote.h
@@ -242,6 +242,10 @@ int get_remote_bundle_uri(int fd_out, struct packet_reader *reader,
 
 int resolve_remote_symref(struct ref *ref, struct ref *list);
 
+/* Used for protocol v2 in order to retrieve recommended features */
+int get_recommended_features(int fd_out, struct packet_reader *reader,
+			     struct string_list *list, int stateless_rpc);
+
 /*
  * Remove and free all but the first of any entries in the input list
  * that map the same remote reference to the same local reference.  If
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 31/36] transport: add connections for 'features' capability
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (29 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 30/36] connect: implement get_recommended_features() Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 32/36] clone: use server-recommended bundle URI Ævar Arnfjörð Bjarmason
                       ` (5 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

To allow 'git clone' to check the 'features' capability, we need to fill
in some boilerplate methods that help detect if the capability exists
and then to execute the get_recommended_features() method with the
proper context. This involves jumping through some vtables.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 transport-helper.c   | 13 +++++++++++++
 transport-internal.h |  9 +++++++++
 transport.c          | 38 ++++++++++++++++++++++++++++++++++++++
 transport.h          |  5 +++++
 4 files changed, 65 insertions(+)

diff --git a/transport-helper.c b/transport-helper.c
index 398712c76f3..782aa1f43a2 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -1160,6 +1160,18 @@ static int push_refs(struct transport *transport,
 	return -1;
 }
 
+static int get_features(struct transport *transport,
+		      struct string_list *list)
+{
+	get_helper(transport);
+
+	if (process_connect(transport, 0)) {
+		do_take_over(transport);
+		return transport->vtable->get_features(transport, list);
+	}
+
+	return -1;
+}
 
 static int has_attribute(const char *attrs, const char *attr)
 {
@@ -1285,6 +1297,7 @@ static struct transport_vtable vtable = {
 	.get_bundle_uri = get_bundle_uri,
 	.fetch_refs	= fetch_refs,
 	.push_refs	= push_refs,
+	.get_features	= get_features,
 	.connect	= connect_helper,
 	.disconnect	= release_helper
 };
diff --git a/transport-internal.h b/transport-internal.h
index 90ea749e5cf..969cb30f510 100644
--- a/transport-internal.h
+++ b/transport-internal.h
@@ -5,6 +5,7 @@ struct ref;
 struct transport;
 struct strvec;
 struct transport_ls_refs_options;
+struct string_list;
 
 struct transport_vtable {
 	/**
@@ -58,6 +59,14 @@ struct transport_vtable {
 	 * process involved generating new commits.
 	 **/
 	int (*push_refs)(struct transport *transport, struct ref *refs, int flags);
+
+	/**
+	 * get_features() requests a list of recommended features and
+	 * populates the given string_list with those 'key=value' pairs.
+	 */
+	int (*get_features)(struct transport *transport,
+			    struct string_list *list);
+
 	int (*connect)(struct transport *connection, const char *name,
 		       const char *executable, int fd[2]);
 
diff --git a/transport.c b/transport.c
index 7e5e1192f95..42a726dc066 100644
--- a/transport.c
+++ b/transport.c
@@ -205,6 +205,20 @@ struct git_transport_data {
 	struct oid_array shallow;
 };
 
+static int get_features(struct transport *transport,
+		      struct string_list *list)
+{
+	struct git_transport_data *data = transport->data;
+	struct packet_reader reader;
+
+	packet_reader_init(&reader, data->fd[0], NULL, 0,
+			   PACKET_READ_CHOMP_NEWLINE |
+			   PACKET_READ_GENTLE_ON_EOF);
+
+	return get_recommended_features(data->fd[1], &reader, list,
+					transport->stateless_rpc);
+}
+
 static int set_git_option(struct git_transport_options *opts,
 			  const char *name, const char *value)
 {
@@ -948,6 +962,7 @@ static struct transport_vtable taken_over_vtable = {
 	.get_bundle_uri = get_bundle_uri,
 	.fetch_refs	= fetch_refs_via_pack,
 	.push_refs	= git_transport_push,
+	.get_features	= get_features,
 	.disconnect	= disconnect_git
 };
 
@@ -1102,6 +1117,7 @@ static struct transport_vtable builtin_smart_vtable = {
 	.get_bundle_uri = get_bundle_uri,
 	.fetch_refs	= fetch_refs_via_pack,
 	.push_refs	= git_transport_push,
+	.get_features	= get_features,
 	.connect	= connect_git,
 	.disconnect	= disconnect_git
 };
@@ -1606,6 +1622,28 @@ void transport_unlock_pack(struct transport *transport, unsigned int flags)
 		string_list_clear(&transport->pack_lockfiles, 0);
 }
 
+struct string_list *transport_remote_features(struct transport *transport)
+{
+	const struct transport_vtable *vtable = transport->vtable;
+	struct string_list *list = NULL;
+
+	if (!server_supports_v2("features", 0))
+		return NULL;
+
+	if (!vtable->get_features) {
+		warning(_("'features' not supported by this remote"));
+		return NULL;
+	}
+
+	CALLOC_ARRAY(list, 1);
+	string_list_init_dup(list);
+
+	if (vtable->get_features(transport, list))
+		warning(_("failed to get recommended features from remote"));
+
+	return list;
+}
+
 int transport_connect(struct transport *transport, const char *name,
 		      const char *exec, int fd[2])
 {
diff --git a/transport.h b/transport.h
index ed5ebcf1466..7afc02eb683 100644
--- a/transport.h
+++ b/transport.h
@@ -322,6 +322,11 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs);
  */
 void transport_unlock_pack(struct transport *transport, unsigned int flags);
 
+/**
+ * Get recommended config from remote.
+ */
+struct string_list *transport_remote_features(struct transport *transport);
+
 int transport_disconnect(struct transport *transport);
 char *transport_anonymize_url(const char *url);
 void transport_take_over(struct transport *transport,
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 32/36] clone: use server-recommended bundle URI
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (30 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 31/36] transport: add connections for 'features' capability Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 33/36] t5601: basic bundle URI test Ævar Arnfjörð Bjarmason
                       ` (4 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

After the ref advertisement initializes the connection between the
client and the remote, use the 'features' capability (if available) to
get a list of recommended features from the server.

In this change, we only update the bundle URI setting. The bundles are
downloaded immediately afterwards if the bundle URI becomes non-null.

RFC-TODO: don't overwrite a given --bundle-uri option.
RFC-TODO: implement the other capabilities.
RFC-TODO: guard this entire request behind opt-in config.
RFC-TODO: prevent using an HTTP(S) URI when in an SSH clone.
RFC-TODO: prevent using a local path for the bundle URI.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/clone.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index af64bd273b7..81c14a9f5d7 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -890,6 +890,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	int err = 0, complete_refs_before_fetch = 1;
 	int submodule_progress;
 	int filter_submodules = 0;
+	struct string_list *feature_list = NULL;
 
 	struct transport_ls_refs_options transport_ls_refs_options =
 		TRANSPORT_LS_REFS_OPTIONS_INIT;
@@ -1241,11 +1242,23 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 
 	refs = transport_get_remote_refs(transport, &transport_ls_refs_options);
 
-	/*
-	 * NOTE: The bundle URI download takes place after transport_get_remote_refs()
-	 * because a later change will introduce a check for recommended features,
-	 * which might include a recommended bundle URI.
-	 */
+	feature_list = transport_remote_features(transport);
+
+	if (feature_list) {
+		struct string_list_item *item;
+		for_each_string_list_item(item, feature_list) {
+			char *value;
+			char *equals = strchr(item->string, '=');
+
+			if (!equals)
+				continue;
+			*equals = '\0';
+			value = equals + 1;
+
+			if (!strcmp(item->string, "bundleuri"))
+				bundle_uri = value;
+		}
+	}
 
 	/*
 	 * Before fetching from the remote, download and install bundle
@@ -1265,7 +1278,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		if (filter)
 			git_config_set("fetch.bundlefilter", filter);
 
-		if (!fetch_bundle_uri(bundle_uri, filter))
+		if (fetch_bundle_uri(bundle_uri, filter))
 			warning(_("failed to fetch objects from bundle URI '%s'"),
 				bundle_uri);
 	}
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 33/36] t5601: basic bundle URI test
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (31 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 32/36] clone: use server-recommended bundle URI Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 34/36] protocol v2: add server-side "bundle-uri" skeleton (docs) Ævar Arnfjörð Bjarmason
                       ` (3 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long

From: Derrick Stolee <derrickstolee@github.com>

This test demonstrates an end-to-end form of the bundle URI feature
given by an HTTP server advertising the 'features' capability with a
bundle URI that is a bundle file on that same HTTP server. We verify
that we unbundled a bundle, which could only have happened if we
successfully downloaded that file.

RFC-TODO: Create similar tests throughout the series that perform
similar tests, including examples with table of contents and partial
clones.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t5601-clone.sh | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
index 4a61f2c901e..e6119f78aea 100755
--- a/t/t5601-clone.sh
+++ b/t/t5601-clone.sh
@@ -767,6 +767,21 @@ test_expect_success 'reject cloning shallow repository using HTTP' '
 	git clone --no-reject-shallow $HTTPD_URL/smart/repo.git repo
 '
 
+test_expect_success 'auto-discover bundle URI from HTTP clone' '
+	test_when_finished rm -rf repo "$HTTPD_DOCUMENT_ROOT_PATH/repo2.git" &&
+	git -C src bundle create "$HTTPD_DOCUMENT_ROOT_PATH/everything.bundle" --all &&
+	git clone --bare --no-local src "$HTTPD_DOCUMENT_ROOT_PATH/repo2.git" &&
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/repo2.git" config \
+		serve.bundleuri $HTTPD_URL/everything.bundle &&
+	GIT_TRACE2_EVENT="$(pwd)/trace.txt" \
+		git -c protocol.version=2 clone \
+		$HTTPD_URL/smart/repo2.git repo &&
+	cat >pat <<-\EOF &&
+	"event":"child_start".*"argv":\["git","bundle","unbundle"
+	EOF
+	grep -f pat trace.txt
+'
+
 # DO NOT add non-httpd-specific tests here, because the last part of this
 # test script is only executed when httpd is available and enabled.
 
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 34/36] protocol v2: add server-side "bundle-uri" skeleton (docs)
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (32 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 33/36] t5601: basic bundle URI test Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 35/36] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
                       ` (2 subsequent siblings)
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/technical/protocol-v2.txt | 209 ++++++++++++++++++++++++
 1 file changed, 209 insertions(+)

diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 8a877d27e23..3ea96add398 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -566,3 +566,212 @@ and associated requested information, each separated by a single space.
 	attr = "size"
 
 	obj-info = obj-id SP obj-size
+
+bundle-uri
+~~~~~~~~~~
+
+If the 'bundle-uri' capability is advertised, the server supports the
+`bundle-uri' command.
+
+The capability is currently advertised with no value (i.e. not
+"bundle-uri=somevalue"), a value may be added in the future for
+supporting command-wide extensions. Clients MUST ignore any unknown
+capability values and proceed with the 'bundle-uri` dialog they
+support.
+
+The 'bundle-uri' command is intended to be issued before `fetch` to
+get URIs to bundle files (see linkgit:git-bundle[1]) to "seed" and
+inform the subsequent `fetch` command.
+
+The client CAN issue `bundle-uri` before or after any other valid
+command. To be useful to clients it's expected that it'll be issued
+after an `ls-refs` and before `fetch`, but CAN be issued at any time
+in the dialog.
+
+DISCUSSION of bundle-uri
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The intent of the feature is optimize for server resource consumption
+in the common case by changing the common case of fetching a very
+large PACK during linkgit:git-clone[1] into a smaller incremental
+fetch.
+
+It also allows servers to achieve better caching in combination with
+an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
+
+By having new clones or fetches be a more predictable and common
+negotiation against the tips of recently produces *.bundle file(s).
+Servers might even pre-generate the results of such negotiations for
+the `uploadpack.packObjectsHook` as new pushes come in.
+
+I.e. the server would anticipate that fresh clones will download a
+known bundle, followed by catching up to the current state of the
+repository using ref tips found in that bundle (or bundles).
+
+PROTOCOL for bundle-uri
+^^^^^^^^^^^^^^^^^^^^^^^
+
+A `bundle-uri` request takes no arguments, and as noted above does not
+currently advertise a capability value. Both may be added in the
+future.
+
+When the client issues a `command=bundle-uri` the response is a list
+of URIs the server would like the client to fetch out-of-bounds before
+proceeding with the `fetch` request in this format:
+
+	output = bundle-uri-line
+		 bundle-uri-line* flush-pkt
+
+	bundle-uri-line = PKT-LINE(bundle-uri)
+			  *(SP bundle-feature-key *(=bundle-feature-val))
+			  LF
+
+	bundle-uri = A URI such as a https://, ssh:// etc. URI
+
+	bundle-feature-key = Any printable ASCII characters except SP or "="
+	bundle-feature-val = Any printable ASCII characters except SP or "="
+
+No `bundle-feature-key`=`bundle-feature-value` fields are currently
+defined. See the discussion of features below.
+
+Clients are still expected to fully parse the line according to the
+above format, lines that do not conform to the format SHOULD be
+discarded. The user MAY be warned in such a case.
+
+bundle-uri CLIENT AND SERVER EXPECTATIONS
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+".bundle" FORMAT
+++++++++++++++++
+
+The advertised bundle(s) MUST be in a format that "git bundle verify"
+would accept. I.e. they MUST contain one or more reference tips for
+use by the client, MUST indicate prerequisites (in any) with standard
+"-" prefixes, and MUST indicate their "object-format", if
+applicable. Create "*.bundle" files with "git bundle create".
+
+bundle-uri CLIENT ERROR RECOVERY
+++++++++++++++++++++++++++++++++
+
+A client MUST above all gracefully degrade on errors, whether that
+error is because of bad missing/data in the bundle URI(s), because
+that client is too dumb to e.g. understand and fully parse out bundle
+headers and their prerequisite relationships, or something else.
+
+Server operators should feel confident in turning on "bundle-uri" and
+not worry if e.g. their CDN goes down that clones or fetches will run
+into hard failures. Even if the server bundle bundle(s) are
+incomplete, or bad in some way the client should still end up with a
+functioning repository, just as if it had chosen not to use this
+protocol extension.
+
+All subsequent discussion on client and server interaction MUST keep
+this in mind.
+
+bundle-uri SERVER TO CLIENT
++++++++++++++++++++++++++++
+
+The ordering of the returned bundle uris is not significant. Clients
+MUST parse their headers to discover their contained OIDS and
+prerequisites. A client MUST consider the content of the bundle(s)
+themselves and their header as the ultimate source of truth.
+
+A server MAY even return bundle(s) that don't have any direct
+relationship to the repository being cloned (either through accident,
+or intentional "clever" configuration), and expect a client to sort
+out what data they'd like from the bundle(s), if any.
+
+bundle-uri CLIENT TO SERVER
++++++++++++++++++++++++++++
+
+The client SHOULD provide reference tips found in the bundle header(s)
+as 'have' lines in any subsequent `fetch` request. A client MAY also
+ignore the bundle(s) entirely if doing so is deemed worse for some
+reason, e.g. if the bundles can't be downloaded, it doesn't like the
+tips it finds etc.
+
+WHEN ADVERTISED BUNDLE(S) REQUIRE NO FURTHER NEGOTIATION
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+If after issuing `bundle-uri` and `ls-refs`, and getting the header(s)
+of the bundle(s) the client finds that the ref tips it wants can be
+retrieved entirety from advertised bundle(s), it MAY disconnect. The
+results of such a 'clone' or 'fetch' should be indistinguishable from
+the state attained without using bundle-uri.
+
+EARLY CLIENT DISCONNECTIONS AND ERROR RECOVERY
+++++++++++++++++++++++++++++++++++++++++++++++
+
+A client MAY perform an early disconnect while still downloading the
+bundle(s) (having streamed and parsed their headers). In such a case
+the client MUST gracefully recover from any errors related to
+finishing the download and validation of the bundle(s).
+
+I.e. a client might need to re-connect and issue a 'fetch' command,
+and possibly fall back to not making use of 'bundle-uri' at all.
+
+This "MAY" behavior is specified as such (and not a "SHOULD") on the
+assumption that a server advertising bundle uris is more likely than
+not to be serving up a relatively large repository, and to be pointing
+to URIs that have a good chance of being in working order. A client
+MAY e.g. look at the payload size of the bundles as a heuristic to see
+if an early disconnect is worth it, should falling back on a full
+"fetch" dialog be necessary.
+
+WHEN ADVERTISED BUNDLE(S) REQUIRE FURTHER NEGOTIATION
++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+A client SHOULD commence a negotiation of a PACK from the server via
+the "fetch" command using the OID tips found in advertised bundles,
+even if's still in the process of downloading those bundle(s).
+
+This allows for aggressive early disconnects from any interactive
+server dialog. The client blindly trusts that the advertised OID tips
+are relevant, and issues them as 'have' lines, it then requests any
+tips it would like (usually from the "ls-refs" advertisement) via
+'want' lines. The server will then compute a (hopefully small) PACK
+with the expected difference between the tips from the bundle(s) and
+the data requested.
+
+The only connection the client then needs to keep active is to the
+concurrently downloading static bundle(s), when those and the
+incremental PACK are retrieved they should be inflated and
+validated. Any errors at this point should be gracefully recovered
+from, see above.
+
+bundle-uri PROTOCOL FEATURES
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As noted above no `bundle-feature-key`=`bundle-feature-value` fields
+are currently defined.
+
+They are intended for future per-URI metadata which older clients MUST
+ignore and gracefully degrade on. Any fields they do recognize they
+CAN also ignore.
+
+Any backwards-incompatible addition of pre-URI key-value will be
+guarded by a new value or values in 'bundle-uri' capability
+advertisement itself, and/or by new future `bundle-uri` request
+arguments.
+
+While no per-URI key-value are currently supported currently they're
+intended to support future features such as:
+
+ * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or
+   size of the bundle file.
+
+ * Advertise that one or more bundle files are the same (to e.g. have
+   clients round-robin or otherwise choose one of N possible files).
+
+ * A "oid=<OID>" shortcut and "prerequisite=<OID>" shortcut. For
+   expressing the common case of a bundle with one tip and no
+   prerequisites, or one tip and one prerequisite.
++
+This would allow for optimizing the common case of servers who'd like
+to provide one "big bundle" containing only their "main" branch,
+and/or incremental updates thereof.
++
+A client receiving such a a response MAY assume that they can skip
+retrieving the header from a bundle at the indicated URI, and thus
+save themselves and the server(s) the request(s) needed to inspect the
+headers of that bundle or bundles.
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 35/36] bundle-uri docs: add design notes
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (33 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 34/36] protocol v2: add server-side "bundle-uri" skeleton (docs) Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-18 17:23     ` [RFC PATCH v2 36/36] docs: document bundle URI standard Ævar Arnfjörð Bjarmason
  2022-04-21 19:54     ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Derrick Stolee
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

Add a design doc for the bundle-uri protocol extension to go along
with the packfile-uri extension added in cd8402e0fd8 (Documentation:
add Packfile URIs design doc, 2020-06-10).

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/technical/bundle-uri.txt  | 119 ++++++++++++++++++++++++
 Documentation/technical/protocol-v2.txt |   5 +
 2 files changed, 124 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri.txt

diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
new file mode 100644
index 00000000000..5ae9a15eafe
--- /dev/null
+++ b/Documentation/technical/bundle-uri.txt
@@ -0,0 +1,119 @@
+Bundle URI Design Notes
+=======================
+
+Protocol
+--------
+
+See `bundle-uri` in the link:protocol-v2.html[protocol-v2]
+documentation for a discussion of the bundle-uri command, and the
+expectations of clients and servers.
+
+This document is a a more general discussion of how the `bundle-uri`
+command fits in with the rest of the git ecosystem, its design goals
+and non-goals, comparison to alternatives etc.
+
+Comparison with Packfile URIs
+-----------------------------
+
+There is a similar "Packfile URIs" facility, see the
+link:packfile-uri.html[packfile-uri] documentation for details.
+
+The Packfile URIs facility requires a much closer cooperation between
+CDN and server than the bundle URI facility.
+
+I.e. the server MUST know what objects exist in the packfile URI it's
+pointing to, as well as its pack checksum. Failure to do so will not
+only result in a client error (the packfile hash won't match), but
+even if it got past that would likely result in a corrupt repository
+with tips pointing to unreachable objects.
+
+By comparison the bundle URIs are meant to be a "dumb" solution
+friendly to e.g. having a weekly cronjob take a snapshot of a git
+repository, that snapshot being uploaded to a network of FTP mirrors
+(which may be inconsistent or out of date).
+
+The server does not need to know what state the side-channel download
+is at, because the client will first validate it, and then optionally
+negotiate with the server using what it discovers there.
+
+Using the local `transfer.injectBundleURI` configuration variable (see
+linkgit:git-config[1]) the `bundle-uri` mechanism doesn't even need
+the server to support it.
+
+Security
+--------
+
+The omission of something equivalent to the packfile <OID> in the
+Packfile URIs protocol is intentional, as having it would require
+closer server and CDN cooperation than some server operators are
+comfortable with.
+
+Furthermore, it is not needed for security. The server doesn't need to
+trust its CDN. If the server were to attempt to send harmful content
+to the client, the result would not validate against the server's
+provided ref tips gotten from ls-refs.
+
+The lack of a such a hash does leave room open to a malicious CDN
+operation to be annoying however. E.g. they could inject irrelevant
+objects into the bundles, which would enlarge the downloaded
+repository until a "gc" would eventually throw them away.
+
+In practice the lack of a hash is considered to be a non-issue. Anyone
+concerned about such security problems between their server and their
+CDN is going to be pointing to a "https" URL under their control. For
+a client the "threat" is the same as without bundle-uri, i.e. a server
+is free to be annoying today and send you garbage in the PACK that you
+won't need.
+
+Security issues peculiar to bundle-uri
+--------------------------------------
+
+Both packfile-uri and bundle-uri use the `fetch.uriProtocols`
+configuration variable (see linkgit:git-config[1]) to configure which
+protocols they support.
+
+By default this is set to "http,https" for both, but bundle-uri
+supports adding "file" to that list. The server can thus point to
+"file://" URIs it expects the client to have access to.
+
+This is primarily intended for use with the `transfer.injectBundleURI`
+mechanism, but can also be useful e.g. in a centralized environment
+where a server might point to a "file:///mnt/bundles/big-repo.bdl" it
+knows to be mounted on the local machine (e.g. a racked server),
+points to it in its "bundle-uri" response.
+
+The client can then add "file" to the `fetch.uriProtocols` list to
+obey such responses. That does mean that a malicious server can point
+to any arbitrary file on the local machine. The threat of this is
+considered minimal, since anyone adding `file` to `fetch.uriProtocols`
+likely knows what they're doing and controls both ands, and the worst
+they can do is make a curl(1) pipe garbage into "index-pack" (which
+will likely promptly die on the non-PACK-file).
+
+Security comparison with packfile-uri
+-------------------------------------
+
+The initial implementation of packfile-uri needed special adjusting to
+run "git fsck" on incoming .gitmodules files, this was to deal with a
+general security issue in git, See CVE-2018-17456.
+
+The current packfile-uri mechanism requires special handling around
+"fsck" to do such cross-PACK fsck's, this is because it first indexes
+the "incremental" PACK, and then any PACK(s) provided via
+packfile-uri, before finally doing a full connectivity check.
+
+This is effect doing the fsck one might do via "clone" and "fetch" in
+reverse, or the equivalent of starting with the incremental "fetch",
+followed by the "clone".
+
+Since the packfile-uri mechanism can result in the .gitmodules blob
+referenced by such a "fetch" to be in the pack for the "clone" the
+fetch-pack process needs to keep state between the indexing of
+multiple packs, to remember to fsck the blob (via the "clone") later
+after seeing it in a tree (from the "fetch).
+
+There are no known security issues with the way packfile-uri does
+this, but since bundle-uri effectively emulates what a which doesn't
+support either "bundle-uri" or "packfile-uri" would do on clone/fetch,
+any future security issues peculiar to the packfile-uri approach are
+unlikely to be shared by it.
diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 3ea96add398..3a51492049f 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -775,3 +775,8 @@ A client receiving such a a response MAY assume that they can skip
 retrieving the header from a bundle at the indicated URI, and thus
 save themselves and the server(s) the request(s) needed to inspect the
 headers of that bundle or bundles.
+
+bundle-uri SEE ALSO
+^^^^^^^^^^^^^^^^^^^
+
+See the link:bundle-uri.html[Bundle URI Design Notes] for more.
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [RFC PATCH v2 36/36] docs: document bundle URI standard
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (34 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 35/36] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
@ 2022-04-18 17:23     ` Ævar Arnfjörð Bjarmason
  2022-04-21 19:54     ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Derrick Stolee
  36 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-18 17:23 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, Jonathan Nieder,
	Albert Cui, Robin H . Johnson, Teng Long,
	Ævar Arnfjörð Bjarmason

From: Derrick Stolee <derrickstolee@github.com>

Introduce the idea of bundle URIs to the Git codebase through an
aspirational design document. This document includes the full design
intended to include the feature in its fully-implemented form. This will
take several steps as detailed in the Implementation Plan section.

By committing this document now, it can be used to motivate changes
necessary to reach these final goals. The design can still be altered as
new information is discovered.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/technical/bundle-uri-TOC.txt | 404 +++++++++++++++++++++
 1 file changed, 404 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri-TOC.txt

diff --git a/Documentation/technical/bundle-uri-TOC.txt b/Documentation/technical/bundle-uri-TOC.txt
new file mode 100644
index 00000000000..4763449e88b
--- /dev/null
+++ b/Documentation/technical/bundle-uri-TOC.txt
@@ -0,0 +1,404 @@
+Bundle URIs
+===========
+
+Bundle URIs are locations where Git can download one or more bundles in
+order to bootstrap the object database in advance of fetching the remaining
+objects from a remote.
+
+One goal is to speed up clones and fetches for users with poor network
+connectivity to the origin server. Another benefit is to allow heavy users,
+such as CI build farms, to use local resources for the majority of Git data
+and thereby reducing the load on the origin server.
+
+To enable the bundle URI feature, users can specify a bundle URI using
+command-line options or the origin server can advertise one or more URIs
+via a protocol v2 capability.
+
+Server requirements
+-------------------
+
+To provide a server-side implementation of bundle servers, no other parts
+of the Git protocol are required. This allows server maintainers to use
+static content solutions such as CDNs in order to serve the bundle files.
+
+At the current scope of the bundle URI feature, all URIs are expected to
+be HTTP(S) URLs where content is downloaded to a local file using a `GET`
+request to that URL. The server could include authentication requirements
+to those requests with the aim of triggering the configured credential
+helper for secure access.
+
+Assuming a `200 OK` response from the server, the content at the URL is
+expected to be of one of two forms:
+
+1. Bundle: A Git bundle file of version 2 or higher.
+
+2. Table of Contents: A plain-text file that is parsable using Git's
+   config file parser. This file describes one or more bundles that are
+   accessible from other URIs.
+
+Any other data provided by the server is considered erroneous.
+
+Table of Contents Format
+------------------------
+
+If the content at a bundle URI is not a bundle, then it is expected to be
+a plaintext file that is parseable using Git's config parser. This file
+can contain any list of key/value pairs, but only a fixed set will be
+considered by Git.
+
+bundle.tableOfContents.version::
+	This value provides a version number for the table of contents. If
+	a future Git change enables a feature that needs the Git client to
+	react to a new key in the table of contents file, then this version
+	will increment. The only current version number is 1, and if any
+	other value is specified then Git will fail to use this file.
+
+bundle.tableOfContents.forFetch::
+	This boolean value is a signal to the Git client that the bundle
+	server has designed its bundle organization to assist `git fetch`
+	commands in addition to `git clone` commands. If this is missing,
+	Git should not use this table of contents for `git fetch` as it
+	may lead to excess data downloads.
+
+The remaining keys include an `<id>` segment which is a server-designated
+name for each available bundle.
+
+bundle.<id>.uri::
+	This string value is the URI for downloading bundle `<id>`. If
+	the URI begins with a protocol (`http://` or `https://`) then the
+	URI is absolute. Otherwise, the URI is interpreted as relative to
+	the URI used for the table of contents. If the URI begins with `/`,
+	then that relative path is relative to the domain name used for
+	the table of contents. (This use of relative paths is intended to
+	make it easier to distribute a set of bundles across a large
+	number of servers or CDNs with different domain names.)
+
+bundle.<id>.timestamp::
+	(Optional) This value is the number of seconds since Unix epoch
+	(UTC) that this bundle was created. This is used as an approximation
+	of a point in time that the bundle matches the data available at
+	the origin server.
+
+bundle.<id>.requires::
+	(Optional) This string value represents the ID of another bundle.
+	When present, the server is indicating that this bundle contains a
+	thin packfile. If the client does not have all necessary objects
+	to unbundle this packfile, then the client can download the bundle
+	with the `requires` ID and try again. (Note: it may be beneficial
+	to allow the server to specify multiple `requires` bundles.)
+
+bundle.<id>.filter::
+	(Optional) This string value represents an object filter that
+	should also appear in the header of this bundle. The server uses
+	this value to differentiate different kinds of bundles from which
+	the client can choose those that match their object filters.
+
+Here is an example table of contents:
+
+```
+[bundle "tableofcontents"]
+	version = 1
+	forFetch = true
+
+[bundle "2022-02-09-1644442601-daily"]
+	uri = https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-09-1644442601-daily.bundle
+	timestamp = 1644442601
+	requires = 2022-02-02-1643842562
+
+[bundle "2022-02-02-1643842562"]
+	uri = https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-02-1643842562.bundle
+	timestamp = 1643842562
+
+[bundle "2022-02-09-1644442631-daily-blobless"]
+	uri = 2022-02-09-1644442631-daily-blobless.bundle
+	timestamp = 1644442631
+	requires = 2022-02-02-1643842568-blobless
+	filter = blob:none
+
+[bundle "2022-02-02-1643842568-blobless"]
+	uri = /git/git/2022-02-02-1643842568-blobless.bundle
+	timestamp = 1643842568
+	filter = blob:none
+```
+
+This example uses all of the keys in the specification. Suppose that the
+table of contents was found at the URI
+`https://gitbundleserver.z13.web.core.windows.net/git/git/` and so the
+two blobless bundles have the following fully-expanded URIs:
+
+* `https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-09-1644442631-daily-blobless.bundle`
+* `https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-02-1643842568-blobless.bundle`
+
+Advertising Bundle URIs
+-----------------------
+
+If a user knows a bundle URI for the repository they are cloning, then they
+can specify that URI manually through a command-line option. However, a
+Git host may want to advertise bundle URIs during the clone operation,
+helping users unaware of the feature.
+
+Note: The exact details of this section are not final. This is a possible
+way that Git could auto-discover bundle URIs, but is not a committed
+direction until that feature is implemented.
+
+The only thing required for this feature is that the server can advertise
+one or more bundle URIs. One way to implement this is to create a new
+protocol v2 capability that advertises recommended features, including
+bundle URIs.
+
+The client could choose an arbitrary bundle URI as an option _or_ select
+the URI with lowest latency by some exploratory checks. It is up to the
+server operator to decide if having multiple URIs is preferable to a
+single URI that is geodistributed through server-side infrastructure.
+
+Cloning with Bundle URIs
+------------------------
+
+The primary need for bundle URIs is to speed up clones. The Git client
+will interact with bundle URIs according to the following flow:
+
+1. The user specifies a bundle URI with the `--bundle-uri` command-line
+   option _or_ the client discovers a bundle URI that was advertised by
+   the remote server.
+
+2. The client downloads the file at the bundle URI. If it is a bundle, then
+   it is unbundled with the refs being stored in `refs/bundle/*`.
+
+3. If the file is instead a table of contents, then the bundles with
+   matching `filter` settings are sorted by `timestamp` (if present),
+   and the most-recent bundle is downloaded.
+
+4. If the current bundle header mentions negative commid OIDs that are not
+   in the object database, then download the `requires` bundle and try
+   again.
+
+5. After inspecting a bundle with no negative commit OIDs (or all OIDs are
+   already in the object database somehow), then unbundle all of the
+   bundles in reverse order, placing references within `refs/bundle/*`.
+
+6. The client performs a fetch negotiation with the origin server, using
+   the `refs/bundle/*` references as `have`s and the server's ref
+   advertisement as `want`s. This results in a pack-file containing the
+   remaining objects requested by the clone but not in the bundles.
+
+Note that during a clone we expect that all bundles will be required. The
+client could be extended to download all bundles in parallel, though they
+need to be unbundled in the correct order.
+
+If a table of contents is used and it contains
+`bundle.tableOfContents.forFetch = true`, then the client can store a
+config value indicating to reuse this URI for later `git fetch` commands.
+In this case, the client will also want to store the maximum timestamp of
+a downloaded bundle.
+
+Fetching with Bundle URIs
+-------------------------
+
+When the client fetches new data, it can decide to fetch from bundle
+servers before fetching from the origin remote. This could be done via
+a command-line option, but it is more likely useful to use a config value
+such as the one specified during the clone.
+
+The fetch operation follows the same procedure to download bundles from a
+table of contents (although we do _not_ want to use parallel downloads
+here). We expect that the process will end because all negative commit
+OIDs in a thin bundle are already in the object database.
+
+A further optimization is that the client can avoid downloading any
+bundles if their timestamps are not larger than the stored timestamp.
+After fetching new bundles, this local timestamp value is updated.
+
+Choices for Bundle Server Organization
+--------------------------------------
+
+With this standard, there are many options available to the bundle server
+in how it organizes Git data into bundles.
+
+* Bundles can have whatever name the server desires. This name could refer
+  to immutable data by using a hash of the bundle contents. However, this
+  means that a new URI will be needed after every update of the content.
+  This might be acceptable if the server is advertising the URI (and the
+  server is aware of new bundles being generated) but would not be
+  ergonomic for users using the command line option.
+
+* If the server intends to only serve full clones, then the advertised URI
+  could be a bundle file without a filter that is updated at some cadence.
+
+* If the server intends to serve clones, but wants clients to choose full
+  or blobless partial clones, then the server can use a table of contents
+  that lists two non-thin bundles and the client chooses between them only
+  by the `bundle.<id>.filter` values.
+
+* If the server intends to improve clones with parallel downloads, then it
+  can use a table of contents and split the repository into time intervals
+  of approximately similar-sized bundles. Using `bundle.<id>.timestamp`
+  and `bundle.<id>.requires` values helps the client decide the order to
+  unbundle the bundles.
+
+* If the server intends to serve fetches, then it can use a table of
+  contents to advertise a list of bundles that are updated regularly. The
+  most recent bundles could be generated on short intervals, such as hourly.
+  These small bundles could be merged together at some rate, such as 24
+  hourly bundles merging into a single daily bundle. At some point, it may
+  be beneficial to create a bundle that stores the majority of the history,
+  such as all data older than 30 days.
+
+These recommendations are intended only as suggestions. Each repository is
+different and every Git server has different needs. Hopefully the bundle
+URI feature and its table of contents is flexible enough to satisfy all
+needs. If not, then the format can be extended.
+
+Error Conditions
+----------------
+
+If the Git client discovers something unexpected while downloading
+information according to a bundle URI or the table of contents found at
+that location, then Git can ignore that data and continue as if it was not
+given a bundle URI. The remote Git server is the ultimate source of truth,
+not the bundle URI.
+
+Here are a few example error conditions:
+
+* The client fails to connect with a server at the given URI or a connection
+  is lost without any chance to recover.
+
+* The client receives a response other than `200 OK` (such as `404 Not Found`,
+  `401 Not Authorized`, or `500 Internal Server Error`).
+
+* The client receives data that is not parsable as a bundle or table of
+  contents.
+
+* The table of contents describes a directed cycle in the
+  `bundle.<id>.requires` links.
+
+* A bundle includes a filter that does not match expectations.
+
+* The client cannot unbundle the bundles because the negative commit OIDs
+  are not in the object database and there are no more
+  `bundle.<id>.requires` links to follow.
+
+There are also situations that could be seen as wasteful, but are not
+error conditions:
+
+* The downloaded bundles contain more information than is requested by
+  the clone or fetch request. A primary example is if the user requests
+  a clone with `--single-branch` but downloads bundles that store every
+  reachable commit from all `refs/heads/*` references. This might be
+  initially wasteful, but perhaps these objects will become reachable by
+  a later ref update that the client cares about.
+
+* A bundle download during a `git fetch` contains objects already in the
+  object database. This is probably unavoidable if we are using bundles
+  for fetches, since the client will almost always be slightly ahead of
+  the bundle servers after performing its "catch-up" fetch to the remote
+  server. This extra work is most wasteful when the client is fetching
+  much more frequently than the server is computing bundles, such as if
+  the client is using hourly prefetches with background maintenance, but
+  the server is computing bundles weekly. For this reason, the client
+  should not use bundle URIs for fetch unless the server has explicitly
+  recommended it through the `bundle.tableOfContents.forFetch = true`
+  value.
+
+Implementation Plan
+-------------------
+
+This design document is being submitted on its own as an aspirational
+document, with the goal of implementing all of the mentioned client
+features over the course of several patch series. Here is a potential
+outline for submitting these features for full review:
+
+1. Update the `git bundle create` command to take a `--filter` option,
+   allowing bundles to store packfiles restricted to an object filter.
+   This is necessary for using bundle URIs to benefit partial clones.
+
+2. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
+   This will include the full understanding of a table of contents, but
+   will not integrate with `git fetch` or allow the server to advertise
+   URIs.
+
+3. Integrate bundle URIs into `git fetch`, triggered by config values that
+   are set during `git clone` if the server indicates that the bundle
+   strategy works for fetches.
+
+4. Create a new "recommended features" capability in protocol v2 where the
+   server can recommend features such as bundle URIs, partial clone, and
+   sparse-checkout. These features will be extremely limited in scope and
+   blocked by opt-in config options. The design for this portion could be
+   replaced by a "bundle-uri" capability that only advertises bundle URIs
+   and no other information.
+
+Related Work: Packfile URIs
+---------------------------
+
+The Git protocol already has a capability where the Git server can list
+a set of URLs along with the packfile response when serving a client
+request. The client is then expected to download the packfiles at those
+locations in order to have a complete understanding of the response.
+
+This mechanism is used by the Gerrit server (implemented with JGit) and
+has been effective at reducing CPU load and improving user performance for
+clones.
+
+A major downside to this mechanism is that the origin server needs to know
+_exactly_ what is in those packfiles, and the packfiles need to be available
+to the user for some time after the server has responded. This coupling
+between the origin and the packfile data is difficult to manage.
+
+Further, this implementation is extremely hard to make work with fetches.
+
+Related Work: GVFS Cache Servers
+--------------------------------
+
+The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
+the Git project before Git's partial clone was created. One feature of this
+protocol is the idea of a "cache server" which can be colocated with build
+machines or developer offices to transfer Git data without overloading the
+central server.
+
+The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
+endpoint, which allows downloading an object on-demand. This is a critical
+piece of the filesystem virtualization of that product.
+
+However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
+endpoint. Given an optional timestamp, the cache server responds with a list
+of precomputed packfiles containing the commits and trees that were introduced
+in those time intervals.
+
+The cache server computes these "prefetch" packfiles using the following
+strategy:
+
+1. Every hour, an "hourly" pack is generated with a given timestamp.
+2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
+3. Nightly, all prefetch packs more than 30 days old are rolled up into
+   one pack.
+
+When a user runs `gvfs clone` or `scalar clone` against a repo with cache
+servers, the client requests all prefetch packfiles, which is at most
+`24 + 30 + 1` packfiles downloading only commits and trees. The client
+then follows with a request to the origin server for the references, and
+attempts to checkout that tip reference. (There is an extra endpoint that
+helps get all reachable trees from a given commit, in case that commit
+was not already in a prefetch packfile.)
+
+During a `git fetch`, a hook requests the prefetch endpoint using the
+most-recent timestamp from a previously-downloaded prefetch packfile.
+Only the list of packfiles with later timestamps are downloaded. Most
+users fetch hourly, so they get at most one hourly prefetch pack. Users
+whose machines have been off or otherwise have not fetched in over 30 days
+might redownload all prefetch packfiles. This is rare.
+
+It is important to note that the clients always contact the origin server
+for the refs advertisement, so the refs are frequently "ahead" of the
+prefetched pack data. The missing objects are downloaded on-demand using
+the `GET gvfs/objects/{oid}` requests, when needed by a command such as
+`git checkout` or `git log`. Some Git optimizations disable checks that
+would cause these on-demand downloads to be too aggressive.
+
+See Also
+--------
+
+[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
+    An earlier RFC for a bundle URI feature.
+
+[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
+    The GVFS Protocol
-- 
2.36.0.rc2.902.g60576bbc845


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH v2 02/36] dir API: add a generalized path_match_flags() function
  2022-04-18 17:23     ` [RFC PATCH v2 02/36] dir API: add a generalized path_match_flags() function Ævar Arnfjörð Bjarmason
@ 2022-04-21 17:26       ` Derrick Stolee
  0 siblings, 0 replies; 77+ messages in thread
From: Derrick Stolee @ 2022-04-21 17:26 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Jonathan Tan, Jonathan Nieder, Albert Cui,
	Robin H . Johnson, Teng Long

On 4/18/2022 1:23 PM, Ævar Arnfjörð Bjarmason wrote:
> Add a path_match_flags() function and have the two sets of
> starts_with_dot_{,dot_}slash() functions added in
> 63e95beb085 (submodule: port resolve_relative_url from shell to C,
> 2016-04-15) and a2b26ffb1a8 (fsck: convert gitmodules url to URL
> passed to curl, 2020-04-18) be thin wrappers for it.
> 
> As the latter of those notes the fsck version was copied from the
> initial builtin/submodule--helper.c version.
> 
> Since the code added in a2b26ffb1a8 was doing really doing the same as

s/doing really doing/really doing/

> win32_is_dir_sep() added in 1cadad6f658 (git clone <url>
> C:\cygwin\home\USER\repo' is working (again), 2018-12-15) let's move
> the latter to git-compat-util.h is a is_xplatform_dir_sep(). We can
> then call either it or the platform-specific is_dir_sep() from this
> new function.
> 
> Let's likewise change code in various other places that was hardcoding
> checks for "'/' || '\\'" with the new is_xplatform_dir_sep(). As can
> be seen in those callers some of them still concern themselves with
> ':' (Mac OS classic?), but let's leave the question of whether that
> should be consolidated for some other time.

This feels like it could be its own change before the refactor
of the starts_with_dot_{,dot}slash() functions. The diff is pretty
big and all over the place.

If you start with the addition of is_xplatform_dir_sep() (and maybe
the change of how is_dir_sep() is created) then the rest of the
change is more focused.
 
> As we expect to make wider use of the "native" case in the future,
> define and use two starts_with_dot_{,dot_}slash_native() convenience
> wrappers. This makes the diff in builtin/submodule--helper.c much
> smaller.

> +static int starts_with_dot_slash(const char *const path)
> +{
> +	return starts_with_dot_slash_native(path);;

Double semi-colon.

> +int path_match_flags(const char *const str, const enum path_match_flags flags)

I feel like "path_match_flags()" is too generic of a name here.

Maybe something like "path_starts_with_dotslash_flags()" would be
sufficiently descriptive.

> +{
> +	const char *p = str;
> +
> +	if (flags & PATH_MATCH_NATIVE &&
> +	    flags & PATH_MATCH_XPLATFORM)
> +		BUG("path_match_flags() must get one match kind, not multiple!");
> +	else if (!(flags & PATH_MATCH_KINDS_MASK))
> +		BUG("path_match_flags() must get at least one match kind!");
> +
> +	if (flags & PATH_MATCH_STARTS_WITH_DOT_SLASH &&
> +	    flags & PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH)
> +		BUG("path_match_flags() must get one platform kind, not multiple!");
> +	else if (!(flags & PATH_MATCH_PLATFORM_MASK))
> +		BUG("path_match_flags() must get at least one platform kind!");

These would be easier and more robust if we had a simple
popcount function. It's not worth extracting one out of
ewah/ewok.h just for this, though.

> +	if (*p++ != '.')
> +		return 0;
> +	if (flags & PATH_MATCH_STARTS_WITH_DOT_DOT_SLASH &&
> +	    *p++ != '.')
> +		return 0;
> +
> +	if (flags & PATH_MATCH_NATIVE)
> +		return is_dir_sep(*p);
> +	else if (flags & PATH_MATCH_XPLATFORM)
> +		return is_xplatform_dir_sep(*p);
> +	BUG("unreachable");
> +}

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH v2 03/36] fetch-pack: add a deref_without_lazy_fetch_extended()
  2022-04-18 17:23     ` [RFC PATCH v2 03/36] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
@ 2022-04-21 17:28       ` Derrick Stolee
  0 siblings, 0 replies; 77+ messages in thread
From: Derrick Stolee @ 2022-04-21 17:28 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Jonathan Tan, Jonathan Nieder, Albert Cui,
	Robin H . Johnson, Teng Long

On 4/18/2022 1:23 PM, Ævar Arnfjörð Bjarmason wrote:
> -static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
> -					       int mark_tags_complete)
> +static struct commit *deref_without_lazy_fetch_extended(const struct object_id *oid,
> +							int mark_tags_complete,
> +							enum object_type *type,
> +							unsigned int oi_flags)
>  {
> -	enum object_type type;
> -	struct object_info info = { .typep = &type };
> +	struct object_info info = { .typep = type };
>  	struct commit *commit;

Since we now dereference 'type', should we have a BUG() statement here
if type is NULL?

>  
>  	commit = lookup_commit_in_graph(the_repository, oid);
> @@ -128,9 +129,9 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
>  
>  	while (1) {
>  		if (oid_object_info_extended(the_repository, oid, &info,
> -					     OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_QUICK))
> +					     oi_flags))
>  			return NULL;
> -		if (type == OBJ_TAG) {
> +		if (*type == OBJ_TAG) {
>  			struct tag *tag = (struct tag *)
>  				parse_object(the_repository, oid);
>  
> @@ -144,7 +145,7 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
>  		}
>  	}
>  
> -	if (type == OBJ_COMMIT) {
> +	if (*type == OBJ_COMMIT) {
>  		struct commit *commit = lookup_commit(the_repository, oid);
>  		if (!commit || repo_parse_commit(the_repository, commit))
>  			return NULL;
> @@ -154,6 +155,16 @@ static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
>  	return NULL;
>  }
>  
> +

nit: extraneous newline.

> +static struct commit *deref_without_lazy_fetch(const struct object_id *oid,
> +					       int mark_tags_complete)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format
  2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
                       ` (35 preceding siblings ...)
  2022-04-18 17:23     ` [RFC PATCH v2 36/36] docs: document bundle URI standard Ævar Arnfjörð Bjarmason
@ 2022-04-21 19:54     ` Derrick Stolee
  2022-04-22  9:37       ` Ævar Arnfjörð Bjarmason
  36 siblings, 1 reply; 77+ messages in thread
From: Derrick Stolee @ 2022-04-21 19:54 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: Junio C Hamano, Jonathan Tan, Jonathan Nieder, Albert Cui,
	Robin H . Johnson, Teng Long

On 4/18/2022 1:23 PM, Ævar Arnfjörð Bjarmason wrote:
> This RFC series is a start at trying to combine the two differing RFC
> versions of bundle URIs I [1] and Derrick Stolee [2] were kicking
> around.
> 
> = Layout
> 
> This series arranged in the following way:
> 
> * 01-08: "Prep" patches from both [1] and [2] which in principle could
>   graduate first to "master".
> 
>   I.e. they're prep fixes added for the two bundle-uri
>   implementations, but which either justify themselves, or e.g. expose
>   a now-static function via an API.
> 
>   I tried to move things into the "justify themselves" category
>   whenever possible, but may have overdone it e.g. for 02/36
>   (originally an idea/commit of Derrick's, but I changed the
>   authorship as pretty much all of it at this point is something I
>   changed).
> 
>   For the "prep" changes that are only needed for later changes in the
>   series perhaps we should just squash them if they're small enough.

I focused today on reading these first 8 patches with the intention that
they can be submitted for full review and merging on their own. I think
they don't fully succeed in justifying themselves (since not all public
methods have callers) but it would be best to have these refactors
settled before getting into the nitty gritty of the bundle URI feature.

I mostly had a few nits here and there. I noticed that you did not always
add your sign-off after mine, so please correct that when you send the
next version (assuming you are planning to do so).

I'll now start digging into the bigger parts of the series, but I expect
it to take a lot longer to look at everything.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format
  2022-04-21 19:54     ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Derrick Stolee
@ 2022-04-22  9:37       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 77+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-22  9:37 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Junio C Hamano, Jonathan Tan, Jonathan Nieder, Albert Cui,
	Robin H . Johnson, Teng Long


On Thu, Apr 21 2022, Derrick Stolee wrote:

> On 4/18/2022 1:23 PM, Ævar Arnfjörð Bjarmason wrote:
>> This RFC series is a start at trying to combine the two differing RFC
>> versions of bundle URIs I [1] and Derrick Stolee [2] were kicking
>> around.
>> 
>> = Layout
>> 
>> This series arranged in the following way:
>> 
>> * 01-08: "Prep" patches from both [1] and [2] which in principle could
>>   graduate first to "master".
>> 
>>   I.e. they're prep fixes added for the two bundle-uri
>>   implementations, but which either justify themselves, or e.g. expose
>>   a now-static function via an API.
>> 
>>   I tried to move things into the "justify themselves" category
>>   whenever possible, but may have overdone it e.g. for 02/36
>>   (originally an idea/commit of Derrick's, but I changed the
>>   authorship as pretty much all of it at this point is something I
>>   changed).
>> 
>>   For the "prep" changes that are only needed for later changes in the
>>   series perhaps we should just squash them if they're small enough.
>
> I focused today on reading these first 8 patches with the intention that
> they can be submitted for full review and merging on their own. I think
> they don't fully succeed in justifying themselves (since not all public
> methods have callers) but it would be best to have these refactors
> settled before getting into the nitty gritty of the bundle URI feature.
>
> I mostly had a few nits here and there. I noticed that you did not always
> add your sign-off after mine, so please correct that when you send the
> next version (assuming you are planning to do so).

Willdo, sorry.

FWIW the ones with missing sign-off are also those I didn't modify
(extensively), so while I should fix it it might help as a marker for
stuff I changed right now...

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2022-04-22  9:37 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-25 21:25 [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Ævar Arnfjörð Bjarmason
2021-10-25 21:25 ` [PATCH 1/3] leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak Ævar Arnfjörð Bjarmason
2021-10-25 21:25 ` [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri" Ævar Arnfjörð Bjarmason
2021-10-26 14:00   ` Derrick Stolee
2021-10-26 15:00     ` Ævar Arnfjörð Bjarmason
2021-10-27  1:55       ` Derrick Stolee
2021-10-27 17:49         ` Ævar Arnfjörð Bjarmason
2021-10-27  2:01   ` Derrick Stolee
2021-10-27  8:29     ` Ævar Arnfjörð Bjarmason
2021-10-27 16:31       ` Derrick Stolee
2021-10-27 18:01         ` Ævar Arnfjörð Bjarmason
2021-10-27 19:23           ` Derrick Stolee
2021-10-27 20:22             ` Ævar Arnfjörð Bjarmason
2021-10-29 18:30               ` Derrick Stolee
2021-10-30 14:51           ` Philip Oakley
2021-10-25 21:25 ` [PATCH 3/3] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2021-10-26 14:05   ` Derrick Stolee
2021-10-29 18:46 ` [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Derrick Stolee
2021-10-30  7:21   ` Ævar Arnfjörð Bjarmason
2021-11-01 21:00     ` Derrick Stolee
2021-11-01 23:18       ` Ævar Arnfjörð Bjarmason
2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 01/13] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 02/13] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 03/13] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 04/13] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 05/13] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 06/13] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 07/13] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 08/13] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 09/13] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 10/13] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 11/13] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 12/13] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 13/13] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
2022-03-11 21:28   ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Derrick Stolee
2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 01/36] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 02/36] dir API: add a generalized path_match_flags() function Ævar Arnfjörð Bjarmason
2022-04-21 17:26       ` Derrick Stolee
2022-04-18 17:23     ` [RFC PATCH v2 03/36] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
2022-04-21 17:28       ` Derrick Stolee
2022-04-18 17:23     ` [RFC PATCH v2 04/36] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 05/36] http: make http_get_file() external Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 06/36] remote: move relative_url() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 07/36] remote: allow relative_url() to return an absolute url Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 08/36] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 09/36] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 10/36] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 11/36] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 12/36] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 13/36] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 14/36] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 15/36] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 16/36] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 17/36] remote-curl: add 'get' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 18/36] bundle: implement 'fetch' command for direct bundles Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 19/36] bundle: parse table of contents during 'fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 20/36] bundle: add --filter option to 'fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 21/36] bundle: allow relative URLs in table of contents Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 22/36] bundle: make it easy to call 'git bundle fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 23/36] clone: add --bundle-uri option Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 24/36] clone: --bundle-uri cannot be combined with --depth Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 25/36] bundle: only fetch bundles if timestamp is new Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 26/36] fetch: fetch bundles before fetching original data Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 27/36] protocol-caps: implement cap_features() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 28/36] serve: understand but do not advertise 'features' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 29/36] serve: advertise 'features' when config exists Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 30/36] connect: implement get_recommended_features() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 31/36] transport: add connections for 'features' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 32/36] clone: use server-recommended bundle URI Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 33/36] t5601: basic bundle URI test Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 34/36] protocol v2: add server-side "bundle-uri" skeleton (docs) Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 35/36] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 36/36] docs: document bundle URI standard Ævar Arnfjörð Bjarmason
2022-04-21 19:54     ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Derrick Stolee
2022-04-22  9:37       ` Ævar Arnfjörð Bjarmason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.