Proposal for "fetch-any-blob Git protocol" and server design

* Proposal for "fetch-any-blob Git protocol" and server design
@ 2017-03-14 22:57 Jonathan Tan
  2017-03-15 17:59 ` Junio C Hamano
  2017-03-28 23:19 ` Stefan Beller
  0 siblings, 2 replies; 10+ messages in thread
From: Jonathan Tan @ 2017-03-14 22:57 UTC (permalink / raw)
  To: git; +Cc: markbt, git

As described in "Background" below, there have been at least 2 patch 
sets to support "partial clones" and on-demand blob fetches, where the 
server part that supports on-demand blob fetches was treated at least in 
outline. Here is a proposal treating that server part in detail.

== Background

The desire for Git to support (i) missing blobs and (ii) fetching them 
as needed from a remote repository has surfaced on the mailing list a 
few times, most recently in the form of RFC patch sets [1] [2].

A local repository that supports (i) will be created by a "partial 
clone", that is, a clone with some special parameters (exact parameters 
are still being discussed) that does not download all blobs normally 
downloaded. Such a repository should support (ii), which is what this 
proposal describes.

== Design

A new endpoint "server" is created. The client will send a message in 
the following format:

----
fbp-request = PKT-LINE("fetch-blob-pack")
               1*want
               flush-pkt
want = PKT-LINE("want" SP obj-id)
----

The client may send one or more SHA-1s for which it wants blobs, then a 
flush-pkt.

The server will then reply:

----
server-reply = flush-pkt | PKT-LINE("ERR" SP message)
----

If there was no error, the server will then send them in a packfile, 
formatted like described in "Packfile Data" in pack-protocol.txt with 
"side-band-64k" enabled.

Any server that supports "partial clone" will also support this, and the 
client will automatically assume this. (How a client discovers "partial 
clone" is not covered by this proposal.)

The server will perform reachability checks on requested blobs through 
the equivalent of "git rev-list --use-bitmap-index" (like "git 
upload-pack" when using the allowreachablesha1inwant option), unless 
configured to suppress reachability checks through a config option. The 
server administrator is highly recommended to regularly regenerate the 
bitmap (or suppress reachability checks).

=== Endpoint support for forward compatibility

This "server" endpoint requires that the first line be understood, but 
will ignore any other lines starting with words that it does not 
understand. This allows new "commands" to be added (distinguished by 
their first lines) and existing commands to be "upgraded" with backwards 
compatibility.

=== Related improvements possible with new endpoint

Previous protocol upgrade suggestions have had to face the difficulty of 
allowing updated clients to discover the server support while not 
slowing down (for example, through extra network round-trips) any 
client, whether non-updated or updated. The introduction of "partial 
clone" allows clients to rely on the guarantee that any server that 
supports "partial clone" supports "fetch-blob-pack", and we can extend 
the guarantee to other protocol upgrades that such repos would want.

One such upgrade is "ref-in-want" [3]. The full details can be obtained 
from that email thread, but to summarize, the patch set eliminates the 
need for the initial ref advertisement and allows communication in ref 
name globs, making it much easier for multiple load-balanced servers to 
serve large repos to clients - this is something that would greatly 
benefit the Android project, for example, and possibly many others.

Bundling support for "ref-in-want" with "fetch-blob-pack" simplifies 
matters for the client in that a client needs to only handle one 
"version" of server (a server that supports both). If "ref-in-want" were 
added later, instead of now, clients would need to be able to handle two 
"versions" (one with only "fetch-blob-pack" and one with both 
"fetch-blob-pack" and "ref-in-want").

As for its implementation, that email thread already contains a patch 
set that makes it work with the existing "upload-pack" endpoint; I can 
update that patch set to use the proposed "server" endpoint (with a 
"fetch-commit-pack" message) if need be.

== Client behavior

This proposal is concerned with server behavior only, but it is useful 
to envision how the client would use this to ensure that the server 
behavior is useful.

=== Indication to use the proposed endpoint

The client will probably already record that at least one of its remotes 
(the one that it successfully performed a "partial clone" from) supports 
this new endpoint (if not, it can’t determine whether a missing blob was 
caused by repo corruption or by the "partial clone"). This knowledge can 
be used both to know that the server supports "fetch-blob-pack" and 
"fetch-commit-pack" (for the latter, the client can fall back to 
"fetch-pack"/"upload-pack" when fetching from other servers).

=== Multiple remotes

Fetches of missing blobs should (at least by default?) go to the remote 
that sent the tree that points to them. This means that if there are 
multiple remotes, the client needs to remember which remote it learned 
about a given missing blob from.

== Alternatives considered

The "fetch-blob-pack" and "fetch-commit-pack" messages could be split 
into their own endpoints. It seemed more reasonable to combine them 
together since they serve similar use cases (large repos), and (for 
example) reduces the number of binaries in PATH, but I do not feel 
strongly about this.

The client could supply commit information about the blobs it wants (or 
other information that could help the reachability analysis). However, 
these lines wouldn’t be used by the proposed server design. And if we do 
discover that these lines are useful, the protocol could be extended 
with new lines that contain this information (since old servers will 
ignore all lines that they do not understand).

We could extend "upload-pack" to allow blobs in "want" lines instead of 
having a new endpoint. Due to a quirk in the Git implementation (but 
possibly not other implementations like JGit), this is already supported 
[4]. However, each invocation would require the server to generate an 
unnecessary ref list, and would require both the server and the client 
to undergo more network traffic.

Also, the new "server" endpoint might be made to be discovered through 
another mechanism (for example, a capability advertisement on another 
endpoint). It is probably simpler to tie it to the "partial clone" 
feature, though, since they are so likely to be used together.

[1] <20170304191901.9622-1-markbt@efaref.net>
[2] <1488999039-37631-1-git-send-email-git@jeffhostetler.com>
[3] <cover.1485381677.git.jonathantanmy@google.com>
[4] <20170309003547.6930-1-jonathantanmy@google.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread