* Proposal for "fetch-any-blob Git protocol" and server design @ 2017-03-14 22:57 Jonathan Tan 2017-03-15 17:59 ` Junio C Hamano 2017-03-28 23:19 ` Stefan Beller 0 siblings, 2 replies; 10+ messages in thread From: Jonathan Tan @ 2017-03-14 22:57 UTC (permalink / raw) To: git; +Cc: markbt, git As described in "Background" below, there have been at least 2 patch sets to support "partial clones" and on-demand blob fetches, where the server part that supports on-demand blob fetches was treated at least in outline. Here is a proposal treating that server part in detail. == Background The desire for Git to support (i) missing blobs and (ii) fetching them as needed from a remote repository has surfaced on the mailing list a few times, most recently in the form of RFC patch sets [1] [2]. A local repository that supports (i) will be created by a "partial clone", that is, a clone with some special parameters (exact parameters are still being discussed) that does not download all blobs normally downloaded. Such a repository should support (ii), which is what this proposal describes. == Design A new endpoint "server" is created. The client will send a message in the following format: ---- fbp-request = PKT-LINE("fetch-blob-pack") 1*want flush-pkt want = PKT-LINE("want" SP obj-id) ---- The client may send one or more SHA-1s for which it wants blobs, then a flush-pkt. The server will then reply: ---- server-reply = flush-pkt | PKT-LINE("ERR" SP message) ---- If there was no error, the server will then send them in a packfile, formatted like described in "Packfile Data" in pack-protocol.txt with "side-band-64k" enabled. Any server that supports "partial clone" will also support this, and the client will automatically assume this. (How a client discovers "partial clone" is not covered by this proposal.) The server will perform reachability checks on requested blobs through the equivalent of "git rev-list --use-bitmap-index" (like "git upload-pack" when using the allowreachablesha1inwant option), unless configured to suppress reachability checks through a config option. The server administrator is highly recommended to regularly regenerate the bitmap (or suppress reachability checks). === Endpoint support for forward compatibility This "server" endpoint requires that the first line be understood, but will ignore any other lines starting with words that it does not understand. This allows new "commands" to be added (distinguished by their first lines) and existing commands to be "upgraded" with backwards compatibility. === Related improvements possible with new endpoint Previous protocol upgrade suggestions have had to face the difficulty of allowing updated clients to discover the server support while not slowing down (for example, through extra network round-trips) any client, whether non-updated or updated. The introduction of "partial clone" allows clients to rely on the guarantee that any server that supports "partial clone" supports "fetch-blob-pack", and we can extend the guarantee to other protocol upgrades that such repos would want. One such upgrade is "ref-in-want" [3]. The full details can be obtained from that email thread, but to summarize, the patch set eliminates the need for the initial ref advertisement and allows communication in ref name globs, making it much easier for multiple load-balanced servers to serve large repos to clients - this is something that would greatly benefit the Android project, for example, and possibly many others. Bundling support for "ref-in-want" with "fetch-blob-pack" simplifies matters for the client in that a client needs to only handle one "version" of server (a server that supports both). If "ref-in-want" were added later, instead of now, clients would need to be able to handle two "versions" (one with only "fetch-blob-pack" and one with both "fetch-blob-pack" and "ref-in-want"). As for its implementation, that email thread already contains a patch set that makes it work with the existing "upload-pack" endpoint; I can update that patch set to use the proposed "server" endpoint (with a "fetch-commit-pack" message) if need be. == Client behavior This proposal is concerned with server behavior only, but it is useful to envision how the client would use this to ensure that the server behavior is useful. === Indication to use the proposed endpoint The client will probably already record that at least one of its remotes (the one that it successfully performed a "partial clone" from) supports this new endpoint (if not, it can’t determine whether a missing blob was caused by repo corruption or by the "partial clone"). This knowledge can be used both to know that the server supports "fetch-blob-pack" and "fetch-commit-pack" (for the latter, the client can fall back to "fetch-pack"/"upload-pack" when fetching from other servers). === Multiple remotes Fetches of missing blobs should (at least by default?) go to the remote that sent the tree that points to them. This means that if there are multiple remotes, the client needs to remember which remote it learned about a given missing blob from. == Alternatives considered The "fetch-blob-pack" and "fetch-commit-pack" messages could be split into their own endpoints. It seemed more reasonable to combine them together since they serve similar use cases (large repos), and (for example) reduces the number of binaries in PATH, but I do not feel strongly about this. The client could supply commit information about the blobs it wants (or other information that could help the reachability analysis). However, these lines wouldn’t be used by the proposed server design. And if we do discover that these lines are useful, the protocol could be extended with new lines that contain this information (since old servers will ignore all lines that they do not understand). We could extend "upload-pack" to allow blobs in "want" lines instead of having a new endpoint. Due to a quirk in the Git implementation (but possibly not other implementations like JGit), this is already supported [4]. However, each invocation would require the server to generate an unnecessary ref list, and would require both the server and the client to undergo more network traffic. Also, the new "server" endpoint might be made to be discovered through another mechanism (for example, a capability advertisement on another endpoint). It is probably simpler to tie it to the "partial clone" feature, though, since they are so likely to be used together. [1] <20170304191901.9622-1-markbt@efaref.net> [2] <1488999039-37631-1-git-send-email-git@jeffhostetler.com> [3] <cover.1485381677.git.jonathantanmy@google.com> [4] <20170309003547.6930-1-jonathantanmy@google.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Proposal for "fetch-any-blob Git protocol" and server design 2017-03-14 22:57 Proposal for "fetch-any-blob Git protocol" and server design Jonathan Tan @ 2017-03-15 17:59 ` Junio C Hamano 2017-03-16 17:31 ` Jonathan Tan 2017-03-28 23:19 ` Stefan Beller 1 sibling, 1 reply; 10+ messages in thread From: Junio C Hamano @ 2017-03-15 17:59 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, markbt, git Jonathan Tan <jonathantanmy@google.com> writes: > == Design > > A new endpoint "server" is created. The client will send a message in > the following format: > > ---- > fbp-request = PKT-LINE("fetch-blob-pack") > 1*want > flush-pkt > want = PKT-LINE("want" SP obj-id) > ---- > > The client may send one or more SHA-1s for which it wants blobs, then > a flush-pkt. By "SHA-1s for which it wants blobs", you mean that "want" only allows one exact blob object name? I think it is necessary to support that mode of operation as a base case, and it is a good starting point. When you know - you have a "partial" clone that initially asked to contain only blobs that are smaller than 10MB, and - you are now trying to do a "git checkout v1.0 -- this/directory" so that the directory is fully populated instead of enumerating all the missing blobs from the output of "ls-tree -r v1.0 this/directory" on separate "want" requests, you may want to say "I want all the blobs that are not smaller than 10MB in this tree object $(git rev-parse v1.0:this/directory)". I am not saying that you should add something like this right away, but I am wondering how you would extend the proposed system to do so. Would you add "fetch-size-limited-blob-in-tree-pack" that runs parallel to "fetch-blob-pack" request? Would you add a new type of request packet "want-blob-with-expression" for fbp-request, which is protected by some "protocol capability" exchange? If the former, how does a client discover if a particular server already supports the new "fetch-size-limited-blob-in-tree-pack" request, so that it does not have to send a bunch of "want" request by enumerating the blobs itself? If the latter, how does a client discover if a particular server's "fetch-blob-pack" already supports the new "want-blob-with-expression" request packet? > === Endpoint support for forward compatibility > > This "server" endpoint requires that the first line be understood, but > will ignore any other lines starting with words that it does not > understand. This allows new "commands" to be added (distinguished by > their first lines) and existing commands to be "upgraded" with > backwards compatibility. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Proposal for "fetch-any-blob Git protocol" and server design 2017-03-15 17:59 ` Junio C Hamano @ 2017-03-16 17:31 ` Jonathan Tan 2017-03-16 21:17 ` Junio C Hamano 0 siblings, 1 reply; 10+ messages in thread From: Jonathan Tan @ 2017-03-16 17:31 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, markbt, git On 03/15/2017 10:59 AM, Junio C Hamano wrote: > By "SHA-1s for which it wants blobs", you mean that "want" only > allows one exact blob object name? I think it is necessary to > support that mode of operation as a base case, and it is a good > starting point. > > When you know > > - you have a "partial" clone that initially asked to contain only > blobs that are smaller than 10MB, and > > - you are now trying to do a "git checkout v1.0 -- this/directory" > so that the directory is fully populated > > instead of enumerating all the missing blobs from the output of > "ls-tree -r v1.0 this/directory" on separate "want" requests, you > may want to say "I want all the blobs that are not smaller than 10MB > in this tree object $(git rev-parse v1.0:this/directory)". > > I am not saying that you should add something like this right away, > but I am wondering how you would extend the proposed system to do > so. Would you add "fetch-size-limited-blob-in-tree-pack" that runs > parallel to "fetch-blob-pack" request? Would you add a new type of > request packet "want-blob-with-expression" for fbp-request, which is > protected by some "protocol capability" exchange? > > If the former, how does a client discover if a particular server > already supports the new "fetch-size-limited-blob-in-tree-pack" > request, so that it does not have to send a bunch of "want" request > by enumerating the blobs itself? If the latter, how does a client > discover if a particular server's "fetch-blob-pack" already supports > the new "want-blob-with-expression" request packet? I'm not sure if that use case is something we need to worry about (if you're downloading x * 10MB, uploading x * 50B shouldn't be a problem, I think), but if we want to handle that use case in the future, I agree that extending this system would be difficult. The best way I can think of right now is for the client to send a fetch-blob-pack request with no "want" lines and at least one "want-tree" line, and then if there is an error (which will happen if the server is old, and therefore sees that there is not at least "want" line), to retry with the "want" lines. This allows us to add alternative ways of specifying blobs later (if we want to), but also means that upgrading a client without upgrading the corresponding server incurs a round-trip penalty. Alternatively we could add rudimentary support for trees now and add filter-by-size later (so that such requests made to old servers will download extra blobs, but at least it works), but it still doesn't solve the general problem of specifying blobs by some other rule than its own SHA-1 or its tree's SHA-1. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Proposal for "fetch-any-blob Git protocol" and server design 2017-03-16 17:31 ` Jonathan Tan @ 2017-03-16 21:17 ` Junio C Hamano 2017-03-16 22:48 ` Jonathan Tan 0 siblings, 1 reply; 10+ messages in thread From: Junio C Hamano @ 2017-03-16 21:17 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, markbt, git Jonathan Tan <jonathantanmy@google.com> writes: > On 03/15/2017 10:59 AM, Junio C Hamano wrote: >> ... >> but I am wondering how you would extend the proposed system to do >> so. Would you add "fetch-size-limited-blob-in-tree-pack" that runs >> parallel to "fetch-blob-pack" request? Would you add a new type of >> request packet "want-blob-with-expression" for fbp-request, which is >> protected by some "protocol capability" exchange? >> >> If the former, how does a client discover if a particular server >> already supports the new "fetch-size-limited-blob-in-tree-pack" >> request, so that it does not have to send a bunch of "want" request >> by enumerating the blobs itself? If the latter, how does a client >> discover if a particular server's "fetch-blob-pack" already supports >> the new "want-blob-with-expression" request packet? > > I'm not sure if that use case is something we need to worry about (if > you're downloading x * 10MB, uploading x * 50B shouldn't be a problem, > I think), but if we want to handle that use case in the future, I > agree that extending this system would be difficult. Yeah, the example was solely to see how the system was to be extended, as one of the selling point of the proposal was: > === Endpoint support for forward compatibility > > This "server" endpoint requires that the first line be understood, but > will ignore any other lines starting with words that it does not > understand. This allows new "commands" to be added (distinguished by > their first lines) and existing commands to be "upgraded" with > backwards compatibility. > The best way I can think of right now is for the client to send a > fetch-blob-pack request with no "want" lines and at least one > "want-tree" line, ... So it is not by adding new type of "request" that sits next to "fetch-blob-pack" request, but by adding a new way to drive that existing "fetch-blob-pack" request. > and then if there is an error (which will happen if > the server is old, and therefore sees that there is not at least > "want" line), to retry with the "want" lines. This allows us to add > alternative ways of specifying blobs later (if we want to), but also > means that upgrading a client without upgrading the corresponding > server incurs a round-trip penalty. And the lack of "capability negotiation" is substituted by "assume the better server, fallback to lower common denominator by detecting errors"? > Alternatively we could add rudimentary support for trees now and add > filter-by-size later ... I am not particularly interested in "blobs in this tree" request. It was merely an example to make it easier to discuss the main point, which is the bigger picture that the proposal was painting around "forward compatibility". Thanks. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Proposal for "fetch-any-blob Git protocol" and server design 2017-03-16 21:17 ` Junio C Hamano @ 2017-03-16 22:48 ` Jonathan Tan 0 siblings, 0 replies; 10+ messages in thread From: Jonathan Tan @ 2017-03-16 22:48 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, markbt, git On 03/16/2017 02:17 PM, Junio C Hamano wrote: > Yeah, the example was solely to see how the system was to be > extended, as one of the selling point of the proposal was: > > > === Endpoint support for forward compatibility > > > > This "server" endpoint requires that the first line be understood, but > > will ignore any other lines starting with words that it does not > > understand. This allows new "commands" to be added (distinguished by > > their first lines) and existing commands to be "upgraded" with > > backwards compatibility. <snip> > > And the lack of "capability negotiation" is substituted by "assume > the better server, fallback to lower common denominator by detecting > errors"? Yes. I probably should have mentioned that this "forward compatibility" is limited - it does not include any potential new feature intending to reduce the size of the request. I was thinking more of this being able to be extended to, for example, add "hint" lines that certain blobs come from certain commits, or add "have" lines to present blobs that might be good delta bases (and even if the server doesn't understand these lines, its output is still correct). ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Proposal for "fetch-any-blob Git protocol" and server design 2017-03-14 22:57 Proposal for "fetch-any-blob Git protocol" and server design Jonathan Tan 2017-03-15 17:59 ` Junio C Hamano @ 2017-03-28 23:19 ` Stefan Beller [not found] ` <00bf01d2aed7$b13492a0$139db7e0$@gmail.com> 1 sibling, 1 reply; 10+ messages in thread From: Stefan Beller @ 2017-03-28 23:19 UTC (permalink / raw) To: Jonathan Tan, Ben Peart; +Cc: git, Mark Thomas, Jeff Hostetler +cc Ben Peart, who sent "[RFC] Add support for downloading blobs on demand" to the list recently. This proposal here seems like it has the same goal, so maybe your review could go a long way here? Thanks, Stefan On Tue, Mar 14, 2017 at 3:57 PM, Jonathan Tan <jonathantanmy@google.com> wrote: > As described in "Background" below, there have been at least 2 patch sets to > support "partial clones" and on-demand blob fetches, where the server part > that supports on-demand blob fetches was treated at least in outline. Here > is a proposal treating that server part in detail. > > == Background > > The desire for Git to support (i) missing blobs and (ii) fetching them as > needed from a remote repository has surfaced on the mailing list a few > times, most recently in the form of RFC patch sets [1] [2]. > > A local repository that supports (i) will be created by a "partial clone", > that is, a clone with some special parameters (exact parameters are still > being discussed) that does not download all blobs normally downloaded. Such > a repository should support (ii), which is what this proposal describes. > > == Design > > A new endpoint "server" is created. The client will send a message in the > following format: > > ---- > fbp-request = PKT-LINE("fetch-blob-pack") > 1*want > flush-pkt > want = PKT-LINE("want" SP obj-id) > ---- > > The client may send one or more SHA-1s for which it wants blobs, then a > flush-pkt. > > The server will then reply: > > ---- > server-reply = flush-pkt | PKT-LINE("ERR" SP message) > ---- > > If there was no error, the server will then send them in a packfile, > formatted like described in "Packfile Data" in pack-protocol.txt with > "side-band-64k" enabled. > > Any server that supports "partial clone" will also support this, and the > client will automatically assume this. (How a client discovers "partial > clone" is not covered by this proposal.) > > The server will perform reachability checks on requested blobs through the > equivalent of "git rev-list --use-bitmap-index" (like "git upload-pack" when > using the allowreachablesha1inwant option), unless configured to suppress > reachability checks through a config option. The server administrator is > highly recommended to regularly regenerate the bitmap (or suppress > reachability checks). > > === Endpoint support for forward compatibility > > This "server" endpoint requires that the first line be understood, but will > ignore any other lines starting with words that it does not understand. This > allows new "commands" to be added (distinguished by their first lines) and > existing commands to be "upgraded" with backwards compatibility. > > === Related improvements possible with new endpoint > > Previous protocol upgrade suggestions have had to face the difficulty of > allowing updated clients to discover the server support while not slowing > down (for example, through extra network round-trips) any client, whether > non-updated or updated. The introduction of "partial clone" allows clients > to rely on the guarantee that any server that supports "partial clone" > supports "fetch-blob-pack", and we can extend the guarantee to other > protocol upgrades that such repos would want. > > One such upgrade is "ref-in-want" [3]. The full details can be obtained from > that email thread, but to summarize, the patch set eliminates the need for > the initial ref advertisement and allows communication in ref name globs, > making it much easier for multiple load-balanced servers to serve large > repos to clients - this is something that would greatly benefit the Android > project, for example, and possibly many others. > > Bundling support for "ref-in-want" with "fetch-blob-pack" simplifies matters > for the client in that a client needs to only handle one "version" of server > (a server that supports both). If "ref-in-want" were added later, instead of > now, clients would need to be able to handle two "versions" (one with only > "fetch-blob-pack" and one with both "fetch-blob-pack" and "ref-in-want"). > > As for its implementation, that email thread already contains a patch set > that makes it work with the existing "upload-pack" endpoint; I can update > that patch set to use the proposed "server" endpoint (with a > "fetch-commit-pack" message) if need be. > > == Client behavior > > This proposal is concerned with server behavior only, but it is useful to > envision how the client would use this to ensure that the server behavior is > useful. > > === Indication to use the proposed endpoint > > The client will probably already record that at least one of its remotes > (the one that it successfully performed a "partial clone" from) supports > this new endpoint (if not, it can’t determine whether a missing blob was > caused by repo corruption or by the "partial clone"). This knowledge can be > used both to know that the server supports "fetch-blob-pack" and > "fetch-commit-pack" (for the latter, the client can fall back to > "fetch-pack"/"upload-pack" when fetching from other servers). > > === Multiple remotes > > Fetches of missing blobs should (at least by default?) go to the remote that > sent the tree that points to them. This means that if there are multiple > remotes, the client needs to remember which remote it learned about a given > missing blob from. > > == Alternatives considered > > The "fetch-blob-pack" and "fetch-commit-pack" messages could be split into > their own endpoints. It seemed more reasonable to combine them together > since they serve similar use cases (large repos), and (for example) reduces > the number of binaries in PATH, but I do not feel strongly about this. > > The client could supply commit information about the blobs it wants (or > other information that could help the reachability analysis). However, these > lines wouldn’t be used by the proposed server design. And if we do discover > that these lines are useful, the protocol could be extended with new lines > that contain this information (since old servers will ignore all lines that > they do not understand). > > We could extend "upload-pack" to allow blobs in "want" lines instead of > having a new endpoint. Due to a quirk in the Git implementation (but > possibly not other implementations like JGit), this is already supported > [4]. However, each invocation would require the server to generate an > unnecessary ref list, and would require both the server and the client to > undergo more network traffic. > > Also, the new "server" endpoint might be made to be discovered through > another mechanism (for example, a capability advertisement on another > endpoint). It is probably simpler to tie it to the "partial clone" feature, > though, since they are so likely to be used together. > > [1] <20170304191901.9622-1-markbt@efaref.net> > [2] <1488999039-37631-1-git-send-email-git@jeffhostetler.com> > [3] <cover.1485381677.git.jonathantanmy@google.com> > [4] <20170309003547.6930-1-jonathantanmy@google.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <00bf01d2aed7$b13492a0$139db7e0$@gmail.com>]
* RE: Proposal for "fetch-any-blob Git protocol" and server design [not found] ` <00bf01d2aed7$b13492a0$139db7e0$@gmail.com> @ 2017-04-12 22:02 ` Kevin David 2017-04-13 20:12 ` Jonathan Tan 0 siblings, 1 reply; 10+ messages in thread From: Kevin David @ 2017-04-12 22:02 UTC (permalink / raw) To: Jonathan Tan, Ben Peart, Stefan Beller; +Cc: git, Mark Thomas, Jeff Hostetler Hi Jonathan, I work on the network protocols for the GVFS project at Microsoft. I shared a couple thoughts and questions below. Thanks, Kevin -----Original Message----- From: Stefan Beller [mailto:sbeller@google.com] Sent: Tuesday, March 28, 2017 7:20 PM To: Jonathan Tan <jonathantanmy@google.com>; Ben Peart <peartben@gmail.com> Cc: git@vger.kernel.org; Mark Thomas <markbt@efaref.net>; Jeff Hostetler <git@jeffhostetler.com> Subject: Re: Proposal for "fetch-any-blob Git protocol" and server design +cc Ben Peart, who sent "[RFC] Add support for downloading blobs on demand" to the list recently. This proposal here seems like it has the same goal, so maybe your review could go a long way here? Thanks, Stefan On Tue, Mar 14, 2017 at 3:57 PM, Jonathan Tan <jonathantanmy@google.com> wrote: > As described in "Background" below, there have been at least 2 patch > sets to support "partial clones" and on-demand blob fetches, where the > server part that supports on-demand blob fetches was treated at least > in outline. Here is a proposal treating that server part in detail. > > == Background > > The desire for Git to support (i) missing blobs and (ii) fetching them > as needed from a remote repository has surfaced on the mailing list a > few times, most recently in the form of RFC patch sets [1] [2]. > > A local repository that supports (i) will be created by a "partial > clone", that is, a clone with some special parameters (exact > parameters are still being discussed) that does not download all blobs > normally downloaded. Such a repository should support (ii), which is what this proposal describes. > > == Design > > A new endpoint "server" is created. The client will send a message in > the following format: > > ---- > fbp-request = PKT-LINE("fetch-blob-pack") > 1*want > flush-pkt > want = PKT-LINE("want" SP obj-id) > ---- > > The client may send one or more SHA-1s for which it wants blobs, then > a flush-pkt. I know we're considering server behavior here, but how large do you generally expect these blob-want requests to be? I ask because we took an initial approach very similar to this, however, we had a hard time being clever about figuring out what set of blobs to request for those clients that didn't want the entire set, and ended up falling back to single-blob requests. Obviously, this could be due to thenature of our filesystem-virtualization-based client, but I also suspect that the teams attacking this problem are more often than not dealing with very large blob objects, so the cost of a round-trip becomes lower relative to sending the object content itself. > > The server will then reply: > > ---- > server-reply = flush-pkt | PKT-LINE("ERR" SP message) > ---- > > If there was no error, the server will then send them in a packfile, > formatted like described in "Packfile Data" in pack-protocol.txt with > "side-band-64k" enabled. > Any server that supports "partial clone" will also support this, and > the client will automatically assume this. (How a client discovers > "partial clone" is not covered by this proposal.) Along the same lines as above, this is where we started and it worked well for low-volume requests. However, when we started ramping up the load, `pack-objects` operating on a very large packed repository (~150 GiB) became very computationally expensive, even with `--compression=1 --depth=0 --window=0`. Being a bit more clever about packing objects (e.g. splitting blobs out from commits and trees) improved this a bit, but we still hit a bottlenecks from what appeared to be a large number of memory-mapping operations on a ~140GiB packfile of blobs. Each `pack-objects` process would consume approximately one CPU core for the duration of the request. It's possible that further splitting of these large blob packs would have improved performance in some scenarios, but that would increase the amount of pack-index lookups necessary to find a single object. > > The server will perform reachability checks on requested blobs through > the equivalent of "git rev-list --use-bitmap-index" (like "git > upload-pack" when using the allowreachablesha1inwant option), unless > configured to suppress reachability checks through a config option. > The server administrator is highly recommended to regularly regenerate > the bitmap (or suppress reachability checks). > > === Endpoint support for forward compatibility > > This "server" endpoint requires that the first line be understood, but > will ignore any other lines starting with words that it does not > understand. This allows new "commands" to be added (distinguished by > their first lines) and existing commands to be "upgraded" with backwards compatibility. This seems like a clever way to avoid the canonical `/info/refs?service=git-upload-pack` capability negotiation on every call. However, using error handling to fallback seems slightly wonky to me. Hopefully users are incentivized to upgrade their clients. > === Related improvements possible with new endpoint > > Previous protocol upgrade suggestions have had to face the difficulty > of allowing updated clients to discover the server support while not > slowing down (for example, through extra network round-trips) any > client, whether non-updated or updated. The introduction of "partial > clone" allows clients to rely on the guarantee that any server that supports "partial clone" > supports "fetch-blob-pack", and we can extend the guarantee to other > protocol upgrades that such repos would want. > > One such upgrade is "ref-in-want" [3]. The full details can be > obtained from that email thread, but to summarize, the patch set > eliminates the need for the initial ref advertisement and allows > communication in ref name globs, making it much easier for multiple > load-balanced servers to serve large repos to clients - this is > something that would greatly benefit the Android project, for example, and possibly many others. > > Bundling support for "ref-in-want" with "fetch-blob-pack" simplifies > matters for the client in that a client needs to only handle one > "version" of server (a server that supports both). If "ref-in-want" > were added later, instead of now, clients would need to be able to > handle two "versions" (one with only "fetch-blob-pack" and one with both "fetch-blob-pack" and "ref-in-want"). > > As for its implementation, that email thread already contains a patch > set that makes it work with the existing "upload-pack" endpoint; I can > update that patch set to use the proposed "server" endpoint (with a > "fetch-commit-pack" message) if need be. > > == Client behavior > > This proposal is concerned with server behavior only, but it is useful > to envision how the client would use this to ensure that the server > behavior is useful. > > === Indication to use the proposed endpoint > > The client will probably already record that at least one of its > remotes (the one that it successfully performed a "partial clone" > from) supports this new endpoint (if not, it can’t determine whether a > missing blob was caused by repo corruption or by the "partial clone"). > This knowledge can be used both to know that the server supports > "fetch-blob-pack" and "fetch-commit-pack" (for the latter, the client > can fall back to "fetch-pack"/"upload-pack" when fetching from other servers). This makes a lot of sense to me. When we built our caching proxy, we had to be careful when designing how we'd handle clients requesting objects missing from the proxy. For example, a client requests a single blob and the proxy doesn't have it - we can't simply download that object from the "authoritative" remote and stick it in the `.git\objects\xx\yyy...` directory, because the repository would be made corrupt. Having a way to specify that the repo is a "partial clone" and allowing "holes" would help a lot, I believe. I know there have been varying opinions on how these missing objects should be "marked" and I'm not ready to propose anything there - just agreeing the problem is important. > === Multiple remotes > > Fetches of missing blobs should (at least by default?) go to the > remote that sent the tree that points to them. This means that if > there are multiple remotes, the client needs to remember which remote > it learned about a given missing blob from. Definitely handy for someone writing a proxy, but some level of error fallback might be good. > == Alternatives considered > > The "fetch-blob-pack" and "fetch-commit-pack" messages could be split > into their own endpoints. It seemed more reasonable to combine them > together since they serve similar use cases (large repos), and (for > example) reduces the number of binaries in PATH, but I do not feel strongly about this. Not do derail us too far off blobs, but I wonder if we need a `fetch-commit-pack` endpoint, or could get away with introducing a new capability (e.g. `no-blobs`) to `upload-pack` instead. As a casual observer, this seems like it would be a much smaller change since the rest of the negotiation/reachability calculation would look the same, right? Or would this `fetch-commit-pack` not return trees either? I only ask because, in our observations, when git wants to read commits it's usually followed by a lot of "related" trees - again caveated with the fact that we're intercepting many things at the filesystem layer. > The client could supply commit information about the blobs it wants > (or other information that could help the reachability analysis). > However, these lines wouldn’t be used by the proposed server design. > And if we do discover that these lines are useful, the protocol could > be extended with new lines that contain this information (since old > servers will ignore all lines that they do not understand). > > We could extend "upload-pack" to allow blobs in "want" lines instead > of having a new endpoint. Due to a quirk in the Git implementation > (but possibly not other implementations like JGit), this is already > supported [4]. However, each invocation would require the server to > generate an unnecessary ref list, and would require both the server > and the client to undergo more network traffic. > > Also, the new "server" endpoint might be made to be discovered through > another mechanism (for example, a capability advertisement on another > endpoint). It is probably simpler to tie it to the "partial clone" > feature, though, since they are so likely to be used together. Just to keep the discussion interesting, I'll throw an alternative out there that's worked well for us. As I understand it, the HTTP-based dumb transfer protocol supports returning objects in loose object format, but only if they already exist in loose format. Extending this to have the remote provide these objects via a "dumb" protocol when they are packed as well - i.e. the server would "loosens" them upon request - is basically what we do and it works quite well for low-latency clients. To further improve performance at the cost of complexity, we've added caching at the memory and disk layer for these loose objects in the same format we send to the client. There's a clear tradeoff here - the servers must have adequate disk and/or memory to store these loose objects in optimal format. In addition, the higher the latency is to the remote, the worse this solution will perform. Fortunately, in our case, network topology allows us to put these caching proxies close enough to clients for it not to matter. > [1] <20170304191901.9622-1-markbt@efaref.net> > [2] <1488999039-37631-1-git-send-email-git@jeffhostetler.com> > [3] <cover.1485381677.git.jonathantanmy@google.com> > [4] <20170309003547.6930-1-jonathantanmy@google.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Proposal for "fetch-any-blob Git protocol" and server design 2017-04-12 22:02 ` Kevin David @ 2017-04-13 20:12 ` Jonathan Tan 2017-04-21 16:41 ` Kevin David 0 siblings, 1 reply; 10+ messages in thread From: Jonathan Tan @ 2017-04-13 20:12 UTC (permalink / raw) To: Kevin David, Ben Peart, Stefan Beller; +Cc: git, Mark Thomas, Jeff Hostetler On 04/12/2017 03:02 PM, Kevin David wrote: > Hi Jonathan, > > I work on the network protocols for the GVFS project at Microsoft. > I shared a couple thoughts and questions below. Thanks for your reply! > I know we're considering server behavior here, but how large do you generally > expect these blob-want requests to be? I ask because we took an initial approach > very similar to this, however, we had a hard time being clever about figuring out > what set of blobs to request for those clients that didn't want the entire set, and > ended up falling back to single-blob requests. > > Obviously, this could be due to thenature of our filesystem-virtualization-based client, > but I also suspect that the teams attacking this problem are more often than not dealing > with very large blob objects, so the cost of a round-trip becomes lower relative to sending > the object content itself. I am envisioning (1a) as described in Jeff Hostetler's e-mail [1] ("a pre-command or hook to identify needed blobs and pre-fetch them before allowing the actual command to start"), so a Git command would typically make a single request that contains all the blobs required, but my proposal can also handle (1c) ('"fault" them in as necessary in read_object() while the command is running and without any pre-fetch (either synchronously or asynchronously and with/without a helper process)'). Even if we decided to go with single-blob requests and responses, it is still important to send them as packfiles, so that the server can serve them directly from its compressed storage without first having to uncompress them. [1] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ > Along the same lines as above, this is where we started and it worked well for > low-volume requests. However, when we started ramping up the load, > `pack-objects` operating on a very large packed repository (~150 GiB) became > very computationally expensive, even with `--compression=1 --depth=0 --window=0`. > > Being a bit more clever about packing objects (e.g. splitting blobs out from commits > and trees) improved this a bit, but we still hit a bottlenecks from what appeared to > be a large number of memory-mapping operations on a ~140GiB packfile of blobs. > > Each `pack-objects` process would consume approximately one CPU core for the > duration of the request. It's possible that further splitting of these large blob packs > would have improved performance in some scenarios, but that would increase the > amount of pack-index lookups necessary to find a single object. I'm not very experienced with mmap, but I thought that memory-mapping a large file in itself does not incur much of a performance penalty (if any) - it is the accesses that count. I experimented with 15,000 and 150,000 MiB files and mmap and they seem to be handled quite well. Also, how many objects are "pack-objects" packing here? >> === Endpoint support for forward compatibility >> >> This "server" endpoint requires that the first line be understood, but >> will ignore any other lines starting with words that it does not >> understand. This allows new "commands" to be added (distinguished by >> their first lines) and existing commands to be "upgraded" with backwards compatibility. > > This seems like a clever way to avoid the canonical `/info/refs?service=git-upload-pack` > capability negotiation on every call. However, using error handling to fallback seems > slightly wonky to me. Hopefully users are incentivized to upgrade their clients. By "error handling to fallback", do you mean in my proposal or in a possible future one (assuming my proposal is implemented)? I don't think my proposal requires any error handling to fallback (since only new clients can clone partially - old clients will just clone totally and obliviously), but I acknowledge that this proposal does not mean that any future proposal can be done without requiring error handling to fallback. >> === Indication to use the proposed endpoint >> >> The client will probably already record that at least one of its >> remotes (the one that it successfully performed a "partial clone" >> from) supports this new endpoint (if not, it can’t determine whether a >> missing blob was caused by repo corruption or by the "partial clone"). >> This knowledge can be used both to know that the server supports >> "fetch-blob-pack" and "fetch-commit-pack" (for the latter, the client >> can fall back to "fetch-pack"/"upload-pack" when fetching from other servers). > > This makes a lot of sense to me. When we built our caching proxy, we had to be careful > when designing how we'd handle clients requesting objects missing from the proxy. > > For example, a client requests a single blob and the proxy doesn't have it - we can't simply > download that object from the "authoritative" remote and stick it in the `.git\objects\xx\yyy...` > directory, because the repository would be made corrupt. By proxy, do you mean a Git repository? Sorry, I don't really understand this part. > Having a way to specify that the repo is a "partial clone" and allowing "holes" would help a lot, > I believe. I know there have been varying opinions on how these missing objects should be > "marked" and I'm not ready to propose anything there - just agreeing the problem is important. Agreed. > Not do derail us too far off blobs, but I wonder if we need a `fetch-commit-pack` endpoint, > or could get away with introducing a new capability (e.g. `no-blobs`) to `upload-pack` instead. > As a casual observer, this seems like it would be a much smaller change since the rest of the > negotiation/reachability calculation would look the same, right? Or would this `fetch-commit-pack` > not return trees either? > > I only ask because, in our observations, when git wants to read commits it's > usually followed by a lot of "related" trees - again caveated with the fact that > we're intercepting many things at the filesystem layer. The main reason for this extra command is not to exclude blobs (which, as you said, can be done with a new capability - I suspect that we will need a capability or parameter of some sort anyway to indicate which size of blobs to filter out) but to eliminate the mandatory ref advertisement that is done whenever the client fetches. One of our use cases (internal Android) has large blobs and many (more than 700k) refs, so it would benefit greatly from blob filtering and elimination of the mandatory ref advertisement (tens of megabytes per fetch). As for the size of the change, I have a work in progress that implements this [2]. [2] https://public-inbox.org/git/cover.1491851452.git.jonathantanmy@google.com/ > Just to keep the discussion interesting, I'll throw an alternative out there that's > worked well for us. As I understand it, the HTTP-based dumb transfer protocol > supports returning objects in loose object format, but only if they already exist > in loose format. > > Extending this to have the remote provide these objects via a "dumb" protocol > when they are packed as well - i.e. the server would "loosens" them upon request - > is basically what we do and it works quite well for low-latency clients. To further improve > performance at the cost of complexity, we've added caching at the memory and disk layer > for these loose objects in the same format we send to the client. > > There's a clear tradeoff here - the servers must have adequate disk and/or memory to store > these loose objects in optimal format. In addition, the higher the latency is to the remote, > the worse this solution will perform. Fortunately, in our case, network topology allows us to > put these caching proxies close enough to clients for it not to matter. This does make sense in the situation you describe, but (as you said) I don't think we can guarantee this in the majority of situations. I think some sort of batching (like the (1a) solution I talked about near the start of this e-mail) and serving packed data from packed storage should form the baseline, and any situation-specific optimizations (e.g. serving unpacked data from topologically-close servers) can be additional steps. ^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: Proposal for "fetch-any-blob Git protocol" and server design 2017-04-13 20:12 ` Jonathan Tan @ 2017-04-21 16:41 ` Kevin David 2017-04-26 22:51 ` Jonathan Tan 0 siblings, 1 reply; 10+ messages in thread From: Kevin David @ 2017-04-21 16:41 UTC (permalink / raw) To: Jonathan Tan, Ben Peart, Stefan Beller; +Cc: git, Mark Thomas, Jeff Hostetler Hi Jonathan, Sorry for the delayed response - other work got in the way, unfortunately! Kevin -----Original Message----- From: Jonathan Tan [mailto:jonathantanmy@google.com] Sent: Thursday, April 13, 2017 4:13 PM >> I know we're considering server behavior here, but how large do you generally >> expect these blob-want requests to be? I ask because we took an initial approach >> very similar to this, however, we had a hard time being clever about figuring out >> what set of blobs to request for those clients that didn't want the entire set, and >> ended up falling back to single-blob requests. >> >> Obviously, this could be due to thenature of our filesystem-virtualization-based client, >> but I also suspect that the teams attacking this problem are more often than not dealing >> with very large blob objects, so the cost of a round-trip becomes lower relative to sending >> the object content itself. > > I am envisioning (1a) as described in Jeff Hostetler's e-mail [1] ("a > pre-command or hook to identify needed blobs and pre-fetch them before > allowing the actual command to start"), so a Git command would typically > make a single request that contains all the blobs required, but my > proposal can also handle (1c) ('"fault" them in as necessary in > read_object() while the command is running and without any pre-fetch > (either synchronously or asynchronously and with/without a helper > process)'). > > Even if we decided to go with single-blob requests and responses, it is > still important to send them as packfiles, so that the server can serve > them directly from its compressed storage without first having to > uncompress them. > > [1] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ Ah, I missed this. If we think we can build meaningfully-sized requests via (1a) and (1b), then I agree - packs are optimal. However, if single-blob requests/responses dominate, I have a few concerns, mostly from experience playing with something similar: * Regardless of the size of the object or number returned, the client will need to `index-pack` the result and create a corresponding `.idx` file, which requires decompression to construct (right?) * Unless the client's repository is repacked aggressively, we'll pollute the `.git\objects\pack` directory with little indexes (somewhere around 4KiB minimum) and packfiles rapidly, which would degrade performance. One rudimentary workaround would be to loosen these packs on the client if they were under a certain size/object count. I think fetch does this already? In either case, shifting the pack decompression/loose object recompression problem to the client instead of the server is probably a good principle, but in our case it simply wasn't fast enough to serve single blobs interactively (e.g. opening a source file you don't have locally). I'm hopeful that the proposed partial clone solutions you referenced would reduce the frequency of this being required. >> Being a bit more clever about packing objects (e.g. splitting blobs out from commits >> and trees) improved this a bit, but we still hit a bottlenecks from what appeared to >> be a large number of memory-mapping operations on a ~140GiB packfile of blobs. >> >> Each `pack-objects` process would consume approximately one CPU core for the >> duration of the request. It's possible that further splitting of these large blob packs >> would have improved performance in some scenarios, but that would increase the >> amount of pack-index lookups necessary to find a single object. > > I'm not very experienced with mmap, but I thought that memory-mapping a > large file in itself does not incur much of a performance penalty (if > any) - it is the accesses that count. I experimented with 15,000 and > 150,000 MiB files and mmap and they seem to be handled quite well. Also, > how many objects are "pack-objects" packing here? Back when we took this approach, it was ~4000 blob objects at a time. Perhaps we were being bitten by the Windows implementation of git_mmap[3]?. When I profiled the 4k blob scenario, the majority of CPU and wall time was spent in MapViewOfFileEx, which looks like it could mean accesses as well. [3] https://github.com/git/git/blob/master/compat/win32mmap.c >> This seems like a clever way to avoid the canonical `/info/refs?service=git-upload-pack` >> capability negotiation on every call. However, using error handling to fallback seems >> slightly wonky to me. Hopefully users are incentivized to upgrade their clients. > > By "error handling to fallback", do you mean in my proposal or in a > possible future one (assuming my proposal is implemented)? I don't think > my proposal requires any error handling to fallback (since only new > clients can clone partially - old clients will just clone totally and > obliviously), but I acknowledge that this proposal does not mean that > any future proposal can be done without requiring error handling to > fallback. Right, I was talking about the possible future one - more around the concept of back-compat in the event of any protocol changes. I don't want to spend too much time focusing on what we might want in the future, but a thought I just had: what about versioning as a part of the URL? For example, `/server-endpoint?version=1.0`. This could also enable breaking changes for existing commands. >> This makes a lot of sense to me. When we built our caching proxy, we had to be careful >> when designing how we'd handle clients requesting objects missing from the proxy. >> >> For example, a client requests a single blob and the proxy doesn't have it - we can't simply >> download that object from the "authoritative" remote and stick it in the `.git\objects\xx\yyy...` >> directory, because the repository would be made corrupt. > > By proxy, do you mean a Git repository? Sorry, I don't really understand > this part. Yes, apologies for not being clear. The proxy we created maintains a nearly-up-to-date mirror of the remote as a bare repo and exposes a few custom endpoints I've alluded to. What I'm trying to say is that updating the bare repo on disk via methods other than a full `fetch` is dangerous without the partial-clone-marking capability. >> Not do derail us too far off blobs, but I wonder if we need a `fetch-commit-pack` endpoint, >> or could get away with introducing a new capability (e.g. `no-blobs`) to `upload-pack` instead. >> As a casual observer, this seems like it would be a much smaller change since the rest of the >> negotiation/reachability calculation would look the same, right? Or would this `fetch-commit-pack` >> not return trees either? >> >> I only ask because, in our observations, when git wants to read commits it's >> usually followed by a lot of "related" trees - again caveated with the fact that >> we're intercepting many things at the filesystem layer. > > The main reason for this extra command is not to exclude blobs (which, > as you said, can be done with a new capability - I suspect that we will > need a capability or parameter of some sort anyway to indicate which > size of blobs to filter out) but to eliminate the mandatory ref > advertisement that is done whenever the client fetches. One of our use > cases (internal Android) has large blobs and many (more than 700k) refs, > so it would benefit greatly from blob filtering and elimination of the > mandatory ref advertisement (tens of megabytes per fetch). > > As for the size of the change, I have a work in progress that implements > this [2]. > > [2] https://public-inbox.org/git/cover.1491851452.git.jonathantanmy@google.com/ Ah, interesting. We solved the many-refs problem using a different approach - basically, limiting what the server returns based on who the user is and preferences they set via our web interface or an API. I've been promised by a few colleagues that we'll have more to share here soon... With your proposal, how does the client choose which refs they want to fetch? I took a look at look at your patch series. I'm nowhere near qualified to give feedback given my lack of experience with the core git implementation and C in general, but it looks reasonable. Hoping we can come up with a better name than "server-endpoint" though :) >> Just to keep the discussion interesting, I'll throw an alternative out there that's >> worked well for us. As I understand it, the HTTP-based dumb transfer protocol >> supports returning objects in loose object format, but only if they already exist >> in loose format. >> >> Extending this to have the remote provide these objects via a "dumb" protocol >> when they are packed as well - i.e. the server would "loosens" them upon request - >> is basically what we do and it works quite well for low-latency clients. To further improve >> performance at the cost of complexity, we've added caching at the memory and disk layer >> for these loose objects in the same format we send to the client. >> >> There's a clear tradeoff here - the servers must have adequate disk and/or memory to store >> these loose objects in optimal format. In addition, the higher the latency is to the remote, >> the worse this solution will perform. Fortunately, in our case, network topology allows us to >> put these caching proxies close enough to clients for it not to matter. > >This does make sense in the situation you describe, but (as you said) I >don't think we can guarantee this in the majority of situations. I think >some sort of batching (like the (1a) solution I talked about near the >start of this e-mail) and serving packed data from packed storage should >form the baseline, and any situation-specific optimizations (e.g. >serving unpacked data from topologically-close servers) can be >additional steps. Agreed - our choice of solution was largely driven by our lack of ability to batch, performance issues with `pack-objects` at scale, and ability to deploy the previously-mentioned proxies very close to clients. If we can solve the first problem (batching), optimizing for the hopefully-less-common single blob at a time scenario becomes less important. If we assume that happens, I think your proposal looks sound overall. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Proposal for "fetch-any-blob Git protocol" and server design 2017-04-21 16:41 ` Kevin David @ 2017-04-26 22:51 ` Jonathan Tan 0 siblings, 0 replies; 10+ messages in thread From: Jonathan Tan @ 2017-04-26 22:51 UTC (permalink / raw) To: Kevin David, Ben Peart, Stefan Beller; +Cc: git, Mark Thomas, Jeff Hostetler On 04/21/2017 09:41 AM, Kevin David wrote: > Hi Jonathan, > > Sorry for the delayed response - other work got in the way, unfortunately! No worries! >> I am envisioning (1a) as described in Jeff Hostetler's e-mail [1] ("a >> pre-command or hook to identify needed blobs and pre-fetch them before >> allowing the actual command to start"), so a Git command would typically >> make a single request that contains all the blobs required, but my >> proposal can also handle (1c) ('"fault" them in as necessary in >> read_object() while the command is running and without any pre-fetch >> (either synchronously or asynchronously and with/without a helper >> process)'). >> >> Even if we decided to go with single-blob requests and responses, it is >> still important to send them as packfiles, so that the server can serve >> them directly from its compressed storage without first having to >> uncompress them. >> >> [1] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ > > Ah, I missed this. If we think we can build meaningfully-sized requests via (1a) and (1b), > then I agree - packs are optimal. > > However, if single-blob requests/responses dominate, I have a few concerns, mostly from > experience playing with something similar: > * Regardless of the size of the object or number returned, the client will need to > `index-pack` the result and create a corresponding `.idx` file, which requires > decompression to construct (right?) Yes, although I think that loose objects would need to be decompressed too to verify the SHA-1. > * Unless the client's repository is repacked aggressively, we'll pollute the > `.git\objects\pack` directory with little indexes (somewhere around 4KiB minimum) > and packfiles rapidly, which would degrade performance. One rudimentary workaround > would be to loosen these packs on the client if they were under a certain > size/object count. I think fetch does this already? > > In either case, shifting the pack decompression/loose object recompression problem > to the client instead of the server is probably a good principle, but in our case it > simply wasn't fast enough to serve single blobs interactively (e.g. opening a source > file you don't have locally). I'm hopeful that the proposed partial clone solutions you > referenced would reduce the frequency of this being required. I wrote up a proposal for how Git should handle missing blobs, including a hook that, when invoked, is allowed to write packed or loose objects [1]. I was thinking along similar lines too about the many-packfile issue, and that was part of the reason why [1] was written the way it is. I think that, by default, the wire protocol should transmit packs because it causes less load on the server (as you describe) and it supports batch requests. Also, this protocol with single-blob fetching might be usable as-is for projects that only exclude large blobs (so there are relatively few of them), and will be more and more usable as we migrate the more commonly used Git commands to provide batch requests. However, the ability to customize the hook allows users to use another download scheme more suited to their use cases (for example, in yours, where you can provision servers that are topologically close to your clients). [1] <20170426221346.25337-1-jonathantanmy@google.com> >>> Being a bit more clever about packing objects (e.g. splitting blobs out from commits >>> and trees) improved this a bit, but we still hit a bottlenecks from what appeared to >>> be a large number of memory-mapping operations on a ~140GiB packfile of blobs. >>> >>> Each `pack-objects` process would consume approximately one CPU core for the >>> duration of the request. It's possible that further splitting of these large blob packs >>> would have improved performance in some scenarios, but that would increase the >>> amount of pack-index lookups necessary to find a single object. >> >> I'm not very experienced with mmap, but I thought that memory-mapping a >> large file in itself does not incur much of a performance penalty (if >> any) - it is the accesses that count. I experimented with 15,000 and >> 150,000 MiB files and mmap and they seem to be handled quite well. Also, >> how many objects are "pack-objects" packing here? > > Back when we took this approach, it was ~4000 blob objects at a time. > Perhaps we were being bitten by the Windows implementation of git_mmap[3]?. > When I profiled the 4k blob scenario, the majority of CPU and wall time was > spent in MapViewOfFileEx, which looks like it could mean accesses as well. > > [3] https://github.com/git/git/blob/master/compat/win32mmap.c Hmm...I don't have much experience with this, but this may be something to keep in mind. >>> This seems like a clever way to avoid the canonical `/info/refs?service=git-upload-pack` >>> capability negotiation on every call. However, using error handling to fallback seems >>> slightly wonky to me. Hopefully users are incentivized to upgrade their clients. >> >> By "error handling to fallback", do you mean in my proposal or in a >> possible future one (assuming my proposal is implemented)? I don't think >> my proposal requires any error handling to fallback (since only new >> clients can clone partially - old clients will just clone totally and >> obliviously), but I acknowledge that this proposal does not mean that >> any future proposal can be done without requiring error handling to >> fallback. > > Right, I was talking about the possible future one - more around the > concept of back-compat in the event of any protocol changes. I don't want > to spend too much time focusing on what we might want in the future, but a > thought I just had: what about versioning as a part of the URL? For example, > `/server-endpoint?version=1.0`. This could also enable breaking changes for > existing commands. That could work, although we could put a version number in the command name (e.g. "fetch-any-blob-2") too. > Ah, interesting. We solved the many-refs problem using a different approach - > basically, limiting what the server returns based on who the user is and preferences > they set via our web interface or an API. I've been promised by a few colleagues that > we'll have more to share here soon... With your proposal, how does the client choose > which refs they want to fetch? The "want" lines can now take names (including glob characters). A list of refs (names and hashes) that correspond to those "want"s will be returned by the server right before the packfile. (So the client doesn't "choose" in the sense that it has a list of options and it picks a subset of them - it chooses by sending one or more strings that the server will match.) > I took a look at look at your patch series. I'm nowhere near qualified to give feedback > given my lack of experience with the core git implementation and C in general, but it > looks reasonable. Hoping we can come up with a better name than "server-endpoint" though :) Thanks, and I'm open to suggestions for the name. :-) >>> Just to keep the discussion interesting, I'll throw an alternative out there that's >>> worked well for us. As I understand it, the HTTP-based dumb transfer protocol >>> supports returning objects in loose object format, but only if they already exist >>> in loose format. >>> >>> Extending this to have the remote provide these objects via a "dumb" protocol >>> when they are packed as well - i.e. the server would "loosens" them upon request - >>> is basically what we do and it works quite well for low-latency clients. To further improve >>> performance at the cost of complexity, we've added caching at the memory and disk layer >>> for these loose objects in the same format we send to the client. >>> >>> There's a clear tradeoff here - the servers must have adequate disk and/or memory to store >>> these loose objects in optimal format. In addition, the higher the latency is to the remote, >>> the worse this solution will perform. Fortunately, in our case, network topology allows us to >>> put these caching proxies close enough to clients for it not to matter. >> >> This does make sense in the situation you describe, but (as you said) I >> don't think we can guarantee this in the majority of situations. I think >> some sort of batching (like the (1a) solution I talked about near the >> start of this e-mail) and serving packed data from packed storage should >> form the baseline, and any situation-specific optimizations (e.g. >> serving unpacked data from topologically-close servers) can be >> additional steps. > > Agreed - our choice of solution was largely driven by our lack of ability to batch, > performance issues with `pack-objects` at scale, and ability to deploy the > previously-mentioned proxies very close to clients. If we can solve the first > problem (batching), optimizing for the hopefully-less-common single blob at a time > scenario becomes less important. If we assume that happens, I think your proposal > looks sound overall. Thanks. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-04-26 22:51 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-14 22:57 Proposal for "fetch-any-blob Git protocol" and server design Jonathan Tan 2017-03-15 17:59 ` Junio C Hamano 2017-03-16 17:31 ` Jonathan Tan 2017-03-16 21:17 ` Junio C Hamano 2017-03-16 22:48 ` Jonathan Tan 2017-03-28 23:19 ` Stefan Beller [not found] ` <00bf01d2aed7$b13492a0$139db7e0$@gmail.com> 2017-04-12 22:02 ` Kevin David 2017-04-13 20:12 ` Jonathan Tan 2017-04-21 16:41 ` Kevin David 2017-04-26 22:51 ` Jonathan Tan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.