git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/25] [RFC] Bundle URIs
@ 2022-02-23 18:30 Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 01/25] docs: document bundle URI standard Derrick Stolee via GitGitGadget
                   ` (25 more replies)
  0 siblings, 26 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee

There have been several suggestions to improve Git clone speeds and
reliability by supplementing the Git protocol with static content. The
Packfile URI [0] feature lets the Git response include URIs that point to
packfiles that the client must download to complete the request.

Last year, Ævar suggested using bundles instead of packfiles [1] [2]. This
design has the same benefits to the packfile URI feature because it offloads
most object downloads to static content fetches. The main advantage over
packfile URIs is that the remote Git server does not need to know what is in
those bundles. The Git client tells the server what it downloaded during the
fetch negotiation afterwards. This includes any chance that the client did
not have access to those bundles or otherwise failed to access them. I
agreed that this was a much more desirable way to serve static content, but
had concerns about the flexibility of that design [3]. I have not heard more
on the topic since October, so I started investigating this idea myself in
December, resulting in this RFC.

I focused on maximizing flexibility for the service that organizes and
serves bundles. This includes:

 * Bundle URIs work for full and partial clones.

 * Bundle URIs can assist with git fetch in addition to git clone.

 * Users can set up bundle servers independent of the remote Git server if
   they specify the bundle URI via a --bundle-uri argument.

This series is based on the recently-submitted series that adds object
filters to bundles [4]. There is a slight adjacent-line-add conflict with
js/apply-partial-clone-filters-recursively, but that is in the last few
patches, so it will be easy to rebase by the time we have a fully-reviewable
patch series for those steps.

The general breakdown is as follows:

 * Patch 1 adds documentation for the feature in its entirety.

 * Patches 2-14 add the ability to run ‘git clone --bundle-uri=’

 * Patches 15-17 add bundle fetches to ‘git fetch’ calls

 * Patches 18-25 add a new ‘features’ capability that allows a server to
   advertise bundle URIs (and in the future, other features).

I consider the patches in their current form to be “RFC quality”. There are
multiple places where tests are missing or special cases are not checked.
The goal for this RFC is to seek feedback on the high-level ideas before
committing to the deep work of creating mergeable patches.


Testing with this series
========================

To get a full test of the features being proposed here, I created an MVP of
a bundle server by pushing bundle data to some publicly-readable Azure
Storage accounts. These bundle servers mirror the following public
repositories on GitHub:

 * git/git
 * git-for-windows/git
 * homebrew/homebrew-core
 * cocoapods/specs
 * torvalds/linux

In addition, the Azure Storage accounts are available in different regions:

 * East US: https://gitbundleserver.z13.web.core.windows.net
 * West US: https://gitbundleserverwestus.z22.web.core.windows.net
 * Europe: https://gitbundleservereurope.z6.web.core.windows.net
 * East Asia: https://gitbundleservereastasia.z7.web.core.windows.net
 * South Asia: https://gitbundleserversouthasia.z23.web.core.windows.net
 * Australia: https://gitbundleserveraustralia.z26.web.core.windows.net

To test this RFC against these servers, choose your $org/$repo to clone and
your region's bundle server $url and run

$ git clone --bundle-uri=$url/$org/$repo/ https://github.com/$org/$repo


Note that these servers are set up using "almost free" storage in Azure.
Network connectivity of this storage can be slower than that of GitHub data
centers, so your results may vary.

From my location in Raleigh, NC, USA, I am able to clone torvalds/linux from
the bundle servers quite a bit faster than from GitHub:

@derrickstolee With Bundles Without Bundles Full Clone 491.9s 964.9s Partial
Clone 132.3s 171.7s

I recruited GitHub employees from across the globe to test this experimental
deployment of bundle servers and found mixed results. Some users had better
performance with bundle servers, and many had very close results. Some had
much worse connections to bundle servers than the GitHub remote, because of
the use of the cheapest form of Azure Storage services compared to highly
optimized GitHub infrastructure.

I plan to keep these servers running for a while, so this test should be
possible throughout the review of this RFC and the patch series as they are
reviewed. I'd love to hear if anyone else does this experiment and has
anything to say about the results.


Implementation Plan
===================

The first patch contains a design document that is "aspirational": It
describes the feature as it should be when the implementation is complete.
Most of that implementation is included in this RFC, though there are some
minor tweaks to the functionality I want to change or add before the patches
are under full review. In particular, things are poorly tested and
undocumented the further you look in the RFC.

Here is a potential plan for splitting this RFC into digestible pieces that
can be reviewed in sequence:

 0. Update the git bundle create command to take a --filter option, allowing
    bundles to store packfiles restricted to an object filter. This is
    necessary for using bundle URIs to benefit partial clones. This step was
    already submitted for full review [4]. These patches are based on those.

 1. Integrate bundle URIs into git clone with a --bundle-uri option. This
    will include the full understanding of a table of contents, but will not
    integrate with git fetch or allow the server to advertise URIs.

 2. Integrate bundle URIs into git fetch, triggered by config values that
    are set during git clone if the server indicates that the bundle
    strategy works for fetches.

 3. Create a new "recommended features" capability in protocol v2 where the
    server can recommend features such as bundle URIs, partial clone, and
    sparse-checkout. These features will be extremely limited in scope and
    blocked by opt-in config options. The design for this portion could be
    replaced by a "bundle-uri" capability that only advertises bundle URIs
    and no other information.


Intended Focus of this RFC
==========================

This RFC is very large, and that's even with many patches not including full
documentation or tests. These commits are not intended to be reviewed as if
I intended to merge this as-is.

One thing this feature establishes is a new standard by which the Git client
will communicate with external servers. The goal of this RFC is to determine
if this standard is well designed, and whether we need to make it more
robust. Alternatively, the design might need to be changed for reasons I
cannot predict.

For that reason, hopefully most of the feedback is directly on the first
patch, which contains the design document. In particular, the design
document repeats the implementation plan, and I'd like extra eyes on that,
too.

[0]
https://github.com/git/git/blob/master/Documentation/technical/packfile-uri.txt
The packfile URI feature in Git (Created June 2020)

[1]
https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
An earlier RFC for a bundle URI feature. (August 2021)

[2]
https://lore.kernel.org/git/cover-0.3-00000000000-20211025T211159Z-avarab@gmail.com/
An earlier patch series creating a 'bundle-uri' protocol v2 capability.
(October 2021)

[3]
https://lore.kernel.org/git/e7fe220b-2877-107e-8f7e-ea507a65feff@gmail.com/
My earlier thoughts on the previous RFCs, many of which are integrated into
this RFC (August 2021)

[4]
https://lore.kernel.org/git/pull.1159.git.1645638911.gitgitgadget@gmail.com/
Add object filters to bundles (February 2022)

Thanks, -Stolee

Derrick Stolee (24):
  docs: document bundle URI standard
  bundle: alphabetize subcommands better
  dir: extract starts_with_dot[_dot]_slash()
  remote: move relative_url()
  remote: allow relative_url() to return an absolute url
  http: make http_get_file() external
  remote-curl: add 'get' capability
  bundle: implement 'fetch' command for direct bundles
  bundle: parse table of contents during 'fetch'
  bundle: add --filter option to 'fetch'
  bundle: allow relative URLs in table of contents
  bundle: make it easy to call 'git bundle fetch'
  clone: add --bundle-uri option
  clone: --bundle-uri cannot be combined with --depth
  config: add git_config_get_timestamp()
  bundle: only fetch bundles if timestamp is new
  fetch: fetch bundles before fetching original data
  protocol-caps: implement cap_features()
  serve: understand but do not advertise 'features' capability
  serve: advertise 'features' when config exists
  connect: implement get_recommended_features()
  transport: add connections for 'features' capability
  clone: use server-recommended bundle URI
  t5601: basic bundle URI test

Ævar Arnfjörð Bjarmason (1):
  connect.c: refactor sending of agent & object-format

 Documentation/gitremote-helpers.txt    |   6 +
 Documentation/technical/bundle-uri.txt | 404 +++++++++++++++++++++
 builtin/bundle.c                       | 478 ++++++++++++++++++++++++-
 builtin/clone.c                        |  51 +++
 builtin/fetch.c                        |  17 +
 builtin/submodule--helper.c            | 129 -------
 bundle.c                               |  21 ++
 bundle.h                               |   9 +
 config.c                               |  39 ++
 config.h                               |  14 +
 connect.c                              |  69 +++-
 dir.h                                  |  11 +
 fsck.c                                 |  14 +-
 http.c                                 |   4 +-
 http.h                                 |   9 +
 protocol-caps.c                        |  66 ++++
 protocol-caps.h                        |   1 +
 remote-curl.c                          |  32 ++
 remote.c                               | 104 ++++++
 remote.h                               |  35 ++
 serve.c                                |  23 ++
 t/t5601-clone.sh                       |  12 +
 t/t5701-git-serve.sh                   |   9 +
 transport-helper.c                     |  14 +
 transport-internal.h                   |   9 +
 transport.c                            |  38 ++
 transport.h                            |   5 +
 27 files changed, 1467 insertions(+), 156 deletions(-)
 create mode 100644 Documentation/technical/bundle-uri.txt


base-commit: ec51d0a50e6e64ae37795d77f7d33204b9b71ecd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1160%2Fderrickstolee%2Fbundle%2Frfc-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1160/derrickstolee/bundle/rfc-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1160
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 01/25] docs: document bundle URI standard
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-03-02  2:28   ` Elijah Newren
  2022-02-23 18:30 ` [PATCH 02/25] bundle: alphabetize subcommands better Derrick Stolee via GitGitGadget
                   ` (24 subsequent siblings)
  25 siblings, 1 reply; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Introduce the idea of bundle URIs to the Git codebase through an
aspirational design document. This document includes the full design
intended to include the feature in its fully-implemented form. This will
take several steps as detailed in the Implementation Plan section.

By committing this document now, it can be used to motivate changes
necessary to reach these final goals. The design can still be altered as
new information is discovered.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/technical/bundle-uri.txt | 404 +++++++++++++++++++++++++
 1 file changed, 404 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri.txt

diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
new file mode 100644
index 00000000000..5c0b9e8e3ef
--- /dev/null
+++ b/Documentation/technical/bundle-uri.txt
@@ -0,0 +1,404 @@
+Bundle URIs
+===========
+
+Bundle URIs are locations where Git can download one or more bundles in
+order to bootstrap the object database in advance of fetching the remaining
+objects from a remote.
+
+One goal is to speed up clones and fetches for users with poor network
+connectivity to the origin server. Another benefit is to allow heavy users,
+such as CI build farms, to use local resources for the majority of Git data
+and thereby reducing the load on the origin server.
+
+To enable the bundle URI feature, users can specify a bundle URI using
+command-line options or the origin server can advertise one or more URIs
+via a protocol v2 capability.
+
+Server requirements
+-------------------
+
+To provide a server-side implementation of bundle servers, no other parts
+of the Git protocol are required. This allows server maintainers to use
+static content solutions such as CDNs in order to serve the bundle files.
+
+At the current scope of the bundle URI feature, all URIs are expected to
+be HTTP(S) URLs where content is downloaded to a local file using a `GET`
+request to that URL. The server could include authentication requirements
+to those requests with the aim of triggering the configured credential
+helper for secure access.
+
+Assuming a `200 OK` response from the server, the content at the URL is
+expected to be of one of two forms:
+
+1. Bundle: A Git bundle file of version 2 or higher.
+
+2. Table of Contents: A plain-text file that is parsable using Git's
+   config file parser. This file describes one or more bundles that are
+   accessible from other URIs.
+
+Any other data provided by the server is considered erroneous.
+
+Table of Contents Format
+------------------------
+
+If the content at a bundle URI is not a bundle, then it is expected to be
+a plaintext file that is parseable using Git's config parser. This file
+can contain any list of key/value pairs, but only a fixed set will be
+considered by Git.
+
+bundle.tableOfContents.version::
+	This value provides a version number for the table of contents. If
+	a future Git change enables a feature that needs the Git client to
+	react to a new key in the table of contents file, then this version
+	will increment. The only current version number is 1, and if any
+	other value is specified then Git will fail to use this file.
+
+bundle.tableOfContents.forFetch::
+	This boolean value is a signal to the Git client that the bundle
+	server has designed its bundle organization to assist `git fetch`
+	commands in addition to `git clone` commands. If this is missing,
+	Git should not use this table of contents for `git fetch` as it
+	may lead to excess data downloads.
+
+The remaining keys include an `<id>` segment which is a server-designated
+name for each available bundle.
+
+bundle.<id>.uri::
+	This string value is the URI for downloading bundle `<id>`. If
+	the URI begins with a protocol (`http://` or `https://`) then the
+	URI is absolute. Otherwise, the URI is interpreted as relative to
+	the URI used for the table of contents. If the URI begins with `/`,
+	then that relative path is relative to the domain name used for
+	the table of contents. (This use of relative paths is intended to
+	make it easier to distribute a set of bundles across a large
+	number of servers or CDNs with different domain names.)
+
+bundle.<id>.timestamp::
+	(Optional) This value is the number of seconds since Unix epoch
+	(UTC) that this bundle was created. This is used as an approximation
+	of a point in time that the bundle matches the data available at
+	the origin server.
+
+bundle.<id>.requires::
+	(Optional) This string value represents the ID of another bundle.
+	When present, the server is indicating that this bundle contains a
+	thin packfile. If the client does not have all necessary objects
+	to unbundle this packfile, then the client can download the bundle
+	with the `requires` ID and try again. (Note: it may be beneficial
+	to allow the server to specify multiple `requires` bundles.)
+
+bundle.<id>.filter::
+	(Optional) This string value represents an object filter that
+	should also appear in the header of this bundle. The server uses
+	this value to differentiate different kinds of bundles from which
+	the client can choose those that match their object filters.
+
+Here is an example table of contents:
+
+```
+[bundle "tableofcontents"]
+	version = 1
+	forFetch = true
+
+[bundle "2022-02-09-1644442601-daily"]
+	uri = https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-09-1644442601-daily.bundle
+	timestamp = 1644442601
+	requires = 2022-02-02-1643842562
+
+[bundle "2022-02-02-1643842562"]
+	uri = https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-02-1643842562.bundle
+	timestamp = 1643842562
+
+[bundle "2022-02-09-1644442631-daily-blobless"]
+	uri = 2022-02-09-1644442631-daily-blobless.bundle
+	timestamp = 1644442631
+	requires = 2022-02-02-1643842568-blobless
+	filter = blob:none
+
+[bundle "2022-02-02-1643842568-blobless"]
+	uri = /git/git/2022-02-02-1643842568-blobless.bundle
+	timestamp = 1643842568
+	filter = blob:none
+```
+
+This example uses all of the keys in the specification. Suppose that the
+table of contents was found at the URI
+`https://gitbundleserver.z13.web.core.windows.net/git/git/` and so the
+two blobless bundles have the following fully-expanded URIs:
+
+* `https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-09-1644442631-daily-blobless.bundle`
+* `https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-02-1643842568-blobless.bundle`
+
+Advertising Bundle URIs
+-----------------------
+
+If a user knows a bundle URI for the repository they are cloning, then they
+can specify that URI manually through a command-line option. However, a
+Git host may want to advertise bundle URIs during the clone operation,
+helping users unaware of the feature.
+
+Note: The exact details of this section are not final. This is a possible
+way that Git could auto-discover bundle URIs, but is not a committed
+direction until that feature is implemented.
+
+The only thing required for this feature is that the server can advertise
+one or more bundle URIs. One way to implement this is to create a new
+protocol v2 capability that advertises recommended features, including
+bundle URIs.
+
+The client could choose an arbitrary bundle URI as an option _or_ select
+the URI with lowest latency by some exploratory checks. It is up to the
+server operator to decide if having multiple URIs is preferable to a
+single URI that is geodistributed through server-side infrastructure.
+
+Cloning with Bundle URIs
+------------------------
+
+The primary need for bundle URIs is to speed up clones. The Git client
+will interact with bundle URIs according to the following flow:
+
+1. The user specifies a bundle URI with the `--bundle-uri` command-line
+   option _or_ the client discovers a bundle URI that was advertised by
+   the remote server.
+
+2. The client downloads the file at the bundle URI. If it is a bundle, then
+   it is unbundled with the refs being stored in `refs/bundle/*`.
+
+3. If the file is instead a table of contents, then the bundles with
+   matching `filter` settings are sorted by `timestamp` (if present),
+   and the most-recent bundle is downloaded.
+
+4. If the current bundle header mentions negative commid OIDs that are not
+   in the object database, then download the `requires` bundle and try
+   again.
+
+5. After inspecting a bundle with no negative commit OIDs (or all OIDs are
+   already in the object database somehow), then unbundle all of the
+   bundles in reverse order, placing references within `refs/bundle/*`.
+
+6. The client performs a fetch negotiation with the origin server, using
+   the `refs/bundle/*` references as `have`s and the server's ref
+   advertisement as `want`s. This results in a pack-file containing the
+   remaining objects requested by the clone but not in the bundles.
+
+Note that during a clone we expect that all bundles will be required. The
+client could be extended to download all bundles in parallel, though they
+need to be unbundled in the correct order.
+
+If a table of contents is used and it contains
+`bundle.tableOfContents.forFetch = true`, then the client can store a
+config value indicating to reuse this URI for later `git fetch` commands.
+In this case, the client will also want to store the maximum timestamp of
+a downloaded bundle.
+
+Fetching with Bundle URIs
+-------------------------
+
+When the client fetches new data, it can decide to fetch from bundle
+servers before fetching from the origin remote. This could be done via
+a command-line option, but it is more likely useful to use a config value
+such as the one specified during the clone.
+
+The fetch operation follows the same procedure to download bundles from a
+table of contents (although we do _not_ want to use parallel downloads
+here). We expect that the process will end because all negative commit
+OIDs in a thin bundle are already in the object database.
+
+A further optimization is that the client can avoid downloading any
+bundles if their timestamps are not larger than the stored timestamp.
+After fetching new bundles, this local timestamp value is updated.
+
+Choices for Bundle Server Organization
+--------------------------------------
+
+With this standard, there are many options available to the bundle server
+in how it organizes Git data into bundles.
+
+* Bundles can have whatever name the server desires. This name could refer
+  to immutable data by using a hash of the bundle contents. However, this
+  means that a new URI will be needed after every update of the content.
+  This might be acceptable if the server is advertising the URI (and the
+  server is aware of new bundles being generated) but would not be
+  ergonomic for users using the command line option.
+
+* If the server intends to only serve full clones, then the advertised URI
+  could be a bundle file without a filter that is updated at some cadence.
+
+* If the server intends to serve clones, but wants clients to choose full
+  or blobless partial clones, then the server can use a table of contents
+  that lists two non-thin bundles and the client chooses between them only
+  by the `bundle.<id>.filter` values.
+
+* If the server intends to improve clones with parallel downloads, then it
+  can use a table of contents and split the repository into time intervals
+  of approximately similar-sized bundles. Using `bundle.<id>.timestamp`
+  and `bundle.<id>.requires` values helps the client decide the order to
+  unbundle the bundles.
+
+* If the server intends to serve fetches, then it can use a table of
+  contents to advertise a list of bundles that are updated regularly. The
+  most recent bundles could be generated on short intervals, such as hourly.
+  These small bundles could be merged together at some rate, such as 24
+  hourly bundles merging into a single daily bundle. At some point, it may
+  be beneficial to create a bundle that stores the majority of the history,
+  such as all data older than 30 days.
+
+These recommendations are intended only as suggestions. Each repository is
+different and every Git server has different needs. Hopefully the bundle
+URI feature and its table of contents is flexible enough to satisfy all
+needs. If not, then the format can be extended.
+
+Error Conditions
+----------------
+
+If the Git client discovers something unexpected while downloading
+information according to a bundle URI or the table of contents found at
+that location, then Git can ignore that data and continue as if it was not
+given a bundle URI. The remote Git server is the ultimate source of truth,
+not the bundle URI.
+
+Here are a few example error conditions:
+
+* The client fails to connect with a server at the given URI or a connection
+  is lost without any chance to recover.
+
+* The client receives a response other than `200 OK` (such as `404 Not Found`,
+  `401 Not Authorized`, or `500 Internal Server Error`).
+
+* The client receives data that is not parsable as a bundle or table of
+  contents.
+
+* The table of contents describes a directed cycle in the
+  `bundle.<id>.requires` links.
+
+* A bundle includes a filter that does not match expectations.
+
+* The client cannot unbundle the bundles because the negative commit OIDs
+  are not in the object database and there are no more
+  `bundle.<id>.requires` links to follow.
+
+There are also situations that could be seen as wasteful, but are not
+error conditions:
+
+* The downloaded bundles contain more information than is requested by
+  the clone or fetch request. A primary example is if the user requests
+  a clone with `--single-branch` but downloads bundles that store every
+  reachable commit from all `refs/heads/*` references. This might be
+  initially wasteful, but perhaps these objects will become reachable by
+  a later ref update that the client cares about.
+
+* A bundle download during a `git fetch` contains objects already in the
+  object database. This is probably unavoidable if we are using bundles
+  for fetches, since the client will almost always be slightly ahead of
+  the bundle servers after performing its "catch-up" fetch to the remote
+  server. This extra work is most wasteful when the client is fetching
+  much more frequently than the server is computing bundles, such as if
+  the client is using hourly prefetches with background maintenance, but
+  the server is computing bundles weekly. For this reason, the client
+  should not use bundle URIs for fetch unless the server has explicitly
+  recommended it through the `bundle.tableOfContents.forFetch = true`
+  value.
+
+Implementation Plan
+-------------------
+
+This design document is being submitted on its own as an aspirational
+document, with the goal of implementing all of the mentioned client
+features over the course of several patch series. Here is a potential
+outline for submitting these features for full review:
+
+1. Update the `git bundle create` command to take a `--filter` option,
+   allowing bundles to store packfiles restricted to an object filter.
+   This is necessary for using bundle URIs to benefit partial clones.
+
+2. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
+   This will include the full understanding of a table of contents, but
+   will not integrate with `git fetch` or allow the server to advertise
+   URIs.
+
+3. Integrate bundle URIs into `git fetch`, triggered by config values that
+   are set during `git clone` if the server indicates that the bundle
+   strategy works for fetches.
+
+4. Create a new "recommended features" capability in protocol v2 where the
+   server can recommend features such as bundle URIs, partial clone, and
+   sparse-checkout. These features will be extremely limited in scope and
+   blocked by opt-in config options. The design for this portion could be
+   replaced by a "bundle-uri" capability that only advertises bundle URIs
+   and no other information.
+
+Related Work: Packfile URIs
+---------------------------
+
+The Git protocol already has a capability where the Git server can list
+a set of URLs along with the packfile response when serving a client
+request. The client is then expected to download the packfiles at those
+locations in order to have a complete understanding of the response.
+
+This mechanism is used by the Gerrit server (implemented with JGit) and
+has been effective at reducing CPU load and improving user performance for
+clones.
+
+A major downside to this mechanism is that the origin server needs to know
+_exactly_ what is in those packfiles, and the packfiles need to be available
+to the user for some time after the server has responded. This coupling
+between the origin and the packfile data is difficult to manage.
+
+Further, this implementation is extremely hard to make work with fetches.
+
+Related Work: GVFS Cache Servers
+--------------------------------
+
+The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
+the Git project before Git's partial clone was created. One feature of this
+protocol is the idea of a "cache server" which can be colocated with build
+machines or developer offices to transfer Git data without overloading the
+central server.
+
+The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
+endpoint, which allows downloading an object on-demand. This is a critical
+piece of the filesystem virtualization of that product.
+
+However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
+endpoint. Given an optional timestamp, the cache server responds with a list
+of precomputed packfiles containing the commits and trees that were introduced
+in those time intervals.
+
+The cache server computes these "prefetch" packfiles using the following
+strategy:
+
+1. Every hour, an "hourly" pack is generated with a given timestamp.
+2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
+3. Nightly, all prefetch packs more than 30 days old are rolled up into
+   one pack.
+
+When a user runs `gvfs clone` or `scalar clone` against a repo with cache
+servers, the client requests all prefetch packfiles, which is at most
+`24 + 30 + 1` packfiles downloading only commits and trees. The client
+then follows with a request to the origin server for the references, and
+attempts to checkout that tip reference. (There is an extra endpoint that
+helps get all reachable trees from a given commit, in case that commit
+was not already in a prefetch packfile.)
+
+During a `git fetch`, a hook requests the prefetch endpoint using the
+most-recent timestamp from a previously-downloaded prefetch packfile.
+Only the list of packfiles with later timestamps are downloaded. Most
+users fetch hourly, so they get at most one hourly prefetch pack. Users
+whose machines have been off or otherwise have not fetched in over 30 days
+might redownload all prefetch packfiles. This is rare.
+
+It is important to note that the clients always contact the origin server
+for the refs advertisement, so the refs are frequently "ahead" of the
+prefetched pack data. The missing objects are downloaded on-demand using
+the `GET gvfs/objects/{oid}` requests, when needed by a command such as
+`git checkout` or `git log`. Some Git optimizations disable checks that
+would cause these on-demand downloads to be too aggressive.
+
+See Also
+--------
+
+[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
+    An earlier RFC for a bundle URI feature.
+
+[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
+    The GVFS Protocol
\ No newline at end of file
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 02/25] bundle: alphabetize subcommands better
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 01/25] docs: document bundle URI standard Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-03-11 13:47   ` Ævar Arnfjörð Bjarmason
  2022-02-23 18:30 ` [PATCH 03/25] dir: extract starts_with_dot[_dot]_slash() Derrick Stolee via GitGitGadget
                   ` (23 subsequent siblings)
  25 siblings, 1 reply; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The usage strings for the 'git bundle' subcommands are not alphabetical.
This also applies to their inspection within cmd_bundle(). Fix this
ordering before we insert a new subcommand.

This change does not reorder the cmd_bundle_*() methods to avoid moving
lines that are more likely wanted in a future 'git blame' call. It is
fine that those longer methods are not ordered alphabetically.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/bundle.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/builtin/bundle.c b/builtin/bundle.c
index 5a85d7cd0fe..8187b7df739 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -13,9 +13,9 @@
 
 static const char * const builtin_bundle_usage[] = {
   N_("git bundle create [<options>] <file> <git-rev-list args>"),
-  N_("git bundle verify [<options>] <file>"),
   N_("git bundle list-heads <file> [<refname>...]"),
   N_("git bundle unbundle <file> [<refname>...]"),
+  N_("git bundle verify [<options>] <file>"),
   NULL
 };
 
@@ -24,11 +24,6 @@ static const char * const builtin_bundle_create_usage[] = {
   NULL
 };
 
-static const char * const builtin_bundle_verify_usage[] = {
-  N_("git bundle verify [<options>] <file>"),
-  NULL
-};
-
 static const char * const builtin_bundle_list_heads_usage[] = {
   N_("git bundle list-heads <file> [<refname>...]"),
   NULL
@@ -39,6 +34,11 @@ static const char * const builtin_bundle_unbundle_usage[] = {
   NULL
 };
 
+static const char * const builtin_bundle_verify_usage[] = {
+  N_("git bundle verify [<options>] <file>"),
+  NULL
+};
+
 static int parse_options_cmd_bundle(int argc,
 		const char **argv,
 		const char* prefix,
@@ -209,12 +209,12 @@ int cmd_bundle(int argc, const char **argv, const char *prefix)
 
 	else if (!strcmp(argv[0], "create"))
 		result = cmd_bundle_create(argc, argv, prefix);
-	else if (!strcmp(argv[0], "verify"))
-		result = cmd_bundle_verify(argc, argv, prefix);
 	else if (!strcmp(argv[0], "list-heads"))
 		result = cmd_bundle_list_heads(argc, argv, prefix);
 	else if (!strcmp(argv[0], "unbundle"))
 		result = cmd_bundle_unbundle(argc, argv, prefix);
+	else if (!strcmp(argv[0], "verify"))
+		result = cmd_bundle_verify(argc, argv, prefix);
 	else {
 		error(_("Unknown subcommand: %s"), argv[0]);
 		usage_with_options(builtin_bundle_usage, options);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 03/25] dir: extract starts_with_dot[_dot]_slash()
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 01/25] docs: document bundle URI standard Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 02/25] bundle: alphabetize subcommands better Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 04/25] remote: move relative_url() Derrick Stolee via GitGitGadget
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

We will want to use this logic to assist checking if paths are absolute
or relative, so extract it into a helpful place. This creates a
collision with similar methods in builtin/fsck.c, but those methods have
important differences. Prepend "fsck_" to those methods to emphasize
that they are custom to the fsck builtin.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/submodule--helper.c | 10 ----------
 dir.h                       | 11 +++++++++++
 fsck.c                      | 14 +++++++-------
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index c5d3fc3817f..c17dde4170f 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -70,16 +70,6 @@ static int print_default_remote(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-static int starts_with_dot_slash(const char *str)
-{
-	return str[0] == '.' && is_dir_sep(str[1]);
-}
-
-static int starts_with_dot_dot_slash(const char *str)
-{
-	return str[0] == '.' && str[1] == '.' && is_dir_sep(str[2]);
-}
-
 /*
  * Returns 1 if it was the last chop before ':'.
  */
diff --git a/dir.h b/dir.h
index 8e02dfb505d..5e38d1ba536 100644
--- a/dir.h
+++ b/dir.h
@@ -578,4 +578,15 @@ void connect_work_tree_and_git_dir(const char *work_tree,
 void relocate_gitdir(const char *path,
 		     const char *old_git_dir,
 		     const char *new_git_dir);
+
+static inline int starts_with_dot_slash(const char *str)
+{
+	return str[0] == '.' && is_dir_sep(str[1]);
+}
+
+static inline int starts_with_dot_dot_slash(const char *str)
+{
+	return str[0] == '.' && str[1] == '.' && is_dir_sep(str[2]);
+}
+
 #endif
diff --git a/fsck.c b/fsck.c
index 3ec500d707a..32cd3bc081f 100644
--- a/fsck.c
+++ b/fsck.c
@@ -976,31 +976,31 @@ done:
 }
 
 /*
- * Like builtin/submodule--helper.c's starts_with_dot_slash, but without
+ * Like dir.h's starts_with_dot_slash, but without
  * relying on the platform-dependent is_dir_sep helper.
  *
  * This is for use in checking whether a submodule URL is interpreted as
  * relative to the current directory on any platform, since \ is a
  * directory separator on Windows but not on other platforms.
  */
-static int starts_with_dot_slash(const char *str)
+static int fsck_starts_with_dot_slash(const char *str)
 {
 	return str[0] == '.' && (str[1] == '/' || str[1] == '\\');
 }
 
 /*
- * Like starts_with_dot_slash, this is a variant of submodule--helper's
- * helper of the same name with the twist that it accepts backslash as a
+ * Like fsck_starts_with_dot_slash, this is a variant of dir.h's
+ * helper with the twist that it accepts backslash as a
  * directory separator even on non-Windows platforms.
  */
-static int starts_with_dot_dot_slash(const char *str)
+static int fsck_starts_with_dot_dot_slash(const char *str)
 {
-	return str[0] == '.' && starts_with_dot_slash(str + 1);
+	return str[0] == '.' && fsck_starts_with_dot_slash(str + 1);
 }
 
 static int submodule_url_is_relative(const char *url)
 {
-	return starts_with_dot_slash(url) || starts_with_dot_dot_slash(url);
+	return fsck_starts_with_dot_slash(url) || fsck_starts_with_dot_dot_slash(url);
 }
 
 /*
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 04/25] remote: move relative_url()
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 03/25] dir: extract starts_with_dot[_dot]_slash() Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 05/25] remote: allow relative_url() to return an absolute url Derrick Stolee via GitGitGadget
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

This method was initially written in 63e95beb0 (submodule: port
resolve_relative_url from shell to C, 2016-05-15). As we will need
similar functionality in the bundle URI feature, extract this to be
available in remote.h.

The code is exactly the same. The prototype is different only in
whitespace. The documentation comment only adds explicit instructions on
what happens when supplying two absolute URLs.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/submodule--helper.c | 119 ------------------------------------
 remote.c                    |  96 +++++++++++++++++++++++++++++
 remote.h                    |  30 +++++++++
 3 files changed, 126 insertions(+), 119 deletions(-)

diff --git a/builtin/submodule--helper.c b/builtin/submodule--helper.c
index c17dde4170f..506ebd8e6bc 100644
--- a/builtin/submodule--helper.c
+++ b/builtin/submodule--helper.c
@@ -70,125 +70,6 @@ static int print_default_remote(int argc, const char **argv, const char *prefix)
 	return 0;
 }
 
-/*
- * Returns 1 if it was the last chop before ':'.
- */
-static int chop_last_dir(char **remoteurl, int is_relative)
-{
-	char *rfind = find_last_dir_sep(*remoteurl);
-	if (rfind) {
-		*rfind = '\0';
-		return 0;
-	}
-
-	rfind = strrchr(*remoteurl, ':');
-	if (rfind) {
-		*rfind = '\0';
-		return 1;
-	}
-
-	if (is_relative || !strcmp(".", *remoteurl))
-		die(_("cannot strip one component off url '%s'"),
-			*remoteurl);
-
-	free(*remoteurl);
-	*remoteurl = xstrdup(".");
-	return 0;
-}
-
-/*
- * The `url` argument is the URL that navigates to the submodule origin
- * repo. When relative, this URL is relative to the superproject origin
- * URL repo. The `up_path` argument, if specified, is the relative
- * path that navigates from the submodule working tree to the superproject
- * working tree. Returns the origin URL of the submodule.
- *
- * Return either an absolute URL or filesystem path (if the superproject
- * origin URL is an absolute URL or filesystem path, respectively) or a
- * relative file system path (if the superproject origin URL is a relative
- * file system path).
- *
- * When the output is a relative file system path, the path is either
- * relative to the submodule working tree, if up_path is specified, or to
- * the superproject working tree otherwise.
- *
- * NEEDSWORK: This works incorrectly on the domain and protocol part.
- * remote_url      url              outcome          expectation
- * http://a.com/b  ../c             http://a.com/c   as is
- * http://a.com/b/ ../c             http://a.com/c   same as previous line, but
- *                                                   ignore trailing slash in url
- * http://a.com/b  ../../c          http://c         error out
- * http://a.com/b  ../../../c       http:/c          error out
- * http://a.com/b  ../../../../c    http:c           error out
- * http://a.com/b  ../../../../../c    .:c           error out
- * NEEDSWORK: Given how chop_last_dir() works, this function is broken
- * when a local part has a colon in its path component, too.
- */
-static char *relative_url(const char *remote_url,
-				const char *url,
-				const char *up_path)
-{
-	int is_relative = 0;
-	int colonsep = 0;
-	char *out;
-	char *remoteurl = xstrdup(remote_url);
-	struct strbuf sb = STRBUF_INIT;
-	size_t len = strlen(remoteurl);
-
-	if (is_dir_sep(remoteurl[len-1]))
-		remoteurl[len-1] = '\0';
-
-	if (!url_is_local_not_ssh(remoteurl) || is_absolute_path(remoteurl))
-		is_relative = 0;
-	else {
-		is_relative = 1;
-		/*
-		 * Prepend a './' to ensure all relative
-		 * remoteurls start with './' or '../'
-		 */
-		if (!starts_with_dot_slash(remoteurl) &&
-		    !starts_with_dot_dot_slash(remoteurl)) {
-			strbuf_reset(&sb);
-			strbuf_addf(&sb, "./%s", remoteurl);
-			free(remoteurl);
-			remoteurl = strbuf_detach(&sb, NULL);
-		}
-	}
-	/*
-	 * When the url starts with '../', remove that and the
-	 * last directory in remoteurl.
-	 */
-	while (url) {
-		if (starts_with_dot_dot_slash(url)) {
-			url += 3;
-			colonsep |= chop_last_dir(&remoteurl, is_relative);
-		} else if (starts_with_dot_slash(url))
-			url += 2;
-		else
-			break;
-	}
-	strbuf_reset(&sb);
-	strbuf_addf(&sb, "%s%s%s", remoteurl, colonsep ? ":" : "/", url);
-	if (ends_with(url, "/"))
-		strbuf_setlen(&sb, sb.len - 1);
-	free(remoteurl);
-
-	if (starts_with_dot_slash(sb.buf))
-		out = xstrdup(sb.buf + 2);
-	else
-		out = xstrdup(sb.buf);
-
-	if (!up_path || !is_relative) {
-		strbuf_release(&sb);
-		return out;
-	}
-
-	strbuf_reset(&sb);
-	strbuf_addf(&sb, "%s%s", up_path, out);
-	free(out);
-	return strbuf_detach(&sb, NULL);
-}
-
 static char *resolve_relative_url(const char *rel_url, const char *up_path, int quiet)
 {
 	char *remoteurl, *resolved_url;
diff --git a/remote.c b/remote.c
index c97c626eaa8..c4a56749e85 100644
--- a/remote.c
+++ b/remote.c
@@ -14,6 +14,7 @@
 #include "strvec.h"
 #include "commit-reach.h"
 #include "advice.h"
+#include "connect.h"
 
 enum map_direction { FROM_SRC, FROM_DST };
 
@@ -2727,3 +2728,98 @@ void remote_state_clear(struct remote_state *remote_state)
 	hashmap_clear_and_free(&remote_state->remotes_hash, struct remote, ent);
 	hashmap_clear_and_free(&remote_state->branches_hash, struct remote, ent);
 }
+
+/*
+ * Returns 1 if it was the last chop before ':'.
+ */
+static int chop_last_dir(char **remoteurl, int is_relative)
+{
+	char *rfind = find_last_dir_sep(*remoteurl);
+	if (rfind) {
+		*rfind = '\0';
+		return 0;
+	}
+
+	rfind = strrchr(*remoteurl, ':');
+	if (rfind) {
+		*rfind = '\0';
+		return 1;
+	}
+
+	if (is_relative || !strcmp(".", *remoteurl))
+		die(_("cannot strip one component off url '%s'"),
+			*remoteurl);
+
+	free(*remoteurl);
+	*remoteurl = xstrdup(".");
+	return 0;
+}
+
+/*
+ * NEEDSWORK: Given how chop_last_dir() works, this function is broken
+ * when a local part has a colon in its path component, too.
+ */
+char *relative_url(const char *remote_url,
+		   const char *url,
+		   const char *up_path)
+{
+	int is_relative = 0;
+	int colonsep = 0;
+	char *out;
+	char *remoteurl = xstrdup(remote_url);
+	struct strbuf sb = STRBUF_INIT;
+	size_t len = strlen(remoteurl);
+
+	if (is_dir_sep(remoteurl[len-1]))
+		remoteurl[len-1] = '\0';
+
+	if (!url_is_local_not_ssh(remoteurl) || is_absolute_path(remoteurl))
+		is_relative = 0;
+	else {
+		is_relative = 1;
+		/*
+		 * Prepend a './' to ensure all relative
+		 * remoteurls start with './' or '../'
+		 */
+		if (!starts_with_dot_slash(remoteurl) &&
+		    !starts_with_dot_dot_slash(remoteurl)) {
+			strbuf_reset(&sb);
+			strbuf_addf(&sb, "./%s", remoteurl);
+			free(remoteurl);
+			remoteurl = strbuf_detach(&sb, NULL);
+		}
+	}
+	/*
+	 * When the url starts with '../', remove that and the
+	 * last directory in remoteurl.
+	 */
+	while (url) {
+		if (starts_with_dot_dot_slash(url)) {
+			url += 3;
+			colonsep |= chop_last_dir(&remoteurl, is_relative);
+		} else if (starts_with_dot_slash(url))
+			url += 2;
+		else
+			break;
+	}
+	strbuf_reset(&sb);
+	strbuf_addf(&sb, "%s%s%s", remoteurl, colonsep ? ":" : "/", url);
+	if (ends_with(url, "/"))
+		strbuf_setlen(&sb, sb.len - 1);
+	free(remoteurl);
+
+	if (starts_with_dot_slash(sb.buf))
+		out = xstrdup(sb.buf + 2);
+	else
+		out = xstrdup(sb.buf);
+
+	if (!up_path || !is_relative) {
+		strbuf_release(&sb);
+		return out;
+	}
+
+	strbuf_reset(&sb);
+	strbuf_addf(&sb, "%s%s", up_path, out);
+	free(out);
+	return strbuf_detach(&sb, NULL);
+}
diff --git a/remote.h b/remote.h
index 4a1209ae2c8..91c7f187863 100644
--- a/remote.h
+++ b/remote.h
@@ -409,4 +409,34 @@ int parseopt_push_cas_option(const struct option *, const char *arg, int unset);
 int is_empty_cas(const struct push_cas_option *);
 void apply_push_cas(struct push_cas_option *, struct remote *, struct ref *);
 
+/*
+ * The `url` argument is the URL that navigates to the submodule origin
+ * repo. When relative, this URL is relative to the superproject origin
+ * URL repo. The `up_path` argument, if specified, is the relative
+ * path that navigates from the submodule working tree to the superproject
+ * working tree. Returns the origin URL of the submodule.
+ *
+ * Return either an absolute URL or filesystem path (if the superproject
+ * origin URL is an absolute URL or filesystem path, respectively) or a
+ * relative file system path (if the superproject origin URL is a relative
+ * file system path).
+ *
+ * When the output is a relative file system path, the path is either
+ * relative to the submodule working tree, if up_path is specified, or to
+ * the superproject working tree otherwise.
+ *
+ * NEEDSWORK: This works incorrectly on the domain and protocol part.
+ * remote_url      url              outcome          expectation
+ * http://a.com/b  ../c             http://a.com/c   as is
+ * http://a.com/b/ ../c             http://a.com/c   same as previous line, but
+ *                                                   ignore trailing slash in url
+ * http://a.com/b  ../../c          http://c         error out
+ * http://a.com/b  ../../../c       http:/c          error out
+ * http://a.com/b  ../../../../c    http:c           error out
+ * http://a.com/b  ../../../../../c    .:c           error out
+ */
+char *relative_url(const char *remote_url,
+		   const char *url,
+		   const char *up_path);
+
 #endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 05/25] remote: allow relative_url() to return an absolute url
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 04/25] remote: move relative_url() Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 06/25] http: make http_get_file() external Derrick Stolee via GitGitGadget
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

When the 'url' parameter was absolute, the previous implementation would
concatenate 'remote_url' with 'url'. Instead, we want to return 'url' in
this case.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 remote.c | 12 ++++++++++--
 remote.h |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/remote.c b/remote.c
index c4a56749e85..ac1d98ae922 100644
--- a/remote.c
+++ b/remote.c
@@ -2766,10 +2766,18 @@ char *relative_url(const char *remote_url,
 	int is_relative = 0;
 	int colonsep = 0;
 	char *out;
-	char *remoteurl = xstrdup(remote_url);
+	char *remoteurl;
 	struct strbuf sb = STRBUF_INIT;
-	size_t len = strlen(remoteurl);
+	size_t len;
+
+	if (!url_is_local_not_ssh(url) || is_absolute_path(url))
+		return xstrdup(url);
+
+	len = strlen(remote_url);
+	if (!len)
+		BUG("invalid empty remote_url");
 
+	remoteurl = xstrdup(remote_url);
 	if (is_dir_sep(remoteurl[len-1]))
 		remoteurl[len-1] = '\0';
 
diff --git a/remote.h b/remote.h
index 91c7f187863..438152ef562 100644
--- a/remote.h
+++ b/remote.h
@@ -434,6 +434,7 @@ void apply_push_cas(struct push_cas_option *, struct remote *, struct ref *);
  * http://a.com/b  ../../../c       http:/c          error out
  * http://a.com/b  ../../../../c    http:c           error out
  * http://a.com/b  ../../../../../c    .:c           error out
+ * http://a.com/b  http://d.org/e   http://d.org/e   as is
  */
 char *relative_url(const char *remote_url,
 		   const char *url,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 06/25] http: make http_get_file() external
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 05/25] remote: allow relative_url() to return an absolute url Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 07/25] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

This method will be used in an upcoming extension of git-remote-curl to
download a single file over HTTP(S) by request.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 http.c | 4 ++--
 http.h | 9 +++++++++
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/http.c b/http.c
index 229da4d1488..04e73149357 100644
--- a/http.c
+++ b/http.c
@@ -1945,8 +1945,8 @@ int http_get_strbuf(const char *url,
  * If a previous interrupted download is detected (i.e. a previous temporary
  * file is still around) the download is resumed.
  */
-static int http_get_file(const char *url, const char *filename,
-			 struct http_get_options *options)
+int http_get_file(const char *url, const char *filename,
+		  struct http_get_options *options)
 {
 	int ret;
 	struct strbuf tmpfile = STRBUF_INIT;
diff --git a/http.h b/http.h
index df1590e53a4..ba303cfb372 100644
--- a/http.h
+++ b/http.h
@@ -163,6 +163,15 @@ struct http_get_options {
  */
 int http_get_strbuf(const char *url, struct strbuf *result, struct http_get_options *options);
 
+/*
+ * Downloads a URL and stores the result in the given file.
+ *
+ * If a previous interrupted download is detected (i.e. a previous temporary
+ * file is still around) the download is resumed.
+ */
+int http_get_file(const char *url, const char *filename,
+		  struct http_get_options *options);
+
 int http_fetch_ref(const char *base, struct ref *ref);
 
 /* Helpers for fetching packs */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 07/25] remote-curl: add 'get' capability
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 06/25] http: make http_get_file() external Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 08/25] bundle: implement 'fetch' command for direct bundles Derrick Stolee via GitGitGadget
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

A future change will want a way to download a file over HTTP(S) using
the simplest of download mechanisms. We do not want to assume that the
server on the other side understands anything about the Git protocol but
could be a simple static web server.

Create the new 'get' capability for the remote helpers which advertises
that the 'get' command is avalable. A caller can send a line containing
'get <url> <path>' to download the file at <url> into the file at
<path>.

RFC-TODO: This change requires tests directly on the remote helper.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/gitremote-helpers.txt |  6 ++++++
 remote-curl.c                       | 32 +++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/Documentation/gitremote-helpers.txt b/Documentation/gitremote-helpers.txt
index 6f1e269ae43..f82588601a9 100644
--- a/Documentation/gitremote-helpers.txt
+++ b/Documentation/gitremote-helpers.txt
@@ -168,6 +168,9 @@ Supported commands: 'list', 'import'.
 	Can guarantee that when a clone is requested, the received
 	pack is self contained and is connected.
 
+'get'::
+	Can use the 'get' command to download a file from a given URI.
+
 If a helper advertises 'connect', Git will use it if possible and
 fall back to another capability if the helper requests so when
 connecting (see the 'connect' command under COMMANDS).
@@ -418,6 +421,9 @@ Supported if the helper has the "connect" capability.
 +
 Supported if the helper has the "stateless-connect" capability.
 
+'get' <uri> <path>::
+	Downloads the file from the given `<uri>` to the given `<path>`.
+
 If a fatal error occurs, the program writes the error message to
 stderr and exits. The caller should expect that a suitable error
 message has been printed if the child closes the connection without
diff --git a/remote-curl.c b/remote-curl.c
index 0dabef2dd7c..92beb98631b 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -1270,6 +1270,33 @@ static void parse_fetch(struct strbuf *buf)
 	strbuf_reset(buf);
 }
 
+static void parse_get(struct strbuf *buf)
+{
+	struct http_get_options opts = { 0 };
+	struct strbuf url = STRBUF_INIT;
+	struct strbuf path = STRBUF_INIT;
+	const char *p, *space;
+
+	if (!skip_prefix(buf->buf, "get ", &p))
+		die(_("http transport does not support %s"), buf->buf);
+
+	space = strchr(p, ' ');
+
+	if (!space)
+		die(_("protocol error: expected '<url> <path>', missing space"));
+
+	strbuf_add(&url, p, space - p);
+	strbuf_addstr(&path, space + 1);
+
+	http_get_file(url.buf, path.buf, &opts);
+
+	strbuf_release(&url);
+	strbuf_release(&path);
+	printf("\n");
+	fflush(stdout);
+	strbuf_reset(buf);
+}
+
 static int push_dav(int nr_spec, const char **specs)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
@@ -1542,9 +1569,14 @@ int cmd_main(int argc, const char **argv)
 				printf("unsupported\n");
 			fflush(stdout);
 
+		} else if (skip_prefix(buf.buf, "get ", &arg)) {
+			parse_get(&buf);
+			fflush(stdout);
+
 		} else if (!strcmp(buf.buf, "capabilities")) {
 			printf("stateless-connect\n");
 			printf("fetch\n");
+			printf("get\n");
 			printf("option\n");
 			printf("push\n");
 			printf("check-connectivity\n");
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 08/25] bundle: implement 'fetch' command for direct bundles
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 07/25] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 09/25] bundle: parse table of contents during 'fetch' Derrick Stolee via GitGitGadget
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The 'git bundle fetch <uri>' command will be used to download one or
more bundles from a specified '<uri>'. The implementation being added
here focuses only on downloading a file from '<uri>' and unbundling it
if it is a valid bundle file.

If it is not a bundle file, then we currently die(), but a later change
will attempt to interpret it as a table of contents with possibly
multiple bundles listed, along with other metadata for each bundle.

That explains a bit why cmd_bundle_fetch() has three steps carefully
commented, including a "stack" that currently can only hold one bundle.
We will later update this while loop to push onto the stack when
necessary.

RFC-TODO: Add documentation to Documentation/git-bundle.txt

RFC-TODO: Add direct tests of 'git bundle fetch' when the URI is a
bundle file.

RFC-TODO: Split out the docs and subcommand boilerplate into its own
commit.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/bundle.c | 261 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 261 insertions(+)

diff --git a/builtin/bundle.c b/builtin/bundle.c
index 8187b7df739..0e06f1756d1 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -3,6 +3,10 @@
 #include "parse-options.h"
 #include "cache.h"
 #include "bundle.h"
+#include "run-command.h"
+#include "hashmap.h"
+#include "object-store.h"
+#include "refs.h"
 
 /*
  * Basic handler for bundle files to connect repositories via sneakernet.
@@ -13,6 +17,7 @@
 
 static const char * const builtin_bundle_usage[] = {
   N_("git bundle create [<options>] <file> <git-rev-list args>"),
+  N_("git bundle fetch [<options>] <uri>"),
   N_("git bundle list-heads <file> [<refname>...]"),
   N_("git bundle unbundle <file> [<refname>...]"),
   N_("git bundle verify [<options>] <file>"),
@@ -24,6 +29,11 @@ static const char * const builtin_bundle_create_usage[] = {
   NULL
 };
 
+static const char * const builtin_bundle_fetch_usage[] = {
+  N_("git bundle fetch [--filter=<spec>] <uri>"),
+  NULL
+};
+
 static const char * const builtin_bundle_list_heads_usage[] = {
   N_("git bundle list-heads <file> [<refname>...]"),
   NULL
@@ -131,6 +141,255 @@ cleanup:
 	return ret;
 }
 
+/**
+ * The remote_bundle_info struct contains the necessary data for
+ * the list of bundles advertised by a table of contents. If the
+ * bundle URI instead contains a single bundle, then this struct
+ * can represent a single bundle without a 'uri' but with a
+ * tempfile storing its current location on disk.
+ */
+struct remote_bundle_info {
+	struct hashmap_entry ent;
+
+	/**
+	 * The 'id' is a name given to the bundle for reference
+	 * by other bundle infos.
+	 */
+	char *id;
+
+	/**
+	 * The 'uri' is the location of the remote bundle so
+	 * it can be downloaded on-demand. This will be NULL
+	 * if there was no table of contents.
+	 */
+	char *uri;
+
+	/**
+	 * The 'next_id' string, if non-NULL, contains the 'id'
+	 * for a bundle that contains the prerequisites for this
+	 * bundle. Used by table of contents to allow fetching
+	 * a portion of a repository incrementally.
+	 */
+	char *next_id;
+
+	/**
+	 * A table of contents can include a timestamp for the
+	 * bundle as a heuristic for describing a list of bundles
+	 * in order of recency.
+	 */
+	timestamp_t timestamp;
+
+	/**
+	 * If the bundle has been downloaded, then 'file' is a
+	 * filename storing its contents. Otherwise, 'file' is
+	 * an empty string.
+	 */
+	struct strbuf file;
+
+	/**
+	 * The 'stack_next' pointer allows this struct to form
+	 * a stack.
+	 */
+	struct remote_bundle_info *stack_next;
+};
+
+static void download_uri_to_file(const char *uri, const char *file)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	FILE *child_in;
+
+	strvec_pushl(&cp.args, "git-remote-https", "origin", uri, NULL);
+	cp.in = -1;
+	cp.out = -1;
+
+	if (start_command(&cp))
+		die(_("failed to start remote helper"));
+
+	child_in = fdopen(cp.in, "w");
+	if (!child_in)
+		die(_("cannot write to child process"));
+
+	fprintf(child_in, "get %s %s\n\n", uri, file);
+	fclose(child_in);
+
+	if (finish_command(&cp))
+		die(_("remote helper failed"));
+}
+
+static void find_temp_filename(struct strbuf *name)
+{
+	int fd;
+	/*
+	 * Find a temporray filename that is available. This is briefly
+	 * racy, but unlikely to collide.
+	 */
+	fd = odb_mkstemp(name, "bundles/tmp_uri_XXXXXX");
+	if (fd < 0)
+		die(_("failed to create temporary file"));
+	close(fd);
+	unlink(name->buf);
+}
+
+static void unbundle_fetched_bundle(struct remote_bundle_info *info)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	FILE *f;
+	struct strbuf line = STRBUF_INIT;
+	struct strbuf bundle_ref = STRBUF_INIT;
+	size_t bundle_prefix_len;
+
+	strvec_pushl(&cp.args, "bundle", "unbundle",
+				info->file.buf, NULL);
+	cp.git_cmd = 1;
+	cp.out = -1;
+
+	if (start_command(&cp))
+		die(_("failed to start 'unbundle' process"));
+
+	strbuf_addstr(&bundle_ref, "refs/bundles/");
+	bundle_prefix_len = bundle_ref.len;
+
+	f = fdopen(cp.out, "r");
+	while (strbuf_getline(&line, f) != EOF) {
+		struct object_id oid, old_oid;
+		const char *refname, *branch_name, *end;
+		char *space;
+		int has_old;
+
+		strbuf_trim_trailing_newline(&line);
+
+		space = strchr(line.buf, ' ');
+
+		if (!space)
+			continue;
+
+		refname = space + 1;
+		*space = '\0';
+		parse_oid_hex(line.buf, &oid, &end);
+
+		if (!skip_prefix(refname, "refs/heads/", &branch_name))
+			continue;
+
+		strbuf_setlen(&bundle_ref, bundle_prefix_len);
+		strbuf_addstr(&bundle_ref, branch_name);
+
+		has_old = !read_ref(bundle_ref.buf, &old_oid);
+
+		update_ref("bundle fetch", bundle_ref.buf, &oid,
+				has_old ? &old_oid : NULL,
+				REF_SKIP_OID_VERIFICATION,
+				UPDATE_REFS_MSG_ON_ERR);
+	}
+
+	if (finish_command(&cp))
+		die(_("failed to unbundle bundle from '%s'"), info->uri);
+
+	unlink_or_warn(info->file.buf);
+}
+
+static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
+{
+	int ret = 0;
+	int progress = isatty(2);
+	char *bundle_uri;
+	struct remote_bundle_info first_file = {
+		.file = STRBUF_INIT,
+	};
+	struct remote_bundle_info *stack = NULL;
+
+	struct option options[] = {
+		OPT_BOOL(0, "progress", &progress,
+			 N_("show progress meter")),
+		OPT_END()
+	};
+
+	argc = parse_options_cmd_bundle(argc, argv, prefix,
+			builtin_bundle_fetch_usage, options, &bundle_uri);
+
+	if (!startup_info->have_repository)
+		die(_("'fetch' requires a repository"));
+
+	/*
+	 * Step 1: determine protocol for uri, and download contents to
+	 * a temporary location.
+	 */
+	first_file.uri = bundle_uri;
+	find_temp_filename(&first_file.file);
+	download_uri_to_file(bundle_uri, first_file.file.buf);
+
+	/*
+	 * Step 2: Check if the file is a bundle (if so, add it to the
+	 * stack and move to step 3).
+	 */
+
+	if (is_bundle(first_file.file.buf, 1)) {
+		/* The simple case: only one file, no stack to worry about. */
+		stack = &first_file;
+	} else {
+		/* TODO: Expect and parse a table of contents. */
+		die(_("unexpected data at bundle URI"));
+	}
+
+	/*
+	 * Step 3: For each bundle in the stack:
+	 * 	i. If not downloaded to a temporary file, download it.
+	 * 	ii. Once downloaded, check that its prerequisites are in
+	 * 	    the object database. If not, then push its dependent
+	 * 	    bundle onto the stack. (Fail if no such bundle exists.)
+	 * 	iii. If all prerequisites are present, then unbundle the
+	 * 	     temporary file and pop the bundle from the stack.
+	 */
+	while (stack) {
+		int valid = 1;
+		int bundle_fd;
+		struct string_list_item *prereq;
+		struct bundle_header header = BUNDLE_HEADER_INIT;
+
+		if (!stack->file.len) {
+			find_temp_filename(&stack->file);
+			download_uri_to_file(stack->uri, stack->file.buf);
+			if (!is_bundle(stack->file.buf, 1))
+				die(_("file downloaded from '%s' is not a bundle"), stack->uri);
+		}
+
+		bundle_header_init(&header);
+		bundle_fd = read_bundle_header(stack->file.buf, &header);
+		if (bundle_fd < 0)
+			die(_("failed to read bundle from '%s'"), stack->uri);
+
+		for_each_string_list_item(prereq, &header.prerequisites) {
+			struct object_info info = OBJECT_INFO_INIT;
+			struct object_id *oid = prereq->util;
+
+			if (oid_object_info_extended(the_repository, oid, &info,
+						     OBJECT_INFO_QUICK)) {
+				valid = 0;
+				break;
+			}
+		}
+
+		close(bundle_fd);
+		bundle_header_release(&header);
+
+		if (valid) {
+			unbundle_fetched_bundle(stack);
+		} else if (stack->next_id) {
+			/*
+			 * Load the next bundle from the hashtable and
+			 * push it onto the stack.
+			 */
+		} else {
+			die(_("bundle from '%s' has missing prerequisites and no dependent bundle"),
+			    stack->uri);
+		}
+
+		stack = stack->stack_next;
+	}
+
+	free(bundle_uri);
+	return ret;
+}
+
 static int cmd_bundle_list_heads(int argc, const char **argv, const char *prefix) {
 	struct bundle_header header = BUNDLE_HEADER_INIT;
 	int bundle_fd = -1;
@@ -209,6 +468,8 @@ int cmd_bundle(int argc, const char **argv, const char *prefix)
 
 	else if (!strcmp(argv[0], "create"))
 		result = cmd_bundle_create(argc, argv, prefix);
+	else if (!strcmp(argv[0], "fetch"))
+		result = cmd_bundle_fetch(argc, argv, prefix);
 	else if (!strcmp(argv[0], "list-heads"))
 		result = cmd_bundle_list_heads(argc, argv, prefix);
 	else if (!strcmp(argv[0], "unbundle"))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 09/25] bundle: parse table of contents during 'fetch'
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 08/25] bundle: implement 'fetch' command for direct bundles Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 10/25] bundle: add --filter option to 'fetch' Derrick Stolee via GitGitGadget
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

In order to support a flexible bundle URI feature, we allow the server
to return a "table of contents" file that is formatted according to Git
config file standards. These files can describe multiple bundles,
intended to assist with using bundle URIs for fetching or with partial
clone.

Here is an example table of contents file:

[bundle "tableofcontents"]
	version = 1

[bundle "2022-02-09-1644442601-daily"]
	uri = 2022-02-09-1644442601-daily.bundle
	timestamp = 1644442601
	requires = 2022-02-02-1643842562

[bundle "2022-02-02-1643842562"]
	uri = 2022-02-02-1643842562.bundle
	timestamp = 1643842562

[bundle "2022-02-09-1644442631-daily-blobless"]
	uri = 2022-02-09-1644442631-daily-blobless.bundle
	timestamp = 1644442631
	requires = 2022-02-02-1643842568-blobless
	filter = blob:none

[bundle "2022-02-02-1643842568-blobless"]
	uri = 2022-02-02-1643842568-blobless.bundle
	timestamp = 1643842568
	filter = blob:none

(End of example.)

This file contains some important fixed values, such as

 * bundle.tableofcontents.version = 1

Also, different bundles are referenced by <id>, using keys with names

 * bundle.<id>.uri: the URI to download this bundle. This could be an
   absolute URI or a URI relative to the bundle file's URI.
 * bundle.<id>.timestamp: the timestamp when this file was generated.
 * bundle.<id>.filter: the partial clone filter applied on this bundle.
 * bundle.<id>.requires: the ID for the previous bundle.

The current change does not parse the '.filter' option, but does use the
'.requires' in the 'while (stack)' loop.

The process is that 'git bundle fetch' will parse the table of contents
and pick the most-recent bundle and download that one. That bundle
header has a ref listing, including (possibly) a list of commits that
are missing from the bundle. If any of those commits are missing, then
Git downloads the bundle specified by the '.requires' value and tries
again.  Eventually, Git should download a bundle where all missing
commits actually exist in the current repository, or Git downloads a
bundle with no missing commits.

Of course, the server could be advertising incorrect information, so it
could advertise bundles that never satisfy the missing objects. It could
also create a directed cycle in its '.requires' specifications. In each
of these cases, Git will die with a "bundle '<id>' still invalid after
downloading required bundle" message or a "bundle from '<uri>' has
missing prerequisites and no dependent bundle" message.

RFC-TODO: add a direct test of table of contents parsing in this change.
RFC-TODO: create tests that check these erroneous cases.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/bundle.c | 169 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 162 insertions(+), 7 deletions(-)

diff --git a/builtin/bundle.c b/builtin/bundle.c
index 0e06f1756d1..66f3b3c9376 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -7,6 +7,8 @@
 #include "hashmap.h"
 #include "object-store.h"
 #include "refs.h"
+#include "config.h"
+#include "packfile.h"
 
 /*
  * Basic handler for bundle files to connect repositories via sneakernet.
@@ -165,12 +167,21 @@ struct remote_bundle_info {
 	char *uri;
 
 	/**
-	 * The 'next_id' string, if non-NULL, contains the 'id'
+	 * The 'requires_id' string, if non-NULL, contains the 'id'
 	 * for a bundle that contains the prerequisites for this
 	 * bundle. Used by table of contents to allow fetching
 	 * a portion of a repository incrementally.
 	 */
-	char *next_id;
+	char *requires_id;
+
+	/**
+	 * The 'filter_str' string, if non-NULL, specifies the
+	 * filter capability exists in this bundle with the given
+	 * specification. Allows selecting bundles that match the
+	 * client's desired filter. If NULL, then no filter exists
+	 * on the bundle.
+	 */
+	char *filter_str;
 
 	/**
 	 * A table of contents can include a timestamp for the
@@ -191,8 +202,106 @@ struct remote_bundle_info {
 	 * a stack.
 	 */
 	struct remote_bundle_info *stack_next;
+
+	/**
+	 * 'pushed' is set when first pushing the required bundle
+	 * onto the stack. Used to error out when verifying the
+	 * prerequisites and avoiding an infinite loop.
+	 */
+	unsigned pushed:1;
 };
 
+static int remote_bundle_cmp(const void *unused_cmp_data,
+			     const struct hashmap_entry *a,
+			     const struct hashmap_entry *b,
+			     const void *key)
+{
+	const struct remote_bundle_info *ee1 =
+			container_of(a, struct remote_bundle_info, ent);
+	const struct remote_bundle_info *ee2 =
+			container_of(b, struct remote_bundle_info, ent);
+
+	return strcmp(ee1->id, ee2->id);
+}
+
+static int parse_toc_config(const char *key, const char *value, void *data)
+{
+	struct hashmap *toc = data;
+	const char *key1, *key2, *id_end;
+	struct strbuf id = STRBUF_INIT;
+	struct remote_bundle_info info_lookup;
+	struct remote_bundle_info *info;
+
+	if (!skip_prefix(key, "bundle.", &key1))
+		return -1;
+
+	if (skip_prefix(key1, "tableofcontents.", &key2)) {
+		if (!strcmp(key2, "version")) {
+			int version = git_config_int(key, value);
+
+			if (version != 1) {
+				warning(_("table of contents version %d not understood"), version);
+				return -1;
+			}
+		}
+
+		return 0;
+	}
+
+	id_end = strchr(key1, '.');
+
+	/*
+	 * If this key is of the form "bundle.<x>" with no third item,
+	 * then we do not know about it. We should ignore it. Later versions
+	 * might start caring about this data on an optional basis. Increase
+	 * the version number to add keys that must be understood.
+	 */
+	if (!id_end)
+		return 0;
+
+	strbuf_add(&id, key1, id_end - key1);
+	key2 = id_end + 1;
+
+	info_lookup.id = id.buf;
+	hashmap_entry_init(&info_lookup.ent, strhash(info_lookup.id));
+	if (!(info = hashmap_get_entry(toc, &info_lookup, ent, NULL))) {
+		CALLOC_ARRAY(info, 1);
+		info->id = strbuf_detach(&id, NULL);
+		strbuf_init(&info->file, 0);
+		hashmap_entry_init(&info->ent, strhash(info->id));
+		hashmap_add(toc, &info->ent);
+	}
+
+	if (!strcmp(key2, "uri")) {
+		if (info->uri)
+			warning(_("duplicate 'uri' value for id '%s'"), info->id);
+		else
+			info->uri = xstrdup(value);
+		return 0;
+	} else if (!strcmp(key2, "timestamp")) {
+		if (info->timestamp)
+			warning(_("duplicate 'timestamp' value for id '%s'"), info->id);
+		else
+			info->timestamp = git_config_int64(key, value);
+		return 0;
+	} else if (!strcmp(key2, "requires")) {
+		if (info->requires_id)
+			warning(_("duplicate 'requires' value for id '%s'"), info->id);
+		else
+			info->requires_id = xstrdup(value);
+		return 0;
+	} else if (!strcmp(key2, "filter")) {
+		if (info->filter_str)
+			warning(_("duplicate 'filter' value for id '%s'"), info->id);
+		else
+			info->filter_str = xstrdup(value);
+		return 0;
+	}
+
+	/* Return 0 here to ignore unknown options. */
+	return 0;
+}
+
 static void download_uri_to_file(const char *uri, const char *file)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
@@ -289,13 +398,14 @@ static void unbundle_fetched_bundle(struct remote_bundle_info *info)
 
 static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 {
-	int ret = 0;
+	int ret = 0, used_hashmap = 0;
 	int progress = isatty(2);
 	char *bundle_uri;
 	struct remote_bundle_info first_file = {
 		.file = STRBUF_INIT,
 	};
 	struct remote_bundle_info *stack = NULL;
+	struct hashmap toc = { 0 };
 
 	struct option options[] = {
 		OPT_BOOL(0, "progress", &progress,
@@ -319,15 +429,31 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 
 	/*
 	 * Step 2: Check if the file is a bundle (if so, add it to the
-	 * stack and move to step 3).
+	 * stack and move to step 3). Otherwise, expect it to be a table
+	 * of contents. Use the table to populate a hashtable of bundles
+	 * and push the most recent bundle to the stack.
 	 */
 
 	if (is_bundle(first_file.file.buf, 1)) {
 		/* The simple case: only one file, no stack to worry about. */
 		stack = &first_file;
 	} else {
-		/* TODO: Expect and parse a table of contents. */
-		die(_("unexpected data at bundle URI"));
+		struct hashmap_iter iter;
+		struct remote_bundle_info *info;
+		timestamp_t max_time = 0;
+
+		/* populate a hashtable with all relevant bundles. */
+		used_hashmap = 1;
+		hashmap_init(&toc, remote_bundle_cmp, NULL, 0);
+		git_config_from_file(parse_toc_config, first_file.file.buf, &toc);
+
+		/* initialize stack using timestamp heuristic. */
+		hashmap_for_each_entry(&toc, &iter, info, ent) {
+			if (info->timestamp > max_time || !stack) {
+				stack = info;
+				max_time = info->timestamp;
+			}
+		}
 	}
 
 	/*
@@ -357,6 +483,7 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 		if (bundle_fd < 0)
 			die(_("failed to read bundle from '%s'"), stack->uri);
 
+		reprepare_packed_git(the_repository);
 		for_each_string_list_item(prereq, &header.prerequisites) {
 			struct object_info info = OBJECT_INFO_INIT;
 			struct object_id *oid = prereq->util;
@@ -373,11 +500,28 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 
 		if (valid) {
 			unbundle_fetched_bundle(stack);
-		} else if (stack->next_id) {
+		} else if (stack->pushed) {
+			die(_("bundle '%s' still invalid after downloading required bundle"), stack->id);
+		} else if (stack->requires_id) {
 			/*
 			 * Load the next bundle from the hashtable and
 			 * push it onto the stack.
 			 */
+			struct remote_bundle_info *info;
+			struct remote_bundle_info info_lookup = { 0 };
+			info_lookup.id = stack->requires_id;
+
+			hashmap_entry_init(&info_lookup.ent, strhash(info_lookup.id));
+			if ((info = hashmap_get_entry(&toc, &info_lookup, ent, NULL))) {
+				/* Push onto the stack */
+				stack->pushed = 1;
+				info->stack_next = stack;
+				stack = info;
+				continue;
+			} else {
+				die(_("unable to find bundle '%s' required by bundle '%s'"),
+				    stack->requires_id, stack->id);
+			}
 		} else {
 			die(_("bundle from '%s' has missing prerequisites and no dependent bundle"),
 			    stack->uri);
@@ -386,6 +530,17 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 		stack = stack->stack_next;
 	}
 
+	if (used_hashmap) {
+		struct hashmap_iter iter;
+		struct remote_bundle_info *info;
+		hashmap_for_each_entry(&toc, &iter, info, ent) {
+			free(info->id);
+			free(info->uri);
+			free(info->requires_id);
+			free(info->filter_str);
+		}
+		hashmap_clear_and_free(&toc, struct remote_bundle_info, ent);
+	}
 	free(bundle_uri);
 	return ret;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 10/25] bundle: add --filter option to 'fetch'
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 09/25] bundle: parse table of contents during 'fetch' Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-03-11 13:44   ` Ævar Arnfjörð Bjarmason
  2022-02-23 18:30 ` [PATCH 11/25] bundle: allow relative URLs in table of contents Derrick Stolee via GitGitGadget
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

When a repository uses an object filter for partial clone, the 'git
bundle fetch' command should try to download bundles that match that
filter.

Teach 'git bundle fetch' to take a '--filter' option and then only
consider bundles that match that filter (or lack thereof). This allows
the bundle server to advertise different sets of bundles for different
filters.

Add some verification to be sure that the bundle we downloaded actually
uses that filter. This is especially important when no filter is
requested but the downloaded bundle _does_ have a filter.

RFC-TODO: add tests for the happy path.

RFC-TODO: add tests for these validations.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/bundle.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/builtin/bundle.c b/builtin/bundle.c
index 66f3b3c9376..27da5e3737f 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -9,6 +9,7 @@
 #include "refs.h"
 #include "config.h"
 #include "packfile.h"
+#include "list-objects-filter-options.h"
 
 /*
  * Basic handler for bundle files to connect repositories via sneakernet.
@@ -406,10 +407,13 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 	};
 	struct remote_bundle_info *stack = NULL;
 	struct hashmap toc = { 0 };
+	const char *filter = NULL;
 
 	struct option options[] = {
 		OPT_BOOL(0, "progress", &progress,
 			 N_("show progress meter")),
+		OPT_STRING(0, "filter", &filter,
+			   N_("filter-spec"), N_("only install bundles matching this filter")),
 		OPT_END()
 	};
 
@@ -449,6 +453,17 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 
 		/* initialize stack using timestamp heuristic. */
 		hashmap_for_each_entry(&toc, &iter, info, ent) {
+			/* Skip if filter does not match. */
+			if (!filter && info->filter_str)
+				continue;
+			if (filter &&
+			    (!info->filter_str || strcasecmp(filter, info->filter_str)))
+				continue;
+
+			/*
+			 * Now that the filter matches, start with the
+			 * bundle with largest timestamp.
+			 */
 			if (info->timestamp > max_time || !stack) {
 				stack = info;
 				max_time = info->timestamp;
@@ -468,6 +483,7 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 	while (stack) {
 		int valid = 1;
 		int bundle_fd;
+		const char *filter_str = NULL;
 		struct string_list_item *prereq;
 		struct bundle_header header = BUNDLE_HEADER_INIT;
 
@@ -483,6 +499,16 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 		if (bundle_fd < 0)
 			die(_("failed to read bundle from '%s'"), stack->uri);
 
+		if (header.filter)
+			filter_str = list_objects_filter_spec(header.filter);
+
+		if (filter && (!filter_str || strcasecmp(filter, filter_str)))
+			die(_("bundle from '%s' does not match expected filter"),
+			    stack->uri);
+		if (!filter && filter_str)
+			die(_("bundle from '%s' has an unexpected filter"),
+			    stack->uri);
+
 		reprepare_packed_git(the_repository);
 		for_each_string_list_item(prereq, &header.prerequisites) {
 			struct object_info info = OBJECT_INFO_INIT;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 11/25] bundle: allow relative URLs in table of contents
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 10/25] bundle: add --filter option to 'fetch' Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-03-11 13:42   ` Ævar Arnfjörð Bjarmason
  2022-02-23 18:30 ` [PATCH 12/25] bundle: make it easy to call 'git bundle fetch' Derrick Stolee via GitGitGadget
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

When hosting bundle data, it can be helpful to distribute that data
across multiple CDNs. This might require a change in the base URI, all
the way to the domain name. If all bundles require an absolute URI in
their 'uri' value, then every push to a CDN would require altering the
table of contents to match the expected domain and exact location within
it.

Allow the table of contents to specify a relative URI for the bundles.
This allows easier distribution of bundle data.

RFC-TODO: An earlier change referenced relative URLs, but it was not
implemented until this change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/bundle.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/builtin/bundle.c b/builtin/bundle.c
index 27da5e3737f..ec969a62ae1 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -10,6 +10,7 @@
 #include "config.h"
 #include "packfile.h"
 #include "list-objects-filter-options.h"
+#include "remote.h"
 
 /*
  * Basic handler for bundle files to connect repositories via sneakernet.
@@ -453,6 +454,8 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 
 		/* initialize stack using timestamp heuristic. */
 		hashmap_for_each_entry(&toc, &iter, info, ent) {
+			char *old_uri;
+
 			/* Skip if filter does not match. */
 			if (!filter && info->filter_str)
 				continue;
@@ -460,6 +463,10 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 			    (!info->filter_str || strcasecmp(filter, info->filter_str)))
 				continue;
 
+			old_uri = info->uri;
+			info->uri = relative_url(bundle_uri, info->uri, NULL);
+			free(old_uri);
+
 			/*
 			 * Now that the filter matches, start with the
 			 * bundle with largest timestamp.
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 12/25] bundle: make it easy to call 'git bundle fetch'
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 11/25] bundle: allow relative URLs in table of contents Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 13/25] clone: add --bundle-uri option Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Future changes will integrate 'git bundle fetch' into the 'git clone'
and 'git fetch' operations. Make it easy to fetch bundles via a helper
method.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 bundle.c | 21 +++++++++++++++++++++
 bundle.h |  9 +++++++++
 2 files changed, 30 insertions(+)

diff --git a/bundle.c b/bundle.c
index 3d97de40ef0..9e1b5300366 100644
--- a/bundle.c
+++ b/bundle.c
@@ -649,3 +649,24 @@ int unbundle(struct repository *r, struct bundle_header *header,
 		return error(_("index-pack died"));
 	return 0;
 }
+
+int fetch_bundle_uri(const char *bundle_uri,
+		     const char *filter)
+{
+	int res = 0;
+	struct strvec args = STRVEC_INIT;
+
+	strvec_pushl(&args, "bundle", "fetch", NULL);
+
+	if (filter)
+		strvec_pushf(&args, "--filter=%s", filter);
+	strvec_push(&args, bundle_uri);
+
+	if (run_command_v_opt(args.v, RUN_GIT_CMD)) {
+		warning(_("failed to download bundle from uri '%s'"), bundle_uri);
+		res = 1;
+	}
+
+	strvec_clear(&args);
+	return res;
+}
diff --git a/bundle.h b/bundle.h
index eb026153d56..bf865b19687 100644
--- a/bundle.h
+++ b/bundle.h
@@ -45,4 +45,13 @@ int unbundle(struct repository *r, struct bundle_header *header,
 int list_bundle_refs(struct bundle_header *header,
 		int argc, const char **argv);
 
+struct list_objects_filter_options;
+/**
+ * Fetch bundles from the given URI with the given filter.
+ *
+ * Uses 'git bundle fetch' as a subprocess.
+ */
+int fetch_bundle_uri(const char *bundle_uri,
+		     const char *filter);
+
 #endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 13/25] clone: add --bundle-uri option
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 12/25] bundle: make it easy to call 'git bundle fetch' Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 14/25] clone: --bundle-uri cannot be combined with --depth Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Cloning a remote repository is one of the most expensive operations in
Git. The server can spend a lot of CPU time generating a pack-file for
the client's request. The amount of data can clog the network for a long
time, and the Git protocol is not resumable. For users with poor network
connections or are located far away from the origin server, this can be
especially painful.

The 'git bundle fetch' command allows users to bootstrap a repository
using a set of bundles. However, this would require them to use 'git
init' first, followed by the 'git bundle fetch', and finally add a
remote, fetch, and checkout the branch they want.

Instead, integrate this workflow directly into 'git clone' with the
--bundle-uri' option. If the user is aware of a bundle server, then they
can tell Git to bootstrap the new repository with these bundles before
fetching the remaining objects from the origin server.

RFC-TODO: Document this option in git-clone.txt.

RFC-TODO: I added a comment about the location of this code being
necessary for the later step of auto-discovering the bundle URI from the
origin server. This is probably not actually a requirement, but rather a
pain point around how I implemented the feature. If a --bundle-uri
option is specified, but SSH is used for the clone, then the SSH
connection is left open while Git downloads bundles from another server.
This is sub-optimal and should be reconsidered when fully reviewed.

RFC-TODO: create tests for this option with a variety of URI types.

RFC-TODO: a simple end-to-end test is available at the end of the
series.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/clone.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/builtin/clone.c b/builtin/clone.c
index 9c29093b352..6df3d513dc4 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -33,6 +33,7 @@
 #include "packfile.h"
 #include "list-objects-filter-options.h"
 #include "hook.h"
+#include "bundle.h"
 
 /*
  * Overall FIXMEs:
@@ -74,6 +75,7 @@ static struct string_list option_recurse_submodules = STRING_LIST_INIT_NODUP;
 static struct list_objects_filter_options filter_options;
 static struct string_list server_options = STRING_LIST_INIT_NODUP;
 static int option_remote_submodules;
+static const char *bundle_uri;
 
 static int recurse_submodules_cb(const struct option *opt,
 				 const char *arg, int unset)
@@ -155,6 +157,8 @@ static struct option builtin_clone_options[] = {
 		    N_("any cloned submodules will use their remote-tracking branch")),
 	OPT_BOOL(0, "sparse", &option_sparse_checkout,
 		    N_("initialize sparse-checkout file to include only files at root")),
+	OPT_STRING(0, "bundle-uri", &bundle_uri,
+		   N_("uri"), N_("A URI for downloading bundles before fetching from origin remote")),
 	OPT_END()
 };
 
@@ -1185,6 +1189,35 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 
 	refs = transport_get_remote_refs(transport, &transport_ls_refs_options);
 
+	/*
+	 * NOTE: The bundle URI download takes place after transport_get_remote_refs()
+	 * because a later change will introduce a check for recommended features,
+	 * which might include a recommended bundle URI.
+	 */
+
+	/*
+	 * Before fetching from the remote, download and install bundle
+	 * data from the --bundle-uri option.
+	 */
+	if (bundle_uri) {
+		const char *filter = NULL;
+
+		if (filter_options.filter_spec.nr)
+			filter = expand_list_objects_filter_spec(&filter_options);
+		/*
+		 * Set the config for fetching from this bundle URI in the
+		 * future, but do it before fetch_bundle_uri() which might
+		 * un-set it (for instance, if there is no table of contents).
+		 */
+		git_config_set("fetch.bundleuri", bundle_uri);
+		if (filter)
+			git_config_set("fetch.bundlefilter", filter);
+
+		if (!fetch_bundle_uri(bundle_uri, filter))
+			warning(_("failed to fetch objects from bundle URI '%s'"),
+				bundle_uri);
+	}
+
 	if (refs)
 		mapped_refs = wanted_peer_refs(refs, &remote->fetch);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 14/25] clone: --bundle-uri cannot be combined with --depth
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 13/25] clone: add --bundle-uri option Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 15/25] config: add git_config_get_timestamp() Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change added the '--bundle-uri' option, but did not check
if the --depth parameter was included. Since bundles are not compatible
with shallow clones, provide an error message to the user who is
attempting this combination.

I am leaving this as its own change, separate from the one that
implements '--bundle-uri', because this is more of an advisory for the
user. There is nothing wrong with bootstrapping with bundles and then
fetching a shallow clone. However, that is likely going to involve too
much work for the client _and_ the server. The client will download all
of this bundle information containing the full history of the
repository only to ignore most of it. The server will get a shallow
fetch request, but with a list of haves that might cause a more painful
computation of that shallow pack-file.

RFC-TODO: add a test case for this error message.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/clone.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/builtin/clone.c b/builtin/clone.c
index 6df3d513dc4..cfe3d96047a 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -912,6 +912,11 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		option_no_checkout = 1;
 	}
 
+	if (bundle_uri) {
+		if (deepen)
+			die(_("--bundle-uri is incompatible with --depth, --shallow-since, and --shallow-exclude"));
+	}
+
 	repo_name = argv[0];
 
 	path = get_repo_path(repo_name, &is_bundle);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 15/25] config: add git_config_get_timestamp()
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 14/25] clone: --bundle-uri cannot be combined with --depth Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-03-11 13:32   ` Ævar Arnfjörð Bjarmason
  2022-02-23 18:30 ` [PATCH 16/25] bundle: only fetch bundles if timestamp is new Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The existing config parsing methods do not include a way to consistently
parse timestamps across all platforms. Recall that "unsigned long" is
32 bits on 64-bit Windows, so git_config_get_ulong() is insufficient.

Adding a new type requires quite a bit of boilerplate to match the style
of other types.

RFC-QUESTION: Would this be better to use uintmax_t, which could be cast
to timestamp_t or other types more robust than "unsigned long"?

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 config.c | 39 +++++++++++++++++++++++++++++++++++++++
 config.h | 14 ++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/config.c b/config.c
index e0c03d154c9..84021b7d504 100644
--- a/config.c
+++ b/config.c
@@ -1228,6 +1228,15 @@ int git_parse_ulong(const char *value, unsigned long *ret)
 	return 1;
 }
 
+int git_parse_timestamp(const char *value, timestamp_t *ret)
+{
+	uintmax_t tmp;
+	if (!git_parse_unsigned(value, &tmp, maximum_unsigned_value_of_type(timestamp_t)))
+		return 0;
+	*ret = tmp;
+	return 1;
+}
+
 int git_parse_ssize_t(const char *value, ssize_t *ret)
 {
 	intmax_t tmp;
@@ -1296,6 +1305,14 @@ unsigned long git_config_ulong(const char *name, const char *value)
 	return ret;
 }
 
+timestamp_t git_config_timestamp(const char *name, const char *value)
+{
+	timestamp_t ret;
+	if (!git_parse_timestamp(value, &ret))
+		die_bad_number(name, value);
+	return ret;
+}
+
 ssize_t git_config_ssize_t(const char *name, const char *value)
 {
 	ssize_t ret;
@@ -2328,6 +2345,16 @@ int git_configset_get_ulong(struct config_set *cs, const char *key, unsigned lon
 		return 1;
 }
 
+int git_configset_get_timestamp(struct config_set *cs, const char *key, timestamp_t *dest)
+{
+	const char *value;
+	if (!git_configset_get_value(cs, key, &value)) {
+		*dest = git_config_timestamp(key, value);
+		return 0;
+	} else
+		return 1;
+}
+
 int git_configset_get_bool(struct config_set *cs, const char *key, int *dest)
 {
 	const char *value;
@@ -2471,6 +2498,13 @@ int repo_config_get_ulong(struct repository *repo,
 	return git_configset_get_ulong(repo->config, key, dest);
 }
 
+int repo_config_get_timestamp(struct repository *repo,
+			      const char *key, timestamp_t *dest)
+{
+	git_config_check_init(repo);
+	return git_configset_get_timestamp(repo->config, key, dest);
+}
+
 int repo_config_get_bool(struct repository *repo,
 			 const char *key, int *dest)
 {
@@ -2544,6 +2578,11 @@ int git_config_get_ulong(const char *key, unsigned long *dest)
 	return repo_config_get_ulong(the_repository, key, dest);
 }
 
+int git_config_get_timestamp(const char *key, timestamp_t *dest)
+{
+	return repo_config_get_timestamp(the_repository, key, dest);
+}
+
 int git_config_get_bool(const char *key, int *dest)
 {
 	return repo_config_get_bool(the_repository, key, dest);
diff --git a/config.h b/config.h
index ab0106d2875..a6e4d35da0a 100644
--- a/config.h
+++ b/config.h
@@ -206,6 +206,7 @@ int config_with_options(config_fn_t fn, void *,
 
 int git_parse_ssize_t(const char *, ssize_t *);
 int git_parse_ulong(const char *, unsigned long *);
+int git_parse_timestamp(const char *, timestamp_t *);
 
 /**
  * Same as `git_config_bool`, except that it returns -1 on error rather
@@ -226,6 +227,11 @@ int64_t git_config_int64(const char *, const char *);
  */
 unsigned long git_config_ulong(const char *, const char *);
 
+/**
+ * Identical to `git_config_int`, but for (unsigned) timestamps.
+ */
+timestamp_t git_config_timestamp(const char *name, const char *value);
+
 ssize_t git_config_ssize_t(const char *, const char *);
 
 /**
@@ -469,6 +475,7 @@ int git_configset_get_string(struct config_set *cs, const char *key, char **dest
 int git_configset_get_string_tmp(struct config_set *cs, const char *key, const char **dest);
 int git_configset_get_int(struct config_set *cs, const char *key, int *dest);
 int git_configset_get_ulong(struct config_set *cs, const char *key, unsigned long *dest);
+int git_configset_get_timestamp(struct config_set *cs, const char *key, timestamp_t *dest);
 int git_configset_get_bool(struct config_set *cs, const char *key, int *dest);
 int git_configset_get_bool_or_int(struct config_set *cs, const char *key, int *is_bool, int *dest);
 int git_configset_get_maybe_bool(struct config_set *cs, const char *key, int *dest);
@@ -489,6 +496,8 @@ int repo_config_get_int(struct repository *repo,
 			const char *key, int *dest);
 int repo_config_get_ulong(struct repository *repo,
 			  const char *key, unsigned long *dest);
+int repo_config_get_timestamp(struct repository *repo,
+			      const char *key, timestamp_t *dest);
 int repo_config_get_bool(struct repository *repo,
 			 const char *key, int *dest);
 int repo_config_get_bool_or_int(struct repository *repo,
@@ -558,6 +567,11 @@ int git_config_get_int(const char *key, int *dest);
  */
 int git_config_get_ulong(const char *key, unsigned long *dest);
 
+/**
+ * Similar to `git_config_get_int` but for (unsigned) timestamps.
+ */
+int git_config_get_timestamp(const char *key, timestamp_t *dest);
+
 /**
  * Finds and parses the value into a boolean value, for the configuration
  * variable `key` respecting keywords like "true" and "false". Integer
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 16/25] bundle: only fetch bundles if timestamp is new
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (14 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 15/25] config: add git_config_get_timestamp() Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 17/25] fetch: fetch bundles before fetching original data Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

If a bundle server is providing a table of contents with timestamps for
the bundles, then we can store the most-recent timestamp and use that as
a test if the bundle server has any new information. Teach 'git bundle
fetch' to store the timestamp in the config file as
'fetch.bundleTimestamp' and compare the existing value to the
most-recent timestamp in the bundle server's table of contents. If the
new timestamp is at most the stored timestamp, then exit early (with
success). If the new timestamp is greater than the stored timestamp,
then continue with the normal fetch logic of downloading the most-recent
bundle until all missing objects are satisfied. Store that new timestamp
in the config for next time.

RFC-TODO: Update documentation of 'git bundle fetch' to match his new
behavior.

RFC-TODO: Add 'fetch.bundleTimestamp' to Documentation/config/

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/bundle.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/builtin/bundle.c b/builtin/bundle.c
index ec969a62ae1..cab99ee2b15 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -409,6 +409,9 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 	struct remote_bundle_info *stack = NULL;
 	struct hashmap toc = { 0 };
 	const char *filter = NULL;
+	const char *timestamp_key = "fetch.bundletimestamp";
+	timestamp_t stored_time = 0;
+	timestamp_t max_time = 0;
 
 	struct option options[] = {
 		OPT_BOOL(0, "progress", &progress,
@@ -424,6 +427,8 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 	if (!startup_info->have_repository)
 		die(_("'fetch' requires a repository"));
 
+	git_config_get_timestamp(timestamp_key, &stored_time);
+
 	/*
 	 * Step 1: determine protocol for uri, and download contents to
 	 * a temporary location.
@@ -445,7 +450,6 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 	} else {
 		struct hashmap_iter iter;
 		struct remote_bundle_info *info;
-		timestamp_t max_time = 0;
 
 		/* populate a hashtable with all relevant bundles. */
 		used_hashmap = 1;
@@ -476,6 +480,13 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 				max_time = info->timestamp;
 			}
 		}
+
+		trace2_data_intmax("bundle", the_repository, "max_time", max_time);
+		trace2_data_intmax("bundle", the_repository, "stored_time", stored_time);
+
+		/* Skip fetching bundles if data isn't new enough. */
+		if (max_time <= stored_time)
+			goto cleanup;
 	}
 
 	/*
@@ -563,6 +574,14 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
 		stack = stack->stack_next;
 	}
 
+	if (max_time) {
+		struct strbuf tstr = STRBUF_INIT;
+		strbuf_addf(&tstr, "%"PRIuMAX"", max_time);
+		git_config_set_gently(timestamp_key, tstr.buf);
+		strbuf_release(&tstr);
+	}
+
+cleanup:
 	if (used_hashmap) {
 		struct hashmap_iter iter;
 		struct remote_bundle_info *info;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 17/25] fetch: fetch bundles before fetching original data
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (15 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 16/25] bundle: only fetch bundles if timestamp is new Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 18/25] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason via GitGitGadget
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

If a user cloned using a bundle URI, then they might want to re-use it
to download new bundles during 'git fetch' before fetching the remaining
objects from the origin server. Use the 'fetch.bundleURI' config as the
indicator for whether this extra step should happen.

Do not fetch bundles if --dry-run is specified.

RFC-TODO: add tests.

RFC-TODO: update Documentation/git-fetch.txt

RFC-TODO: update Documentation/config/

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/fetch.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 6f5e1578639..c0fece55632 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -29,6 +29,7 @@
 #include "commit-graph.h"
 #include "shallow.h"
 #include "worktree.h"
+#include "bundle.h"
 
 #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)
 
@@ -2081,6 +2082,22 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	/* FETCH_HEAD never gets updated in --dry-run mode */
 	if (dry_run)
 		write_fetch_head = 0;
+	else {
+		/*
+		 * --dry-run mode skips bundle downloads, which might
+		 * update some refs.
+		 */
+		char *bundle_uri = NULL;
+		git_config_get_string("fetch.bundleuri", &bundle_uri);
+
+		if (bundle_uri) {
+			char *filter = NULL;
+			git_config_get_string("fetch.bundlefilter", &filter);
+			fetch_bundle_uri(bundle_uri, filter);
+			free(bundle_uri);
+			free(filter);
+		}
+	}
 
 	if (all) {
 		if (argc == 1)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 18/25] connect.c: refactor sending of agent & object-format
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (16 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 17/25] fetch: fetch bundles before fetching original data Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Ævar Arnfjörð Bjarmason via GitGitGadget
  2022-02-23 18:30 ` [PATCH 19/25] protocol-caps: implement cap_features() Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Ævar Arnfjörð Bjarmason via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git
  Cc: gitster, me, aevar, newren, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

From: =?UTF-8?q?=C3=86var=20Arnfj=C3=B6r=C3=B0=20Bjarmason?=
 <avarab@gmail.com>

Refactor the sending of the "agent" and "object-format" capabilities
into a function.

This was added in its current form in ab67235bc4 (connect: parse v2
refs with correct hash algorithm, 2020-05-25). When we connect to a v2
server we need to know about its object-format, and it needs to know
about ours. Since most things in connect.c and transport.c piggy-back
on the eager getting of remote refs via the handshake() those commands
can make use of the just-sent-over object-format by ls-refs.

But I'm about to add a command that may come after ls-refs, and may
not, but we need the server to know about our user-agent and
object-format. So let's split this into a function.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 connect.c | 33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/connect.c b/connect.c
index eaf7d6d2618..9d78d681e95 100644
--- a/connect.c
+++ b/connect.c
@@ -473,6 +473,24 @@ void check_stateless_delimiter(int stateless_rpc,
 		die("%s", error);
 }
 
+static void send_capabilities(int fd_out, struct packet_reader *reader)
+{
+	const char *hash_name;
+
+	if (server_supports_v2("agent", 0))
+		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
+
+	if (server_feature_v2("object-format", &hash_name)) {
+		int hash_algo = hash_algo_by_name(hash_name);
+		if (hash_algo == GIT_HASH_UNKNOWN)
+			die(_("unknown object format '%s' specified by server"), hash_name);
+		reader->hash_algo = &hash_algos[hash_algo];
+		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
+	} else {
+		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
+	}
+}
+
 struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     struct ref **list, int for_push,
 			     struct transport_ls_refs_options *transport_options,
@@ -480,7 +498,6 @@ struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     int stateless_rpc)
 {
 	int i;
-	const char *hash_name;
 	struct strvec *ref_prefixes = transport_options ?
 		&transport_options->ref_prefixes : NULL;
 	char **unborn_head_target = transport_options ?
@@ -490,18 +507,8 @@ struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 	if (server_supports_v2("ls-refs", 1))
 		packet_write_fmt(fd_out, "command=ls-refs\n");
 
-	if (server_supports_v2("agent", 0))
-		packet_write_fmt(fd_out, "agent=%s", git_user_agent_sanitized());
-
-	if (server_feature_v2("object-format", &hash_name)) {
-		int hash_algo = hash_algo_by_name(hash_name);
-		if (hash_algo == GIT_HASH_UNKNOWN)
-			die(_("unknown object format '%s' specified by server"), hash_name);
-		reader->hash_algo = &hash_algos[hash_algo];
-		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
-	} else {
-		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
-	}
+	/* Send capabilities */
+	send_capabilities(fd_out, reader);
 
 	if (server_options && server_options->nr &&
 	    server_supports_v2("server-option", 1))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 19/25] protocol-caps: implement cap_features()
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (17 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 18/25] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 20/25] serve: understand but do not advertise 'features' capability Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The 'features' capability sends a list of "key=value" pairs from the
server. These are a set of fixed config values, all prefixed with
"serve." to avoid conflicting with other config values of similar names.

The initial set chosen here are:

* bundleURI: Allow advertising one or more bundle servers by URI.

* partialCloneFilter: Advertise one or more recommended partial clone
  filters.

* sparseCheckout: Advertise that this repository recommends using the
  sparse-checkout feature in cone mode.

The client will have the choice to enable these features.

RFC-TODO: Create Documentation/config/serve.txt

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 protocol-caps.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++
 protocol-caps.h |  1 +
 2 files changed, 67 insertions(+)

diff --git a/protocol-caps.c b/protocol-caps.c
index bbde91810ac..88b01c4133e 100644
--- a/protocol-caps.c
+++ b/protocol-caps.c
@@ -8,6 +8,7 @@
 #include "object-store.h"
 #include "string-list.h"
 #include "strbuf.h"
+#include "config.h"
 
 struct requested_info {
 	unsigned size : 1;
@@ -111,3 +112,68 @@ int cap_object_info(struct repository *r, struct packet_reader *request)
 
 	return 0;
 }
+
+static void send_lines(struct repository *r, struct packet_writer *writer,
+		       struct string_list *str_list)
+{
+	struct string_list_item *item;
+
+	if (!str_list->nr)
+		return;
+
+	for_each_string_list_item (item, str_list) {
+		packet_writer_write(writer, "%s", item->string);
+	}
+}
+
+int cap_features(struct repository *r, struct packet_reader *request)
+{
+	struct packet_writer writer;
+	struct string_list feature_list = STRING_LIST_INIT_DUP;
+	int i = 0;
+	const char *keys[] = {
+		"bundleuri",
+		"partialclonefilter",
+		"sparsecheckout",
+		NULL
+	};
+	struct strbuf serve_feature = STRBUF_INIT;
+	struct strbuf key_equals_value = STRBUF_INIT;
+	size_t len;
+	strbuf_add(&serve_feature, "serve.", 6);
+	len = serve_feature.len;
+
+	packet_writer_init(&writer, 1);
+
+	while (keys[i]) {
+		struct string_list_item *item;
+		const struct string_list *values = NULL;
+		strbuf_setlen(&serve_feature, len);
+		strbuf_addstr(&serve_feature, keys[i]);
+
+		values = repo_config_get_value_multi(r, serve_feature.buf);
+
+		if (values) {
+			for_each_string_list_item(item, values) {
+				strbuf_reset(&key_equals_value);
+				strbuf_addstr(&key_equals_value, keys[i]);
+				strbuf_addch(&key_equals_value, '=');
+				strbuf_addstr(&key_equals_value, item->string);
+
+				string_list_append(&feature_list, key_equals_value.buf);
+			}
+		}
+
+		i++;
+	}
+	strbuf_release(&serve_feature);
+	strbuf_release(&key_equals_value);
+
+	send_lines(r, &writer, &feature_list);
+
+	string_list_clear(&feature_list, 1);
+
+	packet_flush(1);
+
+	return 0;
+}
diff --git a/protocol-caps.h b/protocol-caps.h
index 15c4550360c..681d2106d88 100644
--- a/protocol-caps.h
+++ b/protocol-caps.h
@@ -4,5 +4,6 @@
 struct repository;
 struct packet_reader;
 int cap_object_info(struct repository *r, struct packet_reader *request);
+int cap_features(struct repository *r, struct packet_reader *request);
 
 #endif /* PROTOCOL_CAPS_H */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 20/25] serve: understand but do not advertise 'features' capability
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (18 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 19/25] protocol-caps: implement cap_features() Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:30 ` [PATCH 21/25] serve: advertise 'features' when config exists Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change implemented cap_features() to return a set of
'key=value' pairs when this capability is run. Add the capability to our
list of understood capabilities.

This change does not advertise the capability. When deploying a new
capability to a distributed fleet of Git servers, it is important to
delay advertising the capability until all nodes understand it. A later
change will advertise it when appropriate, but as a separate change to
simplify this transition.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 serve.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/serve.c b/serve.c
index b3fe9b5126a..a1c853dda1f 100644
--- a/serve.c
+++ b/serve.c
@@ -18,6 +18,12 @@ static int always_advertise(struct repository *r,
 	return 1;
 }
 
+static int never_advertise(struct repository *r,
+			   struct strbuf *value)
+{
+	return 0;
+}
+
 static int agent_advertise(struct repository *r,
 			   struct strbuf *value)
 {
@@ -136,6 +142,11 @@ static struct protocol_capability capabilities[] = {
 		.advertise = always_advertise,
 		.command = cap_object_info,
 	},
+	{
+		.name = "features",
+		.advertise = never_advertise,
+		.command = cap_features,
+	},
 };
 
 void protocol_v2_advertise_capabilities(void)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 21/25] serve: advertise 'features' when config exists
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (19 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 20/25] serve: understand but do not advertise 'features' capability Derrick Stolee via GitGitGadget
@ 2022-02-23 18:30 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:31 ` [PATCH 22/25] connect: implement get_recommended_features() Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:30 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The 'features' capability allows a server to recommend some Git features
at a high level. Previous changes implemented the capability so servers
understand it, but it was never advertised.

Now, allow it to be advertised, but only when the capability will
actually _do_ something. That is, advertise if and only if a config
value exists with the prefix "serve.". This avoids unnecessary round
trips for an empty result.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 serve.c              | 18 +++++++++++++++---
 t/t5701-git-serve.sh |  9 +++++++++
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/serve.c b/serve.c
index a1c853dda1f..7dcabb68147 100644
--- a/serve.c
+++ b/serve.c
@@ -18,12 +18,24 @@ static int always_advertise(struct repository *r,
 	return 1;
 }
 
-static int never_advertise(struct repository *r,
-			   struct strbuf *value)
+static int key_serve_prefix(const char *key, const char *value, void *data)
 {
+	int *signal = data;
+	if (!strncmp(key, "serve.", 6)) {
+		*signal = 1;
+		return 1;
+	}
 	return 0;
 }
 
+static int has_serve_config(struct repository *r,
+			    struct strbuf *value)
+{
+	int signal = 0;
+	repo_config(r, key_serve_prefix, &signal);
+	return signal;
+}
+
 static int agent_advertise(struct repository *r,
 			   struct strbuf *value)
 {
@@ -144,7 +156,7 @@ static struct protocol_capability capabilities[] = {
 	},
 	{
 		.name = "features",
-		.advertise = never_advertise,
+		.advertise = has_serve_config,
 		.command = cap_features,
 	},
 };
diff --git a/t/t5701-git-serve.sh b/t/t5701-git-serve.sh
index 1896f671cb3..6ef721c3f97 100755
--- a/t/t5701-git-serve.sh
+++ b/t/t5701-git-serve.sh
@@ -30,6 +30,15 @@ test_expect_success 'test capability advertisement' '
 	test_cmp expect actual
 '
 
+test_expect_success 'test capability advertisement' '
+	test_when_finished git config --unset serve.bundleuri &&
+	git config serve.bundleuri "file://$(pwd)" &&
+	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
+		--advertise-capabilities >out &&
+	test-tool pkt-line unpack <out >actual &&
+	grep features actual
+'
+
 test_expect_success 'stateless-rpc flag does not list capabilities' '
 	# Empty request
 	test-tool pkt-line pack >in <<-EOF &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 22/25] connect: implement get_recommended_features()
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (20 preceding siblings ...)
  2022-02-23 18:30 ` [PATCH 21/25] serve: advertise 'features' when config exists Derrick Stolee via GitGitGadget
@ 2022-02-23 18:31 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:31 ` [PATCH 23/25] transport: add connections for 'features' capability Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:31 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

This method allows a client to request and parse the 'features' capability
of protocol v2. The response is expected to be a list of 'key=value'
lines, but this implementation does no checking of the lines, expecting
a later parse of the lines to be careful of the existence of that '='
character.

This change is based on an earlier patch [1] written for a similar
capability.

[1] https://lore.kernel.org/git/RFC-patch-04.13-21caf01775-20210805T150534Z-avarab@gmail.com/

Co-authored-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 connect.c | 36 ++++++++++++++++++++++++++++++++++++
 remote.h  |  4 ++++
 2 files changed, 40 insertions(+)

diff --git a/connect.c b/connect.c
index 9d78d681e95..e1e6f4770dd 100644
--- a/connect.c
+++ b/connect.c
@@ -491,6 +491,42 @@ static void send_capabilities(int fd_out, struct packet_reader *reader)
 	}
 }
 
+int get_recommended_features(int fd_out, struct packet_reader *reader,
+			     struct string_list *list, int stateless_rpc)
+{
+	int line_nr = 1;
+
+	server_supports_v2("features", 1);
+
+	/* (Re-)send capabilities */
+	send_capabilities(fd_out, reader);
+
+	/* Send command */
+	packet_write_fmt(fd_out, "command=features\n");
+	packet_delim(fd_out);
+	packet_flush(fd_out);
+
+	/* Process response from server */
+	while (packet_reader_read(reader) == PACKET_READ_NORMAL) {
+		const char *line = reader->line;
+		line_nr++;
+
+		string_list_append(list, line);
+	}
+
+	if (reader->status != PACKET_READ_FLUSH)
+		return error(_("expected flush after features listing"));
+
+	/*
+	 * Might die(), but obscure enough that that's OK, e.g. in
+	 * serve.c, we'll call BUG() on its equivalent (the
+	 * PACKET_READ_RESPONSE_END check).
+	 */
+	check_stateless_delimiter(stateless_rpc, reader,
+		_("expected response end packet after features listing"));
+	return 0;
+}
+
 struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     struct ref **list, int for_push,
 			     struct transport_ls_refs_options *transport_options,
diff --git a/remote.h b/remote.h
index 438152ef562..268e8134f5e 100644
--- a/remote.h
+++ b/remote.h
@@ -236,6 +236,10 @@ struct ref **get_remote_refs(int fd_out, struct packet_reader *reader,
 			     const struct string_list *server_options,
 			     int stateless_rpc);
 
+/* Used for protocol v2 in order to retrieve recommended features */
+int get_recommended_features(int fd_out, struct packet_reader *reader,
+			     struct string_list *list, int stateless_rpc);
+
 int resolve_remote_symref(struct ref *ref, struct ref *list);
 
 /*
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 23/25] transport: add connections for 'features' capability
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (21 preceding siblings ...)
  2022-02-23 18:31 ` [PATCH 22/25] connect: implement get_recommended_features() Derrick Stolee via GitGitGadget
@ 2022-02-23 18:31 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:31 ` [PATCH 24/25] clone: use server-recommended bundle URI Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:31 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

To allow 'git clone' to check the 'features' capability, we need to fill
in some boilerplate methods that help detect if the capability exists
and then to execute the get_recommended_features() method with the
proper context. This involves jumping through some vtables.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 transport-helper.c   | 14 ++++++++++++++
 transport-internal.h |  9 +++++++++
 transport.c          | 38 ++++++++++++++++++++++++++++++++++++++
 transport.h          |  5 +++++
 4 files changed, 66 insertions(+)

diff --git a/transport-helper.c b/transport-helper.c
index a0297b0986c..642472e2adb 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -1264,11 +1264,25 @@ static struct ref *get_refs_list_using_list(struct transport *transport,
 	return ret;
 }
 
+static int get_features(struct transport *transport,
+		      struct string_list *list)
+{
+	get_helper(transport);
+
+	if (process_connect(transport, 0)) {
+		do_take_over(transport);
+		return transport->vtable->get_features(transport, list);
+	}
+
+	return -1;
+}
+
 static struct transport_vtable vtable = {
 	.set_option	= set_helper_option,
 	.get_refs_list	= get_refs_list,
 	.fetch_refs	= fetch_refs,
 	.push_refs	= push_refs,
+	.get_features	= get_features,
 	.connect	= connect_helper,
 	.disconnect	= release_helper
 };
diff --git a/transport-internal.h b/transport-internal.h
index c4ca0b733ac..759c79148db 100644
--- a/transport-internal.h
+++ b/transport-internal.h
@@ -5,6 +5,7 @@ struct ref;
 struct transport;
 struct strvec;
 struct transport_ls_refs_options;
+struct string_list;
 
 struct transport_vtable {
 	/**
@@ -51,6 +52,14 @@ struct transport_vtable {
 	 * process involved generating new commits.
 	 **/
 	int (*push_refs)(struct transport *transport, struct ref *refs, int flags);
+
+	/**
+	 * get_features() requests a list of recommended features and
+	 * populates the given string_list with those 'key=value' pairs.
+	 */
+	int (*get_features)(struct transport *transport,
+			    struct string_list *list);
+
 	int (*connect)(struct transport *connection, const char *name,
 		       const char *executable, int fd[2]);
 
diff --git a/transport.c b/transport.c
index 2a3e3241545..99d6b719f35 100644
--- a/transport.c
+++ b/transport.c
@@ -349,6 +349,20 @@ static struct ref *get_refs_via_connect(struct transport *transport, int for_pus
 	return handshake(transport, for_push, options, 1);
 }
 
+static int get_features(struct transport *transport,
+		      struct string_list *list)
+{
+	struct git_transport_data *data = transport->data;
+	struct packet_reader reader;
+
+	packet_reader_init(&reader, data->fd[0], NULL, 0,
+			   PACKET_READ_CHOMP_NEWLINE |
+			   PACKET_READ_GENTLE_ON_EOF);
+
+	return get_recommended_features(data->fd[1], &reader, list,
+					transport->stateless_rpc);
+}
+
 static int fetch_refs_via_pack(struct transport *transport,
 			       int nr_heads, struct ref **to_fetch)
 {
@@ -890,6 +904,7 @@ static struct transport_vtable taken_over_vtable = {
 	.get_refs_list	= get_refs_via_connect,
 	.fetch_refs	= fetch_refs_via_pack,
 	.push_refs	= git_transport_push,
+	.get_features	= get_features,
 	.disconnect	= disconnect_git
 };
 
@@ -1043,6 +1058,7 @@ static struct transport_vtable builtin_smart_vtable = {
 	.get_refs_list	= get_refs_via_connect,
 	.fetch_refs	= fetch_refs_via_pack,
 	.push_refs	= git_transport_push,
+	.get_features	= get_features,
 	.connect	= connect_git,
 	.disconnect	= disconnect_git
 };
@@ -1456,6 +1472,28 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs)
 	return rc;
 }
 
+struct string_list *transport_remote_features(struct transport *transport)
+{
+	const struct transport_vtable *vtable = transport->vtable;
+	struct string_list *list = NULL;
+
+	if (!server_supports_v2("features", 0))
+		return NULL;
+
+	if (!vtable->get_features) {
+		warning(_("'features' not supported by this remote"));
+		return NULL;
+	}
+
+	CALLOC_ARRAY(list, 1);
+	string_list_init_dup(list);
+
+	if (vtable->get_features(transport, list))
+		warning(_("failed to get recommended features from remote"));
+
+	return list;
+}
+
 void transport_unlock_pack(struct transport *transport, unsigned int flags)
 {
 	int in_signal_handler = !!(flags & TRANSPORT_UNLOCK_PACK_IN_SIGNAL_HANDLER);
diff --git a/transport.h b/transport.h
index 3f16e50c196..bfa2dd48d85 100644
--- a/transport.h
+++ b/transport.h
@@ -272,6 +272,11 @@ struct transport_ls_refs_options {
 const struct ref *transport_get_remote_refs(struct transport *transport,
 					    struct transport_ls_refs_options *transport_options);
 
+/**
+ * Get recommended config from remote.
+ */
+struct string_list *transport_remote_features(struct transport *transport);
+
 /*
  * Fetch the hash algorithm used by a remote.
  *
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 24/25] clone: use server-recommended bundle URI
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (22 preceding siblings ...)
  2022-02-23 18:31 ` [PATCH 23/25] transport: add connections for 'features' capability Derrick Stolee via GitGitGadget
@ 2022-02-23 18:31 ` Derrick Stolee via GitGitGadget
  2022-02-23 18:31 ` [PATCH 25/25] t5601: basic bundle URI test Derrick Stolee via GitGitGadget
  2022-02-23 22:17 ` [PATCH 00/25] [RFC] Bundle URIs Ævar Arnfjörð Bjarmason
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:31 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

After the ref advertisement initializes the connection between the
client and the remote, use the 'features' capability (if available) to
get a list of recommended features from the server.

In this change, we only update the bundle URI setting. The bundles are
downloaded immediately afterwards if the bundle URI becomes non-null.

RFC-TODO: don't overwrite a given --bundle-uri option.
RFC-TODO: implement the other capabilities.
RFC-TODO: guard this entire request behind opt-in config.
RFC-TODO: prevent using an HTTP(S) URI when in an SSH clone.
RFC-TODO: prevent using a local path for the bundle URI.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/clone.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index cfe3d96047a..92b8727fc9d 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -876,6 +876,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	struct remote *remote;
 	int err = 0, complete_refs_before_fetch = 1;
 	int submodule_progress;
+	struct string_list *feature_list = NULL;
 
 	struct transport_ls_refs_options transport_ls_refs_options =
 		TRANSPORT_LS_REFS_OPTIONS_INIT;
@@ -1194,11 +1195,23 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 
 	refs = transport_get_remote_refs(transport, &transport_ls_refs_options);
 
-	/*
-	 * NOTE: The bundle URI download takes place after transport_get_remote_refs()
-	 * because a later change will introduce a check for recommended features,
-	 * which might include a recommended bundle URI.
-	 */
+	feature_list = transport_remote_features(transport);
+
+	if (feature_list) {
+		struct string_list_item *item;
+		for_each_string_list_item(item, feature_list) {
+			char *value;
+			char *equals = strchr(item->string, '=');
+
+			if (!equals)
+				continue;
+			*equals = '\0';
+			value = equals + 1;
+
+			if (!strcmp(item->string, "bundleuri"))
+				bundle_uri = value;
+		}
+	}
 
 	/*
 	 * Before fetching from the remote, download and install bundle
@@ -1218,7 +1231,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		if (filter)
 			git_config_set("fetch.bundlefilter", filter);
 
-		if (!fetch_bundle_uri(bundle_uri, filter))
+		if (fetch_bundle_uri(bundle_uri, filter))
 			warning(_("failed to fetch objects from bundle URI '%s'"),
 				bundle_uri);
 	}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 25/25] t5601: basic bundle URI test
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (23 preceding siblings ...)
  2022-02-23 18:31 ` [PATCH 24/25] clone: use server-recommended bundle URI Derrick Stolee via GitGitGadget
@ 2022-02-23 18:31 ` Derrick Stolee via GitGitGadget
  2022-02-23 22:17 ` [PATCH 00/25] [RFC] Bundle URIs Ævar Arnfjörð Bjarmason
  25 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-02-23 18:31 UTC (permalink / raw)
  To: git; +Cc: gitster, me, aevar, newren, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

This test demonstrates an end-to-end form of the bundle URI feature
given by an HTTP server advertising the 'features' capability with a
bundle URI that is a bundle file on that same HTTP server. We verify
that we unbundled a bundle, which could only have happened if we
successfully downloaded that file.

RFC-TODO: Create similar tests throughout the series that perform
similar tests, including examples with table of contents and partial
clones.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 t/t5601-clone.sh | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
index 83c24fc97a7..b2409a4c04c 100755
--- a/t/t5601-clone.sh
+++ b/t/t5601-clone.sh
@@ -769,6 +769,18 @@ test_expect_success 'reject cloning shallow repository using HTTP' '
 	git clone --no-reject-shallow $HTTPD_URL/smart/repo.git repo
 '
 
+test_expect_success 'auto-discover bundle URI from HTTP clone' '
+	test_when_finished rm -rf repo "$HTTPD_DOCUMENT_ROOT_PATH/repo2.git" &&
+	git -C src bundle create "$HTTPD_DOCUMENT_ROOT_PATH/everything.bundle" --all &&
+	git clone --bare --no-local src "$HTTPD_DOCUMENT_ROOT_PATH/repo2.git" &&
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/repo2.git" config \
+		serve.bundleuri $HTTPD_URL/everything.bundle &&
+	GIT_TRACE2_EVENT="$(pwd)/trace.txt" \
+		git -c protocol.version=2 clone \
+		$HTTPD_URL/smart/repo2.git repo &&
+	test_subcommand_inexact git bundle unbundle <trace.txt
+'
+
 # DO NOT add non-httpd-specific tests here, because the last part of this
 # test script is only executed when httpd is available and enabled.
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
                   ` (24 preceding siblings ...)
  2022-02-23 18:31 ` [PATCH 25/25] t5601: basic bundle URI test Derrick Stolee via GitGitGadget
@ 2022-02-23 22:17 ` Ævar Arnfjörð Bjarmason
  2022-02-24 14:11   ` Derrick Stolee
  2022-03-08  8:18   ` Teng Long
  25 siblings, 2 replies; 46+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-02-23 22:17 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, newren, Derrick Stolee, Robin H . Johnson,
	Teng Long, Konstantin Ryabitsev


On Wed, Feb 23 2022, Derrick Stolee via GitGitGadget wrote:

[Note: The E-Mail address you CC'd for me (presumably, dropped in this
reply) is not my E-Mail address, this one is]

[Also CC-ing some people who have expressed interest in this are, and
would probably like to be kept in the loop going forward]

> There have been several suggestions to improve Git clone speeds and
> reliability by supplementing the Git protocol with static content. The
> Packfile URI [0] feature lets the Git response include URIs that point to
> packfiles that the client must download to complete the request.
>
> Last year, Ævar suggested using bundles instead of packfiles [1] [2]. This
> design has the same benefits to the packfile URI feature because it offloads
> most object downloads to static content fetches. The main advantage over
> packfile URIs is that the remote Git server does not need to know what is in
> those bundles. The Git client tells the server what it downloaded during the
> fetch negotiation afterwards. This includes any chance that the client did
> not have access to those bundles or otherwise failed to access them. I
> agreed that this was a much more desirable way to serve static content, but
> had concerns about the flexibility of that design [3]. I have not heard more
> on the topic since October, so I started investigating this idea myself in
> December, resulting in this RFC.

This timing is both quite fortunate & unfortunate for me, since I'd been
blocked / waiting on various things until very recently to submit a
non-RFC re-roll of (a larger version of) that series you mentioned from
October.

I guess the good news is that we'll have at least one guaranteed very
interested reviewer for each other's patches, and that the design that
makes it into git.git in the end will definitely be well hashed out :)

I won't be able to review this in any detail right at this hour, but
will be doing so. I'd also like to submit what I've got in some form
soon for hashing the two out.

It will be some 50+ patches on the ML in total though related to this
topic, so I think the two of us coming up with some way to manage all of
that for both ourselves & others would be nice. Perhaps we could also
have an off-list (video) chat in real time to clarify/discuss various
thing related to this.

Having said that, basically:

> I focused on maximizing flexibility for the service that organizes and
> serves bundles. This includes:
>
>  * Bundle URIs work for full and partial clones.
>
>  * Bundle URIs can assist with git fetch in addition to git clone.
>
>  * Users can set up bundle servers independent of the remote Git server if
>    they specify the bundle URI via a --bundle-uri argument.
>
> This series is based on the recently-submitted series that adds object
> filters to bundles [4]. There is a slight adjacent-line-add conflict with
> js/apply-partial-clone-filters-recursively, but that is in the last few
> patches, so it will be easy to rebase by the time we have a fully-reviewable
> patch series for those steps.
>
> The general breakdown is as follows:
>
>  * Patch 1 adds documentation for the feature in its entirety.
>
>  * Patches 2-14 add the ability to run ‘git clone --bundle-uri=’
>
>  * Patches 15-17 add bundle fetches to ‘git fetch’ calls
>
>  * Patches 18-25 add a new ‘features’ capability that allows a server to
>    advertise bundle URIs (and in the future, other features).
>
> I consider the patches in their current form to be “RFC quality”. There are
> multiple places where tests are missing or special cases are not checked.
> The goal for this RFC is to seek feedback on the high-level ideas before
> committing to the deep work of creating mergeable patches.

Having skimmed through all of this a *very rough* overview of what
you've got here & the direction I chose to go in is:

1. I didn't go for an initial step of teaching "git bundle" any direct
   remote operation, rather it's straight to  the protocol v2 bits etc.

   I don't think there's anything wrong with that, but didn't see much
   point in teaching  "git bundle" to do that when the eventual state is
   to have "git fetch" do so anyway.

   But in either case the "fetch" parts are either a thin wrapper for
   "git bundle fetch", or a "git bundle fetch/unbundle" is a thin
   equivalent to "init" "fetch" (with bundle-uri) + "unbundle".

2. By far the main difference is that you're heavily leaning on a TOC
   format which encodes certain assumptions that aren't true of
   clones/fetches in general (but probably are for most fetches), whereas
   my design (as we previously discussed) leans entirely on the client
   making sense of the bundle header & content itself.

   E.g. you have a "bundle.tableOfContents.forFetch", but e.g. if you've
   got a git.git clone of "master" and want to:

       git fetch origin refs/heads/todo:todo

   The assumption that we can cleanly separate "clone" from "fetch" breaks
   down.

   I.e. such a thing needs to assume that "clone" implies "you have
   most of the objects you need already" and that "fetch" means "..an
   incremental update thereof", doesn't it?

   Whereas I think (but we'll hash that out) that having a client fetch the
   bundle header and working that out via current reachability checks will
   be just as fast/faster, and such a thing is definitely more
   general/applicable to all sorts/types of fetches.

   (A TOC mechanism might still be good/valuable, but I hope it can be a
   cheap/discardable way to simply cache those bundle headers, or serve
   them up all at once)

3. Ditto "bundle.<id>.timestamp" in the design (presumably assumes not-rewound
   histories), and "requires" (can also currently be inferred from bundle
   headers).

4. I still need to go over your just-submitted "bundle filters"
   (https://lore.kernel.org/git/pull.1159.git.1645638911.gitgitgadget@gmail.com/)
   in detail but by adding a @filter to the file format (good!) presumably the
   "bundle.<id>.filter" amounts to a cache of the headers (which was 100% in line
   with any design I had for such extra information associated with a bundle).

In (partial) summary: I really want to lean more heavily into the
distributed nature of git in that a "bundle clone" be no more special
than the same operation performed locally where "clone/fetch" is
pointed-to a directory containing X number of local bundles, and has to
make sense of whether those help with the clone/fetch operation. I.e. by
parsing their headers & comparing that to the ref advertisement.

Maybe a meta-format TOC will be needed eventually, and I'm not against
such a thing.

I'd just like to make sure we wouldn't add such a thing as a premature
optimization or something that would needlessly complicate the
design. In particular (quoting from a part of 01/25:
    
    +A further optimization is that the client can avoid downloading any
    +bundles if their timestamps are not larger than the stored timestamp.
    +After fetching new bundles, this local timestamp value is updated.

Such caching seems sensible, but to me seems basically redundant to what
you'd get by doing the same with just:

 * A set of dumb bundle files in a directory on a webserver
 * Having unique names for each of those (e.g. naming them
   https://<host>/<hash-of-content>.bundle instead of
   https://<host>/weekly.bundle)
 * Since the content wouldn't change (HTTP headers indicating caching
   forever) a client would have downloaded say the last 6 of your set of
   7 "daily" rotating bundles already, and we'd locally cache their
   entire header, not just a timestamp.

I.e. I think you'd get the same reduction in requests and more from
that. I.e. (to go back to the earlier example) of:

    git fetch origin refs/heads/todo:todo

You'd get the tip of "ls-refs" for TODO, and locally discover that one
of the 6 "daily" bundles whose headers (but not necessarily content) you
already downloaded had that advertised OID, and grab it from there.

The critical difference being that such an arrangement would not be
assuming linear history/additive only (i.e. only fast-forward) which the
"forFetch" + "timestamp" surely does.

And, I think we'll be much better off both in the short and long term by
heavily leaning into HTTP caching features and things like request
pipelining + range requests than a custom meta-index format.

IOW is a TOC format needed if we assume for a moment for the sake of
argument that for a given repository the say 100 bundles you'd
potentially serve up aren't remote at all, but something you've got
mmap()'d and can inspect the bundle headers for and compare with the
remote "ls-refs"?

Because if that's the case we could basically get to the same place via
HTTP caching features, and doing it that way has the advantage of
piggy-backing on all existing caching infrastructure.

Have 1000 computers on your network that keep fetching torvalds/linux?
Stick a proxy configured to cache the first say 1MB of <bundle-base-url>
in front of them.

Now all their requests to discover if the bundles help will be local
(and it would probably make sense to cache the actual content
too). Whereas any type of custom caching strategy would be
per-git-client.

Just food for thought, and sorry that this E-Mail/braindump got so long
already...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-02-23 22:17 ` [PATCH 00/25] [RFC] Bundle URIs Ævar Arnfjörð Bjarmason
@ 2022-02-24 14:11   ` Derrick Stolee
  2022-03-04 13:30     ` Derrick Stolee
  2022-03-08  8:18   ` Teng Long
  1 sibling, 1 reply; 46+ messages in thread
From: Derrick Stolee @ 2022-02-24 14:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, newren, Robin H . Johnson, Teng Long,
	Konstantin Ryabitsev

On 2/23/2022 5:17 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Feb 23 2022, Derrick Stolee via GitGitGadget wrote:
> 
> [Note: The E-Mail address you CC'd for me (presumably, dropped in this
> reply) is not my E-Mail address, this one is]

Sorry about that. I appear to have gotten it right on the partial
bundles series, but somehow had a brain fart here.
 
> [Also CC-ing some people who have expressed interest in this are, and
> would probably like to be kept in the loop going forward]

Thanks. The more eyes, the better.

>> There have been several suggestions to improve Git clone speeds and
>> reliability by supplementing the Git protocol with static content. The
>> Packfile URI [0] feature lets the Git response include URIs that point to
>> packfiles that the client must download to complete the request.
>>
>> Last year, Ævar suggested using bundles instead of packfiles [1] [2]. This
>> design has the same benefits to the packfile URI feature because it offloads
>> most object downloads to static content fetches. The main advantage over
>> packfile URIs is that the remote Git server does not need to know what is in
>> those bundles. The Git client tells the server what it downloaded during the
>> fetch negotiation afterwards. This includes any chance that the client did
>> not have access to those bundles or otherwise failed to access them. I
>> agreed that this was a much more desirable way to serve static content, but
>> had concerns about the flexibility of that design [3]. I have not heard more
>> on the topic since October, so I started investigating this idea myself in
>> December, resulting in this RFC.
> 
> This timing is both quite fortunate & unfortunate for me, since I'd been
> blocked / waiting on various things until very recently to submit a
> non-RFC re-roll of (a larger version of) that series you mentioned from
> October.
> 
> I guess the good news is that we'll have at least one guaranteed very
> interested reviewer for each other's patches, and that the design that
> makes it into git.git in the end will definitely be well hashed out :)
> 
> I won't be able to review this in any detail right at this hour, but
> will be doing so. I'd also like to submit what I've got in some form
> soon for hashing the two out.
> 
> It will be some 50+ patches on the ML in total though related to this
> topic, so I think the two of us coming up with some way to manage all of
> that for both ourselves & others would be nice. Perhaps we could also
> have an off-list (video) chat in real time to clarify/discuss various
> thing related to this.

I look forward to seeing your full implementation. There are many things
about your RFC that left me confused and not fully understanding your
vision. You reference a few of them further down, so I'll mention them
specifically in that context.

> Having said that, basically:
> 
>> I focused on maximizing flexibility for the service that organizes and
>> serves bundles. This includes:
>>
>>  * Bundle URIs work for full and partial clones.
>>
>>  * Bundle URIs can assist with git fetch in addition to git clone.
>>
>>  * Users can set up bundle servers independent of the remote Git server if
>>    they specify the bundle URI via a --bundle-uri argument.
>>
>> This series is based on the recently-submitted series that adds object
>> filters to bundles [4]. There is a slight adjacent-line-add conflict with
>> js/apply-partial-clone-filters-recursively, but that is in the last few
>> patches, so it will be easy to rebase by the time we have a fully-reviewable
>> patch series for those steps.
>>
>> The general breakdown is as follows:
>>
>>  * Patch 1 adds documentation for the feature in its entirety.
>>
>>  * Patches 2-14 add the ability to run ‘git clone --bundle-uri=’
>>
>>  * Patches 15-17 add bundle fetches to ‘git fetch’ calls
>>
>>  * Patches 18-25 add a new ‘features’ capability that allows a server to
>>    advertise bundle URIs (and in the future, other features).
>>
>> I consider the patches in their current form to be “RFC quality”. There are
>> multiple places where tests are missing or special cases are not checked.
>> The goal for this RFC is to seek feedback on the high-level ideas before
>> committing to the deep work of creating mergeable patches.
> 
> Having skimmed through all of this a *very rough* overview of what
> you've got here & the direction I chose to go in is:
> 
> 1. I didn't go for an initial step of teaching "git bundle" any direct
>    remote operation, rather it's straight to  the protocol v2 bits etc.
> 
>    I don't think there's anything wrong with that, but didn't see much
>    point in teaching  "git bundle" to do that when the eventual state is
>    to have "git fetch" do so anyway.
> 
>    But in either case the "fetch" parts are either a thin wrapper for
>    "git bundle fetch", or a "git bundle fetch/unbundle" is a thin
>    equivalent to "init" "fetch" (with bundle-uri) + "unbundle".

I'm not married to this specific implementation, although having the
bundle fetch be something a user could run independently of 'git fetch'
or 'git clone' might be desirable.

> 2. By far the main difference is that you're heavily leaning on a TOC
>    format which encodes certain assumptions that aren't true of
>    clones/fetches in general (but probably are for most fetches), whereas
>    my design (as we previously discussed) leans entirely on the client
>    making sense of the bundle header & content itself.
> 
>    E.g. you have a "bundle.tableOfContents.forFetch", but e.g. if you've
>    got a git.git clone of "master" and want to:
> 
>        git fetch origin refs/heads/todo:todo
> 
>    The assumption that we can cleanly separate "clone" from "fetch" breaks
>    down.
> 
>    I.e. such a thing needs to assume that "clone" implies "you have
>    most of the objects you need already" and that "fetch" means "..an
>    incremental update thereof", doesn't it?
> 
>    Whereas I think (but we'll hash that out) that having a client fetch the
>    bundle header and working that out via current reachability checks will
>    be just as fast/faster, and such a thing is definitely more
>    general/applicable to all sorts/types of fetches.
> 
>    (A TOC mechanism might still be good/valuable, but I hope it can be a
>    cheap/discardable way to simply cache those bundle headers, or serve
>    them up all at once)

Note that the TOC is completely optional, and you could serve a bundle from
the advertised URI. The biggest difference is that the TOC allows flexibility
that your design did not (or at least, I could not detect how to get that
flexibility out of it).

The biggest thing that I am trying to understand from your design is something
I'm going to make an educated guess about: if you _do_ break the repo into
multiple bundles, then the bundle URI advertisement sends multiple URIs that
must _all_ be inspected to get the full bundle content.

The change in the TOC model is that there can be multiple potential servers
hosting multiple bundles, and the bundle URI advertisement sending multiple
URIs means "any of these would work. Try one close to you, or use the others
as a fallback." This seems like the biggest incompatibility in our approaches
based on my understanding.

Finally, the biggest thing that is possible in my model that is not in yours
is the idea of an independent bundle server that is not known by the origin
server. It seems that everything relies on the bundle URI advertisement,
while this implementation allows a --bundle-uri=<X> at clone time or a
configured URI.

Again, the TOC is critical here, so there can be one well-known URI for the
TOC that doesn't change over time, and can be used to fetch multiple bundles.

And perhaps you are intending the bundle URI to point to a directory listing,
but that seems like it would need a format that Git understands. And you
mention parsing bundle headers, which I would really like to see how you
intend to implement that without downloading too much data from the remotes.
The TOC can be extended to include whatever header information you intend to
use for these decisions.

> 3. Ditto "bundle.<id>.timestamp" in the design (presumably assumes not-rewound
>    histories), and "requires" (can also currently be inferred from bundle
>    headers).
> 
> 4. I still need to go over your just-submitted "bundle filters"
>    (https://lore.kernel.org/git/pull.1159.git.1645638911.gitgitgadget@gmail.com/)
>    in detail but by adding a @filter to the file format (good!) presumably the
>    "bundle.<id>.filter" amounts to a cache of the headers (which was 100% in line
>    with any design I had for such extra information associated with a bundle).
> 
> In (partial) summary: I really want to lean more heavily into the
> distributed nature of git in that a "bundle clone" be no more special
> than the same operation performed locally where "clone/fetch" is
> pointed-to a directory containing X number of local bundles, and has to
> make sense of whether those help with the clone/fetch operation. I.e. by
> parsing their headers & comparing that to the ref advertisement.

Cloning and fetching from bundles is fundamentally different from the
dynamic fetch negotiation of the Git protocol, so I see this intent as
a drawback to your approach, not a strength.

> Maybe a meta-format TOC will be needed eventually, and I'm not against
> such a thing.
> 
> I'd just like to make sure we wouldn't add such a thing as a premature
> optimization or something that would needlessly complicate the
> design. In particular (quoting from a part of 01/25:
>     
>     +A further optimization is that the client can avoid downloading any
>     +bundles if their timestamps are not larger than the stored timestamp.
>     +After fetching new bundles, this local timestamp value is updated.
> 
> Such caching seems sensible, but to me seems basically redundant to what
> you'd get by doing the same with just:
> 
>  * A set of dumb bundle files in a directory on a webserver
>  * Having unique names for each of those (e.g. naming them
>    https://<host>/<hash-of-content>.bundle instead of
>    https://<host>/weekly.bundle)

The bundles on my prototype server do have unique names (the timestamp
is part of the name), even though they don't include a hash of the
contents. This is mostly for readability when a human looks at the
TOC, since it does not affect the correctness. A hash could be used in
the name if the bundle server wanted, but it also isn't required.

>  * Since the content wouldn't change (HTTP headers indicating caching
>    forever) a client would have downloaded say the last 6 of your set of
>    7 "daily" rotating bundles already, and we'd locally cache their
>    entire header, not just a timestamp.

This model either requires Git understanding how to walk that
directory of files on the webserver OR for the origin server to be
aware of the hash of every bundle that might be fetched. That coupling
is exactly the kind of thing I think is too difficult for serving with
packfile URIs.

> I.e. I think you'd get the same reduction in requests and more from
> that. I.e. (to go back to the earlier example) of:
> 
>     git fetch origin refs/heads/todo:todo
> 
> You'd get the tip of "ls-refs" for TODO, and locally discover that one
> of the 6 "daily" bundles whose headers (but not necessarily content) you
> already downloaded had that advertised OID, and grab it from there.

It would be easy to special-case custom refspecs as not wanting bundle
data. This goes back to the idea that serving from static content is
_not_ dynamic enough to handle these cases and therefore should not be
a major concern for the design.

Everything about using static content should be a _heuristic_ that
works for most users most-frequent operations. Most users have a
standard refs/heads/*:refs/remotes/origin/* refspec when fetching, so
they want as much data as they can get from that refspec from the
bundles.

> The critical difference being that such an arrangement would not be
> assuming linear history/additive only (i.e. only fast-forward) which the
> "forFetch" + "timestamp" surely does.

Generally, refs/heads/* does move forward. Yes, if someone force-pushes
then there is a chance that some extra data from their older version is
stored in a bundle somewhere.

> And, I think we'll be much better off both in the short and long term by
> heavily leaning into HTTP caching features and things like request
> pipelining + range requests than a custom meta-index format.

The point of TOC is not to split bundles small enough to avoid those
features. Those things are still possible, and I expect that anyone
organizing a bundle server would want one very large bundle to handle
the majority of the repo data. Range requests are critical for allowing
resumable clones for that large data.

> IOW is a TOC format needed if we assume for a moment for the sake of
> argument that for a given repository the say 100 bundles you'd
> potentially serve up aren't remote at all, but something you've got
> mmap()'d and can inspect the bundle headers for and compare with the
> remote "ls-refs"?
> 
> Because if that's the case we could basically get to the same place via
> HTTP caching features, and doing it that way has the advantage of
> piggy-backing on all existing caching infrastructure.
1. 100 bundles is probably at least triple the maximum I would
   recommend, unless you have a have a super-active monorepo that
   wants bundles computed hourly.

2. You keep saying "inspecting the headers" to make a decision, but
   I have yet to see you present a way that you would organize bundles
   such that that decision isn't just "get all bundles I haven't
   already downloaded" and then be equivalent to the timestamp
   heuristic.

> Have 1000 computers on your network that keep fetching torvalds/linux?
> Stick a proxy configured to cache the first say 1MB of <bundle-base-url>
> in front of them.
> 
> Now all their requests to discover if the bundles help will be local
> (and it would probably make sense to cache the actual content
> too). Whereas any type of custom caching strategy would be
> per-git-client.
> 
> Just food for thought, and sorry that this E-Mail/braindump got so long
> already...

I recommend that when you have the time you look carefully at Patch 1
which actually includes the design document for the feature. That doc
contains concrete descriptions of all the ways the presented design is
flexible for users and server administrators.

In particular, I think an extremely valuable aspect of this design is
the ability for someone to spin up their own bundle server in their
build lab and modify their build scripts to fetch from that bundle
server before going to the origin remote. If your design can handle
that scenario, then I'd love to see how.

Take your time preparing your patches, and I'll give them a careful
read then they are available. I expect that you have some great ideas
that could augment this design, and I'm sure that your code will have
some implementation details that are better than mine. We can find a
way to combine to get the best of both worlds.

That said, I do feel that I must not have done an adequate job of
describing how important I think it is to have a more flexible design
than what you presented before. You continue to push back that your
ideas are sufficient, but I disagree. Unfortunately, I cannot be
sure either way until I see your full implementation.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/25] docs: document bundle URI standard
  2022-02-23 18:30 ` [PATCH 01/25] docs: document bundle URI standard Derrick Stolee via GitGitGadget
@ 2022-03-02  2:28   ` Elijah Newren
  2022-03-02 14:06     ` Derrick Stolee
  0 siblings, 1 reply; 46+ messages in thread
From: Elijah Newren @ 2022-03-02  2:28 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Taylor Blau, aevar, Derrick Stolee

On Wed, Feb 23, 2022 at 10:31 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <derrickstolee@github.com>
>
> Introduce the idea of bundle URIs to the Git codebase through an
> aspirational design document. This document includes the full design
> intended to include the feature in its fully-implemented form. This will
> take several steps as detailed in the Implementation Plan section.
>
> By committing this document now, it can be used to motivate changes
> necessary to reach these final goals. The design can still be altered as
> new information is discovered.
>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/technical/bundle-uri.txt | 404 +++++++++++++++++++++++++
>  1 file changed, 404 insertions(+)
>  create mode 100644 Documentation/technical/bundle-uri.txt
>
> diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
> new file mode 100644
> index 00000000000..5c0b9e8e3ef
> --- /dev/null
> +++ b/Documentation/technical/bundle-uri.txt
> @@ -0,0 +1,404 @@
> +Bundle URIs
> +===========
> +
> +Bundle URIs are locations where Git can download one or more bundles in
> +order to bootstrap the object database in advance of fetching the remaining
> +objects from a remote.
> +
> +One goal is to speed up clones and fetches for users with poor network
> +connectivity to the origin server. Another benefit is to allow heavy users,
> +such as CI build farms, to use local resources for the majority of Git data
> +and thereby reducing the load on the origin server.
> +
> +To enable the bundle URI feature, users can specify a bundle URI using
> +command-line options or the origin server can advertise one or more URIs
> +via a protocol v2 capability.
> +
> +Server requirements
> +-------------------
> +
> +To provide a server-side implementation of bundle servers, no other parts
> +of the Git protocol are required. This allows server maintainers to use
> +static content solutions such as CDNs in order to serve the bundle files.
> +
> +At the current scope of the bundle URI feature, all URIs are expected to
> +be HTTP(S) URLs where content is downloaded to a local file using a `GET`
> +request to that URL. The server could include authentication requirements
> +to those requests with the aim of triggering the configured credential
> +helper for secure access.

So folks using ssh to clone, who have never configured a credential
helper before, might need to start doing so.  This makes sense and I
don't think I see a way around it, but we might want to call it out a
bit more prominently.  Cloning over https seems to be rare in the
various setups I've seen (I know there are others where it's common,
just noting that many users may never have had to use https for
cloning before), so this is a potentially big point for users to be
aware of in terms of setup.

> +
> +Assuming a `200 OK` response from the server, the content at the URL is
> +expected to be of one of two forms:
> +
> +1. Bundle: A Git bundle file of version 2 or higher.
> +
> +2. Table of Contents: A plain-text file that is parsable using Git's
> +   config file parser. This file describes one or more bundles that are
> +   accessible from other URIs.
> +
> +Any other data provided by the server is considered erroneous.
> +
> +Table of Contents Format
> +------------------------
> +
> +If the content at a bundle URI is not a bundle, then it is expected to be
> +a plaintext file that is parseable using Git's config parser. This file
> +can contain any list of key/value pairs, but only a fixed set will be
> +considered by Git.
> +
> +bundle.tableOfContents.version::
> +       This value provides a version number for the table of contents. If
> +       a future Git change enables a feature that needs the Git client to
> +       react to a new key in the table of contents file, then this version
> +       will increment. The only current version number is 1, and if any
> +       other value is specified then Git will fail to use this file.

What does "Git will fail to use this file" mean?  Does it mean Git
will throw an error?  clone/fetch without the aid of bundle uris but
show a warning?  something else?

> +
> +bundle.tableOfContents.forFetch::
> +       This boolean value is a signal to the Git client that the bundle
> +       server has designed its bundle organization to assist `git fetch`
> +       commands in addition to `git clone` commands. If this is missing,
> +       Git should not use this table of contents for `git fetch` as it
> +       may lead to excess data downloads.
> +
> +The remaining keys include an `<id>` segment which is a server-designated
> +name for each available bundle.
> +
> +bundle.<id>.uri::
> +       This string value is the URI for downloading bundle `<id>`. If
> +       the URI begins with a protocol (`http://` or `https://`) then the
> +       URI is absolute. Otherwise, the URI is interpreted as relative to
> +       the URI used for the table of contents. If the URI begins with `/`,
> +       then that relative path is relative to the domain name used for
> +       the table of contents. (This use of relative paths is intended to
> +       make it easier to distribute a set of bundles across a large
> +       number of servers or CDNs with different domain names.)
> +
> +bundle.<id>.timestamp::
> +       (Optional) This value is the number of seconds since Unix epoch
> +       (UTC) that this bundle was created. This is used as an approximation
> +       of a point in time that the bundle matches the data available at
> +       the origin server.

As an approximation, is there a risk of drift where the user has
timetamp A which is very close to B and makes decisions based upon it
which results in their not getting dependent objects they need?  Or is
it just an optimization for them to only download certain bundles and
look at them, and then they'll iteratively go back and download more
(as per the 'requires' field below) if they don't have enough objects
to unbundle what they previously downloaded?

> +
> +bundle.<id>.requires::
> +       (Optional) This string value represents the ID of another bundle.
> +       When present, the server is indicating that this bundle contains a
> +       thin packfile. If the client does not have all necessary objects
> +       to unbundle this packfile, then the client can download the bundle
> +       with the `requires` ID and try again. (Note: it may be beneficial
> +       to allow the server to specify multiple `requires` bundles.)
> +
> +bundle.<id>.filter::
> +       (Optional) This string value represents an object filter that
> +       should also appear in the header of this bundle. The server uses
> +       this value to differentiate different kinds of bundles from which
> +       the client can choose those that match their object filters.
> +
> +Here is an example table of contents:
> +
> +```
> +[bundle "tableofcontents"]
> +       version = 1
> +       forFetch = true
> +
> +[bundle "2022-02-09-1644442601-daily"]
> +       uri = https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-09-1644442601-daily.bundle
> +       timestamp = 1644442601
> +       requires = 2022-02-02-1643842562
> +
> +[bundle "2022-02-02-1643842562"]
> +       uri = https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-02-1643842562.bundle
> +       timestamp = 1643842562
> +
> +[bundle "2022-02-09-1644442631-daily-blobless"]
> +       uri = 2022-02-09-1644442631-daily-blobless.bundle
> +       timestamp = 1644442631
> +       requires = 2022-02-02-1643842568-blobless
> +       filter = blob:none
> +
> +[bundle "2022-02-02-1643842568-blobless"]
> +       uri = /git/git/2022-02-02-1643842568-blobless.bundle
> +       timestamp = 1643842568
> +       filter = blob:none
> +```
> +
> +This example uses all of the keys in the specification. Suppose that the
> +table of contents was found at the URI
> +`https://gitbundleserver.z13.web.core.windows.net/git/git/` and so the
> +two blobless bundles have the following fully-expanded URIs:
> +
> +* `https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-09-1644442631-daily-blobless.bundle`
> +* `https://gitbundleserver.z13.web.core.windows.net/git/git/2022-02-02-1643842568-blobless.bundle`
> +
> +Advertising Bundle URIs
> +-----------------------
> +
> +If a user knows a bundle URI for the repository they are cloning, then they
> +can specify that URI manually through a command-line option. However, a
> +Git host may want to advertise bundle URIs during the clone operation,
> +helping users unaware of the feature.
> +
> +Note: The exact details of this section are not final. This is a possible
> +way that Git could auto-discover bundle URIs, but is not a committed
> +direction until that feature is implemented.
> +
> +The only thing required for this feature is that the server can advertise
> +one or more bundle URIs. One way to implement this is to create a new
> +protocol v2 capability that advertises recommended features, including
> +bundle URIs.
> +
> +The client could choose an arbitrary bundle URI as an option _or_ select
> +the URI with lowest latency by some exploratory checks. It is up to the
> +server operator to decide if having multiple URIs is preferable to a
> +single URI that is geodistributed through server-side infrastructure.
> +
> +Cloning with Bundle URIs
> +------------------------
> +
> +The primary need for bundle URIs is to speed up clones. The Git client
> +will interact with bundle URIs according to the following flow:
> +
> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
> +   option _or_ the client discovers a bundle URI that was advertised by
> +   the remote server.
> +
> +2. The client downloads the file at the bundle URI. If it is a bundle, then
> +   it is unbundled with the refs being stored in `refs/bundle/*`.
> +
> +3. If the file is instead a table of contents, then the bundles with
> +   matching `filter` settings are sorted by `timestamp` (if present),
> +   and the most-recent bundle is downloaded.
> +
> +4. If the current bundle header mentions negative commid OIDs that are not
> +   in the object database, then download the `requires` bundle and try
> +   again.
> +
> +5. After inspecting a bundle with no negative commit OIDs (or all OIDs are
> +   already in the object database somehow), then unbundle all of the
> +   bundles in reverse order, placing references within `refs/bundle/*`.
> +
> +6. The client performs a fetch negotiation with the origin server, using
> +   the `refs/bundle/*` references as `have`s and the server's ref
> +   advertisement as `want`s. This results in a pack-file containing the
> +   remaining objects requested by the clone but not in the bundles.

Does step 6 potentially involve a new, second connection to the origin
server?  I'm wondering about timeouts closing the original connection
while the client is downloading the bundle uris.  Will the client
handle that automatically, or will they potentially be forced to
re-issue the clone/fetch command?  I'm also wondering if we want to be
"nice" and pre-emptively close the original connection to the server
while we fetch the bundles -- for example, some servers have a
threadpool for processing fetch/clone requests and will only serve a
limited number; IIRC Gerrit operates this way.  I have no idea if
that's a good idea or a horrible one.  If a second connection is tried
automatically, will the user potentially be forced to re-enter
connection credentials again?  And is there a risk that after the
second connection, there are new bundle uris for the client to go
fetch (and/or a removal of the original ones, e.g. replacing the old
"daily" bundle with a new one)?  Does this possibly cause us some
timestamp confusion as I noted earlier?

> +Note that during a clone we expect that all bundles will be required. The
> +client could be extended to download all bundles in parallel, though they
> +need to be unbundled in the correct order.

What does required mean?  That the origin server can refuse to service
requests if the client does not have commits found in said bundles?
That new enough Git clients are expected to download all the bundles
(and no config option will be provided to users to just do traditional
negotation without first downloading them)?  Something else?

If users are doing a single-branch clone or a shallow clone, will the
requirements still hold?  (I'm not a fan of shallow clones, but they
are sadly used in a number of places and I'm curious how the features
interact or conflict.)

> +
> +If a table of contents is used and it contains
> +`bundle.tableOfContents.forFetch = true`, then the client can store a
> +config value indicating to reuse this URI for later `git fetch` commands.
> +In this case, the client will also want to store the maximum timestamp of
> +a downloaded bundle.
> +
> +Fetching with Bundle URIs
> +-------------------------
> +
> +When the client fetches new data, it can decide to fetch from bundle
> +servers before fetching from the origin remote. This could be done via
> +a command-line option, but it is more likely useful to use a config value
> +such as the one specified during the clone.
> +
> +The fetch operation follows the same procedure to download bundles from a
> +table of contents (although we do _not_ want to use parallel downloads
> +here). We expect that the process will end because all negative commit
> +OIDs in a thin bundle are already in the object database.
> +
> +A further optimization is that the client can avoid downloading any
> +bundles if their timestamps are not larger than the stored timestamp.
> +After fetching new bundles, this local timestamp value is updated.

What about the transition period where repositories were cloned before
bundle URIs became a thing (or were turned on within an organization),
and the user then goes to fetch?  Will git go and download a bunch of
useless large bundles (and maybe one small useful one) the day this
feature is turned on, making users think this is a bad feature?

Should git perhaps treat the already-cloned case without a stored
timestamp as a request to store a timestamp of "now", making it ignore
the current bundles?  (If so, are there races where it later goes to
grab a bundle slightly newer than "now" but which depends on an older
bundle that has some objects we are missing?)

> +
> +Choices for Bundle Server Organization
> +--------------------------------------
> +
> +With this standard, there are many options available to the bundle server
> +in how it organizes Git data into bundles.
> +
> +* Bundles can have whatever name the server desires. This name could refer
> +  to immutable data by using a hash of the bundle contents. However, this
> +  means that a new URI will be needed after every update of the content.
> +  This might be acceptable if the server is advertising the URI (and the
> +  server is aware of new bundles being generated) but would not be
> +  ergonomic for users using the command line option.
> +
> +* If the server intends to only serve full clones, then the advertised URI
> +  could be a bundle file without a filter that is updated at some cadence.
> +
> +* If the server intends to serve clones, but wants clients to choose full
> +  or blobless partial clones, then the server can use a table of contents
> +  that lists two non-thin bundles and the client chooses between them only
> +  by the `bundle.<id>.filter` values.
> +
> +* If the server intends to improve clones with parallel downloads, then it
> +  can use a table of contents and split the repository into time intervals
> +  of approximately similar-sized bundles. Using `bundle.<id>.timestamp`
> +  and `bundle.<id>.requires` values helps the client decide the order to
> +  unbundle the bundles.
> +
> +* If the server intends to serve fetches, then it can use a table of
> +  contents to advertise a list of bundles that are updated regularly. The
> +  most recent bundles could be generated on short intervals, such as hourly.
> +  These small bundles could be merged together at some rate, such as 24
> +  hourly bundles merging into a single daily bundle. At some point, it may
> +  be beneficial to create a bundle that stores the majority of the history,
> +  such as all data older than 30 days.
> +
> +These recommendations are intended only as suggestions. Each repository is
> +different and every Git server has different needs. Hopefully the bundle
> +URI feature and its table of contents is flexible enough to satisfy all
> +needs. If not, then the format can be extended.
> +
> +Error Conditions
> +----------------
> +
> +If the Git client discovers something unexpected while downloading
> +information according to a bundle URI or the table of contents found at
> +that location, then Git can ignore that data and continue as if it was not
> +given a bundle URI. The remote Git server is the ultimate source of truth,
> +not the bundle URI.

This seems to contradict the earlier statement that for clones all
bundle URIs would be "required".  I like the idea of bundle URIs only
being an optimization that can be ignored, just noting the potential
confusion.

> +
> +Here are a few example error conditions:
> +
> +* The client fails to connect with a server at the given URI or a connection
> +  is lost without any chance to recover.
> +
> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
> +  `401 Not Authorized`, or `500 Internal Server Error`).
> +
> +* The client receives data that is not parsable as a bundle or table of
> +  contents.
> +
> +* The table of contents describes a directed cycle in the
> +  `bundle.<id>.requires` links.
> +
> +* A bundle includes a filter that does not match expectations.
> +
> +* The client cannot unbundle the bundles because the negative commit OIDs
> +  are not in the object database and there are no more
> +  `bundle.<id>.requires` links to follow.

Should these result in warnings so that folks can diagnose slower
connections, or should they be squelched?  (I'm thinking particularly
of the `401 Not Authorized` case in combination with users never
having had to use a credential helper before.)

> +
> +There are also situations that could be seen as wasteful, but are not
> +error conditions:
> +
> +* The downloaded bundles contain more information than is requested by
> +  the clone or fetch request. A primary example is if the user requests
> +  a clone with `--single-branch` but downloads bundles that store every
> +  reachable commit from all `refs/heads/*` references. This might be
> +  initially wasteful, but perhaps these objects will become reachable by
> +  a later ref update that the client cares about.

Ah, this answers my --single-branch question.  Still curious about the
--shallow misfeature (yeah, I'm a bit opinionated) and how it
interacts, though.

> +* A bundle download during a `git fetch` contains objects already in the
> +  object database. This is probably unavoidable if we are using bundles
> +  for fetches, since the client will almost always be slightly ahead of
> +  the bundle servers after performing its "catch-up" fetch to the remote
> +  server. This extra work is most wasteful when the client is fetching
> +  much more frequently than the server is computing bundles, such as if
> +  the client is using hourly prefetches with background maintenance, but
> +  the server is computing bundles weekly. For this reason, the client
> +  should not use bundle URIs for fetch unless the server has explicitly
> +  recommended it through the `bundle.tableOfContents.forFetch = true`
> +  value.

Makes sense, and certainly reduces my worry about the "transition
period" where users have existing clones that pre-dated the
introduction of the bundle URI feature.  But I'm still kind of curious
about how we handle that transition for folks that have recommended
their bundleUris for fetches.


> +
> +Implementation Plan
> +-------------------
> +
> +This design document is being submitted on its own as an aspirational
> +document, with the goal of implementing all of the mentioned client
> +features over the course of several patch series. Here is a potential
> +outline for submitting these features for full review:
> +
> +1. Update the `git bundle create` command to take a `--filter` option,
> +   allowing bundles to store packfiles restricted to an object filter.
> +   This is necessary for using bundle URIs to benefit partial clones.
> +
> +2. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
> +   This will include the full understanding of a table of contents, but
> +   will not integrate with `git fetch` or allow the server to advertise
> +   URIs.
> +
> +3. Integrate bundle URIs into `git fetch`, triggered by config values that
> +   are set during `git clone` if the server indicates that the bundle
> +   strategy works for fetches.
> +
> +4. Create a new "recommended features" capability in protocol v2 where the
> +   server can recommend features such as bundle URIs, partial clone, and
> +   sparse-checkout. These features will be extremely limited in scope and
> +   blocked by opt-in config options. The design for this portion could be
> +   replaced by a "bundle-uri" capability that only advertises bundle URIs
> +   and no other information.
> +
> +Related Work: Packfile URIs
> +---------------------------
> +
> +The Git protocol already has a capability where the Git server can list
> +a set of URLs along with the packfile response when serving a client
> +request. The client is then expected to download the packfiles at those
> +locations in order to have a complete understanding of the response.
> +
> +This mechanism is used by the Gerrit server (implemented with JGit) and
> +has been effective at reducing CPU load and improving user performance for
> +clones.
> +
> +A major downside to this mechanism is that the origin server needs to know
> +_exactly_ what is in those packfiles, and the packfiles need to be available
> +to the user for some time after the server has responded. This coupling
> +between the origin and the packfile data is difficult to manage.
> +
> +Further, this implementation is extremely hard to make work with fetches.
> +
> +Related Work: GVFS Cache Servers
> +--------------------------------
> +
> +The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
> +the Git project before Git's partial clone was created. One feature of this
> +protocol is the idea of a "cache server" which can be colocated with build
> +machines or developer offices to transfer Git data without overloading the
> +central server.
> +
> +The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
> +endpoint, which allows downloading an object on-demand. This is a critical
> +piece of the filesystem virtualization of that product.
> +
> +However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
> +endpoint. Given an optional timestamp, the cache server responds with a list
> +of precomputed packfiles containing the commits and trees that were introduced
> +in those time intervals.
> +
> +The cache server computes these "prefetch" packfiles using the following
> +strategy:
> +
> +1. Every hour, an "hourly" pack is generated with a given timestamp.
> +2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
> +3. Nightly, all prefetch packs more than 30 days old are rolled up into
> +   one pack.
> +
> +When a user runs `gvfs clone` or `scalar clone` against a repo with cache
> +servers, the client requests all prefetch packfiles, which is at most
> +`24 + 30 + 1` packfiles downloading only commits and trees. The client
> +then follows with a request to the origin server for the references, and
> +attempts to checkout that tip reference. (There is an extra endpoint that
> +helps get all reachable trees from a given commit, in case that commit
> +was not already in a prefetch packfile.)
> +
> +During a `git fetch`, a hook requests the prefetch endpoint using the
> +most-recent timestamp from a previously-downloaded prefetch packfile.
> +Only the list of packfiles with later timestamps are downloaded. Most
> +users fetch hourly, so they get at most one hourly prefetch pack. Users
> +whose machines have been off or otherwise have not fetched in over 30 days
> +might redownload all prefetch packfiles. This is rare.
> +
> +It is important to note that the clients always contact the origin server
> +for the refs advertisement, so the refs are frequently "ahead" of the
> +prefetched pack data. The missing objects are downloaded on-demand using
> +the `GET gvfs/objects/{oid}` requests, when needed by a command such as
> +`git checkout` or `git log`. Some Git optimizations disable checks that
> +would cause these on-demand downloads to be too aggressive.
> +
> +See Also
> +--------
> +
> +[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
> +    An earlier RFC for a bundle URI feature.
> +
> +[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
> +    The GVFS Protocol
> \ No newline at end of file
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/25] docs: document bundle URI standard
  2022-03-02  2:28   ` Elijah Newren
@ 2022-03-02 14:06     ` Derrick Stolee
  2022-03-03  2:19       ` Elijah Newren
  0 siblings, 1 reply; 46+ messages in thread
From: Derrick Stolee @ 2022-03-02 14:06 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Taylor Blau,
	Ævar Arnfjörð Bjarmason

On 3/1/2022 9:28 PM, Elijah Newren wrote:
> On Wed, Feb 23, 2022 at 10:31 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> Introduce the idea of bundle URIs to the Git codebase through an
>> aspirational design document. This document includes the full design
>> intended to include the feature in its fully-implemented form. This will
>> take several steps as detailed in the Implementation Plan section.
>>
>> By committing this document now, it can be used to motivate changes
>> necessary to reach these final goals. The design can still be altered as
>> new information is discovered.
>>
>> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
>> ---
>>  Documentation/technical/bundle-uri.txt | 404 +++++++++++++++++++++++++
>>  1 file changed, 404 insertions(+)
>>  create mode 100644 Documentation/technical/bundle-uri.txt
>>
>> diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
>> new file mode 100644
>> index 00000000000..5c0b9e8e3ef
>> --- /dev/null
>> +++ b/Documentation/technical/bundle-uri.txt
>> @@ -0,0 +1,404 @@
>> +Bundle URIs
>> +===========
>> +
>> +Bundle URIs are locations where Git can download one or more bundles in
>> +order to bootstrap the object database in advance of fetching the remaining
>> +objects from a remote.
>> +
>> +One goal is to speed up clones and fetches for users with poor network
>> +connectivity to the origin server. Another benefit is to allow heavy users,
>> +such as CI build farms, to use local resources for the majority of Git data
>> +and thereby reducing the load on the origin server.
>> +
>> +To enable the bundle URI feature, users can specify a bundle URI using
>> +command-line options or the origin server can advertise one or more URIs
>> +via a protocol v2 capability.
>> +
>> +Server requirements
>> +-------------------
>> +
>> +To provide a server-side implementation of bundle servers, no other parts
>> +of the Git protocol are required. This allows server maintainers to use
>> +static content solutions such as CDNs in order to serve the bundle files.
>> +
>> +At the current scope of the bundle URI feature, all URIs are expected to
>> +be HTTP(S) URLs where content is downloaded to a local file using a `GET`
>> +request to that URL. The server could include authentication requirements
>> +to those requests with the aim of triggering the configured credential
>> +helper for secure access.
> 
> So folks using ssh to clone, who have never configured a credential
> helper before, might need to start doing so.  This makes sense and I
> don't think I see a way around it, but we might want to call it out a
> bit more prominently.  Cloning over https seems to be rare in the
> various setups I've seen (I know there are others where it's common,
> just noting that many users may never have had to use https for
> cloning before), so this is a potentially big point for users to be
> aware of in terms of setup.

We could even go so far as to skip the credential manager if the
Git remote is SSH, requiring the bundle URIs to work only if
unauthenticated. Likely, we will want clear knobs that the user
can toggle for how to behave when a bundle URI is advertised with
modes such as

* always attempt with authentication
* always attempt, but skip authentication when Git remote is not HTTP(S)
* attempt only when Git remote is HTTP(S)
* never attempt

These are all things that are separate from the bundle URI standard
being documented here, but instead would be saved for the last set of
patches that allow a server to advertise a bundle URI at clone time.

>> +bundle.tableOfContents.version::
>> +       This value provides a version number for the table of contents. If
>> +       a future Git change enables a feature that needs the Git client to
>> +       react to a new key in the table of contents file, then this version
>> +       will increment. The only current version number is 1, and if any
>> +       other value is specified then Git will fail to use this file.
> 
> What does "Git will fail to use this file" mean?  Does it mean Git
> will throw an error?  clone/fetch without the aid of bundle uris but
> show a warning?  something else?

I mean "Git will continue as if the bundle URI was not specified". It would
show a warning, probably. This could be converted into a failure if valuable
for the user, but I don't expect that will be the default behavior.

>> +bundle.<id>.timestamp::
>> +       (Optional) This value is the number of seconds since Unix epoch
>> +       (UTC) that this bundle was created. This is used as an approximation
>> +       of a point in time that the bundle matches the data available at
>> +       the origin server.
> 
> As an approximation, is there a risk of drift where the user has
> timetamp A which is very close to B and makes decisions based upon it
> which results in their not getting dependent objects they need?  Or is
> it just an optimization for them to only download certain bundles and
> look at them, and then they'll iteratively go back and download more
> (as per the 'requires' field below) if they don't have enough objects
> to unbundle what they previously downloaded?

The user doesn't ever generate the timestamp. It saves the timestamp
from the most-recent bundle it downloaded. The only risk of timestamp
drift is if the server has multiple machines generating different sets
of bundles, and places those machines behind a load balancer.

This is something the server can control, likely by having one job
generate the bundle set and then distributing them to various storage
locations.

>> +Cloning with Bundle URIs
>> +------------------------
>> +
>> +The primary need for bundle URIs is to speed up clones. The Git client
>> +will interact with bundle URIs according to the following flow:
>> +
>> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
>> +   option _or_ the client discovers a bundle URI that was advertised by
>> +   the remote server.
>> +
>> +2. The client downloads the file at the bundle URI. If it is a bundle, then
>> +   it is unbundled with the refs being stored in `refs/bundle/*`.
>> +
>> +3. If the file is instead a table of contents, then the bundles with
>> +   matching `filter` settings are sorted by `timestamp` (if present),
>> +   and the most-recent bundle is downloaded.
>> +
>> +4. If the current bundle header mentions negative commid OIDs that are not
>> +   in the object database, then download the `requires` bundle and try
>> +   again.
>> +
>> +5. After inspecting a bundle with no negative commit OIDs (or all OIDs are
>> +   already in the object database somehow), then unbundle all of the
>> +   bundles in reverse order, placing references within `refs/bundle/*`.
>> +
>> +6. The client performs a fetch negotiation with the origin server, using
>> +   the `refs/bundle/*` references as `have`s and the server's ref
>> +   advertisement as `want`s. This results in a pack-file containing the
>> +   remaining objects requested by the clone but not in the bundles.
> 
> Does step 6 potentially involve a new, second connection to the origin
> server?  I'm wondering about timeouts closing the original connection
> while the client is downloading the bundle uris.  Will the client
> handle that automatically, or will they potentially be forced to
> re-issue the clone/fetch command?  I'm also wondering if we want to be
> "nice" and pre-emptively close the original connection to the server
> while we fetch the bundles -- for example, some servers have a
> threadpool for processing fetch/clone requests and will only serve a
> limited number; IIRC Gerrit operates this way.  I have no idea if
> that's a good idea or a horrible one.  If a second connection is tried
> automatically, will the user potentially be forced to re-enter
> connection credentials again?  And is there a risk that after the
> second connection, there are new bundle uris for the client to go
> fetch (and/or a removal of the original ones, e.g. replacing the old
> "daily" bundle with a new one)?  Does this possibly cause us some
> timestamp confusion as I noted earlier?

If the user is cloning over HTTPS, then the connections are stateless
and this is not any different than how it works today.

When using SSH, we will probably want to close the SSH connection on
the client and then reopen it to avoid keeping that connection open
during the download. The implementation in this RFC does _not_ do this,
but I think it would be valuable to do.

>> +Note that during a clone we expect that all bundles will be required. The
>> +client could be extended to download all bundles in parallel, though they
>> +need to be unbundled in the correct order.
> 
> What does required mean?  That the origin server can refuse to service
> requests if the client does not have commits found in said bundles?
> That new enough Git clients are expected to download all the bundles
> (and no config option will be provided to users to just do traditional
> negotation without first downloading them)?  Something else?

The assumption I'm making here is that all but on bundle in the table
of contents contains a thin pack, depending on an "earlier" bundle.
The client would be unsuccessful unbundling any single bundle except
the earliest one first.

The benefit of this assumption is that we could also implement parallel
downloads of all bundles in the future.

This assumes that there is no way to organize the bundles to communicate
that a user might want only the objects reachable from the default branch,
but also some users want every reachable object. Such an organization
would require extra information to describe two "parallel" lists of
bundles that could be selected for each of those categories. If such an
organization is valuable, then the table of contents can be extended with
information to communicate such an organization. The downside is that
clients with this "v1" version would download extra data based on this
assumption.

> If users are doing a single-branch clone or a shallow clone, will the
> requirements still hold?  (I'm not a fan of shallow clones, but they
> are sadly used in a number of places and I'm curious how the features
> interact or conflict.)

The current specification does not focus on shallow clones. The TOC
could be extended to say "this bundle is for a shallow clone of 
commit <X>" if that was valuable.

For single-branch clones, my expectation is that the bundles will
give the user more information than they need for that clone. The
negotiation will find out what they need from that branch that was
not in the bundles, but the bundles will also contain a lot of
objects that are not reachable from that ref. (This is also up to
the discretion of the bundle server operator, since they could
focus on only objects reachable from a subset of refs, minimizing
the bundle data while increasing the potential size of that
follow-up fetch.)

>> +If a table of contents is used and it contains
>> +`bundle.tableOfContents.forFetch = true`, then the client can store a
>> +config value indicating to reuse this URI for later `git fetch` commands.
>> +In this case, the client will also want to store the maximum timestamp of
>> +a downloaded bundle.
>> +
>> +Fetching with Bundle URIs
>> +-------------------------
>> +
>> +When the client fetches new data, it can decide to fetch from bundle
>> +servers before fetching from the origin remote. This could be done via
>> +a command-line option, but it is more likely useful to use a config value
>> +such as the one specified during the clone.
>> +
>> +The fetch operation follows the same procedure to download bundles from a
>> +table of contents (although we do _not_ want to use parallel downloads
>> +here). We expect that the process will end because all negative commit
>> +OIDs in a thin bundle are already in the object database.
>> +
>> +A further optimization is that the client can avoid downloading any
>> +bundles if their timestamps are not larger than the stored timestamp.
>> +After fetching new bundles, this local timestamp value is updated.
> 
> What about the transition period where repositories were cloned before
> bundle URIs became a thing (or were turned on within an organization),
> and the user then goes to fetch?  Will git go and download a bunch of
> useless large bundles (and maybe one small useful one) the day this
> feature is turned on, making users think this is a bad feature?
> 
> Should git perhaps treat the already-cloned case without a stored
> timestamp as a request to store a timestamp of "now", making it ignore
> the current bundles?  (If so, are there races where it later goes to
> grab a bundle slightly newer than "now" but which depends on an older
> bundle that has some objects we are missing?)

I expect that users who already cloned will never configure their
repositories to include a bundle server.

That said, if you run 'git bundle fetch <uri>' in an existing
repository, then it will fetch only the newest bundle and see if you
already have all of its negative refs. If so, then it stops and that
is the only bundle that is downloaded. Its timestamp is stored for
the next 'git bundle fetch'.

In the case where the server starts advertising a bundle URI, a
'git fetch' will not start using that URI. That check only happens
during 'git clone' (as currently designed).

>> +Error Conditions
>> +----------------
>> +
>> +If the Git client discovers something unexpected while downloading
>> +information according to a bundle URI or the table of contents found at
>> +that location, then Git can ignore that data and continue as if it was not
>> +given a bundle URI. The remote Git server is the ultimate source of truth,
>> +not the bundle URI.
> 
> This seems to contradict the earlier statement that for clones all
> bundle URIs would be "required".  I like the idea of bundle URIs only
> being an optimization that can be ignored, just noting the potential
> confusion.

Perhaps I misnamed this section. These are things that could go wrong with
a bundle server connection, and in such a case Git should recover by
transitioning to the normal Git protocol to fetch the objects.

>> +
>> +Here are a few example error conditions:
>> +
>> +* The client fails to connect with a server at the given URI or a connection
>> +  is lost without any chance to recover.
>> +
>> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
>> +  `401 Not Authorized`, or `500 Internal Server Error`).
>> +
>> +* The client receives data that is not parsable as a bundle or table of
>> +  contents.
>> +
>> +* The table of contents describes a directed cycle in the
>> +  `bundle.<id>.requires` links.
>> +
>> +* A bundle includes a filter that does not match expectations.
>> +
>> +* The client cannot unbundle the bundles because the negative commit OIDs
>> +  are not in the object database and there are no more
>> +  `bundle.<id>.requires` links to follow.
> 
> Should these result in warnings so that folks can diagnose slower
> connections, or should they be squelched?  (I'm thinking particularly
> of the `401 Not Authorized` case in combination with users never
> having had to use a credential helper before.)

There is a lot of work to be done around polishing the user ergonomics
here, and that is an interesting thing to consider for a second round
after the basic standard is established. I appreciate that you are
already thinking about the user experience in these corner cases.

>> +
>> +There are also situations that could be seen as wasteful, but are not
>> +error conditions:
>> +
>> +* The downloaded bundles contain more information than is requested by
>> +  the clone or fetch request. A primary example is if the user requests
>> +  a clone with `--single-branch` but downloads bundles that store every
>> +  reachable commit from all `refs/heads/*` references. This might be
>> +  initially wasteful, but perhaps these objects will become reachable by
>> +  a later ref update that the client cares about.
> 
> Ah, this answers my --single-branch question.  Still curious about the
> --shallow misfeature (yeah, I'm a bit opinionated) and how it
> interacts, though.

(Hopefully my previous reply to this topic is helpful.)

>> +* A bundle download during a `git fetch` contains objects already in the
>> +  object database. This is probably unavoidable if we are using bundles
>> +  for fetches, since the client will almost always be slightly ahead of
>> +  the bundle servers after performing its "catch-up" fetch to the remote
>> +  server. This extra work is most wasteful when the client is fetching
>> +  much more frequently than the server is computing bundles, such as if
>> +  the client is using hourly prefetches with background maintenance, but
>> +  the server is computing bundles weekly. For this reason, the client
>> +  should not use bundle URIs for fetch unless the server has explicitly
>> +  recommended it through the `bundle.tableOfContents.forFetch = true`
>> +  value.
> 
> Makes sense, and certainly reduces my worry about the "transition
> period" where users have existing clones that pre-dated the
> introduction of the bundle URI feature.  But I'm still kind of curious
> about how we handle that transition for folks that have recommended
> their bundleUris for fetches.

By "that transition" I believe you are talking about configuring bundle
URIs on an existing repository. My earlier reply on that hopefully
eases your concerns somewhat. Ævar also has some ideas about downloading
only the header of a bundle and examining it to see if we need the rest
of it, which would further reduce the amount of data necessary to do an
initial first fetch from a bundle URI.

Thanks for taking the time to read and give detailed thoughts on this
proposal!

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/25] docs: document bundle URI standard
  2022-03-02 14:06     ` Derrick Stolee
@ 2022-03-03  2:19       ` Elijah Newren
  2022-03-03  2:44         ` Derrick Stolee
  0 siblings, 1 reply; 46+ messages in thread
From: Elijah Newren @ 2022-03-03  2:19 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List,
	Junio C Hamano, Taylor Blau,
	Ævar Arnfjörð Bjarmason

On Wed, Mar 2, 2022 at 6:06 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
[...]
> >> +bundle.<id>.timestamp::
> >> +       (Optional) This value is the number of seconds since Unix epoch
> >> +       (UTC) that this bundle was created. This is used as an approximation
> >> +       of a point in time that the bundle matches the data available at
> >> +       the origin server.
> >
> > As an approximation, is there a risk of drift where the user has
> > timetamp A which is very close to B and makes decisions based upon it
> > which results in their not getting dependent objects they need?  Or is
> > it just an optimization for them to only download certain bundles and
> > look at them, and then they'll iteratively go back and download more
> > (as per the 'requires' field below) if they don't have enough objects
> > to unbundle what they previously downloaded?
>
> The user doesn't ever generate the timestamp. It saves the timestamp
> from the most-recent bundle it downloaded. The only risk of timestamp
> drift is if the server has multiple machines generating different sets
> of bundles, and places those machines behind a load balancer.
>
> This is something the server can control, likely by having one job
> generate the bundle set and then distributing them to various storage
> locations.

My first thought here was "can, or must?"...and if it's not "must",
are there ramifications we have to consider in the design of how the
Git client behaves?  However, reading through the rest of the document
and your response, I don't think there is no problem here because
below you mention that when fetching, the client can grab the latest
bundle, and if it lacks some of the necessary pre-requisite objects,
it can just iteratively download previous bundles until they have
enough.

Also, the main thing that prompted my question was my wondering about
how to get pre-existing clones to start taking advantage of bundle
URIs for future fetches, and this timestamp approximation seemed
problematic.  But in addition to iteratively grabbing more bundles,
you also suggested that such users would just be forced to reclone,
which would also avoid timestamp approximation issues.

And, just now, I noticed the little "Optional" you put in the
description that I somehow missed previously.  I just can't read.
Anyway, it sounds like the timestamp isn't used for correctness for
the client to know what to grab, it's more just an aid for grabbing a
useful initial set.

> >> +Cloning with Bundle URIs
> >> +------------------------
> >> +
> >> +The primary need for bundle URIs is to speed up clones. The Git client
> >> +will interact with bundle URIs according to the following flow:
> >> +
> >> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
> >> +   option _or_ the client discovers a bundle URI that was advertised by
> >> +   the remote server.
> >> +
> >> +2. The client downloads the file at the bundle URI. If it is a bundle, then
> >> +   it is unbundled with the refs being stored in `refs/bundle/*`.
> >> +
> >> +3. If the file is instead a table of contents, then the bundles with
> >> +   matching `filter` settings are sorted by `timestamp` (if present),
> >> +   and the most-recent bundle is downloaded.
> >> +
> >> +4. If the current bundle header mentions negative commid OIDs that are not
> >> +   in the object database, then download the `requires` bundle and try
> >> +   again.
> >> +
> >> +5. After inspecting a bundle with no negative commit OIDs (or all OIDs are
> >> +   already in the object database somehow), then unbundle all of the
> >> +   bundles in reverse order, placing references within `refs/bundle/*`.
> >> +
> >> +6. The client performs a fetch negotiation with the origin server, using
> >> +   the `refs/bundle/*` references as `have`s and the server's ref
> >> +   advertisement as `want`s. This results in a pack-file containing the
> >> +   remaining objects requested by the clone but not in the bundles.
> >
> > Does step 6 potentially involve a new, second connection to the origin
> > server?  I'm wondering about timeouts closing the original connection
> > while the client is downloading the bundle uris.  Will the client
> > handle that automatically, or will they potentially be forced to
> > re-issue the clone/fetch command?  I'm also wondering if we want to be
> > "nice" and pre-emptively close the original connection to the server
> > while we fetch the bundles -- for example, some servers have a
> > threadpool for processing fetch/clone requests and will only serve a
> > limited number; IIRC Gerrit operates this way.  I have no idea if
> > that's a good idea or a horrible one.  If a second connection is tried
> > automatically, will the user potentially be forced to re-enter
> > connection credentials again?  And is there a risk that after the
> > second connection, there are new bundle uris for the client to go
> > fetch (and/or a removal of the original ones, e.g. replacing the old
> > "daily" bundle with a new one)?  Does this possibly cause us some
> > timestamp confusion as I noted earlier?
>
> If the user is cloning over HTTPS, then the connections are stateless
> and this is not any different than how it works today.
>
> When using SSH, we will probably want to close the SSH connection on
> the client and then reopen it to avoid keeping that connection open
> during the download. The implementation in this RFC does _not_ do this,
> but I think it would be valuable to do.

Right, I was thinking of ssh.

> >> +Note that during a clone we expect that all bundles will be required. The
> >> +client could be extended to download all bundles in parallel, though they
> >> +need to be unbundled in the correct order.
> >
> > What does required mean?  That the origin server can refuse to service
> > requests if the client does not have commits found in said bundles?
> > That new enough Git clients are expected to download all the bundles
> > (and no config option will be provided to users to just do traditional
> > negotation without first downloading them)?  Something else?
>
> The assumption I'm making here is that all but on bundle in the table
> of contents contains a thin pack, depending on an "earlier" bundle.
> The client would be unsuccessful unbundling any single bundle except
> the earliest one first.
>
> The benefit of this assumption is that we could also implement parallel
> downloads of all bundles in the future.
>
> This assumes that there is no way to organize the bundles to communicate
> that a user might want only the objects reachable from the default branch,
> but also some users want every reachable object. Such an organization
> would require extra information to describe two "parallel" lists of
> bundles that could be selected for each of those categories. If such an
> organization is valuable, then the table of contents can be extended with
> information to communicate such an organization. The downside is that
> clients with this "v1" version would download extra data based on this
> assumption.

That makes sense, and seems to suggest, as you clarify below, that no
bundles are required.  However, that makes your earlier sentence "Note
that during a clone we expect that all bundles will be required" a bit
jarring and perhaps in need of rewording to clarify the intent (e.g.
"Note that during a clone we expect that clients will download and use
all bundles, not just a subset of them.").

> > If users are doing a single-branch clone or a shallow clone, will the
> > requirements still hold?  (I'm not a fan of shallow clones, but they
> > are sadly used in a number of places and I'm curious how the features
> > interact or conflict.)
>
> The current specification does not focus on shallow clones. The TOC
> could be extended to say "this bundle is for a shallow clone of
> commit <X>" if that was valuable.

I guess it could, though I suspect the easier answer is that clients
ignore the bundle URIs for shallow clones, because doing otherwise is
just a nightmare for managing bundles with various depths.  (As a side
note, is the likely result of this series that shallow clones become
the bad citizens that put the heaviest load on the origin server?)

> For single-branch clones, my expectation is that the bundles will
> give the user more information than they need for that clone. The
> negotiation will find out what they need from that branch that was
> not in the bundles, but the bundles will also contain a lot of
> objects that are not reachable from that ref. (This is also up to
> the discretion of the bundle server operator, since they could
> focus on only objects reachable from a subset of refs, minimizing
> the bundle data while increasing the potential size of that
> follow-up fetch.)

Makes sense to me.

[...]
> I expect that users who already cloned will never configure their
> repositories to include a bundle server.
>
> That said, if you run 'git bundle fetch <uri>' in an existing
> repository, then it will fetch only the newest bundle and see if you
> already have all of its negative refs. If so, then it stops and that
> is the only bundle that is downloaded. Its timestamp is stored for
> the next 'git bundle fetch'.
>
> In the case where the server starts advertising a bundle URI, a
> 'git fetch' will not start using that URI. That check only happens
> during 'git clone' (as currently designed).

So people have to reclone to take advantage of this performance improvement?

It probably doesn't matter to me; our repository sizes are such that I
suspect we may only use this feature for clones anyway.  I'm just
curious.

[...]
> >> +Here are a few example error conditions:
> >> +
> >> +* The client fails to connect with a server at the given URI or a connection
> >> +  is lost without any chance to recover.
> >> +
> >> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
> >> +  `401 Not Authorized`, or `500 Internal Server Error`).
[...]
> >
> > Should these result in warnings so that folks can diagnose slower
> > connections, or should they be squelched?  (I'm thinking particularly
> > of the `401 Not Authorized` case in combination with users never
> > having had to use a credential helper before.)
>
> There is a lot of work to be done around polishing the user ergonomics
> here, and that is an interesting thing to consider for a second round
> after the basic standard is established. I appreciate that you are
> already thinking about the user experience in these corner cases.

Thanks, but I think this particular situation (using ssh for clone and
possibly not having credential helpers setup before) isn't so much of
a corner case:
  * Setting up cloning via ssh is easy (on Linux and maybe Mac),
whereas https involves more hoops.  So, at my previous employer,
cloning was only available via ssh.  (No idea if they've since changed
or adopted a Git repository manager of some sort.)
  * At current employer, ssh is default, and has at various times been
the _only_ option:
    * Back when we used gitolite, gitolite was intrinsically tied to
ssh (probably still is, right?)
    * For years during part of the time we used Gerrit, https cloning
was disabled.  This was probably a configuration oversight at first,
but I inherited the system and didn't notice this for years, likely
because of having migrated from gitolite.  I think I enabled it, and
possibly disabled it again because Gerrit liked to default to https
for download links if it was a possibility.  Or maybe I disabled it
again due to the authentication issues that also affected GHE
(mentioned below).
    * For a few years during the early usage of GHE, https cloning was
disabled.  This was not an oversight, but an unfortunate fallout of
some kind of issues in our then-authentication scheme (salesforce,
duo, who knows what else).  Something in that stack didn't work
together in a way that somehow only adversely affected cloning (at
least to my knowledge), but since everyone was used to cloning solely
via ssh anyway, the folks running GHE simply disabled cloning via
https.  We've since changed authentication schemes, and it appears
https was re-enabled somewhere along the way (I just tried to verify).

So, I expect many users here have never cloned any of our internal
repositories with https, and thus likely have not set up a credential
helper.  And my previous employer would have been in the same boat and
it would not surprise me if they still are.  Combine the above with
the following:
  * The Infosec team here will not allow unauthenticated source code
access; so the bundle access would have to be authenticated here.
  * I'm pretty certain my former employer would have felt the same.

So, to me at least, this particular case (using ssh for clone and
possibly not having credential helpers setup) feels pretty common.
Not that you're doing anything wrong for this case, just noting that
we need to make sure such folks are aware of the authentication
considerations.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/25] docs: document bundle URI standard
  2022-03-03  2:19       ` Elijah Newren
@ 2022-03-03  2:44         ` Derrick Stolee
  0 siblings, 0 replies; 46+ messages in thread
From: Derrick Stolee @ 2022-03-03  2:44 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List,
	Junio C Hamano, Taylor Blau,
	Ævar Arnfjörð Bjarmason

On 3/2/2022 9:19 PM, Elijah Newren wrote:
> On Wed, Mar 2, 2022 at 6:06 AM Derrick Stolee <derrickstolee@github.com> wrote:
>>
> [...]
>>>> +bundle.<id>.timestamp::
>>>> +       (Optional) This value is the number of seconds since Unix epoch
>>>> +       (UTC) that this bundle was created. This is used as an approximation
>>>> +       of a point in time that the bundle matches the data available at
>>>> +       the origin server.
>>>
>>> As an approximation, is there a risk of drift where the user has
>>> timetamp A which is very close to B and makes decisions based upon it
>>> which results in their not getting dependent objects they need?  Or is
>>> it just an optimization for them to only download certain bundles and
>>> look at them, and then they'll iteratively go back and download more
>>> (as per the 'requires' field below) if they don't have enough objects
>>> to unbundle what they previously downloaded?
>>
>> The user doesn't ever generate the timestamp. It saves the timestamp
>> from the most-recent bundle it downloaded. The only risk of timestamp
>> drift is if the server has multiple machines generating different sets
>> of bundles, and places those machines behind a load balancer.
>>
>> This is something the server can control, likely by having one job
>> generate the bundle set and then distributing them to various storage
>> locations.
> 
> My first thought here was "can, or must?"...and if it's not "must",
> are there ramifications we have to consider in the design of how the
> Git client behaves?  However, reading through the rest of the document
> and your response, I don't think there is no problem here because
> below you mention that when fetching, the client can grab the latest
> bundle, and if it lacks some of the necessary pre-requisite objects,
> it can just iteratively download previous bundles until they have
> enough.
> 
> Also, the main thing that prompted my question was my wondering about
> how to get pre-existing clones to start taking advantage of bundle
> URIs for future fetches, and this timestamp approximation seemed
> problematic.  But in addition to iteratively grabbing more bundles,
> you also suggested that such users would just be forced to reclone,
> which would also avoid timestamp approximation issues.
> 
> And, just now, I noticed the little "Optional" you put in the
> description that I somehow missed previously.  I just can't read.
> Anyway, it sounds like the timestamp isn't used for correctness for
> the client to know what to grab, it's more just an aid for grabbing a
> useful initial set.

Exactly. It's a heuristic, but it is less noisy than it sounds.

We could easily replace it with "generation" to give an increasing
number (how many times has the bundle server has generated bundles?),
but using a timestamp provides some amount of human-understandable
meaning to the number.

>>>> +Cloning with Bundle URIs
>>>> +------------------------
>>>> +
>>>> +The primary need for bundle URIs is to speed up clones. The Git client
>>>> +will interact with bundle URIs according to the following flow:
>>>> +
>>>> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
>>>> +   option _or_ the client discovers a bundle URI that was advertised by
>>>> +   the remote server.
>>>> +
>>>> +2. The client downloads the file at the bundle URI. If it is a bundle, then
>>>> +   it is unbundled with the refs being stored in `refs/bundle/*`.
>>>> +
>>>> +3. If the file is instead a table of contents, then the bundles with
>>>> +   matching `filter` settings are sorted by `timestamp` (if present),
>>>> +   and the most-recent bundle is downloaded.
>>>> +
>>>> +4. If the current bundle header mentions negative commid OIDs that are not
>>>> +   in the object database, then download the `requires` bundle and try
>>>> +   again.
>>>> +
>>>> +5. After inspecting a bundle with no negative commit OIDs (or all OIDs are
>>>> +   already in the object database somehow), then unbundle all of the
>>>> +   bundles in reverse order, placing references within `refs/bundle/*`.
>>>> +
>>>> +6. The client performs a fetch negotiation with the origin server, using
>>>> +   the `refs/bundle/*` references as `have`s and the server's ref
>>>> +   advertisement as `want`s. This results in a pack-file containing the
>>>> +   remaining objects requested by the clone but not in the bundles.
>>>
>>> Does step 6 potentially involve a new, second connection to the origin
>>> server?  I'm wondering about timeouts closing the original connection
>>> while the client is downloading the bundle uris.  Will the client
>>> handle that automatically, or will they potentially be forced to
>>> re-issue the clone/fetch command?  I'm also wondering if we want to be
>>> "nice" and pre-emptively close the original connection to the server
>>> while we fetch the bundles -- for example, some servers have a
>>> threadpool for processing fetch/clone requests and will only serve a
>>> limited number; IIRC Gerrit operates this way.  I have no idea if
>>> that's a good idea or a horrible one.  If a second connection is tried
>>> automatically, will the user potentially be forced to re-enter
>>> connection credentials again?  And is there a risk that after the
>>> second connection, there are new bundle uris for the client to go
>>> fetch (and/or a removal of the original ones, e.g. replacing the old
>>> "daily" bundle with a new one)?  Does this possibly cause us some
>>> timestamp confusion as I noted earlier?
>>
>> If the user is cloning over HTTPS, then the connections are stateless
>> and this is not any different than how it works today.
>>
>> When using SSH, we will probably want to close the SSH connection on
>> the client and then reopen it to avoid keeping that connection open
>> during the download. The implementation in this RFC does _not_ do this,
>> but I think it would be valuable to do.
> 
> Right, I was thinking of ssh.
> 
>>>> +Note that during a clone we expect that all bundles will be required. The
>>>> +client could be extended to download all bundles in parallel, though they
>>>> +need to be unbundled in the correct order.
>>>
>>> What does required mean?  That the origin server can refuse to service
>>> requests if the client does not have commits found in said bundles?
>>> That new enough Git clients are expected to download all the bundles
>>> (and no config option will be provided to users to just do traditional
>>> negotation without first downloading them)?  Something else?
>>
>> The assumption I'm making here is that all but on bundle in the table
>> of contents contains a thin pack, depending on an "earlier" bundle.
>> The client would be unsuccessful unbundling any single bundle except
>> the earliest one first.
>>
>> The benefit of this assumption is that we could also implement parallel
>> downloads of all bundles in the future.
>>
>> This assumes that there is no way to organize the bundles to communicate
>> that a user might want only the objects reachable from the default branch,
>> but also some users want every reachable object. Such an organization
>> would require extra information to describe two "parallel" lists of
>> bundles that could be selected for each of those categories. If such an
>> organization is valuable, then the table of contents can be extended with
>> information to communicate such an organization. The downside is that
>> clients with this "v1" version would download extra data based on this
>> assumption.
> 
> That makes sense, and seems to suggest, as you clarify below, that no
> bundles are required.  However, that makes your earlier sentence "Note
> that during a clone we expect that all bundles will be required" a bit
> jarring and perhaps in need of rewording to clarify the intent (e.g.
> "Note that during a clone we expect that clients will download and use
> all bundles, not just a subset of them.").

Your suggestion to reword is a good one.

>>> If users are doing a single-branch clone or a shallow clone, will the
>>> requirements still hold?  (I'm not a fan of shallow clones, but they
>>> are sadly used in a number of places and I'm curious how the features
>>> interact or conflict.)
>>
>> The current specification does not focus on shallow clones. The TOC
>> could be extended to say "this bundle is for a shallow clone of
>> commit <X>" if that was valuable.
> 
> I guess it could, though I suspect the easier answer is that clients
> ignore the bundle URIs for shallow clones, because doing otherwise is
> just a nightmare for managing bundles with various depths.  (As a side
> note, is the likely result of this series that shallow clones become
> the bad citizens that put the heaviest load on the origin server?)
> 
>> For single-branch clones, my expectation is that the bundles will
>> give the user more information than they need for that clone. The
>> negotiation will find out what they need from that branch that was
>> not in the bundles, but the bundles will also contain a lot of
>> objects that are not reachable from that ref. (This is also up to
>> the discretion of the bundle server operator, since they could
>> focus on only objects reachable from a subset of refs, minimizing
>> the bundle data while increasing the potential size of that
>> follow-up fetch.)
> 
> Makes sense to me.
> 
> [...]
>> I expect that users who already cloned will never configure their
>> repositories to include a bundle server.
>>
>> That said, if you run 'git bundle fetch <uri>' in an existing
>> repository, then it will fetch only the newest bundle and see if you
>> already have all of its negative refs. If so, then it stops and that
>> is the only bundle that is downloaded. Its timestamp is stored for
>> the next 'git bundle fetch'.
>>
>> In the case where the server starts advertising a bundle URI, a
>> 'git fetch' will not start using that URI. That check only happens
>> during 'git clone' (as currently designed).
> 
> So people have to reclone to take advantage of this performance improvement?

To auto-discover bundle URIs from the origin server, in v1, yes.

It is possible that we would want fetches to see that the
recommended features have been updated, and to have the client
"upgrade" to these features in appropriate cases. But I don't
anticipate putting that into the first version. This is especially
important for features like partial clone where it doesn't make a
lot of sense to change the filter (until the refiltering topic is
finalized) and maybe not even to toggle sparse-checkout after the
initial clone. These kinds of complications are solvable, but are
probably too complicated to work out completely in v1.

> It probably doesn't matter to me; our repository sizes are such that I
> suspect we may only use this feature for clones anyway.  I'm just
> curious.

I anticipate that this will primarily be used for clones except
for some very large repos where fetches longer than a day or two
behind the latest can be costly.

> [...]
>>>> +Here are a few example error conditions:
>>>> +
>>>> +* The client fails to connect with a server at the given URI or a connection
>>>> +  is lost without any chance to recover.
>>>> +
>>>> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
>>>> +  `401 Not Authorized`, or `500 Internal Server Error`).
> [...]
>>>
>>> Should these result in warnings so that folks can diagnose slower
>>> connections, or should they be squelched?  (I'm thinking particularly
>>> of the `401 Not Authorized` case in combination with users never
>>> having had to use a credential helper before.)
>>
>> There is a lot of work to be done around polishing the user ergonomics
>> here, and that is an interesting thing to consider for a second round
>> after the basic standard is established. I appreciate that you are
>> already thinking about the user experience in these corner cases.
> 
> Thanks, but I think this particular situation (using ssh for clone and
> possibly not having credential helpers setup before) isn't so much of
> a corner case:
>   * Setting up cloning via ssh is easy (on Linux and maybe Mac),
> whereas https involves more hoops.  So, at my previous employer,
> cloning was only available via ssh.  (No idea if they've since changed
> or adopted a Git repository manager of some sort.)
>   * At current employer, ssh is default, and has at various times been
> the _only_ option:
>     * Back when we used gitolite, gitolite was intrinsically tied to
> ssh (probably still is, right?)
>     * For years during part of the time we used Gerrit, https cloning
> was disabled.  This was probably a configuration oversight at first,
> but I inherited the system and didn't notice this for years, likely
> because of having migrated from gitolite.  I think I enabled it, and
> possibly disabled it again because Gerrit liked to default to https
> for download links if it was a possibility.  Or maybe I disabled it
> again due to the authentication issues that also affected GHE
> (mentioned below).
>     * For a few years during the early usage of GHE, https cloning was
> disabled.  This was not an oversight, but an unfortunate fallout of
> some kind of issues in our then-authentication scheme (salesforce,
> duo, who knows what else).  Something in that stack didn't work
> together in a way that somehow only adversely affected cloning (at
> least to my knowledge), but since everyone was used to cloning solely
> via ssh anyway, the folks running GHE simply disabled cloning via
> https.  We've since changed authentication schemes, and it appears
> https was re-enabled somewhere along the way (I just tried to verify).
> 
> So, I expect many users here have never cloned any of our internal
> repositories with https, and thus likely have not set up a credential
> helper.  And my previous employer would have been in the same boat and
> it would not surprise me if they still are.  Combine the above with
> the following:
>   * The Infosec team here will not allow unauthenticated source code
> access; so the bundle access would have to be authenticated here.
>   * I'm pretty certain my former employer would have felt the same.
> 
> So, to me at least, this particular case (using ssh for clone and
> possibly not having credential helpers setup) feels pretty common.
> Not that you're doing anything wrong for this case, just noting that
> we need to make sure such folks are aware of the authentication
> considerations.

I completely agree that SSH clones are _not_ a corner case and
should be front-and-center in our minds when moving to the server-
advertised bundles. If a user specifies a --bundle-uri argument
to the 'git clone', then we can expect that they will have proper
credential managers configured to handle the supplied URIs.

I was thinking more about the cases where a user is having other
kinds of network issues to the bundle URIs that are not due to
this auth issue, but you mention that as your main example, so I
can now see how my statement looks like I'm referring to SSH.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-02-24 14:11   ` Derrick Stolee
@ 2022-03-04 13:30     ` Derrick Stolee
  2022-03-04 14:49       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 46+ messages in thread
From: Derrick Stolee @ 2022-03-04 13:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, newren, Robin H . Johnson, Teng Long,
	Konstantin Ryabitsev

On 2/24/2022 9:11 AM, Derrick Stolee wrote:
> On 2/23/2022 5:17 PM, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, Feb 23 2022, Derrick Stolee via GitGitGadget wrote:

>>> There have been several suggestions to improve Git clone speeds and
>>> reliability by supplementing the Git protocol with static content. The
>>> Packfile URI [0] feature lets the Git response include URIs that point to
>>> packfiles that the client must download to complete the request.
>>>
>>> Last year, Ævar suggested using bundles instead of packfiles [1] [2]. This
>>> design has the same benefits to the packfile URI feature because it offloads
>>> most object downloads to static content fetches. The main advantage over
>>> packfile URIs is that the remote Git server does not need to know what is in
>>> those bundles. The Git client tells the server what it downloaded during the
>>> fetch negotiation afterwards. This includes any chance that the client did
>>> not have access to those bundles or otherwise failed to access them. I
>>> agreed that this was a much more desirable way to serve static content, but
>>> had concerns about the flexibility of that design [3]. I have not heard more
>>> on the topic since October, so I started investigating this idea myself in
>>> December, resulting in this RFC.
>>
>> This timing is both quite fortunate & unfortunate for me, since I'd been
>> blocked / waiting on various things until very recently to submit a
>> non-RFC re-roll of (a larger version of) that series you mentioned from
>> October.
>>
>> I guess the good news is that we'll have at least one guaranteed very
>> interested reviewer for each other's patches, and that the design that
>> makes it into git.git in the end will definitely be well hashed out :)
>>
>> I won't be able to review this in any detail right at this hour, but
>> will be doing so. I'd also like to submit what I've got in some form
>> soon for hashing the two out.
>>
>> It will be some 50+ patches on the ML in total though related to this
>> topic, so I think the two of us coming up with some way to manage all of
>> that for both ourselves & others would be nice. Perhaps we could also
>> have an off-list (video) chat in real time to clarify/discuss various
>> thing related to this.
> 
> I look forward to seeing your full implementation. There are many things
> about your RFC that left me confused and not fully understanding your
> vision.

I am genuinely curious to see your full implementation of bundle URIs.
I've been having trouble joining the Git IRC chats, but I saw from the
logs that you are working on getting patches together.

Do you have an expected timeline on that progress?

I would like to move forward in getting bundle URIs submitted as a full
feature, but it is important to see your intended design so we can take
the best parts of both to create a version that satisfies us both.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-03-04 13:30     ` Derrick Stolee
@ 2022-03-04 14:49       ` Ævar Arnfjörð Bjarmason
  2022-03-04 15:12         ` Derrick Stolee
  0 siblings, 1 reply; 46+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-04 14:49 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, gitster, me, newren,
	Robin H . Johnson, Teng Long, Konstantin Ryabitsev


On Fri, Mar 04 2022, Derrick Stolee wrote:

> On 2/24/2022 9:11 AM, Derrick Stolee wrote:
>> On 2/23/2022 5:17 PM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Wed, Feb 23 2022, Derrick Stolee via GitGitGadget wrote:
>
>>>> There have been several suggestions to improve Git clone speeds and
>>>> reliability by supplementing the Git protocol with static content. The
>>>> Packfile URI [0] feature lets the Git response include URIs that point to
>>>> packfiles that the client must download to complete the request.
>>>>
>>>> Last year, Ævar suggested using bundles instead of packfiles [1] [2]. This
>>>> design has the same benefits to the packfile URI feature because it offloads
>>>> most object downloads to static content fetches. The main advantage over
>>>> packfile URIs is that the remote Git server does not need to know what is in
>>>> those bundles. The Git client tells the server what it downloaded during the
>>>> fetch negotiation afterwards. This includes any chance that the client did
>>>> not have access to those bundles or otherwise failed to access them. I
>>>> agreed that this was a much more desirable way to serve static content, but
>>>> had concerns about the flexibility of that design [3]. I have not heard more
>>>> on the topic since October, so I started investigating this idea myself in
>>>> December, resulting in this RFC.
>>>
>>> This timing is both quite fortunate & unfortunate for me, since I'd been
>>> blocked / waiting on various things until very recently to submit a
>>> non-RFC re-roll of (a larger version of) that series you mentioned from
>>> October.
>>>
>>> I guess the good news is that we'll have at least one guaranteed very
>>> interested reviewer for each other's patches, and that the design that
>>> makes it into git.git in the end will definitely be well hashed out :)
>>>
>>> I won't be able to review this in any detail right at this hour, but
>>> will be doing so. I'd also like to submit what I've got in some form
>>> soon for hashing the two out.
>>>
>>> It will be some 50+ patches on the ML in total though related to this
>>> topic, so I think the two of us coming up with some way to manage all of
>>> that for both ourselves & others would be nice. Perhaps we could also
>>> have an off-list (video) chat in real time to clarify/discuss various
>>> thing related to this.
>> 
>> I look forward to seeing your full implementation. There are many things
>> about your RFC that left me confused and not fully understanding your
>> vision.
>
> I am genuinely curious to see your full implementation of bundle URIs.
> I've been having trouble joining the Git IRC chats, but I saw from the
> logs that you are working on getting patches together.
>
> Do you have an expected timeline on that progress?
>
> I would like to move forward in getting bundle URIs submitted as a full
> feature, but it is important to see your intended design so we can take
> the best parts of both to create a version that satisfies us both.

Hi. Very sorry about the late reply. I relly meant to have something
read to send at the end of this week, but it's been a bit of a
whirlwhind of random things coming up & distracting me too much. Sorry.

I'm also totally on board with that goal, are you OK with waiting until
the end of next week at the latest?

Also, as noted in the upthread
<220224.86czjdb22l.gmgdl@evledraar.gmail.com> it might be useful to chat
in a more voice/video medium in parallel (maybe mid-next-week) about the
high-level ideas & to get a feel for our goals, conflicts etc. Doing
that over very long E-Mail exchanges (and the fault of "long" there is
mostly on my side:) can be a bit harder...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-03-04 14:49       ` Ævar Arnfjörð Bjarmason
@ 2022-03-04 15:12         ` Derrick Stolee
  2022-03-08 17:15           ` Derrick Stolee
  0 siblings, 1 reply; 46+ messages in thread
From: Derrick Stolee @ 2022-03-04 15:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, gitster, me, newren,
	Robin H . Johnson, Teng Long, Konstantin Ryabitsev

On 3/4/2022 9:49 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Mar 04 2022, Derrick Stolee wrote:
> 
>> On 2/24/2022 9:11 AM, Derrick Stolee wrote:
>>> I look forward to seeing your full implementation. There are many things
>>> about your RFC that left me confused and not fully understanding your
>>> vision.
>>
>> I am genuinely curious to see your full implementation of bundle URIs.
>> I've been having trouble joining the Git IRC chats, but I saw from the
>> logs that you are working on getting patches together.
>>
>> Do you have an expected timeline on that progress?
>>
>> I would like to move forward in getting bundle URIs submitted as a full
>> feature, but it is important to see your intended design so we can take
>> the best parts of both to create a version that satisfies us both.
> 
> Hi. Very sorry about the late reply. I relly meant to have something
> read to send at the end of this week, but it's been a bit of a
> whirlwhind of random things coming up & distracting me too much. Sorry.
> 
> I'm also totally on board with that goal, are you OK with waiting until
> the end of next week at the latest?

I'm OK with waiting, especially when I have a timeline in mind.
 
> Also, as noted in the upthread
> <220224.86czjdb22l.gmgdl@evledraar.gmail.com> it might be useful to chat
> in a more voice/video medium in parallel (maybe mid-next-week) about the
> high-level ideas & to get a feel for our goals, conflicts etc. Doing
> that over very long E-Mail exchanges (and the fault of "long" there is
> mostly on my side:) can be a bit harder...

I agree. I we can work out a time in a private thread and I can send
you a video call invite.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-02-23 22:17 ` [PATCH 00/25] [RFC] Bundle URIs Ævar Arnfjörð Bjarmason
  2022-02-24 14:11   ` Derrick Stolee
@ 2022-03-08  8:18   ` Teng Long
  1 sibling, 0 replies; 46+ messages in thread
From: Teng Long @ 2022-03-08  8:18 UTC (permalink / raw)
  To: avarab
  Cc: derrickstolee, dyroneteng, git, gitgitgadget, gitster,
	konstantin, me, newren, robbat2

Ævar Arnfjörð Bjarmason wrote on Wed, 23 Feb 2022 23:17:22 +0100:

>[Note: The E-Mail address you CC'd for me (presumably, dropped in this
>reply) is not my E-Mail address, this one is]
>
>[Also CC-ing some people who have expressed interest in this are, and
>would probably like to be kept in the loop going forward]

Appreciate for the CC. 

I'm attractive on this for a while. On a earlier time, I had posted a
patchset about to extend "packfile-uri" for common or similar reasons,
but after I saw the idea and RFC from Ævar Arnfjörð Bjarmason, I suspended
it.

Really looking forward the solution after you guys reach the consensuses and
I'm glad to attend and listen to your meetings, if possible (I looked at the
context a little bit late, maybe already missed it).

Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-03-04 15:12         ` Derrick Stolee
@ 2022-03-08 17:15           ` Derrick Stolee
  2022-03-10 14:45             ` Johannes Schindelin
  2022-04-07 19:08             ` Derrick Stolee
  0 siblings, 2 replies; 46+ messages in thread
From: Derrick Stolee @ 2022-03-08 17:15 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, gitster, me, newren,
	Robin H . Johnson, Teng Long, Konstantin Ryabitsev

On 3/4/2022 10:12 AM, Derrick Stolee wrote:
> On 3/4/2022 9:49 AM, Ævar Arnfjörð Bjarmason wrote:
>> Also, as noted in the upthread
>> <220224.86czjdb22l.gmgdl@evledraar.gmail.com> it might be useful to chat
>> in a more voice/video medium in parallel (maybe mid-next-week) about the
>> high-level ideas & to get a feel for our goals, conflicts etc. Doing
>> that over very long E-Mail exchanges (and the fault of "long" there is
>> mostly on my side:) can be a bit harder...
> 
> I agree. I we can work out a time in a private thread and I can send
> you a video call invite.

Ævar and I just finished our chat and came away with these two
action items:

1. Ævar will finish prepping his RFC as-is and send it to the list.
   It contains several deeply technical optimizations that are
   critical to how his model works, but could also be used to
   improve scenarios in the table of contents model.

2. Ævar will then do a round of taking both series and combining
   them in a way that allows the union of possible functionality
   to work.

3. As these things come out, I will make it a priority to read the
   patches and provide feedback focusing on high-level concepts
   and ways we can split the future, non-RFC series into chunks
   that provide incremental functionality while keeping review
   easier than reading the whole series.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-03-08 17:15           ` Derrick Stolee
@ 2022-03-10 14:45             ` Johannes Schindelin
  2022-04-07 19:08             ` Derrick Stolee
  1 sibling, 0 replies; 46+ messages in thread
From: Johannes Schindelin @ 2022-03-10 14:45 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget, git, gitster, me, newren,
	Robin H . Johnson, Teng Long, Konstantin Ryabitsev

[-- Attachment #1: Type: text/plain, Size: 1589 bytes --]

Hi Stolee & Ævar,

On Tue, 8 Mar 2022, Derrick Stolee wrote:

> On 3/4/2022 10:12 AM, Derrick Stolee wrote:
> > On 3/4/2022 9:49 AM, Ævar Arnfjörð Bjarmason wrote:
> >> Also, as noted in the upthread
> >> <220224.86czjdb22l.gmgdl@evledraar.gmail.com> it might be useful to chat
> >> in a more voice/video medium in parallel (maybe mid-next-week) about the
> >> high-level ideas & to get a feel for our goals, conflicts etc. Doing
> >> that over very long E-Mail exchanges (and the fault of "long" there is
> >> mostly on my side:) can be a bit harder...
> >
> > I agree. I we can work out a time in a private thread and I can send
> > you a video call invite.
>
> Ævar and I just finished our chat and came away with these two
> action items:
>
> 1. Ævar will finish prepping his RFC as-is and send it to the list.
>    It contains several deeply technical optimizations that are
>    critical to how his model works, but could also be used to
>    improve scenarios in the table of contents model.
>
> 2. Ævar will then do a round of taking both series and combining
>    them in a way that allows the union of possible functionality
>    to work.
>
> 3. As these things come out, I will make it a priority to read the
>    patches and provide feedback focusing on high-level concepts
>    and ways we can split the future, non-RFC series into chunks
>    that provide incremental functionality while keeping review
>    easier than reading the whole series.

I very much look forward to see the combined work soon!

Thank you, both,
Johannes

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 15/25] config: add git_config_get_timestamp()
  2022-02-23 18:30 ` [PATCH 15/25] config: add git_config_get_timestamp() Derrick Stolee via GitGitGadget
@ 2022-03-11 13:32   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 46+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 13:32 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, aevar, newren, Derrick Stolee


On Wed, Feb 23 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
>
> The existing config parsing methods do not include a way to consistently
> parse timestamps across all platforms. Recall that "unsigned long" is
> 32 bits on 64-bit Windows, so git_config_get_ulong() is insufficient.

They do. I think you've just missed the slightly non-obviously-named
expiry-date type:

    $ for t in $(date +%s) 2.days.ago 1.week.ago; do echo $t = $(git -c x.y=$t config --type expiry-date --get x.y); done
    1647005519 = 1647005519
    2.days.ago = 1646832719
    1.week.ago = 1646400719

AFAICT there's no need for these methods, except perhaps as some wrapper
for the existing git_config_expiry_date(), perhaps to have a flag passed
down to only allow epochs, and not the human-readable format?

But for the TOC proposal I don't see why it wouldn't allow the
human-readable format. You can arguably shoot yourself in the foot with
2.days.ago, but you can also provide absolute times in that format.

In any case, peeling out this patch from this series to fast-track it if
you do need it sounds like a good idea, it would need the relevant bits
of the CLI UX implemented & docs added.

There's also tests that you'll find by grepping for "expiry-date" to
extend.

> RFC-QUESTION: Would this be better to use uintmax_t, which could be cast
> to timestamp_t or other types more robust than "unsigned long"?

I think it's certainly legitimate to re-visit that when we're talking
about making it part of a network format.

But I think we should just go for timestamp_t, it will break 32 bit
clients around the year 2038, but I think that probably the intersection
between clients who'll find this useful and those still using 32 bit
systems then will be pretty much zero :)

So I think the only thing we need for bundle-uri if it has a TOC with a
timestamp is to have it die/fallback properly if it can't parse the
values.

> [...]
> +		*dest = git_config_timestamp(key, value);
> +		return 0;
> +	} else
> +		return 1;

Nit: missing {}

>[...]
> --- a/config.h
> +++ b/config.h
> @@ -206,6 +206,7 @@ int config_with_options(config_fn_t fn, void *,
>  
>  int git_parse_ssize_t(const char *, ssize_t *);
>  int git_parse_ulong(const char *, unsigned long *);
> +int git_parse_timestamp(const char *, timestamp_t *);
>  
>  /**
>   * Same as `git_config_bool`, except that it returns -1 on error rather
> @@ -226,6 +227,11 @@ int64_t git_config_int64(const char *, const char *);
>   */
>  unsigned long git_config_ulong(const char *, const char *);
>  
> +/**
> + * Identical to `git_config_int`, but for (unsigned) timestamps.
> + */
> +timestamp_t git_config_timestamp(const char *name, const char *value);
> +
>  ssize_t git_config_ssize_t(const char *, const char *);
>  
>  /**
> @@ -469,6 +475,7 @@ int git_configset_get_string(struct config_set *cs, const char *key, char **dest
>  int git_configset_get_string_tmp(struct config_set *cs, const char *key, const char **dest);
>  int git_configset_get_int(struct config_set *cs, const char *key, int *dest);
>  int git_configset_get_ulong(struct config_set *cs, const char *key, unsigned long *dest);
> +int git_configset_get_timestamp(struct config_set *cs, const char *key, timestamp_t *dest);
>  int git_configset_get_bool(struct config_set *cs, const char *key, int *dest);
>  int git_configset_get_bool_or_int(struct config_set *cs, const char *key, int *is_bool, int *dest);
>  int git_configset_get_maybe_bool(struct config_set *cs, const char *key, int *dest);
> @@ -489,6 +496,8 @@ int repo_config_get_int(struct repository *repo,
>  			const char *key, int *dest);
>  int repo_config_get_ulong(struct repository *repo,
>  			  const char *key, unsigned long *dest);
> +int repo_config_get_timestamp(struct repository *repo,
> +			      const char *key, timestamp_t *dest);
>  int repo_config_get_bool(struct repository *repo,
>  			 const char *key, int *dest);
>  int repo_config_get_bool_or_int(struct repository *repo,
> @@ -558,6 +567,11 @@ int git_config_get_int(const char *key, int *dest);
>   */
>  int git_config_get_ulong(const char *key, unsigned long *dest);
>  
> +/**
> + * Similar to `git_config_get_int` but for (unsigned) timestamps.
> + */
> +int git_config_get_timestamp(const char *key, timestamp_t *dest);
> +

Not your fault, but it's really tiresome that whenever we need to add
stuff to this API it's this mostly copy/pasting of related methods. I
had some recent-ish patches that did something similar & it was a chore
:)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 11/25] bundle: allow relative URLs in table of contents
  2022-02-23 18:30 ` [PATCH 11/25] bundle: allow relative URLs in table of contents Derrick Stolee via GitGitGadget
@ 2022-03-11 13:42   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 46+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 13:42 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, aevar, newren, Derrick Stolee


On Wed, Feb 23 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> @@ -460,6 +463,10 @@ static int cmd_bundle_fetch(int argc, const char **argv, const char *prefix)
>  			    (!info->filter_str || strcasecmp(filter, info->filter_str)))
>  				continue;
>  
> +			old_uri = info->uri;
> +			info->uri = relative_url(bundle_uri, info->uri, NULL);
> +			free(old_uri);
> +

I had it on my TODO to look into how to do this, and hadn't dug yet,
it's really pleasing that we have an API to make it this simple.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/25] bundle: add --filter option to 'fetch'
  2022-02-23 18:30 ` [PATCH 10/25] bundle: add --filter option to 'fetch' Derrick Stolee via GitGitGadget
@ 2022-03-11 13:44   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 46+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 13:44 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, aevar, newren, Derrick Stolee


On Wed, Feb 23 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> +			/* Skip if filter does not match. */
> +			if (!filter && info->filter_str)
> +				continue;
> +			if (filter &&
> +			    (!info->filter_str || strcasecmp(filter, info->filter_str)))
> +				continue;

General API comment: Do we have a need for strcasecmp() now anywhere
when dealing with filters?

    $ git clone --no-checkout --filter=BlOb:nOnE https://github.com/git/git
    fatal: invalid filter-spec 'BlOb:nOnE'

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/25] bundle: alphabetize subcommands better
  2022-02-23 18:30 ` [PATCH 02/25] bundle: alphabetize subcommands better Derrick Stolee via GitGitGadget
@ 2022-03-11 13:47   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 46+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-03-11 13:47 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, aevar, newren, Derrick Stolee


On Wed, Feb 23 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
>
> The usage strings for the 'git bundle' subcommands are not alphabetical.
> This also applies to their inspection within cmd_bundle(). Fix this
> ordering before we insert a new subcommand.
>
> This change does not reorder the cmd_bundle_*() methods to avoid moving
> lines that are more likely wanted in a future 'git blame' call. It is
> fine that those longer methods are not ordered alphabetically.

If we're moving these around it makes sense to make them use the pattern
of macros we use in commit-graph.c. I could have sworn I changed that
here already, but obviously not...

Anyway, I think as general UX I thought having these in alphabetical
order was a non-goal. Doesn't it make more for these commands in general
to list themost frequently used commands first.

In this case though there's pretty much a mapping between that and what
happens to be alphabetical order.

But I'd think if e.g. we implemented an "add-ref" or whatever to
in-place munge a bundle file to add more data to it such an obscure
thing would be better off last, before create/unbundle/list-heads/verify
etc.

Jut my 0.02, but for this change I really don't mind it, other than the
macro suggestion...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-03-08 17:15           ` Derrick Stolee
  2022-03-10 14:45             ` Johannes Schindelin
@ 2022-04-07 19:08             ` Derrick Stolee
  2022-04-08  9:15               ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 46+ messages in thread
From: Derrick Stolee @ 2022-04-07 19:08 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, gitster, me, newren,
	Robin H . Johnson, Teng Long, Konstantin Ryabitsev

On 3/8/2022 12:15 PM, Derrick Stolee wrote:
> On 3/4/2022 10:12 AM, Derrick Stolee wrote:
>> On 3/4/2022 9:49 AM, Ævar Arnfjörð Bjarmason wrote:
>>> Also, as noted in the upthread
>>> <220224.86czjdb22l.gmgdl@evledraar.gmail.com> it might be useful to chat
>>> in a more voice/video medium in parallel (maybe mid-next-week) about the
>>> high-level ideas & to get a feel for our goals, conflicts etc. Doing
>>> that over very long E-Mail exchanges (and the fault of "long" there is
>>> mostly on my side:) can be a bit harder...
>>
>> I agree. I we can work out a time in a private thread and I can send
>> you a video call invite.
> 
> Ævar and I just finished our chat and came away with these two
> action items:
> 
> 1. Ævar will finish prepping his RFC as-is and send it to the list.
>    It contains several deeply technical optimizations that are
>    critical to how his model works, but could also be used to
>    improve scenarios in the table of contents model.

Ævar: I'm still waiting on the full version of this. While you
updated [1] your original RFC [2], it was incomplete. I am still
looking forward to seeing your full vision of how it works with
incremental fetch and how your optimizations to download only the
headers of the bundles will work.

[1] [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git
    https://lore.kernel.org/git/RFC-cover-v2-00.13-00000000000-20220311T155841Z-avarab@gmail.com/

[2] [RFC PATCH 00/13] Add bundle-uri: resumably clones, static "dumb" CDN etc.
    https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/

> 2. Ævar will then do a round of taking both series and combining
>    them in a way that allows the union of possible functionality
>    to work.

Or perhaps you are jumping straight to this part?

> 3. As these things come out, I will make it a priority to read the
>    patches and provide feedback focusing on high-level concepts
>    and ways we can split the future, non-RFC series into chunks
>    that provide incremental functionality while keeping review
>    easier than reading the whole series.

I'm still looking forward to seeing progress in this area. Please
let me know what your plan is here.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-04-07 19:08             ` Derrick Stolee
@ 2022-04-08  9:15               ` Ævar Arnfjörð Bjarmason
  2022-04-08 13:13                 ` Derrick Stolee
  0 siblings, 1 reply; 46+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-04-08  9:15 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, gitster, me, newren,
	Robin H . Johnson, Teng Long, Konstantin Ryabitsev


On Thu, Apr 07 2022, Derrick Stolee wrote:

> On 3/8/2022 12:15 PM, Derrick Stolee wrote:
>> On 3/4/2022 10:12 AM, Derrick Stolee wrote:
>>> On 3/4/2022 9:49 AM, Ævar Arnfjörð Bjarmason wrote:
>>>> Also, as noted in the upthread
>>>> <220224.86czjdb22l.gmgdl@evledraar.gmail.com> it might be useful to chat
>>>> in a more voice/video medium in parallel (maybe mid-next-week) about the
>>>> high-level ideas & to get a feel for our goals, conflicts etc. Doing
>>>> that over very long E-Mail exchanges (and the fault of "long" there is
>>>> mostly on my side:) can be a bit harder...
>>>
>>> I agree. I we can work out a time in a private thread and I can send
>>> you a video call invite.
>> 
>> Ævar and I just finished our chat and came away with these two
>> action items:
>> 
>> 1. Ævar will finish prepping his RFC as-is and send it to the list.
>>    It contains several deeply technical optimizations that are
>>    critical to how his model works, but could also be used to
>>    improve scenarios in the table of contents model.
>
> Ævar: I'm still waiting on the full version of this. While you
> updated [1] your original RFC [2], it was incomplete. I am still
> looking forward to seeing your full vision of how it works with
> incremental fetch and how your optimizations to download only the
> headers of the bundles will work.
>
> [1] [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git
>     https://lore.kernel.org/git/RFC-cover-v2-00.13-00000000000-20220311T155841Z-avarab@gmail.com/
>
> [2] [RFC PATCH 00/13] Add bundle-uri: resumably clones, static "dumb" CDN etc.
>     https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
>
>> 2. Ævar will then do a round of taking both series and combining
>>    them in a way that allows the union of possible functionality
>>    to work.
>
> Or perhaps you are jumping straight to this part?

Yeah, that was part of it...

>> 3. As these things come out, I will make it a priority to read the
>>    patches and provide feedback focusing on high-level concepts
>>    and ways we can split the future, non-RFC series into chunks
>>    that provide incremental functionality while keeping review
>>    easier than reading the whole series.
>
> I'm still looking forward to seeing progress in this area. Please
> let me know what your plan is here.

Hi. I'm sorry about the delay, I ran into various life/software things,
and found that this topic required a lot of continuous "sit down for a
day and work on it" attention from me v.s. some other topics where I'd
deal with interruption better.

Then I was hoping that the merger of your bundle.c changes would come a
bit earlier before the rc, but they pretty much coincided, and since the
rc dropped I've been hesitant to send a very large topic to the list
(c.f. e.g. [1]).

Maybe I should just bite the bullet and submit it anyway, what do you
think?

1. https://lore.kernel.org/git/nycvar.QRO.7.76.6.2204071407160.347@tvgsbejvaqbjf.bet/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-04-08  9:15               ` Ævar Arnfjörð Bjarmason
@ 2022-04-08 13:13                 ` Derrick Stolee
  2022-04-08 18:26                   ` Junio C Hamano
  0 siblings, 1 reply; 46+ messages in thread
From: Derrick Stolee @ 2022-04-08 13:13 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, gitster, me, newren,
	Robin H . Johnson, Teng Long, Konstantin Ryabitsev

On 4/8/2022 5:15 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Apr 07 2022, Derrick Stolee wrote:
>> Ævar: I'm still waiting on the full version of this. While you
>> updated [1] your original RFC [2], it was incomplete. I am still
>> looking forward to seeing your full vision of how it works with
>> incremental fetch and how your optimizations to download only the
>> headers of the bundles will work.
>>
>> [1] [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git
>>     https://lore.kernel.org/git/RFC-cover-v2-00.13-00000000000-20220311T155841Z-avarab@gmail.com/
>>
>> [2] [RFC PATCH 00/13] Add bundle-uri: resumably clones, static "dumb" CDN etc.
>>     https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
>>
>>> 2. Ævar will then do a round of taking both series and combining
>>>    them in a way that allows the union of possible functionality
>>>    to work.
>>
>> Or perhaps you are jumping straight to this part?
> 
> Yeah, that was part of it...
> 
>>> 3. As these things come out, I will make it a priority to read the
>>>    patches and provide feedback focusing on high-level concepts
>>>    and ways we can split the future, non-RFC series into chunks
>>>    that provide incremental functionality while keeping review
>>>    easier than reading the whole series.
>>
>> I'm still looking forward to seeing progress in this area. Please
>> let me know what your plan is here.
> 
> Hi. I'm sorry about the delay, I ran into various life/software things,
> and found that this topic required a lot of continuous "sit down for a
> day and work on it" attention from me v.s. some other topics where I'd
> deal with interruption better.
> 
> Then I was hoping that the merger of your bundle.c changes would come a
> bit earlier before the rc, but they pretty much coincided, and since the
> rc dropped I've been hesitant to send a very large topic to the list
> (c.f. e.g. [1]).
> 
> Maybe I should just bite the bullet and submit it anyway, what do you
> think?
> 
> 1. https://lore.kernel.org/git/nycvar.QRO.7.76.6.2204071407160.347@tvgsbejvaqbjf.bet/

We could err on the side of not distracting from the list. I tend to
think the release phase can be a good time to talk about an RFC, since
Junio can ignore the thread until it is actually ready for full review.

At minimum, I appreciate knowing your progress. I can be patient for a
few more weeks as long as I have confidence that it will be shared as
soon as external blockers are cleared.

Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/25] [RFC] Bundle URIs
  2022-04-08 13:13                 ` Derrick Stolee
@ 2022-04-08 18:26                   ` Junio C Hamano
  0 siblings, 0 replies; 46+ messages in thread
From: Junio C Hamano @ 2022-04-08 18:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason,
	Derrick Stolee via GitGitGadget, git, me, newren,
	Robin H . Johnson, Teng Long, Konstantin Ryabitsev

Derrick Stolee <derrickstolee@github.com> writes:

> We could err on the side of not distracting from the list. I tend to
> think the release phase can be a good time to talk about an RFC, since

Yes, I think that it is a good thing to do.  I'd need to be careful
not to miss updates and issues critical to the upcoming release but
a large topic that clearly marks itself as RFC would not get in the
way, I would think.

> Junio can ignore the thread until it is actually ready for full review.

It is true for both during -rc freeze and outside ;-)

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2022-04-08 18:26 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 01/25] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-03-02  2:28   ` Elijah Newren
2022-03-02 14:06     ` Derrick Stolee
2022-03-03  2:19       ` Elijah Newren
2022-03-03  2:44         ` Derrick Stolee
2022-02-23 18:30 ` [PATCH 02/25] bundle: alphabetize subcommands better Derrick Stolee via GitGitGadget
2022-03-11 13:47   ` Ævar Arnfjörð Bjarmason
2022-02-23 18:30 ` [PATCH 03/25] dir: extract starts_with_dot[_dot]_slash() Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 04/25] remote: move relative_url() Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 05/25] remote: allow relative_url() to return an absolute url Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 06/25] http: make http_get_file() external Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 07/25] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 08/25] bundle: implement 'fetch' command for direct bundles Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 09/25] bundle: parse table of contents during 'fetch' Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 10/25] bundle: add --filter option to 'fetch' Derrick Stolee via GitGitGadget
2022-03-11 13:44   ` Ævar Arnfjörð Bjarmason
2022-02-23 18:30 ` [PATCH 11/25] bundle: allow relative URLs in table of contents Derrick Stolee via GitGitGadget
2022-03-11 13:42   ` Ævar Arnfjörð Bjarmason
2022-02-23 18:30 ` [PATCH 12/25] bundle: make it easy to call 'git bundle fetch' Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 13/25] clone: add --bundle-uri option Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 14/25] clone: --bundle-uri cannot be combined with --depth Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 15/25] config: add git_config_get_timestamp() Derrick Stolee via GitGitGadget
2022-03-11 13:32   ` Ævar Arnfjörð Bjarmason
2022-02-23 18:30 ` [PATCH 16/25] bundle: only fetch bundles if timestamp is new Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 17/25] fetch: fetch bundles before fetching original data Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 18/25] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason via GitGitGadget
2022-02-23 18:30 ` [PATCH 19/25] protocol-caps: implement cap_features() Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 20/25] serve: understand but do not advertise 'features' capability Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 21/25] serve: advertise 'features' when config exists Derrick Stolee via GitGitGadget
2022-02-23 18:31 ` [PATCH 22/25] connect: implement get_recommended_features() Derrick Stolee via GitGitGadget
2022-02-23 18:31 ` [PATCH 23/25] transport: add connections for 'features' capability Derrick Stolee via GitGitGadget
2022-02-23 18:31 ` [PATCH 24/25] clone: use server-recommended bundle URI Derrick Stolee via GitGitGadget
2022-02-23 18:31 ` [PATCH 25/25] t5601: basic bundle URI test Derrick Stolee via GitGitGadget
2022-02-23 22:17 ` [PATCH 00/25] [RFC] Bundle URIs Ævar Arnfjörð Bjarmason
2022-02-24 14:11   ` Derrick Stolee
2022-03-04 13:30     ` Derrick Stolee
2022-03-04 14:49       ` Ævar Arnfjörð Bjarmason
2022-03-04 15:12         ` Derrick Stolee
2022-03-08 17:15           ` Derrick Stolee
2022-03-10 14:45             ` Johannes Schindelin
2022-04-07 19:08             ` Derrick Stolee
2022-04-08  9:15               ` Ævar Arnfjörð Bjarmason
2022-04-08 13:13                 ` Derrick Stolee
2022-04-08 18:26                   ` Junio C Hamano
2022-03-08  8:18   ` Teng Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).