All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] clone: fix init of refdb with wrong object format
@ 2023-12-06 12:39 Patrick Steinhardt
  2023-12-06 12:39 ` [PATCH 1/7] setup: extract function to create the refdb Patrick Steinhardt
                   ` (9 more replies)
  0 siblings, 10 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-06 12:39 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2793 bytes --]

Hi,

when using git-clone(1), we initialize the complete repository before we
know about the object format used by the remote repository. This means
that we'll potentially create the refdb with the wrong object format in
case the local default object format and remote object format are not
the same.

This isn't much of a problem in the context of the files backend, which
never records the object format anyway. But it is a problem for the
reftable backend, which indeed records the object format in the on-disk
data structures. The result is thus a reftable with wrong object format.

This patch series aims to solve this issue by initializing the refdb at
a later point after we have learned about the remote object format. This
requires some careful reordering of code. Unfortunately, the end result
is not easily verifiable and thus I didn't add tests here. But it does
fix cloning of SHA256 repositories with the in-progress state of the
reftable backend.

While at it I noticed that this actually fixes a bug with bundle URIs
when the object formats diverge in this way.

The series is layed out as follows:

  - Patch 1 + 2: split out a function to create the refdb and make it
    possible to skip its initialization in `init_db()`.

  - Patch 3: allows git-remote-curl(1) to work with repos that get
    initialized during its lifetime.

  - Patch 4 - 6: address various corner cases where we access the refdb
    before we learned about the object format.

  - Patch 7: move initialization of the refdb to happen after we have
    learned about the object format.

This patch series is actually the last incompatibility for the reftable
backend that I have found. All tests except for the files-backend
specific ones pass now with the current state I have at [1], which is
currently at e6f2f592b7 (t: skip tests which are incompatible with
reftable, 2023-11-24)

Thanks in advance for your reviews!

Patrick

Patrick Steinhardt (7):
  setup: extract function to create the refdb
  setup: allow skipping creation of the refdb
  remote-curl: rediscover repository when fetching refs
  builtin/clone: fix bundle URIs with mismatching object formats
  builtin/clone: set up sparse checkout later
  builtin/clone: skip reading HEAD when retrieving remote
  builtin/clone: create the refdb with the correct object format

 builtin/clone.c             |  65 ++++++++++++----------
 remote-curl.c               |   7 ++-
 remote.c                    |  26 +++++----
 remote.h                    |   1 +
 setup.c                     | 106 +++++++++++++++++++++---------------
 setup.h                     |   6 +-
 t/t5558-clone-bundle-uri.sh |  18 ++++++
 7 files changed, 140 insertions(+), 89 deletions(-)

-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/7] setup: extract function to create the refdb
  2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
@ 2023-12-06 12:39 ` Patrick Steinhardt
  2023-12-06 21:10   ` Karthik Nayak
  2023-12-06 12:39 ` [PATCH 2/7] setup: allow skipping creation of " Patrick Steinhardt
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-06 12:39 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 4260 bytes --]

We're about to let callers skip creation of the reference database when
calling `init_db()`. Extract the logic into a standalone function so
that it becomes easier to do this refactoring.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 setup.c | 95 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 57 insertions(+), 38 deletions(-)

diff --git a/setup.c b/setup.c
index fc592dc6dd..9fcb64159f 100644
--- a/setup.c
+++ b/setup.c
@@ -1885,6 +1885,60 @@ void initialize_repository_version(int hash_algo, int reinit)
 		git_config_set_gently("extensions.objectformat", NULL);
 }
 
+static int is_reinit(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	char junk[2];
+	int ret;
+
+	git_path_buf(&buf, "HEAD");
+	ret = !access(buf.buf, R_OK) || readlink(buf.buf, junk, sizeof(junk) - 1) != -1;
+	strbuf_release(&buf);
+	return ret;
+}
+
+static void create_reference_database(const char *initial_branch, int quiet)
+{
+	struct strbuf err = STRBUF_INIT;
+	int reinit = is_reinit();
+
+	/*
+	 * We need to create a "refs" dir in any case so that older
+	 * versions of git can tell that this is a repository.
+	 */
+	safe_create_dir(git_path("refs"), 1);
+	adjust_shared_perm(git_path("refs"));
+
+	if (refs_init_db(&err))
+		die("failed to set up refs db: %s", err.buf);
+
+	/*
+	 * Point the HEAD symref to the initial branch with if HEAD does
+	 * not yet exist.
+	 */
+	if (!reinit) {
+		char *ref;
+
+		if (!initial_branch)
+			initial_branch = git_default_branch_name(quiet);
+
+		ref = xstrfmt("refs/heads/%s", initial_branch);
+		if (check_refname_format(ref, 0) < 0)
+			die(_("invalid initial branch name: '%s'"),
+			    initial_branch);
+
+		if (create_symref("HEAD", ref, NULL) < 0)
+			exit(1);
+		free(ref);
+	}
+
+	if (reinit && initial_branch)
+		warning(_("re-init: ignored --initial-branch=%s"),
+			initial_branch);
+
+	strbuf_release(&err);
+}
+
 static int create_default_files(const char *template_path,
 				const char *original_git_dir,
 				const char *initial_branch,
@@ -1896,10 +1950,8 @@ static int create_default_files(const char *template_path,
 	struct stat st1;
 	struct strbuf buf = STRBUF_INIT;
 	char *path;
-	char junk[2];
 	int reinit;
 	int filemode;
-	struct strbuf err = STRBUF_INIT;
 	const char *init_template_dir = NULL;
 	const char *work_tree = get_git_work_tree();
 
@@ -1919,6 +1971,8 @@ static int create_default_files(const char *template_path,
 	reset_shared_repository();
 	git_config(git_default_config, NULL);
 
+	reinit = is_reinit();
+
 	/*
 	 * We must make sure command-line options continue to override any
 	 * values we might have just re-read from the config.
@@ -1962,39 +2016,7 @@ static int create_default_files(const char *template_path,
 		adjust_shared_perm(get_git_dir());
 	}
 
-	/*
-	 * We need to create a "refs" dir in any case so that older
-	 * versions of git can tell that this is a repository.
-	 */
-	safe_create_dir(git_path("refs"), 1);
-	adjust_shared_perm(git_path("refs"));
-
-	if (refs_init_db(&err))
-		die("failed to set up refs db: %s", err.buf);
-
-	/*
-	 * Point the HEAD symref to the initial branch with if HEAD does
-	 * not yet exist.
-	 */
-	path = git_path_buf(&buf, "HEAD");
-	reinit = (!access(path, R_OK)
-		  || readlink(path, junk, sizeof(junk)-1) != -1);
-	if (!reinit) {
-		char *ref;
-
-		if (!initial_branch)
-			initial_branch = git_default_branch_name(quiet);
-
-		ref = xstrfmt("refs/heads/%s", initial_branch);
-		if (check_refname_format(ref, 0) < 0)
-			die(_("invalid initial branch name: '%s'"),
-			    initial_branch);
-
-		if (create_symref("HEAD", ref, NULL) < 0)
-			exit(1);
-		free(ref);
-	}
-
+	create_reference_database(initial_branch, quiet);
 	initialize_repository_version(fmt->hash_algo, 0);
 
 	/* Check filemode trustability */
@@ -2158,9 +2180,6 @@ int init_db(const char *git_dir, const char *real_git_dir,
 				      prev_bare_repository,
 				      init_shared_repository,
 				      flags & INIT_DB_QUIET);
-	if (reinit && initial_branch)
-		warning(_("re-init: ignored --initial-branch=%s"),
-			initial_branch);
 
 	create_object_directory();
 
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/7] setup: allow skipping creation of the refdb
  2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
  2023-12-06 12:39 ` [PATCH 1/7] setup: extract function to create the refdb Patrick Steinhardt
@ 2023-12-06 12:39 ` Patrick Steinhardt
  2023-12-06 12:39 ` [PATCH 3/7] remote-curl: rediscover repository when fetching refs Patrick Steinhardt
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-06 12:39 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2753 bytes --]

Allow callers to skip creation of the reference database via a new flag
`INIT_DB_SKIP_REFDB`, which is required for git-clone(1) so that we can
create it at a later point once the object format has been discovered
from the remote repository.

Note that we also uplift the call to `create_reference_database()` into
`init_db()`, which makes it easier to handle the new flag for us. This
changes the order in which we do initialization so that we now set up
the Git configuration before we create the reference database. In
practice this move should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 setup.c | 13 +++++--------
 setup.h |  5 +++--
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/setup.c b/setup.c
index 9fcb64159f..a80fc09b9c 100644
--- a/setup.c
+++ b/setup.c
@@ -1941,11 +1941,9 @@ static void create_reference_database(const char *initial_branch, int quiet)
 
 static int create_default_files(const char *template_path,
 				const char *original_git_dir,
-				const char *initial_branch,
 				const struct repository_format *fmt,
 				int prev_bare_repository,
-				int init_shared_repository,
-				int quiet)
+				int init_shared_repository)
 {
 	struct stat st1;
 	struct strbuf buf = STRBUF_INIT;
@@ -2016,7 +2014,6 @@ static int create_default_files(const char *template_path,
 		adjust_shared_perm(get_git_dir());
 	}
 
-	create_reference_database(initial_branch, quiet);
 	initialize_repository_version(fmt->hash_algo, 0);
 
 	/* Check filemode trustability */
@@ -2176,11 +2173,11 @@ int init_db(const char *git_dir, const char *real_git_dir,
 	validate_hash_algorithm(&repo_fmt, hash);
 
 	reinit = create_default_files(template_dir, original_git_dir,
-				      initial_branch, &repo_fmt,
-				      prev_bare_repository,
-				      init_shared_repository,
-				      flags & INIT_DB_QUIET);
+				      &repo_fmt, prev_bare_repository,
+				      init_shared_repository);
 
+	if (!(flags & INIT_DB_SKIP_REFDB))
+		create_reference_database(initial_branch, flags & INIT_DB_QUIET);
 	create_object_directory();
 
 	if (get_shared_repository()) {
diff --git a/setup.h b/setup.h
index b48cf1c43b..cbf538286b 100644
--- a/setup.h
+++ b/setup.h
@@ -169,8 +169,9 @@ int verify_repository_format(const struct repository_format *format,
  */
 void check_repository_format(struct repository_format *fmt);
 
-#define INIT_DB_QUIET 0x0001
-#define INIT_DB_EXIST_OK 0x0002
+#define INIT_DB_QUIET      (1 << 0)
+#define INIT_DB_EXIST_OK   (1 << 1)
+#define INIT_DB_SKIP_REFDB (1 << 2)
 
 int init_db(const char *git_dir, const char *real_git_dir,
 	    const char *template_dir, int hash_algo,
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/7] remote-curl: rediscover repository when fetching refs
  2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
  2023-12-06 12:39 ` [PATCH 1/7] setup: extract function to create the refdb Patrick Steinhardt
  2023-12-06 12:39 ` [PATCH 2/7] setup: allow skipping creation of " Patrick Steinhardt
@ 2023-12-06 12:39 ` Patrick Steinhardt
  2023-12-08 23:09   ` Junio C Hamano
  2023-12-06 12:40 ` [PATCH 4/7] builtin/clone: fix bundle URIs with mismatching object formats Patrick Steinhardt
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-06 12:39 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1997 bytes --]

We're about to change git-clone(1) so that we set up the reference
database at a later point. This change will cause git-remote-curl(1) to
not detect the repository anymore due to "HEAD" not having been created
yet at the time it spawns, and thus causes it to error out once it is
asked to fetch the references.

We can address this issue by trying to re-discover the Git repository in
case none was detected at startup time. With this change, the clone will
look as following:

  1. git-clone(1) sets up the initial repository, excluding the
     reference database.

  2. git-clone(1) spawns git-remote-curl(1), which will be unable to
     detect the repository due to a missing "HEAD".

  3. git-clone(1) asks git-remote-curl(1) to list remote references.
     This works just fine as this step does not require a local
     repository

  4. git-clone(1) creates the reference database as it has now learned
     about the object format.

  5. git-clone(1) asks git-remote-curl(1) to fetch the remote packfile.
     The latter notices that it doesn't have a repository available, but
     it now knows to try and re-discover it.

If the re-discovery succeeds in the last step we can continue with the
clone.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 remote-curl.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/remote-curl.c b/remote-curl.c
index ef05752ca5..fc29757b65 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -1564,8 +1564,11 @@ int cmd_main(int argc, const char **argv)
 		if (buf.len == 0)
 			break;
 		if (starts_with(buf.buf, "fetch ")) {
-			if (nongit)
-				die(_("remote-curl: fetch attempted without a local repo"));
+			if (nongit) {
+				setup_git_directory_gently(&nongit);
+				if (nongit)
+					die(_("remote-curl: fetch attempted without a local repo"));
+			}
 			parse_fetch(&buf);
 
 		} else if (!strcmp(buf.buf, "list") || starts_with(buf.buf, "list ")) {
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/7] builtin/clone: fix bundle URIs with mismatching object formats
  2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
                   ` (2 preceding siblings ...)
  2023-12-06 12:39 ` [PATCH 3/7] remote-curl: rediscover repository when fetching refs Patrick Steinhardt
@ 2023-12-06 12:40 ` Patrick Steinhardt
  2023-12-06 21:13   ` Karthik Nayak
  2023-12-08 23:11   ` Junio C Hamano
  2023-12-06 12:40 ` [PATCH 5/7] builtin/clone: set up sparse checkout later Patrick Steinhardt
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-06 12:40 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 6666 bytes --]

We create the reference database in git-clone(1) quite early before
connecting to the remote repository. Given that we do not yet know about
the object format that the remote repository uses at that point in time
the consequence is that the refdb may be initialized with the wrong
object format.

This is not a problem in the context of the files backend as we do not
encode the object format anywhere, and furthermore the only reference
that we write between initializing the refdb and learning about the
object format is the "HEAD" symref. It will become a problem though once
we land the reftable backend, which indeed does require to know about
the proper object format at the time of creation. We thus need to
rearrange the logic in git-clone(1) so that we only initialize the refdb
once we have learned about the actual object format.

As a first step, move listing of remote references to happen earlier,
which also allow us to set up the hash algorithm of the repository
earlier now. While we aim to execute this logic as late as possible
until after most of the setup has happened already, detection of the
object format and thus later the setup of the reference database must
happen before any other logic that may spawn Git commands or otherwise
these Git commands may not recognize the repository as such.

The first Git step where we expect the repository to be fully initalized
is when we fetch bundles via bundle URIs. Funny enough, the comments
there also state that "the_repository must match the cloned repo", which
is indeed not necessarily the case for the hash algorithm right now. So
in practice it is the right thing to detect the remote's object format
before downloading bundle URIs anyway, and not doing so causes clones
with bundle URIS to fail when the local default object format does not
match the remote repository's format.

Unfortunately though, this creates a new issue: downloading bundles may
take a long time, so if we list refs beforehand they might've grown
stale meanwhile. It is not clear how to solve this issue except for a
second reference listing though after we have downloaded the bundles,
which may be an expensive thing to do.

Arguably though, it's preferable to have a staleness issue compared to
being unable to clone a repository altogether.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c             | 48 ++++++++++++++++++-------------------
 t/t5558-clone-bundle-uri.sh | 18 ++++++++++++++
 2 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index c6357af949..d188650881 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1266,6 +1266,26 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (transport->smart_options && !deepen && !filter_options.choice)
 		transport->smart_options->check_self_contained_and_connected = 1;
 
+	strvec_push(&transport_ls_refs_options.ref_prefixes, "HEAD");
+	refspec_ref_prefixes(&remote->fetch,
+			     &transport_ls_refs_options.ref_prefixes);
+	if (option_branch)
+		expand_ref_prefix(&transport_ls_refs_options.ref_prefixes,
+				  option_branch);
+	if (!option_no_tags)
+		strvec_push(&transport_ls_refs_options.ref_prefixes,
+			    "refs/tags/");
+
+	refs = transport_get_remote_refs(transport, &transport_ls_refs_options);
+
+	/*
+	 * Now that we know what algorithm the remote side is using, let's set
+	 * ours to the same thing.
+	 */
+	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
+	initialize_repository_version(hash_algo, 1);
+	repo_set_hash_algo(the_repository, hash_algo);
+
 	/*
 	 * Before fetching from the remote, download and install bundle
 	 * data from the --bundle-uri option.
@@ -1281,24 +1301,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 				bundle_uri);
 		else if (has_heuristic)
 			git_config_set_gently("fetch.bundleuri", bundle_uri);
-	}
-
-	strvec_push(&transport_ls_refs_options.ref_prefixes, "HEAD");
-	refspec_ref_prefixes(&remote->fetch,
-			     &transport_ls_refs_options.ref_prefixes);
-	if (option_branch)
-		expand_ref_prefix(&transport_ls_refs_options.ref_prefixes,
-				  option_branch);
-	if (!option_no_tags)
-		strvec_push(&transport_ls_refs_options.ref_prefixes,
-			    "refs/tags/");
-
-	refs = transport_get_remote_refs(transport, &transport_ls_refs_options);
-
-	if (refs)
-		mapped_refs = wanted_peer_refs(refs, &remote->fetch);
-
-	if (!bundle_uri) {
+	} else {
 		/*
 		* Populate transport->got_remote_bundle_uri and
 		* transport->bundle_uri. We might get nothing.
@@ -1319,13 +1322,8 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		}
 	}
 
-		/*
-		 * Now that we know what algorithm the remote side is using,
-		 * let's set ours to the same thing.
-		 */
-	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
-	initialize_repository_version(hash_algo, 1);
-	repo_set_hash_algo(the_repository, hash_algo);
+	if (refs)
+		mapped_refs = wanted_peer_refs(refs, &remote->fetch);
 
 	if (mapped_refs) {
 		/*
diff --git a/t/t5558-clone-bundle-uri.sh b/t/t5558-clone-bundle-uri.sh
index 996a08e90c..1ca5f745e7 100755
--- a/t/t5558-clone-bundle-uri.sh
+++ b/t/t5558-clone-bundle-uri.sh
@@ -33,6 +33,15 @@ test_expect_success 'clone with path bundle' '
 	test_cmp expect actual
 '
 
+test_expect_success 'clone with path bundle and non-default hash' '
+	test_when_finished "rm -rf clone-path-non-default-hash" &&
+	GIT_DEFAULT_HASH=sha256 git clone --bundle-uri="clone-from/B.bundle" \
+		clone-from clone-path-non-default-hash &&
+	git -C clone-path-non-default-hash rev-parse refs/bundles/topic >actual &&
+	git -C clone-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
 test_expect_success 'clone with file:// bundle' '
 	git clone --bundle-uri="file://$(pwd)/clone-from/B.bundle" \
 		clone-from clone-file &&
@@ -284,6 +293,15 @@ test_expect_success 'clone HTTP bundle' '
 	test_config -C clone-http log.excludedecoration refs/bundle/
 '
 
+test_expect_success 'clone HTTP bundle with non-default hash' '
+	test_when_finished "rm -rf clone-http-non-default-hash" &&
+	GIT_DEFAULT_HASH=sha256 git clone --bundle-uri="$HTTPD_URL/B.bundle" \
+		"$HTTPD_URL/smart/fetch.git" clone-http-non-default-hash &&
+	git -C clone-http-non-default-hash rev-parse refs/bundles/topic >actual &&
+	git -C clone-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
 test_expect_success 'clone bundle list (HTTP, no heuristic)' '
 	test_when_finished rm -f trace*.txt &&
 
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/7] builtin/clone: set up sparse checkout later
  2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
                   ` (3 preceding siblings ...)
  2023-12-06 12:40 ` [PATCH 4/7] builtin/clone: fix bundle URIs with mismatching object formats Patrick Steinhardt
@ 2023-12-06 12:40 ` Patrick Steinhardt
  2023-12-06 12:40 ` [PATCH 6/7] builtin/clone: skip reading HEAD when retrieving remote Patrick Steinhardt
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-06 12:40 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1387 bytes --]

When asked to do a sparse checkout, then git-clone(1) will spawn
`git sparse-checkout set` to set up the configuration accordingly. This
requires a proper Git repository or otherwise the command will fail. But
as we are about to move creation of the reference database to a later
point, this prerequisite will not hold anymore.

Move the logic to a later point in time where we know to have created
the reference database already.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index d188650881..9c60923f31 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1185,9 +1185,6 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (option_required_reference.nr || option_optional_reference.nr)
 		setup_reference();
 
-	if (option_sparse_checkout && git_sparse_checkout_init(dir))
-		return 1;
-
 	remote = remote_get(remote_name);
 
 	refspec_appendf(&remote->fetch, "+%s*:%s*", src_ref_prefix,
@@ -1426,6 +1423,9 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		dissociate_from_references();
 	}
 
+	if (option_sparse_checkout && git_sparse_checkout_init(dir))
+		return 1;
+
 	junk_mode = JUNK_LEAVE_REPO;
 	err = checkout(submodule_progress, filter_submodules);
 
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 6/7] builtin/clone: skip reading HEAD when retrieving remote
  2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
                   ` (4 preceding siblings ...)
  2023-12-06 12:40 ` [PATCH 5/7] builtin/clone: set up sparse checkout later Patrick Steinhardt
@ 2023-12-06 12:40 ` Patrick Steinhardt
  2023-12-06 12:40 ` [PATCH 7/7] builtin/clone: create the refdb with the correct object format Patrick Steinhardt
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-06 12:40 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 5197 bytes --]

After we have set up the remote configuration in git-clone(1) we'll call
`remote_get()` to read the remote from the on-disk configuration. But
next to reading the on-disk configuration, `remote_get()` will also
cause us to try and read the repository's HEAD reference so that we can
figure out the current branch. Besides being pointless in git-clone(1)
because we're operating in an empty repository anyway, this will also
break once we move creation of the reference database to a later point
in time.

Refactor the code to introduce a new `remote_get_early()` function that
will skip reading the HEAD reference to address this issue.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c |  2 +-
 remote.c        | 26 ++++++++++++++++----------
 remote.h        |  1 +
 3 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 9c60923f31..06966c5d4c 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1185,7 +1185,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (option_required_reference.nr || option_optional_reference.nr)
 		setup_reference();
 
-	remote = remote_get(remote_name);
+	remote = remote_get_early(remote_name);
 
 	refspec_appendf(&remote->fetch, "+%s*:%s*", src_ref_prefix,
 			branch_top.buf);
diff --git a/remote.c b/remote.c
index abb24822be..051d0a64a0 100644
--- a/remote.c
+++ b/remote.c
@@ -509,7 +509,7 @@ static void alias_all_urls(struct remote_state *remote_state)
 	}
 }
 
-static void read_config(struct repository *repo)
+static void read_config(struct repository *repo, int early)
 {
 	int flag;
 
@@ -518,7 +518,7 @@ static void read_config(struct repository *repo)
 	repo->remote_state->initialized = 1;
 
 	repo->remote_state->current_branch = NULL;
-	if (startup_info->have_repository) {
+	if (startup_info->have_repository && !early) {
 		const char *head_ref = refs_resolve_ref_unsafe(
 			get_main_ref_store(repo), "HEAD", 0, NULL, &flag);
 		if (head_ref && (flag & REF_ISSYMREF) &&
@@ -561,7 +561,7 @@ static const char *remotes_remote_for_branch(struct remote_state *remote_state,
 
 const char *remote_for_branch(struct branch *branch, int *explicit)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	return remotes_remote_for_branch(the_repository->remote_state, branch,
@@ -587,7 +587,7 @@ remotes_pushremote_for_branch(struct remote_state *remote_state,
 
 const char *pushremote_for_branch(struct branch *branch, int *explicit)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	return remotes_pushremote_for_branch(the_repository->remote_state,
@@ -599,7 +599,7 @@ static struct remote *remotes_remote_get(struct remote_state *remote_state,
 
 const char *remote_ref_for_branch(struct branch *branch, int for_push)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	if (branch) {
@@ -709,7 +709,13 @@ remotes_remote_get(struct remote_state *remote_state, const char *name)
 
 struct remote *remote_get(const char *name)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
+	return remotes_remote_get(the_repository->remote_state, name);
+}
+
+struct remote *remote_get_early(const char *name)
+{
+	read_config(the_repository, 1);
 	return remotes_remote_get(the_repository->remote_state, name);
 }
 
@@ -722,7 +728,7 @@ remotes_pushremote_get(struct remote_state *remote_state, const char *name)
 
 struct remote *pushremote_get(const char *name)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	return remotes_pushremote_get(the_repository->remote_state, name);
 }
 
@@ -738,7 +744,7 @@ int remote_is_configured(struct remote *remote, int in_repo)
 int for_each_remote(each_remote_fn fn, void *priv)
 {
 	int i, result = 0;
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	for (i = 0; i < the_repository->remote_state->remotes_nr && !result;
 	     i++) {
 		struct remote *remote =
@@ -1831,7 +1837,7 @@ struct branch *branch_get(const char *name)
 {
 	struct branch *ret;
 
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	if (!name || !*name || !strcmp(name, "HEAD"))
 		ret = the_repository->remote_state->current_branch;
 	else
@@ -1973,7 +1979,7 @@ static const char *branch_get_push_1(struct remote_state *remote_state,
 
 const char *branch_get_push(struct branch *branch, struct strbuf *err)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	if (!branch)
diff --git a/remote.h b/remote.h
index cdc8b1db42..79353ba226 100644
--- a/remote.h
+++ b/remote.h
@@ -118,6 +118,7 @@ struct remote {
  * and configuration.
  */
 struct remote *remote_get(const char *name);
+struct remote *remote_get_early(const char *name);
 
 struct remote *pushremote_get(const char *name);
 int remote_is_configured(struct remote *remote, int in_repo);
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 7/7] builtin/clone: create the refdb with the correct object format
  2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
                   ` (5 preceding siblings ...)
  2023-12-06 12:40 ` [PATCH 6/7] builtin/clone: skip reading HEAD when retrieving remote Patrick Steinhardt
@ 2023-12-06 12:40 ` Patrick Steinhardt
  2023-12-06 13:09 ` [PATCH 0/7] clone: fix init of refdb with wrong " Patrick Steinhardt
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-06 12:40 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 3121 bytes --]

We're currently creating the reference database with a potentially
incorrect object format when the remote repository's object format is
different from the local default object format. This works just fine for
now because the files backend never records the object format anywhere.
But this logic will fail with any new reference backend that encodes
this information in some form either on-disk or in-memory.

The preceding commits have reshuffled code in git-clone(1) so that there
is no code path that will access the reference database before we have
detected the remote's object format. With these refactorings we can now
defer initialization of the reference database until after we have
learned the remote's object format and thus initialize it with the
correct format from the get-go.

These refactorings are required to make git-clone(1) work with the
upcoming reftable backend when cloning repositories with the SHA256
object format.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c | 9 ++++++++-
 setup.c         | 2 +-
 setup.h         | 1 +
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 06966c5d4c..fd052b8b54 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1097,8 +1097,14 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	/*
+	 * Initialize the repository, but skip initializing the reference
+	 * database. We do not yet know about the object format of the
+	 * repository, and reference backends may persist that information into
+	 * their on-disk data structures.
+	 */
 	init_db(git_dir, real_git_dir, option_template, GIT_HASH_UNKNOWN, NULL,
-		do_not_override_repo_unix_permissions, INIT_DB_QUIET);
+		do_not_override_repo_unix_permissions, INIT_DB_QUIET | INIT_DB_SKIP_REFDB);
 
 	if (real_git_dir) {
 		free((char *)git_dir);
@@ -1282,6 +1288,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
 	initialize_repository_version(hash_algo, 1);
 	repo_set_hash_algo(the_repository, hash_algo);
+	create_reference_database(NULL, 1);
 
 	/*
 	 * Before fetching from the remote, download and install bundle
diff --git a/setup.c b/setup.c
index a80fc09b9c..e1d0ce29c6 100644
--- a/setup.c
+++ b/setup.c
@@ -1897,7 +1897,7 @@ static int is_reinit(void)
 	return ret;
 }
 
-static void create_reference_database(const char *initial_branch, int quiet)
+void create_reference_database(const char *initial_branch, int quiet)
 {
 	struct strbuf err = STRBUF_INIT;
 	int reinit = is_reinit();
diff --git a/setup.h b/setup.h
index cbf538286b..3f0f17c351 100644
--- a/setup.h
+++ b/setup.h
@@ -178,6 +178,7 @@ int init_db(const char *git_dir, const char *real_git_dir,
 	    const char *initial_branch, int init_shared_repository,
 	    unsigned int flags);
 void initialize_repository_version(int hash_algo, int reinit);
+void create_reference_database(const char *initial_branch, int quiet);
 
 /*
  * NOTE NOTE NOTE!!
-- 
2.43.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/7] clone: fix init of refdb with wrong object format
  2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
                   ` (6 preceding siblings ...)
  2023-12-06 12:40 ` [PATCH 7/7] builtin/clone: create the refdb with the correct object format Patrick Steinhardt
@ 2023-12-06 13:09 ` Patrick Steinhardt
  2023-12-10  3:16 ` Junio C Hamano
  2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
  9 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-06 13:09 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 3208 bytes --]

On Wed, Dec 06, 2023 at 01:39:44PM +0100, Patrick Steinhardt wrote:
> Hi,
> 
> when using git-clone(1), we initialize the complete repository before we
> know about the object format used by the remote repository. This means
> that we'll potentially create the refdb with the wrong object format in
> case the local default object format and remote object format are not
> the same.
> 
> This isn't much of a problem in the context of the files backend, which
> never records the object format anyway. But it is a problem for the
> reftable backend, which indeed records the object format in the on-disk
> data structures. The result is thus a reftable with wrong object format.
> 
> This patch series aims to solve this issue by initializing the refdb at
> a later point after we have learned about the remote object format. This
> requires some careful reordering of code. Unfortunately, the end result
> is not easily verifiable and thus I didn't add tests here. But it does
> fix cloning of SHA256 repositories with the in-progress state of the
> reftable backend.
> 
> While at it I noticed that this actually fixes a bug with bundle URIs
> when the object formats diverge in this way.
> 
> The series is layed out as follows:
> 
>   - Patch 1 + 2: split out a function to create the refdb and make it
>     possible to skip its initialization in `init_db()`.
> 
>   - Patch 3: allows git-remote-curl(1) to work with repos that get
>     initialized during its lifetime.
> 
>   - Patch 4 - 6: address various corner cases where we access the refdb
>     before we learned about the object format.
> 
>   - Patch 7: move initialization of the refdb to happen after we have
>     learned about the object format.
> 
> This patch series is actually the last incompatibility for the reftable
> backend that I have found. All tests except for the files-backend
> specific ones pass now with the current state I have at [1], which is
> currently at e6f2f592b7 (t: skip tests which are incompatible with
> reftable, 2023-11-24)

I forgot to add the link to the merge request that contains the current
reftable backend's implementation in case anybody is interested:
https://gitlab.com/gitlab-org/git/-/merge_requests/58.

Patrick

> Thanks in advance for your reviews!
> 
> Patrick
> 
> Patrick Steinhardt (7):
>   setup: extract function to create the refdb
>   setup: allow skipping creation of the refdb
>   remote-curl: rediscover repository when fetching refs
>   builtin/clone: fix bundle URIs with mismatching object formats
>   builtin/clone: set up sparse checkout later
>   builtin/clone: skip reading HEAD when retrieving remote
>   builtin/clone: create the refdb with the correct object format
> 
>  builtin/clone.c             |  65 ++++++++++++----------
>  remote-curl.c               |   7 ++-
>  remote.c                    |  26 +++++----
>  remote.h                    |   1 +
>  setup.c                     | 106 +++++++++++++++++++++---------------
>  setup.h                     |   6 +-
>  t/t5558-clone-bundle-uri.sh |  18 ++++++
>  7 files changed, 140 insertions(+), 89 deletions(-)
> 
> -- 
> 2.43.0
> 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] setup: extract function to create the refdb
  2023-12-06 12:39 ` [PATCH 1/7] setup: extract function to create the refdb Patrick Steinhardt
@ 2023-12-06 21:10   ` Karthik Nayak
  2023-12-07  7:22     ` Patrick Steinhardt
  0 siblings, 1 reply; 33+ messages in thread
From: Karthik Nayak @ 2023-12-06 21:10 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Wed, Dec 6, 2023 at 1:40 PM Patrick Steinhardt <ps@pks.im> wrote:
> +static void create_reference_database(const char *initial_branch, int quiet)
> +{
> +       struct strbuf err = STRBUF_INIT;
> +       int reinit = is_reinit();
> +
> +       /*
> +        * We need to create a "refs" dir in any case so that older
> +        * versions of git can tell that this is a repository.
> +        */

How does this work though, even if an earlier version of git can tell
that this is a repository,
it still won't be able to read the reftable backend. In that sense,
what do we achieve here?

> +       safe_create_dir(git_path("refs"), 1);
> +       adjust_shared_perm(git_path("refs"));
> +

Not related to your commit per se, but we ignore the return value
here, shouldn't we die in this case?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/7] builtin/clone: fix bundle URIs with mismatching object formats
  2023-12-06 12:40 ` [PATCH 4/7] builtin/clone: fix bundle URIs with mismatching object formats Patrick Steinhardt
@ 2023-12-06 21:13   ` Karthik Nayak
  2023-12-07  7:22     ` Patrick Steinhardt
  2023-12-08 23:11   ` Junio C Hamano
  1 sibling, 1 reply; 33+ messages in thread
From: Karthik Nayak @ 2023-12-06 21:13 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Wed, Dec 6, 2023 at 1:40 PM Patrick Steinhardt <ps@pks.im> wrote:
> The first Git step where we expect the repository to be fully initalized
> is when we fetch bundles via bundle URIs. Funny enough, the comments
> there also state that "the_repository must match the cloned repo", which
> is indeed not necessarily the case for the hash algorithm right now. So
> in practice it is the right thing to detect the remote's object format
> before downloading bundle URIs anyway, and not doing so causes clones
> with bundle URIS to fail when the local default object format does not
> match the remote repository's format.
>

Nit: s/URIS/URIs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/7] builtin/clone: fix bundle URIs with mismatching object formats
  2023-12-06 21:13   ` Karthik Nayak
@ 2023-12-07  7:22     ` Patrick Steinhardt
  0 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-07  7:22 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

On Wed, Dec 06, 2023 at 10:13:43PM +0100, Karthik Nayak wrote:
> On Wed, Dec 6, 2023 at 1:40 PM Patrick Steinhardt <ps@pks.im> wrote:
> > The first Git step where we expect the repository to be fully initalized
> > is when we fetch bundles via bundle URIs. Funny enough, the comments
> > there also state that "the_repository must match the cloned repo", which
> > is indeed not necessarily the case for the hash algorithm right now. So
> > in practice it is the right thing to detect the remote's object format
> > before downloading bundle URIs anyway, and not doing so causes clones
> > with bundle URIS to fail when the local default object format does not
> > match the remote repository's format.
> >
> 
> Nit: s/URIS/URIs

Thanks, fixed locally. Will wait with v2 though until there are more
review comments.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] setup: extract function to create the refdb
  2023-12-06 21:10   ` Karthik Nayak
@ 2023-12-07  7:22     ` Patrick Steinhardt
  2023-12-08 22:54       ` Junio C Hamano
  2023-12-11 15:50       ` Karthik Nayak
  0 siblings, 2 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-07  7:22 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2242 bytes --]

On Wed, Dec 06, 2023 at 10:10:37PM +0100, Karthik Nayak wrote:
> On Wed, Dec 6, 2023 at 1:40 PM Patrick Steinhardt <ps@pks.im> wrote:
> > +static void create_reference_database(const char *initial_branch, int quiet)
> > +{
> > +       struct strbuf err = STRBUF_INIT;
> > +       int reinit = is_reinit();
> > +
> > +       /*
> > +        * We need to create a "refs" dir in any case so that older
> > +        * versions of git can tell that this is a repository.
> > +        */
> 
> How does this work though, even if an earlier version of git can tell
> that this is a repository, it still won't be able to read the reftable
> backend. In that sense, what do we achieve here?

This is a good question, and there is related ongoing discussion about
this topic in the thread starting at [1]. There are a few benefits to
letting clients discover such repos even if they don't understand the
new reference backend format:

  - They know to stop walking up the parent-directory chain. Otherwise a
    client might end up detecting a Git repository in the parent dir.

  - The user gets a proper error message why the repository cannot be
    accessed. Instead of failing to detect the repository altogether we
    instead say that we don't understand the "extensions.refFormat"
    extension.

Maybe there are other cases I can't think of right now.

> > +       safe_create_dir(git_path("refs"), 1);
> > +       adjust_shared_perm(git_path("refs"));
> > +
> 
> Not related to your commit per se, but we ignore the return value
> here, shouldn't we die in this case?

While the end result wouldn't be quite what the user asks for, the only
negative consequence is that the repository is inaccessible to others. I
think this failure mode is comparatively benign -- if it were the other
way round and we'd over-share the repository it would more severe.

So while I don't think that dying makes much sense here, I could
certainly see us adding a warning so that the user at least knows that
something went wrong. I'd rather want to keep this out of the current
patch series, but could certainly see such a warning added in a follow
up patch series.

Patrick

[1]: <ZWcOvjGPVS_CMUAk@tanuki>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] setup: extract function to create the refdb
  2023-12-07  7:22     ` Patrick Steinhardt
@ 2023-12-08 22:54       ` Junio C Hamano
  2023-12-11 11:34         ` Patrick Steinhardt
  2023-12-11 15:50       ` Karthik Nayak
  1 sibling, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2023-12-08 22:54 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Karthik Nayak, git

Patrick Steinhardt <ps@pks.im> writes:

> On Wed, Dec 06, 2023 at 10:10:37PM +0100, Karthik Nayak wrote:
>> On Wed, Dec 6, 2023 at 1:40 PM Patrick Steinhardt <ps@pks.im> wrote:
>> > +static void create_reference_database(const char *initial_branch, int quiet)
>> > +{
>> > +       struct strbuf err = STRBUF_INIT;
>> > +       int reinit = is_reinit();
>> > +
>> > +       /*
>> > +        * We need to create a "refs" dir in any case so that older
>> > +        * versions of git can tell that this is a repository.
>> > +        */
>> 
>> How does this work though, even if an earlier version of git can tell
>> that this is a repository, it still won't be able to read the reftable
>> backend. In that sense, what do we achieve here?
>
> This is a good question, and there is related ongoing discussion about
> this topic in the thread starting at [1]. There are a few benefits to
> letting clients discover such repos even if they don't understand the
> new reference backend format:
>
>   - They know to stop walking up the parent-directory chain. Otherwise a
>     client might end up detecting a Git repository in the parent dir.
>
>   - The user gets a proper error message why the repository cannot be
>     accessed. Instead of failing to detect the repository altogether we
>     instead say that we don't understand the "extensions.refFormat"
>     extension.

Yup, both are very good reasons.  Would it help to sneak a condensed
version of it in the in-code comment, perhaps?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/7] remote-curl: rediscover repository when fetching refs
  2023-12-06 12:39 ` [PATCH 3/7] remote-curl: rediscover repository when fetching refs Patrick Steinhardt
@ 2023-12-08 23:09   ` Junio C Hamano
  2023-12-11 11:35     ` Patrick Steinhardt
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2023-12-08 23:09 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:

> We're about to change git-clone(1) so that we set up the reference
> database at a later point. This change will cause git-remote-curl(1) to
> not detect the repository anymore due to "HEAD" not having been created
> yet at the time it spawns, and thus causes it to error out once it is
> asked to fetch the references.
>
> We can address this issue by trying to re-discover the Git repository in
> case none was detected at startup time. With this change, the clone will
> look as following:
>
>   1. git-clone(1) sets up the initial repository, excluding the
>      reference database.
>
>   2. git-clone(1) spawns git-remote-curl(1), which will be unable to
>      detect the repository due to a missing "HEAD".
>
>   3. git-clone(1) asks git-remote-curl(1) to list remote references.
>      This works just fine as this step does not require a local
>      repository
>
>   4. git-clone(1) creates the reference database as it has now learned
>      about the object format.

Sorry, but I am not sure I understand this step.  I assume you mean
by "the object format" what hash function is used to index the
objects (which can be learned from the remote "origin" in step 2 and
we can choose to use the one they use), not what ref backend is used
(which is purely a local matter and we do not need to know what is
used at the "origin").  Why do we need to wait initializing ref
backend until we learn what hash is being in use?

>   5. git-clone(1) asks git-remote-curl(1) to fetch the remote packfile.
>      The latter notices that it doesn't have a repository available, but
>      it now knows to try and re-discover it.
>
> If the re-discovery succeeds in the last step we can continue with the
> clone.

OK.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/7] builtin/clone: fix bundle URIs with mismatching object formats
  2023-12-06 12:40 ` [PATCH 4/7] builtin/clone: fix bundle URIs with mismatching object formats Patrick Steinhardt
  2023-12-06 21:13   ` Karthik Nayak
@ 2023-12-08 23:11   ` Junio C Hamano
  1 sibling, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2023-12-08 23:11 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:

> This is not a problem in the context of the files backend ...
> It will become a problem though once
> we land the reftable backend, which indeed does require to know about
> the proper object format at the time of creation.

OK.  That answers the question I had on the previous step.

Thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/7] clone: fix init of refdb with wrong object format
  2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
                   ` (7 preceding siblings ...)
  2023-12-06 13:09 ` [PATCH 0/7] clone: fix init of refdb with wrong " Patrick Steinhardt
@ 2023-12-10  3:16 ` Junio C Hamano
  2023-12-11 11:34   ` Patrick Steinhardt
  2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
  9 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2023-12-10  3:16 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:

> While at it I noticed that this actually fixes a bug with bundle URIs
> when the object formats diverge in this way.
> ...
> This patch series is actually the last incompatibility for the reftable
> backend that I have found. All tests except for the files-backend
> specific ones pass now with the current state I have at [1], which is
> currently at e6f2f592b7 (t: skip tests which are incompatible with
> reftable, 2023-11-24)

An existing test

    $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh

passes with vanilla Git 2.43, but with these patches applied, it
fails the "7 - empty dumb HTTP" step.

Thanks.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/7] clone: fix init of refdb with wrong object format
  2023-12-10  3:16 ` Junio C Hamano
@ 2023-12-11 11:34   ` Patrick Steinhardt
  2023-12-11 14:57     ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-11 11:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, brian m. carlson

[-- Attachment #1: Type: text/plain, Size: 3953 bytes --]

On Sat, Dec 09, 2023 at 07:16:30PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > While at it I noticed that this actually fixes a bug with bundle URIs
> > when the object formats diverge in this way.
> > ...
> > This patch series is actually the last incompatibility for the reftable
> > backend that I have found. All tests except for the files-backend
> > specific ones pass now with the current state I have at [1], which is
> > currently at e6f2f592b7 (t: skip tests which are incompatible with
> > reftable, 2023-11-24)
> 
> An existing test
> 
>     $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh
> 
> passes with vanilla Git 2.43, but with these patches applied, it
> fails the "7 - empty dumb HTTP" step.

Indeed -- now that the GitLab CI definitions have landed in your master
branch I can also see this failure as part of our CI job [1]. And with
the NixOS httpd refactorings I'm actually able to run these tests in the
first place. Really happy to see that things come together like this, as
it means that we'll detect such issues before we send the series to the
mailing list from now on.

Anyway, regarding the test itself. I think the refactorings in this
patch series uncover a preexisting and already-known issue with empty
repositories when using the dumb HTTP protocol: there is no proper way
to know about the hash function that the remote repository uses [2].

Before my refactorings we used to fall back to the local default hash
format with which we've already initialized the repository, which is
wrong. Now we use the hash format we detected via the remote, which we
cannot detect because the remote is empty and does not advertise the
hash function, so we fall back to SHA1 and thus also do the wrong thing.
The only correct thing here would be to use the actual hash function
that the remote repository uses, but we have no to do so. So we're kind
of stuck here and can only choose between two different wrong ways to
pick the hash function.

We can restore the previous wrong behaviour by honoring GIT_DEFAULT_HASH
in git-remote-curl(1) in the case where we do not have a repository set
up yet. So something similar in spirit to the following:

```
diff --git a/remote-curl.c b/remote-curl.c
index fc29757b65..7e97c9c2e9 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -27,6 +27,7 @@
 static struct remote *remote;
 /* always ends with a trailing slash */
 static struct strbuf url = STRBUF_INIT;
+static int nongit;
 
 struct options {
 	int verbosity;
@@ -275,8 +276,30 @@ static const struct git_hash_algo *detect_hash_algo(struct discovery *heads)
 {
 	const char *p = memchr(heads->buf, '\t', heads->len);
 	int algo;
-	if (!p)
-		return the_hash_algo;
+
+	if (!p) {
+		const char *env;
+
+		if (!nongit)
+			return the_hash_algo;
+
+		/*
+		 * In case the remote neither advertises the hash format nor
+		 * any references we have no way to detect the correct hash
+		 * format. We can thus only guess what the remote is using,
+		 * where the best guess is to fall back to the default hash.
+		 */
+		env = getenv("GIT_DEFAULT_HASH");
+		if (env) {
+			algo = hash_algo_by_name(env);
+			if (algo == GIT_HASH_UNKNOWN)
+				die(_("unknown hash algorithm '%s'"), env);
+		} else {
+			algo = GIT_HASH_SHA1;
+		}
+
+		return &hash_algos[algo];
+	}
 
 	algo = hash_algo_by_length((p - heads->buf) / 2);
 	if (algo == GIT_HASH_UNKNOWN)
@@ -1521,7 +1544,6 @@ static int stateless_connect(const char *service_name)
 int cmd_main(int argc, const char **argv)
 {
 	struct strbuf buf = STRBUF_INIT;
-	int nongit;
 	int ret = 1;
 
 	setup_git_directory_gently(&nongit);
```

+Cc brian, as he's the author of [2].

Patrick

[1]: https://gitlab.com/gitlab-org/git/-/jobs/5723052108
[2]: ac093d0790 (remote-curl: detect algorithm for dumb HTTP by size, 2020-06-19)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] setup: extract function to create the refdb
  2023-12-08 22:54       ` Junio C Hamano
@ 2023-12-11 11:34         ` Patrick Steinhardt
  0 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-11 11:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Karthik Nayak, git

[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]

On Sat, Dec 09, 2023 at 07:54:52AM +0900, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > On Wed, Dec 06, 2023 at 10:10:37PM +0100, Karthik Nayak wrote:
> >> On Wed, Dec 6, 2023 at 1:40 PM Patrick Steinhardt <ps@pks.im> wrote:
> >> > +static void create_reference_database(const char *initial_branch, int quiet)
> >> > +{
> >> > +       struct strbuf err = STRBUF_INIT;
> >> > +       int reinit = is_reinit();
> >> > +
> >> > +       /*
> >> > +        * We need to create a "refs" dir in any case so that older
> >> > +        * versions of git can tell that this is a repository.
> >> > +        */
> >> 
> >> How does this work though, even if an earlier version of git can tell
> >> that this is a repository, it still won't be able to read the reftable
> >> backend. In that sense, what do we achieve here?
> >
> > This is a good question, and there is related ongoing discussion about
> > this topic in the thread starting at [1]. There are a few benefits to
> > letting clients discover such repos even if they don't understand the
> > new reference backend format:
> >
> >   - They know to stop walking up the parent-directory chain. Otherwise a
> >     client might end up detecting a Git repository in the parent dir.
> >
> >   - The user gets a proper error message why the repository cannot be
> >     accessed. Instead of failing to detect the repository altogether we
> >     instead say that we don't understand the "extensions.refFormat"
> >     extension.
> 
> Yup, both are very good reasons.  Would it help to sneak a condensed
> version of it in the in-code comment, perhaps?

Sure, let's do so. I failed to condense this meaningfully, but hope that
the result will be okay regardless of that.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/7] remote-curl: rediscover repository when fetching refs
  2023-12-08 23:09   ` Junio C Hamano
@ 2023-12-11 11:35     ` Patrick Steinhardt
  0 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-11 11:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2173 bytes --]

On Sat, Dec 09, 2023 at 08:09:32AM +0900, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > We're about to change git-clone(1) so that we set up the reference
> > database at a later point. This change will cause git-remote-curl(1) to
> > not detect the repository anymore due to "HEAD" not having been created
> > yet at the time it spawns, and thus causes it to error out once it is
> > asked to fetch the references.
> >
> > We can address this issue by trying to re-discover the Git repository in
> > case none was detected at startup time. With this change, the clone will
> > look as following:
> >
> >   1. git-clone(1) sets up the initial repository, excluding the
> >      reference database.
> >
> >   2. git-clone(1) spawns git-remote-curl(1), which will be unable to
> >      detect the repository due to a missing "HEAD".
> >
> >   3. git-clone(1) asks git-remote-curl(1) to list remote references.
> >      This works just fine as this step does not require a local
> >      repository
> >
> >   4. git-clone(1) creates the reference database as it has now learned
> >      about the object format.
> 
> Sorry, but I am not sure I understand this step.  I assume you mean
> by "the object format" what hash function is used to index the
> objects (which can be learned from the remote "origin" in step 2 and
> we can choose to use the one they use), not what ref backend is used
> (which is purely a local matter and we do not need to know what is
> used at the "origin").

Yes, exactly. I'm never quite sure whether I should be saying "hash
function" or "object format". I'll convert the message to say "hash
function" instead to clarify.

> Why do we need to wait initializing ref backend until we learn what
> hash is being in use?

This is because of the reftable backend. With the files backend it never
mattered much, because we do not encode the object format anywhere. But
with the reftable backend we do indeed encode the object format in the
tables' header, so it's important to initialize it with the correct
format right from the start.

I'll amend the commit message.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/7] clone: fix init of refdb with wrong object format
  2023-12-11 11:34   ` Patrick Steinhardt
@ 2023-12-11 14:57     ` Junio C Hamano
  2023-12-11 15:32       ` Patrick Steinhardt
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2023-12-11 14:57 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, brian m. carlson

Patrick Steinhardt <ps@pks.im> writes:

>> An existing test
>> 
>>     $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh
>> 
>> passes with vanilla Git 2.43, but with these patches applied, it
>> fails the "7 - empty dumb HTTP" step.
> ...
> Before my refactorings we used to fall back to the local default hash
> format with which we've already initialized the repository, which is
> wrong. Now we use the hash format we detected via the remote, which we
> cannot detect because the remote is empty and does not advertise the
> hash function, so we fall back to SHA1 and thus also do the wrong thing.

Yeah, that is why I did *not* say "the series *breaks* existing
tests".  It triggers a failure, and in this case, a test failure
does not mean the behaviour is broken because there is no correct
answer.  ... oh, wait.  There isn't?

I wonder if the right thing to do is to advertise the hash function
even from an absolutely empty repository.  There are no objects in
such a repository, but it should already know what hash function to
use when it creates its first object (determined at the repository
creation time), so that _could_ be advertised.  

> The only correct thing here would be to use the actual hash function
> that the remote repository uses, but we have no to do so.

We have "no way to do so"?  We have "not done so"?

It is possible for the client side to download the $GIT_DIR/config
file from the remote to learn what value extensions.objectFormat is
in use over there instead, I think, but at the same time, I highly
suspect that dumb HTTP outlived its usefulness to warrant such an
additional investment of engineering resource.

The simplest "fix" might be to leave what happens in this narrow
case (i.e. cloning over dumb HTTP from an empty repository)
undefined by not testing (or not insisting on one particular
outcome), but ...

> +Cc brian, as he's the author of [2].

... of course I trust Brian more than I trust myself in this area ;-)

> Patrick
>
> [1]: https://gitlab.com/gitlab-org/git/-/jobs/5723052108
> [2]: ac093d0790 (remote-curl: detect algorithm for dumb HTTP by size, 2020-06-19)

Thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/7] clone: fix init of refdb with wrong object format
  2023-12-11 14:57     ` Junio C Hamano
@ 2023-12-11 15:32       ` Patrick Steinhardt
  2023-12-11 22:17         ` brian m. carlson
  0 siblings, 1 reply; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-11 15:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, brian m. carlson

[-- Attachment #1: Type: text/plain, Size: 4580 bytes --]

On Mon, Dec 11, 2023 at 06:57:25AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> >> An existing test
> >> 
> >>     $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh
> >> 
> >> passes with vanilla Git 2.43, but with these patches applied, it
> >> fails the "7 - empty dumb HTTP" step.
> > ...
> > Before my refactorings we used to fall back to the local default hash
> > format with which we've already initialized the repository, which is
> > wrong. Now we use the hash format we detected via the remote, which we
> > cannot detect because the remote is empty and does not advertise the
> > hash function, so we fall back to SHA1 and thus also do the wrong thing.
> 
> Yeah, that is why I did *not* say "the series *breaks* existing
> tests".  It triggers a failure, and in this case, a test failure
> does not mean the behaviour is broken because there is no correct
> answer.  ... oh, wait.  There isn't?
> 
> I wonder if the right thing to do is to advertise the hash function
> even from an absolutely empty repository.  There are no objects in
> such a repository, but it should already know what hash function to
> use when it creates its first object (determined at the repository
> creation time), so that _could_ be advertised.  

For the smart HTTP and SSH protocols definitely, and we already do. But
it's a different story for dumb HTTP, unfortunately, where there is no
CGI-like thing sitting between the client and the repository's data.

> > The only correct thing here would be to use the actual hash function
> > that the remote repository uses, but we have no to do so.
> 
> We have "no way to do so"?  We have "not done so"?

We have not done so until now, and we have no easy way to change this on
the server-side as the server is not controlled by us in the first
place. That leaves two options I can think of:

  - Try harder on the client-side, e.g. by trying to download the
    gitconfig as you propose further down. I wonder whether admins would
    typically block access to the config, but doubt they do.

  - Change the format of `info/refs` to include the hash format, as this
    file _is_ controlled by us on the server-side. Doesn't help though
    in an empty repository, where the file is likely to never have been
    generated in the first place.

So it seems like downloading the gitconfig is the only viable option
that I can think of right now.

It does increase the potential attack surface though because we would
start to unconditionally parse a config file from an untrusted source,
and we did hit issues in our config parser in the past already. You
could argue that we already parse untrusted configs via `.gitmodules`,
but these require opt-in to actually be used by anything if I'm not
mistaken.

So... I dunno.

> It is possible for the client side to download the $GIT_DIR/config
> file from the remote to learn what value extensions.objectFormat is
> in use over there instead, I think, but at the same time, I highly
> suspect that dumb HTTP outlived its usefulness to warrant such an
> additional investment of engineering resource.

Fair enough. All of this feels like an edge case (admin that uses dumb
HTTP) in an edge case (the cloned repository uses SHA256) in an edge
case (the remote repository is empty). Sure, SHA256 is likely to gain in
popularity eventually. But at the same time I'd expect that dumb HTTP
will become increasingly rare.

Taken together, chances for this to happen should be fairly low.

> The simplest "fix" might be to leave what happens in this narrow
> case (i.e. cloning over dumb HTTP from an empty repository)
> undefined by not testing (or not insisting on one particular
> outcome), but ...

I would be fine with that outcome, as well. It's not like the current
behaviour is correct in all cases either. The only benefit of that
behaviour is that a user can in fact work around the broken cases by
setting `GIT_HASH_DEFAULT` to the correct hash, and that benefit would
be retained by the diff I sent that made remote-curl.c aware of this
environment variable.

One additional solution would be to print a user-visible warning a la
"warning: failed to detect hash function of empty remote repository" and
then call it a day, potentially pointing out that a user can correct it
by re-cloning with `GIT_HASH_DEFAULT`. But the warning may not be
actionable by the user, because they may not know what hash function the
remote uses, either.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] setup: extract function to create the refdb
  2023-12-07  7:22     ` Patrick Steinhardt
  2023-12-08 22:54       ` Junio C Hamano
@ 2023-12-11 15:50       ` Karthik Nayak
  1 sibling, 0 replies; 33+ messages in thread
From: Karthik Nayak @ 2023-12-11 15:50 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:

> On Wed, Dec 06, 2023 at 10:10:37PM +0100, Karthik Nayak wrote:
>> On Wed, Dec 6, 2023 at 1:40 PM Patrick Steinhardt <ps@pks.im> wrote:
>> > +       /*
>> > +        * We need to create a "refs" dir in any case so that older
>> > +        * versions of git can tell that this is a repository.
>> > +        */
>>
>> How does this work though, even if an earlier version of git can tell
>> that this is a repository, it still won't be able to read the reftable
>> backend. In that sense, what do we achieve here?
>
> This is a good question, and there is related ongoing discussion about
> this topic in the thread starting at [1]. There are a few benefits to
> letting clients discover such repos even if they don't understand the
> new reference backend format:
>
>   - They know to stop walking up the parent-directory chain. Otherwise a
>     client might end up detecting a Git repository in the parent dir.
>
>   - The user gets a proper error message why the repository cannot be
>     accessed. Instead of failing to detect the repository altogether we
>     instead say that we don't understand the "extensions.refFormat"
>     extension.
>
> Maybe there are other cases I can't think of right now.
> [1]: <ZWcOvjGPVS_CMUAk@tanuki>

Thank Patrick, this does indeed make a lot of sense now. +1 that this
would be super useful as a comment here.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/7] clone: fix init of refdb with wrong object format
  2023-12-11 15:32       ` Patrick Steinhardt
@ 2023-12-11 22:17         ` brian m. carlson
  0 siblings, 0 replies; 33+ messages in thread
From: brian m. carlson @ 2023-12-11 22:17 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 2025 bytes --]

On 2023-12-11 at 15:32:52, Patrick Steinhardt wrote:
> On Mon, Dec 11, 2023 at 06:57:25AM -0800, Junio C Hamano wrote:
> > We have "no way to do so"?  We have "not done so"?
> 
> We have not done so until now, and we have no easy way to change this on
> the server-side as the server is not controlled by us in the first
> place. That leaves two options I can think of:
> 
>   - Try harder on the client-side, e.g. by trying to download the
>     gitconfig as you propose further down. I wonder whether admins would
>     typically block access to the config, but doubt they do.
> 
>   - Change the format of `info/refs` to include the hash format, as this
>     file _is_ controlled by us on the server-side. Doesn't help though
>     in an empty repository, where the file is likely to never have been
>     generated in the first place.
> 
> So it seems like downloading the gitconfig is the only viable option
> that I can think of right now.

I mean, we can add an `info/capabilities` file with capabilities and
assume the repository is SHA-1 without it.  I'm fine with that approach
as well, and it can be implemented as part of `git update-server-info`
pretty easily.

But yes, absent that approach or parsing the config file, we'll have to
just use the default settings.

> > The simplest "fix" might be to leave what happens in this narrow
> > case (i.e. cloning over dumb HTTP from an empty repository)
> > undefined by not testing (or not insisting on one particular
> > outcome), but ...
> 
> I would be fine with that outcome, as well. It's not like the current
> behaviour is correct in all cases either. The only benefit of that
> behaviour is that a user can in fact work around the broken cases by
> setting `GIT_HASH_DEFAULT` to the correct hash, and that benefit would
> be retained by the diff I sent that made remote-curl.c aware of this
> environment variable.

That would also be fine with me.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 0/7] clone: fix init of refdb with wrong object format
  2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
                   ` (8 preceding siblings ...)
  2023-12-10  3:16 ` Junio C Hamano
@ 2023-12-12  7:00 ` Patrick Steinhardt
  2023-12-12  7:00   ` [PATCH v2 1/7] setup: extract function to create the refdb Patrick Steinhardt
                     ` (7 more replies)
  9 siblings, 8 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 8774 bytes --]

Hi,

this is the second version of my patch series to make git-clone(1)
initialize the reference database with the correct hash format. This is
a preparatory step for the reftable backend, which encodes the hash
format on-disk and thus requires us to get the hash format right at the
time of initialization.

Changes compared to v1:

  - Patch 1: Extend the comment explaining why we always create the
    "refs/" directory.

  - Patch 3: Explain better why the patch is required in the first
    place.

  - Patch 4: Fix a typo.

  - Patch 7: Adapt a failing test to now assert the new behaviour.

Thanks for your reviews!

Patrick

Patrick Steinhardt (7):
  setup: extract function to create the refdb
  setup: allow skipping creation of the refdb
  remote-curl: rediscover repository when fetching refs
  builtin/clone: fix bundle URIs with mismatching object formats
  builtin/clone: set up sparse checkout later
  builtin/clone: skip reading HEAD when retrieving remote
  builtin/clone: create the refdb with the correct object format

 builtin/clone.c             |  65 ++++++++++----------
 remote-curl.c               |   7 ++-
 remote.c                    |  26 ++++----
 remote.h                    |   1 +
 setup.c                     | 114 ++++++++++++++++++++++--------------
 setup.h                     |   6 +-
 t/t5550-http-fetch-dumb.sh  |   4 +-
 t/t5558-clone-bundle-uri.sh |  18 ++++++
 8 files changed, 150 insertions(+), 91 deletions(-)

Range-diff against v1:
1:  b69c57d272 ! 1:  2f34daa082 setup: extract function to create the refdb
    @@ Commit message
         calling `init_db()`. Extract the logic into a standalone function so
         that it becomes easier to do this refactoring.
     
    +    While at it, expand the comment that explains why we always create the
    +    "refs/" directory.
    +
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
      ## setup.c ##
    @@ setup.c: void initialize_repository_version(int hash_algo, int reinit)
     +	int reinit = is_reinit();
     +
     +	/*
    -+	 * We need to create a "refs" dir in any case so that older
    -+	 * versions of git can tell that this is a repository.
    ++	 * We need to create a "refs" dir in any case so that older versions of
    ++	 * Git can tell that this is a repository. This serves two main purposes:
    ++	 *
    ++	 * - Clients will know to stop walking the parent-directory chain when
    ++	 *   detecting the Git repository. Otherwise they may end up detecting
    ++	 *   a Git repository in a parent directory instead.
    ++	 *
    ++	 * - Instead of failing to detect a repository with unknown reference
    ++	 *   format altogether, old clients will print an error saying that
    ++	 *   they do not understand the reference format extension.
     +	 */
     +	safe_create_dir(git_path("refs"), 1);
     +	adjust_shared_perm(git_path("refs"));
2:  090c52423e = 2:  40005ac1f1 setup: allow skipping creation of the refdb
3:  a1b86a0cbb ! 3:  374a1c514b remote-curl: rediscover repository when fetching refs
    @@ Metadata
      ## Commit message ##
         remote-curl: rediscover repository when fetching refs
     
    -    We're about to change git-clone(1) so that we set up the reference
    -    database at a later point. This change will cause git-remote-curl(1) to
    -    not detect the repository anymore due to "HEAD" not having been created
    -    yet at the time it spawns, and thus causes it to error out once it is
    -    asked to fetch the references.
    +    The reftable format encodes the hash function used by the repository
    +    inside of its tables. The reftable backend thus needs to be initialized
    +    with the correct hash function right from the start, or otherwise we may
    +    end up writing tables with the wrong hash function. But git-clone(1)
    +    initializes the reference database before learning about the hash
    +    function used by the remote repository, which has never been a problem
    +    with the reffiles backend.
    +
    +    To fix this, we'll have to change git-clone(1) to be more careful and
    +    only create the reference backend once it learned about the remote hash
    +    function. This creates a problem for git-remote-curl(1), which will then
    +    be spawned at a time where the repository is not yet fully-initialized.
    +    Consequentially, git-remote-curl(1) will fail to detect the repository,
    +    which eventually causes it to error out once it is asked to fetch remote
    +    objects.
     
         We can address this issue by trying to re-discover the Git repository in
         case none was detected at startup time. With this change, the clone will
    @@ Commit message
              repository
     
           4. git-clone(1) creates the reference database as it has now learned
    -         about the object format.
    +         about the hash function.
     
           5. git-clone(1) asks git-remote-curl(1) to fetch the remote packfile.
              The latter notices that it doesn't have a repository available, but
4:  c7a9d6ef74 ! 4:  3bef564b57 builtin/clone: fix bundle URIs with mismatching object formats
    @@ Commit message
         is indeed not necessarily the case for the hash algorithm right now. So
         in practice it is the right thing to detect the remote's object format
         before downloading bundle URIs anyway, and not doing so causes clones
    -    with bundle URIS to fail when the local default object format does not
    +    with bundle URIs to fail when the local default object format does not
         match the remote repository's format.
     
         Unfortunately though, this creates a new issue: downloading bundles may
5:  703ff77eea = 5:  917f15055f builtin/clone: set up sparse checkout later
6:  6c919fb19c = 6:  f3485a2708 builtin/clone: skip reading HEAD when retrieving remote
7:  eb5530e6a8 ! 7:  f062b11550 builtin/clone: create the refdb with the correct object format
    @@ Commit message
         upcoming reftable backend when cloning repositories with the SHA256
         object format.
     
    +    This change breaks a test in "t5550-http-fetch-dumb.sh" when cloning an
    +    empty repository with `GIT_TEST_DEFAULT_HASH=sha256`. The test expects
    +    the resulting hash format of the empty cloned repository to match the
    +    default hash, but now we always end up with a sha1 repository. The
    +    problem is that for dumb HTTP fetches, we have no easy way to figure out
    +    the remote's hash function except for deriving it based on the hash
    +    length of refs in `info/refs`. But as the remote repository is empty we
    +    cannot rely on this detection mechanism.
    +
    +    Before the change in this commit we already initialized the repository
    +    with the default hash function and then left it as-is. With this patch
    +    we always use the hash function detected via the remote, where we fall
    +    back to "sha1" in case we cannot detect it.
    +
    +    Neither the old nor the new behaviour are correct as we second-guess the
    +    remote hash function in both cases. But given that this is a rather
    +    unlikely edge case (we use the dumb HTTP protocol, the remote repository
    +    uses SHA256 and the remote repository is empty), let's simply adapt the
    +    test to assert the new behaviour. If we want to properly address this
    +    edge case in the future we will have to extend the dumb HTTP protocol so
    +    that we can properly detect the hash function for empty repositories.
    +
         Signed-off-by: Patrick Steinhardt <ps@pks.im>
     
      ## builtin/clone.c ##
    @@ setup.h: int init_db(const char *git_dir, const char *real_git_dir,
      
      /*
       * NOTE NOTE NOTE!!
    +
    + ## t/t5550-http-fetch-dumb.sh ##
    +@@ t/t5550-http-fetch-dumb.sh: test_expect_success 'create empty remote repository' '
    + 	setup_post_update_server_info_hook "$HTTPD_DOCUMENT_ROOT_PATH/empty.git"
    + '
    + 
    +-test_expect_success 'empty dumb HTTP repository has default hash algorithm' '
    ++test_expect_success 'empty dumb HTTP repository falls back to SHA1' '
    + 	test_when_finished "rm -fr clone-empty" &&
    + 	git clone $HTTPD_URL/dumb/empty.git clone-empty &&
    + 	git -C clone-empty rev-parse --show-object-format >empty-format &&
    +-	test "$(cat empty-format)" = "$(test_oid algo)"
    ++	test "$(cat empty-format)" = sha1
    + '
    + 
    + setup_askpass_helper

base-commit: 1a87c842ece327d03d08096395969aca5e0a6996
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 1/7] setup: extract function to create the refdb
  2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
@ 2023-12-12  7:00   ` Patrick Steinhardt
  2023-12-12  7:00   ` [PATCH v2 2/7] setup: allow skipping creation of " Patrick Steinhardt
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 4810 bytes --]

We're about to let callers skip creation of the reference database when
calling `init_db()`. Extract the logic into a standalone function so
that it becomes easier to do this refactoring.

While at it, expand the comment that explains why we always create the
"refs/" directory.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 setup.c | 103 +++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 65 insertions(+), 38 deletions(-)

diff --git a/setup.c b/setup.c
index fc592dc6dd..865cfe6743 100644
--- a/setup.c
+++ b/setup.c
@@ -1885,6 +1885,68 @@ void initialize_repository_version(int hash_algo, int reinit)
 		git_config_set_gently("extensions.objectformat", NULL);
 }
 
+static int is_reinit(void)
+{
+	struct strbuf buf = STRBUF_INIT;
+	char junk[2];
+	int ret;
+
+	git_path_buf(&buf, "HEAD");
+	ret = !access(buf.buf, R_OK) || readlink(buf.buf, junk, sizeof(junk) - 1) != -1;
+	strbuf_release(&buf);
+	return ret;
+}
+
+static void create_reference_database(const char *initial_branch, int quiet)
+{
+	struct strbuf err = STRBUF_INIT;
+	int reinit = is_reinit();
+
+	/*
+	 * We need to create a "refs" dir in any case so that older versions of
+	 * Git can tell that this is a repository. This serves two main purposes:
+	 *
+	 * - Clients will know to stop walking the parent-directory chain when
+	 *   detecting the Git repository. Otherwise they may end up detecting
+	 *   a Git repository in a parent directory instead.
+	 *
+	 * - Instead of failing to detect a repository with unknown reference
+	 *   format altogether, old clients will print an error saying that
+	 *   they do not understand the reference format extension.
+	 */
+	safe_create_dir(git_path("refs"), 1);
+	adjust_shared_perm(git_path("refs"));
+
+	if (refs_init_db(&err))
+		die("failed to set up refs db: %s", err.buf);
+
+	/*
+	 * Point the HEAD symref to the initial branch with if HEAD does
+	 * not yet exist.
+	 */
+	if (!reinit) {
+		char *ref;
+
+		if (!initial_branch)
+			initial_branch = git_default_branch_name(quiet);
+
+		ref = xstrfmt("refs/heads/%s", initial_branch);
+		if (check_refname_format(ref, 0) < 0)
+			die(_("invalid initial branch name: '%s'"),
+			    initial_branch);
+
+		if (create_symref("HEAD", ref, NULL) < 0)
+			exit(1);
+		free(ref);
+	}
+
+	if (reinit && initial_branch)
+		warning(_("re-init: ignored --initial-branch=%s"),
+			initial_branch);
+
+	strbuf_release(&err);
+}
+
 static int create_default_files(const char *template_path,
 				const char *original_git_dir,
 				const char *initial_branch,
@@ -1896,10 +1958,8 @@ static int create_default_files(const char *template_path,
 	struct stat st1;
 	struct strbuf buf = STRBUF_INIT;
 	char *path;
-	char junk[2];
 	int reinit;
 	int filemode;
-	struct strbuf err = STRBUF_INIT;
 	const char *init_template_dir = NULL;
 	const char *work_tree = get_git_work_tree();
 
@@ -1919,6 +1979,8 @@ static int create_default_files(const char *template_path,
 	reset_shared_repository();
 	git_config(git_default_config, NULL);
 
+	reinit = is_reinit();
+
 	/*
 	 * We must make sure command-line options continue to override any
 	 * values we might have just re-read from the config.
@@ -1962,39 +2024,7 @@ static int create_default_files(const char *template_path,
 		adjust_shared_perm(get_git_dir());
 	}
 
-	/*
-	 * We need to create a "refs" dir in any case so that older
-	 * versions of git can tell that this is a repository.
-	 */
-	safe_create_dir(git_path("refs"), 1);
-	adjust_shared_perm(git_path("refs"));
-
-	if (refs_init_db(&err))
-		die("failed to set up refs db: %s", err.buf);
-
-	/*
-	 * Point the HEAD symref to the initial branch with if HEAD does
-	 * not yet exist.
-	 */
-	path = git_path_buf(&buf, "HEAD");
-	reinit = (!access(path, R_OK)
-		  || readlink(path, junk, sizeof(junk)-1) != -1);
-	if (!reinit) {
-		char *ref;
-
-		if (!initial_branch)
-			initial_branch = git_default_branch_name(quiet);
-
-		ref = xstrfmt("refs/heads/%s", initial_branch);
-		if (check_refname_format(ref, 0) < 0)
-			die(_("invalid initial branch name: '%s'"),
-			    initial_branch);
-
-		if (create_symref("HEAD", ref, NULL) < 0)
-			exit(1);
-		free(ref);
-	}
-
+	create_reference_database(initial_branch, quiet);
 	initialize_repository_version(fmt->hash_algo, 0);
 
 	/* Check filemode trustability */
@@ -2158,9 +2188,6 @@ int init_db(const char *git_dir, const char *real_git_dir,
 				      prev_bare_repository,
 				      init_shared_repository,
 				      flags & INIT_DB_QUIET);
-	if (reinit && initial_branch)
-		warning(_("re-init: ignored --initial-branch=%s"),
-			initial_branch);
 
 	create_object_directory();
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 2/7] setup: allow skipping creation of the refdb
  2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
  2023-12-12  7:00   ` [PATCH v2 1/7] setup: extract function to create the refdb Patrick Steinhardt
@ 2023-12-12  7:00   ` Patrick Steinhardt
  2023-12-12  7:00   ` [PATCH v2 3/7] remote-curl: rediscover repository when fetching refs Patrick Steinhardt
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2755 bytes --]

Allow callers to skip creation of the reference database via a new flag
`INIT_DB_SKIP_REFDB`, which is required for git-clone(1) so that we can
create it at a later point once the object format has been discovered
from the remote repository.

Note that we also uplift the call to `create_reference_database()` into
`init_db()`, which makes it easier to handle the new flag for us. This
changes the order in which we do initialization so that we now set up
the Git configuration before we create the reference database. In
practice this move should not result in any change in behaviour.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 setup.c | 13 +++++--------
 setup.h |  5 +++--
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/setup.c b/setup.c
index 865cfe6743..d6a1c59b7b 100644
--- a/setup.c
+++ b/setup.c
@@ -1949,11 +1949,9 @@ static void create_reference_database(const char *initial_branch, int quiet)
 
 static int create_default_files(const char *template_path,
 				const char *original_git_dir,
-				const char *initial_branch,
 				const struct repository_format *fmt,
 				int prev_bare_repository,
-				int init_shared_repository,
-				int quiet)
+				int init_shared_repository)
 {
 	struct stat st1;
 	struct strbuf buf = STRBUF_INIT;
@@ -2024,7 +2022,6 @@ static int create_default_files(const char *template_path,
 		adjust_shared_perm(get_git_dir());
 	}
 
-	create_reference_database(initial_branch, quiet);
 	initialize_repository_version(fmt->hash_algo, 0);
 
 	/* Check filemode trustability */
@@ -2184,11 +2181,11 @@ int init_db(const char *git_dir, const char *real_git_dir,
 	validate_hash_algorithm(&repo_fmt, hash);
 
 	reinit = create_default_files(template_dir, original_git_dir,
-				      initial_branch, &repo_fmt,
-				      prev_bare_repository,
-				      init_shared_repository,
-				      flags & INIT_DB_QUIET);
+				      &repo_fmt, prev_bare_repository,
+				      init_shared_repository);
 
+	if (!(flags & INIT_DB_SKIP_REFDB))
+		create_reference_database(initial_branch, flags & INIT_DB_QUIET);
 	create_object_directory();
 
 	if (get_shared_repository()) {
diff --git a/setup.h b/setup.h
index b48cf1c43b..cbf538286b 100644
--- a/setup.h
+++ b/setup.h
@@ -169,8 +169,9 @@ int verify_repository_format(const struct repository_format *format,
  */
 void check_repository_format(struct repository_format *fmt);
 
-#define INIT_DB_QUIET 0x0001
-#define INIT_DB_EXIST_OK 0x0002
+#define INIT_DB_QUIET      (1 << 0)
+#define INIT_DB_EXIST_OK   (1 << 1)
+#define INIT_DB_SKIP_REFDB (1 << 2)
 
 int init_db(const char *git_dir, const char *real_git_dir,
 	    const char *template_dir, int hash_algo,
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 3/7] remote-curl: rediscover repository when fetching refs
  2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
  2023-12-12  7:00   ` [PATCH v2 1/7] setup: extract function to create the refdb Patrick Steinhardt
  2023-12-12  7:00   ` [PATCH v2 2/7] setup: allow skipping creation of " Patrick Steinhardt
@ 2023-12-12  7:00   ` Patrick Steinhardt
  2023-12-12  7:00   ` [PATCH v2 4/7] builtin/clone: fix bundle URIs with mismatching object formats Patrick Steinhardt
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2586 bytes --]

The reftable format encodes the hash function used by the repository
inside of its tables. The reftable backend thus needs to be initialized
with the correct hash function right from the start, or otherwise we may
end up writing tables with the wrong hash function. But git-clone(1)
initializes the reference database before learning about the hash
function used by the remote repository, which has never been a problem
with the reffiles backend.

To fix this, we'll have to change git-clone(1) to be more careful and
only create the reference backend once it learned about the remote hash
function. This creates a problem for git-remote-curl(1), which will then
be spawned at a time where the repository is not yet fully-initialized.
Consequentially, git-remote-curl(1) will fail to detect the repository,
which eventually causes it to error out once it is asked to fetch remote
objects.

We can address this issue by trying to re-discover the Git repository in
case none was detected at startup time. With this change, the clone will
look as following:

  1. git-clone(1) sets up the initial repository, excluding the
     reference database.

  2. git-clone(1) spawns git-remote-curl(1), which will be unable to
     detect the repository due to a missing "HEAD".

  3. git-clone(1) asks git-remote-curl(1) to list remote references.
     This works just fine as this step does not require a local
     repository

  4. git-clone(1) creates the reference database as it has now learned
     about the hash function.

  5. git-clone(1) asks git-remote-curl(1) to fetch the remote packfile.
     The latter notices that it doesn't have a repository available, but
     it now knows to try and re-discover it.

If the re-discovery succeeds in the last step we can continue with the
clone.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 remote-curl.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/remote-curl.c b/remote-curl.c
index ef05752ca5..fc29757b65 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -1564,8 +1564,11 @@ int cmd_main(int argc, const char **argv)
 		if (buf.len == 0)
 			break;
 		if (starts_with(buf.buf, "fetch ")) {
-			if (nongit)
-				die(_("remote-curl: fetch attempted without a local repo"));
+			if (nongit) {
+				setup_git_directory_gently(&nongit);
+				if (nongit)
+					die(_("remote-curl: fetch attempted without a local repo"));
+			}
 			parse_fetch(&buf);
 
 		} else if (!strcmp(buf.buf, "list") || starts_with(buf.buf, "list ")) {
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 4/7] builtin/clone: fix bundle URIs with mismatching object formats
  2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2023-12-12  7:00   ` [PATCH v2 3/7] remote-curl: rediscover repository when fetching refs Patrick Steinhardt
@ 2023-12-12  7:00   ` Patrick Steinhardt
  2023-12-12  7:00   ` [PATCH v2 5/7] builtin/clone: set up sparse checkout later Patrick Steinhardt
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 6668 bytes --]

We create the reference database in git-clone(1) quite early before
connecting to the remote repository. Given that we do not yet know about
the object format that the remote repository uses at that point in time
the consequence is that the refdb may be initialized with the wrong
object format.

This is not a problem in the context of the files backend as we do not
encode the object format anywhere, and furthermore the only reference
that we write between initializing the refdb and learning about the
object format is the "HEAD" symref. It will become a problem though once
we land the reftable backend, which indeed does require to know about
the proper object format at the time of creation. We thus need to
rearrange the logic in git-clone(1) so that we only initialize the refdb
once we have learned about the actual object format.

As a first step, move listing of remote references to happen earlier,
which also allow us to set up the hash algorithm of the repository
earlier now. While we aim to execute this logic as late as possible
until after most of the setup has happened already, detection of the
object format and thus later the setup of the reference database must
happen before any other logic that may spawn Git commands or otherwise
these Git commands may not recognize the repository as such.

The first Git step where we expect the repository to be fully initalized
is when we fetch bundles via bundle URIs. Funny enough, the comments
there also state that "the_repository must match the cloned repo", which
is indeed not necessarily the case for the hash algorithm right now. So
in practice it is the right thing to detect the remote's object format
before downloading bundle URIs anyway, and not doing so causes clones
with bundle URIs to fail when the local default object format does not
match the remote repository's format.

Unfortunately though, this creates a new issue: downloading bundles may
take a long time, so if we list refs beforehand they might've grown
stale meanwhile. It is not clear how to solve this issue except for a
second reference listing though after we have downloaded the bundles,
which may be an expensive thing to do.

Arguably though, it's preferable to have a staleness issue compared to
being unable to clone a repository altogether.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c             | 48 ++++++++++++++++++-------------------
 t/t5558-clone-bundle-uri.sh | 18 ++++++++++++++
 2 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index c6357af949..d188650881 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1266,6 +1266,26 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (transport->smart_options && !deepen && !filter_options.choice)
 		transport->smart_options->check_self_contained_and_connected = 1;
 
+	strvec_push(&transport_ls_refs_options.ref_prefixes, "HEAD");
+	refspec_ref_prefixes(&remote->fetch,
+			     &transport_ls_refs_options.ref_prefixes);
+	if (option_branch)
+		expand_ref_prefix(&transport_ls_refs_options.ref_prefixes,
+				  option_branch);
+	if (!option_no_tags)
+		strvec_push(&transport_ls_refs_options.ref_prefixes,
+			    "refs/tags/");
+
+	refs = transport_get_remote_refs(transport, &transport_ls_refs_options);
+
+	/*
+	 * Now that we know what algorithm the remote side is using, let's set
+	 * ours to the same thing.
+	 */
+	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
+	initialize_repository_version(hash_algo, 1);
+	repo_set_hash_algo(the_repository, hash_algo);
+
 	/*
 	 * Before fetching from the remote, download and install bundle
 	 * data from the --bundle-uri option.
@@ -1281,24 +1301,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 				bundle_uri);
 		else if (has_heuristic)
 			git_config_set_gently("fetch.bundleuri", bundle_uri);
-	}
-
-	strvec_push(&transport_ls_refs_options.ref_prefixes, "HEAD");
-	refspec_ref_prefixes(&remote->fetch,
-			     &transport_ls_refs_options.ref_prefixes);
-	if (option_branch)
-		expand_ref_prefix(&transport_ls_refs_options.ref_prefixes,
-				  option_branch);
-	if (!option_no_tags)
-		strvec_push(&transport_ls_refs_options.ref_prefixes,
-			    "refs/tags/");
-
-	refs = transport_get_remote_refs(transport, &transport_ls_refs_options);
-
-	if (refs)
-		mapped_refs = wanted_peer_refs(refs, &remote->fetch);
-
-	if (!bundle_uri) {
+	} else {
 		/*
 		* Populate transport->got_remote_bundle_uri and
 		* transport->bundle_uri. We might get nothing.
@@ -1319,13 +1322,8 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		}
 	}
 
-		/*
-		 * Now that we know what algorithm the remote side is using,
-		 * let's set ours to the same thing.
-		 */
-	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
-	initialize_repository_version(hash_algo, 1);
-	repo_set_hash_algo(the_repository, hash_algo);
+	if (refs)
+		mapped_refs = wanted_peer_refs(refs, &remote->fetch);
 
 	if (mapped_refs) {
 		/*
diff --git a/t/t5558-clone-bundle-uri.sh b/t/t5558-clone-bundle-uri.sh
index 996a08e90c..1ca5f745e7 100755
--- a/t/t5558-clone-bundle-uri.sh
+++ b/t/t5558-clone-bundle-uri.sh
@@ -33,6 +33,15 @@ test_expect_success 'clone with path bundle' '
 	test_cmp expect actual
 '
 
+test_expect_success 'clone with path bundle and non-default hash' '
+	test_when_finished "rm -rf clone-path-non-default-hash" &&
+	GIT_DEFAULT_HASH=sha256 git clone --bundle-uri="clone-from/B.bundle" \
+		clone-from clone-path-non-default-hash &&
+	git -C clone-path-non-default-hash rev-parse refs/bundles/topic >actual &&
+	git -C clone-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
 test_expect_success 'clone with file:// bundle' '
 	git clone --bundle-uri="file://$(pwd)/clone-from/B.bundle" \
 		clone-from clone-file &&
@@ -284,6 +293,15 @@ test_expect_success 'clone HTTP bundle' '
 	test_config -C clone-http log.excludedecoration refs/bundle/
 '
 
+test_expect_success 'clone HTTP bundle with non-default hash' '
+	test_when_finished "rm -rf clone-http-non-default-hash" &&
+	GIT_DEFAULT_HASH=sha256 git clone --bundle-uri="$HTTPD_URL/B.bundle" \
+		"$HTTPD_URL/smart/fetch.git" clone-http-non-default-hash &&
+	git -C clone-http-non-default-hash rev-parse refs/bundles/topic >actual &&
+	git -C clone-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
 test_expect_success 'clone bundle list (HTTP, no heuristic)' '
 	test_when_finished rm -f trace*.txt &&
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 5/7] builtin/clone: set up sparse checkout later
  2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2023-12-12  7:00   ` [PATCH v2 4/7] builtin/clone: fix bundle URIs with mismatching object formats Patrick Steinhardt
@ 2023-12-12  7:00   ` Patrick Steinhardt
  2023-12-12  7:01   ` [PATCH v2 6/7] builtin/clone: skip reading HEAD when retrieving remote Patrick Steinhardt
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-12  7:00 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

When asked to do a sparse checkout, then git-clone(1) will spawn
`git sparse-checkout set` to set up the configuration accordingly. This
requires a proper Git repository or otherwise the command will fail. But
as we are about to move creation of the reference database to a later
point, this prerequisite will not hold anymore.

Move the logic to a later point in time where we know to have created
the reference database already.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index d188650881..9c60923f31 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1185,9 +1185,6 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (option_required_reference.nr || option_optional_reference.nr)
 		setup_reference();
 
-	if (option_sparse_checkout && git_sparse_checkout_init(dir))
-		return 1;
-
 	remote = remote_get(remote_name);
 
 	refspec_appendf(&remote->fetch, "+%s*:%s*", src_ref_prefix,
@@ -1426,6 +1423,9 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		dissociate_from_references();
 	}
 
+	if (option_sparse_checkout && git_sparse_checkout_init(dir))
+		return 1;
+
 	junk_mode = JUNK_LEAVE_REPO;
 	err = checkout(submodule_progress, filter_submodules);
 
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 6/7] builtin/clone: skip reading HEAD when retrieving remote
  2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
                     ` (4 preceding siblings ...)
  2023-12-12  7:00   ` [PATCH v2 5/7] builtin/clone: set up sparse checkout later Patrick Steinhardt
@ 2023-12-12  7:01   ` Patrick Steinhardt
  2023-12-12  7:01   ` [PATCH v2 7/7] builtin/clone: create the refdb with the correct object format Patrick Steinhardt
  2023-12-12 19:20   ` [PATCH v2 0/7] clone: fix init of refdb with wrong " Junio C Hamano
  7 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-12  7:01 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 5199 bytes --]

After we have set up the remote configuration in git-clone(1) we'll call
`remote_get()` to read the remote from the on-disk configuration. But
next to reading the on-disk configuration, `remote_get()` will also
cause us to try and read the repository's HEAD reference so that we can
figure out the current branch. Besides being pointless in git-clone(1)
because we're operating in an empty repository anyway, this will also
break once we move creation of the reference database to a later point
in time.

Refactor the code to introduce a new `remote_get_early()` function that
will skip reading the HEAD reference to address this issue.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c |  2 +-
 remote.c        | 26 ++++++++++++++++----------
 remote.h        |  1 +
 3 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 9c60923f31..06966c5d4c 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1185,7 +1185,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	if (option_required_reference.nr || option_optional_reference.nr)
 		setup_reference();
 
-	remote = remote_get(remote_name);
+	remote = remote_get_early(remote_name);
 
 	refspec_appendf(&remote->fetch, "+%s*:%s*", src_ref_prefix,
 			branch_top.buf);
diff --git a/remote.c b/remote.c
index abb24822be..051d0a64a0 100644
--- a/remote.c
+++ b/remote.c
@@ -509,7 +509,7 @@ static void alias_all_urls(struct remote_state *remote_state)
 	}
 }
 
-static void read_config(struct repository *repo)
+static void read_config(struct repository *repo, int early)
 {
 	int flag;
 
@@ -518,7 +518,7 @@ static void read_config(struct repository *repo)
 	repo->remote_state->initialized = 1;
 
 	repo->remote_state->current_branch = NULL;
-	if (startup_info->have_repository) {
+	if (startup_info->have_repository && !early) {
 		const char *head_ref = refs_resolve_ref_unsafe(
 			get_main_ref_store(repo), "HEAD", 0, NULL, &flag);
 		if (head_ref && (flag & REF_ISSYMREF) &&
@@ -561,7 +561,7 @@ static const char *remotes_remote_for_branch(struct remote_state *remote_state,
 
 const char *remote_for_branch(struct branch *branch, int *explicit)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	return remotes_remote_for_branch(the_repository->remote_state, branch,
@@ -587,7 +587,7 @@ remotes_pushremote_for_branch(struct remote_state *remote_state,
 
 const char *pushremote_for_branch(struct branch *branch, int *explicit)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	return remotes_pushremote_for_branch(the_repository->remote_state,
@@ -599,7 +599,7 @@ static struct remote *remotes_remote_get(struct remote_state *remote_state,
 
 const char *remote_ref_for_branch(struct branch *branch, int for_push)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	if (branch) {
@@ -709,7 +709,13 @@ remotes_remote_get(struct remote_state *remote_state, const char *name)
 
 struct remote *remote_get(const char *name)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
+	return remotes_remote_get(the_repository->remote_state, name);
+}
+
+struct remote *remote_get_early(const char *name)
+{
+	read_config(the_repository, 1);
 	return remotes_remote_get(the_repository->remote_state, name);
 }
 
@@ -722,7 +728,7 @@ remotes_pushremote_get(struct remote_state *remote_state, const char *name)
 
 struct remote *pushremote_get(const char *name)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	return remotes_pushremote_get(the_repository->remote_state, name);
 }
 
@@ -738,7 +744,7 @@ int remote_is_configured(struct remote *remote, int in_repo)
 int for_each_remote(each_remote_fn fn, void *priv)
 {
 	int i, result = 0;
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	for (i = 0; i < the_repository->remote_state->remotes_nr && !result;
 	     i++) {
 		struct remote *remote =
@@ -1831,7 +1837,7 @@ struct branch *branch_get(const char *name)
 {
 	struct branch *ret;
 
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	if (!name || !*name || !strcmp(name, "HEAD"))
 		ret = the_repository->remote_state->current_branch;
 	else
@@ -1973,7 +1979,7 @@ static const char *branch_get_push_1(struct remote_state *remote_state,
 
 const char *branch_get_push(struct branch *branch, struct strbuf *err)
 {
-	read_config(the_repository);
+	read_config(the_repository, 0);
 	die_on_missing_branch(the_repository, branch);
 
 	if (!branch)
diff --git a/remote.h b/remote.h
index cdc8b1db42..79353ba226 100644
--- a/remote.h
+++ b/remote.h
@@ -118,6 +118,7 @@ struct remote {
  * and configuration.
  */
 struct remote *remote_get(const char *name);
+struct remote *remote_get_early(const char *name);
 
 struct remote *pushremote_get(const char *name);
 int remote_is_configured(struct remote *remote, int in_repo);
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 7/7] builtin/clone: create the refdb with the correct object format
  2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
                     ` (5 preceding siblings ...)
  2023-12-12  7:01   ` [PATCH v2 6/7] builtin/clone: skip reading HEAD when retrieving remote Patrick Steinhardt
@ 2023-12-12  7:01   ` Patrick Steinhardt
  2023-12-12 19:20   ` [PATCH v2 0/7] clone: fix init of refdb with wrong " Junio C Hamano
  7 siblings, 0 replies; 33+ messages in thread
From: Patrick Steinhardt @ 2023-12-12  7:01 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 5283 bytes --]

We're currently creating the reference database with a potentially
incorrect object format when the remote repository's object format is
different from the local default object format. This works just fine for
now because the files backend never records the object format anywhere.
But this logic will fail with any new reference backend that encodes
this information in some form either on-disk or in-memory.

The preceding commits have reshuffled code in git-clone(1) so that there
is no code path that will access the reference database before we have
detected the remote's object format. With these refactorings we can now
defer initialization of the reference database until after we have
learned the remote's object format and thus initialize it with the
correct format from the get-go.

These refactorings are required to make git-clone(1) work with the
upcoming reftable backend when cloning repositories with the SHA256
object format.

This change breaks a test in "t5550-http-fetch-dumb.sh" when cloning an
empty repository with `GIT_TEST_DEFAULT_HASH=sha256`. The test expects
the resulting hash format of the empty cloned repository to match the
default hash, but now we always end up with a sha1 repository. The
problem is that for dumb HTTP fetches, we have no easy way to figure out
the remote's hash function except for deriving it based on the hash
length of refs in `info/refs`. But as the remote repository is empty we
cannot rely on this detection mechanism.

Before the change in this commit we already initialized the repository
with the default hash function and then left it as-is. With this patch
we always use the hash function detected via the remote, where we fall
back to "sha1" in case we cannot detect it.

Neither the old nor the new behaviour are correct as we second-guess the
remote hash function in both cases. But given that this is a rather
unlikely edge case (we use the dumb HTTP protocol, the remote repository
uses SHA256 and the remote repository is empty), let's simply adapt the
test to assert the new behaviour. If we want to properly address this
edge case in the future we will have to extend the dumb HTTP protocol so
that we can properly detect the hash function for empty repositories.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/clone.c            | 9 ++++++++-
 setup.c                    | 2 +-
 setup.h                    | 1 +
 t/t5550-http-fetch-dumb.sh | 4 ++--
 4 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 06966c5d4c..fd052b8b54 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1097,8 +1097,14 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	/*
+	 * Initialize the repository, but skip initializing the reference
+	 * database. We do not yet know about the object format of the
+	 * repository, and reference backends may persist that information into
+	 * their on-disk data structures.
+	 */
 	init_db(git_dir, real_git_dir, option_template, GIT_HASH_UNKNOWN, NULL,
-		do_not_override_repo_unix_permissions, INIT_DB_QUIET);
+		do_not_override_repo_unix_permissions, INIT_DB_QUIET | INIT_DB_SKIP_REFDB);
 
 	if (real_git_dir) {
 		free((char *)git_dir);
@@ -1282,6 +1288,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
 	initialize_repository_version(hash_algo, 1);
 	repo_set_hash_algo(the_repository, hash_algo);
+	create_reference_database(NULL, 1);
 
 	/*
 	 * Before fetching from the remote, download and install bundle
diff --git a/setup.c b/setup.c
index d6a1c59b7b..155fe13f70 100644
--- a/setup.c
+++ b/setup.c
@@ -1897,7 +1897,7 @@ static int is_reinit(void)
 	return ret;
 }
 
-static void create_reference_database(const char *initial_branch, int quiet)
+void create_reference_database(const char *initial_branch, int quiet)
 {
 	struct strbuf err = STRBUF_INIT;
 	int reinit = is_reinit();
diff --git a/setup.h b/setup.h
index cbf538286b..3f0f17c351 100644
--- a/setup.h
+++ b/setup.h
@@ -178,6 +178,7 @@ int init_db(const char *git_dir, const char *real_git_dir,
 	    const char *initial_branch, int init_shared_repository,
 	    unsigned int flags);
 void initialize_repository_version(int hash_algo, int reinit);
+void create_reference_database(const char *initial_branch, int quiet);
 
 /*
  * NOTE NOTE NOTE!!
diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
index e444b30bf6..4c3b32785d 100755
--- a/t/t5550-http-fetch-dumb.sh
+++ b/t/t5550-http-fetch-dumb.sh
@@ -66,11 +66,11 @@ test_expect_success 'create empty remote repository' '
 	setup_post_update_server_info_hook "$HTTPD_DOCUMENT_ROOT_PATH/empty.git"
 '
 
-test_expect_success 'empty dumb HTTP repository has default hash algorithm' '
+test_expect_success 'empty dumb HTTP repository falls back to SHA1' '
 	test_when_finished "rm -fr clone-empty" &&
 	git clone $HTTPD_URL/dumb/empty.git clone-empty &&
 	git -C clone-empty rev-parse --show-object-format >empty-format &&
-	test "$(cat empty-format)" = "$(test_oid algo)"
+	test "$(cat empty-format)" = sha1
 '
 
 setup_askpass_helper
-- 
2.43.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 0/7] clone: fix init of refdb with wrong object format
  2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
                     ` (6 preceding siblings ...)
  2023-12-12  7:01   ` [PATCH v2 7/7] builtin/clone: create the refdb with the correct object format Patrick Steinhardt
@ 2023-12-12 19:20   ` Junio C Hamano
  7 siblings, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2023-12-12 19:20 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:

> this is the second version of my patch series to make git-clone(1)
> initialize the reference database with the correct hash format. This is
> a preparatory step for the reftable backend, which encodes the hash
> format on-disk and thus requires us to get the hash format right at the
> time of initialization.
>
> Changes compared to v1:
>
>   - Patch 1: Extend the comment explaining why we always create the
>     "refs/" directory.
>
>   - Patch 3: Explain better why the patch is required in the first
>     place.
>
>   - Patch 4: Fix a typo.
>
>   - Patch 7: Adapt a failing test to now assert the new behaviour.
>
> Thanks for your reviews!

Everything I see in the range-diff looks very sensible.

Will replace.  Thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-12-12 19:20 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-06 12:39 [PATCH 0/7] clone: fix init of refdb with wrong object format Patrick Steinhardt
2023-12-06 12:39 ` [PATCH 1/7] setup: extract function to create the refdb Patrick Steinhardt
2023-12-06 21:10   ` Karthik Nayak
2023-12-07  7:22     ` Patrick Steinhardt
2023-12-08 22:54       ` Junio C Hamano
2023-12-11 11:34         ` Patrick Steinhardt
2023-12-11 15:50       ` Karthik Nayak
2023-12-06 12:39 ` [PATCH 2/7] setup: allow skipping creation of " Patrick Steinhardt
2023-12-06 12:39 ` [PATCH 3/7] remote-curl: rediscover repository when fetching refs Patrick Steinhardt
2023-12-08 23:09   ` Junio C Hamano
2023-12-11 11:35     ` Patrick Steinhardt
2023-12-06 12:40 ` [PATCH 4/7] builtin/clone: fix bundle URIs with mismatching object formats Patrick Steinhardt
2023-12-06 21:13   ` Karthik Nayak
2023-12-07  7:22     ` Patrick Steinhardt
2023-12-08 23:11   ` Junio C Hamano
2023-12-06 12:40 ` [PATCH 5/7] builtin/clone: set up sparse checkout later Patrick Steinhardt
2023-12-06 12:40 ` [PATCH 6/7] builtin/clone: skip reading HEAD when retrieving remote Patrick Steinhardt
2023-12-06 12:40 ` [PATCH 7/7] builtin/clone: create the refdb with the correct object format Patrick Steinhardt
2023-12-06 13:09 ` [PATCH 0/7] clone: fix init of refdb with wrong " Patrick Steinhardt
2023-12-10  3:16 ` Junio C Hamano
2023-12-11 11:34   ` Patrick Steinhardt
2023-12-11 14:57     ` Junio C Hamano
2023-12-11 15:32       ` Patrick Steinhardt
2023-12-11 22:17         ` brian m. carlson
2023-12-12  7:00 ` [PATCH v2 " Patrick Steinhardt
2023-12-12  7:00   ` [PATCH v2 1/7] setup: extract function to create the refdb Patrick Steinhardt
2023-12-12  7:00   ` [PATCH v2 2/7] setup: allow skipping creation of " Patrick Steinhardt
2023-12-12  7:00   ` [PATCH v2 3/7] remote-curl: rediscover repository when fetching refs Patrick Steinhardt
2023-12-12  7:00   ` [PATCH v2 4/7] builtin/clone: fix bundle URIs with mismatching object formats Patrick Steinhardt
2023-12-12  7:00   ` [PATCH v2 5/7] builtin/clone: set up sparse checkout later Patrick Steinhardt
2023-12-12  7:01   ` [PATCH v2 6/7] builtin/clone: skip reading HEAD when retrieving remote Patrick Steinhardt
2023-12-12  7:01   ` [PATCH v2 7/7] builtin/clone: create the refdb with the correct object format Patrick Steinhardt
2023-12-12 19:20   ` [PATCH v2 0/7] clone: fix init of refdb with wrong " Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.