All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] rev-list: add support for commits in `--missing`
@ 2023-10-09 10:55 Karthik Nayak
  2023-10-09 10:55 ` [PATCH 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
                   ` (5 more replies)
  0 siblings, 6 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-09 10:55 UTC (permalink / raw)
  To: git; +Cc: Karthik Nayak

The `--missing` option in git-rev-list(1) was introduced intitally
to deal with missing blobs in the context of promissory notes.
Eventually the option was extended to also support tree objects in
7c0fe330d5 (rev-list: handle missing tree objects properly,2018-10-05).

This patch series extends the `--missing` option to also add support for
commit objects. We do this by introducing a new flag `MISSING` which is
added whenever we encounter a missing commit during traversal. Then in
`builtin/rev-list` we check for this flag and take the appropriate
action based on the `--missing=*` option used.

This series is an alternate to the patch series I had posted earlier:
https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/.
In that patch, we introduced an option `--ignore-missing-links` which
was added to expose the `ignore_missing_links` bit to the user. The
issue in that patch was that, the option `--ignore-missing-links` didn't
play well the pre-existing `--missing` option. This series avoids that
route and just extends the `--missing` option for commits to solve the
same problem.

Karthik Nayak (3):
  revision: rename bit to `do_not_die_on_missing_objects`
  rev-list: move `show_commit()` to the bottom
  rev-list: add commit object support in `--missing` option

 builtin/reflog.c            |  2 +-
 builtin/rev-list.c          | 93 +++++++++++++++++++------------------
 list-objects.c              |  2 +-
 object.h                    |  2 +-
 revision.c                  |  9 +++-
 revision.h                  | 20 ++++----
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
 7 files changed, 145 insertions(+), 57 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

-- 
2.41.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 1/3] revision: rename bit to `do_not_die_on_missing_objects`
  2023-10-09 10:55 [PATCH 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
@ 2023-10-09 10:55 ` Karthik Nayak
  2023-10-09 10:55 ` [PATCH 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-09 10:55 UTC (permalink / raw)
  To: git; +Cc: Karthik Nayak

The bit `do_not_die_on_missing_tree` is used in revision.h to ensure the
revision walker does not die when encountering a missing tree. This is
currently exclusively set within `builtin/rev-list.c` to ensure the
`--missing` option works with missing trees.

In the upcoming commits, we will extend `--missing` to also support
missing commits. So let's rename the bit to
`do_not_die_on_missing_objects`, which is object type agnostic and can
be used for both trees/commits.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/reflog.c   |  2 +-
 builtin/rev-list.c |  2 +-
 list-objects.c     |  2 +-
 revision.h         | 17 +++++++++--------
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/builtin/reflog.c b/builtin/reflog.c
index df63a5892e..9e369a5977 100644
--- a/builtin/reflog.c
+++ b/builtin/reflog.c
@@ -298,7 +298,7 @@ static int cmd_reflog_expire(int argc, const char **argv, const char *prefix)
 		struct rev_info revs;
 
 		repo_init_revisions(the_repository, &revs, prefix);
-		revs.do_not_die_on_missing_tree = 1;
+		revs.do_not_die_on_missing_objects = 1;
 		revs.ignore_missing = 1;
 		revs.ignore_missing_links = 1;
 		if (verbose)
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ff715d6918..ea77489c38 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -561,7 +561,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	}
 
 	if (arg_missing_action)
-		revs.do_not_die_on_missing_tree = 1;
+		revs.do_not_die_on_missing_objects = 1;
 
 	argc = setup_revisions(argc, argv, &revs, &s_r_opt);
 
diff --git a/list-objects.c b/list-objects.c
index c25c72b32c..47296dff2f 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -177,7 +177,7 @@ static void process_tree(struct traversal_context *ctx,
 		    is_promisor_object(&obj->oid))
 			return;
 
-		if (!revs->do_not_die_on_missing_tree)
+		if (!revs->do_not_die_on_missing_objects)
 			die("bad tree object %s", oid_to_hex(&obj->oid));
 	}
 
diff --git a/revision.h b/revision.h
index 50091bbd13..c73c92ef40 100644
--- a/revision.h
+++ b/revision.h
@@ -212,18 +212,19 @@ struct rev_info {
 
 			/*
 			 * Blobs are shown without regard for their existence.
-			 * But not so for trees: unless exclude_promisor_objects
+			 * But not so for trees/commits: unless exclude_promisor_objects
 			 * is set and the tree in question is a promisor object;
 			 * OR ignore_missing_links is set, the revision walker
-			 * dies with a "bad tree object HASH" message when
-			 * encountering a missing tree. For callers that can
-			 * handle missing trees and want them to be filterable
+			 * dies with a "bad <type> object HASH" message when
+			 * encountering a missing object. For callers that can
+			 * handle missing trees/commits and want them to be filterable
 			 * and showable, set this to true. The revision walker
-			 * will filter and show such a missing tree as usual,
-			 * but will not attempt to recurse into this tree
-			 * object.
+			 * will filter and show such a missing object as usual,
+			 * but will not attempt to recurse into this tree/commit
+			 * object. The revision walker will also set the MISSING
+			 * flag for such objects.
 			 */
-			do_not_die_on_missing_tree:1,
+			do_not_die_on_missing_objects:1,
 
 			/* for internal use only */
 			exclude_promisor_objects:1;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 2/3] rev-list: move `show_commit()` to the bottom
  2023-10-09 10:55 [PATCH 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
  2023-10-09 10:55 ` [PATCH 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
@ 2023-10-09 10:55 ` Karthik Nayak
  2023-10-09 10:55 ` [PATCH 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-09 10:55 UTC (permalink / raw)
  To: git; +Cc: Karthik Nayak

The `show_commit()` function already depends on `finish_commit()`, and
in the upcoming commit, we'll also add a dependency on
`finish_object__ma()`. Since in C symbols must be declared before
they're used, let's move `show_commit()` below both `finish_commit()`
and `finish_object__ma()`, so the code is cleaner as a whole without the
need for declarations.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c | 85 +++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 43 deletions(-)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ea77489c38..98542e8b3c 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -100,7 +100,48 @@ static off_t get_object_disk_usage(struct object *obj)
 	return size;
 }
 
-static void finish_commit(struct commit *commit);
+static inline void finish_object__ma(struct object *obj)
+{
+	/*
+	 * Whether or not we try to dynamically fetch missing objects
+	 * from the server, we currently DO NOT have the object.  We
+	 * can either print, allow (ignore), or conditionally allow
+	 * (ignore) them.
+	 */
+	switch (arg_missing_action) {
+	case MA_ERROR:
+		die("missing %s object '%s'",
+		    type_name(obj->type), oid_to_hex(&obj->oid));
+		return;
+
+	case MA_ALLOW_ANY:
+		return;
+
+	case MA_PRINT:
+		oidset_insert(&missing_objects, &obj->oid);
+		return;
+
+	case MA_ALLOW_PROMISOR:
+		if (is_promisor_object(&obj->oid))
+			return;
+		die("unexpected missing %s object '%s'",
+		    type_name(obj->type), oid_to_hex(&obj->oid));
+		return;
+
+	default:
+		BUG("unhandled missing_action");
+		return;
+	}
+}
+
+static void finish_commit(struct commit *commit)
+{
+	free_commit_list(commit->parents);
+	commit->parents = NULL;
+	free_commit_buffer(the_repository->parsed_objects,
+			   commit);
+}
+
 static void show_commit(struct commit *commit, void *data)
 {
 	struct rev_list_info *info = data;
@@ -219,48 +260,6 @@ static void show_commit(struct commit *commit, void *data)
 	finish_commit(commit);
 }
 
-static void finish_commit(struct commit *commit)
-{
-	free_commit_list(commit->parents);
-	commit->parents = NULL;
-	free_commit_buffer(the_repository->parsed_objects,
-			   commit);
-}
-
-static inline void finish_object__ma(struct object *obj)
-{
-	/*
-	 * Whether or not we try to dynamically fetch missing objects
-	 * from the server, we currently DO NOT have the object.  We
-	 * can either print, allow (ignore), or conditionally allow
-	 * (ignore) them.
-	 */
-	switch (arg_missing_action) {
-	case MA_ERROR:
-		die("missing %s object '%s'",
-		    type_name(obj->type), oid_to_hex(&obj->oid));
-		return;
-
-	case MA_ALLOW_ANY:
-		return;
-
-	case MA_PRINT:
-		oidset_insert(&missing_objects, &obj->oid);
-		return;
-
-	case MA_ALLOW_PROMISOR:
-		if (is_promisor_object(&obj->oid))
-			return;
-		die("unexpected missing %s object '%s'",
-		    type_name(obj->type), oid_to_hex(&obj->oid));
-		return;
-
-	default:
-		BUG("unhandled missing_action");
-		return;
-	}
-}
-
 static int finish_object(struct object *obj, const char *name UNUSED,
 			 void *cb_data)
 {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 3/3] rev-list: add commit object support in `--missing` option
  2023-10-09 10:55 [PATCH 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
  2023-10-09 10:55 ` [PATCH 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
  2023-10-09 10:55 ` [PATCH 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
@ 2023-10-09 10:55 ` Karthik Nayak
  2023-10-09 22:02 ` [PATCH 0/3] rev-list: add support for commits in `--missing` Junio C Hamano
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-09 10:55 UTC (permalink / raw)
  To: git; +Cc: Karthik Nayak

The `--missing` object option in rev-list currently works only with
missing blobs/trees. For missing commits the revision walker fails with
a fatal error.

Let's extend the functionality of `--missing` option to also support
commit objects. This is done by adding a new `MISSING` flag that the
revision walker sets whenever it encounters a missing tree/commit. The
revision walker will now continue the traversal and call `show_commit()`
even for missing commits. In rev-list we can then check for this flag
and call the existing code for parsing `--missing` objects.

A scenario where this option would be used is to find the boundary
objects between different object directories. Consider a repository with
a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate
object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a
repository, using the `--missing=print` option while disabling the
alternate object directory allows us to find the boundary objects
between the main and alternate object directory.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c          |  6 +++
 object.h                    |  2 +-
 revision.c                  |  9 ++++-
 revision.h                  |  3 ++
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++
 5 files changed, 91 insertions(+), 3 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 98542e8b3c..604ae01d0c 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data)
 
 	display_progress(progress, ++progress_counter);
 
+	if (revs->do_not_die_on_missing_objects &&
+	    commit->object.flags & MISSING) {
+		finish_object__ma(&commit->object);
+		return;
+	}
+
 	if (show_disk_usage)
 		total_disk_usage += get_object_disk_usage(&commit->object);
 
diff --git a/object.h b/object.h
index 114d45954d..12913e6116 100644
--- a/object.h
+++ b/object.h
@@ -62,7 +62,7 @@ void object_array_init(struct object_array *array);
 
 /*
  * object flag allocation:
- * revision.h:               0---------10         15             23------27
+ * revision.h:               0---------10         15             22------27
  * fetch-pack.c:             01    67
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
diff --git a/revision.c b/revision.c
index e789834dd1..615645d3d1 100644
--- a/revision.c
+++ b/revision.c
@@ -1168,7 +1168,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 	for (parent = commit->parents; parent; parent = parent->next) {
 		struct commit *p = parent->item;
 		int gently = revs->ignore_missing_links ||
-			     revs->exclude_promisor_objects;
+			     revs->exclude_promisor_objects ||
+			     revs->do_not_die_on_missing_objects;
 		if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
 			if (revs->exclude_promisor_objects &&
 			    is_promisor_object(&p->object.oid)) {
@@ -1176,7 +1177,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 					break;
 				continue;
 			}
-			return -1;
+
+			if (!revs->do_not_die_on_missing_objects)
+				return -1;
+			else
+				p->object.flags |= MISSING;
 		}
 		if (revs->sources) {
 			char **slot = revision_sources_at(revs->sources, p);
diff --git a/revision.h b/revision.h
index c73c92ef40..07906bc3bc 100644
--- a/revision.h
+++ b/revision.h
@@ -41,6 +41,9 @@
 /* WARNING: This is also used as REACHABLE in commit-graph.c. */
 #define PULL_MERGE	(1u<<15)
 
+/* WARNING: This is also used as NEEDS_BITMAP in pack-bitmap.c. */
+#define MISSING         (1u<<22)
+
 #define TOPO_WALK_EXPLORED	(1u<<23)
 #define TOPO_WALK_INDEGREE	(1u<<24)
 
diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
new file mode 100755
index 0000000000..bbff66e4fc
--- /dev/null
+++ b/t/t6022-rev-list-missing.sh
@@ -0,0 +1,74 @@
+#!/bin/sh
+
+test_description='handling of missing objects in rev-list'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+# We setup the repository with two commits, this way HEAD is always
+# available and we can hide commit 1.
+test_expect_success 'create repository and alternate directory' '
+	test_commit 1 &&
+	test_commit 2
+'
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	test_expect_success "rev-list --missing=error fails with missing object $obj" '
+		oid="$(git rev-parse $obj)" &&
+		path=".git/objects/$(test_oid_to_path $oid)" &&
+
+		mv "$path" "$path.hidden" &&
+		test_when_finished "mv $path.hidden $path" &&
+
+		test_must_fail git rev-list --missing=error --objects \
+			--no-object-names HEAD
+	'
+done
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	for action in "allow-any" "print"
+	do
+		test_expect_success "rev-list --missing=$action with missing $obj" '
+			oid="$(git rev-parse $obj)" &&
+			path=".git/objects/$(test_oid_to_path $oid)" &&
+
+			# Before the object is made missing, we use rev-list to
+			# get the expected oids.
+			git rev-list --objects --no-object-names \
+				HEAD ^$obj >expect.raw &&
+
+			# Since both the commits have the `1.t` blob, rev-list
+			# will print it since its reachable from either commit. Unless
+			# the blob itself is missing.
+			if [ $obj != "HEAD:1.t" ]; then
+				echo $(git rev-parse HEAD:1.t) >>expect.raw
+			fi &&
+
+			mv "$path" "$path.hidden" &&
+			test_when_finished "mv $path.hidden $path" &&
+
+			git rev-list --missing=$action --objects --no-object-names \
+				HEAD >actual.raw &&
+
+			# When the action is to print, we should also add the missing
+			# oid to the expect list.
+			case $action in
+			allow-any)
+				;;
+			print)
+				grep ?$oid actual.raw &&
+				echo ?$oid >>expect.raw
+				;;
+			esac &&
+
+			sort actual.raw >actual &&
+			sort expect.raw >expect &&
+			test_cmp expect actual
+		'
+	done
+done
+
+
+test_done
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-09 10:55 [PATCH 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
                   ` (2 preceding siblings ...)
  2023-10-09 10:55 ` [PATCH 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
@ 2023-10-09 22:02 ` Junio C Hamano
  2023-10-10  6:19 ` Patrick Steinhardt
  2023-10-16 10:38 ` [PATCH v2 " Karthik Nayak
  5 siblings, 0 replies; 61+ messages in thread
From: Junio C Hamano @ 2023-10-09 22:02 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git

Karthik Nayak <karthik.188@gmail.com> writes:

> This series is an alternate to the patch series I had posted earlier:
> https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/.
> In that patch, we introduced an option `--ignore-missing-links` which
> was added to expose the `ignore_missing_links` bit to the user. The
> issue in that patch was that, the option `--ignore-missing-links` didn't
> play well the pre-existing `--missing` option. This series avoids that
> route and just extends the `--missing` option for commits to solve the
> same problem.

Thanks for exploring the problem space in the other direction.

I haven't really thought things through yet, but it looks to me that
the updated behaviour of "--missing" is more in line with what the
option would have wanted to be from day one.

> Karthik Nayak (3):
>   revision: rename bit to `do_not_die_on_missing_objects`
>   rev-list: move `show_commit()` to the bottom
>   rev-list: add commit object support in `--missing` option
>
>  builtin/reflog.c            |  2 +-
>  builtin/rev-list.c          | 93 +++++++++++++++++++------------------
>  list-objects.c              |  2 +-
>  object.h                    |  2 +-
>  revision.c                  |  9 +++-
>  revision.h                  | 20 ++++----
>  t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
>  7 files changed, 145 insertions(+), 57 deletions(-)
>  create mode 100755 t/t6022-rev-list-missing.sh

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-09 10:55 [PATCH 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
                   ` (3 preceding siblings ...)
  2023-10-09 22:02 ` [PATCH 0/3] rev-list: add support for commits in `--missing` Junio C Hamano
@ 2023-10-10  6:19 ` Patrick Steinhardt
  2023-10-10 17:09   ` Junio C Hamano
  2023-10-16 10:38 ` [PATCH v2 " Karthik Nayak
  5 siblings, 1 reply; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-10  6:19 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2054 bytes --]

On Mon, Oct 09, 2023 at 12:55:25PM +0200, Karthik Nayak wrote:
> The `--missing` option in git-rev-list(1) was introduced intitally
> to deal with missing blobs in the context of promissory notes.
> Eventually the option was extended to also support tree objects in
> 7c0fe330d5 (rev-list: handle missing tree objects properly,2018-10-05).
> 
> This patch series extends the `--missing` option to also add support for
> commit objects. We do this by introducing a new flag `MISSING` which is
> added whenever we encounter a missing commit during traversal. Then in
> `builtin/rev-list` we check for this flag and take the appropriate
> action based on the `--missing=*` option used.
> 
> This series is an alternate to the patch series I had posted earlier:
> https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/.
> In that patch, we introduced an option `--ignore-missing-links` which
> was added to expose the `ignore_missing_links` bit to the user. The
> issue in that patch was that, the option `--ignore-missing-links` didn't
> play well the pre-existing `--missing` option. This series avoids that
> route and just extends the `--missing` option for commits to solve the
> same problem.

I had already reviewed the patches internally at GitLab, so for what
it's worth please feel free to add my Reviewed-by.

Patrick

> Karthik Nayak (3):
>   revision: rename bit to `do_not_die_on_missing_objects`
>   rev-list: move `show_commit()` to the bottom
>   rev-list: add commit object support in `--missing` option
> 
>  builtin/reflog.c            |  2 +-
>  builtin/rev-list.c          | 93 +++++++++++++++++++------------------
>  list-objects.c              |  2 +-
>  object.h                    |  2 +-
>  revision.c                  |  9 +++-
>  revision.h                  | 20 ++++----
>  t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
>  7 files changed, 145 insertions(+), 57 deletions(-)
>  create mode 100755 t/t6022-rev-list-missing.sh
> 
> -- 
> 2.41.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-10  6:19 ` Patrick Steinhardt
@ 2023-10-10 17:09   ` Junio C Hamano
  2023-10-11 10:37     ` Karthik Nayak
  0 siblings, 1 reply; 61+ messages in thread
From: Junio C Hamano @ 2023-10-10 17:09 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Karthik Nayak, git

Patrick Steinhardt <ps@pks.im> writes:

> I had already reviewed the patches internally at GitLab, so for what
> it's worth please feel free to add my Reviewed-by.

Great.  It seems that 'seen' with this series fails to pass the
tests, though.

https://github.com/git/git/actions/runs/6462854176/job/17545104753

Thanks.

>
> Patrick
>
>> Karthik Nayak (3):
>>   revision: rename bit to `do_not_die_on_missing_objects`
>>   rev-list: move `show_commit()` to the bottom
>>   rev-list: add commit object support in `--missing` option
>> 
>>  builtin/reflog.c            |  2 +-
>>  builtin/rev-list.c          | 93 +++++++++++++++++++------------------
>>  list-objects.c              |  2 +-
>>  object.h                    |  2 +-
>>  revision.c                  |  9 +++-
>>  revision.h                  | 20 ++++----
>>  t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
>>  7 files changed, 145 insertions(+), 57 deletions(-)
>>  create mode 100755 t/t6022-rev-list-missing.sh
>> 
>> -- 
>> 2.41.0
>> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-10 17:09   ` Junio C Hamano
@ 2023-10-11 10:37     ` Karthik Nayak
  2023-10-11 16:54       ` Junio C Hamano
  0 siblings, 1 reply; 61+ messages in thread
From: Karthik Nayak @ 2023-10-11 10:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git

On Tue, Oct 10, 2023 at 7:09 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Patrick Steinhardt <ps@pks.im> writes:
>
> > I had already reviewed the patches internally at GitLab, so for what
> > it's worth please feel free to add my Reviewed-by.
>
> Great.  It seems that 'seen' with this series fails to pass the
> tests, though.
>
> https://github.com/git/git/actions/runs/6462854176/job/17545104753

Seems like this is because of commit-graph being enabled, I think the
best thing to do here
would be to disable the commit graph of these tests.

This should do:

    diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
    index bbff66e4fc..39a8402682 100755
    --- a/t/t6022-rev-list-missing.sh
    +++ b/t/t6022-rev-list-missing.sh
    @@ -5,6 +5,11 @@ test_description='handling of missing objects in rev-list'
     TEST_PASSES_SANITIZE_LEAK=true
     . ./test-lib.sh

    +# Disable writing the commit graph as the tests depend on making particular
    +# commits hidden, the graph is created before that and rev-list
would default
    +# to using the commit graph in such instances.
    +GIT_TEST_COMMIT_GRAPH=0
    +
     # We setup the repository with two commits, this way HEAD is always
     # available and we can hide commit 1.
     test_expect_success 'create repository and alternate directory' '

Thanks for reporting. I'll wait for a day or two (for reviews) and
will add this to the second version
of this series.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-11 10:37     ` Karthik Nayak
@ 2023-10-11 16:54       ` Junio C Hamano
  2023-10-12 10:44         ` Karthik Nayak
  0 siblings, 1 reply; 61+ messages in thread
From: Junio C Hamano @ 2023-10-11 16:54 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: Patrick Steinhardt, git

Karthik Nayak <karthik.188@gmail.com> writes:

> Seems like this is because of commit-graph being enabled, I think
> the best thing to do here would be to disable the commit graph of
> these tests.

If the CI uncovered that the new code is broken and does not work
with commit-graph, wouldn't the above a totally wrong approach to
correct it?  If the updated logic cannot work correctly when
commit-graph covers the history you intentionally break, shouldn't
the code, when the feature that is incompatible with commit-graph is
triggered, disable the commit-graph?  I am assuming that the new
feature is meant to be used to recover from a corrupt repository,
and if it does not work well when commit-graph knows (now stale
after repository corruption) more about the objects that are corrupt
in the object store, we do want to disable commit-graph.  After all,
commit-graph is a secondary information that is supposed to be
recoverable from the primary data that is what is in the object
store.  Disabling commit-graph in the test means you are telling the
end-users "do not use commit-graph if you want to use this feature",
which sounds like a wrong thing to do.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-11 16:54       ` Junio C Hamano
@ 2023-10-12 10:44         ` Karthik Nayak
  2023-10-12 11:04           ` Patrick Steinhardt
  2023-10-12 16:26           ` Junio C Hamano
  0 siblings, 2 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-12 10:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git

On Wed, Oct 11, 2023 at 6:54 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Karthik Nayak <karthik.188@gmail.com> writes:
>
> > Seems like this is because of commit-graph being enabled, I think
> > the best thing to do here would be to disable the commit graph of
> > these tests.
>
> If the CI uncovered that the new code is broken and does not work
> with commit-graph, wouldn't the above a totally wrong approach to
> correct it?  If the updated logic cannot work correctly when
> commit-graph covers the history you intentionally break, shouldn't
> the code, when the feature that is incompatible with commit-graph is
> triggered, disable the commit-graph?  I am assuming that the new
> feature is meant to be used to recover from a corrupt repository,
> and if it does not work well when commit-graph knows (now stale
> after repository corruption) more about the objects that are corrupt
> in the object store, we do want to disable commit-graph.  After all,
> commit-graph is a secondary information that is supposed to be
> recoverable from the primary data that is what is in the object
> store.  Disabling commit-graph in the test means you are telling the
> end-users "do not use commit-graph if you want to use this feature",
> which sounds like a wrong thing to do.

I agree with what you're saying. Disabling writing the commit-graph for
only the test doesn't serve the real usage. To ensure that the feature should
work with corrupt repositories with stale commit-graph, I'm thinking of
disabling the commit-graph whenever the `--missing` option is used. The
commit graph already provides an API for this, so it should be simple to do.

Thanks!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-12 10:44         ` Karthik Nayak
@ 2023-10-12 11:04           ` Patrick Steinhardt
  2023-10-12 13:23             ` Karthik Nayak
  2023-10-12 16:17             ` Junio C Hamano
  2023-10-12 16:26           ` Junio C Hamano
  1 sibling, 2 replies; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-12 11:04 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 2669 bytes --]

On Thu, Oct 12, 2023 at 12:44:27PM +0200, Karthik Nayak wrote:
> On Wed, Oct 11, 2023 at 6:54 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Karthik Nayak <karthik.188@gmail.com> writes:
> >
> > > Seems like this is because of commit-graph being enabled, I think
> > > the best thing to do here would be to disable the commit graph of
> > > these tests.
> >
> > If the CI uncovered that the new code is broken and does not work
> > with commit-graph, wouldn't the above a totally wrong approach to
> > correct it?  If the updated logic cannot work correctly when
> > commit-graph covers the history you intentionally break, shouldn't
> > the code, when the feature that is incompatible with commit-graph is
> > triggered, disable the commit-graph?  I am assuming that the new
> > feature is meant to be used to recover from a corrupt repository,
> > and if it does not work well when commit-graph knows (now stale
> > after repository corruption) more about the objects that are corrupt
> > in the object store, we do want to disable commit-graph.  After all,
> > commit-graph is a secondary information that is supposed to be
> > recoverable from the primary data that is what is in the object
> > store.  Disabling commit-graph in the test means you are telling the
> > end-users "do not use commit-graph if you want to use this feature",
> > which sounds like a wrong thing to do.
> 
> I agree with what you're saying. Disabling writing the commit-graph for
> only the test doesn't serve the real usage. To ensure that the feature should
> work with corrupt repositories with stale commit-graph, I'm thinking of
> disabling the commit-graph whenever the `--missing` option is used. The
> commit graph already provides an API for this, so it should be simple to do.
> 
> Thanks!

Wouldn't this have the potential to significantly regress performance
for all those preexisting users of the `--missing` option? The commit
graph is quite an important optimization nowadays, and especially in
commands where we potentially walk a lot of commits (like we may do
here) it can result in queries that are orders of magnitudes faster.

If we were introducing a new option then I think this would be an
acceptable tradeoff as we could still fix the performance issue in
another iteration. But we don't and instead aim to change the meaning of
`--missing` which may already have users out there. Seen in that light
it does not seem sensible to me to just disable the graph for them.

Did you figure out what exactly the root cause of this functional
regression is? If so, can we address that root cause instead?

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-12 11:04           ` Patrick Steinhardt
@ 2023-10-12 13:23             ` Karthik Nayak
  2023-10-12 16:17             ` Junio C Hamano
  1 sibling, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-12 13:23 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Junio C Hamano, git

On Thu, Oct 12, 2023 at 1:04 PM Patrick Steinhardt <ps@pks.im> wrote:
> Wouldn't this have the potential to significantly regress performance
> for all those preexisting users of the `--missing` option? The commit
> graph is quite an important optimization nowadays, and especially in
> commands where we potentially walk a lot of commits (like we may do
> here) it can result in queries that are orders of magnitudes faster.
>
> If we were introducing a new option then I think this would be an
> acceptable tradeoff as we could still fix the performance issue in
> another iteration. But we don't and instead aim to change the meaning of
> `--missing` which may already have users out there. Seen in that light
> it does not seem sensible to me to just disable the graph for them.
>

Agreed, that there is a rather huge performance drop doing this. So either:
1. Add commit support for `--missing` and disable commit-graph. There is huge
performance drop here for commits.
2. Add commit support for `--missing` but disabling commit-graph is provided via
an additional `--disable-commit-graph` option. This expects the user to manually
disable commit-graph for `--missing` to work correctly with commits.

I was also thinking about this, but people who are using `--missing`
till date, also use
it with `--objects`. Using `--objects` is already slow and doesn't
benefit from the
commit-graph. So I don't know if there is much of a performance hit.

Some benchmarks on the Git repo:

With commit graphs:

$ hyperfine --warmup 2 "git rev-list --objects --missing=print HEAD > /dev/null"
Benchmark 1: git rev-list --objects --missing=print HEAD > /dev/null
  Time (mean ± σ):      1.336 s ±  0.013 s    [User: 1.278 s, System: 0.054 s]
  Range (min … max):    1.323 s …  1.365 s    10 runs

$ hyperfine --warmup 2 "git rev-list --missing=print HEAD > /dev/null"
Benchmark 1: git rev-list --missing=print HEAD > /dev/null
  Time (mean ± σ):      29.6 ms ±   1.5 ms    [User: 20.2 ms, System: 9.1 ms]
  Range (min … max):    27.5 ms …  35.1 ms    92 runs

Without commit graphs:

$ hyperfine --warmup 2 "git rev-list --objects --missing=print HEAD > /dev/null"
Benchmark 1: git rev-list --objects --missing=print HEAD > /dev/null
  Time (mean ± σ):      1.735 s ±  0.022 s    [User: 1.672 s, System: 0.057 s]
  Range (min … max):    1.703 s …  1.765 s    10 runs

$ hyperfine --warmup 2 "git rev-list --missing=print HEAD > /dev/null"
Benchmark 1: git rev-list --missing=print HEAD > /dev/null
  Time (mean ± σ):     426.6 ms ±  10.2 ms    [User: 410.1 ms, System: 15.1 ms]
  Range (min … max):   414.9 ms … 441.3 ms    10 runs

This shows that while there is definitely a performance loss when
disabling commit-graphs.
This loss is much lesser when used alongside the `--objects` flag.

Overall, I lean towards the option "1", but okay with either.

> Did you figure out what exactly the root cause of this functional
> regression is? If so, can we address that root cause instead?

Are you referring to `--missing` not working with commits? Looking at
the history
of `--missing` shows:

- caf3827e2f: This commits introduces the `--missing` option, it
basically states that
for the upcoming partial clone mechanism, we want to add support for identifying
missing objects.
- 7c0fe330d5: This commit adds support for trees in `--missing`. It
also goes on to
state that `--missing` assumed only blobs could be missing.

Both of these show that `--missing` started off support for only blobs
and eventually
added support to trees (similar to how we're adding support for
commits here). So
I guess the intention was to never work with commits. But the documentation and
the option seem generic enough. Considering that trees was an extension, commits
should be treated the same way.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-12 11:04           ` Patrick Steinhardt
  2023-10-12 13:23             ` Karthik Nayak
@ 2023-10-12 16:17             ` Junio C Hamano
  2023-10-13  5:53               ` Patrick Steinhardt
  1 sibling, 1 reply; 61+ messages in thread
From: Junio C Hamano @ 2023-10-12 16:17 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Karthik Nayak, git

Patrick Steinhardt <ps@pks.im> writes:

> Wouldn't this have the potential to significantly regress performance
> for all those preexisting users of the `--missing` option? The commit
> graph is quite an important optimization nowadays, and especially in
> commands where we potentially walk a lot of commits (like we may do
> here) it can result in queries that are orders of magnitudes faster.

The test fails only when GIT_TEST_COMMIT_GRAPH is on, which updates
the commit-graph every time a commit is made via "git commit" or
"git merge".

I'd suggest stepping back and think a bit.

My assumption has been that the failing test emulates this scenario
that can happen in real life:

 * The user creates a new commit.

 * A commit graph is written (not as part of GIT_TEST_COMMIT_GRAPH
   that is not realistic, but as part of "maintenance").

 * The repository loses some objects due to corruption.

 * Now, "--missing=print" is invoked so that the user can view what
   are missing.  Or "--missing=allow-primisor" to ensure that the
   repository does not have missing objects other than the ones that
   the promisor will give us if we asked again.

 * But because the connectivity of these objects appear in the
   commit graph file, we fail to notice that these objects are
   missing, producing wrong results.  If we disabled commit-graph
   while traversal (an earlier writing of it was perfectly OK), then
   "rev-list --missing" would have noticed and reported what the
   user wanted to know.

In other words, the "optimization" you value is working to quickly
produce a wrong result.  Is it "significantly regress"ing if we
disabled it to obtain the correct result?

My assumption also has been that there is no point in running
"rev-list --missing" if we know there is no repository corruption,
and those who run "rev-list --missing" wants to know if the objects
are really available, i.e. even if commit-graph that is out of sync
with reality says it exists, if it is not in the object store, they
would want to know that.

If you can show me that it is not the case, then I may be pursuaded
why producing a result that is out of sync with reality _quickly_,
instead of taking time to produce a result that matches reality, is
a worthy "optimization" to keep.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-12 10:44         ` Karthik Nayak
  2023-10-12 11:04           ` Patrick Steinhardt
@ 2023-10-12 16:26           ` Junio C Hamano
  1 sibling, 0 replies; 61+ messages in thread
From: Junio C Hamano @ 2023-10-12 16:26 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: Patrick Steinhardt, git

Karthik Nayak <karthik.188@gmail.com> writes:

> ... To ensure that the feature should
> work with corrupt repositories with stale commit-graph, I'm thinking of
> disabling the commit-graph whenever the `--missing` option is used.

Just to avoid giving anybody a wrong guideline, for most of the
other features, we should assume that the commit-graph accurately
represents the reality in the object database and take advantage of
it, leaving it to "git fsck" to ensure that the commit-graph is
healthy.

But "rev-list --missing", especially when used with actions other
than allow-promisor or allow-any, is a feature that is only useful
when one wants to really notice objects that are available in the
object store, and it would be a bug for it to pretend as if objects
the commit-graph has heard of exists without checking.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-12 16:17             ` Junio C Hamano
@ 2023-10-13  5:53               ` Patrick Steinhardt
  2023-10-13  8:38                 ` Patrick Steinhardt
  0 siblings, 1 reply; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-13  5:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Karthik Nayak, git

[-- Attachment #1: Type: text/plain, Size: 3888 bytes --]

On Thu, Oct 12, 2023 at 09:17:09AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Wouldn't this have the potential to significantly regress performance
> > for all those preexisting users of the `--missing` option? The commit
> > graph is quite an important optimization nowadays, and especially in
> > commands where we potentially walk a lot of commits (like we may do
> > here) it can result in queries that are orders of magnitudes faster.
> 
> The test fails only when GIT_TEST_COMMIT_GRAPH is on, which updates
> the commit-graph every time a commit is made via "git commit" or
> "git merge".
> 
> I'd suggest stepping back and think a bit.
> 
> My assumption has been that the failing test emulates this scenario
> that can happen in real life:
> 
>  * The user creates a new commit.
> 
>  * A commit graph is written (not as part of GIT_TEST_COMMIT_GRAPH
>    that is not realistic, but as part of "maintenance").
> 
>  * The repository loses some objects due to corruption.
> 
>  * Now, "--missing=print" is invoked so that the user can view what
>    are missing.  Or "--missing=allow-primisor" to ensure that the
>    repository does not have missing objects other than the ones that
>    the promisor will give us if we asked again.
> 
>  * But because the connectivity of these objects appear in the
>    commit graph file, we fail to notice that these objects are
>    missing, producing wrong results.  If we disabled commit-graph
>    while traversal (an earlier writing of it was perfectly OK), then
>    "rev-list --missing" would have noticed and reported what the
>    user wanted to know.
> 
> In other words, the "optimization" you value is working to quickly
> produce a wrong result.  Is it "significantly regress"ing if we
> disabled it to obtain the correct result?

It depends, in my opinion. If:

    - Wrong results caused by the commit graph are only introduced with
      this patch series due to the changed behaviour of `--missing`.

    - We disable commit graphs proactively only because of the changed
      behaviour of `--missing`.

Then yes, it does feel wrong to me to disable commit graphs and regress
performance for usecases that perviously worked both correct and fast.

> My assumption also has been that there is no point in running
> "rev-list --missing" if we know there is no repository corruption,
> and those who run "rev-list --missing" wants to know if the objects
> are really available, i.e. even if commit-graph that is out of sync
> with reality says it exists, if it is not in the object store, they
> would want to know that.
> 
> If you can show me that it is not the case, then I may be pursuaded
> why producing a result that is out of sync with reality _quickly_,
> instead of taking time to produce a result that matches reality, is
> a worthy "optimization" to keep.

Note that I'm not saying that it's fine to return wrong results -- this
is of course a bug that needs to be addressed somehow. After all, things
working correctly should always trump things working fast. But until now
it felt more like we were going into the direction of disabling commit
graphs without checking whether there is an alternative solution that
allows us to get the best of both worlds, correctness and performance.

So what I'm looking for in this thread is a reason why we _can't_ have
that, or at least can't have it without unreasonable amounts of work. We
have helpers like `lookup_commit_in_graph()` that are designed to detect
stale commit graphs by double-checking whether a commit that has been
looked up via the commit graph actually exists in the repository. So I'm
wondering whether this could help us address the issue.

If there is a good reason why all of that is not possible then I'm happy
to carve in.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-13  5:53               ` Patrick Steinhardt
@ 2023-10-13  8:38                 ` Patrick Steinhardt
  2023-10-13 12:37                   ` [PATCH] commit: detect commits that exist in commit-graph but not in the ODB Patrick Steinhardt
  2023-10-13 17:07                   ` [PATCH 0/3] rev-list: add support for commits in `--missing` Junio C Hamano
  0 siblings, 2 replies; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-13  8:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Karthik Nayak, git

[-- Attachment #1: Type: text/plain, Size: 3996 bytes --]

On Fri, Oct 13, 2023 at 07:53:36AM +0200, Patrick Steinhardt wrote:
> On Thu, Oct 12, 2023 at 09:17:09AM -0700, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
[snip]
> > My assumption also has been that there is no point in running
> > "rev-list --missing" if we know there is no repository corruption,
> > and those who run "rev-list --missing" wants to know if the objects
> > are really available, i.e. even if commit-graph that is out of sync
> > with reality says it exists, if it is not in the object store, they
> > would want to know that.
> > 
> > If you can show me that it is not the case, then I may be pursuaded
> > why producing a result that is out of sync with reality _quickly_,
> > instead of taking time to produce a result that matches reality, is
> > a worthy "optimization" to keep.
> 
> Note that I'm not saying that it's fine to return wrong results -- this
> is of course a bug that needs to be addressed somehow. After all, things
> working correctly should always trump things working fast. But until now
> it felt more like we were going into the direction of disabling commit
> graphs without checking whether there is an alternative solution that
> allows us to get the best of both worlds, correctness and performance.
> 
> So what I'm looking for in this thread is a reason why we _can't_ have
> that, or at least can't have it without unreasonable amounts of work. We
> have helpers like `lookup_commit_in_graph()` that are designed to detect
> stale commit graphs by double-checking whether a commit that has been
> looked up via the commit graph actually exists in the repository. So I'm
> wondering whether this could help us address the issue.
> 
> If there is a good reason why all of that is not possible then I'm happy
> to carve in.

I've had a quick look at this problem so that I can solidify my own
train of thought a bit. The issue is `repo_parse_commit_internal()`,
which calls `parse_commit_in_graph()` without verifying that the object
actually exists in the object database. It's the only callsite of that
function outside of "commit-graph.c", as all other external callers
would call `lookup_commit_in_graph()` which _does_ perform the object
existence check.

So I think that the proper way to address the regression would be a
patch similar to the following:

diff --git a/commit.c b/commit.c
index b3223478bc..109e9217e3 100644
--- a/commit.c
+++ b/commit.c
@@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r,
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (use_commit_graph && parse_commit_in_graph(r, item))
+	if (use_commit_graph && parse_commit_in_graph(r, item)) {
+		if (!has_object(r, &item->object.oid, 0))
+			return quiet_on_missing ? -1 :
+				error(_("commit %s exists in commit-graph but not in the object database"),
+				      oid_to_hex(&item->object.oid));
 		return 0;
+	}
 
 	if (oid_object_info_extended(r, &item->object.oid, &oi, flags) < 0)
 		return quiet_on_missing ? -1 :

I wouldn't be surprised if there are other edge cases where this can
lead to buggy behaviour.

Also, this issue may not necessarily stem from repository corruption. It
could for example happen that commits are getting garbage collected
without the commit graph having been updated, whatever the reason may be
for this. In that case we would happily continue to return these commits
from the commit graph even though the underlying object has since been
deleted. The repository itself is not corrupt though, we merely look at
an out-of-date commit graph. And for what it's worth, I think that we
should always gracefully handle that case or otherwise the commit graph
becomes less useful overall.

I didn't dig much deeper yet. And while the above patch fixes some of
the test failures, it doesn't fix them all. If we agree that this is the
way to go then I'd be happy to turn this into a proper patch.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
  2023-10-13  8:38                 ` Patrick Steinhardt
@ 2023-10-13 12:37                   ` Patrick Steinhardt
  2023-10-13 18:21                     ` Junio C Hamano
  2023-10-13 17:07                   ` [PATCH 0/3] rev-list: add support for commits in `--missing` Junio C Hamano
  1 sibling, 1 reply; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-13 12:37 UTC (permalink / raw)
  To: git; +Cc: Karthik Nayak, Junio C Hamano, Taylor Blau

[-- Attachment #1: Type: text/plain, Size: 4246 bytes --]

Commit graphs can become stale and contain references to commits that do
not exist in the object database anymore. Theoretically, this can lead
to a scenario where we are able to successfully look up any such commit
via the commit graph even though such a lookup would fail if done via
the object database directly.

As the commit graph is mostly intended as a sort of cache to speed up
parsing of commits we do not want to have diverging behaviour in a
repository with and a repository without commit graphs, no matter
whether they are stale or not. As commits are otherwise immutable, the
only thing that we really need to care about is thus the presence or
absence of a commit.

To address potentially stale commit data that may exist in the graph,
our `lookup_commit_in_graph()` function will check for the commit's
existence in both the commit graph, but also in the object database. So
even if we were able to look up the commit's data in the graph, we would
still pretend as if the commit didn't exist if it is missing in the
object database.

We don't have the same safety net in `parse_commit_in_graph_one()`
though. This function is mostly used internally in "commit-graph.c"
itself to validate the commit graph, and this usage is fine. We do
expose its functionality via `parse_commit_in_graph()` though, which
gets called by `repo_parse_commit_internal()`, and that function is in
turn used in many places in our codebase.

For all I can see this function is never used to directly turn an object
ID into a commit object without additional safety checks before or after
this lookup. What it is being used for though is to walk history via the
parent chain of commits. So when commits in the parent chain of a graph
walk are missing it is possible that we wouldn't notice if that missing
commit was part of the commit graph. Thus, a query like `git rev-parse
HEAD~2` can succeed even if the intermittent commit is missing.

It's unclear whether there are additional ways in which such stale
commit graphs can lead to problems. In any case, it feels like this is a
bigger bug waiting to happen when we gain additional direct or indirect
callers of `repo_parse_commit_internal()`. So let's fix the inconsistent
behaviour by checking for object existence via the object database, as
well.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 commit.c                |  7 ++++++-
 t/t5318-commit-graph.sh | 23 +++++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index b3223478bc..109e9217e3 100644
--- a/commit.c
+++ b/commit.c
@@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r,
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (use_commit_graph && parse_commit_in_graph(r, item))
+	if (use_commit_graph && parse_commit_in_graph(r, item)) {
+		if (!has_object(r, &item->object.oid, 0))
+			return quiet_on_missing ? -1 :
+				error(_("commit %s exists in commit-graph but not in the object database"),
+				      oid_to_hex(&item->object.oid));
 		return 0;
+	}
 
 	if (oid_object_info_extended(r, &item->object.oid, &oi, flags) < 0)
 		return quiet_on_missing ? -1 :
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index ba65f17dd9..25f8e9e2d3 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -821,4 +821,27 @@ test_expect_success 'overflow during generation version upgrade' '
 	)
 '
 
+test_expect_success 'commit exists in commit-graph but not in object database' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		test_commit A &&
+		test_commit B &&
+		test_commit C &&
+		git commit-graph write --reachable &&
+
+		# Corrupt the repository by deleting the intermittent commit
+		# object. Commands should notice that this object is absent and
+		# thus that the repository is corrupt even if the commit graph
+		# exists.
+		oid=$(git rev-parse B) &&
+		rm .git/objects/"$(test_oid_to_path "$oid")" &&
+
+		test_must_fail git rev-parse HEAD~2 2>error &&
+		grep "error: commit $oid exists in commit-graph but not in the object database" error
+	)
+'
+
 test_done
-- 
2.42.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 0/3] rev-list: add support for commits in `--missing`
  2023-10-13  8:38                 ` Patrick Steinhardt
  2023-10-13 12:37                   ` [PATCH] commit: detect commits that exist in commit-graph but not in the ODB Patrick Steinhardt
@ 2023-10-13 17:07                   ` Junio C Hamano
  1 sibling, 0 replies; 61+ messages in thread
From: Junio C Hamano @ 2023-10-13 17:07 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Karthik Nayak, git

Patrick Steinhardt <ps@pks.im> writes:

> actually exists in the object database. It's the only callsite of that
> function outside of "commit-graph.c", as all other external callers
> would call `lookup_commit_in_graph()` which _does_ perform the object
> existence check.
>
> So I think that the proper way to address the regression would be a
> patch similar to the following:
>
> diff --git a/commit.c b/commit.c
> index b3223478bc..109e9217e3 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r,
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (use_commit_graph && parse_commit_in_graph(r, item))
> +	if (use_commit_graph && parse_commit_in_graph(r, item)) {
> +		if (!has_object(r, &item->object.oid, 0))
> +			return quiet_on_missing ? -1 :
> +				error(_("commit %s exists in commit-graph but not in the object database"),
> +				      oid_to_hex(&item->object.oid));
>  		return 0;
> +	}

It is (overly) conservative, which I generally should find pleasing,
but as I said, for secondary information like commit-graph that is
derived from the primary source only to precompute for optimization,
our general attitude should be to trust it and let the optimization
kick in, unless the operation being performed primarily cares about
the case where the result from using and not using the secondary
source differs.  An obvious example of such an operation is "fsck",
where we do care and want to notice when the underlying object graph
and what commit-graph records contradict with each other.  And my
suggestion to disable commit-graph while running the "rev-list"
command with the "--missing" option is because that usage would fall
into the same category (please correct me if that is not the case) [*].

So for the purpose of "rev-list --missing", the above change does
take us closer to the answer we would get from the primary source,
but isn't it pessimising other users unnecessarily?  Do we have a
rough idea (if we have a benchmark that would be even better) how
much this lack of has_object() check is contributing to the
performance benefit of using commit-graph?  Majority of callers of
this code should not have to pay the additional overhead here, so
unless it is small enough, this would be pessimising the generic
code for everybody to please "rev-list --missing", which is why I am
worried about going in this direction.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
  2023-10-13 12:37                   ` [PATCH] commit: detect commits that exist in commit-graph but not in the ODB Patrick Steinhardt
@ 2023-10-13 18:21                     ` Junio C Hamano
  2023-10-17  6:37                       ` Patrick Steinhardt
  0 siblings, 1 reply; 61+ messages in thread
From: Junio C Hamano @ 2023-10-13 18:21 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Karthik Nayak, Taylor Blau

Patrick Steinhardt <ps@pks.im> writes:

> @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r,
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (use_commit_graph && parse_commit_in_graph(r, item))
> +	if (use_commit_graph && parse_commit_in_graph(r, item)) {
> +		if (!has_object(r, &item->object.oid, 0))
> +			return quiet_on_missing ? -1 :
> +				error(_("commit %s exists in commit-graph but not in the object database"),
> +				      oid_to_hex(&item->object.oid));
>  		return 0;
> +	}

Ever since this codepath was introduced by 177722b3 (commit:
integrate commit graph with commit parsing, 2018-04-10), we blindly
trusted what commit-graph file says.  This change is a strict
improvement in the correctness department, but there are two things
that are a bit worrying.

One.  The additional check should certainly be cheaper than a full
reading and parsing of an object, either from a loose object file or
from a pack entry.  It may not hurt performance too much, but it
still would give us more confidence if we know by how much we are
pessimising good cases where the commit-graph does match reality.
Our stance on these secondary files that store precomputed values
for optimization purposes is in general to use them blindly unless
in exceptional cases where the operation values the correctness even
when the validity of these secondary files is dubious (e.g., "fsck"),
and doing this extra check regardless of the caller at this low level
of the callchain is a bit worrying.

Another is that by the time parse_commit_in_graph() returns true and
we realize that the answer we got is bogus by asking has_object(),
item->object.parsed has already been toggled on, so the caller now
has a commit object that claimed it was already parsed and does not
match reality.  Hopefully the caller takes an early exit upon seeing
a failure from parse_commit_gently() and the .parsed bit does not
matter, but maybe I am missing a case where it does.  I dunno.

Other than that, sounds very sensible and the code change is clean.

Will queue.  Thanks.

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index ba65f17dd9..25f8e9e2d3 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -821,4 +821,27 @@ test_expect_success 'overflow during generation version upgrade' '
>  	)
>  '
>  
> +test_expect_success 'commit exists in commit-graph but not in object database' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		test_commit A &&
> +		test_commit B &&
> +		test_commit C &&
> +		git commit-graph write --reachable &&
> +
> +		# Corrupt the repository by deleting the intermittent commit
> +		# object. Commands should notice that this object is absent and
> +		# thus that the repository is corrupt even if the commit graph
> +		# exists.
> +		oid=$(git rev-parse B) &&
> +		rm .git/objects/"$(test_oid_to_path "$oid")" &&
> +
> +		test_must_fail git rev-parse HEAD~2 2>error &&
> +		grep "error: commit $oid exists in commit-graph but not in the object database" error
> +	)
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 0/3] rev-list: add support for commits in `--missing`
  2023-10-09 10:55 [PATCH 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
                   ` (4 preceding siblings ...)
  2023-10-10  6:19 ` Patrick Steinhardt
@ 2023-10-16 10:38 ` Karthik Nayak
  2023-10-16 10:38   ` [PATCH v2 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
                     ` (4 more replies)
  5 siblings, 5 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-16 10:38 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `--missing` option in git-rev-list(1) was introduced intitally
to deal with missing blobs in the context of promissory notes.
Eventually the option was extended to also support tree objects in
7c0fe330d5 (rev-list: handle missing tree objects properly,2018-10-05).

This patch series extends the `--missing` option to also add support for
commit objects. We do this by introducing a new flag `MISSING` which is
added whenever we encounter a missing commit during traversal. Then in
`builtin/rev-list` we check for this flag and take the appropriate
action based on the `--missing=*` option used.

This series is an alternate to the patch series I had posted earlier:
https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/.
In that patch, we introduced an option `--ignore-missing-links` which
was added to expose the `ignore_missing_links` bit to the user. The
issue in that patch was that, the option `--ignore-missing-links` didn't
play well the pre-existing `--missing` option. This series avoids that
route and just extends the `--missing` option for commits to solve the
same problem.

Changes from v1:
1. The patch series is now rebased on top of Patrick's work to make
commit-graphs more reliable. With this, we need to be more diligant
around which commits are missing and only parse required commits. So
we add some extra checks here, especially because we don't want to
allow missing commit's tree and parent to be parsed.
2. The tests were preivously testing a commit with no parents, add
an additional commit to ensure that the missing commit's parents aren't
parsed.

Range diff:

1:  be2ac0a331 = 1:  c1c892aa86 revision: rename bit to `do_not_die_on_missing_objects`
2:  b44983967f = 2:  67f6bfeaf7 rev-list: move `show_commit()` to the bottom
3:  306d918aef ! 3:  6d8e3c721f rev-list: add commit object support in `--missing` option
    @@ builtin/rev-list.c: static void show_commit(struct commit *commit, void *data)
      		total_disk_usage += get_object_disk_usage(&commit->object);
      
     
    + ## list-objects.c ##
    +@@ list-objects.c: static void do_traverse(struct traversal_context *ctx)
    + 		 * an uninteresting boundary commit may not have its tree
    + 		 * parsed yet, but we are not going to show them anyway
    + 		 */
    +-		if (!ctx->revs->tree_objects)
    ++		if (!ctx->revs->tree_objects || commit->object.flags & MISSING)
    + 			; /* do not bother loading tree */
    + 		else if (repo_get_commit_tree(the_repository, commit)) {
    + 			struct tree *tree = repo_get_commit_tree(the_repository,
    +
      ## object.h ##
     @@ object.h: void object_array_init(struct object_array *array);
      
    @@ object.h: void object_array_init(struct object_array *array);
       * walker.c:                 0-2
     
      ## revision.c ##
    +@@ revision.c: static int process_parents(struct rev_info *revs, struct commit *commit,
    + 	struct commit_list *parent = commit->parents;
    + 	unsigned pass_flags;
    + 
    +-	if (commit->object.flags & ADDED)
    ++	if (commit->object.flags & ADDED || commit->object.flags & MISSING)
    + 		return 0;
    + 	commit->object.flags |= ADDED;
    + 
     @@ revision.c: static int process_parents(struct rev_info *revs, struct commit *commit,
      	for (parent = commit->parents; parent; parent = parent->next) {
      		struct commit *p = parent->item;
    @@ t/t6022-rev-list-missing.sh (new)
     +# available and we can hide commit 1.
     +test_expect_success 'create repository and alternate directory' '
     +	test_commit 1 &&
    -+	test_commit 2
    ++	test_commit 2 &&
    ++	test_commit 3
     +'
     +
     +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
    @@ t/t6022-rev-list-missing.sh (new)
     +			git rev-list --objects --no-object-names \
     +				HEAD ^$obj >expect.raw &&
     +
    -+			# Since both the commits have the `1.t` blob, rev-list
    -+			# will print it since its reachable from either commit. Unless
    -+			# the blob itself is missing.
    ++			# Blobs are shared by all commits, so evethough a commit/tree
    ++			# might be skipped, its blob must be accounted for.
     +			if [ $obj != "HEAD:1.t" ]; then
    -+				echo $(git rev-parse HEAD:1.t) >>expect.raw
    ++				echo $(git rev-parse HEAD:1.t) >>expect.raw &&
    ++				echo $(git rev-parse HEAD:2.t) >>expect.raw
     +			fi &&
     +
     +			mv "$path" "$path.hidden" &&
    @@ t/t6022-rev-list-missing.sh (new)
     +	done
     +done
     +
    -+
     +test_done


Karthik Nayak (3):
  revision: rename bit to `do_not_die_on_missing_objects`
  rev-list: move `show_commit()` to the bottom
  rev-list: add commit object support in `--missing` option

 builtin/reflog.c            |  2 +-
 builtin/rev-list.c          | 93 +++++++++++++++++++------------------
 list-objects.c              |  4 +-
 object.h                    |  2 +-
 revision.c                  | 11 +++--
 revision.h                  | 20 ++++----
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
 7 files changed, 147 insertions(+), 59 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

-- 
2.41.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 1/3] revision: rename bit to `do_not_die_on_missing_objects`
  2023-10-16 10:38 ` [PATCH v2 " Karthik Nayak
@ 2023-10-16 10:38   ` Karthik Nayak
  2023-10-16 10:38   ` [PATCH v2 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-16 10:38 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The bit `do_not_die_on_missing_tree` is used in revision.h to ensure the
revision walker does not die when encountering a missing tree. This is
currently exclusively set within `builtin/rev-list.c` to ensure the
`--missing` option works with missing trees.

In the upcoming commits, we will extend `--missing` to also support
missing commits. So let's rename the bit to
`do_not_die_on_missing_objects`, which is object type agnostic and can
be used for both trees/commits.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/reflog.c   |  2 +-
 builtin/rev-list.c |  2 +-
 list-objects.c     |  2 +-
 revision.h         | 17 +++++++++--------
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/builtin/reflog.c b/builtin/reflog.c
index df63a5892e..9e369a5977 100644
--- a/builtin/reflog.c
+++ b/builtin/reflog.c
@@ -298,7 +298,7 @@ static int cmd_reflog_expire(int argc, const char **argv, const char *prefix)
 		struct rev_info revs;
 
 		repo_init_revisions(the_repository, &revs, prefix);
-		revs.do_not_die_on_missing_tree = 1;
+		revs.do_not_die_on_missing_objects = 1;
 		revs.ignore_missing = 1;
 		revs.ignore_missing_links = 1;
 		if (verbose)
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ff715d6918..ea77489c38 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -561,7 +561,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	}
 
 	if (arg_missing_action)
-		revs.do_not_die_on_missing_tree = 1;
+		revs.do_not_die_on_missing_objects = 1;
 
 	argc = setup_revisions(argc, argv, &revs, &s_r_opt);
 
diff --git a/list-objects.c b/list-objects.c
index c25c72b32c..47296dff2f 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -177,7 +177,7 @@ static void process_tree(struct traversal_context *ctx,
 		    is_promisor_object(&obj->oid))
 			return;
 
-		if (!revs->do_not_die_on_missing_tree)
+		if (!revs->do_not_die_on_missing_objects)
 			die("bad tree object %s", oid_to_hex(&obj->oid));
 	}
 
diff --git a/revision.h b/revision.h
index 50091bbd13..c73c92ef40 100644
--- a/revision.h
+++ b/revision.h
@@ -212,18 +212,19 @@ struct rev_info {
 
 			/*
 			 * Blobs are shown without regard for their existence.
-			 * But not so for trees: unless exclude_promisor_objects
+			 * But not so for trees/commits: unless exclude_promisor_objects
 			 * is set and the tree in question is a promisor object;
 			 * OR ignore_missing_links is set, the revision walker
-			 * dies with a "bad tree object HASH" message when
-			 * encountering a missing tree. For callers that can
-			 * handle missing trees and want them to be filterable
+			 * dies with a "bad <type> object HASH" message when
+			 * encountering a missing object. For callers that can
+			 * handle missing trees/commits and want them to be filterable
 			 * and showable, set this to true. The revision walker
-			 * will filter and show such a missing tree as usual,
-			 * but will not attempt to recurse into this tree
-			 * object.
+			 * will filter and show such a missing object as usual,
+			 * but will not attempt to recurse into this tree/commit
+			 * object. The revision walker will also set the MISSING
+			 * flag for such objects.
 			 */
-			do_not_die_on_missing_tree:1,
+			do_not_die_on_missing_objects:1,
 
 			/* for internal use only */
 			exclude_promisor_objects:1;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 2/3] rev-list: move `show_commit()` to the bottom
  2023-10-16 10:38 ` [PATCH v2 " Karthik Nayak
  2023-10-16 10:38   ` [PATCH v2 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
@ 2023-10-16 10:38   ` Karthik Nayak
  2023-10-16 10:38   ` [PATCH v2 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-16 10:38 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `show_commit()` function already depends on `finish_commit()`, and
in the upcoming commit, we'll also add a dependency on
`finish_object__ma()`. Since in C symbols must be declared before
they're used, let's move `show_commit()` below both `finish_commit()`
and `finish_object__ma()`, so the code is cleaner as a whole without the
need for declarations.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c | 85 +++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 43 deletions(-)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ea77489c38..98542e8b3c 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -100,7 +100,48 @@ static off_t get_object_disk_usage(struct object *obj)
 	return size;
 }
 
-static void finish_commit(struct commit *commit);
+static inline void finish_object__ma(struct object *obj)
+{
+	/*
+	 * Whether or not we try to dynamically fetch missing objects
+	 * from the server, we currently DO NOT have the object.  We
+	 * can either print, allow (ignore), or conditionally allow
+	 * (ignore) them.
+	 */
+	switch (arg_missing_action) {
+	case MA_ERROR:
+		die("missing %s object '%s'",
+		    type_name(obj->type), oid_to_hex(&obj->oid));
+		return;
+
+	case MA_ALLOW_ANY:
+		return;
+
+	case MA_PRINT:
+		oidset_insert(&missing_objects, &obj->oid);
+		return;
+
+	case MA_ALLOW_PROMISOR:
+		if (is_promisor_object(&obj->oid))
+			return;
+		die("unexpected missing %s object '%s'",
+		    type_name(obj->type), oid_to_hex(&obj->oid));
+		return;
+
+	default:
+		BUG("unhandled missing_action");
+		return;
+	}
+}
+
+static void finish_commit(struct commit *commit)
+{
+	free_commit_list(commit->parents);
+	commit->parents = NULL;
+	free_commit_buffer(the_repository->parsed_objects,
+			   commit);
+}
+
 static void show_commit(struct commit *commit, void *data)
 {
 	struct rev_list_info *info = data;
@@ -219,48 +260,6 @@ static void show_commit(struct commit *commit, void *data)
 	finish_commit(commit);
 }
 
-static void finish_commit(struct commit *commit)
-{
-	free_commit_list(commit->parents);
-	commit->parents = NULL;
-	free_commit_buffer(the_repository->parsed_objects,
-			   commit);
-}
-
-static inline void finish_object__ma(struct object *obj)
-{
-	/*
-	 * Whether or not we try to dynamically fetch missing objects
-	 * from the server, we currently DO NOT have the object.  We
-	 * can either print, allow (ignore), or conditionally allow
-	 * (ignore) them.
-	 */
-	switch (arg_missing_action) {
-	case MA_ERROR:
-		die("missing %s object '%s'",
-		    type_name(obj->type), oid_to_hex(&obj->oid));
-		return;
-
-	case MA_ALLOW_ANY:
-		return;
-
-	case MA_PRINT:
-		oidset_insert(&missing_objects, &obj->oid);
-		return;
-
-	case MA_ALLOW_PROMISOR:
-		if (is_promisor_object(&obj->oid))
-			return;
-		die("unexpected missing %s object '%s'",
-		    type_name(obj->type), oid_to_hex(&obj->oid));
-		return;
-
-	default:
-		BUG("unhandled missing_action");
-		return;
-	}
-}
-
 static int finish_object(struct object *obj, const char *name UNUSED,
 			 void *cb_data)
 {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v2 3/3] rev-list: add commit object support in `--missing` option
  2023-10-16 10:38 ` [PATCH v2 " Karthik Nayak
  2023-10-16 10:38   ` [PATCH v2 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
  2023-10-16 10:38   ` [PATCH v2 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
@ 2023-10-16 10:38   ` Karthik Nayak
  2023-10-16 16:24   ` [PATCH v2 0/3] rev-list: add support for commits in `--missing` Junio C Hamano
  2023-10-19 12:10   ` [PATCH v3 " Karthik Nayak
  4 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-16 10:38 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `--missing` object option in rev-list currently works only with
missing blobs/trees. For missing commits the revision walker fails with
a fatal error.

Let's extend the functionality of `--missing` option to also support
commit objects. This is done by adding a new `MISSING` flag that the
revision walker sets whenever it encounters a missing tree/commit. The
revision walker will now continue the traversal and call `show_commit()`
even for missing commits. In rev-list we can then check for this flag
and call the existing code for parsing `--missing` objects.

A scenario where this option would be used is to find the boundary
objects between different object directories. Consider a repository with
a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate
object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a
repository, using the `--missing=print` option while disabling the
alternate object directory allows us to find the boundary objects
between the main and alternate object directory.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c          |  6 +++
 list-objects.c              |  2 +-
 object.h                    |  2 +-
 revision.c                  | 11 ++++--
 revision.h                  |  3 ++
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++
 6 files changed, 93 insertions(+), 5 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 98542e8b3c..604ae01d0c 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data)
 
 	display_progress(progress, ++progress_counter);
 
+	if (revs->do_not_die_on_missing_objects &&
+	    commit->object.flags & MISSING) {
+		finish_object__ma(&commit->object);
+		return;
+	}
+
 	if (show_disk_usage)
 		total_disk_usage += get_object_disk_usage(&commit->object);
 
diff --git a/list-objects.c b/list-objects.c
index 47296dff2f..60c5533721 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -387,7 +387,7 @@ static void do_traverse(struct traversal_context *ctx)
 		 * an uninteresting boundary commit may not have its tree
 		 * parsed yet, but we are not going to show them anyway
 		 */
-		if (!ctx->revs->tree_objects)
+		if (!ctx->revs->tree_objects || commit->object.flags & MISSING)
 			; /* do not bother loading tree */
 		else if (repo_get_commit_tree(the_repository, commit)) {
 			struct tree *tree = repo_get_commit_tree(the_repository,
diff --git a/object.h b/object.h
index 114d45954d..12913e6116 100644
--- a/object.h
+++ b/object.h
@@ -62,7 +62,7 @@ void object_array_init(struct object_array *array);
 
 /*
  * object flag allocation:
- * revision.h:               0---------10         15             23------27
+ * revision.h:               0---------10         15             22------27
  * fetch-pack.c:             01    67
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
diff --git a/revision.c b/revision.c
index e789834dd1..d147e0f611 100644
--- a/revision.c
+++ b/revision.c
@@ -1110,7 +1110,7 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 	struct commit_list *parent = commit->parents;
 	unsigned pass_flags;
 
-	if (commit->object.flags & ADDED)
+	if (commit->object.flags & ADDED || commit->object.flags & MISSING)
 		return 0;
 	commit->object.flags |= ADDED;
 
@@ -1168,7 +1168,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 	for (parent = commit->parents; parent; parent = parent->next) {
 		struct commit *p = parent->item;
 		int gently = revs->ignore_missing_links ||
-			     revs->exclude_promisor_objects;
+			     revs->exclude_promisor_objects ||
+			     revs->do_not_die_on_missing_objects;
 		if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
 			if (revs->exclude_promisor_objects &&
 			    is_promisor_object(&p->object.oid)) {
@@ -1176,7 +1177,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 					break;
 				continue;
 			}
-			return -1;
+
+			if (!revs->do_not_die_on_missing_objects)
+				return -1;
+			else
+				p->object.flags |= MISSING;
 		}
 		if (revs->sources) {
 			char **slot = revision_sources_at(revs->sources, p);
diff --git a/revision.h b/revision.h
index c73c92ef40..07906bc3bc 100644
--- a/revision.h
+++ b/revision.h
@@ -41,6 +41,9 @@
 /* WARNING: This is also used as REACHABLE in commit-graph.c. */
 #define PULL_MERGE	(1u<<15)
 
+/* WARNING: This is also used as NEEDS_BITMAP in pack-bitmap.c. */
+#define MISSING         (1u<<22)
+
 #define TOPO_WALK_EXPLORED	(1u<<23)
 #define TOPO_WALK_INDEGREE	(1u<<24)
 
diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
new file mode 100755
index 0000000000..40265a4f66
--- /dev/null
+++ b/t/t6022-rev-list-missing.sh
@@ -0,0 +1,74 @@
+#!/bin/sh
+
+test_description='handling of missing objects in rev-list'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+# We setup the repository with two commits, this way HEAD is always
+# available and we can hide commit 1.
+test_expect_success 'create repository and alternate directory' '
+	test_commit 1 &&
+	test_commit 2 &&
+	test_commit 3
+'
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	test_expect_success "rev-list --missing=error fails with missing object $obj" '
+		oid="$(git rev-parse $obj)" &&
+		path=".git/objects/$(test_oid_to_path $oid)" &&
+
+		mv "$path" "$path.hidden" &&
+		test_when_finished "mv $path.hidden $path" &&
+
+		test_must_fail git rev-list --missing=error --objects \
+			--no-object-names HEAD
+	'
+done
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	for action in "allow-any" "print"
+	do
+		test_expect_success "rev-list --missing=$action with missing $obj" '
+			oid="$(git rev-parse $obj)" &&
+			path=".git/objects/$(test_oid_to_path $oid)" &&
+
+			# Before the object is made missing, we use rev-list to
+			# get the expected oids.
+			git rev-list --objects --no-object-names \
+				HEAD ^$obj >expect.raw &&
+
+			# Blobs are shared by all commits, so evethough a commit/tree
+			# might be skipped, its blob must be accounted for.
+			if [ $obj != "HEAD:1.t" ]; then
+				echo $(git rev-parse HEAD:1.t) >>expect.raw &&
+				echo $(git rev-parse HEAD:2.t) >>expect.raw
+			fi &&
+
+			mv "$path" "$path.hidden" &&
+			test_when_finished "mv $path.hidden $path" &&
+
+			git rev-list --missing=$action --objects --no-object-names \
+				HEAD >actual.raw &&
+
+			# When the action is to print, we should also add the missing
+			# oid to the expect list.
+			case $action in
+			allow-any)
+				;;
+			print)
+				grep ?$oid actual.raw &&
+				echo ?$oid >>expect.raw
+				;;
+			esac &&
+
+			sort actual.raw >actual &&
+			sort expect.raw >expect &&
+			test_cmp expect actual
+		'
+	done
+done
+
+test_done
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 0/3] rev-list: add support for commits in `--missing`
  2023-10-16 10:38 ` [PATCH v2 " Karthik Nayak
                     ` (2 preceding siblings ...)
  2023-10-16 10:38   ` [PATCH v2 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
@ 2023-10-16 16:24   ` Junio C Hamano
  2023-10-16 19:01     ` Karthik Nayak
  2023-10-19 12:10   ` [PATCH v3 " Karthik Nayak
  4 siblings, 1 reply; 61+ messages in thread
From: Junio C Hamano @ 2023-10-16 16:24 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git, ps

Karthik Nayak <karthik.188@gmail.com> writes:

> ... some extra checks here, especially because we don't want to
> allow missing commit's tree and parent to be parsed.

Good changes relative to the previous round.

A bad news is that with this series (especially [PATCH 3/3]) applied
on top of Patrick's "always make sure the commit we found from the
commit-graph at least exists" change, t5310 and t5320 seem to
consistently fail for me.  They pass with the first two steps
applied without [3/3] but that is to be expected as they are both
no-ops.

The change in 3/3 to list-objects.c::do_traverse() seems to be the
culprit.  Reverting that single hunk makes t5310 and t5320 pass
again.  What I find alarming is that two tests that are touched by
this series, t5318 and t6022, do not seem to fail with the hunk
reverted, which means the hunk has no meaningful test coverage for
the purpose of this series.

I didn't even try to understand what is going on, so there may be a
very good reason that the change must be there for do_traverse() to
work correctly.  I dunno.

>     +-	if (commit->object.flags & ADDED)
>     ++	if (commit->object.flags & ADDED || commit->object.flags & MISSING)

This and others are syntactically correct C, but some folks may find
it more readable to see each of the bitwise operations enclosed in a
pair of parentheses when combined by logical operations, i.e.,

	if ((commit->object.flags & ADDED) || (commit->object.flags & MISSING))

In this particular case, we can even say

		if (commit->object.flags & (ADDED | MISSING))

meaning, "if we have either the ADDED or the MISSING bit set", which
may make it even clearer.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 0/3] rev-list: add support for commits in `--missing`
  2023-10-16 16:24   ` [PATCH v2 0/3] rev-list: add support for commits in `--missing` Junio C Hamano
@ 2023-10-16 19:01     ` Karthik Nayak
  2023-10-16 20:33       ` Junio C Hamano
  0 siblings, 1 reply; 61+ messages in thread
From: Karthik Nayak @ 2023-10-16 19:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, ps

On Mon, Oct 16, 2023 at 6:24 PM Junio C Hamano <gitster@pobox.com> wrote:
> Good changes relative to the previous round.
>
> A bad news is that with this series (especially [PATCH 3/3]) applied
> on top of Patrick's "always make sure the commit we found from the
> commit-graph at least exists" change, t5310 and t5320 seem to
> consistently fail for me.  They pass with the first two steps
> applied without [3/3] but that is to be expected as they are both
> no-ops.
>
> The change in 3/3 to list-objects.c::do_traverse() seems to be the
> culprit.  Reverting that single hunk makes t5310 and t5320 pass
> again.

It seems that the new chunk introduced now causes collision because of
the bit field of the MISSING flag being the same as the pre-existing
NEEDS_BITMAP flag. Earlier this wasn't a concern, but now, their paths
have converged at this change.

So changing the bit field for the MISSING flag from `22` to an unused `28`
fixes the issue. I should have run the whole test suite again. Apologies.

>  What I find alarming is that two tests that are touched by
> this series, t5318 and t6022, do not seem to fail with the hunk
> reverted, which means the hunk has no meaningful test coverage for
> the purpose of this series.
>

Well, actually the newly introduced tests t6022 require this block,
but this is specific
to when commit graph is enabled. So what you did see was correct, but
the test would
also fail with `GIT_TEST_COMMIT_GRAPH=1` if the block was removed. That's
exactly why in this series I increased the number of commits in the
test block from 2->3.

> >     +-        if (commit->object.flags & ADDED)
> >     ++        if (commit->object.flags & ADDED || commit->object.flags & MISSING)
>
> This and others are syntactically correct C, but some folks may find
> it more readable to see each of the bitwise operations enclosed in a
> pair of parentheses when combined by logical operations, i.e.,
>
>         if ((commit->object.flags & ADDED) || (commit->object.flags & MISSING))
>
> In this particular case, we can even say
>
>                 if (commit->object.flags & (ADDED | MISSING))
>
> meaning, "if we have either the ADDED or the MISSING bit set", which
> may make it even clearer.

I agree with this, I'll add it in the next block.

Thanks for the quick review, I'll wait a day/two and send v3 with the
changes to fix tests
and cleaner code.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 0/3] rev-list: add support for commits in `--missing`
  2023-10-16 19:01     ` Karthik Nayak
@ 2023-10-16 20:33       ` Junio C Hamano
  0 siblings, 0 replies; 61+ messages in thread
From: Junio C Hamano @ 2023-10-16 20:33 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git, ps

Karthik Nayak <karthik.188@gmail.com> writes:

> Well, actually the newly introduced tests t6022 require this block,
> but this is specific
> to when commit graph is enabled.

Ah, of course.

> Thanks for the quick review, I'll wait a day/two and send v3 with the
> changes to fix tests
> and cleaner code.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
  2023-10-13 18:21                     ` Junio C Hamano
@ 2023-10-17  6:37                       ` Patrick Steinhardt
  2023-10-17 18:34                         ` Junio C Hamano
  0 siblings, 1 reply; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-17  6:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Karthik Nayak, Taylor Blau

[-- Attachment #1: Type: text/plain, Size: 5942 bytes --]

On Fri, Oct 13, 2023 at 11:21:48AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > @@ -572,8 +572,13 @@ int repo_parse_commit_internal(struct repository *r,
> >  		return -1;
> >  	if (item->object.parsed)
> >  		return 0;
> > -	if (use_commit_graph && parse_commit_in_graph(r, item))
> > +	if (use_commit_graph && parse_commit_in_graph(r, item)) {
> > +		if (!has_object(r, &item->object.oid, 0))
> > +			return quiet_on_missing ? -1 :
> > +				error(_("commit %s exists in commit-graph but not in the object database"),
> > +				      oid_to_hex(&item->object.oid));
> >  		return 0;
> > +	}
> 
> Ever since this codepath was introduced by 177722b3 (commit:
> integrate commit graph with commit parsing, 2018-04-10), we blindly
> trusted what commit-graph file says.  This change is a strict
> improvement in the correctness department, but there are two things
> that are a bit worrying.
> 
> One.  The additional check should certainly be cheaper than a full
> reading and parsing of an object, either from a loose object file or
> from a pack entry.  It may not hurt performance too much, but it
> still would give us more confidence if we know by how much we are
> pessimising good cases where the commit-graph does match reality.
> Our stance on these secondary files that store precomputed values
> for optimization purposes is in general to use them blindly unless
> in exceptional cases where the operation values the correctness even
> when the validity of these secondary files is dubious (e.g., "fsck"),
> and doing this extra check regardless of the caller at this low level
> of the callchain is a bit worrying.

Fair point indeed. The following is a worst-case scenario benchmark of
of the change where we do a full topological walk of all reachable
commits in the graph, executed in linux.git. We parse commit parents via
`repo_parse_commit_gently()`, so the new code path now basically has to
check for object existence of every reachable commit:

Benchmark 1: git -c core.commitGraph=true rev-list --topo-order --all (git = master)
  Time (mean ± σ):      2.913 s ±  0.018 s    [User: 2.363 s, System: 0.548 s]
  Range (min … max):    2.894 s …  2.950 s    10 runs

Benchmark 2: git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
  Time (mean ± σ):      3.834 s ±  0.052 s    [User: 3.276 s, System: 0.556 s]
  Range (min … max):    3.780 s …  3.961 s    10 runs

Benchmark 3: git -c core.commitGraph=false rev-list --topo-order --all (git = master)
  Time (mean ± σ):     13.841 s ±  0.084 s    [User: 13.152 s, System: 0.687 s]
  Range (min … max):   13.714 s … 13.995 s    10 runs

Benchmark 4: git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
  Time (mean ± σ):     13.762 s ±  0.116 s    [User: 13.094 s, System: 0.667 s]
  Range (min … max):   13.645 s … 14.038 s    10 runs

Summary
  git -c core.commitGraph=true rev-list --topo-order --all (git = master) ran
    1.32 ± 0.02 times faster than git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
    4.72 ± 0.05 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
    4.75 ± 0.04 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = master)

The added check does lead to a performance regression indeed, which is
not all that unexpected. That being said, the commit-graph still results
in a significant speedup compared to the case where we don't have it.

> Another is that by the time parse_commit_in_graph() returns true and
> we realize that the answer we got is bogus by asking has_object(),
> item->object.parsed has already been toggled on, so the caller now
> has a commit object that claimed it was already parsed and does not
> match reality.  Hopefully the caller takes an early exit upon seeing
> a failure from parse_commit_gently() and the .parsed bit does not
> matter, but maybe I am missing a case where it does.  I dunno.

We could also call `unparse_commit()` when we notice the stale commit
graph item. This would be in the same spirit as the rest of this patch
as it would lead to an overall safer end state.

In any case I'll wait for additional input before sending a v2, most
importantly to see whether we think that consistency trumps performance
in this case. Personally I'm still of the mind that it should, which
also comes from the fact that we were fighting with stale commit graphs
several times in production data.

Patrick

> Other than that, sounds very sensible and the code change is clean.
> 
> Will queue.  Thanks.
> 
> > diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> > index ba65f17dd9..25f8e9e2d3 100755
> > --- a/t/t5318-commit-graph.sh
> > +++ b/t/t5318-commit-graph.sh
> > @@ -821,4 +821,27 @@ test_expect_success 'overflow during generation version upgrade' '
> >  	)
> >  '
> >  
> > +test_expect_success 'commit exists in commit-graph but not in object database' '
> > +	test_when_finished "rm -rf repo" &&
> > +	git init repo &&
> > +	(
> > +		cd repo &&
> > +
> > +		test_commit A &&
> > +		test_commit B &&
> > +		test_commit C &&
> > +		git commit-graph write --reachable &&
> > +
> > +		# Corrupt the repository by deleting the intermittent commit
> > +		# object. Commands should notice that this object is absent and
> > +		# thus that the repository is corrupt even if the commit graph
> > +		# exists.
> > +		oid=$(git rev-parse B) &&
> > +		rm .git/objects/"$(test_oid_to_path "$oid")" &&
> > +
> > +		test_must_fail git rev-parse HEAD~2 2>error &&
> > +		grep "error: commit $oid exists in commit-graph but not in the object database" error
> > +	)
> > +'
> > +
> >  test_done

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
  2023-10-17  6:37                       ` Patrick Steinhardt
@ 2023-10-17 18:34                         ` Junio C Hamano
  2023-10-19  6:45                           ` Patrick Steinhardt
  0 siblings, 1 reply; 61+ messages in thread
From: Junio C Hamano @ 2023-10-17 18:34 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Karthik Nayak, Taylor Blau

Patrick Steinhardt <ps@pks.im> writes:

> Fair point indeed. The following is a worst-case scenario benchmark of
> of the change where we do a full topological walk of all reachable
> commits in the graph, executed in linux.git. We parse commit parents via
> `repo_parse_commit_gently()`, so the new code path now basically has to
> check for object existence of every reachable commit:
> ...
> The added check does lead to a performance regression indeed, which is
> not all that unexpected. That being said, the commit-graph still results
> in a significant speedup compared to the case where we don't have it.

Yeah, I agree that both points are expected.  An extra check that is
much cheaper than the full parsing is paying a small price to be a
bit more careful than before.  The question is if the price is small
enough.  I am still not sure if the extra carefulness is warranted
for all normal cases to spend 30% extra cycles


Yeah, I agree that both points are expected.  An extra check that is
much cheaper than the full parsing is paying a small price to be a
bit more careful than before.  The question is if the price is small
enough.  I am still not sure if the extra carefulness is warranted
for all normal cases to spend 30% extra cycles.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
  2023-10-17 18:34                         ` Junio C Hamano
@ 2023-10-19  6:45                           ` Patrick Steinhardt
  2023-10-19  8:25                             ` Patrick Steinhardt
  0 siblings, 1 reply; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-19  6:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Karthik Nayak, Taylor Blau

[-- Attachment #1: Type: text/plain, Size: 2627 bytes --]

On Tue, Oct 17, 2023 at 11:34:53AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Fair point indeed. The following is a worst-case scenario benchmark of
> > of the change where we do a full topological walk of all reachable
> > commits in the graph, executed in linux.git. We parse commit parents via
> > `repo_parse_commit_gently()`, so the new code path now basically has to
> > check for object existence of every reachable commit:
> > ...
> > The added check does lead to a performance regression indeed, which is
> > not all that unexpected. That being said, the commit-graph still results
> > in a significant speedup compared to the case where we don't have it.
> 
> Yeah, I agree that both points are expected.  An extra check that is
> much cheaper than the full parsing is paying a small price to be a
> bit more careful than before.  The question is if the price is small
> enough.  I am still not sure if the extra carefulness is warranted
> for all normal cases to spend 30% extra cycles.
> 
> Thanks.

Well, seen in contexts like the thread that spawned this discussion I
think it's preferable to take a relatively smaller price compared to
disabling the commit graph entirely in some situations. With that in
mind, personally I'd be happy with either of two outcomes here:

    - We take the patch I proposed as a hardening measure, which allows
      us to use the commit graph in basically any situation where we
      could parse objects from the ODB without any downside except for a
      performance hit.

    - We say that corrupt repositories do not need to be accounted for
      when using metadata caches like the commit-graph. That would mean
      in consequence that for the `--missing` flag we would not have to
      disable commit graphs.

The test failure that was exposed in Karthik's test only stems from the
fact that we manually corrupt the repository and is not specific to the
`--missing` flag. This is further stressed by the new test aded in my
own patch, as we can trigger a similar bug when not using `--missing`.

For end users, object pruning would happen either via git-gc(1) or via
git-maintenance(1), and both of them do know to update commit graphs
already. This should significantly decrease the chance that such stale
commit graphs would be found out there in the wild, but it's still not
impossible of course. Especially in cases where you use lower-level
commands like git-repack(1) with cruft packs or git-prune(1), which is
often the case for Git hosters, the risk is higher though.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
  2023-10-19  6:45                           ` Patrick Steinhardt
@ 2023-10-19  8:25                             ` Patrick Steinhardt
  2023-10-19 17:16                               ` Junio C Hamano
  0 siblings, 1 reply; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-19  8:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Karthik Nayak, Taylor Blau

[-- Attachment #1: Type: text/plain, Size: 2654 bytes --]

On Thu, Oct 19, 2023 at 08:45:34AM +0200, Patrick Steinhardt wrote:
> On Tue, Oct 17, 2023 at 11:34:53AM -0700, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> > 
> > > Fair point indeed. The following is a worst-case scenario benchmark of
> > > of the change where we do a full topological walk of all reachable
> > > commits in the graph, executed in linux.git. We parse commit parents via
> > > `repo_parse_commit_gently()`, so the new code path now basically has to
> > > check for object existence of every reachable commit:
> > > ...
> > > The added check does lead to a performance regression indeed, which is
> > > not all that unexpected. That being said, the commit-graph still results
> > > in a significant speedup compared to the case where we don't have it.
> > 
> > Yeah, I agree that both points are expected.  An extra check that is
> > much cheaper than the full parsing is paying a small price to be a
> > bit more careful than before.  The question is if the price is small
> > enough.  I am still not sure if the extra carefulness is warranted
> > for all normal cases to spend 30% extra cycles.
> > 
> > Thanks.
> 
> Well, seen in contexts like the thread that spawned this discussion I
> think it's preferable to take a relatively smaller price compared to
> disabling the commit graph entirely in some situations. With that in
> mind, personally I'd be happy with either of two outcomes here:
> 
>     - We take the patch I proposed as a hardening measure, which allows
>       us to use the commit graph in basically any situation where we
>       could parse objects from the ODB without any downside except for a
>       performance hit.
> 
>     - We say that corrupt repositories do not need to be accounted for
>       when using metadata caches like the commit-graph. That would mean
>       in consequence that for the `--missing` flag we would not have to
>       disable commit graphs.
> 
> The test failure that was exposed in Karthik's test only stems from the
> fact that we manually corrupt the repository and is not specific to the
> `--missing` flag. This is further stressed by the new test aded in my
> own patch, as we can trigger a similar bug when not using `--missing`.

There's another way to handle this, which is to conditionally enable the
object existence check. This would be less of a performance hit compared
to disabling commit graphs altogether with `--missing`, but still ensure
that repository corruption was detected. Second, it would not regress
performance for all preexisting users of `repo_parse_commit_gently()`.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 0/3] rev-list: add support for commits in `--missing`
  2023-10-16 10:38 ` [PATCH v2 " Karthik Nayak
                     ` (3 preceding siblings ...)
  2023-10-16 16:24   ` [PATCH v2 0/3] rev-list: add support for commits in `--missing` Junio C Hamano
@ 2023-10-19 12:10   ` Karthik Nayak
  2023-10-19 12:10     ` [PATCH v3 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
                       ` (3 more replies)
  4 siblings, 4 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-19 12:10 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `--missing` option in git-rev-list(1) was introduced intitally
to deal with missing blobs in the context of promissory notes.
Eventually the option was extended to also support tree objects in
7c0fe330d5 (rev-list: handle missing tree objects properly,2018-10-05).

This patch series extends the `--missing` option to also add support for
commit objects. We do this by introducing a new flag `MISSING` which is
added whenever we encounter a missing commit during traversal. Then in
`builtin/rev-list` we check for this flag and take the appropriate
action based on the `--missing=*` option used.

This series is an alternate to the patch series I had posted earlier:
https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/.
In that patch, we introduced an option `--ignore-missing-links` which
was added to expose the `ignore_missing_links` bit to the user. The
issue in that patch was that, the option `--ignore-missing-links` didn't
play well the pre-existing `--missing` option. This series avoids that
route and just extends the `--missing` option for commits to solve the
same problem.

V2 of the series can be found here: http://public-inbox.org/git/ZSkCGS3JPEQ71dOF@tanuki/T/#m10b562ec02834d0b222835a24a18e6ca725f99d1

Changes from v2: 

* Change the flag MISSING's bit number to avoid collision with NEEDS_BITMAP
  which was causing t5310 and t532 to fail.

Range diff from v2:

1:  a7018a319a = 1:  fa35cd21e6 revision: rename bit to `do_not_die_on_missing_objects`
2:  b186a679d4 = 2:  31074df376 rev-list: move `show_commit()` to the bottom
3:  0c8c140c27 ! 3:  d9b0d6da0c rev-list: add commit object support in `--missing` option
    @@ object.h: void object_array_init(struct object_array *array);
      /*
       * object flag allocation:
     - * revision.h:               0---------10         15             23------27
    -+ * revision.h:               0---------10         15             22------27
    ++ * revision.h:               0---------10         15             22------28
       * fetch-pack.c:             01    67
       * negotiator/default.c:       2--5
       * walker.c:                 0-2
    +@@ object.h: void object_array_init(struct object_array *array);
    +  * builtin/show-branch.c:    0-------------------------------------------26
    +  * builtin/unpack-objects.c:                                 2021
    +  */
    +-#define FLAG_BITS  28
    ++#define FLAG_BITS  29
    + 
    + #define TYPE_BITS 3
    + 
     
      ## revision.c ##
     @@ revision.c: static int process_parents(struct rev_info *revs, struct commit *commit,
    @@ revision.c: static int process_parents(struct rev_info *revs, struct commit *com
      	unsigned pass_flags;
      
     -	if (commit->object.flags & ADDED)
    -+	if (commit->object.flags & ADDED || commit->object.flags & MISSING)
    ++	if (commit->object.flags & (ADDED | MISSING))
      		return 0;
      	commit->object.flags |= ADDED;
      
    @@ revision.c: static int process_parents(struct rev_info *revs, struct commit *com
     
      ## revision.h ##
     @@
    - /* WARNING: This is also used as REACHABLE in commit-graph.c. */
    - #define PULL_MERGE	(1u<<15)
    + #define ANCESTRY_PATH	(1u<<27)
    + #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR | PULL_MERGE)
      
    -+/* WARNING: This is also used as NEEDS_BITMAP in pack-bitmap.c. */
    -+#define MISSING         (1u<<22)
    ++#define MISSING (1u<<28)
     +
    - #define TOPO_WALK_EXPLORED	(1u<<23)
    - #define TOPO_WALK_INDEGREE	(1u<<24)
    + #define DECORATE_SHORT_REFS	1
    + #define DECORATE_FULL_REFS	2
      
     
      ## t/t6022-rev-list-missing.sh (new) ##


Karthik Nayak (3):
  revision: rename bit to `do_not_die_on_missing_objects`
  rev-list: move `show_commit()` to the bottom
  rev-list: add commit object support in `--missing` option

 builtin/reflog.c            |  2 +-
 builtin/rev-list.c          | 93 +++++++++++++++++++------------------
 list-objects.c              |  4 +-
 object.h                    |  4 +-
 revision.c                  | 11 +++--
 revision.h                  | 19 ++++----
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
 7 files changed, 147 insertions(+), 60 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

-- 
2.42.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 1/3] revision: rename bit to `do_not_die_on_missing_objects`
  2023-10-19 12:10   ` [PATCH v3 " Karthik Nayak
@ 2023-10-19 12:10     ` Karthik Nayak
  2023-10-19 12:10     ` [PATCH v3 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-19 12:10 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The bit `do_not_die_on_missing_tree` is used in revision.h to ensure the
revision walker does not die when encountering a missing tree. This is
currently exclusively set within `builtin/rev-list.c` to ensure the
`--missing` option works with missing trees.

In the upcoming commits, we will extend `--missing` to also support
missing commits. So let's rename the bit to
`do_not_die_on_missing_objects`, which is object type agnostic and can
be used for both trees/commits.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/reflog.c   |  2 +-
 builtin/rev-list.c |  2 +-
 list-objects.c     |  2 +-
 revision.h         | 17 +++++++++--------
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/builtin/reflog.c b/builtin/reflog.c
index df63a5892e..9e369a5977 100644
--- a/builtin/reflog.c
+++ b/builtin/reflog.c
@@ -298,7 +298,7 @@ static int cmd_reflog_expire(int argc, const char **argv, const char *prefix)
 		struct rev_info revs;
 
 		repo_init_revisions(the_repository, &revs, prefix);
-		revs.do_not_die_on_missing_tree = 1;
+		revs.do_not_die_on_missing_objects = 1;
 		revs.ignore_missing = 1;
 		revs.ignore_missing_links = 1;
 		if (verbose)
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ff715d6918..ea77489c38 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -561,7 +561,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	}
 
 	if (arg_missing_action)
-		revs.do_not_die_on_missing_tree = 1;
+		revs.do_not_die_on_missing_objects = 1;
 
 	argc = setup_revisions(argc, argv, &revs, &s_r_opt);
 
diff --git a/list-objects.c b/list-objects.c
index c25c72b32c..47296dff2f 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -177,7 +177,7 @@ static void process_tree(struct traversal_context *ctx,
 		    is_promisor_object(&obj->oid))
 			return;
 
-		if (!revs->do_not_die_on_missing_tree)
+		if (!revs->do_not_die_on_missing_objects)
 			die("bad tree object %s", oid_to_hex(&obj->oid));
 	}
 
diff --git a/revision.h b/revision.h
index 50091bbd13..c73c92ef40 100644
--- a/revision.h
+++ b/revision.h
@@ -212,18 +212,19 @@ struct rev_info {
 
 			/*
 			 * Blobs are shown without regard for their existence.
-			 * But not so for trees: unless exclude_promisor_objects
+			 * But not so for trees/commits: unless exclude_promisor_objects
 			 * is set and the tree in question is a promisor object;
 			 * OR ignore_missing_links is set, the revision walker
-			 * dies with a "bad tree object HASH" message when
-			 * encountering a missing tree. For callers that can
-			 * handle missing trees and want them to be filterable
+			 * dies with a "bad <type> object HASH" message when
+			 * encountering a missing object. For callers that can
+			 * handle missing trees/commits and want them to be filterable
 			 * and showable, set this to true. The revision walker
-			 * will filter and show such a missing tree as usual,
-			 * but will not attempt to recurse into this tree
-			 * object.
+			 * will filter and show such a missing object as usual,
+			 * but will not attempt to recurse into this tree/commit
+			 * object. The revision walker will also set the MISSING
+			 * flag for such objects.
 			 */
-			do_not_die_on_missing_tree:1,
+			do_not_die_on_missing_objects:1,
 
 			/* for internal use only */
 			exclude_promisor_objects:1;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 2/3] rev-list: move `show_commit()` to the bottom
  2023-10-19 12:10   ` [PATCH v3 " Karthik Nayak
  2023-10-19 12:10     ` [PATCH v3 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
@ 2023-10-19 12:10     ` Karthik Nayak
  2023-10-19 12:10     ` [PATCH v3 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
  2023-10-24 12:26     ` [PATCH v4 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
  3 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-19 12:10 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `show_commit()` function already depends on `finish_commit()`, and
in the upcoming commit, we'll also add a dependency on
`finish_object__ma()`. Since in C symbols must be declared before
they're used, let's move `show_commit()` below both `finish_commit()`
and `finish_object__ma()`, so the code is cleaner as a whole without the
need for declarations.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c | 85 +++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 43 deletions(-)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ea77489c38..98542e8b3c 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -100,7 +100,48 @@ static off_t get_object_disk_usage(struct object *obj)
 	return size;
 }
 
-static void finish_commit(struct commit *commit);
+static inline void finish_object__ma(struct object *obj)
+{
+	/*
+	 * Whether or not we try to dynamically fetch missing objects
+	 * from the server, we currently DO NOT have the object.  We
+	 * can either print, allow (ignore), or conditionally allow
+	 * (ignore) them.
+	 */
+	switch (arg_missing_action) {
+	case MA_ERROR:
+		die("missing %s object '%s'",
+		    type_name(obj->type), oid_to_hex(&obj->oid));
+		return;
+
+	case MA_ALLOW_ANY:
+		return;
+
+	case MA_PRINT:
+		oidset_insert(&missing_objects, &obj->oid);
+		return;
+
+	case MA_ALLOW_PROMISOR:
+		if (is_promisor_object(&obj->oid))
+			return;
+		die("unexpected missing %s object '%s'",
+		    type_name(obj->type), oid_to_hex(&obj->oid));
+		return;
+
+	default:
+		BUG("unhandled missing_action");
+		return;
+	}
+}
+
+static void finish_commit(struct commit *commit)
+{
+	free_commit_list(commit->parents);
+	commit->parents = NULL;
+	free_commit_buffer(the_repository->parsed_objects,
+			   commit);
+}
+
 static void show_commit(struct commit *commit, void *data)
 {
 	struct rev_list_info *info = data;
@@ -219,48 +260,6 @@ static void show_commit(struct commit *commit, void *data)
 	finish_commit(commit);
 }
 
-static void finish_commit(struct commit *commit)
-{
-	free_commit_list(commit->parents);
-	commit->parents = NULL;
-	free_commit_buffer(the_repository->parsed_objects,
-			   commit);
-}
-
-static inline void finish_object__ma(struct object *obj)
-{
-	/*
-	 * Whether or not we try to dynamically fetch missing objects
-	 * from the server, we currently DO NOT have the object.  We
-	 * can either print, allow (ignore), or conditionally allow
-	 * (ignore) them.
-	 */
-	switch (arg_missing_action) {
-	case MA_ERROR:
-		die("missing %s object '%s'",
-		    type_name(obj->type), oid_to_hex(&obj->oid));
-		return;
-
-	case MA_ALLOW_ANY:
-		return;
-
-	case MA_PRINT:
-		oidset_insert(&missing_objects, &obj->oid);
-		return;
-
-	case MA_ALLOW_PROMISOR:
-		if (is_promisor_object(&obj->oid))
-			return;
-		die("unexpected missing %s object '%s'",
-		    type_name(obj->type), oid_to_hex(&obj->oid));
-		return;
-
-	default:
-		BUG("unhandled missing_action");
-		return;
-	}
-}
-
 static int finish_object(struct object *obj, const char *name UNUSED,
 			 void *cb_data)
 {
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v3 3/3] rev-list: add commit object support in `--missing` option
  2023-10-19 12:10   ` [PATCH v3 " Karthik Nayak
  2023-10-19 12:10     ` [PATCH v3 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
  2023-10-19 12:10     ` [PATCH v3 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
@ 2023-10-19 12:10     ` Karthik Nayak
  2023-10-19 22:05       ` Junio C Hamano
  2023-10-24 12:26     ` [PATCH v4 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
  3 siblings, 1 reply; 61+ messages in thread
From: Karthik Nayak @ 2023-10-19 12:10 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `--missing` object option in rev-list currently works only with
missing blobs/trees. For missing commits the revision walker fails with
a fatal error.

Let's extend the functionality of `--missing` option to also support
commit objects. This is done by adding a new `MISSING` flag that the
revision walker sets whenever it encounters a missing tree/commit. The
revision walker will now continue the traversal and call `show_commit()`
even for missing commits. In rev-list we can then check for this flag
and call the existing code for parsing `--missing` objects.

A scenario where this option would be used is to find the boundary
objects between different object directories. Consider a repository with
a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate
object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a
repository, using the `--missing=print` option while disabling the
alternate object directory allows us to find the boundary objects
between the main and alternate object directory.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c          |  6 +++
 list-objects.c              |  2 +-
 object.h                    |  4 +-
 revision.c                  | 11 ++++--
 revision.h                  |  2 +
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++
 6 files changed, 93 insertions(+), 6 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 98542e8b3c..604ae01d0c 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data)
 
 	display_progress(progress, ++progress_counter);
 
+	if (revs->do_not_die_on_missing_objects &&
+	    commit->object.flags & MISSING) {
+		finish_object__ma(&commit->object);
+		return;
+	}
+
 	if (show_disk_usage)
 		total_disk_usage += get_object_disk_usage(&commit->object);
 
diff --git a/list-objects.c b/list-objects.c
index 47296dff2f..60c5533721 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -387,7 +387,7 @@ static void do_traverse(struct traversal_context *ctx)
 		 * an uninteresting boundary commit may not have its tree
 		 * parsed yet, but we are not going to show them anyway
 		 */
-		if (!ctx->revs->tree_objects)
+		if (!ctx->revs->tree_objects || commit->object.flags & MISSING)
 			; /* do not bother loading tree */
 		else if (repo_get_commit_tree(the_repository, commit)) {
 			struct tree *tree = repo_get_commit_tree(the_repository,
diff --git a/object.h b/object.h
index 114d45954d..b76830fce1 100644
--- a/object.h
+++ b/object.h
@@ -62,7 +62,7 @@ void object_array_init(struct object_array *array);
 
 /*
  * object flag allocation:
- * revision.h:               0---------10         15             23------27
+ * revision.h:               0---------10         15             22------28
  * fetch-pack.c:             01    67
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -82,7 +82,7 @@ void object_array_init(struct object_array *array);
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  28
+#define FLAG_BITS  29
 
 #define TYPE_BITS 3
 
diff --git a/revision.c b/revision.c
index 219dc76716..8100806068 100644
--- a/revision.c
+++ b/revision.c
@@ -1110,7 +1110,7 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 	struct commit_list *parent = commit->parents;
 	unsigned pass_flags;
 
-	if (commit->object.flags & ADDED)
+	if (commit->object.flags & (ADDED | MISSING))
 		return 0;
 	commit->object.flags |= ADDED;
 
@@ -1168,7 +1168,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 	for (parent = commit->parents; parent; parent = parent->next) {
 		struct commit *p = parent->item;
 		int gently = revs->ignore_missing_links ||
-			     revs->exclude_promisor_objects;
+			     revs->exclude_promisor_objects ||
+			     revs->do_not_die_on_missing_objects;
 		if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
 			if (revs->exclude_promisor_objects &&
 			    is_promisor_object(&p->object.oid)) {
@@ -1176,7 +1177,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 					break;
 				continue;
 			}
-			return -1;
+
+			if (!revs->do_not_die_on_missing_objects)
+				return -1;
+			else
+				p->object.flags |= MISSING;
 		}
 		if (revs->sources) {
 			char **slot = revision_sources_at(revs->sources, p);
diff --git a/revision.h b/revision.h
index c73c92ef40..d3c1ca0f4e 100644
--- a/revision.h
+++ b/revision.h
@@ -53,6 +53,8 @@
 #define ANCESTRY_PATH	(1u<<27)
 #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR | PULL_MERGE)
 
+#define MISSING (1u<<28)
+
 #define DECORATE_SHORT_REFS	1
 #define DECORATE_FULL_REFS	2
 
diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
new file mode 100755
index 0000000000..40265a4f66
--- /dev/null
+++ b/t/t6022-rev-list-missing.sh
@@ -0,0 +1,74 @@
+#!/bin/sh
+
+test_description='handling of missing objects in rev-list'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+# We setup the repository with two commits, this way HEAD is always
+# available and we can hide commit 1.
+test_expect_success 'create repository and alternate directory' '
+	test_commit 1 &&
+	test_commit 2 &&
+	test_commit 3
+'
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	test_expect_success "rev-list --missing=error fails with missing object $obj" '
+		oid="$(git rev-parse $obj)" &&
+		path=".git/objects/$(test_oid_to_path $oid)" &&
+
+		mv "$path" "$path.hidden" &&
+		test_when_finished "mv $path.hidden $path" &&
+
+		test_must_fail git rev-list --missing=error --objects \
+			--no-object-names HEAD
+	'
+done
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	for action in "allow-any" "print"
+	do
+		test_expect_success "rev-list --missing=$action with missing $obj" '
+			oid="$(git rev-parse $obj)" &&
+			path=".git/objects/$(test_oid_to_path $oid)" &&
+
+			# Before the object is made missing, we use rev-list to
+			# get the expected oids.
+			git rev-list --objects --no-object-names \
+				HEAD ^$obj >expect.raw &&
+
+			# Blobs are shared by all commits, so evethough a commit/tree
+			# might be skipped, its blob must be accounted for.
+			if [ $obj != "HEAD:1.t" ]; then
+				echo $(git rev-parse HEAD:1.t) >>expect.raw &&
+				echo $(git rev-parse HEAD:2.t) >>expect.raw
+			fi &&
+
+			mv "$path" "$path.hidden" &&
+			test_when_finished "mv $path.hidden $path" &&
+
+			git rev-list --missing=$action --objects --no-object-names \
+				HEAD >actual.raw &&
+
+			# When the action is to print, we should also add the missing
+			# oid to the expect list.
+			case $action in
+			allow-any)
+				;;
+			print)
+				grep ?$oid actual.raw &&
+				echo ?$oid >>expect.raw
+				;;
+			esac &&
+
+			sort actual.raw >actual &&
+			sort expect.raw >expect &&
+			test_cmp expect actual
+		'
+	done
+done
+
+test_done
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
  2023-10-19  8:25                             ` Patrick Steinhardt
@ 2023-10-19 17:16                               ` Junio C Hamano
  2023-10-20 10:00                                 ` Jeff King
  0 siblings, 1 reply; 61+ messages in thread
From: Junio C Hamano @ 2023-10-19 17:16 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Karthik Nayak, Taylor Blau

Patrick Steinhardt <ps@pks.im> writes:

> There's another way to handle this, which is to conditionally enable the
> object existence check. This would be less of a performance hit compared
> to disabling commit graphs altogether with `--missing`, but still ensure
> that repository corruption was detected. Second, it would not regress
> performance for all preexisting users of `repo_parse_commit_gently()`.

The above was what I meant to suggest when you demonstrated that the
code with additional check is still much more performant than
running without the commit-graph optimization, yet has observable
impact on performance for normal codepaths that do not need the
extra carefulness.

But I wasn't sure if it is easy to plumb the "do we want to double
check?  in other words, are we running something like --missing that
care the correctness a bit more than usual cases?" bit down from the
caller, because this check is so deep in the callchain.

Thanks.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 3/3] rev-list: add commit object support in `--missing` option
  2023-10-19 12:10     ` [PATCH v3 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
@ 2023-10-19 22:05       ` Junio C Hamano
  2023-10-19 23:35         ` Junio C Hamano
  0 siblings, 1 reply; 61+ messages in thread
From: Junio C Hamano @ 2023-10-19 22:05 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git, ps

Karthik Nayak <karthik.188@gmail.com> writes:

> diff --git a/object.h b/object.h
> index 114d45954d..b76830fce1 100644
> --- a/object.h
> +++ b/object.h
> @@ -62,7 +62,7 @@ void object_array_init(struct object_array *array);
>  
>  /*
>   * object flag allocation:
> - * revision.h:               0---------10         15             23------27
> + * revision.h:               0---------10         15             22------28
>   * fetch-pack.c:             01    67
>   * negotiator/default.c:       2--5
>   * walker.c:                 0-2
> @@ -82,7 +82,7 @@ void object_array_init(struct object_array *array);
>   * builtin/show-branch.c:    0-------------------------------------------26
>   * builtin/unpack-objects.c:                                 2021
>   */
> -#define FLAG_BITS  28
> +#define FLAG_BITS  29
>  
>  #define TYPE_BITS 3

I am afraid that this is not a good direction to go, given that the
way FLAG_BITS is used is like this:

    /*
     * The object type is stored in 3 bits.
     */
    struct object {
            unsigned parsed : 1;
            unsigned type : TYPE_BITS;
            unsigned flags : FLAG_BITS;
            struct object_id oid;
    };

28 was there, not as a random number of bits we happen to be using.
It was derived by (32 - 3 - 1), i.e. ensure the bitfields above are
stored within a single word.

sizeof(struct object) is 40 bytes on x86-64, with offsetof(oid)
being 4 (i.e. the bitfields fit in a single 4-byte word).  If we
make FLAG_BITS 29, we will add 4 bytes to the structure and waste
31-bit per each and every in-core objects.

Do we really need to allocate a new bit in the object flags, which
is already a scarce resource?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 3/3] rev-list: add commit object support in `--missing` option
  2023-10-19 22:05       ` Junio C Hamano
@ 2023-10-19 23:35         ` Junio C Hamano
  2023-10-20 11:14           ` Karthik Nayak
  2023-10-20 16:41           ` Junio C Hamano
  0 siblings, 2 replies; 61+ messages in thread
From: Junio C Hamano @ 2023-10-19 23:35 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git, ps

Junio C Hamano <gitster@pobox.com> writes:

> Do we really need to allocate a new bit in the object flags, which
> is already a scarce resource?

Clarification.  I was *not* wondering if we can steal and (re|ab)use
a bit that is used for other purposes, in order to avoid allocating
a new bit.

Rather, I was wondering if we need to use object flags to mark these
objects, or can do what we want to do without using any object flags
at all.  For the purpose of reporting "missing" objects, wouldn't it
be sufficient to walk the object graph and report our findings as we
go?  To avoid reporting the same object twice, as we reasonably can
expect that the missing objects are minority (compared to the total
number of objects), perhaps the codepath that makes such a report
can use a hashmap of object_ids or something, for example.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
  2023-10-19 17:16                               ` Junio C Hamano
@ 2023-10-20 10:00                                 ` Jeff King
  2023-10-20 17:35                                   ` Junio C Hamano
  2023-10-23 10:15                                   ` Patrick Steinhardt
  0 siblings, 2 replies; 61+ messages in thread
From: Jeff King @ 2023-10-20 10:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git, Karthik Nayak, Taylor Blau

On Thu, Oct 19, 2023 at 10:16:56AM -0700, Junio C Hamano wrote:

> Patrick Steinhardt <ps@pks.im> writes:
> 
> > There's another way to handle this, which is to conditionally enable the
> > object existence check. This would be less of a performance hit compared
> > to disabling commit graphs altogether with `--missing`, but still ensure
> > that repository corruption was detected. Second, it would not regress
> > performance for all preexisting users of `repo_parse_commit_gently()`.
> 
> The above was what I meant to suggest when you demonstrated that the
> code with additional check is still much more performant than
> running without the commit-graph optimization, yet has observable
> impact on performance for normal codepaths that do not need the
> extra carefulness.
> 
> But I wasn't sure if it is easy to plumb the "do we want to double
> check?  in other words, are we running something like --missing that
> care the correctness a bit more than usual cases?" bit down from the
> caller, because this check is so deep in the callchain.

I wonder if we would want a "be extra careful" flag that is read from
the environment? That is largely how GIT_REF_PARANOIA works, and then
particular operations set it (though actually it is the default these
days, so they no longer do so explicitly).

I guess that is really like a global variable but even more gross. ;)
But it is nice that it can cross process boundaries, because "how
careful do we want to be" may be in the eye of the caller, especially
for plumbing commands. E.g., even without --missing, you may want
"rev-list" to be extra careful about checking the existence of commits.
The caller in check_connected() could arguably turn that on by default
(the way we used to turn on GIT_REF_PARANOIA for pruning repacks before
it became the default).

-Peff

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 3/3] rev-list: add commit object support in `--missing` option
  2023-10-19 23:35         ` Junio C Hamano
@ 2023-10-20 11:14           ` Karthik Nayak
  2023-10-20 14:47             ` Karthik Nayak
  2023-10-20 16:41           ` Junio C Hamano
  1 sibling, 1 reply; 61+ messages in thread
From: Karthik Nayak @ 2023-10-20 11:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, ps

On Fri, Oct 20, 2023 at 1:35 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Do we really need to allocate a new bit in the object flags, which
> > is already a scarce resource?
>
> Clarification.  I was *not* wondering if we can steal and (re|ab)use
> a bit that is used for other purposes, in order to avoid allocating
> a new bit.
>
> Rather, I was wondering if we need to use object flags to mark these
> objects, or can do what we want to do without using any object flags
> at all.  For the purpose of reporting "missing" objects, wouldn't it
> be sufficient to walk the object graph and report our findings as we
> go?  To avoid reporting the same object twice, as we reasonably can
> expect that the missing objects are minority (compared to the total
> number of objects), perhaps the codepath that makes such a report
> can use a hashmap of object_ids or something, for example.
>

This is kind of hard to do since the revision walking happens in revision.c
and list-objects.c but the missing commits are needed in builtin/rev-list.c.
So even if we build a list of missing commits in list-objects.c, there is no
nice way to send this back to rev-list.c.

The only way we communicate between these layers is through callbacks,
i.e. the show_commit() function in rev-list.c is called for every commit that
the revision walk traverses over.

I'm not entirely sure what the best path to take here to be honest. I will look
and see if there are any bits where overlapping doesn't cause any issues
in the meantime.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 3/3] rev-list: add commit object support in `--missing` option
  2023-10-20 11:14           ` Karthik Nayak
@ 2023-10-20 14:47             ` Karthik Nayak
  2023-10-20 17:45               ` Junio C Hamano
  0 siblings, 1 reply; 61+ messages in thread
From: Karthik Nayak @ 2023-10-20 14:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, ps

On Fri, Oct 20, 2023 at 1:14 PM Karthik Nayak <karthik.188@gmail.com> wrote:
> I'm not entirely sure what the best path to take here to be honest. I will look
> and see if there are any bits where overlapping doesn't cause any issues
> in the meantime.

Was trying to use bit number 12, which coincides METAINFO_SHOWN in
builtin/blame.c.
From skimming over the code, METAINFO_SHOWN is used only within
blame.c and there
should not be collisions here since blame.c doesn't set the
do_not_die_on_missing_objects bit
either.

The tests seem to pass here too [1]. This could be the potential
solution here, Junio, what do you think?
I could send another version if this approach looks good.

[1] https://gitlab.com/gitlab-org/git/-/jobs/5339132966

Diff against the tip of v3:

diff --git a/object.h b/object.h
index b76830fce1..0cb8c02b95 100644
--- a/object.h
+++ b/object.h
@@ -62,7 +62,7 @@ void object_array_init(struct object_array *array);

 /*
  * object flag allocation:
- * revision.h:               0---------10         15             22------28
+ * revision.h:               0---------10  12    15             22------27
  * fetch-pack.c:             01    67
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -82,7 +82,7 @@ void object_array_init(struct object_array *array);
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  29
+#define FLAG_BITS  28

 #define TYPE_BITS 3

diff --git a/revision.h b/revision.h
index d3c1ca0f4e..d2de9d65e4 100644
--- a/revision.h
+++ b/revision.h
@@ -38,6 +38,9 @@
 #define PATCHSAME    (1u<<9)
 #define BOTTOM        (1u<<10)

+/* WARNING: This is also used as METAINFO_SHOWN in builtin/blame.c. */
+#define MISSING     (1u<<12)
+
 /* WARNING: This is also used as REACHABLE in commit-graph.c. */
 #define PULL_MERGE    (1u<<15)

@@ -53,7 +56,6 @@
 #define ANCESTRY_PATH    (1u<<27)
 #define ALL_REV_FLAGS    (((1u<<11)-1) | NOT_USER_GIVEN |
TRACK_LINEAR | PULL_MERGE)

-#define MISSING (1u<<28)

 #define DECORATE_SHORT_REFS    1
 #define DECORATE_FULL_REFS    2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 3/3] rev-list: add commit object support in `--missing` option
  2023-10-19 23:35         ` Junio C Hamano
  2023-10-20 11:14           ` Karthik Nayak
@ 2023-10-20 16:41           ` Junio C Hamano
  2023-10-24 11:34             ` Karthik Nayak
  1 sibling, 1 reply; 61+ messages in thread
From: Junio C Hamano @ 2023-10-20 16:41 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git, ps

Junio C Hamano <gitster@pobox.com> writes:

> Rather, I was wondering if we need to use object flags to mark these
> objects, or can do what we want to do without using any object flags
> at all.  For the purpose of reporting "missing" objects, wouldn't it
> be sufficient to walk the object graph and report our findings as we
> go?  To avoid reporting the same object twice, as we reasonably can
> expect that the missing objects are minority (compared to the total
> number of objects), perhaps the codepath that makes such a report
> can use a hashmap of object_ids or something, for example.

Digging from the bottom,

 * builtin/rev-list.c:show_commit() gets "struct rev_list_info *"
   that has "struct rev_info *" [*].

 * list-objects.c:do_traverse() calls revision.c:get_revision() to
   obtain commits, some of which may be missing ones, and things
   behind get_revision() are responsible for marking the commit as
   missing.  It has "struct traversal_context *", among whose
   members is the "revs" member that is the "struct rev_info *".

 * revision.c:get_revision() and machinery behind it ultimately
   discovers a missing commit in the revision.c:process_parents()
   that loops over the parents commit_list.  It of course has access
   to "struct rev_info *".

So, presumably, if we add a new member to "struct rev_info" that
optionally [*] points at an oidset that records the object names of
missing objects we discovered so far (i.e., the set of missing
objects), the location we set the MISSING bit of a commit can
instead add the object name of the commit to the set.  And we can
export a function that takes "struct rev_info *" and "struct object
*" (or "struct object_id *") to check for membership in the "set of
missing objects", which would be used where we checked the MISSING
bit of a commit.

I do not know the performance implications of going this route, but
if we do not find a suitable vacant bit, we do not have to use any
object flags bit to do this, if we go this route, I would think.  I
may be missing some details that breaks the above outline, though.


[Footnotes]

 * A potential #leftoverbits tangent.

   Why is "rev_list_info" structure declared in <bisect.h>?  I
   suspect that this is a fallout from recent header file shuffling,
   but given who uses it (among which is rev-list:show_commit() that
   has very little to do with bisection and uses the information in
   rev_list_info when doing its normal non-bisect things), it does
   not make much sense.

 * When .do_not_die_on_missing_objects is false, it can and should
   be left NULL, but presumably we use the "do not die" bit even
   when we are not necessarily collecting the missing objects?  So
   the new member cannot replace the "do not die" bit completely.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
  2023-10-20 10:00                                 ` Jeff King
@ 2023-10-20 17:35                                   ` Junio C Hamano
  2023-10-23 10:15                                   ` Patrick Steinhardt
  1 sibling, 0 replies; 61+ messages in thread
From: Junio C Hamano @ 2023-10-20 17:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Steinhardt, git, Karthik Nayak, Taylor Blau

Jeff King <peff@peff.net> writes:

> I guess that is really like a global variable but even more gross. ;)

Yeah, I tend to agree, but ...

> But it is nice that it can cross process boundaries, because "how
> careful do we want to be" may be in the eye of the caller, especially
> for plumbing commands. E.g., even without --missing, you may want
> "rev-list" to be extra careful about checking the existence of commits.

... these are all very good points.

> The caller in check_connected() could arguably turn that on by default
> (the way we used to turn on GIT_REF_PARANOIA for pruning repacks before
> it became the default).

True.

Thanks.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 3/3] rev-list: add commit object support in `--missing` option
  2023-10-20 14:47             ` Karthik Nayak
@ 2023-10-20 17:45               ` Junio C Hamano
  0 siblings, 0 replies; 61+ messages in thread
From: Junio C Hamano @ 2023-10-20 17:45 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git, ps

Karthik Nayak <karthik.188@gmail.com> writes:

> Was trying to use bit number 12, which coincides METAINFO_SHOWN in
> builtin/blame.c.
> From skimming over the code, METAINFO_SHOWN is used only within
> blame.c and there
> should not be collisions here since blame.c doesn't set the
> do_not_die_on_missing_objects bit
> either.

Thanks for researching.  It sounds like it may be a better bit to
steal than the one used by the commit-graph, as long as there is no
reason to expect that blame may want to work in a corrupt repository
with missing objects, but when it happens, we may regret the
decision we are making here.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH] commit: detect commits that exist in commit-graph but not in the ODB
  2023-10-20 10:00                                 ` Jeff King
  2023-10-20 17:35                                   ` Junio C Hamano
@ 2023-10-23 10:15                                   ` Patrick Steinhardt
  1 sibling, 0 replies; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-23 10:15 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Karthik Nayak, Taylor Blau

[-- Attachment #1: Type: text/plain, Size: 2634 bytes --]

On Fri, Oct 20, 2023 at 06:00:24AM -0400, Jeff King wrote:
> On Thu, Oct 19, 2023 at 10:16:56AM -0700, Junio C Hamano wrote:
> 
> > Patrick Steinhardt <ps@pks.im> writes:
> > 
> > > There's another way to handle this, which is to conditionally enable the
> > > object existence check. This would be less of a performance hit compared
> > > to disabling commit graphs altogether with `--missing`, but still ensure
> > > that repository corruption was detected. Second, it would not regress
> > > performance for all preexisting users of `repo_parse_commit_gently()`.
> > 
> > The above was what I meant to suggest when you demonstrated that the
> > code with additional check is still much more performant than
> > running without the commit-graph optimization, yet has observable
> > impact on performance for normal codepaths that do not need the
> > extra carefulness.
> > 
> > But I wasn't sure if it is easy to plumb the "do we want to double
> > check?  in other words, are we running something like --missing that
> > care the correctness a bit more than usual cases?" bit down from the
> > caller, because this check is so deep in the callchain.
> 
> I wonder if we would want a "be extra careful" flag that is read from
> the environment? That is largely how GIT_REF_PARANOIA works, and then
> particular operations set it (though actually it is the default these
> days, so they no longer do so explicitly).
> 
> I guess that is really like a global variable but even more gross. ;)
> But it is nice that it can cross process boundaries, because "how
> careful do we want to be" may be in the eye of the caller, especially
> for plumbing commands. E.g., even without --missing, you may want
> "rev-list" to be extra careful about checking the existence of commits.
> The caller in check_connected() could arguably turn that on by default
> (the way we used to turn on GIT_REF_PARANOIA for pruning repacks before
> it became the default).

Good point indeed, I also started to ask myself whether we'd want to
have a config option. An environment variable `GIT_OBJECT_PARANOIA`
would work equally well though and be less "official".

What I like about this idea is that it would also allow us to easily
unify the behaviour of `lookup_commit_in_graph()` and the new sanity
check in `repo_parse_commit_gently()` that I'm introducing here. I'd
personally think that the default for any such toggle should be `true`,
also to keep the current behaviour of `lookup_commit_in_graph()`. But
changing it would be easy enough.

I'll send a v2 of this series with such a toggle.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 3/3] rev-list: add commit object support in `--missing` option
  2023-10-20 16:41           ` Junio C Hamano
@ 2023-10-24 11:34             ` Karthik Nayak
  0 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-24 11:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, ps

On Fri, Oct 20, 2023 at 6:41 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Rather, I was wondering if we need to use object flags to mark these
> > objects, or can do what we want to do without using any object flags
> > at all.  For the purpose of reporting "missing" objects, wouldn't it
> > be sufficient to walk the object graph and report our findings as we
> > go?  To avoid reporting the same object twice, as we reasonably can
> > expect that the missing objects are minority (compared to the total
> > number of objects), perhaps the codepath that makes such a report
> > can use a hashmap of object_ids or something, for example.
>
> Digging from the bottom,
>
>  * builtin/rev-list.c:show_commit() gets "struct rev_list_info *"
>    that has "struct rev_info *" [*].
>
>  * list-objects.c:do_traverse() calls revision.c:get_revision() to
>    obtain commits, some of which may be missing ones, and things
>    behind get_revision() are responsible for marking the commit as
>    missing.  It has "struct traversal_context *", among whose
>    members is the "revs" member that is the "struct rev_info *".
>
>  * revision.c:get_revision() and machinery behind it ultimately
>    discovers a missing commit in the revision.c:process_parents()
>    that loops over the parents commit_list.  It of course has access
>    to "struct rev_info *".
>
> So, presumably, if we add a new member to "struct rev_info" that
> optionally [*] points at an oidset that records the object names of
> missing objects we discovered so far (i.e., the set of missing
> objects), the location we set the MISSING bit of a commit can
> instead add the object name of the commit to the set.  And we can
> export a function that takes "struct rev_info *" and "struct object
> *" (or "struct object_id *") to check for membership in the "set of
> missing objects", which would be used where we checked the MISSING
> bit of a commit.
>
> I do not know the performance implications of going this route, but
> if we do not find a suitable vacant bit, we do not have to use any
> object flags bit to do this, if we go this route, I would think.  I
> may be missing some details that breaks the above outline, though.
>
>
> [Footnotes]
>
>  * A potential #leftoverbits tangent.
>
>    Why is "rev_list_info" structure declared in <bisect.h>?  I
>    suspect that this is a fallout from recent header file shuffling,
>    but given who uses it (among which is rev-list:show_commit() that
>    has very little to do with bisection and uses the information in
>    rev_list_info when doing its normal non-bisect things), it does
>    not make much sense.
>
>  * When .do_not_die_on_missing_objects is false, it can and should
>    be left NULL, but presumably we use the "do not die" bit even
>    when we are not necessarily collecting the missing objects?  So
>    the new member cannot replace the "do not die" bit completely.

Thanks for the suggestion, this does seem like a good way to go ahead without
using flags. The only performance issue being if there are too many commits
which are missing, then oidset would be large.

But I think that's okay though.

> Thanks for researching.  It sounds like it may be a better bit to
> steal than the one used by the commit-graph, as long as there is no
> reason to expect that blame may want to work in a corrupt repository
> with missing objects, but when it happens, we may regret the
> decision we are making here.
>

I don't see blame working with missing commits though, because it relies on
parsing commits to get information to show to the user. So I think it's a safe
bit to steal. Also, when the time comes we could always release the bit and
move to the solution you mentioned above.

Anyways on the whole I think keeping it future compatible makes a lot
more sense.
I'll send a patch series to implement an oidset instead of flags soon.

- Karthik

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 0/3] rev-list: add support for commits in `--missing`
  2023-10-19 12:10   ` [PATCH v3 " Karthik Nayak
                       ` (2 preceding siblings ...)
  2023-10-19 12:10     ` [PATCH v3 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
@ 2023-10-24 12:26     ` Karthik Nayak
  2023-10-24 12:26       ` [PATCH v4 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
                         ` (3 more replies)
  3 siblings, 4 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-24 12:26 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `--missing` option in git-rev-list(1) was introduced intitally
to deal with missing blobs in the context of promissory notes.
Eventually the option was extended to also support tree objects in
7c0fe330d5 (rev-list: handle missing tree objects properly,2018-10-05).

This patch series extends the `--missing` option to also add support for
commit objects. We do this by introducing a new flag `MISSING` which is
added whenever we encounter a missing commit during traversal. Then in
`builtin/rev-list` we check for this flag and take the appropriate
action based on the `--missing=*` option used.

This series is an alternate to the patch series I had posted earlier:
https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/.
In that patch, we introduced an option `--ignore-missing-links` which
was added to expose the `ignore_missing_links` bit to the user. The
issue in that patch was that, the option `--ignore-missing-links` didn't
play well the pre-existing `--missing` option. This series avoids that
route and just extends the `--missing` option for commits to solve the
same problem.

V3 of the series can be found here: https://lore.kernel.org/git/20231019121024.194317-1-karthik.188@gmail.com/T/#mf6a442e06f323a78a45af086ddd353998bab0052

Changes from v3:
- Instead of using flags on the object level as in the previous vesrions, we
add and use a missing_objects oidset in rev_info. This avoids flag bit-collision
since we were using an existing bit and doesn't require extending the flag size
as done in v3. 

Changelog vs v3:

1:  8c469cf479 = 1:  8c469cf479 revision: rename bit to `do_not_die_on_missing_objects`
2:  76ce43d973 = 2:  76ce43d973 rev-list: move `show_commit()` to the bottom
3:  4c640f9ab4 ! 3:  d892f0b82d rev-list: add commit object support in `--missing` option
    @@ Commit message
         a fatal error.
     
         Let's extend the functionality of `--missing` option to also support
    -    commit objects. This is done by adding a new `MISSING` flag that the
    -    revision walker sets whenever it encounters a missing tree/commit. The
    -    revision walker will now continue the traversal and call `show_commit()`
    -    even for missing commits. In rev-list we can then check for this flag
    -    and call the existing code for parsing `--missing` objects.
    +    commit objects. This is done by adding a `missing_objects` field to
    +    `rev_info`. This field is an `oidset` to which we'll add the missing
    +    commits as we encounter them. The revision walker will now continue the
    +    traversal and call `show_commit()` even for missing commits. In rev-list
    +    we can then check if the commit is a missing commit and call the
    +    existing code for parsing `--missing` objects.
     
         A scenario where this option would be used is to find the boundary
         objects between different object directories. Consider a repository with
    @@ Commit message
         alternate object directory allows us to find the boundary objects
         between the main and alternate object directory.
     
    +    Helped-by: Patrick Steinhardt <ps@pks.im>
         Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
     
      ## builtin/rev-list.c ##
    @@ builtin/rev-list.c: static void show_commit(struct commit *commit, void *data)
      	display_progress(progress, ++progress_counter);
      
     +	if (revs->do_not_die_on_missing_objects &&
    -+	    commit->object.flags & MISSING) {
    ++	    oidset_contains(&revs->missing_objects, &commit->object.oid)) {
     +		finish_object__ma(&commit->object);
     +		return;
     +	}
    @@ builtin/rev-list.c: static void show_commit(struct commit *commit, void *data)
     
      ## list-objects.c ##
     @@ list-objects.c: static void do_traverse(struct traversal_context *ctx)
    - 		 * an uninteresting boundary commit may not have its tree
    - 		 * parsed yet, but we are not going to show them anyway
      		 */
    --		if (!ctx->revs->tree_objects)
    -+		if (!ctx->revs->tree_objects || commit->object.flags & MISSING)
    + 		if (!ctx->revs->tree_objects)
      			; /* do not bother loading tree */
    ++		else if (ctx->revs->do_not_die_on_missing_objects &&
    ++			 oidset_contains(&ctx->revs->missing_objects, &commit->object.oid))
    ++			;
      		else if (repo_get_commit_tree(the_repository, commit)) {
      			struct tree *tree = repo_get_commit_tree(the_repository,
    -
    - ## object.h ##
    -@@ object.h: void object_array_init(struct object_array *array);
    - 
    - /*
    -  * object flag allocation:
    -- * revision.h:               0---------10         15             23------27
    -+ * revision.h:               0---------10         15             22------28
    -  * fetch-pack.c:             01    67
    -  * negotiator/default.c:       2--5
    -  * walker.c:                 0-2
    -@@ object.h: void object_array_init(struct object_array *array);
    -  * builtin/show-branch.c:    0-------------------------------------------26
    -  * builtin/unpack-objects.c:                                 2021
    -  */
    --#define FLAG_BITS  28
    -+#define FLAG_BITS  29
    - 
    - #define TYPE_BITS 3
    - 
    + 								 commit);
     
      ## revision.c ##
    +@@
    + #include "object-name.h"
    + #include "object-file.h"
    + #include "object-store-ll.h"
    ++#include "oidset.h"
    + #include "tag.h"
    + #include "blob.h"
    + #include "tree.h"
     @@ revision.c: static int process_parents(struct rev_info *revs, struct commit *commit,
    - 	struct commit_list *parent = commit->parents;
    - 	unsigned pass_flags;
      
    --	if (commit->object.flags & ADDED)
    -+	if (commit->object.flags & (ADDED | MISSING))
    + 	if (commit->object.flags & ADDED)
      		return 0;
    ++	if (revs->do_not_die_on_missing_objects &&
    ++		oidset_contains(&revs->missing_objects, &commit->object.oid))
    ++		return 0;
      	commit->object.flags |= ADDED;
      
    + 	if (revs->include_check &&
     @@ revision.c: static int process_parents(struct rev_info *revs, struct commit *commit,
      	for (parent = commit->parents; parent; parent = parent->next) {
      		struct commit *p = parent->item;
    @@ revision.c: static int process_parents(struct rev_info *revs, struct commit *com
     +			if (!revs->do_not_die_on_missing_objects)
     +				return -1;
     +			else
    -+				p->object.flags |= MISSING;
    ++				oidset_insert(&revs->missing_objects, &p->object.oid);
      		}
      		if (revs->sources) {
      			char **slot = revision_sources_at(revs->sources, p);
    +@@ revision.c: int prepare_revision_walk(struct rev_info *revs)
    + 				       FOR_EACH_OBJECT_PROMISOR_ONLY);
    + 	}
    + 
    ++	if (revs->do_not_die_on_missing_objects)
    ++		oidset_init(&revs->missing_objects, 0);
    ++
    + 	if (!revs->reflog_info)
    + 		prepare_to_use_bloom_filter(revs);
    + 	if (!revs->unsorted_input)
     
      ## revision.h ##
     @@
    - #define ANCESTRY_PATH	(1u<<27)
    - #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR | PULL_MERGE)
    + #include "commit.h"
    + #include "grep.h"
    + #include "notes.h"
    ++#include "oidset.h"
    + #include "pretty.h"
    + #include "diff.h"
    + #include "commit-slab-decl.h"
    +@@ revision.h: struct rev_info {
      
    -+#define MISSING (1u<<28)
    + 	/* Location where temporary objects for remerge-diff are written. */
    + 	struct tmp_objdir *remerge_objdir;
     +
    - #define DECORATE_SHORT_REFS	1
    - #define DECORATE_FULL_REFS	2
    ++	/* Missing objects to be tracked without failing traversal. */
    ++	struct oidset missing_objects;
    + };
      
    + /**
     
      ## t/t6022-rev-list-missing.sh (new) ##
     @@


Karthik Nayak (3):
  revision: rename bit to `do_not_die_on_missing_objects`
  rev-list: move `show_commit()` to the bottom
  rev-list: add commit object support in `--missing` option

 builtin/reflog.c            |  2 +-
 builtin/rev-list.c          | 93 +++++++++++++++++++------------------
 list-objects.c              |  5 +-
 revision.c                  | 16 ++++++-
 revision.h                  | 21 +++++----
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
 6 files changed, 155 insertions(+), 56 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

-- 
2.42.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 1/3] revision: rename bit to `do_not_die_on_missing_objects`
  2023-10-24 12:26     ` [PATCH v4 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
@ 2023-10-24 12:26       ` Karthik Nayak
  2023-10-24 12:26       ` [PATCH v4 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-24 12:26 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The bit `do_not_die_on_missing_tree` is used in revision.h to ensure the
revision walker does not die when encountering a missing tree. This is
currently exclusively set within `builtin/rev-list.c` to ensure the
`--missing` option works with missing trees.

In the upcoming commits, we will extend `--missing` to also support
missing commits. So let's rename the bit to
`do_not_die_on_missing_objects`, which is object type agnostic and can
be used for both trees/commits.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/reflog.c   |  2 +-
 builtin/rev-list.c |  2 +-
 list-objects.c     |  2 +-
 revision.h         | 17 +++++++++--------
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/builtin/reflog.c b/builtin/reflog.c
index df63a5892e..9e369a5977 100644
--- a/builtin/reflog.c
+++ b/builtin/reflog.c
@@ -298,7 +298,7 @@ static int cmd_reflog_expire(int argc, const char **argv, const char *prefix)
 		struct rev_info revs;
 
 		repo_init_revisions(the_repository, &revs, prefix);
-		revs.do_not_die_on_missing_tree = 1;
+		revs.do_not_die_on_missing_objects = 1;
 		revs.ignore_missing = 1;
 		revs.ignore_missing_links = 1;
 		if (verbose)
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ff715d6918..ea77489c38 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -561,7 +561,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	}
 
 	if (arg_missing_action)
-		revs.do_not_die_on_missing_tree = 1;
+		revs.do_not_die_on_missing_objects = 1;
 
 	argc = setup_revisions(argc, argv, &revs, &s_r_opt);
 
diff --git a/list-objects.c b/list-objects.c
index c25c72b32c..47296dff2f 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -177,7 +177,7 @@ static void process_tree(struct traversal_context *ctx,
 		    is_promisor_object(&obj->oid))
 			return;
 
-		if (!revs->do_not_die_on_missing_tree)
+		if (!revs->do_not_die_on_missing_objects)
 			die("bad tree object %s", oid_to_hex(&obj->oid));
 	}
 
diff --git a/revision.h b/revision.h
index 50091bbd13..c73c92ef40 100644
--- a/revision.h
+++ b/revision.h
@@ -212,18 +212,19 @@ struct rev_info {
 
 			/*
 			 * Blobs are shown without regard for their existence.
-			 * But not so for trees: unless exclude_promisor_objects
+			 * But not so for trees/commits: unless exclude_promisor_objects
 			 * is set and the tree in question is a promisor object;
 			 * OR ignore_missing_links is set, the revision walker
-			 * dies with a "bad tree object HASH" message when
-			 * encountering a missing tree. For callers that can
-			 * handle missing trees and want them to be filterable
+			 * dies with a "bad <type> object HASH" message when
+			 * encountering a missing object. For callers that can
+			 * handle missing trees/commits and want them to be filterable
 			 * and showable, set this to true. The revision walker
-			 * will filter and show such a missing tree as usual,
-			 * but will not attempt to recurse into this tree
-			 * object.
+			 * will filter and show such a missing object as usual,
+			 * but will not attempt to recurse into this tree/commit
+			 * object. The revision walker will also set the MISSING
+			 * flag for such objects.
 			 */
-			do_not_die_on_missing_tree:1,
+			do_not_die_on_missing_objects:1,
 
 			/* for internal use only */
 			exclude_promisor_objects:1;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 2/3] rev-list: move `show_commit()` to the bottom
  2023-10-24 12:26     ` [PATCH v4 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
  2023-10-24 12:26       ` [PATCH v4 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
@ 2023-10-24 12:26       ` Karthik Nayak
  2023-10-24 12:26       ` [PATCH v4 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
  2023-10-26 10:11       ` [PATCH v5 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
  3 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-24 12:26 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `show_commit()` function already depends on `finish_commit()`, and
in the upcoming commit, we'll also add a dependency on
`finish_object__ma()`. Since in C symbols must be declared before
they're used, let's move `show_commit()` below both `finish_commit()`
and `finish_object__ma()`, so the code is cleaner as a whole without the
need for declarations.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c | 85 +++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 43 deletions(-)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ea77489c38..98542e8b3c 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -100,7 +100,48 @@ static off_t get_object_disk_usage(struct object *obj)
 	return size;
 }
 
-static void finish_commit(struct commit *commit);
+static inline void finish_object__ma(struct object *obj)
+{
+	/*
+	 * Whether or not we try to dynamically fetch missing objects
+	 * from the server, we currently DO NOT have the object.  We
+	 * can either print, allow (ignore), or conditionally allow
+	 * (ignore) them.
+	 */
+	switch (arg_missing_action) {
+	case MA_ERROR:
+		die("missing %s object '%s'",
+		    type_name(obj->type), oid_to_hex(&obj->oid));
+		return;
+
+	case MA_ALLOW_ANY:
+		return;
+
+	case MA_PRINT:
+		oidset_insert(&missing_objects, &obj->oid);
+		return;
+
+	case MA_ALLOW_PROMISOR:
+		if (is_promisor_object(&obj->oid))
+			return;
+		die("unexpected missing %s object '%s'",
+		    type_name(obj->type), oid_to_hex(&obj->oid));
+		return;
+
+	default:
+		BUG("unhandled missing_action");
+		return;
+	}
+}
+
+static void finish_commit(struct commit *commit)
+{
+	free_commit_list(commit->parents);
+	commit->parents = NULL;
+	free_commit_buffer(the_repository->parsed_objects,
+			   commit);
+}
+
 static void show_commit(struct commit *commit, void *data)
 {
 	struct rev_list_info *info = data;
@@ -219,48 +260,6 @@ static void show_commit(struct commit *commit, void *data)
 	finish_commit(commit);
 }
 
-static void finish_commit(struct commit *commit)
-{
-	free_commit_list(commit->parents);
-	commit->parents = NULL;
-	free_commit_buffer(the_repository->parsed_objects,
-			   commit);
-}
-
-static inline void finish_object__ma(struct object *obj)
-{
-	/*
-	 * Whether or not we try to dynamically fetch missing objects
-	 * from the server, we currently DO NOT have the object.  We
-	 * can either print, allow (ignore), or conditionally allow
-	 * (ignore) them.
-	 */
-	switch (arg_missing_action) {
-	case MA_ERROR:
-		die("missing %s object '%s'",
-		    type_name(obj->type), oid_to_hex(&obj->oid));
-		return;
-
-	case MA_ALLOW_ANY:
-		return;
-
-	case MA_PRINT:
-		oidset_insert(&missing_objects, &obj->oid);
-		return;
-
-	case MA_ALLOW_PROMISOR:
-		if (is_promisor_object(&obj->oid))
-			return;
-		die("unexpected missing %s object '%s'",
-		    type_name(obj->type), oid_to_hex(&obj->oid));
-		return;
-
-	default:
-		BUG("unhandled missing_action");
-		return;
-	}
-}
-
 static int finish_object(struct object *obj, const char *name UNUSED,
 			 void *cb_data)
 {
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v4 3/3] rev-list: add commit object support in `--missing` option
  2023-10-24 12:26     ` [PATCH v4 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
  2023-10-24 12:26       ` [PATCH v4 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
  2023-10-24 12:26       ` [PATCH v4 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
@ 2023-10-24 12:26       ` Karthik Nayak
  2023-10-24 17:45         ` Junio C Hamano
  2023-10-25  6:40         ` Patrick Steinhardt
  2023-10-26 10:11       ` [PATCH v5 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
  3 siblings, 2 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-24 12:26 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `--missing` object option in rev-list currently works only with
missing blobs/trees. For missing commits the revision walker fails with
a fatal error.

Let's extend the functionality of `--missing` option to also support
commit objects. This is done by adding a `missing_objects` field to
`rev_info`. This field is an `oidset` to which we'll add the missing
commits as we encounter them. The revision walker will now continue the
traversal and call `show_commit()` even for missing commits. In rev-list
we can then check if the commit is a missing commit and call the
existing code for parsing `--missing` objects.

A scenario where this option would be used is to find the boundary
objects between different object directories. Consider a repository with
a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate
object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a
repository, using the `--missing=print` option while disabling the
alternate object directory allows us to find the boundary objects
between the main and alternate object directory.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c          |  6 +++
 list-objects.c              |  3 ++
 revision.c                  | 16 +++++++-
 revision.h                  |  4 ++
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 2 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 98542e8b3c..37b52520b5 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data)
 
 	display_progress(progress, ++progress_counter);
 
+	if (revs->do_not_die_on_missing_objects &&
+	    oidset_contains(&revs->missing_objects, &commit->object.oid)) {
+		finish_object__ma(&commit->object);
+		return;
+	}
+
 	if (show_disk_usage)
 		total_disk_usage += get_object_disk_usage(&commit->object);
 
diff --git a/list-objects.c b/list-objects.c
index 47296dff2f..260089388c 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -389,6 +389,9 @@ static void do_traverse(struct traversal_context *ctx)
 		 */
 		if (!ctx->revs->tree_objects)
 			; /* do not bother loading tree */
+		else if (ctx->revs->do_not_die_on_missing_objects &&
+			 oidset_contains(&ctx->revs->missing_objects, &commit->object.oid))
+			;
 		else if (repo_get_commit_tree(the_repository, commit)) {
 			struct tree *tree = repo_get_commit_tree(the_repository,
 								 commit);
diff --git a/revision.c b/revision.c
index 219dc76716..e60646c1a7 100644
--- a/revision.c
+++ b/revision.c
@@ -6,6 +6,7 @@
 #include "object-name.h"
 #include "object-file.h"
 #include "object-store-ll.h"
+#include "oidset.h"
 #include "tag.h"
 #include "blob.h"
 #include "tree.h"
@@ -1112,6 +1113,9 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 
 	if (commit->object.flags & ADDED)
 		return 0;
+	if (revs->do_not_die_on_missing_objects &&
+		oidset_contains(&revs->missing_objects, &commit->object.oid))
+		return 0;
 	commit->object.flags |= ADDED;
 
 	if (revs->include_check &&
@@ -1168,7 +1172,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 	for (parent = commit->parents; parent; parent = parent->next) {
 		struct commit *p = parent->item;
 		int gently = revs->ignore_missing_links ||
-			     revs->exclude_promisor_objects;
+			     revs->exclude_promisor_objects ||
+			     revs->do_not_die_on_missing_objects;
 		if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
 			if (revs->exclude_promisor_objects &&
 			    is_promisor_object(&p->object.oid)) {
@@ -1176,7 +1181,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 					break;
 				continue;
 			}
-			return -1;
+
+			if (!revs->do_not_die_on_missing_objects)
+				return -1;
+			else
+				oidset_insert(&revs->missing_objects, &p->object.oid);
 		}
 		if (revs->sources) {
 			char **slot = revision_sources_at(revs->sources, p);
@@ -3800,6 +3809,9 @@ int prepare_revision_walk(struct rev_info *revs)
 				       FOR_EACH_OBJECT_PROMISOR_ONLY);
 	}
 
+	if (revs->do_not_die_on_missing_objects)
+		oidset_init(&revs->missing_objects, 0);
+
 	if (!revs->reflog_info)
 		prepare_to_use_bloom_filter(revs);
 	if (!revs->unsorted_input)
diff --git a/revision.h b/revision.h
index c73c92ef40..f6bf422f0e 100644
--- a/revision.h
+++ b/revision.h
@@ -4,6 +4,7 @@
 #include "commit.h"
 #include "grep.h"
 #include "notes.h"
+#include "oidset.h"
 #include "pretty.h"
 #include "diff.h"
 #include "commit-slab-decl.h"
@@ -373,6 +374,9 @@ struct rev_info {
 
 	/* Location where temporary objects for remerge-diff are written. */
 	struct tmp_objdir *remerge_objdir;
+
+	/* Missing objects to be tracked without failing traversal. */
+	struct oidset missing_objects;
 };
 
 /**
diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
new file mode 100755
index 0000000000..40265a4f66
--- /dev/null
+++ b/t/t6022-rev-list-missing.sh
@@ -0,0 +1,74 @@
+#!/bin/sh
+
+test_description='handling of missing objects in rev-list'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+# We setup the repository with two commits, this way HEAD is always
+# available and we can hide commit 1.
+test_expect_success 'create repository and alternate directory' '
+	test_commit 1 &&
+	test_commit 2 &&
+	test_commit 3
+'
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	test_expect_success "rev-list --missing=error fails with missing object $obj" '
+		oid="$(git rev-parse $obj)" &&
+		path=".git/objects/$(test_oid_to_path $oid)" &&
+
+		mv "$path" "$path.hidden" &&
+		test_when_finished "mv $path.hidden $path" &&
+
+		test_must_fail git rev-list --missing=error --objects \
+			--no-object-names HEAD
+	'
+done
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	for action in "allow-any" "print"
+	do
+		test_expect_success "rev-list --missing=$action with missing $obj" '
+			oid="$(git rev-parse $obj)" &&
+			path=".git/objects/$(test_oid_to_path $oid)" &&
+
+			# Before the object is made missing, we use rev-list to
+			# get the expected oids.
+			git rev-list --objects --no-object-names \
+				HEAD ^$obj >expect.raw &&
+
+			# Blobs are shared by all commits, so evethough a commit/tree
+			# might be skipped, its blob must be accounted for.
+			if [ $obj != "HEAD:1.t" ]; then
+				echo $(git rev-parse HEAD:1.t) >>expect.raw &&
+				echo $(git rev-parse HEAD:2.t) >>expect.raw
+			fi &&
+
+			mv "$path" "$path.hidden" &&
+			test_when_finished "mv $path.hidden $path" &&
+
+			git rev-list --missing=$action --objects --no-object-names \
+				HEAD >actual.raw &&
+
+			# When the action is to print, we should also add the missing
+			# oid to the expect list.
+			case $action in
+			allow-any)
+				;;
+			print)
+				grep ?$oid actual.raw &&
+				echo ?$oid >>expect.raw
+				;;
+			esac &&
+
+			sort actual.raw >actual &&
+			sort expect.raw >expect &&
+			test_cmp expect actual
+		'
+	done
+done
+
+test_done
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/3] rev-list: add commit object support in `--missing` option
  2023-10-24 12:26       ` [PATCH v4 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
@ 2023-10-24 17:45         ` Junio C Hamano
  2023-10-25  0:35           ` Junio C Hamano
  2023-10-25  9:34           ` Karthik Nayak
  2023-10-25  6:40         ` Patrick Steinhardt
  1 sibling, 2 replies; 61+ messages in thread
From: Junio C Hamano @ 2023-10-24 17:45 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git, ps

Karthik Nayak <karthik.188@gmail.com> writes:

> @@ -1176,7 +1181,11 @@ static int process_parents(struct rev_info *revs,...
>  					break;
>  				continue;
>  			}
> -			return -1;
> +
> +			if (!revs->do_not_die_on_missing_objects)
> +				return -1;
> +			else
> +				oidset_insert(&revs->missing_objects, &p->object.oid);

I would suspect that swapping if/else would make it easier to
follow.  Everybody else in the patch guards the use of the oidset
with "were we told not to die on missing objects?", i.e.,

	if (revs->do_not_die_on_missing_objects)
		oidset_insert(&revs->missing_objects, &p->object.oid);
	else
		return -1; /* corrupt repository */

> @@ -3800,6 +3809,9 @@ int prepare_revision_walk(struct rev_info *revs)
>  				       FOR_EACH_OBJECT_PROMISOR_ONLY);
>  	}
>  
> +	if (revs->do_not_die_on_missing_objects)
> +		oidset_init(&revs->missing_objects, 0);

I read the patch to make sure that .missing_objects oidset is used
only when .do_not_die_on_missing_objects is set and the oidset is
untouched unless it is initialized.  Well done.

I know I floated "perhaps oidset can replace the object bits,
especially because the number of objects that need marking is
expected to be small", but I am curious what the performance
implication of this would be.  Is this something we can create
comparison easily?

I noticed that nobody releases the resource held by this new oidset.
Shouldn't we do so in revision.c:release_revisions()?

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/3] rev-list: add commit object support in `--missing` option
  2023-10-24 17:45         ` Junio C Hamano
@ 2023-10-25  0:35           ` Junio C Hamano
  2023-10-25  9:34           ` Karthik Nayak
  1 sibling, 0 replies; 61+ messages in thread
From: Junio C Hamano @ 2023-10-25  0:35 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git, ps

Junio C Hamano <gitster@pobox.com> writes:

> I noticed that nobody releases the resource held by this new oidset.
> Shouldn't we do so in revision.c:release_revisions()?

It seems that linux-leaks CI job noticed the same.

https://github.com/git/git/actions/runs/6633599458/job/18021612518#step:5:2949

I wonder if the following is sufficient?

 revision.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git c/revision.c w/revision.c
index 724a116401..7a67ff74dc 100644
--- c/revision.c
+++ w/revision.c
@@ -3136,6 +3136,8 @@ void release_revisions(struct rev_info *revs)
 	clear_decoration(&revs->merge_simplification, free);
 	clear_decoration(&revs->treesame, free);
 	line_log_free(revs);
+	if (revs->do_not_die_on_missing_objects)
+		oidset_clear(&revs->missing_objects);
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/3] rev-list: add commit object support in `--missing` option
  2023-10-24 12:26       ` [PATCH v4 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
  2023-10-24 17:45         ` Junio C Hamano
@ 2023-10-25  6:40         ` Patrick Steinhardt
  2023-10-26 12:37           ` Junio C Hamano
  1 sibling, 1 reply; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-25  6:40 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git, gitster

[-- Attachment #1: Type: text/plain, Size: 8550 bytes --]

On Tue, Oct 24, 2023 at 02:26:31PM +0200, Karthik Nayak wrote:
> The `--missing` object option in rev-list currently works only with
> missing blobs/trees. For missing commits the revision walker fails with
> a fatal error.
> 
> Let's extend the functionality of `--missing` option to also support
> commit objects. This is done by adding a `missing_objects` field to
> `rev_info`. This field is an `oidset` to which we'll add the missing
> commits as we encounter them. The revision walker will now continue the
> traversal and call `show_commit()` even for missing commits. In rev-list
> we can then check if the commit is a missing commit and call the
> existing code for parsing `--missing` objects.
> 
> A scenario where this option would be used is to find the boundary
> objects between different object directories. Consider a repository with
> a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate
> object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a
> repository, using the `--missing=print` option while disabling the
> alternate object directory allows us to find the boundary objects
> between the main and alternate object directory.
> 
> Helped-by: Patrick Steinhardt <ps@pks.im>
> Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
> ---
>  builtin/rev-list.c          |  6 +++
>  list-objects.c              |  3 ++
>  revision.c                  | 16 +++++++-
>  revision.h                  |  4 ++
>  t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 101 insertions(+), 2 deletions(-)
>  create mode 100755 t/t6022-rev-list-missing.sh
> 
> diff --git a/builtin/rev-list.c b/builtin/rev-list.c
> index 98542e8b3c..37b52520b5 100644
> --- a/builtin/rev-list.c
> +++ b/builtin/rev-list.c
> @@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data)
>  
>  	display_progress(progress, ++progress_counter);
>  
> +	if (revs->do_not_die_on_missing_objects &&
> +	    oidset_contains(&revs->missing_objects, &commit->object.oid)) {
> +		finish_object__ma(&commit->object);
> +		return;
> +	}
> +
>  	if (show_disk_usage)
>  		total_disk_usage += get_object_disk_usage(&commit->object);
>  
> diff --git a/list-objects.c b/list-objects.c
> index 47296dff2f..260089388c 100644
> --- a/list-objects.c
> +++ b/list-objects.c
> @@ -389,6 +389,9 @@ static void do_traverse(struct traversal_context *ctx)
>  		 */
>  		if (!ctx->revs->tree_objects)
>  			; /* do not bother loading tree */
> +		else if (ctx->revs->do_not_die_on_missing_objects &&
> +			 oidset_contains(&ctx->revs->missing_objects, &commit->object.oid))
> +			;
>  		else if (repo_get_commit_tree(the_repository, commit)) {
>  			struct tree *tree = repo_get_commit_tree(the_repository,
>  								 commit);
> diff --git a/revision.c b/revision.c
> index 219dc76716..e60646c1a7 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -6,6 +6,7 @@
>  #include "object-name.h"
>  #include "object-file.h"
>  #include "object-store-ll.h"
> +#include "oidset.h"
>  #include "tag.h"
>  #include "blob.h"
>  #include "tree.h"
> @@ -1112,6 +1113,9 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
>  
>  	if (commit->object.flags & ADDED)
>  		return 0;
> +	if (revs->do_not_die_on_missing_objects &&
> +		oidset_contains(&revs->missing_objects, &commit->object.oid))

Nit: indentation is off here.

> +		return 0;
>  	commit->object.flags |= ADDED;
>  
>  	if (revs->include_check &&
> @@ -1168,7 +1172,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
>  	for (parent = commit->parents; parent; parent = parent->next) {
>  		struct commit *p = parent->item;
>  		int gently = revs->ignore_missing_links ||
> -			     revs->exclude_promisor_objects;
> +			     revs->exclude_promisor_objects ||
> +			     revs->do_not_die_on_missing_objects;
>  		if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
>  			if (revs->exclude_promisor_objects &&
>  			    is_promisor_object(&p->object.oid)) {
> @@ -1176,7 +1181,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
>  					break;
>  				continue;
>  			}
> -			return -1;
> +
> +			if (!revs->do_not_die_on_missing_objects)
> +				return -1;
> +			else
> +				oidset_insert(&revs->missing_objects, &p->object.oid);
>  		}
>  		if (revs->sources) {
>  			char **slot = revision_sources_at(revs->sources, p);
> @@ -3800,6 +3809,9 @@ int prepare_revision_walk(struct rev_info *revs)
>  				       FOR_EACH_OBJECT_PROMISOR_ONLY);
>  	}
>  
> +	if (revs->do_not_die_on_missing_objects)
> +		oidset_init(&revs->missing_objects, 0);
> +

While we're initializing the new oidset, we never clear it. We should
probably call `oidset_clear()` in `release_revisions()`. And if we
unconditionally initialized the oidset here then we can also call
`oiadset_clear()` unconditionally there, which should be preferable
given that `oidset_init()` does not allocate memory when no initial size
was given.

>  	if (!revs->reflog_info)
>  		prepare_to_use_bloom_filter(revs);
>  	if (!revs->unsorted_input)
> diff --git a/revision.h b/revision.h
> index c73c92ef40..f6bf422f0e 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -4,6 +4,7 @@
>  #include "commit.h"
>  #include "grep.h"
>  #include "notes.h"
> +#include "oidset.h"
>  #include "pretty.h"
>  #include "diff.h"
>  #include "commit-slab-decl.h"
> @@ -373,6 +374,9 @@ struct rev_info {
>  
>  	/* Location where temporary objects for remerge-diff are written. */
>  	struct tmp_objdir *remerge_objdir;
> +
> +	/* Missing objects to be tracked without failing traversal. */
> +	struct oidset missing_objects;

As far as I can see we only use this set to track missing commits, but
none of the other objects. The name thus feels a bit misleading to me,
as a reader might rightfully assume that it contains _all_ missing
objects after the revwalk. Should we rename it to `missing_commits` to
clarify?

Patrick

>  };
>  
>  /**
> diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
> new file mode 100755
> index 0000000000..40265a4f66
> --- /dev/null
> +++ b/t/t6022-rev-list-missing.sh
> @@ -0,0 +1,74 @@
> +#!/bin/sh
> +
> +test_description='handling of missing objects in rev-list'
> +
> +TEST_PASSES_SANITIZE_LEAK=true
> +. ./test-lib.sh
> +
> +# We setup the repository with two commits, this way HEAD is always
> +# available and we can hide commit 1.
> +test_expect_success 'create repository and alternate directory' '
> +	test_commit 1 &&
> +	test_commit 2 &&
> +	test_commit 3
> +'
> +
> +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
> +do
> +	test_expect_success "rev-list --missing=error fails with missing object $obj" '
> +		oid="$(git rev-parse $obj)" &&
> +		path=".git/objects/$(test_oid_to_path $oid)" &&
> +
> +		mv "$path" "$path.hidden" &&
> +		test_when_finished "mv $path.hidden $path" &&
> +
> +		test_must_fail git rev-list --missing=error --objects \
> +			--no-object-names HEAD
> +	'
> +done
> +
> +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
> +do
> +	for action in "allow-any" "print"
> +	do
> +		test_expect_success "rev-list --missing=$action with missing $obj" '
> +			oid="$(git rev-parse $obj)" &&
> +			path=".git/objects/$(test_oid_to_path $oid)" &&
> +
> +			# Before the object is made missing, we use rev-list to
> +			# get the expected oids.
> +			git rev-list --objects --no-object-names \
> +				HEAD ^$obj >expect.raw &&
> +
> +			# Blobs are shared by all commits, so evethough a commit/tree
> +			# might be skipped, its blob must be accounted for.
> +			if [ $obj != "HEAD:1.t" ]; then
> +				echo $(git rev-parse HEAD:1.t) >>expect.raw &&
> +				echo $(git rev-parse HEAD:2.t) >>expect.raw
> +			fi &&
> +
> +			mv "$path" "$path.hidden" &&
> +			test_when_finished "mv $path.hidden $path" &&
> +
> +			git rev-list --missing=$action --objects --no-object-names \
> +				HEAD >actual.raw &&
> +
> +			# When the action is to print, we should also add the missing
> +			# oid to the expect list.
> +			case $action in
> +			allow-any)
> +				;;
> +			print)
> +				grep ?$oid actual.raw &&
> +				echo ?$oid >>expect.raw
> +				;;
> +			esac &&
> +
> +			sort actual.raw >actual &&
> +			sort expect.raw >expect &&
> +			test_cmp expect actual
> +		'
> +	done
> +done
> +
> +test_done
> -- 
> 2.42.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/3] rev-list: add commit object support in `--missing` option
  2023-10-24 17:45         ` Junio C Hamano
  2023-10-25  0:35           ` Junio C Hamano
@ 2023-10-25  9:34           ` Karthik Nayak
  1 sibling, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-25  9:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, ps

On Tue, Oct 24, 2023 at 7:45 PM Junio C Hamano <gitster@pobox.com> wrote:
> I would suspect that swapping if/else would make it easier to
> follow.  Everybody else in the patch guards the use of the oidset
> with "were we told not to die on missing objects?", i.e.,
>
>         if (revs->do_not_die_on_missing_objects)
>                 oidset_insert(&revs->missing_objects, &p->object.oid);
>         else
>                 return -1; /* corrupt repository */
>

Makes sense. Will change.

> > @@ -3800,6 +3809,9 @@ int prepare_revision_walk(struct rev_info *revs)
> >                                      FOR_EACH_OBJECT_PROMISOR_ONLY);
> >       }
> >
> > +     if (revs->do_not_die_on_missing_objects)
> > +             oidset_init(&revs->missing_objects, 0);
>
> I read the patch to make sure that .missing_objects oidset is used
> only when .do_not_die_on_missing_objects is set and the oidset is
> untouched unless it is initialized.  Well done.
>
> I know I floated "perhaps oidset can replace the object bits,
> especially because the number of objects that need marking is
> expected to be small", but I am curious what the performance
> implication of this would be.  Is this something we can create
> comparison easily?

I did a simple comparision here, I randomly deleted commits which had
child commits with greater than 2 parents.

$ git rev-list --missing=print HEAD | grep "?" | wc -l
828

Using the flag bit:

$ hyperfine -w 3 "~/git/bin-wrappers/git rev-list --missing=allow-any HEAD"
Benchmark 1: ~/git/bin-wrappers/git rev-list --missing=allow-any HEAD
  Time (mean ± σ):     860.5 ms ±  15.2 ms    [User: 375.3 ms, System: 467.5 ms]
  Range (min … max):   835.2 ms … 881.0 ms    10 runs

Using the oidset:

$ hyperfine -w 3 "~/git/bin-wrappers/git rev-list --missing=allow-any HEAD"
Benchmark 1: ~/git/bin-wrappers/git rev-list --missing=allow-any HEAD
  Time (mean ± σ):     901.3 ms ±   9.6 ms    [User: 394.3 ms, System: 488.3 ms]
  Range (min … max):   885.0 ms … 913.2 ms    10 runs

Its definitely slower, but not by much.

>
> > I noticed that nobody releases the resource held by this new oidset.
> > Shouldn't we do so in revision.c:release_revisions()?
>
> It seems that linux-leaks CI job noticed the same.
>
> https://github.com/git/git/actions/runs/6633599458/job/18021612518#step:5:2949
>
> I wonder if the following is sufficient?
>
>  revision.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git c/revision.c w/revision.c
> index 724a116401..7a67ff74dc 100644
> --- c/revision.c
> +++ w/revision.c
> @@ -3136,6 +3136,8 @@ void release_revisions(struct rev_info *revs)
>         clear_decoration(&revs->merge_simplification, free);
>         clear_decoration(&revs->treesame, free);
>         line_log_free(revs);
> +       if (revs->do_not_die_on_missing_objects)
> +               oidset_clear(&revs->missing_objects);
>  }
>
>  static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
>

Yup, this seems right and was missed, will add.


On Wed, Oct 25, 2023 at 8:40 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> >
> >       if (commit->object.flags & ADDED)
> >               return 0;
> > +     if (revs->do_not_die_on_missing_objects &&
> > +             oidset_contains(&revs->missing_objects, &commit->object.oid))
>
> Nit: indentation is off here.
>

Thanks, will fix.

> > +
> > +     /* Missing objects to be tracked without failing traversal. */
> > +     struct oidset missing_objects;
>
> As far as I can see we only use this set to track missing commits, but
> none of the other objects. The name thus feels a bit misleading to me,
> as a reader might rightfully assume that it contains _all_ missing
> objects after the revwalk. Should we rename it to `missing_commits` to
> clarify?
>

Fair enough, I was thinking of being future compatible. But probably best to
be specific now and refactor as needed in the future. Will change.

Thanks both for the review!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v5 0/3] rev-list: add support for commits in `--missing`
  2023-10-24 12:26     ` [PATCH v4 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
                         ` (2 preceding siblings ...)
  2023-10-24 12:26       ` [PATCH v4 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
@ 2023-10-26 10:11       ` Karthik Nayak
  2023-10-26 10:11         ` [PATCH v5 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
                           ` (2 more replies)
  3 siblings, 3 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-26 10:11 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `--missing` option in git-rev-list(1) was introduced intitally
to deal with missing blobs in the context of promissory notes.
Eventually the option was extended to also support tree objects in
7c0fe330d5 (rev-list: handle missing tree objects properly,2018-10-05).

This patch series extends the `--missing` option to also add support for
commit objects. We do this by introducing a new flag `MISSING` which is
added whenever we encounter a missing commit during traversal. Then in
`builtin/rev-list` we check for this flag and take the appropriate
action based on the `--missing=*` option used.

This series is an alternate to the patch series I had posted earlier:
https://lore.kernel.org/git/20230908174208.249184-1-karthik.188@gmail.com/.
In that patch, we introduced an option `--ignore-missing-links` which
was added to expose the `ignore_missing_links` bit to the user. The
issue in that patch was that, the option `--ignore-missing-links` didn't
play well the pre-existing `--missing` option. This series avoids that
route and just extends the `--missing` option for commits to solve the
same problem.

V4 of the series can be found here: https://lore.kernel.org/git/20231024122631.158415-1-karthik.188@gmail.com/T/#ma7f07bc637e20e2a9558b23e8081957af61f63ce

Changes from v4:
- Rename `missing_objects` to `missing_commits` since we only deal
with commits.
- Fix incorrect indentation
- clear oidset after traversal in `release_revisions()` 

Range diff:

1:  8c469cf479 = 1:  8c469cf479 revision: rename bit to `do_not_die_on_missing_objects`
2:  76ce43d973 = 2:  76ce43d973 rev-list: move `show_commit()` to the bottom
3:  d892f0b82d ! 3:  6f0c74f888 rev-list: add commit object support in `--missing` option
    @@ Commit message
         between the main and alternate object directory.
     
         Helped-by: Patrick Steinhardt <ps@pks.im>
    +    Helped-by: Junio C Hamano <gitster@pobox.com>
         Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
     
      ## builtin/rev-list.c ##
    @@ builtin/rev-list.c: static void show_commit(struct commit *commit, void *data)
      	display_progress(progress, ++progress_counter);
      
     +	if (revs->do_not_die_on_missing_objects &&
    -+	    oidset_contains(&revs->missing_objects, &commit->object.oid)) {
    ++	    oidset_contains(&revs->missing_commits, &commit->object.oid)) {
     +		finish_object__ma(&commit->object);
     +		return;
     +	}
    @@ list-objects.c: static void do_traverse(struct traversal_context *ctx)
      		if (!ctx->revs->tree_objects)
      			; /* do not bother loading tree */
     +		else if (ctx->revs->do_not_die_on_missing_objects &&
    -+			 oidset_contains(&ctx->revs->missing_objects, &commit->object.oid))
    ++			 oidset_contains(&ctx->revs->missing_commits, &commit->object.oid))
     +			;
      		else if (repo_get_commit_tree(the_repository, commit)) {
      			struct tree *tree = repo_get_commit_tree(the_repository,
    @@ revision.c: static int process_parents(struct rev_info *revs, struct commit *com
      	if (commit->object.flags & ADDED)
      		return 0;
     +	if (revs->do_not_die_on_missing_objects &&
    -+		oidset_contains(&revs->missing_objects, &commit->object.oid))
    ++	    oidset_contains(&revs->missing_commits, &commit->object.oid))
     +		return 0;
      	commit->object.flags |= ADDED;
      
    @@ revision.c: static int process_parents(struct rev_info *revs, struct commit *com
      			}
     -			return -1;
     +
    -+			if (!revs->do_not_die_on_missing_objects)
    -+				return -1;
    ++			if (revs->do_not_die_on_missing_objects)
    ++				oidset_insert(&revs->missing_commits, &p->object.oid);
     +			else
    -+				oidset_insert(&revs->missing_objects, &p->object.oid);
    ++				return -1; /* corrupt repository */
      		}
      		if (revs->sources) {
      			char **slot = revision_sources_at(revs->sources, p);
    +@@ revision.c: void release_revisions(struct rev_info *revs)
    + 	clear_decoration(&revs->merge_simplification, free);
    + 	clear_decoration(&revs->treesame, free);
    + 	line_log_free(revs);
    ++	oidset_clear(&revs->missing_commits);
    + }
    + 
    + static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
     @@ revision.c: int prepare_revision_walk(struct rev_info *revs)
      				       FOR_EACH_OBJECT_PROMISOR_ONLY);
      	}
      
     +	if (revs->do_not_die_on_missing_objects)
    -+		oidset_init(&revs->missing_objects, 0);
    ++		oidset_init(&revs->missing_commits, 0);
     +
      	if (!revs->reflog_info)
      		prepare_to_use_bloom_filter(revs);
    @@ revision.h: struct rev_info {
      	/* Location where temporary objects for remerge-diff are written. */
      	struct tmp_objdir *remerge_objdir;
     +
    -+	/* Missing objects to be tracked without failing traversal. */
    -+	struct oidset missing_objects;
    ++	/* Missing commits to be tracked without failing traversal. */
    ++	struct oidset missing_commits;
      };
      
      /**


Karthik Nayak (3):
  revision: rename bit to `do_not_die_on_missing_objects`
  rev-list: move `show_commit()` to the bottom
  rev-list: add commit object support in `--missing` option

 builtin/reflog.c            |  2 +-
 builtin/rev-list.c          | 93 +++++++++++++++++++------------------
 list-objects.c              |  5 +-
 revision.c                  | 17 ++++++-
 revision.h                  | 21 +++++----
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++
 6 files changed, 156 insertions(+), 56 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

-- 
2.42.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v5 1/3] revision: rename bit to `do_not_die_on_missing_objects`
  2023-10-26 10:11       ` [PATCH v5 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
@ 2023-10-26 10:11         ` Karthik Nayak
  2023-10-26 10:11         ` [PATCH v5 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
  2023-10-26 10:11         ` [PATCH v5 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
  2 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-26 10:11 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The bit `do_not_die_on_missing_tree` is used in revision.h to ensure the
revision walker does not die when encountering a missing tree. This is
currently exclusively set within `builtin/rev-list.c` to ensure the
`--missing` option works with missing trees.

In the upcoming commits, we will extend `--missing` to also support
missing commits. So let's rename the bit to
`do_not_die_on_missing_objects`, which is object type agnostic and can
be used for both trees/commits.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/reflog.c   |  2 +-
 builtin/rev-list.c |  2 +-
 list-objects.c     |  2 +-
 revision.h         | 17 +++++++++--------
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/builtin/reflog.c b/builtin/reflog.c
index df63a5892e..9e369a5977 100644
--- a/builtin/reflog.c
+++ b/builtin/reflog.c
@@ -298,7 +298,7 @@ static int cmd_reflog_expire(int argc, const char **argv, const char *prefix)
 		struct rev_info revs;
 
 		repo_init_revisions(the_repository, &revs, prefix);
-		revs.do_not_die_on_missing_tree = 1;
+		revs.do_not_die_on_missing_objects = 1;
 		revs.ignore_missing = 1;
 		revs.ignore_missing_links = 1;
 		if (verbose)
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ff715d6918..ea77489c38 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -561,7 +561,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	}
 
 	if (arg_missing_action)
-		revs.do_not_die_on_missing_tree = 1;
+		revs.do_not_die_on_missing_objects = 1;
 
 	argc = setup_revisions(argc, argv, &revs, &s_r_opt);
 
diff --git a/list-objects.c b/list-objects.c
index c25c72b32c..47296dff2f 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -177,7 +177,7 @@ static void process_tree(struct traversal_context *ctx,
 		    is_promisor_object(&obj->oid))
 			return;
 
-		if (!revs->do_not_die_on_missing_tree)
+		if (!revs->do_not_die_on_missing_objects)
 			die("bad tree object %s", oid_to_hex(&obj->oid));
 	}
 
diff --git a/revision.h b/revision.h
index 50091bbd13..c73c92ef40 100644
--- a/revision.h
+++ b/revision.h
@@ -212,18 +212,19 @@ struct rev_info {
 
 			/*
 			 * Blobs are shown without regard for their existence.
-			 * But not so for trees: unless exclude_promisor_objects
+			 * But not so for trees/commits: unless exclude_promisor_objects
 			 * is set and the tree in question is a promisor object;
 			 * OR ignore_missing_links is set, the revision walker
-			 * dies with a "bad tree object HASH" message when
-			 * encountering a missing tree. For callers that can
-			 * handle missing trees and want them to be filterable
+			 * dies with a "bad <type> object HASH" message when
+			 * encountering a missing object. For callers that can
+			 * handle missing trees/commits and want them to be filterable
 			 * and showable, set this to true. The revision walker
-			 * will filter and show such a missing tree as usual,
-			 * but will not attempt to recurse into this tree
-			 * object.
+			 * will filter and show such a missing object as usual,
+			 * but will not attempt to recurse into this tree/commit
+			 * object. The revision walker will also set the MISSING
+			 * flag for such objects.
 			 */
-			do_not_die_on_missing_tree:1,
+			do_not_die_on_missing_objects:1,
 
 			/* for internal use only */
 			exclude_promisor_objects:1;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v5 2/3] rev-list: move `show_commit()` to the bottom
  2023-10-26 10:11       ` [PATCH v5 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
  2023-10-26 10:11         ` [PATCH v5 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
@ 2023-10-26 10:11         ` Karthik Nayak
  2023-10-26 10:11         ` [PATCH v5 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
  2 siblings, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-26 10:11 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `show_commit()` function already depends on `finish_commit()`, and
in the upcoming commit, we'll also add a dependency on
`finish_object__ma()`. Since in C symbols must be declared before
they're used, let's move `show_commit()` below both `finish_commit()`
and `finish_object__ma()`, so the code is cleaner as a whole without the
need for declarations.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c | 85 +++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 43 deletions(-)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index ea77489c38..98542e8b3c 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -100,7 +100,48 @@ static off_t get_object_disk_usage(struct object *obj)
 	return size;
 }
 
-static void finish_commit(struct commit *commit);
+static inline void finish_object__ma(struct object *obj)
+{
+	/*
+	 * Whether or not we try to dynamically fetch missing objects
+	 * from the server, we currently DO NOT have the object.  We
+	 * can either print, allow (ignore), or conditionally allow
+	 * (ignore) them.
+	 */
+	switch (arg_missing_action) {
+	case MA_ERROR:
+		die("missing %s object '%s'",
+		    type_name(obj->type), oid_to_hex(&obj->oid));
+		return;
+
+	case MA_ALLOW_ANY:
+		return;
+
+	case MA_PRINT:
+		oidset_insert(&missing_objects, &obj->oid);
+		return;
+
+	case MA_ALLOW_PROMISOR:
+		if (is_promisor_object(&obj->oid))
+			return;
+		die("unexpected missing %s object '%s'",
+		    type_name(obj->type), oid_to_hex(&obj->oid));
+		return;
+
+	default:
+		BUG("unhandled missing_action");
+		return;
+	}
+}
+
+static void finish_commit(struct commit *commit)
+{
+	free_commit_list(commit->parents);
+	commit->parents = NULL;
+	free_commit_buffer(the_repository->parsed_objects,
+			   commit);
+}
+
 static void show_commit(struct commit *commit, void *data)
 {
 	struct rev_list_info *info = data;
@@ -219,48 +260,6 @@ static void show_commit(struct commit *commit, void *data)
 	finish_commit(commit);
 }
 
-static void finish_commit(struct commit *commit)
-{
-	free_commit_list(commit->parents);
-	commit->parents = NULL;
-	free_commit_buffer(the_repository->parsed_objects,
-			   commit);
-}
-
-static inline void finish_object__ma(struct object *obj)
-{
-	/*
-	 * Whether or not we try to dynamically fetch missing objects
-	 * from the server, we currently DO NOT have the object.  We
-	 * can either print, allow (ignore), or conditionally allow
-	 * (ignore) them.
-	 */
-	switch (arg_missing_action) {
-	case MA_ERROR:
-		die("missing %s object '%s'",
-		    type_name(obj->type), oid_to_hex(&obj->oid));
-		return;
-
-	case MA_ALLOW_ANY:
-		return;
-
-	case MA_PRINT:
-		oidset_insert(&missing_objects, &obj->oid);
-		return;
-
-	case MA_ALLOW_PROMISOR:
-		if (is_promisor_object(&obj->oid))
-			return;
-		die("unexpected missing %s object '%s'",
-		    type_name(obj->type), oid_to_hex(&obj->oid));
-		return;
-
-	default:
-		BUG("unhandled missing_action");
-		return;
-	}
-}
-
 static int finish_object(struct object *obj, const char *name UNUSED,
 			 void *cb_data)
 {
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v5 3/3] rev-list: add commit object support in `--missing` option
  2023-10-26 10:11       ` [PATCH v5 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
  2023-10-26 10:11         ` [PATCH v5 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
  2023-10-26 10:11         ` [PATCH v5 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
@ 2023-10-26 10:11         ` Karthik Nayak
  2023-10-27  6:25           ` Patrick Steinhardt
  2 siblings, 1 reply; 61+ messages in thread
From: Karthik Nayak @ 2023-10-26 10:11 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `--missing` object option in rev-list currently works only with
missing blobs/trees. For missing commits the revision walker fails with
a fatal error.

Let's extend the functionality of `--missing` option to also support
commit objects. This is done by adding a `missing_objects` field to
`rev_info`. This field is an `oidset` to which we'll add the missing
commits as we encounter them. The revision walker will now continue the
traversal and call `show_commit()` even for missing commits. In rev-list
we can then check if the commit is a missing commit and call the
existing code for parsing `--missing` objects.

A scenario where this option would be used is to find the boundary
objects between different object directories. Consider a repository with
a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate
object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a
repository, using the `--missing=print` option while disabling the
alternate object directory allows us to find the boundary objects
between the main and alternate object directory.

Helped-by: Patrick Steinhardt <ps@pks.im>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c          |  6 +++
 list-objects.c              |  3 ++
 revision.c                  | 17 ++++++++-
 revision.h                  |  4 ++
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++
 5 files changed, 102 insertions(+), 2 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 98542e8b3c..181353dcf5 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data)
 
 	display_progress(progress, ++progress_counter);
 
+	if (revs->do_not_die_on_missing_objects &&
+	    oidset_contains(&revs->missing_commits, &commit->object.oid)) {
+		finish_object__ma(&commit->object);
+		return;
+	}
+
 	if (show_disk_usage)
 		total_disk_usage += get_object_disk_usage(&commit->object);
 
diff --git a/list-objects.c b/list-objects.c
index 47296dff2f..f4e1104b56 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -389,6 +389,9 @@ static void do_traverse(struct traversal_context *ctx)
 		 */
 		if (!ctx->revs->tree_objects)
 			; /* do not bother loading tree */
+		else if (ctx->revs->do_not_die_on_missing_objects &&
+			 oidset_contains(&ctx->revs->missing_commits, &commit->object.oid))
+			;
 		else if (repo_get_commit_tree(the_repository, commit)) {
 			struct tree *tree = repo_get_commit_tree(the_repository,
 								 commit);
diff --git a/revision.c b/revision.c
index 219dc76716..738bacad08 100644
--- a/revision.c
+++ b/revision.c
@@ -6,6 +6,7 @@
 #include "object-name.h"
 #include "object-file.h"
 #include "object-store-ll.h"
+#include "oidset.h"
 #include "tag.h"
 #include "blob.h"
 #include "tree.h"
@@ -1112,6 +1113,9 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 
 	if (commit->object.flags & ADDED)
 		return 0;
+	if (revs->do_not_die_on_missing_objects &&
+	    oidset_contains(&revs->missing_commits, &commit->object.oid))
+		return 0;
 	commit->object.flags |= ADDED;
 
 	if (revs->include_check &&
@@ -1168,7 +1172,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 	for (parent = commit->parents; parent; parent = parent->next) {
 		struct commit *p = parent->item;
 		int gently = revs->ignore_missing_links ||
-			     revs->exclude_promisor_objects;
+			     revs->exclude_promisor_objects ||
+			     revs->do_not_die_on_missing_objects;
 		if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
 			if (revs->exclude_promisor_objects &&
 			    is_promisor_object(&p->object.oid)) {
@@ -1176,7 +1181,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 					break;
 				continue;
 			}
-			return -1;
+
+			if (revs->do_not_die_on_missing_objects)
+				oidset_insert(&revs->missing_commits, &p->object.oid);
+			else
+				return -1; /* corrupt repository */
 		}
 		if (revs->sources) {
 			char **slot = revision_sources_at(revs->sources, p);
@@ -3109,6 +3118,7 @@ void release_revisions(struct rev_info *revs)
 	clear_decoration(&revs->merge_simplification, free);
 	clear_decoration(&revs->treesame, free);
 	line_log_free(revs);
+	oidset_clear(&revs->missing_commits);
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
@@ -3800,6 +3810,9 @@ int prepare_revision_walk(struct rev_info *revs)
 				       FOR_EACH_OBJECT_PROMISOR_ONLY);
 	}
 
+	if (revs->do_not_die_on_missing_objects)
+		oidset_init(&revs->missing_commits, 0);
+
 	if (!revs->reflog_info)
 		prepare_to_use_bloom_filter(revs);
 	if (!revs->unsorted_input)
diff --git a/revision.h b/revision.h
index c73c92ef40..94c43138bc 100644
--- a/revision.h
+++ b/revision.h
@@ -4,6 +4,7 @@
 #include "commit.h"
 #include "grep.h"
 #include "notes.h"
+#include "oidset.h"
 #include "pretty.h"
 #include "diff.h"
 #include "commit-slab-decl.h"
@@ -373,6 +374,9 @@ struct rev_info {
 
 	/* Location where temporary objects for remerge-diff are written. */
 	struct tmp_objdir *remerge_objdir;
+
+	/* Missing commits to be tracked without failing traversal. */
+	struct oidset missing_commits;
 };
 
 /**
diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
new file mode 100755
index 0000000000..40265a4f66
--- /dev/null
+++ b/t/t6022-rev-list-missing.sh
@@ -0,0 +1,74 @@
+#!/bin/sh
+
+test_description='handling of missing objects in rev-list'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+# We setup the repository with two commits, this way HEAD is always
+# available and we can hide commit 1.
+test_expect_success 'create repository and alternate directory' '
+	test_commit 1 &&
+	test_commit 2 &&
+	test_commit 3
+'
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	test_expect_success "rev-list --missing=error fails with missing object $obj" '
+		oid="$(git rev-parse $obj)" &&
+		path=".git/objects/$(test_oid_to_path $oid)" &&
+
+		mv "$path" "$path.hidden" &&
+		test_when_finished "mv $path.hidden $path" &&
+
+		test_must_fail git rev-list --missing=error --objects \
+			--no-object-names HEAD
+	'
+done
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	for action in "allow-any" "print"
+	do
+		test_expect_success "rev-list --missing=$action with missing $obj" '
+			oid="$(git rev-parse $obj)" &&
+			path=".git/objects/$(test_oid_to_path $oid)" &&
+
+			# Before the object is made missing, we use rev-list to
+			# get the expected oids.
+			git rev-list --objects --no-object-names \
+				HEAD ^$obj >expect.raw &&
+
+			# Blobs are shared by all commits, so evethough a commit/tree
+			# might be skipped, its blob must be accounted for.
+			if [ $obj != "HEAD:1.t" ]; then
+				echo $(git rev-parse HEAD:1.t) >>expect.raw &&
+				echo $(git rev-parse HEAD:2.t) >>expect.raw
+			fi &&
+
+			mv "$path" "$path.hidden" &&
+			test_when_finished "mv $path.hidden $path" &&
+
+			git rev-list --missing=$action --objects --no-object-names \
+				HEAD >actual.raw &&
+
+			# When the action is to print, we should also add the missing
+			# oid to the expect list.
+			case $action in
+			allow-any)
+				;;
+			print)
+				grep ?$oid actual.raw &&
+				echo ?$oid >>expect.raw
+				;;
+			esac &&
+
+			sort actual.raw >actual &&
+			sort expect.raw >expect &&
+			test_cmp expect actual
+		'
+	done
+done
+
+test_done
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 3/3] rev-list: add commit object support in `--missing` option
  2023-10-25  6:40         ` Patrick Steinhardt
@ 2023-10-26 12:37           ` Junio C Hamano
  0 siblings, 0 replies; 61+ messages in thread
From: Junio C Hamano @ 2023-10-26 12:37 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Karthik Nayak, git

Patrick Steinhardt <ps@pks.im> writes:

>> +	if (revs->do_not_die_on_missing_objects)
>> +		oidset_init(&revs->missing_objects, 0);
>> +
>
> While we're initializing the new oidset, we never clear it. We should
> probably call `oidset_clear()` in `release_revisions()`. And if we
> unconditionally initialized the oidset here then we can also call
> `oiadset_clear()` unconditionally there, which should be preferable
> given that `oidset_init()` does not allocate memory when no initial size
> was given.

Yup, I used the conditional one to match the above, but initializing
unused oidset is cheap and frees us from having to worry about
mistakes.  I like your idea much better.

>> +
>> +	/* Missing objects to be tracked without failing traversal. */
>> +	struct oidset missing_objects;
>
> As far as I can see we only use this set to track missing commits, but
> none of the other objects. The name thus feels a bit misleading to me,
> as a reader might rightfully assume that it contains _all_ missing
> objects after the revwalk. Should we rename it to `missing_commits` to
> clarify?

Again, very good suggestion.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v5 3/3] rev-list: add commit object support in `--missing` option
  2023-10-26 10:11         ` [PATCH v5 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
@ 2023-10-27  6:25           ` Patrick Steinhardt
  2023-10-27  7:54             ` Karthik Nayak
  2023-10-27  7:59             ` Karthik Nayak
  0 siblings, 2 replies; 61+ messages in thread
From: Patrick Steinhardt @ 2023-10-27  6:25 UTC (permalink / raw)
  To: Karthik Nayak; +Cc: git, gitster

[-- Attachment #1: Type: text/plain, Size: 8377 bytes --]

On Thu, Oct 26, 2023 at 12:11:09PM +0200, Karthik Nayak wrote:
> The `--missing` object option in rev-list currently works only with
> missing blobs/trees. For missing commits the revision walker fails with
> a fatal error.
> 
> Let's extend the functionality of `--missing` option to also support
> commit objects. This is done by adding a `missing_objects` field to
> `rev_info`. This field is an `oidset` to which we'll add the missing
> commits as we encounter them. The revision walker will now continue the
> traversal and call `show_commit()` even for missing commits. In rev-list
> we can then check if the commit is a missing commit and call the
> existing code for parsing `--missing` objects.
> 
> A scenario where this option would be used is to find the boundary
> objects between different object directories. Consider a repository with
> a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate
> object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a
> repository, using the `--missing=print` option while disabling the
> alternate object directory allows us to find the boundary objects
> between the main and alternate object directory.
> 
> Helped-by: Patrick Steinhardt <ps@pks.im>
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
> ---
>  builtin/rev-list.c          |  6 +++
>  list-objects.c              |  3 ++
>  revision.c                  | 17 ++++++++-
>  revision.h                  |  4 ++
>  t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 102 insertions(+), 2 deletions(-)
>  create mode 100755 t/t6022-rev-list-missing.sh
> 
> diff --git a/builtin/rev-list.c b/builtin/rev-list.c
> index 98542e8b3c..181353dcf5 100644
> --- a/builtin/rev-list.c
> +++ b/builtin/rev-list.c
> @@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data)
>  
>  	display_progress(progress, ++progress_counter);
>  
> +	if (revs->do_not_die_on_missing_objects &&
> +	    oidset_contains(&revs->missing_commits, &commit->object.oid)) {
> +		finish_object__ma(&commit->object);
> +		return;
> +	}
> +
>  	if (show_disk_usage)
>  		total_disk_usage += get_object_disk_usage(&commit->object);
>  
> diff --git a/list-objects.c b/list-objects.c
> index 47296dff2f..f4e1104b56 100644
> --- a/list-objects.c
> +++ b/list-objects.c
> @@ -389,6 +389,9 @@ static void do_traverse(struct traversal_context *ctx)
>  		 */
>  		if (!ctx->revs->tree_objects)
>  			; /* do not bother loading tree */
> +		else if (ctx->revs->do_not_die_on_missing_objects &&
> +			 oidset_contains(&ctx->revs->missing_commits, &commit->object.oid))
> +			;
>  		else if (repo_get_commit_tree(the_repository, commit)) {
>  			struct tree *tree = repo_get_commit_tree(the_repository,
>  								 commit);
> diff --git a/revision.c b/revision.c
> index 219dc76716..738bacad08 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -6,6 +6,7 @@
>  #include "object-name.h"
>  #include "object-file.h"
>  #include "object-store-ll.h"
> +#include "oidset.h"
>  #include "tag.h"
>  #include "blob.h"
>  #include "tree.h"
> @@ -1112,6 +1113,9 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
>  
>  	if (commit->object.flags & ADDED)
>  		return 0;
> +	if (revs->do_not_die_on_missing_objects &&
> +	    oidset_contains(&revs->missing_commits, &commit->object.oid))
> +		return 0;
>  	commit->object.flags |= ADDED;
>  
>  	if (revs->include_check &&
> @@ -1168,7 +1172,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
>  	for (parent = commit->parents; parent; parent = parent->next) {
>  		struct commit *p = parent->item;
>  		int gently = revs->ignore_missing_links ||
> -			     revs->exclude_promisor_objects;
> +			     revs->exclude_promisor_objects ||
> +			     revs->do_not_die_on_missing_objects;
>  		if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
>  			if (revs->exclude_promisor_objects &&
>  			    is_promisor_object(&p->object.oid)) {
> @@ -1176,7 +1181,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
>  					break;
>  				continue;
>  			}
> -			return -1;
> +
> +			if (revs->do_not_die_on_missing_objects)
> +				oidset_insert(&revs->missing_commits, &p->object.oid);
> +			else
> +				return -1; /* corrupt repository */
>  		}
>  		if (revs->sources) {
>  			char **slot = revision_sources_at(revs->sources, p);
> @@ -3109,6 +3118,7 @@ void release_revisions(struct rev_info *revs)
>  	clear_decoration(&revs->merge_simplification, free);
>  	clear_decoration(&revs->treesame, free);
>  	line_log_free(revs);
> +	oidset_clear(&revs->missing_commits);
>  }
>  
>  static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
> @@ -3800,6 +3810,9 @@ int prepare_revision_walk(struct rev_info *revs)
>  				       FOR_EACH_OBJECT_PROMISOR_ONLY);
>  	}
>  
> +	if (revs->do_not_die_on_missing_objects)
> +		oidset_init(&revs->missing_commits, 0);
> +

We unconditionally clear the oidset now, so shouldn't we also
unconditionally initialize it?

Patrick

>  	if (!revs->reflog_info)
>  		prepare_to_use_bloom_filter(revs);
>  	if (!revs->unsorted_input)
> diff --git a/revision.h b/revision.h
> index c73c92ef40..94c43138bc 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -4,6 +4,7 @@
>  #include "commit.h"
>  #include "grep.h"
>  #include "notes.h"
> +#include "oidset.h"
>  #include "pretty.h"
>  #include "diff.h"
>  #include "commit-slab-decl.h"
> @@ -373,6 +374,9 @@ struct rev_info {
>  
>  	/* Location where temporary objects for remerge-diff are written. */
>  	struct tmp_objdir *remerge_objdir;
> +
> +	/* Missing commits to be tracked without failing traversal. */
> +	struct oidset missing_commits;
>  };
>  
>  /**
> diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
> new file mode 100755
> index 0000000000..40265a4f66
> --- /dev/null
> +++ b/t/t6022-rev-list-missing.sh
> @@ -0,0 +1,74 @@
> +#!/bin/sh
> +
> +test_description='handling of missing objects in rev-list'
> +
> +TEST_PASSES_SANITIZE_LEAK=true
> +. ./test-lib.sh
> +
> +# We setup the repository with two commits, this way HEAD is always
> +# available and we can hide commit 1.
> +test_expect_success 'create repository and alternate directory' '
> +	test_commit 1 &&
> +	test_commit 2 &&
> +	test_commit 3
> +'
> +
> +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
> +do
> +	test_expect_success "rev-list --missing=error fails with missing object $obj" '
> +		oid="$(git rev-parse $obj)" &&
> +		path=".git/objects/$(test_oid_to_path $oid)" &&
> +
> +		mv "$path" "$path.hidden" &&
> +		test_when_finished "mv $path.hidden $path" &&
> +
> +		test_must_fail git rev-list --missing=error --objects \
> +			--no-object-names HEAD
> +	'
> +done
> +
> +for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
> +do
> +	for action in "allow-any" "print"
> +	do
> +		test_expect_success "rev-list --missing=$action with missing $obj" '
> +			oid="$(git rev-parse $obj)" &&
> +			path=".git/objects/$(test_oid_to_path $oid)" &&
> +
> +			# Before the object is made missing, we use rev-list to
> +			# get the expected oids.
> +			git rev-list --objects --no-object-names \
> +				HEAD ^$obj >expect.raw &&
> +
> +			# Blobs are shared by all commits, so evethough a commit/tree
> +			# might be skipped, its blob must be accounted for.
> +			if [ $obj != "HEAD:1.t" ]; then
> +				echo $(git rev-parse HEAD:1.t) >>expect.raw &&
> +				echo $(git rev-parse HEAD:2.t) >>expect.raw
> +			fi &&
> +
> +			mv "$path" "$path.hidden" &&
> +			test_when_finished "mv $path.hidden $path" &&
> +
> +			git rev-list --missing=$action --objects --no-object-names \
> +				HEAD >actual.raw &&
> +
> +			# When the action is to print, we should also add the missing
> +			# oid to the expect list.
> +			case $action in
> +			allow-any)
> +				;;
> +			print)
> +				grep ?$oid actual.raw &&
> +				echo ?$oid >>expect.raw
> +				;;
> +			esac &&
> +
> +			sort actual.raw >actual &&
> +			sort expect.raw >expect &&
> +			test_cmp expect actual
> +		'
> +	done
> +done
> +
> +test_done
> -- 
> 2.42.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v5 3/3] rev-list: add commit object support in `--missing` option
  2023-10-27  6:25           ` Patrick Steinhardt
@ 2023-10-27  7:54             ` Karthik Nayak
  2023-10-27  7:59             ` Karthik Nayak
  1 sibling, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-27  7:54 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, gitster

On Fri, Oct 27, 2023 at 8:25 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> We unconditionally clear the oidset now, so shouldn't we also
> unconditionally initialize it?
>

You're right. I should have added that in.

Junio, I'm replying to this email with the amended commit for v5. If
you want me to reroll
v6, I could do that too. Thanks!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v5 3/3] rev-list: add commit object support in `--missing` option
  2023-10-27  6:25           ` Patrick Steinhardt
  2023-10-27  7:54             ` Karthik Nayak
@ 2023-10-27  7:59             ` Karthik Nayak
  1 sibling, 0 replies; 61+ messages in thread
From: Karthik Nayak @ 2023-10-27  7:59 UTC (permalink / raw)
  To: karthik.188; +Cc: git, gitster, ps

The `--missing` object option in rev-list currently works only with
missing blobs/trees. For missing commits the revision walker fails with
a fatal error.

Let's extend the functionality of `--missing` option to also support
commit objects. This is done by adding a `missing_objects` field to
`rev_info`. This field is an `oidset` to which we'll add the missing
commits as we encounter them. The revision walker will now continue the
traversal and call `show_commit()` even for missing commits. In rev-list
we can then check if the commit is a missing commit and call the
existing code for parsing `--missing` objects.

A scenario where this option would be used is to find the boundary
objects between different object directories. Consider a repository with
a main object directory (GIT_OBJECT_DIRECTORY) and one or more alternate
object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES). In such a
repository, using the `--missing=print` option while disabling the
alternate object directory allows us to find the boundary objects
between the main and alternate object directory.

Helped-by: Patrick Steinhardt <ps@pks.im>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 builtin/rev-list.c          |  6 +++
 list-objects.c              |  3 ++
 revision.c                  | 16 +++++++-
 revision.h                  |  4 ++
 t/t6022-rev-list-missing.sh | 74 +++++++++++++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 2 deletions(-)
 create mode 100755 t/t6022-rev-list-missing.sh

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 98542e8b3c..181353dcf5 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -149,6 +149,12 @@ static void show_commit(struct commit *commit, void *data)
 
 	display_progress(progress, ++progress_counter);
 
+	if (revs->do_not_die_on_missing_objects &&
+	    oidset_contains(&revs->missing_commits, &commit->object.oid)) {
+		finish_object__ma(&commit->object);
+		return;
+	}
+
 	if (show_disk_usage)
 		total_disk_usage += get_object_disk_usage(&commit->object);
 
diff --git a/list-objects.c b/list-objects.c
index 47296dff2f..f4e1104b56 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -389,6 +389,9 @@ static void do_traverse(struct traversal_context *ctx)
 		 */
 		if (!ctx->revs->tree_objects)
 			; /* do not bother loading tree */
+		else if (ctx->revs->do_not_die_on_missing_objects &&
+			 oidset_contains(&ctx->revs->missing_commits, &commit->object.oid))
+			;
 		else if (repo_get_commit_tree(the_repository, commit)) {
 			struct tree *tree = repo_get_commit_tree(the_repository,
 								 commit);
diff --git a/revision.c b/revision.c
index 219dc76716..00d5c29bfc 100644
--- a/revision.c
+++ b/revision.c
@@ -6,6 +6,7 @@
 #include "object-name.h"
 #include "object-file.h"
 #include "object-store-ll.h"
+#include "oidset.h"
 #include "tag.h"
 #include "blob.h"
 #include "tree.h"
@@ -1112,6 +1113,9 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 
 	if (commit->object.flags & ADDED)
 		return 0;
+	if (revs->do_not_die_on_missing_objects &&
+	    oidset_contains(&revs->missing_commits, &commit->object.oid))
+		return 0;
 	commit->object.flags |= ADDED;
 
 	if (revs->include_check &&
@@ -1168,7 +1172,8 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 	for (parent = commit->parents; parent; parent = parent->next) {
 		struct commit *p = parent->item;
 		int gently = revs->ignore_missing_links ||
-			     revs->exclude_promisor_objects;
+			     revs->exclude_promisor_objects ||
+			     revs->do_not_die_on_missing_objects;
 		if (repo_parse_commit_gently(revs->repo, p, gently) < 0) {
 			if (revs->exclude_promisor_objects &&
 			    is_promisor_object(&p->object.oid)) {
@@ -1176,7 +1181,11 @@ static int process_parents(struct rev_info *revs, struct commit *commit,
 					break;
 				continue;
 			}
-			return -1;
+
+			if (revs->do_not_die_on_missing_objects)
+				oidset_insert(&revs->missing_commits, &p->object.oid);
+			else
+				return -1; /* corrupt repository */
 		}
 		if (revs->sources) {
 			char **slot = revision_sources_at(revs->sources, p);
@@ -3109,6 +3118,7 @@ void release_revisions(struct rev_info *revs)
 	clear_decoration(&revs->merge_simplification, free);
 	clear_decoration(&revs->treesame, free);
 	line_log_free(revs);
+	oidset_clear(&revs->missing_commits);
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
@@ -3800,6 +3810,8 @@ int prepare_revision_walk(struct rev_info *revs)
 				       FOR_EACH_OBJECT_PROMISOR_ONLY);
 	}
 
+	oidset_init(&revs->missing_commits, 0);
+
 	if (!revs->reflog_info)
 		prepare_to_use_bloom_filter(revs);
 	if (!revs->unsorted_input)
diff --git a/revision.h b/revision.h
index c73c92ef40..94c43138bc 100644
--- a/revision.h
+++ b/revision.h
@@ -4,6 +4,7 @@
 #include "commit.h"
 #include "grep.h"
 #include "notes.h"
+#include "oidset.h"
 #include "pretty.h"
 #include "diff.h"
 #include "commit-slab-decl.h"
@@ -373,6 +374,9 @@ struct rev_info {
 
 	/* Location where temporary objects for remerge-diff are written. */
 	struct tmp_objdir *remerge_objdir;
+
+	/* Missing commits to be tracked without failing traversal. */
+	struct oidset missing_commits;
 };
 
 /**
diff --git a/t/t6022-rev-list-missing.sh b/t/t6022-rev-list-missing.sh
new file mode 100755
index 0000000000..40265a4f66
--- /dev/null
+++ b/t/t6022-rev-list-missing.sh
@@ -0,0 +1,74 @@
+#!/bin/sh
+
+test_description='handling of missing objects in rev-list'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+# We setup the repository with two commits, this way HEAD is always
+# available and we can hide commit 1.
+test_expect_success 'create repository and alternate directory' '
+	test_commit 1 &&
+	test_commit 2 &&
+	test_commit 3
+'
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	test_expect_success "rev-list --missing=error fails with missing object $obj" '
+		oid="$(git rev-parse $obj)" &&
+		path=".git/objects/$(test_oid_to_path $oid)" &&
+
+		mv "$path" "$path.hidden" &&
+		test_when_finished "mv $path.hidden $path" &&
+
+		test_must_fail git rev-list --missing=error --objects \
+			--no-object-names HEAD
+	'
+done
+
+for obj in "HEAD~1" "HEAD~1^{tree}" "HEAD:1.t"
+do
+	for action in "allow-any" "print"
+	do
+		test_expect_success "rev-list --missing=$action with missing $obj" '
+			oid="$(git rev-parse $obj)" &&
+			path=".git/objects/$(test_oid_to_path $oid)" &&
+
+			# Before the object is made missing, we use rev-list to
+			# get the expected oids.
+			git rev-list --objects --no-object-names \
+				HEAD ^$obj >expect.raw &&
+
+			# Blobs are shared by all commits, so evethough a commit/tree
+			# might be skipped, its blob must be accounted for.
+			if [ $obj != "HEAD:1.t" ]; then
+				echo $(git rev-parse HEAD:1.t) >>expect.raw &&
+				echo $(git rev-parse HEAD:2.t) >>expect.raw
+			fi &&
+
+			mv "$path" "$path.hidden" &&
+			test_when_finished "mv $path.hidden $path" &&
+
+			git rev-list --missing=$action --objects --no-object-names \
+				HEAD >actual.raw &&
+
+			# When the action is to print, we should also add the missing
+			# oid to the expect list.
+			case $action in
+			allow-any)
+				;;
+			print)
+				grep ?$oid actual.raw &&
+				echo ?$oid >>expect.raw
+				;;
+			esac &&
+
+			sort actual.raw >actual &&
+			sort expect.raw >expect &&
+			test_cmp expect actual
+		'
+	done
+done
+
+test_done
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2023-10-27  7:59 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-09 10:55 [PATCH 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
2023-10-09 10:55 ` [PATCH 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
2023-10-09 10:55 ` [PATCH 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
2023-10-09 10:55 ` [PATCH 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
2023-10-09 22:02 ` [PATCH 0/3] rev-list: add support for commits in `--missing` Junio C Hamano
2023-10-10  6:19 ` Patrick Steinhardt
2023-10-10 17:09   ` Junio C Hamano
2023-10-11 10:37     ` Karthik Nayak
2023-10-11 16:54       ` Junio C Hamano
2023-10-12 10:44         ` Karthik Nayak
2023-10-12 11:04           ` Patrick Steinhardt
2023-10-12 13:23             ` Karthik Nayak
2023-10-12 16:17             ` Junio C Hamano
2023-10-13  5:53               ` Patrick Steinhardt
2023-10-13  8:38                 ` Patrick Steinhardt
2023-10-13 12:37                   ` [PATCH] commit: detect commits that exist in commit-graph but not in the ODB Patrick Steinhardt
2023-10-13 18:21                     ` Junio C Hamano
2023-10-17  6:37                       ` Patrick Steinhardt
2023-10-17 18:34                         ` Junio C Hamano
2023-10-19  6:45                           ` Patrick Steinhardt
2023-10-19  8:25                             ` Patrick Steinhardt
2023-10-19 17:16                               ` Junio C Hamano
2023-10-20 10:00                                 ` Jeff King
2023-10-20 17:35                                   ` Junio C Hamano
2023-10-23 10:15                                   ` Patrick Steinhardt
2023-10-13 17:07                   ` [PATCH 0/3] rev-list: add support for commits in `--missing` Junio C Hamano
2023-10-12 16:26           ` Junio C Hamano
2023-10-16 10:38 ` [PATCH v2 " Karthik Nayak
2023-10-16 10:38   ` [PATCH v2 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
2023-10-16 10:38   ` [PATCH v2 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
2023-10-16 10:38   ` [PATCH v2 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
2023-10-16 16:24   ` [PATCH v2 0/3] rev-list: add support for commits in `--missing` Junio C Hamano
2023-10-16 19:01     ` Karthik Nayak
2023-10-16 20:33       ` Junio C Hamano
2023-10-19 12:10   ` [PATCH v3 " Karthik Nayak
2023-10-19 12:10     ` [PATCH v3 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
2023-10-19 12:10     ` [PATCH v3 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
2023-10-19 12:10     ` [PATCH v3 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
2023-10-19 22:05       ` Junio C Hamano
2023-10-19 23:35         ` Junio C Hamano
2023-10-20 11:14           ` Karthik Nayak
2023-10-20 14:47             ` Karthik Nayak
2023-10-20 17:45               ` Junio C Hamano
2023-10-20 16:41           ` Junio C Hamano
2023-10-24 11:34             ` Karthik Nayak
2023-10-24 12:26     ` [PATCH v4 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
2023-10-24 12:26       ` [PATCH v4 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
2023-10-24 12:26       ` [PATCH v4 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
2023-10-24 12:26       ` [PATCH v4 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
2023-10-24 17:45         ` Junio C Hamano
2023-10-25  0:35           ` Junio C Hamano
2023-10-25  9:34           ` Karthik Nayak
2023-10-25  6:40         ` Patrick Steinhardt
2023-10-26 12:37           ` Junio C Hamano
2023-10-26 10:11       ` [PATCH v5 0/3] rev-list: add support for commits in `--missing` Karthik Nayak
2023-10-26 10:11         ` [PATCH v5 1/3] revision: rename bit to `do_not_die_on_missing_objects` Karthik Nayak
2023-10-26 10:11         ` [PATCH v5 2/3] rev-list: move `show_commit()` to the bottom Karthik Nayak
2023-10-26 10:11         ` [PATCH v5 3/3] rev-list: add commit object support in `--missing` option Karthik Nayak
2023-10-27  6:25           ` Patrick Steinhardt
2023-10-27  7:54             ` Karthik Nayak
2023-10-27  7:59             ` Karthik Nayak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.