All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Some patches for fsck for missing objects
@ 2017-07-26 23:29 Jonathan Tan
  2017-07-26 23:29 ` [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension Jonathan Tan
                   ` (10 more replies)
  0 siblings, 11 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-26 23:29 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peartben, christian.couder

This is an updated version of my previous patch [1], but without the
list of promises. It is somewhat different now because fsck cannot just
mark all promised objects as HAS_OBJ.

This could be adapted and incorporated into patch sets that have objects
missing from the local repo.

I split this up into 4 patches so that you can see how the changes in
fsck are being tested, but eventually these should probably be combined
into 1 patch.

[1] https://public-inbox.org/git/3420d9ae9ef86b78af1abe721891233e3f5865a2.1500508695.git.jonathantanmy@google.com/

Jonathan Tan (4):
  environment, fsck: introduce lazyobject extension
  fsck: support refs pointing to lazy objects
  fsck: support referenced lazy objects
  fsck: support lazy objects as CLI argument

 Documentation/technical/repository-version.txt |  7 ++
 builtin/fsck.c                                 | 23 ++++++-
 cache.h                                        |  2 +
 environment.c                                  |  1 +
 setup.c                                        |  7 +-
 t/t0410-lazy-object.sh                         | 95 ++++++++++++++++++++++++++
 6 files changed, 133 insertions(+), 2 deletions(-)
 create mode 100755 t/t0410-lazy-object.sh

-- 
2.14.0.rc0.400.g1c36432dff-goog


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
@ 2017-07-26 23:29 ` Jonathan Tan
  2017-07-27 18:55   ` Junio C Hamano
  2017-07-26 23:30 ` [RFC PATCH 2/4] fsck: support refs pointing to lazy objects Jonathan Tan
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Jonathan Tan @ 2017-07-26 23:29 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peartben, christian.couder

Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage.

Introduce a new repository extension option "extensions.lazyobject", of
data type string. If this is set in a repository, Git will tolerate
objects being missing in that repository.  When Git needs those objects,
it will invoke the command in that option.

Teach fsck about the new state of affairs. In this commit, teach fsck
that missing objects referenced from the reflog are not an error case;
in future commits, fsck will be taught about other cases.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/repository-version.txt |  7 ++++++
 builtin/fsck.c                                 |  2 +-
 cache.h                                        |  2 ++
 environment.c                                  |  1 +
 setup.c                                        |  7 +++++-
 t/t0410-lazy-object.sh                         | 32 ++++++++++++++++++++++++++
 6 files changed, 49 insertions(+), 2 deletions(-)
 create mode 100755 t/t0410-lazy-object.sh

diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
index 00ad37986..39e362543 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,10 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`lazyObject`
+~~~~~~~~~~~~~~~~~
+
+When the config key `extensions.lazyObject` is set to a string, Git
+tolerates objects being missing in the repository. This string contains
+the command to be run whenever a missing object is needed.
diff --git a/builtin/fsck.c b/builtin/fsck.c
index fb0947009..1cfb8d98c 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -402,7 +402,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
 					xstrfmt("%s@{%"PRItime"}", refname, timestamp));
 			obj->flags |= USED;
 			mark_object_reachable(obj);
-		} else {
+		} else if (!repository_format_lazy_object) {
 			error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
 			errors_found |= ERROR_REACHABLE;
 		}
diff --git a/cache.h b/cache.h
index 6c8242340..9e6bc0a21 100644
--- a/cache.h
+++ b/cache.h
@@ -853,10 +853,12 @@ extern int grafts_replace_parents;
 #define GIT_REPO_VERSION 0
 #define GIT_REPO_VERSION_READ 1
 extern int repository_format_precious_objects;
+extern char *repository_format_lazy_object;
 
 struct repository_format {
 	int version;
 	int precious_objects;
+	char *lazy_object;
 	int is_bare;
 	char *work_tree;
 	struct string_list unknown_extensions;
diff --git a/environment.c b/environment.c
index 3fd4b1084..cd8ef2897 100644
--- a/environment.c
+++ b/environment.c
@@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1;
 int warn_on_object_refname_ambiguity = 1;
 int ref_paranoia = -1;
 int repository_format_precious_objects;
+char *repository_format_lazy_object;
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 const char *apply_default_whitespace;
diff --git a/setup.c b/setup.c
index 860507e1f..94cfde3cc 100644
--- a/setup.c
+++ b/setup.c
@@ -425,7 +425,11 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
 			;
 		else if (!strcmp(ext, "preciousobjects"))
 			data->precious_objects = git_config_bool(var, value);
-		else
+		else if (!strcmp(ext, "lazyobject")) {
+			if (!value)
+				return config_error_nonbool(var);
+			data->lazy_object = xstrdup(value);
+		} else
 			string_list_append(&data->unknown_extensions, ext);
 	} else if (strcmp(var, "core.bare") == 0) {
 		data->is_bare = git_config_bool(var, value);
@@ -468,6 +472,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
 	}
 
 	repository_format_precious_objects = candidate.precious_objects;
+	repository_format_lazy_object = candidate.lazy_object;
 	string_list_clear(&candidate.unknown_extensions, 0);
 	if (!has_common) {
 		if (candidate.is_bare != -1) {
diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
new file mode 100755
index 000000000..36442531f
--- /dev/null
+++ b/t/t0410-lazy-object.sh
@@ -0,0 +1,32 @@
+#!/bin/sh
+
+test_description='lazy object'
+
+. ./test-lib.sh
+
+delete_object () {
+	rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40)
+}
+
+test_expect_success 'fsck fails on lazy object in reflog' '
+	test_create_repo repo &&
+	test_commit -C repo 1 &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+	B=$(git -C repo commit-tree -m b HEAD^{tree}) &&
+
+	# Reference $A only from reflog, and delete it
+	git -C repo branch mybranch "$A" &&
+	git -C repo branch -f mybranch "$B" &&
+	delete_object repo "$A" &&
+
+	test_must_fail git -C repo fsck
+'
+
+test_expect_success '...but succeeds if lazyobject is set' '
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	git -C repo fsck
+'
+
+test_done
-- 
2.14.0.rc0.400.g1c36432dff-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 2/4] fsck: support refs pointing to lazy objects
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
  2017-07-26 23:29 ` [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension Jonathan Tan
@ 2017-07-26 23:30 ` Jonathan Tan
  2017-07-27 18:59   ` Junio C Hamano
  2017-07-26 23:30 ` [RFC PATCH 3/4] fsck: support referenced lazy objects Jonathan Tan
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 40+ messages in thread
From: Jonathan Tan @ 2017-07-26 23:30 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peartben, christian.couder

Teach fsck to not treat refs with missing targets as an error when
extensions.lazyobject is set.

For the purposes of warning about no default refs, such refs are still
treated as legitimate refs.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fsck.c         |  8 ++++++++
 t/t0410-lazy-object.sh | 20 ++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 1cfb8d98c..e29ff760b 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -438,6 +438,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
 
 	obj = parse_object(oid);
 	if (!obj) {
+		if (repository_format_lazy_object) {
+			/*
+			 * Increment default_refs anyway, because this is a
+			 * valid ref.
+			 */
+			default_refs++;
+			return 0;
+		}
 		error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
 		errors_found |= ERROR_REACHABLE;
 		/* We'll continue with the rest despite the error.. */
diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
index 36442531f..00e1b4a88 100755
--- a/t/t0410-lazy-object.sh
+++ b/t/t0410-lazy-object.sh
@@ -29,4 +29,24 @@ test_expect_success '...but succeeds if lazyobject is set' '
 	git -C repo fsck
 '
 
+test_expect_success 'fsck fails on lazy object pointed to by ref' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo 1 &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+
+	# Reference $A only from ref, and delete it
+	git -C repo branch mybranch "$A" &&
+	delete_object repo "$A" &&
+
+	test_must_fail git -C repo fsck
+'
+
+test_expect_success '...but succeeds if lazyobject is set' '
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	git -C repo fsck
+'
+
 test_done
-- 
2.14.0.rc0.400.g1c36432dff-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 3/4] fsck: support referenced lazy objects
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
  2017-07-26 23:29 ` [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension Jonathan Tan
  2017-07-26 23:30 ` [RFC PATCH 2/4] fsck: support refs pointing to lazy objects Jonathan Tan
@ 2017-07-26 23:30 ` Jonathan Tan
  2017-07-27 19:17   ` Junio C Hamano
  2017-07-29 16:04   ` Junio C Hamano
  2017-07-26 23:30 ` [RFC PATCH 4/4] fsck: support lazy objects as CLI argument Jonathan Tan
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-26 23:30 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peartben, christian.couder

Teach fsck to not treat missing objects indirectly pointed to by refs as
an error when extensions.lazyobject is set.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fsck.c         | 11 +++++++++++
 t/t0410-lazy-object.sh | 27 +++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index e29ff760b..238532cc2 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -149,6 +149,15 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt
 		return 0;
 	obj->flags |= REACHABLE;
 	if (!(obj->flags & HAS_OBJ)) {
+		if (repository_format_lazy_object)
+			/*
+			 * Return immediately; this is not an error, and further
+			 * recursion does not need to be performed on this
+			 * object since it is missing (so it does not need to be
+			 * added to "pending").
+			 */
+			return 0;
+
 		if (parent && !has_object_file(&obj->oid)) {
 			printf("broken link from %7s %s\n",
 				 printable_type(parent), describe_object(parent));
@@ -212,6 +221,8 @@ static void check_reachable_object(struct object *obj)
 	 * do a full fsck
 	 */
 	if (!(obj->flags & HAS_OBJ)) {
+		if (repository_format_lazy_object)
+			return;
 		if (has_sha1_pack(obj->oid.hash))
 			return; /* it is in pack - forget about it */
 		printf("missing %s %s\n", printable_type(obj),
diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
index 00e1b4a88..45f665a15 100755
--- a/t/t0410-lazy-object.sh
+++ b/t/t0410-lazy-object.sh
@@ -49,4 +49,31 @@ test_expect_success '...but succeeds if lazyobject is set' '
 	git -C repo fsck
 '
 
+test_expect_success 'fsck fails on lazy object indirectly pointed to by ref' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo 1 &&
+	test_commit -C repo 2 &&
+	test_commit -C repo 3 &&
+	git -C repo tag -a annotated_tag -m "annotated tag" &&
+
+	C=$(git -C repo rev-parse 1) &&
+	T=$(git -C repo rev-parse 2^{tree}) &&
+	B=$(git hash-object repo/3.t) &&
+	AT=$(git -C repo rev-parse annotated_tag) &&
+
+	# missing commit, tree, blob, and tag
+	delete_object repo "$C" &&
+	delete_object repo "$T" &&
+	delete_object repo "$B" &&
+	delete_object repo "$AT" &&
+	test_must_fail git -C repo fsck
+'
+
+test_expect_success '...but succeeds if lazyobject is set' '
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	git -C repo fsck
+'
+
 test_done
-- 
2.14.0.rc0.400.g1c36432dff-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 4/4] fsck: support lazy objects as CLI argument
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
                   ` (2 preceding siblings ...)
  2017-07-26 23:30 ` [RFC PATCH 3/4] fsck: support referenced lazy objects Jonathan Tan
@ 2017-07-26 23:30 ` Jonathan Tan
  2017-07-26 23:42 ` [RFC PATCH 0/4] Some patches for fsck for missing objects brian m. carlson
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-26 23:30 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peartben, christian.couder

Teach fsck to not treat missing objects provided on the CLI as an error
when extensions.lazyobject is set.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fsck.c         |  2 ++
 t/t0410-lazy-object.sh | 16 ++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 238532cc2..592e64172 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -755,6 +755,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 			struct object *obj = lookup_object(oid.hash);
 
 			if (!obj || !(obj->flags & HAS_OBJ)) {
+				if (repository_format_lazy_object)
+					continue;
 				error("%s: object missing", oid_to_hex(&oid));
 				errors_found |= ERROR_OBJECT;
 				continue;
diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
index 45f665a15..3ac61c1c5 100755
--- a/t/t0410-lazy-object.sh
+++ b/t/t0410-lazy-object.sh
@@ -76,4 +76,20 @@ test_expect_success '...but succeeds if lazyobject is set' '
 	git -C repo fsck
 '
 
+test_expect_success 'fsck fails on lazy object directly given in command-line' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo 1 &&
+	HASH=$(git hash-object repo/1.t) &&
+	delete_object repo "$HASH" &&
+
+	test_must_fail git -C repo fsck "$HASH"
+'
+
+test_expect_success '...but succeeds if lazyobject is set' '
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	git -C repo fsck "$HASH"
+'
+
 test_done
-- 
2.14.0.rc0.400.g1c36432dff-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/4] Some patches for fsck for missing objects
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
                   ` (3 preceding siblings ...)
  2017-07-26 23:30 ` [RFC PATCH 4/4] fsck: support lazy objects as CLI argument Jonathan Tan
@ 2017-07-26 23:42 ` brian m. carlson
  2017-07-27  0:24   ` Stefan Beller
  2017-07-27 17:25   ` Jonathan Tan
  2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 40+ messages in thread
From: brian m. carlson @ 2017-07-26 23:42 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, christian.couder

[-- Attachment #1: Type: text/plain, Size: 1234 bytes --]

On Wed, Jul 26, 2017 at 04:29:58PM -0700, Jonathan Tan wrote:
> This is an updated version of my previous patch [1], but without the
> list of promises. It is somewhat different now because fsck cannot just
> mark all promised objects as HAS_OBJ.
> 
> This could be adapted and incorporated into patch sets that have objects
> missing from the local repo.
> 
> I split this up into 4 patches so that you can see how the changes in
> fsck are being tested, but eventually these should probably be combined
> into 1 patch.

I looked at this and I like the direction it's going.  It's pretty
simple and straightforward, which I also like.

What I'd recommend is that instead of making lazyObject a string, we
make it an integer representing a protocol version.  We then add a
different config setting that is the actual program to invoke, using the
given protocol version.  This lets us change the way we speak to the
tool without breaking backwards compatibility, and it also allows us to
simply check the lazyObject script for supported protocols up front.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/4] Some patches for fsck for missing objects
  2017-07-26 23:42 ` [RFC PATCH 0/4] Some patches for fsck for missing objects brian m. carlson
@ 2017-07-27  0:24   ` Stefan Beller
  2017-07-27 17:25   ` Jonathan Tan
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Beller @ 2017-07-27  0:24 UTC (permalink / raw)
  To: brian m. carlson, Jonathan Tan, git, Ben Peart, Christian Couder

On Wed, Jul 26, 2017 at 4:42 PM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> On Wed, Jul 26, 2017 at 04:29:58PM -0700, Jonathan Tan wrote:
>> This is an updated version of my previous patch [1], but without the
>> list of promises. It is somewhat different now because fsck cannot just
>> mark all promised objects as HAS_OBJ.
>>
>> This could be adapted and incorporated into patch sets that have objects
>> missing from the local repo.
>>
>> I split this up into 4 patches so that you can see how the changes in
>> fsck are being tested, but eventually these should probably be combined
>> into 1 patch.
>
> I looked at this and I like the direction it's going.  It's pretty
> simple and straightforward, which I also like.

I agree.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/4] Some patches for fsck for missing objects
  2017-07-26 23:42 ` [RFC PATCH 0/4] Some patches for fsck for missing objects brian m. carlson
  2017-07-27  0:24   ` Stefan Beller
@ 2017-07-27 17:25   ` Jonathan Tan
  2017-07-28 13:40     ` Ben Peart
  1 sibling, 1 reply; 40+ messages in thread
From: Jonathan Tan @ 2017-07-27 17:25 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, peartben, christian.couder

On Wed, 26 Jul 2017 23:42:38 +0000
"brian m. carlson" <sandals@crustytoothpaste.net> wrote:

> I looked at this and I like the direction it's going.  It's pretty
> simple and straightforward, which I also like.

Thanks.

> What I'd recommend is that instead of making lazyObject a string, we
> make it an integer representing a protocol version.  We then add a
> different config setting that is the actual program to invoke, using the
> given protocol version.  This lets us change the way we speak to the
> tool without breaking backwards compatibility, and it also allows us to
> simply check the lazyObject script for supported protocols up front.

That's possible too. As for version negotiation, I think we'll end up
using a protocol similar to the clean/smudge long-running process
protocol (as documented as gitattributes), so that does not need to be
taken care of here, but making lazyObject be the version integer is fine
too.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension
  2017-07-26 23:29 ` [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension Jonathan Tan
@ 2017-07-27 18:55   ` Junio C Hamano
  2017-07-28 13:20     ` Ben Peart
  2017-07-28 23:50     ` Jonathan Tan
  0 siblings, 2 replies; 40+ messages in thread
From: Junio C Hamano @ 2017-07-27 18:55 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

> Currently, Git does not support repos with very large numbers of objects
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every referenced object is available somewhere in the
> repo storage.
>
> Introduce a new repository extension option "extensions.lazyobject", of
> data type string. If this is set in a repository, Git will tolerate
> objects being missing in that repository.  When Git needs those objects,
> it will invoke the command in that option.

My reading hiccupped after the first sentence, as the problem
description made it sound like this was a boolean ("are we using
lazy object feature?"), after reading "data type string".  And then
"the command in that option" made me hiccup one more time, as it did
not "click" that "in that option" was trying to say that the string
is used as the command name (or is it a whole command line?  The
leading part of the command line to which some arguments are
appended before it gets invoked as a command? or what?).

Logically, I think it is more like

 - extensions.lazyobject can be set to tell Git to consider missing
   objects in certain cases are not errors;

 - the value of extensions.lazyobject variable must be a string,
   which is used to name the command to lazily make the object
   "appear" in the repository on demand.

> Teach fsck about the new state of affairs. In this commit, teach fsck
> that missing objects referenced from the reflog are not an error case;
> in future commits, fsck will be taught about other cases.

In any case, sounds like a small and good first step.

> +
> +`lazyObject`
> +~~~~~~~~~~~~~~~~~
> +
> +When the config key `extensions.lazyObject` is set to a string, Git
> +tolerates objects being missing in the repository. This string contains
> +the command to be run whenever a missing object is needed.

This has the same issue but to a lessor degree as there is "string
contains".  What the command will do (e.g. "makes the object
magically appear in the object store" or "emits the contents of the
object to its standard output, so that Git can hash it to make it
appear in the object store"), and how it is used (e.g. "this is a
single command name and nothing else", "this is a leading part of
command line and arguments are appended at the end, before the whole
thing is passed to the shell to be executed", etc.) will need to be
clarified in the final version of the series (not necessarily at
this step---the series can elaborate in the later patches).

> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index fb0947009..1cfb8d98c 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -402,7 +402,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
>  					xstrfmt("%s@{%"PRItime"}", refname, timestamp));
>  			obj->flags |= USED;
>  			mark_object_reachable(obj);
> -		} else {
> +		} else if (!repository_format_lazy_object) {
>  			error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
>  			errors_found |= ERROR_REACHABLE;
>  		}
> diff --git a/cache.h b/cache.h
> index 6c8242340..9e6bc0a21 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -853,10 +853,12 @@ extern int grafts_replace_parents;
>  #define GIT_REPO_VERSION 0
>  #define GIT_REPO_VERSION_READ 1
>  extern int repository_format_precious_objects;
> +extern char *repository_format_lazy_object;

This is not a new problem, but I think these two should be
called repository_extension_$NAME not repository_format_$NAME.

> diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
> new file mode 100755
> index 000000000..36442531f
> --- /dev/null
> +++ b/t/t0410-lazy-object.sh
> @@ -0,0 +1,32 @@
> +#!/bin/sh
> +
> +test_description='lazy object'
> +
> +. ./test-lib.sh
> +
> +delete_object () {
> +	rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40)
> +}
> +
> +test_expect_success 'fsck fails on lazy object in reflog' '
> +	test_create_repo repo &&
> +	test_commit -C repo 1 &&
> +
> +	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
> +	B=$(git -C repo commit-tree -m b HEAD^{tree}) &&
> +
> +	# Reference $A only from reflog, and delete it
> +	git -C repo branch mybranch "$A" &&
> +	git -C repo branch -f mybranch "$B" &&
> +	delete_object repo "$A" &&
> +
> +	test_must_fail git -C repo fsck
> +'
> +test_expect_success '...but succeeds if lazyobject is set' '
> +	git -C repo config core.repositoryformatversion 1 &&
> +	git -C repo config extensions.lazyobject "arbitrary string" &&
> +	git -C repo fsck
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 2/4] fsck: support refs pointing to lazy objects
  2017-07-26 23:30 ` [RFC PATCH 2/4] fsck: support refs pointing to lazy objects Jonathan Tan
@ 2017-07-27 18:59   ` Junio C Hamano
  2017-07-27 23:50     ` Jonathan Tan
  0 siblings, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2017-07-27 18:59 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

> Teach fsck to not treat refs with missing targets as an error when
> extensions.lazyobject is set.
>
> For the purposes of warning about no default refs, such refs are still
> treated as legitimate refs.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  builtin/fsck.c         |  8 ++++++++
>  t/t0410-lazy-object.sh | 20 ++++++++++++++++++++
>  2 files changed, 28 insertions(+)
>
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 1cfb8d98c..e29ff760b 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -438,6 +438,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
>  
>  	obj = parse_object(oid);
>  	if (!obj) {
> +		if (repository_format_lazy_object) {
> +			/*
> +			 * Increment default_refs anyway, because this is a
> +			 * valid ref.
> +			 */
> +			default_refs++;
> +			return 0;
> +		}
>  		error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
>  		errors_found |= ERROR_REACHABLE;

At this point, do we know (or can we tell) if this is a missing
object or a file exists as a loose object but is corrupt?  If we
could, it would be nice to do this only for the former to avoid
sweeping a real corruption that is unrelated to the lazy fetch under
the rug.

> +test_expect_success 'fsck fails on lazy object pointed to by ref' '
> +	rm -rf repo &&
> +	test_create_repo repo &&
> +	test_commit -C repo 1 &&
> +
> +	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
> +
> +	# Reference $A only from ref, and delete it
> +	git -C repo branch mybranch "$A" &&
> +	delete_object repo "$A" &&
> +
> +	test_must_fail git -C repo fsck
> +'

And a new test that uses a helper different from delete_object
(perhaps call it corrupt_object?) can be used to make sure that we
complain in that case here.

> +test_expect_success '...but succeeds if lazyobject is set' '
> +	git -C repo config core.repositoryformatversion 1 &&
> +	git -C repo config extensions.lazyobject "arbitrary string" &&
> +	git -C repo fsck
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/4] fsck: support referenced lazy objects
  2017-07-26 23:30 ` [RFC PATCH 3/4] fsck: support referenced lazy objects Jonathan Tan
@ 2017-07-27 19:17   ` Junio C Hamano
  2017-07-27 23:50     ` Jonathan Tan
  2017-07-29 16:04   ` Junio C Hamano
  1 sibling, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2017-07-27 19:17 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

> Teach fsck to not treat missing objects indirectly pointed to by refs as
> an error when extensions.lazyobject is set.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  builtin/fsck.c         | 11 +++++++++++
>  t/t0410-lazy-object.sh | 27 +++++++++++++++++++++++++++
>  2 files changed, 38 insertions(+)
>
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index e29ff760b..238532cc2 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -149,6 +149,15 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt
>  		return 0;
>  	obj->flags |= REACHABLE;
>  	if (!(obj->flags & HAS_OBJ)) {
> +		if (repository_format_lazy_object)
> +			/*
> +			 * Return immediately; this is not an error, and further
> +			 * recursion does not need to be performed on this
> +			 * object since it is missing (so it does not need to be
> +			 * added to "pending").
> +			 */
> +			return 0;
> +

The same comment as 2/4 applies here.

> @@ -212,6 +221,8 @@ static void check_reachable_object(struct object *obj)
>  	 * do a full fsck
>  	 */
>  	if (!(obj->flags & HAS_OBJ)) {
> +		if (repository_format_lazy_object)
> +			return;
>  		if (has_sha1_pack(obj->oid.hash))
>  			return; /* it is in pack - forget about it */
>  		printf("missing %s %s\n", printable_type(obj),

Also this reminds as a related issue.  Imagine:

 - An object X was once retrieved, perhaps but not necessarily
   lazily, together with another object Y that is referred to by X
   (e.g. X is a tree, Y is a blob in the directory at path D, which
   is represented by X).

 - The same blob Y is added to the index in a different directory at
   path E.

 - The user decides to make this a slimmed-down "narrow clone" style
   repository and tells Git that path D is not interesting.  We lose
   X, but not Y because Y is still referenced from the index.

 - "git reset --hard" happens, and there no longer is any reference
   to Y.

Now, when we run fsck, should we diagnose Y as "unreachable and/or
dangling"?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/4] fsck: support referenced lazy objects
  2017-07-27 19:17   ` Junio C Hamano
@ 2017-07-27 23:50     ` Jonathan Tan
  0 siblings, 0 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-27 23:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peartben, christian.couder

On Thu, 27 Jul 2017 12:17:37 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> The same comment as 2/4 applies here.

Noted - whatever the resolution is, I'll apply it to all the patches.
> 
> > @@ -212,6 +221,8 @@ static void check_reachable_object(struct object *obj)
> >  	 * do a full fsck
> >  	 */
> >  	if (!(obj->flags & HAS_OBJ)) {
> > +		if (repository_format_lazy_object)
> > +			return;
> >  		if (has_sha1_pack(obj->oid.hash))
> >  			return; /* it is in pack - forget about it */
> >  		printf("missing %s %s\n", printable_type(obj),
> 
> Also this reminds as a related issue.  Imagine:
> 
>  - An object X was once retrieved, perhaps but not necessarily
>    lazily, together with another object Y that is referred to by X
>    (e.g. X is a tree, Y is a blob in the directory at path D, which
>    is represented by X).
> 
>  - The same blob Y is added to the index in a different directory at
>    path E.
> 
>  - The user decides to make this a slimmed-down "narrow clone" style
>    repository and tells Git that path D is not interesting.  We lose
>    X, but not Y because Y is still referenced from the index.
> 
>  - "git reset --hard" happens, and there no longer is any reference
>    to Y.
> 
> Now, when we run fsck, should we diagnose Y as "unreachable and/or
> dangling"?

I would say yes (or whatever happens in the case where we re-fetch into
a shallow clone).

Come to think of it..."git reset --hard" always has the potential to
create unreachable objects, right (regardless of whether it's a "shallow
clone" or "narrow clone" or ordinary clone)?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 2/4] fsck: support refs pointing to lazy objects
  2017-07-27 18:59   ` Junio C Hamano
@ 2017-07-27 23:50     ` Jonathan Tan
  2017-07-28 13:29       ` Ben Peart
  0 siblings, 1 reply; 40+ messages in thread
From: Jonathan Tan @ 2017-07-27 23:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peartben, christian.couder

On Thu, 27 Jul 2017 11:59:47 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> > @@ -438,6 +438,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
> >  
> >  	obj = parse_object(oid);
> >  	if (!obj) {
> > +		if (repository_format_lazy_object) {
> > +			/*
> > +			 * Increment default_refs anyway, because this is a
> > +			 * valid ref.
> > +			 */
> > +			default_refs++;
> > +			return 0;
> > +		}
> >  		error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
> >  		errors_found |= ERROR_REACHABLE;
> 
> At this point, do we know (or can we tell) if this is a missing
> object or a file exists as a loose object but is corrupt?  If we
> could, it would be nice to do this only for the former to avoid
> sweeping a real corruption that is unrelated to the lazy fetch under
> the rug.

Before all this is run, there is a check over all loose and packed
objects and I've verified that this check reports failure in
corrupt-object situations (see test below).

It is true that parse_object() cannot report the difference, but since
fsck has already verified non-corruptness, I don't think it needs to
know the difference at this point.

> > +test_expect_success 'fsck fails on lazy object pointed to by ref' '
> > +	rm -rf repo &&
> > +	test_create_repo repo &&
> > +	test_commit -C repo 1 &&
> > +
> > +	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
> > +
> > +	# Reference $A only from ref, and delete it
> > +	git -C repo branch mybranch "$A" &&
> > +	delete_object repo "$A" &&
> > +
> > +	test_must_fail git -C repo fsck
> > +'
> 
> And a new test that uses a helper different from delete_object
> (perhaps call it corrupt_object?) can be used to make sure that we
> complain in that case here.

I agree that object corruption can cause this specific part of the
production code to falsely work. But I think that this specific part of
the code can and should rely on object corruption being checked
elsewhere. (I usually don't like to assume that other components work
and will continue to work, but in this case, I think that fsck checking
for object corruption is very foundational and should be relied upon.)

But if we think that defense "in depth" is a good idea, I have no
problem adding such tests (like the one below).

---
delete_object () {
	rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40)
}

corrupt_object () {
	chmod a+w $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) &&
	echo CORRUPT >$1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40)
}

setup_object_in_reflog () {
	rm -rf repo &&
	test_create_repo repo &&
	test_commit -C repo 1 &&

	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
	B=$(git -C repo commit-tree -m b HEAD^{tree}) &&

	# Reference $A only from reflog
	git -C repo branch mybranch "$A" &&
	git -C repo branch -f mybranch "$B"
}

test_expect_success 'lazy object in reflog' '
	setup_object_in_reflog &&
	delete_object repo "$A" &&
	test_must_fail git -C repo fsck &&
	git -C repo config core.repositoryformatversion 1 &&
	git -C repo config extensions.lazyobject "arbitrary string" &&
	git -C repo fsck
'

test_expect_success 'corrupt loose object in reflog' '
	setup_object_in_reflog &&
	corrupt_object repo "$A" &&
	test_must_fail git -C repo fsck &&
	git -C repo config core.repositoryformatversion 1 &&
	git -C repo config extensions.lazyobject "arbitrary string" &&
	test_must_fail git -C repo fsck
'

test_expect_success 'missing packed object in reflog' '
	setup_object_in_reflog &&
	git -C repo repack -a &&
	delete_object repo "$A" &&
	chmod a+w repo/.git/objects/pack/*.pack &&
	echo CORRUPT >"$(echo repo/.git/objects/pack/*.pack)" &&
	test_must_fail git -C repo fsck &&
	git -C repo config core.repositoryformatversion 1 &&
	git -C repo config extensions.lazyobject "arbitrary string" &&
	test_must_fail git -C repo fsck
'

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension
  2017-07-27 18:55   ` Junio C Hamano
@ 2017-07-28 13:20     ` Ben Peart
  2017-07-28 23:50     ` Jonathan Tan
  1 sibling, 0 replies; 40+ messages in thread
From: Ben Peart @ 2017-07-28 13:20 UTC (permalink / raw)
  To: Junio C Hamano, Jonathan Tan; +Cc: git, christian.couder



On 7/27/2017 2:55 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
>> Currently, Git does not support repos with very large numbers of objects
>> or repos that wish to minimize manipulation of certain blobs (for
>> example, because they are very large) very well, even if the user
>> operates mostly on part of the repo, because Git is designed on the
>> assumption that every referenced object is available somewhere in the
>> repo storage.
>>

I very much like the direction this is taking.  Handling missing objects 
gracefully is an important part of the overall partial clone support.

>> Introduce a new repository extension option "extensions.lazyobject", of
>> data type string. If this is set in a repository, Git will tolerate
>> objects being missing in that repository.  When Git needs those objects,
>> it will invoke the command in that option.
> 

I'm very glad you are doing this.  Having never used precious objects I 
was unaware of the extensions concept but it looks like a great way to 
deal with this repo specific option.

> My reading hiccupped after the first sentence, as the problem
> description made it sound like this was a boolean ("are we using
> lazy object feature?"), after reading "data type string".  And then
> "the command in that option" made me hiccup one more time, as it did
> not "click" that "in that option" was trying to say that the string
> is used as the command name (or is it a whole command line?  The
> leading part of the command line to which some arguments are
> appended before it gets invoked as a command? or what?).
> 
> Logically, I think it is more like
> 
>   - extensions.lazyobject can be set to tell Git to consider missing
>     objects in certain cases are not errors;
> 
>   - the value of extensions.lazyobject variable must be a string,
>     which is used to name the command to lazily make the object
>     "appear" in the repository on demand.
> 
>> Teach fsck about the new state of affairs. In this commit, teach fsck
>> that missing objects referenced from the reflog are not an error case;
>> in future commits, fsck will be taught about other cases.
> 
> In any case, sounds like a small and good first step.
> 

I agree completely with the feedback on the description.  That is quite 
the run on sentence. :)

>> +
>> +`lazyObject`
>> +~~~~~~~~~~~~~~~~~
>> +
>> +When the config key `extensions.lazyObject` is set to a string, Git
>> +tolerates objects being missing in the repository. This string contains
>> +the command to be run whenever a missing object is needed.
> 
> This has the same issue but to a lessor degree as there is "string
> contains".  What the command will do (e.g. "makes the object
> magically appear in the object store" or "emits the contents of the
> object to its standard output, so that Git can hash it to make it
> appear in the object store"), and how it is used (e.g. "this is a
> single command name and nothing else", "this is a leading part of
> command line and arguments are appended at the end, before the whole
> thing is passed to the shell to be executed", etc.) will need to be
> clarified in the final version of the series (not necessarily at
> this step---the series can elaborate in the later patches).
> 
>> diff --git a/builtin/fsck.c b/builtin/fsck.c
>> index fb0947009..1cfb8d98c 100644
>> --- a/builtin/fsck.c
>> +++ b/builtin/fsck.c
>> @@ -402,7 +402,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
>>   					xstrfmt("%s@{%"PRItime"}", refname, timestamp));
>>   			obj->flags |= USED;
>>   			mark_object_reachable(obj);
>> -		} else {
>> +		} else if (!repository_format_lazy_object) {
>>   			error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
>>   			errors_found |= ERROR_REACHABLE;
>>   		}
>> diff --git a/cache.h b/cache.h
>> index 6c8242340..9e6bc0a21 100644
>> --- a/cache.h
>> +++ b/cache.h
>> @@ -853,10 +853,12 @@ extern int grafts_replace_parents;
>>   #define GIT_REPO_VERSION 0
>>   #define GIT_REPO_VERSION_READ 1
>>   extern int repository_format_precious_objects;
>> +extern char *repository_format_lazy_object;
> 
> This is not a new problem, but I think these two should be
> called repository_extension_$NAME not repository_format_$NAME.
> 
>> diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
>> new file mode 100755
>> index 000000000..36442531f
>> --- /dev/null
>> +++ b/t/t0410-lazy-object.sh
>> @@ -0,0 +1,32 @@
>> +#!/bin/sh
>> +
>> +test_description='lazy object'
>> +
>> +. ./test-lib.sh
>> +
>> +delete_object () {
>> +	rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40)
>> +}
>> +
>> +test_expect_success 'fsck fails on lazy object in reflog' '
>> +	test_create_repo repo &&
>> +	test_commit -C repo 1 &&
>> +
>> +	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
>> +	B=$(git -C repo commit-tree -m b HEAD^{tree}) &&
>> +
>> +	# Reference $A only from reflog, and delete it
>> +	git -C repo branch mybranch "$A" &&
>> +	git -C repo branch -f mybranch "$B" &&
>> +	delete_object repo "$A" &&
>> +
>> +	test_must_fail git -C repo fsck
>> +'
>> +test_expect_success '...but succeeds if lazyobject is set' '
>> +	git -C repo config core.repositoryformatversion 1 &&
>> +	git -C repo config extensions.lazyobject "arbitrary string" &&
>> +	git -C repo fsck
>> +'
>> +
>> +test_done

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 2/4] fsck: support refs pointing to lazy objects
  2017-07-27 23:50     ` Jonathan Tan
@ 2017-07-28 13:29       ` Ben Peart
  2017-07-28 20:08         ` [PATCH] tests: ensure fsck fails on corrupt packfiles Jonathan Tan
  0 siblings, 1 reply; 40+ messages in thread
From: Ben Peart @ 2017-07-28 13:29 UTC (permalink / raw)
  To: Jonathan Tan, Junio C Hamano; +Cc: git, christian.couder



On 7/27/2017 7:50 PM, Jonathan Tan wrote:
> On Thu, 27 Jul 2017 11:59:47 -0700
> Junio C Hamano <gitster@pobox.com> wrote:
> 
>>> @@ -438,6 +438,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
>>>   
>>>   	obj = parse_object(oid);
>>>   	if (!obj) {
>>> +		if (repository_format_lazy_object) {
>>> +			/*
>>> +			 * Increment default_refs anyway, because this is a
>>> +			 * valid ref.
>>> +			 */
>>> +			default_refs++;
>>> +			return 0;
>>> +		}
>>>   		error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
>>>   		errors_found |= ERROR_REACHABLE;
>>
>> At this point, do we know (or can we tell) if this is a missing
>> object or a file exists as a loose object but is corrupt?  If we
>> could, it would be nice to do this only for the former to avoid
>> sweeping a real corruption that is unrelated to the lazy fetch under
>> the rug.
> 
> Before all this is run, there is a check over all loose and packed
> objects and I've verified that this check reports failure in
> corrupt-object situations (see test below).
> 
> It is true that parse_object() cannot report the difference, but since
> fsck has already verified non-corruptness, I don't think it needs to
> know the difference at this point.
> 
>>> +test_expect_success 'fsck fails on lazy object pointed to by ref' '
>>> +	rm -rf repo &&
>>> +	test_create_repo repo &&
>>> +	test_commit -C repo 1 &&
>>> +
>>> +	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
>>> +
>>> +	# Reference $A only from ref, and delete it
>>> +	git -C repo branch mybranch "$A" &&
>>> +	delete_object repo "$A" &&
>>> +
>>> +	test_must_fail git -C repo fsck
>>> +'
>>
>> And a new test that uses a helper different from delete_object
>> (perhaps call it corrupt_object?) can be used to make sure that we
>> complain in that case here.
> 
> I agree that object corruption can cause this specific part of the
> production code to falsely work. But I think that this specific part of
> the code can and should rely on object corruption being checked
> elsewhere. (I usually don't like to assume that other components work
> and will continue to work, but in this case, I think that fsck checking
> for object corruption is very foundational and should be relied upon.)
> 
> But if we think that defense "in depth" is a good idea, I have no
> problem adding such tests (like the one below).
> 

It's good to know that object corruption is already being checked 
elsewhere in fsck.  I agree that you should be able to rely that as long 
as it is guaranteed to run.  I'd rather see a single location in the 
code that has the logic to detect corrupted objects rather than two 
copies (or worse, two different versions).

Are there also tests for validating the object corruption detection 
code?  If not, adding some (like below) would be great.


> ---
> delete_object () {
> 	rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40)
> }
> 
> corrupt_object () {
> 	chmod a+w $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) &&
> 	echo CORRUPT >$1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40)
> }
> 
> setup_object_in_reflog () {
> 	rm -rf repo &&
> 	test_create_repo repo &&
> 	test_commit -C repo 1 &&
> 
> 	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
> 	B=$(git -C repo commit-tree -m b HEAD^{tree}) &&
> 
> 	# Reference $A only from reflog
> 	git -C repo branch mybranch "$A" &&
> 	git -C repo branch -f mybranch "$B"
> }
> 
> test_expect_success 'lazy object in reflog' '
> 	setup_object_in_reflog &&
> 	delete_object repo "$A" &&
> 	test_must_fail git -C repo fsck &&
> 	git -C repo config core.repositoryformatversion 1 &&
> 	git -C repo config extensions.lazyobject "arbitrary string" &&
> 	git -C repo fsck
> '
> 
> test_expect_success 'corrupt loose object in reflog' '
> 	setup_object_in_reflog &&
> 	corrupt_object repo "$A" &&
> 	test_must_fail git -C repo fsck &&
> 	git -C repo config core.repositoryformatversion 1 &&
> 	git -C repo config extensions.lazyobject "arbitrary string" &&
> 	test_must_fail git -C repo fsck
> '
> 
> test_expect_success 'missing packed object in reflog' '
> 	setup_object_in_reflog &&
> 	git -C repo repack -a &&
> 	delete_object repo "$A" &&
> 	chmod a+w repo/.git/objects/pack/*.pack &&
> 	echo CORRUPT >"$(echo repo/.git/objects/pack/*.pack)" &&
> 	test_must_fail git -C repo fsck &&
> 	git -C repo config core.repositoryformatversion 1 &&
> 	git -C repo config extensions.lazyobject "arbitrary string" &&
> 	test_must_fail git -C repo fsck
> '
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/4] Some patches for fsck for missing objects
  2017-07-27 17:25   ` Jonathan Tan
@ 2017-07-28 13:40     ` Ben Peart
  0 siblings, 0 replies; 40+ messages in thread
From: Ben Peart @ 2017-07-28 13:40 UTC (permalink / raw)
  To: Jonathan Tan, brian m. carlson; +Cc: git, christian.couder



On 7/27/2017 1:25 PM, Jonathan Tan wrote:
> On Wed, 26 Jul 2017 23:42:38 +0000
> "brian m. carlson" <sandals@crustytoothpaste.net> wrote:
> 
>> I looked at this and I like the direction it's going.  It's pretty
>> simple and straightforward, which I also like.
> 

ditto, simple is good!

> Thanks.
> 
>> What I'd recommend is that instead of making lazyObject a string, we
>> make it an integer representing a protocol version.  We then add a
>> different config setting that is the actual program to invoke, using the
>> given protocol version.  This lets us change the way we speak to the
>> tool without breaking backwards compatibility, and it also allows us to
>> simply check the lazyObject script for supported protocols up front.
> 
> That's possible too. As for version negotiation, I think we'll end up
> using a protocol similar to the clean/smudge long-running process
> protocol (as documented as gitattributes), so that does not need to be
> taken care of here, but making lazyObject be the version integer is fine
> too.
> 

I was also thinking the way to retrieve the missing objects be a 
versioned negotiation via the long-running process protocol.

That said, I'm a fan of including versions to make our life easier later 
if we decide to adopt an entirely different model for retrieving the 
missing objects.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] tests: ensure fsck fails on corrupt packfiles
  2017-07-28 13:29       ` Ben Peart
@ 2017-07-28 20:08         ` Jonathan Tan
  0 siblings, 0 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-28 20:08 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peartben, gitster

t1450-fsck.sh does not have a test that checks fsck's behavior when a
packfile is invalid. It does have a test for when an object in a
packfile is invalid, but in that test, the packfile itself is valid.

Add such a test.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
I think the existing packfile corruption test (t5303) is good enough
that we would notice such things, but just in case we want to test fsck
specifically, here's a patch.
---
 t/t1450-fsck.sh | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index bb89e1a5d..4087150db 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -608,6 +608,22 @@ test_expect_success 'fsck errors in packed objects' '
 	! grep corrupt out
 '
 
+test_expect_success 'fsck fails on corrupt packfile' '
+	hsh=$(git commit-tree -m mycommit HEAD^{tree}) &&
+	pack=$(echo $hsh | git pack-objects .git/objects/pack/pack) &&
+
+	# Corrupt the first byte of the first object. (It contains 3 type bits,
+	# at least one of which is not zero, so setting the first byte to 0 is
+	# sufficient.)
+	chmod a+w .git/objects/pack/pack-$pack.pack &&
+	printf '\0' | dd of=.git/objects/pack/pack-$pack.pack bs=1 conv=notrunc seek=12 &&
+
+	test_when_finished "rm -f .git/objects/pack/pack-$pack.*" &&
+	remove_object $hsh &&
+	test_must_fail git fsck 2>out &&
+	test_i18ngrep "checksum mismatch" out
+'
+
 test_expect_success 'fsck finds problems in duplicate loose objects' '
 	rm -rf broken-duplicate &&
 	git init broken-duplicate &&
-- 
2.14.0.rc0.400.g1c36432dff-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension
  2017-07-27 18:55   ` Junio C Hamano
  2017-07-28 13:20     ` Ben Peart
@ 2017-07-28 23:50     ` Jonathan Tan
  2017-07-29  0:21       ` Junio C Hamano
  1 sibling, 1 reply; 40+ messages in thread
From: Jonathan Tan @ 2017-07-28 23:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peartben, christian.couder

On Thu, 27 Jul 2017 11:55:46 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> My reading hiccupped after the first sentence, as the problem
> description made it sound like this was a boolean ("are we using
> lazy object feature?"), after reading "data type string".  And then
> "the command in that option" made me hiccup one more time, as it did
> not "click" that "in that option" was trying to say that the string
> is used as the command name (or is it a whole command line?  The
> leading part of the command line to which some arguments are
> appended before it gets invoked as a command? or what?).
> 
> Logically, I think it is more like
> 
>  - extensions.lazyobject can be set to tell Git to consider missing
>    objects in certain cases are not errors;
> 
>  - the value of extensions.lazyobject variable must be a string,
>    which is used to name the command to lazily make the object
>    "appear" in the repository on demand.

OK, I'll update the commit message in the next reroll.

> >  extern int repository_format_precious_objects;
> > +extern char *repository_format_lazy_object;
> 
> This is not a new problem, but I think these two should be
> called repository_extension_$NAME not repository_format_$NAME.

Looking at the original commit 067fbd4 ("introduce "preciousObjects"
repository extension", 2015-06-24), it seems that this was so named to
be analogous to the existing "struct repository_format { int version;
...}" => "int repository_format_version;". The existing
repository_format_$NAME thus seems reasonable to me.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension
  2017-07-28 23:50     ` Jonathan Tan
@ 2017-07-29  0:21       ` Junio C Hamano
  0 siblings, 0 replies; 40+ messages in thread
From: Junio C Hamano @ 2017-07-29  0:21 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

>> >  extern int repository_format_precious_objects;
>> > +extern char *repository_format_lazy_object;
>> 
>> This is not a new problem, but I think these two should be
>> called repository_extension_$NAME not repository_format_$NAME.
>
> Looking at the original commit 067fbd4 ("introduce "preciousObjects"
> repository extension", 2015-06-24), it seems that this was so named to
> be analogous to the existing "struct repository_format { int version;
> ...}" => "int repository_format_version;". The existing
> repository_format_$NAME thus seems reasonable to me.

OK.  They smell like "repository extension" to me, but probably the
fully spelled name of the concept is "repository format extension",
so using the word "format" out of that phrase sounds OK to me.

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/4] fsck: support referenced lazy objects
  2017-07-26 23:30 ` [RFC PATCH 3/4] fsck: support referenced lazy objects Jonathan Tan
  2017-07-27 19:17   ` Junio C Hamano
@ 2017-07-29 16:04   ` Junio C Hamano
  1 sibling, 0 replies; 40+ messages in thread
From: Junio C Hamano @ 2017-07-29 16:04 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

> Teach fsck to not treat missing objects indirectly pointed to by refs as
> an error when extensions.lazyobject is set.

I forgot to mention a potential flaw in this approach in my previous
message.

If you are a pure sightseer, then this is perfectly fine.  The
object store in your local Git client working in that mode is purely
a cache, lazily populated while browsing the object store backed by
the source of what lazy-object "hook" talks with.  As long as that
cache does not give us a corrupt object, we are OK, because missing
objects do not matter.

But once you start using the repository as more than a sightseer,
you will have objects that only exist in your local "cache" and are
not yet in that backing store behind the lazy-object "hook".  You
need to notice it when any of them goes corrupt or missing, or your
next "git push" to send them over to a remote location will fail by
definition because you are the only one with these objects.

If we had the "promise" thing, then we could say that it is OK if
traversal terminated at a "promised but not fetched yet" boundary,
but we cannot afford the "promise", and more importantly, I do not
think "promise" has to be the only approach to ensure that the
objects that exist only in the local repository are all connected.

For example, if we know that the remote 'origin' is the actual
backing store lazy-object "hook" talks with, a validation rule to
ensure that we haven't lost any local commit is to ensure that a
traversal from our local branch tips down to remote-tracking
branches taken from 'origin' must not hit _any_ missing commit.

That covers only the commit objects.  I do not know offhand if we
can and how we extend this concept to protect the tags, trees and
blobs we have locally generated and haven't pushed out, but you and
Ben hopefully can come up with ways to cover them.




^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
                   ` (4 preceding siblings ...)
  2017-07-26 23:42 ` [RFC PATCH 0/4] Some patches for fsck for missing objects brian m. carlson
@ 2017-07-31 21:02 ` Jonathan Tan
  2017-07-31 21:21   ` Junio C Hamano
  2017-08-08 17:13   ` Ben Peart
  2017-07-31 21:02 ` [PATCH v2 1/5] environment, fsck: introduce lazyobject extension Jonathan Tan
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder

Besides review changes, this patch set now includes my rewritten
lazy-loading sha1_file patch, so you can now do this (excerpted from one
of the tests):

    test_create_repo server
    test_commit -C server 1 1.t abcdefgh
    HASH=$(git hash-object server/1.t)
    
    test_create_repo client
    test_must_fail git -C client cat-file -p "$HASH"
    git -C client config core.repositoryformatversion 1
    git -C client config extensions.lazyobject \
        "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\""
    git -C client cat-file -p "$HASH"

with fsck still working. Also, there is no need for a list of promised
blobs, and the long-running process protocol is being used.

Changes from v1:
 - added last patch that supports lazy loading
 - clarified documentation in "introduce lazyobject extension" patch
   (following Junio's comments [1])

As listed in the changes above, I have rewritten my lazy-loading
sha1_file patch to no longer use the list of promises. Also, I have
added documentation about the protocol used to (hopefully) the
appropriate places.

This is a minimal implementation, hopefully enough of a foundation to be
built upon. In particular, I haven't added the environment variable to
suppress lazy loading, and the lazy loading protocol only supports one
object at a time.

Other work
----------

This differs slightly from Ben Peart's patch [2] in that the
lazy-loading functionality is provided through a configured shell
command instead of a hook shell script. I envision commands like "git
clone", in the future, needing to pre-configure lazy loading, and I
think that it will be less surprising to the user if "git clone" wrote a
default configuration instead of a default hook.

This also differs from Christian Couder's patch set [3] that implement a
larger-scale object database, in that (i) my patch set does not support
putting objects into external databases, and (ii) my patch set requires
the lazy loader to make the objects available in the local repo, instead
of allowing the objects to only be stored in the external database.

[1] https://public-inbox.org/git/xmqqzibpn1zh.fsf@gitster.mtv.corp.google.com/
[2] https://public-inbox.org/git/20170714132651.170708-2-benpeart@microsoft.com/
[3] https://public-inbox.org/git/20170620075523.26961-1-chriscool@tuxfamily.org/

Jonathan Tan (5):
  environment, fsck: introduce lazyobject extension
  fsck: support refs pointing to lazy objects
  fsck: support referenced lazy objects
  fsck: support lazy objects as CLI argument
  sha1_file: support loading lazy objects

 Documentation/Makefile                             |   1 +
 Documentation/gitattributes.txt                    |  54 ++--------
 Documentation/gitrepository-layout.txt             |   3 +
 .../technical/long-running-process-protocol.txt    |  50 +++++++++
 Documentation/technical/repository-version.txt     |  23 +++++
 Makefile                                           |   1 +
 builtin/cat-file.c                                 |   2 +
 builtin/fsck.c                                     |  25 ++++-
 cache.h                                            |   4 +
 environment.c                                      |   1 +
 lazy-object.c                                      |  80 +++++++++++++++
 lazy-object.h                                      |  12 +++
 object.c                                           |   7 ++
 object.h                                           |  13 +++
 setup.c                                            |   7 +-
 sha1_file.c                                        |  44 +++++---
 t/t0410-lazy-object.sh                             | 113 +++++++++++++++++++++
 t/t0410/lazy-object                                | 102 +++++++++++++++++++
 18 files changed, 478 insertions(+), 64 deletions(-)
 create mode 100644 Documentation/technical/long-running-process-protocol.txt
 create mode 100644 lazy-object.c
 create mode 100644 lazy-object.h
 create mode 100755 t/t0410-lazy-object.sh
 create mode 100755 t/t0410/lazy-object

-- 
2.14.0.rc1.383.gd1ce394fe2-goog


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 1/5] environment, fsck: introduce lazyobject extension
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
                   ` (5 preceding siblings ...)
  2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan
@ 2017-07-31 21:02 ` Jonathan Tan
  2017-07-31 21:02 ` [PATCH v2 2/5] fsck: support refs pointing to lazy objects Jonathan Tan
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder

Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage.

Introduce a new repository extension option `extensions.lazyObject`.
When it is set, Git does not treat missing objects as errors. The value
of `extensions.lazyObject` must be a string, naming the command used to
make a missing object appear whenever it is needed.

Teach fsck about the new state of affairs. In this commit, teach fsck
that missing objects referenced from the reflog are not an error case;
in future commits, fsck will be taught about other cases.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/repository-version.txt |  7 ++++++
 builtin/fsck.c                                 |  2 +-
 cache.h                                        |  2 ++
 environment.c                                  |  1 +
 setup.c                                        |  7 +++++-
 t/t0410-lazy-object.sh                         | 32 ++++++++++++++++++++++++++
 6 files changed, 49 insertions(+), 2 deletions(-)
 create mode 100755 t/t0410-lazy-object.sh

diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
index 00ad37986..f0482cae4 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,10 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`lazyObject`
+~~~~~~~~~~~~~~~~~
+
+When the config key `extensions.lazyObject` is set, Git does not treat missing
+objects as errors. The value of `extensions.lazyObject` must be a string,
+naming the command used to make a missing object appear whenever it is needed.
diff --git a/builtin/fsck.c b/builtin/fsck.c
index a92f44818..b7e245654 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -402,7 +402,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
 					xstrfmt("%s@{%"PRItime"}", refname, timestamp));
 			obj->flags |= USED;
 			mark_object_reachable(obj);
-		} else {
+		} else if (!repository_format_lazy_object) {
 			error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
 			errors_found |= ERROR_REACHABLE;
 		}
diff --git a/cache.h b/cache.h
index 6c8242340..9e6bc0a21 100644
--- a/cache.h
+++ b/cache.h
@@ -853,10 +853,12 @@ extern int grafts_replace_parents;
 #define GIT_REPO_VERSION 0
 #define GIT_REPO_VERSION_READ 1
 extern int repository_format_precious_objects;
+extern char *repository_format_lazy_object;
 
 struct repository_format {
 	int version;
 	int precious_objects;
+	char *lazy_object;
 	int is_bare;
 	char *work_tree;
 	struct string_list unknown_extensions;
diff --git a/environment.c b/environment.c
index 3fd4b1084..cd8ef2897 100644
--- a/environment.c
+++ b/environment.c
@@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1;
 int warn_on_object_refname_ambiguity = 1;
 int ref_paranoia = -1;
 int repository_format_precious_objects;
+char *repository_format_lazy_object;
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 const char *apply_default_whitespace;
diff --git a/setup.c b/setup.c
index 860507e1f..94cfde3cc 100644
--- a/setup.c
+++ b/setup.c
@@ -425,7 +425,11 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
 			;
 		else if (!strcmp(ext, "preciousobjects"))
 			data->precious_objects = git_config_bool(var, value);
-		else
+		else if (!strcmp(ext, "lazyobject")) {
+			if (!value)
+				return config_error_nonbool(var);
+			data->lazy_object = xstrdup(value);
+		} else
 			string_list_append(&data->unknown_extensions, ext);
 	} else if (strcmp(var, "core.bare") == 0) {
 		data->is_bare = git_config_bool(var, value);
@@ -468,6 +472,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
 	}
 
 	repository_format_precious_objects = candidate.precious_objects;
+	repository_format_lazy_object = candidate.lazy_object;
 	string_list_clear(&candidate.unknown_extensions, 0);
 	if (!has_common) {
 		if (candidate.is_bare != -1) {
diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
new file mode 100755
index 000000000..36442531f
--- /dev/null
+++ b/t/t0410-lazy-object.sh
@@ -0,0 +1,32 @@
+#!/bin/sh
+
+test_description='lazy object'
+
+. ./test-lib.sh
+
+delete_object () {
+	rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40)
+}
+
+test_expect_success 'fsck fails on lazy object in reflog' '
+	test_create_repo repo &&
+	test_commit -C repo 1 &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+	B=$(git -C repo commit-tree -m b HEAD^{tree}) &&
+
+	# Reference $A only from reflog, and delete it
+	git -C repo branch mybranch "$A" &&
+	git -C repo branch -f mybranch "$B" &&
+	delete_object repo "$A" &&
+
+	test_must_fail git -C repo fsck
+'
+
+test_expect_success '...but succeeds if lazyobject is set' '
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	git -C repo fsck
+'
+
+test_done
-- 
2.14.0.rc1.383.gd1ce394fe2-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 2/5] fsck: support refs pointing to lazy objects
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
                   ` (6 preceding siblings ...)
  2017-07-31 21:02 ` [PATCH v2 1/5] environment, fsck: introduce lazyobject extension Jonathan Tan
@ 2017-07-31 21:02 ` Jonathan Tan
  2017-07-31 21:02 ` [PATCH v2 3/5] fsck: support referenced " Jonathan Tan
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder

Teach fsck to not treat refs with missing targets as an error when
extensions.lazyobject is set.

For the purposes of warning about no default refs, such refs are still
treated as legitimate refs.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fsck.c         |  8 ++++++++
 t/t0410-lazy-object.sh | 20 ++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index b7e245654..38ed630d8 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -438,6 +438,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
 
 	obj = parse_object(oid);
 	if (!obj) {
+		if (repository_format_lazy_object) {
+			/*
+			 * Increment default_refs anyway, because this is a
+			 * valid ref.
+			 */
+			default_refs++;
+			return 0;
+		}
 		error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
 		errors_found |= ERROR_REACHABLE;
 		/* We'll continue with the rest despite the error.. */
diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
index 36442531f..00e1b4a88 100755
--- a/t/t0410-lazy-object.sh
+++ b/t/t0410-lazy-object.sh
@@ -29,4 +29,24 @@ test_expect_success '...but succeeds if lazyobject is set' '
 	git -C repo fsck
 '
 
+test_expect_success 'fsck fails on lazy object pointed to by ref' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo 1 &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+
+	# Reference $A only from ref, and delete it
+	git -C repo branch mybranch "$A" &&
+	delete_object repo "$A" &&
+
+	test_must_fail git -C repo fsck
+'
+
+test_expect_success '...but succeeds if lazyobject is set' '
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	git -C repo fsck
+'
+
 test_done
-- 
2.14.0.rc1.383.gd1ce394fe2-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 3/5] fsck: support referenced lazy objects
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
                   ` (7 preceding siblings ...)
  2017-07-31 21:02 ` [PATCH v2 2/5] fsck: support refs pointing to lazy objects Jonathan Tan
@ 2017-07-31 21:02 ` Jonathan Tan
  2017-07-31 21:02 ` [PATCH v2 4/5] fsck: support lazy objects as CLI argument Jonathan Tan
  2017-07-31 21:02 ` [PATCH v2 5/5] sha1_file: support loading lazy objects Jonathan Tan
  10 siblings, 0 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder

Teach fsck to not treat missing objects indirectly pointed to by refs as
an error when extensions.lazyobject is set.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fsck.c         | 11 +++++++++++
 t/t0410-lazy-object.sh | 27 +++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 38ed630d8..19681c5b3 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -149,6 +149,15 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt
 		return 0;
 	obj->flags |= REACHABLE;
 	if (!(obj->flags & HAS_OBJ)) {
+		if (repository_format_lazy_object)
+			/*
+			 * Return immediately; this is not an error, and further
+			 * recursion does not need to be performed on this
+			 * object since it is missing (so it does not need to be
+			 * added to "pending").
+			 */
+			return 0;
+
 		if (parent && !has_object_file(&obj->oid)) {
 			printf("broken link from %7s %s\n",
 				 printable_type(parent), describe_object(parent));
@@ -212,6 +221,8 @@ static void check_reachable_object(struct object *obj)
 	 * do a full fsck
 	 */
 	if (!(obj->flags & HAS_OBJ)) {
+		if (repository_format_lazy_object)
+			return;
 		if (has_sha1_pack(obj->oid.hash))
 			return; /* it is in pack - forget about it */
 		printf("missing %s %s\n", printable_type(obj),
diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
index 00e1b4a88..45f665a15 100755
--- a/t/t0410-lazy-object.sh
+++ b/t/t0410-lazy-object.sh
@@ -49,4 +49,31 @@ test_expect_success '...but succeeds if lazyobject is set' '
 	git -C repo fsck
 '
 
+test_expect_success 'fsck fails on lazy object indirectly pointed to by ref' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo 1 &&
+	test_commit -C repo 2 &&
+	test_commit -C repo 3 &&
+	git -C repo tag -a annotated_tag -m "annotated tag" &&
+
+	C=$(git -C repo rev-parse 1) &&
+	T=$(git -C repo rev-parse 2^{tree}) &&
+	B=$(git hash-object repo/3.t) &&
+	AT=$(git -C repo rev-parse annotated_tag) &&
+
+	# missing commit, tree, blob, and tag
+	delete_object repo "$C" &&
+	delete_object repo "$T" &&
+	delete_object repo "$B" &&
+	delete_object repo "$AT" &&
+	test_must_fail git -C repo fsck
+'
+
+test_expect_success '...but succeeds if lazyobject is set' '
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	git -C repo fsck
+'
+
 test_done
-- 
2.14.0.rc1.383.gd1ce394fe2-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 4/5] fsck: support lazy objects as CLI argument
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
                   ` (8 preceding siblings ...)
  2017-07-31 21:02 ` [PATCH v2 3/5] fsck: support referenced " Jonathan Tan
@ 2017-07-31 21:02 ` Jonathan Tan
  2017-07-31 21:02 ` [PATCH v2 5/5] sha1_file: support loading lazy objects Jonathan Tan
  10 siblings, 0 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder

Teach fsck to not treat missing objects provided on the CLI as an error
when extensions.lazyobject is set.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fsck.c         |  2 ++
 t/t0410-lazy-object.sh | 16 ++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 19681c5b3..20415902f 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -754,6 +754,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 			struct object *obj = lookup_object(oid.hash);
 
 			if (!obj || !(obj->flags & HAS_OBJ)) {
+				if (repository_format_lazy_object)
+					continue;
 				error("%s: object missing", oid_to_hex(&oid));
 				errors_found |= ERROR_OBJECT;
 				continue;
diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
index 45f665a15..3ac61c1c5 100755
--- a/t/t0410-lazy-object.sh
+++ b/t/t0410-lazy-object.sh
@@ -76,4 +76,20 @@ test_expect_success '...but succeeds if lazyobject is set' '
 	git -C repo fsck
 '
 
+test_expect_success 'fsck fails on lazy object directly given in command-line' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo 1 &&
+	HASH=$(git hash-object repo/1.t) &&
+	delete_object repo "$HASH" &&
+
+	test_must_fail git -C repo fsck "$HASH"
+'
+
+test_expect_success '...but succeeds if lazyobject is set' '
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	git -C repo fsck "$HASH"
+'
+
 test_done
-- 
2.14.0.rc1.383.gd1ce394fe2-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 5/5] sha1_file: support loading lazy objects
  2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
                   ` (9 preceding siblings ...)
  2017-07-31 21:02 ` [PATCH v2 4/5] fsck: support lazy objects as CLI argument Jonathan Tan
@ 2017-07-31 21:02 ` Jonathan Tan
  2017-07-31 21:29   ` Junio C Hamano
  2017-08-08 20:20   ` Ben Peart
  10 siblings, 2 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder

Teach sha1_file to invoke the command configured in
extensions.lazyObject whenever an object is requested and unavailable.

The usage of the hook can be suppressed through a flag when invoking
has_object_file_with_flags() and other similar functions.

This is meant as a temporary measure to ensure that all Git commands
work in such a situation. Future patches will update some commands to
either tolerate missing objects (without invoking the command) or be
more efficient in invoking this command.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
     regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
     need to check callers that know about the loose/packed distinction
     and operate on both differently, and ensure that they can handle
     the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked at for_each_packed_object and at the packed-related
functions that take in a hash. For for_each_packed_object, the callers
either already work or are fixed in this patch:
 - reachable - only to find recent objects
 - builtin/fsck - already knows about missing objects
 - builtin/cat-file - warning message added in this commit

Callers of the other functions do not need to be changed:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
     caller already knows (through other means) that the sought object
     is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
     file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - already knows about promised objects
   - builtin/count-objects - informational purposes only (check if loose
     object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
     not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed object is requested, not on any
   object.  That is, whenever we attempt to search the packfiles for an
   object, if it is missing (from the packfiles and from the loose
   object storage), to invoke the hook (which must then store it as a
   packfile), open the packfile the hook generated, and report that the
   object is found in that new packfile. This reduces the amount of
   analysis needed (in that we only need to look at how packed objects
   are handled), but requires that the hook generate packfiles (or for
   sha1_file to pack whatever loose objects are generated), creating one
   packfile for each missing object and potentially very many packfiles
   that must be linearly searched. This may be tolerable now for repos
   that only have a few missing objects (for example, repos that only
   want to exclude large blobs), and might be tolerable in the future if
   we have batching support for the most commonly used commands, but is
   not tolerable now for repos that exclude a large amount of objects.

Helped-by: Ben Peart <benpeart@microsoft.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/Makefile                             |   1 +
 Documentation/gitattributes.txt                    |  54 ++---------
 Documentation/gitrepository-layout.txt             |   3 +
 .../technical/long-running-process-protocol.txt    |  50 ++++++++++
 Documentation/technical/repository-version.txt     |  16 ++++
 Makefile                                           |   1 +
 builtin/cat-file.c                                 |   2 +
 builtin/fsck.c                                     |   2 +-
 cache.h                                            |   2 +
 lazy-object.c                                      |  80 ++++++++++++++++
 lazy-object.h                                      |  12 +++
 object.c                                           |   7 ++
 object.h                                           |  13 +++
 sha1_file.c                                        |  44 ++++++---
 t/t0410-lazy-object.sh                             |  18 ++++
 t/t0410/lazy-object                                | 102 +++++++++++++++++++++
 16 files changed, 345 insertions(+), 62 deletions(-)
 create mode 100644 Documentation/technical/long-running-process-protocol.txt
 create mode 100644 lazy-object.c
 create mode 100644 lazy-object.h
 create mode 100755 t/t0410/lazy-object

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 2415e0d65..5657d49c2 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -69,6 +69,7 @@ SP_ARTICLES += $(API_DOCS)
 
 TECH_DOCS += technical/http-protocol
 TECH_DOCS += technical/index-format
+TECH_DOCS += technical/long-running-process-protocol
 TECH_DOCS += technical/pack-format
 TECH_DOCS += technical/pack-heuristics
 TECH_DOCS += technical/pack-protocol
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index c4f2be254..ed5e8204e 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -387,46 +387,14 @@ Long Running Filter Process
 If the filter command (a string value) is defined via
 `filter.<driver>.process` then Git can process all blobs with a
 single filter invocation for the entire life of a single Git
-command. This is achieved by using a packet format (pkt-line,
-see technical/protocol-common.txt) based protocol over standard
-input and standard output as follows. All packets, except for the
-"*CONTENT" packets and the "0000" flush packet, are considered
-text and therefore are terminated by a LF.
-
-Git starts the filter when it encounters the first file
-that needs to be cleaned or smudged. After the filter started
-Git sends a welcome message ("git-filter-client"), a list of supported
-protocol version numbers, and a flush packet. Git expects to read a welcome
-response message ("git-filter-server"), exactly one protocol version number
-from the previously sent list, and a flush packet. All further
-communication will be based on the selected version. The remaining
-protocol description below documents "version=2". Please note that
-"version=42" in the example below does not exist and is only there
-to illustrate how the protocol would look like with more than one
-version.
-
-After the version negotiation Git sends a list of all capabilities that
-it supports and a flush packet. Git expects to read a list of desired
-capabilities, which must be a subset of the supported capabilities list,
-and a flush packet as response:
-------------------------
-packet:          git> git-filter-client
-packet:          git> version=2
-packet:          git> version=42
-packet:          git> 0000
-packet:          git< git-filter-server
-packet:          git< version=2
-packet:          git< 0000
-packet:          git> capability=clean
-packet:          git> capability=smudge
-packet:          git> capability=not-yet-invented
-packet:          git> 0000
-packet:          git< capability=clean
-packet:          git< capability=smudge
-packet:          git< 0000
-------------------------
-Supported filter capabilities in version 2 are "clean", "smudge",
-and "delay".
+command. This is achieved by using the long-running process protocol
+(described in technical/long-running-process-protocol.txt).
+
+When Git encounters the first file that needs to be cleaned or smudged,
+it starts the filter and performs the handshake. In the handshake, the
+welcome message sent by Git is "git-filter-client", only version 2 is
+suppported, and the supported capabilities are "clean", "smudge", and
+"delay".
 
 Afterwards Git sends a list of "key=value" pairs terminated with
 a flush packet. The list will contain at least the filter command
@@ -512,12 +480,6 @@ the protocol then Git will stop the filter process and restart it
 with the next file that needs to be processed. Depending on the
 `filter.<driver>.required` flag Git will interpret that as error.
 
-After the filter has processed a command it is expected to wait for
-a "key=value" list containing the next command. Git will close
-the command pipe on exit. The filter is expected to detect EOF
-and exit gracefully on its own. Git will wait until the filter
-process has stopped.
-
 Delay
 ^^^^^
 
diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
index f51ed4e37..a690fdcd5 100644
--- a/Documentation/gitrepository-layout.txt
+++ b/Documentation/gitrepository-layout.txt
@@ -47,6 +47,9 @@ use with dumb transports but otherwise is OK as long as
 `objects/info/alternates` points at the object stores it
 borrows from.
 +
+. If `extensions.lazyObject` is set, some objects could be missing, to
+be lazily loaded by Git when required.
++
 This directory is ignored if $GIT_COMMON_DIR is set and
 "$GIT_COMMON_DIR/objects" will be used instead.
 
diff --git a/Documentation/technical/long-running-process-protocol.txt b/Documentation/technical/long-running-process-protocol.txt
new file mode 100644
index 000000000..aa0aa9af1
--- /dev/null
+++ b/Documentation/technical/long-running-process-protocol.txt
@@ -0,0 +1,50 @@
+Long-running process protocol
+=============================
+
+This protocol is used when Git needs to communicate with an external
+process throughout the entire life of a single Git command. All
+communication is in pkt-line format (see technical/protocol-common.txt)
+over standard input and standard output.
+
+Handshake
+---------
+
+Git starts by sending a welcome message (for example,
+"git-filter-client"), a list of supported protocol version numbers, and
+a flush packet. Git expects to read the welcome message with "server"
+instead of "client" (for example, "git-filter-server"), exactly one
+protocol version number from the previously sent list, and a flush
+packet. All further communication will be based on the selected version.
+The remaining protocol description below documents "version=2". Please
+note that "version=42" in the example below does not exist and is only
+there to illustrate how the protocol would look like with more than one
+version.
+
+After the version negotiation Git sends a list of all capabilities that
+it supports and a flush packet. Git expects to read a list of desired
+capabilities, which must be a subset of the supported capabilities list,
+and a flush packet as response:
+------------------------
+packet:          git> git-filter-client
+packet:          git> version=2
+packet:          git> version=42
+packet:          git> 0000
+packet:          git< git-filter-server
+packet:          git< version=2
+packet:          git< 0000
+packet:          git> capability=clean
+packet:          git> capability=smudge
+packet:          git> capability=not-yet-invented
+packet:          git> 0000
+packet:          git< capability=clean
+packet:          git< capability=smudge
+packet:          git< 0000
+------------------------
+
+Shutdown
+--------
+
+Git will close
+the command pipe on exit. The filter is expected to detect EOF
+and exit gracefully on its own. Git will wait until the filter
+process has stopped.
diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
index f0482cae4..38c8f966e 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -93,3 +93,19 @@ objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 When the config key `extensions.lazyObject` is set, Git does not treat missing
 objects as errors. The value of `extensions.lazyObject` must be a string,
 naming the command used to make a missing object appear whenever it is needed.
+
+Git communicates with this command using the long-running process
+protocol described in technical/long-running-process-protocol.txt.
+During the handshake, the welcome message used is "git-lazy-object", the
+only supported version is 1, and the only supported capability is "get".
+After the handshake, for each object to be loaded, Git will send
+"command=get\n", "sha1=<sha1>\n", and a flush packet, and will expect
+"status=success\n" or "status=error\n" and a flush packet. For example:
+
+------------------------
+packet:          git> command=get
+packet:          git> sha1=1234567890123456789012345678901234567890
+packet:          git> 0000
+packet:          git< status=success
+packet:          git< 0000
+------------------------
diff --git a/Makefile b/Makefile
index cd7d36c7d..3cac24775 100644
--- a/Makefile
+++ b/Makefile
@@ -800,6 +800,7 @@ LIB_OBJS += help.o
 LIB_OBJS += hex.o
 LIB_OBJS += ident.o
 LIB_OBJS += kwset.o
+LIB_OBJS += lazy-object.o
 LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 62c8cf0eb..ef016b84e 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -473,6 +473,8 @@ static int batch_objects(struct batch_options *opt)
 
 		for_each_loose_object(batch_loose_object, &sa, 0);
 		for_each_packed_object(batch_packed_object, &sa, 0);
+		if (repository_format_lazy_object)
+			warning("This repository has extensions.lazyObject set. Some objects may not be loaded.");
 
 		cb.opt = opt;
 		cb.expand = &data;
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 20415902f..327b087c2 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -447,7 +447,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
 {
 	struct object *obj;
 
-	obj = parse_object(oid);
+	obj = parse_maybe_missing_object(oid);
 	if (!obj) {
 		if (repository_format_lazy_object) {
 			/*
diff --git a/cache.h b/cache.h
index 9e6bc0a21..375f5b700 100644
--- a/cache.h
+++ b/cache.h
@@ -1837,6 +1837,8 @@ struct object_info {
 #define OBJECT_INFO_SKIP_CACHED 4
 /* Do not retry packed storage after checking packed and loose storage */
 #define OBJECT_INFO_QUICK 8
+/* Never invoke extensions.lazyObject, even if object is missing */
+#define OBJECT_INFO_NO_LAZY 16
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
 
diff --git a/lazy-object.c b/lazy-object.c
new file mode 100644
index 000000000..d4b271a4e
--- /dev/null
+++ b/lazy-object.c
@@ -0,0 +1,80 @@
+#include "cache.h"
+#include "lazy-object.h"
+#include "sha1-lookup.h"
+#include "strbuf.h"
+#include "run-command.h"
+#include "sha1-array.h"
+#include "config.h"
+#include "sigchain.h"
+#include "sub-process.h"
+#include "pkt-line.h"
+
+#define CAP_GET    (1u<<0)
+
+static struct hashmap subprocess_map;
+
+int start_lazy_object_fn(struct subprocess_entry *subprocess)
+{
+	static int versions[] = {1, 0};
+	static struct subprocess_capability capabilities[] = {
+		{ "get", CAP_GET },
+		{ NULL, 0 }
+	};
+
+	unsigned int supported_capabilities = 0;
+	int ret = subprocess_handshake(subprocess, "git-lazy-object", versions,
+				       NULL, capabilities,
+				       &supported_capabilities);
+	if (ret)
+		return ret;
+	return supported_capabilities != CAP_GET;
+}
+
+void load_lazy_object(const unsigned char *sha1)
+{
+	struct subprocess_entry *entry;
+	struct child_process *process;
+	struct strbuf status = STRBUF_INIT;
+
+	if (!repository_format_lazy_object)
+		die("BUG: extensions.lazyObject is not set");
+
+	if (!subprocess_map.tablesize) {
+		hashmap_init(&subprocess_map,
+			     (hashmap_cmp_fn) cmd2process_cmp, NULL, 0);
+		entry = NULL;
+	} else {
+		entry = subprocess_find_entry(&subprocess_map,
+					      repository_format_lazy_object);
+	}
+	if (!entry) {
+		entry = xmalloc(sizeof(*entry));
+
+		if (subprocess_start(&subprocess_map, entry,
+				     repository_format_lazy_object,
+				     start_lazy_object_fn)) {
+			free(entry);
+			die("could not start '%s'", repository_format_lazy_object);
+		}
+	}
+	process = subprocess_get_child_process(entry);
+
+	packet_write_fmt(process->in, "command=get\n");
+	packet_write_fmt(process->in, "sha1=%s\n", sha1_to_hex(sha1));
+	packet_flush(process->in);
+	if (subprocess_read_status(process->out, &status))
+		die("Could not read status from '%s'", repository_format_lazy_object);
+	
+	if (!strcmp(status.buf, "success"))
+		return;
+	if (!strcmp(status.buf, "error"))
+		die("external process '%s' returned error",
+		    repository_format_lazy_object);
+	die("external process '%s' failed", repository_format_lazy_object);
+
+	/*
+	 * The command above may have updated packfiles, so update our record
+	 * of them.
+	 */
+	reprepare_packed_git();
+}
diff --git a/lazy-object.h b/lazy-object.h
new file mode 100644
index 000000000..1ec6ff203
--- /dev/null
+++ b/lazy-object.h
@@ -0,0 +1,12 @@
+#ifndef LAZY_OBJECT_H
+#define LAZY_OBJECT_H
+
+#include "cache.h"
+#include "sha1-array.h"
+
+/*
+ * Invokes repository_format_lazy_object to load the given missing object.
+ */
+void load_lazy_object(const unsigned char *sha1);
+
+#endif
diff --git a/object.c b/object.c
index 321d7e920..e3e734f67 100644
--- a/object.c
+++ b/object.c
@@ -279,6 +279,13 @@ struct object *parse_object(const struct object_id *oid)
 	return NULL;
 }
 
+struct object *parse_maybe_missing_object(const struct object_id *oid)
+{
+	if (has_object_file_with_flags(oid, OBJECT_INFO_NO_LAZY))
+		return parse_object(oid);
+	return NULL;
+}
+
 struct object_list *object_list_insert(struct object *item,
 				       struct object_list **list_p)
 {
diff --git a/object.h b/object.h
index 0a419ba8d..b98e73572 100644
--- a/object.h
+++ b/object.h
@@ -87,10 +87,23 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet);
 /*
  * Returns the object, having parsed it to find out what it is.
  *
+ * This function automatically invokes the command in extensions.lazyObject if
+ * the object is missing and extensions.lazyObject is set.
+ *
  * Returns NULL if the object is missing or corrupt.
  */
 struct object *parse_object(const struct object_id *oid);
 
+/*
+ * Returns the object, having parsed it to find out what it is.
+ *
+ * This function never invokes the command in extensions.lazyObject, unlike
+ * parse_object().
+ *
+ * Returns NULL if the object is missing or corrupt.
+ */
+struct object *parse_maybe_missing_object(const struct object_id *oid);
+
 /*
  * Like parse_object, but will die() instead of returning NULL. If the
  * "name" parameter is not NULL, it is included in the error message
diff --git a/sha1_file.c b/sha1_file.c
index b60ae15f7..1785c61d8 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -28,6 +28,11 @@
 #include "list.h"
 #include "mergesort.h"
 #include "quote.h"
+#include "iterator.h"
+#include "dir-iterator.h"
+#include "sha1-lookup.h"
+#include "lazy-object.h"
+#include "sha1-array.h"
 
 #define SZ_FMT PRIuMAX
 static inline uintmax_t sz_fmt(size_t s) { return s; }
@@ -2984,6 +2989,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
 				    lookup_replace_object(sha1) :
 				    sha1;
+	int already_retried = 0;
 
 	if (!oi)
 		oi = &blank_oi;
@@ -3008,30 +3014,38 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 	}
 
-	if (!find_pack_entry(real, &e)) {
-		/* Most likely it's a loose object. */
-		if (!sha1_loose_object_info(real, oi, flags)) {
-			oi->whence = OI_LOOSE;
-			return 0;
-		}
+retry:
+	if (find_pack_entry(real, &e))
+		goto found_packed;
 
-		/* Not a loose object; someone else may have just packed it. */
-		if (flags & OBJECT_INFO_QUICK) {
-			return -1;
-		} else {
-			reprepare_packed_git();
-			if (!find_pack_entry(real, &e))
-				return -1;
-		}
+	/* Most likely it's a loose object. */
+	if (!sha1_loose_object_info(real, oi, flags)) {
+		oi->whence = OI_LOOSE;
+		return 0;
 	}
 
+	/* Not a loose object; someone else may have just packed it. */
+	reprepare_packed_git();
+	if (find_pack_entry(real, &e))
+		goto found_packed;
+
+	/* Check if it is a missing object */
+	if (repository_format_lazy_object && !already_retried &&
+	    !(flags & OBJECT_INFO_NO_LAZY)) {
+		load_lazy_object(real);
+		already_retried = 1;
+		goto retry;
+	}
+
+	return -1;
+
+found_packed:
 	if (oi == &blank_oi)
 		/*
 		 * We know that the caller doesn't actually need the
 		 * information below, so return early.
 		 */
 		return 0;
-
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
index 3ac61c1c5..d8888baa0 100755
--- a/t/t0410-lazy-object.sh
+++ b/t/t0410-lazy-object.sh
@@ -92,4 +92,22 @@ test_expect_success '...but succeeds if lazyobject is set' '
 	git -C repo fsck "$HASH"
 '
 
+test_expect_success 'sha1_object_info_extended (through git cat-file)' '
+	rm -rf repo client &&
+
+	test_create_repo server &&
+	test_commit -C server 1 1.t abcdefgh &&
+	HASH=$(git hash-object server/1.t) &&
+
+	test_create_repo client &&
+	test_must_fail git -C client cat-file -p "$HASH"
+'
+
+test_expect_success '...but succeeds if lazyobject is set' '
+	git -C client config core.repositoryformatversion 1 &&
+	git -C client config extensions.lazyobject \
+		"\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\"" &&
+	git -C client cat-file -p "$HASH"
+'
+
 test_done
diff --git a/t/t0410/lazy-object b/t/t0410/lazy-object
new file mode 100755
index 000000000..4f4a9c38a
--- /dev/null
+++ b/t/t0410/lazy-object
@@ -0,0 +1,102 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git lazyObject protocol version 1. See
+# the documentation for extensions.lazyObject in
+# Documentation/technical/repository-version.txt
+#
+# Allows you to test the ability for blobs to be pulled from a host git repo
+# "on demand."  Called when git needs a blob it couldn't find locally due to
+# a lazy clone that only cloned the commits and trees.
+#
+# Please note, this sample is a minimal skeleton. No proper error handling
+# was implemented.
+
+use strict;
+use warnings;
+
+#
+# Point $DIR to the folder where your host git repo is located so we can pull
+# missing objects from it
+#
+my $DIR = $ARGV[0];
+
+sub packet_bin_read {
+	my $buffer;
+	my $bytes_read = read STDIN, $buffer, 4;
+	if ( $bytes_read == 0 ) {
+
+		# EOF - Git stopped talking to us!
+		exit();
+	}
+	elsif ( $bytes_read != 4 ) {
+		die "invalid packet: '$buffer'";
+	}
+	my $pkt_size = hex($buffer);
+	if ( $pkt_size == 0 ) {
+		return ( 1, "" );
+	}
+	elsif ( $pkt_size > 4 ) {
+		my $content_size = $pkt_size - 4;
+		$bytes_read = read STDIN, $buffer, $content_size;
+		if ( $bytes_read != $content_size ) {
+			die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
+		}
+		return ( 0, $buffer );
+	}
+	else {
+		die "invalid packet size: $pkt_size";
+	}
+}
+
+sub packet_txt_read {
+	my ( $res, $buf ) = packet_bin_read();
+	unless ( $buf =~ s/\n$// ) {
+		die "A non-binary line MUST be terminated by an LF.";
+	}
+	return ( $res, $buf );
+}
+
+sub packet_bin_write {
+	my $buf = shift;
+	print STDOUT sprintf( "%04x", length($buf) + 4 );
+	print STDOUT $buf;
+	STDOUT->flush();
+}
+
+sub packet_txt_write {
+	packet_bin_write( $_[0] . "\n" );
+}
+
+sub packet_flush {
+	print STDOUT sprintf( "%04x", 0 );
+	STDOUT->flush();
+}
+
+( packet_txt_read() eq ( 0, "git-lazy-object-client" ) ) || die "bad initialize";
+( packet_txt_read() eq ( 0, "version=1" ) )		 || die "bad version";
+( packet_bin_read() eq ( 1, "" ) )                       || die "bad version end";
+
+packet_txt_write("git-lazy-object-server");
+packet_txt_write("version=1");
+packet_flush();
+
+( packet_txt_read() eq ( 0, "capability=get" ) )    || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) )                  || die "bad capability end";
+
+packet_txt_write("capability=get");
+packet_flush();
+
+while (1) {
+	my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
+
+	if ( $command eq "get" ) {
+		my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
+		packet_bin_read();
+
+		system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c extensions.lazyobject=false hash-object -w --stdin >/dev/null 2>&1');
+		packet_txt_write(($?) ? "status=error" : "status=success");
+		packet_flush();
+	} else {
+		die "bad command '$command'";
+	}
+}
-- 
2.14.0.rc1.383.gd1ce394fe2-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan
@ 2017-07-31 21:21   ` Junio C Hamano
  2017-07-31 23:05     ` Jonathan Tan
  2017-08-08 17:13   ` Ben Peart
  1 sibling, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2017-07-31 21:21 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

> Besides review changes, this patch set now includes my rewritten
> lazy-loading sha1_file patch, so you can now do this (excerpted from one
> of the tests):
>
>     test_create_repo server
>     test_commit -C server 1 1.t abcdefgh
>     HASH=$(git hash-object server/1.t)
>     
>     test_create_repo client
>     test_must_fail git -C client cat-file -p "$HASH"
>     git -C client config core.repositoryformatversion 1
>     git -C client config extensions.lazyobject \
>         "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\""
>     git -C client cat-file -p "$HASH"
>
> with fsck still working. Also, there is no need for a list of promised
> blobs, and the long-running process protocol is being used.

I do not think I read your response to my last comment on this
series, so I could be missing something large, but I am afraid that
the resulting fsck is only half as useful as the normal fsck.  I do
not see it any better than a hypothetical castrated version of fsck
that only checks the integrity of objects that appear in the local
object store, without doing any connectivity checks.

Don't get me wrong.  The integrity check on local objects you still
do is important---that is what allows us to make sure that the local
"cache" does not prevent us from going to the real source of the
remote object store by having a corrupt copy.  

But not being able to tell if a missing object is OK to be missing
(because we can get them if/as needed from elsewhere) or we lost the
sole copy of an object that we created and have not pushed out
(hence we are in deep yogurt) makes it pretty much pointless to run
"fsck", doesn't it?  It does not give us any guarantee that our
repository plus perfect network connectivity gives us an environment
to build further work on.

Or am I missing something fundamental?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 5/5] sha1_file: support loading lazy objects
  2017-07-31 21:02 ` [PATCH v2 5/5] sha1_file: support loading lazy objects Jonathan Tan
@ 2017-07-31 21:29   ` Junio C Hamano
  2017-08-08 20:20   ` Ben Peart
  1 sibling, 0 replies; 40+ messages in thread
From: Junio C Hamano @ 2017-07-31 21:29 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

> Teach sha1_file to invoke the command configured in
> extensions.lazyObject whenever an object is requested and unavailable.
>
> The usage of the hook can be suppressed through a flag when invoking
> has_object_file_with_flags() and other similar functions.
>
> This is meant as a temporary measure to ensure that all Git commands
> work in such a situation. Future patches will update some commands to
> either tolerate missing objects (without invoking the command) or be
> more efficient in invoking this command.
>
> In order to determine the code changes in sha1_file.c necessary, I
> investigated the following:
>  (1) functions in sha1_file that take in a hash, without the user
>      regarding how the object is stored (loose or packed)
>  (2) functions in sha1_file that operate on packed objects (because I
>      need to check callers that know about the loose/packed distinction
>      and operate on both differently, and ensure that they can handle
>      the concept of objects that are neither loose nor packed)
>
> (1) is handled by the modification to sha1_object_info_extended().
>
> For (2), I looked at for_each_packed_object and at the packed-related
> functions that take in a hash. For for_each_packed_object, the callers
> either already work or are fixed in this patch:
>  - reachable - only to find recent objects
>  - builtin/fsck - already knows about missing objects
>  - builtin/cat-file - warning message added in this commit
>
> Callers of the other functions do not need to be changed:
>  - parse_pack_index
>    - http - indirectly from http_get_info_packs
>  - find_pack_entry_one
>    - this searches a single pack that is provided as an argument; the
>      caller already knows (through other means) that the sought object
>      is in a specific pack
>  - find_sha1_pack
>    - fast-import - appears to be an optimization to not store a
>      file if it is already in a pack
>    - http-walker - to search through a struct alt_base
>    - http-push - to search through remote packs
>  - has_sha1_pack
>    - builtin/fsck - already knows about promised objects
>    - builtin/count-objects - informational purposes only (check if loose
>      object is also packed)
>    - builtin/prune-packed - check if object to be pruned is packed (if
>      not, don't prune it)
>    - revision - used to exclude packed objects if requested by user
>    - diff - just for optimization
>
> An alternative design that I considered but rejected:
>
>  - Adding a hook whenever a packed object is requested, not on any
>    object.  That is, whenever we attempt to search the packfiles for an
>    object, if it is missing (from the packfiles and from the loose
>    object storage), to invoke the hook (which must then store it as a
>    packfile), open the packfile the hook generated, and report that the
>    object is found in that new packfile. This reduces the amount of
>    analysis needed (in that we only need to look at how packed objects
>    are handled), but requires that the hook generate packfiles (or for
>    sha1_file to pack whatever loose objects are generated), creating one
>    packfile for each missing object and potentially very many packfiles
>    that must be linearly searched. This may be tolerable now for repos
>    that only have a few missing objects (for example, repos that only
>    want to exclude large blobs), and might be tolerable in the future if
>    we have batching support for the most commonly used commands, but is
>    not tolerable now for repos that exclude a large amount of objects.
>
> Helped-by: Ben Peart <benpeart@microsoft.com>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---

Even though I said a hugely negative thing about the "missing
objects are always OK" butchering of fsck, I do like what this patch
does.  The interface is reasonably well isolated, and moving of the
long-running-process documentation to a standalone file is very
sensible.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-07-31 21:21   ` Junio C Hamano
@ 2017-07-31 23:05     ` Jonathan Tan
  2017-08-01 17:11       ` Junio C Hamano
  0 siblings, 1 reply; 40+ messages in thread
From: Jonathan Tan @ 2017-07-31 23:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peartben, christian.couder

On Mon, 31 Jul 2017 14:21:56 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > Besides review changes, this patch set now includes my rewritten
> > lazy-loading sha1_file patch, so you can now do this (excerpted from one
> > of the tests):
> >
> >     test_create_repo server
> >     test_commit -C server 1 1.t abcdefgh
> >     HASH=$(git hash-object server/1.t)
> >     
> >     test_create_repo client
> >     test_must_fail git -C client cat-file -p "$HASH"
> >     git -C client config core.repositoryformatversion 1
> >     git -C client config extensions.lazyobject \
> >         "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\""
> >     git -C client cat-file -p "$HASH"
> >
> > with fsck still working. Also, there is no need for a list of promised
> > blobs, and the long-running process protocol is being used.
> 
> I do not think I read your response to my last comment on this
> series, so I could be missing something large, but I am afraid that
> the resulting fsck is only half as useful as the normal fsck.  I do
> not see it any better than a hypothetical castrated version of fsck
> that only checks the integrity of objects that appear in the local
> object store, without doing any connectivity checks.

Sorry, I haven't replied to your last response [1]. That does sound like
a good idea, though, and probably can be extended to trees and blobs in
that we need to make sure that any object referenced from local-only
commits (calculated as you describe in [1]) can be obtained through an
object walk from a remote-tracking branch.

I haven't fully thought of the implications of things like building
commits on top of an arbitrary upstream commit (so since our upstream
commit is not a tip, the object walk from all remote-tracking branches
might not reach our upstream commit).

To try to solve that, we could use an alternate object store to store
remote objects in order to be able to find remote objects quickly
without doing a traversal, but that does not fully solve the problem,
because some information about remote object possession lies only in
their parents (for example, if we don't have a remote blob, sometimes
the only way to know that the remote has it is by having a tree
containing that blob).

In addition, this also couples the lazy object loading with either a
remote ref (or all remote refs, if we decide to consider objects from
all remote refs as potentially loadable).

I'll think about this further.

[1] https://public-inbox.org/git/xmqq379fkz4x.fsf@gitster.mtv.corp.google.com/

> Don't get me wrong.  The integrity check on local objects you still
> do is important---that is what allows us to make sure that the local
> "cache" does not prevent us from going to the real source of the
> remote object store by having a corrupt copy.  
> 
> But not being able to tell if a missing object is OK to be missing
> (because we can get them if/as needed from elsewhere) or we lost the
> sole copy of an object that we created and have not pushed out
> (hence we are in deep yogurt) makes it pretty much pointless to run
> "fsck", doesn't it?  It does not give us any guarantee that our
> repository plus perfect network connectivity gives us an environment
> to build further work on.
> 
> Or am I missing something fundamental?

Well, the fsck can still detect issues like corrupt objects (as you
mention above) and dangling heads, which might be real issues. But it is
true that it does not give you the guarantee you describe.

From a user standpoint, this might be able to be worked around by
providing a network-requiring object connectivity checking tool or by
just having the user running a build to ensure that all necessary files
are present.

Having said that, this feature will be very nice to have.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-07-31 23:05     ` Jonathan Tan
@ 2017-08-01 17:11       ` Junio C Hamano
  2017-08-01 17:45         ` Jonathan Nieder
  2017-08-02  0:19         ` Jonathan Tan
  0 siblings, 2 replies; 40+ messages in thread
From: Junio C Hamano @ 2017-08-01 17:11 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

> Well, the fsck can still detect issues like corrupt objects (as you
> mention above) and dangling heads, which might be real issues. But it is
> true that it does not give you the guarantee you describe.

Which makes it pretty much useless.  The whole point of running
"fsck" is to make sure that we won't waste work by not finding
a corruption long after it was introduced and spent a lot of effort
building on top of a state that nobody can reproduce.

> From a user standpoint, this might be able to be worked around by
> providing a network-requiring object connectivity checking tool or by
> just having the user running a build to ensure that all necessary files
> are present.

I actually was hoping that you do not have to go to the network for
the checking.  And I have to say that "only the tip matters" is a
horrible cop-out that is not even a workaround.  Your users would be
served better if you honestly admit that your fsck will not be
useful when this feature is used---at least they won't be harmed by
a false expectation that "fsck" would give them some assurance,
which is not the case.

Let's step back a bit and think what already happens in the pre-
lazy-object world.  We record cut-off commits when a depth limited
clone is created in "shallow".  These essentially are promises,
saying something like:

    Rest assured that everything in the history behind these commits
    are on the other side and you can retrieve them by unshallowing.

    If you traverse from your local tips and find no missing objects
    before reaching one of these commits, then you do not have any
    local corruption you need to worry about.

the other end made to us, when the shallow clone was made.  And we
take this promise and build more commits on top, and then we adjust
these cut-off commits incrementally as we deepen our clone or make
it even shallower.  For this assurance to work, we of course need to
assume a bit more than what we assume for a complete clone, namely,
the "other side" will hold onto the history behind these, i.e. does
not remind the tips it already has shown to us, or even if it does,
the objects that are reachable from these cut-off points will
somehow always be available to us on demand.

Can we do something similar, i.e. maintain minimum set of cut-off
points and adjust that set incrementally, just sufficient to ensure
the integrity of objects locally created and not yet safely stored
away by pushing them the "other side"?

I haven't thought things through (and I know you, Ben and others
have thought much longer and harder), but I would imagine if we have
a commit object [*1*], some of whose parent commits, trees and blobs
are locally missing, and know that the commit exists on the "other
side", we know that all of these "missing" objects that are
referenced by the commit are also available from the "other side".
IOW, I suspect that the same principle "shallow" uses to give us the
integrity guarantee can be naturally extended to allow us to see if
a broken connectivity is OK.


[Footnote]

*1* The same can be said for a tag or a tree object that we know
    exist on the "other side"; they may refer, directly or
    indirectly through objects we locally have, to objects that that
    are missing locally, and as long as the starting point object
    are known to be available on the "other side", it is OK for them
    to be missing locally.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-08-01 17:11       ` Junio C Hamano
@ 2017-08-01 17:45         ` Jonathan Nieder
  2017-08-01 20:15           ` Junio C Hamano
  2017-08-02  0:19         ` Jonathan Tan
  1 sibling, 1 reply; 40+ messages in thread
From: Jonathan Nieder @ 2017-08-01 17:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, peartben, christian.couder

Hi,

Junio C Hamano wrote:

> Can we do something similar, i.e. maintain minimum set of cut-off
> points and adjust that set incrementally, just sufficient to ensure
> the integrity of objects locally created and not yet safely stored
> away by pushing them the "other side"?

This sounds like a variant on the "promises" idea (maintaining a list
of objects at the frontier) described before.  Instead of listing
blobs that the server promised, you are proposing listing trees that
the server has promised to handle all references from.

> I haven't thought things through (and I know you, Ben and others
> have thought much longer and harder), but I would imagine if we have
> a commit object [*1*], some of whose parent commits, trees and blobs
> are locally missing, and know that the commit exists on the "other
> side", we know that all of these "missing" objects that are
> referenced by the commit are also available from the "other side".
> IOW, I suspect that the same principle "shallow" uses to give us the
> integrity guarantee can be naturally extended to allow us to see if
> a broken connectivity is OK.

If we are deeply worried about this kind of broken connectivity, there
is another case to care about: the server can "promise" to serve
requests for some object (e.g., the tree pointed to by the server's
"master") and then decide it does not want to fulfill that promise
(e.g., that tree pointed to private key material and "master" was
rewound to avoid it).  In the promises model, how we do we get a fresh
understanding of what the server wants to promise now?

Earlier in this discussion of fsck, I thought you were proposing a
slightly different idea.  The idea I heard is that you want to check
connectivity for whatever you have built locally, while accepting a
relaxed guarantee for objects from upstream.  If an object is missing,
the idea would be that at least this way you know whose fault it is.
:)  (Not that there's much to do with that knowledge.)  Implementing
that by treating everything reachable from a remote-tracking branch as
"from upstream" seems natural.  But that implementation suffers from
the same problems: not all objects from upstream need be reachable
from a remote-tracking branch (e.g. after a fetch-by-object-id, or
because a remote branch can be rewound).

Both variants proposed of the promises idea also hold some promise,
but my understanding was that the cost of maintaining the promises
file (getting data to fill it, locking on update, merging in new
objects into it on update), for little benefit wasn't palatable to
Microsoft, who has been coping fine without such a file.

Jonathan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-08-01 17:45         ` Jonathan Nieder
@ 2017-08-01 20:15           ` Junio C Hamano
  0 siblings, 0 replies; 40+ messages in thread
From: Junio C Hamano @ 2017-08-01 20:15 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git, peartben, christian.couder

Jonathan Nieder <jrnieder@gmail.com> writes:

> If we are deeply worried about this kind of broken connectivity, there
> is another case to care about: the server can "promise" to serve
> requests for some object (e.g., the tree pointed to by the server's
> "master") and then decide it does not want to fulfill that promise
> (e.g., that tree pointed to private key material and "master" was
> rewound to avoid it).  

I think I've already covered that in my message (i.e. "we need to
assume more than the normal Git").  In short, it is not our problem,
but the "lazy-object" service's problem.  If developers cannot trust
the "central server", most likely owned by the organization that
employs them and forces them to offload the access to these objects
to the "central server", I think there is much larger problem there.

> In the promises model, how we do we get a fresh
> understanding of what the server wants to promise now?

Yes, that is one of the things that needs to be designed if we want
to officially support lazy-objects like structure.  We need a way to
incrementally adjust the cut-off point, below which it is the
responsibility of the "other side" to ensure that necessary objects
are available (on demand), and above which it is a local
repository's responsibility to notice corrupted and/or missing
objects.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-08-01 17:11       ` Junio C Hamano
  2017-08-01 17:45         ` Jonathan Nieder
@ 2017-08-02  0:19         ` Jonathan Tan
  2017-08-02 16:20           ` Junio C Hamano
  1 sibling, 1 reply; 40+ messages in thread
From: Jonathan Tan @ 2017-08-02  0:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peartben, christian.couder

On Tue, 01 Aug 2017 10:11:38 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> Let's step back a bit and think what already happens in the pre-
> lazy-object world.  We record cut-off commits when a depth limited
> clone is created in "shallow".  These essentially are promises,
> saying something like:
> 
>     Rest assured that everything in the history behind these commits
>     are on the other side and you can retrieve them by unshallowing.
> 
>     If you traverse from your local tips and find no missing objects
>     before reaching one of these commits, then you do not have any
>     local corruption you need to worry about.
> 
> the other end made to us, when the shallow clone was made.  And we
> take this promise and build more commits on top, and then we adjust
> these cut-off commits incrementally as we deepen our clone or make
> it even shallower.  For this assurance to work, we of course need to
> assume a bit more than what we assume for a complete clone, namely,
> the "other side" will hold onto the history behind these, i.e. does
> not remind the tips it already has shown to us, or even if it does,
> the objects that are reachable from these cut-off points will
> somehow always be available to us on demand.
> 
> Can we do something similar, i.e. maintain minimum set of cut-off
> points and adjust that set incrementally, just sufficient to ensure
> the integrity of objects locally created and not yet safely stored
> away by pushing them the "other side"?

This suggestion (the "frontier" of what we have) does seem to incur less
overhead than the original promise suggestion (the "frontier" of what we
don't have), but after some in-office discussion, I'm convinced that it
might not be the case - for example, one tree (that we have) might
reference many blobs (that we don't have), but at the same time, many
trees (that we have) might have the same blob (that we don't have). And
the promise overhead was already decided to be too much - which is why
we moved away from it.

One possibility to conceptually have the same thing without the overhead
of the list is to put the obtained-from-elsewhere objects into its own
alternate object store, so that we can distinguish the two. I mentioned
this in my e-mail but rejected it, but after some more thought, this
might be sufficient - we might still need to iterate through every
object to know exactly what we can assume the remote to have, but the
"frontier" solution also needs this iteration, so we are no worse off.

Going back to the original use cases that motivated this (the monorepo
like Microsoft's repo and the large-blob repo like Android's repo), it
might be better just to disable the connectivity check when
extensions.lazyObject is set (as you mentioned). This does change the
meaning of fsck, but it may be fine since the "meaning" of the repo (a
view of another repo, and no longer a full repo) has changed too. Then
this patch set will be more about ensuring that the lazy object loader
is not inadvertently run. As future work, we could add diagnostics that,
for example, attempt a walk anyway and print a list of missing SHA-1s.

(I suspect that we will also need to disable the connectivity check for
things like "git fetch", which means that we won't be able to tell
locally if the server sent us all the objects that we requested for.
This might not be a problem, though, since the local repo already has
some measure of trust for the server.)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-08-02  0:19         ` Jonathan Tan
@ 2017-08-02 16:20           ` Junio C Hamano
  2017-08-02 17:38             ` Jonathan Nieder
  0 siblings, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2017-08-02 16:20 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

> One possibility to conceptually have the same thing without the overhead
> of the list is to put the obtained-from-elsewhere objects into its own
> alternate object store, so that we can distinguish the two.

Now you are talking.  Either a separate object store, or a packfile
that is specially marked as such, would work.  "Maintaining a list
of object names in a flat file is too costly" is not a valid excuse
to discard the integrity of locally created objects, without which
Git will no longer be a version control system, and your "We have to
trust the sender of objects on the other side anyway when operating
in lazy-objects mode, so devise a way that we can use to tell which
objects we _know_ the other side has" that leads to the above idea
is a good line of thought.

> I mentioned
> this in my e-mail but rejected it, but after some more thought, this
> might be sufficient - we might still need to iterate through every
> object to know exactly what we can assume the remote to have, but the
> "frontier" solution also needs this iteration, so we are no worse off.

Most importantly, this is allowed to be costly---we are doing this
not at runtime all the time, but when the user says "make sure that
I haven't lost objects and it is safe for me to build further on
what I created locally so far" by running "git fsck".

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-08-02 16:20           ` Junio C Hamano
@ 2017-08-02 17:38             ` Jonathan Nieder
  2017-08-02 20:51               ` Junio C Hamano
  0 siblings, 1 reply; 40+ messages in thread
From: Jonathan Nieder @ 2017-08-02 17:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, peartben, christian.couder

Hi,

Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:

>> One possibility to conceptually have the same thing without the overhead
>> of the list is to put the obtained-from-elsewhere objects into its own
>> alternate object store, so that we can distinguish the two.
>
> Now you are talking.  Either a separate object store, or a packfile
> that is specially marked as such, would work.

Jonathan's not in today, so let me say a few more words about this
approach.

This approach implies a relaxed connectivity guarantee, by creating
two classes of objects:

 1. Objects that I made should satisfy the connectivity check.  They
    can point to other objects I made, objects I fetched, or (*) objects
    pointed to directly by objects I fetched.  More on (*) below.

 2. Objects that I fetched do not need to satisfy a connectivity
    check.  I can trust the server to provide objects that they point
    to when I ask for them, except in extraordinary cases like a
    credit card number that was accidentally pushed to the server and
    prompted a rewriting of history to remove it (**).

The guarantee (1) looks like it should be easy to satisfy (just like
the current connectivity guarantee where all objects are in class (1)).
I have to know about an object to point to it --- that means the
pointed-to object has to be in the object store or pointed to by
something in the object store.

The complication is in the "git gc" operation for the case (*).
Today, "git gc" uses a reachability walk to decide which objects to
remove --- an object referenced by no other object is fair game to
remove.  With (*), there is another kind of object that must not be
removed: if an object that I made, M, points to a missing/promised
object, O, pointed to by a an object I fetched, F, then I cannot prune
F unless there is another fetched object present to anchor O.

For example: suppose I have a sparse checkout and run

	git fetch origin refs/pulls/x
	git checkout -b topic FETCH_HEAD
	echo "Some great modification" >> README
	git add README
	git commit --amend

When I run "git gc", there is nothing pointing to the commit that was
pointed to by the remote ref refs/pulls/x, so it can be pruned.  I
would naively also expect that the tree pointed to by that commit
could be pruned.  But pruning it means pruning the promise that made
it permissible to lack various blobs that my topic branch refers to
that are outside the sparse checkout area.  So "git gc" must notice
that it is not safe to prune that tree.

This feels hacky.  I prefer the promised object list over this
approach.

>                                                "Maintaining a list
> of object names in a flat file is too costly" is not a valid excuse
> to discard the integrity of locally created objects, without which
> Git will no longer be a version control system,

I am confused by this: I think that Git without a "git fsck" command
at all would still be a version control system, just not as good of
one.

Can you spell this out more?  To be clear, are you speaking as a
reviewer or as the project maintainer?  In other words, if other
reviewers are able to settle on a design that involves a relaxed
guarantee for fsck in this mode that they can agree on, does this
represent a veto meaning the patch can still not go through?

On one hand I'm grateful for the help exploring the design space, and
I think it has helped get a better understanding of the issues
involved.

On the other hand, if this is a veto then it feels very black and
white and a hard starting point to build a consensus from.  I am
worried.

[...]
>> I mentioned
>> this in my e-mail but rejected it, but after some more thought, this
>> might be sufficient - we might still need to iterate through every
>> object to know exactly what we can assume the remote to have, but the
>> "frontier" solution also needs this iteration, so we are no worse off.
>
> Most importantly, this is allowed to be costly---we are doing this
> not at runtime all the time, but when the user says "make sure that
> I haven't lost objects and it is safe for me to build further on
> what I created locally so far" by running "git fsck".

check_everything_connected is also used in some other circumstances:
e.g. when running a fetch, and when receiving a push in git
receive-pack.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-08-02 17:38             ` Jonathan Nieder
@ 2017-08-02 20:51               ` Junio C Hamano
  2017-08-02 22:13                 ` Jonathan Nieder
  2017-08-03 19:08                 ` Jonathan Tan
  0 siblings, 2 replies; 40+ messages in thread
From: Junio C Hamano @ 2017-08-02 20:51 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git, peartben, christian.couder

Jonathan Nieder <jrnieder@gmail.com> writes:

> Junio C Hamano wrote:
>> Jonathan Tan <jonathantanmy@google.com> writes:
>
>>> One possibility to conceptually have the same thing without the overhead
>>> of the list is to put the obtained-from-elsewhere objects into its own
>>> alternate object store, so that we can distinguish the two.
>>
>> Now you are talking.  Either a separate object store, or a packfile
>> that is specially marked as such, would work.
>
> Jonathan's not in today, so let me say a few more words about this
> approach.
>
> This approach implies a relaxed connectivity guarantee, by creating
> two classes of objects:
>
>  1. Objects that I made should satisfy the connectivity check.  They
>     can point to other objects I made, objects I fetched, or (*) objects
>     pointed to directly by objects I fetched.  More on (*) below.

Or objects that are referred to by objects I fetched.  If you
narrowly clone while omitting a subdirectory, updated a file
that is outside the subdirectory, and created a new commit, while
recording the same tree object name for the directory you do not
know its contents (becaues you didn't fetch), then it is OK for the
top-level tree of the resulting commit you created to be pointing
at the tree that represents the subdirectory you never touched.

> The complication is in the "git gc" operation for the case (*).
> Today, "git gc" uses a reachability walk to decide which objects to
> remove --- an object referenced by no other object is fair game to
> remove.  With (*), there is another kind of object that must not be
> removed: if an object that I made, M, points to a missing/promised
> object, O, pointed to by a an object I fetched, F, then I cannot prune
> F unless there is another fetched object present to anchor O.

Absolutely.  Lazy-objects support comes with certain cost and this
is one of them.  

But I do not think it is realistic to expect that you can prune
anything you fetched from the "other place" (i.e. the source
'lazy-objects' hook reads from).  After all, once they give out
objects to their clients (like us in this case), they cannot prune
it, if we take the "implicit promise" approach to avoid the cost to
transmit and maintain a separate "object list".

> For example: suppose I have a sparse checkout and run
>
> 	git fetch origin refs/pulls/x
> 	git checkout -b topic FETCH_HEAD
> 	echo "Some great modification" >> README
> 	git add README
> 	git commit --amend
>
> When I run "git gc", there is nothing pointing to the commit that was
> pointed to by the remote ref refs/pulls/x, so it can be pruned.  I
> would naively also expect that the tree pointed to by that commit
> could be pruned.  But pruning it means pruning the promise that made
> it permissible to lack various blobs that my topic branch refers to
> that are outside the sparse checkout area.  So "git gc" must notice
> that it is not safe to prune that tree.
>
> This feels hacky.  I prefer the promised object list over this
> approach.

I think they are moral equivalents implemented differently with
different assumptions.  The example we are discussing makes an extra
assumption: In order to reduce the cost of transferring and
maintaining the list, we assume that all objects that came during
that transfer are implicitly "promised", i.e. everything behind each
of these objects will later be available on demand.  How these
objects are marked is up to the exact mechanism (my preference is to
mark the resulting packfile as special; Jon Tan's message to which
my message was a resopnse alluded to using an alternate object
store).  If you choose to maintain a separate "object list" and have
the "other side" explicitly give it, perhaps you can lift that
assumption and replace it with some other assumption that assumes
less.

> Can you spell this out more?  To be clear, are you speaking as a
> reviewer or as the project maintainer?  In other words, if other
> reviewers are able to settle on a design that involves a relaxed
> guarantee for fsck in this mode that they can agree on, does this
> represent a veto meaning the patch can still not go through?

Consider it a veto over punting without making sure that we can
later come up with a solution to give such a guarantee.  I am not
getting a feeling that "other reviewers" are even seeking a "relaxed
guarantee"---all I've seen in the thread is to give up any guarantee
and to hope for the best.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-08-02 20:51               ` Junio C Hamano
@ 2017-08-02 22:13                 ` Jonathan Nieder
  2017-08-03 19:08                 ` Jonathan Tan
  1 sibling, 0 replies; 40+ messages in thread
From: Jonathan Nieder @ 2017-08-02 22:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, peartben, christian.couder

Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:

>> Can you spell this out more?  To be clear, are you speaking as a
>> reviewer or as the project maintainer?  In other words, if other
>> reviewers are able to settle on a design that involves a relaxed
>> guarantee for fsck in this mode that they can agree on, does this
>> represent a veto meaning the patch can still not go through?
>
> Consider it a veto over punting without making sure that we can
> later come up with a solution to give such a guarantee.  I am not
> getting a feeling that "other reviewers" are even seeking a "relaxed
> guarantee"---all I've seen in the thread is to give up any guarantee
> and to hope for the best.

Thank you.  That makes sense.

In my defense, one reason I had for being okay with dropping the
connectivity check in the "lazy object" setup (at a higher level than
this patch currently does it, to avoid wasted work) is that this patch
series does not include the required components to do it more properly
and previous discussions on list had pointed to some of those
components that will arrive later (the "object size cache", which
doubles as an incomplete list of promises).  But that doesn't put the
project in a good position because it isn't an explicitly spelled out
plan.

The set of other reviewers that I was hoping will weigh in at some
point is the GVFS team at Microsoft.

I'll write up a summary of the ideas discussed so far to try to get
this unblocked.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-08-02 20:51               ` Junio C Hamano
  2017-08-02 22:13                 ` Jonathan Nieder
@ 2017-08-03 19:08                 ` Jonathan Tan
  1 sibling, 0 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-08-03 19:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Nieder, git, peartben, christian.couder

On Wed, 02 Aug 2017 13:51:37 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> > The complication is in the "git gc" operation for the case (*).
> > Today, "git gc" uses a reachability walk to decide which objects to
> > remove --- an object referenced by no other object is fair game to
> > remove.  With (*), there is another kind of object that must not be
> > removed: if an object that I made, M, points to a missing/promised
> > object, O, pointed to by a an object I fetched, F, then I cannot prune
> > F unless there is another fetched object present to anchor O.
> 
> Absolutely.  Lazy-objects support comes with certain cost and this
> is one of them.  
> 
> But I do not think it is realistic to expect that you can prune
> anything you fetched from the "other place" (i.e. the source
> 'lazy-objects' hook reads from).  After all, once they give out
> objects to their clients (like us in this case), they cannot prune
> it, if we take the "implicit promise" approach to avoid the cost to
> transmit and maintain a separate "object list".

By this, do you mean that the client cannot prune anything lazily loaded
from the server? If yes, I understand that the server cannot prune
anything (for the reasons that you describe), but I don't see how that
applies to the client.

> > For example: suppose I have a sparse checkout and run
> >
> > 	git fetch origin refs/pulls/x
> > 	git checkout -b topic FETCH_HEAD
> > 	echo "Some great modification" >> README
> > 	git add README
> > 	git commit --amend
> >
> > When I run "git gc", there is nothing pointing to the commit that was
> > pointed to by the remote ref refs/pulls/x, so it can be pruned.  I
> > would naively also expect that the tree pointed to by that commit
> > could be pruned.  But pruning it means pruning the promise that made
> > it permissible to lack various blobs that my topic branch refers to
> > that are outside the sparse checkout area.  So "git gc" must notice
> > that it is not safe to prune that tree.
> >
> > This feels hacky.  I prefer the promised object list over this
> > approach.
> 
> I think they are moral equivalents implemented differently with
> different assumptions.  The example we are discussing makes an extra
> assumption: In order to reduce the cost of transferring and
> maintaining the list, we assume that all objects that came during
> that transfer are implicitly "promised", i.e. everything behind each
> of these objects will later be available on demand.  How these
> objects are marked is up to the exact mechanism (my preference is to
> mark the resulting packfile as special; Jon Tan's message to which
> my message was a resopnse alluded to using an alternate object
> store).  If you choose to maintain a separate "object list" and have
> the "other side" explicitly give it, perhaps you can lift that
> assumption and replace it with some other assumption that assumes
> less.

Marking might be an issue if we expect the lazy loader to emit an object
after every hash, like in the current design. There would thus be one
mark per object, with overhead similar to the promise list. (Having said
that, batching is possible - I plan to optimize common cases like
checkout, and have such a patch online for an older "promised blob"
design [1].)

Overhead could be reduced by embedding the mark in both the packed and
loose objects, requiring a different format (instead of having a
separate "catalog" of marked files). This seems more complicated than
using an alternate object store, hence my suggestion.

[1] https://github.com/jonathantanmy/git/commit/14f07d3f06bc3a1a2c9bca85adc8c42b230b9143

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
  2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan
  2017-07-31 21:21   ` Junio C Hamano
@ 2017-08-08 17:13   ` Ben Peart
  1 sibling, 0 replies; 40+ messages in thread
From: Ben Peart @ 2017-08-08 17:13 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: gitster, christian.couder



On 7/31/2017 5:02 PM, Jonathan Tan wrote:
> Besides review changes, this patch set now includes my rewritten
> lazy-loading sha1_file patch, so you can now do this (excerpted from one
> of the tests):
> 
>      test_create_repo server
>      test_commit -C server 1 1.t abcdefgh
>      HASH=$(git hash-object server/1.t)
>      
>      test_create_repo client
>      test_must_fail git -C client cat-file -p "$HASH"
>      git -C client config core.repositoryformatversion 1
>      git -C client config extensions.lazyobject \
>          "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\""
>      git -C client cat-file -p "$HASH"
> 
> with fsck still working. Also, there is no need for a list of promised
> blobs, and the long-running process protocol is being used.
> 
> Changes from v1:
>   - added last patch that supports lazy loading
>   - clarified documentation in "introduce lazyobject extension" patch
>     (following Junio's comments [1])
> 
> As listed in the changes above, I have rewritten my lazy-loading
> sha1_file patch to no longer use the list of promises. Also, I have
> added documentation about the protocol used to (hopefully) the
> appropriate places.

Glad to see the removal of the promises.  Given the ongoing 
conversation, I'm interested to see how you are detecting locally create 
objects vs those downloaded from a server.

> 
> This is a minimal implementation, hopefully enough of a foundation to be
> built upon. In particular, I haven't added the environment variable to
> suppress lazy loading, and the lazy loading protocol only supports one
> object at a time.

We can add multiple object support to the protocol when we get to the 
point that we have code that will utilize it.

> 
> Other work
> ----------
> 
> This differs slightly from Ben Peart's patch [2] in that the
> lazy-loading functionality is provided through a configured shell
> command instead of a hook shell script. I envision commands like "git
> clone", in the future, needing to pre-configure lazy loading, and I
> think that it will be less surprising to the user if "git clone" wrote a
> default configuration instead of a default hook.

This was on my "todo" list to investigate as I've been told it can 
enable people to use taskset to set CPU affinity and get some 
significant performance wins. I'd be interested to see if it actually 
helps here at all.

> 
> This also differs from Christian Couder's patch set [3] that implement a
> larger-scale object database, in that (i) my patch set does not support
> putting objects into external databases, and (ii) my patch set requires
> the lazy loader to make the objects available in the local repo, instead
> of allowing the objects to only be stored in the external database.

This is the model we're using today so I'm confident it will meet our 
requirements.

> 
> [1] https://public-inbox.org/git/xmqqzibpn1zh.fsf@gitster.mtv.corp.google.com/
> [2] https://public-inbox.org/git/20170714132651.170708-2-benpeart@microsoft.com/
> [3] https://public-inbox.org/git/20170620075523.26961-1-chriscool@tuxfamily.org/
> 
> Jonathan Tan (5):
>    environment, fsck: introduce lazyobject extension
>    fsck: support refs pointing to lazy objects
>    fsck: support referenced lazy objects
>    fsck: support lazy objects as CLI argument
>    sha1_file: support loading lazy objects
> 
>   Documentation/Makefile                             |   1 +
>   Documentation/gitattributes.txt                    |  54 ++--------
>   Documentation/gitrepository-layout.txt             |   3 +
>   .../technical/long-running-process-protocol.txt    |  50 +++++++++
>   Documentation/technical/repository-version.txt     |  23 +++++
>   Makefile                                           |   1 +
>   builtin/cat-file.c                                 |   2 +
>   builtin/fsck.c                                     |  25 ++++-
>   cache.h                                            |   4 +
>   environment.c                                      |   1 +
>   lazy-object.c                                      |  80 +++++++++++++++
>   lazy-object.h                                      |  12 +++
>   object.c                                           |   7 ++
>   object.h                                           |  13 +++
>   setup.c                                            |   7 +-
>   sha1_file.c                                        |  44 +++++---
>   t/t0410-lazy-object.sh                             | 113 +++++++++++++++++++++
>   t/t0410/lazy-object                                | 102 +++++++++++++++++++
>   18 files changed, 478 insertions(+), 64 deletions(-)
>   create mode 100644 Documentation/technical/long-running-process-protocol.txt
>   create mode 100644 lazy-object.c
>   create mode 100644 lazy-object.h
>   create mode 100755 t/t0410-lazy-object.sh
>   create mode 100755 t/t0410/lazy-object
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 5/5] sha1_file: support loading lazy objects
  2017-07-31 21:02 ` [PATCH v2 5/5] sha1_file: support loading lazy objects Jonathan Tan
  2017-07-31 21:29   ` Junio C Hamano
@ 2017-08-08 20:20   ` Ben Peart
  1 sibling, 0 replies; 40+ messages in thread
From: Ben Peart @ 2017-08-08 20:20 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: gitster, christian.couder



On 7/31/2017 5:02 PM, Jonathan Tan wrote:
> Teach sha1_file to invoke the command configured in
> extensions.lazyObject whenever an object is requested and unavailable.
> 
> The usage of the hook can be suppressed through a flag when invoking
> has_object_file_with_flags() and other similar functions.
> 
> This is meant as a temporary measure to ensure that all Git commands
> work in such a situation. Future patches will update some commands to
> either tolerate missing objects (without invoking the command) or be
> more efficient in invoking this command.

To prevent fetch from downloading all missing objects, you will also 
need to add logic in check_connected.  The simplest model is to simply 
return 0 if repository_format_lazy_object is set.

/*
  * Running a with lazy_objects there will be objects that are
  * missing locally and we don't want to download a bunch of
  * commits, trees, and blobs just to make sure everything is
  * reachable locally so this option will skip reachablility
  * checks below that use rev-list.  This will stop the check
  * before uploadpack runs to determine if there is anything to
  * fetch.  Returning zero for the first check will also prevent the
  * uploadpack from happening.  It will also skip the check after
  * the fetch is finished to make sure all the objects where
  * downloaded in the pack file.  This will allow the fetch to
  * run and get all the latest tip commit ids for all the branches
  * in the fetch but not pull down commits, trees, or blobs via
  * upload pack.
  */
if (repository_format_lazy_object)
	return 0;

[...]
> diff --git a/sha1_file.c b/sha1_file.c
> index b60ae15f7..1785c61d8 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -28,6 +28,11 @@
>   #include "list.h"
>   #include "mergesort.h"
>   #include "quote.h"
> +#include "iterator.h"
> +#include "dir-iterator.h"
> +#include "sha1-lookup.h"
> +#include "lazy-object.h"
> +#include "sha1-array.h"
>   
>   #define SZ_FMT PRIuMAX
>   static inline uintmax_t sz_fmt(size_t s) { return s; }
> @@ -2984,6 +2989,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>   	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
>   				    lookup_replace_object(sha1) :
>   				    sha1;
> +	int already_retried = 0;
>   
>   	if (!oi)
>   		oi = &blank_oi;
> @@ -3008,30 +3014,38 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>   		}
>   	}
>   
> -	if (!find_pack_entry(real, &e)) {
> -		/* Most likely it's a loose object. */
> -		if (!sha1_loose_object_info(real, oi, flags)) {
> -			oi->whence = OI_LOOSE;
> -			return 0;
> -		}
> +retry:
> +	if (find_pack_entry(real, &e))
> +		goto found_packed;
>   
> -		/* Not a loose object; someone else may have just packed it. */
> -		if (flags & OBJECT_INFO_QUICK) {
> -			return -1;
> -		} else {
> -			reprepare_packed_git();
> -			if (!find_pack_entry(real, &e))
> -				return -1;
> -		}
> +	/* Most likely it's a loose object. */
> +	if (!sha1_loose_object_info(real, oi, flags)) {
> +		oi->whence = OI_LOOSE;
> +		return 0;
>   	}
>   
> +	/* Not a loose object; someone else may have just packed it. */
> +	reprepare_packed_git();
> +	if (find_pack_entry(real, &e))
> +		goto found_packed;

Same feedback as before.  I like to avoid using goto's as flow control 
other than in error handling.

Also, this patch looses the OBJECT_INFO_QUICK logic which could be restored.

[...]

> diff --git a/t/t0410/lazy-object b/t/t0410/lazy-object
> new file mode 100755
> index 000000000..4f4a9c38a
> --- /dev/null
> +++ b/t/t0410/lazy-object
> @@ -0,0 +1,102 @@
> +#!/usr/bin/perl
> +#
> +# Example implementation for the Git lazyObject protocol version 1. See
> +# the documentation for extensions.lazyObject in
> +# Documentation/technical/repository-version.txt
> +#
> +# Allows you to test the ability for blobs to be pulled from a host git repo
> +# "on demand."  Called when git needs a blob it couldn't find locally due to
> +# a lazy clone that only cloned the commits and trees.
> +#
> +# Please note, this sample is a minimal skeleton. No proper error handling
> +# was implemented.
> +
> +use strict;
> +use warnings;
> +
> +#
> +# Point $DIR to the folder where your host git repo is located so we can pull
> +# missing objects from it
> +#
> +my $DIR = $ARGV[0];
> +

At some point, this should be based on the refactored pkt_* functions 
currently contained in the ObjectDB patch series.

> +sub packet_bin_read {
> +	my $buffer;
> +	my $bytes_read = read STDIN, $buffer, 4;
> +	if ( $bytes_read == 0 ) {
> +
> +		# EOF - Git stopped talking to us!
> +		exit();
> +	}
> +	elsif ( $bytes_read != 4 ) {
> +		die "invalid packet: '$buffer'";
> +	}
> +	my $pkt_size = hex($buffer);
> +	if ( $pkt_size == 0 ) {
> +		return ( 1, "" );
> +	}
> +	elsif ( $pkt_size > 4 ) {
> +		my $content_size = $pkt_size - 4;
> +		$bytes_read = read STDIN, $buffer, $content_size;
> +		if ( $bytes_read != $content_size ) {
> +			die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
> +		}
> +		return ( 0, $buffer );
> +	}
> +	else {
> +		die "invalid packet size: $pkt_size";
> +	}
> +}
> +
> +sub packet_txt_read {
> +	my ( $res, $buf ) = packet_bin_read();
> +	unless ( $buf =~ s/\n$// ) {
> +		die "A non-binary line MUST be terminated by an LF.";
> +	}
> +	return ( $res, $buf );
> +}
> +
> +sub packet_bin_write {
> +	my $buf = shift;
> +	print STDOUT sprintf( "%04x", length($buf) + 4 );
> +	print STDOUT $buf;
> +	STDOUT->flush();
> +}
> +
> +sub packet_txt_write {
> +	packet_bin_write( $_[0] . "\n" );
> +}
> +
> +sub packet_flush {
> +	print STDOUT sprintf( "%04x", 0 );
> +	STDOUT->flush();
> +}
> +
> +( packet_txt_read() eq ( 0, "git-lazy-object-client" ) ) || die "bad initialize";
> +( packet_txt_read() eq ( 0, "version=1" ) )		 || die "bad version";
> +( packet_bin_read() eq ( 1, "" ) )                       || die "bad version end";
> +
> +packet_txt_write("git-lazy-object-server");
> +packet_txt_write("version=1");
> +packet_flush();
> +
> +( packet_txt_read() eq ( 0, "capability=get" ) )    || die "bad capability";
> +( packet_bin_read() eq ( 1, "" ) )                  || die "bad capability end";
> +
> +packet_txt_write("capability=get");
> +packet_flush();
> +
> +while (1) {
> +	my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
> +
> +	if ( $command eq "get" ) {
> +		my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
> +		packet_bin_read();
> +
> +		system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c extensions.lazyobject=false hash-object -w --stdin >/dev/null 2>&1');
> +		packet_txt_write(($?) ? "status=error" : "status=success");
> +		packet_flush();
> +	} else {
> +		die "bad command '$command'";
> +	}
> +}
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2017-08-08 20:20 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan
2017-07-26 23:29 ` [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension Jonathan Tan
2017-07-27 18:55   ` Junio C Hamano
2017-07-28 13:20     ` Ben Peart
2017-07-28 23:50     ` Jonathan Tan
2017-07-29  0:21       ` Junio C Hamano
2017-07-26 23:30 ` [RFC PATCH 2/4] fsck: support refs pointing to lazy objects Jonathan Tan
2017-07-27 18:59   ` Junio C Hamano
2017-07-27 23:50     ` Jonathan Tan
2017-07-28 13:29       ` Ben Peart
2017-07-28 20:08         ` [PATCH] tests: ensure fsck fails on corrupt packfiles Jonathan Tan
2017-07-26 23:30 ` [RFC PATCH 3/4] fsck: support referenced lazy objects Jonathan Tan
2017-07-27 19:17   ` Junio C Hamano
2017-07-27 23:50     ` Jonathan Tan
2017-07-29 16:04   ` Junio C Hamano
2017-07-26 23:30 ` [RFC PATCH 4/4] fsck: support lazy objects as CLI argument Jonathan Tan
2017-07-26 23:42 ` [RFC PATCH 0/4] Some patches for fsck for missing objects brian m. carlson
2017-07-27  0:24   ` Stefan Beller
2017-07-27 17:25   ` Jonathan Tan
2017-07-28 13:40     ` Ben Peart
2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan
2017-07-31 21:21   ` Junio C Hamano
2017-07-31 23:05     ` Jonathan Tan
2017-08-01 17:11       ` Junio C Hamano
2017-08-01 17:45         ` Jonathan Nieder
2017-08-01 20:15           ` Junio C Hamano
2017-08-02  0:19         ` Jonathan Tan
2017-08-02 16:20           ` Junio C Hamano
2017-08-02 17:38             ` Jonathan Nieder
2017-08-02 20:51               ` Junio C Hamano
2017-08-02 22:13                 ` Jonathan Nieder
2017-08-03 19:08                 ` Jonathan Tan
2017-08-08 17:13   ` Ben Peart
2017-07-31 21:02 ` [PATCH v2 1/5] environment, fsck: introduce lazyobject extension Jonathan Tan
2017-07-31 21:02 ` [PATCH v2 2/5] fsck: support refs pointing to lazy objects Jonathan Tan
2017-07-31 21:02 ` [PATCH v2 3/5] fsck: support referenced " Jonathan Tan
2017-07-31 21:02 ` [PATCH v2 4/5] fsck: support lazy objects as CLI argument Jonathan Tan
2017-07-31 21:02 ` [PATCH v2 5/5] sha1_file: support loading lazy objects Jonathan Tan
2017-07-31 21:29   ` Junio C Hamano
2017-08-08 20:20   ` Ben Peart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.