git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs")
@ 2017-07-11 19:48 Jonathan Tan
  2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
                   ` (8 more replies)
  0 siblings, 9 replies; 46+ messages in thread
From: Jonathan Tan @ 2017-07-11 19:48 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

These patches are part of a set of patches implementing partial clone,
as you can see here:

https://github.com/jonathantanmy/git/tree/partialclone

In that branch, clone with batch checkout works, as you can see in the
README. The code and tests are generally done, but some patches are
still missing documentation and commit messages.

These 3 patches implement the foundational concept - formerly known as
"missing blobs" in the "missing blob manifest", I decided to call them
"promised blobs". The repo knows their object names and sizes. It also
does not have the blobs themselves, but can be configured to know how to
fetch them.

An older version of these patches was sent as a single demonstration
patch in versions 1 to 3 of [1]. In there, Junio suggested that I have
only one file containing missing blob information. I have made that
suggested change in this version.

One thing remaining is to add a repository extension [2] so that older
versions of Git fail immediately instead of trying to read missing
blobs, but I thought I'd send these first in order to get some initial
feedback.

[1] https://public-inbox.org/git/cover.1497035376.git.jonathantanmy@google.com/
[2] Documentation/technical/repository-version.txt

Jonathan Tan (3):
  promised-blob, fsck: introduce promised blobs
  sha1-array: support appending unsigned char hash
  sha1_file: add promised blob hook support

 Documentation/config.txt               |   8 ++
 Documentation/gitrepository-layout.txt |   8 ++
 Makefile                               |   1 +
 builtin/cat-file.c                     |   9 ++
 builtin/fsck.c                         |  13 +++
 promised-blob.c                        | 170 +++++++++++++++++++++++++++++++++
 promised-blob.h                        |  27 ++++++
 sha1-array.c                           |   7 ++
 sha1-array.h                           |   1 +
 sha1_file.c                            |  44 ++++++---
 t/t3907-promised-blob.sh               |  65 +++++++++++++
 t/test-lib-functions.sh                |   6 ++
 12 files changed, 345 insertions(+), 14 deletions(-)
 create mode 100644 promised-blob.c
 create mode 100644 promised-blob.h
 create mode 100755 t/t3907-promised-blob.sh

-- 
2.13.2.932.g7449e964c-goog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
  2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
@ 2017-07-11 19:48 ` Jonathan Tan
  2017-07-11 22:02   ` Stefan Beller
  2017-07-12 17:29   ` Jeff Hostetler
  2017-07-11 19:48 ` [RFC PATCH 2/3] sha1-array: support appending unsigned char hash Jonathan Tan
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 46+ messages in thread
From: Jonathan Tan @ 2017-07-11 19:48 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

Currently, Git does not support repos with very large numbers of blobs
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every blob referenced by a tree object is available
somewhere in the repo storage.

As a first step to reducing this problem, introduce the concept of
promised blobs. Each Git repo can contain a list of promised blobs and
their sizes at $GIT_DIR/objects/promisedblob. This patch contains
functions to query them; functions for creating and modifying that file
will be introduced in later patches.

A repository that is missing a blob but has that blob promised is not
considered to be in error, so also teach fsck this.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Makefile                 |  1 +
 builtin/fsck.c           | 13 +++++++
 promised-blob.c          | 95 ++++++++++++++++++++++++++++++++++++++++++++++++
 promised-blob.h          | 14 +++++++
 t/t3907-promised-blob.sh | 29 +++++++++++++++
 t/test-lib-functions.sh  |  6 +++
 6 files changed, 158 insertions(+)
 create mode 100644 promised-blob.c
 create mode 100644 promised-blob.h
 create mode 100755 t/t3907-promised-blob.sh

diff --git a/Makefile b/Makefile
index 9c9c42f8f..e96163269 100644
--- a/Makefile
+++ b/Makefile
@@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
 LIB_OBJS += pretty.o
 LIB_OBJS += prio-queue.o
 LIB_OBJS += progress.o
+LIB_OBJS += promised-blob.o
 LIB_OBJS += prompt.o
 LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 99dea7adf..7454be7f1 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -15,6 +15,7 @@
 #include "progress.h"
 #include "streaming.h"
 #include "decorate.h"
+#include "promised-blob.h"
 
 #define REACHABLE 0x0001
 #define SEEN      0x0002
@@ -223,6 +224,9 @@ static void check_reachable_object(struct object *obj)
 	if (!(obj->flags & HAS_OBJ)) {
 		if (has_sha1_pack(obj->oid.hash))
 			return; /* it is in pack - forget about it */
+		if (obj->type == OBJ_BLOB &&
+		    is_promised_blob(&obj->oid, NULL))
+			return;
 		printf("missing %s %s\n", printable_type(obj),
 			describe_object(obj));
 		errors_found |= ERROR_REACHABLE;
@@ -642,6 +646,13 @@ static int mark_packed_for_connectivity(const struct object_id *oid,
 	return 0;
 }
 
+static int mark_promised_blob_for_connectivity(const struct object_id *oid,
+					       void *data)
+{
+	mark_object_for_connectivity(oid);
+	return 0;
+}
+
 static char const * const fsck_usage[] = {
 	N_("git fsck [<options>] [<object>...]"),
 	NULL
@@ -701,6 +712,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 	if (connectivity_only) {
 		for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
 		for_each_packed_object(mark_packed_for_connectivity, NULL, 0);
+		for_each_promised_blob(mark_promised_blob_for_connectivity,
+				       NULL);
 	} else {
 		fsck_object_dir(get_object_directory());
 
diff --git a/promised-blob.c b/promised-blob.c
new file mode 100644
index 000000000..493808ed2
--- /dev/null
+++ b/promised-blob.c
@@ -0,0 +1,95 @@
+#include "cache.h"
+#include "promised-blob.h"
+#include "sha1-lookup.h"
+#include "strbuf.h"
+
+#define ENTRY_SIZE (GIT_SHA1_RAWSZ + 8)
+/*
+ * A mmap-ed byte array of size (missing_blob_nr * ENTRY_SIZE). Each
+ * ENTRY_SIZE-sized entry consists of the SHA-1 of the promised blob and its
+ * 64-bit size in network byte order. The entries are sorted in ascending SHA-1
+ * order.
+ */
+static char *promised_blobs;
+static int64_t promised_blob_nr = -1;
+
+static void prepare_promised_blobs(void)
+{
+	char *filename;
+	int fd;
+	struct stat st;
+
+	if (promised_blob_nr >= 0)
+		return;
+
+	if (getenv("GIT_IGNORE_PROMISED_BLOBS")) {
+		promised_blob_nr = 0;
+		return;
+	}
+	
+	filename = xstrfmt("%s/promisedblob", get_object_directory());
+	fd = git_open(filename);
+	if (fd < 0) {
+		if (errno == ENOENT) {
+			promised_blob_nr = 0;
+			goto cleanup;
+		}
+		perror("prepare_promised_blobs");
+		die("Could not open %s", filename);
+	}
+	if (fstat(fd, &st)) {
+		perror("prepare_promised_blobs");
+		die("Could not stat %s", filename);
+	}
+	if (st.st_size == 0) {
+		promised_blob_nr = 0;
+		goto cleanup;
+	}
+	if (st.st_size % ENTRY_SIZE) {
+		die("Size of %s is not a multiple of %d", filename, ENTRY_SIZE);
+	}
+
+	promised_blobs = xmmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	promised_blob_nr = st.st_size / ENTRY_SIZE;
+
+cleanup:
+	free(filename);
+	if (fd >= 0)
+		close(fd);
+}
+
+int is_promised_blob(const struct object_id *oid, unsigned long *size)
+{
+	int result;
+
+	prepare_promised_blobs();
+	result = sha1_entry_pos(promised_blobs, ENTRY_SIZE, 0, 0,
+				promised_blob_nr, promised_blob_nr, oid->hash);
+	if (result >= 0) {
+		if (size) {
+			uint64_t size_nbo;
+			char *sizeptr = promised_blobs +
+					result * ENTRY_SIZE + GIT_SHA1_RAWSZ;
+			memcpy(&size_nbo, sizeptr, sizeof(size_nbo));
+			*size = ntohll(size_nbo);
+		}
+		return 1;
+	}
+	return 0;
+}
+
+int for_each_promised_blob(each_promised_blob_fn cb, void *data)
+{
+	struct object_id oid;
+	int i, r;
+
+	prepare_promised_blobs();
+	for (i = 0; i < promised_blob_nr; i++) {
+		memcpy(oid.hash, &promised_blobs[i * ENTRY_SIZE],
+		       GIT_SHA1_RAWSZ);
+		r = cb(&oid, data);
+		if (r)
+			return r;
+	}
+	return 0;
+}
diff --git a/promised-blob.h b/promised-blob.h
new file mode 100644
index 000000000..a303ea1ff
--- /dev/null
+++ b/promised-blob.h
@@ -0,0 +1,14 @@
+#ifndef PROMISED_BLOB_H
+#define PROMISED_BLOB_H
+
+/*
+ * Returns 1 if oid is the name of a promised blob. If size is not NULL, also
+ * returns its size.
+ */
+extern int is_promised_blob(const struct object_id *oid,
+			    unsigned long *size);
+
+typedef int each_promised_blob_fn(const struct object_id *oid, void *data);
+int for_each_promised_blob(each_promised_blob_fn, void *);
+
+#endif
diff --git a/t/t3907-promised-blob.sh b/t/t3907-promised-blob.sh
new file mode 100755
index 000000000..827072004
--- /dev/null
+++ b/t/t3907-promised-blob.sh
@@ -0,0 +1,29 @@
+#!/bin/sh
+
+test_description='promised blobs'
+
+. ./test-lib.sh
+
+test_expect_success 'fsck fails on missing blobs' '
+	rm -rf repo &&
+
+	git init repo &&
+	test_commit -C repo 1 &&
+	HASH=$(git hash-object repo/1.t) &&
+
+	rm repo/.git/objects/$(echo $HASH | cut -c1-2)/$(echo $HASH | cut -c3-40) &&
+	test_must_fail git -C repo fsck
+'
+
+test_expect_success '...but succeeds if it is a promised blob' '
+	printf "%s%016x" "$HASH" "$(wc -c <repo/1.t)" |
+		hex_pack >repo/.git/objects/promisedblob &&
+	git -C repo fsck
+'
+
+test_expect_success '...but fails again with GIT_IGNORE_PROMISED_BLOBS' '
+	GIT_IGNORE_PROMISED_BLOBS=1 test_must_fail git -C repo fsck &&
+	unset GIT_IGNORE_PROMISED_BLOBS
+'
+
+test_done
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index db622c355..1ebdd2d04 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1018,3 +1018,9 @@ nongit () {
 		"$@"
 	)
 }
+
+# Converts big-endian pairs of hexadecimal digits into bytes. For example,
+# "printf 61620d0a | hex_pack" results in "ab\r\n".
+hex_pack () {
+	perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
+}
-- 
2.13.2.932.g7449e964c-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 2/3] sha1-array: support appending unsigned char hash
  2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
  2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
@ 2017-07-11 19:48 ` Jonathan Tan
  2017-07-11 22:06   ` Stefan Beller
  2017-07-11 19:48 ` [RFC PATCH 3/3] sha1_file: add promised blob hook support Jonathan Tan
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2017-07-11 19:48 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

In a subsequent patch, sha1_file will need to append object names in the
form of "unsigned char *" to oid arrays. Teach sha1-array support for
that.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1-array.c | 7 +++++++
 sha1-array.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/sha1-array.c b/sha1-array.c
index 838b3bf84..6e0e35391 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -9,6 +9,13 @@ void oid_array_append(struct oid_array *array, const struct object_id *oid)
 	array->sorted = 0;
 }
 
+void oid_array_append_sha1(struct oid_array *array, const unsigned char *sha1)
+{
+	ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
+	hashcpy(array->oid[array->nr++].hash, sha1);
+	array->sorted = 0;
+}
+
 static int void_hashcmp(const void *a, const void *b)
 {
 	return oidcmp(a, b);
diff --git a/sha1-array.h b/sha1-array.h
index 04b075633..3479959e4 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -11,6 +11,7 @@ struct oid_array {
 #define OID_ARRAY_INIT { NULL, 0, 0, 0 }
 
 void oid_array_append(struct oid_array *array, const struct object_id *oid);
+void oid_array_append_sha1(struct oid_array *array, const unsigned char *sha1);
 int oid_array_lookup(struct oid_array *array, const struct object_id *oid);
 void oid_array_clear(struct oid_array *array);
 
-- 
2.13.2.932.g7449e964c-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 3/3] sha1_file: add promised blob hook support
  2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
  2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
  2017-07-11 19:48 ` [RFC PATCH 2/3] sha1-array: support appending unsigned char hash Jonathan Tan
@ 2017-07-11 19:48 ` Jonathan Tan
  2017-07-11 22:38   ` Stefan Beller
  2017-07-12 17:40   ` Ben Peart
  2017-07-16 15:23 ` [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Philip Oakley
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 46+ messages in thread
From: Jonathan Tan @ 2017-07-11 19:48 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

Teach sha1_file to invoke a hook whenever a blob is requested and
unavailable but is promised. The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

This is meant as a temporary measure to ensure that all Git commands
work in such a situation. Future patches will update some commands to
either tolerate promised blobs (without invoking the hook) or be more
efficient in invoking the promised blob hook.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
     regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
     need to check callers that know about the loose/packed distinction
     and operate on both differently, and ensure that they can handle
     the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked at for_each_packed_object and at the packed-related
functions that take in a hash. For for_each_packed_object, the callers
either already work or are fixed in this patch:
 - reachable - only to find recent objects
 - builtin/fsck - already knows about promised blobs
 - builtin/cat-file - fixed in this commit

Callers of the other functions do not need to be changed:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
     caller already knows (through other means) that the sought object
     is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
     file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - already knows about promised blobs
   - builtin/count-objects - informational purposes only (check if loose
     object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
     not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed blob is requested, not on any blob.
   That is, whenever we attempt to search the packfiles for a blob, if
   it is missing (from the packfiles and from the loose object storage),
   to invoke the hook (which must then store it as a packfile), open the
   packfile the hook generated, and report that the blob is found in
   that new packfile. This reduces the amount of analysis needed (in
   that we only need to look at how packed blobs are handled), but
   requires that the hook generate packfiles (or for sha1_file to pack
   whatever loose objects are generated), creating one packfile for each
   missing blob and potentially very many packfiles that must be
   linearly searched. This may be tolerable now for repos that only have
   a few missing blobs (for example, repos that only want to exclude
   large blobs), and might be tolerable in the future if we have
   batching support for the most commonly used commands, but is not
   tolerable now for repos that exclude a large amount of blobs.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/config.txt               |  8 ++++
 Documentation/gitrepository-layout.txt |  8 ++++
 builtin/cat-file.c                     |  9 ++++
 promised-blob.c                        | 75 ++++++++++++++++++++++++++++++++++
 promised-blob.h                        | 13 ++++++
 sha1_file.c                            | 44 +++++++++++++-------
 t/t3907-promised-blob.sh               | 36 ++++++++++++++++
 7 files changed, 179 insertions(+), 14 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index d5c9c4cab..c293ac921 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.promisedBlobCommand::
+	If set, whenever a blob in the local repo is attempted to be read, but
+	is both missing and a promised blob, invoke this shell command to
+	generate or obtain that blob before reporting an error. This shell
+	command should take one or more hashes, each terminated by a newline,
+	as standard input, and (if successful) should write the corresponding
+	objects to the local repo (packed or loose).
+
 core.precomposeUnicode::
 	This option is only used by Mac OS implementation of Git.
 	When core.precomposeUnicode=true, Git reverts the unicode decomposition
diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
index f51ed4e37..0ef9312fa 100644
--- a/Documentation/gitrepository-layout.txt
+++ b/Documentation/gitrepository-layout.txt
@@ -47,6 +47,10 @@ use with dumb transports but otherwise is OK as long as
 `objects/info/alternates` points at the object stores it
 borrows from.
 +
+. You could have blobs that are merely promised by another source. When
+Git requires those blobs, it will invoke the command in the
+`core.promisedBlobCommand` configuration variable.
++
 This directory is ignored if $GIT_COMMON_DIR is set and
 "$GIT_COMMON_DIR/objects" will be used instead.
 
@@ -91,6 +95,10 @@ objects/info/http-alternates::
 	this object store borrows objects from, to be used when
 	the repository is fetched over HTTP.
 
+objects/promisedblob::
+	This file records the sha1 object names and sizes of promised
+	blobs.
+
 refs::
 	References are stored in subdirectories of this
 	directory.  The 'git prune' command knows to preserve
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 96b786e48..3df00bf91 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -12,6 +12,7 @@
 #include "streaming.h"
 #include "tree-walk.h"
 #include "sha1-array.h"
+#include "promised-blob.h"
 
 struct batch_options {
 	int enabled;
@@ -432,6 +433,13 @@ static int batch_packed_object(const struct object_id *oid,
 	return 0;
 }
 
+static int batch_promised_blob(const struct object_id *oid,
+			       void *data)
+{
+	oid_array_append(data, oid);
+	return 0;
+}
+
 static int batch_objects(struct batch_options *opt)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -473,6 +481,7 @@ static int batch_objects(struct batch_options *opt)
 
 		for_each_loose_object(batch_loose_object, &sa, 0);
 		for_each_packed_object(batch_packed_object, &sa, 0);
+		for_each_promised_blob(batch_promised_blob, &sa);
 
 		cb.opt = opt;
 		cb.expand = &data;
diff --git a/promised-blob.c b/promised-blob.c
index 493808ed2..0fe04daf2 100644
--- a/promised-blob.c
+++ b/promised-blob.c
@@ -2,6 +2,9 @@
 #include "promised-blob.h"
 #include "sha1-lookup.h"
 #include "strbuf.h"
+#include "run-command.h"
+#include "sha1-array.h"
+#include "config.h"
 
 #define ENTRY_SIZE (GIT_SHA1_RAWSZ + 8)
 /*
@@ -93,3 +96,75 @@ int for_each_promised_blob(each_promised_blob_fn cb, void *data)
 	}
 	return 0;
 }
+
+static char *promised_blob_command;
+
+static int promised_blob_config(const char *conf_key, const char *value,
+				void *cb)
+{
+	if (!strcmp(conf_key, "core.promisedblobcommand")) {
+		promised_blob_command = xstrdup(value);
+	}
+	return 0;
+}
+
+static void ensure_promised_blob_configured(void)
+{
+	static int configured;
+	if (configured)
+		return;
+	git_config(promised_blob_config, NULL);
+	configured = 1;
+}
+
+int request_promised_blobs(const struct oid_array *blobs)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	const char *argv[] = {NULL, NULL};
+	const char *env[] = {"GIT_IGNORE_PROMISED_BLOBS=1", NULL};
+	int blobs_requested = 0;
+	int i;
+
+	for (i = 0; i < blobs->nr; i++) {
+		if (is_promised_blob(&blobs->oid[i], NULL))
+			break;
+	}
+
+	if (i == blobs->nr)
+		/* Nothing to fetch */
+		return 0;
+
+	ensure_promised_blob_configured();
+	if (!promised_blob_command)
+		die("some promised blobs need to be fetched, but\n"
+		    "no core.promisedblobcommand configured");
+
+	argv[0] = promised_blob_command;
+	cp.argv = argv;
+	cp.env = env;
+	cp.use_shell = 1;
+	cp.in = -1;
+
+	if (start_command(&cp))
+		die("failed to start <%s>", promised_blob_command);
+
+	for (; i < blobs->nr; i++) {
+		if (is_promised_blob(&blobs->oid[i], NULL)) {
+			write_in_full(cp.in, oid_to_hex(&blobs->oid[i]),
+				      GIT_SHA1_HEXSZ);
+			write_in_full(cp.in, "\n", 1);
+			blobs_requested++;
+		}
+	}
+	close(cp.in);
+
+	if (finish_command(&cp))
+		die("failed to finish <%s>", promised_blob_command);
+
+	/*
+	 * The command above may have updated packfiles, so update our record
+	 * of them.
+	 */
+	reprepare_packed_git();
+	return blobs_requested;
+}
diff --git a/promised-blob.h b/promised-blob.h
index a303ea1ff..34b1d3f9e 100644
--- a/promised-blob.h
+++ b/promised-blob.h
@@ -1,6 +1,8 @@
 #ifndef PROMISED_BLOB_H
 #define PROMISED_BLOB_H
 
+#include "sha1-array.h"
+
 /*
  * Returns 1 if oid is the name of a promised blob. If size is not NULL, also
  * returns its size.
@@ -11,4 +13,15 @@ extern int is_promised_blob(const struct object_id *oid,
 typedef int each_promised_blob_fn(const struct object_id *oid, void *data);
 int for_each_promised_blob(each_promised_blob_fn, void *);
 
+/*
+ * If any of the given blobs are promised blobs, invokes
+ * core.promisedblobcommand with those blobs and returns the number of blobs
+ * requested. No check is made as to whether the invocation actually populated
+ * the repository with the promised blobs.
+ *
+ * If none of the given blobs are promised blobs, this function does not invoke
+ * anything and returns 0.
+ */
+int request_promised_blobs(const struct oid_array *blobs);
+
 #endif
diff --git a/sha1_file.c b/sha1_file.c
index 5862386cd..5778a6dbd 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -28,6 +28,11 @@
 #include "list.h"
 #include "mergesort.h"
 #include "quote.h"
+#include "iterator.h"
+#include "dir-iterator.h"
+#include "sha1-lookup.h"
+#include "promised-blob.h"
+#include "sha1-array.h"
 
 #define SZ_FMT PRIuMAX
 static inline uintmax_t sz_fmt(size_t s) { return s; }
@@ -2983,6 +2988,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
 				    lookup_replace_object(sha1) :
 				    sha1;
+	int already_retried = 0;
 
 	if (!oi)
 		oi = &blank_oi;
@@ -3007,30 +3013,40 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 	}
 
-	if (!find_pack_entry(real, &e)) {
-		/* Most likely it's a loose object. */
-		if (!sha1_loose_object_info(real, oi, flags)) {
-			oi->whence = OI_LOOSE;
-			return 0;
-		}
+retry:
+	if (find_pack_entry(real, &e))
+		goto found_packed;
 
-		/* Not a loose object; someone else may have just packed it. */
-		if (flags & OBJECT_INFO_QUICK) {
-			return -1;
-		} else {
-			reprepare_packed_git();
-			if (!find_pack_entry(real, &e))
-				return -1;
+	/* Most likely it's a loose object. */
+	if (!sha1_loose_object_info(real, oi, flags)) {
+		oi->whence = OI_LOOSE;
+		return 0;
+	}
+
+	/* Not a loose object; someone else may have just packed it. */
+	reprepare_packed_git();
+	if (find_pack_entry(real, &e))
+		goto found_packed;
+
+	/* Check if it is a promised blob */
+	if (!already_retried) {
+		struct oid_array promised = OID_ARRAY_INIT;
+		oid_array_append_sha1(&promised, real);
+		if (request_promised_blobs(&promised)) {
+			already_retried = 1;
+			goto retry;
 		}
 	}
 
+	return -1;
+
+found_packed:
 	if (oi == &blank_oi)
 		/*
 		 * We know that the caller doesn't actually need the
 		 * information below, so return early.
 		 */
 		return 0;
-
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
diff --git a/t/t3907-promised-blob.sh b/t/t3907-promised-blob.sh
index 827072004..778afd1f6 100755
--- a/t/t3907-promised-blob.sh
+++ b/t/t3907-promised-blob.sh
@@ -26,4 +26,40 @@ test_expect_success '...but fails again with GIT_IGNORE_PROMISED_BLOBS' '
 	unset GIT_IGNORE_PROMISED_BLOBS
 '
 
+test_expect_success 'sha1_object_info_extended (through git cat-file)' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 1.t abcdefgh &&
+	HASH=$(git hash-object server/1.t) &&
+
+	git init client &&
+	git -C client config core.promisedblobcommand \
+		"git -C \"$(pwd)/server\" pack-objects --stdout |
+		 git unpack-objects" &&
+	
+	test_must_fail git -C client cat-file -p "$HASH"
+'
+
+test_expect_success '...succeeds if it is a promised blob' '
+	printf "%s%016x" "$HASH" "$(wc -c <server/1.t)" |
+		hex_pack >client/.git/objects/promisedblob &&
+	git -C client cat-file -p "$HASH"
+'
+
+test_expect_success 'cat-file --batch-all-objects with promised blobs' '
+	rm -rf client &&
+
+	git init client &&
+	git -C client config core.promisedblobcommand \
+		"git -C \"$(pwd)/server\" pack-objects --stdout |
+		 git unpack-objects" &&
+	printf "%s%016x" "$HASH" "$(wc -c <server/1.t)" |
+		hex_pack >client/.git/objects/promisedblob &&
+
+	# Verify that the promised blob is printed
+	git -C client cat-file --batch --batch-all-objects | tee out |
+		grep abcdefgh
+'
+
 test_done
-- 
2.13.2.932.g7449e964c-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
  2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
@ 2017-07-11 22:02   ` Stefan Beller
  2017-07-19 23:37     ` Jonathan Tan
  2017-07-12 17:29   ` Jeff Hostetler
  1 sibling, 1 reply; 46+ messages in thread
From: Stefan Beller @ 2017-07-11 22:02 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Tue, Jul 11, 2017 at 12:48 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> Currently, Git does not support repos with very large numbers of blobs
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every blob referenced by a tree object is available
> somewhere in the repo storage.
>
> As a first step to reducing this problem, introduce the concept of
> promised blobs. Each Git repo can contain a list of promised blobs and
> their sizes at $GIT_DIR/objects/promisedblob. This patch contains
> functions to query them; functions for creating and modifying that file
> will be introduced in later patches.
>
> A repository that is missing a blob but has that blob promised is not
> considered to be in error, so also teach fsck this.

Here I wondered what this file looks like, in a later patch you
add documentation:

  +objects/promisedblob::
  +       This file records the sha1 object names and sizes of promised
  +       blobs.
  +

but that leaves more fundamental questions:
* Is that a list of sha1s, separated by LF? (CRLF? How would Windows
  users interact with it, are they supposed to ever modify this file?)
* Similar to .git/packed-refs, would we want to have a first line
  that has some sort of comment?
* In the first question I assumed a linear list, will that be a sorted list,
  or do we want to have some fancy data structure here?
  (critbit tree, bloom filter)
* is the contents in ASCII or binary? (What are the separators?)
* In the future I presume we'd want to quickly answer "Is X in the
  promised blobs list?" so would it be possible (in the future) to
  improve the searching e.g. binary search?
* ... I'll read on to see my questions answered, but I'd guess
  others would prefer to see it in the docs. :)


> + * A mmap-ed byte array of size (missing_blob_nr * ENTRY_SIZE). Each
> + * ENTRY_SIZE-sized entry consists of the SHA-1 of the promised blob and its
> + * 64-bit size in network byte order. The entries are sorted in ascending SHA-1
> + * order.

This seems to be the answer to the docs. :)
So binary representation, the size variable takes a fixed amount, such
that the n-th element in the file is at n * ENTRY_SIZE.


> +       if (st.st_size % ENTRY_SIZE) {
> +               die("Size of %s is not a multiple of %d", filename, ENTRY_SIZE);
> +       }

So it looks as if the file format is an array of ENTRY_SIZE.

Similar to other files, would we want to prefix the file with
a 4 letter magic number and a version number?

> +       prepare_promised_blobs();
> +       result = sha1_entry_pos(promised_blobs, ENTRY_SIZE, 0, 0,
> +                               promised_blob_nr, promised_blob_nr, oid->hash);

IIRC, this is a binary search.

> +int for_each_promised_blob(each_promised_blob_fn cb, void *data)
> +{
> +       struct object_id oid;
> +       int i, r;
> +
> +       prepare_promised_blobs();
> +       for (i = 0; i < promised_blob_nr; i++) {
> +               memcpy(oid.hash, &promised_blobs[i * ENTRY_SIZE],
> +                      GIT_SHA1_RAWSZ);
> +               r = cb(&oid, data);
> +               if (r)
> +                       return r;
> +       }
> +       return 0;
> +}
> diff --git a/promised-blob.h b/promised-blob.h
> new file mode 100644
> index 000000000..a303ea1ff
> --- /dev/null
> +++ b/promised-blob.h
> @@ -0,0 +1,14 @@
> +#ifndef PROMISED_BLOB_H
> +#define PROMISED_BLOB_H
> +
> +/*
> + * Returns 1 if oid is the name of a promised blob. If size is not NULL, also
> + * returns its size.
> + */
> +extern int is_promised_blob(const struct object_id *oid,
> +                           unsigned long *size);
> +
> +typedef int each_promised_blob_fn(const struct object_id *oid, void *data);
> +int for_each_promised_blob(each_promised_blob_fn, void *);
> +
> +#endif
> diff --git a/t/t3907-promised-blob.sh b/t/t3907-promised-blob.sh
> new file mode 100755
> index 000000000..827072004
> --- /dev/null
> +++ b/t/t3907-promised-blob.sh
> @@ -0,0 +1,29 @@
> +#!/bin/sh
> +
> +test_description='promised blobs'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'fsck fails on missing blobs' '
> +       rm -rf repo &&

We should not need to delete the repo here, this is the first test?

> +       rm repo/.git/objects/$(echo $HASH | cut -c1-2)/$(echo $HASH | cut -c3-40) &&

Later down the road, do we want to have a
(plumbing) command to move an object from
standard blob to promised blob (as of now I'd think
it would perform this rm call as well as an insert into
the promised blobs file) ?
(Well, we'd also have to think about how to get objects
out of a pack)

With such a command you can easily write your own custom
filter to free up blobs again.


> +       test_must_fail git -C repo fsck

    test_i18n_grep missing out ?

maybe, too? (Maybe that is already tested elsewhere,
so no need for it)


> +'
> +
> +test_expect_success '...but succeeds if it is a promised blob' '
> +       printf "%s%016x" "$HASH" "$(wc -c <repo/1.t)" |
> +               hex_pack >repo/.git/objects/promisedblob &&
> +       git -C repo fsck
> +'
> +
> +test_expect_success '...but fails again with GIT_IGNORE_PROMISED_BLOBS' '
> +       GIT_IGNORE_PROMISED_BLOBS=1 test_must_fail git -C repo fsck &&
> +       unset GIT_IGNORE_PROMISED_BLOBS
> +'

This patch makes sense, so far.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/3] sha1-array: support appending unsigned char hash
  2017-07-11 19:48 ` [RFC PATCH 2/3] sha1-array: support appending unsigned char hash Jonathan Tan
@ 2017-07-11 22:06   ` Stefan Beller
  2017-07-19 23:56     ` Jonathan Tan
  0 siblings, 1 reply; 46+ messages in thread
From: Stefan Beller @ 2017-07-11 22:06 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Tue, Jul 11, 2017 at 12:48 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> In a subsequent patch, sha1_file will need to append object names in the
> form of "unsigned char *" to oid arrays. Teach sha1-array support for
> that.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>

This breaks the oid/sha1 barrier?

I would have expected the caller to do a

  oid_array_append_oid(&array, sha1_to_oid(sha1));

with sha1_to_oid working off some static memory, such that the
call to oid_array_append_oid (which we have as oid_array_append)
is just as cheap?

Could you say more in the commit message on why we collude
sha1 and oids here?

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/3] sha1_file: add promised blob hook support
  2017-07-11 19:48 ` [RFC PATCH 3/3] sha1_file: add promised blob hook support Jonathan Tan
@ 2017-07-11 22:38   ` Stefan Beller
  2017-07-12 17:40   ` Ben Peart
  1 sibling, 0 replies; 46+ messages in thread
From: Stefan Beller @ 2017-07-11 22:38 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Tue, Jul 11, 2017 at 12:48 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> Teach sha1_file to invoke a hook whenever a blob is requested and
> unavailable but is promised. The hook is a shell command that can be
> configured through "git config"; this hook takes in a list of hashes and
> writes (if successful) the corresponding objects to the repo's local
> storage.
>
> This is meant as a temporary measure to ensure that all Git commands
> work in such a situation. Future patches will update some commands to
> either tolerate promised blobs (without invoking the hook) or be more
> efficient in invoking the promised blob hook.
>
> In order to determine the code changes in sha1_file.c necessary, I
> investigated the following:
>  (1) functions in sha1_file that take in a hash, without the user
>      regarding how the object is stored (loose or packed)
>  (2) functions in sha1_file that operate on packed objects (because I
>      need to check callers that know about the loose/packed distinction
>      and operate on both differently, and ensure that they can handle
>      the concept of objects that are neither loose nor packed)
>
> (1) is handled by the modification to sha1_object_info_extended().
>
> For (2), I looked at for_each_packed_object and at the packed-related
> functions that take in a hash. For for_each_packed_object, the callers
> either already work or are fixed in this patch:
>  - reachable - only to find recent objects
>  - builtin/fsck - already knows about promised blobs
>  - builtin/cat-file - fixed in this commit
>
> Callers of the other functions do not need to be changed:
>  - parse_pack_index
>    - http - indirectly from http_get_info_packs
>  - find_pack_entry_one
>    - this searches a single pack that is provided as an argument; the
>      caller already knows (through other means) that the sought object
>      is in a specific pack
>  - find_sha1_pack
>    - fast-import - appears to be an optimization to not store a
>      file if it is already in a pack
>    - http-walker - to search through a struct alt_base
>    - http-push - to search through remote packs
>  - has_sha1_pack
>    - builtin/fsck - already knows about promised blobs
>    - builtin/count-objects - informational purposes only (check if loose
>      object is also packed)
>    - builtin/prune-packed - check if object to be pruned is packed (if
>      not, don't prune it)
>    - revision - used to exclude packed objects if requested by user
>    - diff - just for optimization
>
> An alternative design that I considered but rejected:
>
>  - Adding a hook whenever a packed blob is requested, not on any blob.
>    That is, whenever we attempt to search the packfiles for a blob, if
>    it is missing (from the packfiles and from the loose object storage),
>    to invoke the hook (which must then store it as a packfile), open the
>    packfile the hook generated, and report that the blob is found in
>    that new packfile. This reduces the amount of analysis needed (in
>    that we only need to look at how packed blobs are handled), but
>    requires that the hook generate packfiles (or for sha1_file to pack
>    whatever loose objects are generated), creating one packfile for each
>    missing blob and potentially very many packfiles that must be
>    linearly searched. This may be tolerable now for repos that only have
>    a few missing blobs (for example, repos that only want to exclude
>    large blobs), and might be tolerable in the future if we have
>    batching support for the most commonly used commands, but is not
>    tolerable now for repos that exclude a large amount of blobs.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>

>
> +core.promisedBlobCommand::
> +       If set, whenever a blob in the local repo is attempted to be read, but
> +       is both missing and a promised blob, invoke this shell command to
> +       generate or obtain that blob before reporting an error. This shell
> +       command should take one or more hashes, each terminated by a newline,
> +       as standard input, and (if successful) should write the corresponding
> +       objects to the local repo (packed or loose).

Any git command invoked here behaves differently as
GIT_IGNORE_PROMISED_BLOBS is set.
GIT_IGNORE_PROMISED_BLOBS needs documentation.


> +int request_promised_blobs(const struct oid_array *blobs)

> +       if (i == blobs->nr)
> +               /* Nothing to fetch */

s/to fetch/to get/ ? Might be to nitpicky.

So this function takes an array of blobs and obtains any that
are promised.

> +
> +       argv[0] = promised_blob_command;
> +       cp.argv = argv;
> +       cp.env = env;

I had the impression that we'd favor .args and .env_array
nowdays, such that we'd not need to construct the NULL
terminated array ourselves. (e.g. whenever we'd change argv,
we'd need to make sure it contains enough NULL at the end)

    argv_array_push(&cp.args, promised_blob_command);
    argv_array_push(&cp.env_array, \
        "GIT_IGNORE_PROMISED_BLOBS=1");

would have the same effect.


> +/*
> + * If any of the given blobs are promised blobs, invokes
> + * core.promisedblobcommand with those blobs and returns the number of blobs
> + * requested. No check is made as to whether the invocation actually populated
> + * the repository with the promised blobs.
> + *
> + * If none of the given blobs are promised blobs, this function does not invoke
> + * anything and returns 0.

Thanks for the docs here.
  offtopic: As a general thing for our patches, can we configure
  (or even convince Git in general), that headers ought to be sent *before*
  its accompanying source? I think that would help reviewers like me, who
  tend to start reading linearly and then giving random thoughts, because the
  header prepares the reviewer for the source code with expectations. Also
  by having it the other way around, the review first focuses on design (Is this
  function signature sane; the docs said it would do X while not doing
Y, is that
  sane?) instead of code.

  IIRC Gerrit has an option (or default) that the header is before the source,
  despite the order violating the git sorting of files.



> + */
> +int request_promised_blobs(const struct oid_array *blobs);
> +
>  #endif
> diff --git a/sha1_file.c b/sha1_file.c
> index 5862386cd..5778a6dbd 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -28,6 +28,11 @@
>  #include "list.h"
>  #include "mergesort.h"
>  #include "quote.h"
> +#include "iterator.h"
> +#include "dir-iterator.h"
> +#include "sha1-lookup.h"
> +#include "promised-blob.h"
> +#include "sha1-array.h"
>
>  #define SZ_FMT PRIuMAX
>  static inline uintmax_t sz_fmt(size_t s) { return s; }
> @@ -2983,6 +2988,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>         const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
>                                     lookup_replace_object(sha1) :
>                                     sha1;
> +       int already_retried = 0;
>
>         if (!oi)
>                 oi = &blank_oi;
> @@ -3007,30 +3013,40 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>                 }
>         }
>
> -       if (!find_pack_entry(real, &e)) {
> -               /* Most likely it's a loose object. */
> -               if (!sha1_loose_object_info(real, oi, flags)) {
> -                       oi->whence = OI_LOOSE;
> -                       return 0;
> -               }
> +retry:
> +       if (find_pack_entry(real, &e))
> +               goto found_packed;
>
> -               /* Not a loose object; someone else may have just packed it. */
> -               if (flags & OBJECT_INFO_QUICK) {
> -                       return -1;
> -               } else {
> -                       reprepare_packed_git();
> -                       if (!find_pack_entry(real, &e))
> -                               return -1;
> +       /* Most likely it's a loose object. */
> +       if (!sha1_loose_object_info(real, oi, flags)) {
> +               oi->whence = OI_LOOSE;
> +               return 0;
> +       }
> +
> +       /* Not a loose object; someone else may have just packed it. */
> +       reprepare_packed_git();
> +       if (find_pack_entry(real, &e))
> +               goto found_packed;
> +
> +       /* Check if it is a promised blob */
> +       if (!already_retried) {
> +               struct oid_array promised = OID_ARRAY_INIT;
> +               oid_array_append_sha1(&promised, real);
> +               if (request_promised_blobs(&promised)) {
> +                       already_retried = 1;
> +                       goto retry;

Once we have the promised blob, do we want to
remove it from the list of promised blobs?
(Maybe check first if we have it actually?)

Or is the opposite true:
We rather want to keep the promised blobs list,
such that we can easily get rid of good candidates
to make space on the local fs?


> @@ -26,4 +26,40 @@ test_expect_success '...but fails again with GIT_IGNORE_PROMISED_BLOBS' '
>         unset GIT_IGNORE_PROMISED_BLOBS
>  '
>
> +test_expect_success 'sha1_object_info_extended (through git cat-file)' '
> +       rm -rf server client &&

Did you take these rm -rf as the first line of a test from any other tests?
(If so I'd call them out as bad examples)

I think it is better if we'd do

    test_when_finished "rm -rf ..." &&

in a test, as then each test is self contained and cleans up
after itself as opposed to cleaning up for the guy in front of you.

> +       git init server &&

test_create_repo maybe?

> +       test_commit -C server 1 1.t abcdefgh &&
> +       HASH=$(git hash-object server/1.t) &&
> +
> +       git init client &&
> +       git -C client config core.promisedblobcommand \
> +               "git -C \"$(pwd)/server\" pack-objects --stdout |
> +                git unpack-objects" &&

Nothing is wrong with absolute path, except relative paths
may be easier to read here. Is the core.promisedblobcommand
promised to run in a particular directory? Do we want to make that
promise (if so, we'd want the test with relative path IMHO).

I like the series! A pleasant read, though I am tossing back
a lot of questions.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
  2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
  2017-07-11 22:02   ` Stefan Beller
@ 2017-07-12 17:29   ` Jeff Hostetler
  2017-07-12 19:28     ` Jonathan Nieder
  2017-07-13 19:39     ` Jonathan Tan
  1 sibling, 2 replies; 46+ messages in thread
From: Jeff Hostetler @ 2017-07-12 17:29 UTC (permalink / raw)
  To: Jonathan Tan, git



On 7/11/2017 3:48 PM, Jonathan Tan wrote:
> Currently, Git does not support repos with very large numbers of blobs
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every blob referenced by a tree object is available
> somewhere in the repo storage.
> 
> As a first step to reducing this problem, introduce the concept of
> promised blobs. Each Git repo can contain a list of promised blobs and
> their sizes at $GIT_DIR/objects/promisedblob. This patch contains
> functions to query them; functions for creating and modifying that file
> will be introduced in later patches.


As part of my on-going effort on partial/narrow clone/fetch I've also
looked at how to represent the set of omitted objects and whether or
not we should even try.

My primary concern is scale and managing the list of objects over time.

My fear is that this list will be quite large.  If we only want to omit
the very large blobs, then maybe not.  But if we want to expand that
scope to also omit other objects (such as a clone synchronized with a
sparse checkout), then that list will get large on large repos.  For
example, on the Windows repo we have (conservatively) 100M+ blobs (and
growing).  Assuming 28 bytes per, gives a 2.8GB list to be manipulated.

If I understand your proposal, newly-omitted blobs would need to be
merged into the promised-blob list after each fetch.  The fetch itself
may not have that many new entries, but inserting them into the existing
list will be slow.  Also, mmap'ing and bsearch'ing will likely have
issues.  And there's likely to be a very expensive step to remove
entries from the list as new blobs are received (or locally created).

In such a "sparse clone", it would be nice to omit unneeded tree objects
in addition to just blobs.   I say that because we are finding with GVFS
on the Windows repo, that even with commits-and-trees-only filtering,
the number of tree objects is overwhelming.  So I'm also concerned about
limiting the list to just blobs.  If we need to have this list, it
should be able to contain any object.  (Suggesting having an object type
in the entry.)

I assume that we'll also need a promised-blob.lock file to control
access during list manipulation.  This is already a sore spot with the
index; I'd hate to create another one.

I also have to wonder about the need to have a complete list of omitted
blobs up front.  It may be better to just relax the consistency checks
and assume a missing blob is "intentionally missing" rather than
indicating a corruption somewhere.  And then let the client do a later
round-trip to either demand-load the object -or- demand-load the
existence/size info if/when it really matters.

Maybe we should add a verb to your new fetch-blob endpoint to just get
the size of one or more objects to help with this.

Thanks,
Jeff


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/3] sha1_file: add promised blob hook support
  2017-07-11 19:48 ` [RFC PATCH 3/3] sha1_file: add promised blob hook support Jonathan Tan
  2017-07-11 22:38   ` Stefan Beller
@ 2017-07-12 17:40   ` Ben Peart
  2017-07-12 20:38     ` Jonathan Nieder
  1 sibling, 1 reply; 46+ messages in thread
From: Ben Peart @ 2017-07-12 17:40 UTC (permalink / raw)
  To: Jonathan Tan, git



On 7/11/2017 3:48 PM, Jonathan Tan wrote:
> Teach sha1_file to invoke a hook whenever a blob is requested and
> unavailable but is promised. The hook is a shell command that can be
> configured through "git config"; this hook takes in a list of hashes and
> writes (if successful) the corresponding objects to the repo's local
> storage.
> 

This would seem to work well for promised blobs but does it currently 
support other object types?  In our large mono repos, we have to support 
missing commits and trees as well.  I'd rather not have 
separate/different ways to handle the various missing object types.

There are a couple of related patch series in flight that are playing in 
this area. We should definitely coordinate these various efforts.

[RFC] Add support for downloading blobs on demand
https://public-inbox.org/git/009d01d28deb$158563a0$40902ae0$@gmail.com/

This series adds read-object-process support to download on demand any 
missing object type.  This is based on the code we are using today. I'll 
update it to 1) take advantage of the refactored sub-process support 
(https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/) 
and 2) fix a bug we discovered with alternates support and resubmit it 
as an updated RFC.

[RFC/PATCH v4 00/49] Add initial experimental external ODB support
https://public-inbox.org/git/20170620075523.26961-1-chriscool@tuxfamily.org/

This is a broader patch series that enables you to register multiple 
handlers for "missing" objects. It also already incorporates the 
read-object-process patch series above.

The advantage of these other patch series is that they already support 
_any_ object type, not just blobs or promised blobs.

> This is meant as a temporary measure to ensure that all Git commands
> work in such a situation. Future patches will update some commands to
> either tolerate promised blobs (without invoking the hook) or be more
> efficient in invoking the promised blob hook.
> 
> In order to determine the code changes in sha1_file.c necessary, I
> investigated the following:
>   (1) functions in sha1_file that take in a hash, without the user
>       regarding how the object is stored (loose or packed)
>   (2) functions in sha1_file that operate on packed objects (because I
>       need to check callers that know about the loose/packed distinction
>       and operate on both differently, and ensure that they can handle
>       the concept of objects that are neither loose nor packed)
> 
> (1) is handled by the modification to sha1_object_info_extended().
> 
> For (2), I looked at for_each_packed_object and at the packed-related
> functions that take in a hash. For for_each_packed_object, the callers
> either already work or are fixed in this patch:
>   - reachable - only to find recent objects
>   - builtin/fsck - already knows about promised blobs
>   - builtin/cat-file - fixed in this commit
> 
> Callers of the other functions do not need to be changed:
>   - parse_pack_index
>     - http - indirectly from http_get_info_packs
>   - find_pack_entry_one
>     - this searches a single pack that is provided as an argument; the
>       caller already knows (through other means) that the sought object
>       is in a specific pack
>   - find_sha1_pack
>     - fast-import - appears to be an optimization to not store a
>       file if it is already in a pack
>     - http-walker - to search through a struct alt_base
>     - http-push - to search through remote packs
>   - has_sha1_pack
>     - builtin/fsck - already knows about promised blobs
>     - builtin/count-objects - informational purposes only (check if loose
>       object is also packed)
>     - builtin/prune-packed - check if object to be pruned is packed (if
>       not, don't prune it)
>     - revision - used to exclude packed objects if requested by user
>     - diff - just for optimization
> 
> An alternative design that I considered but rejected:
> 
>   - Adding a hook whenever a packed blob is requested, not on any blob.
>     That is, whenever we attempt to search the packfiles for a blob, if
>     it is missing (from the packfiles and from the loose object storage),
>     to invoke the hook (which must then store it as a packfile), open the
>     packfile the hook generated, and report that the blob is found in
>     that new packfile. This reduces the amount of analysis needed (in
>     that we only need to look at how packed blobs are handled), but
>     requires that the hook generate packfiles (or for sha1_file to pack
>     whatever loose objects are generated), creating one packfile for each
>     missing blob and potentially very many packfiles that must be
>     linearly searched. This may be tolerable now for repos that only have
>     a few missing blobs (for example, repos that only want to exclude
>     large blobs), and might be tolerable in the future if we have
>     batching support for the most commonly used commands, but is not
>     tolerable now for repos that exclude a large amount of blobs.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>   Documentation/config.txt               |  8 ++++
>   Documentation/gitrepository-layout.txt |  8 ++++
>   builtin/cat-file.c                     |  9 ++++
>   promised-blob.c                        | 75 ++++++++++++++++++++++++++++++++++
>   promised-blob.h                        | 13 ++++++
>   sha1_file.c                            | 44 +++++++++++++-------
>   t/t3907-promised-blob.sh               | 36 ++++++++++++++++
>   7 files changed, 179 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index d5c9c4cab..c293ac921 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
>   will probe and set core.ignoreCase true if appropriate when the repository
>   is created.
>   
> +core.promisedBlobCommand::
> +	If set, whenever a blob in the local repo is attempted to be read, but
> +	is both missing and a promised blob, invoke this shell command to
> +	generate or obtain that blob before reporting an error. This shell
> +	command should take one or more hashes, each terminated by a newline,
> +	as standard input, and (if successful) should write the corresponding
> +	objects to the local repo (packed or loose).
> +
>   core.precomposeUnicode::
>   	This option is only used by Mac OS implementation of Git.
>   	When core.precomposeUnicode=true, Git reverts the unicode decomposition
> diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
> index f51ed4e37..0ef9312fa 100644
> --- a/Documentation/gitrepository-layout.txt
> +++ b/Documentation/gitrepository-layout.txt
> @@ -47,6 +47,10 @@ use with dumb transports but otherwise is OK as long as
>   `objects/info/alternates` points at the object stores it
>   borrows from.
>   +
> +. You could have blobs that are merely promised by another source. When
> +Git requires those blobs, it will invoke the command in the
> +`core.promisedBlobCommand` configuration variable.
> ++
>   This directory is ignored if $GIT_COMMON_DIR is set and
>   "$GIT_COMMON_DIR/objects" will be used instead.
>   
> @@ -91,6 +95,10 @@ objects/info/http-alternates::
>   	this object store borrows objects from, to be used when
>   	the repository is fetched over HTTP.
>   
> +objects/promisedblob::
> +	This file records the sha1 object names and sizes of promised
> +	blobs.
> +
>   refs::
>   	References are stored in subdirectories of this
>   	directory.  The 'git prune' command knows to preserve
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 96b786e48..3df00bf91 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -12,6 +12,7 @@
>   #include "streaming.h"
>   #include "tree-walk.h"
>   #include "sha1-array.h"
> +#include "promised-blob.h"
>   
>   struct batch_options {
>   	int enabled;
> @@ -432,6 +433,13 @@ static int batch_packed_object(const struct object_id *oid,
>   	return 0;
>   }
>   
> +static int batch_promised_blob(const struct object_id *oid,
> +			       void *data)
> +{
> +	oid_array_append(data, oid);
> +	return 0;
> +}
> +
>   static int batch_objects(struct batch_options *opt)
>   {
>   	struct strbuf buf = STRBUF_INIT;
> @@ -473,6 +481,7 @@ static int batch_objects(struct batch_options *opt)
>   
>   		for_each_loose_object(batch_loose_object, &sa, 0);
>   		for_each_packed_object(batch_packed_object, &sa, 0);
> +		for_each_promised_blob(batch_promised_blob, &sa);
>   
>   		cb.opt = opt;
>   		cb.expand = &data;
> diff --git a/promised-blob.c b/promised-blob.c
> index 493808ed2..0fe04daf2 100644
> --- a/promised-blob.c
> +++ b/promised-blob.c
> @@ -2,6 +2,9 @@
>   #include "promised-blob.h"
>   #include "sha1-lookup.h"
>   #include "strbuf.h"
> +#include "run-command.h"
> +#include "sha1-array.h"
> +#include "config.h"
>   
>   #define ENTRY_SIZE (GIT_SHA1_RAWSZ + 8)
>   /*
> @@ -93,3 +96,75 @@ int for_each_promised_blob(each_promised_blob_fn cb, void *data)
>   	}
>   	return 0;
>   }
> +
> +static char *promised_blob_command;
> +
> +static int promised_blob_config(const char *conf_key, const char *value,
> +				void *cb)
> +{
> +	if (!strcmp(conf_key, "core.promisedblobcommand")) {
> +		promised_blob_command = xstrdup(value);
> +	}
> +	return 0;
> +}
> +
> +static void ensure_promised_blob_configured(void)
> +{
> +	static int configured;
> +	if (configured)
> +		return;
> +	git_config(promised_blob_config, NULL);
> +	configured = 1;
> +}
> +
> +int request_promised_blobs(const struct oid_array *blobs)
> +{
> +	struct child_process cp = CHILD_PROCESS_INIT;
> +	const char *argv[] = {NULL, NULL};
> +	const char *env[] = {"GIT_IGNORE_PROMISED_BLOBS=1", NULL};
> +	int blobs_requested = 0;
> +	int i;
> +
> +	for (i = 0; i < blobs->nr; i++) {
> +		if (is_promised_blob(&blobs->oid[i], NULL))
> +			break;
> +	}
> +
> +	if (i == blobs->nr)
> +		/* Nothing to fetch */
> +		return 0;
> +
> +	ensure_promised_blob_configured();
> +	if (!promised_blob_command)
> +		die("some promised blobs need to be fetched, but\n"
> +		    "no core.promisedblobcommand configured");
> +
> +	argv[0] = promised_blob_command;
> +	cp.argv = argv;
> +	cp.env = env;
> +	cp.use_shell = 1;
> +	cp.in = -1;
> +
> +	if (start_command(&cp))
> +		die("failed to start <%s>", promised_blob_command);
> +
> +	for (; i < blobs->nr; i++) {
> +		if (is_promised_blob(&blobs->oid[i], NULL)) {
> +			write_in_full(cp.in, oid_to_hex(&blobs->oid[i]),
> +				      GIT_SHA1_HEXSZ);
> +			write_in_full(cp.in, "\n", 1);
> +			blobs_requested++;
> +		}
> +	}
> +	close(cp.in);
> +
> +	if (finish_command(&cp))
> +		die("failed to finish <%s>", promised_blob_command);
> +
> +	/*
> +	 * The command above may have updated packfiles, so update our record
> +	 * of them.
> +	 */
> +	reprepare_packed_git();
> +	return blobs_requested;
> +}
> diff --git a/promised-blob.h b/promised-blob.h
> index a303ea1ff..34b1d3f9e 100644
> --- a/promised-blob.h
> +++ b/promised-blob.h
> @@ -1,6 +1,8 @@
>   #ifndef PROMISED_BLOB_H
>   #define PROMISED_BLOB_H
>   
> +#include "sha1-array.h"
> +
>   /*
>    * Returns 1 if oid is the name of a promised blob. If size is not NULL, also
>    * returns its size.
> @@ -11,4 +13,15 @@ extern int is_promised_blob(const struct object_id *oid,
>   typedef int each_promised_blob_fn(const struct object_id *oid, void *data);
>   int for_each_promised_blob(each_promised_blob_fn, void *);
>   
> +/*
> + * If any of the given blobs are promised blobs, invokes
> + * core.promisedblobcommand with those blobs and returns the number of blobs
> + * requested. No check is made as to whether the invocation actually populated
> + * the repository with the promised blobs.
> + *
> + * If none of the given blobs are promised blobs, this function does not invoke
> + * anything and returns 0.
> + */
> +int request_promised_blobs(const struct oid_array *blobs);
> +
>   #endif
> diff --git a/sha1_file.c b/sha1_file.c
> index 5862386cd..5778a6dbd 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -28,6 +28,11 @@
>   #include "list.h"
>   #include "mergesort.h"
>   #include "quote.h"
> +#include "iterator.h"
> +#include "dir-iterator.h"
> +#include "sha1-lookup.h"
> +#include "promised-blob.h"
> +#include "sha1-array.h"
>   
>   #define SZ_FMT PRIuMAX
>   static inline uintmax_t sz_fmt(size_t s) { return s; }
> @@ -2983,6 +2988,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>   	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
>   				    lookup_replace_object(sha1) :
>   				    sha1;
> +	int already_retried = 0;
>   
>   	if (!oi)
>   		oi = &blank_oi;
> @@ -3007,30 +3013,40 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>   		}
>   	}
>   
> -	if (!find_pack_entry(real, &e)) {
> -		/* Most likely it's a loose object. */
> -		if (!sha1_loose_object_info(real, oi, flags)) {
> -			oi->whence = OI_LOOSE;
> -			return 0;
> -		}
> +retry:
> +	if (find_pack_entry(real, &e))
> +		goto found_packed;
>   
> -		/* Not a loose object; someone else may have just packed it. */
> -		if (flags & OBJECT_INFO_QUICK) {
> -			return -1;
> -		} else {
> -			reprepare_packed_git();
> -			if (!find_pack_entry(real, &e))
> -				return -1;
> +	/* Most likely it's a loose object. */
> +	if (!sha1_loose_object_info(real, oi, flags)) {
> +		oi->whence = OI_LOOSE;
> +		return 0;
> +	}
> +
> +	/* Not a loose object; someone else may have just packed it. */
> +	reprepare_packed_git();
> +	if (find_pack_entry(real, &e))
> +		goto found_packed;
> +
> +	/* Check if it is a promised blob */
> +	if (!already_retried) {
> +		struct oid_array promised = OID_ARRAY_INIT;
> +		oid_array_append_sha1(&promised, real);
> +		if (request_promised_blobs(&promised)) {
> +			already_retried = 1;
> +			goto retry;
>   		}
>   	}
>   
> +	return -1;
> +
> +found_packed:
>   	if (oi == &blank_oi)
>   		/*
>   		 * We know that the caller doesn't actually need the
>   		 * information below, so return early.
>   		 */
>   		return 0;
> -
>   	rtype = packed_object_info(e.p, e.offset, oi);
>   	if (rtype < 0) {
>   		mark_bad_packed_object(e.p, real);
> diff --git a/t/t3907-promised-blob.sh b/t/t3907-promised-blob.sh
> index 827072004..778afd1f6 100755
> --- a/t/t3907-promised-blob.sh
> +++ b/t/t3907-promised-blob.sh
> @@ -26,4 +26,40 @@ test_expect_success '...but fails again with GIT_IGNORE_PROMISED_BLOBS' '
>   	unset GIT_IGNORE_PROMISED_BLOBS
>   '
>   
> +test_expect_success 'sha1_object_info_extended (through git cat-file)' '
> +	rm -rf server client &&
> +
> +	git init server &&
> +	test_commit -C server 1 1.t abcdefgh &&
> +	HASH=$(git hash-object server/1.t) &&
> +
> +	git init client &&
> +	git -C client config core.promisedblobcommand \
> +		"git -C \"$(pwd)/server\" pack-objects --stdout |
> +		 git unpack-objects" &&
> +	
> +	test_must_fail git -C client cat-file -p "$HASH"
> +'
> +
> +test_expect_success '...succeeds if it is a promised blob' '
> +	printf "%s%016x" "$HASH" "$(wc -c <server/1.t)" |
> +		hex_pack >client/.git/objects/promisedblob &&
> +	git -C client cat-file -p "$HASH"
> +'
> +
> +test_expect_success 'cat-file --batch-all-objects with promised blobs' '
> +	rm -rf client &&
> +
> +	git init client &&
> +	git -C client config core.promisedblobcommand \
> +		"git -C \"$(pwd)/server\" pack-objects --stdout |
> +		 git unpack-objects" &&
> +	printf "%s%016x" "$HASH" "$(wc -c <server/1.t)" |
> +		hex_pack >client/.git/objects/promisedblob &&
> +
> +	# Verify that the promised blob is printed
> +	git -C client cat-file --batch --batch-all-objects | tee out |
> +		grep abcdefgh
> +'
> +
>   test_done
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
  2017-07-12 17:29   ` Jeff Hostetler
@ 2017-07-12 19:28     ` Jonathan Nieder
  2017-07-13 14:48       ` Jeff Hostetler
  2017-07-13 19:39     ` Jonathan Tan
  1 sibling, 1 reply; 46+ messages in thread
From: Jonathan Nieder @ 2017-07-12 19:28 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Jonathan Tan, git

Hi,

Jeff Hostetler wrote:

> My primary concern is scale and managing the list of objects over time.
[...]
>                                                                  For
> example, on the Windows repo we have (conservatively) 100M+ blobs (and
> growing).  Assuming 28 bytes per, gives a 2.8GB list to be manipulated.
>
> If I understand your proposal, newly-omitted blobs would need to be
> merged into the promised-blob list after each fetch.  The fetch itself
> may not have that many new entries, but inserting them into the existing
> list will be slow.

This is a good point.  An alternative would be to avoid storing the
list and instead use a repository extension that treats all missing
blobs in the repository similarly to promised blobs (with a weaker
promise, where the server is allowed to return 404).  The downsides:

- blob sizes are not available without an additional request, e.g. for
  directory listings

- the semantics of has_object_file become more complicated.  If it
  is allowed to return false for omitted blobs, then callers have to
  be audited to tolerate that and try to look up the blob anyway.
  If it has to contact the server to find out whether an omitted blob
  is available, then callers have to be audited to skip this expensive
  operation when possible.

- similarly, the semantics of sha1_object_info{,_extended} become more
  complicated.  If they are allowed to return -1 for omitted blobs,
  then callers have to be audited to handle that. If they have to
  contact the server to find the object type and size, it becomes
  expensive in a way that affects callers.

- causes futile repeated requests to the server for objects that don't
  exist.  Caching negative lookups is fussy because a later push could
  cause those objects to exist --- though it should be possible for
  fetches to invalidate entries in such a cache using the list of
  promised blobs sent by the server.

[...]
> In such a "sparse clone", it would be nice to omit unneeded tree objects
> in addition to just blobs.   I say that because we are finding with GVFS
> on the Windows repo, that even with commits-and-trees-only filtering,
> the number of tree objects is overwhelming.  So I'm also concerned about
> limiting the list to just blobs.  If we need to have this list, it
> should be able to contain any object.  (Suggesting having an object type
> in the entry.)

Would it work to have a separate lists of blobs and trees (either in
separate files or in the same file)?

One option would be to add a version number / magic string to the start
of the file.  That would allow making format changes later without a
proliferation of distinct repository extensions.

[...]
> I assume that we'll also need a promised-blob.lock file to control
> access during list manipulation.  This is already a sore spot with the
> index; I'd hate to create another one.

Can you say more about this concern?  My experience with concurrent
fetches has already not been great (since one fetch process is not
aware of what the other has fetched) --- is your concern that the
promised-blob facility would affect pushes as well some day?

> I also have to wonder about the need to have a complete list of omitted
> blobs up front.  It may be better to just relax the consistency checks
> and assume a missing blob is "intentionally missing" rather than
> indicating a corruption somewhere.

We've discussed this previously on list and gone back and forth. :)

>                                     And then let the client do a later
> round-trip to either demand-load the object -or- demand-load the
> existence/size info if/when it really matters.

The cost of demand-loading this existence/size information is what
ultimately convinced me of this approach.

But I can see how the tradeoffs differ between the omit-large-blobs
case and the omit-all-blobs case.  We might end up having to support
both modes. :(

> Maybe we should add a verb to your new fetch-blob endpoint to just get
> the size of one or more objects to help with this.

No objections from me, though we don't need it yet.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 3/3] sha1_file: add promised blob hook support
  2017-07-12 17:40   ` Ben Peart
@ 2017-07-12 20:38     ` Jonathan Nieder
  0 siblings, 0 replies; 46+ messages in thread
From: Jonathan Nieder @ 2017-07-12 20:38 UTC (permalink / raw)
  To: Ben Peart; +Cc: Jonathan Tan, git

Hi,

Ben Peart wrote:
> On 7/11/2017 3:48 PM, Jonathan Tan wrote:

>> Teach sha1_file to invoke a hook whenever a blob is requested and
>> unavailable but is promised. The hook is a shell command that can be
>> configured through "git config"; this hook takes in a list of hashes and
>> writes (if successful) the corresponding objects to the repo's local
>> storage.
>
> This would seem to work well for promised blobs but does it
> currently support other object types?  In our large mono repos, we
> have to support missing commits and trees as well.

Can you elaborate on this?  This is the first time I've heard about
it.

Are you omitting commits and trees today?  How does that work with
commits --- is it something different from shallow clone?

[...]
> There are a couple of related patch series in flight that are
> playing in this area. We should definitely coordinate these various
> efforts.

In that spirit, I've created
https://github.com/jrn/git-large-repositories so we have a shared
place to keep track of pointers to the mailing list, etc.  I'll push
the links mentioned to there now.

I'm also usually available on IRC (channel #git-devel on freenode,
username jrnieder) for real-time conversation.  Logs are available at
http://colabti.org/irclogger/irclogger_logs/git-devel.  You can
prepend a message with [off] if you want it not to be logged.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
  2017-07-12 19:28     ` Jonathan Nieder
@ 2017-07-13 14:48       ` Jeff Hostetler
  2017-07-13 15:05         ` Jeff Hostetler
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Hostetler @ 2017-07-13 14:48 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git



On 7/12/2017 3:28 PM, Jonathan Nieder wrote:
> Hi,
> 
> Jeff Hostetler wrote:
> 
>> My primary concern is scale and managing the list of objects over time.
> [...]
>>                                                                   For
>> example, on the Windows repo we have (conservatively) 100M+ blobs (and
>> growing).  Assuming 28 bytes per, gives a 2.8GB list to be manipulated.
>>
>> If I understand your proposal, newly-omitted blobs would need to be
>> merged into the promised-blob list after each fetch.  The fetch itself
>> may not have that many new entries, but inserting them into the existing
>> list will be slow.
> 
> This is a good point.  An alternative would be to avoid storing the
> list and instead use a repository extension that treats all missing
> blobs in the repository similarly to promised blobs (with a weaker
> promise, where the server is allowed to return 404).  The downsides:
> 
> - blob sizes are not available without an additional request, e.g. for
>    directory listings
> 
> - the semantics of has_object_file become more complicated.  If it
>    is allowed to return false for omitted blobs, then callers have to
>    be audited to tolerate that and try to look up the blob anyway.
>    If it has to contact the server to find out whether an omitted blob
>    is available, then callers have to be audited to skip this expensive
>    operation when possible.
> 
> - similarly, the semantics of sha1_object_info{,_extended} become more
>    complicated.  If they are allowed to return -1 for omitted blobs,
>    then callers have to be audited to handle that. If they have to
>    contact the server to find the object type and size, it becomes
>    expensive in a way that affects callers.
> 
> - causes futile repeated requests to the server for objects that don't
>    exist.  Caching negative lookups is fussy because a later push could
>    cause those objects to exist --- though it should be possible for
>    fetches to invalidate entries in such a cache using the list of
>    promised blobs sent by the server.

Also good points. :-)

Would it be better to:
[] let fetch-pack/index-pack on the client compute the set of missing
    objects based upon the commit range in the response.
[] build a "pack-<id>.promised" as a peer of the "pack-<id>.[pack,idx]"
    files.
[] Use Jonathan's binary format with (-1LL) for the sizes, but also add
    a flags/object-type word.  Since the local commit/object traversal
    usually knows the type of object (either by context or the mode field
    in the parent entry), index-pack can set it immediately.
[] Repack would "do the right thing" when building new packfiles from
    existing packfiles (and merge/prune/etc. a new promise file).

Conceptually, these "pack-<id>.promised" files would be read-only once
created, like the corresponding .pack/.idx files.  But since we have a
fixed record size, we could let the dynamic object size requests update
the sizes in-place.   (And if an update failed for some reason, the
client could just go on and use the result (as we do when trying to
update mtime info in the index).)

This gives us the full set of omitted objects without the network
overhead and without the overhead of merging them into a single list.
And we defer the cost of getting an object's size until we actually
need the info.


> [...]
>> In such a "sparse clone", it would be nice to omit unneeded tree objects
>> in addition to just blobs.   I say that because we are finding with GVFS
>> on the Windows repo, that even with commits-and-trees-only filtering,
>> the number of tree objects is overwhelming.  So I'm also concerned about
>> limiting the list to just blobs.  If we need to have this list, it
>> should be able to contain any object.  (Suggesting having an object type
>> in the entry.)
> 
> Would it work to have a separate lists of blobs and trees (either in
> separate files or in the same file)?
> 
> One option would be to add a version number / magic string to the start
> of the file.  That would allow making format changes later without a
> proliferation of distinct repository extensions.

As I suggested above, we could just put them in the same file with a
flags/object-type field.

Without getting too nerdy here, we could even say that the size doesn't
need to be a int:64 -- definitely bigger than int:32, but we don't need
the full int:64.  We could make the size a int:40 or int:48 and use the
rest of the QWORD for flags/object-type.


> [...]
>> I assume that we'll also need a promised-blob.lock file to control
>> access during list manipulation.  This is already a sore spot with the
>> index; I'd hate to create another one.
> 
> Can you say more about this concern?  My experience with concurrent
> fetches has already not been great (since one fetch process is not
> aware of what the other has fetched) --- is your concern that the
> promised-blob facility would affect pushes as well some day?

[I wrote this before I wrote the multi-peer-file suggestion above.]

I'm assuming that anyone adding a new promised-blob to the single file
will use the standard "read foo, write foo.lock, rename" trick.  If we
have a large number of omitted blobs, this file will be large and the
insert will take a non-trivial amount of time.

Concurrent fetches does sound a bit silly, so we only have to worry
about the time to copy the bits on disk.

If other operations (such as demand loading missing blobs in a log or
diff) need to remove promised-blobs from the list as they create
loose blobs, then they will get stuck on the lock.  I suppose we could
just say that the promised-blob list may have false-positives in it
(because lookups in it won't happen until the packed and loose objects
have been searched and failed) and just wait for GC to clean it.

And there's always that stale lock problem where something dies holding
the lock and doesn't delete it.

Some of these could be addressed with a DB or something that allows
incremental and atomic updates rather than a regular index-like
flat file, but I don't want to dive into that here.


>> I also have to wonder about the need to have a complete list of omitted
>> blobs up front.  It may be better to just relax the consistency checks
>> and assume a missing blob is "intentionally missing" rather than
>> indicating a corruption somewhere.
> 
> We've discussed this previously on list and gone back and forth. :)
> 
>>                                      And then let the client do a later
>> round-trip to either demand-load the object -or- demand-load the
>> existence/size info if/when it really matters.
> 
> The cost of demand-loading this existence/size information is what
> ultimately convinced me of this approach.
> 
> But I can see how the tradeoffs differ between the omit-large-blobs
> case and the omit-all-blobs case.  We might end up having to support
> both modes. :(

I think I've been in both camps myself.

I'm not sure it would be that hard to support both (other than being
unfortunate), assuming pack-objects could build the same binary data
as I suggested be computed by fetch-pack/index-pack above.  It could
then just be a run-time decision.

I could even see it being useful -- do the initial clone with
client-side computation (because you're getting mostly ancient data
that you don't care about) and then let fetches use server-side
computation (because you're more likely to care about the sizes of
objects near your branch heads).


>> Maybe we should add a verb to your new fetch-blob endpoint to just get
>> the size of one or more objects to help with this.
> 
> No objections from me, though we don't need it yet.
> 
> Thanks,
> Jonathan
> 

Thanks
Jeff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
  2017-07-13 14:48       ` Jeff Hostetler
@ 2017-07-13 15:05         ` Jeff Hostetler
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff Hostetler @ 2017-07-13 15:05 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git



On 7/13/2017 10:48 AM, Jeff Hostetler wrote:
> 
> 
> On 7/12/2017 3:28 PM, Jonathan Nieder wrote:
>> Hi,
>>
>> Jeff Hostetler wrote:
>>
>>> My primary concern is scale and managing the list of objects over time.
>> [...]
>>>                                                                   For
>>> example, on the Windows repo we have (conservatively) 100M+ blobs (and
>>> growing).  Assuming 28 bytes per, gives a 2.8GB list to be manipulated.
>>>
>>> If I understand your proposal, newly-omitted blobs would need to be
>>> merged into the promised-blob list after each fetch.  The fetch itself
>>> may not have that many new entries, but inserting them into the existing
>>> list will be slow.
>>
>> This is a good point.  An alternative would be to avoid storing the
>> list and instead use a repository extension that treats all missing
>> blobs in the repository similarly to promised blobs (with a weaker
>> promise, where the server is allowed to return 404).  The downsides:
>>
>> - blob sizes are not available without an additional request, e.g. for
>>    directory listings
>>
>> - the semantics of has_object_file become more complicated.  If it
>>    is allowed to return false for omitted blobs, then callers have to
>>    be audited to tolerate that and try to look up the blob anyway.
>>    If it has to contact the server to find out whether an omitted blob
>>    is available, then callers have to be audited to skip this expensive
>>    operation when possible.
>>
>> - similarly, the semantics of sha1_object_info{,_extended} become more
>>    complicated.  If they are allowed to return -1 for omitted blobs,
>>    then callers have to be audited to handle that. If they have to
>>    contact the server to find the object type and size, it becomes
>>    expensive in a way that affects callers.
>>
>> - causes futile repeated requests to the server for objects that don't
>>    exist.  Caching negative lookups is fussy because a later push could
>>    cause those objects to exist --- though it should be possible for
>>    fetches to invalidate entries in such a cache using the list of
>>    promised blobs sent by the server.
> 
> Also good points. :-)
> 
> Would it be better to:
> [] let fetch-pack/index-pack on the client compute the set of missing
>     objects based upon the commit range in the response.
> [] build a "pack-<id>.promised" as a peer of the "pack-<id>.[pack,idx]"
>     files.
> [] Use Jonathan's binary format with (-1LL) for the sizes, but also add
>     a flags/object-type word.  Since the local commit/object traversal
>     usually knows the type of object (either by context or the mode field
>     in the parent entry), index-pack can set it immediately.
> [] Repack would "do the right thing" when building new packfiles from
>     existing packfiles (and merge/prune/etc. a new promise file).
> 
> Conceptually, these "pack-<id>.promised" files would be read-only once
> created, like the corresponding .pack/.idx files.  But since we have a
> fixed record size, we could let the dynamic object size requests update
> the sizes in-place.   (And if an update failed for some reason, the
> client could just go on and use the result (as we do when trying to
> update mtime info in the index).)
> 
> This gives us the full set of omitted objects without the network
> overhead and without the overhead of merging them into a single list.
> And we defer the cost of getting an object's size until we actually
> need the info.
> 
> 
>> [...]
>>> In such a "sparse clone", it would be nice to omit unneeded tree objects
>>> in addition to just blobs.   I say that because we are finding with GVFS
>>> on the Windows repo, that even with commits-and-trees-only filtering,
>>> the number of tree objects is overwhelming.  So I'm also concerned about
>>> limiting the list to just blobs.  If we need to have this list, it
>>> should be able to contain any object.  (Suggesting having an object type
>>> in the entry.)
>>
>> Would it work to have a separate lists of blobs and trees (either in
>> separate files or in the same file)?
>>
>> One option would be to add a version number / magic string to the start
>> of the file.  That would allow making format changes later without a
>> proliferation of distinct repository extensions.
> 
> As I suggested above, we could just put them in the same file with a
> flags/object-type field.
> 
> Without getting too nerdy here, we could even say that the size doesn't
> need to be a int:64 -- definitely bigger than int:32, but we don't need
> the full int:64.  We could make the size a int:40 or int:48 and use the
> rest of the QWORD for flags/object-type.
> 
> 
>> [...]
>>> I assume that we'll also need a promised-blob.lock file to control
>>> access during list manipulation.  This is already a sore spot with the
>>> index; I'd hate to create another one.
>>
>> Can you say more about this concern?  My experience with concurrent
>> fetches has already not been great (since one fetch process is not
>> aware of what the other has fetched) --- is your concern that the
>> promised-blob facility would affect pushes as well some day?
> 
> [I wrote this before I wrote the multi-peer-file suggestion above.]
> 
> I'm assuming that anyone adding a new promised-blob to the single file
> will use the standard "read foo, write foo.lock, rename" trick.  If we
> have a large number of omitted blobs, this file will be large and the
> insert will take a non-trivial amount of time.
> 
> Concurrent fetches does sound a bit silly, so we only have to worry
> about the time to copy the bits on disk.
> 
> If other operations (such as demand loading missing blobs in a log or
> diff) need to remove promised-blobs from the list as they create
> loose blobs, then they will get stuck on the lock.  I suppose we could
> just say that the promised-blob list may have false-positives in it
> (because lookups in it won't happen until the packed and loose objects
> have been searched and failed) and just wait for GC to clean it.
> 
> And there's always that stale lock problem where something dies holding
> the lock and doesn't delete it.
> 
> Some of these could be addressed with a DB or something that allows
> incremental and atomic updates rather than a regular index-like
> flat file, but I don't want to dive into that here.
> 
> 
>>> I also have to wonder about the need to have a complete list of omitted
>>> blobs up front.  It may be better to just relax the consistency checks
>>> and assume a missing blob is "intentionally missing" rather than
>>> indicating a corruption somewhere.
>>
>> We've discussed this previously on list and gone back and forth. :)
>>
>>>                                      And then let the client do a later
>>> round-trip to either demand-load the object -or- demand-load the
>>> existence/size info if/when it really matters.
>>
>> The cost of demand-loading this existence/size information is what
>> ultimately convinced me of this approach.
>>
>> But I can see how the tradeoffs differ between the omit-large-blobs
>> case and the omit-all-blobs case.  We might end up having to support
>> both modes. :(
> 
> I think I've been in both camps myself.
> 
> I'm not sure it would be that hard to support both (other than being
> unfortunate), assuming pack-objects could build the same binary data
> as I suggested be computed by fetch-pack/index-pack above.  It could
> then just be a run-time decision.
> 
> I could even see it being useful -- do the initial clone with
> client-side computation (because you're getting mostly ancient data
> that you don't care about) and then let fetches use server-side
> computation (because you're more likely to care about the sizes of
> objects near your branch heads).
> 
> 
>>> Maybe we should add a verb to your new fetch-blob endpoint to just get
>>> the size of one or more objects to help with this.
>>
>> No objections from me, though we don't need it yet.

I think it would also be helpful to consider more than one type of size
request.
[1] get the size of a single objects
[2] get the sizes of a set of objects
[3] get the sizes for all objects directly referenced by a tree.
     if both sides have a tree object, then we don't need to send a list
     blobs -- just ask for all the blobs referenced by the tree object.
     the server could just return an array of sizes ordered by blob
     position in the tree.
[4] recursive version of [3].
[5] recursive for a commit (basically just a convenience on [4] and the
     tree root).

This would let the clone do a client-side computation and then (in a
request to the new end-point) get the sizes of everything in HEAD, for
example.

Thanks
Jeff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
  2017-07-12 17:29   ` Jeff Hostetler
  2017-07-12 19:28     ` Jonathan Nieder
@ 2017-07-13 19:39     ` Jonathan Tan
  2017-07-14 20:03       ` Jeff Hostetler
  1 sibling, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2017-07-13 19:39 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, jrnieder

On Wed, 12 Jul 2017 13:29:11 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> My primary concern is scale and managing the list of objects over time.
> 
> My fear is that this list will be quite large.  If we only want to omit
> the very large blobs, then maybe not.  But if we want to expand that
> scope to also omit other objects (such as a clone synchronized with a
> sparse checkout), then that list will get large on large repos.  For
> example, on the Windows repo we have (conservatively) 100M+ blobs (and
> growing).  Assuming 28 bytes per, gives a 2.8GB list to be manipulated.
> 
> If I understand your proposal, newly-omitted blobs would need to be
> merged into the promised-blob list after each fetch.  The fetch itself
> may not have that many new entries, but inserting them into the existing
> list will be slow.  Also, mmap'ing and bsearch'ing will likely have
> issues.  And there's likely to be a very expensive step to remove
> entries from the list as new blobs are received (or locally created).
> 
> In such a "sparse clone", it would be nice to omit unneeded tree objects
> in addition to just blobs.   I say that because we are finding with GVFS
> on the Windows repo, that even with commits-and-trees-only filtering,
> the number of tree objects is overwhelming.

I know that discussion has shifted to the possibility of not having this
list at all, and not sending size information together with the fetch,
but going back to this...maybe omitting trees *is* the solution to both
the large local list and the large amount of size information needing to
be transferred.

So the large-blob (e.g. Android) and many-blob (e.g. Windows) cases
would look like this:

 * Large-blob repositories have no trees omitted and a few blobs
   omitted, and we have sizes for all of them.
 * Many-blob repositories have many trees omitted and either all
   blobs omitted (and we have size information for them, useful for FUSE
   or FUSE-like things, for example) or possibly no blobs omitted (for
   example, if shallow clones are going to be the norm, there won't be
   many blobs to begin with if trees are omitted).

This seems better than an intermediate solution for the many-blob
repository case in which we still keep all the trees but also try to
avoid sending and storing as much information about the blobs as
possible, because that doesn't seem to provide us with much savings
(because the trees as a whole are just as large, if not larger, than the
blob information).

> So I'm also concerned about
> limiting the list to just blobs.  If we need to have this list, it
> should be able to contain any object.  (Suggesting having an object type
> in the entry.)

This makes sense - I'll add it in.

> I also have to wonder about the need to have a complete list of omitted
> blobs up front.  It may be better to just relax the consistency checks
> and assume a missing blob is "intentionally missing" rather than
> indicating a corruption somewhere.  And then let the client do a later
> round-trip to either demand-load the object -or- demand-load the
> existence/size info if/when it really matters.
> 
> Maybe we should add a verb to your new fetch-blob endpoint to just get
> the size of one or more objects to help with this.

If we allow the omission of trees, I don't think the added complexity of
demand-loading sizes is worth it.

What do you think of doing this:
 * add a "type" field to the list of promised objects (formerly the list
   of promised blobs)
 * retain mandatory size for blobs
 * retain single file containing list of promised objects (I don't feel
   too strongly about this, but it has a slight simplicity and
   in-between-GC performance advantage)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
  2017-07-13 19:39     ` Jonathan Tan
@ 2017-07-14 20:03       ` Jeff Hostetler
  2017-07-14 21:30         ` Jonathan Nieder
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Hostetler @ 2017-07-14 20:03 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, jrnieder



On 7/13/2017 3:39 PM, Jonathan Tan wrote:
> On Wed, 12 Jul 2017 13:29:11 -0400
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> My primary concern is scale and managing the list of objects over time.
>>
>> My fear is that this list will be quite large.  If we only want to omit
>> the very large blobs, then maybe not.  But if we want to expand that
>> scope to also omit other objects (such as a clone synchronized with a
>> sparse checkout), then that list will get large on large repos.  For
>> example, on the Windows repo we have (conservatively) 100M+ blobs (and
>> growing).  Assuming 28 bytes per, gives a 2.8GB list to be manipulated.
>>
>> If I understand your proposal, newly-omitted blobs would need to be
>> merged into the promised-blob list after each fetch.  The fetch itself
>> may not have that many new entries, but inserting them into the existing
>> list will be slow.  Also, mmap'ing and bsearch'ing will likely have
>> issues.  And there's likely to be a very expensive step to remove
>> entries from the list as new blobs are received (or locally created).
>>
>> In such a "sparse clone", it would be nice to omit unneeded tree objects
>> in addition to just blobs.   I say that because we are finding with GVFS
>> on the Windows repo, that even with commits-and-trees-only filtering,
>> the number of tree objects is overwhelming.
> 
> I know that discussion has shifted to the possibility of not having this
> list at all, and not sending size information together with the fetch,
> but going back to this...maybe omitting trees *is* the solution to both
> the large local list and the large amount of size information needing to
> be transferred.
> 
> So the large-blob (e.g. Android) and many-blob (e.g. Windows) cases
> would look like this:
> 
>   * Large-blob repositories have no trees omitted and a few blobs
>     omitted, and we have sizes for all of them.
>   * Many-blob repositories have many trees omitted and either all
>     blobs omitted (and we have size information for them, useful for FUSE
>     or FUSE-like things, for example) or possibly no blobs omitted (for
>     example, if shallow clones are going to be the norm, there won't be
>     many blobs to begin with if trees are omitted).

I'm not sure I understand what you're saying here.  Does omitting a tree
object change the set of blob sizes we receive?  Are you saying that if
we omit a tree, then we implicitly omit all the blobs it references and
don't send size info those blobs?  So that the local list only has
reachable objects?  So faulting-in a tree would also have to send size
info for the newly referenced blobs?

Would this make it more similar to a shallow clone (in that none of the
have_object tests work for items beyond the cut point) ?


> This seems better than an intermediate solution for the many-blob
> repository case in which we still keep all the trees but also try to
> avoid sending and storing as much information about the blobs as
> possible, because that doesn't seem to provide us with much savings
> (because the trees as a whole are just as large, if not larger, than the
> blob information).

Or are you saying that by omitting mostly-unused trees (and sending just
their sizes) we can shrink the packfile payload and therefore afford to
send the full set of blob sizes in the promises payload ?


>> So I'm also concerned about
>> limiting the list to just blobs.  If we need to have this list, it
>> should be able to contain any object.  (Suggesting having an object type
>> in the entry.)
> 
> This makes sense - I'll add it in.

Good.  That will solve some cases where the client is just asking if
we have an object and is it of type x.


>> I also have to wonder about the need to have a complete list of omitted
>> blobs up front.  It may be better to just relax the consistency checks
>> and assume a missing blob is "intentionally missing" rather than
>> indicating a corruption somewhere.  And then let the client do a later
>> round-trip to either demand-load the object -or- demand-load the
>> existence/size info if/when it really matters.
>>
>> Maybe we should add a verb to your new fetch-blob endpoint to just get
>> the size of one or more objects to help with this.
> 
> If we allow the omission of trees, I don't think the added complexity of
> demand-loading sizes is worth it.

As you suggested earlier, we're trying to solve 2 different problems.
The "few omitted very large blobs" and the "way too many unneeded blobs"
problems.

For the former case, if you just have a few omitted objects, then a
second round-trip to mget their sizes isn't that much work.

I think for the latter, forcing a full promise-list on clone is just
too much data to send -- data that we likely won't ever need.  If the
client can compute the promise-set from the received packfile, we can
do the "is an object and type" tests completely locally.  It's only
when the client asks for the size do we need to decide whether to fetch
just the size or fetch the actual object.  Either type of request
doesn't seem like that big of a deal.  But we save a lot of IO
(especially during the initial clone, when it hurts the most).

In both cases, we benefit by letting the client decide if/when it
wants them.  And whether it wants the full set or a subset.

I suppose we could make it an option for the upload-pack protocol
to send or not send the promise-set.  That feels a little like a
hack in that it is another knob we have to support, but would let
us work either way.

I think it would be better to have 2 round-trips.  We could also have
an fsck option to verify and back-fill missing sizes if you want to
be able to force them.


> What do you think of doing this:
>   * add a "type" field to the list of promised objects (formerly the list
>     of promised blobs)
>   * retain mandatory size for blobs
>   * retain single file containing list of promised objects (I don't feel
>     too strongly about this, but it has a slight simplicity and
>     in-between-GC performance advantage)
> 

The single promise-set is problematic.  I think it will grow too
large (in our case) and will need all the usual lock juggling
and merging.

I still prefer my suggestion for a per-packfile promise-set for all
of the reasons I stated the other day.  This can be computed quickly
during index-pack, is (nearly) read-only, and doesn't require the
whole file rewrite lock file.  It also has the benefit of being
portable -- in that I can also copy the .promise file if I copy the
.pack and .idx file to another repo.


So to summarize, I think we need:
* an object-list concept
      (that is, we believe that just assuming missing objects
       are "intentionally omitted" has issues.)
* an object-list-entry containing:
      {object-id,
       optional object-size,  (use -1 when unknown)
       object-type}
* server end-point for fetching 1 or more sizes
      (by sending 1 or more object-ids)
* server end-point for fetching 1 or more objects
      (by sending 1 or more object-ids)
      and receiving a packfile or loose object.
* per-packfile object-list (rather than single combined one).
* client-side (index-pack) compute of per-packfile object-list.
* repack to "do the right thing".

Optional:
* pack-protocol option.
* server-side (pack-objects) compute of per-packfile object-list.
* fsck to verify and back-fill all sizes using fetch-sizes.

Don't need:
* fetch-object or fetch-sizes end-points do not need to support
   requests for the set of objects referenced by a directory
   (recursively) from my earlier email.
       keep it simple and just explicitly list the object-ids wanted.
* the client side of fetch-object DOES NOT need to remove
   object-list-entries from the object-list.
       we must assume that the list will have false-positives
       in it as loose objects are created locally.


Thanks,
Jeff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
  2017-07-14 20:03       ` Jeff Hostetler
@ 2017-07-14 21:30         ` Jonathan Nieder
  0 siblings, 0 replies; 46+ messages in thread
From: Jonathan Nieder @ 2017-07-14 21:30 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Jonathan Tan, git

Jeff Hostetler wrote:
> On 7/13/2017 3:39 PM, Jonathan Tan wrote:

>> I know that discussion has shifted to the possibility of not having this
>> list at all, and not sending size information together with the fetch,
>> but going back to this...maybe omitting trees *is* the solution to both
>> the large local list and the large amount of size information needing to
>> be transferred.
>>
>> So the large-blob (e.g. Android) and many-blob (e.g. Windows) cases
>> would look like this:
>>
>>   * Large-blob repositories have no trees omitted and a few blobs
>>     omitted, and we have sizes for all of them.
>>   * Many-blob repositories have many trees omitted and either all
>>     blobs omitted (and we have size information for them, useful for FUSE
>>     or FUSE-like things, for example) or possibly no blobs omitted (for
>>     example, if shallow clones are going to be the norm, there won't be
>>     many blobs to begin with if trees are omitted).
>
> I'm not sure I understand what you're saying here.  Does omitting a tree
> object change the set of blob sizes we receive?  Are you saying that if
> we omit a tree, then we implicitly omit all the blobs it references and
> don't send size info those blobs?  So that the local list only has
> reachable objects?  So faulting-in a tree would also have to send size
> info for the newly referenced blobs?
>
> Would this make it more similar to a shallow clone (in that none of the
> have_object tests work for items beyond the cut point) ?

Correct.  After the server sends a promise instead of a tree object, the
client has no reason to try to access blobs pointed to by that tree, any
more than it has reason to try to access commits on a branch it has not
fetched.  This means the client does not have to be aware of those blobs
until it fetches the tree and associated blob promises.

[...]
> For the former case, if you just have a few omitted objects, then a
> second round-trip to mget their sizes isn't that much work.

For the client, that is true.  For the server, decreasing the number
of requests even when requests are small and fast can be valuable.

[...]
> I think for the latter, forcing a full promise-list on clone is just
> too much data to send -- data that we likely won't ever need.

What did you think of the suggestion to not send promises for objects
that are only referenced by objects that weren't sent?

[...]
>> What do you think of doing this:
>>   * add a "type" field to the list of promised objects (formerly the list
>>     of promised blobs)
>>   * retain mandatory size for blobs
>>   * retain single file containing list of promised objects (I don't feel
>>     too strongly about this, but it has a slight simplicity and
>>     in-between-GC performance advantage)
>
> The single promise-set is problematic.  I think it will grow too
> large (in our case) and will need all the usual lock juggling
> and merging.
>
> I still prefer my suggestion for a per-packfile promise-set for all
> of the reasons I stated the other day.  This can be computed quickly
> during index-pack, is (nearly) read-only, and doesn't require the
> whole file rewrite lock file.  It also has the benefit of being
> portable -- in that I can also copy the .promise file if I copy the
> .pack and .idx file to another repo.

Okay.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs")
  2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
                   ` (2 preceding siblings ...)
  2017-07-11 19:48 ` [RFC PATCH 3/3] sha1_file: add promised blob hook support Jonathan Tan
@ 2017-07-16 15:23 ` Philip Oakley
  2017-07-17 17:43   ` Ben Peart
  2017-07-17 18:03   ` Jonathan Nieder
  2017-07-20  0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 46+ messages in thread
From: Philip Oakley @ 2017-07-16 15:23 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: Jonathan Tan, Ben Peart

From: "Jonathan Tan" <jonathantanmy@google.com>
Sent: Tuesday, July 11, 2017 8:48 PM
> These patches are part of a set of patches implementing partial clone,
> as you can see here:
>
> https://github.com/jonathantanmy/git/tree/partialclone
>
> In that branch, clone with batch checkout works, as you can see in the
> README. The code and tests are generally done, but some patches are
> still missing documentation and commit messages.
>
> These 3 patches implement the foundational concept - formerly known as
> "missing blobs" in the "missing blob manifest", I decided to call them
> "promised blobs". The repo knows their object names and sizes. It also
> does not have the blobs themselves, but can be configured to know how to
> fetch them.
>
If I understand correctly, this method doesn't give any direct user 
visibility of missing blobs in the file system. Is that correct?

I was hoping that eventually the various 'on demand' approaches would still 
allow users to continue to work as they go off-line such that they can see 
directly (in the FS) where the missing blobs (and trees) are located, so 
that they can continue to commit new work on existing files.

I had felt that some sort of 'gitlink' should be present (huma readable) as 
a place holder for the missing blob/tree. e.g. 'gitblob: 1234abcd' (showing 
the missing oid, jsut like sub-modules can do - it's no different really.

I'm concerned that the various GVFS extensions haven't fully achieved a 
separation of concerns surrounding the DVCS capability for on-line/off-line 
conversion as comms drop in and out. The GVFS looks great for a fully 
networked, always on, environment, but it would be good to also have the 
sepration for those who (will) have shallow/narrow clones that may also need 
to work with a local upstream that is also shallow/narrow.

--
Philip
I wanted to at least get my thoughts into the discussion before it all 
passes by.

> An older version of these patches was sent as a single demonstration
> patch in versions 1 to 3 of [1]. In there, Junio suggested that I have
> only one file containing missing blob information. I have made that
> suggested change in this version.
>
> One thing remaining is to add a repository extension [2] so that older
> versions of Git fail immediately instead of trying to read missing
> blobs, but I thought I'd send these first in order to get some initial
> feedback.
>
> [1] 
> https://public-inbox.org/git/cover.1497035376.git.jonathantanmy@google.com/
> [2] Documentation/technical/repository-version.txt
>
> Jonathan Tan (3):
>  promised-blob, fsck: introduce promised blobs
>  sha1-array: support appending unsigned char hash
>  sha1_file: add promised blob hook support
>
> Documentation/config.txt               |   8 ++
> Documentation/gitrepository-layout.txt |   8 ++
> Makefile                               |   1 +
> builtin/cat-file.c                     |   9 ++
> builtin/fsck.c                         |  13 +++
> promised-blob.c                        | 170 
> +++++++++++++++++++++++++++++++++
> promised-blob.h                        |  27 ++++++
> sha1-array.c                           |   7 ++
> sha1-array.h                           |   1 +
> sha1_file.c                            |  44 ++++++---
> t/t3907-promised-blob.sh               |  65 +++++++++++++
> t/test-lib-functions.sh                |   6 ++
> 12 files changed, 345 insertions(+), 14 deletions(-)
> create mode 100644 promised-blob.c
> create mode 100644 promised-blob.h
> create mode 100755 t/t3907-promised-blob.sh
>
> -- 
> 2.13.2.932.g7449e964c-goog
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs")
  2017-07-16 15:23 ` [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Philip Oakley
@ 2017-07-17 17:43   ` Ben Peart
  2017-07-25 20:48     ` Philip Oakley
  2017-07-17 18:03   ` Jonathan Nieder
  1 sibling, 1 reply; 46+ messages in thread
From: Ben Peart @ 2017-07-17 17:43 UTC (permalink / raw)
  To: Philip Oakley, Jonathan Tan, git



On 7/16/2017 11:23 AM, Philip Oakley wrote:
> From: "Jonathan Tan" <jonathantanmy@google.com>
> Sent: Tuesday, July 11, 2017 8:48 PM
>> These patches are part of a set of patches implementing partial clone,
>> as you can see here:
>>
>> https://github.com/jonathantanmy/git/tree/partialclone
>>
>> In that branch, clone with batch checkout works, as you can see in the
>> README. The code and tests are generally done, but some patches are
>> still missing documentation and commit messages.
>>
>> These 3 patches implement the foundational concept - formerly known as
>> "missing blobs" in the "missing blob manifest", I decided to call them
>> "promised blobs". The repo knows their object names and sizes. It also
>> does not have the blobs themselves, but can be configured to know how to
>> fetch them.
>>
> If I understand correctly, this method doesn't give any direct user 
> visibility of missing blobs in the file system. Is that correct?

That is correct

> 
> I was hoping that eventually the various 'on demand' approaches would 
> still allow users to continue to work as they go off-line such that they 
> can see directly (in the FS) where the missing blobs (and trees) are 
> located, so that they can continue to commit new work on existing files.
> 

This is a challenge as git assumes all objects are always available 
(that is a key design principal of a DVCS) so any missing object is 
considered a corruption that typically results in a call to "die."

The GVFS solution gets around this by ensuring any missing object is 
retrieved on behalf of git so that it never sees it as missing.  The 
obvious tradeoff is that this requires a network connection so the 
object can be retrieved.

> I had felt that some sort of 'gitlink' should be present (huma readable) 
> as a place holder for the missing blob/tree. e.g. 'gitblob: 1234abcd' 
> (showing the missing oid, jsut like sub-modules can do - it's no 
> different really.
> 

We explored that option briefly but when you have a large number of 
files, even writing out some sort of place holder can take a very long 
time.  In fact, since the typical source file is relatively small (a few 
kilobytes), writing out a placeholder doesn't save much time vs just 
writing out the actual file contents.

Another challenge is that even if there is a placeholder written to 
disk, you still need a network connection to retrieve the actual 
contents if/when it is needed.

> I'm concerned that the various GVFS extensions haven't fully achieved a 
> separation of concerns surrounding the DVCS capability for 
> on-line/off-line conversion as comms drop in and out. The GVFS looks 
> great for a fully networked, always on, environment, but it would be 
> good to also have the sepration for those who (will) have shallow/narrow 
> clones that may also need to work with a local upstream that is also 
> shallow/narrow.
> 

You are correct that this hasn't been tackled yet. It is a challenging 
problem. I can envision something along the lines of what was done for 
the shallow clone feature where there are distinct ways to change the 
set of objects that are available but that would hopefully come in some 
future patch series.

> -- 
> Philip
> I wanted to at least get my thoughts into the discussion before it all 
> passes by.
> 
>> An older version of these patches was sent as a single demonstration
>> patch in versions 1 to 3 of [1]. In there, Junio suggested that I have
>> only one file containing missing blob information. I have made that
>> suggested change in this version.
>>
>> One thing remaining is to add a repository extension [2] so that older
>> versions of Git fail immediately instead of trying to read missing
>> blobs, but I thought I'd send these first in order to get some initial
>> feedback.
>>
>> [1] 
>> https://public-inbox.org/git/cover.1497035376.git.jonathantanmy@google.com/ 
>>
>> [2] Documentation/technical/repository-version.txt
>>
>> Jonathan Tan (3):
>>  promised-blob, fsck: introduce promised blobs
>>  sha1-array: support appending unsigned char hash
>>  sha1_file: add promised blob hook support
>>
>> Documentation/config.txt               |   8 ++
>> Documentation/gitrepository-layout.txt |   8 ++
>> Makefile                               |   1 +
>> builtin/cat-file.c                     |   9 ++
>> builtin/fsck.c                         |  13 +++
>> promised-blob.c                        | 170 
>> +++++++++++++++++++++++++++++++++
>> promised-blob.h                        |  27 ++++++
>> sha1-array.c                           |   7 ++
>> sha1-array.h                           |   1 +
>> sha1_file.c                            |  44 ++++++---
>> t/t3907-promised-blob.sh               |  65 +++++++++++++
>> t/test-lib-functions.sh                |   6 ++
>> 12 files changed, 345 insertions(+), 14 deletions(-)
>> create mode 100644 promised-blob.c
>> create mode 100644 promised-blob.h
>> create mode 100755 t/t3907-promised-blob.sh
>>
>> -- 
>> 2.13.2.932.g7449e964c-goog
>>
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs")
  2017-07-16 15:23 ` [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Philip Oakley
  2017-07-17 17:43   ` Ben Peart
@ 2017-07-17 18:03   ` Jonathan Nieder
  2017-07-29 12:51     ` Philip Oakley
  1 sibling, 1 reply; 46+ messages in thread
From: Jonathan Nieder @ 2017-07-17 18:03 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Jonathan Tan, git, Ben Peart

Hi Philip,

Philip Oakley wrote:
> From: "Jonathan Tan" <jonathantanmy@google.com>

>> These patches are part of a set of patches implementing partial clone,
>> as you can see here:
>>
>> https://github.com/jonathantanmy/git/tree/partialclone
[...]
> If I understand correctly, this method doesn't give any direct user
> visibility of missing blobs in the file system. Is that correct?
>
> I was hoping that eventually the various 'on demand' approaches
> would still allow users to continue to work as they go off-line such
> that they can see directly (in the FS) where the missing blobs (and
> trees) are located, so that they can continue to commit new work on
> existing files.
>
> I had felt that some sort of 'gitlink' should be present (huma
> readable) as a place holder for the missing blob/tree. e.g.
> 'gitblob: 1234abcd' (showing the missing oid, jsut like sub-modules
> can do - it's no different really.

That's a reasonable thing to want, but it's a little different from
the use cases that partial clone work so far has aimed to support.
They are:

 A. Avoiding downloading all blobs (and likely trees as well) that are
    not needed in the current operation (e.g. checkout).  This blends
    well with the sparse checkout feature, which allows the current
    checkout to be fairly small in a large repository.

    GVFS uses a trick that makes it a little easier to widen a sparse
    checkout upon access of a directory.  But the same building blocks
    should work fine with a sparse checkout that has been set up
    explicitly.

 B. Avoiding downloading large blobs, except for those needed in the
    current operation (e.g. checkout).

    When not using sparse checkout, the main benefit out of the box is
    avoiding downloading *historical versions* of large blobs.

It sounds like you are looking for a sort of placeholder outside the
sparse checkout area.  In a way, that's orthogonal to these patches:
even if you have all relevant blobs, you may want to avoid inflating
them to check them out and reading them to compare to the index (i.e.
the usual benefits of sparse checkout).  In a sparse checkout, you
still might like to be able to get a listing of files outside the
sparse area (which you can get with "git ls-tree") and you may even
want to be able to get such a listing with plain "ls" (as with your
proposal).

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
  2017-07-11 22:02   ` Stefan Beller
@ 2017-07-19 23:37     ` Jonathan Tan
  0 siblings, 0 replies; 46+ messages in thread
From: Jonathan Tan @ 2017-07-19 23:37 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

On Tue, 11 Jul 2017 15:02:09 -0700
Stefan Beller <sbeller@google.com> wrote:

> Here I wondered what this file looks like, in a later patch you
> add documentation:
> 
>   +objects/promisedblob::
>   +       This file records the sha1 object names and sizes of promised
>   +       blobs.
>   +
> 
> but that leaves more fundamental questions:
> * Is that a list of sha1s, separated by LF? (CRLF? How would Windows
>   users interact with it, are they supposed to ever modify this file?)
> * Similar to .git/packed-refs, would we want to have a first line
>   that has some sort of comment?
> * In the first question I assumed a linear list, will that be a sorted list,
>   or do we want to have some fancy data structure here?
>   (critbit tree, bloom filter)
> * is the contents in ASCII or binary? (What are the separators?)
> * In the future I presume we'd want to quickly answer "Is X in the
>   promised blobs list?" so would it be possible (in the future) to
>   improve the searching e.g. binary search?
> * ... I'll read on to see my questions answered, but I'd guess
>   others would prefer to see it in the docs. :)

I'll write more documentation once the file format is finalized. At the
time this patch was written, this was a sorted binary file, where each
entry consists of a 20-byte SHA-1 and an 8-byte size. Now each entry is
a 20-byte SHA-1, a 1-byte type, and an 8-byte size (I will send this
patch out soon).

> Similar to other files, would we want to prefix the file with
> a 4 letter magic number and a version number?

That's a good idea.

> Later down the road, do we want to have a
> (plumbing) command to move an object from
> standard blob to promised blob (as of now I'd think
> it would perform this rm call as well as an insert into
> the promised blobs file) ?
> (Well, we'd also have to think about how to get objects
> out of a pack)
> 
> With such a command you can easily write your own custom
> filter to free up blobs again.

This sounds reasonable, but probably not in the initial set of patches.

> > +       test_must_fail git -C repo fsck
> 
>     test_i18n_grep missing out ?
> 
> maybe, too? (Maybe that is already tested elsewhere,
> so no need for it)

This test is just meant to show that configuration can make fsck pass,
not the existing behavior of fsck when it fails (which is tested
elsewhere - I guess that was your question).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/3] sha1-array: support appending unsigned char hash
  2017-07-11 22:06   ` Stefan Beller
@ 2017-07-19 23:56     ` Jonathan Tan
  2017-07-20  0:06       ` Stefan Beller
  0 siblings, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2017-07-19 23:56 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

On Tue, 11 Jul 2017 15:06:11 -0700
Stefan Beller <sbeller@google.com> wrote:

> On Tue, Jul 11, 2017 at 12:48 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> > In a subsequent patch, sha1_file will need to append object names in the
> > form of "unsigned char *" to oid arrays. Teach sha1-array support for
> > that.
> >
> > Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> 
> This breaks the oid/sha1 barrier?

Not sure what you mean by this. This patch is meant to be a change to
make "unsigned char *"-using code able to create/modify oid arrays,
while we migrate away from "unsigned char *" - I don't know of any
barrier.

> I would have expected the caller to do a
> 
>   oid_array_append_oid(&array, sha1_to_oid(sha1));
> 
> with sha1_to_oid working off some static memory, such that the
> call to oid_array_append_oid (which we have as oid_array_append)
> is just as cheap?

The solution you present would need 2 copies (one in sha1_to_oid, and
another in oid_array_append), but the solution in this patch only
requires one.

> Could you say more in the commit message on why we collude
> sha1 and oids here?

I also don't understand this, but the answer to this is probably the
same as the answer to your first question.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 2/3] sha1-array: support appending unsigned char hash
  2017-07-19 23:56     ` Jonathan Tan
@ 2017-07-20  0:06       ` Stefan Beller
  0 siblings, 0 replies; 46+ messages in thread
From: Stefan Beller @ 2017-07-20  0:06 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Wed, Jul 19, 2017 at 4:56 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> On Tue, 11 Jul 2017 15:06:11 -0700
> Stefan Beller <sbeller@google.com> wrote:
>
>> On Tue, Jul 11, 2017 at 12:48 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
>> > In a subsequent patch, sha1_file will need to append object names in the
>> > form of "unsigned char *" to oid arrays. Teach sha1-array support for
>> > that.
>> >
>> > Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>>
>> This breaks the oid/sha1 barrier?
>
> Not sure what you mean by this. This patch is meant to be a change to
> make "unsigned char *"-using code able to create/modify oid arrays,
> while we migrate away from "unsigned char *" - I don't know of any
> barrier.

In my mind we have 2 states:
* the current state where we have "char *sha1"
* the future in which we have "struct object_id oid"

In both realms you can do things (and the current code
base has examples for both), but these 2 are connected
via only a couple well defined functions (sha1_to_hex,
oid_to_sha1) during the conversion phase.

So with that mindset I was slightly irritated to see a function
that uses both realms so interchangeably.

Revisiting that it is actually not that bad, but also a good
idea to have in the transition phase.

Sorry for the noise,
Stefan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs)
  2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
                   ` (3 preceding siblings ...)
  2017-07-16 15:23 ` [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Philip Oakley
@ 2017-07-20  0:21 ` Jonathan Tan
  2017-07-20  0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 46+ messages in thread
From: Jonathan Tan @ 2017-07-20  0:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, jrnieder, sbeller, git, peartben, philipoakley

Thanks for all your comments on the earlier version. This is a
substantially different version. In particular:
 - Now supports all types (tag, commit, tree) of objects, not only blobs
 - fsck works
 - Incorporates Ben Peart's code that uses a long-living process
   (lifetime of the Git invocation) to obtain objects
 - Implemented as a repository extension

If anyone would like to comment on the overall direction of this
approach, that would be great. In particular, I'm not too sure on the
names of things. I know that I have some things to still work on
(documentation and coding style, more improvements to and tests for
fsck, maybe division of commits?) but I would like some early feedback
on the bigger picture first.

If you want to patch this in, this is built off my recent cleanup patch
[1].

About inability to scale if we have the list of promised blobs: In this
design, we will have a promised blob entry only if we have a concrete
tree, so in a repository in which many trees are omitted, there will not
be many promised blob entry objects. In fact, the minimal partial clone
will be one in which there will be one promised commit for each ref
(assuming no duplicates), no promised trees, and no promised blobs.
(I'm not planning to implement such a clone, but someone else could do
so.)

About having multiple promise lists: I have retained the single list,
but am still open to change. The support of all the object types might
be sufficient mitigation for the issues that caused us to investigate
multiple promise lists in the first place.

[1] https://public-inbox.org/git/20170718222848.1453-1-jonathantanmy@google.com/

Jonathan Tan (4):
  object: remove "used" field from struct object
  promised-object, fsck: introduce promised objects
  sha1-array: support appending unsigned char hash
  sha1_file: support promised object hook

 Documentation/config.txt                         |   8 +
 Documentation/gitrepository-layout.txt           |   8 +
 Documentation/technical/read-object-protocol.txt | 102 +++++++
 Documentation/technical/repository-version.txt   |   6 +
 Makefile                                         |   1 +
 builtin/cat-file.c                               |   9 +
 builtin/fsck.c                                   |  42 ++-
 cache.h                                          |   4 +
 environment.c                                    |   1 +
 fsck.c                                           |   6 +-
 object.c                                         |  21 +-
 object.h                                         |  21 +-
 promised-object.c                                | 324 +++++++++++++++++++++++
 promised-object.h                                |  34 +++
 setup.c                                          |   7 +-
 sha1-array.c                                     |   7 +
 sha1-array.h                                     |   1 +
 sha1_file.c                                      |  44 ++-
 t/t3907-promised-object.sh                       |  73 +++++
 t/t3907/read-object                              | 114 ++++++++
 t/test-lib-functions.sh                          |   6 +
 21 files changed, 808 insertions(+), 31 deletions(-)
 create mode 100644 Documentation/technical/read-object-protocol.txt
 create mode 100644 promised-object.c
 create mode 100644 promised-object.h
 create mode 100755 t/t3907-promised-object.sh
 create mode 100755 t/t3907/read-object

-- 
2.14.0.rc0.284.gd933b75aa4-goog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH v2 1/4] object: remove "used" field from struct object
  2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
                   ` (4 preceding siblings ...)
  2017-07-20  0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
@ 2017-07-20  0:21 ` Jonathan Tan
  2017-07-20  0:36   ` Stefan Beller
  2017-07-20 21:20   ` Junio C Hamano
  2017-07-20  0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 46+ messages in thread
From: Jonathan Tan @ 2017-07-20  0:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, jrnieder, sbeller, git, peartben, philipoakley

The "used" field in struct object is only used by builtin/fsck. Remove
that field and modify builtin/fsck to use a flag instead.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fsck.c | 24 ++++++++++++++----------
 object.c       |  1 -
 object.h       |  2 +-
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 4ba311cda..462b8643b 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -19,6 +19,8 @@
 #define REACHABLE 0x0001
 #define SEEN      0x0002
 #define HAS_OBJ   0x0004
+/* This flag is set if something points to this object. */
+#define USED      0x0008
 
 static int show_root;
 static int show_tags;
@@ -195,7 +197,7 @@ static int mark_used(struct object *obj, int type, void *data, struct fsck_optio
 {
 	if (!obj)
 		return 1;
-	obj->used = 1;
+	obj->flags |= USED;
 	return 0;
 }
 
@@ -244,7 +246,7 @@ static void check_unreachable_object(struct object *obj)
 	}
 
 	/*
-	 * "!used" means that nothing at all points to it, including
+	 * "!USED" means that nothing at all points to it, including
 	 * other unreachable objects. In other words, it's the "tip"
 	 * of some set of unreachable objects, usually a commit that
 	 * got dropped.
@@ -255,7 +257,7 @@ static void check_unreachable_object(struct object *obj)
 	 * deleted a branch by mistake, this is a prime candidate to
 	 * start looking at, for example.
 	 */
-	if (!obj->used) {
+	if (!(obj->flags & USED)) {
 		if (show_dangling)
 			printf("dangling %s %s\n", printable_type(obj),
 			       describe_object(obj));
@@ -379,7 +381,8 @@ static int fsck_obj_buffer(const struct object_id *oid, enum object_type type,
 		errors_found |= ERROR_OBJECT;
 		return error("%s: object corrupt or missing", oid_to_hex(oid));
 	}
-	obj->flags = HAS_OBJ;
+	obj->flags &= ~(REACHABLE | SEEN);
+	obj->flags |= HAS_OBJ;
 	return fsck_obj(obj);
 }
 
@@ -397,7 +400,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
 				add_decoration(fsck_walk_options.object_names,
 					obj,
 					xstrfmt("%s@{%"PRItime"}", refname, timestamp));
-			obj->used = 1;
+			obj->flags |= USED;
 			mark_object_reachable(obj);
 		} else {
 			error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
@@ -445,7 +448,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
 		errors_found |= ERROR_REFS;
 	}
 	default_refs++;
-	obj->used = 1;
+	obj->flags |= USED;
 	if (name_objects)
 		add_decoration(fsck_walk_options.object_names,
 			obj, xstrdup(refname));
@@ -513,7 +516,8 @@ static int fsck_loose(const struct object_id *oid, const char *path, void *data)
 		return 0; /* keep checking other objects */
 	}
 
-	obj->flags = HAS_OBJ;
+	obj->flags &= ~(REACHABLE | SEEN);
+	obj->flags |= HAS_OBJ;
 	if (fsck_obj(obj))
 		errors_found |= ERROR_OBJECT;
 	return 0;
@@ -595,7 +599,7 @@ static int fsck_cache_tree(struct cache_tree *it)
 			errors_found |= ERROR_REFS;
 			return 1;
 		}
-		obj->used = 1;
+		obj->flags |= USED;
 		if (name_objects)
 			add_decoration(fsck_walk_options.object_names,
 				obj, xstrdup(":"));
@@ -737,7 +741,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 				continue;
 			}
 
-			obj->used = 1;
+			obj->flags |= USED;
 			if (name_objects)
 				add_decoration(fsck_walk_options.object_names,
 					obj, xstrdup(arg));
@@ -774,7 +778,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 			if (!blob)
 				continue;
 			obj = &blob->object;
-			obj->used = 1;
+			obj->flags |= USED;
 			if (name_objects)
 				add_decoration(fsck_walk_options.object_names,
 					obj,
diff --git a/object.c b/object.c
index f81877741..321d7e920 100644
--- a/object.c
+++ b/object.c
@@ -141,7 +141,6 @@ void *create_object(const unsigned char *sha1, void *o)
 	struct object *obj = o;
 
 	obj->parsed = 0;
-	obj->used = 0;
 	obj->flags = 0;
 	hashcpy(obj->oid.hash, sha1);
 
diff --git a/object.h b/object.h
index 33e5cc994..0a419ba8d 100644
--- a/object.h
+++ b/object.h
@@ -38,6 +38,7 @@ struct object_array {
  * http-push.c:                            16-----19
  * commit.c:                               16-----19
  * sha1_name.c:                                     20
+ * builtin/fsck.c:  0--3
  */
 #define FLAG_BITS  27
 
@@ -46,7 +47,6 @@ struct object_array {
  */
 struct object {
 	unsigned parsed : 1;
-	unsigned used : 1;
 	unsigned type : TYPE_BITS;
 	unsigned flags : FLAG_BITS;
 	struct object_id oid;
-- 
2.14.0.rc0.284.gd933b75aa4-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
  2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
                   ` (5 preceding siblings ...)
  2017-07-20  0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
@ 2017-07-20  0:21 ` Jonathan Tan
  2017-07-20 18:07   ` Stefan Beller
  2017-07-20 19:58   ` Ben Peart
  2017-07-20  0:21 ` [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash Jonathan Tan
  2017-07-20  0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
  8 siblings, 2 replies; 46+ messages in thread
From: Jonathan Tan @ 2017-07-20  0:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, jrnieder, sbeller, git, peartben, philipoakley

Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage.

As a first step to reducing this problem, introduce the concept of
promised objects. Each Git repo can contain a list of promised objects
and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
contains functions to query them; functions for creating and modifying
that file will be introduced in later patches.

A repository that is missing an object but has that object promised is not
considered to be in error, so also teach fsck this. As part of doing
this, object.{h,c} has been modified to generate "struct object" based
on only the information available to promised objects, without requiring
the object itself.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/repository-version.txt |   6 ++
 Makefile                                       |   1 +
 builtin/fsck.c                                 |  18 +++-
 cache.h                                        |   2 +
 environment.c                                  |   1 +
 fsck.c                                         |   6 +-
 object.c                                       |  19 ++++
 object.h                                       |  19 ++++
 promised-object.c                              | 130 +++++++++++++++++++++++++
 promised-object.h                              |  22 +++++
 setup.c                                        |   7 +-
 t/t3907-promised-object.sh                     |  41 ++++++++
 t/test-lib-functions.sh                        |   6 ++
 13 files changed, 273 insertions(+), 5 deletions(-)
 create mode 100644 promised-object.c
 create mode 100644 promised-object.h
 create mode 100755 t/t3907-promised-object.sh

diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
index 00ad37986..f8b82c1c7 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,9 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`promisedObjects`
+~~~~~~~~~~~~~~~~~
+
+(Explain this - basically a string containing a command to be run
+whenever a missing object needs to be fetched.)
diff --git a/Makefile b/Makefile
index 9c9c42f8f..c1446d5ef 100644
--- a/Makefile
+++ b/Makefile
@@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
 LIB_OBJS += pretty.o
 LIB_OBJS += prio-queue.o
 LIB_OBJS += progress.o
+LIB_OBJS += promised-object.o
 LIB_OBJS += prompt.o
 LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 462b8643b..49e21f361 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -15,6 +15,7 @@
 #include "progress.h"
 #include "streaming.h"
 #include "decorate.h"
+#include "promised-object.h"
 
 #define REACHABLE 0x0001
 #define SEEN      0x0002
@@ -44,6 +45,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_PROMISED_OBJECT 011
 
 static const char *describe_object(struct object *obj)
 {
@@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
 {
 	struct object *obj;
 
-	obj = parse_object(oid);
+	obj = parse_or_promise_object(oid);
 	if (!obj) {
 		error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
 		errors_found |= ERROR_REACHABLE;
@@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
 		fprintf(stderr, "Checking cache tree\n");
 
 	if (0 <= it->entry_count) {
-		struct object *obj = parse_object(&it->oid);
+		struct object *obj = parse_or_promise_object(&it->oid);
 		if (!obj) {
 			error("%s: invalid sha1 pointer in cache-tree",
 			      oid_to_hex(&it->oid));
@@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct object_id *oid,
 	return 0;
 }
 
+static int mark_have_promised_object(const struct object_id *oid, void *data)
+{
+	mark_object_for_connectivity(oid);
+	return 0;
+}
+
 static char const * const fsck_usage[] = {
 	N_("git fsck [<options>] [<object>...]"),
 	NULL
@@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 
 	git_config(fsck_config, NULL);
 
+	if (fsck_promised_objects()) {
+		error("Errors found in promised object list");
+		errors_found |= ERROR_PROMISED_OBJECT;
+	}
+
 	fsck_head_link();
 	if (connectivity_only) {
 		for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
@@ -727,6 +740,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 			stop_progress(&progress);
 		}
 	}
+	for_each_promised_object(mark_have_promised_object, NULL);
 
 	heads = 0;
 	for (i = 0; i < argc; i++) {
diff --git a/cache.h b/cache.h
index 71fe09264..dd94b5ffc 100644
--- a/cache.h
+++ b/cache.h
@@ -853,10 +853,12 @@ extern int grafts_replace_parents;
 #define GIT_REPO_VERSION 0
 #define GIT_REPO_VERSION_READ 1
 extern int repository_format_precious_objects;
+extern char *repository_format_promised_objects;
 
 struct repository_format {
 	int version;
 	int precious_objects;
+	char *promised_objects;
 	int is_bare;
 	char *work_tree;
 	struct string_list unknown_extensions;
diff --git a/environment.c b/environment.c
index 3fd4b1084..82658470d 100644
--- a/environment.c
+++ b/environment.c
@@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1;
 int warn_on_object_refname_ambiguity = 1;
 int ref_paranoia = -1;
 int repository_format_precious_objects;
+char *repository_format_promised_objects;
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 const char *apply_default_whitespace;
diff --git a/fsck.c b/fsck.c
index b4204d772..f08ff0675 100644
--- a/fsck.c
+++ b/fsck.c
@@ -460,8 +460,10 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
 	if (!obj)
 		return -1;
 
-	if (obj->type == OBJ_NONE)
-		parse_object(&obj->oid);
+	if (!obj->parsed && !obj->promised)
+		parse_or_promise_object(&obj->oid);
+	if (obj->promised)
+		return 0;
 
 	switch (obj->type) {
 	case OBJ_BLOB:
diff --git a/object.c b/object.c
index 321d7e920..0aeb95084 100644
--- a/object.c
+++ b/object.c
@@ -4,6 +4,7 @@
 #include "tree.h"
 #include "commit.h"
 #include "tag.h"
+#include "promised-object.h"
 
 static struct object **obj_hash;
 static int nr_objs, obj_hash_size;
@@ -141,6 +142,7 @@ void *create_object(const unsigned char *sha1, void *o)
 	struct object *obj = o;
 
 	obj->parsed = 0;
+	obj->promised = 0;
 	obj->flags = 0;
 	hashcpy(obj->oid.hash, sha1);
 
@@ -279,6 +281,23 @@ struct object *parse_object(const struct object_id *oid)
 	return NULL;
 }
 
+struct object *parse_or_promise_object(const struct object_id *oid)
+{
+	enum object_type type;
+
+	if (has_object_file(oid))
+		return parse_object(oid);
+
+	if (is_promised_object(oid, &type, NULL)) {
+		struct object *obj = lookup_unknown_object(oid->hash);
+		obj->promised = 1;
+		obj->type = type;
+		return obj;
+	}
+
+	return NULL;
+}
+
 struct object_list *object_list_insert(struct object *item,
 				       struct object_list **list_p)
 {
diff --git a/object.h b/object.h
index 0a419ba8d..640c8bff1 100644
--- a/object.h
+++ b/object.h
@@ -46,7 +46,17 @@ struct object_array {
  * The object type is stored in 3 bits.
  */
 struct object {
+	/*
+	 * Set if this object is parsed. If set, "type" is populated and this
+	 * object can be casted to "struct commit" or an equivalent.
+	 */
 	unsigned parsed : 1;
+	/*
+	 * Set if this object is not in the repo but is promised. If set,
+	 * "type" is populated, but this object cannot be casted to "struct
+	 * commit" or an equivalent.
+	 */
+	unsigned promised : 1;
 	unsigned type : TYPE_BITS;
 	unsigned flags : FLAG_BITS;
 	struct object_id oid;
@@ -104,6 +114,15 @@ struct object *parse_object_or_die(const struct object_id *oid, const char *name
  */
 struct object *parse_object_buffer(const struct object_id *oid, enum object_type type, unsigned long size, void *buffer, int *eaten_p);
 
+/*
+ * Returns the object, having parsed it if possible. In the returned object,
+ * either "parsed" or "promised" will be set.
+ *
+ * Returns NULL if the object is missing and not promised, or if the object is
+ * corrupt.
+ */
+struct object *parse_or_promise_object(const struct object_id *oid);
+
 /** Returns the object, with potentially excess memory allocated. **/
 struct object *lookup_unknown_object(const unsigned  char *sha1);
 
diff --git a/promised-object.c b/promised-object.c
new file mode 100644
index 000000000..487ade437
--- /dev/null
+++ b/promised-object.c
@@ -0,0 +1,130 @@
+#include "cache.h"
+#include "promised-object.h"
+#include "sha1-lookup.h"
+#include "strbuf.h"
+
+#define ENTRY_SIZE (GIT_SHA1_RAWSZ + 1 + 8)
+/*
+ * A mmap-ed byte array of size (promised_object_nr * ENTRY_SIZE). Each
+ * ENTRY_SIZE-sized entry consists of the SHA-1 of the promised object, its
+ * 8-bit object type, and its 64-bit size in network byte order. The entries
+ * are sorted in ascending SHA-1 order.
+ */
+static char *promised_objects;
+static int64_t promised_object_nr = -1;
+
+static void prepare_promised_objects(void)
+{
+	char *filename;
+	int fd;
+	struct stat st;
+
+	if (promised_object_nr >= 0)
+		return;
+
+	if (!repository_format_promised_objects ||
+	    getenv("GIT_IGNORE_PROMISED_OBJECTS")) {
+		promised_object_nr = 0;
+		return;
+	}
+
+	filename = xstrfmt("%s/promised", get_object_directory());
+	fd = git_open(filename);
+	if (fd < 0) {
+		if (errno == ENOENT) {
+			promised_object_nr = 0;
+			goto cleanup;
+		}
+		perror("prepare_promised_objects");
+		die("Could not open %s", filename);
+	}
+	if (fstat(fd, &st)) {
+		perror("prepare_promised_objects");
+		die("Could not stat %s", filename);
+	}
+	if (st.st_size == 0) {
+		promised_object_nr = 0;
+		goto cleanup;
+	}
+	if (st.st_size % ENTRY_SIZE) {
+		die("Size of %s is not a multiple of %d", filename, ENTRY_SIZE);
+	}
+
+	promised_objects = xmmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	promised_object_nr = st.st_size / ENTRY_SIZE;
+
+cleanup:
+	free(filename);
+	if (fd >= 0)
+		close(fd);
+}
+
+int is_promised_object(const struct object_id *oid, enum object_type *type,
+		       unsigned long *size)
+{
+	int result;
+
+	prepare_promised_objects();
+	result = sha1_entry_pos(promised_objects, ENTRY_SIZE, 0, 0,
+				promised_object_nr, promised_object_nr,
+				oid->hash);
+	if (result >= 0) {
+		if (type) {
+			char *ptr = promised_objects +
+				    result * ENTRY_SIZE + GIT_SHA1_RAWSZ;
+			*type = *ptr;
+		}
+		if (size) {
+			uint64_t size_nbo;
+			char *ptr = promised_objects +
+				    result * ENTRY_SIZE + GIT_SHA1_RAWSZ + 1;
+			memcpy(&size_nbo, ptr, sizeof(size_nbo));
+			*size = ntohll(size_nbo);
+		}
+		return 1;
+	}
+	return 0;
+}
+
+int for_each_promised_object(each_promised_object_fn cb, void *data)
+{
+	struct object_id oid;
+	int i, r;
+
+	prepare_promised_objects();
+	for (i = 0; i < promised_object_nr; i++) {
+		memcpy(oid.hash, &promised_objects[i * ENTRY_SIZE],
+		       GIT_SHA1_RAWSZ);
+		r = cb(&oid, data);
+		if (r)
+			return r;
+	}
+	return 0;
+}
+
+int fsck_promised_objects(void)
+{
+	int i;
+	prepare_promised_objects();
+	for (i = 0; i < promised_object_nr; i++) {
+		enum object_type type;
+		if (i != 0 && memcmp(&promised_objects[(i - 1) * ENTRY_SIZE],
+				     &promised_objects[i * ENTRY_SIZE],
+				     GIT_SHA1_RAWSZ) >= 0)
+			return error("Error in list of promised objects: not "
+				     "in ascending order of object name");
+		type = promised_objects[i * ENTRY_SIZE + GIT_SHA1_RAWSZ];
+		switch (type) {
+			case OBJ_BLOB:
+			case OBJ_TREE:
+			case OBJ_COMMIT:
+			case OBJ_TAG:
+				break;
+			default:
+				return error("Error in list of promised "
+					     "objects: found object of type %d",
+					     type);
+		}
+	}
+	return 0;
+}
diff --git a/promised-object.h b/promised-object.h
new file mode 100644
index 000000000..7eaedff17
--- /dev/null
+++ b/promised-object.h
@@ -0,0 +1,22 @@
+#ifndef PROMISED_OBJECT_H
+#define PROMISED_OBJECT_H
+
+#include "cache.h"
+
+/*
+ * Returns 1 if oid is the name of a promised object. For non-blobs, 0 is
+ * reported as their size.
+ */
+int is_promised_object(const struct object_id *oid, enum object_type *type,
+		       unsigned long *size);
+
+typedef int each_promised_object_fn(const struct object_id *oid, void *data);
+int for_each_promised_object(each_promised_object_fn, void *);
+
+/*
+ * Returns 0 if there is no list of promised objects or if the list of promised
+ * objects is valid.
+ */
+int fsck_promised_objects(void);
+
+#endif
diff --git a/setup.c b/setup.c
index 860507e1f..36022452f 100644
--- a/setup.c
+++ b/setup.c
@@ -425,7 +425,11 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
 			;
 		else if (!strcmp(ext, "preciousobjects"))
 			data->precious_objects = git_config_bool(var, value);
-		else
+		else if (!strcmp(ext, "promisedobjects")) {
+			if (!value)
+				return config_error_nonbool(var);
+			data->promised_objects = xstrdup(value);
+		} else
 			string_list_append(&data->unknown_extensions, ext);
 	} else if (strcmp(var, "core.bare") == 0) {
 		data->is_bare = git_config_bool(var, value);
@@ -468,6 +472,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
 	}
 
 	repository_format_precious_objects = candidate.precious_objects;
+	repository_format_promised_objects = candidate.promised_objects;
 	string_list_clear(&candidate.unknown_extensions, 0);
 	if (!has_common) {
 		if (candidate.is_bare != -1) {
diff --git a/t/t3907-promised-object.sh b/t/t3907-promised-object.sh
new file mode 100755
index 000000000..3e0caf4f9
--- /dev/null
+++ b/t/t3907-promised-object.sh
@@ -0,0 +1,41 @@
+#!/bin/sh
+
+test_description='promised objects'
+
+. ./test-lib.sh
+
+test_expect_success 'fsck fails on missing objects' '
+	test_create_repo repo &&
+
+	test_commit -C repo 1 &&
+	test_commit -C repo 2 &&
+	test_commit -C repo 3 &&
+	git -C repo tag -a annotated_tag -m "annotated tag" &&
+	C=$(git -C repo rev-parse 1) &&
+	T=$(git -C repo rev-parse 2^{tree}) &&
+	B=$(git hash-object repo/3.t) &&
+	AT=$(git -C repo rev-parse annotated_tag) &&
+
+	# missing commit, tree, blob, and tag
+	rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) &&
+	rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) &&
+	rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) &&
+	rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40) &&
+	test_must_fail git -C repo fsck
+'
+
+test_expect_success '...but succeeds if they are promised objects' '
+	printf "%s01%016x\n%s02%016x\n%s03%016x\n%s04%016x" \
+		"$C" 0 "$T" 0 "$B" "$(wc -c <repo/3.t)" "$AT" 0 |
+		sort | tr -d "\n" | hex_pack >repo/.git/objects/promised &&
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.promisedobjects "arbitrary string" &&
+	git -C repo fsck
+'
+
+test_expect_success '...but fails again with GIT_IGNORE_PROMISED_OBJECTS' '
+	GIT_IGNORE_PROMISED_OBJECTS=1 test_must_fail git -C repo fsck &&
+	unset GIT_IGNORE_PROMISED_OBJECTS
+'
+
+test_done
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index db622c355..1ebdd2d04 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1018,3 +1018,9 @@ nongit () {
 		"$@"
 	)
 }
+
+# Converts big-endian pairs of hexadecimal digits into bytes. For example,
+# "printf 61620d0a | hex_pack" results in "ab\r\n".
+hex_pack () {
+	perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
+}
-- 
2.14.0.rc0.284.gd933b75aa4-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash
  2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
                   ` (6 preceding siblings ...)
  2017-07-20  0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
@ 2017-07-20  0:21 ` Jonathan Tan
  2017-07-20  0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
  8 siblings, 0 replies; 46+ messages in thread
From: Jonathan Tan @ 2017-07-20  0:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, jrnieder, sbeller, git, peartben, philipoakley

In a subsequent patch, sha1_file will need to append object names in the
form of "unsigned char *" to oid arrays. Teach sha1-array support for
that.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sha1-array.c | 7 +++++++
 sha1-array.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/sha1-array.c b/sha1-array.c
index 838b3bf84..6e0e35391 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -9,6 +9,13 @@ void oid_array_append(struct oid_array *array, const struct object_id *oid)
 	array->sorted = 0;
 }
 
+void oid_array_append_sha1(struct oid_array *array, const unsigned char *sha1)
+{
+	ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
+	hashcpy(array->oid[array->nr++].hash, sha1);
+	array->sorted = 0;
+}
+
 static int void_hashcmp(const void *a, const void *b)
 {
 	return oidcmp(a, b);
diff --git a/sha1-array.h b/sha1-array.h
index 04b075633..3479959e4 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -11,6 +11,7 @@ struct oid_array {
 #define OID_ARRAY_INIT { NULL, 0, 0, 0 }
 
 void oid_array_append(struct oid_array *array, const struct object_id *oid);
+void oid_array_append_sha1(struct oid_array *array, const unsigned char *sha1);
 int oid_array_lookup(struct oid_array *array, const struct object_id *oid);
 void oid_array_clear(struct oid_array *array);
 
-- 
2.14.0.rc0.284.gd933b75aa4-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v2 4/4] sha1_file: support promised object hook
  2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
                   ` (7 preceding siblings ...)
  2017-07-20  0:21 ` [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash Jonathan Tan
@ 2017-07-20  0:21 ` Jonathan Tan
  2017-07-20 18:23   ` Stefan Beller
  8 siblings, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2017-07-20  0:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, jrnieder, sbeller, git, peartben, philipoakley

Teach sha1_file to invoke a hook whenever an object is requested and
unavailable but is promised. The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.

The usage of the hook can be suppressed through a flag when invoking
has_object_file_with_flags() and other similar functions.
parse_or_promise_object() in object.c requires this functionality, and
has been modified to use it.

This is meant as a temporary measure to ensure that all Git commands
work in such a situation. Future patches will update some commands to
either tolerate promised objects (without invoking the hook) or be more
efficient in invoking the promised objects hook.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
     regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
     need to check callers that know about the loose/packed distinction
     and operate on both differently, and ensure that they can handle
     the concept of objects that are neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked at for_each_packed_object and at the packed-related
functions that take in a hash. For for_each_packed_object, the callers
either already work or are fixed in this patch:
 - reachable - only to find recent objects
 - builtin/fsck - already knows about promised objects
 - builtin/cat-file - fixed in this commit

Callers of the other functions do not need to be changed:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
     caller already knows (through other means) that the sought object
     is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
     file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - already knows about promised objects
   - builtin/count-objects - informational purposes only (check if loose
     object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
     not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization

An alternative design that I considered but rejected:

 - Adding a hook whenever a packed object is requested, not on any
   object.  That is, whenever we attempt to search the packfiles for an
   object, if it is missing (from the packfiles and from the loose
   object storage), to invoke the hook (which must then store it as a
   packfile), open the packfile the hook generated, and report that the
   object is found in that new packfile. This reduces the amount of
   analysis needed (in that we only need to look at how packed objects
   are handled), but requires that the hook generate packfiles (or for
   sha1_file to pack whatever loose objects are generated), creating one
   packfile for each missing object and potentially very many packfiles
   that must be linearly searched. This may be tolerable now for repos
   that only have a few missing objects (for example, repos that only
   want to exclude large blobs), and might be tolerable in the future if
   we have batching support for the most commonly used commands, but is
   not tolerable now for repos that exclude a large amount of objects.

Helped-by: Ben Peart <benpeart@microsoft.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/config.txt                         |   8 +
 Documentation/gitrepository-layout.txt           |   8 +
 Documentation/technical/read-object-protocol.txt | 102 ++++++++++++
 builtin/cat-file.c                               |   9 ++
 cache.h                                          |   2 +
 object.c                                         |   3 +-
 promised-object.c                                | 194 +++++++++++++++++++++++
 promised-object.h                                |  12 ++
 sha1_file.c                                      |  44 +++--
 t/t3907-promised-object.sh                       |  32 ++++
 t/t3907/read-object                              | 114 +++++++++++++
 11 files changed, 513 insertions(+), 15 deletions(-)
 create mode 100644 Documentation/technical/read-object-protocol.txt
 create mode 100755 t/t3907/read-object

diff --git a/Documentation/config.txt b/Documentation/config.txt
index d5c9c4cab..c293ac921 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.promisedBlobCommand::
+	If set, whenever a blob in the local repo is attempted to be read, but
+	is both missing and a promised blob, invoke this shell command to
+	generate or obtain that blob before reporting an error. This shell
+	command should take one or more hashes, each terminated by a newline,
+	as standard input, and (if successful) should write the corresponding
+	objects to the local repo (packed or loose).
+
 core.precomposeUnicode::
 	This option is only used by Mac OS implementation of Git.
 	When core.precomposeUnicode=true, Git reverts the unicode decomposition
diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
index f51ed4e37..7dea7fe6b 100644
--- a/Documentation/gitrepository-layout.txt
+++ b/Documentation/gitrepository-layout.txt
@@ -47,6 +47,10 @@ use with dumb transports but otherwise is OK as long as
 `objects/info/alternates` points at the object stores it
 borrows from.
 +
+. You could have objects that are merely promised by another source.
+When Git requires those objects, it will invoke the command in the
+`extensions.promisedObjects` configuration variable.
++
 This directory is ignored if $GIT_COMMON_DIR is set and
 "$GIT_COMMON_DIR/objects" will be used instead.
 
@@ -91,6 +95,10 @@ objects/info/http-alternates::
 	this object store borrows objects from, to be used when
 	the repository is fetched over HTTP.
 
+objects/promised::
+	This file records the sha1 object names, types, and sizes of
+	promised objects.
+
 refs::
 	References are stored in subdirectories of this
 	directory.  The 'git prune' command knows to preserve
diff --git a/Documentation/technical/read-object-protocol.txt b/Documentation/technical/read-object-protocol.txt
new file mode 100644
index 000000000..a893b46e7
--- /dev/null
+++ b/Documentation/technical/read-object-protocol.txt
@@ -0,0 +1,102 @@
+Read Object Process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The read-object process enables Git to read all missing blobs with a
+single process invocation for the entire life of a single Git command.
+This is achieved by using a packet format (pkt-line, see technical/
+protocol-common.txt) based protocol over standard input and standard
+output as follows. All packets, except for the "*CONTENT" packets and
+the "0000" flush packet, are considered text and therefore are
+terminated by a LF.
+
+Git starts the process when it encounters the first missing object that
+needs to be retrieved. After the process is started, Git sends a welcome
+message ("git-read-object-client"), a list of supported protocol version
+numbers, and a flush packet. Git expects to read a welcome response
+message ("git-read-object-server"), exactly one protocol version number
+from the previously sent list, and a flush packet. All further
+communication will be based on the selected version.
+
+The remaining protocol description below documents "version=1". Please
+note that "version=42" in the example below does not exist and is only
+there to illustrate how the protocol would look with more than one
+version.
+
+After the version negotiation Git sends a list of all capabilities that
+it supports and a flush packet. Git expects to read a list of desired
+capabilities, which must be a subset of the supported capabilities list,
+and a flush packet as response:
+------------------------
+packet: git> git-read-object-client
+packet: git> version=1
+packet: git> version=42
+packet: git> 0000
+packet: git< git-read-object-server
+packet: git< version=1
+packet: git< 0000
+packet: git> capability=get
+packet: git> capability=have
+packet: git> capability=put
+packet: git> capability=not-yet-invented
+packet: git> 0000
+packet: git< capability=get
+packet: git< 0000
+------------------------
+The only supported capability in version 1 is "get".
+
+Afterwards Git sends a list of "key=value" pairs terminated with a flush
+packet. The list will contain at least the command (based on the
+supported capabilities) and the sha1 of the object to retrieve. Please
+note, that the process must not send any response before it received the
+final flush packet.
+
+When the process receives the "get" command, it should make the requested
+object available in the git object store and then return success. Git will
+then check the object store again and this time find it and proceed.
+------------------------
+packet: git> command=get
+packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
+packet: git> 0000
+------------------------
+
+The process is expected to respond with a list of "key=value" pairs
+terminated with a flush packet. If the process does not experience
+problems then the list must contain a "success" status.
+------------------------
+packet: git< status=success
+packet: git< 0000
+------------------------
+
+In case the process cannot or does not want to process the content, it
+is expected to respond with an "error" status.
+------------------------
+packet: git< status=error
+packet: git< 0000
+------------------------
+
+In case the process cannot or does not want to process the content as
+well as any future content for the lifetime of the Git process, then it
+is expected to respond with an "abort" status at any point in the
+protocol.
+------------------------
+packet: git< status=abort
+packet: git< 0000
+------------------------
+
+Git neither stops nor restarts the process in case the "error"/"abort"
+status is set.
+
+If the process dies during the communication or does not adhere to the
+protocol then Git will stop the process and restart it with the next
+object that needs to be processed.
+
+After the read-object process has processed an object it is expected to
+wait for the next "key=value" list containing a command. Git will close
+the command pipe on exit. The process is expected to detect EOF and exit
+gracefully on its own. Git will wait until the process has stopped.
+
+A long running read-object process demo implementation can be found in
+`contrib/long-running-read-object/example.pl` located in the Git core
+repository. If you develop your own long running process then the
+`GIT_TRACE_PACKET` environment variables can be very helpful for
+debugging (see linkgit:git[1]).
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 96b786e48..33f636926 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -12,6 +12,7 @@
 #include "streaming.h"
 #include "tree-walk.h"
 #include "sha1-array.h"
+#include "promised-object.h"
 
 struct batch_options {
 	int enabled;
@@ -432,6 +433,13 @@ static int batch_packed_object(const struct object_id *oid,
 	return 0;
 }
 
+static int batch_promised_object(const struct object_id *oid,
+				 void *data)
+{
+	oid_array_append(data, oid);
+	return 0;
+}
+
 static int batch_objects(struct batch_options *opt)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -473,6 +481,7 @@ static int batch_objects(struct batch_options *opt)
 
 		for_each_loose_object(batch_loose_object, &sa, 0);
 		for_each_packed_object(batch_packed_object, &sa, 0);
+		for_each_promised_object(batch_promised_object, &sa);
 
 		cb.opt = opt;
 		cb.expand = &data;
diff --git a/cache.h b/cache.h
index dd94b5ffc..75b71f38b 100644
--- a/cache.h
+++ b/cache.h
@@ -1835,6 +1835,8 @@ struct object_info {
 #define OBJECT_INFO_SKIP_CACHED 4
 /* Do not retry packed storage after checking packed and loose storage */
 #define OBJECT_INFO_QUICK 8
+/* Ignore list of promised objects */
+#define OBJECT_INFO_IGNORE_PROMISES 16
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
 extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
 
diff --git a/object.c b/object.c
index 0aeb95084..23f2a6cbc 100644
--- a/object.c
+++ b/object.c
@@ -285,7 +285,8 @@ struct object *parse_or_promise_object(const struct object_id *oid)
 {
 	enum object_type type;
 
-	if (has_object_file(oid))
+	if (has_object_file_with_flags(oid, OBJECT_INFO_SKIP_CACHED |
+					    OBJECT_INFO_IGNORE_PROMISES))
 		return parse_object(oid);
 
 	if (is_promised_object(oid, &type, NULL)) {
diff --git a/promised-object.c b/promised-object.c
index 487ade437..d8d95ebb2 100644
--- a/promised-object.c
+++ b/promised-object.c
@@ -2,6 +2,12 @@
 #include "promised-object.h"
 #include "sha1-lookup.h"
 #include "strbuf.h"
+#include "run-command.h"
+#include "sha1-array.h"
+#include "config.h"
+#include "sigchain.h"
+#include "sub-process.h"
+#include "pkt-line.h"
 
 #define ENTRY_SIZE (GIT_SHA1_RAWSZ + 1 + 8)
 /*
@@ -128,3 +134,191 @@ int fsck_promised_objects(void)
 	}
 	return 0;
 }
+
+#define CAP_GET    (1u<<0)
+
+static int subprocess_map_initialized;
+static struct hashmap subprocess_map;
+
+struct read_object_process {
+	struct subprocess_entry subprocess;
+	unsigned int supported_capabilities;
+};
+
+int start_read_object_fn(struct subprocess_entry *subprocess)
+{
+	int err;
+	struct read_object_process *entry = (struct read_object_process *)subprocess;
+	struct child_process *process;
+	struct string_list cap_list = STRING_LIST_INIT_NODUP;
+	char *cap_buf;
+	const char *cap_name;
+
+	process = subprocess_get_child_process(&entry->subprocess);
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+	err = packet_writel(process->in, "git-read-object-client", "version=1", NULL);
+	if (err)
+		goto done;
+
+	err = strcmp(packet_read_line(process->out, NULL), "git-read-object-server");
+	if (err) {
+		error("external process '%s' does not support read-object protocol version 1", subprocess->cmd);
+		goto done;
+	}
+	err = strcmp(packet_read_line(process->out, NULL), "version=1");
+	if (err)
+		goto done;
+	err = packet_read_line(process->out, NULL) != NULL;
+	if (err)
+		goto done;
+
+	err = packet_writel(process->in, "capability=get", NULL);
+	if (err)
+		goto done;
+
+	for (;;) {
+		cap_buf = packet_read_line(process->out, NULL);
+		if (!cap_buf)
+			break;
+		string_list_split_in_place(&cap_list, cap_buf, '=', 1);
+
+		if (cap_list.nr != 2 || strcmp(cap_list.items[0].string, "capability"))
+			continue;
+
+		cap_name = cap_list.items[1].string;
+		if (!strcmp(cap_name, "get")) {
+			entry->supported_capabilities |= CAP_GET;
+		}
+		else {
+			warning(
+				"external process '%s' requested unsupported read-object capability '%s'",
+				subprocess->cmd, cap_name
+			);
+		}
+
+		string_list_clear(&cap_list, 0);
+	}
+
+done:
+	sigchain_pop(SIGPIPE);
+
+	if (err || errno == EPIPE)
+		return err ? err : errno;
+
+	return 0;
+}
+
+static int read_object_process(const unsigned char *sha1)
+{
+	int err;
+	struct read_object_process *entry;
+	struct child_process *process;
+	struct strbuf status = STRBUF_INIT;
+	uint64_t start;
+
+	start = getnanotime();
+
+	if (!repository_format_promised_objects)
+		die("BUG: if extensions.promisedObjects is not set, there "
+		    "should not be any promised objects");
+
+	if (!subprocess_map_initialized) {
+		subprocess_map_initialized = 1;
+		hashmap_init(&subprocess_map, (hashmap_cmp_fn)cmd2process_cmp, 0);
+		entry = NULL;
+	} else {
+		entry = (struct read_object_process *)subprocess_find_entry(&subprocess_map, repository_format_promised_objects);
+	}
+	if (!entry) {
+		entry = xmalloc(sizeof(*entry));
+		entry->supported_capabilities = 0;
+
+		if (subprocess_start(&subprocess_map, &entry->subprocess, repository_format_promised_objects, start_read_object_fn)) {
+			free(entry);
+			return -1;
+		}
+	}
+	process = subprocess_get_child_process(&entry->subprocess);
+
+	if (!(CAP_GET & entry->supported_capabilities))
+		return -1;
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+	err = packet_write_fmt_gently(process->in, "command=get\n");
+	if (err)
+		goto done;
+
+	err = packet_write_fmt_gently(process->in, "sha1=%s\n", sha1_to_hex(sha1));
+	if (err)
+		goto done;
+
+	err = packet_flush_gently(process->in);
+	if (err)
+		goto done;
+
+	err = subprocess_read_status(process->out, &status);
+	err = err ? err : strcmp(status.buf, "success");
+
+done:
+	sigchain_pop(SIGPIPE);
+
+	if (err || errno == EPIPE) {
+		err = err ? err : errno;
+		if (!strcmp(status.buf, "error")) {
+			/* The process signaled a problem with the file. */
+		}
+		else if (!strcmp(status.buf, "abort")) {
+			/*
+			 * The process signaled a permanent problem. Don't try to read
+			 * objects with the same command for the lifetime of the current
+			 * Git process.
+			 */
+			entry->supported_capabilities &= ~CAP_GET;
+		}
+		else {
+			/*
+			 * Something went wrong with the read-object process.
+			 * Force shutdown and restart if needed.
+			 */
+			error("external process '%s' failed", repository_format_promised_objects);
+			subprocess_stop(&subprocess_map, (struct subprocess_entry *)entry);
+			free(entry);
+		}
+	}
+
+	trace_performance_since(start, "read_object_process");
+
+	return err;
+}
+
+int request_promised_objects(const struct oid_array *oids)
+{
+	int oids_requested = 0;
+	int i;
+
+	for (i = 0; i < oids->nr; i++) {
+		if (is_promised_object(&oids->oid[i], NULL, NULL))
+			break;
+	}
+
+	if (i == oids->nr)
+		/* Nothing to fetch */
+		return 0;
+
+	for (; i < oids->nr; i++) {
+		if (is_promised_object(&oids->oid[i], NULL, NULL)) {
+			read_object_process(oids->oid[i].hash);
+			oids_requested++;
+		}
+	}
+
+	/*
+	 * The command above may have updated packfiles, so update our record
+	 * of them.
+	 */
+	reprepare_packed_git();
+	return oids_requested;
+}
diff --git a/promised-object.h b/promised-object.h
index 7eaedff17..8ad47aa4c 100644
--- a/promised-object.h
+++ b/promised-object.h
@@ -2,6 +2,7 @@
 #define PROMISED_OBJECT_H
 
 #include "cache.h"
+#include "sha1-array.h"
 
 /*
  * Returns 1 if oid is the name of a promised object. For non-blobs, 0 is
@@ -19,4 +20,15 @@ int for_each_promised_object(each_promised_object_fn, void *);
  */
 int fsck_promised_objects(void);
 
+/*
+ * If any of the given objects are promised objects, invokes
+ * core.promisedobjectcommand with those objects and returns the number of
+ * objects requested. No check is made as to whether the invocation actually
+ * populated the repository with the promised objects.
+ *
+ * If none of the given objects are promised objects, this function does not
+ * invoke anything and returns 0.
+ */
+int request_promised_objects(const struct oid_array *oids);
+
 #endif
diff --git a/sha1_file.c b/sha1_file.c
index 5862386cd..ded0ef46b 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -28,6 +28,11 @@
 #include "list.h"
 #include "mergesort.h"
 #include "quote.h"
+#include "iterator.h"
+#include "dir-iterator.h"
+#include "sha1-lookup.h"
+#include "promised-object.h"
+#include "sha1-array.h"
 
 #define SZ_FMT PRIuMAX
 static inline uintmax_t sz_fmt(size_t s) { return s; }
@@ -2983,6 +2988,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
 				    lookup_replace_object(sha1) :
 				    sha1;
+	int already_retried = 0;
 
 	if (!oi)
 		oi = &blank_oi;
@@ -3007,30 +3013,40 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 	}
 
-	if (!find_pack_entry(real, &e)) {
-		/* Most likely it's a loose object. */
-		if (!sha1_loose_object_info(real, oi, flags)) {
-			oi->whence = OI_LOOSE;
-			return 0;
-		}
+retry:
+	if (find_pack_entry(real, &e))
+		goto found_packed;
 
-		/* Not a loose object; someone else may have just packed it. */
-		if (flags & OBJECT_INFO_QUICK) {
-			return -1;
-		} else {
-			reprepare_packed_git();
-			if (!find_pack_entry(real, &e))
-				return -1;
+	/* Most likely it's a loose object. */
+	if (!sha1_loose_object_info(real, oi, flags)) {
+		oi->whence = OI_LOOSE;
+		return 0;
+	}
+
+	/* Not a loose object; someone else may have just packed it. */
+	reprepare_packed_git();
+	if (find_pack_entry(real, &e))
+		goto found_packed;
+
+	/* Check if it is a promised blob */
+	if (!already_retried && !(flags & OBJECT_INFO_IGNORE_PROMISES)) {
+		struct oid_array promised = OID_ARRAY_INIT;
+		oid_array_append_sha1(&promised, real);
+		if (request_promised_objects(&promised)) {
+			already_retried = 1;
+			goto retry;
 		}
 	}
 
+	return -1;
+
+found_packed:
 	if (oi == &blank_oi)
 		/*
 		 * We know that the caller doesn't actually need the
 		 * information below, so return early.
 		 */
 		return 0;
-
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
diff --git a/t/t3907-promised-object.sh b/t/t3907-promised-object.sh
index 3e0caf4f9..d9e6a6486 100755
--- a/t/t3907-promised-object.sh
+++ b/t/t3907-promised-object.sh
@@ -38,4 +38,36 @@ test_expect_success '...but fails again with GIT_IGNORE_PROMISED_OBJECTS' '
 	unset GIT_IGNORE_PROMISED_OBJECTS
 '
 
+test_expect_success 'sha1_object_info_extended (through git cat-file)' '
+	test_create_repo server &&
+	test_commit -C server 1 1.t abcdefgh &&
+	HASH=$(git hash-object server/1.t) &&
+
+	test_create_repo client &&
+	test_must_fail git -C client cat-file -p "$HASH"
+'
+
+test_expect_success '...succeeds if it is a promised object' '
+	printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
+		hex_pack >client/.git/objects/promised &&
+	git -C client config core.repositoryformatversion 1 &&
+	git -C client config extensions.promisedobjects \
+		"\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
+	git -C client cat-file -p "$HASH"
+'
+
+test_expect_success 'cat-file --batch-all-objects with promised objects' '
+	rm -rf client &&
+	test_create_repo client &&
+	git -C client config core.repositoryformatversion 1 &&
+	git -C client config extensions.promisedobjects \
+		"\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
+	printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
+		hex_pack >client/.git/objects/promised &&
+
+	# Verify that the promised object is printed
+	git -C client cat-file --batch --batch-all-objects | tee out |
+		grep abcdefgh
+'
+
 test_done
diff --git a/t/t3907/read-object b/t/t3907/read-object
new file mode 100755
index 000000000..9666ad597
--- /dev/null
+++ b/t/t3907/read-object
@@ -0,0 +1,114 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git read-object protocol version 1
+# See Documentation/technical/read-object-protocol.txt
+#
+# Allows you to test the ability for blobs to be pulled from a host git repo
+# "on demand."  Called when git needs a blob it couldn't find locally due to
+# a lazy clone that only cloned the commits and trees.
+#
+# A lazy clone can be simulated via the following commands from the host repo
+# you wish to create a lazy clone of:
+#
+# cd /host_repo
+# git rev-parse HEAD
+# git init /guest_repo
+# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
+#	cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
+# cd /guest_repo
+# git config core.virtualizeobjects true
+# git reset --hard <sha from rev-parse call above>
+#
+# Please note, this sample is a minimal skeleton. No proper error handling
+# was implemented.
+#
+
+use strict;
+use warnings;
+
+#
+# Point $DIR to the folder where your host git repo is located so we can pull
+# missing objects from it
+#
+my $DIR = $ARGV[0];
+
+sub packet_bin_read {
+	my $buffer;
+	my $bytes_read = read STDIN, $buffer, 4;
+	if ( $bytes_read == 0 ) {
+
+		# EOF - Git stopped talking to us!
+		exit();
+	}
+	elsif ( $bytes_read != 4 ) {
+		die "invalid packet: '$buffer'";
+	}
+	my $pkt_size = hex($buffer);
+	if ( $pkt_size == 0 ) {
+		return ( 1, "" );
+	}
+	elsif ( $pkt_size > 4 ) {
+		my $content_size = $pkt_size - 4;
+		$bytes_read = read STDIN, $buffer, $content_size;
+		if ( $bytes_read != $content_size ) {
+			die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
+		}
+		return ( 0, $buffer );
+	}
+	else {
+		die "invalid packet size: $pkt_size";
+	}
+}
+
+sub packet_txt_read {
+	my ( $res, $buf ) = packet_bin_read();
+	unless ( $buf =~ s/\n$// ) {
+		die "A non-binary line MUST be terminated by an LF.";
+	}
+	return ( $res, $buf );
+}
+
+sub packet_bin_write {
+	my $buf = shift;
+	print STDOUT sprintf( "%04x", length($buf) + 4 );
+	print STDOUT $buf;
+	STDOUT->flush();
+}
+
+sub packet_txt_write {
+	packet_bin_write( $_[0] . "\n" );
+}
+
+sub packet_flush {
+	print STDOUT sprintf( "%04x", 0 );
+	STDOUT->flush();
+}
+
+( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad initialize";
+( packet_txt_read() eq ( 0, "version=1" ) )				 || die "bad version";
+( packet_bin_read() eq ( 1, "" ) )                       || die "bad version end";
+
+packet_txt_write("git-read-object-server");
+packet_txt_write("version=1");
+packet_flush();
+
+( packet_txt_read() eq ( 0, "capability=get" ) )    || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) )                  || die "bad capability end";
+
+packet_txt_write("capability=get");
+packet_flush();
+
+while (1) {
+	my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
+
+	if ( $command eq "get" ) {
+		my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
+		packet_bin_read();
+
+		system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 2>&1');
+		packet_txt_write(($?) ? "status=error" : "status=success");
+		packet_flush();
+	} else {
+		die "bad command '$command'";
+	}
+}
-- 
2.14.0.rc0.284.gd933b75aa4-goog


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 1/4] object: remove "used" field from struct object
  2017-07-20  0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
@ 2017-07-20  0:36   ` Stefan Beller
  2017-07-20  0:55     ` Jonathan Tan
  2017-07-20 21:20   ` Junio C Hamano
  1 sibling, 1 reply; 46+ messages in thread
From: Stefan Beller @ 2017-07-20  0:36 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Jonathan Nieder, Jeff Hostetler, Ben Peart, Philip Oakley

On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> The "used" field in struct object is only used by builtin/fsck. Remove
> that field and modify builtin/fsck to use a flag instead.

The patch looks good to me (I would even claim this could
go in as an independent cleanup, not tied to the RFCish nature
of the later patches), though I have a question:
How did you select 0x0008 for USED, i.e. does it
collide with other flags (theoretically?), and if so
how do we make sure to avoid the collusion in
the future?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 1/4] object: remove "used" field from struct object
  2017-07-20  0:36   ` Stefan Beller
@ 2017-07-20  0:55     ` Jonathan Tan
  2017-07-20 17:44       ` Ben Peart
  0 siblings, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2017-07-20  0:55 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Jonathan Nieder, Jeff Hostetler, Ben Peart, Philip Oakley

On Wed, 19 Jul 2017 17:36:39 -0700
Stefan Beller <sbeller@google.com> wrote:

> On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> > The "used" field in struct object is only used by builtin/fsck. Remove
> > that field and modify builtin/fsck to use a flag instead.
> 
> The patch looks good to me (I would even claim this could
> go in as an independent cleanup, not tied to the RFCish nature
> of the later patches), though I have a question:
> How did you select 0x0008 for USED, i.e. does it
> collide with other flags (theoretically?), and if so
> how do we make sure to avoid the collusion in
> the future?

Thanks. 0x0008 was the next one in the series (as you can see in the
context). As for whether it collides with other flags, that is what the
chart in object.h is for (which I have added to in this patch), I
presume. As far as I can tell, each component must make sure not to
overlap with any other component running concurrently.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 1/4] object: remove "used" field from struct object
  2017-07-20  0:55     ` Jonathan Tan
@ 2017-07-20 17:44       ` Ben Peart
  0 siblings, 0 replies; 46+ messages in thread
From: Ben Peart @ 2017-07-20 17:44 UTC (permalink / raw)
  To: Jonathan Tan, Stefan Beller
  Cc: git, Jonathan Nieder, Jeff Hostetler, Philip Oakley



On 7/19/2017 8:55 PM, Jonathan Tan wrote:
> On Wed, 19 Jul 2017 17:36:39 -0700
> Stefan Beller <sbeller@google.com> wrote:
> 
>> On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
>>> The "used" field in struct object is only used by builtin/fsck. Remove
>>> that field and modify builtin/fsck to use a flag instead.
>>
>> The patch looks good to me (I would even claim this could
>> go in as an independent cleanup, not tied to the RFCish nature
>> of the later patches), though I have a question:
>> How did you select 0x0008 for USED, i.e. does it
>> collide with other flags (theoretically?), and if so
>> how do we make sure to avoid the collusion in
>> the future?
> 
> Thanks. 0x0008 was the next one in the series (as you can see in the
> context). As for whether it collides with other flags, that is what the
> chart in object.h is for (which I have added to in this patch), I
> presume. As far as I can tell, each component must make sure not to
> overlap with any other component running concurrently.
> 

This patch seems reasonable to me.  I agree it could go in separately as 
a general cleanup.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
  2017-07-20  0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
@ 2017-07-20 18:07   ` Stefan Beller
  2017-07-20 19:17     ` Jonathan Tan
  2017-07-20 19:58   ` Ben Peart
  1 sibling, 1 reply; 46+ messages in thread
From: Stefan Beller @ 2017-07-20 18:07 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Jonathan Nieder, Jeff Hostetler, Ben Peart, Philip Oakley

On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> Currently, Git does not support repos with very large numbers of objects
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every referenced object is available somewhere in the
> repo storage.
>
> As a first step to reducing this problem, introduce the concept of
> promised objects. Each Git repo can contain a list of promised objects
> and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> contains functions to query them; functions for creating and modifying
> that file will be introduced in later patches.
>
> A repository that is missing an object but has that object promised is not
> considered to be in error, so also teach fsck this. As part of doing
> this, object.{h,c} has been modified to generate "struct object" based
> on only the information available to promised objects, without requiring
> the object itself.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  Documentation/technical/repository-version.txt |   6 ++
>  Makefile                                       |   1 +
>  builtin/fsck.c                                 |  18 +++-
>  cache.h                                        |   2 +
>  environment.c                                  |   1 +
>  fsck.c                                         |   6 +-
>  object.c                                       |  19 ++++
>  object.h                                       |  19 ++++
>  promised-object.c                              | 130 +++++++++++++++++++++++++
>  promised-object.h                              |  22 +++++
>  setup.c                                        |   7 +-
>  t/t3907-promised-object.sh                     |  41 ++++++++
>  t/test-lib-functions.sh                        |   6 ++
>  13 files changed, 273 insertions(+), 5 deletions(-)
>  create mode 100644 promised-object.c
>  create mode 100644 promised-object.h
>  create mode 100755 t/t3907-promised-object.sh
>
> diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
> index 00ad37986..f8b82c1c7 100644
> --- a/Documentation/technical/repository-version.txt
> +++ b/Documentation/technical/repository-version.txt
> @@ -86,3 +86,9 @@ for testing format-1 compatibility.
>  When the config key `extensions.preciousObjects` is set to `true`,
>  objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
>  `git repack -d`).
> +
> +`promisedObjects`
> +~~~~~~~~~~~~~~~~~
> +
> +(Explain this - basically a string containing a command to be run
> +whenever a missing object needs to be fetched.)
> diff --git a/Makefile b/Makefile
> index 9c9c42f8f..c1446d5ef 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
>  LIB_OBJS += pretty.o
>  LIB_OBJS += prio-queue.o
>  LIB_OBJS += progress.o
> +LIB_OBJS += promised-object.o
>  LIB_OBJS += prompt.o
>  LIB_OBJS += quote.o
>  LIB_OBJS += reachable.o
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 462b8643b..49e21f361 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -15,6 +15,7 @@
>  #include "progress.h"
>  #include "streaming.h"
>  #include "decorate.h"
> +#include "promised-object.h"
>
>  #define REACHABLE 0x0001
>  #define SEEN      0x0002
> @@ -44,6 +45,7 @@ static int name_objects;
>  #define ERROR_REACHABLE 02
>  #define ERROR_PACK 04
>  #define ERROR_REFS 010
> +#define ERROR_PROMISED_OBJECT 011
>
>  static const char *describe_object(struct object *obj)
>  {
> @@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
>  {
>         struct object *obj;
>
> -       obj = parse_object(oid);
> +       obj = parse_or_promise_object(oid);
>         if (!obj) {
>                 error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
>                 errors_found |= ERROR_REACHABLE;
> @@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
>                 fprintf(stderr, "Checking cache tree\n");
>
>         if (0 <= it->entry_count) {
> -               struct object *obj = parse_object(&it->oid);
> +               struct object *obj = parse_or_promise_object(&it->oid);
>                 if (!obj) {
>                         error("%s: invalid sha1 pointer in cache-tree",
>                               oid_to_hex(&it->oid));
> @@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct object_id *oid,
>         return 0;
>  }
>
> +static int mark_have_promised_object(const struct object_id *oid, void *data)
> +{
> +       mark_object_for_connectivity(oid);
> +       return 0;
> +}
> +
>  static char const * const fsck_usage[] = {
>         N_("git fsck [<options>] [<object>...]"),
>         NULL
> @@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>
>         git_config(fsck_config, NULL);
>
> +       if (fsck_promised_objects()) {
> +               error("Errors found in promised object list");
> +               errors_found |= ERROR_PROMISED_OBJECT;
> +       }

This got me thinking: It is an error if we do not have an object
and also do not promise it, but what about the other case:
having and object and promising it, too?
That is not harmful to the operation, except that the promise
machinery may be slower due to its size. (Should that be a soft
warning then? Do we have warnings in fsck?)

>   * The object type is stored in 3 bits.
>   */

We may want to remove this comment while we're here as it
sounds stale despite being technically correct.
1974632c66 (Remove TYPE_* constant macros and use
object_type enums consistently., 2006-07-11)

>  struct object {
> +       /*
> +        * Set if this object is parsed. If set, "type" is populated and this
> +        * object can be casted to "struct commit" or an equivalent.
> +        */
>         unsigned parsed : 1;
> +       /*
> +        * Set if this object is not in the repo but is promised. If set,
> +        * "type" is populated, but this object cannot be casted to "struct
> +        * commit" or an equivalent.
> +        */
> +       unsigned promised : 1;

Would it make sense to have a bit field instead:

#define STATE_BITS 2
#define STATE_PARSED (1<<0)
#define STATE_PROMISED (1<<1)

    unsigned state : STATE_BITS

This would be similar to the types and flags?


> +test_expect_success 'fsck fails on missing objects' '
> +       test_create_repo repo &&
> +
> +       test_commit -C repo 1 &&
> +       test_commit -C repo 2 &&
> +       test_commit -C repo 3 &&
> +       git -C repo tag -a annotated_tag -m "annotated tag" &&
> +       C=$(git -C repo rev-parse 1) &&
> +       T=$(git -C repo rev-parse 2^{tree}) &&
> +       B=$(git hash-object repo/3.t) &&
> +       AT=$(git -C repo rev-parse annotated_tag) &&
> +
> +       # missing commit, tree, blob, and tag
> +       rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) &&
> +       rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) &&
> +       rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) &&
> +       rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40) &&

This is a pretty cool test as it promises all sorts of objects
from different parts of the graph.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 4/4] sha1_file: support promised object hook
  2017-07-20  0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
@ 2017-07-20 18:23   ` Stefan Beller
  2017-07-20 20:58     ` Ben Peart
  0 siblings, 1 reply; 46+ messages in thread
From: Stefan Beller @ 2017-07-20 18:23 UTC (permalink / raw)
  To: Jonathan Tan, Lars Schneider
  Cc: git, Jonathan Nieder, Jeff Hostetler, Ben Peart, Philip Oakley

On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> Teach sha1_file to invoke a hook whenever an object is requested and
> unavailable but is promised. The hook is a shell command that can be
> configured through "git config"; this hook takes in a list of hashes and
> writes (if successful) the corresponding objects to the repo's local
> storage.
>
> The usage of the hook can be suppressed through a flag when invoking
> has_object_file_with_flags() and other similar functions.
> parse_or_promise_object() in object.c requires this functionality, and
> has been modified to use it.
>
> This is meant as a temporary measure to ensure that all Git commands
> work in such a situation. Future patches will update some commands to
> either tolerate promised objects (without invoking the hook) or be more
> efficient in invoking the promised objects hook.
>
> In order to determine the code changes in sha1_file.c necessary, I
> investigated the following:
>  (1) functions in sha1_file that take in a hash, without the user
>      regarding how the object is stored (loose or packed)
>  (2) functions in sha1_file that operate on packed objects (because I
>      need to check callers that know about the loose/packed distinction
>      and operate on both differently, and ensure that they can handle
>      the concept of objects that are neither loose nor packed)
>
> (1) is handled by the modification to sha1_object_info_extended().
>
> For (2), I looked at for_each_packed_object and at the packed-related
> functions that take in a hash. For for_each_packed_object, the callers
> either already work or are fixed in this patch:
>  - reachable - only to find recent objects
>  - builtin/fsck - already knows about promised objects
>  - builtin/cat-file - fixed in this commit
>
> Callers of the other functions do not need to be changed:
>  - parse_pack_index
>    - http - indirectly from http_get_info_packs
>  - find_pack_entry_one
>    - this searches a single pack that is provided as an argument; the
>      caller already knows (through other means) that the sought object
>      is in a specific pack
>  - find_sha1_pack
>    - fast-import - appears to be an optimization to not store a
>      file if it is already in a pack
>    - http-walker - to search through a struct alt_base
>    - http-push - to search through remote packs
>  - has_sha1_pack
>    - builtin/fsck - already knows about promised objects
>    - builtin/count-objects - informational purposes only (check if loose
>      object is also packed)
>    - builtin/prune-packed - check if object to be pruned is packed (if
>      not, don't prune it)
>    - revision - used to exclude packed objects if requested by user
>    - diff - just for optimization
>
> An alternative design that I considered but rejected:
>
>  - Adding a hook whenever a packed object is requested, not on any
>    object.  That is, whenever we attempt to search the packfiles for an
>    object, if it is missing (from the packfiles and from the loose
>    object storage), to invoke the hook (which must then store it as a
>    packfile), open the packfile the hook generated, and report that the
>    object is found in that new packfile. This reduces the amount of
>    analysis needed (in that we only need to look at how packed objects
>    are handled), but requires that the hook generate packfiles (or for
>    sha1_file to pack whatever loose objects are generated), creating one
>    packfile for each missing object and potentially very many packfiles
>    that must be linearly searched. This may be tolerable now for repos
>    that only have a few missing objects (for example, repos that only
>    want to exclude large blobs), and might be tolerable in the future if
>    we have batching support for the most commonly used commands, but is
>    not tolerable now for repos that exclude a large amount of objects.
>
> Helped-by: Ben Peart <benpeart@microsoft.com>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  Documentation/config.txt                         |   8 +
>  Documentation/gitrepository-layout.txt           |   8 +
>  Documentation/technical/read-object-protocol.txt | 102 ++++++++++++
>  builtin/cat-file.c                               |   9 ++
>  cache.h                                          |   2 +
>  object.c                                         |   3 +-
>  promised-object.c                                | 194 +++++++++++++++++++++++
>  promised-object.h                                |  12 ++
>  sha1_file.c                                      |  44 +++--
>  t/t3907-promised-object.sh                       |  32 ++++
>  t/t3907/read-object                              | 114 +++++++++++++
>  11 files changed, 513 insertions(+), 15 deletions(-)
>  create mode 100644 Documentation/technical/read-object-protocol.txt
>  create mode 100755 t/t3907/read-object
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index d5c9c4cab..c293ac921 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
>  will probe and set core.ignoreCase true if appropriate when the repository
>  is created.
>
> +core.promisedBlobCommand::
> +       If set, whenever a blob in the local repo is attempted to be read, but
> +       is both missing and a promised blob, invoke this shell command to
> +       generate or obtain that blob before reporting an error. This shell
> +       command should take one or more hashes, each terminated by a newline,
> +       as standard input, and (if successful) should write the corresponding
> +       objects to the local repo (packed or loose).
> +
>  core.precomposeUnicode::
>         This option is only used by Mac OS implementation of Git.
>         When core.precomposeUnicode=true, Git reverts the unicode decomposition
> diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
> index f51ed4e37..7dea7fe6b 100644
> --- a/Documentation/gitrepository-layout.txt
> +++ b/Documentation/gitrepository-layout.txt
> @@ -47,6 +47,10 @@ use with dumb transports but otherwise is OK as long as
>  `objects/info/alternates` points at the object stores it
>  borrows from.
>  +
> +. You could have objects that are merely promised by another source.
> +When Git requires those objects, it will invoke the command in the
> +`extensions.promisedObjects` configuration variable.
> ++
>  This directory is ignored if $GIT_COMMON_DIR is set and
>  "$GIT_COMMON_DIR/objects" will be used instead.
>
> @@ -91,6 +95,10 @@ objects/info/http-alternates::
>         this object store borrows objects from, to be used when
>         the repository is fetched over HTTP.
>
> +objects/promised::
> +       This file records the sha1 object names, types, and sizes of
> +       promised objects.
> +
>  refs::
>         References are stored in subdirectories of this
>         directory.  The 'git prune' command knows to preserve
> diff --git a/Documentation/technical/read-object-protocol.txt b/Documentation/technical/read-object-protocol.txt
> new file mode 100644
> index 000000000..a893b46e7
> --- /dev/null
> +++ b/Documentation/technical/read-object-protocol.txt
> @@ -0,0 +1,102 @@
> +Read Object Process
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^

This protocol reads very similar to the protocol that is used in
clean/smudge filtering, designed by Lars (cc'd)

> +
> +The read-object process enables Git to read all missing blobs with a
> +single process invocation for the entire life of a single Git command.
> +This is achieved by using a packet format (pkt-line, see technical/
> +protocol-common.txt) based protocol over standard input and standard
> +output as follows. All packets, except for the "*CONTENT" packets and
> +the "0000" flush packet, are considered text and therefore are
> +terminated by a LF.
> +
> +Git starts the process when it encounters the first missing object that
> +needs to be retrieved. After the process is started, Git sends a welcome
> +message ("git-read-object-client"), a list of supported protocol version
> +numbers, and a flush packet. Git expects to read a welcome response
> +message ("git-read-object-server"), exactly one protocol version number
> +from the previously sent list, and a flush packet. All further
> +communication will be based on the selected version.
> +
> +The remaining protocol description below documents "version=1". Please
> +note that "version=42" in the example below does not exist and is only
> +there to illustrate how the protocol would look with more than one
> +version.
> +
> +After the version negotiation Git sends a list of all capabilities that
> +it supports and a flush packet. Git expects to read a list of desired
> +capabilities, which must be a subset of the supported capabilities list,
> +and a flush packet as response:
> +------------------------
> +packet: git> git-read-object-client
> +packet: git> version=1
> +packet: git> version=42
> +packet: git> 0000
> +packet: git< git-read-object-server
> +packet: git< version=1
> +packet: git< 0000
> +packet: git> capability=get
> +packet: git> capability=have
> +packet: git> capability=put
> +packet: git> capability=not-yet-invented
> +packet: git> 0000
> +packet: git< capability=get
> +packet: git< 0000
> +------------------------
> +The only supported capability in version 1 is "get".
> +
> +Afterwards Git sends a list of "key=value" pairs terminated with a flush
> +packet. The list will contain at least the command (based on the
> +supported capabilities) and the sha1 of the object to retrieve. Please
> +note, that the process must not send any response before it received the
> +final flush packet.
> +
> +When the process receives the "get" command, it should make the requested
> +object available in the git object store and then return success. Git will
> +then check the object store again and this time find it and proceed.
> +------------------------
> +packet: git> command=get
> +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
> +packet: git> 0000
> +------------------------
> +
> +The process is expected to respond with a list of "key=value" pairs
> +terminated with a flush packet. If the process does not experience
> +problems then the list must contain a "success" status.
> +------------------------
> +packet: git< status=success
> +packet: git< 0000
> +------------------------
> +
> +In case the process cannot or does not want to process the content, it
> +is expected to respond with an "error" status.
> +------------------------
> +packet: git< status=error
> +packet: git< 0000
> +------------------------
> +
> +In case the process cannot or does not want to process the content as
> +well as any future content for the lifetime of the Git process, then it
> +is expected to respond with an "abort" status at any point in the
> +protocol.
> +------------------------
> +packet: git< status=abort
> +packet: git< 0000
> +------------------------
> +
> +Git neither stops nor restarts the process in case the "error"/"abort"
> +status is set.
> +
> +If the process dies during the communication or does not adhere to the
> +protocol then Git will stop the process and restart it with the next
> +object that needs to be processed.
> +
> +After the read-object process has processed an object it is expected to
> +wait for the next "key=value" list containing a command. Git will close
> +the command pipe on exit. The process is expected to detect EOF and exit
> +gracefully on its own. Git will wait until the process has stopped.
> +
> +A long running read-object process demo implementation can be found in
> +`contrib/long-running-read-object/example.pl` located in the Git core
> +repository. If you develop your own long running process then the
> +`GIT_TRACE_PACKET` environment variables can be very helpful for
> +debugging (see linkgit:git[1]).
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 96b786e48..33f636926 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -12,6 +12,7 @@
>  #include "streaming.h"
>  #include "tree-walk.h"
>  #include "sha1-array.h"
> +#include "promised-object.h"
>
>  struct batch_options {
>         int enabled;
> @@ -432,6 +433,13 @@ static int batch_packed_object(const struct object_id *oid,
>         return 0;
>  }
>
> +static int batch_promised_object(const struct object_id *oid,
> +                                void *data)
> +{
> +       oid_array_append(data, oid);
> +       return 0;
> +}
> +
>  static int batch_objects(struct batch_options *opt)
>  {
>         struct strbuf buf = STRBUF_INIT;
> @@ -473,6 +481,7 @@ static int batch_objects(struct batch_options *opt)
>
>                 for_each_loose_object(batch_loose_object, &sa, 0);
>                 for_each_packed_object(batch_packed_object, &sa, 0);
> +               for_each_promised_object(batch_promised_object, &sa);
>
>                 cb.opt = opt;
>                 cb.expand = &data;
> diff --git a/cache.h b/cache.h
> index dd94b5ffc..75b71f38b 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1835,6 +1835,8 @@ struct object_info {
>  #define OBJECT_INFO_SKIP_CACHED 4
>  /* Do not retry packed storage after checking packed and loose storage */
>  #define OBJECT_INFO_QUICK 8
> +/* Ignore list of promised objects */
> +#define OBJECT_INFO_IGNORE_PROMISES 16
>  extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
>  extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
>
> diff --git a/object.c b/object.c
> index 0aeb95084..23f2a6cbc 100644
> --- a/object.c
> +++ b/object.c
> @@ -285,7 +285,8 @@ struct object *parse_or_promise_object(const struct object_id *oid)
>  {
>         enum object_type type;
>
> -       if (has_object_file(oid))
> +       if (has_object_file_with_flags(oid, OBJECT_INFO_SKIP_CACHED |
> +                                           OBJECT_INFO_IGNORE_PROMISES))
>                 return parse_object(oid);
>
>         if (is_promised_object(oid, &type, NULL)) {
> diff --git a/promised-object.c b/promised-object.c
> index 487ade437..d8d95ebb2 100644
> --- a/promised-object.c
> +++ b/promised-object.c
> @@ -2,6 +2,12 @@
>  #include "promised-object.h"
>  #include "sha1-lookup.h"
>  #include "strbuf.h"
> +#include "run-command.h"
> +#include "sha1-array.h"
> +#include "config.h"
> +#include "sigchain.h"
> +#include "sub-process.h"
> +#include "pkt-line.h"
>
>  #define ENTRY_SIZE (GIT_SHA1_RAWSZ + 1 + 8)
>  /*
> @@ -128,3 +134,191 @@ int fsck_promised_objects(void)
>         }
>         return 0;
>  }
> +
> +#define CAP_GET    (1u<<0)
> +
> +static int subprocess_map_initialized;
> +static struct hashmap subprocess_map;
> +
> +struct read_object_process {
> +       struct subprocess_entry subprocess;
> +       unsigned int supported_capabilities;
> +};
> +
> +int start_read_object_fn(struct subprocess_entry *subprocess)
> +{
> +       int err;
> +       struct read_object_process *entry = (struct read_object_process *)subprocess;
> +       struct child_process *process;
> +       struct string_list cap_list = STRING_LIST_INIT_NODUP;
> +       char *cap_buf;
> +       const char *cap_name;
> +
> +       process = subprocess_get_child_process(&entry->subprocess);
> +
> +       sigchain_push(SIGPIPE, SIG_IGN);
> +
> +       err = packet_writel(process->in, "git-read-object-client", "version=1", NULL);
> +       if (err)
> +               goto done;
> +
> +       err = strcmp(packet_read_line(process->out, NULL), "git-read-object-server");
> +       if (err) {
> +               error("external process '%s' does not support read-object protocol version 1", subprocess->cmd);
> +               goto done;
> +       }
> +       err = strcmp(packet_read_line(process->out, NULL), "version=1");
> +       if (err)
> +               goto done;
> +       err = packet_read_line(process->out, NULL) != NULL;
> +       if (err)
> +               goto done;
> +
> +       err = packet_writel(process->in, "capability=get", NULL);
> +       if (err)
> +               goto done;
> +
> +       for (;;) {
> +               cap_buf = packet_read_line(process->out, NULL);
> +               if (!cap_buf)
> +                       break;
> +               string_list_split_in_place(&cap_list, cap_buf, '=', 1);
> +
> +               if (cap_list.nr != 2 || strcmp(cap_list.items[0].string, "capability"))
> +                       continue;
> +
> +               cap_name = cap_list.items[1].string;
> +               if (!strcmp(cap_name, "get")) {
> +                       entry->supported_capabilities |= CAP_GET;
> +               }
> +               else {
> +                       warning(
> +                               "external process '%s' requested unsupported read-object capability '%s'",
> +                               subprocess->cmd, cap_name
> +                       );
> +               }
> +
> +               string_list_clear(&cap_list, 0);
> +       }
> +
> +done:
> +       sigchain_pop(SIGPIPE);
> +
> +       if (err || errno == EPIPE)
> +               return err ? err : errno;
> +
> +       return 0;
> +}
> +
> +static int read_object_process(const unsigned char *sha1)
> +{
> +       int err;
> +       struct read_object_process *entry;
> +       struct child_process *process;
> +       struct strbuf status = STRBUF_INIT;
> +       uint64_t start;
> +
> +       start = getnanotime();
> +
> +       if (!repository_format_promised_objects)
> +               die("BUG: if extensions.promisedObjects is not set, there "
> +                   "should not be any promised objects");
> +
> +       if (!subprocess_map_initialized) {
> +               subprocess_map_initialized = 1;
> +               hashmap_init(&subprocess_map, (hashmap_cmp_fn)cmd2process_cmp, 0);
> +               entry = NULL;
> +       } else {
> +               entry = (struct read_object_process *)subprocess_find_entry(&subprocess_map, repository_format_promised_objects);
> +       }
> +       if (!entry) {
> +               entry = xmalloc(sizeof(*entry));
> +               entry->supported_capabilities = 0;
> +
> +               if (subprocess_start(&subprocess_map, &entry->subprocess, repository_format_promised_objects, start_read_object_fn)) {
> +                       free(entry);
> +                       return -1;
> +               }
> +       }
> +       process = subprocess_get_child_process(&entry->subprocess);
> +
> +       if (!(CAP_GET & entry->supported_capabilities))
> +               return -1;
> +
> +       sigchain_push(SIGPIPE, SIG_IGN);
> +
> +       err = packet_write_fmt_gently(process->in, "command=get\n");
> +       if (err)
> +               goto done;
> +
> +       err = packet_write_fmt_gently(process->in, "sha1=%s\n", sha1_to_hex(sha1));
> +       if (err)
> +               goto done;
> +
> +       err = packet_flush_gently(process->in);
> +       if (err)
> +               goto done;
> +
> +       err = subprocess_read_status(process->out, &status);
> +       err = err ? err : strcmp(status.buf, "success");
> +
> +done:
> +       sigchain_pop(SIGPIPE);
> +
> +       if (err || errno == EPIPE) {
> +               err = err ? err : errno;
> +               if (!strcmp(status.buf, "error")) {
> +                       /* The process signaled a problem with the file. */
> +               }
> +               else if (!strcmp(status.buf, "abort")) {
> +                       /*
> +                        * The process signaled a permanent problem. Don't try to read
> +                        * objects with the same command for the lifetime of the current
> +                        * Git process.
> +                        */
> +                       entry->supported_capabilities &= ~CAP_GET;
> +               }
> +               else {
> +                       /*
> +                        * Something went wrong with the read-object process.
> +                        * Force shutdown and restart if needed.
> +                        */
> +                       error("external process '%s' failed", repository_format_promised_objects);
> +                       subprocess_stop(&subprocess_map, (struct subprocess_entry *)entry);
> +                       free(entry);
> +               }
> +       }
> +
> +       trace_performance_since(start, "read_object_process");
> +
> +       return err;
> +}
> +
> +int request_promised_objects(const struct oid_array *oids)
> +{
> +       int oids_requested = 0;
> +       int i;
> +
> +       for (i = 0; i < oids->nr; i++) {
> +               if (is_promised_object(&oids->oid[i], NULL, NULL))
> +                       break;
> +       }
> +
> +       if (i == oids->nr)
> +               /* Nothing to fetch */
> +               return 0;
> +
> +       for (; i < oids->nr; i++) {
> +               if (is_promised_object(&oids->oid[i], NULL, NULL)) {
> +                       read_object_process(oids->oid[i].hash);
> +                       oids_requested++;
> +               }
> +       }
> +
> +       /*
> +        * The command above may have updated packfiles, so update our record
> +        * of them.
> +        */
> +       reprepare_packed_git();
> +       return oids_requested;
> +}
> diff --git a/promised-object.h b/promised-object.h
> index 7eaedff17..8ad47aa4c 100644
> --- a/promised-object.h
> +++ b/promised-object.h
> @@ -2,6 +2,7 @@
>  #define PROMISED_OBJECT_H
>
>  #include "cache.h"
> +#include "sha1-array.h"
>
>  /*
>   * Returns 1 if oid is the name of a promised object. For non-blobs, 0 is
> @@ -19,4 +20,15 @@ int for_each_promised_object(each_promised_object_fn, void *);
>   */
>  int fsck_promised_objects(void);
>
> +/*
> + * If any of the given objects are promised objects, invokes
> + * core.promisedobjectcommand with those objects and returns the number of
> + * objects requested. No check is made as to whether the invocation actually
> + * populated the repository with the promised objects.
> + *
> + * If none of the given objects are promised objects, this function does not
> + * invoke anything and returns 0.
> + */
> +int request_promised_objects(const struct oid_array *oids);
> +
>  #endif
> diff --git a/sha1_file.c b/sha1_file.c
> index 5862386cd..ded0ef46b 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -28,6 +28,11 @@
>  #include "list.h"
>  #include "mergesort.h"
>  #include "quote.h"
> +#include "iterator.h"
> +#include "dir-iterator.h"
> +#include "sha1-lookup.h"
> +#include "promised-object.h"
> +#include "sha1-array.h"
>
>  #define SZ_FMT PRIuMAX
>  static inline uintmax_t sz_fmt(size_t s) { return s; }
> @@ -2983,6 +2988,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>         const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
>                                     lookup_replace_object(sha1) :
>                                     sha1;
> +       int already_retried = 0;
>
>         if (!oi)
>                 oi = &blank_oi;
> @@ -3007,30 +3013,40 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>                 }
>         }
>
> -       if (!find_pack_entry(real, &e)) {
> -               /* Most likely it's a loose object. */
> -               if (!sha1_loose_object_info(real, oi, flags)) {
> -                       oi->whence = OI_LOOSE;
> -                       return 0;
> -               }
> +retry:
> +       if (find_pack_entry(real, &e))
> +               goto found_packed;
>
> -               /* Not a loose object; someone else may have just packed it. */
> -               if (flags & OBJECT_INFO_QUICK) {
> -                       return -1;
> -               } else {
> -                       reprepare_packed_git();
> -                       if (!find_pack_entry(real, &e))
> -                               return -1;
> +       /* Most likely it's a loose object. */
> +       if (!sha1_loose_object_info(real, oi, flags)) {
> +               oi->whence = OI_LOOSE;
> +               return 0;
> +       }
> +
> +       /* Not a loose object; someone else may have just packed it. */
> +       reprepare_packed_git();
> +       if (find_pack_entry(real, &e))
> +               goto found_packed;
> +
> +       /* Check if it is a promised blob */
> +       if (!already_retried && !(flags & OBJECT_INFO_IGNORE_PROMISES)) {
> +               struct oid_array promised = OID_ARRAY_INIT;
> +               oid_array_append_sha1(&promised, real);
> +               if (request_promised_objects(&promised)) {
> +                       already_retried = 1;
> +                       goto retry;
>                 }
>         }
>
> +       return -1;
> +
> +found_packed:
>         if (oi == &blank_oi)
>                 /*
>                  * We know that the caller doesn't actually need the
>                  * information below, so return early.
>                  */
>                 return 0;
> -
>         rtype = packed_object_info(e.p, e.offset, oi);
>         if (rtype < 0) {
>                 mark_bad_packed_object(e.p, real);
> diff --git a/t/t3907-promised-object.sh b/t/t3907-promised-object.sh
> index 3e0caf4f9..d9e6a6486 100755
> --- a/t/t3907-promised-object.sh
> +++ b/t/t3907-promised-object.sh
> @@ -38,4 +38,36 @@ test_expect_success '...but fails again with GIT_IGNORE_PROMISED_OBJECTS' '
>         unset GIT_IGNORE_PROMISED_OBJECTS
>  '
>
> +test_expect_success 'sha1_object_info_extended (through git cat-file)' '
> +       test_create_repo server &&
> +       test_commit -C server 1 1.t abcdefgh &&
> +       HASH=$(git hash-object server/1.t) &&
> +
> +       test_create_repo client &&
> +       test_must_fail git -C client cat-file -p "$HASH"
> +'
> +
> +test_expect_success '...succeeds if it is a promised object' '
> +       printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
> +               hex_pack >client/.git/objects/promised &&
> +       git -C client config core.repositoryformatversion 1 &&
> +       git -C client config extensions.promisedobjects \
> +               "\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
> +       git -C client cat-file -p "$HASH"
> +'
> +
> +test_expect_success 'cat-file --batch-all-objects with promised objects' '
> +       rm -rf client &&
> +       test_create_repo client &&
> +       git -C client config core.repositoryformatversion 1 &&
> +       git -C client config extensions.promisedobjects \
> +               "\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
> +       printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
> +               hex_pack >client/.git/objects/promised &&
> +
> +       # Verify that the promised object is printed
> +       git -C client cat-file --batch --batch-all-objects | tee out |
> +               grep abcdefgh
> +'
> +
>  test_done
> diff --git a/t/t3907/read-object b/t/t3907/read-object
> new file mode 100755
> index 000000000..9666ad597
> --- /dev/null
> +++ b/t/t3907/read-object
> @@ -0,0 +1,114 @@
> +#!/usr/bin/perl
> +#
> +# Example implementation for the Git read-object protocol version 1
> +# See Documentation/technical/read-object-protocol.txt
> +#
> +# Allows you to test the ability for blobs to be pulled from a host git repo
> +# "on demand."  Called when git needs a blob it couldn't find locally due to
> +# a lazy clone that only cloned the commits and trees.
> +#
> +# A lazy clone can be simulated via the following commands from the host repo
> +# you wish to create a lazy clone of:
> +#
> +# cd /host_repo
> +# git rev-parse HEAD
> +# git init /guest_repo
> +# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
> +#      cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
> +# cd /guest_repo
> +# git config core.virtualizeobjects true
> +# git reset --hard <sha from rev-parse call above>
> +#
> +# Please note, this sample is a minimal skeleton. No proper error handling
> +# was implemented.
> +#
> +
> +use strict;
> +use warnings;
> +
> +#
> +# Point $DIR to the folder where your host git repo is located so we can pull
> +# missing objects from it
> +#
> +my $DIR = $ARGV[0];
> +
> +sub packet_bin_read {
> +       my $buffer;
> +       my $bytes_read = read STDIN, $buffer, 4;
> +       if ( $bytes_read == 0 ) {
> +
> +               # EOF - Git stopped talking to us!
> +               exit();
> +       }
> +       elsif ( $bytes_read != 4 ) {
> +               die "invalid packet: '$buffer'";
> +       }
> +       my $pkt_size = hex($buffer);
> +       if ( $pkt_size == 0 ) {
> +               return ( 1, "" );
> +       }
> +       elsif ( $pkt_size > 4 ) {
> +               my $content_size = $pkt_size - 4;
> +               $bytes_read = read STDIN, $buffer, $content_size;
> +               if ( $bytes_read != $content_size ) {
> +                       die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
> +               }
> +               return ( 0, $buffer );
> +       }
> +       else {
> +               die "invalid packet size: $pkt_size";
> +       }
> +}
> +
> +sub packet_txt_read {
> +       my ( $res, $buf ) = packet_bin_read();
> +       unless ( $buf =~ s/\n$// ) {
> +               die "A non-binary line MUST be terminated by an LF.";
> +       }
> +       return ( $res, $buf );
> +}
> +
> +sub packet_bin_write {
> +       my $buf = shift;
> +       print STDOUT sprintf( "%04x", length($buf) + 4 );
> +       print STDOUT $buf;
> +       STDOUT->flush();
> +}
> +
> +sub packet_txt_write {
> +       packet_bin_write( $_[0] . "\n" );
> +}
> +
> +sub packet_flush {
> +       print STDOUT sprintf( "%04x", 0 );
> +       STDOUT->flush();
> +}
> +
> +( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad initialize";
> +( packet_txt_read() eq ( 0, "version=1" ) )                             || die "bad version";
> +( packet_bin_read() eq ( 1, "" ) )                       || die "bad version end";
> +
> +packet_txt_write("git-read-object-server");
> +packet_txt_write("version=1");
> +packet_flush();
> +
> +( packet_txt_read() eq ( 0, "capability=get" ) )    || die "bad capability";
> +( packet_bin_read() eq ( 1, "" ) )                  || die "bad capability end";
> +
> +packet_txt_write("capability=get");
> +packet_flush();
> +
> +while (1) {
> +       my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
> +
> +       if ( $command eq "get" ) {
> +               my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
> +               packet_bin_read();
> +
> +               system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 2>&1');
> +               packet_txt_write(($?) ? "status=error" : "status=success");
> +               packet_flush();
> +       } else {
> +               die "bad command '$command'";
> +       }
> +}
> --
> 2.14.0.rc0.284.gd933b75aa4-goog
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
  2017-07-20 18:07   ` Stefan Beller
@ 2017-07-20 19:17     ` Jonathan Tan
  0 siblings, 0 replies; 46+ messages in thread
From: Jonathan Tan @ 2017-07-20 19:17 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Jonathan Nieder, Jeff Hostetler, Ben Peart, Philip Oakley

On Thu, 20 Jul 2017 11:07:29 -0700
Stefan Beller <sbeller@google.com> wrote:

> > +       if (fsck_promised_objects()) {
> > +               error("Errors found in promised object list");
> > +               errors_found |= ERROR_PROMISED_OBJECT;
> > +       }
> 
> This got me thinking: It is an error if we do not have an object
> and also do not promise it, but what about the other case:
> having and object and promising it, too?
> That is not harmful to the operation, except that the promise
> machinery may be slower due to its size. (Should that be a soft
> warning then? Do we have warnings in fsck?)

Good question - having an object and also having it promised is not an
error condition (and I don't think it's a good idea to make it so,
because objects can appear quite easily from various sources). In the
future, I expect "git gc" to be extended to remove such redundant lines
from the "promised" list.

> >   * The object type is stored in 3 bits.
> >   */
> 
> We may want to remove this comment while we're here as it
> sounds stale despite being technically correct.
> 1974632c66 (Remove TYPE_* constant macros and use
> object_type enums consistently., 2006-07-11)

I agree that the comment is unnecessary, but in this commit I didn't
modify anything to do with the type, so I left it there.

> >  struct object {
> > +       /*
> > +        * Set if this object is parsed. If set, "type" is populated and this
> > +        * object can be casted to "struct commit" or an equivalent.
> > +        */
> >         unsigned parsed : 1;
> > +       /*
> > +        * Set if this object is not in the repo but is promised. If set,
> > +        * "type" is populated, but this object cannot be casted to "struct
> > +        * commit" or an equivalent.
> > +        */
> > +       unsigned promised : 1;
> 
> Would it make sense to have a bit field instead:
> 
> #define STATE_BITS 2
> #define STATE_PARSED (1<<0)
> #define STATE_PROMISED (1<<1)
> 
>     unsigned state : STATE_BITS
> 
> This would be similar to the types and flags?

Both type and flag have to be bit fields (for different reasons), but
since we don't need such a combined field for "parsed" and "promised", I
prefer separating them each into their own field.

> > +test_expect_success 'fsck fails on missing objects' '
> > +       test_create_repo repo &&
> > +
> > +       test_commit -C repo 1 &&
> > +       test_commit -C repo 2 &&
> > +       test_commit -C repo 3 &&
> > +       git -C repo tag -a annotated_tag -m "annotated tag" &&
> > +       C=$(git -C repo rev-parse 1) &&
> > +       T=$(git -C repo rev-parse 2^{tree}) &&
> > +       B=$(git hash-object repo/3.t) &&
> > +       AT=$(git -C repo rev-parse annotated_tag) &&
> > +
> > +       # missing commit, tree, blob, and tag
> > +       rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) &&
> > +       rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) &&
> > +       rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) &&
> > +       rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40) &&
> 
> This is a pretty cool test as it promises all sorts of objects
> from different parts of the graph.

Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
  2017-07-20  0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
  2017-07-20 18:07   ` Stefan Beller
@ 2017-07-20 19:58   ` Ben Peart
  2017-07-20 21:13     ` Jonathan Tan
  1 sibling, 1 reply; 46+ messages in thread
From: Ben Peart @ 2017-07-20 19:58 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: jrnieder, sbeller, git, philipoakley



On 7/19/2017 8:21 PM, Jonathan Tan wrote:
> Currently, Git does not support repos with very large numbers of objects
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every referenced object is available somewhere in the
> repo storage.
> 

Great to see this idea making progress. Making git able to gracefully 
handle partial clones (beyond the existing shallow clone support) is a 
key piece of dealing with very large objects and repos.

> As a first step to reducing this problem, introduce the concept of
> promised objects. Each Git repo can contain a list of promised objects
> and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> contains functions to query them; functions for creating and modifying
> that file will be introduced in later patches.

If I'm reading this patch correctly, for a repo to successfully pass 
"git fsck" either the object or a promise must exist for everything fsck 
checks.  From the documentation for fsck it says "git fsck defaults to 
using the index file, all SHA-1 references in refs namespace, and all 
reflogs (unless --no-reflogs is given) as heads." Doesn't this then 
imply objects or promises must exist for all objects referenced by any 
of these locations?

We're currently in the hundreds of millions of objects on some of our 
repos so even downloading the promises for all the objects in the index 
is unreasonable as it is gigabytes of data and growing.

I think we should have a flag (off by default) that enables someone to 
say that promised objects are optional. If the flag is set, 
"is_promised_object" will return success and pass the OBJ_ANY type and a 
size of -1.

Nothing today is using the size and in the two places where the object 
type is being checked for consistency (fsck_cache_tree and 
fsck_handle_ref) the test can add a test for OBJ_ANY as well.

This will enable very large numbers of objects to be omitted from the 
clone without triggering a download of the corresponding number of 
promised objects.

> 
> A repository that is missing an object but has that object promised is not
> considered to be in error, so also teach fsck this. As part of doing
> this, object.{h,c} has been modified to generate "struct object" based
> on only the information available to promised objects, without requiring
> the object itself.

In your work on this, did you investigate if there are other commands 
(ie repack/gc) that will need to learn about promised objects? Have you 
had a chance (or have plans) to hack up the test suite so that it runs 
all tests with promised objects and see what (if anything) breaks?

> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>   Documentation/technical/repository-version.txt |   6 ++
>   Makefile                                       |   1 +
>   builtin/fsck.c                                 |  18 +++-
>   cache.h                                        |   2 +
>   environment.c                                  |   1 +
>   fsck.c                                         |   6 +-
>   object.c                                       |  19 ++++
>   object.h                                       |  19 ++++
>   promised-object.c                              | 130 +++++++++++++++++++++++++
>   promised-object.h                              |  22 +++++
>   setup.c                                        |   7 +-
>   t/t3907-promised-object.sh                     |  41 ++++++++
>   t/test-lib-functions.sh                        |   6 ++
>   13 files changed, 273 insertions(+), 5 deletions(-)
>   create mode 100644 promised-object.c
>   create mode 100644 promised-object.h
>   create mode 100755 t/t3907-promised-object.sh
> 
> diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
> index 00ad37986..f8b82c1c7 100644
> --- a/Documentation/technical/repository-version.txt
> +++ b/Documentation/technical/repository-version.txt
> @@ -86,3 +86,9 @@ for testing format-1 compatibility.
>   When the config key `extensions.preciousObjects` is set to `true`,
>   objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
>   `git repack -d`).
> +
> +`promisedObjects`
> +~~~~~~~~~~~~~~~~~
> +
> +(Explain this - basically a string containing a command to be run
> +whenever a missing object needs to be fetched.)
> diff --git a/Makefile b/Makefile
> index 9c9c42f8f..c1446d5ef 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
>   LIB_OBJS += pretty.o
>   LIB_OBJS += prio-queue.o
>   LIB_OBJS += progress.o
> +LIB_OBJS += promised-object.o
>   LIB_OBJS += prompt.o
>   LIB_OBJS += quote.o
>   LIB_OBJS += reachable.o
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 462b8643b..49e21f361 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -15,6 +15,7 @@
>   #include "progress.h"
>   #include "streaming.h"
>   #include "decorate.h"
> +#include "promised-object.h"
>   
>   #define REACHABLE 0x0001
>   #define SEEN      0x0002
> @@ -44,6 +45,7 @@ static int name_objects;
>   #define ERROR_REACHABLE 02
>   #define ERROR_PACK 04
>   #define ERROR_REFS 010
> +#define ERROR_PROMISED_OBJECT 011
>   
>   static const char *describe_object(struct object *obj)
>   {
> @@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
>   {
>   	struct object *obj;
>   
> -	obj = parse_object(oid);
> +	obj = parse_or_promise_object(oid);
>   	if (!obj) {
>   		error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
>   		errors_found |= ERROR_REACHABLE;
> @@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
>   		fprintf(stderr, "Checking cache tree\n");
>   
>   	if (0 <= it->entry_count) {
> -		struct object *obj = parse_object(&it->oid);
> +		struct object *obj = parse_or_promise_object(&it->oid);
>   		if (!obj) {
>   			error("%s: invalid sha1 pointer in cache-tree",
>   			      oid_to_hex(&it->oid));
> @@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct object_id *oid,
>   	return 0;
>   }
>   
> +static int mark_have_promised_object(const struct object_id *oid, void *data)
> +{
> +	mark_object_for_connectivity(oid);
> +	return 0;
> +}
> +
>   static char const * const fsck_usage[] = {
>   	N_("git fsck [<options>] [<object>...]"),
>   	NULL
> @@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>   
>   	git_config(fsck_config, NULL);
>   
> +	if (fsck_promised_objects()) {
> +		error("Errors found in promised object list");
> +		errors_found |= ERROR_PROMISED_OBJECT;
> +	}
> +
>   	fsck_head_link();
>   	if (connectivity_only) {
>   		for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
> @@ -727,6 +740,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>   			stop_progress(&progress);
>   		}
>   	}
> +	for_each_promised_object(mark_have_promised_object, NULL);
>   
>   	heads = 0;
>   	for (i = 0; i < argc; i++) {
> diff --git a/cache.h b/cache.h
> index 71fe09264..dd94b5ffc 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -853,10 +853,12 @@ extern int grafts_replace_parents;
>   #define GIT_REPO_VERSION 0
>   #define GIT_REPO_VERSION_READ 1
>   extern int repository_format_precious_objects;
> +extern char *repository_format_promised_objects;
>   
>   struct repository_format {
>   	int version;
>   	int precious_objects;
> +	char *promised_objects;
>   	int is_bare;
>   	char *work_tree;
>   	struct string_list unknown_extensions;
> diff --git a/environment.c b/environment.c
> index 3fd4b1084..82658470d 100644
> --- a/environment.c
> +++ b/environment.c
> @@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1;
>   int warn_on_object_refname_ambiguity = 1;
>   int ref_paranoia = -1;
>   int repository_format_precious_objects;
> +char *repository_format_promised_objects;
>   const char *git_commit_encoding;
>   const char *git_log_output_encoding;
>   const char *apply_default_whitespace;
> diff --git a/fsck.c b/fsck.c
> index b4204d772..f08ff0675 100644
> --- a/fsck.c
> +++ b/fsck.c
> @@ -460,8 +460,10 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
>   	if (!obj)
>   		return -1;
>   
> -	if (obj->type == OBJ_NONE)
> -		parse_object(&obj->oid);
> +	if (!obj->parsed && !obj->promised)
> +		parse_or_promise_object(&obj->oid);
> +	if (obj->promised)
> +		return 0;
>   
>   	switch (obj->type) {
>   	case OBJ_BLOB:
> diff --git a/object.c b/object.c
> index 321d7e920..0aeb95084 100644
> --- a/object.c
> +++ b/object.c
> @@ -4,6 +4,7 @@
>   #include "tree.h"
>   #include "commit.h"
>   #include "tag.h"
> +#include "promised-object.h"
>   
>   static struct object **obj_hash;
>   static int nr_objs, obj_hash_size;
> @@ -141,6 +142,7 @@ void *create_object(const unsigned char *sha1, void *o)
>   	struct object *obj = o;
>   
>   	obj->parsed = 0;
> +	obj->promised = 0;
>   	obj->flags = 0;
>   	hashcpy(obj->oid.hash, sha1);
>   
> @@ -279,6 +281,23 @@ struct object *parse_object(const struct object_id *oid)
>   	return NULL;
>   }
>   
> +struct object *parse_or_promise_object(const struct object_id *oid)
> +{
> +	enum object_type type;
> +
> +	if (has_object_file(oid))
> +		return parse_object(oid);
> +
> +	if (is_promised_object(oid, &type, NULL)) {
> +		struct object *obj = lookup_unknown_object(oid->hash);
> +		obj->promised = 1;
> +		obj->type = type;
> +		return obj;
> +	}
> +
> +	return NULL;
> +}
> +
>   struct object_list *object_list_insert(struct object *item,
>   				       struct object_list **list_p)
>   {
> diff --git a/object.h b/object.h
> index 0a419ba8d..640c8bff1 100644
> --- a/object.h
> +++ b/object.h
> @@ -46,7 +46,17 @@ struct object_array {
>    * The object type is stored in 3 bits.
>    */
>   struct object {
> +	/*
> +	 * Set if this object is parsed. If set, "type" is populated and this
> +	 * object can be casted to "struct commit" or an equivalent.
> +	 */
>   	unsigned parsed : 1;
> +	/*
> +	 * Set if this object is not in the repo but is promised. If set,
> +	 * "type" is populated, but this object cannot be casted to "struct
> +	 * commit" or an equivalent.
> +	 */
> +	unsigned promised : 1;
>   	unsigned type : TYPE_BITS;
>   	unsigned flags : FLAG_BITS;
>   	struct object_id oid;
> @@ -104,6 +114,15 @@ struct object *parse_object_or_die(const struct object_id *oid, const char *name
>    */
>   struct object *parse_object_buffer(const struct object_id *oid, enum object_type type, unsigned long size, void *buffer, int *eaten_p);
>   
> +/*
> + * Returns the object, having parsed it if possible. In the returned object,
> + * either "parsed" or "promised" will be set.
> + *
> + * Returns NULL if the object is missing and not promised, or if the object is
> + * corrupt.
> + */
> +struct object *parse_or_promise_object(const struct object_id *oid);
> +
>   /** Returns the object, with potentially excess memory allocated. **/
>   struct object *lookup_unknown_object(const unsigned  char *sha1);
>   
> diff --git a/promised-object.c b/promised-object.c
> new file mode 100644
> index 000000000..487ade437
> --- /dev/null
> +++ b/promised-object.c
> @@ -0,0 +1,130 @@
> +#include "cache.h"
> +#include "promised-object.h"
> +#include "sha1-lookup.h"
> +#include "strbuf.h"
> +
> +#define ENTRY_SIZE (GIT_SHA1_RAWSZ + 1 + 8)
> +/*
> + * A mmap-ed byte array of size (promised_object_nr * ENTRY_SIZE). Each
> + * ENTRY_SIZE-sized entry consists of the SHA-1 of the promised object, its
> + * 8-bit object type, and its 64-bit size in network byte order. The entries
> + * are sorted in ascending SHA-1 order.
> + */
> +static char *promised_objects;
> +static int64_t promised_object_nr = -1;
> +
> +static void prepare_promised_objects(void)
> +{
> +	char *filename;
> +	int fd;
> +	struct stat st;
> +
> +	if (promised_object_nr >= 0)
> +		return;
> +
> +	if (!repository_format_promised_objects ||
> +	    getenv("GIT_IGNORE_PROMISED_OBJECTS")) {
> +		promised_object_nr = 0;
> +		return;
> +	}
> +
> +	filename = xstrfmt("%s/promised", get_object_directory());
> +	fd = git_open(filename);
> +	if (fd < 0) {
> +		if (errno == ENOENT) {
> +			promised_object_nr = 0;
> +			goto cleanup;
> +		}
> +		perror("prepare_promised_objects");
> +		die("Could not open %s", filename);
> +	}
> +	if (fstat(fd, &st)) {
> +		perror("prepare_promised_objects");
> +		die("Could not stat %s", filename);
> +	}
> +	if (st.st_size == 0) {
> +		promised_object_nr = 0;
> +		goto cleanup;
> +	}
> +	if (st.st_size % ENTRY_SIZE) {
> +		die("Size of %s is not a multiple of %d", filename, ENTRY_SIZE);
> +	}
> +
> +	promised_objects = xmmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +	promised_object_nr = st.st_size / ENTRY_SIZE;
> +
> +cleanup:
> +	free(filename);
> +	if (fd >= 0)
> +		close(fd);
> +}
> +
> +int is_promised_object(const struct object_id *oid, enum object_type *type,
> +		       unsigned long *size)
> +{
> +	int result;
> +
> +	prepare_promised_objects();
> +	result = sha1_entry_pos(promised_objects, ENTRY_SIZE, 0, 0,
> +				promised_object_nr, promised_object_nr,
> +				oid->hash);
> +	if (result >= 0) {
> +		if (type) {
> +			char *ptr = promised_objects +
> +				    result * ENTRY_SIZE + GIT_SHA1_RAWSZ;
> +			*type = *ptr;
> +		}
> +		if (size) {
> +			uint64_t size_nbo;
> +			char *ptr = promised_objects +
> +				    result * ENTRY_SIZE + GIT_SHA1_RAWSZ + 1;
> +			memcpy(&size_nbo, ptr, sizeof(size_nbo));
> +			*size = ntohll(size_nbo);
> +		}
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +int for_each_promised_object(each_promised_object_fn cb, void *data)
> +{
> +	struct object_id oid;
> +	int i, r;
> +
> +	prepare_promised_objects();
> +	for (i = 0; i < promised_object_nr; i++) {
> +		memcpy(oid.hash, &promised_objects[i * ENTRY_SIZE],
> +		       GIT_SHA1_RAWSZ);
> +		r = cb(&oid, data);
> +		if (r)
> +			return r;
> +	}
> +	return 0;
> +}
> +
> +int fsck_promised_objects(void)
> +{
> +	int i;
> +	prepare_promised_objects();
> +	for (i = 0; i < promised_object_nr; i++) {
> +		enum object_type type;
> +		if (i != 0 && memcmp(&promised_objects[(i - 1) * ENTRY_SIZE],
> +				     &promised_objects[i * ENTRY_SIZE],
> +				     GIT_SHA1_RAWSZ) >= 0)
> +			return error("Error in list of promised objects: not "
> +				     "in ascending order of object name");
> +		type = promised_objects[i * ENTRY_SIZE + GIT_SHA1_RAWSZ];
> +		switch (type) {
> +			case OBJ_BLOB:
> +			case OBJ_TREE:
> +			case OBJ_COMMIT:
> +			case OBJ_TAG:
> +				break;
> +			default:
> +				return error("Error in list of promised "
> +					     "objects: found object of type %d",
> +					     type);
> +		}
> +	}
> +	return 0;
> +}
> diff --git a/promised-object.h b/promised-object.h
> new file mode 100644
> index 000000000..7eaedff17
> --- /dev/null
> +++ b/promised-object.h
> @@ -0,0 +1,22 @@
> +#ifndef PROMISED_OBJECT_H
> +#define PROMISED_OBJECT_H
> +
> +#include "cache.h"
> +
> +/*
> + * Returns 1 if oid is the name of a promised object. For non-blobs, 0 is
> + * reported as their size.
> + */
> +int is_promised_object(const struct object_id *oid, enum object_type *type,
> +		       unsigned long *size);
> +
> +typedef int each_promised_object_fn(const struct object_id *oid, void *data);
> +int for_each_promised_object(each_promised_object_fn, void *);
> +
> +/*
> + * Returns 0 if there is no list of promised objects or if the list of promised
> + * objects is valid.
> + */
> +int fsck_promised_objects(void);
> +
> +#endif
> diff --git a/setup.c b/setup.c
> index 860507e1f..36022452f 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -425,7 +425,11 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
>   			;
>   		else if (!strcmp(ext, "preciousobjects"))
>   			data->precious_objects = git_config_bool(var, value);
> -		else
> +		else if (!strcmp(ext, "promisedobjects")) {
> +			if (!value)
> +				return config_error_nonbool(var);
> +			data->promised_objects = xstrdup(value);
> +		} else
>   			string_list_append(&data->unknown_extensions, ext);
>   	} else if (strcmp(var, "core.bare") == 0) {
>   		data->is_bare = git_config_bool(var, value);
> @@ -468,6 +472,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
>   	}
>   
>   	repository_format_precious_objects = candidate.precious_objects;
> +	repository_format_promised_objects = candidate.promised_objects;
>   	string_list_clear(&candidate.unknown_extensions, 0);
>   	if (!has_common) {
>   		if (candidate.is_bare != -1) {
> diff --git a/t/t3907-promised-object.sh b/t/t3907-promised-object.sh
> new file mode 100755
> index 000000000..3e0caf4f9
> --- /dev/null
> +++ b/t/t3907-promised-object.sh
> @@ -0,0 +1,41 @@
> +#!/bin/sh
> +
> +test_description='promised objects'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'fsck fails on missing objects' '
> +	test_create_repo repo &&
> +
> +	test_commit -C repo 1 &&
> +	test_commit -C repo 2 &&
> +	test_commit -C repo 3 &&
> +	git -C repo tag -a annotated_tag -m "annotated tag" &&
> +	C=$(git -C repo rev-parse 1) &&
> +	T=$(git -C repo rev-parse 2^{tree}) &&
> +	B=$(git hash-object repo/3.t) &&
> +	AT=$(git -C repo rev-parse annotated_tag) &&
> +
> +	# missing commit, tree, blob, and tag
> +	rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) &&
> +	rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) &&
> +	rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) &&
> +	rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40) &&
> +	test_must_fail git -C repo fsck
> +'
> +
> +test_expect_success '...but succeeds if they are promised objects' '
> +	printf "%s01%016x\n%s02%016x\n%s03%016x\n%s04%016x" \
> +		"$C" 0 "$T" 0 "$B" "$(wc -c <repo/3.t)" "$AT" 0 |
> +		sort | tr -d "\n" | hex_pack >repo/.git/objects/promised &&
> +	git -C repo config core.repositoryformatversion 1 &&
> +	git -C repo config extensions.promisedobjects "arbitrary string" &&
> +	git -C repo fsck
> +'
> +
> +test_expect_success '...but fails again with GIT_IGNORE_PROMISED_OBJECTS' '
> +	GIT_IGNORE_PROMISED_OBJECTS=1 test_must_fail git -C repo fsck &&
> +	unset GIT_IGNORE_PROMISED_OBJECTS
> +'
> +
> +test_done
> diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
> index db622c355..1ebdd2d04 100644
> --- a/t/test-lib-functions.sh
> +++ b/t/test-lib-functions.sh
> @@ -1018,3 +1018,9 @@ nongit () {
>   		"$@"
>   	)
>   }
> +
> +# Converts big-endian pairs of hexadecimal digits into bytes. For example,
> +# "printf 61620d0a | hex_pack" results in "ab\r\n".
> +hex_pack () {
> +	perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
> +}
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 4/4] sha1_file: support promised object hook
  2017-07-20 18:23   ` Stefan Beller
@ 2017-07-20 20:58     ` Ben Peart
  2017-07-20 21:18       ` Jonathan Tan
  0 siblings, 1 reply; 46+ messages in thread
From: Ben Peart @ 2017-07-20 20:58 UTC (permalink / raw)
  To: Stefan Beller, Jonathan Tan, Lars Schneider
  Cc: git, Jonathan Nieder, Jeff Hostetler, Philip Oakley



On 7/20/2017 2:23 PM, Stefan Beller wrote:
> On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
>> Teach sha1_file to invoke a hook whenever an object is requested and
>> unavailable but is promised. The hook is a shell command that can be
>> configured through "git config"; this hook takes in a list of hashes and
>> writes (if successful) the corresponding objects to the repo's local
>> storage.
>>
>> The usage of the hook can be suppressed through a flag when invoking
>> has_object_file_with_flags() and other similar functions.
>> parse_or_promise_object() in object.c requires this functionality, and
>> has been modified to use it.
>>
>> This is meant as a temporary measure to ensure that all Git commands
>> work in such a situation. Future patches will update some commands to
>> either tolerate promised objects (without invoking the hook) or be more
>> efficient in invoking the promised objects hook.

I agree that making git more tolerant of promised objects if possible 
and precomputing a list of promised objects required to complete a 
particular command and downloading them with a single request are good 
optimizations to add over time.

>>
>> In order to determine the code changes in sha1_file.c necessary, I
>> investigated the following:
>>   (1) functions in sha1_file that take in a hash, without the user
>>       regarding how the object is stored (loose or packed)
>>   (2) functions in sha1_file that operate on packed objects (because I
>>       need to check callers that know about the loose/packed distinction
>>       and operate on both differently, and ensure that they can handle
>>       the concept of objects that are neither loose nor packed)
>>
>> (1) is handled by the modification to sha1_object_info_extended().
>>
>> For (2), I looked at for_each_packed_object and at the packed-related
>> functions that take in a hash. For for_each_packed_object, the callers
>> either already work or are fixed in this patch:
>>   - reachable - only to find recent objects
>>   - builtin/fsck - already knows about promised objects
>>   - builtin/cat-file - fixed in this commit
>>
>> Callers of the other functions do not need to be changed:
>>   - parse_pack_index
>>     - http - indirectly from http_get_info_packs
>>   - find_pack_entry_one
>>     - this searches a single pack that is provided as an argument; the
>>       caller already knows (through other means) that the sought object
>>       is in a specific pack
>>   - find_sha1_pack
>>     - fast-import - appears to be an optimization to not store a
>>       file if it is already in a pack
>>     - http-walker - to search through a struct alt_base
>>     - http-push - to search through remote packs
>>   - has_sha1_pack
>>     - builtin/fsck - already knows about promised objects
>>     - builtin/count-objects - informational purposes only (check if loose
>>       object is also packed)
>>     - builtin/prune-packed - check if object to be pruned is packed (if
>>       not, don't prune it)
>>     - revision - used to exclude packed objects if requested by user
>>     - diff - just for optimization
>>

has_sha1_file also takes a hash "whether local or in an alternate object 
database, and whether packed or loose" but never calls 
sha1_object_info_extended.  As a result, we had to add support in 
check_and_freshen to download missing objects to get proper behavior in 
all cases.  I don't think this will work correctly without it.

>> An alternative design that I considered but rejected:
>>
>>   - Adding a hook whenever a packed object is requested, not on any
>>     object.  That is, whenever we attempt to search the packfiles for an
>>     object, if it is missing (from the packfiles and from the loose
>>     object storage), to invoke the hook (which must then store it as a
>>     packfile), open the packfile the hook generated, and report that the
>>     object is found in that new packfile. This reduces the amount of
>>     analysis needed (in that we only need to look at how packed objects
>>     are handled), but requires that the hook generate packfiles (or for
>>     sha1_file to pack whatever loose objects are generated), creating one
>>     packfile for each missing object and potentially very many packfiles
>>     that must be linearly searched. This may be tolerable now for repos
>>     that only have a few missing objects (for example, repos that only
>>     want to exclude large blobs), and might be tolerable in the future if
>>     we have batching support for the most commonly used commands, but is
>>     not tolerable now for repos that exclude a large amount of objects.
>>
>> Helped-by: Ben Peart <benpeart@microsoft.com>
>> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>> ---
>>   Documentation/config.txt                         |   8 +
>>   Documentation/gitrepository-layout.txt           |   8 +
>>   Documentation/technical/read-object-protocol.txt | 102 ++++++++++++
>>   builtin/cat-file.c                               |   9 ++
>>   cache.h                                          |   2 +
>>   object.c                                         |   3 +-
>>   promised-object.c                                | 194 +++++++++++++++++++++++
>>   promised-object.h                                |  12 ++
>>   sha1_file.c                                      |  44 +++--
>>   t/t3907-promised-object.sh                       |  32 ++++
>>   t/t3907/read-object                              | 114 +++++++++++++
>>   11 files changed, 513 insertions(+), 15 deletions(-)
>>   create mode 100644 Documentation/technical/read-object-protocol.txt
>>   create mode 100755 t/t3907/read-object
>>
>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>> index d5c9c4cab..c293ac921 100644
>> --- a/Documentation/config.txt
>> +++ b/Documentation/config.txt
>> @@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
>>   will probe and set core.ignoreCase true if appropriate when the repository
>>   is created.
>>
>> +core.promisedBlobCommand::
>> +       If set, whenever a blob in the local repo is attempted to be read, but
>> +       is both missing and a promised blob, invoke this shell command to
>> +       generate or obtain that blob before reporting an error. This shell
>> +       command should take one or more hashes, each terminated by a newline,
>> +       as standard input, and (if successful) should write the corresponding
>> +       objects to the local repo (packed or loose).
>> +
>>   core.precomposeUnicode::
>>          This option is only used by Mac OS implementation of Git.
>>          When core.precomposeUnicode=true, Git reverts the unicode decomposition
>> diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
>> index f51ed4e37..7dea7fe6b 100644
>> --- a/Documentation/gitrepository-layout.txt
>> +++ b/Documentation/gitrepository-layout.txt
>> @@ -47,6 +47,10 @@ use with dumb transports but otherwise is OK as long as
>>   `objects/info/alternates` points at the object stores it
>>   borrows from.
>>   +
>> +. You could have objects that are merely promised by another source.
>> +When Git requires those objects, it will invoke the command in the
>> +`extensions.promisedObjects` configuration variable.
>> ++
>>   This directory is ignored if $GIT_COMMON_DIR is set and
>>   "$GIT_COMMON_DIR/objects" will be used instead.
>>
>> @@ -91,6 +95,10 @@ objects/info/http-alternates::
>>          this object store borrows objects from, to be used when
>>          the repository is fetched over HTTP.
>>
>> +objects/promised::
>> +       This file records the sha1 object names, types, and sizes of
>> +       promised objects.
>> +
>>   refs::
>>          References are stored in subdirectories of this
>>          directory.  The 'git prune' command knows to preserve
>> diff --git a/Documentation/technical/read-object-protocol.txt b/Documentation/technical/read-object-protocol.txt
>> new file mode 100644
>> index 000000000..a893b46e7
>> --- /dev/null
>> +++ b/Documentation/technical/read-object-protocol.txt
>> @@ -0,0 +1,102 @@
>> +Read Object Process
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> This protocol reads very similar to the protocol that is used in
> clean/smudge filtering, designed by Lars (cc'd)
> 
>> +
>> +The read-object process enables Git to read all missing blobs with a
>> +single process invocation for the entire life of a single Git command.
>> +This is achieved by using a packet format (pkt-line, see technical/
>> +protocol-common.txt) based protocol over standard input and standard
>> +output as follows. All packets, except for the "*CONTENT" packets and
>> +the "0000" flush packet, are considered text and therefore are
>> +terminated by a LF.
>> +
>> +Git starts the process when it encounters the first missing object that
>> +needs to be retrieved. After the process is started, Git sends a welcome
>> +message ("git-read-object-client"), a list of supported protocol version
>> +numbers, and a flush packet. Git expects to read a welcome response
>> +message ("git-read-object-server"), exactly one protocol version number
>> +from the previously sent list, and a flush packet. All further
>> +communication will be based on the selected version.
>> +
>> +The remaining protocol description below documents "version=1". Please
>> +note that "version=42" in the example below does not exist and is only
>> +there to illustrate how the protocol would look with more than one
>> +version.
>> +
>> +After the version negotiation Git sends a list of all capabilities that
>> +it supports and a flush packet. Git expects to read a list of desired
>> +capabilities, which must be a subset of the supported capabilities list,
>> +and a flush packet as response:
>> +------------------------
>> +packet: git> git-read-object-client
>> +packet: git> version=1
>> +packet: git> version=42
>> +packet: git> 0000
>> +packet: git< git-read-object-server
>> +packet: git< version=1
>> +packet: git< 0000
>> +packet: git> capability=get
>> +packet: git> capability=have
>> +packet: git> capability=put
>> +packet: git> capability=not-yet-invented
>> +packet: git> 0000
>> +packet: git< capability=get
>> +packet: git< 0000
>> +------------------------
>> +The only supported capability in version 1 is "get".
>> +
>> +Afterwards Git sends a list of "key=value" pairs terminated with a flush
>> +packet. The list will contain at least the command (based on the
>> +supported capabilities) and the sha1 of the object to retrieve. Please
>> +note, that the process must not send any response before it received the
>> +final flush packet.
>> +
>> +When the process receives the "get" command, it should make the requested
>> +object available in the git object store and then return success. Git will
>> +then check the object store again and this time find it and proceed.
>> +------------------------
>> +packet: git> command=get
>> +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
>> +packet: git> 0000
>> +------------------------
>> +
>> +The process is expected to respond with a list of "key=value" pairs
>> +terminated with a flush packet. If the process does not experience
>> +problems then the list must contain a "success" status.
>> +------------------------
>> +packet: git< status=success
>> +packet: git< 0000
>> +------------------------
>> +
>> +In case the process cannot or does not want to process the content, it
>> +is expected to respond with an "error" status.
>> +------------------------
>> +packet: git< status=error
>> +packet: git< 0000
>> +------------------------
>> +
>> +In case the process cannot or does not want to process the content as
>> +well as any future content for the lifetime of the Git process, then it
>> +is expected to respond with an "abort" status at any point in the
>> +protocol.
>> +------------------------
>> +packet: git< status=abort
>> +packet: git< 0000
>> +------------------------
>> +
>> +Git neither stops nor restarts the process in case the "error"/"abort"
>> +status is set.
>> +
>> +If the process dies during the communication or does not adhere to the
>> +protocol then Git will stop the process and restart it with the next
>> +object that needs to be processed.
>> +
>> +After the read-object process has processed an object it is expected to
>> +wait for the next "key=value" list containing a command. Git will close
>> +the command pipe on exit. The process is expected to detect EOF and exit
>> +gracefully on its own. Git will wait until the process has stopped.
>> +
>> +A long running read-object process demo implementation can be found in
>> +`contrib/long-running-read-object/example.pl` located in the Git core
>> +repository. If you develop your own long running process then the
>> +`GIT_TRACE_PACKET` environment variables can be very helpful for
>> +debugging (see linkgit:git[1]).
>> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
>> index 96b786e48..33f636926 100644
>> --- a/builtin/cat-file.c
>> +++ b/builtin/cat-file.c
>> @@ -12,6 +12,7 @@
>>   #include "streaming.h"
>>   #include "tree-walk.h"
>>   #include "sha1-array.h"
>> +#include "promised-object.h"
>>
>>   struct batch_options {
>>          int enabled;
>> @@ -432,6 +433,13 @@ static int batch_packed_object(const struct object_id *oid,
>>          return 0;
>>   }
>>
>> +static int batch_promised_object(const struct object_id *oid,
>> +                                void *data)
>> +{
>> +       oid_array_append(data, oid);
>> +       return 0;
>> +}
>> +
>>   static int batch_objects(struct batch_options *opt)
>>   {
>>          struct strbuf buf = STRBUF_INIT;
>> @@ -473,6 +481,7 @@ static int batch_objects(struct batch_options *opt)
>>
>>                  for_each_loose_object(batch_loose_object, &sa, 0);
>>                  for_each_packed_object(batch_packed_object, &sa, 0);
>> +               for_each_promised_object(batch_promised_object, &sa);
>>
>>                  cb.opt = opt;
>>                  cb.expand = &data;
>> diff --git a/cache.h b/cache.h
>> index dd94b5ffc..75b71f38b 100644
>> --- a/cache.h
>> +++ b/cache.h
>> @@ -1835,6 +1835,8 @@ struct object_info {
>>   #define OBJECT_INFO_SKIP_CACHED 4
>>   /* Do not retry packed storage after checking packed and loose storage */
>>   #define OBJECT_INFO_QUICK 8
>> +/* Ignore list of promised objects */
>> +#define OBJECT_INFO_IGNORE_PROMISES 16
>>   extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
>>   extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
>>
>> diff --git a/object.c b/object.c
>> index 0aeb95084..23f2a6cbc 100644
>> --- a/object.c
>> +++ b/object.c
>> @@ -285,7 +285,8 @@ struct object *parse_or_promise_object(const struct object_id *oid)
>>   {
>>          enum object_type type;
>>
>> -       if (has_object_file(oid))
>> +       if (has_object_file_with_flags(oid, OBJECT_INFO_SKIP_CACHED |
>> +                                           OBJECT_INFO_IGNORE_PROMISES))
>>                  return parse_object(oid);
>>
>>          if (is_promised_object(oid, &type, NULL)) {
>> diff --git a/promised-object.c b/promised-object.c
>> index 487ade437..d8d95ebb2 100644
>> --- a/promised-object.c
>> +++ b/promised-object.c
>> @@ -2,6 +2,12 @@
>>   #include "promised-object.h"
>>   #include "sha1-lookup.h"
>>   #include "strbuf.h"
>> +#include "run-command.h"
>> +#include "sha1-array.h"
>> +#include "config.h"
>> +#include "sigchain.h"
>> +#include "sub-process.h"
>> +#include "pkt-line.h"
>>
>>   #define ENTRY_SIZE (GIT_SHA1_RAWSZ + 1 + 8)
>>   /*
>> @@ -128,3 +134,191 @@ int fsck_promised_objects(void)
>>          }
>>          return 0;
>>   }
>> +
>> +#define CAP_GET    (1u<<0)
>> +
>> +static int subprocess_map_initialized;
>> +static struct hashmap subprocess_map;
>> +
>> +struct read_object_process {
>> +       struct subprocess_entry subprocess;
>> +       unsigned int supported_capabilities;
>> +};
>> +
>> +int start_read_object_fn(struct subprocess_entry *subprocess)
>> +{
>> +       int err;
>> +       struct read_object_process *entry = (struct read_object_process *)subprocess;
>> +       struct child_process *process;
>> +       struct string_list cap_list = STRING_LIST_INIT_NODUP;
>> +       char *cap_buf;
>> +       const char *cap_name;
>> +
>> +       process = subprocess_get_child_process(&entry->subprocess);
>> +
>> +       sigchain_push(SIGPIPE, SIG_IGN);
>> +
>> +       err = packet_writel(process->in, "git-read-object-client", "version=1", NULL);
>> +       if (err)
>> +               goto done;
>> +
>> +       err = strcmp(packet_read_line(process->out, NULL), "git-read-object-server");
>> +       if (err) {
>> +               error("external process '%s' does not support read-object protocol version 1", subprocess->cmd);
>> +               goto done;
>> +       }
>> +       err = strcmp(packet_read_line(process->out, NULL), "version=1");
>> +       if (err)
>> +               goto done;
>> +       err = packet_read_line(process->out, NULL) != NULL;
>> +       if (err)
>> +               goto done;
>> +
>> +       err = packet_writel(process->in, "capability=get", NULL);
>> +       if (err)
>> +               goto done;
>> +
>> +       for (;;) {
>> +               cap_buf = packet_read_line(process->out, NULL);
>> +               if (!cap_buf)
>> +                       break;
>> +               string_list_split_in_place(&cap_list, cap_buf, '=', 1);
>> +
>> +               if (cap_list.nr != 2 || strcmp(cap_list.items[0].string, "capability"))
>> +                       continue;
>> +
>> +               cap_name = cap_list.items[1].string;
>> +               if (!strcmp(cap_name, "get")) {
>> +                       entry->supported_capabilities |= CAP_GET;
>> +               }
>> +               else {
>> +                       warning(
>> +                               "external process '%s' requested unsupported read-object capability '%s'",
>> +                               subprocess->cmd, cap_name
>> +                       );
>> +               }
>> +
>> +               string_list_clear(&cap_list, 0);
>> +       }
>> +
>> +done:
>> +       sigchain_pop(SIGPIPE);
>> +
>> +       if (err || errno == EPIPE)
>> +               return err ? err : errno;
>> +
>> +       return 0;
>> +}
>> +
>> +static int read_object_process(const unsigned char *sha1)
>> +{
>> +       int err;
>> +       struct read_object_process *entry;
>> +       struct child_process *process;
>> +       struct strbuf status = STRBUF_INIT;
>> +       uint64_t start;
>> +
>> +       start = getnanotime();
>> +
>> +       if (!repository_format_promised_objects)
>> +               die("BUG: if extensions.promisedObjects is not set, there "
>> +                   "should not be any promised objects");
>> +
>> +       if (!subprocess_map_initialized) {
>> +               subprocess_map_initialized = 1;
>> +               hashmap_init(&subprocess_map, (hashmap_cmp_fn)cmd2process_cmp, 0);
>> +               entry = NULL;
>> +       } else {
>> +               entry = (struct read_object_process *)subprocess_find_entry(&subprocess_map, repository_format_promised_objects);
>> +       }
>> +       if (!entry) {
>> +               entry = xmalloc(sizeof(*entry));
>> +               entry->supported_capabilities = 0;
>> +
>> +               if (subprocess_start(&subprocess_map, &entry->subprocess, repository_format_promised_objects, start_read_object_fn)) {
>> +                       free(entry);
>> +                       return -1;
>> +               }
>> +       }
>> +       process = subprocess_get_child_process(&entry->subprocess);
>> +
>> +       if (!(CAP_GET & entry->supported_capabilities))
>> +               return -1;
>> +
>> +       sigchain_push(SIGPIPE, SIG_IGN);
>> +
>> +       err = packet_write_fmt_gently(process->in, "command=get\n");
>> +       if (err)
>> +               goto done;
>> +
>> +       err = packet_write_fmt_gently(process->in, "sha1=%s\n", sha1_to_hex(sha1));
>> +       if (err)
>> +               goto done;
>> +
>> +       err = packet_flush_gently(process->in);
>> +       if (err)
>> +               goto done;
>> +
>> +       err = subprocess_read_status(process->out, &status);
>> +       err = err ? err : strcmp(status.buf, "success");
>> +
>> +done:
>> +       sigchain_pop(SIGPIPE);
>> +
>> +       if (err || errno == EPIPE) {
>> +               err = err ? err : errno;
>> +               if (!strcmp(status.buf, "error")) {
>> +                       /* The process signaled a problem with the file. */
>> +               }
>> +               else if (!strcmp(status.buf, "abort")) {
>> +                       /*
>> +                        * The process signaled a permanent problem. Don't try to read
>> +                        * objects with the same command for the lifetime of the current
>> +                        * Git process.
>> +                        */
>> +                       entry->supported_capabilities &= ~CAP_GET;
>> +               }
>> +               else {
>> +                       /*
>> +                        * Something went wrong with the read-object process.
>> +                        * Force shutdown and restart if needed.
>> +                        */
>> +                       error("external process '%s' failed", repository_format_promised_objects);
>> +                       subprocess_stop(&subprocess_map, (struct subprocess_entry *)entry);
>> +                       free(entry);
>> +               }
>> +       }
>> +
>> +       trace_performance_since(start, "read_object_process");
>> +
>> +       return err;
>> +}
>> +
>> +int request_promised_objects(const struct oid_array *oids)
>> +{
>> +       int oids_requested = 0;
>> +       int i;
>> +
>> +       for (i = 0; i < oids->nr; i++) {
>> +               if (is_promised_object(&oids->oid[i], NULL, NULL))
>> +                       break;
>> +       }
>> +
>> +       if (i == oids->nr)
>> +               /* Nothing to fetch */
>> +               return 0;
>> +
>> +       for (; i < oids->nr; i++) {
>> +               if (is_promised_object(&oids->oid[i], NULL, NULL)) {
>> +                       read_object_process(oids->oid[i].hash);
>> +                       oids_requested++;
>> +               }
>> +       }
>> +
>> +       /*
>> +        * The command above may have updated packfiles, so update our record
>> +        * of them.
>> +        */
>> +       reprepare_packed_git();
>> +       return oids_requested;
>> +}
>> diff --git a/promised-object.h b/promised-object.h
>> index 7eaedff17..8ad47aa4c 100644
>> --- a/promised-object.h
>> +++ b/promised-object.h
>> @@ -2,6 +2,7 @@
>>   #define PROMISED_OBJECT_H
>>
>>   #include "cache.h"
>> +#include "sha1-array.h"
>>
>>   /*
>>    * Returns 1 if oid is the name of a promised object. For non-blobs, 0 is
>> @@ -19,4 +20,15 @@ int for_each_promised_object(each_promised_object_fn, void *);
>>    */
>>   int fsck_promised_objects(void);
>>
>> +/*
>> + * If any of the given objects are promised objects, invokes
>> + * core.promisedobjectcommand with those objects and returns the number of
>> + * objects requested. No check is made as to whether the invocation actually
>> + * populated the repository with the promised objects.
>> + *
>> + * If none of the given objects are promised objects, this function does not
>> + * invoke anything and returns 0.
>> + */
>> +int request_promised_objects(const struct oid_array *oids);
>> +
>>   #endif
>> diff --git a/sha1_file.c b/sha1_file.c
>> index 5862386cd..ded0ef46b 100644
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -28,6 +28,11 @@
>>   #include "list.h"
>>   #include "mergesort.h"
>>   #include "quote.h"
>> +#include "iterator.h"
>> +#include "dir-iterator.h"
>> +#include "sha1-lookup.h"
>> +#include "promised-object.h"
>> +#include "sha1-array.h"
>>
>>   #define SZ_FMT PRIuMAX
>>   static inline uintmax_t sz_fmt(size_t s) { return s; }
>> @@ -2983,6 +2988,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>>          const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
>>                                      lookup_replace_object(sha1) :
>>                                      sha1;
>> +       int already_retried = 0;
>>
>>          if (!oi)
>>                  oi = &blank_oi;
>> @@ -3007,30 +3013,40 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>>                  }
>>          }
>>
>> -       if (!find_pack_entry(real, &e)) {
>> -               /* Most likely it's a loose object. */
>> -               if (!sha1_loose_object_info(real, oi, flags)) {
>> -                       oi->whence = OI_LOOSE;
>> -                       return 0;
>> -               }
>> +retry:
>> +       if (find_pack_entry(real, &e))
>> +               goto found_packed;
>>
>> -               /* Not a loose object; someone else may have just packed it. */
>> -               if (flags & OBJECT_INFO_QUICK) {
>> -                       return -1;
>> -               } else {
>> -                       reprepare_packed_git();
>> -                       if (!find_pack_entry(real, &e))
>> -                               return -1;
>> +       /* Most likely it's a loose object. */
>> +       if (!sha1_loose_object_info(real, oi, flags)) {
>> +               oi->whence = OI_LOOSE;
>> +               return 0;
>> +       }
>> +
>> +       /* Not a loose object; someone else may have just packed it. */
>> +       reprepare_packed_git();
>> +       if (find_pack_entry(real, &e))
>> +               goto found_packed;
>> +
>> +       /* Check if it is a promised blob */
>> +       if (!already_retried && !(flags & OBJECT_INFO_IGNORE_PROMISES)) {
>> +               struct oid_array promised = OID_ARRAY_INIT;
>> +               oid_array_append_sha1(&promised, real);
>> +               if (request_promised_objects(&promised)) {
>> +                       already_retried = 1;
>> +                       goto retry;
>>                  }
>>          }
>>
>> +       return -1;
>> +
>> +found_packed:
>>          if (oi == &blank_oi)
>>                  /*
>>                   * We know that the caller doesn't actually need the
>>                   * information below, so return early.
>>                   */
>>                  return 0;
>> -
>>          rtype = packed_object_info(e.p, e.offset, oi);
>>          if (rtype < 0) {
>>                  mark_bad_packed_object(e.p, real);
>> diff --git a/t/t3907-promised-object.sh b/t/t3907-promised-object.sh
>> index 3e0caf4f9..d9e6a6486 100755
>> --- a/t/t3907-promised-object.sh
>> +++ b/t/t3907-promised-object.sh
>> @@ -38,4 +38,36 @@ test_expect_success '...but fails again with GIT_IGNORE_PROMISED_OBJECTS' '
>>          unset GIT_IGNORE_PROMISED_OBJECTS
>>   '
>>
>> +test_expect_success 'sha1_object_info_extended (through git cat-file)' '
>> +       test_create_repo server &&
>> +       test_commit -C server 1 1.t abcdefgh &&
>> +       HASH=$(git hash-object server/1.t) &&
>> +
>> +       test_create_repo client &&
>> +       test_must_fail git -C client cat-file -p "$HASH"
>> +'
>> +
>> +test_expect_success '...succeeds if it is a promised object' '
>> +       printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
>> +               hex_pack >client/.git/objects/promised &&
>> +       git -C client config core.repositoryformatversion 1 &&
>> +       git -C client config extensions.promisedobjects \
>> +               "\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
>> +       git -C client cat-file -p "$HASH"
>> +'
>> +
>> +test_expect_success 'cat-file --batch-all-objects with promised objects' '
>> +       rm -rf client &&
>> +       test_create_repo client &&
>> +       git -C client config core.repositoryformatversion 1 &&
>> +       git -C client config extensions.promisedobjects \
>> +               "\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
>> +       printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
>> +               hex_pack >client/.git/objects/promised &&
>> +
>> +       # Verify that the promised object is printed
>> +       git -C client cat-file --batch --batch-all-objects | tee out |
>> +               grep abcdefgh
>> +'
>> +
>>   test_done
>> diff --git a/t/t3907/read-object b/t/t3907/read-object
>> new file mode 100755
>> index 000000000..9666ad597
>> --- /dev/null
>> +++ b/t/t3907/read-object
>> @@ -0,0 +1,114 @@
>> +#!/usr/bin/perl
>> +#
>> +# Example implementation for the Git read-object protocol version 1
>> +# See Documentation/technical/read-object-protocol.txt
>> +#
>> +# Allows you to test the ability for blobs to be pulled from a host git repo
>> +# "on demand."  Called when git needs a blob it couldn't find locally due to
>> +# a lazy clone that only cloned the commits and trees.
>> +#
>> +# A lazy clone can be simulated via the following commands from the host repo
>> +# you wish to create a lazy clone of:
>> +#
>> +# cd /host_repo
>> +# git rev-parse HEAD
>> +# git init /guest_repo
>> +# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
>> +#      cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
>> +# cd /guest_repo
>> +# git config core.virtualizeobjects true
>> +# git reset --hard <sha from rev-parse call above>
>> +#
>> +# Please note, this sample is a minimal skeleton. No proper error handling
>> +# was implemented.
>> +#
>> +
>> +use strict;
>> +use warnings;
>> +
>> +#
>> +# Point $DIR to the folder where your host git repo is located so we can pull
>> +# missing objects from it
>> +#
>> +my $DIR = $ARGV[0];
>> +
>> +sub packet_bin_read {
>> +       my $buffer;
>> +       my $bytes_read = read STDIN, $buffer, 4;
>> +       if ( $bytes_read == 0 ) {
>> +
>> +               # EOF - Git stopped talking to us!
>> +               exit();
>> +       }
>> +       elsif ( $bytes_read != 4 ) {
>> +               die "invalid packet: '$buffer'";
>> +       }
>> +       my $pkt_size = hex($buffer);
>> +       if ( $pkt_size == 0 ) {
>> +               return ( 1, "" );
>> +       }
>> +       elsif ( $pkt_size > 4 ) {
>> +               my $content_size = $pkt_size - 4;
>> +               $bytes_read = read STDIN, $buffer, $content_size;
>> +               if ( $bytes_read != $content_size ) {
>> +                       die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
>> +               }
>> +               return ( 0, $buffer );
>> +       }
>> +       else {
>> +               die "invalid packet size: $pkt_size";
>> +       }
>> +}
>> +
>> +sub packet_txt_read {
>> +       my ( $res, $buf ) = packet_bin_read();
>> +       unless ( $buf =~ s/\n$// ) {
>> +               die "A non-binary line MUST be terminated by an LF.";
>> +       }
>> +       return ( $res, $buf );
>> +}
>> +
>> +sub packet_bin_write {
>> +       my $buf = shift;
>> +       print STDOUT sprintf( "%04x", length($buf) + 4 );
>> +       print STDOUT $buf;
>> +       STDOUT->flush();
>> +}
>> +
>> +sub packet_txt_write {
>> +       packet_bin_write( $_[0] . "\n" );
>> +}
>> +
>> +sub packet_flush {
>> +       print STDOUT sprintf( "%04x", 0 );
>> +       STDOUT->flush();
>> +}
>> +
>> +( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad initialize";
>> +( packet_txt_read() eq ( 0, "version=1" ) )                             || die "bad version";
>> +( packet_bin_read() eq ( 1, "" ) )                       || die "bad version end";
>> +
>> +packet_txt_write("git-read-object-server");
>> +packet_txt_write("version=1");
>> +packet_flush();
>> +
>> +( packet_txt_read() eq ( 0, "capability=get" ) )    || die "bad capability";
>> +( packet_bin_read() eq ( 1, "" ) )                  || die "bad capability end";
>> +
>> +packet_txt_write("capability=get");
>> +packet_flush();
>> +
>> +while (1) {
>> +       my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
>> +
>> +       if ( $command eq "get" ) {
>> +               my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
>> +               packet_bin_read();
>> +
>> +               system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 2>&1');
>> +               packet_txt_write(($?) ? "status=error" : "status=success");
>> +               packet_flush();
>> +       } else {
>> +               die "bad command '$command'";
>> +       }
>> +}
>> --
>> 2.14.0.rc0.284.gd933b75aa4-goog
>>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
  2017-07-20 19:58   ` Ben Peart
@ 2017-07-20 21:13     ` Jonathan Tan
  2017-07-21 16:24       ` Ben Peart
  0 siblings, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2017-07-20 21:13 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, jrnieder, sbeller, git, philipoakley

On Thu, 20 Jul 2017 15:58:51 -0400
Ben Peart <peartben@gmail.com> wrote:

> On 7/19/2017 8:21 PM, Jonathan Tan wrote:
> > Currently, Git does not support repos with very large numbers of objects
> > or repos that wish to minimize manipulation of certain blobs (for
> > example, because they are very large) very well, even if the user
> > operates mostly on part of the repo, because Git is designed on the
> > assumption that every referenced object is available somewhere in the
> > repo storage.
> > 
> 
> Great to see this idea making progress. Making git able to gracefully 
> handle partial clones (beyond the existing shallow clone support) is a 
> key piece of dealing with very large objects and repos.

Thanks.

> > As a first step to reducing this problem, introduce the concept of
> > promised objects. Each Git repo can contain a list of promised objects
> > and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> > contains functions to query them; functions for creating and modifying
> > that file will be introduced in later patches.
> 
> If I'm reading this patch correctly, for a repo to successfully pass 
> "git fsck" either the object or a promise must exist for everything fsck 
> checks.  From the documentation for fsck it says "git fsck defaults to 
> using the index file, all SHA-1 references in refs namespace, and all 
> reflogs (unless --no-reflogs is given) as heads." Doesn't this then 
> imply objects or promises must exist for all objects referenced by any 
> of these locations?
> 
> We're currently in the hundreds of millions of objects on some of our 
> repos so even downloading the promises for all the objects in the index 
> is unreasonable as it is gigabytes of data and growing.

For the index to contain all the files, the repo must already have
downloaded all the trees for HEAD (at least). The trees collectively
contain entries for all the relevant blobs. We need one promise for each
blob, and the size of a promise is comparable to the size of a tree
entry, so the size (of download and storage) needed would be just double
of what we would need if we didn't need promises. This is still only
linear growth, unless you have found that the absolute numbers are too
large?

Also, if the index is ever changed to not have one entry for every file,
we also wouldn't need one promise for every file.

> I think we should have a flag (off by default) that enables someone to 
> say that promised objects are optional. If the flag is set, 
> "is_promised_object" will return success and pass the OBJ_ANY type and a 
> size of -1.
> 
> Nothing today is using the size and in the two places where the object 
> type is being checked for consistency (fsck_cache_tree and 
> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
> 
> This will enable very large numbers of objects to be omitted from the 
> clone without triggering a download of the corresponding number of 
> promised objects.

Eventually I plan to use the size when implementing parameters for
history-searching commands (e.g. "git log -S"), but it's true that
that's in the future.

Allowing promised objects to be optional would indeed solve the issue of
downloading too many promises. It would make the code more complicated,
but I'm not sure by how much.

For example, in this fsck patch, the easiest way I could think of to
have promised objects was to introduce a 3rd state, called "promised",
of "struct object" - one in which the type is known, but we don't have
access to the full "struct commit" or equivalent. And thus fsck could
assume that if the "struct object" is "parsed" or "promised", the type
is known. Having optional promised objects would require that we let
this "promised" state have a type of OBJ_UNKNOWN (or something like
that) - maybe that would be fine, but I haven't looked into this in
detail.

> > A repository that is missing an object but has that object promised is not
> > considered to be in error, so also teach fsck this. As part of doing
> > this, object.{h,c} has been modified to generate "struct object" based
> > on only the information available to promised objects, without requiring
> > the object itself.
> 
> In your work on this, did you investigate if there are other commands 
> (ie repack/gc) that will need to learn about promised objects? Have you 
> had a chance (or have plans) to hack up the test suite so that it runs 
> all tests with promised objects and see what (if anything) breaks?

In one of the subsequent patches, I tried to ensure that all
object-reading functions in sha1_file.c somewhat works (albeit slowly)
in the presence of promised objects - that would cover the functionality
of the other commands. As for hacking up the test suite to run with
promised objects, that would be ideal, but I haven't figured out how to
do that yet.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 4/4] sha1_file: support promised object hook
  2017-07-20 20:58     ` Ben Peart
@ 2017-07-20 21:18       ` Jonathan Tan
  2017-07-21 16:27         ` Ben Peart
  0 siblings, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2017-07-20 21:18 UTC (permalink / raw)
  To: Ben Peart
  Cc: Stefan Beller, Lars Schneider, git, Jonathan Nieder,
	Jeff Hostetler, Philip Oakley

On Thu, 20 Jul 2017 16:58:16 -0400
Ben Peart <peartben@gmail.com> wrote:

> >> This is meant as a temporary measure to ensure that all Git commands
> >> work in such a situation. Future patches will update some commands to
> >> either tolerate promised objects (without invoking the hook) or be more
> >> efficient in invoking the promised objects hook.
> 
> I agree that making git more tolerant of promised objects if possible 
> and precomputing a list of promised objects required to complete a 
> particular command and downloading them with a single request are good 
> optimizations to add over time.

That's good to know!

> has_sha1_file also takes a hash "whether local or in an alternate object 
> database, and whether packed or loose" but never calls 
> sha1_object_info_extended.  As a result, we had to add support in 
> check_and_freshen to download missing objects to get proper behavior in 
> all cases.  I don't think this will work correctly without it.

Thanks for the attention to detail. Is this before or after commit
e83e71c ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26)?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 1/4] object: remove "used" field from struct object
  2017-07-20  0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
  2017-07-20  0:36   ` Stefan Beller
@ 2017-07-20 21:20   ` Junio C Hamano
  1 sibling, 0 replies; 46+ messages in thread
From: Junio C Hamano @ 2017-07-20 21:20 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, jrnieder, sbeller, git, peartben, philipoakley

Jonathan Tan <jonathantanmy@google.com> writes:

> The "used" field in struct object is only used by builtin/fsck. Remove
> that field and modify builtin/fsck to use a flag instead.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  builtin/fsck.c | 24 ++++++++++++++----------
>  object.c       |  1 -
>  object.h       |  2 +-
>  3 files changed, 15 insertions(+), 12 deletions(-)

I vaguely recall trying to do this myself a few years ago.  We can
easily spot places within this file that does things like this:

> -	obj->flags = HAS_OBJ;

and correctly update it to this:

> +	obj->flags &= ~(REACHABLE | SEEN);
> +	obj->flags |= HAS_OBJ;

but I didn't do so because I was hesitant having to validate, and
having to maintain the invariant forever, that anything called from
these codepaths is always careful not to clobber the USED bit.

Looks like a reasonable preparatory clean-up that probably should be
doable and should be done way before the main part of the series to
me.  Will queue, together with your other fsck change.

Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
  2017-07-20 21:13     ` Jonathan Tan
@ 2017-07-21 16:24       ` Ben Peart
  2017-07-21 20:33         ` Jonathan Tan
  0 siblings, 1 reply; 46+ messages in thread
From: Ben Peart @ 2017-07-21 16:24 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, jrnieder, sbeller, git, philipoakley



On 7/20/2017 5:13 PM, Jonathan Tan wrote:
> On Thu, 20 Jul 2017 15:58:51 -0400
> Ben Peart <peartben@gmail.com> wrote:
> 
>> On 7/19/2017 8:21 PM, Jonathan Tan wrote:
>>> Currently, Git does not support repos with very large numbers of objects
>>> or repos that wish to minimize manipulation of certain blobs (for
>>> example, because they are very large) very well, even if the user
>>> operates mostly on part of the repo, because Git is designed on the
>>> assumption that every referenced object is available somewhere in the
>>> repo storage.
>>>
>>
>> Great to see this idea making progress. Making git able to gracefully
>> handle partial clones (beyond the existing shallow clone support) is a
>> key piece of dealing with very large objects and repos.
> 
> Thanks.
> 
>>> As a first step to reducing this problem, introduce the concept of
>>> promised objects. Each Git repo can contain a list of promised objects
>>> and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
>>> contains functions to query them; functions for creating and modifying
>>> that file will be introduced in later patches.
>>
>> If I'm reading this patch correctly, for a repo to successfully pass
>> "git fsck" either the object or a promise must exist for everything fsck
>> checks.  From the documentation for fsck it says "git fsck defaults to
>> using the index file, all SHA-1 references in refs namespace, and all
>> reflogs (unless --no-reflogs is given) as heads." Doesn't this then
>> imply objects or promises must exist for all objects referenced by any
>> of these locations?
>>
>> We're currently in the hundreds of millions of objects on some of our
>> repos so even downloading the promises for all the objects in the index
>> is unreasonable as it is gigabytes of data and growing.
> 
> For the index to contain all the files, the repo must already have
> downloaded all the trees for HEAD (at least). The trees collectively
> contain entries for all the relevant blobs. We need one promise for each
> blob, and the size of a promise is comparable to the size of a tree
> entry, so the size (of download and storage) needed would be just double
> of what we would need if we didn't need promises. This is still only
> linear growth, unless you have found that the absolute numbers are too
> large?
> 

Today we have 3.5 million objects * 30 bytes per entry = 105 MB of 
promises. Given the average developer only hydrates 56K files (2 MB 
promises) that is 103 MB to download that no one will ever need. We 
would like to avoid that if possible as this would be a significant 
regression in clone times from where we are today.

I'm also concerned about the performance of merging in promises given we 
have 100M objects today and growing so the number of promises over time 
could get pretty large.

> Also, if the index is ever changed to not have one entry for every file,
> we also wouldn't need one promise for every file.
> 
>> I think we should have a flag (off by default) that enables someone to
>> say that promised objects are optional. If the flag is set,
>> "is_promised_object" will return success and pass the OBJ_ANY type and a
>> size of -1.
>>
>> Nothing today is using the size and in the two places where the object
>> type is being checked for consistency (fsck_cache_tree and
>> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
>>
>> This will enable very large numbers of objects to be omitted from the
>> clone without triggering a download of the corresponding number of
>> promised objects.
> 
> Eventually I plan to use the size when implementing parameters for
> history-searching commands (e.g. "git log -S"), but it's true that
> that's in the future.
> 
> Allowing promised objects to be optional would indeed solve the issue of
> downloading too many promises. It would make the code more complicated,
> but I'm not sure by how much.
> 
> For example, in this fsck patch, the easiest way I could think of to
> have promised objects was to introduce a 3rd state, called "promised",
> of "struct object" - one in which the type is known, but we don't have
> access to the full "struct commit" or equivalent. And thus fsck could
> assume that if the "struct object" is "parsed" or "promised", the type
> is known. Having optional promised objects would require that we let
> this "promised" state have a type of OBJ_UNKNOWN (or something like
> that) - maybe that would be fine, but I haven't looked into this in
> detail.
> 

Caveats apply as I only did a quick look but I only found the two 
locations that were checking the object type for consistency.

>>> A repository that is missing an object but has that object promised is not
>>> considered to be in error, so also teach fsck this. As part of doing
>>> this, object.{h,c} has been modified to generate "struct object" based
>>> on only the information available to promised objects, without requiring
>>> the object itself.
>>
>> In your work on this, did you investigate if there are other commands
>> (ie repack/gc) that will need to learn about promised objects? Have you
>> had a chance (or have plans) to hack up the test suite so that it runs
>> all tests with promised objects and see what (if anything) breaks?
> 
> In one of the subsequent patches, I tried to ensure that all
> object-reading functions in sha1_file.c somewhat works (albeit slowly)
> in the presence of promised objects - that would cover the functionality
> of the other commands. As for hacking up the test suite to run with
> promised objects, that would be ideal, but I haven't figured out how to
> do that yet.
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 4/4] sha1_file: support promised object hook
  2017-07-20 21:18       ` Jonathan Tan
@ 2017-07-21 16:27         ` Ben Peart
  0 siblings, 0 replies; 46+ messages in thread
From: Ben Peart @ 2017-07-21 16:27 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: Stefan Beller, Lars Schneider, git, Jonathan Nieder,
	Jeff Hostetler, Philip Oakley



On 7/20/2017 5:18 PM, Jonathan Tan wrote:
> On Thu, 20 Jul 2017 16:58:16 -0400
> Ben Peart <peartben@gmail.com> wrote:
> 
>>>> This is meant as a temporary measure to ensure that all Git commands
>>>> work in such a situation. Future patches will update some commands to
>>>> either tolerate promised objects (without invoking the hook) or be more
>>>> efficient in invoking the promised objects hook.
>>
>> I agree that making git more tolerant of promised objects if possible
>> and precomputing a list of promised objects required to complete a
>> particular command and downloading them with a single request are good
>> optimizations to add over time.
> 
> That's good to know!
> 
>> has_sha1_file also takes a hash "whether local or in an alternate object
>> database, and whether packed or loose" but never calls
>> sha1_object_info_extended.  As a result, we had to add support in
>> check_and_freshen to download missing objects to get proper behavior in
>> all cases.  I don't think this will work correctly without it.
> 
> Thanks for the attention to detail. Is this before or after commit
> e83e71c ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26)?
> 

Sorry, my bad.  I missed the comment in the cover letter that said this 
needed to be applied on top of your other patch series.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
  2017-07-21 16:24       ` Ben Peart
@ 2017-07-21 20:33         ` Jonathan Tan
  2017-07-25 15:10           ` Ben Peart
  0 siblings, 1 reply; 46+ messages in thread
From: Jonathan Tan @ 2017-07-21 20:33 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, jrnieder, sbeller, git, philipoakley

On Fri, 21 Jul 2017 12:24:52 -0400
Ben Peart <peartben@gmail.com> wrote:

> Today we have 3.5 million objects * 30 bytes per entry = 105 MB of 
> promises. Given the average developer only hydrates 56K files (2 MB 
> promises) that is 103 MB to download that no one will ever need. We 
> would like to avoid that if possible as this would be a significant 
> regression in clone times from where we are today.
> 
> I'm also concerned about the performance of merging in promises given we 
> have 100M objects today and growing so the number of promises over time 
> could get pretty large.

After some thought, maybe a hybrid solution is best, in which it is
permissible but optional for some missing objects to have promises. In
that case, it is more of a "size cache" (which stores the type as well)
rather than a true promise. When fetching, the client can optionally
request for the sizes and types of missing objects.

This is good for the large-blob case, in which we can always have size
information of missing blobs, and we can subsequently add blob-size
filtering (as a parameter) to "git log -S" and friends to avoid needing
to resolve a missing object. And this is, as far as I can tell, also
good for the many-blob case - just have an empty size cache all the
time. (And in the future, use cases could come up that desire non-empty
but non-comprehensive caches - for example, a directory lister working
on a partial clone that only needs to cache the sizes of frequently
accessed directories.)

Another option is to have a repo-wide option that toggles between
mandatory entries in the "size cache" and prohibited entries. Switching
to mandatory provides stricter fsck and negative lookups, but I think
it's not worth it for both the developers and users of Git to have to
know about these two modes.

> >> I think we should have a flag (off by default) that enables someone to
> >> say that promised objects are optional. If the flag is set,
> >> "is_promised_object" will return success and pass the OBJ_ANY type and a
> >> size of -1.
> >>
> >> Nothing today is using the size and in the two places where the object
> >> type is being checked for consistency (fsck_cache_tree and
> >> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
> >>
> >> This will enable very large numbers of objects to be omitted from the
> >> clone without triggering a download of the corresponding number of
> >> promised objects.
> > 
> > Eventually I plan to use the size when implementing parameters for
> > history-searching commands (e.g. "git log -S"), but it's true that
> > that's in the future.
> > 
> > Allowing promised objects to be optional would indeed solve the issue of
> > downloading too many promises. It would make the code more complicated,
> > but I'm not sure by how much.
> > 
> > For example, in this fsck patch, the easiest way I could think of to
> > have promised objects was to introduce a 3rd state, called "promised",
> > of "struct object" - one in which the type is known, but we don't have
> > access to the full "struct commit" or equivalent. And thus fsck could
> > assume that if the "struct object" is "parsed" or "promised", the type
> > is known. Having optional promised objects would require that we let
> > this "promised" state have a type of OBJ_UNKNOWN (or something like
> > that) - maybe that would be fine, but I haven't looked into this in
> > detail.
> > 
> 
> Caveats apply as I only did a quick look but I only found the two 
> locations that were checking the object type for consistency.

I haven't looked into detail, but you are probably right.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
  2017-07-21 20:33         ` Jonathan Tan
@ 2017-07-25 15:10           ` Ben Peart
  2017-07-29 13:26             ` Philip Oakley
  0 siblings, 1 reply; 46+ messages in thread
From: Ben Peart @ 2017-07-25 15:10 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, jrnieder, sbeller, git, philipoakley



On 7/21/2017 4:33 PM, Jonathan Tan wrote:
> On Fri, 21 Jul 2017 12:24:52 -0400
> Ben Peart <peartben@gmail.com> wrote:
> 
>> Today we have 3.5 million objects * 30 bytes per entry = 105 MB of
>> promises. Given the average developer only hydrates 56K files (2 MB
>> promises) that is 103 MB to download that no one will ever need. We
>> would like to avoid that if possible as this would be a significant
>> regression in clone times from where we are today.
>>
>> I'm also concerned about the performance of merging in promises given we
>> have 100M objects today and growing so the number of promises over time
>> could get pretty large.
> 
> After some thought, maybe a hybrid solution is best, in which it is
> permissible but optional for some missing objects to have promises. In
> that case, it is more of a "size cache" (which stores the type as well)
> rather than a true promise. When fetching, the client can optionally
> request for the sizes and types of missing objects.
> 

In our GVFS solution today we do not download any size or object type 
information at clone as the number of objects and the resulting file 
would be too large.  Instead, we have a new sizes endpoint 
(https://github.com/Microsoft/GVFS/blob/master/Protocol.md) that enables 
us to retrieve object sizes "on demand" much like we are enabling for 
the actual object content.

This protocol could easily be extended to return both size and type so 
that it could be used to retrieve "promise" data for objects as they are 
needed. Having a way to "cache" that data locally so that both git and 
other code could share it would be great.

At a minimum, we should ensure the data stream passed back is the same 
whether at clone time or when hitting a "promises" end point. I think it 
would also be helpful to enable promises to be downloaded on demand much 
like we are doing for the object data itself.

> This is good for the large-blob case, in which we can always have size
> information of missing blobs, and we can subsequently add blob-size
> filtering (as a parameter) to "git log -S" and friends to avoid needing
> to resolve a missing object. And this is, as far as I can tell, also
> good for the many-blob case - just have an empty size cache all the
> time. (And in the future, use cases could come up that desire non-empty
> but non-comprehensive caches - for example, a directory lister working
> on a partial clone that only needs to cache the sizes of frequently
> accessed directories.)
> 
> Another option is to have a repo-wide option that toggles between
> mandatory entries in the "size cache" and prohibited entries. Switching
> to mandatory provides stricter fsck and negative lookups, but I think
> it's not worth it for both the developers and users of Git to have to
> know about these two modes.
> 
>>>> I think we should have a flag (off by default) that enables someone to
>>>> say that promised objects are optional. If the flag is set,
>>>> "is_promised_object" will return success and pass the OBJ_ANY type and a
>>>> size of -1.
>>>>
>>>> Nothing today is using the size and in the two places where the object
>>>> type is being checked for consistency (fsck_cache_tree and
>>>> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
>>>>
>>>> This will enable very large numbers of objects to be omitted from the
>>>> clone without triggering a download of the corresponding number of
>>>> promised objects.
>>>
>>> Eventually I plan to use the size when implementing parameters for
>>> history-searching commands (e.g. "git log -S"), but it's true that
>>> that's in the future.
>>>
>>> Allowing promised objects to be optional would indeed solve the issue of
>>> downloading too many promises. It would make the code more complicated,
>>> but I'm not sure by how much.
>>>
>>> For example, in this fsck patch, the easiest way I could think of to
>>> have promised objects was to introduce a 3rd state, called "promised",
>>> of "struct object" - one in which the type is known, but we don't have
>>> access to the full "struct commit" or equivalent. And thus fsck could
>>> assume that if the "struct object" is "parsed" or "promised", the type
>>> is known. Having optional promised objects would require that we let
>>> this "promised" state have a type of OBJ_UNKNOWN (or something like
>>> that) - maybe that would be fine, but I haven't looked into this in
>>> detail.
>>>
>>
>> Caveats apply as I only did a quick look but I only found the two
>> locations that were checking the object type for consistency.
> 
> I haven't looked into detail, but you are probably right.
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs")
  2017-07-17 17:43   ` Ben Peart
@ 2017-07-25 20:48     ` Philip Oakley
  0 siblings, 0 replies; 46+ messages in thread
From: Philip Oakley @ 2017-07-25 20:48 UTC (permalink / raw)
  To: Jonathan Tan, git, Ben Peart

Sorry for the delay - been away...

From: "Ben Peart" <peartben@gmail.com>
Sent: Monday, July 17, 2017 6:43 PM
>
> On 7/16/2017 11:23 AM, Philip Oakley wrote:
>> From: "Jonathan Tan" <jonathantanmy@google.com>
>> Sent: Tuesday, July 11, 2017 8:48 PM
>>> These patches are part of a set of patches implementing partial clone,
>>> as you can see here:
>>>
>>> https://github.com/jonathantanmy/git/tree/partialclone
>>>
>>> In that branch, clone with batch checkout works, as you can see in the
>>> README. The code and tests are generally done, but some patches are
>>> still missing documentation and commit messages.
>>>
>>> These 3 patches implement the foundational concept - formerly known as
>>> "missing blobs" in the "missing blob manifest", I decided to call them
>>> "promised blobs". The repo knows their object names and sizes. It also
>>> does not have the blobs themselves, but can be configured to know how to
>>> fetch them.
>>>
>> If I understand correctly, this method doesn't give any direct user 
>> visibility of missing blobs in the file system. Is that correct?
>
> That is correct
>
>>
>> I was hoping that eventually the various 'on demand' approaches would 
>> still allow users to continue to work as they go off-line such that they 
>> can see directly (in the FS) where the missing blobs (and trees) are 
>> located, so that they can continue to commit new work on existing files.
>>
>
> This is a challenge as git assumes all objects are always available (that 
> is a key design principal of a DVCS) so any missing object is considered a 
> corruption that typically results in a call to "die."

My view/concept was more based on the fact that Git is happy to have missing 
'trees', as long as they are submodules ;-), so I was hoping to massage that 
so git could carry on working as if the whole 'tree' (or blob when they were 
omitted) was still present in as 'unchanged', so the oid's would stay as 
they were.

I see that you don't omit the trees, which would be more common in my type 
of environment (defence/security). I expect in an idealised BigWin repo the 
same would also be true - user only gets /Office/Excel if that's what they 
are working on ;-)

>
> The GVFS solution gets around this by ensuring any missing object is 
> retrieved on behalf of git so that it never sees it as missing.  The 
> obvious tradeoff is that this requires a network connection so the object 
> can be retrieved.

In my concept, the user would not have the opportunity to fetch the 
tree/blob, but could replace it in its entirety (we'd still have the meta 
data of the tree/blob name and it's old oid, but couldn't do a diff)
>
>> I had felt that some sort of 'gitlink' should be present (huma readable) 
>> as a place holder for the missing blob/tree. e.g. 'gitblob: 1234abcd' 
>> (showing the missing oid, jsut like sub-modules can do - it's no 
>> different really.
>>
>
> We explored that option briefly but when you have a large number of files, 
> even writing out some sort of place holder can take a very long time.  In 
> fact, since the typical source file is relatively small (a few kilobytes), 
> writing out a placeholder doesn't save much time vs just writing out the 
> actual file contents.
>
> Another challenge is that even if there is a placeholder written to disk, 
> you still need a network connection to retrieve the actual contents 
> if/when it is needed.

I was viewing the 'missing' tree/blobs to be part of a narrow clone concept, 
so the user would need to explicitly widen the narrow clone to get missing 
trees/blobs (which could have been omitted by age, size, name, style of a 
.gitNarrowIgnore spec etc)

>
>> I'm concerned that the various GVFS extensions haven't fully achieved a 
>> separation of concerns surrounding the DVCS capability for 
>> on-line/off-line conversion as comms drop in and out. The GVFS looks 
>> great for a fully networked, always on, environment, but it would be good 
>> to also have the sepration for those who (will) have shallow/narrow 
>> clones that may also need to work with a local upstream that is also 
>> shallow/narrow.
>>
>
> You are correct that this hasn't been tackled yet. It is a challenging 
> problem. I can envision something along the lines of what was done for the 
> shallow clone feature where there are distinct ways to change the set of 
> objects that are available but that would hopefully come in some future 
> patch series.

OK. That's good to know. If the GFVS could be expanded to create a type of 
Narrow Clone capability  so that the 'going off-line' problem easily 
transitions between being just a neat VFS and then to being a neat narrow 
clone, and  that it may solve two problems in one.

I had it in my mind that the missing blobs/trees could be simply stubbed out 
within the repo itself, as just the oid ref, or maybe the oid ref plus the 
length (given that size is one of the common causes on not wanting the 
content just yet). The repo could still be packed etc, as long as the format 
is understood.

--
Philip

>
>> -- 
>> Philip
>> I wanted to at least get my thoughts into the discussion before it all 
>> passes by.
>>
>>> An older version of these patches was sent as a single demonstration
>>> patch in versions 1 to 3 of [1]. In there, Junio suggested that I have
>>> only one file containing missing blob information. I have made that
>>> suggested change in this version.
>>>
>>> One thing remaining is to add a repository extension [2] so that older
>>> versions of Git fail immediately instead of trying to read missing
>>> blobs, but I thought I'd send these first in order to get some initial
>>> feedback.
>>>
>>> [1] 
>>> https://public-inbox.org/git/cover.1497035376.git.jonathantanmy@google.com/
>>> [2] Documentation/technical/repository-version.txt
>>>
>>> Jonathan Tan (3):
>>>  promised-blob, fsck: introduce promised blobs
>>>  sha1-array: support appending unsigned char hash
>>>  sha1_file: add promised blob hook support
>>>
>>> Documentation/config.txt               |   8 ++
>>> Documentation/gitrepository-layout.txt |   8 ++
>>> Makefile                               |   1 +
>>> builtin/cat-file.c                     |   9 ++
>>> builtin/fsck.c                         |  13 +++
>>> promised-blob.c                        | 170 
>>> +++++++++++++++++++++++++++++++++
>>> promised-blob.h                        |  27 ++++++
>>> sha1-array.c                           |   7 ++
>>> sha1-array.h                           |   1 +
>>> sha1_file.c                            |  44 ++++++---
>>> t/t3907-promised-blob.sh               |  65 +++++++++++++
>>> t/test-lib-functions.sh                |   6 ++
>>> 12 files changed, 345 insertions(+), 14 deletions(-)
>>> create mode 100644 promised-blob.c
>>> create mode 100644 promised-blob.h
>>> create mode 100755 t/t3907-promised-blob.sh
>>>
>>> -- 
>>> 2.13.2.932.g7449e964c-goog
>>>
>> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs")
  2017-07-17 18:03   ` Jonathan Nieder
@ 2017-07-29 12:51     ` Philip Oakley
  0 siblings, 0 replies; 46+ messages in thread
From: Philip Oakley @ 2017-07-29 12:51 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, git, Ben Peart

From: "Jonathan Nieder" <jrnieder@gmail.com>
Sent: Monday, July 17, 2017 7:03 PM
> Hi Philip,
>
> Philip Oakley wrote:
>> From: "Jonathan Tan" <jonathantanmy@google.com>
>
>>> These patches are part of a set of patches implementing partial clone,
>>> as you can see here:
>>>
>>> https://github.com/jonathantanmy/git/tree/partialclone
> [...]
>> If I understand correctly, this method doesn't give any direct user
>> visibility of missing blobs in the file system. Is that correct?
>>
>> I was hoping that eventually the various 'on demand' approaches
>> would still allow users to continue to work as they go off-line such
>> that they can see directly (in the FS) where the missing blobs (and
>> trees) are located, so that they can continue to commit new work on
>> existing files.
>>
>> I had felt that some sort of 'gitlink' should be present (huma
>> readable) as a place holder for the missing blob/tree. e.g.
>> 'gitblob: 1234abcd' (showing the missing oid, jsut like sub-modules
>> can do - it's no different really.
>
> That's a reasonable thing to want, but it's a little different from
> the use cases that partial clone work so far has aimed to support.
> They are:
>
> A. Avoiding downloading all blobs (and likely trees as well) that are
>    not needed in the current operation (e.g. checkout).  This blends
>    well with the sparse checkout feature, which allows the current
>    checkout to be fairly small in a large repository.

True. In my case I was looking for a method that would allow a 'Narrow 
clone' such that the local repo would be smaller (have less content), but 
would feel as if all the usefull files/directories were available, and there 
would be place holders at the points where the trees were pruned, both in 
the object store, and in the user's work-tree.

As you say, in some ways its conceptually orthogonal to the original sparse 
checket (which has a full width object store / repo, and then omitted files 
from the checkout.
>
>    GVFS uses a trick that makes it a little easier to widen a sparse
>    checkout upon access of a directory.  But the same building blocks
>    should work fine with a sparse checkout that has been set up
>    explicitly.
>
> B. Avoiding downloading large blobs, except for those needed in the
>    current operation (e.g. checkout).
>
>    When not using sparse checkout, the main benefit out of the box is
>    avoiding downloading *historical versions* of large blobs.
>

> It sounds like you are looking for a sort of placeholder outside the
> sparse checkout area.

True.

> In a way, that's orthogonal to these patches:
> even if you have all relevant blobs, you may want to avoid inflating
> them to check them out and reading them to compare to the index (i.e.
> the usual benefits of sparse checkout).

In my concept, it should be possible to create the ('sparse'/narrow) index 
from the content of the local object store, without any network connection 
(though that content is determined by the prior fetch/clone;-). The proper 
git sparse checkout could proceed from there as being a further local 
restriction on what is omitted from the worktree.

Those missing from the narrow clone would still show as place holders with 
content ".gitnarrowtree 13a24b..<oid>" (so we know what the hash oid of the 
file/tree should be (so they can be moved/renamed etc!). The index would 
only know the content/structure as far as the place holders (just like 
sub-modules are a break point in the tracking, with identical caveats)


It would be interesting to know from Ben the level of sparseness/narrowness 
has been seen typically in the BigWin GVFS repo case.

>  In a sparse checkout, you
> still might like to be able to get a listing of files outside the
> sparse area (which you can get with "git ls-tree") and you may even
> want to be able to get such a listing with plain "ls" (as with your
> proposal).
>
> Thanks and hope that helps,
> Jonathan

Thanks, yes. It has help consolidate some of the parts of my concept that 
has been in the back of my mind for a while now.

Philip 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
  2017-07-25 15:10           ` Ben Peart
@ 2017-07-29 13:26             ` Philip Oakley
  0 siblings, 0 replies; 46+ messages in thread
From: Philip Oakley @ 2017-07-29 13:26 UTC (permalink / raw)
  To: Jonathan Tan, Ben Peart; +Cc: git, jrnieder, sbeller, git

From: "Ben Peart" <peartben@gmail.com>
Sent: Tuesday, July 25, 2017 4:10 PM
> On 7/21/2017 4:33 PM, Jonathan Tan wrote:
>> On Fri, 21 Jul 2017 12:24:52 -0400
>> Ben Peart <peartben@gmail.com> wrote:
>>
>>> Today we have 3.5 million objects * 30 bytes per entry = 105 MB of
>>> promises. Given the average developer only hydrates 56K files (2 MB
>>> promises) that is 103 MB to download that no one will ever need. We
>>> would like to avoid that if possible as this would be a significant
>>> regression in clone times from where we are today.
>>>

A question in the broader context of Narrow clones that I'd be interested 
in.

How narrow are the tree levels that contain the hydrated files? The question 
splits a couple of ways:

A. If one goes to the highest tree that contains all the 56K files, how many 
files and sub trees would be in that complete tree (i.e. what fraction of 
that 'top' tree is inflated).
A2. Is there some deep/wide metric that indicates how tightly together the 
inflated files tend to cluster?

B. If instead we look at just the trees in the paths of those inflated 
files, those trees will also reference other trees/blobs that are not 
inflated, how big is that list (it would indicate the size of a narrow 
repo's object store that holds the oid stubs)

I would quess / expect that the typical inflation only has a few clustered 
areas of interest, but it maybe that in such a big reality (*) the inflated 
files are actually spread very widely. (*) as per various blog posts saying 
there was no realistic way of partitioning the BigWin repo!

I'd be interested in any such sparsity metric (apologies if I've missed 
previous reports).
--
Philip

>>> I'm also concerned about the performance of merging in promises given we
>>> have 100M objects today and growing so the number of promises over time
>>> could get pretty large.
>>
>> After some thought, maybe a hybrid solution is best, in which it is
>> permissible but optional for some missing objects to have promises. In
>> that case, it is more of a "size cache" (which stores the type as well)
>> rather than a true promise. When fetching, the client can optionally
>> request for the sizes and types of missing objects.
>>
>
> In our GVFS solution today we do not download any size or object type 
> information at clone as the number of objects and the resulting file would 
> be too large.  Instead, we have a new sizes endpoint 
> (https://github.com/Microsoft/GVFS/blob/master/Protocol.md) that enables 
> us to retrieve object sizes "on demand" much like we are enabling for the 
> actual object content.
>
> This protocol could easily be extended to return both size and type so 
> that it could be used to retrieve "promise" data for objects as they are 
> needed. Having a way to "cache" that data locally so that both git and 
> other code could share it would be great.
>
> At a minimum, we should ensure the data stream passed back is the same 
> whether at clone time or when hitting a "promises" end point. I think it 
> would also be helpful to enable promises to be downloaded on demand much 
> like we are doing for the object data itself.
>
>> This is good for the large-blob case, in which we can always have size
>> information of missing blobs, and we can subsequently add blob-size
>> filtering (as a parameter) to "git log -S" and friends to avoid needing
>> to resolve a missing object. And this is, as far as I can tell, also
>> good for the many-blob case - just have an empty size cache all the
>> time. (And in the future, use cases could come up that desire non-empty
>> but non-comprehensive caches - for example, a directory lister working
>> on a partial clone that only needs to cache the sizes of frequently
>> accessed directories.)
>>
>> Another option is to have a repo-wide option that toggles between
>> mandatory entries in the "size cache" and prohibited entries. Switching
>> to mandatory provides stricter fsck and negative lookups, but I think
>> it's not worth it for both the developers and users of Git to have to
>> know about these two modes.
>>
>>>>> I think we should have a flag (off by default) that enables someone to
>>>>> say that promised objects are optional. If the flag is set,
>>>>> "is_promised_object" will return success and pass the OBJ_ANY type and 
>>>>> a
>>>>> size of -1.
>>>>>
>>>>> Nothing today is using the size and in the two places where the object
>>>>> type is being checked for consistency (fsck_cache_tree and
>>>>> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
>>>>>
>>>>> This will enable very large numbers of objects to be omitted from the
>>>>> clone without triggering a download of the corresponding number of
>>>>> promised objects.
>>>>
>>>> Eventually I plan to use the size when implementing parameters for
>>>> history-searching commands (e.g. "git log -S"), but it's true that
>>>> that's in the future.
>>>>
>>>> Allowing promised objects to be optional would indeed solve the issue 
>>>> of
>>>> downloading too many promises. It would make the code more complicated,
>>>> but I'm not sure by how much.
>>>>
>>>> For example, in this fsck patch, the easiest way I could think of to
>>>> have promised objects was to introduce a 3rd state, called "promised",
>>>> of "struct object" - one in which the type is known, but we don't have
>>>> access to the full "struct commit" or equivalent. And thus fsck could
>>>> assume that if the "struct object" is "parsed" or "promised", the type
>>>> is known. Having optional promised objects would require that we let
>>>> this "promised" state have a type of OBJ_UNKNOWN (or something like
>>>> that) - maybe that would be fine, but I haven't looked into this in
>>>> detail.
>>>>
>>>
>>> Caveats apply as I only did a quick look but I only found the two
>>> locations that were checking the object type for consistency.
>>
>> I haven't looked into detail, but you are probably right.
>> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs")
@ 2022-09-17 23:56 Вероника Кулешова
  0 siblings, 0 replies; 46+ messages in thread
From: Вероника Кулешова @ 2022-09-17 23:56 UTC (permalink / raw)
  To: jonathantanmy; +Cc: git



Отправлено с iPhone

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2022-09-17 23:57 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
2017-07-11 22:02   ` Stefan Beller
2017-07-19 23:37     ` Jonathan Tan
2017-07-12 17:29   ` Jeff Hostetler
2017-07-12 19:28     ` Jonathan Nieder
2017-07-13 14:48       ` Jeff Hostetler
2017-07-13 15:05         ` Jeff Hostetler
2017-07-13 19:39     ` Jonathan Tan
2017-07-14 20:03       ` Jeff Hostetler
2017-07-14 21:30         ` Jonathan Nieder
2017-07-11 19:48 ` [RFC PATCH 2/3] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-11 22:06   ` Stefan Beller
2017-07-19 23:56     ` Jonathan Tan
2017-07-20  0:06       ` Stefan Beller
2017-07-11 19:48 ` [RFC PATCH 3/3] sha1_file: add promised blob hook support Jonathan Tan
2017-07-11 22:38   ` Stefan Beller
2017-07-12 17:40   ` Ben Peart
2017-07-12 20:38     ` Jonathan Nieder
2017-07-16 15:23 ` [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Philip Oakley
2017-07-17 17:43   ` Ben Peart
2017-07-25 20:48     ` Philip Oakley
2017-07-17 18:03   ` Jonathan Nieder
2017-07-29 12:51     ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
2017-07-20  0:36   ` Stefan Beller
2017-07-20  0:55     ` Jonathan Tan
2017-07-20 17:44       ` Ben Peart
2017-07-20 21:20   ` Junio C Hamano
2017-07-20  0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
2017-07-20 18:07   ` Stefan Beller
2017-07-20 19:17     ` Jonathan Tan
2017-07-20 19:58   ` Ben Peart
2017-07-20 21:13     ` Jonathan Tan
2017-07-21 16:24       ` Ben Peart
2017-07-21 20:33         ` Jonathan Tan
2017-07-25 15:10           ` Ben Peart
2017-07-29 13:26             ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
2017-07-20 18:23   ` Stefan Beller
2017-07-20 20:58     ` Ben Peart
2017-07-20 21:18       ` Jonathan Tan
2017-07-21 16:27         ` Ben Peart
2022-09-17 23:56 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Вероника Кулешова

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).