git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] repack: introduce `--write-midx`
@ 2021-09-11  3:32 Taylor Blau
  2021-09-11  3:32 ` [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly Taylor Blau
                   ` (8 more replies)
  0 siblings, 9 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11  3:32 UTC (permalink / raw)
  To: git; +Cc: peff

This series implements the final component of multi-pack reachability bitmaps:
the ability to write them from `git repack`.

The gist is that it is currently impossible to write a MIDX based on the result
of a `git repack` without the possibility that an existing MIDX may go away.
That's because `git repack` may have deleted a pack that was included in the
MIDX (either because it was rolled up when doing a `--geometric` repack, or the
repack was done all-into-one), and MIDXs cannot refer to deleted packs.

To solve this, `git repack` is taught a new option to write a MIDX between the
critical section: right after the new packs are written and moved into place,
but right before any old packs are deleted.

Unfortunately, there is quite a bit of auxiliary tooling we need to implement
before this is possible. The series is laid out like so:

  - The first two patches lay the ground work for and expose `git
    multi-pack-index --stdin-packs`, which allows callers to write a MIDX
    containing only certain packs.

  - The next patch implements preliminary support for `--refs-snapshot`, which
    allows the MIDX bitmap selection code to reference the prior state of
    reference tips, preventing a race.

  - The next two patches prepare the `repack` builtin to write a MIDX, and the
    following patch actually implements `--write-midx`.

  - The final two patches make sure the MIDX generated has an appropriately
    chosen preferred pack, and that the refs-snapshot is provided.

It is based on `tb/multi-pack-bitmaps`. Thanks in advance for your review!

Taylor Blau (8):
  midx: expose 'write_midx_file_only()' publicly
  builtin/multi-pack-index.c: support --stdin-packs mode
  midx: preliminary support for `--refs-snapshot`
  builtin/repack.c: keep track of existing packs unconditionally
  builtin/repack.c: extract showing progress to a variable
  builtin/repack.c: support writing a MIDX while repacking
  builtin/repack.c: make largest pack preferred
  builtin/repack.c: pass `--refs-snapshot` when writing bitmaps

 Documentation/git-multi-pack-index.txt |  19 ++
 Documentation/git-repack.txt           |  18 +-
 builtin/multi-pack-index.c             |  35 +++-
 builtin/repack.c                       | 251 ++++++++++++++++++++++---
 midx.c                                 | 102 ++++++++--
 midx.h                                 |   4 +-
 pack-bitmap.c                          |   2 +-
 pack-bitmap.h                          |   1 +
 t/helper/test-read-midx.c              |  25 ++-
 t/lib-midx.sh                          |   8 +
 t/t5319-multi-pack-index.sh            |  15 ++
 t/t5326-multi-pack-bitmaps.sh          |  82 ++++++++
 t/t7700-repack.sh                      |  96 ++++++++++
 t/t7703-repack-geometric.sh            |  23 +++
 14 files changed, 630 insertions(+), 51 deletions(-)
 create mode 100644 t/lib-midx.sh

-- 
2.33.0.96.g73915697e6

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly
  2021-09-11  3:32 [PATCH 0/8] repack: introduce `--write-midx` Taylor Blau
@ 2021-09-11  3:32 ` Taylor Blau
  2021-09-11  5:00   ` Junio C Hamano
  2021-09-11 10:07   ` Ævar Arnfjörð Bjarmason
  2021-09-11  3:32 ` [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode Taylor Blau
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11  3:32 UTC (permalink / raw)
  To: git; +Cc: peff

Expose a variant of the write_midx_file() function which ignores packs
that aren't included in an explicit "allow" list.

This will be used in an upcoming patch to power a new `--stdin-packs`
mode of `git multi-pack-index write` for callers that only want to
include certain packs in a MIDX (and ignore any packs which may have
happened to enter the repository independently, e.g., from pushes).

Those patches will provide test coverage for this new function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
 midx.h |  2 ++
 2 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/midx.c b/midx.c
index 864034a6ad..29d1d107b3 100644
--- a/midx.c
+++ b/midx.c
@@ -475,6 +475,8 @@ struct write_midx_context {
 	uint32_t num_large_offsets;
 
 	int preferred_pack_idx;
+
+	struct string_list *to_include;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -484,8 +486,13 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 	if (ends_with(file_name, ".idx")) {
 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
-		if (ctx->m && midx_contains_pack(ctx->m, file_name))
-			return;
+		if (ctx->m) {
+			if (midx_contains_pack(ctx->m, file_name))
+				return;
+		} else if (ctx->to_include) {
+			if (!string_list_has_string(ctx->to_include, file_name))
+				return;
+		}
 
 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
@@ -1058,6 +1065,7 @@ static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 }
 
 static int write_midx_internal(const char *object_dir,
+			       struct string_list *packs_to_include,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
 			       unsigned flags)
@@ -1082,10 +1090,17 @@ static int write_midx_internal(const char *object_dir,
 		die_errno(_("unable to create leading directories of %s"),
 			  midx_name);
 
-	for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
-		if (!strcmp(object_dir, cur->object_dir)) {
-			ctx.m = cur;
-			break;
+	if (!packs_to_include) {
+		/*
+		 * Only reference an existing MIDX when not filtering which
+		 * packs to include, since all packs and objects are copied
+		 * blindly from an existing MIDX if one is present.
+		 */
+		for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
+			if (!strcmp(object_dir, cur->object_dir)) {
+				ctx.m = cur;
+				break;
+			}
 		}
 	}
 
@@ -1136,10 +1151,13 @@ static int write_midx_internal(const char *object_dir,
 	else
 		ctx.progress = NULL;
 
+	ctx.to_include = packs_to_include;
+
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
 	stop_progress(&ctx.progress);
 
-	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) {
+	if ((ctx.m && ctx.nr == ctx.m->num_packs) &&
+	    !(packs_to_include || packs_to_drop)) {
 		struct bitmap_index *bitmap_git;
 		int bitmap_exists;
 		int want_bitmap = flags & MIDX_WRITE_BITMAP;
@@ -1237,7 +1255,7 @@ static int write_midx_internal(const char *object_dir,
 
 	QSORT(ctx.info, ctx.nr, pack_info_compare);
 
-	if (packs_to_drop && packs_to_drop->nr) {
+	if (ctx.m && packs_to_drop && packs_to_drop->nr) {
 		int drop_index = 0;
 		int missing_drops = 0;
 
@@ -1380,7 +1398,17 @@ int write_midx_file(const char *object_dir,
 		    const char *preferred_pack_name,
 		    unsigned flags)
 {
-	return write_midx_internal(object_dir, NULL, preferred_pack_name, flags);
+	return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name,
+				   flags);
+}
+
+int write_midx_file_only(const char *object_dir,
+			 struct string_list *packs_to_include,
+			 const char *preferred_pack_name,
+			 unsigned flags)
+{
+	return write_midx_internal(object_dir, packs_to_include, NULL,
+				   preferred_pack_name, flags);
 }
 
 struct clear_midx_data {
@@ -1660,7 +1688,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	free(count);
 
 	if (packs_to_drop.nr) {
-		result = write_midx_internal(object_dir, &packs_to_drop, NULL, flags);
+		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, flags);
 		m = NULL;
 	}
 
@@ -1851,7 +1879,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		goto cleanup;
 	}
 
-	result = write_midx_internal(object_dir, NULL, NULL, flags);
+	result = write_midx_internal(object_dir, NULL, NULL, NULL, flags);
 	m = NULL;
 
 cleanup:
diff --git a/midx.h b/midx.h
index aa3da557bb..aefa371c90 100644
--- a/midx.h
+++ b/midx.h
@@ -2,6 +2,7 @@
 #define MIDX_H
 
 #include "repository.h"
+#include "string-list.h"
 
 struct object_id;
 struct pack_entry;
@@ -62,6 +63,7 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
 int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
+int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
 int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags);
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-11  3:32 [PATCH 0/8] repack: introduce `--write-midx` Taylor Blau
  2021-09-11  3:32 ` [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly Taylor Blau
@ 2021-09-11  3:32 ` Taylor Blau
  2021-09-11 10:05   ` Ævar Arnfjörð Bjarmason
  2021-09-11  3:32 ` [PATCH 3/8] midx: preliminary support for `--refs-snapshot` Taylor Blau
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-11  3:32 UTC (permalink / raw)
  To: git; +Cc: peff

To power a new `--write-midx` mode, `git repack` will want to write a
multi-pack index containing a certain set of packs in the repository.

This new option will be used by `git repack` to write a MIDX which
contains only the packs which will survive after the repack (that is, it
will exclude any packs which are about to be deleted).

This patch effectively exposes the function implemented in the previous
commit via the `git multi-pack-index` builtin. An alternative approach
would have been to call that function from the `git repack` builtin
directly, but this introduces awkward problems around closing and
reopening the object store, so the MIDX will be written out-of-process.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt |  4 ++++
 builtin/multi-pack-index.c             | 26 ++++++++++++++++++++++++++
 t/t5319-multi-pack-index.sh            | 15 +++++++++++++++
 3 files changed, 45 insertions(+)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index a9df3dbd32..009c989ef8 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -45,6 +45,10 @@ write::
 
 	--[no-]bitmap::
 		Control whether or not a multi-pack bitmap is written.
+
+	--stdin-packs::
+		Write a multi-pack index containing only the set of
+		line-delimited pack index basenames provided over stdin.
 --
 
 verify::
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 73c0113b48..77488b6b7b 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -47,6 +47,7 @@ static struct opts_multi_pack_index {
 	const char *preferred_pack;
 	unsigned long batch_size;
 	unsigned flags;
+	int stdin_packs;
 } opts;
 
 static struct option common_opts[] = {
@@ -61,6 +62,15 @@ static struct option *add_common_options(struct option *prev)
 	return parse_options_concat(common_opts, prev);
 }
 
+static void read_packs_from_stdin(struct string_list *to)
+{
+	struct strbuf buf = STRBUF_INIT;
+	while (strbuf_getline(&buf, stdin) != EOF) {
+		string_list_append(to, strbuf_detach(&buf, NULL));
+	}
+	string_list_sort(to);
+}
+
 static int cmd_multi_pack_index_write(int argc, const char **argv)
 {
 	struct option *options;
@@ -70,6 +80,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 			   N_("pack for reuse when computing a multi-pack bitmap")),
 		OPT_BIT(0, "bitmap", &opts.flags, N_("write multi-pack bitmap"),
 			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
+		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
+			 N_("write multi-pack index containing only given indexes")),
 		OPT_END(),
 	};
 
@@ -86,6 +98,20 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 
 	FREE_AND_NULL(options);
 
+	if (opts.stdin_packs) {
+		struct string_list packs = STRING_LIST_INIT_NODUP;
+		int ret;
+
+		read_packs_from_stdin(&packs);
+
+		ret = write_midx_file_only(opts.object_dir, &packs,
+					   opts.preferred_pack, opts.flags);
+
+		string_list_clear(&packs, 0);
+
+		return ret;
+
+	}
 	return write_midx_file(opts.object_dir, opts.preferred_pack,
 			       opts.flags);
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index bb04f0f23b..385f0a3efd 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -168,6 +168,21 @@ test_expect_success 'write midx with two packs' '
 
 compare_results_with_midx "two packs"
 
+test_expect_success 'write midx with --stdin-packs' '
+	rm -fr $objdir/pack/multi-pack-index &&
+
+	idx="$(find $objdir/pack -name "test-2-*.idx")" &&
+	basename "$idx" >in &&
+
+	git multi-pack-index write --stdin-packs <in &&
+
+	test-tool read-midx $objdir | grep "\.idx$" >packs &&
+
+	test_cmp packs in
+'
+
+compare_results_with_midx "mixed mode (one pack + extra)"
+
 test_expect_success 'write progress off for redirected stderr' '
 	git multi-pack-index --object-dir=$objdir write 2>err &&
 	test_line_count = 0 err
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 3/8] midx: preliminary support for `--refs-snapshot`
  2021-09-11  3:32 [PATCH 0/8] repack: introduce `--write-midx` Taylor Blau
  2021-09-11  3:32 ` [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly Taylor Blau
  2021-09-11  3:32 ` [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode Taylor Blau
@ 2021-09-11  3:32 ` Taylor Blau
  2021-09-11  3:32 ` [PATCH 4/8] builtin/repack.c: keep track of existing packs unconditionally Taylor Blau
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11  3:32 UTC (permalink / raw)
  To: git; +Cc: peff

To figure out which commits we can write a bitmap for, the multi-pack
index/bitmap code does a reachability traversal, marking any commit
which can be found in the MIDX as eligible to receive a bitmap.

This approach will cause a problem when multi-pack bitmaps are able to
be generated from `git repack`, since the reference tips can change
during the repack. Even though we ignore commits that don't exist in
the MIDX (when doing a scan of the ref tips), it's possible that a
commit in the MIDX reaches something that isn't.

This can happen when a multi-pack index contains some pack which refers
to loose objects (which by definition aren't included in the multi-pack
index).

By taking a snapshot of the references before we start repacking, we can
close that race window. In the above scenario (where we have a packed
object pointing at a loose one), we'll either (a) take a snapshot of the
references before seeing the packed one, or (b) take it after, at which
point we can guarantee that the loose object will be packed and included
in the MIDX.

This patch does just that. It writes a temporary "reference snapshot",
which is a list of OIDs that are at the ref tips before writing a
multi-pack bitmap. References that are "preferred" (i.e,. are a suffix
of at least one value of the 'pack.preferBitmapTips' configuration) are
marked with a special '+'.

The format is simple: one line per commit at each tip, with an optional
'+' at the beginning (for preferred references, as described above).

When provided, the reference snapshot is used to drive bitmap selection
instead of the MIDX code doing its own traversal. When it isn't
provided, the usual traversal takes place instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt | 15 +++++
 builtin/multi-pack-index.c             | 11 +++-
 builtin/repack.c                       |  2 +-
 midx.c                                 | 60 ++++++++++++++++---
 midx.h                                 |  4 +-
 t/t5326-multi-pack-bitmaps.sh          | 82 ++++++++++++++++++++++++++
 6 files changed, 160 insertions(+), 14 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 009c989ef8..27f83932e4 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -49,6 +49,21 @@ write::
 	--stdin-packs::
 		Write a multi-pack index containing only the set of
 		line-delimited pack index basenames provided over stdin.
+
+	--refs-snapshot=<path>::
+		With `--bitmap`, optionally specify a file which
+		contains a "refs snapshot" taken prior to repacking.
++
+A reference snapshot is composed of line-delimited OIDs corresponding to
+the reference tips, usually taken by `git repack` prior to generating a
+new pack. A line may optionally start with a `+` character to indicate
+that the reference which corresponds to that OID is "preferred" (see
+linkgit:git-config[1]'s `pack.preferBitmapTips`.)
++
+The file given at `<path>` is expected to be readable, and can contain
+duplicates. (If a given OID is given more than once, it is marked as
+preferred if at least one instance of it begins with the special `+`
+marker).
 --
 
 verify::
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 77488b6b7b..65a242f5cf 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -7,7 +7,8 @@
 #include "object-store.h"
 
 #define BUILTIN_MIDX_WRITE_USAGE \
-	N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]")
+	N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]" \
+	   "[--refs-snapshot=<path>]")
 
 #define BUILTIN_MIDX_VERIFY_USAGE \
 	N_("git multi-pack-index [<options>] verify")
@@ -45,6 +46,7 @@ static char const * const builtin_multi_pack_index_usage[] = {
 static struct opts_multi_pack_index {
 	const char *object_dir;
 	const char *preferred_pack;
+	const char *refs_snapshot;
 	unsigned long batch_size;
 	unsigned flags;
 	int stdin_packs;
@@ -82,6 +84,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
 		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
 			 N_("write multi-pack index containing only given indexes")),
+		OPT_FILENAME(0, "refs-snapshot", &opts.refs_snapshot,
+			     N_("refs snapshot for selecting bitmap commits")),
 		OPT_END(),
 	};
 
@@ -105,7 +109,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 		read_packs_from_stdin(&packs);
 
 		ret = write_midx_file_only(opts.object_dir, &packs,
-					   opts.preferred_pack, opts.flags);
+					   opts.preferred_pack,
+					   opts.refs_snapshot, opts.flags);
 
 		string_list_clear(&packs, 0);
 
@@ -113,7 +118,7 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 
 	}
 	return write_midx_file(opts.object_dir, opts.preferred_pack,
-			       opts.flags);
+			       opts.refs_snapshot, opts.flags);
 }
 
 static int cmd_multi_pack_index_verify(int argc, const char **argv)
diff --git a/builtin/repack.c b/builtin/repack.c
index 82ab668272..27158a897b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -733,7 +733,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		unsigned flags = 0;
 		if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
 			flags |= MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX;
-		write_midx_file(get_object_directory(), NULL, flags);
+		write_midx_file(get_object_directory(), NULL, NULL, flags);
 	}
 
 	string_list_clear(&names, 0);
diff --git a/midx.c b/midx.c
index 29d1d107b3..46ded7a4cf 100644
--- a/midx.c
+++ b/midx.c
@@ -970,7 +970,42 @@ static void bitmap_show_commit(struct commit *commit, void *_data)
 	data->commits[data->commits_nr++] = commit;
 }
 
+static int read_refs_snapshot(const char *refs_snapshot,
+			      struct rev_info *revs)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct object_id oid;
+	FILE *f = xfopen(refs_snapshot, "r");
+
+	while (strbuf_getline(&buf, f) != EOF) {
+		struct object *object;
+		int preferred = 0;
+		char *hex = buf.buf;
+		const char *end = NULL;
+
+		if (buf.len && *buf.buf == '+') {
+			preferred = 1;
+			hex = &buf.buf[1];
+		}
+
+		if (parse_oid_hex(hex, &oid, &end) < 0)
+			die(_("could not parse line: %s"), buf.buf);
+		if (*end)
+			die(_("malformed line: %s"), buf.buf);
+
+		object = parse_object_or_die(&oid, NULL);
+		if (preferred)
+			object->flags |= NEEDS_BITMAP;
+
+		add_pending_object(revs, object, "");
+	}
+
+	fclose(f);
+	return 0;
+}
+
 static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr_p,
+						    const char *refs_snapshot,
 						    struct write_midx_context *ctx)
 {
 	struct rev_info revs;
@@ -979,8 +1014,12 @@ static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr
 	cb.ctx = ctx;
 
 	repo_init_revisions(the_repository, &revs, NULL);
-	setup_revisions(0, NULL, &revs, NULL);
-	for_each_ref(add_ref_to_pending, &revs);
+	if (refs_snapshot) {
+		read_refs_snapshot(refs_snapshot, &revs);
+	} else {
+		setup_revisions(0, NULL, &revs, NULL);
+		for_each_ref(add_ref_to_pending, &revs);
+	}
 
 	/*
 	 * Skipping promisor objects here is intentional, since it only excludes
@@ -1009,6 +1048,7 @@ static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr
 
 static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 			     struct write_midx_context *ctx,
+			     const char *refs_snapshot,
 			     unsigned flags)
 {
 	struct packing_data pdata;
@@ -1020,7 +1060,7 @@ static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 
 	prepare_midx_packing_data(&pdata, ctx);
 
-	commits = find_commits_for_midx_bitmap(&commits_nr, ctx);
+	commits = find_commits_for_midx_bitmap(&commits_nr, refs_snapshot, ctx);
 
 	/*
 	 * Build the MIDX-order index based on pdata.objects (which is already
@@ -1068,6 +1108,7 @@ static int write_midx_internal(const char *object_dir,
 			       struct string_list *packs_to_include,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
+			       const char *refs_snapshot,
 			       unsigned flags)
 {
 	char *midx_name;
@@ -1361,7 +1402,8 @@ static int write_midx_internal(const char *object_dir,
 	if (flags & MIDX_WRITE_REV_INDEX)
 		write_midx_reverse_index(midx_name, midx_hash, &ctx);
 	if (flags & MIDX_WRITE_BITMAP) {
-		if (write_midx_bitmap(midx_name, midx_hash, &ctx, flags) < 0) {
+		if (write_midx_bitmap(midx_name, midx_hash, &ctx,
+				      refs_snapshot, flags) < 0) {
 			error(_("could not write multi-pack bitmap"));
 			result = 1;
 			goto cleanup;
@@ -1396,19 +1438,21 @@ static int write_midx_internal(const char *object_dir,
 
 int write_midx_file(const char *object_dir,
 		    const char *preferred_pack_name,
+		    const char *refs_snapshot,
 		    unsigned flags)
 {
 	return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name,
-				   flags);
+				   refs_snapshot, flags);
 }
 
 int write_midx_file_only(const char *object_dir,
 			 struct string_list *packs_to_include,
 			 const char *preferred_pack_name,
+			 const char *refs_snapshot,
 			 unsigned flags)
 {
 	return write_midx_internal(object_dir, packs_to_include, NULL,
-				   preferred_pack_name, flags);
+				   preferred_pack_name, refs_snapshot, flags);
 }
 
 struct clear_midx_data {
@@ -1688,7 +1732,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	free(count);
 
 	if (packs_to_drop.nr) {
-		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, flags);
+		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, NULL, flags);
 		m = NULL;
 	}
 
@@ -1879,7 +1923,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		goto cleanup;
 	}
 
-	result = write_midx_internal(object_dir, NULL, NULL, NULL, flags);
+	result = write_midx_internal(object_dir, NULL, NULL, NULL, NULL, flags);
 	m = NULL;
 
 cleanup:
diff --git a/midx.h b/midx.h
index aefa371c90..dc37b94ea1 100644
--- a/midx.h
+++ b/midx.h
@@ -62,8 +62,8 @@ int fill_midx_entry(struct repository *r, const struct object_id *oid, struct pa
 int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
-int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
-int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags);
+int write_midx_file(const char *object_dir, const char *preferred_pack_name, const char *refs_snapshot, unsigned flags);
+int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, const char *refs_snapshot, unsigned flags);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
 int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags);
diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh
index 4ad7c2c969..069dab3e17 100755
--- a/t/t5326-multi-pack-bitmaps.sh
+++ b/t/t5326-multi-pack-bitmaps.sh
@@ -283,4 +283,86 @@ test_expect_success 'pack.preferBitmapTips' '
 	)
 '
 
+test_expect_success 'writing a bitmap with --refs-snapshot' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit one &&
+		test_commit two &&
+
+		git rev-parse one >snapshot &&
+
+		git repack -ad &&
+
+		# First, write a MIDX which see both refs/tags/one and
+		# refs/tags/two (causing both of those commits to receive
+		# bitmaps).
+		git multi-pack-index write --bitmap &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		grep "$(git rev-parse one)" bitmaps &&
+		grep "$(git rev-parse two)" bitmaps &&
+
+		rm -fr $midx-$(midx_checksum $objdir).bitmap &&
+		rm -fr $midx-$(midx_checksum $objdir).rev &&
+		rm -fr $midx &&
+
+		# Then again, but with a refs snapshot which only sees
+		# refs/tags/one.
+		git multi-pack-index write --bitmap --refs-snapshot=snapshot &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		grep "$(git rev-parse one)" bitmaps &&
+		! grep "$(git rev-parse two)" bitmaps
+	)
+'
+
+test_expect_success 'write a bitmap with --refs-snapshot (preferred tips)' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit_bulk --message="%s" 103 &&
+
+		git log --format="%H" >commits.raw &&
+		sort <commits.raw >commits &&
+
+		git log --format="create refs/tags/%s %H" HEAD >refs &&
+		git update-ref --stdin <refs &&
+
+		git multi-pack-index write --bitmap &&
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >before &&
+		test_line_count = 1 before &&
+
+		(
+			grep -vf before commits.raw &&
+			# mark missing commits as preferred
+			sed "s/^/+/" before
+		) >snapshot &&
+
+		rm -fr $midx-$(midx_checksum $objdir).bitmap &&
+		rm -fr $midx-$(midx_checksum $objdir).rev &&
+		rm -fr $midx &&
+
+		git multi-pack-index write --bitmap --refs-snapshot=snapshot &&
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >after &&
+
+		! test_cmp before after
+	)
+'
+
 test_done
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 4/8] builtin/repack.c: keep track of existing packs unconditionally
  2021-09-11  3:32 [PATCH 0/8] repack: introduce `--write-midx` Taylor Blau
                   ` (2 preceding siblings ...)
  2021-09-11  3:32 ` [PATCH 3/8] midx: preliminary support for `--refs-snapshot` Taylor Blau
@ 2021-09-11  3:32 ` Taylor Blau
  2021-09-11  3:32 ` [PATCH 5/8] builtin/repack.c: extract showing progress to a variable Taylor Blau
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11  3:32 UTC (permalink / raw)
  To: git; +Cc: peff

In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.

Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.

Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.

Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 49 ++++++++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 22 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 27158a897b..e55a650de5 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -98,8 +98,9 @@ static void remove_pack_on_signal(int signo)
  * have a corresponding .keep file. These packs are not to
  * be kept if we are going to pack everything into one file.
  */
-static void get_non_kept_pack_filenames(struct string_list *fname_list,
-					const struct string_list *extra_keep)
+static void collect_pack_filenames(struct string_list *fname_list,
+				   struct string_list *fname_kept_list,
+				   const struct string_list *extra_keep)
 {
 	DIR *dir;
 	struct dirent *e;
@@ -112,21 +113,20 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
 		size_t len;
 		int i;
 
+		if (!strip_suffix(e->d_name, ".pack", &len))
+			continue;
+
 		for (i = 0; i < extra_keep->nr; i++)
 			if (!fspathcmp(e->d_name, extra_keep->items[i].string))
 				break;
-		if (extra_keep->nr > 0 && i < extra_keep->nr)
-			continue;
-
-		if (!strip_suffix(e->d_name, ".pack", &len))
-			continue;
 
 		fname = xmemdupz(e->d_name, len);
 
-		if (!file_exists(mkpath("%s/%s.keep", packdir, fname)))
-			string_list_append_nodup(fname_list, fname);
+		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
+		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
+			string_list_append_nodup(fname_kept_list, fname);
 		else
-			free(fname);
+			string_list_append_nodup(fname_list, fname);
 	}
 	closedir(dir);
 }
@@ -440,6 +440,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list names = STRING_LIST_INIT_DUP;
 	struct string_list rollback = STRING_LIST_INIT_NODUP;
 	struct string_list existing_packs = STRING_LIST_INIT_DUP;
+	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
 	int i, ext, ret;
@@ -572,9 +573,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (use_delta_islands)
 		strvec_push(&cmd.args, "--delta-islands");
 
+	collect_pack_filenames(&existing_packs, &existing_kept_packs,
+			       &keep_pack_list);
+
 	if (pack_everything & ALL_INTO_ONE) {
-		get_non_kept_pack_filenames(&existing_packs, &keep_pack_list);
-
 		repack_promisor_objects(&po_args, &names);
 
 		if (existing_packs.nr && delete_redundant) {
@@ -683,17 +685,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	reprepare_packed_git(the_repository);
 
 	if (delete_redundant) {
-		const int hexsz = the_hash_algo->hexsz;
 		int opts = 0;
-		string_list_sort(&names);
-		for_each_string_list_item(item, &existing_packs) {
-			char *sha1;
-			size_t len = strlen(item->string);
-			if (len < hexsz)
-				continue;
-			sha1 = item->string + len - hexsz;
-			if (!string_list_has_string(&names, sha1))
-				remove_redundant_pack(packdir, item->string);
+		if (pack_everything & ALL_INTO_ONE) {
+			const int hexsz = the_hash_algo->hexsz;
+			string_list_sort(&names);
+			for_each_string_list_item(item, &existing_packs) {
+				char *sha1;
+				size_t len = strlen(item->string);
+				if (len < hexsz)
+					continue;
+				sha1 = item->string + len - hexsz;
+				if (!string_list_has_string(&names, sha1))
+					remove_redundant_pack(packdir, item->string);
+			}
 		}
 
 		if (geometry) {
@@ -739,6 +743,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&names, 0);
 	string_list_clear(&rollback, 0);
 	string_list_clear(&existing_packs, 0);
+	string_list_clear(&existing_kept_packs, 0);
 	clear_pack_geometry(geometry);
 	strbuf_release(&line);
 
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 5/8] builtin/repack.c: extract showing progress to a variable
  2021-09-11  3:32 [PATCH 0/8] repack: introduce `--write-midx` Taylor Blau
                   ` (3 preceding siblings ...)
  2021-09-11  3:32 ` [PATCH 4/8] builtin/repack.c: keep track of existing packs unconditionally Taylor Blau
@ 2021-09-11  3:32 ` Taylor Blau
  2021-09-11  3:32 ` [PATCH 6/8] builtin/repack.c: support writing a MIDX while repacking Taylor Blau
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11  3:32 UTC (permalink / raw)
  To: git; +Cc: peff

We only ask whether stderr is a tty before calling
'prune_packed_objects()', but the subsequent patch will add another use.

Extract this check into a variable so that both can use it without
having to call 'isatty()' twice.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index e55a650de5..be470546e6 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -445,6 +445,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct strbuf line = STRBUF_INIT;
 	int i, ext, ret;
 	FILE *out;
+	int show_progress = isatty(2);
 
 	/* variables to be filled by option parsing */
 	int pack_everything = 0;
@@ -718,7 +719,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			}
 			strbuf_release(&buf);
 		}
-		if (!po_args.quiet && isatty(2))
+		if (!po_args.quiet && show_progress)
 			opts |= PRUNE_PACKED_VERBOSE;
 		prune_packed_objects(opts);
 
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 6/8] builtin/repack.c: support writing a MIDX while repacking
  2021-09-11  3:32 [PATCH 0/8] repack: introduce `--write-midx` Taylor Blau
                   ` (4 preceding siblings ...)
  2021-09-11  3:32 ` [PATCH 5/8] builtin/repack.c: extract showing progress to a variable Taylor Blau
@ 2021-09-11  3:32 ` Taylor Blau
  2021-09-11  3:32 ` [PATCH 7/8] builtin/repack.c: make largest pack preferred Taylor Blau
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11  3:32 UTC (permalink / raw)
  To: git; +Cc: peff

Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.

There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:

  - Call 'git multi-pack-index write' after running 'git repack', or

  - Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
    'git repack'.

The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).

Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.

Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.

This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").

There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  14 ++--
 builtin/repack.c             | 132 ++++++++++++++++++++++++++++++-----
 t/lib-midx.sh                |   8 +++
 t/t7700-repack.sh            |  96 +++++++++++++++++++++++++
 4 files changed, 227 insertions(+), 23 deletions(-)
 create mode 100644 t/lib-midx.sh

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 24c00c9384..0f2d235ca5 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -9,7 +9,7 @@ git-repack - Pack unpacked objects in a repository
 SYNOPSIS
 --------
 [verse]
-'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
+'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
 
 DESCRIPTION
 -----------
@@ -128,10 +128,11 @@ depth is 4095.
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
-	only makes sense when used with `-a` or `-A`, as the bitmaps
+	only makes sense when used with `-a`, `-A` or `-m`, as the bitmaps
 	must be able to refer to all reachable objects. This option
-	overrides the setting of `repack.writeBitmaps`.  This option
-	has no effect if multiple packfiles are created.
+	overrides the setting of `repack.writeBitmaps`. This option
+	has no effect if multiple packfiles are created, unless writing a
+	MIDX (in which case a multi-pack bitmap is created).
 
 --pack-kept-objects::
 	Include objects in `.keep` files when repacking.  Note that we
@@ -190,6 +191,11 @@ to change in the future. This option (implying a drastically different
 repack mode) is not guaranteed to work with all other combinations of
 option to `git repack`.
 
+-m::
+--write-midx::
+	Write a multi-pack index (see linkgit:git-multi-pack-index[1])
+	containing the non-redundant packs.
+
 CONFIGURATION
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index be470546e6..cb4292ab37 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -433,6 +433,76 @@ static void clear_pack_geometry(struct pack_geometry *geometry)
 	geometry->split = 0;
 }
 
+static void midx_included_packs(struct string_list *include,
+				struct string_list *existing_packs,
+				struct string_list *existing_kept_packs,
+				struct string_list *names,
+				struct pack_geometry *geometry)
+{
+	struct string_list_item *item;
+
+	for_each_string_list_item(item, existing_kept_packs) {
+		string_list_insert(include, xstrfmt("%s.idx", item->string));
+	}
+	for_each_string_list_item(item, names) {
+		string_list_insert(include, xstrfmt("pack-%s.idx", item->string));
+	}
+	if (geometry) {
+		struct strbuf buf = STRBUF_INIT;
+		uint32_t i;
+		for (i = geometry->split; i < geometry->pack_nr; i++) {
+			struct packed_git *p = geometry->pack[i];
+
+			strbuf_addstr(&buf, pack_basename(p));
+			strbuf_strip_suffix(&buf, ".pack");
+			strbuf_addstr(&buf, ".idx");
+
+			string_list_insert(include, strbuf_detach(&buf, NULL));
+		}
+	} else {
+		for_each_string_list_item(item, existing_packs) {
+			if (item->util)
+				continue;
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
+	}
+}
+
+static int write_midx_included_packs(struct string_list *include,
+				     int show_progress, int write_bitmaps)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct string_list_item *item;
+	FILE *in;
+	int ret;
+
+	cmd.in = -1;
+	cmd.git_cmd = 1;
+
+	strvec_push(&cmd.args, "multi-pack-index");
+	strvec_pushl(&cmd.args, "write", "--stdin-packs", NULL);
+
+	if (show_progress)
+		strvec_push(&cmd.args, "--progress");
+	else
+		strvec_push(&cmd.args, "--no-progress");
+
+	if (write_bitmaps)
+		strvec_push(&cmd.args, "--bitmap");
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, include) {
+		fprintf(in, "%s\n", item->string);
+	}
+	fclose(in);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -456,6 +526,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
 	int geometric_factor = 0;
+	int write_midx = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -498,6 +569,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("do not repack this pack")),
 		OPT_INTEGER('g', "geometric", &geometric_factor,
 			    N_("find a geometric progression with factor <N>")),
+		OPT_BOOL('m', "write-midx", &write_midx,
+			   N_("write a multi-pack index of the resulting packs")),
 		OPT_END()
 	};
 
@@ -514,8 +587,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		die(_("--keep-unreachable and -A are incompatible"));
 
 	if (write_bitmaps < 0) {
-		if (!(pack_everything & ALL_INTO_ONE) ||
-		    !is_bare_repository())
+		if (!write_midx &&
+		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
 			write_bitmaps = 0;
 	} else if (write_bitmaps &&
 		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0) &&
@@ -525,7 +598,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_kept_objects < 0)
 		pack_kept_objects = write_bitmaps > 0;
 
-	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
+	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
 		die(_(incremental_bitmap_conflict_error));
 
 	if (geometric_factor) {
@@ -567,10 +640,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 	if (has_promisor_remote())
 		strvec_push(&cmd.args, "--exclude-promisor-objects");
-	if (write_bitmaps > 0)
-		strvec_push(&cmd.args, "--write-bitmap-index");
-	else if (write_bitmaps < 0)
-		strvec_push(&cmd.args, "--write-bitmap-index-quiet");
+	if (!write_midx) {
+		if (write_bitmaps > 0)
+			strvec_push(&cmd.args, "--write-bitmap-index");
+		else if (write_bitmaps < 0)
+			strvec_push(&cmd.args, "--write-bitmap-index-quiet");
+	}
 	if (use_delta_islands)
 		strvec_push(&cmd.args, "--delta-islands");
 
@@ -683,22 +758,41 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 	/* End of pack replacement. */
 
+	if (delete_redundant && pack_everything & ALL_INTO_ONE) {
+		const int hexsz = the_hash_algo->hexsz;
+		string_list_sort(&names);
+		for_each_string_list_item(item, &existing_packs) {
+			char *sha1;
+			size_t len = strlen(item->string);
+			if (len < hexsz)
+				continue;
+			sha1 = item->string + len - hexsz;
+			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+		}
+	}
+
+	if (write_midx) {
+		struct string_list include = STRING_LIST_INIT_NODUP;
+		midx_included_packs(&include, &existing_packs,
+				    &existing_kept_packs, &names, geometry);
+
+		ret = write_midx_included_packs(&include,
+						show_progress, write_bitmaps > 0);
+
+		string_list_clear(&include, 0);
+
+		if (ret)
+			return ret;
+	}
+
 	reprepare_packed_git(the_repository);
 
 	if (delete_redundant) {
 		int opts = 0;
-		if (pack_everything & ALL_INTO_ONE) {
-			const int hexsz = the_hash_algo->hexsz;
-			string_list_sort(&names);
-			for_each_string_list_item(item, &existing_packs) {
-				char *sha1;
-				size_t len = strlen(item->string);
-				if (len < hexsz)
-					continue;
-				sha1 = item->string + len - hexsz;
-				if (!string_list_has_string(&names, sha1))
-					remove_redundant_pack(packdir, item->string);
-			}
+		for_each_string_list_item(item, &existing_packs) {
+			if (!item->util)
+				continue;
+			remove_redundant_pack(packdir, item->string);
 		}
 
 		if (geometry) {
diff --git a/t/lib-midx.sh b/t/lib-midx.sh
new file mode 100644
index 0000000000..1261994744
--- /dev/null
+++ b/t/lib-midx.sh
@@ -0,0 +1,8 @@
+# test_midx_consistent <objdir>
+test_midx_consistent () {
+	ls $1/pack/pack-*.idx | xargs -n 1 basename | sort >expect &&
+	test-tool read-midx $1 | grep ^pack-.*\.idx$ | sort >actual &&
+
+	test_cmp expect actual &&
+	git multi-pack-index --object-dir=$1 verify
+}
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 98eda3bfeb..6792531dfd 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -3,6 +3,8 @@
 test_description='git repack works correctly'
 
 . ./test-lib.sh
+. "${TEST_DIRECTORY}/lib-bitmap.sh"
+. "${TEST_DIRECTORY}/lib-midx.sh"
 
 commit_and_pack () {
 	test_commit "$@" 1>&2 &&
@@ -234,4 +236,98 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+objdir=.git/objects
+midx=$objdir/pack/multi-pack-index
+
+test_expect_success 'setup for --write-midx tests' '
+	git init midx &&
+	(
+		cd midx &&
+		git config core.multiPackIndex true &&
+
+		test_commit base
+	)
+'
+
+test_expect_success '--write-midx unchanged' '
+	(
+		cd midx &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack &&
+		test_path_is_missing $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success '--write-midx with a new pack' '
+	(
+		cd midx &&
+		test_commit loose &&
+
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success '--write-midx with -b' '
+	(
+		cd midx &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -mb &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success '--write-midx with -d' '
+	(
+		cd midx &&
+		test_commit repack &&
+
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Ad --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success 'cleans up MIDX when appropriate' '
+	(
+		cd midx &&
+
+		test_commit repack-2 &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb --write-midx &&
+
+		checksum=$(midx_checksum $objdir) &&
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$checksum.bitmap &&
+		test_path_is_file $midx-$checksum.rev &&
+
+		test_commit repack-3 &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-$checksum.bitmap &&
+		test_path_is_missing $midx-$checksum.rev &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+		test_path_is_file $midx-$(midx_checksum $objdir).rev &&
+
+		test_commit repack-4 &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb &&
+
+		find $objdir/pack -type f -name "multi-pack-index*" >files &&
+		test_must_be_empty files
+	)
+'
+
 test_done
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 7/8] builtin/repack.c: make largest pack preferred
  2021-09-11  3:32 [PATCH 0/8] repack: introduce `--write-midx` Taylor Blau
                   ` (5 preceding siblings ...)
  2021-09-11  3:32 ` [PATCH 6/8] builtin/repack.c: support writing a MIDX while repacking Taylor Blau
@ 2021-09-11  3:32 ` Taylor Blau
  2021-09-11 10:17   ` Ævar Arnfjörð Bjarmason
  2021-09-11  3:32 ` [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
  8 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-11  3:32 UTC (permalink / raw)
  To: git; +Cc: peff

When repacking into a geometric series and writing a multi-pack bitmap,
it is beneficial to have the largest resulting pack be the preferred
object source in the bitmap's MIDX, since selecting the large packs can
lead to fewer broken delta chains and better compression.

Teach 'git repack' to identify this pack and pass it to the MIDX write
machinery in order to mark it as preferred.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  4 ++++
 builtin/repack.c             | 23 ++++++++++++++++++++++-
 pack-bitmap.c                |  2 +-
 pack-bitmap.h                |  1 +
 t/helper/test-read-midx.c    | 25 ++++++++++++++++++++++++-
 t/t7703-repack-geometric.sh  | 23 +++++++++++++++++++++++
 6 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 0f2d235ca5..7183fb498f 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -190,6 +190,10 @@ this "roll-up", without respect to their reachability. This is subject
 to change in the future. This option (implying a drastically different
 repack mode) is not guaranteed to work with all other combinations of
 option to `git repack`.
++
+When writing a multi-pack bitmap, `git repack` selects the largest resulting
+pack as the preferred pack for object selection by the MIDX (see
+linkgit:git-multi-pack-index[1]).
 
 -m::
 --write-midx::
diff --git a/builtin/repack.c b/builtin/repack.c
index cb4292ab37..e958caff8b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -422,6 +422,13 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	geometry->split = split;
 }
 
+static struct packed_git *get_largest_active_pack(struct pack_geometry *geometry)
+{
+	if (geometry->split == geometry->pack_nr)
+		return NULL;
+	return geometry->pack[geometry->pack_nr - 1];
+}
+
 static void clear_pack_geometry(struct pack_geometry *geometry)
 {
 	if (!geometry)
@@ -469,6 +476,7 @@ static void midx_included_packs(struct string_list *include,
 }
 
 static int write_midx_included_packs(struct string_list *include,
+				     struct pack_geometry *geometry,
 				     int show_progress, int write_bitmaps)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -490,6 +498,19 @@ static int write_midx_included_packs(struct string_list *include,
 	if (write_bitmaps)
 		strvec_push(&cmd.args, "--bitmap");
 
+	if (geometry) {
+		struct packed_git *largest = get_largest_active_pack(geometry);
+		if (largest)
+			strvec_pushf(&cmd.args, "--preferred-pack=%s",
+				     pack_basename(largest));
+		else
+			/*
+			 * The largest pack was repacked, meaning only one pack
+			 * exists (and tautologically, it is the largest).
+			 */
+			;
+	}
+
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
@@ -776,7 +797,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		midx_included_packs(&include, &existing_packs,
 				    &existing_kept_packs, &names, geometry);
 
-		ret = write_midx_included_packs(&include,
+		ret = write_midx_included_packs(&include, geometry,
 						show_progress, write_bitmaps > 0);
 
 		string_list_clear(&include, 0);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 8504110a4d..67be9be9a6 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1418,7 +1418,7 @@ static int try_partial_reuse(struct packed_git *pack,
 	return 0;
 }
 
-static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
+uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
 {
 	struct multi_pack_index *m = bitmap_git->midx;
 	if (!m)
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 469090bad2..7d407c5a4c 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -55,6 +55,7 @@ int test_bitmap_commits(struct repository *r);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 struct list_objects_filter_options *filter,
 					 int filter_provided_objects);
+uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 				       struct packed_git **packfile,
 				       uint32_t *entries,
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index cb0d27049a..0038559129 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -3,6 +3,7 @@
 #include "midx.h"
 #include "repository.h"
 #include "object-store.h"
+#include "pack-bitmap.h"
 
 static int read_midx_file(const char *object_dir, int show_objects)
 {
@@ -72,14 +73,36 @@ static int read_midx_checksum(const char *object_dir)
 	return 0;
 }
 
+static int read_midx_preferred_pack(const char *object_dir)
+{
+	struct multi_pack_index *midx = NULL;
+	struct bitmap_index *bitmap = NULL;
+
+	setup_git_directory();
+
+	midx = load_multi_pack_index(object_dir, 1);
+	if (!midx)
+		return 1;
+
+	bitmap = prepare_bitmap_git(the_repository);
+	if (!(bitmap && bitmap_is_midx(bitmap)))
+		return 1;
+
+
+	printf("%s\n", midx->pack_names[midx_preferred_pack(bitmap)]);
+	return 0;
+}
+
 int cmd__read_midx(int argc, const char **argv)
 {
 	if (!(argc == 2 || argc == 3))
-		usage("read-midx [--show-objects|--checksum] <object-dir>");
+		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
 
 	if (!strcmp(argv[1], "--show-objects"))
 		return read_midx_file(argv[2], 1);
 	else if (!strcmp(argv[1], "--checksum"))
 		return read_midx_checksum(argv[2]);
+	else if (!strcmp(argv[1], "--preferred-pack"))
+		return read_midx_preferred_pack(argv[2]);
 	return read_midx_file(argv[1], 0);
 }
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 5ccaa440e0..7bdeffa111 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -180,4 +180,27 @@ test_expect_success '--geometric ignores kept packs' '
 	)
 '
 
+test_expect_success '--geometric chooses largest MIDX preferred pack' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+		git config core.multiPackIndex true &&
+
+		# These packs already form a geometric progression.
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 2 && # 6 objects
+		ls $objdir/pack/pack-*.idx >before &&
+		test_commit_bulk --start=4 4 && # 12 objects
+		ls $objdir/pack/pack-*.idx >after &&
+
+		git repack --geometric 2 -dbm &&
+
+		comm -3 before after | xargs -n 1 basename >expect &&
+		test-tool read-midx --preferred-pack $objdir >actual &&
+
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-11  3:32 [PATCH 0/8] repack: introduce `--write-midx` Taylor Blau
                   ` (6 preceding siblings ...)
  2021-09-11  3:32 ` [PATCH 7/8] builtin/repack.c: make largest pack preferred Taylor Blau
@ 2021-09-11  3:32 ` Taylor Blau
  2021-09-11 10:27   ` Ævar Arnfjörð Bjarmason
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
  8 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-11  3:32 UTC (permalink / raw)
  To: git; +Cc: peff

To prevent the race described in an earlier patch, generate and pass a
reference snapshot to the multi-pack bitmap code, if we are writing one
from `git repack`.

This patch is mostly limited to creating a temporary file, and then
calling for_each_ref(). Except we try to minimize duplicates, since
doing so can drastically reduce the size in network-of-forks style
repositories. In the kernel's fork network (the repository containing
all objects from the kernel and all its forks), deduplicating the
references drops the snapshot size from 934 MB to just 12 MB.

But since we're handling duplicates in this way, we have to make sure
that we preferred references (those listed in pack.preferBitmapTips)
before non-preferred ones (to avoid recording an object which is pointed
at by a preferred tip as non-preferred).

We accomplish this by doing separate passes over the references: first
visiting each prefix in pack.preferBitmapTips, and then over the rest of
the references.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/builtin/repack.c b/builtin/repack.c
index e958caff8b..3fda837cb5 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -15,6 +15,8 @@
 #include "promisor-remote.h"
 #include "shallow.h"
 #include "pack.h"
+#include "pack-bitmap.h"
+#include "refs.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -440,6 +442,61 @@ static void clear_pack_geometry(struct pack_geometry *geometry)
 	geometry->split = 0;
 }
 
+struct midx_snapshot_ref_data {
+	struct tempfile *f;
+	struct oidset seen;
+	int preferred;
+};
+
+static int midx_snapshot_ref_one(const char *refname,
+				 const struct object_id *oid,
+				 int flag, void *_data)
+{
+	struct midx_snapshot_ref_data *data = _data;
+	struct object_id peeled;
+
+	if (!peel_iterated_oid(oid, &peeled))
+		oid = &peeled;
+
+	if (oidset_insert(&data->seen, oid))
+		return 0; /* already seen */
+
+	if (oid_object_info(the_repository, oid, NULL) != OBJ_COMMIT)
+		return 0;
+
+	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
+		oid_to_hex(oid));
+
+	return 0;
+}
+
+static int midx_snapshot_refs(struct tempfile *f)
+{
+	struct midx_snapshot_ref_data data;
+	const struct string_list *preferred = bitmap_preferred_tips(the_repository);
+
+	data.f = f;
+	oidset_init(&data.seen, 0);
+
+	if (!fdopen_tempfile(f, "w"))
+		 die(_("could not open tempfile %s for writing"),
+		     get_tempfile_path(f));
+
+	if (preferred) {
+		struct string_list_item *item;
+
+		data.preferred = 1;
+		for_each_string_list_item(item, preferred) {
+			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
+		}
+		data.preferred = 0;
+	}
+
+	for_each_ref(midx_snapshot_ref_one, &data);
+
+	return close_tempfile_gently(f);
+}
+
 static void midx_included_packs(struct string_list *include,
 				struct string_list *existing_packs,
 				struct string_list *existing_kept_packs,
@@ -477,6 +534,7 @@ static void midx_included_packs(struct string_list *include,
 
 static int write_midx_included_packs(struct string_list *include,
 				     struct pack_geometry *geometry,
+				     const char *refs_snapshot,
 				     int show_progress, int write_bitmaps)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -511,6 +569,9 @@ static int write_midx_included_packs(struct string_list *include,
 			;
 	}
 
+	if (refs_snapshot)
+		strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
+
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
@@ -534,6 +595,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
+	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
 	FILE *out;
 	int show_progress = isatty(2);
@@ -622,6 +684,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
 		die(_(incremental_bitmap_conflict_error));
 
+	if (write_midx && write_bitmaps) {
+		struct strbuf path = STRBUF_INIT;
+
+		strbuf_addf(&path, "%s/%s_XXXXXX", get_object_directory(),
+			    "bitmap-ref-tips");
+
+		refs_snapshot = xmks_tempfile(path.buf);
+		if (midx_snapshot_refs(refs_snapshot) < 0)
+			die(_("could not take a snapshot of references"));
+
+		strbuf_release(&path);
+	}
+
 	if (geometric_factor) {
 		if (pack_everything)
 			die(_("--geometric is incompatible with -A, -a"));
@@ -798,6 +873,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				    &existing_kept_packs, &names, geometry);
 
 		ret = write_midx_included_packs(&include, geometry,
+						refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
 						show_progress, write_bitmaps > 0);
 
 		string_list_clear(&include, 0);
-- 
2.33.0.96.g73915697e6

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly
  2021-09-11  3:32 ` [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly Taylor Blau
@ 2021-09-11  5:00   ` Junio C Hamano
  2021-09-11 16:17     ` Taylor Blau
  2021-09-11 10:07   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 76+ messages in thread
From: Junio C Hamano @ 2021-09-11  5:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff

Taylor Blau <me@ttaylorr.com> writes:

>  	if (ends_with(file_name, ".idx")) {
>  		display_progress(ctx->progress, ++ctx->pack_paths_checked);
> -		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> -			return;
> +		if (ctx->m) {
> +			if (midx_contains_pack(ctx->m, file_name))
> +				return;
> +		} else if (ctx->to_include) {
> +			if (!string_list_has_string(ctx->to_include, file_name))
> +				return;

What's the expected number of elements on the to_include list?  I am
wondering about the performance implications of using linear search
over the string-list, of course.  Is it about the same order of the
number of packfiles in a repository (up to several dozens, or 1000
at most unless you are insane, or something like that)?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-11  3:32 ` [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode Taylor Blau
@ 2021-09-11 10:05   ` Ævar Arnfjörð Bjarmason
  2021-09-11 16:25     ` Taylor Blau
  0 siblings, 1 reply; 76+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-11 10:05 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff


On Fri, Sep 10 2021, Taylor Blau wrote:

> +	struct strbuf buf = STRBUF_INIT;
> +	while (strbuf_getline(&buf, stdin) != EOF) {
> +		string_list_append(to, strbuf_detach(&buf, NULL));

If you strbuf_detach() it...


> +		struct string_list packs = STRING_LIST_INIT_NODUP;

...init the list with NODUP...

> +		string_list_clear(&packs, 0);

...and call string_list_clear() you'll leak memory, since the
string_list will think that someone else is taking care of this memory.

I had some recent musings about how this API is bad, and I've got local
WIP patches to fix it, but in the meantime you need to:

    packs.strdup_strings = 1;

Before calling string_list_clear(). I.e. we didn't strdup(), but during
free() we pretend that we did, because we did, just not in
string_list_append().

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly
  2021-09-11  3:32 ` [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly Taylor Blau
  2021-09-11  5:00   ` Junio C Hamano
@ 2021-09-11 10:07   ` Ævar Arnfjörð Bjarmason
  2021-09-11 16:21     ` Taylor Blau
  1 sibling, 1 reply; 76+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-11 10:07 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff


On Fri, Sep 10 2021, Taylor Blau wrote:

> -		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> -			return;
> +		if (ctx->m) {
> +			if (midx_contains_pack(ctx->m, file_name))
> +				return;
> +		} else if (ctx->to_include) {
> +			if (!string_list_has_string(ctx->to_include, file_name))
> +				return;
> +		}

I think this is equivalent to:

		if (ctx->m && midx_contains_pack(ctx->m, file_name))
			return;
		else if (!ctx->m && string_list_has_string(...))
			return;

Just a suggestion/tip to reduce the diff size/nested if/if blocks.


> +int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags);

Let's also line-wrap header changes like in the *.c file.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 7/8] builtin/repack.c: make largest pack preferred
  2021-09-11  3:32 ` [PATCH 7/8] builtin/repack.c: make largest pack preferred Taylor Blau
@ 2021-09-11 10:17   ` Ævar Arnfjörð Bjarmason
  2021-09-11 16:35     ` Taylor Blau
  0 siblings, 1 reply; 76+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-11 10:17 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff


On Fri, Sep 10 2021, Taylor Blau wrote:

> +	if (geometry) {
> +		struct packed_git *largest = get_largest_active_pack(geometry);
> +		if (largest)
> +			strvec_pushf(&cmd.args, "--preferred-pack=%s",
> +				     pack_basename(largest));
> +		else
> +			/*
> +			 * The largest pack was repacked, meaning only one pack
> +			 * exists (and tautologically, it is the largest).
> +			 */
> +			;
> +	}

Probably one of those cases where an assignment within an "if" is the
better of two evils, despite the CodingGuidelines warning against it
(but not forbidding it). The added get_largest_active_pack() could also
learn to punt on a NULL argument, in which case we wouldn't need an
assignment):

	struct packed_git *largest;
	[...]
        /* If ...[...] */
	if (geometry && (largest = get_largest_active_pack(geometry)))
		strvec_push(...);

I punted on re-phrasing the comment, because the current code invites
the reader to start reading this block, then we see that we may do
stuff, and then the comment applies to what we *didn't* do.

> -		usage("read-midx [--show-objects|--checksum] <object-dir>");
> +		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");

Just an aside, but I'm surprised to see this use the older
non-parse_options() usage API, which we're generally moving away
from. usage_msg_opt() is generally better.

> +		git config core.multiPackIndex true &&

I remember a similar pattern from another series, does this test pass
without the config being set? I didn't check...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-11  3:32 ` [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
@ 2021-09-11 10:27   ` Ævar Arnfjörð Bjarmason
  2021-09-11 11:19     ` Ævar Arnfjörð Bjarmason
  2021-09-11 16:49     ` Taylor Blau
  0 siblings, 2 replies; 76+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-11 10:27 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff


On Fri, Sep 10 2021, Taylor Blau wrote:

> +struct midx_snapshot_ref_data {
> +	struct tempfile *f;

Style/readability: Spare more than one byte for a variable name in a
struct, maybe file, or tmpfile?

> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
> +		oid_to_hex(oid));

Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
"+OID\n"? Lends itself more naturally to extension, use with the likes
of string_list_split() etc.

> +		for_each_string_list_item(item, preferred) {
> +			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
> +		}

Cheap style commenst I know, but throughout this series there's an odd
mixture of sometimes adding braces for one-lines, sometimes not, or in
the case of that "else ;" I commented on not for something *much* bigger
:)

> +		if (midx_snapshot_refs(refs_snapshot) < 0)
> +			die(_("could not take a snapshot of references"));

Seems like this die() should be moved into the midx_snapshot_refs()
function and that function should use die_errno(), not die(), also it
uses close_tempfile_gently(), shouldn't save the errno on failure there,
clean up the tempfile, and then die?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-11 10:27   ` Ævar Arnfjörð Bjarmason
@ 2021-09-11 11:19     ` Ævar Arnfjörð Bjarmason
  2021-09-11 16:51       ` Taylor Blau
  2021-09-11 16:49     ` Taylor Blau
  1 sibling, 1 reply; 76+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-11 11:19 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff


On Sat, Sep 11 2021, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Sep 10 2021, Taylor Blau wrote:
>
>> +struct midx_snapshot_ref_data {
>> +	struct tempfile *f;
>
> Style/readability: Spare more than one byte for a variable name in a
> struct, maybe file, or tmpfile?
>
>> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
>> +		oid_to_hex(oid));
>
> Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
> "+OID\n"? Lends itself more naturally to extension, use with the likes
> of string_list_split() etc.

Actually, even better a format like:

    "OID[\t+]\n"

Or

    "OID[\tpreferred=1]\n"

Then you can just use the strbuf API to eat the lines, and only parse
any optional \t-delimited "flags" if the length of the line you get
doesn't match the hash size length.

I don't think it matters for speed or whatever, just ease of parsing &
working with this new format.

The "+" v.s. "preferred=1" suggestion is just to make it future-proof
for comma-delimited key-value options, which we already use in the
protocol etc., so it's easy to parse, but I don't know if we're ever
likely to add a new flag here (and even if we are, we can use the next
"\t" positional...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly
  2021-09-11  5:00   ` Junio C Hamano
@ 2021-09-11 16:17     ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11 16:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On Fri, Sep 10, 2021 at 10:00:26PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> >  	if (ends_with(file_name, ".idx")) {
> >  		display_progress(ctx->progress, ++ctx->pack_paths_checked);
> > -		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> > -			return;
> > +		if (ctx->m) {
> > +			if (midx_contains_pack(ctx->m, file_name))
> > +				return;
> > +		} else if (ctx->to_include) {
> > +			if (!string_list_has_string(ctx->to_include, file_name))
> > +				return;
>
> What's the expected number of elements on the to_include list?  I am
> wondering about the performance implications of using linear search
> over the string-list, of course.  Is it about the same order of the
> number of packfiles in a repository (up to several dozens, or 1000
> at most unless you are insane, or something like that)?

You're definitely in the right ballpark. It depends on the repack
settings and size of repository, of course, but I imagine that roughly
1,000 entries would be the most anybody could ever pass (e.g., during a
`--geometric` repack, the biggest pack would have to contain 2^1000
times as many objects as the smallest pack).

Of course, you could just constantly be adding packs and doing
incremental `git repack -d --write-midx`. Seems unlikely to me, but if
it does become a problem we could easily read the values into a hashmap
and constant-ize the lookup.

But the scan is logarithmic, not linear, since the string list is
sorted.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly
  2021-09-11 10:07   ` Ævar Arnfjörð Bjarmason
@ 2021-09-11 16:21     ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11 16:21 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, peff

On Sat, Sep 11, 2021 at 12:07:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Fri, Sep 10 2021, Taylor Blau wrote:
>
> > -		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> > -			return;
> > +		if (ctx->m) {
> > +			if (midx_contains_pack(ctx->m, file_name))
> > +				return;
> > +		} else if (ctx->to_include) {
> > +			if (!string_list_has_string(ctx->to_include, file_name))
> > +				return;
> > +		}
>
> I think this is equivalent to:
>
> 		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> 			return;
> 		else if (!ctx->m && string_list_has_string(...))
> 			return;

They are equivalent, but it took me a minute to convince myself ;). They
are only equivalent because *either* ctx->m or ctx->to_include is
non-NULL.

So it wouldn't be the case that ctx->m being set and the subsequent
midx_contains_pack() call returning zero would cause us to check the
to_include list, because to_include will be NULL if ctx->m isn't.

Anyway, that makes them equivalent, so I'm happy to clean it up and
decrease the nested-ness.

> > +int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags);
>
> Let's also line-wrap header changes like in the *.c file.

Sure.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-11 10:05   ` Ævar Arnfjörð Bjarmason
@ 2021-09-11 16:25     ` Taylor Blau
  2021-09-11 16:28       ` Taylor Blau
                         ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11 16:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, peff

On Sat, Sep 11, 2021 at 12:05:05PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Fri, Sep 10 2021, Taylor Blau wrote:
>
> > +	struct strbuf buf = STRBUF_INIT;
> > +	while (strbuf_getline(&buf, stdin) != EOF) {
> > +		string_list_append(to, strbuf_detach(&buf, NULL));
>
> If you strbuf_detach() it...
>
>
> > +		struct string_list packs = STRING_LIST_INIT_NODUP;
>
> ...init the list with NODUP...
>
> > +		string_list_clear(&packs, 0);
>
> ...and call string_list_clear() you'll leak memory, since the
> string_list will think that someone else is taking care of this memory.
>
> I had some recent musings about how this API is bad, and I've got local
> WIP patches to fix it, but in the meantime you need to:
>
>     packs.strdup_strings = 1;
>
> Before calling string_list_clear(). I.e. we didn't strdup(), but during
> free() we pretend that we did, because we did, just not in
> string_list_append().

Good catch. It's kind of gross, but the result is:

--- 8< ---

diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 77488b6b7b..e6cab975e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -64,7 +64,7 @@ static struct option *add_common_options(struct option *prev)

 static void read_packs_from_stdin(struct string_list *to)
 {
-	struct strbuf buf = STRBUF_INIT;
+	struct strbuf buf = STRBUF_INIT_NODUP;
 	while (strbuf_getline(&buf, stdin) != EOF) {
 		string_list_append(to, strbuf_detach(&buf, NULL));
 	}
@@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 		ret = write_midx_file_only(opts.object_dir, &packs,
 					   opts.preferred_pack, opts.flags);

+		/*
+		 * pretend strings are duplicated to free the memory allocated
+		 * by read_packs_from_stdin()
+		 */
+		packs.strdup_strings = 1;
 		string_list_clear(&packs, 0);

 		return ret;

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-11 16:25     ` Taylor Blau
@ 2021-09-11 16:28       ` Taylor Blau
  2021-09-12  2:08       ` Eric Sunshine
  2021-09-14 19:02       ` Jeff King
  2 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11 16:28 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, peff

On Sat, Sep 11, 2021 at 12:25:05PM -0400, Taylor Blau wrote:
> Good catch. It's kind of gross, but the result is:
>
> --- 8< ---
>
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 77488b6b7b..e6cab975e3 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -64,7 +64,7 @@ static struct option *add_common_options(struct option *prev)
>
>  static void read_packs_from_stdin(struct string_list *to)
>  {
> -	struct strbuf buf = STRBUF_INIT;
> +	struct strbuf buf = STRBUF_INIT_NODUP;

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 7/8] builtin/repack.c: make largest pack preferred
  2021-09-11 10:17   ` Ævar Arnfjörð Bjarmason
@ 2021-09-11 16:35     ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11 16:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, peff

On Sat, Sep 11, 2021 at 12:17:37PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Fri, Sep 10 2021, Taylor Blau wrote:
>
> > +	if (geometry) {
> > +		struct packed_git *largest = get_largest_active_pack(geometry);
> > +		if (largest)
> > +			strvec_pushf(&cmd.args, "--preferred-pack=%s",
> > +				     pack_basename(largest));
> > +		else
> > +			/*
> > +			 * The largest pack was repacked, meaning only one pack
> > +			 * exists (and tautologically, it is the largest).
> > +			 */
> > +			;
> > +	}
>
> Probably one of those cases where an assignment within an "if" is the
> better of two evils, despite the CodingGuidelines warning against it
> (but not forbidding it). The added get_largest_active_pack() could also
> learn to punt on a NULL argument, in which case we wouldn't need an
> assignment):
>
> 	struct packed_git *largest;
> 	[...]
>         /* If ...[...] */
> 	if (geometry && (largest = get_largest_active_pack(geometry)))
> 		strvec_push(...);
>
> I punted on re-phrasing the comment, because the current code invites
> the reader to start reading this block, then we see that we may do
> stuff, and then the comment applies to what we *didn't* do.

Of the two, I vastly prefer the form that doesn't require assignment
inside the conditional expression.

That comment could use an extra bit about how we don't handle
incremental repacks (since we don't bother looking at the size after
words in the case of `git repack -d` (which is neither all-into-one nor
geometric). Fixed both, thanks.

> > -		usage("read-midx [--show-objects|--checksum] <object-dir>");
> > +		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
>
> Just an aside, but I'm surprised to see this use the older
> non-parse_options() usage API, which we're generally moving away
> from. usage_msg_opt() is generally better.

Opportunity for future cleanup. I didn't see any point to change it in
the middle of an otherwise unrelated series.

> > +		git config core.multiPackIndex true &&
>
> I remember a similar pattern from another series, does this test pass
> without the config being set? I didn't check...

It does! This test was written long before 18e449f86b (midx: enable
core.multiPackIndex by default, 2020-09-25), but since 18e449f86b that
line is no longer necessary. Thanks for noticing!

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-11 10:27   ` Ævar Arnfjörð Bjarmason
  2021-09-11 11:19     ` Ævar Arnfjörð Bjarmason
@ 2021-09-11 16:49     ` Taylor Blau
  1 sibling, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-11 16:49 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, peff

On Sat, Sep 11, 2021 at 12:27:39PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Fri, Sep 10 2021, Taylor Blau wrote:
>
> > +struct midx_snapshot_ref_data {
> > +	struct tempfile *f;
>
> Style/readability: Spare more than one byte for a variable name in a
> struct, maybe file, or tmpfile?

Sure.

> > +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
> > +		oid_to_hex(oid));
>
> Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
> "+OID\n"? Lends itself more naturally to extension, use with the likes
> of string_list_split() etc.
>
> > +		for_each_string_list_item(item, preferred) {
> > +			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
> > +		}
>
> Cheap style commenst I know, but throughout this series there's an odd
> mixture of sometimes adding braces for one-lines, sometimes not, or in
> the case of that "else ;" I commented on not for something *much* bigger
> :)

I've always been uncomfortable with brace-less for loops when the loop
is hidden behind a macro. I haven't bothered to check, but my gut
feeling is that it is pretty rare to see a for_each_something() macro
with a one-line body that doesn't have braces.

But my gut is wrong, because

    git grep 'for_each_.*($' **/*.c | wc -l

turns up 99 results! So I'm happy to be more consistent with the other
code (despite my general discomfort with it). An observation is that
even this file isn't consistent about braceless macro'd for loops, even
before this series.

So even cleaning it up will still leave this file in an inconsistent
state, but I don't think it's worth spending much time at all thinking
about cleaning it up.

>
> > +		if (midx_snapshot_refs(refs_snapshot) < 0)
> > +			die(_("could not take a snapshot of references"));
>
> Seems like this die() should be moved into the midx_snapshot_refs()
> function and that function should use die_errno(), not die(), also it
> uses close_tempfile_gently(), shouldn't save the errno on failure there,
> clean up the tempfile, and then die?

Yeah, makes sense.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-11 11:19     ` Ævar Arnfjörð Bjarmason
@ 2021-09-11 16:51       ` Taylor Blau
  2021-09-14 18:55         ` Jeff King
  0 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-11 16:51 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, peff

On Sat, Sep 11, 2021 at 01:19:27PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Sat, Sep 11 2021, Ævar Arnfjörð Bjarmason wrote:
>
> > On Fri, Sep 10 2021, Taylor Blau wrote:
> >
> >> +struct midx_snapshot_ref_data {
> >> +	struct tempfile *f;
> >
> > Style/readability: Spare more than one byte for a variable name in a
> > struct, maybe file, or tmpfile?
> >
> >> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
> >> +		oid_to_hex(oid));
> >
> > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
> > "+OID\n"? Lends itself more naturally to extension, use with the likes
> > of string_list_split() etc.
>
> Actually, even better a format like:
>
>     "OID[\t+]\n"
>
> Or
>
>     "OID[\tpreferred=1]\n"

Sure, but I admit that I'm a little torn on this suggestion. I don't
want to be naive and say that we're never going to change this format
and paint ourselves into a corner.

On the other hand, changing it does seem extremely unlikely to me, and
this tab-delimited thing feels like overkill compared to how simple the
'+' format is.

So, I don't know. It's certainly easy enough to change now before we
lock it in, so I guess we should.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-11 16:25     ` Taylor Blau
  2021-09-11 16:28       ` Taylor Blau
@ 2021-09-12  2:08       ` Eric Sunshine
  2021-09-12  2:21         ` Taylor Blau
  2021-09-14 19:02       ` Jeff King
  2 siblings, 1 reply; 76+ messages in thread
From: Eric Sunshine @ 2021-09-12  2:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Ævar Arnfjörð Bjarmason, Git List, Jeff King

On Sat, Sep 11, 2021 at 12:25 PM Taylor Blau <me@ttaylorr.com> wrote:
> On Sat, Sep 11, 2021 at 12:05:05PM +0200, Ævar Arnfjörð Bjarmason wrote:
> > Before calling string_list_clear(). I.e. we didn't strdup(), but during
> > free() we pretend that we did, because we did, just not in
> > string_list_append().
>
> Good catch. It's kind of gross, but the result is:
>
>  static void read_packs_from_stdin(struct string_list *to)
>  {
> -       struct strbuf buf = STRBUF_INIT;
> +       struct strbuf buf = STRBUF_INIT_NODUP;
>         while (strbuf_getline(&buf, stdin) != EOF) {
>                 string_list_append(to, strbuf_detach(&buf, NULL));
>         }
> @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
>                 ret = write_midx_file_only(opts.object_dir, &packs,
>                                            opts.preferred_pack, opts.flags);
>
> +               /*
> +                * pretend strings are duplicated to free the memory allocated
> +                * by read_packs_from_stdin()
> +                */
> +               packs.strdup_strings = 1;
>                 string_list_clear(&packs, 0);

An alternative is to do this:

    struct strbuf buf = STRBUF_INIT;
    ...
    while (...) {
        string_list_append_nodup(to, strbuf_detach(&buf, NULL));
        ...
    }
    ...
    string_list_clear(&packs, 0);

That is, use string_list_append_nodup(), which is documented as
existing precisely for this sort of use-case:

    Like string_list_append(), except string is never copied.  When
    list->strdup_strings is set, this function can be used to hand
    ownership of a malloc()ed string to list without making an extra
    copy.

(I mention this only for completeness but don't care strongly which
approach is used.)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-12  2:08       ` Eric Sunshine
@ 2021-09-12  2:21         ` Taylor Blau
  2021-09-12 15:15           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-12  2:21 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Taylor Blau, Ævar Arnfjörð Bjarmason, Git List, Jeff King

On Sat, Sep 11, 2021 at 10:08:44PM -0400, Eric Sunshine wrote:
> On Sat, Sep 11, 2021 at 12:25 PM Taylor Blau <me@ttaylorr.com> wrote:
> > On Sat, Sep 11, 2021 at 12:05:05PM +0200, Ævar Arnfjörð Bjarmason wrote:
> > > Before calling string_list_clear(). I.e. we didn't strdup(), but during
> > > free() we pretend that we did, because we did, just not in
> > > string_list_append().
> >
> > Good catch. It's kind of gross, but the result is:
> >
> >  static void read_packs_from_stdin(struct string_list *to)
> >  {
> > -       struct strbuf buf = STRBUF_INIT;
> > +       struct strbuf buf = STRBUF_INIT_NODUP;
> >         while (strbuf_getline(&buf, stdin) != EOF) {
> >                 string_list_append(to, strbuf_detach(&buf, NULL));
> >         }
> > @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
> >                 ret = write_midx_file_only(opts.object_dir, &packs,
> >                                            opts.preferred_pack, opts.flags);
> >
> > +               /*
> > +                * pretend strings are duplicated to free the memory allocated
> > +                * by read_packs_from_stdin()
> > +                */
> > +               packs.strdup_strings = 1;
> >                 string_list_clear(&packs, 0);
>
> An alternative is to do this:
>
>     struct strbuf buf = STRBUF_INIT;
>     ...
>     while (...) {
>         string_list_append_nodup(to, strbuf_detach(&buf, NULL));
>         ...
>     }
>     ...
>     string_list_clear(&packs, 0);
>
> That is, use string_list_append_nodup(), which is documented as
> existing precisely for this sort of use-case:
>
>     Like string_list_append(), except string is never copied.  When
>     list->strdup_strings is set, this function can be used to hand
>     ownership of a malloc()ed string to list without making an extra
>     copy.
>
> (I mention this only for completeness but don't care strongly which
> approach is used.)

Thanks for a great suggestion; I do prefer using the existing APIs where
possible, and it seems like string_list_append_nodup() was designed
exactly for this case.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-12  2:21         ` Taylor Blau
@ 2021-09-12 15:15           ` Ævar Arnfjörð Bjarmason
  2021-09-12 22:30             ` Junio C Hamano
  0 siblings, 1 reply; 76+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-12 15:15 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Eric Sunshine, Git List, Jeff King, Elijah Newren


On Sat, Sep 11 2021, Taylor Blau wrote:

> On Sat, Sep 11, 2021 at 10:08:44PM -0400, Eric Sunshine wrote:
>> On Sat, Sep 11, 2021 at 12:25 PM Taylor Blau <me@ttaylorr.com> wrote:
>> > On Sat, Sep 11, 2021 at 12:05:05PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> > > Before calling string_list_clear(). I.e. we didn't strdup(), but during
>> > > free() we pretend that we did, because we did, just not in
>> > > string_list_append().
>> >
>> > Good catch. It's kind of gross, but the result is:
>> >
>> >  static void read_packs_from_stdin(struct string_list *to)
>> >  {
>> > -       struct strbuf buf = STRBUF_INIT;
>> > +       struct strbuf buf = STRBUF_INIT_NODUP;
>> >         while (strbuf_getline(&buf, stdin) != EOF) {
>> >                 string_list_append(to, strbuf_detach(&buf, NULL));
>> >         }
>> > @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
>> >                 ret = write_midx_file_only(opts.object_dir, &packs,
>> >                                            opts.preferred_pack, opts.flags);
>> >
>> > +               /*
>> > +                * pretend strings are duplicated to free the memory allocated
>> > +                * by read_packs_from_stdin()
>> > +                */
>> > +               packs.strdup_strings = 1;
>> >                 string_list_clear(&packs, 0);
>>
>> An alternative is to do this:
>>
>>     struct strbuf buf = STRBUF_INIT;
>>     ...
>>     while (...) {
>>         string_list_append_nodup(to, strbuf_detach(&buf, NULL));
>>         ...
>>     }
>>     ...
>>     string_list_clear(&packs, 0);
>>
>> That is, use string_list_append_nodup(), which is documented as
>> existing precisely for this sort of use-case:
>>
>>     Like string_list_append(), except string is never copied.  When
>>     list->strdup_strings is set, this function can be used to hand
>>     ownership of a malloc()ed string to list without making an extra
>>     copy.
>>
>> (I mention this only for completeness but don't care strongly which
>> approach is used.)
>
> Thanks for a great suggestion; I do prefer using the existing APIs where
> possible, and it seems like string_list_append_nodup() was designed
> exactly for this case.

I think Eric's suggestion is better in this case, maybe all cases.

For what it's worth the alternate WIP approach I was hinting at is some
version of this:
https://lore.kernel.org/git/87bl6kq631.fsf@evledraar.gmail.com/

I might change my mind, but I think ultimately running with move of the
string_list_append_nodup() approach Eric points out is probably the
right thing to do.

I.e. it's the difference between declaring that a string list is "dup'd"
upfront, and whenever we insert into it we make sure that's the case one
way or another. Then when we clear we don't need to pass any options.

It does mean that instead of extra clear functions we need to have
*_nodup() versions of all the utility functions for this to work,
e.g. in trying to convert this now there's no
string_list_insert_nodup(), which would be needed.

Or maybe the whole approch of the string_list API is just a dead-end in
API design, i.e. it shouldn't have any "dup" mode, just nodup, if you
need something dup'd you xstrdup()-it.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-12 15:15           ` Ævar Arnfjörð Bjarmason
@ 2021-09-12 22:30             ` Junio C Hamano
  2021-09-12 22:32               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 76+ messages in thread
From: Junio C Hamano @ 2021-09-12 22:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, Eric Sunshine, Git List, Jeff King, Elijah Newren

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> Or maybe the whole approch of the string_list API is just a dead-end in
> API design, i.e. it shouldn't have any "dup" mode, just nodup, if you
> need something dup'd you xstrdup()-it.

;-)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-12 22:30             ` Junio C Hamano
@ 2021-09-12 22:32               ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 76+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-12 22:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Taylor Blau, Eric Sunshine, Git List, Jeff King, Elijah Newren


On Sun, Sep 12 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> Or maybe the whole approch of the string_list API is just a dead-end in
>> API design, i.e. it shouldn't have any "dup" mode, just nodup, if you
>> need something dup'd you xstrdup()-it.
>
> ;-)

I realize in practice that's a bit too much, i.e. it's very useful to be
able to use a string_list to account for a bunch of already-allocated
memory without strduping everything.

It's just unfortunate that you can't look at any code that works with
the API and understand whether it does an implicit xstrdup() or not
without tracking down its current state / where it got allocated at.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-11 16:51       ` Taylor Blau
@ 2021-09-14 18:55         ` Jeff King
  2021-09-14 23:34           ` Taylor Blau
  0 siblings, 1 reply; 76+ messages in thread
From: Jeff King @ 2021-09-14 18:55 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Ævar Arnfjörð Bjarmason, git

On Sat, Sep 11, 2021 at 12:51:24PM -0400, Taylor Blau wrote:

> > >> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
> > >> +		oid_to_hex(oid));
> > >
> > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
> > > "+OID\n"? Lends itself more naturally to extension, use with the likes
> > > of string_list_split() etc.
> >
> > Actually, even better a format like:
> >
> >     "OID[\t+]\n"
> >
> > Or
> >
> >     "OID[\tpreferred=1]\n"
> 
> Sure, but I admit that I'm a little torn on this suggestion. I don't
> want to be naive and say that we're never going to change this format
> and paint ourselves into a corner.
> 
> On the other hand, changing it does seem extremely unlikely to me, and
> this tab-delimited thing feels like overkill compared to how simple the
> '+' format is.
> 
> So, I don't know. It's certainly easy enough to change now before we
> lock it in, so I guess we should.

I'm not sure I really see the point of making this infinitely
extensible. This is mostly an internal interface between two Git
programs. Sure, it's exposed to the user in the sense that they can use
--refs-snapshot themselves. But if some writer wants to add a "foo"
flag, do they really want to be able to do it in a way that they're
_syntactically_ compatible with the older versions of Git, yet have no
clue if their option was actually recognized and accepted?

I.e., if we later add such a flag, then:

  +OID\tfoo
  +\tOID\tfoo
  OID\t+\tfoo

does not seem to make a big difference. We will likely add a new
command-line option to say "by the way, I am passing you some extra
information" at which point the format can be whatever we like.

And unless it is really complicated, I'd just as soon use a
single-character marker again, because it's dead-simple to generate and
parse (and again, this is an internal interface and not something we
expect users to have to understand).

Unless we think the "+" is too ugly or confusing, but I don't. The "-"
prefix for pack-objects read_object_list_from_stdin() is similar and
works just fine. For that matter, the --stdin-packs feature there works
similarly. Everything more feels a bit like over-engineering at this
point.

-Peff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-11 16:25     ` Taylor Blau
  2021-09-11 16:28       ` Taylor Blau
  2021-09-12  2:08       ` Eric Sunshine
@ 2021-09-14 19:02       ` Jeff King
  2021-09-14 23:48         ` Taylor Blau
  2 siblings, 1 reply; 76+ messages in thread
From: Jeff King @ 2021-09-14 19:02 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Ævar Arnfjörð Bjarmason, git

On Sat, Sep 11, 2021 at 12:25:05PM -0400, Taylor Blau wrote:

> > Before calling string_list_clear(). I.e. we didn't strdup(), but during
> > free() we pretend that we did, because we did, just not in
> > string_list_append().
> 
> Good catch. It's kind of gross, but the result is:
> 
> --- 8< ---
> 
> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 77488b6b7b..e6cab975e3 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -64,7 +64,7 @@ static struct option *add_common_options(struct option *prev)
> 
>  static void read_packs_from_stdin(struct string_list *to)
>  {
> -	struct strbuf buf = STRBUF_INIT;
> +	struct strbuf buf = STRBUF_INIT_NODUP;

Did you mean to use STRING_LIST_INIT_NODUP on the string-list
declaration?

>  	while (strbuf_getline(&buf, stdin) != EOF) {
>  		string_list_append(to, strbuf_detach(&buf, NULL));
>  	}
> @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
>  		ret = write_midx_file_only(opts.object_dir, &packs,
>  					   opts.preferred_pack, opts.flags);
> 
> +		/*
> +		 * pretend strings are duplicated to free the memory allocated
> +		 * by read_packs_from_stdin()
> +		 */
> +		packs.strdup_strings = 1;
>  		string_list_clear(&packs, 0);
> 
>  		return ret;

I think the root of the problem here is the non-idiomatic use of
strbuf_getline(). The usual thing (and in fact the thing done by the
quite-similar code in read_packs_list_from_stdin() in pack-objects.c)
is not to detach, because strbuf_getline() will reset the buffer each
time. I.e.:

  struct string_list to = STRING_LIST_INIT_DUP;
  ...
  struct strbuf buf = STRBUF
  while (strbuf_getline(&buf, stdin) != EOF)
	string_list_append(to, &buf);
  strbuf_release(&buf);

That avoids any clever string-list allocation games. The number of heap
allocations is about the same (one strbuf and N list items, versus N
strbufs and 0 list items). There's a little extra copying (from the
strbuf into the list items), but the strings themselves are more
efficiently allocated (strbuf may over-allocate, and we lock in
that choice forever by handing over the string).

Not that efficiency probably matters either way for this spot. I'd do it
this way because it's simple and idiomatic for our code base.

-Peff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-14 18:55         ` Jeff King
@ 2021-09-14 23:34           ` Taylor Blau
  2021-09-14 23:56             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-14 23:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Ævar Arnfjörð Bjarmason, git

On Tue, Sep 14, 2021 at 02:55:03PM -0400, Jeff King wrote:
> On Sat, Sep 11, 2021 at 12:51:24PM -0400, Taylor Blau wrote:
>
> > > >> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
> > > >> +		oid_to_hex(oid));
> > > >
> > > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
> > > > "+OID\n"? Lends itself more naturally to extension, use with the likes
> > > > of string_list_split() etc.
> > >
> > > Actually, even better a format like:
> > >
> > >     "OID[\t+]\n"
> > >
> > > Or
> > >
> > >     "OID[\tpreferred=1]\n"
> >
> > Sure, but I admit that I'm a little torn on this suggestion. I don't
> > want to be naive and say that we're never going to change this format
> > and paint ourselves into a corner.
> >
> > On the other hand, changing it does seem extremely unlikely to me, and
> > this tab-delimited thing feels like overkill compared to how simple the
> > '+' format is.
> >
> > So, I don't know. It's certainly easy enough to change now before we
> > lock it in, so I guess we should.
>
> I'm not sure I really see the point of making this infinitely
> extensible. This is mostly an internal interface between two Git
> programs. Sure, it's exposed to the user in the sense that they can use
> --refs-snapshot themselves. But if some writer wants to add a "foo"
> flag, do they really want to be able to do it in a way that they're
> _syntactically_ compatible with the older versions of Git, yet have no
> clue if their option was actually recognized and accepted?

I like this perspective, and tend to agree.

(I'm basically repeating what you're saying but) it seems to me that
trying to make the interface of `--refs-snapshot` compatible across
versions of Git is stretching what we consider to be our compatibility
guarantee given that:

  - the interface is basically expected to be private between `repack`
    and `multi-pack-index`, and

  - an infinitely-extensible version of `--refs-snapshot` would still
    have no way to tell the caller whether or not those new flags were
    accepted

So I tend to agree the existing format works fine and we shouldn't spend
a ton of time trying to over-engineer a solution.

(FWIW, the `+` vs `-` thing is intentional; `--stdin-packs` uses `-` to
_exclude_ packs, but `+` here means "this one is special" not "exclude
this ref").

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-14 19:02       ` Jeff King
@ 2021-09-14 23:48         ` Taylor Blau
  2021-09-15  1:55           ` Eric Sunshine
  0 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-14 23:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Ævar Arnfjörð Bjarmason, git

On Tue, Sep 14, 2021 at 03:02:38PM -0400, Jeff King wrote:
> On Sat, Sep 11, 2021 at 12:25:05PM -0400, Taylor Blau wrote:
>
> > > Before calling string_list_clear(). I.e. we didn't strdup(), but during
> > > free() we pretend that we did, because we did, just not in
> > > string_list_append().
> >
> > Good catch. It's kind of gross, but the result is:
> >
> > --- 8< ---
> >
> > diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> > index 77488b6b7b..e6cab975e3 100644
> > --- a/builtin/multi-pack-index.c
> > +++ b/builtin/multi-pack-index.c
> > @@ -64,7 +64,7 @@ static struct option *add_common_options(struct option *prev)
> >
> >  static void read_packs_from_stdin(struct string_list *to)
> >  {
> > -	struct strbuf buf = STRBUF_INIT;
> > +	struct strbuf buf = STRBUF_INIT_NODUP;
>
> Did you mean to use STRING_LIST_INIT_NODUP on the string-list
> declaration?

Yeah, and I thought that I even mentioned it in [1], but I apparently
sent the contents of my buffer before I saved it. So, I did notice it at
the time, but failed to tell anybody about it!

[1]: https://lore.kernel.org/git/YTzZPti%2FasQwZC%2FD@nand.local/

> >  	while (strbuf_getline(&buf, stdin) != EOF) {
> >  		string_list_append(to, strbuf_detach(&buf, NULL));
> >  	}
> > @@ -107,6 +107,11 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
> >  		ret = write_midx_file_only(opts.object_dir, &packs,
> >  					   opts.preferred_pack, opts.flags);
> >
> > +		/*
> > +		 * pretend strings are duplicated to free the memory allocated
> > +		 * by read_packs_from_stdin()
> > +		 */
> > +		packs.strdup_strings = 1;
> >  		string_list_clear(&packs, 0);
> >
> >  		return ret;
>
> I think the root of the problem here is the non-idiomatic use of
> strbuf_getline(). The usual thing (and in fact the thing done by the
> quite-similar code in read_packs_list_from_stdin() in pack-objects.c)
> is not to detach, because strbuf_getline() will reset the buffer each
> time. I.e.:
>
>   struct string_list to = STRING_LIST_INIT_DUP;
>   ...
>   struct strbuf buf = STRBUF
>   while (strbuf_getline(&buf, stdin) != EOF)
> 	string_list_append(to, &buf);

(Assuming that you meant `s/&buf/buf.buf/`, but otherwise this makes
sense). There is no need to call strbuf_reset() inside the body of the
loop, since strbuf_getline() calls it via strbuf_getwholeline().

>   strbuf_release(&buf);

...But we should still be careful to release the memory used by the
strbuf at the end (which is safe to do, since the whole point is that we
copy each buffer linewise into the string_list).

> That avoids any clever string-list allocation games. The number of heap
> allocations is about the same (one strbuf and N list items, versus N
> strbufs and 0 list items). There's a little extra copying (from the
> strbuf into the list items), but the strings themselves are more
> efficiently allocated (strbuf may over-allocate, and we lock in
> that choice forever by handing over the string).
>
> Not that efficiency probably matters either way for this spot. I'd do it
> this way because it's simple and idiomatic for our code base.

More than anything, I'm persuaded by that (though I was quite partial to
Eric's suggestion, which I found very clever).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-14 23:34           ` Taylor Blau
@ 2021-09-14 23:56             ` Ævar Arnfjörð Bjarmason
  2021-09-15  4:31               ` Taylor Blau
  0 siblings, 1 reply; 76+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-09-14 23:56 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, git


On Tue, Sep 14 2021, Taylor Blau wrote:

> On Tue, Sep 14, 2021 at 02:55:03PM -0400, Jeff King wrote:
>> On Sat, Sep 11, 2021 at 12:51:24PM -0400, Taylor Blau wrote:
>>
>> > > >> +	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
>> > > >> +		oid_to_hex(oid));
>> > > >
>> > > > Just an idea: Maybe the file could be lines of "+\tOID\n" instead of
>> > > > "+OID\n"? Lends itself more naturally to extension, use with the likes
>> > > > of string_list_split() etc.
>> > >
>> > > Actually, even better a format like:
>> > >
>> > >     "OID[\t+]\n"
>> > >
>> > > Or
>> > >
>> > >     "OID[\tpreferred=1]\n"
>> >
>> > Sure, but I admit that I'm a little torn on this suggestion. I don't
>> > want to be naive and say that we're never going to change this format
>> > and paint ourselves into a corner.
>> >
>> > On the other hand, changing it does seem extremely unlikely to me, and
>> > this tab-delimited thing feels like overkill compared to how simple the
>> > '+' format is.
>> >
>> > So, I don't know. It's certainly easy enough to change now before we
>> > lock it in, so I guess we should.
>>
>> I'm not sure I really see the point of making this infinitely
>> extensible. This is mostly an internal interface between two Git
>> programs. Sure, it's exposed to the user in the sense that they can use
>> --refs-snapshot themselves. But if some writer wants to add a "foo"
>> flag, do they really want to be able to do it in a way that they're
>> _syntactically_ compatible with the older versions of Git, yet have no
>> clue if their option was actually recognized and accepted?
>
> I like this perspective, and tend to agree.
>
> (I'm basically repeating what you're saying but) it seems to me that
> trying to make the interface of `--refs-snapshot` compatible across
> versions of Git is stretching what we consider to be our compatibility
> guarantee given that:
>
>   - the interface is basically expected to be private between `repack`
>     and `multi-pack-index`, and
>
>   - an infinitely-extensible version of `--refs-snapshot` would still
>     have no way to tell the caller whether or not those new flags were
>     accepted
>
> So I tend to agree the existing format works fine and we shouldn't spend
> a ton of time trying to over-engineer a solution.
>
> (FWIW, the `+` vs `-` thing is intentional; `--stdin-packs` uses `-` to
> _exclude_ packs, but `+` here means "this one is special" not "exclude
> this ref").

The suggestion I had offhand was just meant as an offhand "interesting,
maybe easier like...", i.e. if you found it easier to split on \t, or
check the hash size in the loop or whatever.

I agree that this doesn't matter and should just be left to whatever
you've got a taste for, thanks. This thread became more of a digression
than I thought...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode
  2021-09-14 23:48         ` Taylor Blau
@ 2021-09-15  1:55           ` Eric Sunshine
  0 siblings, 0 replies; 76+ messages in thread
From: Eric Sunshine @ 2021-09-15  1:55 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Git List

On Tue, Sep 14, 2021 at 7:48 PM Taylor Blau <me@ttaylorr.com> wrote:
> On Tue, Sep 14, 2021 at 03:02:38PM -0400, Jeff King wrote:
> > I think the root of the problem here is the non-idiomatic use of
> > strbuf_getline(). The usual thing (and in fact the thing done by the
> > quite-similar code in read_packs_list_from_stdin() in pack-objects.c)
> > is not to detach, because strbuf_getline() will reset the buffer each
> > time. I.e.:
> >
> >   struct string_list to = STRING_LIST_INIT_DUP;
> >   ...
> >   struct strbuf buf = STRBUF
> >   while (strbuf_getline(&buf, stdin) != EOF)
> >       string_list_append(to, &buf);
>
> > That avoids any clever string-list allocation games. The number of heap
> > allocations is about the same (one strbuf and N list items, versus N
> > strbufs and 0 list items). There's a little extra copying (from the
> > strbuf into the list items), but the strings themselves are more
> > efficiently allocated (strbuf may over-allocate, and we lock in
> > that choice forever by handing over the string).
> >
> > Not that efficiency probably matters either way for this spot. I'd do it
> > this way because it's simple and idiomatic for our code base.
>
> More than anything, I'm persuaded by that (though I was quite partial to
> Eric's suggestion, which I found very clever).

For what it's worth, Peff's suggestion seems best of all those
presented thus far.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-14 23:56             ` Ævar Arnfjörð Bjarmason
@ 2021-09-15  4:31               ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-15  4:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Taylor Blau, Jeff King, git

On Wed, Sep 15, 2021 at 01:56:00AM +0200, Ævar Arnfjörð Bjarmason wrote:
> The suggestion I had offhand was just meant as an offhand "interesting,
> maybe easier like...", i.e. if you found it easier to split on \t, or
> check the hash size in the loop or whatever.

:-). You did say "Just an idea", but I treated your suggestion as if you
were concerned about being able to extend the format in the future.

> I agree that this doesn't matter and should just be left to whatever
> you've got a taste for, thanks. This thread became more of a digression
> than I thought...

I think that all of the discussion was reasonable if I understood your
original opinion more clearly ;). It may help (me, at least) to say more
explicitly, "This may be easier like [...] but I don't have any concerns
about functionality" or similar.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 0/8] repack: introduce `--write-midx`
  2021-09-11  3:32 [PATCH 0/8] repack: introduce `--write-midx` Taylor Blau
                   ` (7 preceding siblings ...)
  2021-09-11  3:32 ` [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
@ 2021-09-15 18:24 ` Taylor Blau
  2021-09-15 18:24   ` [PATCH v2 1/8] midx: expose `write_midx_file_only()` publicly Taylor Blau
                     ` (9 more replies)
  8 siblings, 10 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-15 18:24 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, avarab

Here is a small reroll of my series to implement the final component of
multi-pack reachability bitamps, which is to be able to write them from `git
repack`.

Nothing substantial has changed since last time. Mostly re-wrapping lines and
removing braces in macro'd for_each_xyz() loops. A larger (but still small)
change is how we read input to `--refs-snapshot`. This version contains a more
idiomatic implementation, but see the sub-thread beginning at [1] for a more
complete discussion.

Otherwise, this series is unchagned. It depends on tb/multi-pack-bitmaps, which
has graduated to master. Range-diff below:

[1]: https://lore.kernel.org/git/2a16f11790b79ab452233b6f28acac607c0abd28.1631331139.git.me@ttaylorr.com/

Taylor Blau (8):
  midx: expose `write_midx_file_only()` publicly
  builtin/multi-pack-index.c: support `--stdin-packs` mode
  midx: preliminary support for `--refs-snapshot`
  builtin/repack.c: keep track of existing packs unconditionally
  builtin/repack.c: extract showing progress to a variable
  builtin/repack.c: support writing a MIDX while repacking
  builtin/repack.c: make largest pack preferred
  builtin/repack.c: pass `--refs-snapshot` when writing bitmaps

 Documentation/git-multi-pack-index.txt |  19 ++
 Documentation/git-repack.txt           |  18 +-
 builtin/multi-pack-index.c             |  36 +++-
 builtin/repack.c                       | 255 ++++++++++++++++++++++---
 midx.c                                 |  96 ++++++++--
 midx.h                                 |  11 +-
 pack-bitmap.c                          |   2 +-
 pack-bitmap.h                          |   1 +
 t/helper/test-read-midx.c              |  25 ++-
 t/lib-midx.sh                          |   8 +
 t/t5319-multi-pack-index.sh            |  15 ++
 t/t5326-multi-pack-bitmaps.sh          |  82 ++++++++
 t/t7700-repack.sh                      |  96 ++++++++++
 t/t7703-repack-geometric.sh            |  22 +++
 14 files changed, 637 insertions(+), 49 deletions(-)
 create mode 100644 t/lib-midx.sh

Range-diff against v1:
1:  6d50f46080 ! 1:  03c1b2c4d3 midx: expose 'write_midx_file_only()' publicly
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    midx: expose 'write_midx_file_only()' publicly
    +    midx: expose `write_midx_file_only()` publicly
     
         Expose a variant of the write_midx_file() function which ignores packs
         that aren't included in an explicit "allow" list.
    @@ midx.c: struct write_midx_context {
      
      static void add_pack_to_midx(const char *full_path, size_t full_path_len,
     @@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
    - 
    - 	if (ends_with(file_name, ".idx")) {
      		display_progress(ctx->progress, ++ctx->pack_paths_checked);
    --		if (ctx->m && midx_contains_pack(ctx->m, file_name))
    --			return;
    -+		if (ctx->m) {
    -+			if (midx_contains_pack(ctx->m, file_name))
    -+				return;
    -+		} else if (ctx->to_include) {
    -+			if (!string_list_has_string(ctx->to_include, file_name))
    -+				return;
    -+		}
    + 		if (ctx->m && midx_contains_pack(ctx->m, file_name))
    + 			return;
    ++		else if (ctx->to_include &&
    ++			 !string_list_has_string(ctx->to_include, file_name))
    ++			return;
      
      		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
      
    @@ midx.h: int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pa
      int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
      
      int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
    -+int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags);
    ++int write_midx_file_only(const char *object_dir,
    ++			 struct string_list *packs_to_include,
    ++			 const char *preferred_pack_name,
    ++			 unsigned flags);
      void clear_midx_file(struct repository *r);
      int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
      int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags);
2:  66aa7f4b48 ! 2:  59556e5545 builtin/multi-pack-index.c: support --stdin-packs mode
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>
     
      ## Commit message ##
    -    builtin/multi-pack-index.c: support --stdin-packs mode
    +    builtin/multi-pack-index.c: support `--stdin-packs` mode
     
         To power a new `--write-midx` mode, `git repack` will want to write a
         multi-pack index containing a certain set of packs in the repository.
    @@ builtin/multi-pack-index.c: static struct option *add_common_options(struct opti
     +static void read_packs_from_stdin(struct string_list *to)
     +{
     +	struct strbuf buf = STRBUF_INIT;
    -+	while (strbuf_getline(&buf, stdin) != EOF) {
    -+		string_list_append(to, strbuf_detach(&buf, NULL));
    -+	}
    ++	while (strbuf_getline(&buf, stdin) != EOF)
    ++		string_list_append(to, buf.buf);
     +	string_list_sort(to);
    ++
    ++	strbuf_release(&buf);
     +}
     +
      static int cmd_multi_pack_index_write(int argc, const char **argv)
    @@ builtin/multi-pack-index.c: static int cmd_multi_pack_index_write(int argc, cons
      	FREE_AND_NULL(options);
      
     +	if (opts.stdin_packs) {
    -+		struct string_list packs = STRING_LIST_INIT_NODUP;
    ++		struct string_list packs = STRING_LIST_INIT_DUP;
     +		int ret;
     +
     +		read_packs_from_stdin(&packs);
3:  f74a967be3 ! 3:  42f1ae9ede midx: preliminary support for `--refs-snapshot`
    @@ midx.h: int fill_midx_entry(struct repository *r, const struct object_id *oid, s
      int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
      
     -int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
    --int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, unsigned flags);
    -+int write_midx_file(const char *object_dir, const char *preferred_pack_name, const char *refs_snapshot, unsigned flags);
    -+int write_midx_file_only(const char *object_dir, struct string_list *packs_to_include, const char *preferred_pack_name, const char *refs_snapshot, unsigned flags);
    ++int write_midx_file(const char *object_dir,
    ++		    const char *preferred_pack_name,
    ++		    const char *refs_snapshot,
    ++		    unsigned flags);
    + int write_midx_file_only(const char *object_dir,
    + 			 struct string_list *packs_to_include,
    + 			 const char *preferred_pack_name,
    ++			 const char *refs_snapshot,
    + 			 unsigned flags);
      void clear_midx_file(struct repository *r);
      int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
    - int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags);
     
      ## t/t5326-multi-pack-bitmaps.sh ##
     @@ t/t5326-multi-pack-bitmaps.sh: test_expect_success 'pack.preferBitmapTips' '
4:  5067c2cc65 = 4:  c0d045a9de builtin/repack.c: keep track of existing packs unconditionally
5:  d5990e3b69 = 5:  09de03de47 builtin/repack.c: extract showing progress to a variable
6:  d32d800954 ! 6:  83dfdb8b12 builtin/repack.c: support writing a MIDX while repacking
    @@ builtin/repack.c: static void clear_pack_geometry(struct pack_geometry *geometry
     +{
     +	struct string_list_item *item;
     +
    -+	for_each_string_list_item(item, existing_kept_packs) {
    ++	for_each_string_list_item(item, existing_kept_packs)
     +		string_list_insert(include, xstrfmt("%s.idx", item->string));
    -+	}
    -+	for_each_string_list_item(item, names) {
    ++	for_each_string_list_item(item, names)
     +		string_list_insert(include, xstrfmt("pack-%s.idx", item->string));
    -+	}
     +	if (geometry) {
     +		struct strbuf buf = STRBUF_INIT;
     +		uint32_t i;
    @@ builtin/repack.c: static void clear_pack_geometry(struct pack_geometry *geometry
     +		return ret;
     +
     +	in = xfdopen(cmd.in, "w");
    -+	for_each_string_list_item(item, include) {
    ++	for_each_string_list_item(item, include)
     +		fprintf(in, "%s\n", item->string);
    -+	}
     +	fclose(in);
     +
     +	return finish_command(&cmd);
7:  515e1d95a6 ! 7:  68bc49d8ae builtin/repack.c: make largest pack preferred
    @@ builtin/repack.c: static void split_pack_geometry(struct pack_geometry *geometry
      
     +static struct packed_git *get_largest_active_pack(struct pack_geometry *geometry)
     +{
    ++	if (!geometry) {
    ++		/*
    ++		 * No geometry means either an all-into-one repack (in which
    ++		 * case there is only one pack left and it is the largest) or an
    ++		 * incremental one.
    ++		 *
    ++		 * If repacking incrementally, then we could check the size of
    ++		 * all packs to determine which should be preferred, but leave
    ++		 * this for later.
    ++		 */
    ++		return NULL;
    ++	}
     +	if (geometry->split == geometry->pack_nr)
     +		return NULL;
     +	return geometry->pack[geometry->pack_nr - 1];
    @@ builtin/repack.c: static void midx_included_packs(struct string_list *include,
      				     int show_progress, int write_bitmaps)
      {
      	struct child_process cmd = CHILD_PROCESS_INIT;
    + 	struct string_list_item *item;
    ++	struct packed_git *largest = get_largest_active_pack(geometry);
    + 	FILE *in;
    + 	int ret;
    + 
     @@ builtin/repack.c: static int write_midx_included_packs(struct string_list *include,
      	if (write_bitmaps)
      		strvec_push(&cmd.args, "--bitmap");
      
    -+	if (geometry) {
    -+		struct packed_git *largest = get_largest_active_pack(geometry);
    -+		if (largest)
    -+			strvec_pushf(&cmd.args, "--preferred-pack=%s",
    -+				     pack_basename(largest));
    -+		else
    -+			/*
    -+			 * The largest pack was repacked, meaning only one pack
    -+			 * exists (and tautologically, it is the largest).
    -+			 */
    -+			;
    -+	}
    ++	if (largest)
    ++		strvec_pushf(&cmd.args, "--preferred-pack=%s",
    ++			     pack_basename(largest));
     +
      	ret = start_command(&cmd);
      	if (ret)
    @@ t/t7703-repack-geometric.sh: test_expect_success '--geometric ignores kept packs
     +	test_when_finished "rm -fr geometric" &&
     +	(
     +		cd geometric &&
    -+		git config core.multiPackIndex true &&
     +
     +		# These packs already form a geometric progression.
     +		test_commit_bulk --start=1 1 && # 3 objects
8:  74a9da0ef0 ! 8:  eb24b308ec builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
    @@ builtin/repack.c: static void clear_pack_geometry(struct pack_geometry *geometry
     +	return 0;
     +}
     +
    -+static int midx_snapshot_refs(struct tempfile *f)
    ++static void midx_snapshot_refs(struct tempfile *f)
     +{
     +	struct midx_snapshot_ref_data data;
     +	const struct string_list *preferred = bitmap_preferred_tips(the_repository);
    @@ builtin/repack.c: static void clear_pack_geometry(struct pack_geometry *geometry
     +		struct string_list_item *item;
     +
     +		data.preferred = 1;
    -+		for_each_string_list_item(item, preferred) {
    ++		for_each_string_list_item(item, preferred)
     +			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
    -+		}
     +		data.preferred = 0;
     +	}
     +
     +	for_each_ref(midx_snapshot_ref_one, &data);
     +
    -+	return close_tempfile_gently(f);
    ++	if (close_tempfile_gently(f)) {
    ++		int save_errno = errno;
    ++		delete_tempfile(&f);
    ++		errno = save_errno;
    ++		die_errno(_("could not close refs snapshot tempfile"));
    ++	}
     +}
     +
      static void midx_included_packs(struct string_list *include,
    @@ builtin/repack.c: static void midx_included_packs(struct string_list *include,
      {
      	struct child_process cmd = CHILD_PROCESS_INIT;
     @@ builtin/repack.c: static int write_midx_included_packs(struct string_list *include,
    - 			;
    - 	}
    + 		strvec_pushf(&cmd.args, "--preferred-pack=%s",
    + 			     pack_basename(largest));
      
     +	if (refs_snapshot)
     +		strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     +			    "bitmap-ref-tips");
     +
     +		refs_snapshot = xmks_tempfile(path.buf);
    -+		if (midx_snapshot_refs(refs_snapshot) < 0)
    -+			die(_("could not take a snapshot of references"));
    ++		midx_snapshot_refs(refs_snapshot);
     +
     +		strbuf_release(&path);
     +	}
-- 
2.33.0.96.g73915697e6

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v2 1/8] midx: expose `write_midx_file_only()` publicly
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
@ 2021-09-15 18:24   ` Taylor Blau
  2021-09-22 23:14     ` Jonathan Tan
  2021-09-15 18:24   ` [PATCH v2 2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode Taylor Blau
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-15 18:24 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, avarab

Expose a variant of the write_midx_file() function which ignores packs
that aren't included in an explicit "allow" list.

This will be used in an upcoming patch to power a new `--stdin-packs`
mode of `git multi-pack-index write` for callers that only want to
include certain packs in a MIDX (and ignore any packs which may have
happened to enter the repository independently, e.g., from pushes).

Those patches will provide test coverage for this new function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 44 +++++++++++++++++++++++++++++++++++---------
 midx.h |  5 +++++
 2 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index 864034a6ad..0330202fda 100644
--- a/midx.c
+++ b/midx.c
@@ -475,6 +475,8 @@ struct write_midx_context {
 	uint32_t num_large_offsets;
 
 	int preferred_pack_idx;
+
+	struct string_list *to_include;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -486,6 +488,9 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
 		if (ctx->m && midx_contains_pack(ctx->m, file_name))
 			return;
+		else if (ctx->to_include &&
+			 !string_list_has_string(ctx->to_include, file_name))
+			return;
 
 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
@@ -1058,6 +1063,7 @@ static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 }
 
 static int write_midx_internal(const char *object_dir,
+			       struct string_list *packs_to_include,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
 			       unsigned flags)
@@ -1082,10 +1088,17 @@ static int write_midx_internal(const char *object_dir,
 		die_errno(_("unable to create leading directories of %s"),
 			  midx_name);
 
-	for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
-		if (!strcmp(object_dir, cur->object_dir)) {
-			ctx.m = cur;
-			break;
+	if (!packs_to_include) {
+		/*
+		 * Only reference an existing MIDX when not filtering which
+		 * packs to include, since all packs and objects are copied
+		 * blindly from an existing MIDX if one is present.
+		 */
+		for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
+			if (!strcmp(object_dir, cur->object_dir)) {
+				ctx.m = cur;
+				break;
+			}
 		}
 	}
 
@@ -1136,10 +1149,13 @@ static int write_midx_internal(const char *object_dir,
 	else
 		ctx.progress = NULL;
 
+	ctx.to_include = packs_to_include;
+
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
 	stop_progress(&ctx.progress);
 
-	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) {
+	if ((ctx.m && ctx.nr == ctx.m->num_packs) &&
+	    !(packs_to_include || packs_to_drop)) {
 		struct bitmap_index *bitmap_git;
 		int bitmap_exists;
 		int want_bitmap = flags & MIDX_WRITE_BITMAP;
@@ -1237,7 +1253,7 @@ static int write_midx_internal(const char *object_dir,
 
 	QSORT(ctx.info, ctx.nr, pack_info_compare);
 
-	if (packs_to_drop && packs_to_drop->nr) {
+	if (ctx.m && packs_to_drop && packs_to_drop->nr) {
 		int drop_index = 0;
 		int missing_drops = 0;
 
@@ -1380,7 +1396,17 @@ int write_midx_file(const char *object_dir,
 		    const char *preferred_pack_name,
 		    unsigned flags)
 {
-	return write_midx_internal(object_dir, NULL, preferred_pack_name, flags);
+	return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name,
+				   flags);
+}
+
+int write_midx_file_only(const char *object_dir,
+			 struct string_list *packs_to_include,
+			 const char *preferred_pack_name,
+			 unsigned flags)
+{
+	return write_midx_internal(object_dir, packs_to_include, NULL,
+				   preferred_pack_name, flags);
 }
 
 struct clear_midx_data {
@@ -1660,7 +1686,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	free(count);
 
 	if (packs_to_drop.nr) {
-		result = write_midx_internal(object_dir, &packs_to_drop, NULL, flags);
+		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, flags);
 		m = NULL;
 	}
 
@@ -1851,7 +1877,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		goto cleanup;
 	}
 
-	result = write_midx_internal(object_dir, NULL, NULL, flags);
+	result = write_midx_internal(object_dir, NULL, NULL, NULL, flags);
 	m = NULL;
 
 cleanup:
diff --git a/midx.h b/midx.h
index aa3da557bb..80f502d39b 100644
--- a/midx.h
+++ b/midx.h
@@ -2,6 +2,7 @@
 #define MIDX_H
 
 #include "repository.h"
+#include "string-list.h"
 
 struct object_id;
 struct pack_entry;
@@ -62,6 +63,10 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
 int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
+int write_midx_file_only(const char *object_dir,
+			 struct string_list *packs_to_include,
+			 const char *preferred_pack_name,
+			 unsigned flags);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
 int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags);
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
  2021-09-15 18:24   ` [PATCH v2 1/8] midx: expose `write_midx_file_only()` publicly Taylor Blau
@ 2021-09-15 18:24   ` Taylor Blau
  2021-09-22 22:29     ` Josh Steadmon
  2021-09-22 23:11     ` Jonathan Tan
  2021-09-15 18:24   ` [PATCH v2 3/8] midx: preliminary support for `--refs-snapshot` Taylor Blau
                     ` (7 subsequent siblings)
  9 siblings, 2 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-15 18:24 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, avarab

To power a new `--write-midx` mode, `git repack` will want to write a
multi-pack index containing a certain set of packs in the repository.

This new option will be used by `git repack` to write a MIDX which
contains only the packs which will survive after the repack (that is, it
will exclude any packs which are about to be deleted).

This patch effectively exposes the function implemented in the previous
commit via the `git multi-pack-index` builtin. An alternative approach
would have been to call that function from the `git repack` builtin
directly, but this introduces awkward problems around closing and
reopening the object store, so the MIDX will be written out-of-process.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt |  4 ++++
 builtin/multi-pack-index.c             | 27 ++++++++++++++++++++++++++
 t/t5319-multi-pack-index.sh            | 15 ++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index a9df3dbd32..009c989ef8 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -45,6 +45,10 @@ write::
 
 	--[no-]bitmap::
 		Control whether or not a multi-pack bitmap is written.
+
+	--stdin-packs::
+		Write a multi-pack index containing only the set of
+		line-delimited pack index basenames provided over stdin.
 --
 
 verify::
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 73c0113b48..047647b5f2 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -47,6 +47,7 @@ static struct opts_multi_pack_index {
 	const char *preferred_pack;
 	unsigned long batch_size;
 	unsigned flags;
+	int stdin_packs;
 } opts;
 
 static struct option common_opts[] = {
@@ -61,6 +62,16 @@ static struct option *add_common_options(struct option *prev)
 	return parse_options_concat(common_opts, prev);
 }
 
+static void read_packs_from_stdin(struct string_list *to)
+{
+	struct strbuf buf = STRBUF_INIT;
+	while (strbuf_getline(&buf, stdin) != EOF)
+		string_list_append(to, buf.buf);
+	string_list_sort(to);
+
+	strbuf_release(&buf);
+}
+
 static int cmd_multi_pack_index_write(int argc, const char **argv)
 {
 	struct option *options;
@@ -70,6 +81,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 			   N_("pack for reuse when computing a multi-pack bitmap")),
 		OPT_BIT(0, "bitmap", &opts.flags, N_("write multi-pack bitmap"),
 			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
+		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
+			 N_("write multi-pack index containing only given indexes")),
 		OPT_END(),
 	};
 
@@ -86,6 +99,20 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 
 	FREE_AND_NULL(options);
 
+	if (opts.stdin_packs) {
+		struct string_list packs = STRING_LIST_INIT_DUP;
+		int ret;
+
+		read_packs_from_stdin(&packs);
+
+		ret = write_midx_file_only(opts.object_dir, &packs,
+					   opts.preferred_pack, opts.flags);
+
+		string_list_clear(&packs, 0);
+
+		return ret;
+
+	}
 	return write_midx_file(opts.object_dir, opts.preferred_pack,
 			       opts.flags);
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index bb04f0f23b..385f0a3efd 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -168,6 +168,21 @@ test_expect_success 'write midx with two packs' '
 
 compare_results_with_midx "two packs"
 
+test_expect_success 'write midx with --stdin-packs' '
+	rm -fr $objdir/pack/multi-pack-index &&
+
+	idx="$(find $objdir/pack -name "test-2-*.idx")" &&
+	basename "$idx" >in &&
+
+	git multi-pack-index write --stdin-packs <in &&
+
+	test-tool read-midx $objdir | grep "\.idx$" >packs &&
+
+	test_cmp packs in
+'
+
+compare_results_with_midx "mixed mode (one pack + extra)"
+
 test_expect_success 'write progress off for redirected stderr' '
 	git multi-pack-index --object-dir=$objdir write 2>err &&
 	test_line_count = 0 err
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 3/8] midx: preliminary support for `--refs-snapshot`
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
  2021-09-15 18:24   ` [PATCH v2 1/8] midx: expose `write_midx_file_only()` publicly Taylor Blau
  2021-09-15 18:24   ` [PATCH v2 2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode Taylor Blau
@ 2021-09-15 18:24   ` Taylor Blau
  2021-09-22 22:34     ` Josh Steadmon
  2021-09-22 23:00     ` Jonathan Tan
  2021-09-15 18:24   ` [PATCH v2 4/8] builtin/repack.c: keep track of existing packs unconditionally Taylor Blau
                     ` (6 subsequent siblings)
  9 siblings, 2 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-15 18:24 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, avarab

To figure out which commits we can write a bitmap for, the multi-pack
index/bitmap code does a reachability traversal, marking any commit
which can be found in the MIDX as eligible to receive a bitmap.

This approach will cause a problem when multi-pack bitmaps are able to
be generated from `git repack`, since the reference tips can change
during the repack. Even though we ignore commits that don't exist in
the MIDX (when doing a scan of the ref tips), it's possible that a
commit in the MIDX reaches something that isn't.

This can happen when a multi-pack index contains some pack which refers
to loose objects (which by definition aren't included in the multi-pack
index).

By taking a snapshot of the references before we start repacking, we can
close that race window. In the above scenario (where we have a packed
object pointing at a loose one), we'll either (a) take a snapshot of the
references before seeing the packed one, or (b) take it after, at which
point we can guarantee that the loose object will be packed and included
in the MIDX.

This patch does just that. It writes a temporary "reference snapshot",
which is a list of OIDs that are at the ref tips before writing a
multi-pack bitmap. References that are "preferred" (i.e,. are a suffix
of at least one value of the 'pack.preferBitmapTips' configuration) are
marked with a special '+'.

The format is simple: one line per commit at each tip, with an optional
'+' at the beginning (for preferred references, as described above).

When provided, the reference snapshot is used to drive bitmap selection
instead of the MIDX code doing its own traversal. When it isn't
provided, the usual traversal takes place instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt | 15 +++++
 builtin/multi-pack-index.c             | 11 +++-
 builtin/repack.c                       |  2 +-
 midx.c                                 | 60 ++++++++++++++++---
 midx.h                                 |  6 +-
 t/t5326-multi-pack-bitmaps.sh          | 82 ++++++++++++++++++++++++++
 6 files changed, 163 insertions(+), 13 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 009c989ef8..27f83932e4 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -49,6 +49,21 @@ write::
 	--stdin-packs::
 		Write a multi-pack index containing only the set of
 		line-delimited pack index basenames provided over stdin.
+
+	--refs-snapshot=<path>::
+		With `--bitmap`, optionally specify a file which
+		contains a "refs snapshot" taken prior to repacking.
++
+A reference snapshot is composed of line-delimited OIDs corresponding to
+the reference tips, usually taken by `git repack` prior to generating a
+new pack. A line may optionally start with a `+` character to indicate
+that the reference which corresponds to that OID is "preferred" (see
+linkgit:git-config[1]'s `pack.preferBitmapTips`.)
++
+The file given at `<path>` is expected to be readable, and can contain
+duplicates. (If a given OID is given more than once, it is marked as
+preferred if at least one instance of it begins with the special `+`
+marker).
 --
 
 verify::
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 047647b5f2..4b827a07c0 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -7,7 +7,8 @@
 #include "object-store.h"
 
 #define BUILTIN_MIDX_WRITE_USAGE \
-	N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]")
+	N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]" \
+	   "[--refs-snapshot=<path>]")
 
 #define BUILTIN_MIDX_VERIFY_USAGE \
 	N_("git multi-pack-index [<options>] verify")
@@ -45,6 +46,7 @@ static char const * const builtin_multi_pack_index_usage[] = {
 static struct opts_multi_pack_index {
 	const char *object_dir;
 	const char *preferred_pack;
+	const char *refs_snapshot;
 	unsigned long batch_size;
 	unsigned flags;
 	int stdin_packs;
@@ -83,6 +85,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
 		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
 			 N_("write multi-pack index containing only given indexes")),
+		OPT_FILENAME(0, "refs-snapshot", &opts.refs_snapshot,
+			     N_("refs snapshot for selecting bitmap commits")),
 		OPT_END(),
 	};
 
@@ -106,7 +110,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 		read_packs_from_stdin(&packs);
 
 		ret = write_midx_file_only(opts.object_dir, &packs,
-					   opts.preferred_pack, opts.flags);
+					   opts.preferred_pack,
+					   opts.refs_snapshot, opts.flags);
 
 		string_list_clear(&packs, 0);
 
@@ -114,7 +119,7 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 
 	}
 	return write_midx_file(opts.object_dir, opts.preferred_pack,
-			       opts.flags);
+			       opts.refs_snapshot, opts.flags);
 }
 
 static int cmd_multi_pack_index_verify(int argc, const char **argv)
diff --git a/builtin/repack.c b/builtin/repack.c
index 82ab668272..27158a897b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -733,7 +733,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		unsigned flags = 0;
 		if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
 			flags |= MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX;
-		write_midx_file(get_object_directory(), NULL, flags);
+		write_midx_file(get_object_directory(), NULL, NULL, flags);
 	}
 
 	string_list_clear(&names, 0);
diff --git a/midx.c b/midx.c
index 0330202fda..97ba3421f2 100644
--- a/midx.c
+++ b/midx.c
@@ -968,7 +968,42 @@ static void bitmap_show_commit(struct commit *commit, void *_data)
 	data->commits[data->commits_nr++] = commit;
 }
 
+static int read_refs_snapshot(const char *refs_snapshot,
+			      struct rev_info *revs)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct object_id oid;
+	FILE *f = xfopen(refs_snapshot, "r");
+
+	while (strbuf_getline(&buf, f) != EOF) {
+		struct object *object;
+		int preferred = 0;
+		char *hex = buf.buf;
+		const char *end = NULL;
+
+		if (buf.len && *buf.buf == '+') {
+			preferred = 1;
+			hex = &buf.buf[1];
+		}
+
+		if (parse_oid_hex(hex, &oid, &end) < 0)
+			die(_("could not parse line: %s"), buf.buf);
+		if (*end)
+			die(_("malformed line: %s"), buf.buf);
+
+		object = parse_object_or_die(&oid, NULL);
+		if (preferred)
+			object->flags |= NEEDS_BITMAP;
+
+		add_pending_object(revs, object, "");
+	}
+
+	fclose(f);
+	return 0;
+}
+
 static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr_p,
+						    const char *refs_snapshot,
 						    struct write_midx_context *ctx)
 {
 	struct rev_info revs;
@@ -977,8 +1012,12 @@ static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr
 	cb.ctx = ctx;
 
 	repo_init_revisions(the_repository, &revs, NULL);
-	setup_revisions(0, NULL, &revs, NULL);
-	for_each_ref(add_ref_to_pending, &revs);
+	if (refs_snapshot) {
+		read_refs_snapshot(refs_snapshot, &revs);
+	} else {
+		setup_revisions(0, NULL, &revs, NULL);
+		for_each_ref(add_ref_to_pending, &revs);
+	}
 
 	/*
 	 * Skipping promisor objects here is intentional, since it only excludes
@@ -1007,6 +1046,7 @@ static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr
 
 static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 			     struct write_midx_context *ctx,
+			     const char *refs_snapshot,
 			     unsigned flags)
 {
 	struct packing_data pdata;
@@ -1018,7 +1058,7 @@ static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 
 	prepare_midx_packing_data(&pdata, ctx);
 
-	commits = find_commits_for_midx_bitmap(&commits_nr, ctx);
+	commits = find_commits_for_midx_bitmap(&commits_nr, refs_snapshot, ctx);
 
 	/*
 	 * Build the MIDX-order index based on pdata.objects (which is already
@@ -1066,6 +1106,7 @@ static int write_midx_internal(const char *object_dir,
 			       struct string_list *packs_to_include,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
+			       const char *refs_snapshot,
 			       unsigned flags)
 {
 	char *midx_name;
@@ -1359,7 +1400,8 @@ static int write_midx_internal(const char *object_dir,
 	if (flags & MIDX_WRITE_REV_INDEX)
 		write_midx_reverse_index(midx_name, midx_hash, &ctx);
 	if (flags & MIDX_WRITE_BITMAP) {
-		if (write_midx_bitmap(midx_name, midx_hash, &ctx, flags) < 0) {
+		if (write_midx_bitmap(midx_name, midx_hash, &ctx,
+				      refs_snapshot, flags) < 0) {
 			error(_("could not write multi-pack bitmap"));
 			result = 1;
 			goto cleanup;
@@ -1394,19 +1436,21 @@ static int write_midx_internal(const char *object_dir,
 
 int write_midx_file(const char *object_dir,
 		    const char *preferred_pack_name,
+		    const char *refs_snapshot,
 		    unsigned flags)
 {
 	return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name,
-				   flags);
+				   refs_snapshot, flags);
 }
 
 int write_midx_file_only(const char *object_dir,
 			 struct string_list *packs_to_include,
 			 const char *preferred_pack_name,
+			 const char *refs_snapshot,
 			 unsigned flags)
 {
 	return write_midx_internal(object_dir, packs_to_include, NULL,
-				   preferred_pack_name, flags);
+				   preferred_pack_name, refs_snapshot, flags);
 }
 
 struct clear_midx_data {
@@ -1686,7 +1730,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	free(count);
 
 	if (packs_to_drop.nr) {
-		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, flags);
+		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, NULL, flags);
 		m = NULL;
 	}
 
@@ -1877,7 +1921,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		goto cleanup;
 	}
 
-	result = write_midx_internal(object_dir, NULL, NULL, NULL, flags);
+	result = write_midx_internal(object_dir, NULL, NULL, NULL, NULL, flags);
 	m = NULL;
 
 cleanup:
diff --git a/midx.h b/midx.h
index 80f502d39b..c0b4e21df4 100644
--- a/midx.h
+++ b/midx.h
@@ -62,10 +62,14 @@ int fill_midx_entry(struct repository *r, const struct object_id *oid, struct pa
 int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
-int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
+int write_midx_file(const char *object_dir,
+		    const char *preferred_pack_name,
+		    const char *refs_snapshot,
+		    unsigned flags);
 int write_midx_file_only(const char *object_dir,
 			 struct string_list *packs_to_include,
 			 const char *preferred_pack_name,
+			 const char *refs_snapshot,
 			 unsigned flags);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh
index 4ad7c2c969..069dab3e17 100755
--- a/t/t5326-multi-pack-bitmaps.sh
+++ b/t/t5326-multi-pack-bitmaps.sh
@@ -283,4 +283,86 @@ test_expect_success 'pack.preferBitmapTips' '
 	)
 '
 
+test_expect_success 'writing a bitmap with --refs-snapshot' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit one &&
+		test_commit two &&
+
+		git rev-parse one >snapshot &&
+
+		git repack -ad &&
+
+		# First, write a MIDX which see both refs/tags/one and
+		# refs/tags/two (causing both of those commits to receive
+		# bitmaps).
+		git multi-pack-index write --bitmap &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		grep "$(git rev-parse one)" bitmaps &&
+		grep "$(git rev-parse two)" bitmaps &&
+
+		rm -fr $midx-$(midx_checksum $objdir).bitmap &&
+		rm -fr $midx-$(midx_checksum $objdir).rev &&
+		rm -fr $midx &&
+
+		# Then again, but with a refs snapshot which only sees
+		# refs/tags/one.
+		git multi-pack-index write --bitmap --refs-snapshot=snapshot &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		grep "$(git rev-parse one)" bitmaps &&
+		! grep "$(git rev-parse two)" bitmaps
+	)
+'
+
+test_expect_success 'write a bitmap with --refs-snapshot (preferred tips)' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit_bulk --message="%s" 103 &&
+
+		git log --format="%H" >commits.raw &&
+		sort <commits.raw >commits &&
+
+		git log --format="create refs/tags/%s %H" HEAD >refs &&
+		git update-ref --stdin <refs &&
+
+		git multi-pack-index write --bitmap &&
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >before &&
+		test_line_count = 1 before &&
+
+		(
+			grep -vf before commits.raw &&
+			# mark missing commits as preferred
+			sed "s/^/+/" before
+		) >snapshot &&
+
+		rm -fr $midx-$(midx_checksum $objdir).bitmap &&
+		rm -fr $midx-$(midx_checksum $objdir).rev &&
+		rm -fr $midx &&
+
+		git multi-pack-index write --bitmap --refs-snapshot=snapshot &&
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >after &&
+
+		! test_cmp before after
+	)
+'
+
 test_done
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 4/8] builtin/repack.c: keep track of existing packs unconditionally
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
                     ` (2 preceding siblings ...)
  2021-09-15 18:24   ` [PATCH v2 3/8] midx: preliminary support for `--refs-snapshot` Taylor Blau
@ 2021-09-15 18:24   ` Taylor Blau
  2021-09-22 22:56     ` Jonathan Tan
  2021-09-15 18:24   ` [PATCH v2 5/8] builtin/repack.c: extract showing progress to a variable Taylor Blau
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-15 18:24 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, avarab

In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.

Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.

Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.

Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 49 ++++++++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 22 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 27158a897b..e55a650de5 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -98,8 +98,9 @@ static void remove_pack_on_signal(int signo)
  * have a corresponding .keep file. These packs are not to
  * be kept if we are going to pack everything into one file.
  */
-static void get_non_kept_pack_filenames(struct string_list *fname_list,
-					const struct string_list *extra_keep)
+static void collect_pack_filenames(struct string_list *fname_list,
+				   struct string_list *fname_kept_list,
+				   const struct string_list *extra_keep)
 {
 	DIR *dir;
 	struct dirent *e;
@@ -112,21 +113,20 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
 		size_t len;
 		int i;
 
+		if (!strip_suffix(e->d_name, ".pack", &len))
+			continue;
+
 		for (i = 0; i < extra_keep->nr; i++)
 			if (!fspathcmp(e->d_name, extra_keep->items[i].string))
 				break;
-		if (extra_keep->nr > 0 && i < extra_keep->nr)
-			continue;
-
-		if (!strip_suffix(e->d_name, ".pack", &len))
-			continue;
 
 		fname = xmemdupz(e->d_name, len);
 
-		if (!file_exists(mkpath("%s/%s.keep", packdir, fname)))
-			string_list_append_nodup(fname_list, fname);
+		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
+		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
+			string_list_append_nodup(fname_kept_list, fname);
 		else
-			free(fname);
+			string_list_append_nodup(fname_list, fname);
 	}
 	closedir(dir);
 }
@@ -440,6 +440,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list names = STRING_LIST_INIT_DUP;
 	struct string_list rollback = STRING_LIST_INIT_NODUP;
 	struct string_list existing_packs = STRING_LIST_INIT_DUP;
+	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
 	int i, ext, ret;
@@ -572,9 +573,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (use_delta_islands)
 		strvec_push(&cmd.args, "--delta-islands");
 
+	collect_pack_filenames(&existing_packs, &existing_kept_packs,
+			       &keep_pack_list);
+
 	if (pack_everything & ALL_INTO_ONE) {
-		get_non_kept_pack_filenames(&existing_packs, &keep_pack_list);
-
 		repack_promisor_objects(&po_args, &names);
 
 		if (existing_packs.nr && delete_redundant) {
@@ -683,17 +685,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	reprepare_packed_git(the_repository);
 
 	if (delete_redundant) {
-		const int hexsz = the_hash_algo->hexsz;
 		int opts = 0;
-		string_list_sort(&names);
-		for_each_string_list_item(item, &existing_packs) {
-			char *sha1;
-			size_t len = strlen(item->string);
-			if (len < hexsz)
-				continue;
-			sha1 = item->string + len - hexsz;
-			if (!string_list_has_string(&names, sha1))
-				remove_redundant_pack(packdir, item->string);
+		if (pack_everything & ALL_INTO_ONE) {
+			const int hexsz = the_hash_algo->hexsz;
+			string_list_sort(&names);
+			for_each_string_list_item(item, &existing_packs) {
+				char *sha1;
+				size_t len = strlen(item->string);
+				if (len < hexsz)
+					continue;
+				sha1 = item->string + len - hexsz;
+				if (!string_list_has_string(&names, sha1))
+					remove_redundant_pack(packdir, item->string);
+			}
 		}
 
 		if (geometry) {
@@ -739,6 +743,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&names, 0);
 	string_list_clear(&rollback, 0);
 	string_list_clear(&existing_packs, 0);
+	string_list_clear(&existing_kept_packs, 0);
 	clear_pack_geometry(geometry);
 	strbuf_release(&line);
 
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 5/8] builtin/repack.c: extract showing progress to a variable
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
                     ` (3 preceding siblings ...)
  2021-09-15 18:24   ` [PATCH v2 4/8] builtin/repack.c: keep track of existing packs unconditionally Taylor Blau
@ 2021-09-15 18:24   ` Taylor Blau
  2021-09-15 18:24   ` [PATCH v2 6/8] builtin/repack.c: support writing a MIDX while repacking Taylor Blau
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-15 18:24 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, avarab

We only ask whether stderr is a tty before calling
'prune_packed_objects()', but the subsequent patch will add another use.

Extract this check into a variable so that both can use it without
having to call 'isatty()' twice.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index e55a650de5..be470546e6 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -445,6 +445,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct strbuf line = STRBUF_INIT;
 	int i, ext, ret;
 	FILE *out;
+	int show_progress = isatty(2);
 
 	/* variables to be filled by option parsing */
 	int pack_everything = 0;
@@ -718,7 +719,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			}
 			strbuf_release(&buf);
 		}
-		if (!po_args.quiet && isatty(2))
+		if (!po_args.quiet && show_progress)
 			opts |= PRUNE_PACKED_VERBOSE;
 		prune_packed_objects(opts);
 
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 6/8] builtin/repack.c: support writing a MIDX while repacking
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
                     ` (4 preceding siblings ...)
  2021-09-15 18:24   ` [PATCH v2 5/8] builtin/repack.c: extract showing progress to a variable Taylor Blau
@ 2021-09-15 18:24   ` Taylor Blau
  2021-09-22 22:39     ` Jonathan Tan
  2021-09-15 18:24   ` [PATCH v2 7/8] builtin/repack.c: make largest pack preferred Taylor Blau
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-15 18:24 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, avarab

Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.

There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:

  - Call 'git multi-pack-index write' after running 'git repack', or

  - Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
    'git repack'.

The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).

Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.

Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.

This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").

There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  14 ++--
 builtin/repack.c             | 129 +++++++++++++++++++++++++++++------
 t/lib-midx.sh                |   8 +++
 t/t7700-repack.sh            |  96 ++++++++++++++++++++++++++
 4 files changed, 224 insertions(+), 23 deletions(-)
 create mode 100644 t/lib-midx.sh

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 24c00c9384..0f2d235ca5 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -9,7 +9,7 @@ git-repack - Pack unpacked objects in a repository
 SYNOPSIS
 --------
 [verse]
-'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
+'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
 
 DESCRIPTION
 -----------
@@ -128,10 +128,11 @@ depth is 4095.
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
-	only makes sense when used with `-a` or `-A`, as the bitmaps
+	only makes sense when used with `-a`, `-A` or `-m`, as the bitmaps
 	must be able to refer to all reachable objects. This option
-	overrides the setting of `repack.writeBitmaps`.  This option
-	has no effect if multiple packfiles are created.
+	overrides the setting of `repack.writeBitmaps`. This option
+	has no effect if multiple packfiles are created, unless writing a
+	MIDX (in which case a multi-pack bitmap is created).
 
 --pack-kept-objects::
 	Include objects in `.keep` files when repacking.  Note that we
@@ -190,6 +191,11 @@ to change in the future. This option (implying a drastically different
 repack mode) is not guaranteed to work with all other combinations of
 option to `git repack`.
 
+-m::
+--write-midx::
+	Write a multi-pack index (see linkgit:git-multi-pack-index[1])
+	containing the non-redundant packs.
+
 CONFIGURATION
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index be470546e6..ca18d3dd2e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -433,6 +433,73 @@ static void clear_pack_geometry(struct pack_geometry *geometry)
 	geometry->split = 0;
 }
 
+static void midx_included_packs(struct string_list *include,
+				struct string_list *existing_packs,
+				struct string_list *existing_kept_packs,
+				struct string_list *names,
+				struct pack_geometry *geometry)
+{
+	struct string_list_item *item;
+
+	for_each_string_list_item(item, existing_kept_packs)
+		string_list_insert(include, xstrfmt("%s.idx", item->string));
+	for_each_string_list_item(item, names)
+		string_list_insert(include, xstrfmt("pack-%s.idx", item->string));
+	if (geometry) {
+		struct strbuf buf = STRBUF_INIT;
+		uint32_t i;
+		for (i = geometry->split; i < geometry->pack_nr; i++) {
+			struct packed_git *p = geometry->pack[i];
+
+			strbuf_addstr(&buf, pack_basename(p));
+			strbuf_strip_suffix(&buf, ".pack");
+			strbuf_addstr(&buf, ".idx");
+
+			string_list_insert(include, strbuf_detach(&buf, NULL));
+		}
+	} else {
+		for_each_string_list_item(item, existing_packs) {
+			if (item->util)
+				continue;
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
+	}
+}
+
+static int write_midx_included_packs(struct string_list *include,
+				     int show_progress, int write_bitmaps)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct string_list_item *item;
+	FILE *in;
+	int ret;
+
+	cmd.in = -1;
+	cmd.git_cmd = 1;
+
+	strvec_push(&cmd.args, "multi-pack-index");
+	strvec_pushl(&cmd.args, "write", "--stdin-packs", NULL);
+
+	if (show_progress)
+		strvec_push(&cmd.args, "--progress");
+	else
+		strvec_push(&cmd.args, "--no-progress");
+
+	if (write_bitmaps)
+		strvec_push(&cmd.args, "--bitmap");
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, include)
+		fprintf(in, "%s\n", item->string);
+	fclose(in);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -456,6 +523,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
 	int geometric_factor = 0;
+	int write_midx = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -498,6 +566,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("do not repack this pack")),
 		OPT_INTEGER('g', "geometric", &geometric_factor,
 			    N_("find a geometric progression with factor <N>")),
+		OPT_BOOL('m', "write-midx", &write_midx,
+			   N_("write a multi-pack index of the resulting packs")),
 		OPT_END()
 	};
 
@@ -514,8 +584,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		die(_("--keep-unreachable and -A are incompatible"));
 
 	if (write_bitmaps < 0) {
-		if (!(pack_everything & ALL_INTO_ONE) ||
-		    !is_bare_repository())
+		if (!write_midx &&
+		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
 			write_bitmaps = 0;
 	} else if (write_bitmaps &&
 		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0) &&
@@ -525,7 +595,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_kept_objects < 0)
 		pack_kept_objects = write_bitmaps > 0;
 
-	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
+	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
 		die(_(incremental_bitmap_conflict_error));
 
 	if (geometric_factor) {
@@ -567,10 +637,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 	if (has_promisor_remote())
 		strvec_push(&cmd.args, "--exclude-promisor-objects");
-	if (write_bitmaps > 0)
-		strvec_push(&cmd.args, "--write-bitmap-index");
-	else if (write_bitmaps < 0)
-		strvec_push(&cmd.args, "--write-bitmap-index-quiet");
+	if (!write_midx) {
+		if (write_bitmaps > 0)
+			strvec_push(&cmd.args, "--write-bitmap-index");
+		else if (write_bitmaps < 0)
+			strvec_push(&cmd.args, "--write-bitmap-index-quiet");
+	}
 	if (use_delta_islands)
 		strvec_push(&cmd.args, "--delta-islands");
 
@@ -683,22 +755,41 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 	/* End of pack replacement. */
 
+	if (delete_redundant && pack_everything & ALL_INTO_ONE) {
+		const int hexsz = the_hash_algo->hexsz;
+		string_list_sort(&names);
+		for_each_string_list_item(item, &existing_packs) {
+			char *sha1;
+			size_t len = strlen(item->string);
+			if (len < hexsz)
+				continue;
+			sha1 = item->string + len - hexsz;
+			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+		}
+	}
+
+	if (write_midx) {
+		struct string_list include = STRING_LIST_INIT_NODUP;
+		midx_included_packs(&include, &existing_packs,
+				    &existing_kept_packs, &names, geometry);
+
+		ret = write_midx_included_packs(&include,
+						show_progress, write_bitmaps > 0);
+
+		string_list_clear(&include, 0);
+
+		if (ret)
+			return ret;
+	}
+
 	reprepare_packed_git(the_repository);
 
 	if (delete_redundant) {
 		int opts = 0;
-		if (pack_everything & ALL_INTO_ONE) {
-			const int hexsz = the_hash_algo->hexsz;
-			string_list_sort(&names);
-			for_each_string_list_item(item, &existing_packs) {
-				char *sha1;
-				size_t len = strlen(item->string);
-				if (len < hexsz)
-					continue;
-				sha1 = item->string + len - hexsz;
-				if (!string_list_has_string(&names, sha1))
-					remove_redundant_pack(packdir, item->string);
-			}
+		for_each_string_list_item(item, &existing_packs) {
+			if (!item->util)
+				continue;
+			remove_redundant_pack(packdir, item->string);
 		}
 
 		if (geometry) {
diff --git a/t/lib-midx.sh b/t/lib-midx.sh
new file mode 100644
index 0000000000..1261994744
--- /dev/null
+++ b/t/lib-midx.sh
@@ -0,0 +1,8 @@
+# test_midx_consistent <objdir>
+test_midx_consistent () {
+	ls $1/pack/pack-*.idx | xargs -n 1 basename | sort >expect &&
+	test-tool read-midx $1 | grep ^pack-.*\.idx$ | sort >actual &&
+
+	test_cmp expect actual &&
+	git multi-pack-index --object-dir=$1 verify
+}
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 98eda3bfeb..6792531dfd 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -3,6 +3,8 @@
 test_description='git repack works correctly'
 
 . ./test-lib.sh
+. "${TEST_DIRECTORY}/lib-bitmap.sh"
+. "${TEST_DIRECTORY}/lib-midx.sh"
 
 commit_and_pack () {
 	test_commit "$@" 1>&2 &&
@@ -234,4 +236,98 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+objdir=.git/objects
+midx=$objdir/pack/multi-pack-index
+
+test_expect_success 'setup for --write-midx tests' '
+	git init midx &&
+	(
+		cd midx &&
+		git config core.multiPackIndex true &&
+
+		test_commit base
+	)
+'
+
+test_expect_success '--write-midx unchanged' '
+	(
+		cd midx &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack &&
+		test_path_is_missing $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success '--write-midx with a new pack' '
+	(
+		cd midx &&
+		test_commit loose &&
+
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success '--write-midx with -b' '
+	(
+		cd midx &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -mb &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success '--write-midx with -d' '
+	(
+		cd midx &&
+		test_commit repack &&
+
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Ad --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success 'cleans up MIDX when appropriate' '
+	(
+		cd midx &&
+
+		test_commit repack-2 &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb --write-midx &&
+
+		checksum=$(midx_checksum $objdir) &&
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$checksum.bitmap &&
+		test_path_is_file $midx-$checksum.rev &&
+
+		test_commit repack-3 &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-$checksum.bitmap &&
+		test_path_is_missing $midx-$checksum.rev &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+		test_path_is_file $midx-$(midx_checksum $objdir).rev &&
+
+		test_commit repack-4 &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb &&
+
+		find $objdir/pack -type f -name "multi-pack-index*" >files &&
+		test_must_be_empty files
+	)
+'
+
 test_done
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 7/8] builtin/repack.c: make largest pack preferred
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
                     ` (5 preceding siblings ...)
  2021-09-15 18:24   ` [PATCH v2 6/8] builtin/repack.c: support writing a MIDX while repacking Taylor Blau
@ 2021-09-15 18:24   ` Taylor Blau
  2021-09-15 18:24   ` [PATCH v2 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-15 18:24 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, avarab

When repacking into a geometric series and writing a multi-pack bitmap,
it is beneficial to have the largest resulting pack be the preferred
object source in the bitmap's MIDX, since selecting the large packs can
lead to fewer broken delta chains and better compression.

Teach 'git repack' to identify this pack and pass it to the MIDX write
machinery in order to mark it as preferred.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  4 ++++
 builtin/repack.c             | 27 ++++++++++++++++++++++++++-
 pack-bitmap.c                |  2 +-
 pack-bitmap.h                |  1 +
 t/helper/test-read-midx.c    | 25 ++++++++++++++++++++++++-
 t/t7703-repack-geometric.sh  | 22 ++++++++++++++++++++++
 6 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 0f2d235ca5..7183fb498f 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -190,6 +190,10 @@ this "roll-up", without respect to their reachability. This is subject
 to change in the future. This option (implying a drastically different
 repack mode) is not guaranteed to work with all other combinations of
 option to `git repack`.
++
+When writing a multi-pack bitmap, `git repack` selects the largest resulting
+pack as the preferred pack for object selection by the MIDX (see
+linkgit:git-multi-pack-index[1]).
 
 -m::
 --write-midx::
diff --git a/builtin/repack.c b/builtin/repack.c
index ca18d3dd2e..8c7bc4551e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -422,6 +422,25 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	geometry->split = split;
 }
 
+static struct packed_git *get_largest_active_pack(struct pack_geometry *geometry)
+{
+	if (!geometry) {
+		/*
+		 * No geometry means either an all-into-one repack (in which
+		 * case there is only one pack left and it is the largest) or an
+		 * incremental one.
+		 *
+		 * If repacking incrementally, then we could check the size of
+		 * all packs to determine which should be preferred, but leave
+		 * this for later.
+		 */
+		return NULL;
+	}
+	if (geometry->split == geometry->pack_nr)
+		return NULL;
+	return geometry->pack[geometry->pack_nr - 1];
+}
+
 static void clear_pack_geometry(struct pack_geometry *geometry)
 {
 	if (!geometry)
@@ -467,10 +486,12 @@ static void midx_included_packs(struct string_list *include,
 }
 
 static int write_midx_included_packs(struct string_list *include,
+				     struct pack_geometry *geometry,
 				     int show_progress, int write_bitmaps)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
 	struct string_list_item *item;
+	struct packed_git *largest = get_largest_active_pack(geometry);
 	FILE *in;
 	int ret;
 
@@ -488,6 +509,10 @@ static int write_midx_included_packs(struct string_list *include,
 	if (write_bitmaps)
 		strvec_push(&cmd.args, "--bitmap");
 
+	if (largest)
+		strvec_pushf(&cmd.args, "--preferred-pack=%s",
+			     pack_basename(largest));
+
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
@@ -773,7 +798,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		midx_included_packs(&include, &existing_packs,
 				    &existing_kept_packs, &names, geometry);
 
-		ret = write_midx_included_packs(&include,
+		ret = write_midx_included_packs(&include, geometry,
 						show_progress, write_bitmaps > 0);
 
 		string_list_clear(&include, 0);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 8504110a4d..67be9be9a6 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1418,7 +1418,7 @@ static int try_partial_reuse(struct packed_git *pack,
 	return 0;
 }
 
-static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
+uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
 {
 	struct multi_pack_index *m = bitmap_git->midx;
 	if (!m)
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 469090bad2..7d407c5a4c 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -55,6 +55,7 @@ int test_bitmap_commits(struct repository *r);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 struct list_objects_filter_options *filter,
 					 int filter_provided_objects);
+uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 				       struct packed_git **packfile,
 				       uint32_t *entries,
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index cb0d27049a..0038559129 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -3,6 +3,7 @@
 #include "midx.h"
 #include "repository.h"
 #include "object-store.h"
+#include "pack-bitmap.h"
 
 static int read_midx_file(const char *object_dir, int show_objects)
 {
@@ -72,14 +73,36 @@ static int read_midx_checksum(const char *object_dir)
 	return 0;
 }
 
+static int read_midx_preferred_pack(const char *object_dir)
+{
+	struct multi_pack_index *midx = NULL;
+	struct bitmap_index *bitmap = NULL;
+
+	setup_git_directory();
+
+	midx = load_multi_pack_index(object_dir, 1);
+	if (!midx)
+		return 1;
+
+	bitmap = prepare_bitmap_git(the_repository);
+	if (!(bitmap && bitmap_is_midx(bitmap)))
+		return 1;
+
+
+	printf("%s\n", midx->pack_names[midx_preferred_pack(bitmap)]);
+	return 0;
+}
+
 int cmd__read_midx(int argc, const char **argv)
 {
 	if (!(argc == 2 || argc == 3))
-		usage("read-midx [--show-objects|--checksum] <object-dir>");
+		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
 
 	if (!strcmp(argv[1], "--show-objects"))
 		return read_midx_file(argv[2], 1);
 	else if (!strcmp(argv[1], "--checksum"))
 		return read_midx_checksum(argv[2]);
+	else if (!strcmp(argv[1], "--preferred-pack"))
+		return read_midx_preferred_pack(argv[2]);
 	return read_midx_file(argv[1], 0);
 }
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 5ccaa440e0..54360c9878 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -180,4 +180,26 @@ test_expect_success '--geometric ignores kept packs' '
 	)
 '
 
+test_expect_success '--geometric chooses largest MIDX preferred pack' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# These packs already form a geometric progression.
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 2 && # 6 objects
+		ls $objdir/pack/pack-*.idx >before &&
+		test_commit_bulk --start=4 4 && # 12 objects
+		ls $objdir/pack/pack-*.idx >after &&
+
+		git repack --geometric 2 -dbm &&
+
+		comm -3 before after | xargs -n 1 basename >expect &&
+		test-tool read-midx --preferred-pack $objdir >actual &&
+
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v2 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
                     ` (6 preceding siblings ...)
  2021-09-15 18:24   ` [PATCH v2 7/8] builtin/repack.c: make largest pack preferred Taylor Blau
@ 2021-09-15 18:24   ` Taylor Blau
  2021-09-24 18:22     ` Jonathan Tan
  2021-09-15 19:22   ` [PATCH v2 0/8] repack: introduce `--write-midx` Junio C Hamano
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
  9 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-15 18:24 UTC (permalink / raw)
  To: git; +Cc: peff, gitster, avarab

To prevent the race described in an earlier patch, generate and pass a
reference snapshot to the multi-pack bitmap code, if we are writing one
from `git repack`.

This patch is mostly limited to creating a temporary file, and then
calling for_each_ref(). Except we try to minimize duplicates, since
doing so can drastically reduce the size in network-of-forks style
repositories. In the kernel's fork network (the repository containing
all objects from the kernel and all its forks), deduplicating the
references drops the snapshot size from 934 MB to just 12 MB.

But since we're handling duplicates in this way, we have to make sure
that we preferred references (those listed in pack.preferBitmapTips)
before non-preferred ones (to avoid recording an object which is pointed
at by a preferred tip as non-preferred).

We accomplish this by doing separate passes over the references: first
visiting each prefix in pack.preferBitmapTips, and then over the rest of
the references.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/builtin/repack.c b/builtin/repack.c
index 8c7bc4551e..a0b515b139 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -15,6 +15,8 @@
 #include "promisor-remote.h"
 #include "shallow.h"
 #include "pack.h"
+#include "pack-bitmap.h"
+#include "refs.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -452,6 +454,65 @@ static void clear_pack_geometry(struct pack_geometry *geometry)
 	geometry->split = 0;
 }
 
+struct midx_snapshot_ref_data {
+	struct tempfile *f;
+	struct oidset seen;
+	int preferred;
+};
+
+static int midx_snapshot_ref_one(const char *refname,
+				 const struct object_id *oid,
+				 int flag, void *_data)
+{
+	struct midx_snapshot_ref_data *data = _data;
+	struct object_id peeled;
+
+	if (!peel_iterated_oid(oid, &peeled))
+		oid = &peeled;
+
+	if (oidset_insert(&data->seen, oid))
+		return 0; /* already seen */
+
+	if (oid_object_info(the_repository, oid, NULL) != OBJ_COMMIT)
+		return 0;
+
+	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
+		oid_to_hex(oid));
+
+	return 0;
+}
+
+static void midx_snapshot_refs(struct tempfile *f)
+{
+	struct midx_snapshot_ref_data data;
+	const struct string_list *preferred = bitmap_preferred_tips(the_repository);
+
+	data.f = f;
+	oidset_init(&data.seen, 0);
+
+	if (!fdopen_tempfile(f, "w"))
+		 die(_("could not open tempfile %s for writing"),
+		     get_tempfile_path(f));
+
+	if (preferred) {
+		struct string_list_item *item;
+
+		data.preferred = 1;
+		for_each_string_list_item(item, preferred)
+			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
+		data.preferred = 0;
+	}
+
+	for_each_ref(midx_snapshot_ref_one, &data);
+
+	if (close_tempfile_gently(f)) {
+		int save_errno = errno;
+		delete_tempfile(&f);
+		errno = save_errno;
+		die_errno(_("could not close refs snapshot tempfile"));
+	}
+}
+
 static void midx_included_packs(struct string_list *include,
 				struct string_list *existing_packs,
 				struct string_list *existing_kept_packs,
@@ -487,6 +548,7 @@ static void midx_included_packs(struct string_list *include,
 
 static int write_midx_included_packs(struct string_list *include,
 				     struct pack_geometry *geometry,
+				     const char *refs_snapshot,
 				     int show_progress, int write_bitmaps)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -513,6 +575,9 @@ static int write_midx_included_packs(struct string_list *include,
 		strvec_pushf(&cmd.args, "--preferred-pack=%s",
 			     pack_basename(largest));
 
+	if (refs_snapshot)
+		strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
+
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
@@ -535,6 +600,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
+	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
 	FILE *out;
 	int show_progress = isatty(2);
@@ -623,6 +689,18 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
 		die(_(incremental_bitmap_conflict_error));
 
+	if (write_midx && write_bitmaps) {
+		struct strbuf path = STRBUF_INIT;
+
+		strbuf_addf(&path, "%s/%s_XXXXXX", get_object_directory(),
+			    "bitmap-ref-tips");
+
+		refs_snapshot = xmks_tempfile(path.buf);
+		midx_snapshot_refs(refs_snapshot);
+
+		strbuf_release(&path);
+	}
+
 	if (geometric_factor) {
 		if (pack_everything)
 			die(_("--geometric is incompatible with -A, -a"));
@@ -799,6 +877,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				    &existing_kept_packs, &names, geometry);
 
 		ret = write_midx_included_packs(&include, geometry,
+						refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
 						show_progress, write_bitmaps > 0);
 
 		string_list_clear(&include, 0);
-- 
2.33.0.96.g73915697e6

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 0/8] repack: introduce `--write-midx`
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
                     ` (7 preceding siblings ...)
  2021-09-15 18:24   ` [PATCH v2 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
@ 2021-09-15 19:22   ` Junio C Hamano
  2021-09-15 19:29     ` Junio C Hamano
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
  9 siblings, 1 reply; 76+ messages in thread
From: Junio C Hamano @ 2021-09-15 19:22 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, avarab

Taylor Blau <me@ttaylorr.com> writes:

> Nothing substantial has changed since last time. Mostly re-wrapping lines and
> removing braces in macro'd for_each_xyz() loops....
> Otherwise, this series is unchagned. It depends on tb/multi-pack-bitmaps, which
> has graduated to master. Range-diff below:

Despite the explanation above ...

>      @@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>     - 
>     - 	if (ends_with(file_name, ".idx")) {
>       		display_progress(ctx->progress, ++ctx->pack_paths_checked);
>     --		if (ctx->m && midx_contains_pack(ctx->m, file_name))
>     --			return;
>     -+		if (ctx->m) {
>     -+			if (midx_contains_pack(ctx->m, file_name))
>     -+				return;
>     -+		} else if (ctx->to_include) {
>     -+			if (!string_list_has_string(ctx->to_include, file_name))
>     -+				return;
>     -+		}
>     + 		if (ctx->m && midx_contains_pack(ctx->m, file_name))
>     + 			return;
>     ++		else if (ctx->to_include &&
>     ++			 !string_list_has_string(ctx->to_include, file_name))
>     ++			return;

... this part seems to change what the code does quite a bit.

The original used to skip .to_include checks when .m is set but it
did not pass midx_contains_pack().  Now .to_include check is done
in a context with .m that does not pass midex_contains_pack() check.

I suspect that this change since v1 is a bugfix.  We make an early
return based on midx_contains_pack() but make sure .m is not NULL,
in order to be able to run the check.  There is another early return
condition that is orthogonal, and we do string_list check, but that
is only doable if the .to_include is not NULL.  The logic in v2 
clearly expresses that, but v1 was simply buggy.

But if it is the case, I'd step back a bit and further question if
"else if" is a good construct to use here.  We'd return if .m passes
midx_contains_pack() check, and another check based on .to_include
gives us an orthogonal chance to return early, so two "if" statement
that are independent sitting next to each other may have avoided
such a bug from the beginning, perhaps?

	if (ctx->m && midx_contains...)
		return;
	if (ctx->to_include && !string_list_has...)
		return;

Of course you could

	if ((ctx->m && midx_contains...) ||
	    (ctx->to_include && !string_list_has...))
		return;

but that may not scale as well as two independent if statements.

Everything else in the range-diff looked cosmetic as explained in
the cover letter.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 0/8] repack: introduce `--write-midx`
  2021-09-15 19:22   ` [PATCH v2 0/8] repack: introduce `--write-midx` Junio C Hamano
@ 2021-09-15 19:29     ` Junio C Hamano
  2021-09-15 21:19       ` Taylor Blau
  0 siblings, 1 reply; 76+ messages in thread
From: Junio C Hamano @ 2021-09-15 19:29 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, avarab

Junio C Hamano <gitster@pobox.com> writes:

> But if it is the case, I'd step back a bit and further question if
> "else if" is a good construct to use here.  We'd return if .m passes
> midx_contains_pack() check, and another check based on .to_include
> gives us an orthogonal chance to return early, so two "if" statement
> that are independent sitting next to each other may have avoided
> such a bug from the beginning, perhaps?

OK, I went back and checked your response to a review in an earlier
round.  If .m and .to_include cannot be turned on at the same time,
then I think "else if" would express the intention more clearly.

But if we go that route, the whole "if ... else if" may deserve a
comment that explains why .m and .to_include are fundamentally and
inherently mutually exclusive.  In other words, is it possible if
future enhancement may want to pass both .m and .to_include to allow
the code path to check both conditions and return early?

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 0/8] repack: introduce `--write-midx`
  2021-09-15 19:29     ` Junio C Hamano
@ 2021-09-15 21:19       ` Taylor Blau
  2021-09-16 22:16         ` Junio C Hamano
  0 siblings, 1 reply; 76+ messages in thread
From: Taylor Blau @ 2021-09-15 21:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, peff, avarab

On Wed, Sep 15, 2021 at 12:29:20PM -0700, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
> > But if it is the case, I'd step back a bit and further question if
> > "else if" is a good construct to use here.  We'd return if .m passes
> > midx_contains_pack() check, and another check based on .to_include
> > gives us an orthogonal chance to return early, so two "if" statement
> > that are independent sitting next to each other may have avoided
> > such a bug from the beginning, perhaps?
>
> OK, I went back and checked your response to a review in an earlier
> round.  If .m and .to_include cannot be turned on at the same time,
> then I think "else if" would express the intention more clearly.

Yeah, exactly. I didn't think that this would be as interesting to
reviewers as it ended up being, hence I didn't put much thought into it.

> But if we go that route, the whole "if ... else if" may deserve a
> comment that explains why .m and .to_include are fundamentally and
> inherently mutually exclusive.  In other words, is it possible if
> future enhancement may want to pass both .m and .to_include to allow
> the code path to check both conditions and return early?

The gist is that they aren't inherently exclusive, but that using an
existing MIDX (which is the result of having `.m` point at a MIDX
struct) right now blindly reuses all of the existing packs in that MIDX,
which is what to_include is trying to avoid.

We could reuse an existing MIDX with to_include by filtering down its
packs before carrying them forward, but don't currently. So that would
cause this change to produce unexpected results if we ever changed that
behavior.

I think all of this is non-obvious enough that it warrants being written
down in a comment. Here's a replacement for that first patch that should
apply without conflict. But if you'd rather have a reroll or anything
else, I'm happy to do that, too.

--- >8 ---

Subject: [PATCH] midx: expose `write_midx_file_only()` publicly

Expose a variant of the write_midx_file() function which ignores packs
that aren't included in an explicit "allow" list.

This will be used in an upcoming patch to power a new `--stdin-packs`
mode of `git multi-pack-index write` for callers that only want to
include certain packs in a MIDX (and ignore any packs which may have
happened to enter the repository independently, e.g., from pushes).

Those patches will provide test coverage for this new function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++---------
 midx.h |  5 +++++
 2 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/midx.c b/midx.c
index 864034a6ad..61226eaa9d 100644
--- a/midx.c
+++ b/midx.c
@@ -475,6 +475,8 @@ struct write_midx_context {
 	uint32_t num_large_offsets;

 	int preferred_pack_idx;
+
+	struct string_list *to_include;
 };

 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -484,8 +486,26 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,

 	if (ends_with(file_name, ".idx")) {
 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
+		/*
+		 * Note that at most one of ctx->m and ctx->to_include are set,
+		 * so we are testing midx_contains_pack() and
+		 * string_list_has_string() independently (guarded by the
+		 * appropriate NULL checks).
+		 *
+		 * We could support passing to_include while reusing an existing
+		 * MIDX, but don't currently since the reuse process drags
+		 * forward all packs from an existing MIDX (without checking
+		 * whether or not they appear in the to_include list).
+		 *
+		 * If we added support for that, these next two conditional
+		 * should be performed independently (likely checking
+		 * to_include before the existing MIDX).
+		 */
 		if (ctx->m && midx_contains_pack(ctx->m, file_name))
 			return;
+		else if (ctx->to_include &&
+			 !string_list_has_string(ctx->to_include, file_name))
+			return;

 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);

@@ -1058,6 +1078,7 @@ static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 }

 static int write_midx_internal(const char *object_dir,
+			       struct string_list *packs_to_include,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
 			       unsigned flags)
@@ -1082,10 +1103,17 @@ static int write_midx_internal(const char *object_dir,
 		die_errno(_("unable to create leading directories of %s"),
 			  midx_name);

-	for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
-		if (!strcmp(object_dir, cur->object_dir)) {
-			ctx.m = cur;
-			break;
+	if (!packs_to_include) {
+		/*
+		 * Only reference an existing MIDX when not filtering which
+		 * packs to include, since all packs and objects are copied
+		 * blindly from an existing MIDX if one is present.
+		 */
+		for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
+			if (!strcmp(object_dir, cur->object_dir)) {
+				ctx.m = cur;
+				break;
+			}
 		}
 	}

@@ -1136,10 +1164,13 @@ static int write_midx_internal(const char *object_dir,
 	else
 		ctx.progress = NULL;

+	ctx.to_include = packs_to_include;
+
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
 	stop_progress(&ctx.progress);

-	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) {
+	if ((ctx.m && ctx.nr == ctx.m->num_packs) &&
+	    !(packs_to_include || packs_to_drop)) {
 		struct bitmap_index *bitmap_git;
 		int bitmap_exists;
 		int want_bitmap = flags & MIDX_WRITE_BITMAP;
@@ -1237,7 +1268,7 @@ static int write_midx_internal(const char *object_dir,

 	QSORT(ctx.info, ctx.nr, pack_info_compare);

-	if (packs_to_drop && packs_to_drop->nr) {
+	if (ctx.m && packs_to_drop && packs_to_drop->nr) {
 		int drop_index = 0;
 		int missing_drops = 0;

@@ -1380,7 +1411,17 @@ int write_midx_file(const char *object_dir,
 		    const char *preferred_pack_name,
 		    unsigned flags)
 {
-	return write_midx_internal(object_dir, NULL, preferred_pack_name, flags);
+	return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name,
+				   flags);
+}
+
+int write_midx_file_only(const char *object_dir,
+			 struct string_list *packs_to_include,
+			 const char *preferred_pack_name,
+			 unsigned flags)
+{
+	return write_midx_internal(object_dir, packs_to_include, NULL,
+				   preferred_pack_name, flags);
 }

 struct clear_midx_data {
@@ -1660,7 +1701,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	free(count);

 	if (packs_to_drop.nr) {
-		result = write_midx_internal(object_dir, &packs_to_drop, NULL, flags);
+		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, flags);
 		m = NULL;
 	}

@@ -1851,7 +1892,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		goto cleanup;
 	}

-	result = write_midx_internal(object_dir, NULL, NULL, flags);
+	result = write_midx_internal(object_dir, NULL, NULL, NULL, flags);
 	m = NULL;

 cleanup:
diff --git a/midx.h b/midx.h
index aa3da557bb..80f502d39b 100644
--- a/midx.h
+++ b/midx.h
@@ -2,6 +2,7 @@
 #define MIDX_H

 #include "repository.h"
+#include "string-list.h"

 struct object_id;
 struct pack_entry;
@@ -62,6 +63,10 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);

 int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
+int write_midx_file_only(const char *object_dir,
+			 struct string_list *packs_to_include,
+			 const char *preferred_pack_name,
+			 unsigned flags);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
 int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags);
--
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 0/8] repack: introduce `--write-midx`
  2021-09-15 21:19       ` Taylor Blau
@ 2021-09-16 22:16         ` Junio C Hamano
  0 siblings, 0 replies; 76+ messages in thread
From: Junio C Hamano @ 2021-09-16 22:16 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, avarab

Taylor Blau <me@ttaylorr.com> writes:

> I think all of this is non-obvious enough that it warrants being written
> down in a comment. Here's a replacement for that first patch that should
> apply without conflict. But if you'd rather have a reroll or anything
> else, I'm happy to do that, too.

Thanks; will replace.  The comment reads well.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode
  2021-09-15 18:24   ` [PATCH v2 2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode Taylor Blau
@ 2021-09-22 22:29     ` Josh Steadmon
  2021-09-23  2:03       ` Taylor Blau
  2021-09-22 23:11     ` Jonathan Tan
  1 sibling, 1 reply; 76+ messages in thread
From: Josh Steadmon @ 2021-09-22 22:29 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, gitster, avarab

Thanks for the series! I have a couple of questions:


On 2021.09.15 14:24, Taylor Blau wrote:
> To power a new `--write-midx` mode, `git repack` will want to write a
> multi-pack index containing a certain set of packs in the repository.
> 
> This new option will be used by `git repack` to write a MIDX which
> contains only the packs which will survive after the repack (that is, it
> will exclude any packs which are about to be deleted).
> 
> This patch effectively exposes the function implemented in the previous
> commit via the `git multi-pack-index` builtin. An alternative approach
> would have been to call that function from the `git repack` builtin
> directly, but this introduces awkward problems around closing and
> reopening the object store, so the MIDX will be written out-of-process.

Could you elaborate a bit on the "awkward problems" here? I'm afraid I'm
missing the context here.


> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 73c0113b48..047647b5f2 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -47,6 +47,7 @@ static struct opts_multi_pack_index {
>  	const char *preferred_pack;
>  	unsigned long batch_size;
>  	unsigned flags;
> +	int stdin_packs;
>  } opts;
>  
>  static struct option common_opts[] = {
> @@ -61,6 +62,16 @@ static struct option *add_common_options(struct option *prev)
>  	return parse_options_concat(common_opts, prev);
>  }
>  
> +static void read_packs_from_stdin(struct string_list *to)
> +{
> +	struct strbuf buf = STRBUF_INIT;
> +	while (strbuf_getline(&buf, stdin) != EOF)
> +		string_list_append(to, buf.buf);
> +	string_list_sort(to);
> +
> +	strbuf_release(&buf);
> +}
> +

I'm presuming that the packfile list is going to be generated
automatically, but what happens if that becomes corrupt somehow, and we
skip a packfile that should have been included? Will that cause
incorrect behavior, or will we just miss out on some of the bitmap
performance benefits?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 3/8] midx: preliminary support for `--refs-snapshot`
  2021-09-15 18:24   ` [PATCH v2 3/8] midx: preliminary support for `--refs-snapshot` Taylor Blau
@ 2021-09-22 22:34     ` Josh Steadmon
  2021-09-23  2:08       ` Taylor Blau
  2021-09-22 23:00     ` Jonathan Tan
  1 sibling, 1 reply; 76+ messages in thread
From: Josh Steadmon @ 2021-09-22 22:34 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, gitster, avarab

A small nitpick for this patch:

On 2021.09.15 14:24, Taylor Blau wrote:
> diff --git a/midx.c b/midx.c
> index 0330202fda..97ba3421f2 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -968,7 +968,42 @@ static void bitmap_show_commit(struct commit *commit, void *_data)
>  	data->commits[data->commits_nr++] = commit;
>  }
>  
> +static int read_refs_snapshot(const char *refs_snapshot,
> +			      struct rev_info *revs)
> +{
> +	struct strbuf buf = STRBUF_INIT;
> +	struct object_id oid;
> +	FILE *f = xfopen(refs_snapshot, "r");
> +
> +	while (strbuf_getline(&buf, f) != EOF) {
> +		struct object *object;
> +		int preferred = 0;
> +		char *hex = buf.buf;
> +		const char *end = NULL;
> +
> +		if (buf.len && *buf.buf == '+') {
> +			preferred = 1;
> +			hex = &buf.buf[1];
> +		}
> +
> +		if (parse_oid_hex(hex, &oid, &end) < 0)
> +			die(_("could not parse line: %s"), buf.buf);
> +		if (*end)
> +			die(_("malformed line: %s"), buf.buf);
> +
> +		object = parse_object_or_die(&oid, NULL);
> +		if (preferred)
> +			object->flags |= NEEDS_BITMAP;
> +
> +		add_pending_object(revs, object, "");
> +	}
> +
> +	fclose(f);
> +	return 0;
> +}

`buf` needs to be released here.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 6/8] builtin/repack.c: support writing a MIDX while repacking
  2021-09-15 18:24   ` [PATCH v2 6/8] builtin/repack.c: support writing a MIDX while repacking Taylor Blau
@ 2021-09-22 22:39     ` Jonathan Tan
  2021-09-23  2:40       ` Taylor Blau
  0 siblings, 1 reply; 76+ messages in thread
From: Jonathan Tan @ 2021-09-22 22:39 UTC (permalink / raw)
  To: me; +Cc: git, peff, gitster, avarab, Jonathan Tan

> @@ -683,22 +755,41 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	}
>  	/* End of pack replacement. */
>  
> +	if (delete_redundant && pack_everything & ALL_INTO_ONE) {
> +		const int hexsz = the_hash_algo->hexsz;
> +		string_list_sort(&names);
> +		for_each_string_list_item(item, &existing_packs) {
> +			char *sha1;
> +			size_t len = strlen(item->string);
> +			if (len < hexsz)
> +				continue;
> +			sha1 = item->string + len - hexsz;
> +			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);

OK, here is the tricky part. They are marked for deletion here...

> +		}
> +	}
> +
> +	if (write_midx) {
> +		struct string_list include = STRING_LIST_INIT_NODUP;
> +		midx_included_packs(&include, &existing_packs,
> +				    &existing_kept_packs, &names, geometry);

...the mark for deletion is taken into account during the execution of
midx_included_packs() (as can be seen by looking at that function)...

> +
> +		ret = write_midx_included_packs(&include,
> +						show_progress, write_bitmaps > 0);
> +
> +		string_list_clear(&include, 0);
> +
> +		if (ret)
> +			return ret;
> +	}
> +
>  	reprepare_packed_git(the_repository);
>  
>  	if (delete_redundant) {
>  		int opts = 0;
> -		if (pack_everything & ALL_INTO_ONE) {
> -			const int hexsz = the_hash_algo->hexsz;
> -			string_list_sort(&names);
> -			for_each_string_list_item(item, &existing_packs) {
> -				char *sha1;
> -				size_t len = strlen(item->string);
> -				if (len < hexsz)
> -					continue;
> -				sha1 = item->string + len - hexsz;
> -				if (!string_list_has_string(&names, sha1))
> -					remove_redundant_pack(packdir, item->string);
> -			}
> +		for_each_string_list_item(item, &existing_packs) {
> +			if (!item->util)
> +				continue;

...and the marks are also used here. I was at first confused about why
the functionality of midx_included_packs() depended on whether redundant
packs were marked for deletion - if they are redundant, shouldn't they
never be taken into account (regardless of whether we actually delete
them)? But I guess it makes sense as an overall design point that we
pass all packs that are to remain (so if they will be deleted, exclude
them, and if they will not be, include them).

I think a comment "mark this pack for deletion" at the point we write
the mark (so, where the cast to intptr_t is) is worthwhile. Other than
that, this patch looks good to me.

I (and the others in our review club) only managed to reach this patch.
I hope to get to the other 2 by the end of the week.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 4/8] builtin/repack.c: keep track of existing packs unconditionally
  2021-09-15 18:24   ` [PATCH v2 4/8] builtin/repack.c: keep track of existing packs unconditionally Taylor Blau
@ 2021-09-22 22:56     ` Jonathan Tan
  2021-09-23  2:59       ` Taylor Blau
  0 siblings, 1 reply; 76+ messages in thread
From: Jonathan Tan @ 2021-09-22 22:56 UTC (permalink / raw)
  To: me; +Cc: git, peff, gitster, avarab, Jonathan Tan

> @@ -98,8 +98,9 @@ static void remove_pack_on_signal(int signo)
>   * have a corresponding .keep file. These packs are not to
>   * be kept if we are going to pack everything into one file.
>   */
> -static void get_non_kept_pack_filenames(struct string_list *fname_list,
> -					const struct string_list *extra_keep)
> +static void collect_pack_filenames(struct string_list *fname_list,
> +				   struct string_list *fname_kept_list,
> +				   const struct string_list *extra_keep)
>  {
>  	DIR *dir;
>  	struct dirent *e;

The comment in the before-context of this hunk needs to be updated.

Also, I think that fname_list should be renamed to fname_nonkept_list.
It does have the same semantics as before, but the name of the function
has changed. And also, fname_list sounds like a superset of
fname_kept_list, but that is not the case.

> @@ -440,6 +440,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  	struct string_list names = STRING_LIST_INIT_DUP;
>  	struct string_list rollback = STRING_LIST_INIT_NODUP;
>  	struct string_list existing_packs = STRING_LIST_INIT_DUP;
> +	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
>  	struct pack_geometry *geometry = NULL;
>  	struct strbuf line = STRBUF_INIT;
>  	int i, ext, ret;

Likewise here - the introduction of a new similarly-named variable means
that existing_packs should be renamed, I think.

Other than that, this patch looks good.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 3/8] midx: preliminary support for `--refs-snapshot`
  2021-09-15 18:24   ` [PATCH v2 3/8] midx: preliminary support for `--refs-snapshot` Taylor Blau
  2021-09-22 22:34     ` Josh Steadmon
@ 2021-09-22 23:00     ` Jonathan Tan
  2021-09-23  2:18       ` Taylor Blau
  1 sibling, 1 reply; 76+ messages in thread
From: Jonathan Tan @ 2021-09-22 23:00 UTC (permalink / raw)
  To: me; +Cc: git, peff, gitster, avarab, Jonathan Tan

> This approach will cause a problem when multi-pack bitmaps are able to
> be generated from `git repack`, since the reference tips can change
> during the repack. Even though we ignore commits that don't exist in
> the MIDX (when doing a scan of the ref tips), it's possible that a
> commit in the MIDX reaches something that isn't.
> 
> This can happen when a multi-pack index contains some pack which refers
> to loose objects (which by definition aren't included in the multi-pack
> index).
> 
> By taking a snapshot of the references before we start repacking, we can
> close that race window.

I can understand why we want the refs to remain the same both for the
MIDX generation and the MIDX bitmap generation (one reason that comes to
mind is how we select the commits for which to generate bitmaps), but I
don't understand what referring to loose objects has to do with it. I
think that using the same set of refs for MIDX generation and bitmap
generation is a good enough reason to do this, and we don't need to
mention loose objects.

The patch itself looks good.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode
  2021-09-15 18:24   ` [PATCH v2 2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode Taylor Blau
  2021-09-22 22:29     ` Josh Steadmon
@ 2021-09-22 23:11     ` Jonathan Tan
  2021-09-23  2:06       ` Taylor Blau
  1 sibling, 1 reply; 76+ messages in thread
From: Jonathan Tan @ 2021-09-22 23:11 UTC (permalink / raw)
  To: me; +Cc: git, peff, gitster, avarab, Jonathan Tan

> An alternative approach
> would have been to call that function from the `git repack` builtin
> directly, but this introduces awkward problems around closing and
> reopening the object store, so the MIDX will be written out-of-process.

I'm not sure if the implementation direction started by this patch
(eventually, running "git multi-pack-index write --stdin-packs" to
replace the midx of a repository while "git repack" is running) would
work on platforms in which mmap-ing a file means that other processes
can't delete it, but if it works on a Windows CI, this should be fine.
(I don't have a Windows CI handy to test it on, though.)

Assuming it works on platforms like Windows, this patch looks fine.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 1/8] midx: expose `write_midx_file_only()` publicly
  2021-09-15 18:24   ` [PATCH v2 1/8] midx: expose `write_midx_file_only()` publicly Taylor Blau
@ 2021-09-22 23:14     ` Jonathan Tan
  2021-09-23  3:09       ` Taylor Blau
  0 siblings, 1 reply; 76+ messages in thread
From: Jonathan Tan @ 2021-09-22 23:14 UTC (permalink / raw)
  To: me; +Cc: git, peff, gitster, avarab, Jonathan Tan

> @@ -1237,7 +1253,7 @@ static int write_midx_internal(const char *object_dir,
>  
>  	QSORT(ctx.info, ctx.nr, pack_info_compare);
>  
> -	if (packs_to_drop && packs_to_drop->nr) {
> +	if (ctx.m && packs_to_drop && packs_to_drop->nr) {
>  		int drop_index = 0;
>  		int missing_drops = 0;
>  

I couldn't figure out why this requires ctx.m now.

> @@ -62,6 +63,10 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
>  int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
>  
>  int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
> +int write_midx_file_only(const char *object_dir,
> +			 struct string_list *packs_to_include,
> +			 const char *preferred_pack_name,
> +			 unsigned flags);

It took me a while to figure out that this function doesn't only write a
MIDX file, but writes an MIDX file only for certain packs. Maybe worth
adding a comment here (e.g. "Write an MIDX file only for the given
packs").

Other than that, this patch looks good.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode
  2021-09-22 22:29     ` Josh Steadmon
@ 2021-09-23  2:03       ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-23  2:03 UTC (permalink / raw)
  To: Josh Steadmon, Taylor Blau, git, peff, gitster, avarab

On Wed, Sep 22, 2021 at 03:29:53PM -0700, Josh Steadmon wrote:
> Thanks for the series! I have a couple of questions:
>
>
> On 2021.09.15 14:24, Taylor Blau wrote:
> > To power a new `--write-midx` mode, `git repack` will want to write a
> > multi-pack index containing a certain set of packs in the repository.
> >
> > This new option will be used by `git repack` to write a MIDX which
> > contains only the packs which will survive after the repack (that is, it
> > will exclude any packs which are about to be deleted).
> >
> > This patch effectively exposes the function implemented in the previous
> > commit via the `git multi-pack-index` builtin. An alternative approach
> > would have been to call that function from the `git repack` builtin
> > directly, but this introduces awkward problems around closing and
> > reopening the object store, so the MIDX will be written out-of-process.
>
> Could you elaborate a bit on the "awkward problems" here? I'm afraid I'm
> missing the context here.

A variety of things can go wrong when the object store is closed and
re-opened in the same process. Many of the symptoms are described
beginning at this message:

  https://lore.kernel.org/git/YPf1m01mcdJ3HNBt@coredump.intra.peff.net/

and further down in the sub-thread. Many of those problems have been
resolved, but I'm not convinced that there aren't others lurking.

> > +static void read_packs_from_stdin(struct string_list *to)
> > +{
> > +	struct strbuf buf = STRBUF_INIT;
> > +	while (strbuf_getline(&buf, stdin) != EOF)
> > +		string_list_append(to, buf.buf);
> > +	string_list_sort(to);
> > +
> > +	strbuf_release(&buf);
> > +}
> > +
>
> I'm presuming that the packfile list is going to be generated
> automatically, but what happens if that becomes corrupt somehow, and we
> skip a packfile that should have been included? Will that cause
> incorrect behavior, or will we just miss out on some of the bitmap
> performance benefits?

A multi-pack bitmap can only refer to objects that are in a pack which
the repository's MIDX includes. So if we left off a pack from this list,
we'd be unable to cover that pack's objects in the resulting bitmap.
We'd also be unable to cover any objects which are reachable from the
missing pack's objects, since the set of objects in a bitmap must be
closed under reachability.

If, on the other hand, we read a line which does not correspond to any
pack, we'll simply ignore it. That's because we loop over the results of
get_all_packs() and try to find a match in this list instead of the
other way around.

We could mark the packs we found by abusing the string_list_item's util
pointer, but it's probably not worth it since this is mostly an internal
interface.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode
  2021-09-22 23:11     ` Jonathan Tan
@ 2021-09-23  2:06       ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-23  2:06 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: me, git, peff, gitster, avarab

On Wed, Sep 22, 2021 at 04:11:37PM -0700, Jonathan Tan wrote:
> > An alternative approach
> > would have been to call that function from the `git repack` builtin
> > directly, but this introduces awkward problems around closing and
> > reopening the object store, so the MIDX will be written out-of-process.
>
> I'm not sure if the implementation direction started by this patch
> (eventually, running "git multi-pack-index write --stdin-packs" to
> replace the midx of a repository while "git repack" is running) would
> work on platforms in which mmap-ing a file means that other processes
> can't delete it, but if it works on a Windows CI, this should be fine.
> (I don't have a Windows CI handy to test it on, though.)
>
> Assuming it works on platforms like Windows, this patch looks fine.

It definitely passes CI :-). But special care is taken to handle this
case, and it works for a couple of reasons. Notably that we only call
`write_midx_included_packs()` (which in turn invokes the
multi-pack-index builtin as a sub-process) *after* repack has called
close_object_store(). That means that `repack` won't be holding the MIDX
open while the sub-process is trying to write a new one in its place.

The other reason is that the multi-pack-index process also makes sure to
close the old MIDX before writing the new one, so we can be certain that
neither of these spots are mapping the file or have it opened when
trying to write over it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 3/8] midx: preliminary support for `--refs-snapshot`
  2021-09-22 22:34     ` Josh Steadmon
@ 2021-09-23  2:08       ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-23  2:08 UTC (permalink / raw)
  To: Josh Steadmon, Taylor Blau, git, peff, gitster, avarab

On Wed, Sep 22, 2021 at 03:34:53PM -0700, Josh Steadmon wrote:
> > +		object = parse_object_or_die(&oid, NULL);
> > +		if (preferred)
> > +			object->flags |= NEEDS_BITMAP;
> > +
> > +		add_pending_object(revs, object, "");
> > +	}
> > +
> > +	fclose(f);
> > +	return 0;
> > +}
>
> `buf` needs to be released here.

Good eyes; thanks for noticing :-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 3/8] midx: preliminary support for `--refs-snapshot`
  2021-09-22 23:00     ` Jonathan Tan
@ 2021-09-23  2:18       ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-23  2:18 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: me, git, peff, gitster, avarab

On Wed, Sep 22, 2021 at 04:00:12PM -0700, Jonathan Tan wrote:
> > This approach will cause a problem when multi-pack bitmaps are able to
> > be generated from `git repack`, since the reference tips can change
> > during the repack. Even though we ignore commits that don't exist in
> > the MIDX (when doing a scan of the ref tips), it's possible that a
> > commit in the MIDX reaches something that isn't.
> >
> > This can happen when a multi-pack index contains some pack which refers
> > to loose objects (which by definition aren't included in the multi-pack
> > index).
> >
> > By taking a snapshot of the references before we start repacking, we can
> > close that race window.
>
> I can understand why we want the refs to remain the same both for the
> MIDX generation and the MIDX bitmap generation (one reason that comes to
> mind is how we select the commits for which to generate bitmaps), but I
> don't understand what referring to loose objects has to do with it. I
> think that using the same set of refs for MIDX generation and bitmap
> generation is a good enough reason to do this, and we don't need to
> mention loose objects.

The point there is that a pack which contains objects that are ancestors
of loose objects can show up at any time, including just before we
select which packs to go into a MIDX.

This is particularly common at GitHub, where all of our test-merges and
objects created via the web interface are written loose. So we could
compute a test merge, store the result loose, and have somebody push a
pack up on top of it.

If that pack shows up after the repack, but before we write a MIDX, then
without the refs-snapshot code, we would include that pack, select its
commits as candidates for bitmap selection, and then ultimately discover
that the bitmap isn't closed, since the loose object won't be included.

But that may be a little into the weeds for the patch message. Anyway,
I did have to think about it for a minute, but I'm pretty sure that's
what I was thinking when I wrote this.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 6/8] builtin/repack.c: support writing a MIDX while repacking
  2021-09-22 22:39     ` Jonathan Tan
@ 2021-09-23  2:40       ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-23  2:40 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: me, git, peff, gitster, avarab

On Wed, Sep 22, 2021 at 03:39:35PM -0700, Jonathan Tan wrote:
> > @@ -683,22 +755,41 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >  	}
> >  	/* End of pack replacement. */
> >
> > +	if (delete_redundant && pack_everything & ALL_INTO_ONE) {
> > +		const int hexsz = the_hash_algo->hexsz;
> > +		string_list_sort(&names);
> > +		for_each_string_list_item(item, &existing_packs) {
> > +			char *sha1;
> > +			size_t len = strlen(item->string);
> > +			if (len < hexsz)
> > +				continue;
> > +			sha1 = item->string + len - hexsz;
> > +			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
>
> OK, here is the tricky part. They are marked for deletion here...

Yep, instead of deleting the packs immediately, which would cause an
existing MIDX referencing them to become corrupt, we split that into two
phases: one to figure out which packs need to be deleted (after the new
MIDX), and then another to actually delete those packs.

> > +		}
> > +	}
> > +
> > +	if (write_midx) {
> > +		struct string_list include = STRING_LIST_INIT_NODUP;
> > +		midx_included_packs(&include, &existing_packs,
> > +				    &existing_kept_packs, &names, geometry);
>
> ...the mark for deletion is taken into account during the execution of
> midx_included_packs() (as can be seen by looking at that function)...
>
> > +
> > +		ret = write_midx_included_packs(&include,
> > +						show_progress, write_bitmaps > 0);
> > +
> > +		string_list_clear(&include, 0);
> > +
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> >  	reprepare_packed_git(the_repository);
> >
> >  	if (delete_redundant) {
> >  		int opts = 0;
> > -		if (pack_everything & ALL_INTO_ONE) {
> > -			const int hexsz = the_hash_algo->hexsz;
> > -			string_list_sort(&names);
> > -			for_each_string_list_item(item, &existing_packs) {
> > -				char *sha1;
> > -				size_t len = strlen(item->string);
> > -				if (len < hexsz)
> > -					continue;
> > -				sha1 = item->string + len - hexsz;
> > -				if (!string_list_has_string(&names, sha1))
> > -					remove_redundant_pack(packdir, item->string);
> > -			}
> > +		for_each_string_list_item(item, &existing_packs) {
> > +			if (!item->util)
> > +				continue;
>
> ...and the marks are also used here. I was at first confused about why
> the functionality of midx_included_packs() depended on whether redundant
> packs were marked for deletion - if they are redundant, shouldn't they
> never be taken into account (regardless of whether we actually delete
> them)? But I guess it makes sense as an overall design point that we
> pass all packs that are to remain (so if they will be deleted, exclude
> them, and if they will not be, include them).

Right; midx_included_packs() is used to determine which packs are going
to be written into the new MIDX, and we need to make sure that we
exclude any packs which are about to be deleted.

> I think a comment "mark this pack for deletion" at the point we write
> the mark (so, where the cast to intptr_t is) is worthwhile. Other than
> that, this patch looks good to me.

Sure, that seems like it would be helpful to other readers.

> I (and the others in our review club) only managed to reach this patch.
> I hope to get to the other 2 by the end of the week.

Thanks for reviewing, I look forward to more of your feedback.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 4/8] builtin/repack.c: keep track of existing packs unconditionally
  2021-09-22 22:56     ` Jonathan Tan
@ 2021-09-23  2:59       ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-23  2:59 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: me, git, peff, gitster, avarab

On Wed, Sep 22, 2021 at 03:56:05PM -0700, Jonathan Tan wrote:
> > @@ -98,8 +98,9 @@ static void remove_pack_on_signal(int signo)
> >   * have a corresponding .keep file. These packs are not to
> >   * be kept if we are going to pack everything into one file.
> >   */
> > -static void get_non_kept_pack_filenames(struct string_list *fname_list,
> > -					const struct string_list *extra_keep)
> > +static void collect_pack_filenames(struct string_list *fname_list,
> > +				   struct string_list *fname_kept_list,
> > +				   const struct string_list *extra_keep)
> >  {
> >  	DIR *dir;
> >  	struct dirent *e;
>
> The comment in the before-context of this hunk needs to be updated.

Thanks, updated. Now it reads like this:

  /*
   * Adds all packs hex strings to either the fname or fname_kept_list
   * list, based on whether each pack has a corresponding .keep file or
   * not. Packs without a .keep file are not to be kept if we are going
   * to pack everything into one file.
   */

> Also, I think that fname_list should be renamed to fname_nonkept_list.
> It does have the same semantics as before, but the name of the function
> has changed. And also, fname_list sounds like a superset of
> fname_kept_list, but that is not the case.

Yeah, I think that is a reasonable suggestion. I'll do it in a separate
patch, though, to keep the renaming self-contained.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 1/8] midx: expose `write_midx_file_only()` publicly
  2021-09-22 23:14     ` Jonathan Tan
@ 2021-09-23  3:09       ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-23  3:09 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: me, git, peff, gitster, avarab

On Wed, Sep 22, 2021 at 04:14:23PM -0700, Jonathan Tan wrote:
> > @@ -1237,7 +1253,7 @@ static int write_midx_internal(const char *object_dir,
> >
> >  	QSORT(ctx.info, ctx.nr, pack_info_compare);
> >
> > -	if (packs_to_drop && packs_to_drop->nr) {
> > +	if (ctx.m && packs_to_drop && packs_to_drop->nr) {
> >  		int drop_index = 0;
> >  		int missing_drops = 0;
> >
>
> I couldn't figure out why this requires ctx.m now.

Me either; this must have been a stray change that got dragged along. I
dropped in -- thanks for pointing it out.

> > @@ -62,6 +63,10 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
> >  int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
> >
> >  int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
> > +int write_midx_file_only(const char *object_dir,
> > +			 struct string_list *packs_to_include,
> > +			 const char *preferred_pack_name,
> > +			 unsigned flags);
>
> It took me a while to figure out that this function doesn't only write a
> MIDX file, but writes an MIDX file only for certain packs. Maybe worth
> adding a comment here (e.g. "Write an MIDX file only for the given
> packs").

Nitpicking a little, it does still write a (single) MIDX file, but that
MIDX only includes the packs listed in packs_to_include. I can add a
comment to that effect.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-15 18:24   ` [PATCH v2 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
@ 2021-09-24 18:22     ` Jonathan Tan
  2021-10-01 22:38       ` Taylor Blau
  0 siblings, 1 reply; 76+ messages in thread
From: Jonathan Tan @ 2021-09-24 18:22 UTC (permalink / raw)
  To: me; +Cc: git, peff, gitster, avarab, Jonathan Tan

> +static void midx_snapshot_refs(struct tempfile *f)
> +{
> +	struct midx_snapshot_ref_data data;
> +	const struct string_list *preferred = bitmap_preferred_tips(the_repository);
> +
> +	data.f = f;
> +	oidset_init(&data.seen, 0);
> +
> +	if (!fdopen_tempfile(f, "w"))
> +		 die(_("could not open tempfile %s for writing"),
> +		     get_tempfile_path(f));
> +
> +	if (preferred) {
> +		struct string_list_item *item;
> +
> +		data.preferred = 1;
> +		for_each_string_list_item(item, preferred)
> +			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
> +		data.preferred = 0;
> +	}
> +
> +	for_each_ref(midx_snapshot_ref_one, &data);

One small thing here - I think "data.preferred = 0;" needs to be moved
to right above this for_each_ref line. As it is, if preferred is NULL,
data.preferred wouldn't be initialized.

Other than that (and the minor changes I've suggested in previous
patches), this series looks good to me.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v3 0/9] repack: introduce `--write-midx`
  2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
                     ` (8 preceding siblings ...)
  2021-09-15 19:22   ` [PATCH v2 0/8] repack: introduce `--write-midx` Junio C Hamano
@ 2021-09-29  1:54   ` Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 1/9] midx: expose `write_midx_file_only()` publicly Taylor Blau
                       ` (10 more replies)
  9 siblings, 11 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-29  1:54 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, gitster, jonathantanmy, steadmon

Here is another small reroll of my series to implement the final component of
multi-pack reachability bitamps, which is to be able to write them from `git
repack`.

This version incorporates feedback from Jonathan Tan and others at Google who
graciously provided review. A range-diff is included below, but the major
changes are:

  - A comment to explain the relationship between 'ctx->m' and 'ctx->to_include'
    in midx.c:add_pack_to_midx().

  - A comment to explain the meaning of write_midx_file_only().

  - Clean-up of stray hunks, a missing strbuf_release().

  - Replacing the name `existing_packs` with `existing_nonkept_packs` to
    emphasize the relationship between it and the similarly-named
    `existing_kept_packs`.

Instead of depending on tb/multi-pack-bitmaps, this series now depends on the
tip of master.

Taylor Blau (9):
  midx: expose `write_midx_file_only()` publicly
  builtin/multi-pack-index.c: support `--stdin-packs` mode
  midx: preliminary support for `--refs-snapshot`
  builtin/repack.c: keep track of existing packs unconditionally
  builtin/repack.c: rename variables that deal with non-kept packs
  builtin/repack.c: extract showing progress to a variable
  builtin/repack.c: support writing a MIDX while repacking
  builtin/repack.c: make largest pack preferred
  builtin/repack.c: pass `--refs-snapshot` when writing bitmaps

 Documentation/git-multi-pack-index.txt |  19 ++
 Documentation/git-repack.txt           |  18 +-
 builtin/multi-pack-index.c             |  36 +++-
 builtin/repack.c                       | 279 ++++++++++++++++++++++---
 midx.c                                 | 110 ++++++++--
 midx.h                                 |  15 +-
 pack-bitmap.c                          |   2 +-
 pack-bitmap.h                          |   1 +
 t/helper/test-read-midx.c              |  25 ++-
 t/lib-midx.sh                          |   8 +
 t/t5319-multi-pack-index.sh            |  15 ++
 t/t5326-multi-pack-bitmaps.sh          |  82 ++++++++
 t/t7700-repack.sh                      |  96 +++++++++
 t/t7703-repack-geometric.sh            |  24 ++-
 14 files changed, 674 insertions(+), 56 deletions(-)
 create mode 100644 t/lib-midx.sh

Range-diff against v2:
 1:  03c1b2c4d3 !  1:  b7c72d0bf5 midx: expose `write_midx_file_only()` publicly
    @@ midx.c: struct write_midx_context {
      
      static void add_pack_to_midx(const char *full_path, size_t full_path_len,
     @@ midx.c: static void add_pack_to_midx(const char *full_path, size_t full_path_len,
    + 
    + 	if (ends_with(file_name, ".idx")) {
      		display_progress(ctx->progress, ++ctx->pack_paths_checked);
    ++		/*
    ++		 * Note that at most one of ctx->m and ctx->to_include are set,
    ++		 * so we are testing midx_contains_pack() and
    ++		 * string_list_has_string() independently (guarded by the
    ++		 * appropriate NULL checks).
    ++		 *
    ++		 * We could support passing to_include while reusing an existing
    ++		 * MIDX, but don't currently since the reuse process drags
    ++		 * forward all packs from an existing MIDX (without checking
    ++		 * whether or not they appear in the to_include list).
    ++		 *
    ++		 * If we added support for that, these next two conditional
    ++		 * should be performed independently (likely checking
    ++		 * to_include before the existing MIDX).
    ++		 */
      		if (ctx->m && midx_contains_pack(ctx->m, file_name))
      			return;
     +		else if (ctx->to_include &&
    @@ midx.c: static int write_midx_internal(const char *object_dir,
      		struct bitmap_index *bitmap_git;
      		int bitmap_exists;
      		int want_bitmap = flags & MIDX_WRITE_BITMAP;
    -@@ midx.c: static int write_midx_internal(const char *object_dir,
    - 
    - 	QSORT(ctx.info, ctx.nr, pack_info_compare);
    - 
    --	if (packs_to_drop && packs_to_drop->nr) {
    -+	if (ctx.m && packs_to_drop && packs_to_drop->nr) {
    - 		int drop_index = 0;
    - 		int missing_drops = 0;
    - 
     @@ midx.c: int write_midx_file(const char *object_dir,
      		    const char *preferred_pack_name,
      		    unsigned flags)
    @@ midx.h: int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pa
      int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
      
      int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
    ++/*
    ++ * Variant of write_midx_file which writes a MIDX containing only the packs
    ++ * specified in packs_to_include.
    ++ */
     +int write_midx_file_only(const char *object_dir,
     +			 struct string_list *packs_to_include,
     +			 const char *preferred_pack_name,
 2:  59556e5545 =  2:  986ef14f2a builtin/multi-pack-index.c: support `--stdin-packs` mode
 3:  42f1ae9ede !  3:  4e3769a4f3 midx: preliminary support for `--refs-snapshot`
    @@ Commit message
         commit in the MIDX reaches something that isn't.
     
         This can happen when a multi-pack index contains some pack which refers
    -    to loose objects (which by definition aren't included in the multi-pack
    -    index).
    +    to loose objects (e.g., if a pack was pushed after starting the repack
    +    but before generating the MIDX which depends on an object which is
    +    stored as loose in the repository, and by definition isn't included in
    +    the multi-pack index).
     
         By taking a snapshot of the references before we start repacking, we can
         close that race window. In the above scenario (where we have a packed
    @@ midx.c: static void bitmap_show_commit(struct commit *commit, void *_data)
     +	}
     +
     +	fclose(f);
    ++	strbuf_release(&buf);
     +	return 0;
     +}
     +
    @@ midx.h: int fill_midx_entry(struct repository *r, const struct object_id *oid, s
      int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
      
     -int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
    + /*
    +  * Variant of write_midx_file which writes a MIDX containing only the packs
    +  * specified in packs_to_include.
    +  */
     +int write_midx_file(const char *object_dir,
     +		    const char *preferred_pack_name,
     +		    const char *refs_snapshot,
 4:  c0d045a9de !  4:  1b3dd331ca builtin/repack.c: keep track of existing packs unconditionally
    @@ Commit message
     
      ## builtin/repack.c ##
     @@ builtin/repack.c: static void remove_pack_on_signal(int signo)
    -  * have a corresponding .keep file. These packs are not to
    -  * be kept if we are going to pack everything into one file.
    + }
    + 
    + /*
    +- * Adds all packs hex strings to the fname list, which do not
    +- * have a corresponding .keep file. These packs are not to
    +- * be kept if we are going to pack everything into one file.
    ++ * Adds all packs hex strings to either fname_list or fname_kept_list
    ++ * based on whether each pack has a corresponding .keep file or not.
    ++ * Packs without a .keep file are not to be kept if we are going to
    ++ * pack everything into one file.
       */
     -static void get_non_kept_pack_filenames(struct string_list *fname_list,
     -					const struct string_list *extra_keep)
 -:  ---------- >  5:  15831a201a builtin/repack.c: rename variables that deal with non-kept packs
 5:  09de03de47 =  6:  1a40161baf builtin/repack.c: extract showing progress to a variable
 6:  83dfdb8b12 !  7:  6854f0751d builtin/repack.c: support writing a MIDX while repacking
    @@ Commit message
         check we can make consistently when (1) telling the MIDX which packs we
         want to exclude, and (2) actually unlinking the redundant packs.
     
    +    There is also a tiny change to short-circuit the body of
    +    write_midx_included_packs() when no packs remain in the case of an empty
    +    repository. The MIDX code does not handle this, so avoid trying to
    +    generate a MIDX covering zero packs in the first place.
    +
         Signed-off-by: Taylor Blau <me@ttaylorr.com>
     
      ## Documentation/git-repack.txt ##
    @@ builtin/repack.c: static void clear_pack_geometry(struct pack_geometry *geometry
      }
      
     +static void midx_included_packs(struct string_list *include,
    -+				struct string_list *existing_packs,
    ++				struct string_list *existing_nonkept_packs,
     +				struct string_list *existing_kept_packs,
     +				struct string_list *names,
     +				struct pack_geometry *geometry)
    @@ builtin/repack.c: static void clear_pack_geometry(struct pack_geometry *geometry
     +			string_list_insert(include, strbuf_detach(&buf, NULL));
     +		}
     +	} else {
    -+		for_each_string_list_item(item, existing_packs) {
    ++		for_each_string_list_item(item, existing_nonkept_packs) {
     +			if (item->util)
     +				continue;
     +			string_list_insert(include, xstrfmt("%s.idx", item->string));
    @@ builtin/repack.c: static void clear_pack_geometry(struct pack_geometry *geometry
     +	FILE *in;
     +	int ret;
     +
    ++	if (!include->nr)
    ++		return 0;
    ++
     +	cmd.in = -1;
     +	cmd.git_cmd = 1;
     +
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     +	if (delete_redundant && pack_everything & ALL_INTO_ONE) {
     +		const int hexsz = the_hash_algo->hexsz;
     +		string_list_sort(&names);
    -+		for_each_string_list_item(item, &existing_packs) {
    ++		for_each_string_list_item(item, &existing_nonkept_packs) {
     +			char *sha1;
     +			size_t len = strlen(item->string);
     +			if (len < hexsz)
     +				continue;
     +			sha1 = item->string + len - hexsz;
    ++			/*
    ++			 * Mark this pack for deletion, which ensures that this
    ++			 * pack won't be included in a MIDX (if `--write-midx`
    ++			 * was given) and that we will actually delete this pack
    ++			 * (if `-d` was given).
    ++			 */
     +			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
     +		}
     +	}
     +
     +	if (write_midx) {
     +		struct string_list include = STRING_LIST_INIT_NODUP;
    -+		midx_included_packs(&include, &existing_packs,
    ++		midx_included_packs(&include, &existing_nonkept_packs,
     +				    &existing_kept_packs, &names, geometry);
     +
     +		ret = write_midx_included_packs(&include,
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     -		if (pack_everything & ALL_INTO_ONE) {
     -			const int hexsz = the_hash_algo->hexsz;
     -			string_list_sort(&names);
    --			for_each_string_list_item(item, &existing_packs) {
    +-			for_each_string_list_item(item, &existing_nonkept_packs) {
     -				char *sha1;
     -				size_t len = strlen(item->string);
     -				if (len < hexsz)
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     -				if (!string_list_has_string(&names, sha1))
     -					remove_redundant_pack(packdir, item->string);
     -			}
    -+		for_each_string_list_item(item, &existing_packs) {
    ++		for_each_string_list_item(item, &existing_nonkept_packs) {
     +			if (!item->util)
     +				continue;
     +			remove_redundant_pack(packdir, item->string);
    @@ t/t7700-repack.sh: test_expect_success 'auto-bitmaps do not complain if unavaila
     +'
     +
      test_done
    +
    + ## t/t7703-repack-geometric.sh ##
    +@@ t/t7703-repack-geometric.sh: test_expect_success '--geometric with no packs' '
    + 	(
    + 		cd geometric &&
    + 
    +-		git repack --geometric 2 >out &&
    ++		git repack --write-midx --geometric 2 >out &&
    + 		test_i18ngrep "Nothing new to pack" out
    + 	)
    + '
 7:  68bc49d8ae !  8:  3596c76daf builtin/repack.c: make largest pack preferred
    @@ builtin/repack.c: static int write_midx_included_packs(struct string_list *inclu
      	if (ret)
      		return ret;
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    - 		midx_included_packs(&include, &existing_packs,
    + 		midx_included_packs(&include, &existing_nonkept_packs,
      				    &existing_kept_packs, &names, geometry);
      
     -		ret = write_midx_included_packs(&include,
 8:  eb24b308ec !  9:  d99f075321 builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
    @@ builtin/repack.c: static void clear_pack_geometry(struct pack_geometry *geometry
     +}
     +
      static void midx_included_packs(struct string_list *include,
    - 				struct string_list *existing_packs,
    + 				struct string_list *existing_nonkept_packs,
      				struct string_list *existing_kept_packs,
     @@ builtin/repack.c: static void midx_included_packs(struct string_list *include,
      
-- 
2.33.0.96.g73915697e6

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v3 1/9] midx: expose `write_midx_file_only()` publicly
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
@ 2021-09-29  1:55     ` Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 2/9] builtin/multi-pack-index.c: support `--stdin-packs` mode Taylor Blau
                       ` (9 subsequent siblings)
  10 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-29  1:55 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, gitster, jonathantanmy, steadmon

Expose a variant of the write_midx_file() function which ignores packs
that aren't included in an explicit "allow" list.

This will be used in an upcoming patch to power a new `--stdin-packs`
mode of `git multi-pack-index write` for callers that only want to
include certain packs in a MIDX (and ignore any packs which may have
happened to enter the repository independently, e.g., from pushes).

Those patches will provide test coverage for this new function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++--------
 midx.h |  9 +++++++++
 2 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/midx.c b/midx.c
index f96fb2efee..7ac97e66e0 100644
--- a/midx.c
+++ b/midx.c
@@ -460,6 +460,8 @@ struct write_midx_context {
 	uint32_t num_large_offsets;
 
 	int preferred_pack_idx;
+
+	struct string_list *to_include;
 };
 
 static void add_pack_to_midx(const char *full_path, size_t full_path_len,
@@ -469,8 +471,26 @@ static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 
 	if (ends_with(file_name, ".idx")) {
 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
+		/*
+		 * Note that at most one of ctx->m and ctx->to_include are set,
+		 * so we are testing midx_contains_pack() and
+		 * string_list_has_string() independently (guarded by the
+		 * appropriate NULL checks).
+		 *
+		 * We could support passing to_include while reusing an existing
+		 * MIDX, but don't currently since the reuse process drags
+		 * forward all packs from an existing MIDX (without checking
+		 * whether or not they appear in the to_include list).
+		 *
+		 * If we added support for that, these next two conditional
+		 * should be performed independently (likely checking
+		 * to_include before the existing MIDX).
+		 */
 		if (ctx->m && midx_contains_pack(ctx->m, file_name))
 			return;
+		else if (ctx->to_include &&
+			 !string_list_has_string(ctx->to_include, file_name))
+			return;
 
 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
@@ -1043,6 +1063,7 @@ static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 }
 
 static int write_midx_internal(const char *object_dir,
+			       struct string_list *packs_to_include,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
 			       unsigned flags)
@@ -1067,10 +1088,17 @@ static int write_midx_internal(const char *object_dir,
 		die_errno(_("unable to create leading directories of %s"),
 			  midx_name);
 
-	for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
-		if (!strcmp(object_dir, cur->object_dir)) {
-			ctx.m = cur;
-			break;
+	if (!packs_to_include) {
+		/*
+		 * Only reference an existing MIDX when not filtering which
+		 * packs to include, since all packs and objects are copied
+		 * blindly from an existing MIDX if one is present.
+		 */
+		for (cur = get_multi_pack_index(the_repository); cur; cur = cur->next) {
+			if (!strcmp(object_dir, cur->object_dir)) {
+				ctx.m = cur;
+				break;
+			}
 		}
 	}
 
@@ -1121,10 +1149,13 @@ static int write_midx_internal(const char *object_dir,
 	else
 		ctx.progress = NULL;
 
+	ctx.to_include = packs_to_include;
+
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
 	stop_progress(&ctx.progress);
 
-	if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) {
+	if ((ctx.m && ctx.nr == ctx.m->num_packs) &&
+	    !(packs_to_include || packs_to_drop)) {
 		struct bitmap_index *bitmap_git;
 		int bitmap_exists;
 		int want_bitmap = flags & MIDX_WRITE_BITMAP;
@@ -1365,7 +1396,17 @@ int write_midx_file(const char *object_dir,
 		    const char *preferred_pack_name,
 		    unsigned flags)
 {
-	return write_midx_internal(object_dir, NULL, preferred_pack_name, flags);
+	return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name,
+				   flags);
+}
+
+int write_midx_file_only(const char *object_dir,
+			 struct string_list *packs_to_include,
+			 const char *preferred_pack_name,
+			 unsigned flags)
+{
+	return write_midx_internal(object_dir, packs_to_include, NULL,
+				   preferred_pack_name, flags);
 }
 
 struct clear_midx_data {
@@ -1645,7 +1686,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	free(count);
 
 	if (packs_to_drop.nr) {
-		result = write_midx_internal(object_dir, &packs_to_drop, NULL, flags);
+		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, flags);
 		m = NULL;
 	}
 
@@ -1836,7 +1877,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		goto cleanup;
 	}
 
-	result = write_midx_internal(object_dir, NULL, NULL, flags);
+	result = write_midx_internal(object_dir, NULL, NULL, NULL, flags);
 	m = NULL;
 
 cleanup:
diff --git a/midx.h b/midx.h
index aa3da557bb..3545e327ea 100644
--- a/midx.h
+++ b/midx.h
@@ -2,6 +2,7 @@
 #define MIDX_H
 
 #include "repository.h"
+#include "string-list.h"
 
 struct object_id;
 struct pack_entry;
@@ -62,6 +63,14 @@ int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name)
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
 int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
+/*
+ * Variant of write_midx_file which writes a MIDX containing only the packs
+ * specified in packs_to_include.
+ */
+int write_midx_file_only(const char *object_dir,
+			 struct string_list *packs_to_include,
+			 const char *preferred_pack_name,
+			 unsigned flags);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
 int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags);
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v3 2/9] builtin/multi-pack-index.c: support `--stdin-packs` mode
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 1/9] midx: expose `write_midx_file_only()` publicly Taylor Blau
@ 2021-09-29  1:55     ` Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 3/9] midx: preliminary support for `--refs-snapshot` Taylor Blau
                       ` (8 subsequent siblings)
  10 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-29  1:55 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, gitster, jonathantanmy, steadmon

To power a new `--write-midx` mode, `git repack` will want to write a
multi-pack index containing a certain set of packs in the repository.

This new option will be used by `git repack` to write a MIDX which
contains only the packs which will survive after the repack (that is, it
will exclude any packs which are about to be deleted).

This patch effectively exposes the function implemented in the previous
commit via the `git multi-pack-index` builtin. An alternative approach
would have been to call that function from the `git repack` builtin
directly, but this introduces awkward problems around closing and
reopening the object store, so the MIDX will be written out-of-process.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt |  4 ++++
 builtin/multi-pack-index.c             | 27 ++++++++++++++++++++++++++
 t/t5319-multi-pack-index.sh            | 15 ++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index a9df3dbd32..009c989ef8 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -45,6 +45,10 @@ write::
 
 	--[no-]bitmap::
 		Control whether or not a multi-pack bitmap is written.
+
+	--stdin-packs::
+		Write a multi-pack index containing only the set of
+		line-delimited pack index basenames provided over stdin.
 --
 
 verify::
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 66de6efd41..03aaf8e7fb 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -47,6 +47,7 @@ static struct opts_multi_pack_index {
 	const char *preferred_pack;
 	unsigned long batch_size;
 	unsigned flags;
+	int stdin_packs;
 } opts;
 
 static struct option common_opts[] = {
@@ -61,6 +62,16 @@ static struct option *add_common_options(struct option *prev)
 	return parse_options_concat(common_opts, prev);
 }
 
+static void read_packs_from_stdin(struct string_list *to)
+{
+	struct strbuf buf = STRBUF_INIT;
+	while (strbuf_getline(&buf, stdin) != EOF)
+		string_list_append(to, buf.buf);
+	string_list_sort(to);
+
+	strbuf_release(&buf);
+}
+
 static int cmd_multi_pack_index_write(int argc, const char **argv)
 {
 	struct option *options;
@@ -70,6 +81,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 			   N_("pack for reuse when computing a multi-pack bitmap")),
 		OPT_BIT(0, "bitmap", &opts.flags, N_("write multi-pack bitmap"),
 			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
+		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
+			 N_("write multi-pack index containing only given indexes")),
 		OPT_END(),
 	};
 
@@ -86,6 +99,20 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 
 	FREE_AND_NULL(options);
 
+	if (opts.stdin_packs) {
+		struct string_list packs = STRING_LIST_INIT_DUP;
+		int ret;
+
+		read_packs_from_stdin(&packs);
+
+		ret = write_midx_file_only(opts.object_dir, &packs,
+					   opts.preferred_pack, opts.flags);
+
+		string_list_clear(&packs, 0);
+
+		return ret;
+
+	}
 	return write_midx_file(opts.object_dir, opts.preferred_pack,
 			       opts.flags);
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index bb04f0f23b..385f0a3efd 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -168,6 +168,21 @@ test_expect_success 'write midx with two packs' '
 
 compare_results_with_midx "two packs"
 
+test_expect_success 'write midx with --stdin-packs' '
+	rm -fr $objdir/pack/multi-pack-index &&
+
+	idx="$(find $objdir/pack -name "test-2-*.idx")" &&
+	basename "$idx" >in &&
+
+	git multi-pack-index write --stdin-packs <in &&
+
+	test-tool read-midx $objdir | grep "\.idx$" >packs &&
+
+	test_cmp packs in
+'
+
+compare_results_with_midx "mixed mode (one pack + extra)"
+
 test_expect_success 'write progress off for redirected stderr' '
 	git multi-pack-index --object-dir=$objdir write 2>err &&
 	test_line_count = 0 err
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v3 3/9] midx: preliminary support for `--refs-snapshot`
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 1/9] midx: expose `write_midx_file_only()` publicly Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 2/9] builtin/multi-pack-index.c: support `--stdin-packs` mode Taylor Blau
@ 2021-09-29  1:55     ` Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 4/9] builtin/repack.c: keep track of existing packs unconditionally Taylor Blau
                       ` (7 subsequent siblings)
  10 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-29  1:55 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, gitster, jonathantanmy, steadmon

To figure out which commits we can write a bitmap for, the multi-pack
index/bitmap code does a reachability traversal, marking any commit
which can be found in the MIDX as eligible to receive a bitmap.

This approach will cause a problem when multi-pack bitmaps are able to
be generated from `git repack`, since the reference tips can change
during the repack. Even though we ignore commits that don't exist in
the MIDX (when doing a scan of the ref tips), it's possible that a
commit in the MIDX reaches something that isn't.

This can happen when a multi-pack index contains some pack which refers
to loose objects (e.g., if a pack was pushed after starting the repack
but before generating the MIDX which depends on an object which is
stored as loose in the repository, and by definition isn't included in
the multi-pack index).

By taking a snapshot of the references before we start repacking, we can
close that race window. In the above scenario (where we have a packed
object pointing at a loose one), we'll either (a) take a snapshot of the
references before seeing the packed one, or (b) take it after, at which
point we can guarantee that the loose object will be packed and included
in the MIDX.

This patch does just that. It writes a temporary "reference snapshot",
which is a list of OIDs that are at the ref tips before writing a
multi-pack bitmap. References that are "preferred" (i.e,. are a suffix
of at least one value of the 'pack.preferBitmapTips' configuration) are
marked with a special '+'.

The format is simple: one line per commit at each tip, with an optional
'+' at the beginning (for preferred references, as described above).

When provided, the reference snapshot is used to drive bitmap selection
instead of the MIDX code doing its own traversal. When it isn't
provided, the usual traversal takes place instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt | 15 +++++
 builtin/multi-pack-index.c             | 11 +++-
 builtin/repack.c                       |  2 +-
 midx.c                                 | 61 ++++++++++++++++---
 midx.h                                 |  6 +-
 t/t5326-multi-pack-bitmaps.sh          | 82 ++++++++++++++++++++++++++
 6 files changed, 164 insertions(+), 13 deletions(-)

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index 009c989ef8..27f83932e4 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -49,6 +49,21 @@ write::
 	--stdin-packs::
 		Write a multi-pack index containing only the set of
 		line-delimited pack index basenames provided over stdin.
+
+	--refs-snapshot=<path>::
+		With `--bitmap`, optionally specify a file which
+		contains a "refs snapshot" taken prior to repacking.
++
+A reference snapshot is composed of line-delimited OIDs corresponding to
+the reference tips, usually taken by `git repack` prior to generating a
+new pack. A line may optionally start with a `+` character to indicate
+that the reference which corresponds to that OID is "preferred" (see
+linkgit:git-config[1]'s `pack.preferBitmapTips`.)
++
+The file given at `<path>` is expected to be readable, and can contain
+duplicates. (If a given OID is given more than once, it is marked as
+preferred if at least one instance of it begins with the special `+`
+marker).
 --
 
 verify::
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 03aaf8e7fb..93869d58c5 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -7,7 +7,8 @@
 #include "object-store.h"
 
 #define BUILTIN_MIDX_WRITE_USAGE \
-	N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]")
+	N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]" \
+	   "[--refs-snapshot=<path>]")
 
 #define BUILTIN_MIDX_VERIFY_USAGE \
 	N_("git multi-pack-index [<options>] verify")
@@ -45,6 +46,7 @@ static char const * const builtin_multi_pack_index_usage[] = {
 static struct opts_multi_pack_index {
 	const char *object_dir;
 	const char *preferred_pack;
+	const char *refs_snapshot;
 	unsigned long batch_size;
 	unsigned flags;
 	int stdin_packs;
@@ -83,6 +85,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
 		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
 			 N_("write multi-pack index containing only given indexes")),
+		OPT_FILENAME(0, "refs-snapshot", &opts.refs_snapshot,
+			     N_("refs snapshot for selecting bitmap commits")),
 		OPT_END(),
 	};
 
@@ -106,7 +110,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 		read_packs_from_stdin(&packs);
 
 		ret = write_midx_file_only(opts.object_dir, &packs,
-					   opts.preferred_pack, opts.flags);
+					   opts.preferred_pack,
+					   opts.refs_snapshot, opts.flags);
 
 		string_list_clear(&packs, 0);
 
@@ -114,7 +119,7 @@ static int cmd_multi_pack_index_write(int argc, const char **argv)
 
 	}
 	return write_midx_file(opts.object_dir, opts.preferred_pack,
-			       opts.flags);
+			       opts.refs_snapshot, opts.flags);
 }
 
 static int cmd_multi_pack_index_verify(int argc, const char **argv)
diff --git a/builtin/repack.c b/builtin/repack.c
index c1a209013b..dba83eede2 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -733,7 +733,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		unsigned flags = 0;
 		if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
 			flags |= MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX;
-		write_midx_file(get_object_directory(), NULL, flags);
+		write_midx_file(get_object_directory(), NULL, NULL, flags);
 	}
 
 	string_list_clear(&names, 0);
diff --git a/midx.c b/midx.c
index 7ac97e66e0..984dea2dba 100644
--- a/midx.c
+++ b/midx.c
@@ -968,7 +968,43 @@ static void bitmap_show_commit(struct commit *commit, void *_data)
 	data->commits[data->commits_nr++] = commit;
 }
 
+static int read_refs_snapshot(const char *refs_snapshot,
+			      struct rev_info *revs)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct object_id oid;
+	FILE *f = xfopen(refs_snapshot, "r");
+
+	while (strbuf_getline(&buf, f) != EOF) {
+		struct object *object;
+		int preferred = 0;
+		char *hex = buf.buf;
+		const char *end = NULL;
+
+		if (buf.len && *buf.buf == '+') {
+			preferred = 1;
+			hex = &buf.buf[1];
+		}
+
+		if (parse_oid_hex(hex, &oid, &end) < 0)
+			die(_("could not parse line: %s"), buf.buf);
+		if (*end)
+			die(_("malformed line: %s"), buf.buf);
+
+		object = parse_object_or_die(&oid, NULL);
+		if (preferred)
+			object->flags |= NEEDS_BITMAP;
+
+		add_pending_object(revs, object, "");
+	}
+
+	fclose(f);
+	strbuf_release(&buf);
+	return 0;
+}
+
 static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr_p,
+						    const char *refs_snapshot,
 						    struct write_midx_context *ctx)
 {
 	struct rev_info revs;
@@ -977,8 +1013,12 @@ static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr
 	cb.ctx = ctx;
 
 	repo_init_revisions(the_repository, &revs, NULL);
-	setup_revisions(0, NULL, &revs, NULL);
-	for_each_ref(add_ref_to_pending, &revs);
+	if (refs_snapshot) {
+		read_refs_snapshot(refs_snapshot, &revs);
+	} else {
+		setup_revisions(0, NULL, &revs, NULL);
+		for_each_ref(add_ref_to_pending, &revs);
+	}
 
 	/*
 	 * Skipping promisor objects here is intentional, since it only excludes
@@ -1007,6 +1047,7 @@ static struct commit **find_commits_for_midx_bitmap(uint32_t *indexed_commits_nr
 
 static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 			     struct write_midx_context *ctx,
+			     const char *refs_snapshot,
 			     unsigned flags)
 {
 	struct packing_data pdata;
@@ -1018,7 +1059,7 @@ static int write_midx_bitmap(char *midx_name, unsigned char *midx_hash,
 
 	prepare_midx_packing_data(&pdata, ctx);
 
-	commits = find_commits_for_midx_bitmap(&commits_nr, ctx);
+	commits = find_commits_for_midx_bitmap(&commits_nr, refs_snapshot, ctx);
 
 	/*
 	 * Build the MIDX-order index based on pdata.objects (which is already
@@ -1066,6 +1107,7 @@ static int write_midx_internal(const char *object_dir,
 			       struct string_list *packs_to_include,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
+			       const char *refs_snapshot,
 			       unsigned flags)
 {
 	char *midx_name;
@@ -1359,7 +1401,8 @@ static int write_midx_internal(const char *object_dir,
 	if (flags & MIDX_WRITE_REV_INDEX)
 		write_midx_reverse_index(midx_name, midx_hash, &ctx);
 	if (flags & MIDX_WRITE_BITMAP) {
-		if (write_midx_bitmap(midx_name, midx_hash, &ctx, flags) < 0) {
+		if (write_midx_bitmap(midx_name, midx_hash, &ctx,
+				      refs_snapshot, flags) < 0) {
 			error(_("could not write multi-pack bitmap"));
 			result = 1;
 			goto cleanup;
@@ -1394,19 +1437,21 @@ static int write_midx_internal(const char *object_dir,
 
 int write_midx_file(const char *object_dir,
 		    const char *preferred_pack_name,
+		    const char *refs_snapshot,
 		    unsigned flags)
 {
 	return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name,
-				   flags);
+				   refs_snapshot, flags);
 }
 
 int write_midx_file_only(const char *object_dir,
 			 struct string_list *packs_to_include,
 			 const char *preferred_pack_name,
+			 const char *refs_snapshot,
 			 unsigned flags)
 {
 	return write_midx_internal(object_dir, packs_to_include, NULL,
-				   preferred_pack_name, flags);
+				   preferred_pack_name, refs_snapshot, flags);
 }
 
 struct clear_midx_data {
@@ -1686,7 +1731,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 	free(count);
 
 	if (packs_to_drop.nr) {
-		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, flags);
+		result = write_midx_internal(object_dir, NULL, &packs_to_drop, NULL, NULL, flags);
 		m = NULL;
 	}
 
@@ -1877,7 +1922,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 		goto cleanup;
 	}
 
-	result = write_midx_internal(object_dir, NULL, NULL, NULL, flags);
+	result = write_midx_internal(object_dir, NULL, NULL, NULL, NULL, flags);
 	m = NULL;
 
 cleanup:
diff --git a/midx.h b/midx.h
index 3545e327ea..11ff094a8c 100644
--- a/midx.h
+++ b/midx.h
@@ -62,14 +62,18 @@ int fill_midx_entry(struct repository *r, const struct object_id *oid, struct pa
 int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name);
 int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local);
 
-int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags);
 /*
  * Variant of write_midx_file which writes a MIDX containing only the packs
  * specified in packs_to_include.
  */
+int write_midx_file(const char *object_dir,
+		    const char *preferred_pack_name,
+		    const char *refs_snapshot,
+		    unsigned flags);
 int write_midx_file_only(const char *object_dir,
 			 struct string_list *packs_to_include,
 			 const char *preferred_pack_name,
+			 const char *refs_snapshot,
 			 unsigned flags);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh
index 4ad7c2c969..069dab3e17 100755
--- a/t/t5326-multi-pack-bitmaps.sh
+++ b/t/t5326-multi-pack-bitmaps.sh
@@ -283,4 +283,86 @@ test_expect_success 'pack.preferBitmapTips' '
 	)
 '
 
+test_expect_success 'writing a bitmap with --refs-snapshot' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit one &&
+		test_commit two &&
+
+		git rev-parse one >snapshot &&
+
+		git repack -ad &&
+
+		# First, write a MIDX which see both refs/tags/one and
+		# refs/tags/two (causing both of those commits to receive
+		# bitmaps).
+		git multi-pack-index write --bitmap &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		grep "$(git rev-parse one)" bitmaps &&
+		grep "$(git rev-parse two)" bitmaps &&
+
+		rm -fr $midx-$(midx_checksum $objdir).bitmap &&
+		rm -fr $midx-$(midx_checksum $objdir).rev &&
+		rm -fr $midx &&
+
+		# Then again, but with a refs snapshot which only sees
+		# refs/tags/one.
+		git multi-pack-index write --bitmap --refs-snapshot=snapshot &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		grep "$(git rev-parse one)" bitmaps &&
+		! grep "$(git rev-parse two)" bitmaps
+	)
+'
+
+test_expect_success 'write a bitmap with --refs-snapshot (preferred tips)' '
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit_bulk --message="%s" 103 &&
+
+		git log --format="%H" >commits.raw &&
+		sort <commits.raw >commits &&
+
+		git log --format="create refs/tags/%s %H" HEAD >refs &&
+		git update-ref --stdin <refs &&
+
+		git multi-pack-index write --bitmap &&
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >before &&
+		test_line_count = 1 before &&
+
+		(
+			grep -vf before commits.raw &&
+			# mark missing commits as preferred
+			sed "s/^/+/" before
+		) >snapshot &&
+
+		rm -fr $midx-$(midx_checksum $objdir).bitmap &&
+		rm -fr $midx-$(midx_checksum $objdir).rev &&
+		rm -fr $midx &&
+
+		git multi-pack-index write --bitmap --refs-snapshot=snapshot &&
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >after &&
+
+		! test_cmp before after
+	)
+'
+
 test_done
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v3 4/9] builtin/repack.c: keep track of existing packs unconditionally
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
                       ` (2 preceding siblings ...)
  2021-09-29  1:55     ` [PATCH v3 3/9] midx: preliminary support for `--refs-snapshot` Taylor Blau
@ 2021-09-29  1:55     ` Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 5/9] builtin/repack.c: rename variables that deal with non-kept packs Taylor Blau
                       ` (6 subsequent siblings)
  10 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-29  1:55 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, gitster, jonathantanmy, steadmon

In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.

Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.

Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.

Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 56 +++++++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index dba83eede2..39f11df675 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -94,12 +94,14 @@ static void remove_pack_on_signal(int signo)
 }
 
 /*
- * Adds all packs hex strings to the fname list, which do not
- * have a corresponding .keep file. These packs are not to
- * be kept if we are going to pack everything into one file.
+ * Adds all packs hex strings to either fname_list or fname_kept_list
+ * based on whether each pack has a corresponding .keep file or not.
+ * Packs without a .keep file are not to be kept if we are going to
+ * pack everything into one file.
  */
-static void get_non_kept_pack_filenames(struct string_list *fname_list,
-					const struct string_list *extra_keep)
+static void collect_pack_filenames(struct string_list *fname_list,
+				   struct string_list *fname_kept_list,
+				   const struct string_list *extra_keep)
 {
 	DIR *dir;
 	struct dirent *e;
@@ -112,21 +114,20 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list,
 		size_t len;
 		int i;
 
+		if (!strip_suffix(e->d_name, ".pack", &len))
+			continue;
+
 		for (i = 0; i < extra_keep->nr; i++)
 			if (!fspathcmp(e->d_name, extra_keep->items[i].string))
 				break;
-		if (extra_keep->nr > 0 && i < extra_keep->nr)
-			continue;
-
-		if (!strip_suffix(e->d_name, ".pack", &len))
-			continue;
 
 		fname = xmemdupz(e->d_name, len);
 
-		if (!file_exists(mkpath("%s/%s.keep", packdir, fname)))
-			string_list_append_nodup(fname_list, fname);
+		if ((extra_keep->nr > 0 && i < extra_keep->nr) ||
+		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
+			string_list_append_nodup(fname_kept_list, fname);
 		else
-			free(fname);
+			string_list_append_nodup(fname_list, fname);
 	}
 	closedir(dir);
 }
@@ -440,6 +441,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list names = STRING_LIST_INIT_DUP;
 	struct string_list rollback = STRING_LIST_INIT_NODUP;
 	struct string_list existing_packs = STRING_LIST_INIT_DUP;
+	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
 	int i, ext, ret;
@@ -572,9 +574,10 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (use_delta_islands)
 		strvec_push(&cmd.args, "--delta-islands");
 
+	collect_pack_filenames(&existing_packs, &existing_kept_packs,
+			       &keep_pack_list);
+
 	if (pack_everything & ALL_INTO_ONE) {
-		get_non_kept_pack_filenames(&existing_packs, &keep_pack_list);
-
 		repack_promisor_objects(&po_args, &names);
 
 		if (existing_packs.nr && delete_redundant) {
@@ -683,17 +686,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	reprepare_packed_git(the_repository);
 
 	if (delete_redundant) {
-		const int hexsz = the_hash_algo->hexsz;
 		int opts = 0;
-		string_list_sort(&names);
-		for_each_string_list_item(item, &existing_packs) {
-			char *sha1;
-			size_t len = strlen(item->string);
-			if (len < hexsz)
-				continue;
-			sha1 = item->string + len - hexsz;
-			if (!string_list_has_string(&names, sha1))
-				remove_redundant_pack(packdir, item->string);
+		if (pack_everything & ALL_INTO_ONE) {
+			const int hexsz = the_hash_algo->hexsz;
+			string_list_sort(&names);
+			for_each_string_list_item(item, &existing_packs) {
+				char *sha1;
+				size_t len = strlen(item->string);
+				if (len < hexsz)
+					continue;
+				sha1 = item->string + len - hexsz;
+				if (!string_list_has_string(&names, sha1))
+					remove_redundant_pack(packdir, item->string);
+			}
 		}
 
 		if (geometry) {
@@ -739,6 +744,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	string_list_clear(&names, 0);
 	string_list_clear(&rollback, 0);
 	string_list_clear(&existing_packs, 0);
+	string_list_clear(&existing_kept_packs, 0);
 	clear_pack_geometry(geometry);
 	strbuf_release(&line);
 
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v3 5/9] builtin/repack.c: rename variables that deal with non-kept packs
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
                       ` (3 preceding siblings ...)
  2021-09-29  1:55     ` [PATCH v3 4/9] builtin/repack.c: keep track of existing packs unconditionally Taylor Blau
@ 2021-09-29  1:55     ` Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 6/9] builtin/repack.c: extract showing progress to a variable Taylor Blau
                       ` (5 subsequent siblings)
  10 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-29  1:55 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, gitster, jonathantanmy, steadmon

The new variable `existing_kept_packs` (and corresponding parameter
`fname_kept_list`) added by the previous patch make it seem like
`existing_packs` and `fname_list` are each subsets of the other two
respectively.

In reality, each pair is disjoint: one stores the packs without .keep
files, and the other stores the packs with .keep files. Rename each to
more clearly reflect this.

Suggested-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 39f11df675..5539ec7e89 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -94,12 +94,12 @@ static void remove_pack_on_signal(int signo)
 }
 
 /*
- * Adds all packs hex strings to either fname_list or fname_kept_list
- * based on whether each pack has a corresponding .keep file or not.
- * Packs without a .keep file are not to be kept if we are going to
- * pack everything into one file.
+ * Adds all packs hex strings to either fname_nonkept_list or
+ * fname_kept_list based on whether each pack has a corresponding
+ * .keep file or not.  Packs without a .keep file are not to be kept
+ * if we are going to pack everything into one file.
  */
-static void collect_pack_filenames(struct string_list *fname_list,
+static void collect_pack_filenames(struct string_list *fname_nonkept_list,
 				   struct string_list *fname_kept_list,
 				   const struct string_list *extra_keep)
 {
@@ -127,7 +127,7 @@ static void collect_pack_filenames(struct string_list *fname_list,
 		    (file_exists(mkpath("%s/%s.keep", packdir, fname))))
 			string_list_append_nodup(fname_kept_list, fname);
 		else
-			string_list_append_nodup(fname_list, fname);
+			string_list_append_nodup(fname_nonkept_list, fname);
 	}
 	closedir(dir);
 }
@@ -440,7 +440,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list_item *item;
 	struct string_list names = STRING_LIST_INIT_DUP;
 	struct string_list rollback = STRING_LIST_INIT_NODUP;
-	struct string_list existing_packs = STRING_LIST_INIT_DUP;
+	struct string_list existing_nonkept_packs = STRING_LIST_INIT_DUP;
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
@@ -574,13 +574,13 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (use_delta_islands)
 		strvec_push(&cmd.args, "--delta-islands");
 
-	collect_pack_filenames(&existing_packs, &existing_kept_packs,
+	collect_pack_filenames(&existing_nonkept_packs, &existing_kept_packs,
 			       &keep_pack_list);
 
 	if (pack_everything & ALL_INTO_ONE) {
 		repack_promisor_objects(&po_args, &names);
 
-		if (existing_packs.nr && delete_redundant) {
+		if (existing_nonkept_packs.nr && delete_redundant) {
 			for_each_string_list_item(item, &names) {
 				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
 					     packtmp_name, item->string);
@@ -690,7 +690,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (pack_everything & ALL_INTO_ONE) {
 			const int hexsz = the_hash_algo->hexsz;
 			string_list_sort(&names);
-			for_each_string_list_item(item, &existing_packs) {
+			for_each_string_list_item(item, &existing_nonkept_packs) {
 				char *sha1;
 				size_t len = strlen(item->string);
 				if (len < hexsz)
@@ -743,7 +743,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 
 	string_list_clear(&names, 0);
 	string_list_clear(&rollback, 0);
-	string_list_clear(&existing_packs, 0);
+	string_list_clear(&existing_nonkept_packs, 0);
 	string_list_clear(&existing_kept_packs, 0);
 	clear_pack_geometry(geometry);
 	strbuf_release(&line);
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v3 6/9] builtin/repack.c: extract showing progress to a variable
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
                       ` (4 preceding siblings ...)
  2021-09-29  1:55     ` [PATCH v3 5/9] builtin/repack.c: rename variables that deal with non-kept packs Taylor Blau
@ 2021-09-29  1:55     ` Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 7/9] builtin/repack.c: support writing a MIDX while repacking Taylor Blau
                       ` (4 subsequent siblings)
  10 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-29  1:55 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, gitster, jonathantanmy, steadmon

We only ask whether stderr is a tty before calling
'prune_packed_objects()', but the subsequent patch will add another use.

Extract this check into a variable so that both can use it without
having to call 'isatty()' twice.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 5539ec7e89..475677b297 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -446,6 +446,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct strbuf line = STRBUF_INIT;
 	int i, ext, ret;
 	FILE *out;
+	int show_progress = isatty(2);
 
 	/* variables to be filled by option parsing */
 	int pack_everything = 0;
@@ -719,7 +720,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			}
 			strbuf_release(&buf);
 		}
-		if (!po_args.quiet && isatty(2))
+		if (!po_args.quiet && show_progress)
 			opts |= PRUNE_PACKED_VERBOSE;
 		prune_packed_objects(opts);
 
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v3 7/9] builtin/repack.c: support writing a MIDX while repacking
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
                       ` (5 preceding siblings ...)
  2021-09-29  1:55     ` [PATCH v3 6/9] builtin/repack.c: extract showing progress to a variable Taylor Blau
@ 2021-09-29  1:55     ` Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 8/9] builtin/repack.c: make largest pack preferred Taylor Blau
                       ` (3 subsequent siblings)
  10 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-29  1:55 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, gitster, jonathantanmy, steadmon

Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.

There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:

  - Call 'git multi-pack-index write' after running 'git repack', or

  - Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
    'git repack'.

The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).

Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.

Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.

This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").

There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.

There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  14 +++-
 builtin/repack.c             | 138 ++++++++++++++++++++++++++++++-----
 t/lib-midx.sh                |   8 ++
 t/t7700-repack.sh            |  96 ++++++++++++++++++++++++
 t/t7703-repack-geometric.sh  |   2 +-
 5 files changed, 234 insertions(+), 24 deletions(-)
 create mode 100644 t/lib-midx.sh

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 24c00c9384..0f2d235ca5 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -9,7 +9,7 @@ git-repack - Pack unpacked objects in a repository
 SYNOPSIS
 --------
 [verse]
-'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
+'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
 
 DESCRIPTION
 -----------
@@ -128,10 +128,11 @@ depth is 4095.
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
-	only makes sense when used with `-a` or `-A`, as the bitmaps
+	only makes sense when used with `-a`, `-A` or `-m`, as the bitmaps
 	must be able to refer to all reachable objects. This option
-	overrides the setting of `repack.writeBitmaps`.  This option
-	has no effect if multiple packfiles are created.
+	overrides the setting of `repack.writeBitmaps`. This option
+	has no effect if multiple packfiles are created, unless writing a
+	MIDX (in which case a multi-pack bitmap is created).
 
 --pack-kept-objects::
 	Include objects in `.keep` files when repacking.  Note that we
@@ -190,6 +191,11 @@ to change in the future. This option (implying a drastically different
 repack mode) is not guaranteed to work with all other combinations of
 option to `git repack`.
 
+-m::
+--write-midx::
+	Write a multi-pack index (see linkgit:git-multi-pack-index[1])
+	containing the non-redundant packs.
+
 CONFIGURATION
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 475677b297..abb30f89e8 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -434,6 +434,76 @@ static void clear_pack_geometry(struct pack_geometry *geometry)
 	geometry->split = 0;
 }
 
+static void midx_included_packs(struct string_list *include,
+				struct string_list *existing_nonkept_packs,
+				struct string_list *existing_kept_packs,
+				struct string_list *names,
+				struct pack_geometry *geometry)
+{
+	struct string_list_item *item;
+
+	for_each_string_list_item(item, existing_kept_packs)
+		string_list_insert(include, xstrfmt("%s.idx", item->string));
+	for_each_string_list_item(item, names)
+		string_list_insert(include, xstrfmt("pack-%s.idx", item->string));
+	if (geometry) {
+		struct strbuf buf = STRBUF_INIT;
+		uint32_t i;
+		for (i = geometry->split; i < geometry->pack_nr; i++) {
+			struct packed_git *p = geometry->pack[i];
+
+			strbuf_addstr(&buf, pack_basename(p));
+			strbuf_strip_suffix(&buf, ".pack");
+			strbuf_addstr(&buf, ".idx");
+
+			string_list_insert(include, strbuf_detach(&buf, NULL));
+		}
+	} else {
+		for_each_string_list_item(item, existing_nonkept_packs) {
+			if (item->util)
+				continue;
+			string_list_insert(include, xstrfmt("%s.idx", item->string));
+		}
+	}
+}
+
+static int write_midx_included_packs(struct string_list *include,
+				     int show_progress, int write_bitmaps)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct string_list_item *item;
+	FILE *in;
+	int ret;
+
+	if (!include->nr)
+		return 0;
+
+	cmd.in = -1;
+	cmd.git_cmd = 1;
+
+	strvec_push(&cmd.args, "multi-pack-index");
+	strvec_pushl(&cmd.args, "write", "--stdin-packs", NULL);
+
+	if (show_progress)
+		strvec_push(&cmd.args, "--progress");
+	else
+		strvec_push(&cmd.args, "--no-progress");
+
+	if (write_bitmaps)
+		strvec_push(&cmd.args, "--bitmap");
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, include)
+		fprintf(in, "%s\n", item->string);
+	fclose(in);
+
+	return finish_command(&cmd);
+}
+
 int cmd_repack(int argc, const char **argv, const char *prefix)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -457,6 +527,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	int no_update_server_info = 0;
 	struct pack_objects_args po_args = {NULL};
 	int geometric_factor = 0;
+	int write_midx = 0;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -499,6 +570,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("do not repack this pack")),
 		OPT_INTEGER('g', "geometric", &geometric_factor,
 			    N_("find a geometric progression with factor <N>")),
+		OPT_BOOL('m', "write-midx", &write_midx,
+			   N_("write a multi-pack index of the resulting packs")),
 		OPT_END()
 	};
 
@@ -515,8 +588,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		die(_("--keep-unreachable and -A are incompatible"));
 
 	if (write_bitmaps < 0) {
-		if (!(pack_everything & ALL_INTO_ONE) ||
-		    !is_bare_repository())
+		if (!write_midx &&
+		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
 			write_bitmaps = 0;
 	} else if (write_bitmaps &&
 		   git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0) &&
@@ -526,7 +599,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (pack_kept_objects < 0)
 		pack_kept_objects = write_bitmaps > 0;
 
-	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE))
+	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
 		die(_(incremental_bitmap_conflict_error));
 
 	if (geometric_factor) {
@@ -568,10 +641,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 	if (has_promisor_remote())
 		strvec_push(&cmd.args, "--exclude-promisor-objects");
-	if (write_bitmaps > 0)
-		strvec_push(&cmd.args, "--write-bitmap-index");
-	else if (write_bitmaps < 0)
-		strvec_push(&cmd.args, "--write-bitmap-index-quiet");
+	if (!write_midx) {
+		if (write_bitmaps > 0)
+			strvec_push(&cmd.args, "--write-bitmap-index");
+		else if (write_bitmaps < 0)
+			strvec_push(&cmd.args, "--write-bitmap-index-quiet");
+	}
 	if (use_delta_islands)
 		strvec_push(&cmd.args, "--delta-islands");
 
@@ -684,22 +759,47 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 	/* End of pack replacement. */
 
+	if (delete_redundant && pack_everything & ALL_INTO_ONE) {
+		const int hexsz = the_hash_algo->hexsz;
+		string_list_sort(&names);
+		for_each_string_list_item(item, &existing_nonkept_packs) {
+			char *sha1;
+			size_t len = strlen(item->string);
+			if (len < hexsz)
+				continue;
+			sha1 = item->string + len - hexsz;
+			/*
+			 * Mark this pack for deletion, which ensures that this
+			 * pack won't be included in a MIDX (if `--write-midx`
+			 * was given) and that we will actually delete this pack
+			 * (if `-d` was given).
+			 */
+			item->util = (void*)(intptr_t)!string_list_has_string(&names, sha1);
+		}
+	}
+
+	if (write_midx) {
+		struct string_list include = STRING_LIST_INIT_NODUP;
+		midx_included_packs(&include, &existing_nonkept_packs,
+				    &existing_kept_packs, &names, geometry);
+
+		ret = write_midx_included_packs(&include,
+						show_progress, write_bitmaps > 0);
+
+		string_list_clear(&include, 0);
+
+		if (ret)
+			return ret;
+	}
+
 	reprepare_packed_git(the_repository);
 
 	if (delete_redundant) {
 		int opts = 0;
-		if (pack_everything & ALL_INTO_ONE) {
-			const int hexsz = the_hash_algo->hexsz;
-			string_list_sort(&names);
-			for_each_string_list_item(item, &existing_nonkept_packs) {
-				char *sha1;
-				size_t len = strlen(item->string);
-				if (len < hexsz)
-					continue;
-				sha1 = item->string + len - hexsz;
-				if (!string_list_has_string(&names, sha1))
-					remove_redundant_pack(packdir, item->string);
-			}
+		for_each_string_list_item(item, &existing_nonkept_packs) {
+			if (!item->util)
+				continue;
+			remove_redundant_pack(packdir, item->string);
 		}
 
 		if (geometry) {
diff --git a/t/lib-midx.sh b/t/lib-midx.sh
new file mode 100644
index 0000000000..1261994744
--- /dev/null
+++ b/t/lib-midx.sh
@@ -0,0 +1,8 @@
+# test_midx_consistent <objdir>
+test_midx_consistent () {
+	ls $1/pack/pack-*.idx | xargs -n 1 basename | sort >expect &&
+	test-tool read-midx $1 | grep ^pack-.*\.idx$ | sort >actual &&
+
+	test_cmp expect actual &&
+	git multi-pack-index --object-dir=$1 verify
+}
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 98eda3bfeb..6792531dfd 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -3,6 +3,8 @@
 test_description='git repack works correctly'
 
 . ./test-lib.sh
+. "${TEST_DIRECTORY}/lib-bitmap.sh"
+. "${TEST_DIRECTORY}/lib-midx.sh"
 
 commit_and_pack () {
 	test_commit "$@" 1>&2 &&
@@ -234,4 +236,98 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+objdir=.git/objects
+midx=$objdir/pack/multi-pack-index
+
+test_expect_success 'setup for --write-midx tests' '
+	git init midx &&
+	(
+		cd midx &&
+		git config core.multiPackIndex true &&
+
+		test_commit base
+	)
+'
+
+test_expect_success '--write-midx unchanged' '
+	(
+		cd midx &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack &&
+		test_path_is_missing $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success '--write-midx with a new pack' '
+	(
+		cd midx &&
+		test_commit loose &&
+
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success '--write-midx with -b' '
+	(
+		cd midx &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -mb &&
+
+		test_path_is_file $midx &&
+		test_path_is_file $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success '--write-midx with -d' '
+	(
+		cd midx &&
+		test_commit repack &&
+
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Ad --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-*.bitmap &&
+		test_midx_consistent $objdir
+	)
+'
+
+test_expect_success 'cleans up MIDX when appropriate' '
+	(
+		cd midx &&
+
+		test_commit repack-2 &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb --write-midx &&
+
+		checksum=$(midx_checksum $objdir) &&
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$checksum.bitmap &&
+		test_path_is_file $midx-$checksum.rev &&
+
+		test_commit repack-3 &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb --write-midx &&
+
+		test_path_is_file $midx &&
+		test_path_is_missing $midx-$checksum.bitmap &&
+		test_path_is_missing $midx-$checksum.rev &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+		test_path_is_file $midx-$(midx_checksum $objdir).rev &&
+
+		test_commit repack-4 &&
+		GIT_TEST_MULTI_PACK_INDEX=0 git repack -Adb &&
+
+		find $objdir/pack -type f -name "multi-pack-index*" >files &&
+		test_must_be_empty files
+	)
+'
+
 test_done
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 5ccaa440e0..67049f7637 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -15,7 +15,7 @@ test_expect_success '--geometric with no packs' '
 	(
 		cd geometric &&
 
-		git repack --geometric 2 >out &&
+		git repack --write-midx --geometric 2 >out &&
 		test_i18ngrep "Nothing new to pack" out
 	)
 '
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v3 8/9] builtin/repack.c: make largest pack preferred
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
                       ` (6 preceding siblings ...)
  2021-09-29  1:55     ` [PATCH v3 7/9] builtin/repack.c: support writing a MIDX while repacking Taylor Blau
@ 2021-09-29  1:55     ` Taylor Blau
  2021-09-29  1:55     ` [PATCH v3 9/9] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
                       ` (2 subsequent siblings)
  10 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-29  1:55 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, gitster, jonathantanmy, steadmon

When repacking into a geometric series and writing a multi-pack bitmap,
it is beneficial to have the largest resulting pack be the preferred
object source in the bitmap's MIDX, since selecting the large packs can
lead to fewer broken delta chains and better compression.

Teach 'git repack' to identify this pack and pass it to the MIDX write
machinery in order to mark it as preferred.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-repack.txt |  4 ++++
 builtin/repack.c             | 27 ++++++++++++++++++++++++++-
 pack-bitmap.c                |  2 +-
 pack-bitmap.h                |  1 +
 t/helper/test-read-midx.c    | 25 ++++++++++++++++++++++++-
 t/t7703-repack-geometric.sh  | 22 ++++++++++++++++++++++
 6 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 0f2d235ca5..7183fb498f 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -190,6 +190,10 @@ this "roll-up", without respect to their reachability. This is subject
 to change in the future. This option (implying a drastically different
 repack mode) is not guaranteed to work with all other combinations of
 option to `git repack`.
++
+When writing a multi-pack bitmap, `git repack` selects the largest resulting
+pack as the preferred pack for object selection by the MIDX (see
+linkgit:git-multi-pack-index[1]).
 
 -m::
 --write-midx::
diff --git a/builtin/repack.c b/builtin/repack.c
index abb30f89e8..1577f0d59f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -423,6 +423,25 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	geometry->split = split;
 }
 
+static struct packed_git *get_largest_active_pack(struct pack_geometry *geometry)
+{
+	if (!geometry) {
+		/*
+		 * No geometry means either an all-into-one repack (in which
+		 * case there is only one pack left and it is the largest) or an
+		 * incremental one.
+		 *
+		 * If repacking incrementally, then we could check the size of
+		 * all packs to determine which should be preferred, but leave
+		 * this for later.
+		 */
+		return NULL;
+	}
+	if (geometry->split == geometry->pack_nr)
+		return NULL;
+	return geometry->pack[geometry->pack_nr - 1];
+}
+
 static void clear_pack_geometry(struct pack_geometry *geometry)
 {
 	if (!geometry)
@@ -468,10 +487,12 @@ static void midx_included_packs(struct string_list *include,
 }
 
 static int write_midx_included_packs(struct string_list *include,
+				     struct pack_geometry *geometry,
 				     int show_progress, int write_bitmaps)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
 	struct string_list_item *item;
+	struct packed_git *largest = get_largest_active_pack(geometry);
 	FILE *in;
 	int ret;
 
@@ -492,6 +513,10 @@ static int write_midx_included_packs(struct string_list *include,
 	if (write_bitmaps)
 		strvec_push(&cmd.args, "--bitmap");
 
+	if (largest)
+		strvec_pushf(&cmd.args, "--preferred-pack=%s",
+			     pack_basename(largest));
+
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
@@ -783,7 +808,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		midx_included_packs(&include, &existing_nonkept_packs,
 				    &existing_kept_packs, &names, geometry);
 
-		ret = write_midx_included_packs(&include,
+		ret = write_midx_included_packs(&include, geometry,
 						show_progress, write_bitmaps > 0);
 
 		string_list_clear(&include, 0);
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 8504110a4d..67be9be9a6 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1418,7 +1418,7 @@ static int try_partial_reuse(struct packed_git *pack,
 	return 0;
 }
 
-static uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
+uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git)
 {
 	struct multi_pack_index *m = bitmap_git->midx;
 	if (!m)
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 469090bad2..7d407c5a4c 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -55,6 +55,7 @@ int test_bitmap_commits(struct repository *r);
 struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs,
 					 struct list_objects_filter_options *filter,
 					 int filter_provided_objects);
+uint32_t midx_preferred_pack(struct bitmap_index *bitmap_git);
 int reuse_partial_packfile_from_bitmap(struct bitmap_index *,
 				       struct packed_git **packfile,
 				       uint32_t *entries,
diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c
index cb0d27049a..0038559129 100644
--- a/t/helper/test-read-midx.c
+++ b/t/helper/test-read-midx.c
@@ -3,6 +3,7 @@
 #include "midx.h"
 #include "repository.h"
 #include "object-store.h"
+#include "pack-bitmap.h"
 
 static int read_midx_file(const char *object_dir, int show_objects)
 {
@@ -72,14 +73,36 @@ static int read_midx_checksum(const char *object_dir)
 	return 0;
 }
 
+static int read_midx_preferred_pack(const char *object_dir)
+{
+	struct multi_pack_index *midx = NULL;
+	struct bitmap_index *bitmap = NULL;
+
+	setup_git_directory();
+
+	midx = load_multi_pack_index(object_dir, 1);
+	if (!midx)
+		return 1;
+
+	bitmap = prepare_bitmap_git(the_repository);
+	if (!(bitmap && bitmap_is_midx(bitmap)))
+		return 1;
+
+
+	printf("%s\n", midx->pack_names[midx_preferred_pack(bitmap)]);
+	return 0;
+}
+
 int cmd__read_midx(int argc, const char **argv)
 {
 	if (!(argc == 2 || argc == 3))
-		usage("read-midx [--show-objects|--checksum] <object-dir>");
+		usage("read-midx [--show-objects|--checksum|--preferred-pack] <object-dir>");
 
 	if (!strcmp(argv[1], "--show-objects"))
 		return read_midx_file(argv[2], 1);
 	else if (!strcmp(argv[1], "--checksum"))
 		return read_midx_checksum(argv[2]);
+	else if (!strcmp(argv[1], "--preferred-pack"))
+		return read_midx_preferred_pack(argv[2]);
 	return read_midx_file(argv[1], 0);
 }
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 67049f7637..bdbbcbf1ec 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -180,4 +180,26 @@ test_expect_success '--geometric ignores kept packs' '
 	)
 '
 
+test_expect_success '--geometric chooses largest MIDX preferred pack' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# These packs already form a geometric progression.
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 2 && # 6 objects
+		ls $objdir/pack/pack-*.idx >before &&
+		test_commit_bulk --start=4 4 && # 12 objects
+		ls $objdir/pack/pack-*.idx >after &&
+
+		git repack --geometric 2 -dbm &&
+
+		comm -3 before after | xargs -n 1 basename >expect &&
+		test-tool read-midx --preferred-pack $objdir >actual &&
+
+		test_cmp expect actual
+	)
+'
+
 test_done
-- 
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v3 9/9] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
                       ` (7 preceding siblings ...)
  2021-09-29  1:55     ` [PATCH v3 8/9] builtin/repack.c: make largest pack preferred Taylor Blau
@ 2021-09-29  1:55     ` Taylor Blau
  2021-09-29  4:24     ` [PATCH v3 0/9] repack: introduce `--write-midx` Junio C Hamano
  2021-10-01 20:01     ` Jonathan Tan
  10 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-09-29  1:55 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, gitster, jonathantanmy, steadmon

To prevent the race described in an earlier patch, generate and pass a
reference snapshot to the multi-pack bitmap code, if we are writing one
from `git repack`.

This patch is mostly limited to creating a temporary file, and then
calling for_each_ref(). Except we try to minimize duplicates, since
doing so can drastically reduce the size in network-of-forks style
repositories. In the kernel's fork network (the repository containing
all objects from the kernel and all its forks), deduplicating the
references drops the snapshot size from 934 MB to just 12 MB.

But since we're handling duplicates in this way, we have to make sure
that we preferred references (those listed in pack.preferBitmapTips)
before non-preferred ones (to avoid recording an object which is pointed
at by a preferred tip as non-preferred).

We accomplish this by doing separate passes over the references: first
visiting each prefix in pack.preferBitmapTips, and then over the rest of
the references.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/builtin/repack.c b/builtin/repack.c
index 1577f0d59f..5cc0dff77c 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -15,6 +15,8 @@
 #include "promisor-remote.h"
 #include "shallow.h"
 #include "pack.h"
+#include "pack-bitmap.h"
+#include "refs.h"
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -453,6 +455,65 @@ static void clear_pack_geometry(struct pack_geometry *geometry)
 	geometry->split = 0;
 }
 
+struct midx_snapshot_ref_data {
+	struct tempfile *f;
+	struct oidset seen;
+	int preferred;
+};
+
+static int midx_snapshot_ref_one(const char *refname,
+				 const struct object_id *oid,
+				 int flag, void *_data)
+{
+	struct midx_snapshot_ref_data *data = _data;
+	struct object_id peeled;
+
+	if (!peel_iterated_oid(oid, &peeled))
+		oid = &peeled;
+
+	if (oidset_insert(&data->seen, oid))
+		return 0; /* already seen */
+
+	if (oid_object_info(the_repository, oid, NULL) != OBJ_COMMIT)
+		return 0;
+
+	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
+		oid_to_hex(oid));
+
+	return 0;
+}
+
+static void midx_snapshot_refs(struct tempfile *f)
+{
+	struct midx_snapshot_ref_data data;
+	const struct string_list *preferred = bitmap_preferred_tips(the_repository);
+
+	data.f = f;
+	oidset_init(&data.seen, 0);
+
+	if (!fdopen_tempfile(f, "w"))
+		 die(_("could not open tempfile %s for writing"),
+		     get_tempfile_path(f));
+
+	if (preferred) {
+		struct string_list_item *item;
+
+		data.preferred = 1;
+		for_each_string_list_item(item, preferred)
+			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
+		data.preferred = 0;
+	}
+
+	for_each_ref(midx_snapshot_ref_one, &data);
+
+	if (close_tempfile_gently(f)) {
+		int save_errno = errno;
+		delete_tempfile(&f);
+		errno = save_errno;
+		die_errno(_("could not close refs snapshot tempfile"));
+	}
+}
+
 static void midx_included_packs(struct string_list *include,
 				struct string_list *existing_nonkept_packs,
 				struct string_list *existing_kept_packs,
@@ -488,6 +549,7 @@ static void midx_included_packs(struct string_list *include,
 
 static int write_midx_included_packs(struct string_list *include,
 				     struct pack_geometry *geometry,
+				     const char *refs_snapshot,
 				     int show_progress, int write_bitmaps)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -517,6 +579,9 @@ static int write_midx_included_packs(struct string_list *include,
 		strvec_pushf(&cmd.args, "--preferred-pack=%s",
 			     pack_basename(largest));
 
+	if (refs_snapshot)
+		strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
+
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
@@ -539,6 +604,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
+	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
 	FILE *out;
 	int show_progress = isatty(2);
@@ -627,6 +693,18 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
 		die(_(incremental_bitmap_conflict_error));
 
+	if (write_midx && write_bitmaps) {
+		struct strbuf path = STRBUF_INIT;
+
+		strbuf_addf(&path, "%s/%s_XXXXXX", get_object_directory(),
+			    "bitmap-ref-tips");
+
+		refs_snapshot = xmks_tempfile(path.buf);
+		midx_snapshot_refs(refs_snapshot);
+
+		strbuf_release(&path);
+	}
+
 	if (geometric_factor) {
 		if (pack_everything)
 			die(_("--geometric is incompatible with -A, -a"));
@@ -809,6 +887,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				    &existing_kept_packs, &names, geometry);
 
 		ret = write_midx_included_packs(&include, geometry,
+						refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
 						show_progress, write_bitmaps > 0);
 
 		string_list_clear(&include, 0);
-- 
2.33.0.96.g73915697e6

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v3 0/9] repack: introduce `--write-midx`
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
                       ` (8 preceding siblings ...)
  2021-09-29  1:55     ` [PATCH v3 9/9] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
@ 2021-09-29  4:24     ` Junio C Hamano
  2021-10-01 20:01     ` Jonathan Tan
  10 siblings, 0 replies; 76+ messages in thread
From: Junio C Hamano @ 2021-09-29  4:24 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, avarab, jonathantanmy, steadmon

Taylor Blau <me@ttaylorr.com> writes:

> Here is another small reroll of my series to implement the final component of
> multi-pack reachability bitamps, which is to be able to write them from `git
> repack`.
>
> This version incorporates feedback from Jonathan Tan and others at Google who
> graciously provided review. A range-diff is included below, but the major
> changes are:
>
>   - A comment to explain the relationship between 'ctx->m' and 'ctx->to_include'
>     in midx.c:add_pack_to_midx().
>
>   - A comment to explain the meaning of write_midx_file_only().
>
>   - Clean-up of stray hunks, a missing strbuf_release().
>
>   - Replacing the name `existing_packs` with `existing_nonkept_packs` to
>     emphasize the relationship between it and the similarly-named
>     `existing_kept_packs`.
>
> Instead of depending on tb/multi-pack-bitmaps, this series now depends on the
> tip of master.

... as the topic already graduated to 'master' ;-)

But applying them on the same base as the previous round and merging
the result to 'master', and applying them on 'master', will give us
identical trees.

Will replace; thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v3 0/9] repack: introduce `--write-midx`
  2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
                       ` (9 preceding siblings ...)
  2021-09-29  4:24     ` [PATCH v3 0/9] repack: introduce `--write-midx` Junio C Hamano
@ 2021-10-01 20:01     ` Jonathan Tan
  2021-10-01 22:40       ` Taylor Blau
  10 siblings, 1 reply; 76+ messages in thread
From: Jonathan Tan @ 2021-10-01 20:01 UTC (permalink / raw)
  To: me; +Cc: git, peff, avarab, gitster, jonathantanmy, steadmon

> Here is another small reroll of my series to implement the final component of
> multi-pack reachability bitamps, which is to be able to write them from `git
> repack`.

I see that all my comments up to patch 6 (of v2) have been addressed.
There is one outstanding one involving a potentially uninitialized
variable [1] from patch 8 (of v2) that you (Taylor) might have missed
since I sent it after I reviewed the first 6.

Other than that,

Reviewed-by: Jonathan Tan <jonathantanmy@google.com> (conditional on
that being resolved)

[1] https://lore.kernel.org/git/20210924182247.2922561-1-jonathantanmy@google.com/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v2 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  2021-09-24 18:22     ` Jonathan Tan
@ 2021-10-01 22:38       ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-10-01 22:38 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff, gitster, avarab

On Fri, Sep 24, 2021 at 11:22:47AM -0700, Jonathan Tan wrote:
> > +static void midx_snapshot_refs(struct tempfile *f)
> > +{
> > +	struct midx_snapshot_ref_data data;
> > +	const struct string_list *preferred = bitmap_preferred_tips(the_repository);
> > +
> > +	data.f = f;
> > +	oidset_init(&data.seen, 0);
> > +
> > +	if (!fdopen_tempfile(f, "w"))
> > +		 die(_("could not open tempfile %s for writing"),
> > +		     get_tempfile_path(f));
> > +
> > +	if (preferred) {
> > +		struct string_list_item *item;
> > +
> > +		data.preferred = 1;
> > +		for_each_string_list_item(item, preferred)
> > +			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
> > +		data.preferred = 0;
> > +	}
> > +
> > +	for_each_ref(midx_snapshot_ref_one, &data);
>
> One small thing here - I think "data.preferred = 0;" needs to be moved
> to right above this for_each_ref line. As it is, if preferred is NULL,
> data.preferred wouldn't be initialized.
>
> Other than that (and the minor changes I've suggested in previous
> patches), this series looks good to me.

Eek. Great catch. I was wondering why this didn't bite us at GitHub by
marking all refs as preferred, but it doesn't because we always get back
a non-NULL string list from bitmap_preferred_tips(), which causes us to
set data.preferred back to 0.

Here's a replacement patch which adds the missing `data.preferred = 0`
while initializing data in this function, as well as calls
oidset_clear() and adds a test. Unfortunately, the new test in t7700 is
basically a smoke test, because of the way that bitmap selection works

If Junio would rather I send a reroll of the topic instead, I'm happy to
do that, too.

--- 8< ---

Subject: [PATCH] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps

To prevent the race described in an earlier patch, generate and pass a
reference snapshot to the multi-pack bitmap code, if we are writing one
from `git repack`.

This patch is mostly limited to creating a temporary file, and then
calling for_each_ref(). Except we try to minimize duplicates, since
doing so can drastically reduce the size in network-of-forks style
repositories. In the kernel's fork network (the repository containing
all objects from the kernel and all its forks), deduplicating the
references drops the snapshot size from 934 MB to just 12 MB.

But since we're handling duplicates in this way, we have to make sure
that we preferred references (those listed in pack.preferBitmapTips)
before non-preferred ones (to avoid recording an object which is pointed
at by a preferred tip as non-preferred).

We accomplish this by doing separate passes over the references: first
visiting each prefix in pack.preferBitmapTips, and then over the rest of
the references.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c  | 82 +++++++++++++++++++++++++++++++++++++++++++++++
 t/t7700-repack.sh | 42 ++++++++++++++++++++++++
 2 files changed, 124 insertions(+)

diff --git a/builtin/repack.c b/builtin/repack.c
index 1577f0d59f..9354b2bd93 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -15,6 +15,8 @@
 #include "promisor-remote.h"
 #include "shallow.h"
 #include "pack.h"
+#include "pack-bitmap.h"
+#include "refs.h"

 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
@@ -453,6 +455,68 @@ static void clear_pack_geometry(struct pack_geometry *geometry)
 	geometry->split = 0;
 }

+struct midx_snapshot_ref_data {
+	struct tempfile *f;
+	struct oidset seen;
+	int preferred;
+};
+
+static int midx_snapshot_ref_one(const char *refname,
+				 const struct object_id *oid,
+				 int flag, void *_data)
+{
+	struct midx_snapshot_ref_data *data = _data;
+	struct object_id peeled;
+
+	if (!peel_iterated_oid(oid, &peeled))
+		oid = &peeled;
+
+	if (oidset_insert(&data->seen, oid))
+		return 0; /* already seen */
+
+	if (oid_object_info(the_repository, oid, NULL) != OBJ_COMMIT)
+		return 0;
+
+	fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
+		oid_to_hex(oid));
+
+	return 0;
+}
+
+static void midx_snapshot_refs(struct tempfile *f)
+{
+	struct midx_snapshot_ref_data data;
+	const struct string_list *preferred = bitmap_preferred_tips(the_repository);
+
+	data.f = f;
+	data.preferred = 0;
+	oidset_init(&data.seen, 0);
+
+	if (!fdopen_tempfile(f, "w"))
+		 die(_("could not open tempfile %s for writing"),
+		     get_tempfile_path(f));
+
+	if (preferred) {
+		struct string_list_item *item;
+
+		data.preferred = 1;
+		for_each_string_list_item(item, preferred)
+			for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
+		data.preferred = 0;
+	}
+
+	for_each_ref(midx_snapshot_ref_one, &data);
+
+	if (close_tempfile_gently(f)) {
+		int save_errno = errno;
+		delete_tempfile(&f);
+		errno = save_errno;
+		die_errno(_("could not close refs snapshot tempfile"));
+	}
+
+	oidset_clear(&data.seen);
+}
+
 static void midx_included_packs(struct string_list *include,
 				struct string_list *existing_nonkept_packs,
 				struct string_list *existing_kept_packs,
@@ -488,6 +552,7 @@ static void midx_included_packs(struct string_list *include,

 static int write_midx_included_packs(struct string_list *include,
 				     struct pack_geometry *geometry,
+				     const char *refs_snapshot,
 				     int show_progress, int write_bitmaps)
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
@@ -517,6 +582,9 @@ static int write_midx_included_packs(struct string_list *include,
 		strvec_pushf(&cmd.args, "--preferred-pack=%s",
 			     pack_basename(largest));

+	if (refs_snapshot)
+		strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
+
 	ret = start_command(&cmd);
 	if (ret)
 		return ret;
@@ -539,6 +607,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	struct string_list existing_kept_packs = STRING_LIST_INIT_DUP;
 	struct pack_geometry *geometry = NULL;
 	struct strbuf line = STRBUF_INIT;
+	struct tempfile *refs_snapshot = NULL;
 	int i, ext, ret;
 	FILE *out;
 	int show_progress = isatty(2);
@@ -627,6 +696,18 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
 		die(_(incremental_bitmap_conflict_error));

+	if (write_midx && write_bitmaps) {
+		struct strbuf path = STRBUF_INIT;
+
+		strbuf_addf(&path, "%s/%s_XXXXXX", get_object_directory(),
+			    "bitmap-ref-tips");
+
+		refs_snapshot = xmks_tempfile(path.buf);
+		midx_snapshot_refs(refs_snapshot);
+
+		strbuf_release(&path);
+	}
+
 	if (geometric_factor) {
 		if (pack_everything)
 			die(_("--geometric is incompatible with -A, -a"));
@@ -809,6 +890,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				    &existing_kept_packs, &names, geometry);

 		ret = write_midx_included_packs(&include, geometry,
+						refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
 						show_progress, write_bitmaps > 0);

 		string_list_clear(&include, 0);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 6792531dfd..0260ad6f0e 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -330,4 +330,46 @@ test_expect_success 'cleans up MIDX when appropriate' '
 	)
 '

+test_expect_success '--write-midx with preferred bitmap tips' '
+	git init midx-preferred-tips &&
+	test_when_finished "rm -fr midx-preferred-tips" &&
+	(
+		cd midx-preferred-tips &&
+
+		test_commit_bulk --message="%s" 103 &&
+
+		git log --format="%H" >commits.raw &&
+		sort <commits.raw >commits &&
+
+		git log --format="create refs/tags/%s/%s %H" HEAD >refs &&
+		git update-ref --stdin <refs &&
+
+		git repack --write-midx --write-bitmap-index &&
+		test_path_is_file $midx &&
+		test_path_is_file $midx-$(midx_checksum $objdir).bitmap &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >before &&
+		test_line_count = 1 before &&
+
+		rm -fr $midx-$(midx_checksum $objdir).bitmap &&
+		rm -fr $midx-$(midx_checksum $objdir).rev &&
+		rm -fr $midx &&
+
+		# instead of constructing the snapshot ourselves (c.f., the test
+		# "write a bitmap with --refs-snapshot (preferred tips)" in
+		# t5326), mark the missing commit as preferred by adding it to
+		# the pack.preferBitmapTips configuration.
+		git for-each-ref --format="%(refname:rstrip=1)" \
+			--points-at="$(cat before)" >missing &&
+		git config pack.preferBitmapTips "$(cat missing)" &&
+		git repack --write-midx --write-bitmap-index &&
+
+		test-tool bitmap list-commits | sort >bitmaps &&
+		comm -13 bitmaps commits >after &&
+
+		! test_cmp before after
+	)
+'
+
 test_done
--
2.33.0.96.g73915697e6


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v3 0/9] repack: introduce `--write-midx`
  2021-10-01 20:01     ` Jonathan Tan
@ 2021-10-01 22:40       ` Taylor Blau
  0 siblings, 0 replies; 76+ messages in thread
From: Taylor Blau @ 2021-10-01 22:40 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff, avarab, gitster, steadmon

On Fri, Oct 01, 2021 at 01:01:35PM -0700, Jonathan Tan wrote:
> > Here is another small reroll of my series to implement the final component of
> > multi-pack reachability bitamps, which is to be able to write them from `git
> > repack`.
>
> I see that all my comments up to patch 6 (of v2) have been addressed.
> There is one outstanding one involving a potentially uninitialized
> variable [1] from patch 8 (of v2) that you (Taylor) might have missed
> since I sent it after I reviewed the first 6.

Thanks! I must have missed your review of v2 when I sent v3 (indeed, my
reroll came 5 days after you sent [1], so I missed it).

I sent a replacement patch here:

    https://lore.kernel.org/git/YVeN0mXqYvTHtNB+@nand.local/

which Junio can take instead of [v3 8/8]. But I'm happy to reroll the
whole series, too, whatever is easier.

> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> (conditional on
> that being resolved)

Thanks again :-).

> [1] https://lore.kernel.org/git/20210924182247.2922561-1-jonathantanmy@google.com/

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2021-10-01 22:40 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-11  3:32 [PATCH 0/8] repack: introduce `--write-midx` Taylor Blau
2021-09-11  3:32 ` [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly Taylor Blau
2021-09-11  5:00   ` Junio C Hamano
2021-09-11 16:17     ` Taylor Blau
2021-09-11 10:07   ` Ævar Arnfjörð Bjarmason
2021-09-11 16:21     ` Taylor Blau
2021-09-11  3:32 ` [PATCH 2/8] builtin/multi-pack-index.c: support --stdin-packs mode Taylor Blau
2021-09-11 10:05   ` Ævar Arnfjörð Bjarmason
2021-09-11 16:25     ` Taylor Blau
2021-09-11 16:28       ` Taylor Blau
2021-09-12  2:08       ` Eric Sunshine
2021-09-12  2:21         ` Taylor Blau
2021-09-12 15:15           ` Ævar Arnfjörð Bjarmason
2021-09-12 22:30             ` Junio C Hamano
2021-09-12 22:32               ` Ævar Arnfjörð Bjarmason
2021-09-14 19:02       ` Jeff King
2021-09-14 23:48         ` Taylor Blau
2021-09-15  1:55           ` Eric Sunshine
2021-09-11  3:32 ` [PATCH 3/8] midx: preliminary support for `--refs-snapshot` Taylor Blau
2021-09-11  3:32 ` [PATCH 4/8] builtin/repack.c: keep track of existing packs unconditionally Taylor Blau
2021-09-11  3:32 ` [PATCH 5/8] builtin/repack.c: extract showing progress to a variable Taylor Blau
2021-09-11  3:32 ` [PATCH 6/8] builtin/repack.c: support writing a MIDX while repacking Taylor Blau
2021-09-11  3:32 ` [PATCH 7/8] builtin/repack.c: make largest pack preferred Taylor Blau
2021-09-11 10:17   ` Ævar Arnfjörð Bjarmason
2021-09-11 16:35     ` Taylor Blau
2021-09-11  3:32 ` [PATCH 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
2021-09-11 10:27   ` Ævar Arnfjörð Bjarmason
2021-09-11 11:19     ` Ævar Arnfjörð Bjarmason
2021-09-11 16:51       ` Taylor Blau
2021-09-14 18:55         ` Jeff King
2021-09-14 23:34           ` Taylor Blau
2021-09-14 23:56             ` Ævar Arnfjörð Bjarmason
2021-09-15  4:31               ` Taylor Blau
2021-09-11 16:49     ` Taylor Blau
2021-09-15 18:24 ` [PATCH v2 0/8] repack: introduce `--write-midx` Taylor Blau
2021-09-15 18:24   ` [PATCH v2 1/8] midx: expose `write_midx_file_only()` publicly Taylor Blau
2021-09-22 23:14     ` Jonathan Tan
2021-09-23  3:09       ` Taylor Blau
2021-09-15 18:24   ` [PATCH v2 2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode Taylor Blau
2021-09-22 22:29     ` Josh Steadmon
2021-09-23  2:03       ` Taylor Blau
2021-09-22 23:11     ` Jonathan Tan
2021-09-23  2:06       ` Taylor Blau
2021-09-15 18:24   ` [PATCH v2 3/8] midx: preliminary support for `--refs-snapshot` Taylor Blau
2021-09-22 22:34     ` Josh Steadmon
2021-09-23  2:08       ` Taylor Blau
2021-09-22 23:00     ` Jonathan Tan
2021-09-23  2:18       ` Taylor Blau
2021-09-15 18:24   ` [PATCH v2 4/8] builtin/repack.c: keep track of existing packs unconditionally Taylor Blau
2021-09-22 22:56     ` Jonathan Tan
2021-09-23  2:59       ` Taylor Blau
2021-09-15 18:24   ` [PATCH v2 5/8] builtin/repack.c: extract showing progress to a variable Taylor Blau
2021-09-15 18:24   ` [PATCH v2 6/8] builtin/repack.c: support writing a MIDX while repacking Taylor Blau
2021-09-22 22:39     ` Jonathan Tan
2021-09-23  2:40       ` Taylor Blau
2021-09-15 18:24   ` [PATCH v2 7/8] builtin/repack.c: make largest pack preferred Taylor Blau
2021-09-15 18:24   ` [PATCH v2 8/8] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
2021-09-24 18:22     ` Jonathan Tan
2021-10-01 22:38       ` Taylor Blau
2021-09-15 19:22   ` [PATCH v2 0/8] repack: introduce `--write-midx` Junio C Hamano
2021-09-15 19:29     ` Junio C Hamano
2021-09-15 21:19       ` Taylor Blau
2021-09-16 22:16         ` Junio C Hamano
2021-09-29  1:54   ` [PATCH v3 0/9] " Taylor Blau
2021-09-29  1:55     ` [PATCH v3 1/9] midx: expose `write_midx_file_only()` publicly Taylor Blau
2021-09-29  1:55     ` [PATCH v3 2/9] builtin/multi-pack-index.c: support `--stdin-packs` mode Taylor Blau
2021-09-29  1:55     ` [PATCH v3 3/9] midx: preliminary support for `--refs-snapshot` Taylor Blau
2021-09-29  1:55     ` [PATCH v3 4/9] builtin/repack.c: keep track of existing packs unconditionally Taylor Blau
2021-09-29  1:55     ` [PATCH v3 5/9] builtin/repack.c: rename variables that deal with non-kept packs Taylor Blau
2021-09-29  1:55     ` [PATCH v3 6/9] builtin/repack.c: extract showing progress to a variable Taylor Blau
2021-09-29  1:55     ` [PATCH v3 7/9] builtin/repack.c: support writing a MIDX while repacking Taylor Blau
2021-09-29  1:55     ` [PATCH v3 8/9] builtin/repack.c: make largest pack preferred Taylor Blau
2021-09-29  1:55     ` [PATCH v3 9/9] builtin/repack.c: pass `--refs-snapshot` when writing bitmaps Taylor Blau
2021-09-29  4:24     ` [PATCH v3 0/9] repack: introduce `--write-midx` Junio C Hamano
2021-10-01 20:01     ` Jonathan Tan
2021-10-01 22:40       ` Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).