All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] clean-ups to geometric repacking
@ 2021-03-05 15:21 Taylor Blau
  2021-03-05 15:21 ` [PATCH 1/5] builtin/repack.c: do not repack single packs with --geometric Taylor Blau
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Taylor Blau @ 2021-03-05 15:21 UTC (permalink / raw)
  To: git; +Cc: gitster, peff

I was reading Junio's comments on my patch to implement a '--geometric' option
for 'git repack' here:

    https://lore.kernel.org/git/xmqqv9ahxddp.fsf@gitster.g/

and felt that all of the suggestions therein would make for good clean-up. Since
the original series is already on next, this follow-up series is sent separately
(and can be cleanly applied on top of tb/geometric-repack).

These are all fairly straightforward clean-ups, although I could do with or
without the fourth patch. It adds a lot of verbosity in exchanged for checking
unsigned overflows, but I'm not sure how likely it is that we'd run into it. So,
I would be fine if that patch were dropped.

I applied Junio's sign-off to the last patch, since it was written by him in
the above mail, and I merely added a patch message. I hope that that's OK; if it
isn't, please don't hesitate to drop it (or I can resend it with my authorship
and sign-off).

Junio C Hamano (1):
  builtin/repack.c: reword comment around pack-objects flags

Taylor Blau (4):
  builtin/repack.c: do not repack single packs with --geometric
  t7703: test --geometric repack with loose objects
  builtin/repack.c: assign pack split later
  builtin/repack.c: be more conservative with unsigned overflows

 builtin/repack.c            | 46 ++++++++++++++++++++++++++-----------
 t/t7703-repack-geometric.sh | 46 +++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+), 13 deletions(-)

-- 
2.30.0.667.g81c0cbc6fd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/5] builtin/repack.c: do not repack single packs with --geometric
  2021-03-05 15:21 [PATCH 0/5] clean-ups to geometric repacking Taylor Blau
@ 2021-03-05 15:21 ` Taylor Blau
  2021-03-05 19:15   ` Junio C Hamano
  2021-03-05 15:21 ` [PATCH 2/5] t7703: test --geometric repack with loose objects Taylor Blau
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Taylor Blau @ 2021-03-05 15:21 UTC (permalink / raw)
  To: git; +Cc: gitster, peff

In 0fabafd0b9 (builtin/repack.c: add '--geometric' option, 2021-02-22),
the 'git repack --geometric' code aborts early when there is zero or one
pack.

When there are no packs, this code does the right thing by placing the
split at "0". But when there is exactly one pack, the split is placed at
"1", which means that "git repack --geometric" (with any factor)
repacks all of the objects in a single pack.

This is wasteful, and the remaining code in split_pack_geometry() does
the right thing (not repacking the objects in a single pack) even when
only one pack is present.

Loosen the guard to only stop when there aren't any packs, and let the
rest of the code do the right thing. Add a test to ensure that this is
the case.

Noticed-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c            |  2 +-
 t/t7703-repack-geometric.sh | 15 +++++++++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index bcf280b10d..4ca2f647b4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -351,7 +351,7 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	uint32_t split;
 	off_t total_size = 0;
 
-	if (geometry->pack_nr <= 1) {
+	if (!geometry->pack_nr) {
 		geometry->split = geometry->pack_nr;
 		return;
 	}
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 96917fc163..4a1952a054 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -20,6 +20,21 @@ test_expect_success '--geometric with no packs' '
 	)
 '
 
+test_expect_success '--geometric with one pack' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		test_commit "base" &&
+		git repack -d &&
+
+		git repack --geometric 2 >out &&
+
+		test_i18ngrep "Nothing new to pack" out
+	)
+'
+
 test_expect_success '--geometric with an intact progression' '
 	git init geometric &&
 	test_when_finished "rm -fr geometric" &&
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/5] t7703: test --geometric repack with loose objects
  2021-03-05 15:21 [PATCH 0/5] clean-ups to geometric repacking Taylor Blau
  2021-03-05 15:21 ` [PATCH 1/5] builtin/repack.c: do not repack single packs with --geometric Taylor Blau
@ 2021-03-05 15:21 ` Taylor Blau
  2021-03-05 15:21 ` [PATCH 3/5] builtin/repack.c: assign pack split later Taylor Blau
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Taylor Blau @ 2021-03-05 15:21 UTC (permalink / raw)
  To: git; +Cc: gitster, peff

We don't currently have a test that demonstrates the non-idempotent
behavior of 'git repack --geometric' with loose objects, so add one here
to make sure we don't regress in this area.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t7703-repack-geometric.sh | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 4a1952a054..5ccaa440e0 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -54,6 +54,37 @@ test_expect_success '--geometric with an intact progression' '
 	)
 '
 
+test_expect_success '--geometric with loose objects' '
+	git init geometric &&
+	test_when_finished "rm -fr geometric" &&
+	(
+		cd geometric &&
+
+		# These packs already form a geometric progression.
+		test_commit_bulk --start=1 1 && # 3 objects
+		test_commit_bulk --start=2 2 && # 6 objects
+		# The loose objects are packed together, breaking the
+		# progression.
+		test_commit loose && # 3 objects
+
+		find $objdir/pack -name "*.pack" | sort >before &&
+		git repack --geometric 2 -d &&
+		find $objdir/pack -name "*.pack" | sort >after &&
+
+		comm -13 before after >new &&
+		comm -23 before after >removed &&
+
+		test_line_count = 1 new &&
+		test_must_be_empty removed &&
+
+		git repack --geometric 2 -d &&
+		find $objdir/pack -name "*.pack" | sort >after &&
+
+		# The progression (3, 3, 6) is combined into one new pack.
+		test_line_count = 1 after
+	)
+'
+
 test_expect_success '--geometric with small-pack rollup' '
 	git init geometric &&
 	test_when_finished "rm -fr geometric" &&
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/5] builtin/repack.c: assign pack split later
  2021-03-05 15:21 [PATCH 0/5] clean-ups to geometric repacking Taylor Blau
  2021-03-05 15:21 ` [PATCH 1/5] builtin/repack.c: do not repack single packs with --geometric Taylor Blau
  2021-03-05 15:21 ` [PATCH 2/5] t7703: test --geometric repack with loose objects Taylor Blau
@ 2021-03-05 15:21 ` Taylor Blau
  2021-03-05 15:21 ` [PATCH 4/5] builtin/repack.c: be more conservative with unsigned overflows Taylor Blau
  2021-03-05 15:22 ` [PATCH 5/5] builtin/repack.c: reword comment around pack-objects flags Taylor Blau
  4 siblings, 0 replies; 12+ messages in thread
From: Taylor Blau @ 2021-03-05 15:21 UTC (permalink / raw)
  To: git; +Cc: gitster, peff

To determine the where to place the split when repacking with the
'--geometric' option, split_pack_geometry() assigns the "split" variable
and then decrements it in a loop.

It would be equivalent (and more readable) to assign the split to the
loop position after exiting the loop, so do that instead.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 4ca2f647b4..21a5778e73 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -356,8 +356,6 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 		return;
 	}
 
-	split = geometry->pack_nr - 1;
-
 	/*
 	 * First, count the number of packs (in descending order of size) which
 	 * already form a geometric progression.
@@ -365,12 +363,12 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	for (i = geometry->pack_nr - 1; i > 0; i--) {
 		struct packed_git *ours = geometry->pack[i];
 		struct packed_git *prev = geometry->pack[i - 1];
-		if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev))
-			split--;
-		else
+		if (geometry_pack_weight(ours) < factor * geometry_pack_weight(prev))
 			break;
 	}
 
+	split = i;
+
 	if (split) {
 		/*
 		 * Move the split one to the right, since the top element in the
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/5] builtin/repack.c: be more conservative with unsigned overflows
  2021-03-05 15:21 [PATCH 0/5] clean-ups to geometric repacking Taylor Blau
                   ` (2 preceding siblings ...)
  2021-03-05 15:21 ` [PATCH 3/5] builtin/repack.c: assign pack split later Taylor Blau
@ 2021-03-05 15:21 ` Taylor Blau
  2021-03-05 19:32   ` Junio C Hamano
  2021-03-10 21:00   ` Jeff King
  2021-03-05 15:22 ` [PATCH 5/5] builtin/repack.c: reword comment around pack-objects flags Taylor Blau
  4 siblings, 2 replies; 12+ messages in thread
From: Taylor Blau @ 2021-03-05 15:21 UTC (permalink / raw)
  To: git; +Cc: gitster, peff

There are a number of places in the geometric repack code where we
multiply the number of objects in a pack by another unsigned value. We
trust that the number of objects in a pack is always representable by a
uint32_t, but we don't necessarily trust that that number can be
multiplied without overflow.

Sprinkle some unsigned_add_overflows() and unsigned_mult_overflows() in
split_pack_geometry() to check that we never overflow any unsigned types
when adding or multiplying them.

Arguably these checks are a little too conservative, and certainly they
do not help the readability of this function. But they are serving a
useful purpose, so I think they are worthwhile overall.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 21a5778e73..677c6b75c1 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -363,6 +363,12 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	for (i = geometry->pack_nr - 1; i > 0; i--) {
 		struct packed_git *ours = geometry->pack[i];
 		struct packed_git *prev = geometry->pack[i - 1];
+
+		if (unsigned_mult_overflows(factor, geometry_pack_weight(prev)))
+			die(_("pack %s too large to consider in geometric "
+			      "progression"),
+			    prev->pack_name);
+
 		if (geometry_pack_weight(ours) < factor * geometry_pack_weight(prev))
 			break;
 	}
@@ -388,11 +394,25 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	 * packs in the heavy half need to be joined into it (if any) to restore
 	 * the geometric progression.
 	 */
-	for (i = 0; i < split; i++)
-		total_size += geometry_pack_weight(geometry->pack[i]);
+	for (i = 0; i < split; i++) {
+		struct packed_git *p = geometry->pack[i];
+
+		if (unsigned_add_overflows(total_size, geometry_pack_weight(p)))
+			die(_("pack %s too large to roll up"), p->pack_name);
+		total_size += geometry_pack_weight(p);
+	}
 	for (i = split; i < geometry->pack_nr; i++) {
 		struct packed_git *ours = geometry->pack[i];
+
+		if (unsigned_mult_overflows(factor, total_size))
+			die(_("pack %s too large to roll up"), ours->pack_name);
+
 		if (geometry_pack_weight(ours) < factor * total_size) {
+			if (unsigned_add_overflows(total_size,
+						   geometry_pack_weight(ours)))
+				die(_("pack %s too large to roll up"),
+				    ours->pack_name);
+
 			split++;
 			total_size += geometry_pack_weight(ours);
 		} else
-- 
2.30.0.667.g81c0cbc6fd


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/5] builtin/repack.c: reword comment around pack-objects flags
  2021-03-05 15:21 [PATCH 0/5] clean-ups to geometric repacking Taylor Blau
                   ` (3 preceding siblings ...)
  2021-03-05 15:21 ` [PATCH 4/5] builtin/repack.c: be more conservative with unsigned overflows Taylor Blau
@ 2021-03-05 15:22 ` Taylor Blau
  4 siblings, 0 replies; 12+ messages in thread
From: Taylor Blau @ 2021-03-05 15:22 UTC (permalink / raw)
  To: git; +Cc: gitster, peff

From: Junio C Hamano <gitster@pobox.com>

The comment in this block is meant to indicate that passing '--all',
'--reflog', and so on aren't necessary when repacking with the
'--geometric' option.

But, it has two problems: first, it is factually incorrect ('--all' is
*not* incompatible with '--stdin-packs' as the comment suggests);
second, it is quite focused on the geometric case for a block that is
guarding against it.

Reword this comment to address both issues.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/repack.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/builtin/repack.c b/builtin/repack.c
index 677c6b75c1..6ce2556c9e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -545,12 +545,14 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	strvec_push(&cmd.args, "--non-empty");
 	if (!geometry) {
 		/*
-		 * 'git pack-objects' will up all objects loose or packed
-		 * (either rolling them up or leaving them alone), so don't pass
-		 * these options.
+		 * We need to grab all reachable objects, including those that
+		 * are reachable from reflogs and the index.
 		 *
-		 * The implementation of 'git pack-objects --stdin-packs'
-		 * makes them redundant (and the two are incompatible).
+		 * When repacking into a geometric progression of packs,
+		 * however, we ask 'git pack-objects --stdin-packs', and it is
+		 * not about packing objects based on reachability but about
+		 * repacking all the objects in specified packs and loose ones
+		 * (indeed, --stdin-packs is incompatible with these options).
 		 */
 		strvec_push(&cmd.args, "--all");
 		strvec_push(&cmd.args, "--reflog");
-- 
2.30.0.667.g81c0cbc6fd

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] builtin/repack.c: do not repack single packs with --geometric
  2021-03-05 15:21 ` [PATCH 1/5] builtin/repack.c: do not repack single packs with --geometric Taylor Blau
@ 2021-03-05 19:15   ` Junio C Hamano
  2021-03-05 19:27     ` Taylor Blau
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2021-03-05 19:15 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff

Taylor Blau <me@ttaylorr.com> writes:

> Loosen the guard to only stop when there aren't any packs, and let the
> rest of the code do the right thing. Add a test to ensure that this is
> the case.
>
> Noticed-by: Junio C Hamano <gitster@pobox.com>

I do not think I "noticed" anything, though.

> -	if (geometry->pack_nr <= 1) {
> +	if (!geometry->pack_nr) {
>  		geometry->split = geometry->pack_nr;
>  		return;
>  	}

When pack_nr is 0, split is set to 0.  Otherwise we compute the
split the usual way in the new code.  Let's see the post-context of
the above code and figure out what happens when pack_nr is 1.

	split = geometry->pack_nr - 1;

split set to 0 here.

	/*
	 * First, count the number of packs (in descending order of size) which
	 * already form a geometric progression.
	 */
	for (i = geometry->pack_nr - 1; i > 0; i--) {

Iterate from i = 0 while i is positive, which means this entire loop
is skipped.

		struct packed_git *ours = geometry->pack[i];
		struct packed_git *prev = geometry->pack[i - 1];
		if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev))
			split--;
		else
			break;
	}

	if (split) {
		/*
		 * Move the split one to the right, since the top element in the
		 * last-compared pair can't be in the progression. Only do this
		 * when we split in the middle of the array (otherwise if we got
		 * to the end, then the split is in the right place).
		 */
		split++;
	}

And split is still 0.


	/*
	 * Then, anything to the left of 'split' must be in a new pack. But,
	 * creating that new pack may cause packs in the heavy half to no longer
	 * form a geometric progression.
	 *
	 * Compute an expected size of the new pack, and then determine how many
	 * packs in the heavy half need to be joined into it (if any) to restore
	 * the geometric progression.
	 */
	for (i = 0; i < split; i++)
		total_size += geometry_pack_weight(geometry->pack[i]);

The loop is skipped, as split is 0.

	for (i = split; i < geometry->pack_nr; i++) {

Iterate from i = 0 and for just once.

		struct packed_git *ours = geometry->pack[i];
		if (geometry_pack_weight(ours) < factor * total_size) {

But total_size is 0 so split is not incremented.

			split++;
			total_size += geometry_pack_weight(ours);
		} else
			break;
	}

	geometry->split = split;

Hence we assign 0 to .split member.

I however wonder if it expresses the intent more clearly if you did
this upfront, instead of forcing the readers to go through the code.

	if (geometry->pack_nr <= 1) {
-		geometry->split = geometry->pack_nr;
+		geometry->split = 0;
 		return;
 	}

That is, "when there is no existing packs, or just one pack, we
split at 0"

The code that gets affected by the setting of "split" is this piece
in the caller, cmd_repack():

	if (geometry) {
		FILE *in = xfdopen(cmd.in, "w");
		for (i = 0; i < geometry->split; i++)
			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
		for (i = geometry->split; i < geometry->pack_nr; i++)
			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
		fclose(in);
	}

When split == 0, we end up feeding no positive packs and all
negative packs, which results in nothing to pack.  I wonder if we
can optimize out the spawning of the pack-object process, but that
is probalby optimizing for a wrong case.

> diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
> index 96917fc163..4a1952a054 100755
> --- a/t/t7703-repack-geometric.sh
> +++ b/t/t7703-repack-geometric.sh
> @@ -20,6 +20,21 @@ test_expect_success '--geometric with no packs' '
>  	)
>  '
>  
> +test_expect_success '--geometric with one pack' '
> +	git init geometric &&
> +	test_when_finished "rm -fr geometric" &&
> +	(
> +		cd geometric &&
> +
> +		test_commit "base" &&
> +		git repack -d &&
> +
> +		git repack --geometric 2 >out &&
> +
> +		test_i18ngrep "Nothing new to pack" out
> +	)
> +'
> +
>  test_expect_success '--geometric with an intact progression' '
>  	git init geometric &&
>  	test_when_finished "rm -fr geometric" &&

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] builtin/repack.c: do not repack single packs with --geometric
  2021-03-05 19:15   ` Junio C Hamano
@ 2021-03-05 19:27     ` Taylor Blau
  2021-03-05 19:30       ` Taylor Blau
  0 siblings, 1 reply; 12+ messages in thread
From: Taylor Blau @ 2021-03-05 19:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On Fri, Mar 05, 2021 at 11:15:58AM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Loosen the guard to only stop when there aren't any packs, and let the
> > rest of the code do the right thing. Add a test to ensure that this is
> > the case.
> >
> > Noticed-by: Junio C Hamano <gitster@pobox.com>
>
> I do not think I "noticed" anything, though.

Well, I clearly didn't notice it, so I'm happy to pass the buck to you.

> > -	if (geometry->pack_nr <= 1) {
> > +	if (!geometry->pack_nr) {
> >  		geometry->split = geometry->pack_nr;
> >  		return;
> >  	}
>
> When pack_nr is 0, split is set to 0.  Otherwise we compute the
> split the usual way in the new code.  Let's see the post-context of
> the above code and figure out what happens when pack_nr is 1.
>
> [snip]
>
> I however wonder if it expresses the intent more clearly if you did
> this upfront, instead of forcing the readers to go through the code.
>
> 	if (geometry->pack_nr <= 1) {
> -		geometry->split = geometry->pack_nr;
> +		geometry->split = 0;
>  		return;
>  	}
>
> That is, "when there is no existing packs, or just one pack, we
> split at 0"

Hmm. I originally wrote the patch as:

    if (geometry->pack_nr <= 1) {
      geometry->split = 0;
      return;
    }

instead of only when geometry->pack_nr == 0. But I was pretty sure that
the code below was doing the right thing even for geometry->pack_nr ==
1, and so I decided to avoid making this non-special case "special" by
returning early.

I could see arguments in both directions. But I may be biased as the
author, so I'd rather defer to your judgement instead.

> The code that gets affected by the setting of "split" is this piece
> in the caller, cmd_repack():
>
> 	if (geometry) {
> 		FILE *in = xfdopen(cmd.in, "w");
> 		for (i = 0; i < geometry->split; i++)
> 			fprintf(in, "%s\n", pack_basename(geometry->pack[i]));
> 		for (i = geometry->split; i < geometry->pack_nr; i++)
> 			fprintf(in, "^%s\n", pack_basename(geometry->pack[i]));
> 		fclose(in);
> 	}
>
> When split == 0, we end up feeding no positive packs and all
> negative packs, which results in nothing to pack.  I wonder if we
> can optimize out the spawning of the pack-object process, but that
> is probalby optimizing for a wrong case.

Yeah, I think the earlier optimization (avoiding repacking the contents
of a single pack) is more important than not opening pack objects here.

But the next patch demonstrates why we can't do this: we care about
loose objects, which we may still pick up even if split == 0.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/5] builtin/repack.c: do not repack single packs with --geometric
  2021-03-05 19:27     ` Taylor Blau
@ 2021-03-05 19:30       ` Taylor Blau
  0 siblings, 0 replies; 12+ messages in thread
From: Taylor Blau @ 2021-03-05 19:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On Fri, Mar 05, 2021 at 02:27:41PM -0500, Taylor Blau wrote:
> instead of only when geometry->pack_nr == 0. But I was pretty sure that
> the code below was doing the right thing even for geometry->pack_nr ==
> 1, and so I decided to avoid making this non-special case "special" by
> returning early.

After "I was pretty sure that the code was doing the right", I should
note that I did the same thing you did in my head, and then confirmed
my thinking by running the tests in t5326.

Re-reading my response, I am worried that I gave the impression of being
lackadaisical about testing.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/5] builtin/repack.c: be more conservative with unsigned overflows
  2021-03-05 15:21 ` [PATCH 4/5] builtin/repack.c: be more conservative with unsigned overflows Taylor Blau
@ 2021-03-05 19:32   ` Junio C Hamano
  2021-03-05 19:41     ` Taylor Blau
  2021-03-10 21:00   ` Jeff King
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2021-03-05 19:32 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff

Taylor Blau <me@ttaylorr.com> writes:

> There are a number of places in the geometric repack code where we
> multiply the number of objects in a pack by another unsigned value. We
> trust that the number of objects in a pack is always representable by a
> uint32_t, but we don't necessarily trust that that number can be
> multiplied without overflow.
>
> Sprinkle some unsigned_add_overflows() and unsigned_mult_overflows() in
> split_pack_geometry() to check that we never overflow any unsigned types
> when adding or multiplying them.
>
> Arguably these checks are a little too conservative, and certainly they
> do not help the readability of this function. But they are serving a
> useful purpose, so I think they are worthwhile overall.
>
> Suggested-by: Junio C Hamano <gitster@pobox.com>

I do not deserve this for my idle thinking-out-aloud speculation.
You did all the thinking and work.

I do not know if just dying is a useful thing to do, though.  These
all happen in the "discover where the split point is, to figure out
which packs to coalesce" phase where we haven't touched anything on
disk, so in that sense it is safe to die, but what recourse do users
have after seeing these errors?

For example, in the first overflow check in hunk -388,11, presumably
when total_size for a given set of packfiles does not fit in
int32_t, it won't be fruitful to attempt to coalesce such a set of
packs into one, so perhaps a better fix may be to shift the split
point earlier to find a point that lets us to pack as many as we
could?

I do not offhand know what a sensible and gentler fallback position
would be for the factor multiplication, but presumably, if the right
hand side of this ...

>  		if (geometry_pack_weight(ours) < factor * total_size) {

... overflows, we can say it would definitely be larger than our
weight and continue, instead of "no, we cannot multiply total with
factor, as that would give us too big a number", I guess.

I am OK to either (1) leave the code as-is without this patch, because
the overflow would only affect the largest of packs and would be rare,
and people who would notice when they hit it would probably are the ones
with enough resource to diagnose, report and even give us a fix ;-)
or (2) apply this patch as-is, because the only people who will see
the die() would be the same ones affected by (1) above anyway.

And applying this patch as-is would give better chance than (1) for
the overflow to be noticed.

So, let's queue the patch as-is.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/5] builtin/repack.c: be more conservative with unsigned overflows
  2021-03-05 19:32   ` Junio C Hamano
@ 2021-03-05 19:41     ` Taylor Blau
  0 siblings, 0 replies; 12+ messages in thread
From: Taylor Blau @ 2021-03-05 19:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On Fri, Mar 05, 2021 at 11:32:57AM -0800, Junio C Hamano wrote:
> > Suggested-by: Junio C Hamano <gitster@pobox.com>
>
> I do not deserve this for my idle thinking-out-aloud speculation.
> You did all the thinking and work.

You're selling yourself short ;-).

> I do not offhand know what a sensible and gentler fallback position
> would be for the factor multiplication, but presumably, if the right
> hand side of this ...
>
> >  		if (geometry_pack_weight(ours) < factor * total_size) {
>
> ... overflows, we can say it would definitely be larger than our
> weight and continue, instead of "no, we cannot multiply total with
> factor, as that would give us too big a number", I guess.

To answer your question about whether or not dying is sensible, I think
that there are a couple of options. You could say, "I can no longer
multiply total_size by factor without overflowing, so I'm not even going
to bother considering additional packs". I guess that makes some
progress in condensing packs, but that's as good as it would get, since
the subsequent repack would also overflow instead of making further
progress, so it would just get stuck.

Which makes me think that the other option, dying, is more sensible. But
I think that this is all a moot point as you seem to indicate anyway,
because having a large enough pack that we are verging on overflowing is
likely not a real-world problem that we are going to deal with (at least
right away).

> I am OK to either (1) leave the code as-is without this patch, because
> the overflow would only affect the largest of packs and would be rare,
> and people who would notice when they hit it would probably are the ones
> with enough resource to diagnose, report and even give us a fix ;-)
> or (2) apply this patch as-is, because the only people who will see
> the die() would be the same ones affected by (1) above anyway.
>
> And applying this patch as-is would give better chance than (1) for
> the overflow to be noticed.
>
> So, let's queue the patch as-is.

Great, I'm fine with that. Thanks.

> Thanks.

Taylor

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/5] builtin/repack.c: be more conservative with unsigned overflows
  2021-03-05 15:21 ` [PATCH 4/5] builtin/repack.c: be more conservative with unsigned overflows Taylor Blau
  2021-03-05 19:32   ` Junio C Hamano
@ 2021-03-10 21:00   ` Jeff King
  1 sibling, 0 replies; 12+ messages in thread
From: Jeff King @ 2021-03-10 21:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitster

On Fri, Mar 05, 2021 at 10:21:56AM -0500, Taylor Blau wrote:

> There are a number of places in the geometric repack code where we
> multiply the number of objects in a pack by another unsigned value. We
> trust that the number of objects in a pack is always representable by a
> uint32_t, but we don't necessarily trust that that number can be
> multiplied without overflow.
> 
> Sprinkle some unsigned_add_overflows() and unsigned_mult_overflows() in
> split_pack_geometry() to check that we never overflow any unsigned types
> when adding or multiplying them.
> 
> Arguably these checks are a little too conservative, and certainly they
> do not help the readability of this function. But they are serving a
> useful purpose, so I think they are worthwhile overall.

Hmm. My initial reaction was: how close might we reasonably come to the
limit? Packfiles are limited to uint32_t, and:

  - you'd have a pretty hard time approaching that; pack-objects needs
    at least 100 bytes of heap per object for its internal book-keeping.
    So you're looking at 200GB-400GB of RAM to generate such a packfile.

  - I'm pretty sure the rest of the repack code would die horribly and
    unpredictably if it were allowed to have that much RAM.

Which isn't an argument against such protections, but I wonder if it
might give people a false sense that we are in any way prepared for
repositories of this scale.

However, at least one of these looks to be multiplying the user-provided
scale factor. I can't imagine a scale factor beyond "2" is all that
useful, but conceptually somebody could provide a big number there.

Looking at the code, though...

> diff --git a/builtin/repack.c b/builtin/repack.c
> index 21a5778e73..677c6b75c1 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -363,6 +363,12 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
>  	for (i = geometry->pack_nr - 1; i > 0; i--) {
>  		struct packed_git *ours = geometry->pack[i];
>  		struct packed_git *prev = geometry->pack[i - 1];
> +
> +		if (unsigned_mult_overflows(factor, geometry_pack_weight(prev)))
> +			die(_("pack %s too large to consider in geometric "
> +			      "progression"),
> +			    prev->pack_name);

This says "unsigned_mult_overflows", but "factor" is a signed int. This
will generally be cast to unsigned in the actual multiplication, but it
depends on the size of the operands.

If int is larger than uint32_t, we'd do the multiplication as a signed
int. But the overflow check would be against an unsigned int (since our
macro only looks at the type of the first argument). So it would be
overly permissive with the final bit. I suspect it would also be wrong
if "factor" is negative.

If int is smaller than 32 bits, then we'd be too conservative (it would
get promoted to uint32_t for the actual multiplication). Also, you
should get a better computer in that case.

It's probably OK in practice, as int tends to just be 32 bits. But if
the point is to be careful, we should probably just take "factor" as a
uint32_t in the first place.

> @@ -388,11 +394,25 @@ static void split_pack_geometry(struct pack_geometry *geometry, int factor)
>  	 * packs in the heavy half need to be joined into it (if any) to restore
>  	 * the geometric progression.
>  	 */
> -	for (i = 0; i < split; i++)
> -		total_size += geometry_pack_weight(geometry->pack[i]);
> +	for (i = 0; i < split; i++) {
> +		struct packed_git *p = geometry->pack[i];
> +
> +		if (unsigned_add_overflows(total_size, geometry_pack_weight(p)))
> +			die(_("pack %s too large to roll up"), p->pack_name);
> +		total_size += geometry_pack_weight(p);
> +	}

This one seems even less likely to overflow. total_size is an off_t, so
unless you're on a really lame system, we should be able to fit a lot of
uint32_t's in there.

(It actually feels a little weird for it to be an off_t in the first
place; we're still dealing in units of "number of objects", which the
rest of Git generally considers to be in the realm of a uint32_t).

>  	for (i = split; i < geometry->pack_nr; i++) {
>  		struct packed_git *ours = geometry->pack[i];
> +
> +		if (unsigned_mult_overflows(factor, total_size))
> +			die(_("pack %s too large to roll up"), ours->pack_name);

This one is wrong in the same way as the earlier multiplication, except
this time we're pretty sure that total_size actually is much bigger. So
we'll complain about overflowing an int, but the multiplication will
actually be done as an off_t. And of course it has the same problems
with negative values.

Should total_size just be a uint32_t? Or perhaps any of these factor
multiplication results should just be uint64_t, which would be the
obviously-large-enough type. Then you wouldn't have these ugly overflow
checks. :)

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-03-10 21:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05 15:21 [PATCH 0/5] clean-ups to geometric repacking Taylor Blau
2021-03-05 15:21 ` [PATCH 1/5] builtin/repack.c: do not repack single packs with --geometric Taylor Blau
2021-03-05 19:15   ` Junio C Hamano
2021-03-05 19:27     ` Taylor Blau
2021-03-05 19:30       ` Taylor Blau
2021-03-05 15:21 ` [PATCH 2/5] t7703: test --geometric repack with loose objects Taylor Blau
2021-03-05 15:21 ` [PATCH 3/5] builtin/repack.c: assign pack split later Taylor Blau
2021-03-05 15:21 ` [PATCH 4/5] builtin/repack.c: be more conservative with unsigned overflows Taylor Blau
2021-03-05 19:32   ` Junio C Hamano
2021-03-05 19:41     ` Taylor Blau
2021-03-10 21:00   ` Jeff King
2021-03-05 15:22 ` [PATCH 5/5] builtin/repack.c: reword comment around pack-objects flags Taylor Blau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.