git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] rev-list --disk-usage
@ 2021-01-27 22:11 Jeff King
  2021-01-27 22:12 ` [PATCH 1/2] t: add --no-tag option to test_commit Jeff King
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Jeff King @ 2021-01-27 22:11 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

This series teaches rev-list to compute the on-disk size used by a set
of objects. You can do the same thing with cat-file, but this is much
faster (see the timings in the second commit).

We've been running it for about 5 years at GitHub. I hesitated sending
it upstream because it's a bit weird and special-purpose. But it does
come in handy for debugging, analyzing repos, etc. So maybe others will
find it useful.

The first patch is just a test-script enhancement to let test_commit
avoid creating tags. During some recent refactoring, we actually broke
the --disk-usage feature but the test script didn't catch it because the
tags were being picked up by "--all". Since this is at least the third
time I've run into that in our test suite, I thought I'd make it a
little more convenient to avoid. :)

  [1/2]: t: add --no-tag option to test_commit
  [2/2]: rev-list: add --disk-usage option for calculating disk usage

 Documentation/rev-list-options.txt |  9 ++++++
 builtin/rev-list.c                 | 49 ++++++++++++++++++++++++++++
 pack-bitmap.c                      | 50 +++++++++++++++++++++++++++++
 pack-bitmap.h                      |  2 ++
 t/t4208-log-magic-pathspec.sh      |  9 ++----
 t/t6114-rev-list-du.sh             | 51 ++++++++++++++++++++++++++++++
 t/test-lib-functions.sh            |  9 +++++-
 7 files changed, 171 insertions(+), 8 deletions(-)
 create mode 100755 t/t6114-rev-list-du.sh

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/2] t: add --no-tag option to test_commit
  2021-01-27 22:11 [PATCH 0/2] rev-list --disk-usage Jeff King
@ 2021-01-27 22:12 ` Jeff King
  2021-01-27 22:48   ` Taylor Blau
  2021-01-27 22:17 ` [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2021-01-27 22:12 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

One of the conveniences that test_commit offers is making a tag for each
commit. This makes it easy to refer to the commits in subsequent
commands. But it can also be a pain if you care about reachability,
because those tags keep the commits reachable even if they are rewound
from the branch they're made on.

The alternative is that scripts have to call test_tick, git-add, and
git-commit themselves. Let's add a --no-tag option to give them the
one-liner convenience of using test_commit.

This is in preparation for the next patch, which will add some more
calls. But I cleaned up an existing site to show off the feature. There
are probably more cleanups possible.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t4208-log-magic-pathspec.sh | 9 ++-------
 t/test-lib-functions.sh       | 9 ++++++++-
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/t/t4208-log-magic-pathspec.sh b/t/t4208-log-magic-pathspec.sh
index 5e10136e9a..7f0c1dcc0f 100755
--- a/t/t4208-log-magic-pathspec.sh
+++ b/t/t4208-log-magic-pathspec.sh
@@ -31,13 +31,8 @@ test_expect_success '"git log :/a -- " should not be ambiguous' '
 test_expect_success '"git log :/detached -- " should find a commit only in HEAD' '
 	test_when_finished "git checkout main" &&
 	git checkout --detach &&
-	# Must manually call `test_tick` instead of using `test_commit`,
-	# because the latter additionally creates a tag, which would make
-	# the commit reachable not only via HEAD.
-	test_tick &&
-	git commit --allow-empty -m detached &&
-	test_tick &&
-	git commit --allow-empty -m something-else &&
+	test_commit --no-tag detached &&
+	test_commit --no-tag something-else &&
 	git log :/detached --
 '
 
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 6bca002316..1587241ba0 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -202,6 +202,7 @@ test_commit () {
 	author= &&
 	signoff= &&
 	indir= &&
+	no_tag= &&
 	while test $# != 0
 	do
 		case "$1" in
@@ -222,6 +223,9 @@ test_commit () {
 			indir="$2"
 			shift
 			;;
+		--no-tag)
+			no_tag=yes
+			;;
 		*)
 			break
 			;;
@@ -244,7 +248,10 @@ test_commit () {
 	git ${indir:+ -C "$indir"} commit \
 	    ${author:+ --author "$author"} \
 	    $signoff -m "$1" &&
-	git ${indir:+ -C "$indir"} tag "${4:-$1}"
+	if test -z "$no_tag"
+	then
+		git ${indir:+ -C "$indir"} tag "${4:-$1}"
+	fi
 }
 
 # Call test_merge with the arguments "<message> <commit>", where <commit>
-- 
2.30.0.758.g9692d13bf2


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage
  2021-01-27 22:11 [PATCH 0/2] rev-list --disk-usage Jeff King
  2021-01-27 22:12 ` [PATCH 1/2] t: add --no-tag option to test_commit Jeff King
@ 2021-01-27 22:17 ` Jeff King
  2021-01-27 22:57   ` Taylor Blau
                     ` (2 more replies)
  2021-01-27 22:46 ` [PATCH 0/2] rev-list --disk-usage Taylor Blau
  2021-02-09 10:52 ` [PATCH v2] " Jeff King
  3 siblings, 3 replies; 30+ messages in thread
From: Jeff King @ 2021-01-27 22:17 UTC (permalink / raw)
  To: git; +Cc: Taylor Blau

It can sometimes be useful to see which refs are contributing to the
overall repository size (e.g., does some branch have a bunch of objects
not found elsewhere in history, which indicates that deleting it would
shrink the size of a clone).

You can find that out by generating a list of objects, getting their
sizes from cat-file, and then summing them, like:

    git rev-list --objects main..branch
    cut -d' ' -f1 |
    git cat-file --batch-check='%(objectsize:disk)' |
    perl -lne '$total += $_; END { print $total }'

Though note that the caveats from git-cat-file(1) apply here. We "blame"
base objects more than their deltas, even though the relationship could
easily be flipped. Still, it can be a useful rough measure.

But one problem is that it's slow to run. Teaching rev-list to sum up
the sizes can be much faster for two reasons:

  1. It skips all of the piping of object names and sizes.

  2. If bitmaps are in use, for objects that are in the
     bitmapped packfile we can skip the oid_object_info()
     lookup entirely, and just ask the revindex for the
     on-disk size.

This patch implements a --disk-usage option which produces the same
answer in a fraction of the time. Here are some timings using a clone of
torvalds/linux:

  [rev-list piped to cat-file, no bitmaps]
  $ time git rev-list --objects --all |
    cut -d' ' -f1 |
    git cat-file --buffer --batch-check='%(objectsize:disk)' |
    perl -lne '$total += $_; END { print $total }'
  1455691059
  real	0m34.336s
  user	0m46.533s
  sys	0m2.953s

  [internal, no bitmaps]
  $ time git rev-list --disk-usage --all
  1455691059
  real	0m32.662s
  user	0m32.306s
  sys	0m0.353s

The wall-clock times aren't that different because of parallelism, but
notice the CPU savings between the two. We saved 35% of the CPU just by
avoiding the pipes.

But the real win is with bitmaps. If we use them without the new option:

  [rev-list piped to cat-file, bitmaps]
  $ time git rev-list --objects --all --use-bitmap-index |
    cut -d' ' -f1 |
    git cat-file --batch-check='%(objectsize:disk)' |
    perl -lne '$total += $_; END { print $total }'
  real	0m9.954s
  user	0m11.234s
  sys	0m8.522s

then we're faster to generate the list of objects, but we still spend a
lot of time piping and looking things up. But if we do both together:

  [internal, bitmaps]
  $ time git rev-list --disk-usage --all --use-bitmap-index
  1455691059
  real	0m0.235s
  user	0m0.186s
  sys	0m0.049s

then we get the same answer much faster.

For "--all", that answer will correspond closely to "du objects/pack",
of course. But we're actually checking reachability here, so we're still
fast when we ask for more interesting things:

  $ time git rev-list --disk-usage --all --use-bitmap-index v5.0..v5.10
  374798628
  real	0m0.429s
  user	0m0.356s
  sys	0m0.072s

Signed-off-by: Jeff King <peff@peff.net>
---
This _could_ be made more flexible, but I didn't think it was worth the
complexity. Some obvious things one might want are:

  - not counting up all reachable objects (i.e., requiring --objects for
    this output, and omitting it just counts up commits). This could be
    handled in the bitmap case with some extra code (OR-ing with the
    type bitmaps).

    But after 5 years of this patch, I've never wanted that once. The
    disk usage of just some of the objects isn't really that useful (and
    of course you can still get it by piping to cat-file).

  - an option to output the sizes of specific objects along with their
    oids. But if you want to get to this level of flexibility, I think
    you're better off just using cat-file (and if we are concerned about
    the pipe costs, we should teach rev-list to understand cat-file's
    custom formats).

 Documentation/rev-list-options.txt |  9 ++++++
 builtin/rev-list.c                 | 49 ++++++++++++++++++++++++++++
 pack-bitmap.c                      | 50 +++++++++++++++++++++++++++++
 pack-bitmap.h                      |  2 ++
 t/t6114-rev-list-du.sh             | 51 ++++++++++++++++++++++++++++++
 5 files changed, 161 insertions(+)
 create mode 100755 t/t6114-rev-list-du.sh

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 002379056a..1e5826f26d 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -222,6 +222,15 @@ ifdef::git-rev-list[]
 	test the exit status to see if a range of objects is fully
 	connected (or not).  It is faster than redirecting stdout
 	to `/dev/null` as the output does not have to be formatted.
+
+--disk-usage::
+	Suppress normal output; instead, print the sum of the bytes used
+	for on-disk storage by the selected objects. This is equivalent
+	to piping the output of `rev-list --objects` into
+	`git cat-file --batch-check='%(objectsize:disk)', except that it
+	runs much faster (especially with `--use-bitmap-index`). See the
+	`CAVEATS` section in linkgit:git-cat-file[1] for the limitations
+	of what "on-disk storage" means.
 endif::git-rev-list[]
 
 --cherry-mark::
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 25c6c3b38d..2262b613dd 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -80,6 +80,19 @@ static int arg_show_object_names = 1;
 
 #define DEFAULT_OIDSET_SIZE     (16*1024)
 
+static int show_disk_usage;
+static off_t total_disk_usage;
+
+static off_t get_object_disk_usage(struct object *obj)
+{
+	off_t size;
+	struct object_info oi = OBJECT_INFO_INIT;
+	oi.disk_sizep = &size;
+	if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
+		die(_("unable to get disk usage of %s"), oid_to_hex(&obj->oid));
+	return size;
+}
+
 static void finish_commit(struct commit *commit);
 static void show_commit(struct commit *commit, void *data)
 {
@@ -88,6 +101,9 @@ static void show_commit(struct commit *commit, void *data)
 
 	display_progress(progress, ++progress_counter);
 
+	if (show_disk_usage)
+		total_disk_usage += get_object_disk_usage(&commit->object);
+
 	if (info->flags & REV_LIST_QUIET) {
 		finish_commit(commit);
 		return;
@@ -258,6 +274,8 @@ static void show_object(struct object *obj, const char *name, void *cb_data)
 	if (finish_object(obj, name, cb_data))
 		return;
 	display_progress(progress, ++progress_counter);
+	if (show_disk_usage)
+		total_disk_usage += get_object_disk_usage(obj);
 	if (info->flags & REV_LIST_QUIET)
 		return;
 
@@ -452,6 +470,23 @@ static int try_bitmap_traversal(struct rev_info *revs,
 	return 0;
 }
 
+static int try_bitmap_disk_usage(struct rev_info *revs,
+				 struct list_objects_filter_options *filter)
+{
+	struct bitmap_index *bitmap_git;
+
+	if (!show_disk_usage)
+		return -1;
+
+	bitmap_git = prepare_bitmap_walk(revs, filter);
+	if (!bitmap_git)
+		return -1;
+
+	printf("%"PRIuMAX"\n",
+	       (uintmax_t)get_disk_usage_from_bitmap(bitmap_git));
+	return 0;
+}
+
 int cmd_rev_list(int argc, const char **argv, const char *prefix)
 {
 	struct rev_info revs;
@@ -584,6 +619,15 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 			continue;
 		}
 
+		if (!strcmp(arg, "--disk-usage")) {
+			show_disk_usage = 1;
+			revs.tag_objects = 1;
+			revs.tree_objects = 1;
+			revs.blob_objects = 1;
+			info.flags |= REV_LIST_QUIET;
+			continue;
+		}
+
 		usage(rev_list_usage);
 
 	}
@@ -626,6 +670,8 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (use_bitmap_index) {
 		if (!try_bitmap_count(&revs, &filter_options))
 			return 0;
+		if (!try_bitmap_disk_usage(&revs, &filter_options))
+			return 0;
 		if (!try_bitmap_traversal(&revs, &filter_options))
 			return 0;
 	}
@@ -690,5 +736,8 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 			printf("%d\n", revs.count_left + revs.count_right);
 	}
 
+	if (show_disk_usage)
+		printf("%"PRIuMAX"\n", (uintmax_t)total_disk_usage);
+
 	return 0;
 }
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 60fe20fb87..ba36b9c6a0 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1430,3 +1430,53 @@ int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_git,
 	return bitmap_git &&
 		bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid);
 }
+
+off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git)
+{
+	struct bitmap *result = bitmap_git->result;
+	struct packed_git *pack = bitmap_git->pack;
+	struct eindex *eindex = &bitmap_git->ext_index;
+	struct object_info oi = OBJECT_INFO_INIT;
+	off_t object_size;
+	off_t total = 0;
+	size_t i;
+
+	oi.disk_sizep = &object_size;
+
+	for (i = 0; i < result->word_alloc; i++) {
+		eword_t word = result->words[i];
+		size_t base = (i * BITS_IN_EWORD);
+		unsigned offset;
+
+		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
+			size_t pos;
+
+			if ((word >> offset) == 0)
+				break;
+
+			offset += ewah_bit_ctz64(word >> offset);
+			pos = base + offset;
+
+			/*
+			 * If it's in the pack, we can use the fast path
+			 * and just check the revindex. Otherwise, we
+			 * fall back to looking it up.
+			 */
+			if (pos < pack->num_objects) {
+				object_size =
+					pack_pos_to_offset(pack, pos + 1) -
+					pack_pos_to_offset(pack, pos);
+			} else {
+				struct object *obj;
+				obj = eindex->objects[pos - pack->num_objects];
+				if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
+					die(_("unable to get disk usage of %s"),
+					      oid_to_hex(&obj->oid));
+			}
+
+			total += object_size;
+		}
+	}
+
+	return total;
+}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 25dfcf5615..c8070606b7 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -68,6 +68,8 @@ int bitmap_walk_contains(struct bitmap_index *,
  */
 int bitmap_has_oid_in_uninteresting(struct bitmap_index *, const struct object_id *oid);
 
+off_t get_disk_usage_from_bitmap(struct bitmap_index *);
+
 void bitmap_writer_show_progress(int show);
 void bitmap_writer_set_checksum(unsigned char *sha1);
 void bitmap_writer_build_type_index(struct packing_data *to_pack,
diff --git a/t/t6114-rev-list-du.sh b/t/t6114-rev-list-du.sh
new file mode 100755
index 0000000000..1fadbcaded
--- /dev/null
+++ b/t/t6114-rev-list-du.sh
@@ -0,0 +1,51 @@
+#!/bin/sh
+
+test_description='basic tests of rev-list --disk-usage'
+. ./test-lib.sh
+
+# we want a mix of reachable and unreachable, as well as
+# objects in the bitmapped pack and some outside of it
+test_expect_success 'set up repository' '
+	test_commit --no-tag one &&
+	test_commit --no-tag two &&
+	git repack -adb &&
+	git reset --hard HEAD^ &&
+	test_commit --no-tag three &&
+	test_commit --no-tag four &&
+	git reset --hard HEAD^
+'
+
+# We don't want to hardcode sizes, because they depend on the exact details of
+# packing, zlib, etc. We'll assume that the regular rev-list and cat-file
+# machinery works and compare the --disk-usage output to that.
+disk_usage_slow () {
+	git rev-list --objects "$@" |
+	cut -d' ' -f1 |
+	git cat-file --batch-check="%(objectsize:disk)" |
+	perl -lne '$total += $_; END { print $total}'
+}
+
+# check behavior with given rev-list options; note that
+# whitespace is not preserved in args
+check_du () {
+	args=$*
+
+	test_expect_success "generate expected size ($args)" "
+		disk_usage_slow $args >expect
+	"
+
+	test_expect_success "rev-list --disk-usage without bitmaps ($args)" "
+		git rev-list --disk-usage $args >actual &&
+		test_cmp expect actual
+	"
+
+	test_expect_success "rev-list --disk-usage with bitmaps ($args)" "
+		git rev-list --disk-usage --use-bitmap-index $args >actual &&
+		test_cmp expect actual
+	"
+}
+
+check_du HEAD
+check_du HEAD^..HEAD
+
+test_done
-- 
2.30.0.758.g9692d13bf2

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] rev-list --disk-usage
  2021-01-27 22:11 [PATCH 0/2] rev-list --disk-usage Jeff King
  2021-01-27 22:12 ` [PATCH 1/2] t: add --no-tag option to test_commit Jeff King
  2021-01-27 22:17 ` [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
@ 2021-01-27 22:46 ` Taylor Blau
  2021-02-09 10:52 ` [PATCH v2] " Jeff King
  3 siblings, 0 replies; 30+ messages in thread
From: Taylor Blau @ 2021-01-27 22:46 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

On Wed, Jan 27, 2021 at 05:11:36PM -0500, Jeff King wrote:
> The first patch is just a test-script enhancement to let test_commit
> avoid creating tags. During some recent refactoring, we actually broke
> the --disk-usage feature but the test script didn't catch it because the
> tags were being picked up by "--all". Since this is at least the third
> time I've run into that in our test suite, I thought I'd make it a
> little more convenient to avoid. :)

I appreciate the non-incriminating "we", but the person who caused the
regression was most certainly me ;-).

This happened while cherry-picking Junio's recent merge of
tb/revindex-api, which obviously did not cause a merge conflict with
this new caller. The remaining details are boring, but they definitely
weren't Peff's fault :-).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] t: add --no-tag option to test_commit
  2021-01-27 22:12 ` [PATCH 1/2] t: add --no-tag option to test_commit Jeff King
@ 2021-01-27 22:48   ` Taylor Blau
  0 siblings, 0 replies; 30+ messages in thread
From: Taylor Blau @ 2021-01-27 22:48 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

On Wed, Jan 27, 2021 at 05:12:25PM -0500, Jeff King wrote:
> The alternative is that scripts have to call test_tick, git-add, and
> git-commit themselves. Let's add a --no-tag option to give them the
> one-liner convenience of using test_commit.

Thanks for finding a spot that does this and making it more readable
with the new --no-tag option.

This patch looks obviously correct. I'm sure that (as you note) there
are more cleanups possible, but I'm happy to just grab an easy one and
let future refactorings clean up the remaining ones.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage
  2021-01-27 22:17 ` [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
@ 2021-01-27 22:57   ` Taylor Blau
  2021-01-27 23:34     ` Jeff King
  2021-01-27 23:01   ` Kyle Meyer
  2021-01-27 23:07   ` Eric Sunshine
  2 siblings, 1 reply; 30+ messages in thread
From: Taylor Blau @ 2021-01-27 22:57 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Taylor Blau

On Wed, Jan 27, 2021 at 05:17:07PM -0500, Jeff King wrote:
> It can sometimes be useful to see which refs are contributing to the
> overall repository size (e.g., does some branch have a bunch of objects
> not found elsewhere in history, which indicates that deleting it would
> shrink the size of a clone).
>
> You can find that out by generating a list of objects, getting their
> sizes from cat-file, and then summing them, like:
>
>     git rev-list --objects main..branch
>     cut -d' ' -f1 |

I suspect that this is from the original commit message that you wrote a
half-decade ago. Not that it really means much, but you could shave one
process off of this example by passing '--no-object-names' to 'git
rev-list'.

The whole point is that we can avoid having to do this, so I don't think
it really matters, anyway.

> [...]
> then we're faster to generate the list of objects, but we still spend a
> lot of time piping and looking things up. But if we do both together:
>
>   [internal, bitmaps]
>   $ time git rev-list --disk-usage --all --use-bitmap-index
>   1455691059
>   real	0m0.235s
>   user	0m0.186s
>   sys	0m0.049s
>
> then we get the same answer much faster.

Very nice.

> This _could_ be made more flexible, but I didn't think it was worth the
> complexity. Some obvious things one might want are:
>
>   - not counting up all reachable objects (i.e., requiring --objects for
>     this output, and omitting it just counts up commits). This could be
>     handled in the bitmap case with some extra code (OR-ing with the
>     type bitmaps).
>
>     But after 5 years of this patch, I've never wanted that once. The
>     disk usage of just some of the objects isn't really that useful (and
>     of course you can still get it by piping to cat-file).

Yeah. I think it's trivial to support it, but I'm in favor of a simpler
interface.

That said, I worry about painting ourselves into a corner if the default
implies --objects. If we wanted to change that, I'm pretty sure you'd
have to write a rule that says "imply objects, unless --tags, --blobs or
etc. are specified, and then only do that".

Maybe we'll never have to address that, but it's worth thinking about
before committing to implying '--objects'.

>   - an option to output the sizes of specific objects along with their
>     oids. But if you want to get to this level of flexibility, I think
>     you're better off just using cat-file (and if we are concerned about
>     the pipe costs, we should teach rev-list to understand cat-file's
>     custom formats).

This I agree with completely. Any caller who wants that level of
flexibility shouldn't mind the piping.

I have no comments on the patch itself, which looks fine to me (and I
have seen over and over again as it seems to regularly cause conflicts
when merging new releases into GitHub's fork :-)).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage
  2021-01-27 22:17 ` [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
  2021-01-27 22:57   ` Taylor Blau
@ 2021-01-27 23:01   ` Kyle Meyer
  2021-01-27 23:36     ` Jeff King
  2021-01-27 23:07   ` Eric Sunshine
  2 siblings, 1 reply; 30+ messages in thread
From: Kyle Meyer @ 2021-01-27 23:01 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git

Jeff King writes:

> diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
> index 002379056a..1e5826f26d 100644
> --- a/Documentation/rev-list-options.txt
> +++ b/Documentation/rev-list-options.txt
> @@ -222,6 +222,15 @@ ifdef::git-rev-list[]
>  	test the exit status to see if a range of objects is fully
>  	connected (or not).  It is faster than redirecting stdout
>  	to `/dev/null` as the output does not have to be formatted.
> +
> +--disk-usage::
> +	Suppress normal output; instead, print the sum of the bytes used
> +	for on-disk storage by the selected objects. This is equivalent
> +	to piping the output of `rev-list --objects` into
> +	`git cat-file --batch-check='%(objectsize:disk)', except that it

[ Just a drive-by typo comment from a reader not knowledgeable enough to
  review the code change :) ]

The cat-file command is missing its closing quote.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage
  2021-01-27 22:17 ` [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
  2021-01-27 22:57   ` Taylor Blau
  2021-01-27 23:01   ` Kyle Meyer
@ 2021-01-27 23:07   ` Eric Sunshine
  2021-01-27 23:39     ` Jeff King
  2 siblings, 1 reply; 30+ messages in thread
From: Eric Sunshine @ 2021-01-27 23:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List, Taylor Blau

On Wed, Jan 27, 2021 at 5:20 PM Jeff King <peff@peff.net> wrote:
> This patch implements a --disk-usage option which produces the same
> answer in a fraction of the time. Here are some timings using a clone of
> torvalds/linux:
>
>   [rev-list piped to cat-file, no bitmaps]
>   $ time git rev-list --objects --all |
>     cut -d' ' -f1 |
>     git cat-file --buffer --batch-check='%(objectsize:disk)' |
>     perl -lne '$total += $_; END { print $total }'
>   1455691059
>   real  0m34.336s
>   user  0m46.533s
>   sys   0m2.953s

This example shows the computed size (1455691059)...

> But the real win is with bitmaps. If we use them without the new option:
>
>   [rev-list piped to cat-file, bitmaps]
>   $ time git rev-list --objects --all --use-bitmap-index |
>     cut -d' ' -f1 |
>     git cat-file --batch-check='%(objectsize:disk)' |
>     perl -lne '$total += $_; END { print $total }'
>   real  0m9.954s
>   user  0m11.234s
>   sys   0m8.522s

...however, this example does not (but all the others do). Simple
copy/paste error?

Not worth a re-roll, of course.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage
  2021-01-27 22:57   ` Taylor Blau
@ 2021-01-27 23:34     ` Jeff King
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2021-01-27 23:34 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git

On Wed, Jan 27, 2021 at 05:57:21PM -0500, Taylor Blau wrote:

> > You can find that out by generating a list of objects, getting their
> > sizes from cat-file, and then summing them, like:
> >
> >     git rev-list --objects main..branch
> >     cut -d' ' -f1 |
> 
> I suspect that this is from the original commit message that you wrote a
> half-decade ago. Not that it really means much, but you could shave one
> process off of this example by passing '--no-object-names' to 'git
> rev-list'.

That, plus my muscle memory to do the cut. We should probably model the
better form here, and use it in the test, though (not worth a re-roll on
its own, but it looks like there are a few other minor bits).

> >   - not counting up all reachable objects (i.e., requiring --objects for
> >     this output, and omitting it just counts up commits). This could be
> >     handled in the bitmap case with some extra code (OR-ing with the
> >     type bitmaps).
> >
> >     But after 5 years of this patch, I've never wanted that once. The
> >     disk usage of just some of the objects isn't really that useful (and
> >     of course you can still get it by piping to cat-file).
> 
> Yeah. I think it's trivial to support it, but I'm in favor of a simpler
> interface.
> 
> That said, I worry about painting ourselves into a corner if the default
> implies --objects. If we wanted to change that, I'm pretty sure you'd
> have to write a rule that says "imply objects, unless --tags, --blobs or
> etc. are specified, and then only do that".
> 
> Maybe we'll never have to address that, but it's worth thinking about
> before committing to implying '--objects'.

Yeah, the one thing that gives me pause is that it would be hard to undo
later. I didn't write the code to handle it in the bitmap case, but I
don't think it would be _too_ bad. It is slightly annoying for the
all-objects case, because the existing code isn't set up well to iterate
either a specific type, or all types.

> I have no comments on the patch itself, which looks fine to me (and I
> have seen over and over again as it seems to regularly cause conflicts
> when merging new releases into GitHub's fork :-)).

You are exposing my ulterior motive. :)

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage
  2021-01-27 23:01   ` Kyle Meyer
@ 2021-01-27 23:36     ` Jeff King
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2021-01-27 23:36 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: Taylor Blau, git

On Wed, Jan 27, 2021 at 06:01:51PM -0500, Kyle Meyer wrote:

> > +--disk-usage::
> > +	Suppress normal output; instead, print the sum of the bytes used
> > +	for on-disk storage by the selected objects. This is equivalent
> > +	to piping the output of `rev-list --objects` into
> > +	`git cat-file --batch-check='%(objectsize:disk)', except that it
> 
> [ Just a drive-by typo comment from a reader not knowledgeable enough to
>   review the code change :) ]
> 
> The cat-file command is missing its closing quote.

Thanks for catching that. I should have looked at the output of
doc-diff, which does reveal it.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage
  2021-01-27 23:07   ` Eric Sunshine
@ 2021-01-27 23:39     ` Jeff King
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2021-01-27 23:39 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Taylor Blau

On Wed, Jan 27, 2021 at 06:07:57PM -0500, Eric Sunshine wrote:

> This example shows the computed size (1455691059)...
> [...]
> ...however, this example does not (but all the others do). Simple
> copy/paste error?

Yep, thanks for catching. (Of course I have since repacked my linux.git,
so now it produces a different answer! It does match the current value
of the other techniques, though).

> Not worth a re-roll, of course.

Agreed, but it looks like there are a few other minor bits, so I'll
definitely fix it up at the same time. I'll give a little more time
before re-rolling in case there are any other comments.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2] rev-list --disk-usage
  2021-01-27 22:11 [PATCH 0/2] rev-list --disk-usage Jeff King
                   ` (2 preceding siblings ...)
  2021-01-27 22:46 ` [PATCH 0/2] rev-list --disk-usage Taylor Blau
@ 2021-02-09 10:52 ` Jeff King
  2021-02-09 10:52   ` [PATCH v2 1/2] t: add --no-tag option to test_commit Jeff King
                     ` (3 more replies)
  3 siblings, 4 replies; 30+ messages in thread
From: Jeff King @ 2021-02-09 10:52 UTC (permalink / raw)
  To: git; +Cc: Kyle Meyer, Eric Sunshine, Taylor Blau

Here's a re-roll of my series to add "rev-list --disk-usage", for
counting up object storage used for various slices of history.

This fixes the minor bits mentioned in review for v1, but the big change
is that "--disk-usage" no longer implies "--objects". I think you
generally would want to use it with that option, but it really seemed to
violate the principle of least surprise for the user.

That requires handling each object type independently, but the code for
that turned out to be not too bad (and is modeled after the similar
logic in traverse_bitmap_commit_list()). I was slightly concerned that
it would slow things down to walk over the bitmap multiple times, but it
doesn't seem to make much of a difference in practice.

There's a range-diff below, but it's not really worth looking at. All of
the interesting parts were rewritten completely, so you're better off to
just read patch 2 again (and patch 1 did not change at all).

  [1/2]: t: add --no-tag option to test_commit
  [2/2]: rev-list: add --disk-usage option for calculating disk usage

 Documentation/rev-list-options.txt |  9 ++++
 builtin/rev-list.c                 | 46 +++++++++++++++++
 pack-bitmap.c                      | 81 ++++++++++++++++++++++++++++++
 pack-bitmap.h                      |  2 +
 t/t4208-log-magic-pathspec.sh      |  9 +---
 t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
 t/test-lib-functions.sh            |  9 +++-
 7 files changed, 199 insertions(+), 8 deletions(-)
 create mode 100755 t/t6114-rev-list-du.sh

1:  20f8edeff1 = 1:  6365cd94bd t: add --no-tag option to test_commit
2:  64e28cb6c9 ! 2:  8a93583dee rev-list: add --disk-usage option for calculating disk usage
    @@ Commit message
         You can find that out by generating a list of objects, getting their
         sizes from cat-file, and then summing them, like:
     
    -        git rev-list --objects main..branch
    -        cut -d' ' -f1 |
    +        git rev-list --objects --no-object-names main..branch
             git cat-file --batch-check='%(objectsize:disk)' |
             perl -lne '$total += $_; END { print $total }'
     
    @@ Commit message
         torvalds/linux:
     
           [rev-list piped to cat-file, no bitmaps]
    -      $ time git rev-list --objects --all |
    -        cut -d' ' -f1 |
    +      $ time git rev-list --objects --no-object-names --all |
             git cat-file --buffer --batch-check='%(objectsize:disk)' |
             perl -lne '$total += $_; END { print $total }'
    -      1455691059
    -      real  0m34.336s
    -      user  0m46.533s
    -      sys   0m2.953s
    +      1459938510
    +      real  0m29.635s
    +      user  0m38.003s
    +      sys   0m1.093s
     
           [internal, no bitmaps]
    -      $ time git rev-list --disk-usage --all
    -      1455691059
    -      real  0m32.662s
    -      user  0m32.306s
    -      sys   0m0.353s
    +      $ time git rev-list --disk-usage --objects --all
    +      1459938510
    +      real  0m31.262s
    +      user  0m30.885s
    +      sys   0m0.376s
     
    -    The wall-clock times aren't that different because of parallelism, but
    -    notice the CPU savings between the two. We saved 35% of the CPU just by
    +    Even though the wall-clock time is slightly worse due to parallelism,
    +    notice the CPU savings between the two. We saved 21% of the CPU just by
         avoiding the pipes.
     
         But the real win is with bitmaps. If we use them without the new option:
     
           [rev-list piped to cat-file, bitmaps]
    -      $ time git rev-list --objects --all --use-bitmap-index |
    -        cut -d' ' -f1 |
    +      $ time git rev-list --objects --no-object-names --all --use-bitmap-index |
             git cat-file --batch-check='%(objectsize:disk)' |
             perl -lne '$total += $_; END { print $total }'
    -      real  0m9.954s
    -      user  0m11.234s
    -      sys   0m8.522s
    +      1459938510
    +      real  0m6.244s
    +      user  0m8.452s
    +      sys   0m0.311s
     
         then we're faster to generate the list of objects, but we still spend a
         lot of time piping and looking things up. But if we do both together:
     
           [internal, bitmaps]
    -      $ time git rev-list --disk-usage --all --use-bitmap-index
    -      1455691059
    -      real  0m0.235s
    -      user  0m0.186s
    +      $ time git rev-list --disk-usage --objects --all --use-bitmap-index
    +      1459938510
    +      real  0m0.219s
    +      user  0m0.169s
           sys   0m0.049s
     
         then we get the same answer much faster.
    @@ Commit message
         of course. But we're actually checking reachability here, so we're still
         fast when we ask for more interesting things:
     
    -      $ time git rev-list --disk-usage --all --use-bitmap-index v5.0..v5.10
    +      $ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10
           374798628
           real  0m0.429s
           user  0m0.356s
    @@ Documentation/rev-list-options.txt: ifdef::git-rev-list[]
     +
     +--disk-usage::
     +	Suppress normal output; instead, print the sum of the bytes used
    -+	for on-disk storage by the selected objects. This is equivalent
    -+	to piping the output of `rev-list --objects` into
    -+	`git cat-file --batch-check='%(objectsize:disk)', except that it
    -+	runs much faster (especially with `--use-bitmap-index`). See the
    -+	`CAVEATS` section in linkgit:git-cat-file[1] for the limitations
    -+	of what "on-disk storage" means.
    ++	for on-disk storage by the selected commits or objects. This is
    ++	equivalent to piping the output into `git cat-file
    ++	--batch-check='%(objectsize:disk)'`, except that it runs much
    ++	faster (especially with `--use-bitmap-index`). See the `CAVEATS`
    ++	section in linkgit:git-cat-file[1] for the limitations of what
    ++	"on-disk storage" means.
      endif::git-rev-list[]
      
      --cherry-mark::
    @@ builtin/rev-list.c: static int try_bitmap_traversal(struct rev_info *revs,
     +		return -1;
     +
     +	printf("%"PRIuMAX"\n",
    -+	       (uintmax_t)get_disk_usage_from_bitmap(bitmap_git));
    ++	       (uintmax_t)get_disk_usage_from_bitmap(bitmap_git, revs));
     +	return 0;
     +}
     +
    @@ builtin/rev-list.c: int cmd_rev_list(int argc, const char **argv, const char *pr
      
     +		if (!strcmp(arg, "--disk-usage")) {
     +			show_disk_usage = 1;
    -+			revs.tag_objects = 1;
    -+			revs.tree_objects = 1;
    -+			revs.blob_objects = 1;
     +			info.flags |= REV_LIST_QUIET;
     +			continue;
     +		}
    @@ pack-bitmap.c: int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_g
      		bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid);
      }
     +
    -+off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git)
    ++static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
    ++				     enum object_type object_type)
     +{
     +	struct bitmap *result = bitmap_git->result;
     +	struct packed_git *pack = bitmap_git->pack;
    -+	struct eindex *eindex = &bitmap_git->ext_index;
    -+	struct object_info oi = OBJECT_INFO_INIT;
    -+	off_t object_size;
     +	off_t total = 0;
    ++	struct ewah_iterator it;
    ++	eword_t filter;
     +	size_t i;
     +
    -+	oi.disk_sizep = &object_size;
    -+
    -+	for (i = 0; i < result->word_alloc; i++) {
    -+		eword_t word = result->words[i];
    ++	init_type_iterator(&it, bitmap_git, object_type);
    ++	for (i = 0; i < result->word_alloc &&
    ++			ewah_iterator_next(&filter, &it); i++) {
    ++		eword_t word = result->words[i] & filter;
     +		size_t base = (i * BITS_IN_EWORD);
     +		unsigned offset;
     +
    ++		if (!word)
    ++			continue;
    ++
     +		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
     +			size_t pos;
     +
    @@ pack-bitmap.c: int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_g
     +
     +			offset += ewah_bit_ctz64(word >> offset);
     +			pos = base + offset;
    -+
    -+			/*
    -+			 * If it's in the pack, we can use the fast path
    -+			 * and just check the revindex. Otherwise, we
    -+			 * fall back to looking it up.
    -+			 */
    -+			if (pos < pack->num_objects) {
    -+				object_size =
    -+					pack_pos_to_offset(pack, pos + 1) -
    -+					pack_pos_to_offset(pack, pos);
    -+			} else {
    -+				struct object *obj;
    -+				obj = eindex->objects[pos - pack->num_objects];
    -+				if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
    -+					die(_("unable to get disk usage of %s"),
    -+					      oid_to_hex(&obj->oid));
    -+			}
    -+
    -+			total += object_size;
    ++			total += pack_pos_to_offset(pack, pos + 1) -
    ++				 pack_pos_to_offset(pack, pos);
     +		}
     +	}
     +
     +	return total;
    ++}
    ++
    ++static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
    ++{
    ++	struct bitmap *result = bitmap_git->result;
    ++	struct packed_git *pack = bitmap_git->pack;
    ++	struct eindex *eindex = &bitmap_git->ext_index;
    ++	off_t total = 0;
    ++	struct object_info oi = OBJECT_INFO_INIT;
    ++	off_t object_size;
    ++	size_t i;
    ++
    ++	oi.disk_sizep = &object_size;
    ++
    ++	for (i = 0; i < eindex->count; i++) {
    ++		struct object *obj = eindex->objects[i];
    ++
    ++		if (!bitmap_get(result, pack->num_objects + i))
    ++			continue;
    ++
    ++		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
    ++			die(_("unable to get disk usage of %s"),
    ++			    oid_to_hex(&obj->oid));
    ++
    ++		total += object_size;
    ++	}
    ++	return total;
    ++}
    ++
    ++off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
    ++				 struct rev_info *revs)
    ++{
    ++	off_t total = 0;
    ++
    ++	total += get_disk_usage_for_type(bitmap_git, OBJ_COMMIT);
    ++	if (revs->tree_objects)
    ++		total += get_disk_usage_for_type(bitmap_git, OBJ_TREE);
    ++	if (revs->blob_objects)
    ++		total += get_disk_usage_for_type(bitmap_git, OBJ_BLOB);
    ++	if (revs->tag_objects)
    ++		total += get_disk_usage_for_type(bitmap_git, OBJ_TAG);
    ++
    ++	total += get_disk_usage_for_extended(bitmap_git);
    ++
    ++	return total;
     +}
     
      ## pack-bitmap.h ##
     @@ pack-bitmap.h: int bitmap_walk_contains(struct bitmap_index *,
       */
      int bitmap_has_oid_in_uninteresting(struct bitmap_index *, const struct object_id *oid);
      
    -+off_t get_disk_usage_from_bitmap(struct bitmap_index *);
    ++off_t get_disk_usage_from_bitmap(struct bitmap_index *, struct rev_info *);
     +
      void bitmap_writer_show_progress(int show);
      void bitmap_writer_set_checksum(unsigned char *sha1);
    @@ t/t6114-rev-list-du.sh (new)
     +# packing, zlib, etc. We'll assume that the regular rev-list and cat-file
     +# machinery works and compare the --disk-usage output to that.
     +disk_usage_slow () {
    -+	git rev-list --objects "$@" |
    -+	cut -d' ' -f1 |
    ++	git rev-list --no-object-names "$@" |
     +	git cat-file --batch-check="%(objectsize:disk)" |
     +	perl -lne '$total += $_; END { print $total}'
     +}
    @@ t/t6114-rev-list-du.sh (new)
     +}
     +
     +check_du HEAD
    -+check_du HEAD^..HEAD
    ++check_du --objects HEAD
    ++check_du --objects HEAD^..HEAD
     +
     +test_done

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/2] t: add --no-tag option to test_commit
  2021-02-09 10:52 ` [PATCH v2] " Jeff King
@ 2021-02-09 10:52   ` Jeff King
  2021-02-09 10:53   ` [PATCH v2 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2021-02-09 10:52 UTC (permalink / raw)
  To: git; +Cc: Kyle Meyer, Eric Sunshine, Taylor Blau

One of the conveniences that test_commit offers is making a tag for each
commit. This makes it easy to refer to the commits in subsequent
commands. But it can also be a pain if you care about reachability,
because those tags keep the commits reachable even if they are rewound
from the branch they're made on.

The alternative is that scripts have to call test_tick, git-add, and
git-commit themselves. Let's add a --no-tag option to give them the
one-liner convenience of using test_commit.

This is in preparation for the next patch, which will add some more
calls. But I cleaned up an existing site to show off the feature. There
are probably more cleanups possible.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t4208-log-magic-pathspec.sh | 9 ++-------
 t/test-lib-functions.sh       | 9 ++++++++-
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/t/t4208-log-magic-pathspec.sh b/t/t4208-log-magic-pathspec.sh
index 5e10136e9a..7f0c1dcc0f 100755
--- a/t/t4208-log-magic-pathspec.sh
+++ b/t/t4208-log-magic-pathspec.sh
@@ -31,13 +31,8 @@ test_expect_success '"git log :/a -- " should not be ambiguous' '
 test_expect_success '"git log :/detached -- " should find a commit only in HEAD' '
 	test_when_finished "git checkout main" &&
 	git checkout --detach &&
-	# Must manually call `test_tick` instead of using `test_commit`,
-	# because the latter additionally creates a tag, which would make
-	# the commit reachable not only via HEAD.
-	test_tick &&
-	git commit --allow-empty -m detached &&
-	test_tick &&
-	git commit --allow-empty -m something-else &&
+	test_commit --no-tag detached &&
+	test_commit --no-tag something-else &&
 	git log :/detached --
 '
 
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 6bca002316..1587241ba0 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -202,6 +202,7 @@ test_commit () {
 	author= &&
 	signoff= &&
 	indir= &&
+	no_tag= &&
 	while test $# != 0
 	do
 		case "$1" in
@@ -222,6 +223,9 @@ test_commit () {
 			indir="$2"
 			shift
 			;;
+		--no-tag)
+			no_tag=yes
+			;;
 		*)
 			break
 			;;
@@ -244,7 +248,10 @@ test_commit () {
 	git ${indir:+ -C "$indir"} commit \
 	    ${author:+ --author "$author"} \
 	    $signoff -m "$1" &&
-	git ${indir:+ -C "$indir"} tag "${4:-$1}"
+	if test -z "$no_tag"
+	then
+		git ${indir:+ -C "$indir"} tag "${4:-$1}"
+	fi
 }
 
 # Call test_merge with the arguments "<message> <commit>", where <commit>
-- 
2.30.1.887.ge7d57fcab0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 2/2] rev-list: add --disk-usage option for calculating disk usage
  2021-02-09 10:52 ` [PATCH v2] " Jeff King
  2021-02-09 10:52   ` [PATCH v2 1/2] t: add --no-tag option to test_commit Jeff King
@ 2021-02-09 10:53   ` Jeff King
  2021-02-09 11:09   ` [PATCH v2] rev-list --disk-usage Jeff King
  2021-02-10  0:44   ` Junio C Hamano
  3 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2021-02-09 10:53 UTC (permalink / raw)
  To: git; +Cc: Kyle Meyer, Eric Sunshine, Taylor Blau

It can sometimes be useful to see which refs are contributing to the
overall repository size (e.g., does some branch have a bunch of objects
not found elsewhere in history, which indicates that deleting it would
shrink the size of a clone).

You can find that out by generating a list of objects, getting their
sizes from cat-file, and then summing them, like:

    git rev-list --objects --no-object-names main..branch
    git cat-file --batch-check='%(objectsize:disk)' |
    perl -lne '$total += $_; END { print $total }'

Though note that the caveats from git-cat-file(1) apply here. We "blame"
base objects more than their deltas, even though the relationship could
easily be flipped. Still, it can be a useful rough measure.

But one problem is that it's slow to run. Teaching rev-list to sum up
the sizes can be much faster for two reasons:

  1. It skips all of the piping of object names and sizes.

  2. If bitmaps are in use, for objects that are in the
     bitmapped packfile we can skip the oid_object_info()
     lookup entirely, and just ask the revindex for the
     on-disk size.

This patch implements a --disk-usage option which produces the same
answer in a fraction of the time. Here are some timings using a clone of
torvalds/linux:

  [rev-list piped to cat-file, no bitmaps]
  $ time git rev-list --objects --no-object-names --all |
    git cat-file --buffer --batch-check='%(objectsize:disk)' |
    perl -lne '$total += $_; END { print $total }'
  1459938510
  real	0m29.635s
  user	0m38.003s
  sys	0m1.093s

  [internal, no bitmaps]
  $ time git rev-list --disk-usage --objects --all
  1459938510
  real	0m31.262s
  user	0m30.885s
  sys	0m0.376s

Even though the wall-clock time is slightly worse due to parallelism,
notice the CPU savings between the two. We saved 21% of the CPU just by
avoiding the pipes.

But the real win is with bitmaps. If we use them without the new option:

  [rev-list piped to cat-file, bitmaps]
  $ time git rev-list --objects --no-object-names --all --use-bitmap-index |
    git cat-file --batch-check='%(objectsize:disk)' |
    perl -lne '$total += $_; END { print $total }'
  1459938510
  real	0m6.244s
  user	0m8.452s
  sys	0m0.311s

then we're faster to generate the list of objects, but we still spend a
lot of time piping and looking things up. But if we do both together:

  [internal, bitmaps]
  $ time git rev-list --disk-usage --objects --all --use-bitmap-index
  1459938510
  real	0m0.219s
  user	0m0.169s
  sys	0m0.049s

then we get the same answer much faster.

For "--all", that answer will correspond closely to "du objects/pack",
of course. But we're actually checking reachability here, so we're still
fast when we ask for more interesting things:

  $ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10
  374798628
  real	0m0.429s
  user	0m0.356s
  sys	0m0.072s

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/rev-list-options.txt |  9 ++++
 builtin/rev-list.c                 | 46 +++++++++++++++++
 pack-bitmap.c                      | 81 ++++++++++++++++++++++++++++++
 pack-bitmap.h                      |  2 +
 t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
 5 files changed, 189 insertions(+)
 create mode 100755 t/t6114-rev-list-du.sh

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 96cc89d157..1238bfd915 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -227,6 +227,15 @@ ifdef::git-rev-list[]
 	test the exit status to see if a range of objects is fully
 	connected (or not).  It is faster than redirecting stdout
 	to `/dev/null` as the output does not have to be formatted.
+
+--disk-usage::
+	Suppress normal output; instead, print the sum of the bytes used
+	for on-disk storage by the selected commits or objects. This is
+	equivalent to piping the output into `git cat-file
+	--batch-check='%(objectsize:disk)'`, except that it runs much
+	faster (especially with `--use-bitmap-index`). See the `CAVEATS`
+	section in linkgit:git-cat-file[1] for the limitations of what
+	"on-disk storage" means.
 endif::git-rev-list[]
 
 --cherry-mark::
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 25c6c3b38d..b4d8ea0a35 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -80,6 +80,19 @@ static int arg_show_object_names = 1;
 
 #define DEFAULT_OIDSET_SIZE     (16*1024)
 
+static int show_disk_usage;
+static off_t total_disk_usage;
+
+static off_t get_object_disk_usage(struct object *obj)
+{
+	off_t size;
+	struct object_info oi = OBJECT_INFO_INIT;
+	oi.disk_sizep = &size;
+	if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
+		die(_("unable to get disk usage of %s"), oid_to_hex(&obj->oid));
+	return size;
+}
+
 static void finish_commit(struct commit *commit);
 static void show_commit(struct commit *commit, void *data)
 {
@@ -88,6 +101,9 @@ static void show_commit(struct commit *commit, void *data)
 
 	display_progress(progress, ++progress_counter);
 
+	if (show_disk_usage)
+		total_disk_usage += get_object_disk_usage(&commit->object);
+
 	if (info->flags & REV_LIST_QUIET) {
 		finish_commit(commit);
 		return;
@@ -258,6 +274,8 @@ static void show_object(struct object *obj, const char *name, void *cb_data)
 	if (finish_object(obj, name, cb_data))
 		return;
 	display_progress(progress, ++progress_counter);
+	if (show_disk_usage)
+		total_disk_usage += get_object_disk_usage(obj);
 	if (info->flags & REV_LIST_QUIET)
 		return;
 
@@ -452,6 +470,23 @@ static int try_bitmap_traversal(struct rev_info *revs,
 	return 0;
 }
 
+static int try_bitmap_disk_usage(struct rev_info *revs,
+				 struct list_objects_filter_options *filter)
+{
+	struct bitmap_index *bitmap_git;
+
+	if (!show_disk_usage)
+		return -1;
+
+	bitmap_git = prepare_bitmap_walk(revs, filter);
+	if (!bitmap_git)
+		return -1;
+
+	printf("%"PRIuMAX"\n",
+	       (uintmax_t)get_disk_usage_from_bitmap(bitmap_git, revs));
+	return 0;
+}
+
 int cmd_rev_list(int argc, const char **argv, const char *prefix)
 {
 	struct rev_info revs;
@@ -584,6 +619,12 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 			continue;
 		}
 
+		if (!strcmp(arg, "--disk-usage")) {
+			show_disk_usage = 1;
+			info.flags |= REV_LIST_QUIET;
+			continue;
+		}
+
 		usage(rev_list_usage);
 
 	}
@@ -626,6 +667,8 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (use_bitmap_index) {
 		if (!try_bitmap_count(&revs, &filter_options))
 			return 0;
+		if (!try_bitmap_disk_usage(&revs, &filter_options))
+			return 0;
 		if (!try_bitmap_traversal(&revs, &filter_options))
 			return 0;
 	}
@@ -690,5 +733,8 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 			printf("%d\n", revs.count_left + revs.count_right);
 	}
 
+	if (show_disk_usage)
+		printf("%"PRIuMAX"\n", (uintmax_t)total_disk_usage);
+
 	return 0;
 }
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 60fe20fb87..1f69b5fa85 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -1430,3 +1430,84 @@ int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_git,
 	return bitmap_git &&
 		bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid);
 }
+
+static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
+				     enum object_type object_type)
+{
+	struct bitmap *result = bitmap_git->result;
+	struct packed_git *pack = bitmap_git->pack;
+	off_t total = 0;
+	struct ewah_iterator it;
+	eword_t filter;
+	size_t i;
+
+	init_type_iterator(&it, bitmap_git, object_type);
+	for (i = 0; i < result->word_alloc &&
+			ewah_iterator_next(&filter, &it); i++) {
+		eword_t word = result->words[i] & filter;
+		size_t base = (i * BITS_IN_EWORD);
+		unsigned offset;
+
+		if (!word)
+			continue;
+
+		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
+			size_t pos;
+
+			if ((word >> offset) == 0)
+				break;
+
+			offset += ewah_bit_ctz64(word >> offset);
+			pos = base + offset;
+			total += pack_pos_to_offset(pack, pos + 1) -
+				 pack_pos_to_offset(pack, pos);
+		}
+	}
+
+	return total;
+}
+
+static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
+{
+	struct bitmap *result = bitmap_git->result;
+	struct packed_git *pack = bitmap_git->pack;
+	struct eindex *eindex = &bitmap_git->ext_index;
+	off_t total = 0;
+	struct object_info oi = OBJECT_INFO_INIT;
+	off_t object_size;
+	size_t i;
+
+	oi.disk_sizep = &object_size;
+
+	for (i = 0; i < eindex->count; i++) {
+		struct object *obj = eindex->objects[i];
+
+		if (!bitmap_get(result, pack->num_objects + i))
+			continue;
+
+		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
+			die(_("unable to get disk usage of %s"),
+			    oid_to_hex(&obj->oid));
+
+		total += object_size;
+	}
+	return total;
+}
+
+off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
+				 struct rev_info *revs)
+{
+	off_t total = 0;
+
+	total += get_disk_usage_for_type(bitmap_git, OBJ_COMMIT);
+	if (revs->tree_objects)
+		total += get_disk_usage_for_type(bitmap_git, OBJ_TREE);
+	if (revs->blob_objects)
+		total += get_disk_usage_for_type(bitmap_git, OBJ_BLOB);
+	if (revs->tag_objects)
+		total += get_disk_usage_for_type(bitmap_git, OBJ_TAG);
+
+	total += get_disk_usage_for_extended(bitmap_git);
+
+	return total;
+}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 25dfcf5615..36d99930d8 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -68,6 +68,8 @@ int bitmap_walk_contains(struct bitmap_index *,
  */
 int bitmap_has_oid_in_uninteresting(struct bitmap_index *, const struct object_id *oid);
 
+off_t get_disk_usage_from_bitmap(struct bitmap_index *, struct rev_info *);
+
 void bitmap_writer_show_progress(int show);
 void bitmap_writer_set_checksum(unsigned char *sha1);
 void bitmap_writer_build_type_index(struct packing_data *to_pack,
diff --git a/t/t6114-rev-list-du.sh b/t/t6114-rev-list-du.sh
new file mode 100755
index 0000000000..b4aef32b71
--- /dev/null
+++ b/t/t6114-rev-list-du.sh
@@ -0,0 +1,51 @@
+#!/bin/sh
+
+test_description='basic tests of rev-list --disk-usage'
+. ./test-lib.sh
+
+# we want a mix of reachable and unreachable, as well as
+# objects in the bitmapped pack and some outside of it
+test_expect_success 'set up repository' '
+	test_commit --no-tag one &&
+	test_commit --no-tag two &&
+	git repack -adb &&
+	git reset --hard HEAD^ &&
+	test_commit --no-tag three &&
+	test_commit --no-tag four &&
+	git reset --hard HEAD^
+'
+
+# We don't want to hardcode sizes, because they depend on the exact details of
+# packing, zlib, etc. We'll assume that the regular rev-list and cat-file
+# machinery works and compare the --disk-usage output to that.
+disk_usage_slow () {
+	git rev-list --no-object-names "$@" |
+	git cat-file --batch-check="%(objectsize:disk)" |
+	perl -lne '$total += $_; END { print $total}'
+}
+
+# check behavior with given rev-list options; note that
+# whitespace is not preserved in args
+check_du () {
+	args=$*
+
+	test_expect_success "generate expected size ($args)" "
+		disk_usage_slow $args >expect
+	"
+
+	test_expect_success "rev-list --disk-usage without bitmaps ($args)" "
+		git rev-list --disk-usage $args >actual &&
+		test_cmp expect actual
+	"
+
+	test_expect_success "rev-list --disk-usage with bitmaps ($args)" "
+		git rev-list --disk-usage --use-bitmap-index $args >actual &&
+		test_cmp expect actual
+	"
+}
+
+check_du HEAD
+check_du --objects HEAD
+check_du --objects HEAD^..HEAD
+
+test_done
-- 
2.30.1.887.ge7d57fcab0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-09 10:52 ` [PATCH v2] " Jeff King
  2021-02-09 10:52   ` [PATCH v2 1/2] t: add --no-tag option to test_commit Jeff King
  2021-02-09 10:53   ` [PATCH v2 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
@ 2021-02-09 11:09   ` Jeff King
  2021-02-09 21:14     ` Junio C Hamano
  2021-02-10  0:44   ` Junio C Hamano
  3 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2021-02-09 11:09 UTC (permalink / raw)
  To: git; +Cc: Kyle Meyer, Eric Sunshine, Taylor Blau

On Tue, Feb 09, 2021 at 05:52:28AM -0500, Jeff King wrote:

> This fixes the minor bits mentioned in review for v1, but the big change
> is that "--disk-usage" no longer implies "--objects". I think you
> generally would want to use it with that option, but it really seemed to
> violate the principle of least surprise for the user.
> 
> That requires handling each object type independently, but the code for
> that turned out to be not too bad (and is modeled after the similar
> logic in traverse_bitmap_commit_list()). I was slightly concerned that
> it would slow things down to walk over the bitmap multiple times, but it
> doesn't seem to make much of a difference in practice.

You might reasonably ask whether we could just directly use
traverse_bitmap_commit_list(), since after all it takes a callback. And
indeed, doing so reduces the size of the code (see the patch below).

But it's shockingly slower! It takes consistently 2-3x longer to produce
the same answer on linux.git with bitmaps. The problem is that we give
more information to the callback than the disk-usage computation needs.

In particular, finding nth_packed_object_id() is a big killer. Which
kind of makes sense. We memcpy() the oids out of the .idx file into a
"struct object_id" on the stack. And linux.git has ~200MB of oids to
copy (and I'm sure doing it 20 bytes at a time isn't quite optimal).
That adds several hundred milliseconds. Not a lot in absolute terms, but
we're able to do the whole computation in ~200ms to start with, so it's
relatively a big change.

This could be solved by having a more "bare" callback that just passes
the pack position, and not the oid (and then the callback is responsible
for looking it up if they care). But it gets pretty awkward when we have
to complete the bitmap traversal with non-bitmap objects (for those we
_do_ have an oid to pass, but no pack position). I think the
implementation in my 2/2 isn't so bad in comparison (and we can always
swap it out later; these are all just implementation details).

I did find it a bit interesting, though. When we moved to "struct
object_id" and started copying bits out with nth_packed_object_id(),
rather than just pointing to the mmap'd .idx bytes, we wondered whether
there would be any measurable difference. Likewise when we extended it
to handle the oid size changing at runtime. At the time, I wasn't able
to measure any impact for real operations, but I guess we just needed a
case that highlighted it more.

I don't know that it's really worth digging into that much, though it's
quite possible there may be some easy wins by optimizing those memcpy
calls. E.g., I'm not sure if the compiler ends up inlining them or not.
If it doesn't realize that the_hash_algo->rawsz is only ever "20" or
"32", we could perhaps help it along with specialized versions of
hashcpy(). If somebody does want to play with it, this patch may make a
good testbed. :)

-- >8 --
 builtin/pack-objects.c |  3 +-
 builtin/rev-list.c     | 40 +++++++++++------------
 pack-bitmap.c          | 86 ++------------------------------------------------
 pack-bitmap.h          |  1 +
 reachable.c            |  1 +
 5 files changed, 25 insertions(+), 106 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 13cde5896a..33f7d19eb3 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1388,7 +1388,8 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
 static int add_object_entry_from_bitmap(const struct object_id *oid,
 					enum object_type type,
 					int flags, uint32_t name_hash,
-					struct packed_git *pack, off_t offset)
+					struct packed_git *pack,
+					uint32_t pack_pos, off_t offset)
 {
 	display_progress(progress_state, ++nr_seen);
 
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index b4d8ea0a35..cc96b4c854 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -83,13 +83,13 @@ static int arg_show_object_names = 1;
 static int show_disk_usage;
 static off_t total_disk_usage;
 
-static off_t get_object_disk_usage(struct object *obj)
+static off_t get_object_disk_usage(const struct object_id *oid)
 {
 	off_t size;
 	struct object_info oi = OBJECT_INFO_INIT;
 	oi.disk_sizep = &size;
-	if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
-		die(_("unable to get disk usage of %s"), oid_to_hex(&obj->oid));
+	if (oid_object_info_extended(the_repository, oid, &oi, 0) < 0)
+		die(_("unable to get disk usage of %s"), oid_to_hex(oid));
 	return size;
 }
 
@@ -102,7 +102,7 @@ static void show_commit(struct commit *commit, void *data)
 	display_progress(progress, ++progress_counter);
 
 	if (show_disk_usage)
-		total_disk_usage += get_object_disk_usage(&commit->object);
+		total_disk_usage += get_object_disk_usage(&commit->object.oid);
 
 	if (info->flags & REV_LIST_QUIET) {
 		finish_commit(commit);
@@ -275,7 +275,7 @@ static void show_object(struct object *obj, const char *name, void *cb_data)
 		return;
 	display_progress(progress, ++progress_counter);
 	if (show_disk_usage)
-		total_disk_usage += get_object_disk_usage(obj);
+		total_disk_usage += get_object_disk_usage(&obj->oid);
 	if (info->flags & REV_LIST_QUIET)
 		return;
 
@@ -363,8 +363,19 @@ static int show_object_fast(
 	int exclude,
 	uint32_t name_hash,
 	struct packed_git *found_pack,
+	uint32_t pack_pos,
 	off_t found_offset)
 {
+	if (show_disk_usage) {
+		if (found_pack) {
+			total_disk_usage +=
+				pack_pos_to_offset(found_pack, pack_pos + 1) -
+				found_offset;
+		} else {
+			total_disk_usage += get_object_disk_usage(oid);
+		}
+		return 1;
+	}
 	fprintf(stdout, "%s\n", oid_to_hex(oid));
 	return 1;
 }
@@ -467,23 +478,10 @@ static int try_bitmap_traversal(struct rev_info *revs,
 
 	traverse_bitmap_commit_list(bitmap_git, revs, &show_object_fast);
 	free_bitmap_index(bitmap_git);
-	return 0;
-}
-
-static int try_bitmap_disk_usage(struct rev_info *revs,
-				 struct list_objects_filter_options *filter)
-{
-	struct bitmap_index *bitmap_git;
 
-	if (!show_disk_usage)
-		return -1;
-
-	bitmap_git = prepare_bitmap_walk(revs, filter);
-	if (!bitmap_git)
-		return -1;
+	if (show_disk_usage)
+		printf("%"PRIuMAX"\n", (uintmax_t)total_disk_usage);
 
-	printf("%"PRIuMAX"\n",
-	       (uintmax_t)get_disk_usage_from_bitmap(bitmap_git, revs));
 	return 0;
 }
 
@@ -667,8 +665,6 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (use_bitmap_index) {
 		if (!try_bitmap_count(&revs, &filter_options))
 			return 0;
-		if (!try_bitmap_disk_usage(&revs, &filter_options))
-			return 0;
 		if (!try_bitmap_traversal(&revs, &filter_options))
 			return 0;
 	}
diff --git a/pack-bitmap.c b/pack-bitmap.c
index 1f69b5fa85..f118a365e1 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -655,7 +655,7 @@ static void show_extended_objects(struct bitmap_index *bitmap_git,
 		    (obj->type == OBJ_TAG && !revs->tag_objects))
 			continue;
 
-		show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0);
+		show_reach(&obj->oid, obj->type, 0, eindex->hashes[i], NULL, 0, 0);
 	}
 }
 
@@ -726,7 +726,8 @@ static void show_objects_for_type(
 			if (bitmap_git->hashes)
 				hash = get_be32(bitmap_git->hashes + index_pos);
 
-			show_reach(&oid, object_type, 0, hash, bitmap_git->pack, ofs);
+			show_reach(&oid, object_type, 0, hash,
+				   bitmap_git->pack, pos + offset, ofs);
 		}
 	}
 }
@@ -1430,84 +1431,3 @@ int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_git,
 	return bitmap_git &&
 		bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid);
 }
-
-static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
-				     enum object_type object_type)
-{
-	struct bitmap *result = bitmap_git->result;
-	struct packed_git *pack = bitmap_git->pack;
-	off_t total = 0;
-	struct ewah_iterator it;
-	eword_t filter;
-	size_t i;
-
-	init_type_iterator(&it, bitmap_git, object_type);
-	for (i = 0; i < result->word_alloc &&
-			ewah_iterator_next(&filter, &it); i++) {
-		eword_t word = result->words[i] & filter;
-		size_t base = (i * BITS_IN_EWORD);
-		unsigned offset;
-
-		if (!word)
-			continue;
-
-		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
-			size_t pos;
-
-			if ((word >> offset) == 0)
-				break;
-
-			offset += ewah_bit_ctz64(word >> offset);
-			pos = base + offset;
-			total += pack_pos_to_offset(pack, pos + 1) -
-				 pack_pos_to_offset(pack, pos);
-		}
-	}
-
-	return total;
-}
-
-static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
-{
-	struct bitmap *result = bitmap_git->result;
-	struct packed_git *pack = bitmap_git->pack;
-	struct eindex *eindex = &bitmap_git->ext_index;
-	off_t total = 0;
-	struct object_info oi = OBJECT_INFO_INIT;
-	off_t object_size;
-	size_t i;
-
-	oi.disk_sizep = &object_size;
-
-	for (i = 0; i < eindex->count; i++) {
-		struct object *obj = eindex->objects[i];
-
-		if (!bitmap_get(result, pack->num_objects + i))
-			continue;
-
-		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
-			die(_("unable to get disk usage of %s"),
-			    oid_to_hex(&obj->oid));
-
-		total += object_size;
-	}
-	return total;
-}
-
-off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
-				 struct rev_info *revs)
-{
-	off_t total = 0;
-
-	total += get_disk_usage_for_type(bitmap_git, OBJ_COMMIT);
-	if (revs->tree_objects)
-		total += get_disk_usage_for_type(bitmap_git, OBJ_TREE);
-	if (revs->blob_objects)
-		total += get_disk_usage_for_type(bitmap_git, OBJ_BLOB);
-	if (revs->tag_objects)
-		total += get_disk_usage_for_type(bitmap_git, OBJ_TAG);
-
-	total += get_disk_usage_for_extended(bitmap_git);
-
-	return total;
-}
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 36d99930d8..ba71a9f5c6 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -38,6 +38,7 @@ typedef int (*show_reachable_fn)(
 	int flags,
 	uint32_t hash,
 	struct packed_git *found_pack,
+	uint32_t pack_pos,
 	off_t found_offset);
 
 struct bitmap_index;
diff --git a/reachable.c b/reachable.c
index 77a60c70a5..79ebe8f940 100644
--- a/reachable.c
+++ b/reachable.c
@@ -182,6 +182,7 @@ static int mark_object_seen(const struct object_id *oid,
 			     int exclude,
 			     uint32_t name_hash,
 			     struct packed_git *found_pack,
+			     uint32_t pack_pos,
 			     off_t found_offset)
 {
 	struct object *obj = lookup_object_by_type(the_repository, oid, type);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-09 11:09   ` [PATCH v2] rev-list --disk-usage Jeff King
@ 2021-02-09 21:14     ` Junio C Hamano
  2021-02-10  9:38       ` Jeff King
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2021-02-09 21:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Kyle Meyer, Eric Sunshine, Taylor Blau

Jeff King <peff@peff.net> writes:

> I don't know that it's really worth digging into that much, though it's
> quite possible there may be some easy wins by optimizing those memcpy
> calls. E.g., I'm not sure if the compiler ends up inlining them or not.
> If it doesn't realize that the_hash_algo->rawsz is only ever "20" or
> "32", we could perhaps help it along with specialized versions of
> hashcpy(). If somebody does want to play with it, this patch may make a
> good testbed. :)

Yuck.  That reminds me of the adventure Shawn he made in the Java
land benchmarking which one among int[5], int a,b,c,d,e, char[40] is
the most efficient way (both storage-wise and performance-wise) to
store SHA-1 hash.  I wish we didn't have to go there.

It indeed is an interesting, despite a bit sad, observation that
even with a good precomputed information, an overly heavy interface
can kill potential performance benefit.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-09 10:52 ` [PATCH v2] " Jeff King
                     ` (2 preceding siblings ...)
  2021-02-09 11:09   ` [PATCH v2] rev-list --disk-usage Jeff King
@ 2021-02-10  0:44   ` Junio C Hamano
  2021-02-10  1:49     ` Taylor Blau
  2021-02-10 10:01     ` Jeff King
  3 siblings, 2 replies; 30+ messages in thread
From: Junio C Hamano @ 2021-02-10  0:44 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Kyle Meyer, Eric Sunshine, Taylor Blau

Jeff King <peff@peff.net> writes:

> Here's a re-roll of my series to add "rev-list --disk-usage", for
> counting up object storage used for various slices of history.
> ...
>  t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
>  t/test-lib-functions.sh            |  9 +++-
>  7 files changed, 199 insertions(+), 8 deletions(-)
>  create mode 100755 t/t6114-rev-list-du.sh

I relocated 6114 to 6115 to avoid tests sharing the same number.

I am getting these numbers from random ranges I am interested in,
but do they say what I think they mean?  Was the development effort
went into the v2.28 release almost half the size of v2.29, and have
we already done about the same amont of work for this cycle?

: gitster git.git/seen; rungit seen rev-list --disk-usage master..next
83105
: gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master
183463
: gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0
231640
: gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0
334355
: gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0
182298

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-10  0:44   ` Junio C Hamano
@ 2021-02-10  1:49     ` Taylor Blau
  2021-02-10 10:01     ` Jeff King
  1 sibling, 0 replies; 30+ messages in thread
From: Taylor Blau @ 2021-02-10  1:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, Kyle Meyer, Eric Sunshine

On Tue, Feb 09, 2021 at 04:44:27PM -0800, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > Here's a re-roll of my series to add "rev-list --disk-usage", for
> > counting up object storage used for various slices of history.
> > ...
> >  t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
> >  t/test-lib-functions.sh            |  9 +++-
> >  7 files changed, 199 insertions(+), 8 deletions(-)
> >  create mode 100755 t/t6114-rev-list-du.sh
>
> I relocated 6114 to 6115 to avoid tests sharing the same number.

Thanks.

> I am getting these numbers from random ranges I am interested in,
> but do they say what I think they mean?  Was the development effort
> went into the v2.28 release almost half the size of v2.29, and have
> we already done about the same amont of work for this cycle?
>
> : gitster git.git/seen; rungit seen rev-list --disk-usage master..next
> 83105
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master
> 183463
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0
> 231640
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0
> 334355
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0
> 182298

I think you are surprised by these numbers because you're only counting
disk usage of commit objects in those ranges. v1 of this series implied
--objects by default, but this changed in v2 due to my suggestion.

Passing --objects to count the disk-usage of all objects in those ranges
gives more reasonable numbers (and match my rough guesses, i.e., that
2.29 was busier than 2.30, and so on):

    $ for range in origin/master..origin/next v2.30.0..origin/master \
        v2.29.0..v2.30.0 v2.28.0..v2.29.0 v2.27.0..v2.28.0
    do
      printf "%s %d vs. %d\n" $range \
        "$(git rev-list --objects --no-object-names $range |
           git cat-file --batch-check='%(objectsize:disk)' |
           paste -sd+ | bc)" \
        "$(git.seen rev-list --objects --disk-usage $range)"
    done
    origin/master..origin/next 671380 vs. 671380
    v2.30.0..origin/master 1618815 vs. 1618815
    v2.29.0..v2.30.0 3308295 vs. 3308295
    v2.28.0..v2.29.0 4080789 vs. 4080789
    v2.27.0..v2.28.0 2846196 vs. 2846196

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-09 21:14     ` Junio C Hamano
@ 2021-02-10  9:38       ` Jeff King
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2021-02-10  9:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kyle Meyer, Eric Sunshine, Taylor Blau

On Tue, Feb 09, 2021 at 01:14:17PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I don't know that it's really worth digging into that much, though it's
> > quite possible there may be some easy wins by optimizing those memcpy
> > calls. E.g., I'm not sure if the compiler ends up inlining them or not.
> > If it doesn't realize that the_hash_algo->rawsz is only ever "20" or
> > "32", we could perhaps help it along with specialized versions of
> > hashcpy(). If somebody does want to play with it, this patch may make a
> > good testbed. :)
> 
> Yuck.  That reminds me of the adventure Shawn he made in the Java
> land benchmarking which one among int[5], int a,b,c,d,e, char[40] is
> the most efficient way (both storage-wise and performance-wise) to
> store SHA-1 hash.  I wish we didn't have to go there.
> 
> It indeed is an interesting, despite a bit sad, observation that
> even with a good precomputed information, an overly heavy interface
> can kill potential performance benefit.

Agreed. But I'm hoping we can continue to mostly ignore it. I suspect
this finding means we are wasting a few hundred milliseconds copying
oids around during a clone of torvalds/linux. But overall that is a
pretty heavy-weight operation, and I doubt anybody really notices. And
for something as lightweight as --disk-usage, it was easy enough to
optimize around it.

It probably does have a more measurable impact in something like:

  git rev-list --use-bitmap-index --objects HEAD >/dev/null

where we really do need those oids, and the extra copying might add up.
I guess if somebody is interested in micro-optimizing, that is probably
a good command to look at.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-10  0:44   ` Junio C Hamano
  2021-02-10  1:49     ` Taylor Blau
@ 2021-02-10 10:01     ` Jeff King
  2021-02-10 16:31       ` Junio C Hamano
  1 sibling, 1 reply; 30+ messages in thread
From: Jeff King @ 2021-02-10 10:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kyle Meyer, Eric Sunshine, Taylor Blau

On Tue, Feb 09, 2021 at 04:44:27PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Here's a re-roll of my series to add "rev-list --disk-usage", for
> > counting up object storage used for various slices of history.
> > ...
> >  t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
> >  t/test-lib-functions.sh            |  9 +++-
> >  7 files changed, 199 insertions(+), 8 deletions(-)
> >  create mode 100755 t/t6114-rev-list-du.sh
> 
> I relocated 6114 to 6115 to avoid tests sharing the same number.

Thanks. I wondered why I didn't notice, but it's because the other 6114
also just made it into "seen". :)

> I am getting these numbers from random ranges I am interested in,
> but do they say what I think they mean?  Was the development effort
> went into the v2.28 release almost half the size of v2.29, and have
> we already done about the same amont of work for this cycle?
> 
> : gitster git.git/seen; rungit seen rev-list --disk-usage master..next
> 83105
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master
> 183463
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0
> 231640
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0
> 334355
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0
> 182298

As Taylor mentioned, this is only hitting the commits. So you might as
well just be looking at commit counts as a measure of work, I'd think
(and indeed v2.28 has about half as many commits as v2.29!).

Adding --objects gets you a rougher estimate of "bytes changed", which
helps accounts for commits of different sizes. But there I think you'd
do just as well to look at the actual number of lines changed with "git
diff --numstat".

I'd expect the number of on-disk bytes to _roughly_ correspond to the
size of the changes. But you are working against the heuristics of the
delta chains there. It may well be that we would store a base object in
the v2.28..v2.29 range, and a delta against it in v2.27..v2.28. And that
would attribute most of the bytes to v2.29, even though they should be
shared roughly with v2.28.

I'm sure one could devise a scheme for "sharing" the bytes from a delta
family across all of its objects. That might even be worth implementing
on top (I don't even think it would be too expensive; you just have to
collect the delta chains for any objects you're reporting, and then
average the total size among a chain).

But in practice, we've found this kind of naive --disk-usage useful for
answering questions like:

  - do I need all of these objects? Comparing "rev-list --disk-usage
    --objects --all", "rev-list --disk-usage --objects --all --reflog",
    and "du objects/pack/*.pack" will tell you if a prune/repack might
    help, and whether expiring reflogs makes a difference.

  - the size of the shared alternates repo for a set of forks has
    jumped. Comparing "rev-list --disk-usage --objects --remotes=$base
    --not --remotes=$fork" will tell you what's reachable from a fork
    but not from the base (we use "refs/remotes/$id/*" to keep track of
    fork refs in our alternates repo). This can be junk like somebody
    forking git/git and then uploading a bunch of pirated video files.
    :)

  - likewise, the size of cloning a single repo may jump. Comparing
    "rev-list --disk-usage --objects HEAD..$branch" for each branch
    might show that one branch is an outlier (e.g., because somebody
    accidentally committed a bunch of build artifacts).

In those kinds of cases, it's not usually "oh, this version is twice as
big as this other one". It's more like "wow, this branch is 100x as big
as the other branches", and little decisions like delta direction are
just noise. I imagine that in those cases the uncompressed object sizes
would probably produce similar patterns and answers. But it's actually
faster to produce the on-disk sizes. :)

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-10 10:01     ` Jeff King
@ 2021-02-10 16:31       ` Junio C Hamano
  2021-02-10 20:38         ` Jeff King
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2021-02-10 16:31 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Kyle Meyer, Eric Sunshine, Taylor Blau

Jeff King <peff@peff.net> writes:

> But in practice, we've found this kind of naive --disk-usage useful for
> answering questions like:
>
>   - do I need all of these objects? Comparing "rev-list --disk-usage
>     --objects --all", "rev-list --disk-usage --objects --all --reflog",
>     and "du objects/pack/*.pack" will tell you if a prune/repack might
>     help, and whether expiring reflogs makes a difference.
>
>   - the size of the shared alternates repo for a set of forks has
>     jumped. Comparing "rev-list --disk-usage --objects --remotes=$base
>     --not --remotes=$fork" will tell you what's reachable from a fork
>     but not from the base (we use "refs/remotes/$id/*" to keep track of
>     fork refs in our alternates repo). This can be junk like somebody
>     forking git/git and then uploading a bunch of pirated video files.
>     :)
>
>   - likewise, the size of cloning a single repo may jump. Comparing
>     "rev-list --disk-usage --objects HEAD..$branch" for each branch
>     might show that one branch is an outlier (e.g., because somebody
>     accidentally committed a bunch of build artifacts).
>
> In those kinds of cases, it's not usually "oh, this version is twice as
> big as this other one". It's more like "wow, this branch is 100x as big
> as the other branches", and little decisions like delta direction are
> just noise. I imagine that in those cases the uncompressed object sizes
> would probably produce similar patterns and answers. But it's actually
> faster to produce the on-disk sizes. :)

Thanks.

I kind of feel sad to have a nice write-up like this only in the
list archive.  Is there a section in our documentation set to keep
collection of such a real-life use cases?  Perhaps the examples
section of manpages is the closest thing, but it looks a bit too
narrowly scoped for the example section of "rev-list" manpage.

THanks.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-10 16:31       ` Junio C Hamano
@ 2021-02-10 20:38         ` Jeff King
  2021-02-10 23:15           ` Taylor Blau
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2021-02-10 20:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kyle Meyer, Eric Sunshine, Taylor Blau

On Wed, Feb 10, 2021 at 08:31:08AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > But in practice, we've found this kind of naive --disk-usage useful for
> > answering questions like:
> [...]
>
> I kind of feel sad to have a nice write-up like this only in the
> list archive.  Is there a section in our documentation set to keep
> collection of such a real-life use cases?  Perhaps the examples
> section of manpages is the closest thing, but it looks a bit too
> narrowly scoped for the example section of "rev-list" manpage.

Agreed on both counts. If this gets put into a release, I suspect Taylor
would cover it in a release blog post. That is not quite the same thing
as having it in the documentation, but it may provide more search engine
boost than the list archive. I dunno.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-10 20:38         ` Jeff King
@ 2021-02-10 23:15           ` Taylor Blau
  2021-02-11 11:00             ` Jeff King
  0 siblings, 1 reply; 30+ messages in thread
From: Taylor Blau @ 2021-02-10 23:15 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Kyle Meyer, Eric Sunshine, Taylor Blau

On Wed, Feb 10, 2021 at 03:38:58PM -0500, Jeff King wrote:
> On Wed, Feb 10, 2021 at 08:31:08AM -0800, Junio C Hamano wrote:
>
> > Jeff King <peff@peff.net> writes:
> >
> > > But in practice, we've found this kind of naive --disk-usage useful for
> > > answering questions like:
> > [...]
> >
> > I kind of feel sad to have a nice write-up like this only in the
> > list archive.  Is there a section in our documentation set to keep
> > collection of such a real-life use cases?  Perhaps the examples
> > section of manpages is the closest thing, but it looks a bit too
> > narrowly scoped for the example section of "rev-list" manpage.
>
> Agreed on both counts. If this gets put into a release, I suspect Taylor
> would cover it in a release blog post. That is not quite the same thing
> as having it in the documentation, but it may provide more search engine
> boost than the list archive. I dunno.

Yeah, this is the perfect sort of thing for those blog posts.

But it makes sense to include some of these examples in our own
documentation here, too. git-rev-list(1) doesn't have an EXAMPLES
section, but maybe it should.

> -Peff

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-10 23:15           ` Taylor Blau
@ 2021-02-11 11:00             ` Jeff King
  2021-02-11 12:04               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff King @ 2021-02-11 11:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, Kyle Meyer, Eric Sunshine

On Wed, Feb 10, 2021 at 06:15:16PM -0500, Taylor Blau wrote:

> > > I kind of feel sad to have a nice write-up like this only in the
> > > list archive.  Is there a section in our documentation set to keep
> > > collection of such a real-life use cases?  Perhaps the examples
> > > section of manpages is the closest thing, but it looks a bit too
> > > narrowly scoped for the example section of "rev-list" manpage.
> >
> > Agreed on both counts. If this gets put into a release, I suspect Taylor
> > would cover it in a release blog post. That is not quite the same thing
> > as having it in the documentation, but it may provide more search engine
> > boost than the list archive. I dunno.
> 
> Yeah, this is the perfect sort of thing for those blog posts.
> 
> But it makes sense to include some of these examples in our own
> documentation here, too. git-rev-list(1) doesn't have an EXAMPLES
> section, but maybe it should.

I think this is the "narrowly scoped" bit from Junio's response above.
It would be a bit weird to have an examples section for rev-list that
only mentions this rather obscure feature.

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-11 11:00             ` Jeff King
@ 2021-02-11 12:04               ` Ævar Arnfjörð Bjarmason
  2021-02-11 17:57                 ` Junio C Hamano
  2021-02-17 23:31                 ` [PATCH 0/2] rev-list --disk-usage example docs Jeff King
  0 siblings, 2 replies; 30+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-02-11 12:04 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Junio C Hamano, git, Kyle Meyer, Eric Sunshine


On Thu, Feb 11 2021, Jeff King wrote:

> On Wed, Feb 10, 2021 at 06:15:16PM -0500, Taylor Blau wrote:
>
>> > > I kind of feel sad to have a nice write-up like this only in the
>> > > list archive.  Is there a section in our documentation set to keep
>> > > collection of such a real-life use cases?  Perhaps the examples
>> > > section of manpages is the closest thing, but it looks a bit too
>> > > narrowly scoped for the example section of "rev-list" manpage.
>> >
>> > Agreed on both counts. If this gets put into a release, I suspect Taylor
>> > would cover it in a release blog post. That is not quite the same thing
>> > as having it in the documentation, but it may provide more search engine
>> > boost than the list archive. I dunno.
>> 
>> Yeah, this is the perfect sort of thing for those blog posts.
>> 
>> But it makes sense to include some of these examples in our own
>> documentation here, too. git-rev-list(1) doesn't have an EXAMPLES
>> section, but maybe it should.
>
> I think this is the "narrowly scoped" bit from Junio's response above.
> It would be a bit weird to have an examples section for rev-list that
> only mentions this rather obscure feature.

I don't think the lack of an EXAMPLES section or the relative obscurity
of the switch should preclude us from adding useful documentation.

Yes it would feel a bit out of place, but we can always have a
sub-section of EXAMPLES, and we've got to start somewhere.

In this case I don't see why it couldn't be added to OPTIONS, we've got
some very long discussion there already, and as long as there's a clear
separation in prose from an initial brief discussion of the switch and
further prose it won't be confusing for readers, they can just page past
the details.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] rev-list --disk-usage
  2021-02-11 12:04               ` Ævar Arnfjörð Bjarmason
@ 2021-02-11 17:57                 ` Junio C Hamano
  2021-02-17 23:31                 ` [PATCH 0/2] rev-list --disk-usage example docs Jeff King
  1 sibling, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2021-02-11 17:57 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Taylor Blau, git, Kyle Meyer, Eric Sunshine

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> I think this is the "narrowly scoped" bit from Junio's response above.
>> It would be a bit weird to have an examples section for rev-list that
>> only mentions this rather obscure feature.
>
> I don't think the lack of an EXAMPLES section or the relative obscurity
> of the switch should preclude us from adding useful documentation.
>
> Yes it would feel a bit out of place, but we can always have a
> sub-section of EXAMPLES, and we've got to start somewhere.
>
> In this case I don't see why it couldn't be added to OPTIONS, we've got
> some very long discussion there already, and as long as there's a clear
> separation in prose from an initial brief discussion of the switch and
> further prose it won't be confusing for readers, they can just page past
> the details.

OK.

In any case, [v2] as we have it (with test number relocation) should
be good as-is, so I'd start preparing to merge it down to 'next'
soonish.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/2] rev-list --disk-usage example docs
  2021-02-11 12:04               ` Ævar Arnfjörð Bjarmason
  2021-02-11 17:57                 ` Junio C Hamano
@ 2021-02-17 23:31                 ` Jeff King
  2021-02-17 23:34                   ` [PATCH 1/2] docs/rev-list: add an examples section Jeff King
                                     ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Jeff King @ 2021-02-17 23:31 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, Junio C Hamano, git, Kyle Meyer, Eric Sunshine

On Thu, Feb 11, 2021 at 01:04:26PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > I think this is the "narrowly scoped" bit from Junio's response above.
> > It would be a bit weird to have an examples section for rev-list that
> > only mentions this rather obscure feature.
> 
> I don't think the lack of an EXAMPLES section or the relative obscurity
> of the switch should preclude us from adding useful documentation.
> 
> Yes it would feel a bit out of place, but we can always have a
> sub-section of EXAMPLES, and we've got to start somewhere.

Fair enough. Here are some patches (to go on top of jk/rev-list-disk-usage,
though obviously the first one could be applied independently).

> In this case I don't see why it couldn't be added to OPTIONS, we've got
> some very long discussion there already, and as long as there's a clear
> separation in prose from an initial brief discussion of the switch and
> further prose it won't be confusing for readers, they can just page past
> the details.

It's already big and scary enough that I prefer starting an EXAMPLES
section. :)

By the way, there's one other finishing touch we might consider:
enabling --use-bitmap-index automatically when bitmaps are present, for
requests that produce the identical answer (so _not_ a regular
traversal, because the output order and presence of pathnames are
different there). I'd prefer to do that as a separate series, though,
since there are multiple arguments that might benefit (like --count).

  [1/2]: docs/rev-list: add an examples section
  [2/2]: docs/rev-list: add some examples of --disk-usage

 Documentation/git-rev-list.txt | 93 ++++++++++++++++++++++++++++++++++
 1 file changed, 93 insertions(+)

-Peff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/2] docs/rev-list: add an examples section
  2021-02-17 23:31                 ` [PATCH 0/2] rev-list --disk-usage example docs Jeff King
@ 2021-02-17 23:34                   ` Jeff King
  2021-02-17 23:35                   ` [PATCH 2/2] docs/rev-list: add some examples of --disk-usage Jeff King
  2021-02-17 23:44                   ` [PATCH 0/2] rev-list --disk-usage example docs Taylor Blau
  2 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2021-02-17 23:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, Junio C Hamano, git, Kyle Meyer, Eric Sunshine

We currently don't show any examples of using git-rev-list at all. Let's
add some pretty elementary examples. They likely seem obvious to anybody
who has worked with the tool for a while, but my purpose here is
two-fold:

  - they may be enlightening to people who haven't used the tool a lot
    to give a general flavor of how it is meant to be used

  - they can serve as a starting point for adding more interesting
    examples (we can do that without the basic ones, of course, but I
    think it makes sense to show off the building blocks)

This set is far from exhaustive, but again, the purpose is to be a
starting point for further additions.

Signed-off-by: Jeff King <peff@peff.net>
---
I'm open to feedback on these. But please, if you have suggestions for
adding more, do it in the form of a patch on top. :)

 Documentation/git-rev-list.txt | 52 ++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/Documentation/git-rev-list.txt b/Documentation/git-rev-list.txt
index 5da66232dc..d7ff519b90 100644
--- a/Documentation/git-rev-list.txt
+++ b/Documentation/git-rev-list.txt
@@ -31,6 +31,58 @@ include::rev-list-options.txt[]
 
 include::pretty-formats.txt[]
 
+EXAMPLES
+--------
+
+* Print the list of commits reachable from the current branch.
++
+----------
+git rev-list HEAD
+----------
+
+* Print the list of commits on this branch, but not present in the
+  upstream branch.
++
+----------
+git rev-list @{upstream}..HEAD
+----------
+
+* Format commits with their author and commit message (see also the
+  porcelain linkgit:git-log[1]).
++
+----------
+git rev-list --format=medium HEAD
+----------
+
+* Format commits along with their diffs (see also the porcelain
+  linkgit:git-log[1], which can do this in a single process).
++
+----------
+git rev-list HEAD |
+git diff-tree --stdin --format=medium -p
+----------
+
+* Print the list of commits on the current branch that touched any
+  file in the `Documentation` directory.
++
+----------
+git rev-list HEAD -- Documentation/
+----------
+
+* Print the list of commits authored by you in the past year, on
+  any branch, tag, or other ref.
++
+----------
+git rev-list --author=you@example.com --since=1.year.ago --all
+----------
+
+* Print the list of objects reachable from the current branch (i.e., all
+  commits and the blobs and trees they contain).
++
+----------
+git rev-list --objects HEAD
+----------
+
 GIT
 ---
 Part of the linkgit:git[1] suite
-- 
2.30.1.989.g5e01c2f281


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/2] docs/rev-list: add some examples of --disk-usage
  2021-02-17 23:31                 ` [PATCH 0/2] rev-list --disk-usage example docs Jeff King
  2021-02-17 23:34                   ` [PATCH 1/2] docs/rev-list: add an examples section Jeff King
@ 2021-02-17 23:35                   ` Jeff King
  2021-02-17 23:44                   ` [PATCH 0/2] rev-list --disk-usage example docs Taylor Blau
  2 siblings, 0 replies; 30+ messages in thread
From: Jeff King @ 2021-02-17 23:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, Junio C Hamano, git, Kyle Meyer, Eric Sunshine

It's not immediately obvious why --disk-usage might be a useful thing.
These examples show off a few of the real-world cases I've used it for.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-rev-list.txt | 41 ++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/Documentation/git-rev-list.txt b/Documentation/git-rev-list.txt
index d7ff519b90..20bb8e8217 100644
--- a/Documentation/git-rev-list.txt
+++ b/Documentation/git-rev-list.txt
@@ -83,6 +83,47 @@ git rev-list --author=you@example.com --since=1.year.ago --all
 git rev-list --objects HEAD
 ----------
 
+* Compare the disk size of all reachable objects, versus those
+  reachable from reflogs, versus the total packed size. This can tell
+  you whether running `git repack -ad` might reduce the repository size
+  (by dropping unreachable objects), and whether expiring reflogs might
+  help.
++
+----------
+# reachable objects
+git rev-list --disk-usage --objects --all
+# plus reflogs
+git rev-list --disk-usage --objects --all --reflog
+# total disk size used
+du -c .git/objects/pack/*.pack .git/objects/??/*
+# alternative to du: add up "size" and "size-pack" fields
+git count-objects -v
+----------
+
+* Report the disk size of each branch, not including objects used by the
+  current branch. This can find outliers that are contributing to a
+  bloated repository size (e.g., because somebody accidentally committed
+  large build artifacts).
++
+----------
+git for-each-ref --format='%(refname)' |
+while read branch
+do
+	size=$(git rev-list --disk-usage --objects HEAD..$branch)
+	echo "$size $branch"
+done |
+sort -n
+----------
+
+* Compare the on-disk size of branches in one group of refs, excluding
+  another. If you co-mingle objects from multiple remotes in a single
+  repository, this can show which remotes are contributing to the
+  repository size (taking the size of `origin` as a baseline).
++
+----------
+git rev-list --disk-usage --objects --remotes=$suspect --not --remotes=origin
+----------
+
 GIT
 ---
 Part of the linkgit:git[1] suite
-- 
2.30.1.989.g5e01c2f281

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] rev-list --disk-usage example docs
  2021-02-17 23:31                 ` [PATCH 0/2] rev-list --disk-usage example docs Jeff King
  2021-02-17 23:34                   ` [PATCH 1/2] docs/rev-list: add an examples section Jeff King
  2021-02-17 23:35                   ` [PATCH 2/2] docs/rev-list: add some examples of --disk-usage Jeff King
@ 2021-02-17 23:44                   ` Taylor Blau
  2 siblings, 0 replies; 30+ messages in thread
From: Taylor Blau @ 2021-02-17 23:44 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Taylor Blau,
	Junio C Hamano, git, Kyle Meyer, Eric Sunshine

On Wed, Feb 17, 2021 at 06:31:07PM -0500, Jeff King wrote:
> It's already big and scary enough that I prefer starting an EXAMPLES
> section. :)

The patches you sent below are great. I think that it's easy to nitpick
and say "oh, you should have added this or that example, too", but I
think you gave a great set of starting examples.

I'd be happy to see this merged so that others can add more examples on
top.

> By the way, there's one other finishing touch we might consider:
> enabling --use-bitmap-index automatically when bitmaps are present, for
> requests that produce the identical answer (so _not_ a regular
> traversal, because the output order and presence of pathnames are
> different there). I'd prefer to do that as a separate series, though,
> since there are multiple arguments that might benefit (like --count).

This would be really neat. I look forward to it.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-02-17 23:44 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-27 22:11 [PATCH 0/2] rev-list --disk-usage Jeff King
2021-01-27 22:12 ` [PATCH 1/2] t: add --no-tag option to test_commit Jeff King
2021-01-27 22:48   ` Taylor Blau
2021-01-27 22:17 ` [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
2021-01-27 22:57   ` Taylor Blau
2021-01-27 23:34     ` Jeff King
2021-01-27 23:01   ` Kyle Meyer
2021-01-27 23:36     ` Jeff King
2021-01-27 23:07   ` Eric Sunshine
2021-01-27 23:39     ` Jeff King
2021-01-27 22:46 ` [PATCH 0/2] rev-list --disk-usage Taylor Blau
2021-02-09 10:52 ` [PATCH v2] " Jeff King
2021-02-09 10:52   ` [PATCH v2 1/2] t: add --no-tag option to test_commit Jeff King
2021-02-09 10:53   ` [PATCH v2 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
2021-02-09 11:09   ` [PATCH v2] rev-list --disk-usage Jeff King
2021-02-09 21:14     ` Junio C Hamano
2021-02-10  9:38       ` Jeff King
2021-02-10  0:44   ` Junio C Hamano
2021-02-10  1:49     ` Taylor Blau
2021-02-10 10:01     ` Jeff King
2021-02-10 16:31       ` Junio C Hamano
2021-02-10 20:38         ` Jeff King
2021-02-10 23:15           ` Taylor Blau
2021-02-11 11:00             ` Jeff King
2021-02-11 12:04               ` Ævar Arnfjörð Bjarmason
2021-02-11 17:57                 ` Junio C Hamano
2021-02-17 23:31                 ` [PATCH 0/2] rev-list --disk-usage example docs Jeff King
2021-02-17 23:34                   ` [PATCH 1/2] docs/rev-list: add an examples section Jeff King
2021-02-17 23:35                   ` [PATCH 2/2] docs/rev-list: add some examples of --disk-usage Jeff King
2021-02-17 23:44                   ` [PATCH 0/2] rev-list --disk-usage example docs Taylor Blau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).