git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: Kyle Meyer <kyle@kyleam.com>,
	Eric Sunshine <sunshine@sunshineco.com>,
	Taylor Blau <me@ttaylorr.com>
Subject: [PATCH v2] rev-list --disk-usage
Date: Tue, 9 Feb 2021 05:52:28 -0500	[thread overview]
Message-ID: <YCJpbPIlSpCAKSBF@coredump.intra.peff.net> (raw)
In-Reply-To: <YBHlGPBSJC++CnPy@coredump.intra.peff.net>

Here's a re-roll of my series to add "rev-list --disk-usage", for
counting up object storage used for various slices of history.

This fixes the minor bits mentioned in review for v1, but the big change
is that "--disk-usage" no longer implies "--objects". I think you
generally would want to use it with that option, but it really seemed to
violate the principle of least surprise for the user.

That requires handling each object type independently, but the code for
that turned out to be not too bad (and is modeled after the similar
logic in traverse_bitmap_commit_list()). I was slightly concerned that
it would slow things down to walk over the bitmap multiple times, but it
doesn't seem to make much of a difference in practice.

There's a range-diff below, but it's not really worth looking at. All of
the interesting parts were rewritten completely, so you're better off to
just read patch 2 again (and patch 1 did not change at all).

  [1/2]: t: add --no-tag option to test_commit
  [2/2]: rev-list: add --disk-usage option for calculating disk usage

 Documentation/rev-list-options.txt |  9 ++++
 builtin/rev-list.c                 | 46 +++++++++++++++++
 pack-bitmap.c                      | 81 ++++++++++++++++++++++++++++++
 pack-bitmap.h                      |  2 +
 t/t4208-log-magic-pathspec.sh      |  9 +---
 t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
 t/test-lib-functions.sh            |  9 +++-
 7 files changed, 199 insertions(+), 8 deletions(-)
 create mode 100755 t/t6114-rev-list-du.sh

1:  20f8edeff1 = 1:  6365cd94bd t: add --no-tag option to test_commit
2:  64e28cb6c9 ! 2:  8a93583dee rev-list: add --disk-usage option for calculating disk usage
    @@ Commit message
         You can find that out by generating a list of objects, getting their
         sizes from cat-file, and then summing them, like:
     
    -        git rev-list --objects main..branch
    -        cut -d' ' -f1 |
    +        git rev-list --objects --no-object-names main..branch
             git cat-file --batch-check='%(objectsize:disk)' |
             perl -lne '$total += $_; END { print $total }'
     
    @@ Commit message
         torvalds/linux:
     
           [rev-list piped to cat-file, no bitmaps]
    -      $ time git rev-list --objects --all |
    -        cut -d' ' -f1 |
    +      $ time git rev-list --objects --no-object-names --all |
             git cat-file --buffer --batch-check='%(objectsize:disk)' |
             perl -lne '$total += $_; END { print $total }'
    -      1455691059
    -      real  0m34.336s
    -      user  0m46.533s
    -      sys   0m2.953s
    +      1459938510
    +      real  0m29.635s
    +      user  0m38.003s
    +      sys   0m1.093s
     
           [internal, no bitmaps]
    -      $ time git rev-list --disk-usage --all
    -      1455691059
    -      real  0m32.662s
    -      user  0m32.306s
    -      sys   0m0.353s
    +      $ time git rev-list --disk-usage --objects --all
    +      1459938510
    +      real  0m31.262s
    +      user  0m30.885s
    +      sys   0m0.376s
     
    -    The wall-clock times aren't that different because of parallelism, but
    -    notice the CPU savings between the two. We saved 35% of the CPU just by
    +    Even though the wall-clock time is slightly worse due to parallelism,
    +    notice the CPU savings between the two. We saved 21% of the CPU just by
         avoiding the pipes.
     
         But the real win is with bitmaps. If we use them without the new option:
     
           [rev-list piped to cat-file, bitmaps]
    -      $ time git rev-list --objects --all --use-bitmap-index |
    -        cut -d' ' -f1 |
    +      $ time git rev-list --objects --no-object-names --all --use-bitmap-index |
             git cat-file --batch-check='%(objectsize:disk)' |
             perl -lne '$total += $_; END { print $total }'
    -      real  0m9.954s
    -      user  0m11.234s
    -      sys   0m8.522s
    +      1459938510
    +      real  0m6.244s
    +      user  0m8.452s
    +      sys   0m0.311s
     
         then we're faster to generate the list of objects, but we still spend a
         lot of time piping and looking things up. But if we do both together:
     
           [internal, bitmaps]
    -      $ time git rev-list --disk-usage --all --use-bitmap-index
    -      1455691059
    -      real  0m0.235s
    -      user  0m0.186s
    +      $ time git rev-list --disk-usage --objects --all --use-bitmap-index
    +      1459938510
    +      real  0m0.219s
    +      user  0m0.169s
           sys   0m0.049s
     
         then we get the same answer much faster.
    @@ Commit message
         of course. But we're actually checking reachability here, so we're still
         fast when we ask for more interesting things:
     
    -      $ time git rev-list --disk-usage --all --use-bitmap-index v5.0..v5.10
    +      $ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10
           374798628
           real  0m0.429s
           user  0m0.356s
    @@ Documentation/rev-list-options.txt: ifdef::git-rev-list[]
     +
     +--disk-usage::
     +	Suppress normal output; instead, print the sum of the bytes used
    -+	for on-disk storage by the selected objects. This is equivalent
    -+	to piping the output of `rev-list --objects` into
    -+	`git cat-file --batch-check='%(objectsize:disk)', except that it
    -+	runs much faster (especially with `--use-bitmap-index`). See the
    -+	`CAVEATS` section in linkgit:git-cat-file[1] for the limitations
    -+	of what "on-disk storage" means.
    ++	for on-disk storage by the selected commits or objects. This is
    ++	equivalent to piping the output into `git cat-file
    ++	--batch-check='%(objectsize:disk)'`, except that it runs much
    ++	faster (especially with `--use-bitmap-index`). See the `CAVEATS`
    ++	section in linkgit:git-cat-file[1] for the limitations of what
    ++	"on-disk storage" means.
      endif::git-rev-list[]
      
      --cherry-mark::
    @@ builtin/rev-list.c: static int try_bitmap_traversal(struct rev_info *revs,
     +		return -1;
     +
     +	printf("%"PRIuMAX"\n",
    -+	       (uintmax_t)get_disk_usage_from_bitmap(bitmap_git));
    ++	       (uintmax_t)get_disk_usage_from_bitmap(bitmap_git, revs));
     +	return 0;
     +}
     +
    @@ builtin/rev-list.c: int cmd_rev_list(int argc, const char **argv, const char *pr
      
     +		if (!strcmp(arg, "--disk-usage")) {
     +			show_disk_usage = 1;
    -+			revs.tag_objects = 1;
    -+			revs.tree_objects = 1;
    -+			revs.blob_objects = 1;
     +			info.flags |= REV_LIST_QUIET;
     +			continue;
     +		}
    @@ pack-bitmap.c: int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_g
      		bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid);
      }
     +
    -+off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git)
    ++static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
    ++				     enum object_type object_type)
     +{
     +	struct bitmap *result = bitmap_git->result;
     +	struct packed_git *pack = bitmap_git->pack;
    -+	struct eindex *eindex = &bitmap_git->ext_index;
    -+	struct object_info oi = OBJECT_INFO_INIT;
    -+	off_t object_size;
     +	off_t total = 0;
    ++	struct ewah_iterator it;
    ++	eword_t filter;
     +	size_t i;
     +
    -+	oi.disk_sizep = &object_size;
    -+
    -+	for (i = 0; i < result->word_alloc; i++) {
    -+		eword_t word = result->words[i];
    ++	init_type_iterator(&it, bitmap_git, object_type);
    ++	for (i = 0; i < result->word_alloc &&
    ++			ewah_iterator_next(&filter, &it); i++) {
    ++		eword_t word = result->words[i] & filter;
     +		size_t base = (i * BITS_IN_EWORD);
     +		unsigned offset;
     +
    ++		if (!word)
    ++			continue;
    ++
     +		for (offset = 0; offset < BITS_IN_EWORD; offset++) {
     +			size_t pos;
     +
    @@ pack-bitmap.c: int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_g
     +
     +			offset += ewah_bit_ctz64(word >> offset);
     +			pos = base + offset;
    -+
    -+			/*
    -+			 * If it's in the pack, we can use the fast path
    -+			 * and just check the revindex. Otherwise, we
    -+			 * fall back to looking it up.
    -+			 */
    -+			if (pos < pack->num_objects) {
    -+				object_size =
    -+					pack_pos_to_offset(pack, pos + 1) -
    -+					pack_pos_to_offset(pack, pos);
    -+			} else {
    -+				struct object *obj;
    -+				obj = eindex->objects[pos - pack->num_objects];
    -+				if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
    -+					die(_("unable to get disk usage of %s"),
    -+					      oid_to_hex(&obj->oid));
    -+			}
    -+
    -+			total += object_size;
    ++			total += pack_pos_to_offset(pack, pos + 1) -
    ++				 pack_pos_to_offset(pack, pos);
     +		}
     +	}
     +
     +	return total;
    ++}
    ++
    ++static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
    ++{
    ++	struct bitmap *result = bitmap_git->result;
    ++	struct packed_git *pack = bitmap_git->pack;
    ++	struct eindex *eindex = &bitmap_git->ext_index;
    ++	off_t total = 0;
    ++	struct object_info oi = OBJECT_INFO_INIT;
    ++	off_t object_size;
    ++	size_t i;
    ++
    ++	oi.disk_sizep = &object_size;
    ++
    ++	for (i = 0; i < eindex->count; i++) {
    ++		struct object *obj = eindex->objects[i];
    ++
    ++		if (!bitmap_get(result, pack->num_objects + i))
    ++			continue;
    ++
    ++		if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
    ++			die(_("unable to get disk usage of %s"),
    ++			    oid_to_hex(&obj->oid));
    ++
    ++		total += object_size;
    ++	}
    ++	return total;
    ++}
    ++
    ++off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
    ++				 struct rev_info *revs)
    ++{
    ++	off_t total = 0;
    ++
    ++	total += get_disk_usage_for_type(bitmap_git, OBJ_COMMIT);
    ++	if (revs->tree_objects)
    ++		total += get_disk_usage_for_type(bitmap_git, OBJ_TREE);
    ++	if (revs->blob_objects)
    ++		total += get_disk_usage_for_type(bitmap_git, OBJ_BLOB);
    ++	if (revs->tag_objects)
    ++		total += get_disk_usage_for_type(bitmap_git, OBJ_TAG);
    ++
    ++	total += get_disk_usage_for_extended(bitmap_git);
    ++
    ++	return total;
     +}
     
      ## pack-bitmap.h ##
     @@ pack-bitmap.h: int bitmap_walk_contains(struct bitmap_index *,
       */
      int bitmap_has_oid_in_uninteresting(struct bitmap_index *, const struct object_id *oid);
      
    -+off_t get_disk_usage_from_bitmap(struct bitmap_index *);
    ++off_t get_disk_usage_from_bitmap(struct bitmap_index *, struct rev_info *);
     +
      void bitmap_writer_show_progress(int show);
      void bitmap_writer_set_checksum(unsigned char *sha1);
    @@ t/t6114-rev-list-du.sh (new)
     +# packing, zlib, etc. We'll assume that the regular rev-list and cat-file
     +# machinery works and compare the --disk-usage output to that.
     +disk_usage_slow () {
    -+	git rev-list --objects "$@" |
    -+	cut -d' ' -f1 |
    ++	git rev-list --no-object-names "$@" |
     +	git cat-file --batch-check="%(objectsize:disk)" |
     +	perl -lne '$total += $_; END { print $total}'
     +}
    @@ t/t6114-rev-list-du.sh (new)
     +}
     +
     +check_du HEAD
    -+check_du HEAD^..HEAD
    ++check_du --objects HEAD
    ++check_du --objects HEAD^..HEAD
     +
     +test_done

  parent reply	other threads:[~2021-02-09 10:55 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-27 22:11 [PATCH 0/2] " Jeff King
2021-01-27 22:12 ` [PATCH 1/2] t: add --no-tag option to test_commit Jeff King
2021-01-27 22:48   ` Taylor Blau
2021-01-27 22:17 ` [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
2021-01-27 22:57   ` Taylor Blau
2021-01-27 23:34     ` Jeff King
2021-01-27 23:01   ` Kyle Meyer
2021-01-27 23:36     ` Jeff King
2021-01-27 23:07   ` Eric Sunshine
2021-01-27 23:39     ` Jeff King
2021-01-27 22:46 ` [PATCH 0/2] rev-list --disk-usage Taylor Blau
2021-02-09 10:52 ` Jeff King [this message]
2021-02-09 10:52   ` [PATCH v2 1/2] t: add --no-tag option to test_commit Jeff King
2021-02-09 10:53   ` [PATCH v2 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
2021-02-09 11:09   ` [PATCH v2] rev-list --disk-usage Jeff King
2021-02-09 21:14     ` Junio C Hamano
2021-02-10  9:38       ` Jeff King
2021-02-10  0:44   ` Junio C Hamano
2021-02-10  1:49     ` Taylor Blau
2021-02-10 10:01     ` Jeff King
2021-02-10 16:31       ` Junio C Hamano
2021-02-10 20:38         ` Jeff King
2021-02-10 23:15           ` Taylor Blau
2021-02-11 11:00             ` Jeff King
2021-02-11 12:04               ` Ævar Arnfjörð Bjarmason
2021-02-11 17:57                 ` Junio C Hamano
2021-02-17 23:31                 ` [PATCH 0/2] rev-list --disk-usage example docs Jeff King
2021-02-17 23:34                   ` [PATCH 1/2] docs/rev-list: add an examples section Jeff King
2021-02-17 23:35                   ` [PATCH 2/2] docs/rev-list: add some examples of --disk-usage Jeff King
2021-02-17 23:44                   ` [PATCH 0/2] rev-list --disk-usage example docs Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YCJpbPIlSpCAKSBF@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=kyle@kyleam.com \
    --cc=me@ttaylorr.com \
    --cc=sunshine@sunshineco.com \
    --subject='Re: [PATCH v2] rev-list --disk-usage' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).