git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Kyle Meyer <kyle@kyleam.com>,
	Eric Sunshine <sunshine@sunshineco.com>,
	Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH v2] rev-list --disk-usage
Date: Wed, 10 Feb 2021 05:01:19 -0500	[thread overview]
Message-ID: <YCOu70m5SKU7L4CS@coredump.intra.peff.net> (raw)
In-Reply-To: <xmqqh7mkycno.fsf@gitster.c.googlers.com>

On Tue, Feb 09, 2021 at 04:44:27PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Here's a re-roll of my series to add "rev-list --disk-usage", for
> > counting up object storage used for various slices of history.
> > ...
> >  t/t6114-rev-list-du.sh             | 51 +++++++++++++++++++
> >  t/test-lib-functions.sh            |  9 +++-
> >  7 files changed, 199 insertions(+), 8 deletions(-)
> >  create mode 100755 t/t6114-rev-list-du.sh
> 
> I relocated 6114 to 6115 to avoid tests sharing the same number.

Thanks. I wondered why I didn't notice, but it's because the other 6114
also just made it into "seen". :)

> I am getting these numbers from random ranges I am interested in,
> but do they say what I think they mean?  Was the development effort
> went into the v2.28 release almost half the size of v2.29, and have
> we already done about the same amont of work for this cycle?
> 
> : gitster git.git/seen; rungit seen rev-list --disk-usage master..next
> 83105
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.30.0..master
> 183463
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.29.0..v2.30.0
> 231640
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.28.0..v2.29.0
> 334355
> : gitster git.git/seen; rungit seen rev-list --disk-usage v2.27.0..v2.28.0
> 182298

As Taylor mentioned, this is only hitting the commits. So you might as
well just be looking at commit counts as a measure of work, I'd think
(and indeed v2.28 has about half as many commits as v2.29!).

Adding --objects gets you a rougher estimate of "bytes changed", which
helps accounts for commits of different sizes. But there I think you'd
do just as well to look at the actual number of lines changed with "git
diff --numstat".

I'd expect the number of on-disk bytes to _roughly_ correspond to the
size of the changes. But you are working against the heuristics of the
delta chains there. It may well be that we would store a base object in
the v2.28..v2.29 range, and a delta against it in v2.27..v2.28. And that
would attribute most of the bytes to v2.29, even though they should be
shared roughly with v2.28.

I'm sure one could devise a scheme for "sharing" the bytes from a delta
family across all of its objects. That might even be worth implementing
on top (I don't even think it would be too expensive; you just have to
collect the delta chains for any objects you're reporting, and then
average the total size among a chain).

But in practice, we've found this kind of naive --disk-usage useful for
answering questions like:

  - do I need all of these objects? Comparing "rev-list --disk-usage
    --objects --all", "rev-list --disk-usage --objects --all --reflog",
    and "du objects/pack/*.pack" will tell you if a prune/repack might
    help, and whether expiring reflogs makes a difference.

  - the size of the shared alternates repo for a set of forks has
    jumped. Comparing "rev-list --disk-usage --objects --remotes=$base
    --not --remotes=$fork" will tell you what's reachable from a fork
    but not from the base (we use "refs/remotes/$id/*" to keep track of
    fork refs in our alternates repo). This can be junk like somebody
    forking git/git and then uploading a bunch of pirated video files.
    :)

  - likewise, the size of cloning a single repo may jump. Comparing
    "rev-list --disk-usage --objects HEAD..$branch" for each branch
    might show that one branch is an outlier (e.g., because somebody
    accidentally committed a bunch of build artifacts).

In those kinds of cases, it's not usually "oh, this version is twice as
big as this other one". It's more like "wow, this branch is 100x as big
as the other branches", and little decisions like delta direction are
just noise. I imagine that in those cases the uncompressed object sizes
would probably produce similar patterns and answers. But it's actually
faster to produce the on-disk sizes. :)

-Peff

  parent reply	other threads:[~2021-02-10 10:04 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-27 22:11 [PATCH 0/2] " Jeff King
2021-01-27 22:12 ` [PATCH 1/2] t: add --no-tag option to test_commit Jeff King
2021-01-27 22:48   ` Taylor Blau
2021-01-27 22:17 ` [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
2021-01-27 22:57   ` Taylor Blau
2021-01-27 23:34     ` Jeff King
2021-01-27 23:01   ` Kyle Meyer
2021-01-27 23:36     ` Jeff King
2021-01-27 23:07   ` Eric Sunshine
2021-01-27 23:39     ` Jeff King
2021-01-27 22:46 ` [PATCH 0/2] rev-list --disk-usage Taylor Blau
2021-02-09 10:52 ` [PATCH v2] " Jeff King
2021-02-09 10:52   ` [PATCH v2 1/2] t: add --no-tag option to test_commit Jeff King
2021-02-09 10:53   ` [PATCH v2 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
2021-02-09 11:09   ` [PATCH v2] rev-list --disk-usage Jeff King
2021-02-09 21:14     ` Junio C Hamano
2021-02-10  9:38       ` Jeff King
2021-02-10  0:44   ` Junio C Hamano
2021-02-10  1:49     ` Taylor Blau
2021-02-10 10:01     ` Jeff King [this message]
2021-02-10 16:31       ` Junio C Hamano
2021-02-10 20:38         ` Jeff King
2021-02-10 23:15           ` Taylor Blau
2021-02-11 11:00             ` Jeff King
2021-02-11 12:04               ` Ævar Arnfjörð Bjarmason
2021-02-11 17:57                 ` Junio C Hamano
2021-02-17 23:31                 ` [PATCH 0/2] rev-list --disk-usage example docs Jeff King
2021-02-17 23:34                   ` [PATCH 1/2] docs/rev-list: add an examples section Jeff King
2021-02-17 23:35                   ` [PATCH 2/2] docs/rev-list: add some examples of --disk-usage Jeff King
2021-02-17 23:44                   ` [PATCH 0/2] rev-list --disk-usage example docs Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YCOu70m5SKU7L4CS@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=kyle@kyleam.com \
    --cc=me@ttaylorr.com \
    --cc=sunshine@sunshineco.com \
    --subject='Re: [PATCH v2] rev-list --disk-usage' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
on how to clone and mirror all data and code used for this inbox