From: Felipe Contreras <felipe.contreras@gmail.com>
To: Elijah Newren <newren@gmail.com>, Jeff King <peff@peff.net>
Cc: "Git Mailing List" <git@vger.kernel.org>,
"Derrick Stolee" <stolee@gmail.com>,
"Junio C Hamano" <gitster@pobox.com>,
"Linus Torvalds" <torvalds@linux-foundation.org>,
"Jonathan Tan" <jonathantanmy@google.com>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [RFC] Bump {diff,merge}.renameLimit ?
Date: Mon, 12 Jul 2021 15:58:53 -0500 [thread overview]
Message-ID: <60ecad0dadf2c_a68ed208e7@natae.notmuch> (raw)
In-Reply-To: <CABPp-BEdUmxXVCx=5pb0=LN-0YBtrEB-wngPC5vys6fjVctgaQ@mail.gmail.com>
Elijah Newren wrote:
> On Mon, Jul 12, 2021 at 10:16 AM Jeff King <peff@peff.net> wrote:
> > > * I think the median file size is a better predictor of rename
> > > performance than mean file size, and median file size is ~2.5x smaller
> > > than the mean[18].
> >
> > There you might get hit with the quadratic-update thing again, though.
> > The big files are more likely to be touched, so could be weighted more
> > (though are they more likely to have been added/delete/renamed? Who
> > knows).
>
> I'll agree that big files are more likely to be updated, but I don't
> think renames are weighted towards bigger files. In fact, I wrote a
> quick script to look at the sizes of all the renamed files in the
> history of v2.6.25, and the mean (8034.1) and median (3866) of the
> renamed files sizes in that history are comparable to the mean
> (11150.3) and median (4198) of the files sizes in the v2.6.25 tree.
>
> I re-did the calculations using v5.5, and found that the mean
> (12495.1) and median (3702) sizes of renames in all linux history up
> to that point again were a bit less than the mean (13449.2) and median
> (3860) file size of a file in the final v5.5 tree.
>
> Granted, this is a bit hand-wavy (what about creations or deletions?
> Is there too much bias from the fact that I did rename sizes over all
> history (due to needing enough to get statistics) while just grabbing
> regular file sizes just in the end tree?), but I think it provides
> pretty good first order approximation suggesting that mean/median
> sizes of files involved in rename detection will be similar to the
> mean/median sizes of other files within the relevant trees.
>
> > I don't think file size matters all _that_ much, though, as it has a
> > linear relationship to time spent. Whereas the number of entries is
> > quadratic. And of course the whole experiment is ball-parking in the
> > first place. We're looking for order-of-magnitude approximations, I'd
> > think.
>
> I agree that the number of entries is what's important; in fact,
> that's why I think the median file size is more important than the
> mean file size:
That is almost always the case (except in unskewed distributions where
the mean is equal to the median).
Another option instead of an opaque configuration like 'renamelimit'
--which is almost entirely arbitrary for most users--would be to have
'renamelevel'. A renamelevel of 5 would be the median, so that's already
more meaningul than any value of renamelimit.
A renamelevel of 9 would be the equivalent of the 9th decile, so that
would catch 90% of renames.
If the distribution follows a Pareto distribution (which is often the
case), the formula to calculate the different deciles is trivial, but it
would also be possible to hard-code all the different levels.
--
Felipe Contreras
next prev parent reply other threads:[~2021-07-12 20:59 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-11 0:28 [RFC] Bump {diff,merge}.renameLimit ? Elijah Newren
2021-07-11 16:42 ` Ævar Arnfjörð Bjarmason
2021-07-12 15:23 ` Elijah Newren
2021-07-12 16:48 ` Ævar Arnfjörð Bjarmason
2021-07-12 17:39 ` Jeff King
2021-07-12 17:16 ` Jeff King
2021-07-12 20:23 ` Elijah Newren
2021-07-12 20:58 ` Felipe Contreras [this message]
2021-07-12 21:41 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=60ecad0dadf2c_a68ed208e7@natae.notmuch \
--to=felipe.contreras@gmail.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jonathantanmy@google.com \
--cc=newren@gmail.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).