Re: [RFC] Bump {diff,merge}.renameLimit ?

From: Felipe Contreras <felipe.contreras@gmail.com>
To: Elijah Newren <newren@gmail.com>, Jeff King <peff@peff.net>
Cc: "Git Mailing List" <git@vger.kernel.org>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Jonathan Tan" <jonathantanmy@google.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [RFC] Bump {diff,merge}.renameLimit ?
Date: Mon, 12 Jul 2021 15:58:53 -0500	[thread overview]
Message-ID: <60ecad0dadf2c_a68ed208e7@natae.notmuch> (raw)
In-Reply-To: <CABPp-BEdUmxXVCx=5pb0=LN-0YBtrEB-wngPC5vys6fjVctgaQ@mail.gmail.com>

Elijah Newren wrote:
> On Mon, Jul 12, 2021 at 10:16 AM Jeff King <peff@peff.net> wrote:

> > > * I think the median file size is a better predictor of rename
> > >   performance than mean file size, and median file size is ~2.5x smaller
> > >   than the mean[18].
> >
> > There you might get hit with the quadratic-update thing again, though.
> > The big files are more likely to be touched, so could be weighted more
> > (though are they more likely to have been added/delete/renamed? Who
> > knows).
> 
> I'll agree that big files are more likely to be updated, but I don't
> think renames are weighted towards bigger files.  In fact, I wrote a
> quick script to look at the sizes of all the renamed files in the
> history of v2.6.25, and the mean (8034.1) and median (3866) of the
> renamed files sizes in that history are comparable to the mean
> (11150.3) and median (4198) of the files sizes in the v2.6.25 tree.
> 
> I re-did the calculations using v5.5, and found that the mean
> (12495.1) and median (3702) sizes of renames in all linux history up
> to that point again were a bit less than the mean (13449.2) and median
> (3860) file size of a file in the final v5.5 tree.
> 
> Granted, this is a bit hand-wavy (what about creations or deletions?
> Is there too much bias from the fact that I did rename sizes over all
> history (due to needing enough to get statistics) while just grabbing
> regular file sizes just in the end tree?), but I think it provides
> pretty good first order approximation suggesting that mean/median
> sizes of files involved in rename detection will be similar to the
> mean/median sizes of other files within the relevant trees.
> 
> > I don't think file size matters all _that_ much, though, as it has a
> > linear relationship to time spent. Whereas the number of entries is
> > quadratic. And of course the whole experiment is ball-parking in the
> > first place. We're looking for order-of-magnitude approximations, I'd
> > think.
> 
> I agree that the number of entries is what's important; in fact,
> that's why I think the median file size is more important than the
> mean file size:

That is almost always the case (except in unskewed distributions where
the mean is equal to the median).

Another option instead of an opaque configuration like 'renamelimit'
--which is almost entirely arbitrary for most users--would be to have
'renamelevel'. A renamelevel of 5 would be the median, so that's already
more meaningul than any value of renamelimit.

A renamelevel of 9 would be the equivalent of the 9th decile, so that
would catch 90% of renames.

If the distribution follows a Pareto distribution (which is often the
case), the formula to calculate the different deciles is trivial, but it
would also be possible to hard-code all the different levels.

-- 
Felipe Contreras