All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: "Bagas Sanjaya" <bagasdotme@gmail.com>,
	"Elijah Newren" <newren@gmail.com>,
	"Eric Sunshine" <sunshine@sunshineco.com>,
	"Derrick Stolee" <stolee@gmail.com>, "Jeff King" <peff@peff.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Elijah Newren" <newren@gmail.com>,
	"Elijah Newren" <newren@gmail.com>
Subject: [PATCH v3 4/4] Bump rename limit defaults (yet again)
Date: Thu, 15 Jul 2021 00:45:24 +0000	[thread overview]
Message-ID: <b41278b6680f8b1f59e73c4eab24595a33f2f3c5.1626309924.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.1044.v3.git.git.1626309924.gitgitgadget@gmail.com>

From: Elijah Newren <newren@gmail.com>

These were last bumped in commit 92c57e5c1d29 (bump rename limit
defaults (again), 2011-02-19), and were bumped both because processors
had gotten faster, and because people were getting ugly merges that
caused problems and reporting it to the mailing list (suggesting that
folks were willing to spend more time waiting).

Since that time:
  * Linus has continued recommending kernel folks to set
    diff.renameLimit=0 (maps to 32767, currently)
  * Folks with repositories with lots of renames were happy to set
    merge.renameLimit above 32767, once the code supported that, to
    get correct cherry-picks
  * Processors have gotten faster
  * It has been discovered that the timing methodology used last time
    probably used too large example files.

The last point is probably worth explaining a bit more:

  * The "average" file size used appears to have been average blob size
    in the linux kernel history at the time (probably v2.6.25 or
    something close to it).
  * Since bigger files are modified more frequently, such a computation
    weights towards larger files.
  * Larger files may be more likely to be modified over time, but are
    not more likely to be renamed -- the mean and median blob size
    within a tree are a bit higher than the mean and median of blob
    sizes in the history leading up to that version for the linux
    kernel.
  * The mean blob size in v2.6.25 was half the average blob size in
    history leading to that point
  * The median blob size in v2.6.25 was about 40% of the mean blob size
    in v2.6.25.
  * Since the mean blob size is more than double the median blob size,
    any file as big as the mean will not be compared to any files of
    median size or less (because they'd be more than 50% dissimilar).
  * Since it is the number of files compared that provides the O(n^2)
    behavior, median-sized files should matter more than mean-sized
    ones.

The combined effect of the above is that the file size used in past
calculations was likely about 5x too large.  Combine that with a CPU
performance improvement of ~30%, and we can increase the limits by
a factor of sqrt(5/(1-.3)) = 2.67, while keeping the original stated
time limits.

Keeping the same approximate time limit probably makes sense for
diff.renameLimit (there is no progress feedback in e.g. git log -p),
but the experience above suggests merge.renameLimit could be extended
significantly.  In fact, it probably would make sense to have an
unlimited default setting for merge.renameLimit, but that would
likely need to be coupled with changes to how progress is displayed.
(See https://lore.kernel.org/git/YOx+Ok%2FEYvLqRMzJ@coredump.intra.peff.net/
for details in that area.)  For now, let's just bump the approximate
time limit from 10s to 1m.

(Note: We do not want to use actual time limits, because getting results
that depend on how loaded your system is that day feels bad, and because
we don't discover that we won't get all the renames until after we've
put in a lot of work rather than just upfront telling the user there are
too many files involved.)

Using the original time limit of 2s for diff.renameLimit, and bumping
merge.renameLimit from 10s to 60s, I found the following timings using
the simple script at the end of this commit message (on an AWS c5.xlarge
which reports as "Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz"):

      N   Timing
   1300    1.995s
   7100   59.973s

So let's round down to nice even numbers and bump the limits from
400->1000, and from 1000->7000.

Here is the measure_rename_perf script (adapted from
https://lore.kernel.org/git/20080211113516.GB6344@coredump.intra.peff.net/
in particular to avoid triggering the linear handling from
basename-guided rename detection):

    #!/bin/bash

    n=$1; shift

    rm -rf repo
    mkdir repo && cd repo
    git init -q -b main

    mkdata() {
      mkdir $1
      for i in `seq 1 $2`; do
        (sed "s/^/$i /" <../sample
         echo tag: $1
        ) >$1/$i
      done
    }

    mkdata initial $n
    git add .
    git commit -q -m initial

    mkdata new $n
    git add .
    cd new
    for i in *; do git mv $i $i.renamed; done
    cd ..
    git rm -q -rf initial
    git commit -q -m new

    time git diff-tree -M -l0 --summary HEAD^ HEAD

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/config/diff.txt  | 2 +-
 Documentation/config/merge.txt | 2 +-
 diff.c                         | 2 +-
 merge-ort.c                    | 2 +-
 merge-recursive.c              | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/config/diff.txt b/Documentation/config/diff.txt
index d1b5cfa3542..32f84838ac1 100644
--- a/Documentation/config/diff.txt
+++ b/Documentation/config/diff.txt
@@ -120,7 +120,7 @@ diff.orderFile::
 diff.renameLimit::
 	The number of files to consider in the exhaustive portion of
 	copy/rename detection; equivalent to the 'git diff' option
-	`-l`.  If not set, the default value is currently 400.  This
+	`-l`.  If not set, the default value is currently 1000.  This
 	setting has no effect if rename detection is turned off.
 
 diff.renames::
diff --git a/Documentation/config/merge.txt b/Documentation/config/merge.txt
index 7cd6d7883b6..e27cc639447 100644
--- a/Documentation/config/merge.txt
+++ b/Documentation/config/merge.txt
@@ -37,7 +37,7 @@ merge.renameLimit::
 	rename detection during a merge.  If not specified, defaults
 	to the value of diff.renameLimit.  If neither
 	merge.renameLimit nor diff.renameLimit are specified,
-	currently defaults to 1000.  This setting has no effect if
+	currently defaults to 7000.  This setting has no effect if
 	rename detection is turned off.
 
 merge.renames::
diff --git a/diff.c b/diff.c
index 2454e34cf6d..0244a371d32 100644
--- a/diff.c
+++ b/diff.c
@@ -35,7 +35,7 @@
 
 static int diff_detect_rename_default;
 static int diff_indent_heuristic = 1;
-static int diff_rename_limit_default = 400;
+static int diff_rename_limit_default = 1000;
 static int diff_suppress_blank_empty;
 static int diff_use_color_default = -1;
 static int diff_color_moved_default;
diff --git a/merge-ort.c b/merge-ort.c
index b954f7184a5..8a84375e940 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2558,7 +2558,7 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_opts.detect_rename = DIFF_DETECT_RENAME;
 	diff_opts.rename_limit = opt->rename_limit;
 	if (opt->rename_limit <= 0)
-		diff_opts.rename_limit = 1000;
+		diff_opts.rename_limit = 7000;
 	diff_opts.rename_score = opt->rename_score;
 	diff_opts.show_rename_progress = opt->show_rename_progress;
 	diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT;
diff --git a/merge-recursive.c b/merge-recursive.c
index 4327e0cfa33..f19f8cc37bd 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1879,7 +1879,7 @@ static struct diff_queue_struct *get_diffpairs(struct merge_options *opt,
 	 */
 	if (opts.detect_rename > DIFF_DETECT_RENAME)
 		opts.detect_rename = DIFF_DETECT_RENAME;
-	opts.rename_limit = (opt->rename_limit >= 0) ? opt->rename_limit : 1000;
+	opts.rename_limit = (opt->rename_limit >= 0) ? opt->rename_limit : 7000;
 	opts.rename_score = opt->rename_score;
 	opts.show_rename_progress = opt->show_rename_progress;
 	opts.output_format = DIFF_FORMAT_NO_OUTPUT;
-- 
gitgitgadget

  parent reply	other threads:[~2021-07-15  0:45 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-11  0:46 [PATCH 0/3] Improve the documentation and warnings dealing with rename/copy limits Elijah Newren via GitGitGadget
2021-07-11  0:46 ` [PATCH 1/3] doc: clarify documentation for " Elijah Newren via GitGitGadget
2021-07-11  4:37   ` Bagas Sanjaya
2021-07-11  4:52     ` Elijah Newren
2021-07-12 15:03   ` Derrick Stolee
2021-07-12 21:27     ` Junio C Hamano
2021-07-11  0:46 ` [PATCH 2/3] doc: document the special handling of -l0 Elijah Newren via GitGitGadget
2021-07-11  4:54   ` Eric Sunshine
2021-07-11  4:54     ` Elijah Newren
2021-07-11  0:46 ` [PATCH 3/3] diff: correct warning message when renameLimit exceeded Elijah Newren via GitGitGadget
2021-07-12 15:09   ` Derrick Stolee
2021-07-12 18:13     ` Elijah Newren
2021-07-14  0:47       ` Junio C Hamano
2021-07-14  1:06         ` Elijah Newren
2021-07-14  1:10           ` Junio C Hamano
2021-07-14  1:22             ` Elijah Newren
2021-07-14  5:17               ` Junio C Hamano
2021-07-14 15:09                 ` Elijah Newren
2021-07-14  1:12 ` [PATCH v2 0/4] Rename/copy limits -- docs, warnings, and new defaults Elijah Newren via GitGitGadget
2021-07-14  1:12   ` [PATCH v2 1/4] diff: correct warning message when renameLimit exceeded Elijah Newren via GitGitGadget
2021-07-14  1:12   ` [PATCH v2 2/4] doc: clarify documentation for rename/copy limits Elijah Newren via GitGitGadget
2021-07-14  7:37     ` Ævar Arnfjörð Bjarmason
2021-07-14 16:30       ` Elijah Newren
2021-07-14 22:08         ` Ævar Arnfjörð Bjarmason
2021-07-14 22:56           ` Elijah Newren
2021-07-14  1:12   ` [PATCH v2 3/4] doc: document the special handling of -l0 Elijah Newren via GitGitGadget
2021-07-14 16:45     ` Jeff King
2021-07-14 17:17       ` Elijah Newren
2021-07-14 17:33         ` Jeff King
2021-07-14 19:32           ` Elijah Newren
2021-07-14  1:12   ` [PATCH v2 4/4] Bump rename limit defaults (yet again) Elijah Newren via GitGitGadget
2021-07-14 16:43     ` Jeff King
2021-07-14 17:32       ` Elijah Newren
2021-07-14 17:57         ` Jeff King
2021-07-14 20:03           ` Elijah Newren
2021-07-14 20:47             ` Jeff King
2021-07-15  0:45   ` [PATCH v3 0/4] Rename/copy limits -- docs, warnings, and new defaults Elijah Newren via GitGitGadget
2021-07-15  0:45     ` [PATCH v3 1/4] diff: correct warning message when renameLimit exceeded Elijah Newren via GitGitGadget
2021-07-15  0:45     ` [PATCH v3 2/4] doc: clarify documentation for rename/copy limits Elijah Newren via GitGitGadget
2021-07-15  0:45     ` [PATCH v3 3/4] diffcore-rename: treat a rename_limit of 0 as unlimited Elijah Newren via GitGitGadget
2021-07-15 23:17       ` Junio C Hamano
2021-07-15  0:45     ` Elijah Newren via GitGitGadget [this message]
2021-07-15 13:36     ` [PATCH v3 0/4] Rename/copy limits -- docs, warnings, and new defaults Derrick Stolee
2021-07-15 23:20     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b41278b6680f8b1f59e73c4eab24595a33f2f3c5.1626309924.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=avarab@gmail.com \
    --cc=bagasdotme@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.