All of lore.kernel.org
 help / color / mirror / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Thomas Rast" <tr@thomasrast.ch>,
	"Derrick Stolee" <stolee@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: [PATCH 2/2] line-log: avoid unnecessary full tree diffs
Date: Wed, 21 Aug 2019 13:04:24 +0200	[thread overview]
Message-ID: <20190821110424.18184-3-szeder.dev@gmail.com> (raw)
In-Reply-To: <20190821110424.18184-1-szeder.dev@gmail.com>

With rename detection enabled the line-level log is able to trace the
evolution of line ranges across whole-file renames [1].  Alas, to
achieve that it uses the diff machinery very inefficiently, making the
operation very slow [2].  And since rename detection is enabled by
default, the line-level log is very slow by default.

When the line-level log processes a commit with rename detection
enabled, it currently does the following (see queue_diffs()):

  1. Computes a full tree diff between the commit and (one of) its
     parent(s), i.e. invokes diff_tree_oid() with an empty
     'diffopt->pathspec'.
  2. Checks whether any paths in the line ranges were modified.
  3. Checks whether any modified paths in the line ranges are missing
     in the parent commit's tree.
  4. If there is such a missing path, then calls diffcore_std() to
     figure out whether the path was indeed renamed based on the
     previously computed full tree diff.
  5. Continues doing stuff that are unrelated to the slowness.

So basically the line-level log computes a full tree diff for each
commit-parent pair in step (1) to be used for rename detection in step
(4) in the off chance that an interesting path is missing from the
parent.

Avoid these expensive and mostly unnecessary full tree diffs by
limiting the diffs to paths in the line ranges.  This is much cheaper,
and makes step (2) unnecessary.  If it turns out that an interesting
path is missing from the parent, then fall back and compute a full
tree diff, so the rename detection will still work.

Care must be taken when to update the pathspec used to limit the diff
in case of renames.  A path might be renamed on one branch and
modified on several parallel running branches, and while processing
commits on these branches the line-level log might have to alternate
between looking at a path's new and old name.  However, at any one
time there is only a single 'diffopt->pathspec'.

So add a step (0) to the above to ensure that the paths in the
pathspec match the paths in the line ranges associated with the
currently processed commit, and re-parse the pathspec from the paths
in the line ranges if they differ.

The new test cases include a specially crafted piece of history with
two merged branches and two files, where each branch modifies both
files, renames on of them, and then modifies both again.  Then two
separate 'git log -L' invocations check the line-level log of each of
those two files, which ensures that at least one of those invocations
have to do that back-and-forth between the file's old and new name (no
matter which branch is traversed first).  't/t4211-line-log.sh'
already contains two tests involving renames, they don't don't trigger
this back-and-forth.

Avoiding these unnecessary full tree diffs can have huge impact on
performance, especially in big repositories with big trees and mergy
history.  Tracing the evolution of a function through the whole
history:

  # git.git
  $ time git --no-pager log -L:read_alternate_refs:sha1-file.c v2.23.0

  Before:

    real    0m8.874s
    user    0m8.816s
    sys     0m0.057s

  After:

    real    0m2.516s
    user    0m2.456s
    sys     0m0.060s

  # linux.git
  $ time ~/src/git/git --no-pager log \
    -L:build_restore_work_registers:arch/mips/mm/tlbex.c v5.2

  Before:

    real    3m50.033s
    user    3m48.041s
    sys     0m0.300s

  After:

    real    0m2.599s
    user    0m2.466s
    sys     0m0.157s

That's just over 88x speedup.

[1] Line-level log's rename following is quite similar to 'git log
    --follow path', with the notable differences that it does handle
    multiple paths at once as well, and that it doesn't show the
    commit performing the rename if it's an exact rename.

[2] This slowness might not have been apparent initially, because back
    when the line-level log feature was introduced rename detection
    was not yet enabled by default; 12da1d1f6f (Implement line-history
    search (git log -L), 2013-03-28) and 5404c116aa (diff: activate
    diff.renames by default, 2016-02-25).

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 line-log.c          | 43 ++++++++++++++++++++----
 t/t4211-line-log.sh | 82 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+), 7 deletions(-)

diff --git a/line-log.c b/line-log.c
index fddd91f060..9010e00950 100644
--- a/line-log.c
+++ b/line-log.c
@@ -737,6 +737,22 @@ static struct line_log_data *lookup_line_range(struct rev_info *revs,
 	return ret;
 }
 
+static int same_paths_in_pathspec_and_range(struct pathspec *pathspec,
+					    struct line_log_data *range)
+{
+	int i;
+	struct line_log_data *r;
+
+	for (i = 0, r = range; i < pathspec->nr && r; i++, r = r->next)
+		if (strcmp(pathspec->items[i].match, r->path))
+			return 0;
+	if (i < pathspec->nr || r)
+		/* different number of pathspec items and ranges */
+		return 0;
+
+	return 1;
+}
+
 static void parse_pathspec_from_ranges(struct pathspec *pathspec,
 				       struct line_log_data *range)
 {
@@ -762,8 +778,7 @@ void line_log_init(struct rev_info *rev, const char *prefix, struct string_list
 	range = parse_lines(rev->diffopt.repo, commit, prefix, args);
 	add_line_range(rev, commit, range);
 
-	if (!rev->diffopt.detect_rename)
-		parse_pathspec_from_ranges(&rev->diffopt.pathspec, range);
+	parse_pathspec_from_ranges(&rev->diffopt.pathspec, range);
 }
 
 static void move_diff_queue(struct diff_queue_struct *dst,
@@ -821,15 +836,29 @@ static void queue_diffs(struct line_log_data *range,
 			struct diff_queue_struct *queue,
 			struct commit *commit, struct commit *parent)
 {
+	struct object_id *tree_oid, *parent_tree_oid;
+
 	assert(commit);
 
+	tree_oid = get_commit_tree_oid(commit);
+	parent_tree_oid = parent ? get_commit_tree_oid(parent) : NULL;
+
+	if (opt->detect_rename &&
+	    !same_paths_in_pathspec_and_range(&opt->pathspec, range)) {
+		clear_pathspec(&opt->pathspec);
+		parse_pathspec_from_ranges(&opt->pathspec, range);
+	}
 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
-	diff_tree_oid(parent ? get_commit_tree_oid(parent) : NULL,
-		      get_commit_tree_oid(commit), "", opt);
-	if (opt->detect_rename) {
+	diff_tree_oid(parent_tree_oid, tree_oid, "", opt);
+	if (opt->detect_rename && diff_might_be_rename()) {
+		/* must look at the full tree diff to detect renames */
+		clear_pathspec(&opt->pathspec);
+		DIFF_QUEUE_CLEAR(&diff_queued_diff);
+
+		diff_tree_oid(parent_tree_oid, tree_oid, "", opt);
+
 		filter_diffs_for_paths(range, 1);
-		if (diff_might_be_rename())
-			diffcore_std(opt);
+		diffcore_std(opt);
 		filter_diffs_for_paths(range, 0);
 	}
 	move_diff_queue(queue, &diff_queued_diff);
diff --git a/t/t4211-line-log.sh b/t/t4211-line-log.sh
index 1db7bd0f59..8319163744 100755
--- a/t/t4211-line-log.sh
+++ b/t/t4211-line-log.sh
@@ -132,4 +132,86 @@ test_expect_success '--raw is forbidden' '
 	test_must_fail git log -L1,24:b.c --raw
 '
 
+test_expect_success 'setup for checking fancy rename following' '
+	git checkout --orphan moves-start &&
+	git reset --hard &&
+
+	printf "%s\n"    12 13 14 15      b c d e   >file-1 &&
+	printf "%s\n"    22 23 24 25      B C D E   >file-2 &&
+	git add file-1 file-2 &&
+	test_tick &&
+	git commit -m "Add file-1 and file-2" &&
+	oid_add_f1_f2=$(git rev-parse --short HEAD) &&
+
+	git checkout -b moves-main &&
+	printf "%s\n" 11 12 13 14 15      b c d e   >file-1 &&
+	git commit -a -m "Modify file-1 on main" &&
+	oid_mod_f1_main=$(git rev-parse --short HEAD) &&
+
+	printf "%s\n" 21 22 23 24 25      B C D E   >file-2 &&
+	git commit -a -m "Modify file-2 on main #1" &&
+	oid_mod_f2_main_1=$(git rev-parse --short HEAD) &&
+
+	git mv file-1 renamed-1 &&
+	git commit -m "Rename file-1 to renamed-1 on main" &&
+
+	printf "%s\n" 11 12 13 14 15      b c d e f >renamed-1 &&
+	git commit -a -m "Modify renamed-1 on main" &&
+	oid_mod_r1_main=$(git rev-parse --short HEAD) &&
+
+	printf "%s\n" 21 22 23 24 25      B C D E F >file-2 &&
+	git commit -a -m "Modify file-2 on main #2" &&
+	oid_mod_f2_main_2=$(git rev-parse --short HEAD) &&
+
+	git checkout -b moves-side moves-start &&
+	printf "%s\n"    12 13 14 15 16   b c d e   >file-1 &&
+	git commit -a -m "Modify file-1 on side #1" &&
+	oid_mod_f1_side_1=$(git rev-parse --short HEAD) &&
+
+	printf "%s\n"    22 23 24 25 26   B C D E   >file-2 &&
+	git commit -a -m "Modify file-2 on side" &&
+	oid_mod_f2_side=$(git rev-parse --short HEAD) &&
+
+	git mv file-2 renamed-2 &&
+	git commit -m "Rename file-2 to renamed-2 on side" &&
+
+	printf "%s\n"    12 13 14 15 16 a b c d e   >file-1 &&
+	git commit -a -m "Modify file-1 on side #2" &&
+	oid_mod_f1_side_2=$(git rev-parse --short HEAD) &&
+
+	printf "%s\n"    22 23 24 25 26 A B C D E   >renamed-2 &&
+	git commit -a -m "Modify renamed-2 on side" &&
+	oid_mod_r2_side=$(git rev-parse --short HEAD) &&
+
+	git checkout moves-main &&
+	git merge moves-side &&
+	oid_merge=$(git rev-parse --short HEAD)
+'
+
+test_expect_success 'fancy rename following #1' '
+	cat >expect <<-EOF &&
+	$oid_merge Merge branch '\''moves-side'\'' into moves-main
+	$oid_mod_f1_side_2 Modify file-1 on side #2
+	$oid_mod_f1_side_1 Modify file-1 on side #1
+	$oid_mod_r1_main Modify renamed-1 on main
+	$oid_mod_f1_main Modify file-1 on main
+	$oid_add_f1_f2 Add file-1 and file-2
+	EOF
+	git log -L1:renamed-1 --oneline --no-patch >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'fancy rename following #2' '
+	cat >expect <<-EOF &&
+	$oid_merge Merge branch '\''moves-side'\'' into moves-main
+	$oid_mod_r2_side Modify renamed-2 on side
+	$oid_mod_f2_side Modify file-2 on side
+	$oid_mod_f2_main_2 Modify file-2 on main #2
+	$oid_mod_f2_main_1 Modify file-2 on main #1
+	$oid_add_f1_f2 Add file-1 and file-2
+	EOF
+	git log -L1:renamed-2 --oneline --no-patch >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.23.0.352.gebb2b55eae


  parent reply	other threads:[~2019-08-21 11:04 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-21 11:04 [PATCH 0/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor
2019-08-21 11:04 ` [PATCH 1/2] line-log: extract pathspec parsing from line ranges into a helper function SZEDER Gábor
2019-08-21 11:04 ` SZEDER Gábor [this message]
2019-08-21 15:53   ` [PATCH 2/2] line-log: avoid unnecessary full tree diffs Derrick Stolee
2019-08-21 17:35     ` SZEDER Gábor
2019-08-21 18:12       ` Derrick Stolee
2019-08-22  8:41       ` SZEDER Gábor
2019-08-22 14:53         ` Derrick Stolee
2019-08-22 16:01         ` Junio C Hamano
2019-08-22 16:26           ` SZEDER Gábor
2019-08-22 16:51             ` Derrick Stolee
2019-08-23 10:04         ` SZEDER Gábor
2019-08-21 17:29   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190821110424.18184-3-szeder.dev@gmail.com \
    --to=szeder.dev@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=stolee@gmail.com \
    --cc=tr@thomasrast.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.