All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Jonathan Tan <jonathantanmy@google.com>,
	Derrick Stolee <dstolee@gmail.com>, Taylor Blau <me@ttaylorr.com>,
	Derrick Stolee <stolee@gmail.com>,
	Elijah Newren <newren@gmail.com>
Subject: [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort
Date: Tue, 22 Jun 2021 08:04:36 +0000	[thread overview]
Message-ID: <pull.969.v3.git.1624349082.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.969.v2.git.1623796907.gitgitgadget@gmail.com>

This series optimizes blob downloading in merges for partial clones. It can
apply on master. It's independent of ort-perf-batch-12.

Changes since v2:

 * Incorporated the suggestions from Junio on patch 2.

Changes since v1:

 * Incorporated the suggestions from Stolee on patch 2.

=== High level summary ===

 1. diffcore-rename.c has had a prefetch() to get data needed for inexact
    renames for a while.
 2. find_basename_matches() only requires a small subset of what prefetch()
    provides.
 3. I added a basename_prefetch() for find_basename_matches()

In the worst case, the above means:

 * We download the same number of objects, in 2 steps instead of 1.

However, in practice, since rename detection can usually quit after
find_basename_matches() (usually due to the irrelevant check that cannot be
performed until after find_basename_matches()):

 * We download far fewer objects, and use barely more download steps than
   before.

Adding some prefetching to merge-ort.c allows us to also drop the number of
downloads overall.

=== Modified performance measurement method ===

The testcases I've been using so far to measure performance were not run in
a partial clone, so they aren't directly usable for comparison. Further,
partial clone performance depends on network speed which can be highly
variable. So I want to modify one of the existing testcases slightly and
focus on two different but more stable metrics:

 1. Number of git fetch operations during rebase
 2. Number of objects fetched during rebase

The first of these should already be decent due to Jonathan Tan's work to
batch fetching of missing blobs during rename detection (see commit
7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05)), so we are
mostly looking to optimize the second but would like to also decrease the
first if possible.

The testcase we will look at will be a modification of the mega-renames
testcase from commit 557ac0350d ("merge-ort: begin performance work;
instrument with trace2_region_* calls", 2020-10-28). In particular, we
change

$ git clone \
    git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git


to

$ git clone --sparse --filter=blob:none \
    https://github.com/github/linux


(The change in clone URL is just to get a server that supports the filter
predicate.)

We otherwise keep the test the same (in particular, we do not add any calls
to "git-sparse checkout {set,add}" which means that the resulting repository
will only have 7 total blobs from files in the toplevel directory before
starting the rebase).

=== Results ===

For the mega-renames testcase noted above (which rebases 35 commits across
an upstream with ~26K renames in a partial clone), I found the following
results for our metrics of interest:

     Number of `git fetch` ops during rebase

                     Before Series   After Series
merge-recursive:          62              63
merge-ort:                30              20


     Number of objects fetched during rebase

                     Before Series   After Series
merge-recursive:         11423          11423
merge-ort:               11391             63


So, we have a significant reduction (factor of ~3 relative to
merge-recursive) in the number of git fetch operations that have to be
performed in a partial clone to complete the rebase, and a dramatic
reduction (factor of ~180) in the number of objects that need to be fetched.

=== Summary ===

It's worth pointing out that merge-ort after the series needs only ~1.8
blobs per commit being transplanted to complete this particular rebase.
Essentially, this reinforces the fact the optimization work so far has taken
rename detection from often being an overwhelmingly costly portion of a
merge (leading many to just capitulate on it), to what I have observed in my
experience so far as being just a minor cost for merges.

Elijah Newren (5):
  promisor-remote: output trace2 statistics for number of objects
    fetched
  t6421: add tests checking for excessive object downloads during merge
  diffcore-rename: allow different missing_object_cb functions
  diffcore-rename: use a different prefetch for basename comparisons
  merge-ort: add prefetching for content merges

 diffcore-rename.c              | 149 ++++++++---
 merge-ort.c                    |  50 ++++
 promisor-remote.c              |   7 +-
 t/t6421-merge-partial-clone.sh | 440 +++++++++++++++++++++++++++++++++
 4 files changed, 612 insertions(+), 34 deletions(-)
 create mode 100755 t/t6421-merge-partial-clone.sh


base-commit: 6de569e6ac492213e81321ca35f1f1b365ba31e3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-969%2Fnewren%2Fort-perf-batch-13-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-969/newren/ort-perf-batch-13-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/969

Range-diff vs v2:

 1:  04f5ebdabe14 = 1:  04f5ebdabe14 promisor-remote: output trace2 statistics for number of objects fetched
 2:  0f786cfb4c95 ! 2:  4796e096fdb4 t6421: add tests checking for excessive object downloads during merge
     @@ t/t6421-merge-partial-clone.sh (new)
      +		echo g >dir/subdir/tweaked/g &&
      +		echo h >dir/subdir/tweaked/h &&
      +		echo subdirectory makefile >dir/subdir/tweaked/Makefile &&
     -+		for i in `test_seq 1 88`; do
     ++		for i in $(test_seq 1 88)
     ++		do
      +			echo content $i >dir/unchanged/file_$i
      +		done &&
      +		git add . &&
     @@ t/t6421-merge-partial-clone.sh (new)
      +		cd objects-single &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep '\?' >missing-objects-before &&
     ++			grep "^?" | sort >missing-objects-before &&
      +
      +		git checkout -q origin/A &&
      +
     @@ t/t6421-merge-partial-clone.sh (new)
      +		test_line_count = 2 fetches &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep ^? >missing-objects-after &&
     -+		test_cmp missing-objects-before missing-objects-after |
     -+			grep "^[-+]?" >found-and-new-objects &&
     -+		# We should not have any NEW missing objects
     -+		! grep ^+ found-and-new-objects &&
     -+		# Fetched 2+1=3 objects, so should have 3 fewer missing objects
     -+		test_line_count = 3 found-and-new-objects
     ++			grep "^?" | sort >missing-objects-after &&
     ++		comm -2 -3 missing-objects-before missing-objects-after >old &&
     ++		comm -1 -3 missing-objects-before missing-objects-after >new &&
     ++		# No new missing objects
     ++		test_must_be_empty new &&
     ++		# Fetched 2 + 1 = 3 objects
     ++		test_line_count = 3 old
      +	)
      +'
      +
     @@ t/t6421-merge-partial-clone.sh (new)
      +		cd objects-dir &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep '\?' >missing-objects-before &&
     ++			grep "^?" | sort >missing-objects-before &&
      +
      +		git checkout -q origin/A &&
      +
     @@ t/t6421-merge-partial-clone.sh (new)
      +		test_line_count = 1 fetches &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep ^? >missing-objects-after &&
     -+		test_cmp missing-objects-before missing-objects-after |
     -+			grep "^[-+]?" >found-and-new-objects &&
     -+		# We should not have any NEW missing objects
     -+		! grep ^+ found-and-new-objects &&
     -+		# Fetched 6 objects, so should have 6 fewer missing objects
     -+		test_line_count = 6 found-and-new-objects
     ++			grep "^?" | sort >missing-objects-after &&
     ++		comm -2 -3 missing-objects-before missing-objects-after >old &&
     ++		comm -1 -3 missing-objects-before missing-objects-after >new &&
     ++		# No new missing objects
     ++		test_must_be_empty new &&
     ++		# Fetched 6 objects
     ++		test_line_count = 6 old
      +	)
      +'
      +
     @@ t/t6421-merge-partial-clone.sh (new)
      +		cd objects-many &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep '\?' >missing-objects-before &&
     ++			grep "^?" | sort >missing-objects-before &&
      +
      +		git checkout -q origin/A &&
      +
     @@ t/t6421-merge-partial-clone.sh (new)
      +		test_line_count = 4 fetches &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep ^? >missing-objects-after &&
     -+		test_cmp missing-objects-before missing-objects-after |
     -+			grep "^[-+]?" >found-and-new-objects &&
     -+		# We should not have any NEW missing objects
     -+		! grep ^+ found-and-new-objects &&
     -+		# Fetched 12 + 5 + 3 + 2 == 22 objects
     -+		test_line_count = 22 found-and-new-objects
     ++			grep "^?" | sort >missing-objects-after &&
     ++		comm -2 -3 missing-objects-before missing-objects-after >old &&
     ++		comm -1 -3 missing-objects-before missing-objects-after >new &&
     ++		# No new missing objects
     ++		test_must_be_empty new &&
     ++		# Fetched 12 + 5 + 3 + 2 = 22 objects
     ++		test_line_count = 22 old
      +	)
      +'
      +
 3:  9f2a8ed8d61f = 3:  7ed0162cdb4e diffcore-rename: allow different missing_object_cb functions
 4:  f753f8035564 = 4:  c9b55241d831 diffcore-rename: use a different prefetch for basename comparisons
 5:  317bcc7f56cb = 5:  69011cfe9fae merge-ort: add prefetching for content merges

-- 
gitgitgadget

  parent reply	other threads:[~2021-06-22  8:04 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-05  1:27 [PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
2021-06-05  1:28 ` [PATCH 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
2021-06-09 21:12   ` Derrick Stolee
2021-06-05  1:28 ` [PATCH 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
2021-06-09 21:16   ` Derrick Stolee
2021-06-05  1:28 ` [PATCH 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
2021-06-05  1:28 ` [PATCH 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
2021-06-05  1:28 ` [PATCH 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
2021-06-17  4:49     ` Junio C Hamano
2021-06-15 22:41   ` [PATCH v2 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
2021-06-17  5:04     ` Junio C Hamano
2021-06-22  8:02       ` Elijah Newren
2021-06-16 17:54   ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Derrick Stolee
2021-06-17  5:05   ` Junio C Hamano
2021-06-22  8:04   ` Elijah Newren via GitGitGadget [this message]
2021-06-22  8:04     ` [PATCH v3 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
2021-06-22 16:10     ` [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort Derrick Stolee
2021-06-22 18:45       ` Elijah Newren
2021-06-23  2:14         ` Derrick Stolee
2021-06-23  8:11           ` Elijah Newren
2021-06-23 17:31             ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.969.v3.git.1624349082.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=dstolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.