From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>,
"Felipe Contreras" <felipe.contreras@gmail.com>,
"SZEDER Gábor" <szeder.dev@gmail.com>,
"Chris Torek" <chris.torek@gmail.com>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Junio C Hamano" <gitster@pobox.com>,
"Taylor Blau" <me@ttaylorr.com>
Subject: [PATCH v4 0/6] Speed up connectivity checks
Date: Thu, 5 Aug 2021 13:25:19 +0200 [thread overview]
Message-ID: <cover.1628162156.git.ps@pks.im> (raw)
In-Reply-To: <cover.1627896460.git.ps@pks.im>
[-- Attachment #1: Type: text/plain, Size: 17702 bytes --]
Hi,
this is version 4 of my series to speed up connectivity checks. Given
that v3 has received positive feedback, I finally stuck to the approach
and only have a bunch of iterative changes based on your feedback.
Changes compared to v3:
- Patch 1/6 is new and splits up revs->no_walk into revs->no_walk
which now only indicates whether to walk and revs->unserted_input,
which indicates whether to sort input.
- Patch 2/6 got some documentation for the new `--unsorted-input`
option. Furthermore, we refuse `--no-walk`/`--no-walk=sorted` and
`--unsorted-input` if used together. I've also added some tests
for the new option.
- Patch 3/6 has an updated commit message, detailing that
`add_pending_oid()` only is a thin wrapper around
`add_pending_object()`.
- Patch 4/6 has an update commit message, stating that it's a
prerequisite for 6/6.
- Patch 5/6 is new, splitting out the logic to find a commit ID in
the commit graph as a prerequisite for 6/6.
- Patch 6/6 now also verifies that commits parsed only via the
commit-graph exist in the ODB. I've also renamed
`find_object_in_graph()` to `parse_commit_in_graph_gently()` to
better reflect what the function does.
With the added existence check in 6/6, the speedup is not as big as
before (1.47x faster instead of 1.55x). But it's still very much worth
it. In total, this patch series decreases `git rev-list --objects
--unsorted --not --all --not $newrev` from 7.6s to 3.0s in my test
repository.
Thanks for all your feedback!
Patrick
Patrick Steinhardt (6):
revision: separate walk and unsorted flags
connected: do not sort input revisions
revision: stop retrieving reference twice
revision: avoid loading object headers multiple times
commit-graph: split out function to search commit position
revision: avoid hitting packfiles when commits are in commit-graph
Documentation/rev-list-options.txt | 8 +++-
builtin/log.c | 2 +-
builtin/revert.c | 3 +-
commit-graph.c | 75 +++++++++++++++++++++---------
commit-graph.h | 7 +++
connected.c | 1 +
revision.c | 52 +++++++++++++++++----
revision.h | 7 +--
t/t6000-rev-list-misc.sh | 38 +++++++++++++++
9 files changed, 153 insertions(+), 40 deletions(-)
Range-diff against v3:
-: ---------- > 1: 67232910ac revision: separate walk and unsorted flags
1: 1fd83f726a ! 2: 9d7f484907 connected: do not sort input revisions
@@ Commit message
Signed-off-by: Patrick Steinhardt <ps@pks.im>
+ ## Documentation/rev-list-options.txt ##
+@@ Documentation/rev-list-options.txt: list of the missing objects. Object IDs are prefixed with a ``?'' character.
+ objects.
+ endif::git-rev-list[]
+
++--unsorted-input::
++ Show commits in the order they were given on the command line instead
++ of sorting them in reverse chronological order by commit time. Cannot
++ be combined with `--no-walk` or `--no-walk=sorted`.
++
+ --no-walk[=(sorted|unsorted)]::
+ Only show the given commits, but do not traverse their ancestors.
+ This has no effect if a range is specified. If the argument
+@@ Documentation/rev-list-options.txt: endif::git-rev-list[]
+ given on the command line. Otherwise (if `sorted` or no argument
+ was given), the commits are shown in reverse chronological order
+ by commit time.
+- Cannot be combined with `--graph`.
++ Cannot be combined with `--graph`. Cannot be combined with
++ `--unsorted-input` if `sorted` or no argument was given.
+
+ --do-walk::
+ Overrides a previous `--no-walk`.
+
## connected.c ##
@@ connected.c: int check_connected(oid_iterate_fn fn, void *cb_data,
if (opt->progress)
@@ revision.c: static int handle_revision_opt(struct rev_info *revs, int argc, cons
revs->sort_order = REV_SORT_BY_AUTHOR_DATE;
revs->topo_order = 1;
+ } else if (!strcmp(arg, "--unsorted-input")) {
++ if (revs->no_walk && !revs->unsorted_input)
++ die(_("--unsorted-input is incompatible with --no-walk and --no-walk=sorted"));
+ revs->unsorted_input = 1;
} else if (!strcmp(arg, "--early-output")) {
revs->early_output = 100;
revs->topo_order = 1;
-@@ revision.c: int prepare_revision_walk(struct rev_info *revs)
-
- if (!revs->reflog_info)
- prepare_to_use_bloom_filter(revs);
-- if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
-+ if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED && !revs->unsorted_input)
- commit_list_sort_by_date(&revs->commits);
- if (revs->no_walk)
- return 0;
+@@ revision.c: static int handle_revision_pseudo_opt(const char *submodule,
+ } else if (!strcmp(arg, "--not")) {
+ *flags ^= UNINTERESTING | BOTTOM;
+ } else if (!strcmp(arg, "--no-walk")) {
++ if (revs->unsorted_input)
++ die(_("--no-walk is incompatible with --no-walk=unsorted and --unsorted-input"));
+ revs->no_walk = 1;
+ } else if (skip_prefix(arg, "--no-walk=", &optarg)) {
+ /*
+@@ revision.c: static int handle_revision_pseudo_opt(const char *submodule,
+ * not allowed, since the argument is optional.
+ */
+ revs->no_walk = 1;
+- if (!strcmp(optarg, "sorted"))
++ if (!strcmp(optarg, "sorted")) {
++ if (revs->unsorted_input)
++ die(_("--no-walk=sorted is incompatible with --no-walk=unsorted "
++ "and --unsorted-input"));
+ revs->unsorted_input = 0;
+- else if (!strcmp(optarg, "unsorted"))
++ } else if (!strcmp(optarg, "unsorted"))
+ revs->unsorted_input = 1;
+ else
+ return error("invalid argument to --no-walk");
- ## revision.h ##
-@@ revision.h: struct rev_info {
- simplify_history:1,
- show_pulls:1,
- topo_order:1,
-+ unsorted_input:1,
- simplify_merges:1,
- simplify_by_decoration:1,
- single_worktree:1,
+ ## t/t6000-rev-list-misc.sh ##
+@@ t/t6000-rev-list-misc.sh: test_expect_success 'rev-list --count --objects' '
+ test_line_count = $count actual
+ '
+
++test_expect_success 'rev-list --unsorted-input results in different sorting' '
++ git rev-list --unsorted-input HEAD HEAD~ >first &&
++ git rev-list --unsorted-input HEAD~ HEAD >second &&
++ ! test_cmp first second &&
++ sort first >first.sorted &&
++ sort second >second.sorted &&
++ test_cmp first.sorted second.sorted
++'
++
++test_expect_success 'rev-list --unsorted-input compatible with --no-walk=unsorted' '
++ git rev-list --unsorted-input --no-walk=unsorted HEAD HEAD~ >actual &&
++ git rev-parse HEAD >expect &&
++ git rev-parse HEAD~ >>expect &&
++ test_cmp expect actual
++'
++
++test_expect_success 'rev-list --unsorted-input incompatible with --no-walk=sorted' '
++ cat >expect <<-EOF &&
++ fatal: --no-walk is incompatible with --no-walk=unsorted and --unsorted-input
++ EOF
++ test_must_fail git rev-list --unsorted-input --no-walk HEAD 2>error &&
++ test_cmp expect error &&
++
++ cat >expect <<-EOF &&
++ fatal: --no-walk=sorted is incompatible with --no-walk=unsorted and --unsorted-input
++ EOF
++ test_must_fail git rev-list --unsorted-input --no-walk=sorted HEAD 2>error &&
++ test_cmp expect error &&
++
++ cat >expect <<-EOF &&
++ fatal: --unsorted-input is incompatible with --no-walk and --no-walk=sorted
++ EOF
++ test_must_fail git rev-list --no-walk --unsorted-input HEAD 2>error &&
++ test_cmp expect error &&
++ test_must_fail git rev-list --no-walk=sorted --unsorted-input HEAD 2>error &&
++ test_cmp expect error
++'
++
+ test_done
2: db85480649 ! 3: d8e63d0943 revision: stop retrieving reference twice
@@ Commit message
revision: stop retrieving reference twice
When queueing up references for the revision walk, `handle_one_ref()`
- will resolve the reference's object ID and then queue the ID as pending
- object via `add_pending_oid()`. But `add_pending_oid()` will again try
- to resolve the object ID to an object, effectively duplicating the work
- its caller already did before.
+ will resolve the reference's object ID via `get_reference()` and then
+ queue the ID as pending object via `add_pending_oid()`. But given that
+ `add_pending_oid()` is only a thin wrapper around `add_pending_object()`
+ which fist calls `get_reference()`, we effectively resolve the reference
+ twice and thus duplicate some of the work.
- Fix the issue by instead calling `add_pending_object()`, which takes the
- already-resolved object as input. In a repository with lots of refs,
- this translates in a nearly 10% speedup:
+ Fix the issue by instead calling `add_pending_object()` directly, which
+ takes the already-resolved object as input. In a repository with lots of
+ refs, this translates into a near 10% speedup:
Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
Time (mean ± σ): 5.015 s ± 0.038 s [User: 4.698 s, System: 0.316 s]
3: b9897e102a ! 4: ba8df5cad0 revision: avoid loading object headers multiple times
@@ Commit message
either a tag or a commit, we'd have a modest increase in memory
consumption of about 12.5% here.
+ Note that on its own, this patch may not seem like a clear win. But it
+ is a prerequisite for the following patch, which will result in another
+ 37% speedup.
+
Signed-off-by: Patrick Steinhardt <ps@pks.im>
## revision.c ##
-: ---------- > 5: e33cd51ebf commit-graph: split out function to search commit position
4: f6fc2a5e6d ! 6: 900c5a9c60 revision: avoid hitting packfiles when commits are in commit-graph
@@ Commit message
directly fill in the commit object, otherwise we can still hit the disk
to determine the object's type.
- Expose a new function `find_object_in_graph()`, which fills in an object
- of unknown type in case we find its object ID in the graph. This
- provides a big performance win in cases where there is a commit-graph
- available in the repository in case we load lots of references. The
- following has been executed in a real-world repository with about 2.2
- million refs:
+ Expose a new function `parse_commit_in_graph_gently()`, which fills in
+ an object of unknown type in case we find its object ID in the graph.
+ This provides a big performance win in cases where there is a
+ commit-graph available in the repository in case we load lots of
+ references. The following has been executed in a real-world repository
+ with about 2.2 million refs:
Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
- Time (mean ± σ): 4.465 s ± 0.037 s [User: 4.144 s, System: 0.320 s]
- Range (min … max): 4.411 s … 4.514 s 10 runs
+ Time (mean ± σ): 4.508 s ± 0.039 s [User: 4.131 s, System: 0.377 s]
+ Range (min … max): 4.455 s … 4.576 s 10 runs
Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
- Time (mean ± σ): 2.886 s ± 0.032 s [User: 2.570 s, System: 0.316 s]
- Range (min … max): 2.826 s … 2.933 s 10 runs
+ Time (mean ± σ): 3.072 s ± 0.031 s [User: 2.707 s, System: 0.365 s]
+ Range (min … max): 3.040 s … 3.144 s 10 runs
Summary
'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
- 1.55 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'
+ 1.47 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'
Signed-off-by: Patrick Steinhardt <ps@pks.im>
## commit-graph.c ##
-@@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
- return 1;
+@@ commit-graph.c: static int find_commit_pos_in_graph(struct commit *item, struct commit_graph *g,
+ }
}
-+static int find_object_id_in_graph(const struct object_id *id, struct commit_graph *g, uint32_t *pos)
-+{
-+ struct commit_graph *cur_g = g;
-+ uint32_t lex_index;
-+
-+ while (cur_g && !bsearch_graph(cur_g, (struct object_id *)id, &lex_index))
-+ cur_g = cur_g->base_graph;
-+
-+ if (cur_g) {
-+ *pos = lex_index + cur_g->num_commits_in_base;
-+ return 1;
-+ }
-+
-+ return 0;
-+}
-+
-+int find_object_in_graph(struct repository *repo, struct object *object)
++int parse_commit_in_graph_gently(struct repository *repo, struct object *object)
+{
+ struct commit *commit;
+ uint32_t pos;
@@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
+ if (!repo->objects->commit_graph)
+ return -1;
+
-+ if (!find_object_id_in_graph(&object->oid, repo->objects->commit_graph, &pos))
++ if (!search_commit_pos_in_graph(&object->oid, repo->objects->commit_graph, &pos))
+ return -1;
+
+ commit = object_as_type(object, OBJ_COMMIT, 1);
@@ commit-graph.c: static int fill_commit_in_graph(struct repository *r,
+ return 0;
+}
+
- static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
- {
- uint32_t graph_pos = commit_graph_position(item);
-@@ commit-graph.c: static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
- *pos = graph_pos;
- return 1;
- } else {
-- struct commit_graph *cur_g = g;
-- uint32_t lex_index;
--
-- while (cur_g && !bsearch_graph(cur_g, &(item->object.oid), &lex_index))
-- cur_g = cur_g->base_graph;
--
-- if (cur_g) {
-- *pos = lex_index + cur_g->num_commits_in_base;
-- return 1;
-- }
--
-- return 0;
-+ return find_object_id_in_graph(&item->object.oid, g, pos);
- }
- }
-
+ static int parse_commit_in_graph_one(struct repository *r,
+ struct commit_graph *g,
+ struct commit *item)
## commit-graph.h ##
-@@ commit-graph.h: int write_commit_graph(struct object_directory *odb,
- enum commit_graph_write_flags flags,
- const struct commit_graph_opts *opts);
+@@ commit-graph.h: int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
+ */
+ int parse_commit_in_graph(struct repository *r, struct commit *item);
-+int find_object_in_graph(struct repository *repo, struct object *object);
++/*
++ * Given an object of unknown type, try to fill in the object in case it is a
++ * commit part of the commit-graph. Returns 0 if the object is a parsed commit
++ * or if it could be filled in via the commit graph, otherwise it returns -1.
++ */
++int parse_commit_in_graph_gently(struct repository *repo, struct object *object);
+
- #define COMMIT_GRAPH_VERIFY_SHALLOW (1 << 0)
-
- int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags);
+ /*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
## revision.c ##
@@ revision.c: static struct object *get_reference(struct rev_info *revs, const char *name,
@@ revision.c: static struct object *get_reference(struct rev_info *revs, const cha
if (object->type == OBJ_NONE) {
- int type = oid_object_info(revs->repo, oid, NULL);
- if (type < 0 || !object_as_type(object, type, 1)) {
-- object = NULL;
-- goto out;
-+ if (find_object_in_graph(revs->repo, object) < 0) {
++ /*
++ * It's likely that the reference points to a commit, so we
++ * first try to look it up via the commit-graph. If successful,
++ * then we know it's a commit and don't have to unpack the
++ * object header. We still need to assert that the object
++ * exists, but given that we don't request any info about the
++ * object this is a lot faster than `oid_object_info()`.
++ */
++ if (parse_commit_in_graph_gently(revs->repo, object) < 0) {
+ int type = oid_object_info(revs->repo, oid, NULL);
+ if (type < 0 || !object_as_type(object, type, 1)) {
+ object = NULL;
+ goto out;
+ }
++ } else if (!repo_has_object_file(revs->repo, oid)) {
+ object = NULL;
+ goto out;
}
- }
-
--
2.32.0
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2021-08-05 11:25 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-28 5:33 [PATCH v2 0/3] Speed up connectivity checks via bitmaps Patrick Steinhardt
2021-06-28 5:33 ` [PATCH v2 1/3] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
2021-06-28 7:49 ` Ævar Arnfjörð Bjarmason
2021-06-29 6:18 ` Patrick Steinhardt
2021-06-29 12:09 ` Ævar Arnfjörð Bjarmason
2021-06-28 5:33 ` [PATCH v2 2/3] receive-pack: skip connectivity checks on delete-only commands Patrick Steinhardt
2021-06-28 8:00 ` Ævar Arnfjörð Bjarmason
2021-06-28 8:06 ` Ævar Arnfjörð Bjarmason
2021-06-29 6:26 ` Patrick Steinhardt
2021-06-30 1:31 ` Jeff King
2021-06-30 1:35 ` Jeff King
2021-06-30 13:52 ` Patrick Steinhardt
2021-06-28 5:33 ` [PATCH v2 3/3] connected: implement connectivity check using bitmaps Patrick Steinhardt
2021-06-28 20:23 ` Taylor Blau
2021-06-29 22:44 ` Taylor Blau
2021-06-30 2:04 ` Jeff King
2021-06-30 3:07 ` Taylor Blau
2021-06-30 5:45 ` Jeff King
2021-07-02 17:44 ` Taylor Blau
2021-07-02 21:21 ` Jeff King
2021-06-30 1:51 ` Jeff King
2021-07-20 14:26 ` Patrick Steinhardt
2021-08-02 9:37 ` [PATCH v3 0/4] Speed up connectivity checks Patrick Steinhardt
2021-08-02 9:38 ` [PATCH v3 1/4] connected: do not sort input revisions Patrick Steinhardt
2021-08-02 12:49 ` Ævar Arnfjörð Bjarmason
2021-08-03 8:50 ` Patrick Steinhardt
2021-08-04 11:01 ` Ævar Arnfjörð Bjarmason
2021-08-02 19:00 ` Junio C Hamano
2021-08-03 8:55 ` Patrick Steinhardt
2021-08-03 21:47 ` Junio C Hamano
2021-08-02 9:38 ` [PATCH v3 2/4] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-02 12:53 ` Ævar Arnfjörð Bjarmason
2021-08-02 9:38 ` [PATCH v3 3/4] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-02 12:55 ` Ævar Arnfjörð Bjarmason
2021-08-05 10:12 ` Patrick Steinhardt
2021-08-02 19:40 ` Junio C Hamano
2021-08-03 9:07 ` Patrick Steinhardt
2021-08-06 14:17 ` Patrick Steinhardt
2021-08-02 9:38 ` [PATCH v3 4/4] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-02 20:01 ` Junio C Hamano
2021-08-03 9:16 ` Patrick Steinhardt
2021-08-03 21:56 ` Junio C Hamano
2021-08-05 11:01 ` Patrick Steinhardt
2021-08-05 16:16 ` Junio C Hamano
2021-08-04 10:51 ` Ævar Arnfjörð Bjarmason
2021-08-05 11:25 ` Patrick Steinhardt [this message]
2021-08-05 11:25 ` [PATCH v4 1/6] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-05 18:47 ` Junio C Hamano
2021-08-05 11:25 ` [PATCH v4 2/6] connected: do not sort input revisions Patrick Steinhardt
2021-08-05 18:44 ` Junio C Hamano
2021-08-06 6:00 ` Patrick Steinhardt
2021-08-06 16:50 ` Junio C Hamano
2021-08-05 11:25 ` [PATCH v4 3/6] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-05 11:25 ` [PATCH v4 4/6] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-05 11:25 ` [PATCH v4 5/6] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-05 11:25 ` [PATCH v4 6/6] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-09 8:00 ` [PATCH v5 0/5] Speed up connectivity checks Patrick Steinhardt
2021-08-09 8:02 ` Patrick Steinhardt
2021-08-09 8:11 ` Patrick Steinhardt
2021-08-09 8:11 ` [PATCH v5 1/5] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-09 8:11 ` [PATCH v5 2/5] connected: do not sort input revisions Patrick Steinhardt
2021-08-09 8:11 ` [PATCH v5 3/5] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-09 8:11 ` [PATCH v5 4/5] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-09 8:12 ` [PATCH v5 5/5] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1628162156.git.ps@pks.im \
--to=ps@pks.im \
--cc=avarab@gmail.com \
--cc=chris.torek@gmail.com \
--cc=felipe.contreras@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
--cc=szeder.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).