* [PATCH] packfile.c: speed up loading lots of packfiles. @ 2019-11-27 22:24 Colin Stolley 2019-11-28 0:42 ` hashmap vs khash? " Eric Wong ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Colin Stolley @ 2019-11-27 22:24 UTC (permalink / raw) To: git When loading packfiles on start-up, we traverse the internal packfile list once per file to avoid reloading packfiles that have already been loaded. This check runs in quadratic time, so for poorly maintained repos with a large number of packfiles, it can be pretty slow. Add a hashmap containing the packfile names as we load them so that the average runtime cost of checking for already-loaded packs becomes constant. Add a perf test to p5303 to show speed-up. The existing p5303 test runtimes are dominated by other factors and do not show an appreciable speed-up. The new test in p5303 clearly exposes a speed-up in bad cases. In this test we create 10,000 packfiles and measure the start-up time of git rev-parse, which does little else besides load in the packs. Here are the numbers for the new p5303 test: Test HEAD^ HEAD --------------------------------------------------------------------- 5303.12: load 10,000 packs 1.03(0.92+0.10) 0.12(0.02+0.09) -88.3% Thanks-to: Jeff King <peff@peff.net> Signed-off-by: Colin Stolley <cstolley@runbox.com> --- object-store.h | 21 +++++++++++++++++++++ object.c | 3 +++ packfile.c | 21 +++++++++++---------- t/perf/p5303-many-packs.sh | 18 ++++++++++++++++++ 4 files changed, 53 insertions(+), 10 deletions(-) diff --git a/object-store.h b/object-store.h index 7f7b3cdd80..55ee639350 100644 --- a/object-store.h +++ b/object-store.h @@ -60,6 +60,7 @@ struct oid_array *odb_loose_cache(struct object_directory *odb, void odb_clear_loose_cache(struct object_directory *odb); struct packed_git { + struct hashmap_entry packmap_ent; struct packed_git *next; struct list_head mru; struct pack_window *windows; @@ -88,6 +89,20 @@ struct packed_git { struct multi_pack_index; +static inline int pack_map_entry_cmp(const void *unused_cmp_data, + const struct hashmap_entry *entry, + const struct hashmap_entry *entry2, + const void *keydata) +{ + const char *key = keydata; + const struct packed_git *pg1, *pg2; + + pg1 = container_of(entry, const struct packed_git, packmap_ent); + pg2 = container_of(entry2, const struct packed_git, packmap_ent); + + return strcmp(pg1->pack_name, key ? key : pg2->pack_name); +} + struct raw_object_store { /* * Set of all object directories; the main directory is first (and @@ -131,6 +146,12 @@ struct raw_object_store { /* A most-recently-used ordered version of the packed_git list. */ struct list_head packed_git_mru; + /* + * A map of packfiles to packed_git structs for tracking which + * packs have been loaded already. + */ + struct hashmap pack_map; + /* * A fast, rough count of the number of objects in the repository. * These two fields are not meant for direct access. Use diff --git a/object.c b/object.c index 3b8b8c55c9..142ef69399 100644 --- a/object.c +++ b/object.c @@ -479,6 +479,7 @@ struct raw_object_store *raw_object_store_new(void) memset(o, 0, sizeof(*o)); INIT_LIST_HEAD(&o->packed_git_mru); + hashmap_init(&o->pack_map, pack_map_entry_cmp, NULL, 0); return o; } @@ -518,6 +519,8 @@ void raw_object_store_clear(struct raw_object_store *o) INIT_LIST_HEAD(&o->packed_git_mru); close_object_store(o); o->packed_git = NULL; + + hashmap_free(&o->pack_map); } void parsed_object_pool_clear(struct parsed_object_pool *o) diff --git a/packfile.c b/packfile.c index 355066de17..253559fa87 100644 --- a/packfile.c +++ b/packfile.c @@ -856,20 +856,21 @@ static void prepare_pack(const char *full_name, size_t full_name_len, if (strip_suffix_mem(full_name, &base_len, ".idx") && !(data->m && midx_contains_pack(data->m, file_name))) { - /* Don't reopen a pack we already have. */ - for (p = data->r->objects->packed_git; p; p = p->next) { - size_t len; - if (strip_suffix(p->pack_name, ".pack", &len) && - len == base_len && - !memcmp(p->pack_name, full_name, len)) - break; - } + struct hashmap_entry hent; + char *pack_name = xstrfmt("%.*s.pack", (int)base_len, full_name); + unsigned int hash = strhash(pack_name); + hashmap_entry_init(&hent, hash); - if (!p) { + /* Don't reopen a pack we already have. */ + if (!hashmap_get(&data->r->objects->pack_map, &hent, pack_name)) { p = add_packed_git(full_name, full_name_len, data->local); - if (p) + if (p) { + hashmap_entry_init(&p->packmap_ent, hash); + hashmap_add(&data->r->objects->pack_map, &p->packmap_ent); install_packed_git(data->r, p); + } } + free(pack_name); } if (!report_garbage) diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh index 3779851941..ede78e19e2 100755 --- a/t/perf/p5303-many-packs.sh +++ b/t/perf/p5303-many-packs.sh @@ -84,4 +84,22 @@ do ' done +# Measure pack loading with 10,000 packs. +test_expect_success 'generate lots of packs' ' + for i in $(test_seq 10000); do + echo "blob" + echo "data <<EOF" + echo "blob $i" + echo "EOF" + echo "checkpoint" + done | + git -c fastimport.unpackLimit=0 fast-import +' + +# The purpose of this test is to evaluate load time for a large number +# of packs while doing as little other work as possible. +test_perf "load 10,000 packs" ' + git rev-parse --verify "HEAD^{commit}" +' + test_done -- 2.23.0 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* hashmap vs khash? Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-11-27 22:24 [PATCH] packfile.c: speed up loading lots of packfiles Colin Stolley @ 2019-11-28 0:42 ` Eric Wong 2019-11-30 17:36 ` Junio C Hamano 2019-12-02 14:39 ` Jeff King 2019-12-02 17:40 ` SZEDER Gábor 2019-12-03 6:19 ` Taylor Blau 2 siblings, 2 replies; 15+ messages in thread From: Eric Wong @ 2019-11-28 0:42 UTC (permalink / raw) To: Colin Stolley; +Cc: git Colin Stolley <cstolley@runbox.com> wrote: > When loading packfiles on start-up, we traverse the internal packfile > list once per file to avoid reloading packfiles that have already > been loaded. This check runs in quadratic time, so for poorly > maintained repos with a large number of packfiles, it can be pretty > slow. Cool! Thanks for looking into this, and I've been having trouble in that department with big alternates files. > Add a hashmap containing the packfile names as we load them so that > the average runtime cost of checking for already-loaded packs becomes > constant. Btw, would you have time to do a comparison against khash? AFAIK hashmap predates khash in git; and hashmap was optimized for removal. Removals don't seem to be a problem for pack loading. I'm interested in exploring the removing of hashmap entirely in favor of khash to keep our codebase smaller and easier-to-learn. khash shows up more in other projects, and ought to have better cache-locality. Thanks again. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: hashmap vs khash? Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-11-28 0:42 ` hashmap vs khash? " Eric Wong @ 2019-11-30 17:36 ` Junio C Hamano 2019-12-02 14:39 ` Jeff King 1 sibling, 0 replies; 15+ messages in thread From: Junio C Hamano @ 2019-11-30 17:36 UTC (permalink / raw) To: Eric Wong; +Cc: Colin Stolley, git Eric Wong <e@80x24.org> writes: > Colin Stolley <cstolley@runbox.com> wrote: >> When loading packfiles on start-up, we traverse the internal packfile >> list once per file to avoid reloading packfiles that have already >> been loaded. This check runs in quadratic time, so for poorly >> maintained repos with a large number of packfiles, it can be pretty >> slow. > > Cool! Thanks for looking into this, and I've been having > trouble in that department with big alternates files. > >> Add a hashmap containing the packfile names as we load them so that >> the average runtime cost of checking for already-loaded packs becomes >> constant. > > Btw, would you have time to do a comparison against khash? > > AFAIK hashmap predates khash in git; and hashmap was optimized > for removal. Removals don't seem to be a problem for pack > loading. > > I'm interested in exploring the removing of hashmap entirely in > favor of khash to keep our codebase smaller and easier-to-learn. > khash shows up more in other projects, and ought to have better > cache-locality. Sounds fun ;-) and useful. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: hashmap vs khash? Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-11-28 0:42 ` hashmap vs khash? " Eric Wong 2019-11-30 17:36 ` Junio C Hamano @ 2019-12-02 14:39 ` Jeff King 1 sibling, 0 replies; 15+ messages in thread From: Jeff King @ 2019-12-02 14:39 UTC (permalink / raw) To: Eric Wong; +Cc: Colin Stolley, git On Thu, Nov 28, 2019 at 12:42:02AM +0000, Eric Wong wrote: > > Add a hashmap containing the packfile names as we load them so that > > the average runtime cost of checking for already-loaded packs becomes > > constant. > > Btw, would you have time to do a comparison against khash? > > AFAIK hashmap predates khash in git; and hashmap was optimized > for removal. Removals don't seem to be a problem for pack > loading. Actually, they came around simultaneously. I think hashmap.[ch] was mostly a response to our open-coded hashes, like the one in object.c (which still uses neither of the reusable forms!). Those didn't handle removal at all. khash does handle removal, though you pay a price in tombstone entries until the next resize operation. > I'm interested in exploring the removing of hashmap entirely in > favor of khash to keep our codebase smaller and easier-to-learn. > khash shows up more in other projects, and ought to have better > cache-locality. I have been tempted to push for that, too. Every timing I have ever done shows khash as faster (though for a trivial use like this one, I would be quite surprised if it mattered either way). My hesitation is that khash can be harder to debug because of the macro implementation. But I have rarely needed to look beneath its API. -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-11-27 22:24 [PATCH] packfile.c: speed up loading lots of packfiles Colin Stolley 2019-11-28 0:42 ` hashmap vs khash? " Eric Wong @ 2019-12-02 17:40 ` SZEDER Gábor 2019-12-02 19:42 ` Jeff King 2019-12-03 6:19 ` Taylor Blau 2 siblings, 1 reply; 15+ messages in thread From: SZEDER Gábor @ 2019-12-02 17:40 UTC (permalink / raw) To: Colin Stolley; +Cc: git On Wed, Nov 27, 2019 at 04:24:53PM -0600, Colin Stolley wrote: > When loading packfiles on start-up, we traverse the internal packfile > list once per file to avoid reloading packfiles that have already > been loaded. This check runs in quadratic time, so for poorly > maintained repos with a large number of packfiles, it can be pretty > slow. > > Add a hashmap containing the packfile names as we load them so that > the average runtime cost of checking for already-loaded packs becomes > constant. > > Add a perf test to p5303 to show speed-up. > > The existing p5303 test runtimes are dominated by other factors and do > not show an appreciable speed-up. The new test in p5303 clearly exposes > a speed-up in bad cases. In this test we create 10,000 packfiles and > measure the start-up time of git rev-parse, which does little else > besides load in the packs. > > Here are the numbers for the new p5303 test: > > Test HEAD^ HEAD > --------------------------------------------------------------------- > 5303.12: load 10,000 packs 1.03(0.92+0.10) 0.12(0.02+0.09) -88.3% > > Thanks-to: Jeff King <peff@peff.net> > Signed-off-by: Colin Stolley <cstolley@runbox.com> > --- This patch break test 'gc --keep-largest-pack' in 't6500-gc.sh' when run with GIT_TEST_MULTI_PACK_INDEX=1, because there is a duplicate entry in '.git/objects/info/packs': expecting success of 6500.7 'gc --keep-largest-pack': test_create_repo keep-pack && ( cd keep-pack && test_commit one && test_commit two && test_commit three && git gc && ( cd .git/objects/pack && ls *.pack ) >pack-list && test_line_count = 1 pack-list && BASE_PACK=.git/objects/pack/pack-*.pack && test_commit four && git repack -d && test_commit five && git repack -d && ( cd .git/objects/pack && ls *.pack ) >pack-list && test_line_count = 3 pack-list && git gc --keep-largest-pack && ( cd .git/objects/pack && ls *.pack ) >pack-list && test_line_count = 2 pack-list && awk "/^P /{print \$2}" <.git/objects/info/packs >pack-info && test_line_count = 2 pack-info && test_path_is_file $BASE_PACK && git fsck ) + test_create_repo keep-pack Initialized empty Git repository in /home/szeder/src/git/t/trash directory.t6500-gc/keep-pack/.git/ + cd keep-pack + test_commit one [master (root-commit) d79ce16] one Author: A U Thor <author@example.com> 1 file changed, 1 insertion(+) create mode 100644 one.t + test_commit two [master 139b20d] two Author: A U Thor <author@example.com> 1 file changed, 1 insertion(+) create mode 100644 two.t + test_commit three [master 7c7cd71] three Author: A U Thor <author@example.com> 1 file changed, 1 insertion(+) create mode 100644 three.t + git gc Computing commit graph generation numbers: 33% (1/3)^MComputing commit graph generation numbers: 66% (2/3)^MComputing commit graph generation numbers: 100% (3/3)^MComputing commit graph generation numbers: 100% (3/3), done. + cd .git/objects/pack + ls pack-a4b37b9b5458e8116b1c1840185b39fb5e6b8726.pack + test_line_count = 1 pack-list + BASE_PACK=.git/objects/pack/pack-*.pack + test_commit four [master fd8d77e] four Author: A U Thor <author@example.com> 1 file changed, 1 insertion(+) create mode 100644 four.t + git repack -d + test_commit five [master a383792] five Author: A U Thor <author@example.com> 1 file changed, 1 insertion(+) create mode 100644 five.t + git repack -d + cd .git/objects/pack + ls pack-057d7f493a7c26d58090f4777ff66d4c226c4408.pack pack-54feec766fc7d2d204b03879d96f4595d7e48c37.pack pack-a4b37b9b5458e8116b1c1840185b39fb5e6b8726.pack + test_line_count = 3 pack-list + git gc --keep-largest-pack Computing commit graph generation numbers: 20% (1/5)^MComputing commit graph generation numbers: 40% (2/5)^MComputing commit graph generation numbers: 60% (3/5)^MComputing commit graph generation numbers: 80% (4/5)^MComputing commit graph generation numbers: 100% (5/5)^MComputing commit graph generation numbers: 100% (5/5), done. + cd .git/objects/pack + ls pack-390dbbb8e27c014b080c08dfc482d4982d4c6644.pack pack-a4b37b9b5458e8116b1c1840185b39fb5e6b8726.pack + test_line_count = 2 pack-list + awk /^P /{print $2} + test_line_count = 2 pack-info test_line_count: line count for pack-info != 2 pack-a4b37b9b5458e8116b1c1840185b39fb5e6b8726.pack pack-a4b37b9b5458e8116b1c1840185b39fb5e6b8726.pack pack-390dbbb8e27c014b080c08dfc482d4982d4c6644.pack error: last command exited with $?=1 not ok 7 - gc --keep-largest-pack ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-12-02 17:40 ` SZEDER Gábor @ 2019-12-02 19:42 ` Jeff King 2019-12-03 6:17 ` Taylor Blau 2019-12-03 16:04 ` Junio C Hamano 0 siblings, 2 replies; 15+ messages in thread From: Jeff King @ 2019-12-02 19:42 UTC (permalink / raw) To: SZEDER Gábor; +Cc: Colin Stolley, git On Mon, Dec 02, 2019 at 06:40:35PM +0100, SZEDER Gábor wrote: > > When loading packfiles on start-up, we traverse the internal packfile > > list once per file to avoid reloading packfiles that have already > > been loaded. This check runs in quadratic time, so for poorly > > maintained repos with a large number of packfiles, it can be pretty > > slow. > > > > Add a hashmap containing the packfile names as we load them so that > > the average runtime cost of checking for already-loaded packs becomes > > constant. > [...] > This patch break test 'gc --keep-largest-pack' in 't6500-gc.sh' when > run with GIT_TEST_MULTI_PACK_INDEX=1, because there is a duplicate > entry in '.git/objects/info/packs': Good catch. The issue is that we only add entries to the hashmap in prepare_packed_git(), but they may be added to the pack list by other callers of install_packed_git(). It probably makes sense to just push the hashmap maintenance down into that function, like below. That requires an extra strhash() when inserting a new pack, but I don't think that's a big deal. diff --git a/packfile.c b/packfile.c index 253559fa87..f0dc63e92f 100644 --- a/packfile.c +++ b/packfile.c @@ -757,6 +757,9 @@ void install_packed_git(struct repository *r, struct packed_git *pack) pack->next = r->objects->packed_git; r->objects->packed_git = pack; + + hashmap_entry_init(&pack->packmap_ent, strhash(pack->pack_name)); + hashmap_add(&r->objects->pack_map, &pack->packmap_ent); } void (*report_garbage)(unsigned seen_bits, const char *path); @@ -864,11 +867,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len, /* Don't reopen a pack we already have. */ if (!hashmap_get(&data->r->objects->pack_map, &hent, pack_name)) { p = add_packed_git(full_name, full_name_len, data->local); - if (p) { - hashmap_entry_init(&p->packmap_ent, hash); - hashmap_add(&data->r->objects->pack_map, &p->packmap_ent); + if (p) install_packed_git(data->r, p); - } } free(pack_name); } -Peff ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-12-02 19:42 ` Jeff King @ 2019-12-03 6:17 ` Taylor Blau 2019-12-03 15:34 ` Jeff King 2019-12-03 16:04 ` Junio C Hamano 1 sibling, 1 reply; 15+ messages in thread From: Taylor Blau @ 2019-12-03 6:17 UTC (permalink / raw) To: Jeff King; +Cc: SZEDER Gábor, Colin Stolley, git On Mon, Dec 02, 2019 at 02:42:31PM -0500, Jeff King wrote: > On Mon, Dec 02, 2019 at 06:40:35PM +0100, SZEDER Gábor wrote: > > > > When loading packfiles on start-up, we traverse the internal packfile > > > list once per file to avoid reloading packfiles that have already > > > been loaded. This check runs in quadratic time, so for poorly > > > maintained repos with a large number of packfiles, it can be pretty > > > slow. > > > > > > Add a hashmap containing the packfile names as we load them so that > > > the average runtime cost of checking for already-loaded packs becomes > > > constant. > > [...] > > This patch break test 'gc --keep-largest-pack' in 't6500-gc.sh' when > > run with GIT_TEST_MULTI_PACK_INDEX=1, because there is a duplicate > > entry in '.git/objects/info/packs': > > Good catch. The issue is that we only add entries to the hashmap in > prepare_packed_git(), but they may be added to the pack list by other > callers of install_packed_git(). It probably makes sense to just push > the hashmap maintenance down into that function, like below. That > requires an extra strhash() when inserting a new pack, but I don't think > that's a big deal. Ah, great catch, and thanks for pointing it out. We have been running this patch in production at GitHub for a few weeks now, but didn't notice this because we never run tests with 'GIT_TEST_MULTI_PACK_INDEX=1' in the environment. Perhaps in the future that might change, but I think that for now that can explain why the failure wasn't noticed earlier. > diff --git a/packfile.c b/packfile.c > index 253559fa87..f0dc63e92f 100644 > --- a/packfile.c > +++ b/packfile.c > @@ -757,6 +757,9 @@ void install_packed_git(struct repository *r, struct packed_git *pack) > > pack->next = r->objects->packed_git; > r->objects->packed_git = pack; > + > + hashmap_entry_init(&pack->packmap_ent, strhash(pack->pack_name)); > + hashmap_add(&r->objects->pack_map, &pack->packmap_ent); > } > > void (*report_garbage)(unsigned seen_bits, const char *path); > @@ -864,11 +867,8 @@ static void prepare_pack(const char *full_name, size_t full_name_len, > /* Don't reopen a pack we already have. */ > if (!hashmap_get(&data->r->objects->pack_map, &hent, pack_name)) { > p = add_packed_git(full_name, full_name_len, data->local); > - if (p) { > - hashmap_entry_init(&p->packmap_ent, hash); > - hashmap_add(&data->r->objects->pack_map, &p->packmap_ent); > + if (p) > install_packed_git(data->r, p); > - } > } > free(pack_name); > } > > -Peff Thanks, Taylor ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-12-03 6:17 ` Taylor Blau @ 2019-12-03 15:34 ` Jeff King 0 siblings, 0 replies; 15+ messages in thread From: Jeff King @ 2019-12-03 15:34 UTC (permalink / raw) To: Taylor Blau; +Cc: SZEDER Gábor, Colin Stolley, git On Mon, Dec 02, 2019 at 10:17:44PM -0800, Taylor Blau wrote: > > Good catch. The issue is that we only add entries to the hashmap in > > prepare_packed_git(), but they may be added to the pack list by other > > callers of install_packed_git(). It probably makes sense to just push > > the hashmap maintenance down into that function, like below. That > > requires an extra strhash() when inserting a new pack, but I don't think > > that's a big deal. > > Ah, great catch, and thanks for pointing it out. We have been running > this patch in production at GitHub for a few weeks now, but didn't > notice this because we never run tests with > 'GIT_TEST_MULTI_PACK_INDEX=1' in the environment. > > Perhaps in the future that might change, but I think that for now that > can explain why the failure wasn't noticed earlier. It would trigger in other contexts, too: anywhere we call install_packed_git() outside of prepare_packed_git(). I think the main reason we didn't notice in the tests or in production is that: - it would generally require re-scanning the pack directory, which implies an unexpected missing object (or a racy repack), which is not likely to happen during the tests - in most cases duplicates don't impact the outcome of the command at all. A duplicate entry in the packed_git list would just mean an extra pack to check for lookups. Most objects would be found in other packs (including its doppelganger entry), so the duplicate would float to the bottom of the MRU order. It's only lookups for objects we don't have that would pay a penalty (and it would be relatively small). But obviously it's still worth fixing. -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-12-02 19:42 ` Jeff King 2019-12-03 6:17 ` Taylor Blau @ 2019-12-03 16:04 ` Junio C Hamano 2019-12-03 17:33 ` Colin Stolley 2019-12-03 22:17 ` Jeff King 1 sibling, 2 replies; 15+ messages in thread From: Junio C Hamano @ 2019-12-03 16:04 UTC (permalink / raw) To: Jeff King; +Cc: SZEDER Gábor, Colin Stolley, git Jeff King <peff@peff.net> writes: > Good catch. The issue is that we only add entries to the hashmap in > prepare_packed_git(), but they may be added to the pack list by other > callers of install_packed_git(). It probably makes sense to just push > the hashmap maintenance down into that function, like below. Makes sense to me. Let me locally squash your fix in and credit you with helped-by footer in the amended log message. Strictly speaking, this may invalidate the perf numbers, but I do not think the scenario p5303 sets up alone is all that interesting anyway---if you have 10,000 packs, not just registering them (which is improved with this patch) but using objects from them would be slower than necessary X-<. Thanks. -- >8 -- From: Colin Stolley <cstolley@runbox.com> Date: Wed, 27 Nov 2019 16:24:53 -0600 Subject: [PATCH] packfile.c: speed up loading lots of packfiles When loading packfiles on start-up, we traverse the internal packfile list once per file to avoid reloading packfiles that have already been loaded. This check runs in quadratic time, so for poorly maintained repos with a large number of packfiles, it can be pretty slow. Add a hashmap containing the packfile names as we load them so that the average runtime cost of checking for already-loaded packs becomes constant. Add a perf test to p5303 to show speed-up. The existing p5303 test runtimes are dominated by other factors and do not show an appreciable speed-up. The new test in p5303 clearly exposes a speed-up in bad cases. In this test we create 10,000 packfiles and measure the start-up time of git rev-parse, which does little else besides load in the packs. Here are the numbers for the new p5303 test: Test HEAD^ HEAD --------------------------------------------------------------------- 5303.12: load 10,000 packs 1.03(0.92+0.10) 0.12(0.02+0.09) -88.3% Signed-off-by: Colin Stolley <cstolley@runbox.com> Helped-by: Jeff King <peff@peff.net> [jc: squashed the change to call hashmap in install_packed_git() by peff] Signed-off-by: Junio C Hamano <gitster@pobox.com> --- object-store.h | 21 +++++++++++++++++++++ object.c | 3 +++ packfile.c | 19 ++++++++++--------- t/perf/p5303-many-packs.sh | 18 ++++++++++++++++++ 4 files changed, 52 insertions(+), 9 deletions(-) diff --git a/object-store.h b/object-store.h index 7f7b3cdd80..55ee639350 100644 --- a/object-store.h +++ b/object-store.h @@ -60,6 +60,7 @@ struct oid_array *odb_loose_cache(struct object_directory *odb, void odb_clear_loose_cache(struct object_directory *odb); struct packed_git { + struct hashmap_entry packmap_ent; struct packed_git *next; struct list_head mru; struct pack_window *windows; @@ -88,6 +89,20 @@ struct packed_git { struct multi_pack_index; +static inline int pack_map_entry_cmp(const void *unused_cmp_data, + const struct hashmap_entry *entry, + const struct hashmap_entry *entry2, + const void *keydata) +{ + const char *key = keydata; + const struct packed_git *pg1, *pg2; + + pg1 = container_of(entry, const struct packed_git, packmap_ent); + pg2 = container_of(entry2, const struct packed_git, packmap_ent); + + return strcmp(pg1->pack_name, key ? key : pg2->pack_name); +} + struct raw_object_store { /* * Set of all object directories; the main directory is first (and @@ -131,6 +146,12 @@ struct raw_object_store { /* A most-recently-used ordered version of the packed_git list. */ struct list_head packed_git_mru; + /* + * A map of packfiles to packed_git structs for tracking which + * packs have been loaded already. + */ + struct hashmap pack_map; + /* * A fast, rough count of the number of objects in the repository. * These two fields are not meant for direct access. Use diff --git a/object.c b/object.c index 3b8b8c55c9..142ef69399 100644 --- a/object.c +++ b/object.c @@ -479,6 +479,7 @@ struct raw_object_store *raw_object_store_new(void) memset(o, 0, sizeof(*o)); INIT_LIST_HEAD(&o->packed_git_mru); + hashmap_init(&o->pack_map, pack_map_entry_cmp, NULL, 0); return o; } @@ -518,6 +519,8 @@ void raw_object_store_clear(struct raw_object_store *o) INIT_LIST_HEAD(&o->packed_git_mru); close_object_store(o); o->packed_git = NULL; + + hashmap_free(&o->pack_map); } void parsed_object_pool_clear(struct parsed_object_pool *o) diff --git a/packfile.c b/packfile.c index 355066de17..f0dc63e92f 100644 --- a/packfile.c +++ b/packfile.c @@ -757,6 +757,9 @@ void install_packed_git(struct repository *r, struct packed_git *pack) pack->next = r->objects->packed_git; r->objects->packed_git = pack; + + hashmap_entry_init(&pack->packmap_ent, strhash(pack->pack_name)); + hashmap_add(&r->objects->pack_map, &pack->packmap_ent); } void (*report_garbage)(unsigned seen_bits, const char *path); @@ -856,20 +859,18 @@ static void prepare_pack(const char *full_name, size_t full_name_len, if (strip_suffix_mem(full_name, &base_len, ".idx") && !(data->m && midx_contains_pack(data->m, file_name))) { - /* Don't reopen a pack we already have. */ - for (p = data->r->objects->packed_git; p; p = p->next) { - size_t len; - if (strip_suffix(p->pack_name, ".pack", &len) && - len == base_len && - !memcmp(p->pack_name, full_name, len)) - break; - } + struct hashmap_entry hent; + char *pack_name = xstrfmt("%.*s.pack", (int)base_len, full_name); + unsigned int hash = strhash(pack_name); + hashmap_entry_init(&hent, hash); - if (!p) { + /* Don't reopen a pack we already have. */ + if (!hashmap_get(&data->r->objects->pack_map, &hent, pack_name)) { p = add_packed_git(full_name, full_name_len, data->local); if (p) install_packed_git(data->r, p); } + free(pack_name); } if (!report_garbage) diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh index 3779851941..ede78e19e2 100755 --- a/t/perf/p5303-many-packs.sh +++ b/t/perf/p5303-many-packs.sh @@ -84,4 +84,22 @@ do ' done +# Measure pack loading with 10,000 packs. +test_expect_success 'generate lots of packs' ' + for i in $(test_seq 10000); do + echo "blob" + echo "data <<EOF" + echo "blob $i" + echo "EOF" + echo "checkpoint" + done | + git -c fastimport.unpackLimit=0 fast-import +' + +# The purpose of this test is to evaluate load time for a large number +# of packs while doing as little other work as possible. +test_perf "load 10,000 packs" ' + git rev-parse --verify "HEAD^{commit}" +' + test_done -- 2.24.0-578-g4820254054 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-12-03 16:04 ` Junio C Hamano @ 2019-12-03 17:33 ` Colin Stolley 2019-12-03 22:18 ` Jeff King 2019-12-03 22:17 ` Jeff King 1 sibling, 1 reply; 15+ messages in thread From: Colin Stolley @ 2019-12-03 17:33 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, SZEDER Gábor, git Junio C Hamano wrote: > Let me locally squash your fix in and credit you with helped-by > footer in the amended log message. Strictly speaking, this may I'm also happy to provide a khash version if that is still desired, perhaps as a separate patch. Thanks for spotting that bug and providing a fix. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-12-03 17:33 ` Colin Stolley @ 2019-12-03 22:18 ` Jeff King 2019-12-04 18:15 ` Junio C Hamano 0 siblings, 1 reply; 15+ messages in thread From: Jeff King @ 2019-12-03 22:18 UTC (permalink / raw) To: Colin Stolley; +Cc: Junio C Hamano, SZEDER Gábor, git On Tue, Dec 03, 2019 at 11:33:34AM -0600, Colin Stolley wrote: > Junio C Hamano wrote: > > > Let me locally squash your fix in and credit you with helped-by > > footer in the amended log message. Strictly speaking, this may > > I'm also happy to provide a khash version if that is still desired, > perhaps as a separate patch. Personally I'd be OK with it as-is, or with a conversion to khash (as I said earlier, I prefer khash myself, but it's not like there aren't tons of other hashmap.c users in the code). -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-12-03 22:18 ` Jeff King @ 2019-12-04 18:15 ` Junio C Hamano 0 siblings, 0 replies; 15+ messages in thread From: Junio C Hamano @ 2019-12-04 18:15 UTC (permalink / raw) To: Jeff King; +Cc: Colin Stolley, SZEDER Gábor, git Jeff King <peff@peff.net> writes: > On Tue, Dec 03, 2019 at 11:33:34AM -0600, Colin Stolley wrote: > >> Junio C Hamano wrote: >> >> > Let me locally squash your fix in and credit you with helped-by >> > footer in the amended log message. Strictly speaking, this may >> >> I'm also happy to provide a khash version if that is still desired, >> perhaps as a separate patch. > > Personally I'd be OK with it as-is, or with a conversion to khash (as I > said earlier, I prefer khash myself, but it's not like there aren't tons > of other hashmap.c users in the code). Exactly me feeling. Thanks. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-12-03 16:04 ` Junio C Hamano 2019-12-03 17:33 ` Colin Stolley @ 2019-12-03 22:17 ` Jeff King 2019-12-04 4:23 ` Jonathan Nieder 1 sibling, 1 reply; 15+ messages in thread From: Jeff King @ 2019-12-03 22:17 UTC (permalink / raw) To: Junio C Hamano; +Cc: SZEDER Gábor, Colin Stolley, git On Tue, Dec 03, 2019 at 08:04:15AM -0800, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Good catch. The issue is that we only add entries to the hashmap in > > prepare_packed_git(), but they may be added to the pack list by other > > callers of install_packed_git(). It probably makes sense to just push > > the hashmap maintenance down into that function, like below. > > Makes sense to me. > > Let me locally squash your fix in and credit you with helped-by > footer in the amended log message. Strictly speaking, this may > invalidate the perf numbers, but I do not think the scenario p5303 > sets up alone is all that interesting anyway---if you have 10,000 > packs, not just registering them (which is improved with this patch) > but using objects from them would be slower than necessary X-<. Thanks, that sounds good. I actually re-checked the perf numbers (mostly to make sure I didn't screw anything up) and got similar results. I agree that 10,000 packs is ridiculous, but we do see it (and worse) occasionally from people pushing in a loop before our scheduled maintenance kicks in. And it's quadratic, so if you hit 30k packs, it's another factor of 9 worse. It makes even diagnosing the repository pretty painful. :) Also a fun fact: Linux actually has a limit on the number of simultaneous mmaps that a process can have, which defaults to ~64k. But if you have if you have 32k packs, then we'll map both the packs and the idx files. Plus whatever you need for mapping the binary, libraries, etc, plus any maps opened by malloc() for large requests. I have occasionally run into this trying to repack some very out-of-control cases (the magic fix is to double the limit with `sysctl -w vm.max_map_count=131060`, if you are curious). I also wondered if this might be made worse by the recent change to drop release_pack_memory(). But I ran into it even before that change, because zlib calls malloc() directly. We're also pretty aggressive about dying when mmap() returns an error (rather than closing packs and trying again). I think Git _could_ be handle this more gracefully by just trying to keep fewer packs open at one time (the way we similarly try not to use up all of the file descriptors). But I have a hard time caring too much, since it's such a ridiculous situation in the first place. Bumping the limits is an easy operational fix. -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-12-03 22:17 ` Jeff King @ 2019-12-04 4:23 ` Jonathan Nieder 0 siblings, 0 replies; 15+ messages in thread From: Jonathan Nieder @ 2019-12-04 4:23 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, SZEDER Gábor, Colin Stolley, git, Martin Fick Jeff King wrote: > I agree that 10,000 packs is ridiculous, but we do see it (and worse) > occasionally from people pushing in a loop before our scheduled > maintenance kicks in. On that subject: one thing Martin Fick (cc-ed) has suggested is moving more of the garbage collection work "inline" into the push path. That is, instead of letting someone push 10,000 packs in a loop, build concatenated packs ("exponential rollup") in the push path and don't return success and commit the packs into the object store until we're done. That way a reasonably small amortized cost is paid up front by the pusher instead of later by everyone. Midx changes things a little: it might make sense to build concatenated idxes instead of packs, which would still avoid the same quadratic behavior. Just a random thought, Jonathan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] packfile.c: speed up loading lots of packfiles. 2019-11-27 22:24 [PATCH] packfile.c: speed up loading lots of packfiles Colin Stolley 2019-11-28 0:42 ` hashmap vs khash? " Eric Wong 2019-12-02 17:40 ` SZEDER Gábor @ 2019-12-03 6:19 ` Taylor Blau 2 siblings, 0 replies; 15+ messages in thread From: Taylor Blau @ 2019-12-03 6:19 UTC (permalink / raw) To: Colin Stolley; +Cc: git Hi Colin, On Wed, Nov 27, 2019 at 04:24:53PM -0600, Colin Stolley wrote: > When loading packfiles on start-up, we traverse the internal packfile > list once per file to avoid reloading packfiles that have already > been loaded. This check runs in quadratic time, so for poorly > maintained repos with a large number of packfiles, it can be pretty > slow. Thanks for sending your patches to the list, and for working on this! I have been most excited to see this develop internally, and I'm glad that the new paths have been exercised enough to send the patches. So, I'm quite interested to see what you find in terms of wiring this code up to work with khash. If that's something that you don't have time for, please do let me know and I'd be happy to take it up myself. Other than that, this has been thoroughly reviewed internally, and so I don't have much more to add to this patch than what we have already discussed. Thanks, Taylor ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2019-12-04 18:15 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-27 22:24 [PATCH] packfile.c: speed up loading lots of packfiles Colin Stolley 2019-11-28 0:42 ` hashmap vs khash? " Eric Wong 2019-11-30 17:36 ` Junio C Hamano 2019-12-02 14:39 ` Jeff King 2019-12-02 17:40 ` SZEDER Gábor 2019-12-02 19:42 ` Jeff King 2019-12-03 6:17 ` Taylor Blau 2019-12-03 15:34 ` Jeff King 2019-12-03 16:04 ` Junio C Hamano 2019-12-03 17:33 ` Colin Stolley 2019-12-03 22:18 ` Jeff King 2019-12-04 18:15 ` Junio C Hamano 2019-12-03 22:17 ` Jeff King 2019-12-04 4:23 ` Jonathan Nieder 2019-12-03 6:19 ` Taylor Blau
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).