From: Derrick Stolee <stolee@gmail.com> To: "SZEDER Gábor" <szeder.dev@gmail.com> Cc: "Jeff King" <peff@peff.net>, "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>, "Stefan Beller" <sbeller@google.com>, git <git@vger.kernel.org>, "Duy Nguyen" <pclouds@gmail.com> Subject: Re: We should add a "git gc --auto" after "git clone" due to commit graph Date: Mon, 8 Oct 2018 14:29:47 -0400 [thread overview] Message-ID: <9ad5f166-f7c5-de79-0f86-f1f952cd33d2@gmail.com> (raw) In-Reply-To: <20181008181015.GA23446@szeder.dev> On 10/8/2018 2:10 PM, SZEDER Gábor wrote: > On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote: >> Nice! These numbers make sense to me, in terms of how many TREESAME queries >> we actually need to perform for such a query. > Yeah... because you didn't notice that I deliberately cheated :) > > As it turned out, it's not just about the number of diff queries that > we can spare, but, for the speedup _ratio_, it's more about how > expensive those diff queries are. > > git.git has a rather flat hierarchy, and 't/' is the 372th entry in > the current root tree object, while 'valgrind/' is the 923th entry, > and the diff machinery spends considerable time wading through the > previous entries. Notice the "carefully chosen path" remark in my > previous email; I think this particular path has the highest number of > preceeding tree entries, and, in addition, 't/' changes rather > frequently, so the diff machinery often has to scan two relatively big > tree objects. Had I chosen 'Documentation/RelNotes/1.5.0.1.txt' > instead, i.e. another path two directories deep, but whose leading > path components are both near the beginning of the tree objects, the > speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only" > ~5.7x instead of ~24.8x. This is expected. The performance ratio is better when the path is any of the following: 1. A very deep path (need to walk multiple trees to answer TREESAME) 2. An entry is late in a very wide tree (need to spend extra time parsing tree object) 3. The path doesn't change very often (need to inspect many TREESAME pairs before finding enough interesting commits) 4. Some sub-path changes often (so the TREESAME comparison needs to parse beyond that sub-path often) Our standard examples (Git and Linux repos) don't have many paths that have these properties. But: they do exist. In other projects, this is actually typical. Think about Java projects that frequently have ~5 levels of folders before actually touching a code file. When I was implementing the Bloom filter feature for Azure Repos, I ran performance tests on the Linux repo using a random sampling of paths. The typical speedup was 5x while some outliers were in the 25x range. > >>> But I'm afraid it will take a while until I get around to turn it into >>> something presentable... >> Do you have the code pushed somewhere public where one could take a look? I >> Do you have the code pushed somewhere public where one could take a >> look? I could provide some early feedback. > Nah, definitely not... I know full well how embarassingly broken this > implementation is, I don't need others to tell me that ;) There are two questions that I was hoping to answer by looking at your code: 1. How do you store your Bloom filter? Is it connected to the commit-graph and split on a commit-by-commit basis (storing "$path" as a key), or is it one huge Bloom filter (storing "$commitid:$path" as key)? 2. Where does your Bloom filter check plug into the TREESAME logic? I haven't investigated this part, but hopefully it isn't too complicated. Thanks, -Stolee
next prev parent reply other threads:[~2018-10-08 18:29 UTC|newest] Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-10-03 13:23 Ævar Arnfjörð Bjarmason 2018-10-03 13:36 ` SZEDER Gábor 2018-10-03 13:42 ` Derrick Stolee 2018-10-03 14:18 ` Ævar Arnfjörð Bjarmason 2018-10-03 14:01 ` Ævar Arnfjörð Bjarmason 2018-10-03 14:17 ` SZEDER Gábor 2018-10-03 14:22 ` Ævar Arnfjörð Bjarmason 2018-10-03 14:53 ` SZEDER Gábor 2018-10-03 15:19 ` Ævar Arnfjörð Bjarmason 2018-10-03 16:59 ` SZEDER Gábor 2018-10-05 6:09 ` Junio C Hamano 2018-10-10 22:07 ` SZEDER Gábor 2018-10-10 23:01 ` Ævar Arnfjörð Bjarmason 2018-10-03 19:08 ` Stefan Beller 2018-10-03 19:21 ` Jeff King 2018-10-03 20:35 ` Ævar Arnfjörð Bjarmason 2018-10-03 17:47 ` Stefan Beller 2018-10-03 18:47 ` Ævar Arnfjörð Bjarmason 2018-10-03 18:51 ` Jeff King 2018-10-03 18:59 ` Derrick Stolee 2018-10-03 19:18 ` Jeff King 2018-10-08 16:41 ` SZEDER Gábor 2018-10-08 16:57 ` Derrick Stolee 2018-10-08 18:10 ` SZEDER Gábor 2018-10-08 18:29 ` Derrick Stolee [this message] 2018-10-09 3:08 ` Jeff King 2018-10-09 13:48 ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee 2018-10-09 18:45 ` Ævar Arnfjörð Bjarmason 2018-10-09 18:46 ` Jeff King 2018-10-09 19:03 ` Derrick Stolee 2018-10-09 21:14 ` Jeff King 2018-10-09 23:12 ` Bloom Filters Jeff King 2018-10-09 23:13 ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King 2018-10-09 23:14 ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King 2018-10-10 0:48 ` Junio C Hamano 2018-10-11 3:13 ` Jeff King 2018-10-09 23:14 ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King 2018-10-10 0:58 ` Junio C Hamano 2018-10-11 3:20 ` Jeff King 2018-10-11 12:33 ` Bloom Filters Derrick Stolee 2018-10-11 13:43 ` Jeff King 2018-10-09 21:30 ` We should add a "git gc --auto" after "git clone" due to commit graph SZEDER Gábor 2018-10-09 19:34 ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor 2018-10-09 19:34 ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor 2018-10-09 19:34 ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit SZEDER Gábor 2018-10-09 21:06 ` Jeff King 2018-10-09 21:37 ` SZEDER Gábor 2018-10-09 19:34 ` [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks SZEDER Gábor 2018-10-09 19:34 ` [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics SZEDER Gábor 2018-10-09 19:47 ` [PATCH 0/4] Bloom filter experiment Derrick Stolee 2018-10-11 1:21 ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan 2018-10-11 1:21 ` [PATCH 1/2] One filter per commit Jonathan Tan 2018-10-11 12:49 ` Derrick Stolee 2018-10-11 19:11 ` [PATCH] Per-commit and per-parent filters for 2 parents Jonathan Tan 2018-10-11 1:21 ` [PATCH 2/2] Only make bloom filter for first parent Jonathan Tan 2018-10-11 7:37 ` [PATCH 0/2] Per-commit filter proof of concept Ævar Arnfjörð Bjarmason 2018-10-15 14:39 ` [PATCH 0/4] Bloom filter experiment Derrick Stolee 2018-10-16 4:45 ` Junio C Hamano 2018-10-16 11:13 ` Derrick Stolee 2018-10-16 12:57 ` Ævar Arnfjörð Bjarmason 2018-10-16 13:03 ` Derrick Stolee 2018-10-18 2:00 ` Junio C Hamano 2018-10-16 23:41 ` Jonathan Tan 2018-10-08 23:02 ` We should add a "git gc --auto" after "git clone" due to commit graph Junio C Hamano 2018-10-03 14:32 ` Duy Nguyen 2018-10-03 16:45 ` Duy Nguyen 2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason 2018-10-05 12:05 ` Derrick Stolee 2018-10-05 13:05 ` Ævar Arnfjörð Bjarmason 2018-10-05 13:45 ` Derrick Stolee 2018-10-05 14:04 ` Ævar Arnfjörð Bjarmason 2018-10-05 19:21 ` Jeff King 2018-10-05 19:41 ` Derrick Stolee 2018-10-05 19:47 ` Jeff King 2018-10-05 20:00 ` Derrick Stolee 2018-10-05 20:02 ` Jeff King 2018-10-05 20:01 ` Ævar Arnfjörð Bjarmason 2018-10-05 20:09 ` Jeff King
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=9ad5f166-f7c5-de79-0f86-f1f952cd33d2@gmail.com \ --to=stolee@gmail.com \ --cc=avarab@gmail.com \ --cc=git@vger.kernel.org \ --cc=pclouds@gmail.com \ --cc=peff@peff.net \ --cc=sbeller@google.com \ --cc=szeder.dev@gmail.com \ --subject='Re: We should add a "git gc --auto" after "git clone" due to commit graph' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).