All of lore.kernel.org
 help / color / mirror / Atom feed
* How to reduce pickaxe times for a particular repo?
@ 2022-06-28 10:50 Pavel Rappo
  2022-06-28 11:35 ` Ævar Arnfjörð Bjarmason
  2022-06-28 13:01 ` Derrick Stolee
  0 siblings, 2 replies; 6+ messages in thread
From: Pavel Rappo @ 2022-06-28 10:50 UTC (permalink / raw)
  To: Git mailing list

I have a repo of the following characteristics:

  * 1 branch
  * 100,000 commits
  * 1TB in size
  * The tip of the branch has 55,000 files
  * No new commits are expected: the repo is abandoned and kept for
archaeological purposes.

Typically, a `git log -S/-G` lookup takes around a minute to complete.
I would like to significantly reduce that time. How can I do that? I
can spend up to 10x more disk space, if required. The machine has 10
cores and 32GB of RAM.

Thanks,
-Pavel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to reduce pickaxe times for a particular repo?
  2022-06-28 10:50 How to reduce pickaxe times for a particular repo? Pavel Rappo
@ 2022-06-28 11:35 ` Ævar Arnfjörð Bjarmason
  2022-06-28 12:35   ` Pavel Rappo
  2022-06-28 13:01 ` Derrick Stolee
  1 sibling, 1 reply; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-06-28 11:35 UTC (permalink / raw)
  To: Pavel Rappo; +Cc: Git mailing list


On Tue, Jun 28 2022, Pavel Rappo wrote:

> I have a repo of the following characteristics:
>
>   * 1 branch
>   * 100,000 commits
>   * 1TB in size
>   * The tip of the branch has 55,000 files
>   * No new commits are expected: the repo is abandoned and kept for
> archaeological purposes.
>
> Typically, a `git log -S/-G` lookup takes around a minute to complete.
> I would like to significantly reduce that time. How can I do that? I
> can spend up to 10x more disk space, if required. The machine has 10
> cores and 32GB of RAM.

In git as it stands now the main thing you can do is to limit your seach
by paths, and if you use the commit-graph and have a git that's using
"commitGraph.readChangedPaths" (defaults to true) doing e.g.:

    git log -p -G<rx> -- tests/

Can really help, or any other filter, such as --author or whatever.

But eventually you'll simply run into the regex engine being slow, if
you're feeling very adventurous I have a very WIP branch to make this a
lot faster by making -S and -G use PCREv2 as a backend:
http://github.com/avar/git/tree/avar/pcre2-conversion-of-diffcore-pickaxe

Bench mark results (made sometime last year) were:

    Test                                                                      origin/next       HEAD
    ------------------------------------------------------------------------------------------------------------------
    4209.1: git log -S'int main' <limit-rev>..                                0.38(0.36+0.01)   0.37(0.33+0.04) -2.6%
    4209.2: git log -S'æ' <limit-rev>..                                       0.51(0.47+0.04)   0.32(0.27+0.05) -37.3%
    4209.3: git log --pickaxe-regex -S'(int|void|null)' <limit-rev>..         0.72(0.68+0.03)   0.57(0.54+0.03) -20.8%
    4209.4: git log --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>..          0.60(0.55+0.02)   0.39(0.34+0.05) -35.0%
    4209.5: git log --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>..       0.43(0.40+0.03)   0.50(0.44+0.06) +16.3%
    4209.6: git log -G'(int|void|null)' <limit-rev>..                         0.64(0.55+0.09)   0.63(0.56+0.05) -1.6%
    4209.7: git log -G'if *\([^ ]+ & ' <limit-rev>..                          0.64(0.59+0.05)   0.63(0.56+0.06) -1.6%
    4209.8: git log -G'[àáâãäåæñøùúûüýþ]' <limit-rev>..                       0.63(0.54+0.08)   0.62(0.55+0.06) -1.6%
    4209.9: git log -i -S'int main' <limit-rev>..                             0.39(0.35+0.03)   0.38(0.35+0.02) -2.6%
    4209.10: git log -i -S'æ' <limit-rev>..                                   0.39(0.33+0.06)   0.32(0.28+0.04) -17.9%
    4209.11: git log -i --pickaxe-regex -S'(int|void|null)' <limit-rev>..     0.90(0.84+0.05)   0.58(0.53+0.04) -35.6%
    4209.12: git log -i --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>..      0.71(0.64+0.06)   0.40(0.37+0.03) -43.7%
    4209.13: git log -i --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>..   0.43(0.40+0.03)   0.50(0.46+0.04) +16.3%
    4209.14: git log -i -G'(int|void|null)' <limit-rev>..                     0.64(0.57+0.06)   0.62(0.56+0.05) -3.1%
    4209.15: git log -i -G'if *\([^ ]+ & ' <limit-rev>..                      0.65(0.59+0.06)   0.63(0.54+0.08) -3.1%
    4209.16: git log -i -G'[àáâãäåæñøùúûüýþ]' <limit-rev>..                   0.63(0.55+0.08)   0.62(0.56+0.05) -1.6%

So it's much faster on some queries in particular, I don't think that
code is ready for git.git in its current form, but if you're desperate
for performance and need to run ad-hoc queries...

I don't know the full shape of your repo but 1TB in size probably means
some very big files? I think you might want to experiment with e.g. a
filtered repo to filter out big blobs or something else you may be
needlessly searching though (binaries?).

I.e. I think you're probably getting a lot of OS cache churn, where we
can't have the working data in memory for your whole search, so you're
mainly I/O bound.

I did want to (as a future infinite time project) create a search index
for regexes in git for -S and -G, i.e. we'd store something like
trigrams of potentially matchable content, so we could skip commits &
trees quickly if the diff e.g. didn't. contain the fixed string "int" or
whatever.

But that's a much bigger project...

If you're really desperate for performance & willing to hack on
somtething custom you could emulate that with a hacky solution, e.g.:

 1. Create a COMMIT=DIFF pair for all commits in your repo, or e.g.
    PATH=DIFF (so one concat'd diff with all modifications ever to a
    given path)

 2. Stick that into Lucene with trigram indexing, e.g. ElasticSearch
    might make this easy. Make sure not to "store documents" in the
    index, you just want the reverse index from say "int" to "documents"
    that contain it.

 3. Do a two-step search, where a search like "foo.*bar" is first
    against tha index, where you find say all commits that have "foo" in
    the diff OR "bar" in the diff, ditto changed paths.

 4. Feed that list into the "real" git log -S or -G search, either
    limiting by commits, or by paths (taking advantage of the
    commit-graph path index).

For someone familiar with the tools involved that should be about a day
to get to a rough hacky solution, it's mostly gluing existing OTS
software together.

You should be able to get your searches down to the tens of millisecond
range with that if also carefully manage which parts are in cache, but
it depends a lot on the exact shape of data in your repo, how much
memory you have etc.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to reduce pickaxe times for a particular repo?
  2022-06-28 11:35 ` Ævar Arnfjörð Bjarmason
@ 2022-06-28 12:35   ` Pavel Rappo
  2022-06-29 12:31     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 6+ messages in thread
From: Pavel Rappo @ 2022-06-28 12:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git mailing list

On Tue, Jun 28, 2022 at 12:58 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:

<snip>

> But eventually you'll simply run into the regex engine being slow

Since I know very little about git internals, I was under a naive
impression that a significant, if not comparable to that of regex,
portion of pickaxe's time is spent on computing diffs between
revisions. So I assumed that there was a way to pre-compute those
diffs.

<snip>

>  2. Stick that into Lucene with trigram indexing, e.g. ElasticSearch
>     might make this easy.

<snip>

> For someone familiar with the tools involved that should be about a day
> to get to a rough hacky solution, it's mostly gluing existing OTS
> software together.

<snip>

I'll see what I can do with external systems. You see, I initially
came from a similar repository exposed through OpenGrok. But I think
that something was wrong with the index or query syntax because I
couldn't find the things that I knew were there. I was able to secure
a git repo that was close to that of OpenGrok as I found pickaxe to be
robust albeit slow alternative for my searches.

Thanks for the suggestion.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to reduce pickaxe times for a particular repo?
  2022-06-28 10:50 How to reduce pickaxe times for a particular repo? Pavel Rappo
  2022-06-28 11:35 ` Ævar Arnfjörð Bjarmason
@ 2022-06-28 13:01 ` Derrick Stolee
  2022-07-01 18:21   ` Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Derrick Stolee @ 2022-06-28 13:01 UTC (permalink / raw)
  To: Pavel Rappo, Git mailing list

On 6/28/2022 6:50 AM, Pavel Rappo wrote:

Hi Pavel! Welcome.

> I have a repo of the following characteristics:
> 
>   * 1 branch
>   * 100,000 commits

This is not too large.

>   * 1TB in size

This _is_ large.

>   * The tip of the branch has 55,000 files

And again, this is not large.

This means you have some very large files in your repo, perhaps
even binary files that you don't intend to search.

>   * No new commits are expected: the repo is abandoned and kept for
> archaeological purposes.
> 
> Typically, a `git log -S/-G` lookup takes around a minute to complete.
> I would like to significantly reduce that time. How can I do that? I
> can spend up to 10x more disk space, if required. The machine has 10
> cores and 32GB of RAM.

You are using -S<string> or -G<regex> to see which commits change the
number of matches of that <string> or <regex>. If you don't provide a
pathspec, then Git will search every changed file, including those
very large binary files.

Perhaps you'd like to start by providing a pathspec that limits the
search to only the meaningful code files?

As far as I know, Git doesn't have any data structures that can speed
up content-based matches like this. The commit-graph's content-changed
Bloom filters only help Git with questions like "did this specific file
change?" which is not going to be a critical code path in what you're
describing.

I'm not sure what you're actually trying to ask with -S or -G, so maybe
it is worth considering other types of queries, such as -L<n>,<m>:<file>
or something. This is just a shot in the dark, as you might be doing the
only thing you _can_ do to solve your problem.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to reduce pickaxe times for a particular repo?
  2022-06-28 12:35   ` Pavel Rappo
@ 2022-06-29 12:31     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-06-29 12:31 UTC (permalink / raw)
  To: Pavel Rappo; +Cc: Git mailing list


On Tue, Jun 28 2022, Pavel Rappo wrote:

> On Tue, Jun 28, 2022 at 12:58 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>
> <snip>
>
>> But eventually you'll simply run into the regex engine being slow
>
> Since I know very little about git internals, I was under a naive
> impression that a significant, if not comparable to that of regex,
> portion of pickaxe's time is spent on computing diffs between
> revisions. So I assumed that there was a way to pre-compute those
> diffs.

Yes and no, maybe sort of :)

Firstly, -S doesn't involve a diff, it's comparing the raw pre-post
image, and seeing how many times we match.

-G does involve computing the diff.

One the one hand we're fast at making diffs, but that really shouldn't
be significant compared to the speed of a regex engine.

The other side of this is that we're really stupid about how we invoke
the regex engine, historical reasons, backwards compatibility & all
that, but we:

 * Aren't compiling the regex once, and using it N times in some cases
   (I have some local patches to fix this)
 * Are computing matches one line at a time, when we could e.g. point
   PCRE to an entire diff with the right line-split options.
 * Are often doing needless work, e.g. in v2.33 I solved an issue with
   us continuing to create diffs when we could abort early (see
   f97fe358576 (pickaxe -G: don't special-case create/delete,
   2021-04-12)), which resulted in some speed-up.q

Some of these are tricky to fix.
> <snip>
>
>>  2. Stick that into Lucene with trigram indexing, e.g. ElasticSearch
>>     might make this easy.
>
> <snip>
>
>> For someone familiar with the tools involved that should be about a day
>> to get to a rough hacky solution, it's mostly gluing existing OTS
>> software together.
>
> <snip>
>
> I'll see what I can do with external systems. You see, I initially
> came from a similar repository exposed through OpenGrok. But I think
> that something was wrong with the index or query syntax because I
> couldn't find the things that I knew were there. I was able to secure
> a git repo that was close to that of OpenGrok as I found pickaxe to be
> robust albeit slow alternative for my searches.

This is the first time I hear about OpenGrok, so no idea, sorry.

One common pitfall with search indexes is that they tend to have a
blacklist of words, e.g. Lucene will have "for", "or" and other common
English words as part of its defaults, so if you're trying to e.g. find
when you altered a for-loop you might silently be getting no results.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to reduce pickaxe times for a particular repo?
  2022-06-28 13:01 ` Derrick Stolee
@ 2022-07-01 18:21   ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2022-07-01 18:21 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Pavel Rappo, Git mailing list

On Tue, Jun 28, 2022 at 09:01:17AM -0400, Derrick Stolee wrote:

> > Typically, a `git log -S/-G` lookup takes around a minute to complete.
> > I would like to significantly reduce that time. How can I do that? I
> > can spend up to 10x more disk space, if required. The machine has 10
> > cores and 32GB of RAM.
> 
> You are using -S<string> or -G<regex> to see which commits change the
> number of matches of that <string> or <regex>. If you don't provide a
> pathspec, then Git will search every changed file, including those
> very large binary files.
> 
> Perhaps you'd like to start by providing a pathspec that limits the
> search to only the meaningful code files?

I think "-S" will search every file, since it's just counting instances
of the token in each file. But "-G" does a diff first, so it skips
binary files. So you could probably speed it up in general with a
.gitattributes that mark large binary files as such. Sort of the same
concept as your pathspec suggestion (which is a good one), but you don't
have to remember to add it to each invocation. :)

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-07-01 18:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-28 10:50 How to reduce pickaxe times for a particular repo? Pavel Rappo
2022-06-28 11:35 ` Ævar Arnfjörð Bjarmason
2022-06-28 12:35   ` Pavel Rappo
2022-06-29 12:31     ` Ævar Arnfjörð Bjarmason
2022-06-28 13:01 ` Derrick Stolee
2022-07-01 18:21   ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.