All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH/RFC] gitperformance: add new documentation about git performance tuning
@ 2017-04-03 21:16 Ævar Arnfjörð Bjarmason
  2017-04-03 21:34 ` Eric Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-04-03 21:16 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

Add a new manpage that gives an overview of how to tweak git's
performance.

There's currently no good single resource for things a git site
administrator might want to look into to improve performance for his
site & his users. This unfinished documentation aims to be the first
thing someone might want to look at when investigating ways to improve
git performance.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---

I've been wanting to get something like this started for a while. It's
obviously woefully incomplete. Pointers about what to include would be
great & whether including something like this makes sense.

Things I have on my TODO list:

 - Add a section discussing how refs impact performance, suggest
   e.g. archiving old tags if possible, or at least run "git remote
   prune origin" regularly on clients.

 - Discuss split index a bit, although I'm not very confident in
   describing what its pros & cons are.

 - Should we be covering good practices for your repo going forward to
   maintain good performance? E.g. don't have some huge tree all in
   one directory (use subdirs), don't add binary (rather
   un-delta-able) content if you can help it etc.

- The new core.checksumIndex option being discussed on-list. Which
  actually drove my to finally write this up (hrm, this sounds useful,
  but unless I was watching the list I'd probably never see it...).


 Documentation/Makefile           |   1 +
 Documentation/gitperformance.txt | 107 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 Documentation/gitperformance.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index b5be2e2d3f..528aa22354 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -23,6 +23,7 @@ MAN5_TXT += gitrepository-layout.txt
 MAN5_TXT += gitweb.conf.txt
 
 MAN7_TXT += gitcli.txt
+MAN7_TXT += gitperformance.txt
 MAN7_TXT += gitcore-tutorial.txt
 MAN7_TXT += gitcredentials.txt
 MAN7_TXT += gitcvs-migration.txt
diff --git a/Documentation/gitperformance.txt b/Documentation/gitperformance.txt
new file mode 100644
index 0000000000..0548d1e721
--- /dev/null
+++ b/Documentation/gitperformance.txt
@@ -0,0 +1,107 @@
+giteveryday(7)
+==============
+
+NAME
+----
+gitperformance - How to improve Git's performance
+
+SYNOPSIS
+--------
+
+A guide to improving Git's performance beyond the defaults.
+
+DESCRIPTION
+-----------
+
+Git is mostly performant by default, but ships with various
+configuration options, command-line options, etc. that might improve
+performance, but for various reasons aren't on by default.
+
+This document provides a brief overview of these features.
+
+The reader should not assume that turning on all of these features
+will increase performance, depending on the repository, workload &
+use-case turning some of them on might severely harm performance.
+
+This document serves as a starting point for things to look into when
+it comes to improving performance, not as a checklist for things to
+enable or disable.
+
+Performance by topic
+--------------------
+
+It can be hard to divide the performance features into topics, but
+most of them fall into various well-defined buckets. E.g. there are
+features that help with the performance of "git status", and couldn't
+possibly impact repositories without working copies, and then some
+that only impact the performance of cloning from a server, or help the
+server itself etc.
+
+git status
+~~~~~~~~~~
+
+Running "git status" requires traversing the working tree & comparing
+it with the index. Several configuration options can help with its
+performance, with some trade-offs.
+
+- config: "core.untrackedCache=true" (see linkgit:git-config[1]) can
+  save on `stat(2)` calls by caching the mtime of filesystem
+  directories, and if they didn't change avoid recursing into that
+  directory to `stat(2)` every file in them.
++
+pros: Can drastically speed up "git status".
++
+cons: There's a speed hit for initially populating & maintaining the
+cache. Doesn't work on all filesystems (see `--test-untracked-cache`
+in linkgit:git-update-index[1]).
+
+- config: "status.showUntrackedFiles=no" (see
+  linkgit:git-config[1]). Skips looking for files in the working tree
+  git doesn't already know about.
++
+pros: Speeds up "git status" by making it do a lot less work.
++
+cons: If there's any new & untracked files anywhere in the working
+tree they won't be noticed by git. Makes it easy to accidentally miss
+files to "git add" before committing, or files which might impact the
+code in the working tree, but which git won't know exist.
+
+git grep
+~~~~~~~~
+
+- config: "grep.patternType=perl" (see linkgit:git-config[1]) will use
+  the PCRE library when "git grep" is invoked by default. This can be
+  faster than POSIX regular expressions in many cases.
++
+pros: Can, depending on the use-case, be faster than default "git grep".
++
+cons: Can also be slower, and in some edge cases produce different
+results.
+
+- config: "grep.threads=*" (see linkgit:git-config[1] &
+  linkgit:git-grep[1]). Tunes the number of "git grep" worker threads.
++
+pros: Giving this a more optimal value might result in a faster grep.
++
+cons: It might not.
+
+Server options to help clients
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These features can be enabled on git servers, they won't help the
+performance of the servers themselves, but will help clients that need
+to talk to those servers.
+
+- config: "repack.writeBitmaps=true" (see
+  linkgit:git-config[1]). Spend more time during repack to produce
+  bitmap index, helps clients with "fetch" & "clone" performance.
++
+pros: Once enabled & run regularly as part of "git repack" speeds up
+"clone" and "fetch".
++
+cons: Takes extra time during repack, requires doing full
+non-incremental repacks with `-A` or `-a`.
+
+GIT
+---
+Part of the linkgit:git[1] suite
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH/RFC] gitperformance: add new documentation about git performance tuning
  2017-04-03 21:16 [PATCH/RFC] gitperformance: add new documentation about git performance tuning Ævar Arnfjörð Bjarmason
@ 2017-04-03 21:34 ` Eric Wong
  2017-04-03 21:57   ` Ævar Arnfjörð Bjarmason
  2017-04-04 15:07 ` Jeff Hostetler
  2017-04-05 12:56 ` Duy Nguyen
  2 siblings, 1 reply; 10+ messages in thread
From: Eric Wong @ 2017-04-03 21:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Junio C Hamano, Jeff King

Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> Add a new manpage that gives an overview of how to tweak git's
> performance.
> 
> There's currently no good single resource for things a git site
> administrator might want to look into to improve performance for his
> site & his users. This unfinished documentation aims to be the first
> thing someone might want to look at when investigating ways to improve
> git performance.
> 
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
> 
> I've been wanting to get something like this started for a while. It's
> obviously woefully incomplete. Pointers about what to include would be
> great & whether including something like this makes sense.

Thanks for doing this.  I hope something like this can give
server operators more confidence to host their own git servers.

> Things I have on my TODO list:

<snip>

>  - Should we be covering good practices for your repo going forward to
>    maintain good performance? E.g. don't have some huge tree all in
>    one directory (use subdirs), don't add binary (rather
>    un-delta-able) content if you can help it etc.

Yes, I think so.

I think avoiding ever growing ChangeLog-type files should also
be added to things to avoid.

> --- /dev/null
> +++ b/Documentation/gitperformance.txt
> @@ -0,0 +1,107 @@
> +giteveryday(7)

gitperformance(7)

> +Server options to help clients
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +These features can be enabled on git servers, they won't help the
> +performance of the servers themselves,

Is that true for bitmaps?  I thought they reduced CPU usage on
the server side...

A sidenote: I wonder if bitmaps should be the default for bare
repos, since bare repos are likely used on servers.

> but will help clients that need
> +to talk to those servers.
> +
> +- config: "repack.writeBitmaps=true" (see
> +  linkgit:git-config[1]). Spend more time during repack to produce
> +  bitmap index, helps clients with "fetch" & "clone" performance.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/RFC] gitperformance: add new documentation about git performance tuning
  2017-04-03 21:34 ` Eric Wong
@ 2017-04-03 21:57   ` Ævar Arnfjörð Bjarmason
  2017-04-03 22:39     ` Eric Wong
  2017-04-04  2:19     ` Jeff King
  0 siblings, 2 replies; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-04-03 21:57 UTC (permalink / raw)
  To: Eric Wong; +Cc: Git Mailing List, Junio C Hamano, Jeff King, Vicent Marti

On Mon, Apr 3, 2017 at 11:34 PM, Eric Wong <e@80x24.org> wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> Add a new manpage that gives an overview of how to tweak git's
>> performance.
>>
>> There's currently no good single resource for things a git site
>> administrator might want to look into to improve performance for his
>> site & his users. This unfinished documentation aims to be the first
>> thing someone might want to look at when investigating ways to improve
>> git performance.
>>
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> ---
>>
>> I've been wanting to get something like this started for a while. It's
>> obviously woefully incomplete. Pointers about what to include would be
>> great & whether including something like this makes sense.
>
> Thanks for doing this.  I hope something like this can give
> server operators more confidence to host their own git servers.
>
>> Things I have on my TODO list:
>
> <snip>
>
>>  - Should we be covering good practices for your repo going forward to
>>    maintain good performance? E.g. don't have some huge tree all in
>>    one directory (use subdirs), don't add binary (rather
>>    un-delta-able) content if you can help it etc.
>
> Yes, I think so.

I'll try to write something up.

> I think avoiding ever growing ChangeLog-type files should also
> be added to things to avoid.

How were those bad specifically? They should delta quite well, it's
expensive to commit large files but no more because they're
ever-growing.

One issue with e.g. storing logs (I keep my IRC logs in git) is that
if you're constantly committing large (text) files without repack your
.git grows by a *lot* in a very short amount of time until a very
expensive repack, so now I split my IRC logs by month.

But I'm probably forgetting some obvious case where the ChangeLog
use-case is bad.

>> --- /dev/null
>> +++ b/Documentation/gitperformance.txt
>> @@ -0,0 +1,107 @@
>> +giteveryday(7)
>
> gitperformance(7)

Oops, thanks.

>> +Server options to help clients
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +These features can be enabled on git servers, they won't help the
>> +performance of the servers themselves,
>
> Is that true for bitmaps?  I thought they reduced CPU usage on
> the server side...

I'm not sure, JK? From my reading of the repack.writeBitmaps docs it
seems to only help clone/fetch for the client, but maybe they do more
than that.

I also see we should mention pack.writeBitmapHashCache, which
according to my reading of v2.0.0-rc0~13^2~8 only helps clone/fetch.

> A sidenote: I wonder if bitmaps should be the default for bare
> repos, since bare repos are likely used on servers.
>
>> but will help clients that need
>> +to talk to those servers.
>> +
>> +- config: "repack.writeBitmaps=true" (see
>> +  linkgit:git-config[1]). Spend more time during repack to produce
>> +  bitmap index, helps clients with "fetch" & "clone" performance.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/RFC] gitperformance: add new documentation about git performance tuning
  2017-04-03 21:57   ` Ævar Arnfjörð Bjarmason
@ 2017-04-03 22:39     ` Eric Wong
  2017-04-04 21:12       ` Ævar Arnfjörð Bjarmason
  2017-04-04  2:19     ` Jeff King
  1 sibling, 1 reply; 10+ messages in thread
From: Eric Wong @ 2017-04-03 22:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Vicent Marti

Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Mon, Apr 3, 2017 at 11:34 PM, Eric Wong <e@80x24.org> wrote:
> > Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >>  - Should we be covering good practices for your repo going forward to
> >>    maintain good performance? E.g. don't have some huge tree all in
> >>    one directory (use subdirs), don't add binary (rather
> >>    un-delta-able) content if you can help it etc.
> >
> > Yes, I think so.
> 
> I'll try to write something up.
> 
> > I think avoiding ever growing ChangeLog-type files should also
> > be added to things to avoid.
> 
> How were those bad specifically? They should delta quite well, it's
> expensive to commit large files but no more because they're
> ever-growing.

It might be blame/annotate specifically, I was remembering this
thread from a decade ago:

  https://public-inbox.org/git/4aca3dc20712110933i636342fbifb15171d3e3cafb3@mail.gmail.com/T/

> One issue with e.g. storing logs (I keep my IRC logs in git) is that
> if you're constantly committing large (text) files without repack your
> .git grows by a *lot* in a very short amount of time until a very
> expensive repack, so now I split my IRC logs by month.

Yep, that too; as auto GC is triggered by the number of loose
objects, not the size/packability of them.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/RFC] gitperformance: add new documentation about git performance tuning
  2017-04-03 21:57   ` Ævar Arnfjörð Bjarmason
  2017-04-03 22:39     ` Eric Wong
@ 2017-04-04  2:19     ` Jeff King
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff King @ 2017-04-04  2:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Eric Wong, Git Mailing List, Junio C Hamano, Vicent Marti

On Mon, Apr 03, 2017 at 11:57:51PM +0200, Ævar Arnfjörð Bjarmason wrote:

> >> +These features can be enabled on git servers, they won't help the
> >> +performance of the servers themselves,
> >
> > Is that true for bitmaps?  I thought they reduced CPU usage on
> > the server side...
> 
> I'm not sure, JK? From my reading of the repack.writeBitmaps docs it
> seems to only help clone/fetch for the client, but maybe they do more
> than that.

Bitmaps reduce the CPU required to do the "Counting" phase of
pack-objects. For serving a fetch or clone, the server side is happy
because they use less CPU, but the client is happy because the server
moves to the "Writing" phase more quickly.

Bitmaps also help with pushes, but this is usually less interesting. You
don't tend to push all of history over and over (whereas people _do_
tend to clone all of history over and over).

They don't speed up the counting portion of a regular repack. In theory
they could, but the resulting packs may grow less optimal over time
(e.g., we can't compute the same history-based write order, so over time
your objects would get jumbled, leading to worse cold-cache behavior).

You can also use bitmaps for other reachability computations, but we
don't do so currently. I have patches that I need to clean up to use
them for "git prune", doing ahead/behind checks, --contains, etc.

> I also see we should mention pack.writeBitmapHashCache, which
> according to my reading of v2.0.0-rc0~13^2~8 only helps clone/fetch.

Yes, it helps the delta search heuristic, so only pack-objects would
ever benefit. This should basically be turned on all the time, as
without it fetches from partially-bitmapped repos (i.e., when you've
gotten some pushes but haven't repacked yet) do a really bad job of
finding deltas (the waste too much time and deliver sub-optimal packs).

Arguably it should be the default. The initial patches made it optional
for strict JGit compatibility (I don't know if JGit ever implemented the
extension). We've had it on at GitHub since day one, so I don't have any
operational experience with turning it off (aside from the simulated
numbers in that commit message).

> > A sidenote: I wonder if bitmaps should be the default for bare
> > repos, since bare repos are likely used on servers.

That's an interesting notion. It's a net loss if you don't serve a lot
of fetches, because it's trying to amortize the extra CPU during the
repack with faster fetches and clones. So it makes sense for a hosting
site, but less for somebody pushing to a personal bare repo.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/RFC] gitperformance: add new documentation about git performance tuning
  2017-04-03 21:16 [PATCH/RFC] gitperformance: add new documentation about git performance tuning Ævar Arnfjörð Bjarmason
  2017-04-03 21:34 ` Eric Wong
@ 2017-04-04 15:07 ` Jeff Hostetler
  2017-04-04 15:18   ` Ævar Arnfjörð Bjarmason
  2017-04-05 12:56 ` Duy Nguyen
  2 siblings, 1 reply; 10+ messages in thread
From: Jeff Hostetler @ 2017-04-04 15:07 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git; +Cc: Junio C Hamano, Jeff King



On 4/3/2017 5:16 PM, Ævar Arnfjörð Bjarmason wrote:
> Add a new manpage that gives an overview of how to tweak git's
> performance.
>
> There's currently no good single resource for things a git site
> administrator might want to look into to improve performance for his
> site & his users. This unfinished documentation aims to be the first
> thing someone might want to look at when investigating ways to improve
> git performance.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>
> I've been wanting to get something like this started for a while. It's
> obviously woefully incomplete. Pointers about what to include would be
> great & whether including something like this makes sense.
>
> Things I have on my TODO list:
>
>  - Add a section discussing how refs impact performance, suggest
>    e.g. archiving old tags if possible, or at least run "git remote
>    prune origin" regularly on clients.
>
>  - Discuss split index a bit, although I'm not very confident in
>    describing what its pros & cons are.
>
>  - Should we be covering good practices for your repo going forward to
>    maintain good performance? E.g. don't have some huge tree all in
>    one directory (use subdirs), don't add binary (rather
>    un-delta-able) content if you can help it etc.
>
> - The new core.checksumIndex option being discussed on-list. Which
>   actually drove my to finally write this up (hrm, this sounds useful,
>   but unless I was watching the list I'd probably never see it...).

You might also consider core.preloadIndex.

For people with very large trees, talk about sparse-checkout.

And (on Windows) core.fscache.  Or leave a place for
an addendum for Windows that we can fill in later.

>
>
>  Documentation/Makefile           |   1 +
>  Documentation/gitperformance.txt | 107 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 108 insertions(+)
>  create mode 100644 Documentation/gitperformance.txt
>
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index b5be2e2d3f..528aa22354 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -23,6 +23,7 @@ MAN5_TXT += gitrepository-layout.txt
>  MAN5_TXT += gitweb.conf.txt
>
>  MAN7_TXT += gitcli.txt
> +MAN7_TXT += gitperformance.txt
>  MAN7_TXT += gitcore-tutorial.txt
>  MAN7_TXT += gitcredentials.txt
>  MAN7_TXT += gitcvs-migration.txt
> diff --git a/Documentation/gitperformance.txt b/Documentation/gitperformance.txt
> new file mode 100644
> index 0000000000..0548d1e721
> --- /dev/null
> +++ b/Documentation/gitperformance.txt
> @@ -0,0 +1,107 @@
> +giteveryday(7)
> +==============
> +
> +NAME
> +----
> +gitperformance - How to improve Git's performance
> +
> +SYNOPSIS
> +--------
> +
> +A guide to improving Git's performance beyond the defaults.
> +
> +DESCRIPTION
> +-----------
> +
> +Git is mostly performant by default, but ships with various
> +configuration options, command-line options, etc. that might improve
> +performance, but for various reasons aren't on by default.
> +
> +This document provides a brief overview of these features.
> +
> +The reader should not assume that turning on all of these features
> +will increase performance, depending on the repository, workload &
> +use-case turning some of them on might severely harm performance.
> +
> +This document serves as a starting point for things to look into when
> +it comes to improving performance, not as a checklist for things to
> +enable or disable.
> +
> +Performance by topic
> +--------------------
> +
> +It can be hard to divide the performance features into topics, but
> +most of them fall into various well-defined buckets. E.g. there are
> +features that help with the performance of "git status", and couldn't
> +possibly impact repositories without working copies, and then some
> +that only impact the performance of cloning from a server, or help the
> +server itself etc.
> +
> +git status
> +~~~~~~~~~~
> +
> +Running "git status" requires traversing the working tree & comparing
> +it with the index. Several configuration options can help with its
> +performance, with some trade-offs.
> +
> +- config: "core.untrackedCache=true" (see linkgit:git-config[1]) can
> +  save on `stat(2)` calls by caching the mtime of filesystem
> +  directories, and if they didn't change avoid recursing into that
> +  directory to `stat(2)` every file in them.
> ++
> +pros: Can drastically speed up "git status".
> ++
> +cons: There's a speed hit for initially populating & maintaining the
> +cache. Doesn't work on all filesystems (see `--test-untracked-cache`
> +in linkgit:git-update-index[1]).
> +
> +- config: "status.showUntrackedFiles=no" (see
> +  linkgit:git-config[1]). Skips looking for files in the working tree
> +  git doesn't already know about.
> ++
> +pros: Speeds up "git status" by making it do a lot less work.
> ++
> +cons: If there's any new & untracked files anywhere in the working
> +tree they won't be noticed by git. Makes it easy to accidentally miss
> +files to "git add" before committing, or files which might impact the
> +code in the working tree, but which git won't know exist.
> +
> +git grep
> +~~~~~~~~
> +
> +- config: "grep.patternType=perl" (see linkgit:git-config[1]) will use
> +  the PCRE library when "git grep" is invoked by default. This can be
> +  faster than POSIX regular expressions in many cases.
> ++
> +pros: Can, depending on the use-case, be faster than default "git grep".
> ++
> +cons: Can also be slower, and in some edge cases produce different
> +results.
> +
> +- config: "grep.threads=*" (see linkgit:git-config[1] &
> +  linkgit:git-grep[1]). Tunes the number of "git grep" worker threads.
> ++
> +pros: Giving this a more optimal value might result in a faster grep.
> ++
> +cons: It might not.
> +
> +Server options to help clients
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +These features can be enabled on git servers, they won't help the
> +performance of the servers themselves, but will help clients that need
> +to talk to those servers.
> +
> +- config: "repack.writeBitmaps=true" (see
> +  linkgit:git-config[1]). Spend more time during repack to produce
> +  bitmap index, helps clients with "fetch" & "clone" performance.
> ++
> +pros: Once enabled & run regularly as part of "git repack" speeds up
> +"clone" and "fetch".
> ++
> +cons: Takes extra time during repack, requires doing full
> +non-incremental repacks with `-A` or `-a`.
> +
> +GIT
> +---
> +Part of the linkgit:git[1] suite
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/RFC] gitperformance: add new documentation about git performance tuning
  2017-04-04 15:07 ` Jeff Hostetler
@ 2017-04-04 15:18   ` Ævar Arnfjörð Bjarmason
  2017-04-04 18:25     ` Jeff Hostetler
  0 siblings, 1 reply; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-04-04 15:18 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Git Mailing List, Junio C Hamano, Jeff King

On Tue, Apr 4, 2017 at 5:07 PM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>
> On 4/3/2017 5:16 PM, Ævar Arnfjörð Bjarmason wrote:
>>
>> Add a new manpage that gives an overview of how to tweak git's
>> performance.
>>
>> There's currently no good single resource for things a git site
>> administrator might want to look into to improve performance for his
>> site & his users. This unfinished documentation aims to be the first
>> thing someone might want to look at when investigating ways to improve
>> git performance.
>>
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> ---
>>
>> I've been wanting to get something like this started for a while. It's
>> obviously woefully incomplete. Pointers about what to include would be
>> great & whether including something like this makes sense.
>>
>> Things I have on my TODO list:
>>
>>  - Add a section discussing how refs impact performance, suggest
>>    e.g. archiving old tags if possible, or at least run "git remote
>>    prune origin" regularly on clients.
>>
>>  - Discuss split index a bit, although I'm not very confident in
>>    describing what its pros & cons are.
>>
>>  - Should we be covering good practices for your repo going forward to
>>    maintain good performance? E.g. don't have some huge tree all in
>>    one directory (use subdirs), don't add binary (rather
>>    un-delta-able) content if you can help it etc.
>>
>> - The new core.checksumIndex option being discussed on-list. Which
>>   actually drove my to finally write this up (hrm, this sounds useful,
>>   but unless I was watching the list I'd probably never see it...).
>
>
> You might also consider core.preloadIndex.

It's been enabled by default since 2.1.0 (299e29870b), or do you mean
talk about disabling it, or "this is a perf option we have on by
default"?

I don't know the pros of disabling that, haven't used it myself & it's
not clear from the docs.

> For people with very large trees, talk about sparse-checkout.

*nod*

> And (on Windows) core.fscache.  Or leave a place for
> an addendum for Windows that we can fill in later.

I have no core.fscache in my git.git, did you mean something else?

>
>
>>
>>
>>  Documentation/Makefile           |   1 +
>>  Documentation/gitperformance.txt | 107
>> +++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 108 insertions(+)
>>  create mode 100644 Documentation/gitperformance.txt
>>
>> diff --git a/Documentation/Makefile b/Documentation/Makefile
>> index b5be2e2d3f..528aa22354 100644
>> --- a/Documentation/Makefile
>> +++ b/Documentation/Makefile
>> @@ -23,6 +23,7 @@ MAN5_TXT += gitrepository-layout.txt
>>  MAN5_TXT += gitweb.conf.txt
>>
>>  MAN7_TXT += gitcli.txt
>> +MAN7_TXT += gitperformance.txt
>>  MAN7_TXT += gitcore-tutorial.txt
>>  MAN7_TXT += gitcredentials.txt
>>  MAN7_TXT += gitcvs-migration.txt
>> diff --git a/Documentation/gitperformance.txt
>> b/Documentation/gitperformance.txt
>> new file mode 100644
>> index 0000000000..0548d1e721
>> --- /dev/null
>> +++ b/Documentation/gitperformance.txt
>> @@ -0,0 +1,107 @@
>> +giteveryday(7)
>> +==============
>> +
>> +NAME
>> +----
>> +gitperformance - How to improve Git's performance
>> +
>> +SYNOPSIS
>> +--------
>> +
>> +A guide to improving Git's performance beyond the defaults.
>> +
>> +DESCRIPTION
>> +-----------
>> +
>> +Git is mostly performant by default, but ships with various
>> +configuration options, command-line options, etc. that might improve
>> +performance, but for various reasons aren't on by default.
>> +
>> +This document provides a brief overview of these features.
>> +
>> +The reader should not assume that turning on all of these features
>> +will increase performance, depending on the repository, workload &
>> +use-case turning some of them on might severely harm performance.
>> +
>> +This document serves as a starting point for things to look into when
>> +it comes to improving performance, not as a checklist for things to
>> +enable or disable.
>> +
>> +Performance by topic
>> +--------------------
>> +
>> +It can be hard to divide the performance features into topics, but
>> +most of them fall into various well-defined buckets. E.g. there are
>> +features that help with the performance of "git status", and couldn't
>> +possibly impact repositories without working copies, and then some
>> +that only impact the performance of cloning from a server, or help the
>> +server itself etc.
>> +
>> +git status
>> +~~~~~~~~~~
>> +
>> +Running "git status" requires traversing the working tree & comparing
>> +it with the index. Several configuration options can help with its
>> +performance, with some trade-offs.
>> +
>> +- config: "core.untrackedCache=true" (see linkgit:git-config[1]) can
>> +  save on `stat(2)` calls by caching the mtime of filesystem
>> +  directories, and if they didn't change avoid recursing into that
>> +  directory to `stat(2)` every file in them.
>> ++
>> +pros: Can drastically speed up "git status".
>> ++
>> +cons: There's a speed hit for initially populating & maintaining the
>> +cache. Doesn't work on all filesystems (see `--test-untracked-cache`
>> +in linkgit:git-update-index[1]).
>> +
>> +- config: "status.showUntrackedFiles=no" (see
>> +  linkgit:git-config[1]). Skips looking for files in the working tree
>> +  git doesn't already know about.
>> ++
>> +pros: Speeds up "git status" by making it do a lot less work.
>> ++
>> +cons: If there's any new & untracked files anywhere in the working
>> +tree they won't be noticed by git. Makes it easy to accidentally miss
>> +files to "git add" before committing, or files which might impact the
>> +code in the working tree, but which git won't know exist.
>> +
>> +git grep
>> +~~~~~~~~
>> +
>> +- config: "grep.patternType=perl" (see linkgit:git-config[1]) will use
>> +  the PCRE library when "git grep" is invoked by default. This can be
>> +  faster than POSIX regular expressions in many cases.
>> ++
>> +pros: Can, depending on the use-case, be faster than default "git grep".
>> ++
>> +cons: Can also be slower, and in some edge cases produce different
>> +results.
>> +
>> +- config: "grep.threads=*" (see linkgit:git-config[1] &
>> +  linkgit:git-grep[1]). Tunes the number of "git grep" worker threads.
>> ++
>> +pros: Giving this a more optimal value might result in a faster grep.
>> ++
>> +cons: It might not.
>> +
>> +Server options to help clients
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +These features can be enabled on git servers, they won't help the
>> +performance of the servers themselves, but will help clients that need
>> +to talk to those servers.
>> +
>> +- config: "repack.writeBitmaps=true" (see
>> +  linkgit:git-config[1]). Spend more time during repack to produce
>> +  bitmap index, helps clients with "fetch" & "clone" performance.
>> ++
>> +pros: Once enabled & run regularly as part of "git repack" speeds up
>> +"clone" and "fetch".
>> ++
>> +cons: Takes extra time during repack, requires doing full
>> +non-incremental repacks with `-A` or `-a`.
>> +
>> +GIT
>> +---
>> +Part of the linkgit:git[1] suite
>>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/RFC] gitperformance: add new documentation about git performance tuning
  2017-04-04 15:18   ` Ævar Arnfjörð Bjarmason
@ 2017-04-04 18:25     ` Jeff Hostetler
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff Hostetler @ 2017-04-04 18:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano, Jeff King



On 4/4/2017 11:18 AM, Ævar Arnfjörð Bjarmason wrote:
> On Tue, Apr 4, 2017 at 5:07 PM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>>
>> On 4/3/2017 5:16 PM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> Add a new manpage that gives an overview of how to tweak git's
>>> performance.
>>>
>>> There's currently no good single resource for things a git site
>>> administrator might want to look into to improve performance for his
>>> site & his users. This unfinished documentation aims to be the first
>>> thing someone might want to look at when investigating ways to improve
>>> git performance.
>>>
>>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>>> ---
>>>
>>> I've been wanting to get something like this started for a while. It's
>>> obviously woefully incomplete. Pointers about what to include would be
>>> great & whether including something like this makes sense.
>>>
>>> Things I have on my TODO list:
>>>
>>>  - Add a section discussing how refs impact performance, suggest
>>>    e.g. archiving old tags if possible, or at least run "git remote
>>>    prune origin" regularly on clients.
>>>
>>>  - Discuss split index a bit, although I'm not very confident in
>>>    describing what its pros & cons are.
>>>
>>>  - Should we be covering good practices for your repo going forward to
>>>    maintain good performance? E.g. don't have some huge tree all in
>>>    one directory (use subdirs), don't add binary (rather
>>>    un-delta-able) content if you can help it etc.
>>>
>>> - The new core.checksumIndex option being discussed on-list. Which
>>>   actually drove my to finally write this up (hrm, this sounds useful,
>>>   but unless I was watching the list I'd probably never see it...).
>>
>>
>> You might also consider core.preloadIndex.
>
> It's been enabled by default since 2.1.0 (299e29870b), or do you mean
> talk about disabling it, or "this is a perf option we have on by
> default"?
>
> I don't know the pros of disabling that, haven't used it myself & it's
> not clear from the docs.

Sorry, no, don't disable it.  Maybe an ack that
it should be on.


>
>> For people with very large trees, talk about sparse-checkout.
>
> *nod*
>
>> And (on Windows) core.fscache.  Or leave a place for
>> an addendum for Windows that we can fill in later.
>
> I have no core.fscache in my git.git, did you mean something else?

This is only in the Git for Windows tree.  It hasn't
made it upstream yet.

https://github.com/git-for-windows/git/commits/master/compat/win32/fscache.c

Ignore this for now if you want and we can fill in the details
later for you.

>
>>
>>
>>>
>>>
>>>  Documentation/Makefile           |   1 +
>>>  Documentation/gitperformance.txt | 107
>>> +++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 108 insertions(+)
>>>  create mode 100644 Documentation/gitperformance.txt
>>>
>>> diff --git a/Documentation/Makefile b/Documentation/Makefile
>>> index b5be2e2d3f..528aa22354 100644
>>> --- a/Documentation/Makefile
>>> +++ b/Documentation/Makefile
>>> @@ -23,6 +23,7 @@ MAN5_TXT += gitrepository-layout.txt
>>>  MAN5_TXT += gitweb.conf.txt
>>>
>>>  MAN7_TXT += gitcli.txt
>>> +MAN7_TXT += gitperformance.txt
>>>  MAN7_TXT += gitcore-tutorial.txt
>>>  MAN7_TXT += gitcredentials.txt
>>>  MAN7_TXT += gitcvs-migration.txt
>>> diff --git a/Documentation/gitperformance.txt
>>> b/Documentation/gitperformance.txt
>>> new file mode 100644
>>> index 0000000000..0548d1e721
>>> --- /dev/null
>>> +++ b/Documentation/gitperformance.txt
>>> @@ -0,0 +1,107 @@
>>> +giteveryday(7)
>>> +==============
>>> +
>>> +NAME
>>> +----
>>> +gitperformance - How to improve Git's performance
>>> +
>>> +SYNOPSIS
>>> +--------
>>> +
>>> +A guide to improving Git's performance beyond the defaults.
>>> +
>>> +DESCRIPTION
>>> +-----------
>>> +
>>> +Git is mostly performant by default, but ships with various
>>> +configuration options, command-line options, etc. that might improve
>>> +performance, but for various reasons aren't on by default.
>>> +
>>> +This document provides a brief overview of these features.
>>> +
>>> +The reader should not assume that turning on all of these features
>>> +will increase performance, depending on the repository, workload &
>>> +use-case turning some of them on might severely harm performance.
>>> +
>>> +This document serves as a starting point for things to look into when
>>> +it comes to improving performance, not as a checklist for things to
>>> +enable or disable.
>>> +
>>> +Performance by topic
>>> +--------------------
>>> +
>>> +It can be hard to divide the performance features into topics, but
>>> +most of them fall into various well-defined buckets. E.g. there are
>>> +features that help with the performance of "git status", and couldn't
>>> +possibly impact repositories without working copies, and then some
>>> +that only impact the performance of cloning from a server, or help the
>>> +server itself etc.
>>> +
>>> +git status
>>> +~~~~~~~~~~
>>> +
>>> +Running "git status" requires traversing the working tree & comparing
>>> +it with the index. Several configuration options can help with its
>>> +performance, with some trade-offs.
>>> +
>>> +- config: "core.untrackedCache=true" (see linkgit:git-config[1]) can
>>> +  save on `stat(2)` calls by caching the mtime of filesystem
>>> +  directories, and if they didn't change avoid recursing into that
>>> +  directory to `stat(2)` every file in them.
>>> ++
>>> +pros: Can drastically speed up "git status".
>>> ++
>>> +cons: There's a speed hit for initially populating & maintaining the
>>> +cache. Doesn't work on all filesystems (see `--test-untracked-cache`
>>> +in linkgit:git-update-index[1]).
>>> +
>>> +- config: "status.showUntrackedFiles=no" (see
>>> +  linkgit:git-config[1]). Skips looking for files in the working tree
>>> +  git doesn't already know about.
>>> ++
>>> +pros: Speeds up "git status" by making it do a lot less work.
>>> ++
>>> +cons: If there's any new & untracked files anywhere in the working
>>> +tree they won't be noticed by git. Makes it easy to accidentally miss
>>> +files to "git add" before committing, or files which might impact the
>>> +code in the working tree, but which git won't know exist.
>>> +
>>> +git grep
>>> +~~~~~~~~
>>> +
>>> +- config: "grep.patternType=perl" (see linkgit:git-config[1]) will use
>>> +  the PCRE library when "git grep" is invoked by default. This can be
>>> +  faster than POSIX regular expressions in many cases.
>>> ++
>>> +pros: Can, depending on the use-case, be faster than default "git grep".
>>> ++
>>> +cons: Can also be slower, and in some edge cases produce different
>>> +results.
>>> +
>>> +- config: "grep.threads=*" (see linkgit:git-config[1] &
>>> +  linkgit:git-grep[1]). Tunes the number of "git grep" worker threads.
>>> ++
>>> +pros: Giving this a more optimal value might result in a faster grep.
>>> ++
>>> +cons: It might not.
>>> +
>>> +Server options to help clients
>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +These features can be enabled on git servers, they won't help the
>>> +performance of the servers themselves, but will help clients that need
>>> +to talk to those servers.
>>> +
>>> +- config: "repack.writeBitmaps=true" (see
>>> +  linkgit:git-config[1]). Spend more time during repack to produce
>>> +  bitmap index, helps clients with "fetch" & "clone" performance.
>>> ++
>>> +pros: Once enabled & run regularly as part of "git repack" speeds up
>>> +"clone" and "fetch".
>>> ++
>>> +cons: Takes extra time during repack, requires doing full
>>> +non-incremental repacks with `-A` or `-a`.
>>> +
>>> +GIT
>>> +---
>>> +Part of the linkgit:git[1] suite
>>>
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/RFC] gitperformance: add new documentation about git performance tuning
  2017-04-03 22:39     ` Eric Wong
@ 2017-04-04 21:12       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-04-04 21:12 UTC (permalink / raw)
  To: Eric Wong; +Cc: Git Mailing List, Junio C Hamano, Jeff King, Vicent Marti

On Tue, Apr 4, 2017 at 12:39 AM, Eric Wong <e@80x24.org> wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> On Mon, Apr 3, 2017 at 11:34 PM, Eric Wong <e@80x24.org> wrote:
>> > Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> >>  - Should we be covering good practices for your repo going forward to
>> >>    maintain good performance? E.g. don't have some huge tree all in
>> >>    one directory (use subdirs), don't add binary (rather
>> >>    un-delta-able) content if you can help it etc.
>> >
>> > Yes, I think so.
>>
>> I'll try to write something up.
>>
>> > I think avoiding ever growing ChangeLog-type files should also
>> > be added to things to avoid.
>>
>> How were those bad specifically? They should delta quite well, it's
>> expensive to commit large files but no more because they're
>> ever-growing.
>
> It might be blame/annotate specifically, I was remembering this
> thread from a decade ago:
>
>   https://public-inbox.org/git/4aca3dc20712110933i636342fbifb15171d3e3cafb3@mail.gmail.com/T/

I did some basic testing on this, and I think advice about
ChangeLog-style files isn't worth including. On gcc.git blame on
ChangeLog still takes a few hundred MB of RAM, but finishes in about
2s on my machine. That gcc/fold-const.c file takes ~10s for me though,
but that thread seems to have resulted in some patches to git-blame.

Running this:

    parallel '/usr/bin/time -f %E git blame {} 2>&1 >/dev/null | tr
"\n" "\t" && git log --oneline {} | wc -l | tr "\n" "\t" && wc -l {} |
tr "\n" "\t" && echo {}' ::: $(git ls-files) | tee
/tmp/git-blame-times.txt

On git.git shows that the slowest blames are just files with either
lots of commits, or lots of lines, or some combination of the two. The
gcc.git repo has some more pathological cases, top 10 on that repo:

$ parallel '/usr/bin/time -f %E git blame {} 2>&1 >/dev/null | tr "\n"
"\t" && git log --oneline {} | wc -l | tr "\n" "\t" && wc -l {} | tr
"\n" "\t" && echo {}' ::: $(git ls-files|grep -e ^gcc/ -e
ChangeLog|grep -v '/.*/') | tee /tmp/gcc-blame-times.txt
$ sort -nr /tmp/gcc-blame-times.txt |head -n 10
0:18.12 1513    14517 gcc/tree.c        gcc/tree.c
0:17.35 66336   7435 gcc/ChangeLog      gcc/ChangeLog
0:16.87 1634    30455 gcc/dwarf2out.c   gcc/dwarf2out.c
0:16.76 1160    7937 gcc/varasm.c       gcc/varasm.c
0:16.36 1692    5491 gcc/tree.h gcc/tree.h
0:15.34 94      493 gcc/xcoffout.c      gcc/xcoffout.c
0:15.22 54      194 gcc/xcoffout.h      gcc/xcoffout.h
0:15.12 964     9224 gcc/reload1.c      gcc/reload1.c
0:14.90 1593    2202 gcc/toplev.c       gcc/toplev.c
0:14.66 11      43 gcc/typeclass.h      gcc/typeclass.h

Which makes it pretty clear that blame is slow where you'd expect, not
with files that are prepended or appended to.


>> One issue with e.g. storing logs (I keep my IRC logs in git) is that
>> if you're constantly committing large (text) files without repack your
>> .git grows by a *lot* in a very short amount of time until a very
>> expensive repack, so now I split my IRC logs by month.
>
> Yep, that too; as auto GC is triggered by the number of loose
> objects, not the size/packability of them.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH/RFC] gitperformance: add new documentation about git performance tuning
  2017-04-03 21:16 [PATCH/RFC] gitperformance: add new documentation about git performance tuning Ævar Arnfjörð Bjarmason
  2017-04-03 21:34 ` Eric Wong
  2017-04-04 15:07 ` Jeff Hostetler
@ 2017-04-05 12:56 ` Duy Nguyen
  2 siblings, 0 replies; 10+ messages in thread
From: Duy Nguyen @ 2017-04-05 12:56 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano, Jeff King

On Tue, Apr 4, 2017 at 4:16 AM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> Things I have on my TODO list:

Always keep cache-tree valid. I think there's some changes in "git
checkout" to rebuild cache-tree often, so it's probably not as bad as
before. I don't know if there's a command to manually repair
cache-tree after it's damaged too much (or even better, attempt to
repair cache-tree automatically once damages go over a limit, similar
to how you automatically split index).

> +git status
> +~~~~~~~~~~
> +
> +Running "git status" requires traversing the working tree & comparing
> +it with the index. Several configuration options can help with its
> +performance, with some trade-offs.

Another option, if you know you only make changes in one (preferably
deep) subdirectory and your whole worktree is very large, then you
could do something like "git status .". This speeds git-status up a
bit because it won't need to look outside (or speeds up a lot if
"outside" is very large). The con is, changes outside "." will not be
seen.
-- 
Duy

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-04-05 12:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-03 21:16 [PATCH/RFC] gitperformance: add new documentation about git performance tuning Ævar Arnfjörð Bjarmason
2017-04-03 21:34 ` Eric Wong
2017-04-03 21:57   ` Ævar Arnfjörð Bjarmason
2017-04-03 22:39     ` Eric Wong
2017-04-04 21:12       ` Ævar Arnfjörð Bjarmason
2017-04-04  2:19     ` Jeff King
2017-04-04 15:07 ` Jeff Hostetler
2017-04-04 15:18   ` Ævar Arnfjörð Bjarmason
2017-04-04 18:25     ` Jeff Hostetler
2017-04-05 12:56 ` Duy Nguyen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.