Git Scaling: What factors most affect Git performance for a large repo?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Git Scaling: What factors most affect Git performance for a large repo?
@ 2015-02-19 21:26 Stephen Morton
  2015-02-19 22:21 ` Stefan Beller
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Stephen Morton @ 2015-02-19 21:26 UTC (permalink / raw)
  To: git

I posted this to comp.version-control.git.user and didn't get any response. I
think the question is plumbing-related enough that I can ask it here.

I'm evaluating the feasibility of moving my team from SVN to git. We have a very
large repo. [1] We will have a central repo using GitLab (or similar) that
everybody works with. Forks, code sharing, pull requests etc. will be done
through this central server.

By 'performance', I guess I mean speed of day to day operations for devs.

   * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
   * Will a few simultaneous clones from the central server also slow down
     other concurrent operations for other users?
   * Will 'git pull' be slow?
   * 'git push'?
   * 'git commit'? (It is listed as slow in reference [3].)
   * 'git stautus'? (Slow again in reference 3 though I don't see it.)
   * Some operations might not seem to be day-to-day but if they are called
     frequently by the web front-end to GitLab/Stash/GitHub etc then
     they can become bottlenecks. (e.g. 'git branch --contains' seems terribly
     adversely affected by large numbers of branches.)
   * Others?

Assuming I can put lots of resources into a central server with lots of CPU,
RAM, fast SSD, fast networking, what aspects of the repo are most likely to
affect devs' experience?
   * Number of commits
   * Sheer disk space occupied by the repo
   * Number of tags.
   * Number of branches.
   * Binary objects in the repo that cause it to bloat in size [1]
   * Other factors?

Of the various HW items listed above --CPU speed, number of cores, RAM, SSD,
networking-- which is most critical here?

(Stash recommends 1.5 x repo_size x number of concurrent clones of
available RAM.
I assume that is good advice in general.)

Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo,
50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up
branches" which are just one little dangling commit required to change the code
a little bit between a commit a tag that was not quite made from it.)

While there's lots of information online, much of it is old [3] and with git
constantly evolving I don't know how valid it still is. Then there's anecdotal
evidence that is of questionable value.[2]
    Are many/all of the issues Facebook identified [3] resolved? (Yes, I
understand Facebook went with Mercurial. But I imagine the git team nevertheless
took their analysis to heart.)

Thanks,
Steve

[1] (Yes, I'm investigating ways to make our repo not so large etc. That's
    beyond the scope of the discussion I'd like to have with this
    question. Thanks.)
[2] The large amounts of anecdotal evidence relate to the "why don't you try it
    yourself?" response to my question. I will I I have to but setting up a
    properly methodical study is time consuming and difficult --I don't want to
    produce poor anecdotal numbers that don't really hold up-- and if somebody's
    already done the work, then I should leverage it.
[3] http://thread.gmane.org/gmane.comp.version-control.git/189776

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-19 21:26 Git Scaling: What factors most affect Git performance for a large repo? Stephen Morton
@ 2015-02-19 22:21 ` Stefan Beller
  2015-02-19 23:06   ` Stephen Morton
  2015-02-19 23:29 ` Ævar Arnfjörð Bjarmason
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Stefan Beller @ 2015-02-19 22:21 UTC (permalink / raw)
  To: Stephen Morton; +Cc: git

On Thu, Feb 19, 2015 at 1:26 PM, Stephen Morton
<stephen.c.morton@gmail.com> wrote:
> I posted this to comp.version-control.git.user and didn't get any response. I
> think the question is plumbing-related enough that I can ask it here.
>
> I'm evaluating the feasibility of moving my team from SVN to git. We have a very
> large repo. [1]
>
> [1] (Yes, I'm investigating ways to make our repo not so large etc. That's
>     beyond the scope of the discussion I'd like to have with this
>     question. Thanks.)

What do you mean by large?
* lots of files
* large files
* or even large binary files (bad to diff/merge)
* long history (i.e. lots of small changes)
* impactful history (changes which rewrite nearly everything from scratch)

For reference, the linux
* has 48414 files, in 3128 directories
* the largest file is 1.1M, the whole repo is 600M
* no really large binary files
* more than 500051 changes/commits including merges
* started in 2004 (when git was invented essentially)
* the .git folder is 1.4G compared to the 600M files,
   indicating it may have been rewritting 3 times (well this
   metric is bogus, there is lots of compression
   going on in .git)

and linux seems to be doing ok with git.

So as long as you cannot pinpoint your question on what you are exactly
concerned about, there will be no helpful answer I guess.

linux is by no means a really large project, there are other projects way
larger than that (I am thinking about the KDE project for example)
and they do fine as well.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-19 22:21 ` Stefan Beller
@ 2015-02-19 23:06   ` Stephen Morton
  2015-02-19 23:15     ` Stefan Beller
  0 siblings, 1 reply; 33+ messages in thread
From: Stephen Morton @ 2015-02-19 23:06 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

On Thu, Feb 19, 2015 at 5:21 PM, Stefan Beller <sbeller@google.com> wrote:
> On Thu, Feb 19, 2015 at 1:26 PM, Stephen Morton
> <stephen.c.morton@gmail.com> wrote:
>> I posted this to comp.version-control.git.user and didn't get any response. I
>> think the question is plumbing-related enough that I can ask it here.
>>
>> I'm evaluating the feasibility of moving my team from SVN to git. We have a very
>> large repo. [1]
>>
>> [1] (Yes, I'm investigating ways to make our repo not so large etc. That's
>>     beyond the scope of the discussion I'd like to have with this
>>     question. Thanks.)
>
> What do you mean by large?
> * lots of files
> * large files
> * or even large binary files (bad to diff/merge)
> * long history (i.e. lots of small changes)
> * impactful history (changes which rewrite nearly everything from scratch)
>
> For reference, the linux
> * has 48414 files, in 3128 directories
> * the largest file is 1.1M, the whole repo is 600M
> * no really large binary files
> * more than 500051 changes/commits including merges
> * started in 2004 (when git was invented essentially)
> * the .git folder is 1.4G compared to the 600M files,
>    indicating it may have been rewritting 3 times (well this
>    metric is bogus, there is lots of compression
>    going on in .git)
>
> and linux seems to be doing ok with git.
>
> So as long as you cannot pinpoint your question on what you are exactly
> concerned about, there will be no helpful answer I guess.
>
> linux is by no means a really large project, there are other projects way
> larger than that (I am thinking about the KDE project for example)
> and they do fine as well.
>
> Thanks,
> Stefan

Hi Stefan,

I think I addressed most of this in my original post with the paragraph

 "Assume ridiculous numbers. Let me exaggerate: say 1 million commits,
15 GB repo,
  50k tags, 1,000 branches. (Due to historical code fixups, another
5,000 "fix-up
  branches" which are just one little dangling commit required to
change the code
  a little bit between a commit and a tag that was not quite made from it.)"

To that I'd add 25k files,
no major rewrites,
no huge binary files, but lots of a few MB binary files with many revisions.

But even without details of my specific concerns, I thought that
perhaps the git developers know what limits git's performance even if
large projects like the kernel are not hitting these limits.

Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-19 23:06   ` Stephen Morton
@ 2015-02-19 23:15     ` Stefan Beller
  0 siblings, 0 replies; 33+ messages in thread
From: Stefan Beller @ 2015-02-19 23:15 UTC (permalink / raw)
  To: Stephen Morton; +Cc: git

On Thu, Feb 19, 2015 at 3:06 PM, Stephen Morton
<stephen.c.morton@gmail.com> wrote:
>
> I think I addressed most of this in my original post with the paragraph
>
>  "Assume ridiculous numbers. Let me exaggerate: say 1 million commits,
> 15 GB repo,
>   50k tags, 1,000 branches. (Due to historical code fixups, another
> 5,000 "fix-up
>   branches" which are just one little dangling commit required to
> change the code
>   a little bit between a commit and a tag that was not quite made from it.)"
>
> To that I'd add 25k files,
> no major rewrites,
> no huge binary files, but lots of a few MB binary files with many revisions.
>
> But even without details of my specific concerns, I thought that
> perhaps the git developers know what limits git's performance even if
> large projects like the kernel are not hitting these limits.
>
> Steve

I did not realize you gave numbers below, as I started answering after
reading the
first paragraphs. Sorry about that.

I think lots of files organized in a hierarchical fashion ranging in
the small MB range is
not a huge deal. Also history is a non issue

The problem arises with having lots of branches.
"640 git branches ought to be enough for everybody -- Linus" (just kidding)
Git doesn't really scale efficiently with lots of branches (second
hand information
except for fetch/pull where I did some patches on another topic recently).

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-19 21:26 Git Scaling: What factors most affect Git performance for a large repo? Stephen Morton
  2015-02-19 22:21 ` Stefan Beller
@ 2015-02-19 23:29 ` Ævar Arnfjörð Bjarmason
  2015-02-20  0:04   ` Duy Nguyen
  2015-02-19 23:38 ` Duy Nguyen
  2015-02-20  0:03 ` brian m. carlson
  3 siblings, 1 reply; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2015-02-19 23:29 UTC (permalink / raw)
  To: Stephen Morton; +Cc: Git Mailing List

On Thu, Feb 19, 2015 at 10:26 PM, Stephen Morton
<stephen.c.morton@gmail.com> wrote:
> I posted this to comp.version-control.git.user and didn't get any response. I
> think the question is plumbing-related enough that I can ask it here.
>
> I'm evaluating the feasibility of moving my team from SVN to git. We have a very
> large repo. [1] We will have a central repo using GitLab (or similar) that
> everybody works with. Forks, code sharing, pull requests etc. will be done
> through this central server.
>
> By 'performance', I guess I mean speed of day to day operations for devs.
>
>    * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
>    * Will a few simultaneous clones from the central server also slow down
>      other concurrent operations for other users?
>    * Will 'git pull' be slow?
>    * 'git push'?
>    * 'git commit'? (It is listed as slow in reference [3].)
>    * 'git stautus'? (Slow again in reference 3 though I don't see it.)
>    * Some operations might not seem to be day-to-day but if they are called
>      frequently by the web front-end to GitLab/Stash/GitHub etc then
>      they can become bottlenecks. (e.g. 'git branch --contains' seems terribly
>      adversely affected by large numbers of branches.)
>    * Others?
>
>
> Assuming I can put lots of resources into a central server with lots of CPU,
> RAM, fast SSD, fast networking, what aspects of the repo are most likely to
> affect devs' experience?
>    * Number of commits
>    * Sheer disk space occupied by the repo
>    * Number of tags.
>    * Number of branches.
>    * Binary objects in the repo that cause it to bloat in size [1]
>    * Other factors?
>
> Of the various HW items listed above --CPU speed, number of cores, RAM, SSD,
> networking-- which is most critical here?
>
> (Stash recommends 1.5 x repo_size x number of concurrent clones of
> available RAM.
> I assume that is good advice in general.)
>
> Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo,
> 50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up
> branches" which are just one little dangling commit required to change the code
> a little bit between a commit a tag that was not quite made from it.)
>
> While there's lots of information online, much of it is old [3] and with git
> constantly evolving I don't know how valid it still is. Then there's anecdotal
> evidence that is of questionable value.[2]
>     Are many/all of the issues Facebook identified [3] resolved? (Yes, I
> understand Facebook went with Mercurial. But I imagine the git team nevertheless
> took their analysis to heart.)

Anecdotally I work on a repo at work (where I'm mostly "the Git guy") that's:

 * Around 500k commits
 * Around 100k tags
 * Around 5k branches
 * Around 500 commits/day, almost entirely to the same branch
 * 1.5 GB .git checkout.
 * Mostly text source, but some binaries (we're trying to cut down[1] on those)

The main scaling issues we have with Git are:

 * "git pull" takes around 10 seconds or so
 * Operations like "git status" are much slower because they scale
with the size of the work tree
 * Similarly "git rebase" takes a much longer time for each applied
commit, I think because it does the equivalent of "git status" for
every applied commit. Each commit applied takes around 1-2 seconds.
 * We have a lot of contention on pushes because we're mostly pushing
to one branch.
 * History spelunking (e.g. git log --reverse -p -G<str>) is taking
longer by the day

The obvious reason for why "git pull" is slow is because
git-upload-pack spews the complete set of refs at you each time. The
output from that command is around 10MB in size for us now. It takes
around 300 ms to run that locally from hot cache, a bit more to send
it over the network.

But actually most of "git fetch" is spent in the reachability check
subsequently done by "git-rev-list" which takes several seconds. I
haven't looked into it but there's got to be room for optimization
there, surely it only has to do reachability checks for new refs, or
could run in some "I trust this remote not to send me corrupt data"
completely mode (which would make sense within a company where you can
trust your main Git box).

The "git status" operations could be made faster by having something
like watchman, there's been some effort on getting that done in Git,
but I haven't tried it. This seems to have been the main focus of
Facebook's Mercurial optimization effort.

Some of this you can "solve" mostly by doing e.g. "git status -uno",
having support for such unsafe operations (e.g. teaching rebase and
pals to use it) would be nice at the cost of some safety, but having
something that feeds of inotify would be even better.

It takes around 3 minutes to reclone our repo, we really don't care
(we rarely re-clone). But I thought I'd mention it because for some
reason this is important to Facebook and along with inotify were the
two major things they focused on.

As far as I know every day Git operations don't scale all that badly
with a huge history. They will a bit since everything will live in the
same pack file, and this'll become especially noticable when your
packfiles are being evicted out of the page cache.

However operations like "git repack" seem to be quite bad at handling
these sort of repos. It already takes us many GB of RAM to repack
ours. I'd hate to do the same if it was 10x as big.

Overall I'd say Git would work for you for a repo like that, I'd
certainly still take it over SVN any day. The main thing you might
want to try out is partitioning out any binary assets you may have.
Usually that's much easier than splitting up the source tree itself.

I haven't yet done this, but I was planning on writing something to
start archiving our tags (mostly created by [2]) along with
aggressively deleting branches in the repo. I haven't benchmarked that
but I think that'll make the "pull" operations much faster, which in
turn will make the push contention (lots of lemmings pushing to the
same ref) better since the pull && push window is reduced.

I do ask myself what we're going to do if we just keep growing and all
the numbers I cited will be multiplied by 10-50x. With the current Git
limitations on the git implementation I think we'd need to split the
repo. The main reason we don't do so is because we like being able to
atomically change a library and its users.

However there's nothing in the basic Git file repository format that
inherently limits Git from being smarter about large repos, it just
seems to be hard to implement with the way the current client is
structured. In particular nothing would stop a Git client from:

 * Partially cloning a history but still being able to push upstream.
You could just get a partial commit/tree/blob graph and fetch the rest
on-demand as needed.
 * Scaling up to multi TB or PB repos. We'd just have to treat blobs
as something fetched on-demand, sort of like what git-annex does, but
built-in. We'd also have to be less stupid about how we pack big blobs
(or not at all)
 * Being able to partially clone a Git working tree. You could ask the
server for the last N commit objects and what you need for some
subdirectory in the repo. Then when you commit you ask the server what
the other top-level tree objects you need to make a commit are.
 * Nothing in the Git format itself actually requires filesystem
access. Not having to deal with external things modifying the tree
would be another approach to what the inotify effort is trying to
solve.

Of course changes like that will require a major rehaul of the current
codebase, or another implementation. Some of those require much more
active client/server interaction than what we have now, but they are
possible, which gives me some hope for the future.

Finally, I'd like to mention that if someone here on-list is
interested in doing work on these scalability topics in Git we'd be
open to funding that effort on some contract basis. Obviously the
details would have to be worked out blah blah blah, but that came up
the last time we had discussions about this internally. Myself and a
bunch of other people at work /could/ work on this ourselves, but
we're busy with other stuff and would much prefer just to pay someone
to fix them.

1. https://github.com/avar/pre-receive-reject-binaries
2. https://github.com/git-deploy/git-deploy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-19 21:26 Git Scaling: What factors most affect Git performance for a large repo? Stephen Morton
  2015-02-19 22:21 ` Stefan Beller
  2015-02-19 23:29 ` Ævar Arnfjörð Bjarmason
@ 2015-02-19 23:38 ` Duy Nguyen
  2015-02-20  0:42   ` David Turner
  2015-02-20  0:03 ` brian m. carlson
  3 siblings, 1 reply; 33+ messages in thread
From: Duy Nguyen @ 2015-02-19 23:38 UTC (permalink / raw)
  To: Stephen Morton; +Cc: Git Mailing List

On Fri, Feb 20, 2015 at 4:26 AM, Stephen Morton
<stephen.c.morton@gmail.com> wrote:
> By 'performance', I guess I mean speed of day to day operations for devs.
>
>    * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
>    * Will a few simultaneous clones from the central server also slow down
>      other concurrent operations for other users?

There are no locks in server when cloning, so in theory cloning does
not affect other operations. Cloning can use lots of memory though
(and a lot of cpu unless you turn on reachability bitmap feature,
which you should).

>    * Will 'git pull' be slow?

If we exclude the server side, the size of your tree is the main
factor, but your 25k files should be fine (linux has 48k files).

>    * 'git push'?

This one is not affected by how deep your repo's history is, or how
wide your tree is, so should be quick..

Ah the number of refs may affect both git-push and git-pull. I think
Stefan knows better than I in this area.

>    * 'git commit'? (It is listed as slow in reference [3].)
>    * 'git stautus'? (Slow again in reference 3 though I don't see it.)
(also git-add)

Again, the size of your tree. I'm trying to address problems in [3],
but at your repo's size, I don't think you need to worry about it.

>    * Some operations might not seem to be day-to-day but if they are called
>      frequently by the web front-end to GitLab/Stash/GitHub etc then
>      they can become bottlenecks. (e.g. 'git branch --contains' seems terribly
>      adversely affected by large numbers of branches.)
>    * Others?

git-blame could be slow when a file is modified a lot.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-19 21:26 Git Scaling: What factors most affect Git performance for a large repo? Stephen Morton
                   ` (2 preceding siblings ...)
  2015-02-19 23:38 ` Duy Nguyen
@ 2015-02-20  0:03 ` brian m. carlson
  2015-02-20 16:06   ` Stephen Morton
  2015-02-20 22:08   ` Sebastian Schuberth
  3 siblings, 2 replies; 33+ messages in thread
From: brian m. carlson @ 2015-02-20  0:03 UTC (permalink / raw)
  To: Stephen Morton; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 5506 bytes --]

On Thu, Feb 19, 2015 at 04:26:58PM -0500, Stephen Morton wrote:
> I posted this to comp.version-control.git.user and didn't get any response. I
> think the question is plumbing-related enough that I can ask it here.
> 
> I'm evaluating the feasibility of moving my team from SVN to git. We have a very
> large repo. [1] We will have a central repo using GitLab (or similar) that
> everybody works with. Forks, code sharing, pull requests etc. will be done
> through this central server.
> 
> By 'performance', I guess I mean speed of day to day operations for devs.
> 
>    * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
>    * Will a few simultaneous clones from the central server also slow down
>      other concurrent operations for other users?

This hasn't been a problem for us at $DAYJOB.  Git doesn't lock anything 
on fetches, so each process is independent.  We probably have about 
sixty developers (and maybe twenty other occasional users) that manage 
to interact with our Git server all day long.  We also have probably 
twenty smoker (CI) systems pulling at two hour intervals, or, when 
there's nothing to do, every two minutes, plus probably fifteen to 
twenty build systems pulling hourly.

I assume you will provide adequate resources for your server.

>    * Will 'git pull' be slow?
>    * 'git push'?

The most pathological case I've seen for git push is a branch with a 
single commit merged into the main development branch.  As of Git 2.3.0, 
the performance regression here is fixed.

Obviously, the speed of your network connection will affect this.  Even 
at 30 MB/s, cloning several gigabytes of data takes time.  Git tries 
hard to eliminate sending a lot of data, so if your developers keep 
reasonably up-to-date, the cost of establishing the connection will tend 
to dominate.

I see pull and push times that are less than 2 seconds in most cases.

>    * 'git commit'? (It is listed as slow in reference [3].)
>    * 'git stautus'? (Slow again in reference 3 though I don't see it.)

These can be slow with slow disks or over remote file systems.  I 
recommend not doing that.  I've heard rumbles that disk performance is 
better on Unix, but I don't use Windows so I can't say.

You should keep your .gitignore files up-to-date to avoid enumerating 
untracked files.  There's some work towards making this less of an 
issue.

git blame can be somewhat slow, but it's not something I use more than 
about once a day, so it doesn't bother me that much.

> Assuming I can put lots of resources into a central server with lots of CPU,
> RAM, fast SSD, fast networking, what aspects of the repo are most likely to
> affect devs' experience?
>    * Number of commits
>    * Sheer disk space occupied by the repo

The number of files can impact performance due to the number of stat()s 
required.

>    * Number of tags.
>    * Number of branches.

The number of tags and branches individually is really less relevant 
than the total number of refs (tags, branches, remote branches, etc). 
Very large numbers of refs can impact performance on pushes and pulls 
due to the need to enumerate them all.

>    * Binary objects in the repo that cause it to bloat in size [1]
>    * Other factors?

If you want good performance, I'd recommend the latest version of Git 
both client- and server-side.  Newer versions of Git provide pack 
bitmaps, which can dramatically speed up clones and fetches, and Git 
2.3.0 fixes a performance regression with large numbers of refs in 
non-shallow repositories.

It is totally worth it to roll your own packages of git if your vendor 
provides old versions.

> Of the various HW items listed above --CPU speed, number of cores, RAM, SSD,
> networking-- which is most critical here?

I generally find that having a good disk cache is important with large 
repositories.  It may be advantageous to make sure the developer 
machines have adequate memory.  Performance is notably better on 
development machines (VMs) with 2 GB or 4 GB of memory instead of 1 GB.

I can't speak to the server side, as I'm not directly involved with its 
deployment.

> Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo,
> 50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up
> branches" which are just one little dangling commit required to change the code
> a little bit between a commit a tag that was not quite made from it.)

I routinely work on a repo that's 1.9 GB packed, with 25k (and rapidly 
growing) refs.  Other developers work on a repo that's 9 GB packed, with 
somewhat fewer refs.  We don't tend to have problems with this.

Obviously, performance is better on some of our smaller repos, but it's 
not unacceptable on the larger ones.  I generally find that the 940 KB 
repo with huge numbers of files performs worse than the 1.9 GB repo with 
somewhat fewer.  If you can split your repository into multiple logical 
repositories, that will certainly improve performance.

If you end up having pain points, we're certainly interested in 
working through those.  I've brought up performance problems and people 
are generally responsive.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-19 23:29 ` Ævar Arnfjörð Bjarmason
@ 2015-02-20  0:04   ` Duy Nguyen
  2015-02-20 12:09     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 33+ messages in thread
From: Duy Nguyen @ 2015-02-20  0:04 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stephen Morton, Git Mailing List

On Fri, Feb 20, 2015 at 6:29 AM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> Anecdotally I work on a repo at work (where I'm mostly "the Git guy") that's:
>
>  * Around 500k commits
>  * Around 100k tags
>  * Around 5k branches
>  * Around 500 commits/day, almost entirely to the same branch
>  * 1.5 GB .git checkout.
>  * Mostly text source, but some binaries (we're trying to cut down[1] on those)

Would be nice if you could make an anonymized version of this repo
public. Working on a "real" large repo is better than an artificial
one.

> But actually most of "git fetch" is spent in the reachability check
> subsequently done by "git-rev-list" which takes several seconds. I

I wonder if reachability bitmap could help here..

> haven't looked into it but there's got to be room for optimization
> there, surely it only has to do reachability checks for new refs, or
> could run in some "I trust this remote not to send me corrupt data"
> completely mode (which would make sense within a company where you can
> trust your main Git box).

No, it's not just about trusting the server side, it's about catching
data corruption on the wire as well. We have a trick to avoid
reachability check in clone case, which is much more expensive than a
fetch. Maybe we could do something further to help the fetch case _if_
reachability bitmaps don't help.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-19 23:38 ` Duy Nguyen
@ 2015-02-20  0:42   ` David Turner
  2015-02-20 20:59     ` Junio C Hamano
  2015-02-21  4:01     ` Duy Nguyen
  0 siblings, 2 replies; 33+ messages in thread
From: David Turner @ 2015-02-20  0:42 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Stephen Morton, Git Mailing List

On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote:
> >    * 'git push'?
> 
> This one is not affected by how deep your repo's history is, or how
> wide your tree is, so should be quick..
> 
> Ah the number of refs may affect both git-push and git-pull. I think
> Stefan knows better than I in this area.

I can tell you that this is a bit of a problem for us at Twitter.  We
have over 100k refs, which adds ~20MiB of downstream traffic to every
push.

I added a hack to improve this locally inside Twitter: The client sends
a bloom filter of shas that it believes that the server knows about; the
server sends only the sha of master and any refs that are not in the
bloom filter.  The client  uses its local version of the servers' refs
as if they had just been sent.  This means that some packs will be
suboptimal, due to false positives in the bloom filter leading some new
refs to not be sent.  Also, if there were a repack between the pull and
the push, some refs might have been deleted on the server; we repack
rarely enough and pull frequently enough that this is hopefully not an
issue.

We're still testing to see if this works.  But due to the number of
assumptions it makes, it's probably not that great an idea for general
use.

There are probably more complex schemes to compute minimal (or
small-enough) packs; in particular, if the patch is just a few megs off
of master, it's better to just send the whole pack.  That doesn't work
for us because we've got a log-based replication scheme that the pack
appends to, and we don't want the log to get too big; we want
more-minimal packs than that.  But it might work for others.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20  0:04   ` Duy Nguyen
@ 2015-02-20 12:09     ` Ævar Arnfjörð Bjarmason
  2015-02-20 12:11       ` Ævar Arnfjörð Bjarmason
                         ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2015-02-20 12:09 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Stephen Morton, Git Mailing List

On Fri, Feb 20, 2015 at 1:04 AM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Fri, Feb 20, 2015 at 6:29 AM, Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> Anecdotally I work on a repo at work (where I'm mostly "the Git guy") that's:
>>
>>  * Around 500k commits
>>  * Around 100k tags
>>  * Around 5k branches
>>  * Around 500 commits/day, almost entirely to the same branch
>>  * 1.5 GB .git checkout.
>>  * Mostly text source, but some binaries (we're trying to cut down[1] on those)
>
> Would be nice if you could make an anonymized version of this repo
> public. Working on a "real" large repo is better than an artificial
> one.

Yeah, I'll try to do that.

>> But actually most of "git fetch" is spent in the reachability check
>> subsequently done by "git-rev-list" which takes several seconds. I
>
> I wonder if reachability bitmap could help here..

I could have sworn I had that enabled already but evidently not. I did
test it and it cut down on clone times a bit. Now our daily repacking
is:

        git --git-dir={} gc &&
        git --git-dir={} pack-refs --all --prune &&
        git --git-dir={} repack -Ad --window=250 --depth=100
--write-bitmap-index --pack-kept-objects &&

It's not clear to me from the documentation whether this should just
be enabled on the server, or the clients too. In any case I've enabled
it on both.

Even then with it enabled on both a "git pull" that pulls down just
one commit on one branch is 13s. Trace attached at the end of the
mail.

>> haven't looked into it but there's got to be room for optimization
>> there, surely it only has to do reachability checks for new refs, or
>> could run in some "I trust this remote not to send me corrupt data"
>> completely mode (which would make sense within a company where you can
>> trust your main Git box).
>
> No, it's not just about trusting the server side, it's about catching
> data corruption on the wire as well. We have a trick to avoid
> reachability check in clone case, which is much more expensive than a
> fetch. Maybe we could do something further to help the fetch case _if_
> reachability bitmaps don't help.

Still, if that's indeed a big bottleneck what's the worst-case
scenario here? That the local repository gets hosed? The server will
still recursively validate the objects it gets sent, right?

I wonder if a better trade-off in that case would be to skip this in
some situations and instead put something like "git fsck" in a
cronjob.

Here's a "git pull" trace mentioned above:

$ time GIT_TRACE=1 git pull
13:06:13.603781 git.c:555               trace: exec: 'git-pull'
13:06:13.603936 run-command.c:351       trace: run_command: 'git-pull'
13:06:13.620615 git.c:349               trace: built-in: git
'rev-parse' '--git-dir'
13:06:13.631602 git.c:349               trace: built-in: git
'rev-parse' '--is-bare-repository'
13:06:13.636103 git.c:349               trace: built-in: git
'rev-parse' '--show-toplevel'
13:06:13.641491 git.c:349               trace: built-in: git 'ls-files' '-u'
13:06:13.719923 git.c:349               trace: built-in: git
'symbolic-ref' '-q' 'HEAD'
13:06:13.728085 git.c:349               trace: built-in: git 'config'
'branch.trunk.rebase'
13:06:13.738160 git.c:349               trace: built-in: git 'config' 'pull.ff'
13:06:13.743286 git.c:349               trace: built-in: git
'rev-parse' '-q' '--verify' 'HEAD'
13:06:13.972091 git.c:349               trace: built-in: git
'rev-parse' '--verify' 'HEAD'
13:06:14.149420 git.c:349               trace: built-in: git
'update-index' '-q' '--ignore-submodules' '--refresh'
13:06:14.294098 git.c:349               trace: built-in: git
'diff-files' '--quiet' '--ignore-submodules'
13:06:14.467711 git.c:349               trace: built-in: git
'diff-index' '--cached' '--quiet' '--ignore-submodules' 'HEAD' '--'
13:06:14.683419 git.c:349               trace: built-in: git
'rev-parse' '-q' '--git-dir'
13:06:15.189707 git.c:349               trace: built-in: git
'rev-parse' '-q' '--verify' 'HEAD'
13:06:15.335948 git.c:349               trace: built-in: git 'fetch'
'--update-head-ok'
13:06:15.691303 run-command.c:351       trace: run_command: 'ssh'
'git.example.com' 'git-upload-pack '\''/gitrepos/core.git'\'''
13:06:17.095662 run-command.c:351       trace: run_command: 'rev-list'
'--objects' '--stdin' '--not' '--all' '--quiet'
remote: Counting objects: 6, done.
remote: Compressing objects: 100% (6/6), done.
3:06:20.426346 run-command.c:351       trace: run_command:
'unpack-objects' '--pack_header=2,6'
13:06:20.431806 exec_cmd.c:130          trace: exec: 'git'
'unpack-objects' '--pack_header=2,6'
13:06:20.437343 git.c:349               trace: built-in: git
'unpack-objects' '--pack_header=2,6'
remote: Total 6 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (6/6), done.
13:06:20.444196 run-command.c:351       trace: run_command: 'rev-list'
'--objects' '--stdin' '--not' '--all'
13:06:20.447135 exec_cmd.c:130          trace: exec: 'git' 'rev-list'
'--objects' '--stdin' '--not' '--all'
13:06:20.451283 git.c:349               trace: built-in: git
'rev-list' '--objects' '--stdin' '--not' '--all'
From ssh://git.example.com/gitrepos/core
   02d33d2..41e72c4  core      -> origin/core
13:06:22.559609 run-command.c:351       trace: run_command: 'gc' '--auto'
13:06:22.562176 exec_cmd.c:130          trace: exec: 'git' 'gc' '--auto'
13:06:22.565661 git.c:349               trace: built-in: git 'gc' '--auto'
13:06:22.594980 git.c:349               trace: built-in: git
'rev-parse' '-q' '--verify' 'HEAD'
13:06:22.845728 git.c:349               trace: built-in: git
'show-branch' '--merge-base' 'refs/heads/core'
'41e72c42addc5075e8009a3eebe914fa0ce98b27'
'02d33d2be7f8601c3502fdd89b0946447d7cdf15'
13:06:23.087586 git.c:349               trace: built-in: git 'fmt-merge-msg'
13:06:23.341451 git.c:349               trace: built-in: git
'rev-parse' '--parseopt' '--stuck-long' '--' '--onto'
'41e72c42addc5075e8009a3eebe914fa0ce98b27'
'41e72c42addc5075e8009a3eebe914fa0ce98b27'
13:06:23.350513 git.c:349               trace: built-in: git
'rev-parse' '--git-dir'
13:06:23.362011 git.c:349               trace: built-in: git
'rev-parse' '--is-bare-repository'
13:06:23.365282 git.c:349               trace: built-in: git
'rev-parse' '--show-toplevel'
13:06:23.372589 git.c:349               trace: built-in: git 'config'
'--bool' 'rebase.stat'
13:06:23.377056 git.c:349               trace: built-in: git 'config'
'--bool' 'rebase.autostash'
13:06:23.382102 git.c:349               trace: built-in: git 'config'
'--bool' 'rebase.autosquash'
13:06:23.389458 git.c:349               trace: built-in: git
'rev-parse' '--verify' '41e72c42addc5075e8009a3eebe914fa0ce98b27^0'
13:06:23.608894 git.c:349               trace: built-in: git
'rev-parse' '--verify' '41e72c42addc5075e8009a3eebe914fa0ce98b27^0'
13:06:23.894026 git.c:349               trace: built-in: git
'symbolic-ref' '-q' 'HEAD'
13:06:23.898918 git.c:349               trace: built-in: git
'rev-parse' '--verify' 'HEAD'
13:06:24.102269 git.c:349               trace: built-in: git
'rev-parse' '--verify' 'HEAD'
13:06:24.338636 git.c:349               trace: built-in: git
'update-index' '-q' '--ignore-submodules' '--refresh'
13:06:24.539912 git.c:349               trace: built-in: git
'diff-files' '--quiet' '--ignore-submodules'
13:06:24.729362 git.c:349               trace: built-in: git
'diff-index' '--cached' '--quiet' '--ignore-submodules' 'HEAD' '--'
13:06:24.938533 git.c:349               trace: built-in: git
'merge-base' '41e72c42addc5075e8009a3eebe914fa0ce98b27'
'02d33d2be7f8601c3502fdd89b0946447d7cdf15'
13:06:25.197791 git.c:349               trace: built-in: git 'diff'
'--stat' '--summary' '02d33d2be7f8601c3502fdd89b0946447d7cdf15'
'41e72c42addc5075e8009a3eebe914fa0ce98b27'
[details on updated files]
13:06:25.488275 git.c:349               trace: built-in: git
'checkout' '-q' '41e72c42addc5075e8009a3eebe914fa0ce98b27^0'
13:06:26.467413 git.c:349               trace: built-in: git
'update-ref' 'ORIG_HEAD' '02d33d2be7f8601c3502fdd89b0946447d7cdf15'
Fast-forwarded trunk to 41e72c42addc5075e8009a3eebe914fa0ce98b27.
13:06:26.716256 git.c:349               trace: built-in: git 'rev-parse' 'HEAD'
13:06:26.958595 git.c:349               trace: built-in: git
'update-ref' '-m' 'rebase finished: refs/heads/core onto
41e72c42addc5075e8009a3eebe914fa0ce98b27' 'refs/heads/core'
'41e72c42addc5075e8009a3eebe914fa0ce98b27'
'02d33d2be7f8601c3502fdd89b0946447d7cdf15'
13:06:27.205320 git.c:349               trace: built-in: git
'symbolic-ref' '-m' 'rebase finished: returning to refs/heads/core'
'HEAD' 'refs/heads/core'
13:06:27.208748 git.c:349               trace: built-in: git 'gc' '--auto'

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 12:09     ` Ævar Arnfjörð Bjarmason
@ 2015-02-20 12:11       ` Ævar Arnfjörð Bjarmason
  2015-02-20 14:25       ` Ævar Arnfjörð Bjarmason
  2015-02-21  3:51       ` Duy Nguyen
  2 siblings, 0 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2015-02-20 12:11 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Stephen Morton, Git Mailing List

On Fri, Feb 20, 2015 at 1:09 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Fri, Feb 20, 2015 at 1:04 AM, Duy Nguyen <pclouds@gmail.com> wrote:
>> On Fri, Feb 20, 2015 at 6:29 AM, Ævar Arnfjörð Bjarmason
>> <avarab@gmail.com> wrote:
>>> Anecdotally I work on a repo at work (where I'm mostly "the Git guy") that's:
>>>
>>>  * Around 500k commits
>>>  * Around 100k tags
>>>  * Around 5k branches
>>>  * Around 500 commits/day, almost entirely to the same branch
>>>  * 1.5 GB .git checkout.
>>>  * Mostly text source, but some binaries (we're trying to cut down[1] on those)
>>
>> Would be nice if you could make an anonymized version of this repo
>> public. Working on a "real" large repo is better than an artificial
>> one.
>
> Yeah, I'll try to do that.
>
>>> But actually most of "git fetch" is spent in the reachability check
>>> subsequently done by "git-rev-list" which takes several seconds. I
>>
>> I wonder if reachability bitmap could help here..
>
> I could have sworn I had that enabled already but evidently not. I did
> test it and it cut down on clone times a bit. Now our daily repacking
> is:
>
>         git --git-dir={} gc &&
>         git --git-dir={} pack-refs --all --prune &&
>         git --git-dir={} repack -Ad --window=250 --depth=100
> --write-bitmap-index --pack-kept-objects &&
>
> It's not clear to me from the documentation whether this should just
> be enabled on the server, or the clients too. In any case I've enabled
> it on both.
>
> Even then with it enabled on both a "git pull" that pulls down just
> one commit on one branch is 13s. Trace attached at the end of the
> mail.
>
>>> haven't looked into it but there's got to be room for optimization
>>> there, surely it only has to do reachability checks for new refs, or
>>> could run in some "I trust this remote not to send me corrupt data"
>>> completely mode (which would make sense within a company where you can
>>> trust your main Git box).
>>
>> No, it's not just about trusting the server side, it's about catching
>> data corruption on the wire as well. We have a trick to avoid
>> reachability check in clone case, which is much more expensive than a
>> fetch. Maybe we could do something further to help the fetch case _if_
>> reachability bitmaps don't help.
>
> Still, if that's indeed a big bottleneck what's the worst-case
> scenario here? That the local repository gets hosed? The server will
> still recursively validate the objects it gets sent, right?
>
> I wonder if a better trade-off in that case would be to skip this in
> some situations and instead put something like "git fsck" in a
> cronjob.
>
> Here's a "git pull" trace mentioned above:
>
> $ time GIT_TRACE=1 git pull
> 13:06:13.603781 git.c:555               trace: exec: 'git-pull'
> 13:06:13.603936 run-command.c:351       trace: run_command: 'git-pull'
> 13:06:13.620615 git.c:349               trace: built-in: git
> 'rev-parse' '--git-dir'
> 13:06:13.631602 git.c:349               trace: built-in: git
> 'rev-parse' '--is-bare-repository'
> 13:06:13.636103 git.c:349               trace: built-in: git
> 'rev-parse' '--show-toplevel'
> 13:06:13.641491 git.c:349               trace: built-in: git 'ls-files' '-u'
> 13:06:13.719923 git.c:349               trace: built-in: git
> 'symbolic-ref' '-q' 'HEAD'
> 13:06:13.728085 git.c:349               trace: built-in: git 'config'
> 'branch.trunk.rebase'
> 13:06:13.738160 git.c:349               trace: built-in: git 'config' 'pull.ff'
> 13:06:13.743286 git.c:349               trace: built-in: git
> 'rev-parse' '-q' '--verify' 'HEAD'
> 13:06:13.972091 git.c:349               trace: built-in: git
> 'rev-parse' '--verify' 'HEAD'
> 13:06:14.149420 git.c:349               trace: built-in: git
> 'update-index' '-q' '--ignore-submodules' '--refresh'
> 13:06:14.294098 git.c:349               trace: built-in: git
> 'diff-files' '--quiet' '--ignore-submodules'
> 13:06:14.467711 git.c:349               trace: built-in: git
> 'diff-index' '--cached' '--quiet' '--ignore-submodules' 'HEAD' '--'
> 13:06:14.683419 git.c:349               trace: built-in: git
> 'rev-parse' '-q' '--git-dir'
> 13:06:15.189707 git.c:349               trace: built-in: git
> 'rev-parse' '-q' '--verify' 'HEAD'
> 13:06:15.335948 git.c:349               trace: built-in: git 'fetch'
> '--update-head-ok'
> 13:06:15.691303 run-command.c:351       trace: run_command: 'ssh'
> 'git.example.com' 'git-upload-pack '\''/gitrepos/core.git'\'''
> 13:06:17.095662 run-command.c:351       trace: run_command: 'rev-list'
> '--objects' '--stdin' '--not' '--all' '--quiet'
> remote: Counting objects: 6, done.
> remote: Compressing objects: 100% (6/6), done.
> 3:06:20.426346 run-command.c:351       trace: run_command:
> 'unpack-objects' '--pack_header=2,6'
> 13:06:20.431806 exec_cmd.c:130          trace: exec: 'git'
> 'unpack-objects' '--pack_header=2,6'
> 13:06:20.437343 git.c:349               trace: built-in: git
> 'unpack-objects' '--pack_header=2,6'
> remote: Total 6 (delta 0), reused 0 (delta 0)
> Unpacking objects: 100% (6/6), done.
> 13:06:20.444196 run-command.c:351       trace: run_command: 'rev-list'
> '--objects' '--stdin' '--not' '--all'
> 13:06:20.447135 exec_cmd.c:130          trace: exec: 'git' 'rev-list'
> '--objects' '--stdin' '--not' '--all'
> 13:06:20.451283 git.c:349               trace: built-in: git
> 'rev-list' '--objects' '--stdin' '--not' '--all'
> From ssh://git.example.com/gitrepos/core
>    02d33d2..41e72c4  core      -> origin/core
> 13:06:22.559609 run-command.c:351       trace: run_command: 'gc' '--auto'
> 13:06:22.562176 exec_cmd.c:130          trace: exec: 'git' 'gc' '--auto'
> 13:06:22.565661 git.c:349               trace: built-in: git 'gc' '--auto'
> 13:06:22.594980 git.c:349               trace: built-in: git
> 'rev-parse' '-q' '--verify' 'HEAD'
> 13:06:22.845728 git.c:349               trace: built-in: git
> 'show-branch' '--merge-base' 'refs/heads/core'
> '41e72c42addc5075e8009a3eebe914fa0ce98b27'
> '02d33d2be7f8601c3502fdd89b0946447d7cdf15'
> 13:06:23.087586 git.c:349               trace: built-in: git 'fmt-merge-msg'
> 13:06:23.341451 git.c:349               trace: built-in: git
> 'rev-parse' '--parseopt' '--stuck-long' '--' '--onto'
> '41e72c42addc5075e8009a3eebe914fa0ce98b27'
> '41e72c42addc5075e8009a3eebe914fa0ce98b27'
> 13:06:23.350513 git.c:349               trace: built-in: git
> 'rev-parse' '--git-dir'
> 13:06:23.362011 git.c:349               trace: built-in: git
> 'rev-parse' '--is-bare-repository'
> 13:06:23.365282 git.c:349               trace: built-in: git
> 'rev-parse' '--show-toplevel'
> 13:06:23.372589 git.c:349               trace: built-in: git 'config'
> '--bool' 'rebase.stat'
> 13:06:23.377056 git.c:349               trace: built-in: git 'config'
> '--bool' 'rebase.autostash'
> 13:06:23.382102 git.c:349               trace: built-in: git 'config'
> '--bool' 'rebase.autosquash'
> 13:06:23.389458 git.c:349               trace: built-in: git
> 'rev-parse' '--verify' '41e72c42addc5075e8009a3eebe914fa0ce98b27^0'
> 13:06:23.608894 git.c:349               trace: built-in: git
> 'rev-parse' '--verify' '41e72c42addc5075e8009a3eebe914fa0ce98b27^0'
> 13:06:23.894026 git.c:349               trace: built-in: git
> 'symbolic-ref' '-q' 'HEAD'
> 13:06:23.898918 git.c:349               trace: built-in: git
> 'rev-parse' '--verify' 'HEAD'
> 13:06:24.102269 git.c:349               trace: built-in: git
> 'rev-parse' '--verify' 'HEAD'
> 13:06:24.338636 git.c:349               trace: built-in: git
> 'update-index' '-q' '--ignore-submodules' '--refresh'
> 13:06:24.539912 git.c:349               trace: built-in: git
> 'diff-files' '--quiet' '--ignore-submodules'
> 13:06:24.729362 git.c:349               trace: built-in: git
> 'diff-index' '--cached' '--quiet' '--ignore-submodules' 'HEAD' '--'
> 13:06:24.938533 git.c:349               trace: built-in: git
> 'merge-base' '41e72c42addc5075e8009a3eebe914fa0ce98b27'
> '02d33d2be7f8601c3502fdd89b0946447d7cdf15'
> 13:06:25.197791 git.c:349               trace: built-in: git 'diff'
> '--stat' '--summary' '02d33d2be7f8601c3502fdd89b0946447d7cdf15'
> '41e72c42addc5075e8009a3eebe914fa0ce98b27'
> [details on updated files]
> 13:06:25.488275 git.c:349               trace: built-in: git
> 'checkout' '-q' '41e72c42addc5075e8009a3eebe914fa0ce98b27^0'
> 13:06:26.467413 git.c:349               trace: built-in: git
> 'update-ref' 'ORIG_HEAD' '02d33d2be7f8601c3502fdd89b0946447d7cdf15'
> Fast-forwarded trunk to 41e72c42addc5075e8009a3eebe914fa0ce98b27.
> 13:06:26.716256 git.c:349               trace: built-in: git 'rev-parse' 'HEAD'
> 13:06:26.958595 git.c:349               trace: built-in: git
> 'update-ref' '-m' 'rebase finished: refs/heads/core onto
> 41e72c42addc5075e8009a3eebe914fa0ce98b27' 'refs/heads/core'
> '41e72c42addc5075e8009a3eebe914fa0ce98b27'
> '02d33d2be7f8601c3502fdd89b0946447d7cdf15'
> 13:06:27.205320 git.c:349               trace: built-in: git
> 'symbolic-ref' '-m' 'rebase finished: returning to refs/heads/core'
> 'HEAD' 'refs/heads/core'
> 13:06:27.208748 git.c:349               trace: built-in: git 'gc' '--auto'

I forgot to include that this took:

real    0m13.630s
user    0m10.739s
sys     0m4.064s

on my local laptop with a ssd + hot cache it was:

real    0m7.513s
user    0m3.796s
sys     0m0.624s

So some of that we could speed up with faster systems, but we still
have quite a bit of Git overhead.

Even with the hot cache on the ssd I get on this repo:

$ time (git log -1 >/dev/null)

real    0m0.938s
user    0m0.916s
sys     0m0.020s

v.s. the same on linux.git:

$ time (git log -1 >/dev/null)

real    0m0.016s
user    0m0.008s
sys     0m0.004s

Which I suspect is a function of the high ref count, but it could be
something else...

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 12:09     ` Ævar Arnfjörð Bjarmason
  2015-02-20 12:11       ` Ævar Arnfjörð Bjarmason
@ 2015-02-20 14:25       ` Ævar Arnfjörð Bjarmason
  2015-02-20 21:04         ` Junio C Hamano
                           ` (2 more replies)
  2015-02-21  3:51       ` Duy Nguyen
  2 siblings, 3 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2015-02-20 14:25 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Stephen Morton, Git Mailing List

On Fri, Feb 20, 2015 at 1:09 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Fri, Feb 20, 2015 at 1:04 AM, Duy Nguyen <pclouds@gmail.com> wrote:
>> On Fri, Feb 20, 2015 at 6:29 AM, Ævar Arnfjörð Bjarmason
>> <avarab@gmail.com> wrote:
>>> Anecdotally I work on a repo at work (where I'm mostly "the Git guy") that's:
>>>
>>>  * Around 500k commits
>>>  * Around 100k tags
>>>  * Around 5k branches
>>>  * Around 500 commits/day, almost entirely to the same branch
>>>  * 1.5 GB .git checkout.
>>>  * Mostly text source, but some binaries (we're trying to cut down[1] on those)
>>
>> Would be nice if you could make an anonymized version of this repo
>> public. Working on a "real" large repo is better than an artificial
>> one.
>
> Yeah, I'll try to do that.

tl;dr: After some more testing it turns out the performance issues we
have are almost entirely due to the number of refs. Some of these I
knew about and were obvious (e..g. git pull), but some aren't so
obvious (why does "git log" without "--all" slow down as a function of
the overall number of refs?).

Rather than getting an anonymized version of the repo we have, a
simpler isolated test case is just doing this on linux.git:

    $ git rev-list --all | perl -ne 'my $cnt; while (<>) {
s<([a-f0-9]+)><git tag -a -m"Test" TAG $1>gm; next unless int rand 10
== 1; $cnt++; s/TAG/tagnr-$cnt/; print }'  | sh -x

That'll create a tag for every 10th commit or so, which is around 50k
tags for linux.git.

I actually ran this a few times while testing it, so this is a before
and after on a hot cache of linux.git with 406 tags v.s. ~140k. I ran
the gc + repack + bitmaps for both repos noted in an earlier reply of
mine, and took the fastest run out of 3:

    $ time (git log master -100 >/dev/null)
    Before: real    0m0.021s
    After: real    0m2.929s
    $ time (git status >/dev/null)
    # Around 150ms, no noticeable difference
    $ time git fetch
    # I'm fetching from git@github.com:torvalds/linux.git here, the
    # cache is hot but upstream has *no* changes
    Before: real    0m1.826s
    After: real    0m8.458s

Details on why "git fetch" is slow in this situation:

    $ time GIT_TRACE=1 git fetch
    15:15:00.435420 git.c:349               trace: built-in: git 'fetch'
    15:15:00.654428 run-command.c:341       trace: run_command: 'ssh'
'git@github.com' 'git-upload-pack '\''torvalds/linux.git'\'''
    15:15:02.426121 run-command.c:341       trace: run_command:
'rev-list' '--objects' '--stdin' '--not' '--all' '--quiet'
    15:15:05.507327 run-command.c:341       trace: run_command:
'rev-list' '--objects' '--stdin' '--not' '--all'
    15:15:05.508329 exec_cmd.c:134          trace: exec: 'git'
'rev-list' '--objects' '--stdin' '--not' '--all'
    15:15:05.510490 git.c:349               trace: built-in: git
'rev-list' '--objects' '--stdin' '--not' '--all'
    15:15:08.874116 run-command.c:341       trace: run_command: 'gc' '--auto'
    15:15:08.879570 exec_cmd.c:134          trace: exec: 'git' 'gc' '--auto'
    15:15:08.882495 git.c:349               trace: built-in: git 'gc' '--auto'
    real    0m8.458s
    user    0m6.548s
    sys     0m0.204s

Even things you'd expect to not be impacted are, like a reverse log
search on the master branch:

    $ time (git log --reverse -p --grep=arm64 origin/master >/dev/null)
    Before: real    0m4.473s
    After: real    0m6.194s

Or doing 10 commits and rebasing on the upstream:

    $ time (git checkout origin/master~ && for i in {1..10}; do echo
$i > file && git add file && git commit -m"moo" $file; done && git
rebase origin/master)
    Before: real    0m6.798s
    After: real    0m12.340s

The remaining slowdown comes from the size of the tree, which we can
deal with by either reducing it in size (we have some copied JS
libraries and whatnot) or trying the inotify-powered git-status.

In our case there's no good reason for why we have this many refs in
the repository everyone uses. We basically just have a bunch of dated
rollout tags that have been accumulating for years, and a bunch of
mostly unused branches people just haven't cleaned up.

So I'm going to:

 1. Write a hook that rejects tags that aren't new (i.e. forbid
re-pushes of old tags)
 2. Create an archive repository that contains all the old tags (i.e.
just run "git fetch" on the main one from cron)
 3. Run a script to regularly delete tags from the main repo
 4. Run the same script on the clients that clone the repo

The branches are slightly harder, deleting those that are fully merged
into the same branch is easy, deleting those whose contents 100%
matches patch-id's already in the main branch is another thing we can
do, and just clean up branches unconditionally after they've reached a
certain age (they'll still be archived).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20  0:03 ` brian m. carlson
@ 2015-02-20 16:06   ` Stephen Morton
  2015-02-20 16:38     ` Matthieu Moy
  2015-02-20 17:16     ` brian m. carlson
  2015-02-20 22:08   ` Sebastian Schuberth
  1 sibling, 2 replies; 33+ messages in thread
From: Stephen Morton @ 2015-02-20 16:06 UTC (permalink / raw)
  To: git

This is fantastic. I really appreciate all the answers. And it's great
that I think I've sparked some general discussion that could lead
somewhere too.

Notes:

I'm currently using 2.1.3. I'll move to 2.3.x

I'm experimenting with git-annex to reduce repo size on disk. We'll see.

I could remove all tags older than /n/ years old in the active repo
and just maintain them in the historical repo. (We have quite a lot of
CI-generated tags.) It sounds like that might improve performance.

Questions:

1. Ævar : I'm a bit concerned by your statement that git rebases take
about 1-2 s per commit. Does that mean that a "git pull --rebase", if
it is picking up say 120 commits (not at all unrealistic), could
potentially take 4 minutes to complete? Or have I misinterpreted your
comment.

2. I'd not heard about bitmap indexes before this thread but it sounds
like they should help me. In limited searching I can't find much
useful documentation about them. It is also not clear to me if I have
to explicitly run "git repack --write-bitmap-indexes" or if git will
automatically detect when they're needed; first experiments seem to
indicate that I need to explicitly generate them. I assume that once
the index is there, git will just use it automatically.


Steve


On Thu, Feb 19, 2015 at 7:03 PM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> On Thu, Feb 19, 2015 at 04:26:58PM -0500, Stephen Morton wrote:
>> I posted this to comp.version-control.git.user and didn't get any response. I
>> think the question is plumbing-related enough that I can ask it here.
>>
>> I'm evaluating the feasibility of moving my team from SVN to git. We have a very
>> large repo. [1] We will have a central repo using GitLab (or similar) that
>> everybody works with. Forks, code sharing, pull requests etc. will be done
>> through this central server.
>>
>> By 'performance', I guess I mean speed of day to day operations for devs.
>>
>>    * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
>>    * Will a few simultaneous clones from the central server also slow down
>>      other concurrent operations for other users?
>
> This hasn't been a problem for us at $DAYJOB.  Git doesn't lock anything
> on fetches, so each process is independent.  We probably have about
> sixty developers (and maybe twenty other occasional users) that manage
> to interact with our Git server all day long.  We also have probably
> twenty smoker (CI) systems pulling at two hour intervals, or, when
> there's nothing to do, every two minutes, plus probably fifteen to
> twenty build systems pulling hourly.
>
> I assume you will provide adequate resources for your server.
>
>>    * Will 'git pull' be slow?
>>    * 'git push'?
>
> The most pathological case I've seen for git push is a branch with a
> single commit merged into the main development branch.  As of Git 2.3.0,
> the performance regression here is fixed.
>
> Obviously, the speed of your network connection will affect this.  Even
> at 30 MB/s, cloning several gigabytes of data takes time.  Git tries
> hard to eliminate sending a lot of data, so if your developers keep
> reasonably up-to-date, the cost of establishing the connection will tend
> to dominate.
>
> I see pull and push times that are less than 2 seconds in most cases.
>
>>    * 'git commit'? (It is listed as slow in reference [3].)
>>    * 'git stautus'? (Slow again in reference 3 though I don't see it.)
>
> These can be slow with slow disks or over remote file systems.  I
> recommend not doing that.  I've heard rumbles that disk performance is
> better on Unix, but I don't use Windows so I can't say.
>
> You should keep your .gitignore files up-to-date to avoid enumerating
> untracked files.  There's some work towards making this less of an
> issue.
>
> git blame can be somewhat slow, but it's not something I use more than
> about once a day, so it doesn't bother me that much.
>
>> Assuming I can put lots of resources into a central server with lots of CPU,
>> RAM, fast SSD, fast networking, what aspects of the repo are most likely to
>> affect devs' experience?
>>    * Number of commits
>>    * Sheer disk space occupied by the repo
>
> The number of files can impact performance due to the number of stat()s
> required.
>
>>    * Number of tags.
>>    * Number of branches.
>
> The number of tags and branches individually is really less relevant
> than the total number of refs (tags, branches, remote branches, etc).
> Very large numbers of refs can impact performance on pushes and pulls
> due to the need to enumerate them all.
>
>>    * Binary objects in the repo that cause it to bloat in size [1]
>>    * Other factors?
>
> If you want good performance, I'd recommend the latest version of Git
> both client- and server-side.  Newer versions of Git provide pack
> bitmaps, which can dramatically speed up clones and fetches, and Git
> 2.3.0 fixes a performance regression with large numbers of refs in
> non-shallow repositories.
>
> It is totally worth it to roll your own packages of git if your vendor
> provides old versions.
>
>> Of the various HW items listed above --CPU speed, number of cores, RAM, SSD,
>> networking-- which is most critical here?
>
> I generally find that having a good disk cache is important with large
> repositories.  It may be advantageous to make sure the developer
> machines have adequate memory.  Performance is notably better on
> development machines (VMs) with 2 GB or 4 GB of memory instead of 1 GB.
>
> I can't speak to the server side, as I'm not directly involved with its
> deployment.
>
>> Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo,
>> 50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up
>> branches" which are just one little dangling commit required to change the code
>> a little bit between a commit a tag that was not quite made from it.)
>
> I routinely work on a repo that's 1.9 GB packed, with 25k (and rapidly
> growing) refs.  Other developers work on a repo that's 9 GB packed, with
> somewhat fewer refs.  We don't tend to have problems with this.
>
> Obviously, performance is better on some of our smaller repos, but it's
> not unacceptable on the larger ones.  I generally find that the 940 KB
> repo with huge numbers of files performs worse than the 1.9 GB repo with
> somewhat fewer.  If you can split your repository into multiple logical
> repositories, that will certainly improve performance.
>
> If you end up having pain points, we're certainly interested in
> working through those.  I've brought up performance problems and people
> are generally responsive.
> --
> brian m. carlson / brian with sandals: Houston, Texas, US
> +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
> OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 16:06   ` Stephen Morton
@ 2015-02-20 16:38     ` Matthieu Moy
  2015-02-20 17:16     ` brian m. carlson
  1 sibling, 0 replies; 33+ messages in thread
From: Matthieu Moy @ 2015-02-20 16:38 UTC (permalink / raw)
  To: Stephen Morton; +Cc: git

Stephen Morton <stephen.c.morton@gmail.com> writes:

> 1. Ævar : I'm a bit concerned by your statement that git rebases take
> about 1-2 s per commit. Does that mean that a "git pull --rebase", if
> it is picking up say 120 commits (not at all unrealistic), could
> potentially take 4 minutes to complete? Or have I misinterpreted your
> comment.

Ævar talked about "applied commits" during rebase. When you "git pull
--rebase", you fast-forward the history you just fetched, which is
almost instantaneous, and then you reapply your local history on top of
it.

So, the performance depends on how long your local history is, not on
how many commits you're fetching.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 16:06   ` Stephen Morton
  2015-02-20 16:38     ` Matthieu Moy
@ 2015-02-20 17:16     ` brian m. carlson
  1 sibling, 0 replies; 33+ messages in thread
From: brian m. carlson @ 2015-02-20 17:16 UTC (permalink / raw)
  To: Stephen Morton; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1250 bytes --]

On Fri, Feb 20, 2015 at 11:06:44AM -0500, Stephen Morton wrote:
>2. I'd not heard about bitmap indexes before this thread but it sounds
>like they should help me. In limited searching I can't find much
>useful documentation about them. It is also not clear to me if I have
>to explicitly run "git repack --write-bitmap-indexes" or if git will
>automatically detect when they're needed; first experiments seem to
>indicate that I need to explicitly generate them. I assume that once
>the index is there, git will just use it automatically.

Pack bitmaps are a way of speeding up clones and fetches by precomputing 
reachability information.  Practically, this means that the initial 
"Counting objects" phase is instantaneous for clones and much faster for 
fetches.

The way I've done it in the past is to set repack.writeBitmaps = true in 
/etc/gitconfig on the server.  (I highly recommend enabling reflogs in 
the same place.)  Then you'll need to ensure that git gc runs 
periodically so that bitmaps are generated.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20  0:42   ` David Turner
@ 2015-02-20 20:59     ` Junio C Hamano
  2015-02-23 20:23       ` David Turner
  2015-02-21  4:01     ` Duy Nguyen
  1 sibling, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2015-02-20 20:59 UTC (permalink / raw)
  To: David Turner; +Cc: Duy Nguyen, Stephen Morton, Git Mailing List

David Turner <dturner@twopensource.com> writes:

> On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote:
>> >    * 'git push'?
>> 
>> This one is not affected by how deep your repo's history is, or how
>> wide your tree is, so should be quick..
>> 
>> Ah the number of refs may affect both git-push and git-pull. I think
>> Stefan knows better than I in this area.
>
> I can tell you that this is a bit of a problem for us at Twitter.  We
> have over 100k refs, which adds ~20MiB of downstream traffic to every
> push.
>
> I added a hack to improve this locally inside Twitter: The client sends
> a bloom filter of shas that it believes that the server knows about; the
> server sends only the sha of master and any refs that are not in the
> bloom filter.  The client  uses its local version of the servers' refs
> as if they had just been sent....

Interesting.

Care to extend the discussion to improve the protocol exchange,
which starts at $gmane/263932 [*1*], where I list the known issues
around the current protocol (and a possible way to correct them in
footnotes)?


[Footnote]

*1* http://thread.gmane.org/gmane.comp.version-control.git/263898/focus=263932

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 14:25       ` Ævar Arnfjörð Bjarmason
@ 2015-02-20 21:04         ` Junio C Hamano
  2015-03-02 19:36           ` Ævar Arnfjörð Bjarmason
  2015-02-20 22:02         ` Sebastian Schuberth
  2015-02-24 12:44         ` Michael Haggerty
  2 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2015-02-20 21:04 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Duy Nguyen, Stephen Morton, Git Mailing List

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I actually ran this a few times while testing it, so this is a before
> and after on a hot cache of linux.git with 406 tags v.s. ~140k. I ran
> the gc + repack + bitmaps for both repos noted in an earlier reply of
> mine, and took the fastest run out of 3:
>
>     $ time (git log master -100 >/dev/null)
>     Before: real    0m0.021s
>     After: real    0m2.929s

Do you force --decorate with some config?  Or do you see similar
performance difference with "git rev-parse master", too?

>     $ time (git status >/dev/null)
>     # Around 150ms, no noticeable difference

This is understandable, as it will not look at any ref other than
HEAD.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 14:25       ` Ævar Arnfjörð Bjarmason
  2015-02-20 21:04         ` Junio C Hamano
@ 2015-02-20 22:02         ` Sebastian Schuberth
  2015-02-24 12:44         ` Michael Haggerty
  2 siblings, 0 replies; 33+ messages in thread
From: Sebastian Schuberth @ 2015-02-20 22:02 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Duy Nguyen
  Cc: Stephen Morton, Git Mailing List

On 20.02.2015 15:25, Ævar Arnfjörð Bjarmason wrote:

> tl;dr: After some more testing it turns out the performance issues we
> have are almost entirely due to the number of refs. Some of these I

Interesting. We currently have similar performance issues when pushing 
to a Git repo hosted on Gerrit. The only difference of our repo compared 
to others on the same server that do not have any performance issues is 
the large number of refs (about 40k) in refs/changes.

I still wonder why that would slow down a push as we have not configured 
our clients to fetch / push refs/changes.

-- 
Sebastian Schuberth

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20  0:03 ` brian m. carlson
  2015-02-20 16:06   ` Stephen Morton
@ 2015-02-20 22:08   ` Sebastian Schuberth
  2015-02-20 22:58     ` brian m. carlson
  1 sibling, 1 reply; 33+ messages in thread
From: Sebastian Schuberth @ 2015-02-20 22:08 UTC (permalink / raw)
  To: brian m. carlson, Stephen Morton, git

On 20.02.2015 01:03, brian m. carlson wrote:

> If you want good performance, I'd recommend the latest version of Git
> both client- and server-side.  Newer versions of Git provide pack
> bitmaps, which can dramatically speed up clones and fetches, and Git

Do you happen now which version, if at all, of JGit and Gerrit support 
pack bitmaps?

> 2.3.0 fixes a performance regression with large numbers of refs in
> non-shallow repositories.

Do you also know in what Git version the regression was introduced?

-- 
Sebastian Schuberth

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 22:08   ` Sebastian Schuberth
@ 2015-02-20 22:58     ` brian m. carlson
  0 siblings, 0 replies; 33+ messages in thread
From: brian m. carlson @ 2015-02-20 22:58 UTC (permalink / raw)
  To: Sebastian Schuberth; +Cc: Stephen Morton, git

[-- Attachment #1: Type: text/plain, Size: 1017 bytes --]

On Fri, Feb 20, 2015 at 11:08:55PM +0100, Sebastian Schuberth wrote:
>On 20.02.2015 01:03, brian m. carlson wrote:
>
>>If you want good performance, I'd recommend the latest version of Git
>>both client- and server-side.  Newer versions of Git provide pack
>>bitmaps, which can dramatically speed up clones and fetches, and Git
>
>Do you happen now which version, if at all, of JGit and Gerrit support 
>pack bitmaps?

They were originally implemented in JGit, but I don't know what version, 
sorry.  Some googling tells me that it's probably version 3.0.

>>2.3.0 fixes a performance regression with large numbers of refs in
>>non-shallow repositories.
>
>Do you also know in what Git version the regression was introduced?

v1.8.4-rc3-8-gfbd4a70.  It was fixed in v2.2.1-65-g2dacf26.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 12:09     ` Ævar Arnfjörð Bjarmason
  2015-02-20 12:11       ` Ævar Arnfjörð Bjarmason
  2015-02-20 14:25       ` Ævar Arnfjörð Bjarmason
@ 2015-02-21  3:51       ` Duy Nguyen
  2 siblings, 0 replies; 33+ messages in thread
From: Duy Nguyen @ 2015-02-21  3:51 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stephen Morton, Git Mailing List

On Fri, Feb 20, 2015 at 7:09 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>>> But actually most of "git fetch" is spent in the reachability check
>>> subsequently done by "git-rev-list" which takes several seconds. I
>>
>> I wonder if reachability bitmap could help here..
>
> I could have sworn I had that enabled already but evidently not. I did
> test it and it cut down on clone times a bit. Now our daily repacking
> is:
>
>         git --git-dir={} gc &&
>         git --git-dir={} pack-refs --all --prune &&
>         git --git-dir={} repack -Ad --window=250 --depth=100
> --write-bitmap-index --pack-kept-objects &&
>
> It's not clear to me from the documentation whether this should just
> be enabled on the server, or the clients too. In any case I've enabled
> it on both.

Pack bitmaps matter most on the server side. What I was not sure was
whether it helped the client side as well because you do rev-list on
the client side for reachability test. But thinking again, I don't
think enabling pack bitmaps on the client side helps much. The "--not
--all" part in rev-list basically just traverses commits, not trees
and objects (where pack bitmaps shine). The big problem here is
"--all" which will go examine all refs. So big ref number problem
again..

> Even then with it enabled on both a "git pull" that pulls down just
> one commit on one branch is 13s. Trace attached at the end of the
> mail.
>
>>> haven't looked into it but there's got to be room for optimization
>>> there, surely it only has to do reachability checks for new refs, or
>>> could run in some "I trust this remote not to send me corrupt data"
>>> completely mode (which would make sense within a company where you can
>>> trust your main Git box).
>>
>> No, it's not just about trusting the server side, it's about catching
>> data corruption on the wire as well. We have a trick to avoid
>> reachability check in clone case, which is much more expensive than a
>> fetch. Maybe we could do something further to help the fetch case _if_
>> reachability bitmaps don't help.
>
> Still, if that's indeed a big bottleneck what's the worst-case
> scenario here? That the local repository gets hosed? The server will
> still recursively validate the objects it gets sent, right?

The server is under pressure to pack and send data fast so it does not
validate as heavily as the client. When deltas are reused, only crc32
is verified. When deltas are generated, the server must unpack some
objects for deltification, but I don't think it rehashes the content
to see if it produces the same SHA-1. Single bit flips could go
unnoticed..

> I wonder if a better trade-off in that case would be to skip this in
> some situations and instead put something like "git fsck" in a
> cronjob.

Either that or be optimistic, accept the pack (i.e. git-fetch returns
quickly) and validate it in the background. If the pack is indeed
good, you don't have to wait until validation is done. If the pack is
bad, you would know after a minute or two, hopefully you can still
recover from that point.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20  0:42   ` David Turner
  2015-02-20 20:59     ` Junio C Hamano
@ 2015-02-21  4:01     ` Duy Nguyen
  2015-02-25 12:02       ` Duy Nguyen
  1 sibling, 1 reply; 33+ messages in thread
From: Duy Nguyen @ 2015-02-21  4:01 UTC (permalink / raw)
  To: David Turner; +Cc: Stephen Morton, Git Mailing List

On Fri, Feb 20, 2015 at 7:42 AM, David Turner <dturner@twopensource.com> wrote:
> On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote:
>> >    * 'git push'?
>>
>> This one is not affected by how deep your repo's history is, or how
>> wide your tree is, so should be quick..
>>
>> Ah the number of refs may affect both git-push and git-pull. I think
>> Stefan knows better than I in this area.
>
> I can tell you that this is a bit of a problem for us at Twitter.  We
> have over 100k refs, which adds ~20MiB of downstream traffic to every
> push.
>
> I added a hack to improve this locally inside Twitter: The client sends
> a bloom filter of shas that it believes that the server knows about; the
> server sends only the sha of master and any refs that are not in the
> bloom filter.  The client  uses its local version of the servers' refs
> as if they had just been sent.  This means that some packs will be
> suboptimal, due to false positives in the bloom filter leading some new
> refs to not be sent.  Also, if there were a repack between the pull and
> the push, some refs might have been deleted on the server; we repack
> rarely enough and pull frequently enough that this is hopefully not an
> issue.

I wonder how efficient rsync is for transferring these refs: the
client generates a "file" containing all refs, the server does the
same with their refs, then the client rsync their file to the server..
The changes between the server and the client files are usually small,
I'm hoping rsync can take advantage of that.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 20:59     ` Junio C Hamano
@ 2015-02-23 20:23       ` David Turner
  0 siblings, 0 replies; 33+ messages in thread
From: David Turner @ 2015-02-23 20:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Stephen Morton, Git Mailing List

On Fri, 2015-02-20 at 12:59 -0800, Junio C Hamano wrote:
> David Turner <dturner@twopensource.com> writes:
> 
> > On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote:
> >> >    * 'git push'?
> >> 
> >> This one is not affected by how deep your repo's history is, or how
> >> wide your tree is, so should be quick..
> >> 
> >> Ah the number of refs may affect both git-push and git-pull. I think
> >> Stefan knows better than I in this area.
> >
> > I can tell you that this is a bit of a problem for us at Twitter.  We
> > have over 100k refs, which adds ~20MiB of downstream traffic to every
> > push.
> >
> > I added a hack to improve this locally inside Twitter: The client sends
> > a bloom filter of shas that it believes that the server knows about; the
> > server sends only the sha of master and any refs that are not in the
> > bloom filter.  The client  uses its local version of the servers' refs
> > as if they had just been sent....
> 
> Interesting.
> 
> Care to extend the discussion to improve the protocol exchange,
> which starts at $gmane/263932 [*1*], where I list the known issues
> around the current protocol (and a possible way to correct them in
> footnotes)?

At Twitter, we changed to an entirely different clone strategy for our
largest repo: instead of using git clone, we use bittorrent (on a
tarball of the repo).  For git pull, we maintain a journal of all pushes
ever made to the server (data and ref updates); each client keeps track
of their location in that journal.  So now pull does not require any
computation on the server; the client just requests the segment of the
journal that they don't have.  Then the client replays the journal.
This scheme isn't perfect: clients end up with data about even
transitory and long-dead branches, and there is presently no way to
redact data (although that would be possible to add).  And of course
shallow and sparse clones are impossible.  But it works quite well for
Twitter's needs.  As I understand it, the hope is to implement redaction
and then submit patches upstream.

I say "we", but I personally did not do any of the above work.  Because
I haven't looked into most of these issues personally, I'm reluctant to
say too much on protocol improvements.  I would want to better
understand the constraints.  I do think there is value in having a
diversity of possible protocols to handle different use cases.  As
repositories grow, traditional full-repo clones become less viable.
Network transfer and client-side performance both suffer.  In a repo the
size of (say) WebKit, the traditional model works.  In a repo the size
of Facebook's monorepo, it starts to break down.  So Facebook does
entirely shallow clones (using hg, but the problems are similar in git).
Commands like log and blame instead call out to a server to gather
history data.  At Google, whose repo is I think two or three orders of
magnitude larger than WebKit, all local copies are both shallow and
sparse; there is also support for "sparse commits" -- so that a commit
that affects (say) ten thousand files across the entire tree can be kept
to a reasonable size. 

<end digression>

Twitter's journal scheme explains why I implemented bloom filter pushes
-- the number of refs does not significantly affect pull performance,
but pushes still go through the normal git machinery, so we wanted an
optimization to reduce latency there.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 14:25       ` Ævar Arnfjörð Bjarmason
  2015-02-20 21:04         ` Junio C Hamano
  2015-02-20 22:02         ` Sebastian Schuberth
@ 2015-02-24 12:44         ` Michael Haggerty
  2015-03-02 19:42           ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 33+ messages in thread
From: Michael Haggerty @ 2015-02-24 12:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Duy Nguyen
  Cc: Stephen Morton, Git Mailing List

On 02/20/2015 03:25 PM, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Feb 20, 2015 at 1:09 PM, Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> On Fri, Feb 20, 2015 at 1:04 AM, Duy Nguyen <pclouds@gmail.com> wrote:
>>> On Fri, Feb 20, 2015 at 6:29 AM, Ævar Arnfjörð Bjarmason
>>> <avarab@gmail.com> wrote:
>>>> Anecdotally I work on a repo at work (where I'm mostly "the Git guy") that's:
>>>>
>>>>  * Around 500k commits
>>>>  * Around 100k tags
>>>>  * Around 5k branches
>>>>  * Around 500 commits/day, almost entirely to the same branch
>>>>  * 1.5 GB .git checkout.
>>>>  * Mostly text source, but some binaries (we're trying to cut down[1] on those)
>>>
>>> Would be nice if you could make an anonymized version of this repo
>>> public. Working on a "real" large repo is better than an artificial
>>> one.
>>
>> Yeah, I'll try to do that.
> 
> tl;dr: After some more testing it turns out the performance issues we
> have are almost entirely due to the number of refs. Some of these I
> knew about and were obvious (e..g. git pull), but some aren't so
> obvious (why does "git log" without "--all" slow down as a function of
> the overall number of refs?).

I'm assuming that you pack your references periodically. (If not, you
should, because reading lots of loose references is very expensive for
the commands that need to iterate over all references!)

On the other hand, packed refs also have a downside, namely that
whenever even a single packed reference has to be read, the whole
packed-refs file has to be read and parsed. One way that this can bite
you, even with innocuous-seeming commands, is if you haven't disabled
the use of replace references (i.e., using "git --no-replace-objects
<CMD>" or GIT_NO_REPLACE_OBJECTS). In that case, almost any Git command
has to read the "refs/replace/*" namespace, which, in turn, forces the
whole packed-refs file to be read and parsed. This can take a
significant amount of time if you have a very large number of references.

So try your experiments with replace references disabled. If that helps,
consider disabling them on your server if you don't need them.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-21  4:01     ` Duy Nguyen
@ 2015-02-25 12:02       ` Duy Nguyen
  0 siblings, 0 replies; 33+ messages in thread
From: Duy Nguyen @ 2015-02-25 12:02 UTC (permalink / raw)
  To: David Turner; +Cc: Stephen Morton, Git Mailing List

On Sat, Feb 21, 2015 at 11:01 AM, Duy Nguyen <pclouds@gmail.com> wrote:
> I wonder how efficient rsync is for transferring these refs: the
> client generates a "file" containing all refs, the server does the
> same with their refs, then the client rsync their file to the server..
> The changes between the server and the client files are usually small,
> I'm hoping rsync can take advantage of that.

Some numbers without any actual coding. After the initial clone, we
store the server's refs in a file called base-file at client. At the
next push or pull, the server saves its refs in 'new-file'. Using
rsync to avoid initial ref advertisement would involve these steps
(rdiff command is from librsync)

client> rdiff signature base-file signature
(client sends "signature" file to server)
server> rdiff delta signature new-file delta
(server sends "delta" file back to client)
client> rdiff patch base-file delta new-file

The exchanged files over network are "signature" and "delta". I used
my git.git's packed-refs as the base-file (1416 refs, 78789 bytes) and
modifies three lines to create new-file. That produced a signature
file of 480 bytes and delta file of 6163 bytes. That's 7% the size of
the new file. Good.

When I modified more lines in new-file (15 lines), the delta file grew
to 26644 bytes ("sig" file remains the same because it only depends on
base-file). Total transferred bytes were 60% the size of new-file.
Less impressive. Maybe there's some tuning options for better
results...

The same process could be used to transfer the whole client refs to
server instead of sending lots of "have" lines. I suspect there will
be more changes between client's "have" file and server ref list. If
the changes spread out and cause a lot of blocks to be sent, the
saving would not be as high as I wanted. I guess that's it for rsync
idea.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 21:04         ` Junio C Hamano
@ 2015-03-02 19:36           ` Ævar Arnfjörð Bjarmason
  2015-03-02 20:15             ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2015-03-02 19:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Stephen Morton, Git Mailing List

On Fri, Feb 20, 2015 at 10:04 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> I actually ran this a few times while testing it, so this is a before
>> and after on a hot cache of linux.git with 406 tags v.s. ~140k. I ran
>> the gc + repack + bitmaps for both repos noted in an earlier reply of
>> mine, and took the fastest run out of 3:
>>
>>     $ time (git log master -100 >/dev/null)
>>     Before: real    0m0.021s
>>     After: real    0m2.929s
>
> Do you force --decorate with some config?  Or do you see similar
> performance difference with "git rev-parse master", too?

Yes, I had log.decorate=short set in my config. With --no-decorate:

    $ time (git log --no-decorate -100 >/dev/null)
    # Before: real    0m0.010s
    # After: real    0m0.065s

>>     $ time (git status >/dev/null)
>>     # Around 150ms, no noticeable difference
>
> This is understandable, as it will not look at any ref other than
> HEAD.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-24 12:44         ` Michael Haggerty
@ 2015-03-02 19:42           ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2015-03-02 19:42 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Duy Nguyen, Stephen Morton, Git Mailing List

On Tue, Feb 24, 2015 at 1:44 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 02/20/2015 03:25 PM, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Feb 20, 2015 at 1:09 PM, Ævar Arnfjörð Bjarmason
>> <avarab@gmail.com> wrote:
>>> On Fri, Feb 20, 2015 at 1:04 AM, Duy Nguyen <pclouds@gmail.com> wrote:
>>>> On Fri, Feb 20, 2015 at 6:29 AM, Ævar Arnfjörð Bjarmason
>>>> <avarab@gmail.com> wrote:
>>>>> Anecdotally I work on a repo at work (where I'm mostly "the Git guy") that's:
>>>>>
>>>>>  * Around 500k commits
>>>>>  * Around 100k tags
>>>>>  * Around 5k branches
>>>>>  * Around 500 commits/day, almost entirely to the same branch
>>>>>  * 1.5 GB .git checkout.
>>>>>  * Mostly text source, but some binaries (we're trying to cut down[1] on those)
>>>>
>>>> Would be nice if you could make an anonymized version of this repo
>>>> public. Working on a "real" large repo is better than an artificial
>>>> one.
>>>
>>> Yeah, I'll try to do that.
>>
>> tl;dr: After some more testing it turns out the performance issues we
>> have are almost entirely due to the number of refs. Some of these I
>> knew about and were obvious (e..g. git pull), but some aren't so
>> obvious (why does "git log" without "--all" slow down as a function of
>> the overall number of refs?).
>
> I'm assuming that you pack your references periodically. (If not, you
> should, because reading lots of loose references is very expensive for
> the commands that need to iterate over all references!)

Yes, as mentioned in another reply of mine, like this:

    git --git-dir={} gc &&
    git --git-dir={} pack-refs --all --prune &&
    git --git-dir={} repack -Ad --window=250 --depth=100
--write-bitmap-index --pack-kept-objects &&

> On the other hand, packed refs also have a downside, namely that
> whenever even a single packed reference has to be read, the whole
> packed-refs file has to be read and parsed. One way that this can bite
> you, even with innocuous-seeming commands, is if you haven't disabled
> the use of replace references (i.e., using "git --no-replace-objects
> <CMD>" or GIT_NO_REPLACE_OBJECTS). In that case, almost any Git command
> has to read the "refs/replace/*" namespace, which, in turn, forces the
> whole packed-refs file to be read and parsed. This can take a
> significant amount of time if you have a very large number of references.

Interesting. I tried the rough benchmarks I posted above with
GIT_NO_REPLACE_OBJECTS=1 and couldn't see any differences, although as
mentioned in another reply --no-decorate had a big effect on git-log.

> So try your experiments with replace references disabled. If that helps,
> consider disabling them on your server if you don't need them.
>
> Michael
>
> --
> Michael Haggerty
> mhagger@alum.mit.edu
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-03-02 19:36           ` Ævar Arnfjörð Bjarmason
@ 2015-03-02 20:15             ` Junio C Hamano
  0 siblings, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2015-03-02 20:15 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Duy Nguyen, Stephen Morton, Git Mailing List

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Fri, Feb 20, 2015 at 10:04 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>>
>>> I actually ran this a few times while testing it, so this is a before
>>> and after on a hot cache of linux.git with 406 tags v.s. ~140k. I ran
>>> the gc + repack + bitmaps for both repos noted in an earlier reply of
>>> mine, and took the fastest run out of 3:
>>>
>>>     $ time (git log master -100 >/dev/null)
>>>     Before: real    0m0.021s
>>>     After: real    0m2.929s
>>
>> Do you force --decorate with some config?  Or do you see similar
>> performance difference with "git rev-parse master", too?
>
> Yes, I had log.decorate=short set in my config. With --no-decorate:
>
>     $ time (git log --no-decorate -100 >/dev/null)
>     # Before: real    0m0.010s
>     # After: real    0m0.065s

There you have the answer to your earlier question, then, which was:

>> tl;dr: After some more testing it turns out the performance issues we
>> have are almost entirely due to the number of refs. Some of these I
>> knew about and were obvious (e..g. git pull), but some aren't so
>> obvious (why does "git log" without "--all" slow down as a function of
>> the overall number of refs?).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 20:37   ` Martin Fick
@ 2015-02-21  0:41     ` David Turner
  0 siblings, 0 replies; 33+ messages in thread
From: David Turner @ 2015-02-21  0:41 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git Mailing List, Stephen Morton, Duy Nguyen

On Fri, 2015-02-20 at 13:37 -0700, Martin Fick wrote:
> On Friday, February 20, 2015 01:29:12 PM David Turner wrote:
> >...
> > For a more general solution, perhaps a log of ref updates
> > could be used. Every time a ref is updated on the server,
> > that ref would be written into an append-only log.  Every
> > time a client pulls, their pull data includes an index
> > into that log.  Then on push, the client could say, "I
> > have refs as-of $index", and the server could read the
> > log (or do something more-optimized) and send only refs
> > updated since that index.
> 
> Interesting idea, I like it.
> 
> How would you make this reliable?  It relies on updates 
> being reliably recorded which would mean that you would have 
> to ensure that any tool which touches the repo follows this 
> convention.  That is unfortunately a tough thing to enforce 
> for most people.

I think it only truly relies on the server reliably updating its state
on ref updates. Which of course the server will do because why would you
let arbitrary processes write to your central git repo?  (That is, most
people use git in a roughly-centralized way, and if you turn on this
config option, you promise to only do ref updates that write to the
log).

If the client fails to update its state (on a fetch), it will send
larger-than-necessary packs but not otherwise fail.  And this situation
is sometimes be detectable on the client side -- if
mtime(.git/refs/remotes/$remote ) > mtime
(.git/server-ref-log-index/$remote), then we know our
server-ref-log-index is out-of-date.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20 18:29 ` David Turner
@ 2015-02-20 20:37   ` Martin Fick
  2015-02-21  0:41     ` David Turner
  0 siblings, 1 reply; 33+ messages in thread
From: Martin Fick @ 2015-02-20 20:37 UTC (permalink / raw)
  To: David Turner; +Cc: Git Mailing List, Stephen Morton, Duy Nguyen

On Friday, February 20, 2015 01:29:12 PM David Turner wrote:
>...
> For a more general solution, perhaps a log of ref updates
> could be used. Every time a ref is updated on the server,
> that ref would be written into an append-only log.  Every
> time a client pulls, their pull data includes an index
> into that log.  Then on push, the client could say, "I
> have refs as-of $index", and the server could read the
> log (or do something more-optimized) and send only refs
> updated since that index.

Interesting idea, I like it.

How would you make this reliable?  It relies on updates 
being reliably recorded which would mean that you would have 
to ensure that any tool which touches the repo follows this 
convention.  That is unfortunately a tough thing to enforce 
for most people.

But perhaps, instead of logging updates, the server could 
log snapshots of all refs using an atomically increasing 
sequence number.  Then missed updates do not matter, a 
sequence number is simplly an opaque handle to some full ref 
state that can be diffed against.  The snapshots need not 
even be taken inline with the client connection, or with 
every update for this to work.  It might mean that some 
extra updates are sent when they don't need to be, but at 
least they will be accurate.

I know in the past similar ideas have been passed around, 
but they typically relied on the server keeping track of the 
state of each client.  Instead, here we are talking about 
clients keeping track of state for a particular server.  
Clients already store info about remotes.

A very neat idea indeed, thanks!

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20  6:57 Martin Fick
  2015-02-20 18:29 ` David Turner
@ 2015-02-20 19:27 ` Randall S. Becker
  1 sibling, 0 replies; 33+ messages in thread
From: Randall S. Becker @ 2015-02-20 19:27 UTC (permalink / raw)
  To: 'Martin Fick', 'David Turner'
  Cc: 'Git Mailing List', 'Stephen Morton',
	'Duy Nguyen', 'Joachim Schmitz'

-----Original Message-----
On Feb 20, 2015 1:58AM Martin Fick wrote:
>On Feb 19, 2015 5:42 PM, David Turner <dturner@twopensource.com> wrote:
> > This one is not affected by how deep your repo's history is, or how 
> > wide your tree is, so should be quick.. 
>Good to hear that others are starting to experiment with solutions to this problem!  I hope to hear more updates on this.

<snip-snip>

Now that Jojo and I  have git 2.3.0 ported to the HP NonStop platform, there are some very large code bases out there that may start being managed using git. These will tend to initially shallow histories (100's not 1000's of commits, and fairly linear) but large source and binaries - I know of a few where just the distributed set of sources are above 1Gb and are unlikely to be managed in multiple repos despite my previous best efforts to change that. Fortunately, It is a relatively simple matter to profile the code on the platform for various operations so data on where to improve may be available - I hope. 

With that said, for NonStop file system tends to be heavier weight than on Linux (many more moving parts by virtue of the MPP nature of the OS and hardware). Packing up changes seems pretty good, but any operating involving creating a large number of small files does hurt a bunch.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
  2015-02-20  6:57 Martin Fick
@ 2015-02-20 18:29 ` David Turner
  2015-02-20 20:37   ` Martin Fick
  2015-02-20 19:27 ` Randall S. Becker
  1 sibling, 1 reply; 33+ messages in thread
From: David Turner @ 2015-02-20 18:29 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git Mailing List, Stephen Morton, Duy Nguyen

On Thu, 2015-02-19 at 23:57 -0700, Martin Fick wrote:
> On Feb 19, 2015 5:42 PM, David Turner <dturner@twopensource.com> wrote:
> >
> > On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote: 
> > > >    * 'git push'? 
> > > 
> > > This one is not affected by how deep your repo's history is, or how 
> > > wide your tree is, so should be quick.. 
> > > 
> > > Ah the number of refs may affect both git-push and git-pull. I think 
> > > Stefan knows better than I in this area. 
> >
> > I can tell you that this is a bit of a problem for us at Twitter.  We 
> > have over 100k refs, which adds ~20MiB of downstream traffic to every 
> > push. 
> >
> > I added a hack to improve this locally inside Twitter: The client sends 
> > a bloom filter of shas that it believes that the server knows about; the 
> > server sends only the sha of master and any refs that are not in the 
> > bloom filter.  The client  uses its local version of the servers' refs 
> > as if they had just been sent.  This means that some packs will be 
> > suboptimal, due to false positives in the bloom filter leading some new 
> > refs to not be sent.  Also, if there were a repack between the pull and 
> > the push, some refs might have been deleted on the server; we repack 
> > rarely enough and pull frequently enough that this is hopefully not an 
> > issue. 
> >
> > We're still testing to see if this works.  But due to the number of 
> > assumptions it makes, it's probably not that great an idea for general 
> > use. 
> 
> Good to hear that others are starting to experiment with solutions to this problem!  I hope to hear more updates on this.
> 
> I have a prototype of a simpler, and
> I believe more robust solution, but aimed at a smaller use case I think.  On connecting, the client sends a sha of all its refs/shas as defined by a refspec, which it also sends to the server, which it believes the server might have the same refs/shas values for.  The server can then calculate the value of its refs/shas which meet the same refspec, and then omit sending those refs if the "verification" sha matches, and instead send only a confirmation that they matched (along with any refs outside of the refspec).  On a match, the client can inject the local values of the refs which met the refspec and be guaranteed that they match the server's values.
> 
> This optimization is aimed at the worst case scenario (and is thus the potentially best case "compression"), when the client and server match for all refs (a refs/* refspec)  This is something that happens often on Gerrit server startup, when it verifies that its mirrors are up-to-date.  One reason I chose this as a starting optimization, is because I think it is one use case which will actually not benefit from "fixing" the git protocol to only send relevant refs since all the refs are in fact relevant here! So something like this will likely be needed in any future git protocol in order for it to be efficient for this use case.  And I believe this use case is likely to stick around.
> 
> With a minor tweak, this optimization should work when replicating actual expected updates also by excluding the expected updating refs from the verification so that the server always sends their values since they will likely not match and would wreck the optimization.  However, for this use case it is not clear whether it is actually even worth caring about the non updating refs?  In theory the knowledge of the non updating refs can potentially reduce the amount of data transmitted, but I suspect that as the ref count increases, this has diminishing returns and mostly ends up chewing up CPU and memory in a vain attempt to reduce network traffic.

For a more general solution, perhaps a log of ref updates could be used.
Every time a ref is updated on the server, that ref would be written
into an append-only log.  Every time a client pulls, their pull data
includes an index into that log.  Then on push, the client could say, "I
have refs as-of $index", and the server could read the log (or do
something more-optimized) and send only refs updated since that index.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Git Scaling: What factors most affect Git performance for a large repo?
@ 2015-02-20  6:57 Martin Fick
  2015-02-20 18:29 ` David Turner
  2015-02-20 19:27 ` Randall S. Becker
  0 siblings, 2 replies; 33+ messages in thread
From: Martin Fick @ 2015-02-20  6:57 UTC (permalink / raw)
  To: David Turner; +Cc: Git Mailing List, Stephen Morton, Duy Nguyen

On Feb 19, 2015 5:42 PM, David Turner <dturner@twopensource.com> wrote:
>
> On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote: 
> > >    * 'git push'? 
> > 
> > This one is not affected by how deep your repo's history is, or how 
> > wide your tree is, so should be quick.. 
> > 
> > Ah the number of refs may affect both git-push and git-pull. I think 
> > Stefan knows better than I in this area. 
>
> I can tell you that this is a bit of a problem for us at Twitter.  We 
> have over 100k refs, which adds ~20MiB of downstream traffic to every 
> push. 
>
> I added a hack to improve this locally inside Twitter: The client sends 
> a bloom filter of shas that it believes that the server knows about; the 
> server sends only the sha of master and any refs that are not in the 
> bloom filter.  The client  uses its local version of the servers' refs 
> as if they had just been sent.  This means that some packs will be 
> suboptimal, due to false positives in the bloom filter leading some new 
> refs to not be sent.  Also, if there were a repack between the pull and 
> the push, some refs might have been deleted on the server; we repack 
> rarely enough and pull frequently enough that this is hopefully not an 
> issue. 
>
> We're still testing to see if this works.  But due to the number of 
> assumptions it makes, it's probably not that great an idea for general 
> use. 

Good to hear that others are starting to experiment with solutions to this problem!  I hope to hear more updates on this.

I have a prototype of a simpler, and
I believe more robust solution, but aimed at a smaller use case I think.  On connecting, the client sends a sha of all its refs/shas as defined by a refspec, which it also sends to the server, which it believes the server might have the same refs/shas values for.  The server can then calculate the value of its refs/shas which meet the same refspec, and then omit sending those refs if the "verification" sha matches, and instead send only a confirmation that they matched (along with any refs outside of the refspec).  On a match, the client can inject the local values of the refs which met the refspec and be guaranteed that they match the server's values.

This optimization is aimed at the worst case scenario (and is thus the potentially best case "compression"), when the client and server match for all refs (a refs/* refspec)  This is something that happens often on Gerrit server startup, when it verifies that its mirrors are up-to-date.  One reason I chose this as a starting optimization, is because I think it is one use case which will actually not benefit from "fixing" the git protocol to only send relevant refs since all the refs are in fact relevant here! So something like this will likely be needed in any future git protocol in order for it to be efficient for this use case.  And I believe this use case is likely to stick around.

With a minor tweak, this optimization should work when replicating actual expected updates also by excluding the expected updating refs from the verification so that the server always sends their values since they will likely not match and would wreck the optimization.  However, for this use case it is not clear whether it is actually even worth caring about the non updating refs?  In theory the knowledge of the non updating refs can potentially reduce the amount of data transmitted, but I suspect that as the ref count increases, this has diminishing returns and mostly ends up chewing up CPU and memory in a vain attempt to reduce network traffic.

Please do keep us up-to-date of your results,

-Martin

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2015-03-02 20:16 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-19 21:26 Git Scaling: What factors most affect Git performance for a large repo? Stephen Morton
2015-02-19 22:21 ` Stefan Beller
2015-02-19 23:06   ` Stephen Morton
2015-02-19 23:15     ` Stefan Beller
2015-02-19 23:29 ` Ævar Arnfjörð Bjarmason
2015-02-20  0:04   ` Duy Nguyen
2015-02-20 12:09     ` Ævar Arnfjörð Bjarmason
2015-02-20 12:11       ` Ævar Arnfjörð Bjarmason
2015-02-20 14:25       ` Ævar Arnfjörð Bjarmason
2015-02-20 21:04         ` Junio C Hamano
2015-03-02 19:36           ` Ævar Arnfjörð Bjarmason
2015-03-02 20:15             ` Junio C Hamano
2015-02-20 22:02         ` Sebastian Schuberth
2015-02-24 12:44         ` Michael Haggerty
2015-03-02 19:42           ` Ævar Arnfjörð Bjarmason
2015-02-21  3:51       ` Duy Nguyen
2015-02-19 23:38 ` Duy Nguyen
2015-02-20  0:42   ` David Turner
2015-02-20 20:59     ` Junio C Hamano
2015-02-23 20:23       ` David Turner
2015-02-21  4:01     ` Duy Nguyen
2015-02-25 12:02       ` Duy Nguyen
2015-02-20  0:03 ` brian m. carlson
2015-02-20 16:06   ` Stephen Morton
2015-02-20 16:38     ` Matthieu Moy
2015-02-20 17:16     ` brian m. carlson
2015-02-20 22:08   ` Sebastian Schuberth
2015-02-20 22:58     ` brian m. carlson
2015-02-20  6:57 Martin Fick
2015-02-20 18:29 ` David Turner
2015-02-20 20:37   ` Martin Fick
2015-02-21  0:41     ` David Turner
2015-02-20 19:27 ` Randall S. Becker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.