All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance issue: initial git clone causes massive repack
@ 2009-04-04 22:07 Robin H. Johnson
  2009-04-05  0:05 ` Nicolas Sebrecht
                   ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-04 22:07 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1793 bytes --]

Hi,

This is a first in my series of mails over the next few days, on issues
that we've run into planning a potential migration for Gentoo's
repository into Git.

Our full repository conversion is large, even after tuning the
repacking, the packed repository is between 1.4 and 1.6GiB. As of Feburary
4th, 2009, it contained 4886949 objects. It is not suitable for
splitting into submodules either unfortunately - we have a lot of
directory moves that would cause submodule bloat.

During an initial clone, I see that git-upload-pack invokes
pack-objects, despite the ENTIRE repository already being packed - no
loose objects whatsoever. git-upload-pack then seems to buffer in
memory.

In a small repository, this wouldn't be a problem, as the entire
repository can fit in memory very easily. However, with our large
repository, git-upload-pack and git-pack-objects grows in memory to well
more than the size of the packed repository, and are usually killed by
the OOM.

During 'remote: Counting objects: 4886949, done.', git-upload-pack peaks at
2474216KB VSZ and 1143048KB RSS. 
Shortly thereafter, we get 'remote: Compressing objects:   0%
(1328/1994284)', git-pack-objects with ~2.8GB VSZ and ~1.8GB RSS. Here,
the CPU burn also starts. On our test server machine (w/ git 1.6.0.6),
it takes about 200 minutes walltime to finish the pack, IFF the OOM
doesn't kick in.

Given that the repo is entirely packed already, I see no point in doing
this.

For the initial clone, can the git-upload-pack algorithm please send
existing packs, and only generate a pack containing the non-packed
items?

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-04 22:07 Performance issue: initial git clone causes massive repack Robin H. Johnson
@ 2009-04-05  0:05 ` Nicolas Sebrecht
  2009-04-05  0:37   ` Robin H. Johnson
  2009-04-05 19:57 ` Jeff King
  2009-04-11 17:24 ` Mark Levedahl
  2 siblings, 1 reply; 97+ messages in thread
From: Nicolas Sebrecht @ 2009-04-05  0:05 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List

On Sat, Apr 04, 2009 at 03:07:43PM -0700, Robin H. Johnson wrote:

> Our full repository conversion is large, even after tuning the
> repacking, the packed repository is between 1.4 and 1.6GiB. As of Feburary
> 4th, 2009, it contained 4886949 objects. It is not suitable for
> splitting into submodules either unfortunately - we have a lot of
> directory moves that would cause submodule bloat.

Actually, I'm not sure that a full portage tree repository would be the
best thing to do. It would not be suitable in the long term and working
on the repository/history would be a big mess. Why provide a such repo ?
Or at least, why provide a such readable repo ?

IMHO, you should provide a repository per upstream package on the main
server.


PS: what about cc'ing gentoo-scm list ?

-- 
Nicolas Sebrecht

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05  0:05 ` Nicolas Sebrecht
@ 2009-04-05  0:37   ` Robin H. Johnson
  2009-04-05  3:54     ` Nicolas Sebrecht
  0 siblings, 1 reply; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-05  0:37 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 2804 bytes --]

On Sun, Apr 05, 2009 at 02:05:36AM +0200, Nicolas Sebrecht wrote:
> > Our full repository conversion is large, even after tuning the
> > repacking, the packed repository is between 1.4 and 1.6GiB. As of Feburary
> > 4th, 2009, it contained 4886949 objects. It is not suitable for
> > splitting into submodules either unfortunately - we have a lot of
> > directory moves that would cause submodule bloat.
> Actually, I'm not sure that a full portage tree repository would be the
> best thing to do. It would not be suitable in the long term and working
> on the repository/history would be a big mess. Why provide a such repo ?
> Or at least, why provide a such readable repo ?
> 
> IMHO, you should provide a repository per upstream package on the main
> server.
That causes incredibly bloat unfortunately.

I'll summarize why here for the git mailing list. Most our developers
have the entire tree checked out, and in informal surveys, would like to
continue to do so. There are ~13500 packages right now (I'm excluding
eclasses/, profiles/, scripts/), and growing by 15-25 new packages/week.
(~45% of packages also have a files/ directory).

For each package, the .git directory, assuming in a single pack,
consumes at least 36 inodes.  Tail-packing is limited to Reiserfs3 and
JFS, and isn't widely used other than that, so assuming 4KiB inodes,
that's an overhead of at least 144KiB per package. Multiple by the
number of packages, and we get an overhead of 2GiB, before we've added
ANY content.

Without tail packing, the Gentoo tree is presently around 520MiB (you
can fit it into ~190MiB with tail packing). This means that
repo-per-package would have an overhead in the range of 400%.

Additionally, there's a lot of commonality between ebuilds and packages,
and having repo-per-package means that the compression algorithms can't
make use of it - dictionary algorithms are effective at compression for
a reason.

Overhead is the reason that we refused to migrate to SVN as well.
- CVS, per each directory of data, has a constant overhead of 4 inodes
  (CVS/ CVS/Root CVS/Repository CVS/Entries)
- SVN, for each data directory, has another complete copy of the data,
  plus a minimum of 10 other inodes.
- Git costs a minimum 36 inodes per repository. In a fully packed repo,
  the number of inodes tends to stay below 50 in all cases.

> PS: what about cc'ing gentoo-scm list ?
It's not an open-posting list, so anybody here on the git list simply
replying would not get their post on there. The issue has been raised
there, and this mainly meant to find a resolution to that problem.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05  0:37   ` Robin H. Johnson
@ 2009-04-05  3:54     ` Nicolas Sebrecht
  2009-04-05  4:08       ` Nicolas Sebrecht
  2009-04-05  7:04       ` Robin H. Johnson
  0 siblings, 2 replies; 97+ messages in thread
From: Nicolas Sebrecht @ 2009-04-05  3:54 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List

On Sat, Apr 04, 2009 at 05:37:53PM -0700, Robin H. Johnson wrote:

> That causes incredibly bloat unfortunately.
> 
> I'll summarize why here for the git mailing list. Most our developers
> have the entire tree checked out, and in informal surveys, would like to
> continue to do so. There are ~13500 packages right now 

Each developer doesn't work on so many packages, right ? From my point
of view, checkin'out the entire tree is the wrong way on how to do
things.

Also, you could keep an entire tree repo assuming it's _not_
"fetch-able".

> For each package, the .git directory, assuming in a single pack,
> consumes at least 36 inodes.  Tail-packing is limited to Reiserfs3 and
> JFS, and isn't widely used other than that, so assuming 4KiB inodes,
> that's an overhead of at least 144KiB per package. Multiple by the
> number of packages, and we get an overhead of 2GiB, before we've added
> ANY content.

> Without tail packing, the Gentoo tree is presently around 520MiB (you
> can fit it into ~190MiB with tail packing). This means that
> repo-per-package would have an overhead in the range of 400%.

Don't know about the business for Gentoo, but HDD is cheap. Also, I'd
like to know how much space you will gain with the CVS to Git migration.
How bigger is a CVS repo against a Git one ?

One repo per category could be a good compromise assuming one seperate
branch per ebuild, then.

> Additionally, there's a lot of commonality between ebuilds and packages,
> and having repo-per-package means that the compression algorithms can't
> make use of it - dictionary algorithms are effective at compression for
> a reason.

Please, no. We are in the long term issues. Compression will be
efficient. It's all about the content of the files and dictionary
algorithms certainly will do a good job over the ebuilds revisions.

-- 
Nicolas Sebrecht

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05  3:54     ` Nicolas Sebrecht
@ 2009-04-05  4:08       ` Nicolas Sebrecht
  2009-04-05  7:04       ` Robin H. Johnson
  1 sibling, 0 replies; 97+ messages in thread
From: Nicolas Sebrecht @ 2009-04-05  4:08 UTC (permalink / raw)
  To: Nicolas Sebrecht; +Cc: Robin H. Johnson, Git Mailing List

On Sun, Apr 05, 2009 at 05:54:53AM +0200, Nicolas Sebrecht wrote:

> One repo per category could be a good compromise assuming one seperate
> branch per ebuild, then.

s/ebuild/package/

-- 
Nicolas Sebrecht

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05  3:54     ` Nicolas Sebrecht
  2009-04-05  4:08       ` Nicolas Sebrecht
@ 2009-04-05  7:04       ` Robin H. Johnson
  2009-04-05 19:02         ` Nicolas Sebrecht
  1 sibling, 1 reply; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-05  7:04 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 4992 bytes --]

Before I answer the rest of your post, I'd like to note that the matter
of which choice between single-repo, repo-per-package, repo-per-category
has been flogged to death within Gentoo.

I did not come to the Git mailing list to rehash those choices. I came
here to find a solution to the performance problem. While it shows up
with our repo, I'm certain that we're not the only people with the
problem. The GSoC 2009 ideas contain a potential project for caching the
generated packs, which, while having value in itself, could be partially
avoided by sending suitable pre-built packs (if they exist) without any
repacking.

On Sun, Apr 05, 2009 at 05:54:53AM +0200, Nicolas Sebrecht wrote:
> > That causes incredibly bloat unfortunately.
> > 
> > I'll summarize why here for the git mailing list. Most our developers
> > have the entire tree checked out, and in informal surveys, would like to
> > continue to do so. There are ~13500 packages right now 
> Each developer doesn't work on so many packages, right ? From my point
> of view, checkin'out the entire tree is the wrong way on how to do
> things.
Also, I should note that working on the tree isn't the only reason to
have the tree checked out. While the great majority of Gentoo users have
their trees purely from rsync, there is nothing stopping you from using
a tree from CVS (anonCVS for the users, master CVS server for the
developers).

A quick bit of stats run show that while some developers only touch a
few packages, there are at least 200 developers that have done a major
change to 100 or more packages.

> > Without tail packing, the Gentoo tree is presently around 520MiB (you
> > can fit it into ~190MiB with tail packing). This means that
> > repo-per-package would have an overhead in the range of 400%.
> Don't know about the business for Gentoo, but HDD is cheap.
There's no reason to have bloat just for the layout to change.

> Also, I'd like to know how much space you will gain with the CVS to Git >
> migration.  How bigger is a CVS repo against a Git one ?
For the CVS checkouts right now: 
- ~410MiB of content (w/ 4kb inodes)
- ~240MiB of CVS overhead (w/ 4kb inodes)
(sorry about the earlier 520MiB number, I forgot to exclude a local dir
of stats data on my box when I ran du quickly).

Our experimental Git, with only a single repo for gentoo-x86:
- ~410MiB of content (w/ 4kb inodes)
- 80MiB - 1.6GiB of Git total overhead.

80MiB of overhead is the total overhead with a shallow clone at depth 1.
1.6GiB is with the full history.

And per-package numbers, because we DID do an experimental conversion,
last year, although the packs might not have been optimal:
- ~410MiB of content (w/ 4kb inodes)
- 4.7GiB of Git total overhead, with a breakdown:
  - 1.9GiB in inode waste
  - 2.8GiB in packs

> One repo per category could be a good compromise assuming one seperate
> branch per package, then.
Other downsides to repo-per-category and repo-per-package:
- Raises difficulty in adding a new package/category. 
  You cannot just do 'mkdir && vi ... && git add && git commit' anymore.
- The name of the directory for both of the category AND the package are not
  specified in the ebuild, as such, unless they are checked out to the right
  location, you will get breakage (definitely in the package name, and
  about 10% of the time with categories).
- You cannot use git-cvsserver with them cleanly and have the correct
  behavior (we DO have developers that want to use the CVS emulation
  layer) - adding a category or a package would NOT trigger the
  addition of a new repo on the server when needed.
- Does NOT present a good base for anybody wanting to branch the entire
  tree themselves.
  

> > Additionally, there's a lot of commonality between ebuilds and packages,
> > and having repo-per-package means that the compression algorithms can't
> > make use of it - dictionary algorithms are effective at compression for
> > a reason.
> Please, no. We are in the long term issues. Compression will be
> efficient. It's all about the content of the files and dictionary
> algorithms certainly will do a good job over the ebuilds revisions.
We're already on track to drop the CVS $Header$, and thereafter, some of the
ebuilds are already on track to be smaller. Here's our prototype dev-perl/Sub-Name-0.04.
====
# Copyright 1999-2009 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2
MODULE_AUTHOR=XMATH
inherit perl-module
DESCRIPTION="(re)name a sub"
LICENSE="|| ( Artistic GPL-2 )"
SLOT="0"
KEYWORDS="~amd64 ~x86"
IUSE=""
SRC_TEST=do
====

We can have all the CPAN packages from CPAN author XMATH, with changing
only the DESCRIPTION string. KEYWORDS then just changes over the package
lifespan.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05  7:04       ` Robin H. Johnson
@ 2009-04-05 19:02         ` Nicolas Sebrecht
  2009-04-05 19:17           ` Shawn O. Pearce
                             ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Nicolas Sebrecht @ 2009-04-05 19:02 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List

On Sun, Apr 05, 2009 at 12:04:12AM -0700, Robin H. Johnson wrote:

> Before I answer the rest of your post, I'd like to note that the matter
> of which choice between single-repo, repo-per-package, repo-per-category
> has been flogged to death within Gentoo.
> 
> I did not come to the Git mailing list to rehash those choices. I came
> here to find a solution to the performance problem.

I understand. I know two ways to resolve this:
- by resolving the performance problem itself,
- by changing the workflow to something more accurate and more suitable
  against the facts.

My point is that going from a centralized to a decentralized SCM
involves breacking strongly how developers and maintainers work. What
you're currently suggesting is a way to work with Git in a centralized
way. This sucks. To get the things right with Git I would avoid shared
and global repositories. Gnome is doing it this way:
http://gitorious.org/projects/gnome-svn-hooks/repos/mainline/trees/master

>          The GSoC 2009 ideas contain a potential project for caching the
> generated packs, which, while having value in itself, could be partially
> avoided by sending suitable pre-built packs (if they exist) without any
> repacking.

Right. It could be an option to wait and see if the GSoC gives
something.

> Also, I should note that working on the tree isn't the only reason to
> have the tree checked out. While the great majority of Gentoo users have
> their trees purely from rsync, there is nothing stopping you from using
> a tree from CVS (anonCVS for the users, master CVS server for the
> developers).
> 
> A quick bit of stats run show that while some developers only touch a
> few packages, there are at least 200 developers that have done a major
> change to 100 or more packages.

That's a point that has to be reconsidered. Not the fact that at least
200 developers work on over 100 packages (this is really not an issue)¹
but the fact that they do that directly on the main repo/server. The
good way to achieve this is to send his work to the maintainer². The main
issue is a better code reviewing.

1. Some or all repo-per-category can be tracked with a simple script.
2. Maintainers could be - or not be - the same developers as today.
Adding a layer of maintainers in charge of EAPI review (for example) up
to the packages-maintainers could help in fixing a lot of portage issues
and would avoid "simple developers" to do crap on the main repo(s) that
users download.

> And per-package numbers, because we DID do an experimental conversion,
> last year, although the packs might not have been optimal:
> - ~410MiB of content (w/ 4kb inodes)
> - 4.7GiB of Git total overhead, with a breakdown:
>   - 1.9GiB in inode waste
>   - 2.8GiB in packs

Ok.

> > One repo per category could be a good compromise assuming one seperate
> > branch per package, then.
> Other downsides to repo-per-category and repo-per-package:

Let's forget a repo-per-package.

> - Raises difficulty in adding a new package/category. 
>   You cannot just do 'mkdir && vi ... && git add && git commit' anymore.

Right, but categories are not evolving that much.

> - The name of the directory for both of the category AND the package are not
>   specified in the ebuild, as such, unless they are checked out to the right
>   location, you will get breakage (definitely in the package name, and
>   about 10% of the time with categories).

Of course. Quite franckly, it's recoverable without pain.

A repo-per-category local workflow would be:
$ git branch
  master
* next
  package_one
  package_two
  [...]
$ tree -a
|-- .git
|   |-- [...]
|   [...]
|-- package_one
|   |-- ChangeLog
|   |-- Manifest
|   |-- metadata.xml
|   |-- package_one-0.4.ebuild
|   `-- package_one-0.5.ebuild
|-- package_two
|   |-- ChangeLog
|   |-- Manifest
|   |-- files
|   |   |-- package_two.confd
|   |   `-- package_two.rc
|   |-- metadata.xml
|   `-- package_two-0.7-r3.ebuild
[...]

$ git checkout package_one
$ tree -a
|-- .git
|   |-- [...]
|   [...]
`-- package_one
    |-- ChangeLog
    |-- Manifest
    |-- metadata.xml
    |-- package_one-0.4.ebuild
    `-- package_one-0.5.ebuild
$ <hack, hack, hack>
$ git checkout next
$ git merge package_one 

> - Does NOT present a good base for anybody wanting to branch the entire
>   tree themselves.

Scriptable.

> We're already on track to drop the CVS $Header$, and thereafter, some of the
> ebuilds are already on track to be smaller. Here's our prototype dev-perl/Sub-Name-0.04.
> ====
> # Copyright 1999-2009 Gentoo Foundation
> # Distributed under the terms of the GNU General Public License v2
> MODULE_AUTHOR=XMATH
> inherit perl-module
> DESCRIPTION="(re)name a sub"
> LICENSE="|| ( Artistic GPL-2 )"
> SLOT="0"
> KEYWORDS="~amd64 ~x86"
> IUSE=""
> SRC_TEST=do
> ====
> 
> We can have all the CPAN packages from CPAN author XMATH, with changing
> only the DESCRIPTION string. KEYWORDS then just changes over the package
> lifespan.

Sounds good.

-- 
Nicolas Sebrecht

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 19:02         ` Nicolas Sebrecht
@ 2009-04-05 19:17           ` Shawn O. Pearce
  2009-04-05 23:02             ` Robin H. Johnson
  2009-04-05 20:43           ` Robin H. Johnson
  2009-04-05 21:28           ` david
  2 siblings, 1 reply; 97+ messages in thread
From: Shawn O. Pearce @ 2009-04-05 19:17 UTC (permalink / raw)
  To: Nicolas Sebrecht; +Cc: Robin H. Johnson, Git Mailing List

Nicolas Sebrecht <nicolas.s-dev@laposte.net> wrote:
> On Sun, Apr 05, 2009 at 12:04:12AM -0700, Robin H. Johnson wrote:
> >          The GSoC 2009 ideas contain a potential project for caching the
> > generated packs, which, while having value in itself, could be partially
> > avoided by sending suitable pre-built packs (if they exist) without any
> > repacking.
> 
> Right. It could be an option to wait and see if the GSoC gives
> something.

Another option is to use rsync:// for initial clones.
 
Tell new developers that their initial command sequence to
(efficiently) get the base tree is:

  git clone rsync://git.gentoo.org/tree.git
  cd tree
  git config remote.origin.url git://git.gentoo.org/tree.git

rsync should be more efficient at dragging 1.6GiB over the network,
as its only streaming the files.  But it may fall over if the server
has a lot of loose objects; many more small files to create.

One way around that would be to use two repositories on the server;
a historical repository that is fully packed and contains the full
history, and a bleeding edge repository that users would normally
work against:

  git clone rsync://git.gentoo.org/fully-packed-tree.git tree
  cd tree
  git config remote.origin.url git://git.gentoo.org/tree.git
  git pull

Then every so often (e.g. once a Gentoo release cycle, so once
a year) pull the bleeding edge repository into the fully packed
repository.  That will introduce a single new pack file, so the
fully packed repository grows at a rate of 2 inodes/year, and is
still very efficient to rsync on initial clones.


That caching GSoC project may help, but didn't I see earlier in
this thread that you have >4.8 million objects in your repository?
Any proposals on that project would still have Git malloc()'ing
data per object; its ~80 bytes per object needed so that's a data
segment of 384+ MiB, per concurrent clone client.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-04 22:07 Performance issue: initial git clone causes massive repack Robin H. Johnson
  2009-04-05  0:05 ` Nicolas Sebrecht
@ 2009-04-05 19:57 ` Jeff King
  2009-04-05 23:38   ` Robin H. Johnson
  2009-04-11 17:24 ` Mark Levedahl
  2 siblings, 1 reply; 97+ messages in thread
From: Jeff King @ 2009-04-05 19:57 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List

On Sat, Apr 04, 2009 at 03:07:43PM -0700, Robin H. Johnson wrote:

> During an initial clone, I see that git-upload-pack invokes
> pack-objects, despite the ENTIRE repository already being packed - no
> loose objects whatsoever. git-upload-pack then seems to buffer in
> memory.

We need to run pack-objects even if the repo is fully packed because we
don't know what's _in_ the existing pack (or packs). In particular we
want to:

  - combine multiple packs into a single pack; this is more efficient on
    the network, because you can find more deltas, and I believe is
    required because the protocol sends only a single pack.

  - cull any objects which are not actually part of the reachability
    chain from the refs we are sending

If no work needs to be done for either case, then pack-objects should
basically just figure that out and then send the existing pack (the
expensive bit is doing deltas, and we don't consider objects in the same
pack for deltas, as we know we have already considered that during the
last repack). It does mmap the whole pack, so you will see your virtual
memory jump, but nothing should require the whole pack being in memory
at once.

pack-objects streams the output to upload-pack, which should only ever
have an 8K buffer of it in memory at any given time.

At least that is how it is all supposed to work, according to my
understanding. So if you are seeing very high memory usage, I wonder if
there is a bug in pack-objects or upload-pack that can be fixed.

Maybe somebody more knowledgeable than me about packing can comment.

> During 'remote: Counting objects: 4886949, done.', git-upload-pack peaks at
> 2474216KB VSZ and 1143048KB RSS. 
> Shortly thereafter, we get 'remote: Compressing objects:   0%
> (1328/1994284)', git-pack-objects with ~2.8GB VSZ and ~1.8GB RSS. Here,
> the CPU burn also starts. On our test server machine (w/ git 1.6.0.6),
> it takes about 200 minutes walltime to finish the pack, IFF the OOM
> doesn't kick in.

Have you tried with a more recent git to see if it is any better? There
have been a number of changes since 1.6.0.6, although it looks like
mostly dealing with better recovery from corrupted packs.

> Given that the repo is entirely packed already, I see no point in doing
> this.
> 
> For the initial clone, can the git-upload-pack algorithm please send
> existing packs, and only generate a pack containing the non-packed
> items?

I believe that would require a change to the protocol to allow multiple
packs. However, it may be possible to munge the pack header in such a
way that you basically concatenate multiple packs. You would still want
to peek in the big pack to try deltas from the non-packed items, though.

I think all of this falls into the realm of the GSOC pack caching project.
There have been other discussions on the list, so you might want to look
through those for something useful.

-Peff

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 19:02         ` Nicolas Sebrecht
  2009-04-05 19:17           ` Shawn O. Pearce
@ 2009-04-05 20:43           ` Robin H. Johnson
  2009-04-05 21:08             ` Shawn O. Pearce
  2009-04-05 21:28           ` david
  2 siblings, 1 reply; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-05 20:43 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 4679 bytes --]

On Sun, Apr 05, 2009 at 09:02:13PM +0200, Nicolas Sebrecht wrote:
> > Before I answer the rest of your post, I'd like to note that the matter
> > of which choice between single-repo, repo-per-package, repo-per-category
> > has been flogged to death within Gentoo.
> > 
> > I did not come to the Git mailing list to rehash those choices. I came
> > here to find a solution to the performance problem.
> I understand. I know two ways to resolve this:
> - by resolving the performance problem itself,
> - by changing the workflow to something more accurate and more suitable
>   against the facts.
> 
> My point is that going from a centralized to a decentralized SCM
> involves breacking strongly how developers and maintainers work. What
> you're currently suggesting is a way to work with Git in a centralized
> way. This sucks. To get the things right with Git I would avoid shared
> and global repositories. Gnome is doing it this way:
> http://gitorious.org/projects/gnome-svn-hooks/repos/mainline/trees/master
The entire matter of splitting the repository comes down to what should
be considered an atomic unit. For GNOME, KDE and all of the other large
Git consumers that I'm aware of, there atomic units are individual
packages - specifically because they make sense to be consumed without
having all the rest of the packages. For the gentoo tree, it is an
atomic unit in itself. Changes to the profiles/ directory (for package
masks, USE keys are frequently related and need to be always committed
and received atomically with changes to one or more packages.

> >          The GSoC 2009 ideas contain a potential project for caching the
> > generated packs, which, while having value in itself, could be partially
> > avoided by sending suitable pre-built packs (if they exist) without any
> > repacking.
> Right. It could be an option to wait and see if the GSoC gives
> something.
How hard is it to just look at the git-upload-pack code and make it
realize that it doesn't need to repack at all for this case.

> > A quick bit of stats run show that while some developers only touch a
> > few packages, there are at least 200 developers that have done a major
> > change to 100 or more packages.
> That's a point that has to be reconsidered. Not the fact that at least
> 200 developers work on over 100 packages (this is really not an issue)¹
> but the fact that they do that directly on the main repo/server. The
> good way to achieve this is to send his work to the maintainer². The main
> issue is a better code reviewing.
This has been shot down by our developer base. One of the grounds is
that there is no developer with sufficient time to take a merge-master
role on a regular basis like that.

> 1. Some or all repo-per-category can be tracked with a simple script.
> 2. Maintainers could be - or not be - the same developers as today.
> Adding a layer of maintainers in charge of EAPI review (for example) up
> to the packages-maintainers could help in fixing a lot of portage issues
> and would avoid "simple developers" to do crap on the main repo(s) that
> users download.
You imply that there is a problem in that field already, which I
disagree with.

> > > One repo per category could be a good compromise assuming one seperate
> > > branch per package, then.
> > Other downsides to repo-per-category and repo-per-package:
> Let's forget a repo-per-package.
One downside unique to repo-per-category is that when a package moves
cross-category, you end up with it consuming space in packs on both
sides.

> > - Raises difficulty in adding a new package/category. 
> >   You cannot just do 'mkdir && vi ... && git add && git commit' anymore.
> Right, but categories are not evolving that much.
There's demand to evolve them, but bulk package moves are painful with
CVS, so it's been waiting for Git.

> A repo-per-category local workflow would be:
> [...]
> $ git checkout package_one
> $ tree -a
> |-- .git
> |   |-- [...]
> |   [...]
> `-- package_one
>     |-- ChangeLog
>     |-- Manifest
>     |-- metadata.xml
>     |-- package_one-0.4.ebuild
>     `-- package_one-0.5.ebuild
Umm, why does package_two not exist in the other branch?
If package_one depends on package_two, and you're in for a world of fail
the moment it you changes branches here.

> > - Does NOT present a good base for anybody wanting to branch the entire
> >   tree themselves.
> Scriptable.
You dropped my cvsserver list item.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 20:43           ` Robin H. Johnson
@ 2009-04-05 21:08             ` Shawn O. Pearce
  0 siblings, 0 replies; 97+ messages in thread
From: Shawn O. Pearce @ 2009-04-05 21:08 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List

"Robin H. Johnson" <robbat2@gentoo.org> wrote:
> > >          The GSoC 2009 ideas contain a potential project for caching the
> > > generated packs, which, while having value in itself, could be partially
> > > avoided by sending suitable pre-built packs (if they exist) without any
> > > repacking.
> > Right. It could be an option to wait and see if the GSoC gives
> > something.
>
> How hard is it to just look at the git-upload-pack code and make it
> realize that it doesn't need to repack at all for this case.

I don't need to go look.  I know that code.

Its harder than you think.

I'll tell you what, *you* go look at the git-upload-pack code and
come back with a patch that doesn't need to repack at all for this
case, *and* which Junio will actually apply.  If its any good,
Junio would apply it pretty quickly.

Nobody else has managed to create such a patch just yet.  Because its
extremely non-trivial.  Its pretty much never the case that an active
repository is fully repacked, so we always have to enumerate some
number of loose objects and put them into a single outgoing pack
for the network.  Its also considered to be a security feature of
Git that we only transmit reachable objects.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 19:02         ` Nicolas Sebrecht
  2009-04-05 19:17           ` Shawn O. Pearce
  2009-04-05 20:43           ` Robin H. Johnson
@ 2009-04-05 21:28           ` david
  2009-04-05 21:36             ` Sverre Rabbelier
  2009-04-05 22:59             ` Performance issue: initial git clone causes massive repack Nicolas Sebrecht
  2 siblings, 2 replies; 97+ messages in thread
From: david @ 2009-04-05 21:28 UTC (permalink / raw)
  To: Nicolas Sebrecht; +Cc: Robin H. Johnson, Git Mailing List

On Sun, 5 Apr 2009, Nicolas Sebrecht wrote:

> On Sun, Apr 05, 2009 at 12:04:12AM -0700, Robin H. Johnson wrote:
>
>> Before I answer the rest of your post, I'd like to note that the matter
>> of which choice between single-repo, repo-per-package, repo-per-category
>> has been flogged to death within Gentoo.
>>
>> I did not come to the Git mailing list to rehash those choices. I came
>> here to find a solution to the performance problem.
>
> I understand. I know two ways to resolve this:
> - by resolving the performance problem itself,
> - by changing the workflow to something more accurate and more suitable
>  against the facts.
>
> My point is that going from a centralized to a decentralized SCM
> involves breacking strongly how developers and maintainers work. What
> you're currently suggesting is a way to work with Git in a centralized
> way. This sucks. To get the things right with Git I would avoid shared
> and global repositories. Gnome is doing it this way:
> http://gitorious.org/projects/gnome-svn-hooks/repos/mainline/trees/master

guys, back off a little on telling the gentoo people to change. the kernel 
developers don't split th kernel into 'core' 'drivers' etc pieces just 
because some people only work on one area. I see the gentoo desire to keep 
things in one repo as being something very similar.

the problem here is a real one, if you have a large repo, git send-pack 
will always generate a new pack, even if it doesn't need to (with the 
extreme case being the the repo is fully packed)

>>          The GSoC 2009 ideas contain a potential project for caching the
>> generated packs, which, while having value in itself, could be partially
>> avoided by sending suitable pre-built packs (if they exist) without any
>> repacking.
>
> Right. It could be an option to wait and see if the GSoC gives
> something.

the GSOC project is not the same thing. in this case the packs are already 
'cached' (they are stored on disk), what is needed is some option to let 
git send existing pack(s) if they exist rather then taking the time to 
try and generate an 'optimal' pack.

I'm actually aurprised that this is happening, I thought that the 
recommendation was that the public repository should do a very agressive 
pack (that takes a lot of resources) for the old content so that people 
cloning from it get the advantage of the tight packing without having to 
do it themselves.

if the server _always_ re-generates the pack from scratch then this is a 
waste of time (except for people who clone via the dumb, unsafe 
mechanisms)

David Lang

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 21:28           ` david
@ 2009-04-05 21:36             ` Sverre Rabbelier
  2009-04-06  3:24               ` Nicolas Pitre
  2009-04-05 22:59             ` Performance issue: initial git clone causes massive repack Nicolas Sebrecht
  1 sibling, 1 reply; 97+ messages in thread
From: Sverre Rabbelier @ 2009-04-05 21:36 UTC (permalink / raw)
  To: david, Junio C Hamano
  Cc: Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

Heya,

On Sun, Apr 5, 2009 at 23:28,  <david@lang.hm> wrote:
> Guys, back off a little on telling the gentoo people to change.

I agree here, we should either say "look, we don't really support big
repositories because [explanation here], unless you [workarounds
here]" OR we should work to improve the support we do have. Of course,
the latter option does not magically create developer time to work on
that, but if we do go that way we should at least tell people that we
are aware of the problems and that it's on the global TODO list (not
necessarily on anyone's personal TODO list though).
Of course, the problem can sometimes be solved by splitting the
repository, but I think it is important to have an official policy
here, do we want Git to support huge repositories, or do we not?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 21:28           ` david
  2009-04-05 21:36             ` Sverre Rabbelier
@ 2009-04-05 22:59             ` Nicolas Sebrecht
  2009-04-05 23:20               ` david
  2009-04-07 10:11               ` Martin Langhoff
  1 sibling, 2 replies; 97+ messages in thread
From: Nicolas Sebrecht @ 2009-04-05 22:59 UTC (permalink / raw)
  To: david; +Cc: Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

On Sun, Apr 05, 2009 at 02:28:35PM -0700, david@lang.hm wrote:

> guys, back off a little on telling the gentoo people to change.

Don't blame Git people, please. I currently am the only one here to
discuss that way and see a painful work coming at Gentoo.
Git people didn't discuss around thoses issues.

>                                                                 the 
> kernel developers don't split th kernel into 'core' 'drivers' etc pieces 
> just because some people only work on one area.

And you might notice that they don't provide a CVS access and actually
don't work around an unique shared repo. Also, you might notice that
keeping the history clean to assure the work on the kernel easier is not
an elementary issue.

> just because some people only work on one area. I see the gentoo desire 
> to keep things in one repo as being something very similar.

That's why I think the gentoo desire is not very clean (don't be
affected). What I see is that in one hand you want a DSCM and on the
other hand you want to keep a central shared repo.

> the problem here is a real one, if you have a large repo, git send-pack  
> will always generate a new pack, even if it doesn't need to (with the  
> extreme case being the the repo is fully packed)

What about the rsync solution given in this thread?

-- 
Nicolas Sebrecht

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 19:17           ` Shawn O. Pearce
@ 2009-04-05 23:02             ` Robin H. Johnson
  0 siblings, 0 replies; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-05 23:02 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 2851 bytes --]

On Sun, Apr 05, 2009 at 12:17:03PM -0700, Shawn O. Pearce wrote:
> Another option is to use rsync:// for initial clones.
>   git clone rsync://git.gentoo.org/tree.git
> rsync should be more efficient at dragging 1.6GiB over the network,
> as its only streaming the files.  But it may fall over if the server
> has a lot of loose objects; many more small files to create.
I just tried this, and ran into a segfault.

Original command:
# git clone rsync://git.overlays.gentoo.org/vcs-public-gitroot/exp/gentoo-x86.git

It looks at a glance like the linked list has a null value it hits during the
internal while loop, not checking 'list' before using 'list->next'.

gdb> bt
#0  strcmp () at ../sysdeps/x86_64/strcmp.S:30
#1  0x000000000049474c in get_refs_via_rsync (transport=<value optimized out>, for_push=<value optimized out>) at transport.c:123
#2  0x000000000049234c in transport_get_remote_refs (transport=0x725fc9) at transport.c:1045
#3  0x000000000041620a in cmd_clone (argc=<value optimized out>, argv=0x7fff908c8550, prefix=<value optimized out>) at builtin-clone.c:487
#4  0x0000000000404f59 in handle_internal_command (argc=0x2, argv=0x7fff908c8550) at git.c:244
#5  0x0000000000405167 in main (argc=0x2, argv=0x7fff908c8550) at git.c:434
gdb> up
#1  0x000000000049474c in get_refs_via_rsync (transport=<value optimized out>, for_push=<value optimized out>) at transport.c:123
123					(cmp = strcmp(buffer + 41,
gdb> print list
$1 = {nr = 0x0, alloc = 0x0, name = 0x0}

If I go into the repo thereafter and manually run git-fetch again, it does work
fine.

> One way around that would be to use two repositories on the server;
> a historical repository that is fully packed and contains the full
> history, and a bleeding edge repository that users would normally
> work against:
Yup, we've been considering similar. We do have one specific need with that
however: to prevent resource abuse, we would like to DENY the ability to do the
initial clone with git:// then - just so that nobody tries to DoS our servers
by doing a couple of hungry initial clones at once.

> That caching GSoC project may help, but didn't I see earlier in
> this thread that you have >4.8 million objects in your repository?
> Any proposals on that project would still have Git malloc()'ing
> data per object; its ~80 bytes per object needed so that's a data
> segment of 384+ MiB, per concurrent clone client.
384MiB or even 512MiB I can cover. It's the 200+ wallclock minutes of cpu burn
with no download that aren't acceptable.

P.S.
The -v output of the rsync-mode git-fetch is very devoid of output. Can we
maybe pipe the rsync progress back?


-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 22:59             ` Performance issue: initial git clone causes massive repack Nicolas Sebrecht
@ 2009-04-05 23:20               ` david
  2009-04-05 23:28                 ` Robin Rosenberg
  2009-04-06  3:34                 ` Nicolas Pitre
  2009-04-07 10:11               ` Martin Langhoff
  1 sibling, 2 replies; 97+ messages in thread
From: david @ 2009-04-05 23:20 UTC (permalink / raw)
  To: Nicolas Sebrecht; +Cc: Robin H. Johnson, Git Mailing List

On Mon, 6 Apr 2009, Nicolas Sebrecht wrote:

> On Sun, Apr 05, 2009 at 02:28:35PM -0700, david@lang.hm wrote:
>
>> guys, back off a little on telling the gentoo people to change.
>
> Don't blame Git people, please. I currently am the only one here to
> discuss that way and see a painful work coming at Gentoo.
> Git people didn't discuss around thoses issues.
>
>>                                                                 the
>> kernel developers don't split th kernel into 'core' 'drivers' etc pieces
>> just because some people only work on one area.
>
> And you might notice that they don't provide a CVS access and actually
> don't work around an unique shared repo. Also, you might notice that
> keeping the history clean to assure the work on the kernel easier is not
> an elementary issue.

these issues are completely seperate from the issue that the initial 
poster asked about, which is that when someone tries to do a clone of the 
repository the system wastes a lot of time creating a new pack.

the kernel has a central public repo, they could run the cvs server on 
that and still keep the rest of the kernel development exactly the way it 
is.

if they are currently planning for one central repo with everyone pushing 
to it, I expect that they will change their workflow as they get used to 
git, but that isn't going to address the problem in the tool.

>> just because some people only work on one area. I see the gentoo desire
>> to keep things in one repo as being something very similar.
>
> That's why I think the gentoo desire is not very clean (don't be
> affected). What I see is that in one hand you want a DSCM and on the
> other hand you want to keep a central shared repo.

don't worry about this part of things, worry about why the server wastes 
so many resources.

if this is really what's happening, other projects will suffer as well 
(including the kernel, which has a very distributed workflow)

>> the problem here is a real one, if you have a large repo, git send-pack
>> will always generate a new pack, even if it doesn't need to (with the
>> extreme case being the the repo is fully packed)
>
> What about the rsync solution given in this thread?

that may be a work-around for a situation where git just doesn't work, but 
how do they prevent users from killing their server by trying to do a 
normal git clone?

Daivd Lang

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 23:20               ` david
@ 2009-04-05 23:28                 ` Robin Rosenberg
  2009-04-06  3:34                 ` Nicolas Pitre
  1 sibling, 0 replies; 97+ messages in thread
From: Robin Rosenberg @ 2009-04-05 23:28 UTC (permalink / raw)
  To: david; +Cc: Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

måndag 06 april 2009 01:20:07 skrev david@lang.hm:
> >> the problem here is a real one, if you have a large repo, git send-pack
> >> will always generate a new pack, even if it doesn't need to (with the
> >> extreme case being the the repo is fully packed)
> >
> > What about the rsync solution given in this thread?
> 
> that may be a work-around for a situation where git just doesn't work, but 
> how do they prevent users from killing their server by trying to do a 
> normal git clone?

Is there no way of telling git not work so hard on packing? 

If not, you could try JGit and compare. It's still too stupid to pack much, so it shouldn't spend much CPU time (for that reason at least). I haven't tried JGit's deamon for large amounts of data yet.

-- robin

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 19:57 ` Jeff King
@ 2009-04-05 23:38   ` Robin H. Johnson
  2009-04-05 23:42     ` Robin H. Johnson
                       ` (3 more replies)
  0 siblings, 4 replies; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-05 23:38 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 4561 bytes --]

On Sun, Apr 05, 2009 at 03:57:14PM -0400, Jeff King wrote:
> > During an initial clone, I see that git-upload-pack invokes
> > pack-objects, despite the ENTIRE repository already being packed - no
> > loose objects whatsoever. git-upload-pack then seems to buffer in
> > memory.
> We need to run pack-objects even if the repo is fully packed because we
> don't know what's _in_ the existing pack (or packs). In particular we
> want to:
>   - combine multiple packs into a single pack; this is more efficient on
>     the network, because you can find more deltas, and I believe is
>     required because the protocol sends only a single pack.
> 
>   - cull any objects which are not actually part of the reachability
>     chain from the refs we are sending
> 
> If no work needs to be done for either case, then pack-objects should
> basically just figure that out and then send the existing pack (the
> expensive bit is doing deltas, and we don't consider objects in the same
> pack for deltas, as we know we have already considered that during the
> last repack). It does mmap the whole pack, so you will see your virtual
> memory jump, but nothing should require the whole pack being in memory
> at once.
While my current pack setup has multiple packs of not more than 100MiB
each, that was simply for ease of resume with rsync+http tests. Even
when I already had a single pack, with every object reachable,
pack-objects was redoing the packing.

> pack-objects streams the output to upload-pack, which should only ever
> have an 8K buffer of it in memory at any given time.
> 
> At least that is how it is all supposed to work, according to my
> understanding. So if you are seeing very high memory usage, I wonder if
> there is a bug in pack-objects or upload-pack that can be fixed.
> 
> Maybe somebody more knowledgeable than me about packing can comment.
Looking at the source, I agree that it should be buffering, however top and ps
seem to disagree. 3GiB VSZ and 2.5GiB RSS here now.

%CPU %MEM     VSZ     RSS STAT START   TIME COMMAND
 0.0  0.0  140932    1040 Ss   16:09   0:00 \_ git-upload-pack /code/gentoo/gentoo-git/gentoo-x86.git 
32.2  0.0       0       0 Z    16:09   1:50     \_ [git-upload-pack] <defunct>
80.8 44.2 3018484 2545700 Sl   16:09   4:36     \_ git pack-objects --stdout --progress --delta-base-offset 

Also, I did another trace, using some other hardware, in a LAN setting, and
noticed that git-upload-pack/pack-objects only seems to start output to the
network after it reaches 100% in 'remote: Compressing objects:'.

Relatedly, throwing more RAM (6GiB total, vs. the previous 2GiB) at the server
in this case cut the 200 wallclock minutes before any sending too place down to
5 minutes.


> > During 'remote: Counting objects: 4886949, done.', git-upload-pack peaks at
> > 2474216KB VSZ and 1143048KB RSS. 
> > Shortly thereafter, we get 'remote: Compressing objects:   0%
> > (1328/1994284)', git-pack-objects with ~2.8GB VSZ and ~1.8GB RSS. Here,
> > the CPU burn also starts. On our test server machine (w/ git 1.6.0.6),
> > it takes about 200 minutes walltime to finish the pack, IFF the OOM
> > doesn't kick in.
> Have you tried with a more recent git to see if it is any better? There
> have been a number of changes since 1.6.0.6, although it looks like
> mostly dealing with better recovery from corrupted packs.
Testing right now, the above on the LAN setup was w/ current git HEAD.

> > For the initial clone, can the git-upload-pack algorithm please send
> > existing packs, and only generate a pack containing the non-packed
> > items?
> 
> I believe that would require a change to the protocol to allow multiple
> packs. However, it may be possible to munge the pack header in such a
> way that you basically concatenate multiple packs. You would still want
> to peek in the big pack to try deltas from the non-packed items, though.
> 
> I think all of this falls into the realm of the GSOC pack caching project.
> There have been other discussions on the list, so you might want to look
> through those for something useful.
Yes, both changing the protocol, and recognizing that existing packs may be
suitable to send could be considered as part of the caching project, as they
fall under the aegis of making good use of what's stored in the cache already
to send.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 23:38   ` Robin H. Johnson
@ 2009-04-05 23:42     ` Robin H. Johnson
       [not found]     ` <0015174c150e49b5740466d7d2c2@google.com>
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-05 23:42 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 616 bytes --]

On Sun, Apr 05, 2009 at 04:38:31PM -0700, Robin H. Johnson wrote:
> > Have you tried with a more recent git to see if it is any better? There
> > have been a number of changes since 1.6.0.6, although it looks like
> > mostly dealing with better recovery from corrupted packs.
> Testing right now, the above on the LAN setup was w/ current git HEAD.
Just following up, there seems to be no significant change in results
with v1.6.2.2 over v1.6.0.6.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Re: Performance issue: initial git clone causes massive repack
       [not found]     ` <0015174c150e49b5740466d7d2c2@google.com>
@ 2009-04-06  0:29       ` Robin H. Johnson
  0 siblings, 0 replies; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-06  0:29 UTC (permalink / raw)
  To: Git List

[-- Attachment #1: Type: text/plain, Size: 1503 bytes --]

On Mon, Apr 06, 2009 at 12:17:18AM +0000, SRabbelier@gmail.com wrote:
> Heya,
>
> On Mon, Apr 6, 2009 at 01:38, Robin H. Johnson robbat2@gentoo.org> wrote:
>> Relatedly, throwing more RAM (6GiB total, vs. the previous 2GiB) at the 
>> server
>> in this case cut the 200 wallclock minutes before any sending too place 
>> down to
>> 5 minutes.
> I'm curious what kind of hardware changes you made to achieve such an 
> enormous effect? Was it just added RAM on the same machine?
No, see the paragraph previous to that, showing it was a different machine that
just happened to have 6GiB of RAM.

The key difference is that having 6GiB of RAM was enough to stop the
swap/OOM-killing of git-pack-objects/git-upload-pack that happened on the slow
server, which I considered to be entirely unwarranted since the pack was
already generated and perfect for use.

"Slow" server:
- deadline scheduler
- AMD Opteron 1210, single socket, 2 cores @ 1.8GHZ
- 2GB Reg ECC RAM
- 2x ST3250620AS, RAID1
- 100Mbit internet feed, co-located in Texas.

"Fast" server:
- deadline scheduler
- Intel Core2 Q6600, single socket, 4 cores @ 2.4GHz
- ~5.7GiB of cheap RAM (6GiB in the box, 256 not usable due to BIOS MTRR brokeness)
- 7x ST3320620AS, RAID-5.
- Sitting on my home LAN, with a very crappy upload bandwidth to the Internet.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 23:38   ` Robin H. Johnson
  2009-04-05 23:42     ` Robin H. Johnson
       [not found]     ` <0015174c150e49b5740466d7d2c2@google.com>
@ 2009-04-06  3:10     ` Nguyen Thai Ngoc Duy
  2009-04-06  4:09       ` Nicolas Pitre
  2009-04-06  4:06     ` Nicolas Pitre
  3 siblings, 1 reply; 97+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2009-04-06  3:10 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List

On Mon, Apr 6, 2009 at 9:38 AM, Robin H. Johnson <robbat2@gentoo.org> wrote:
> Looking at the source, I agree that it should be buffering, however top and ps
> seem to disagree. 3GiB VSZ and 2.5GiB RSS here now.
>
> %CPU %MEM     VSZ     RSS STAT START   TIME COMMAND
>  0.0  0.0  140932    1040 Ss   16:09   0:00 \_ git-upload-pack /code/gentoo/gentoo-git/gentoo-x86.git
> 32.2  0.0       0       0 Z    16:09   1:50     \_ [git-upload-pack] <defunct>
> 80.8 44.2 3018484 2545700 Sl   16:09   4:36     \_ git pack-objects --stdout --progress --delta-base-offset
>
> Also, I did another trace, using some other hardware, in a LAN setting, and
> noticed that git-upload-pack/pack-objects only seems to start output to the
> network after it reaches 100% in 'remote: Compressing objects:'.
>
> Relatedly, throwing more RAM (6GiB total, vs. the previous 2GiB) at the server
> in this case cut the 200 wallclock minutes before any sending too place down to
> 5 minutes.

Searching back the archive, there was memory fragmentation issue with
gcc repo. I wonder if it happens again. Maybe you should try Google
allocator. BTW, did you try to turn off THREADED_DELTA_SEARCH?
-- 
Duy

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 21:36             ` Sverre Rabbelier
@ 2009-04-06  3:24               ` Nicolas Pitre
  2009-04-07  8:10                 ` Björn Steinbrink
  0 siblings, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-06  3:24 UTC (permalink / raw)
  To: Sverre Rabbelier
  Cc: david, Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

On Sun, 5 Apr 2009, Sverre Rabbelier wrote:

> Heya,
> 
> On Sun, Apr 5, 2009 at 23:28,  <david@lang.hm> wrote:
> > Guys, back off a little on telling the gentoo people to change.
> 
> I agree here, we should either say "look, we don't really support big
> repositories because [explanation here], unless you [workarounds
> here]" OR we should work to improve the support we do have. Of course,
> the latter option does not magically create developer time to work on
> that, but if we do go that way we should at least tell people that we
> are aware of the problems and that it's on the global TODO list (not
> necessarily on anyone's personal TODO list though).

For the record... I at least am aware of the problem and it is indeed on 
my personal git todo list.  Not that I have a clear solution yet (I've 
been pondering on some git packing issues for almost 4 years now).

Still, in this particular case, the problem appears to be unclear to me, 
like "this shouldn't be so bad".

> Of course, the problem can sometimes be solved by splitting the
> repository, but I think it is important to have an official policy
> here, do we want Git to support huge repositories, or do we not?

I do.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 23:20               ` david
  2009-04-05 23:28                 ` Robin Rosenberg
@ 2009-04-06  3:34                 ` Nicolas Pitre
  2009-04-06  5:15                   ` Junio C Hamano
  2009-04-06 11:22                   ` Matthieu Moy
  1 sibling, 2 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-06  3:34 UTC (permalink / raw)
  To: david; +Cc: Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

On Sun, 5 Apr 2009, david@lang.hm wrote:

> On Mon, 6 Apr 2009, Nicolas Sebrecht wrote:
> 
> > On Sun, Apr 05, 2009 at 02:28:35PM -0700, david@lang.hm wrote:
> > 
> > > guys, back off a little on telling the gentoo people to change.
> > 
> > Don't blame Git people, please. I currently am the only one here to
> > discuss that way and see a painful work coming at Gentoo.
> > Git people didn't discuss around thoses issues.
> > 
> > >                                                                 the
> > > kernel developers don't split th kernel into 'core' 'drivers' etc pieces
> > > just because some people only work on one area.
> > 
> > And you might notice that they don't provide a CVS access and actually
> > don't work around an unique shared repo. Also, you might notice that
> > keeping the history clean to assure the work on the kernel easier is not
> > an elementary issue.
> 
> these issues are completely seperate from the issue that the initial poster
> asked about, which is that when someone tries to do a clone of the repository
> the system wastes a lot of time creating a new pack.

And this shouldn't be, by design.  Especially if your repo serving clone 
requests is already well packed.

What git-pack-objects does in this case is not a full repack.  It 
instead _reuse_ as much of the existing packs as possible, and only does 
the heavy packing processing for loose objects and/or inter pack 
boundaryes when gluing everything together for streaming over the net.  
If for example you have a single pack because your repo is already fully 
packed, then the "packing operation" involved during a clone should 
merely copy the existing pack over with no further attempt at delta 
compression.

> don't worry about this part of things, worry about why the server wastes so
> many resources.

Indeed.  And since a significant amount of code involved happens to be 
mine, I do wonder.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 23:38   ` Robin H. Johnson
                       ` (2 preceding siblings ...)
  2009-04-06  3:10     ` Nguyen Thai Ngoc Duy
@ 2009-04-06  4:06     ` Nicolas Pitre
  2009-04-06 14:20       ` Robin H. Johnson
  3 siblings, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-06  4:06 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List

On Sun, 5 Apr 2009, Robin H. Johnson wrote:

> On Sun, Apr 05, 2009 at 03:57:14PM -0400, Jeff King wrote:
> > > During an initial clone, I see that git-upload-pack invokes
> > > pack-objects, despite the ENTIRE repository already being packed - no
> > > loose objects whatsoever. git-upload-pack then seems to buffer in
> > > memory.
> > We need to run pack-objects even if the repo is fully packed because we
> > don't know what's _in_ the existing pack (or packs). In particular we
> > want to:
> >   - combine multiple packs into a single pack; this is more efficient on
> >     the network, because you can find more deltas, and I believe is
> >     required because the protocol sends only a single pack.
> > 
> >   - cull any objects which are not actually part of the reachability
> >     chain from the refs we are sending
> > 
> > If no work needs to be done for either case, then pack-objects should
> > basically just figure that out and then send the existing pack (the
> > expensive bit is doing deltas, and we don't consider objects in the same
> > pack for deltas, as we know we have already considered that during the
> > last repack). It does mmap the whole pack, so you will see your virtual
> > memory jump, but nothing should require the whole pack being in memory
> > at once.

Actually the pack is mapped with a (configurable) window.  See the
core.packedGitWindowSize and core.packedGitLimit config options for 
details.

> While my current pack setup has multiple packs of not more than 100MiB
> each, that was simply for ease of resume with rsync+http tests. Even
> when I already had a single pack, with every object reachable,
> pack-objects was redoing the packing.

In that case it shouldn't have.

> > pack-objects streams the output to upload-pack, which should only ever
> > have an 8K buffer of it in memory at any given time.
> > 
> > At least that is how it is all supposed to work, according to my
> > understanding. So if you are seeing very high memory usage, I wonder if
> > there is a bug in pack-objects or upload-pack that can be fixed.
> > 
> > Maybe somebody more knowledgeable than me about packing can comment.
> Looking at the source, I agree that it should be buffering, however top and ps
> seem to disagree. 3GiB VSZ and 2.5GiB RSS here now.
> 
> %CPU %MEM     VSZ     RSS STAT START   TIME COMMAND
>  0.0  0.0  140932    1040 Ss   16:09   0:00 \_ git-upload-pack /code/gentoo/gentoo-git/gentoo-x86.git 
> 32.2  0.0       0       0 Z    16:09   1:50     \_ [git-upload-pack] <defunct>
> 80.8 44.2 3018484 2545700 Sl   16:09   4:36     \_ git pack-objects --stdout --progress --delta-base-offset 
> 
> Also, I did another trace, using some other hardware, in a LAN setting, and
> noticed that git-upload-pack/pack-objects only seems to start output to the
> network after it reaches 100% in 'remote: Compressing objects:'.

That's to be expected.  Delta compression matches objects which are not 
in the stream order at all.  Therefore it is not possible to start 
outputting pack data until this pass is done.  Still, this pass should 
not be invoked if your repository is already fully packed into one pack.  
Can you confirm this is actually the case?

> Relatedly, throwing more RAM (6GiB total, vs. the previous 2GiB) at 
> the server in this case cut the 200 wallclock minutes before any 
> sending too place down to 5 minutes.

Well... here's a wild guess.  In the source repository serving clone 
requests, please do:

	git config pack.deltaCacheSize 1
	git config pack.deltaCacheLimit 0

and try cloning again with a fully packed repository.

> > > For the initial clone, can the git-upload-pack algorithm please send
> > > existing packs, and only generate a pack containing the non-packed
> > > items?
> > 
> > I believe that would require a change to the protocol to allow multiple
> > packs. However, it may be possible to munge the pack header in such a
> > way that you basically concatenate multiple packs. You would still want
> > to peek in the big pack to try deltas from the non-packed items, though.

As explained already, even if the protocol requires a single pack to be 
created, it is still made up of unmodified data segments from existing 
packs as much as possible.  So you should see it more or less as the 
concatenation of those packs already, plus some munging over the edges.

> > I think all of this falls into the realm of the GSOC pack caching project.
> > There have been other discussions on the list, so you might want to look
> > through those for something useful.
> Yes, both changing the protocol, and recognizing that existing packs may be
> suitable to send could be considered as part of the caching project, as they
> fall under the aegis of making good use of what's stored in the cache already
> to send.

The caching pack project is to address a different issue: mainly to 
bypass the object enumeration cost.  In other words, it could allow for 
skipping the "Counting objects" pass, and a tiny bit more.  At least in 
theory that's about the main difference.  This has many drawbacks as 
well though.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06  3:10     ` Nguyen Thai Ngoc Duy
@ 2009-04-06  4:09       ` Nicolas Pitre
  0 siblings, 0 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-06  4:09 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Robin H. Johnson, Git Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1462 bytes --]

On Mon, 6 Apr 2009, Nguyen Thai Ngoc Duy wrote:

> On Mon, Apr 6, 2009 at 9:38 AM, Robin H. Johnson <robbat2@gentoo.org> wrote:
> > Looking at the source, I agree that it should be buffering, however top and ps
> > seem to disagree. 3GiB VSZ and 2.5GiB RSS here now.
> >
> > %CPU %MEM     VSZ     RSS STAT START   TIME COMMAND
> >  0.0  0.0  140932    1040 Ss   16:09   0:00 \_ git-upload-pack /code/gentoo/gentoo-git/gentoo-x86.git
> > 32.2  0.0       0       0 Z    16:09   1:50     \_ [git-upload-pack] <defunct>
> > 80.8 44.2 3018484 2545700 Sl   16:09   4:36     \_ git pack-objects --stdout --progress --delta-base-offset
> >
> > Also, I did another trace, using some other hardware, in a LAN setting, and
> > noticed that git-upload-pack/pack-objects only seems to start output to the
> > network after it reaches 100% in 'remote: Compressing objects:'.
> >
> > Relatedly, throwing more RAM (6GiB total, vs. the previous 2GiB) at the server
> > in this case cut the 200 wallclock minutes before any sending too place down to
> > 5 minutes.
> 
> Searching back the archive, there was memory fragmentation issue with
> gcc repo. I wonder if it happens again. Maybe you should try Google
> allocator. BTW, did you try to turn off THREADED_DELTA_SEARCH?

That was for a _full_ repack, i.e. 'git repack -a -f'.  This is never 
the case on a fetch/clone, like in this case, unless you have all your 
objects in loose form.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06  3:34                 ` Nicolas Pitre
@ 2009-04-06  5:15                   ` Junio C Hamano
  2009-04-06 13:12                     ` Nicolas Pitre
  2009-04-06 13:52                     ` Jon Smirl
  2009-04-06 11:22                   ` Matthieu Moy
  1 sibling, 2 replies; 97+ messages in thread
From: Junio C Hamano @ 2009-04-06  5:15 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: david, Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

> What git-pack-objects does in this case is not a full repack.  It 
> instead _reuse_ as much of the existing packs as possible, and only does 
> the heavy packing processing for loose objects and/or inter pack 
> boundaryes when gluing everything together for streaming over the net.  
> If for example you have a single pack because your repo is already fully 
> packed, then the "packing operation" involved during a clone should 
> merely copy the existing pack over with no further attempt at delta 
> compression.

One possibile scenario that you still need to spend memory and cycle is if
the cloned repository was packed to an excessive depth to cause many of
its objects to be in deltified form on insanely deep chains, while cloning
send-pack uses a depth that is more reasonable.  Then pack-objects invoked
by send-pack is not allowed to reuse most of the objects and would end up
redoing the delta on them.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06  3:34                 ` Nicolas Pitre
  2009-04-06  5:15                   ` Junio C Hamano
@ 2009-04-06 11:22                   ` Matthieu Moy
  2009-04-06 13:29                     ` Nicolas Pitre
  1 sibling, 1 reply; 97+ messages in thread
From: Matthieu Moy @ 2009-04-06 11:22 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: david, Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

> If for example you have a single pack because your repo is already fully 
> packed, then the "packing operation" involved during a clone should 
> merely copy the existing pack over with no further attempt at delta 
> compression.

There's still the question if your repository has too many objects
(for example, a branch that you deleted without garbage-collecting
it). Then, sending the whole pack sends data that one may have
considered as "secret".

To me, this is a non-issue (if the content of these objects are
secret, then why are they here at all on a public server?), but I
think there were discussions here about it (can't find the right
keywords to dig the archives though), and other people may think
differently.

Jeff King's answer in <20090405195714.GA4716@coredump.intra.peff.net>
tackles this problem too.

-- 
Matthieu

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06  5:15                   ` Junio C Hamano
@ 2009-04-06 13:12                     ` Nicolas Pitre
  2009-04-06 13:52                     ` Jon Smirl
  1 sibling, 0 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-06 13:12 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: david, Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

On Sun, 5 Apr 2009, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > What git-pack-objects does in this case is not a full repack.  It 
> > instead _reuse_ as much of the existing packs as possible, and only does 
> > the heavy packing processing for loose objects and/or inter pack 
> > boundaryes when gluing everything together for streaming over the net.  
> > If for example you have a single pack because your repo is already fully 
> > packed, then the "packing operation" involved during a clone should 
> > merely copy the existing pack over with no further attempt at delta 
> > compression.
> 
> One possibile scenario that you still need to spend memory and cycle is if
> the cloned repository was packed to an excessive depth to cause many of
> its objects to be in deltified form on insanely deep chains, while cloning
> send-pack uses a depth that is more reasonable.  Then pack-objects invoked
> by send-pack is not allowed to reuse most of the objects and would end up
> redoing the delta on them.

Nope.  When pack data is reused, there is simply no consideration what 
so ever for the actual delta depth limit.  Only when an object already 
being used as a delta base for reused deltas is itself subject to delta 
compression does the real depth of the concerned delta chain is 
evaluated in order to not purposely bust the specified delta depth limit 
(otherwise a delta chain could grow unbounded).


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06 11:22                   ` Matthieu Moy
@ 2009-04-06 13:29                     ` Nicolas Pitre
  2009-04-06 14:03                       ` Robin H. Johnson
  0 siblings, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-06 13:29 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: david, Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

On Mon, 6 Apr 2009, Matthieu Moy wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > If for example you have a single pack because your repo is already fully 
> > packed, then the "packing operation" involved during a clone should 
> > merely copy the existing pack over with no further attempt at delta 
> > compression.
> 
> There's still the question if your repository has too many objects
> (for example, a branch that you deleted without garbage-collecting
> it). Then, sending the whole pack sends data that one may have
> considered as "secret".

I said "merely copy", which is not a straight copy.  In this case, only 
the relevant objects from the existing pack will be copied into the 
streamed pack, and objects from the unused branch will be left behind.  
In that case, deltas which base object is left behind will automatically 
be considered for alternative delta matching of course, but that is 
normally a relatively small set of objects.  And if that set gets really 
big, that means that an even bigger set of objects was left behind, 
making the actual repacking smaller in scope.

> To me, this is a non-issue (if the content of these objects are
> secret, then why are they here at all on a public server?), but I
> think there were discussions here about it (can't find the right
> keywords to dig the archives though), and other people may think
> differently.

Guess who was involved in that discussion...

I may allow you to pull certain branches directly from my own PC through 
the git native protocol.  That doesn't mean you have direct access to 
the whole of any of the packs I have on my disk.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06  5:15                   ` Junio C Hamano
  2009-04-06 13:12                     ` Nicolas Pitre
@ 2009-04-06 13:52                     ` Jon Smirl
  2009-04-06 14:19                       ` Nicolas Pitre
  1 sibling, 1 reply; 97+ messages in thread
From: Jon Smirl @ 2009-04-06 13:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, david, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List, Shawn O. Pearce

On Mon, Apr 6, 2009 at 1:15 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
>> What git-pack-objects does in this case is not a full repack.  It
>> instead _reuse_ as much of the existing packs as possible, and only does
>> the heavy packing processing for loose objects and/or inter pack
>> boundaryes when gluing everything together for streaming over the net.
>> If for example you have a single pack because your repo is already fully
>> packed, then the "packing operation" involved during a clone should
>> merely copy the existing pack over with no further attempt at delta
>> compression.
>
> One possibile scenario that you still need to spend memory and cycle is if
> the cloned repository was packed to an excessive depth to cause many of
> its objects to be in deltified form on insanely deep chains, while cloning
> send-pack uses a depth that is more reasonable.  Then pack-objects invoked
> by send-pack is not allowed to reuse most of the objects and would end up
> redoing the delta on them.

That seems broken. You went through all of the trouble to make the
pack file smaller to reduce transmission time, and then clone undoes
the work.

What about making a very simple special case for an initial clone?
First thing an initial clone does is copy all of the pack files from
the server to the client without even looking at them. Some of these
packs will probably be marked 'keep' because they are old history and
have been densely packed. Once the packs are down, start over and do a
fetch taking these packs into account.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06 13:29                     ` Nicolas Pitre
@ 2009-04-06 14:03                       ` Robin H. Johnson
  2009-04-06 14:14                         ` Nicolas Pitre
  0 siblings, 1 reply; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-06 14:03 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1034 bytes --]

I haven't read all this morning submissions to the thread yet, but I
wanted to make two posts before I leave on a trip (in ~20 minutes), and
I'll be back late on Thursday.

On Mon, Apr 06, 2009 at 09:29:04AM -0400, Nicolas Pitre wrote:
> > To me, this is a non-issue (if the content of these objects are
> > secret, then why are they here at all on a public server?), but I
> > think there were discussions here about it (can't find the right
> > keywords to dig the archives though), and other people may think
> > differently.
> Guess who was involved in that discussion...
> I may allow you to pull certain branches directly from my own PC through 
> the git native protocol.  That doesn't mean you have direct access to 
> the whole of any of the packs I have on my disk.
If the native rsync protocol is allowed to the repo, then that argument
is moot.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06 14:03                       ` Robin H. Johnson
@ 2009-04-06 14:14                         ` Nicolas Pitre
  0 siblings, 0 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-06 14:14 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List

On Mon, 6 Apr 2009, Robin H. Johnson wrote:

> I haven't read all this morning submissions to the thread yet, but I
> wanted to make two posts before I leave on a trip (in ~20 minutes), and
> I'll be back late on Thursday.
> 
> On Mon, Apr 06, 2009 at 09:29:04AM -0400, Nicolas Pitre wrote:
> > > To me, this is a non-issue (if the content of these objects are
> > > secret, then why are they here at all on a public server?), but I
> > > think there were discussions here about it (can't find the right
> > > keywords to dig the archives though), and other people may think
> > > differently.
> > Guess who was involved in that discussion...
> > I may allow you to pull certain branches directly from my own PC through 
> > the git native protocol.  That doesn't mean you have direct access to 
> > the whole of any of the packs I have on my disk.
> If the native rsync protocol is allowed to the repo, then that argument
> is moot.

The rsync protocol is _not_ the native git protocol.  And I personally 
don't encourage its usage either, except as a _temporary_ workaround for 
unresolved issues.  You will never see this protocol available from any 
git server I maintain.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06 13:52                     ` Jon Smirl
@ 2009-04-06 14:19                       ` Nicolas Pitre
  2009-04-06 14:37                         ` Jon Smirl
  0 siblings, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-06 14:19 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Junio C Hamano, david, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List, Shawn O. Pearce

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2515 bytes --]

On Mon, 6 Apr 2009, Jon Smirl wrote:

> On Mon, Apr 6, 2009 at 1:15 AM, Junio C Hamano <gitster@pobox.com> wrote:
> > Nicolas Pitre <nico@cam.org> writes:
> >
> >> What git-pack-objects does in this case is not a full repack.  It
> >> instead _reuse_ as much of the existing packs as possible, and only does
> >> the heavy packing processing for loose objects and/or inter pack
> >> boundaryes when gluing everything together for streaming over the net.
> >> If for example you have a single pack because your repo is already fully
> >> packed, then the "packing operation" involved during a clone should
> >> merely copy the existing pack over with no further attempt at delta
> >> compression.
> >
> > One possibile scenario that you still need to spend memory and cycle is if
> > the cloned repository was packed to an excessive depth to cause many of
> > its objects to be in deltified form on insanely deep chains, while cloning
> > send-pack uses a depth that is more reasonable.  Then pack-objects invoked
> > by send-pack is not allowed to reuse most of the objects and would end up
> > redoing the delta on them.
> 
> That seems broken. You went through all of the trouble to make the
> pack file smaller to reduce transmission time, and then clone undoes
> the work.

And as I already explained, this is indeed not what happens.

> What about making a very simple special case for an initial clone?

There should not be any need for initial clone hacks.

> First thing an initial clone does is copy all of the pack files from
> the server to the client without even looking at them.

This is a no go for reasons already stated many times.  There are 
security implications (those packs might contain stuff that you didn't 
intend to be publically accessible) and there might be efficiency 
reasons as well (you might have a shared object store with lots of stuff 
unrelated to the particular clone).

The biggest cost right now when cloning a big packed repo is object 
enumeration.  Any other issues related to memory costs in the GB range 
simply has no reason for it, and is mostly due to misconfigurations or 
bugs that have to be fixed.  Trying to work around the issue by all 
sorts of hacks is simply counter productive.

In the case that started this very thread, I suspect that a small 
misfeature of some delta caching might be the culprit.  I asked Robin H. 
Johnson to perform a really simple config addition to his repo and 
retest, for which we still haven't seen any results yet.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06  4:06     ` Nicolas Pitre
@ 2009-04-06 14:20       ` Robin H. Johnson
  0 siblings, 0 replies; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-06 14:20 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 2560 bytes --]

Again, I'm about to leave on a trip for a few days (back late Thursday),
but just wanted to comment in on the thread.

On Mon, Apr 06, 2009 at 12:06:00AM -0400, Nicolas Pitre wrote:
> > While my current pack setup has multiple packs of not more than 100MiB
> > each, that was simply for ease of resume with rsync+http tests. Even
> > when I already had a single pack, with every object reachable,
> > pack-objects was redoing the packing.
> In that case it shouldn't have.
I'll retest that part on my return, but I'm pretty sure I did see the
same excess cputime usage.

> > Also, I did another trace, using some other hardware, in a LAN setting, and
> > noticed that git-upload-pack/pack-objects only seems to start output to the
> > network after it reaches 100% in 'remote: Compressing objects:'.
> That's to be expected.  Delta compression matches objects which are not 
> in the stream order at all.  Therefore it is not possible to start 
> outputting pack data until this pass is done.  Still, this pass should 
> not be invoked if your repository is already fully packed into one pack.  
So it's seeking around the existing packs before sending?

> Can you confirm this is actually the case?
The most recent tests were with the 15(+ one partial) packs limited to a
max of 100MiB each, because that made resume for rsync/http during the
tests much cleaner.

> > Relatedly, throwing more RAM (6GiB total, vs. the previous 2GiB) at 
> > the server in this case cut the 200 wallclock minutes before any 
> > sending too place down to 5 minutes.
> Well... here's a wild guess.  In the source repository serving clone 
> requests, please do:
> 	git config pack.deltaCacheSize 1
> 	git config pack.deltaCacheLimit 0
> and try cloning again with a fully packed repository.
I did the multiple pack case quickly, and found that it does still take
a long time in the low memory case. I'll do the test with a single pack
on my return.

> The caching pack project is to address a different issue: mainly to 
> bypass the object enumeration cost.  In other words, it could allow for 
> skipping the "Counting objects" pass, and a tiny bit more.  At least in 
> theory that's about the main difference.  This has many drawbacks as 
> well though.
Relatedly, would it be possible to keep a cache of enumerated objects
that was trivially updatable during pushes?

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06 14:19                       ` Nicolas Pitre
@ 2009-04-06 14:37                         ` Jon Smirl
  2009-04-06 14:48                           ` Shawn O. Pearce
  2009-04-06 15:14                           ` Nicolas Pitre
  0 siblings, 2 replies; 97+ messages in thread
From: Jon Smirl @ 2009-04-06 14:37 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, david, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List, Shawn O. Pearce

On Mon, Apr 6, 2009 at 10:19 AM, Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 6 Apr 2009, Jon Smirl wrote:
>
>> On Mon, Apr 6, 2009 at 1:15 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> > Nicolas Pitre <nico@cam.org> writes:
>> >
>> >> What git-pack-objects does in this case is not a full repack.  It
>> >> instead _reuse_ as much of the existing packs as possible, and only does
>> >> the heavy packing processing for loose objects and/or inter pack
>> >> boundaryes when gluing everything together for streaming over the net.
>> >> If for example you have a single pack because your repo is already fully
>> >> packed, then the "packing operation" involved during a clone should
>> >> merely copy the existing pack over with no further attempt at delta
>> >> compression.
>> >
>> > One possibile scenario that you still need to spend memory and cycle is if
>> > the cloned repository was packed to an excessive depth to cause many of
>> > its objects to be in deltified form on insanely deep chains, while cloning
>> > send-pack uses a depth that is more reasonable.  Then pack-objects invoked
>> > by send-pack is not allowed to reuse most of the objects and would end up
>> > redoing the delta on them.
>>
>> That seems broken. You went through all of the trouble to make the
>> pack file smaller to reduce transmission time, and then clone undoes
>> the work.
>
> And as I already explained, this is indeed not what happens.
>
>> What about making a very simple special case for an initial clone?
>
> There should not be any need for initial clone hacks.
>
>> First thing an initial clone does is copy all of the pack files from
>> the server to the client without even looking at them.
>
> This is a no go for reasons already stated many times.  There are
> security implications (those packs might contain stuff that you didn't
> intend to be publically accessible) and there might be efficiency
> reasons as well (you might have a shared object store with lots of stuff
> unrelated to the particular clone).

How do you deal with dense history packs? These packs take many hours
to make (on a server class machine) and can be half the size of a
regular pack. Shouldn't there be a way to copy these packs intact on
an initial clone? It's ok if these packs are specially marked as being
ok to copy.

>
> The biggest cost right now when cloning a big packed repo is object
> enumeration.  Any other issues related to memory costs in the GB range
> simply has no reason for it, and is mostly due to misconfigurations or
> bugs that have to be fixed.  Trying to work around the issue by all
> sorts of hacks is simply counter productive.
>
> In the case that started this very thread, I suspect that a small
> misfeature of some delta caching might be the culprit.  I asked Robin H.
> Johnson to perform a really simple config addition to his repo and
> retest, for which we still haven't seen any results yet.
>
>
> Nicolas
>



-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06 14:37                         ` Jon Smirl
@ 2009-04-06 14:48                           ` Shawn O. Pearce
  2009-04-06 15:14                           ` Nicolas Pitre
  1 sibling, 0 replies; 97+ messages in thread
From: Shawn O. Pearce @ 2009-04-06 14:48 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Nicolas Pitre, Junio C Hamano, david, Nicolas Sebrecht,
	Robin H. Johnson, Git Mailing List

Jon Smirl <jonsmirl@gmail.com> wrote:
> 
> How do you deal with dense history packs? These packs take many hours
> to make (on a server class machine) and can be half the size of a
> regular pack. Shouldn't there be a way to copy these packs intact on
> an initial clone? It's ok if these packs are specially marked as being
> ok to copy.

These should be copied as-is.

Basically, object enumeration lists every reachable object, which
should include every object in this pack if its a "dense history
pack".  We then start to write out each object.  As each object
is written we look to see if it already exists in a pack.  It does
(in your dense history pack), so we then look to see if its delta
base is also in the output list (it is), so we send the data as-is.


One of the bigger costs with such clones is building that huge list
of objects needed to send.  The primary cost appears to be unpacking
the trees from the "dense history pack", where delta chains are
usually quite long.  The GSoC 2009 pack caching project idea is
based on the theory that we should be able to save a list of objects
that are reachable from some fixed point (e.g. a very well known,
stable tag), and avoid needing to read these ancient trees.

But its just a theory.  Caching always costs you management
overheads.  And it may not save us that much time .  And most of
the theory here is based on JGit's performance during packing,
*not* git-core.

I came up with the object list caching idea because JGit's object
enumeration is just pitiful.  (Its Java, what do you want, if you
wanted fast, you'd use portable assembler... like git-core does.)
Whether or not its worth applying to git-core is another story
entirely.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06 14:37                         ` Jon Smirl
  2009-04-06 14:48                           ` Shawn O. Pearce
@ 2009-04-06 15:14                           ` Nicolas Pitre
  2009-04-06 15:28                             ` Jon Smirl
  1 sibling, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-06 15:14 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Junio C Hamano, david, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List, Shawn O. Pearce

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3205 bytes --]

On Mon, 6 Apr 2009, Jon Smirl wrote:

> On Mon, Apr 6, 2009 at 10:19 AM, Nicolas Pitre <nico@cam.org> wrote:
> > On Mon, 6 Apr 2009, Jon Smirl wrote:
> >
> >> First thing an initial clone does is copy all of the pack files from
> >> the server to the client without even looking at them.
> >
> > This is a no go for reasons already stated many times.  There are
> > security implications (those packs might contain stuff that you didn't
> > intend to be publically accessible) and there might be efficiency
> > reasons as well (you might have a shared object store with lots of stuff
> > unrelated to the particular clone).
> 
> How do you deal with dense history packs? These packs take many hours
> to make (on a server class machine) and can be half the size of a
> regular pack. Shouldn't there be a way to copy these packs intact on
> an initial clone? It's ok if these packs are specially marked as being
> ok to copy.

[sigh]

Let me explain it all again.

There is basically two ways to create a new pack: the intelligent way, 
and the bruteforce way.

When creating a new pack the intelligent way, what we do is to enumerate 
all the needed object and look them up in the object store.  When a 
particular object is found, we create a record for that object and note 
in which pack it is located, at what offset in that pack, how much space 
it occupies in its compressed form within that pack, , and if whether it 
is a delta or not.  When that object is indeed a delta (the majority of 
objects usually are) then we also keep a pointer on the record for the 
base object for that delta.

Next, for all objects in delta form which base object is also part of 
the object enumeration and obviously part of the same pack, we simply 
flag those objects as directly reusable without any further processing.  
This means that, when those objects are about to be stored in the new 
pack, their raw data is simply copied straight from the original pack 
using the offset and size noted above.  In other words, those objects 
are simply never redeltified nor redeflated at all, and all the work 
that was previously done to find the best delta match is preserved with 
no extra cost.

Of course, when your repository is tightly packed into a single pack, 
then all enumerated objects fall into the reusable category and 
therefore a copy of the original pack is indeed sent over the wire.  
One exception is with older git clients which don't support the delta 
base offset encoding, in which case the delta reference encoding is 
substituted on the fly with almost no cost (this is btw another reason 
why a dumb copy of existing pack may not work universally either).  But 
in the common case, you might see the above as just the same as if git 
did copy the pack file because it really only reads some data from a 
pack and immediately writes that data out.

The bruteforce repacking is different because it simply doesn't concern 
itself with existing deltas at all.  It instead start everything from 
scratch and perform the whole delta search all over for all objects.  
This is what takes lots of resources and CPU cycles, and as you may 
guess, is never used for fetch/clone requests.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06 15:14                           ` Nicolas Pitre
@ 2009-04-06 15:28                             ` Jon Smirl
  2009-04-06 16:14                               ` Nicolas Pitre
  0 siblings, 1 reply; 97+ messages in thread
From: Jon Smirl @ 2009-04-06 15:28 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, david, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List, Shawn O. Pearce

On Mon, Apr 6, 2009 at 11:14 AM, Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 6 Apr 2009, Jon Smirl wrote:
>
>> On Mon, Apr 6, 2009 at 10:19 AM, Nicolas Pitre <nico@cam.org> wrote:
>> > On Mon, 6 Apr 2009, Jon Smirl wrote:
>> >
>> >> First thing an initial clone does is copy all of the pack files from
>> >> the server to the client without even looking at them.
>> >
>> > This is a no go for reasons already stated many times.  There are
>> > security implications (those packs might contain stuff that you didn't
>> > intend to be publically accessible) and there might be efficiency
>> > reasons as well (you might have a shared object store with lots of stuff
>> > unrelated to the particular clone).
>>
>> How do you deal with dense history packs? These packs take many hours
>> to make (on a server class machine) and can be half the size of a
>> regular pack. Shouldn't there be a way to copy these packs intact on
>> an initial clone? It's ok if these packs are specially marked as being
>> ok to copy.
>
> [sigh]
>
> Let me explain it all again.
>
> There is basically two ways to create a new pack: the intelligent way,
> and the bruteforce way.
>
> When creating a new pack the intelligent way, what we do is to enumerate
> all the needed object and look them up in the object store.  When a
> particular object is found, we create a record for that object and note
> in which pack it is located, at what offset in that pack, how much space
> it occupies in its compressed form within that pack, , and if whether it
> is a delta or not.  When that object is indeed a delta (the majority of
> objects usually are) then we also keep a pointer on the record for the
> base object for that delta.
>
> Next, for all objects in delta form which base object is also part of
> the object enumeration and obviously part of the same pack, we simply
> flag those objects as directly reusable without any further processing.
> This means that, when those objects are about to be stored in the new
> pack, their raw data is simply copied straight from the original pack
> using the offset and size noted above.  In other words, those objects
> are simply never redeltified nor redeflated at all, and all the work
> that was previously done to find the best delta match is preserved with
> no extra cost.

Does this process cause random reads all over a 2GB pack file? Busy
servers can't keep a 2GB pack in memory.
sendfile() the 2GB pack to client is way more efficient. (assuming the
pack is marked as being ok to send).

>
> Of course, when your repository is tightly packed into a single pack,
> then all enumerated objects fall into the reusable category and
> therefore a copy of the original pack is indeed sent over the wire.
> One exception is with older git clients which don't support the delta
> base offset encoding, in which case the delta reference encoding is
> substituted on the fly with almost no cost (this is btw another reason
> why a dumb copy of existing pack may not work universally either).  But
> in the common case, you might see the above as just the same as if git
> did copy the pack file because it really only reads some data from a
> pack and immediately writes that data out.
>
> The bruteforce repacking is different because it simply doesn't concern
> itself with existing deltas at all.  It instead start everything from
> scratch and perform the whole delta search all over for all objects.
> This is what takes lots of resources and CPU cycles, and as you may
> guess, is never used for fetch/clone requests.
>
>
> Nicolas
>



-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06 15:28                             ` Jon Smirl
@ 2009-04-06 16:14                               ` Nicolas Pitre
  0 siblings, 0 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-06 16:14 UTC (permalink / raw)
  To: Jon Smirl
  Cc: Junio C Hamano, david, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List, Shawn O. Pearce

[-- Attachment #1: Type: TEXT/PLAIN, Size: 970 bytes --]

On Mon, 6 Apr 2009, Jon Smirl wrote:

> On Mon, Apr 6, 2009 at 11:14 AM, Nicolas Pitre <nico@cam.org> wrote:
> > This means that, when those objects are about to be stored in the new
> > pack, their raw data is simply copied straight from the original pack
> > using the offset and size noted above.  In other words, those objects
> > are simply never redeltified nor redeflated at all, and all the work
> > that was previously done to find the best delta match is preserved with
> > no extra cost.
> 
> Does this process cause random reads all over a 2GB pack file? Busy
> servers can't keep a 2GB pack in memory.

The creation of a new pack follows the same object recency rule as the 
ones it copies from, so the various reads should be perfectly 
sequential.

> sendfile() the 2GB pack to client is way more efficient. (assuming the
> pack is marked as being ok to send).

Git is not a FTP server.  Otherwise we would have stayed with the rsync 
protocol.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-06  3:24               ` Nicolas Pitre
@ 2009-04-07  8:10                 ` Björn Steinbrink
  2009-04-07  9:45                   ` Jakub Narebski
  0 siblings, 1 reply; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-07  8:10 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Sverre Rabbelier, david, Junio C Hamano, Nicolas Sebrecht,
	Robin H. Johnson, Git Mailing List

On 2009.04.05 23:24:27 -0400, Nicolas Pitre wrote:
> On Sun, 5 Apr 2009, Sverre Rabbelier wrote:
> 
> > Heya,
> > 
> > On Sun, Apr 5, 2009 at 23:28,  <david@lang.hm> wrote:
> > > Guys, back off a little on telling the gentoo people to change.
> > 
> > I agree here, we should either say "look, we don't really support big
> > repositories because [explanation here], unless you [workarounds
> > here]" OR we should work to improve the support we do have. Of course,
> > the latter option does not magically create developer time to work on
> > that, but if we do go that way we should at least tell people that we
> > are aware of the problems and that it's on the global TODO list (not
> > necessarily on anyone's personal TODO list though).
> 
> For the record... I at least am aware of the problem and it is indeed on 
> my personal git todo list.  Not that I have a clear solution yet (I've 
> been pondering on some git packing issues for almost 4 years now).
> 
> Still, in this particular case, the problem appears to be unclear to me, 
> like "this shouldn't be so bad".

It's not primarily pack-objects, I think. It's the rev-list that's run
by upload-pack.  Running "git rev-list --objects --all" on that repo
eats about 2G RSS, easily killing the system's cache on a small box,
leading to swapping and a painful time reading the packfile contents
afterwards to send them to the client.

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07  8:10                 ` Björn Steinbrink
@ 2009-04-07  9:45                   ` Jakub Narebski
  2009-04-07 13:13                     ` Nicolas Pitre
  0 siblings, 1 reply; 97+ messages in thread
From: Jakub Narebski @ 2009-04-07  9:45 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Nicolas Pitre, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

Björn Steinbrink <B.Steinbrink@gmx.de> writes:
> On 2009.04.05 23:24:27 -0400, Nicolas Pitre wrote:
> > On Sun, 5 Apr 2009, Sverre Rabbelier wrote:
> > > 
> > > I agree here, we should either say "look, we don't really support big
> > > repositories because [explanation here], unless you [workarounds
> > > here]" OR we should work to improve the support we do have. Of course,
> > > the latter option does not magically create developer time to work on
> > > that, but if we do go that way we should at least tell people that we
> > > are aware of the problems and that it's on the global TODO list (not
> > > necessarily on anyone's personal TODO list though).
> > 
> > For the record... I at least am aware of the problem and it is indeed on 
> > my personal git todo list.  Not that I have a clear solution yet (I've 
> > been pondering on some git packing issues for almost 4 years now).
> > 
> > Still, in this particular case, the problem appears to be unclear to me, 
> > like "this shouldn't be so bad".
> 
> It's not primarily pack-objects, I think. It's the rev-list that's run
> by upload-pack.  Running "git rev-list --objects --all" on that repo
> eats about 2G RSS, easily killing the system's cache on a small box,
> leading to swapping and a painful time reading the packfile contents
> afterwards to send them to the client.

Than I think that "packfile caching" GSoC project (which is IIRC
"object enumeration caching", or at least includes it) should help
here.  You would, from what I understand, run "git rev-list -objects
--all --not <tops of cache>" + sequential read of object enumeration
cache...

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-05 22:59             ` Performance issue: initial git clone causes massive repack Nicolas Sebrecht
  2009-04-05 23:20               ` david
@ 2009-04-07 10:11               ` Martin Langhoff
  1 sibling, 0 replies; 97+ messages in thread
From: Martin Langhoff @ 2009-04-07 10:11 UTC (permalink / raw)
  To: Nicolas Sebrecht; +Cc: david, Robin H. Johnson, Git Mailing List

On Mon, Apr 6, 2009 at 12:59 AM, Nicolas Sebrecht
<nicolas.s-dev@laposte.net> wrote:
> What about the rsync solution given in this thread?

Also, HTTP is excellent for initial clones, possibly better than rsync
in some cases.

The Gentoo team has good reasons to do things their way, and it's IMHO
a wart in git that initial clones of large repos. But we do have valid
workarounds (as above) so they can use them .

cheers,



martin
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07  9:45                   ` Jakub Narebski
@ 2009-04-07 13:13                     ` Nicolas Pitre
  2009-04-07 13:37                       ` Jakub Narebski
                                         ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-07 13:13 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Björn Steinbrink, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2102 bytes --]

On Tue, 7 Apr 2009, Jakub Narebski wrote:

> Björn Steinbrink <B.Steinbrink@gmx.de> writes:
> > On 2009.04.05 23:24:27 -0400, Nicolas Pitre wrote:
> > > On Sun, 5 Apr 2009, Sverre Rabbelier wrote:
> > > > 
> > > > I agree here, we should either say "look, we don't really support big
> > > > repositories because [explanation here], unless you [workarounds
> > > > here]" OR we should work to improve the support we do have. Of course,
> > > > the latter option does not magically create developer time to work on
> > > > that, but if we do go that way we should at least tell people that we
> > > > are aware of the problems and that it's on the global TODO list (not
> > > > necessarily on anyone's personal TODO list though).
> > > 
> > > For the record... I at least am aware of the problem and it is indeed on 
> > > my personal git todo list.  Not that I have a clear solution yet (I've 
> > > been pondering on some git packing issues for almost 4 years now).
> > > 
> > > Still, in this particular case, the problem appears to be unclear to me, 
> > > like "this shouldn't be so bad".
> > 
> > It's not primarily pack-objects, I think. It's the rev-list that's run
> > by upload-pack.  Running "git rev-list --objects --all" on that repo
> > eats about 2G RSS, easily killing the system's cache on a small box,
> > leading to swapping and a painful time reading the packfile contents
> > afterwards to send them to the client.
> 
> Than I think that "packfile caching" GSoC project (which is IIRC
> "object enumeration caching", or at least includes it) should help
> here.

NO!

Please people stop being so creative with all sort of ways to simply 
avoid the real issue and focussing on a real fix.  Git has not become 
what it is today by the accumulation of workarounds and ignorance of 
fundamental issues.

Having git-rev-list consume about 2G RSS for the enumeration of 4M 
objects is simply inacceptable, period.  This is the equivalent of 500 
bytes per object pinned in memory on average, just for listing object, 
which is completely silly. We ought to do better than that.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 13:13                     ` Nicolas Pitre
@ 2009-04-07 13:37                       ` Jakub Narebski
  2009-04-07 14:03                         ` Jon Smirl
  2009-04-07 17:59                         ` Nicolas Pitre
  2009-04-07 14:21                       ` Björn Steinbrink
  2009-04-08 11:28                       ` [PATCH] process_{tree,blob}: Remove useless xstrdup calls Björn Steinbrink
  2 siblings, 2 replies; 97+ messages in thread
From: Jakub Narebski @ 2009-04-07 13:37 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Björn Steinbrink, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

On Tue, 7 Apr 2009, Nicolas Pitre wrote:
> On Tue, 7 Apr 2009, Jakub Narebski wrote:
>> Björn Steinbrink <B.Steinbrink@gmx.de> writes:

[...]
>>> It's not primarily pack-objects, I think. It's the rev-list that's run
>>> by upload-pack.  Running "git rev-list --objects --all" on that repo
>>> eats about 2G RSS, easily killing the system's cache on a small box,
>>> leading to swapping and a painful time reading the packfile contents
>>> afterwards to send them to the client.
>> 
>> Than I think that "packfile caching" GSoC project (which is IIRC
>> "object enumeration caching", or at least includes it) should help
>> here.
> 
> NO!
> 
> Please people stop being so creative with all sort of ways to simply 
> avoid the real issue and focussing on a real fix.  Git has not become 
> what it is today by the accumulation of workarounds and ignorance of 
> fundamental issues.
> 
> Having git-rev-list consume about 2G RSS for the enumeration of 4M 
> objects is simply inacceptable, period.  This is the equivalent of 500 
> bytes per object pinned in memory on average, just for listing object, 
> which is completely silly. We ought to do better than that.

I have thought that the large amount of memory consumed by git-rev-list
was caused by not-so-sequential access to very large packfile (1.5GB+ if
I remember correctly), which I thought causes the whole packfile to be
mmapped and not only window, plus large amount of objects in 300MB+ mem
range or something; those both would account for around 2GB.

Besides even if git-rev-list wouldn't take so much memory, object
enumeration caching would still help with CPU load... admittedly less.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 13:37                       ` Jakub Narebski
@ 2009-04-07 14:03                         ` Jon Smirl
  2009-04-07 17:59                         ` Nicolas Pitre
  1 sibling, 0 replies; 97+ messages in thread
From: Jon Smirl @ 2009-04-07 14:03 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Nicolas Pitre, Björn Steinbrink, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

2009/4/7 Jakub Narebski <jnareb@gmail.com>:
> On Tue, 7 Apr 2009, Nicolas Pitre wrote:
>> On Tue, 7 Apr 2009, Jakub Narebski wrote:
>>> Björn Steinbrink <B.Steinbrink@gmx.de> writes:
>
> [...]
>>>> It's not primarily pack-objects, I think. It's the rev-list that's run
>>>> by upload-pack.  Running "git rev-list --objects --all" on that repo
>>>> eats about 2G RSS, easily killing the system's cache on a small box,
>>>> leading to swapping and a painful time reading the packfile contents
>>>> afterwards to send them to the client.
>>>
>>> Than I think that "packfile caching" GSoC project (which is IIRC
>>> "object enumeration caching", or at least includes it) should help
>>> here.
>>
>> NO!
>>
>> Please people stop being so creative with all sort of ways to simply
>> avoid the real issue and focussing on a real fix.  Git has not become
>> what it is today by the accumulation of workarounds and ignorance of
>> fundamental issues.
>>
>> Having git-rev-list consume about 2G RSS for the enumeration of 4M
>> objects is simply inacceptable, period.  This is the equivalent of 500
>> bytes per object pinned in memory on average, just for listing object,
>> which is completely silly. We ought to do better than that.
>
> I have thought that the large amount of memory consumed by git-rev-list
> was caused by not-so-sequential access to very large packfile (1.5GB+ if
> I remember correctly), which I thought causes the whole packfile to be
> mmapped and not only window, plus large amount of objects in 300MB+ mem
> range or something; those both would account for around 2GB.

I don't know all of the finer details of chasing revision lists, but
would it help if pack files recorded the root IDs of their object
trees at creation time and stored it in the front of the pack?


>
> Besides even if git-rev-list wouldn't take so much memory, object
> enumeration caching would still help with CPU load... admittedly less.
>
> --
> Jakub Narebski
> Poland
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 13:13                     ` Nicolas Pitre
  2009-04-07 13:37                       ` Jakub Narebski
@ 2009-04-07 14:21                       ` Björn Steinbrink
  2009-04-07 17:48                         ` Nicolas Pitre
  2009-04-08 11:28                       ` [PATCH] process_{tree,blob}: Remove useless xstrdup calls Björn Steinbrink
  2 siblings, 1 reply; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-07 14:21 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jakub Narebski, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

On 2009.04.07 09:13:45 -0400, Nicolas Pitre wrote:
> On Tue, 7 Apr 2009, Jakub Narebski wrote:
> 
> > Björn Steinbrink <B.Steinbrink@gmx.de> writes:
> > > On 2009.04.05 23:24:27 -0400, Nicolas Pitre wrote:
> > > > On Sun, 5 Apr 2009, Sverre Rabbelier wrote:
> > > > > 
> > > > > I agree here, we should either say "look, we don't really support big
> > > > > repositories because [explanation here], unless you [workarounds
> > > > > here]" OR we should work to improve the support we do have. Of course,
> > > > > the latter option does not magically create developer time to work on
> > > > > that, but if we do go that way we should at least tell people that we
> > > > > are aware of the problems and that it's on the global TODO list (not
> > > > > necessarily on anyone's personal TODO list though).
> > > > 
> > > > For the record... I at least am aware of the problem and it is indeed on 
> > > > my personal git todo list.  Not that I have a clear solution yet (I've 
> > > > been pondering on some git packing issues for almost 4 years now).
> > > > 
> > > > Still, in this particular case, the problem appears to be unclear to me, 
> > > > like "this shouldn't be so bad".
> > > 
> > > It's not primarily pack-objects, I think. It's the rev-list that's run
> > > by upload-pack.  Running "git rev-list --objects --all" on that repo
> > > eats about 2G RSS, easily killing the system's cache on a small box,
> > > leading to swapping and a painful time reading the packfile contents
> > > afterwards to send them to the client.
> > 
> > Than I think that "packfile caching" GSoC project (which is IIRC
> > "object enumeration caching", or at least includes it) should help
> > here.
> 
> NO!
> 
> Please people stop being so creative with all sort of ways to simply 
> avoid the real issue and focussing on a real fix.  Git has not become 
> what it is today by the accumulation of workarounds and ignorance of 
> fundamental issues.
> 
> Having git-rev-list consume about 2G RSS for the enumeration of 4M 
> objects is simply inacceptable, period.  This is the equivalent of 500 
> bytes per object pinned in memory on average, just for listing object, 
> which is completely silly. We ought to do better than that.

Ah, crap, I might have been fooled by "ps aux", top actually shows about
1.3G being shared, likely the mmapped pack files. And that will be
reused, assuming the box has enough memory to keep all that stuff.

But that's still 700MB or about 150 bytes per object on average.

A "struct tree" is 40 bytes here, adding the average path length (19 in
this repo) that's 59 byte, leaving about 90 bytes of "overhead" per
object, as end the end we seem to care only about the sha1 and the path
name.

And in the upload-pack case, there's also pack-objects running
concurrently, already going up to 950M RSS/100M shared _while_ the
rev-list is still running. So that's 3G of memory usage (2G if you
ignore the shared stuff) before the "Compressing objects" part even
starts. And of course, pack-objects will apparently start to mmap the
pack files only after the rev-list finished, so a "smart" OS might have
removed a lot of the mmapped stuff from memory again, causing it to be
re-read. :-/

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 14:21                       ` Björn Steinbrink
@ 2009-04-07 17:48                         ` Nicolas Pitre
  2009-04-07 18:12                           ` Björn Steinbrink
  0 siblings, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-07 17:48 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Jakub Narebski, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2250 bytes --]

On Tue, 7 Apr 2009, Björn Steinbrink wrote:

> On 2009.04.07 09:13:45 -0400, Nicolas Pitre wrote:
> > Having git-rev-list consume about 2G RSS for the enumeration of 4M 
> > objects is simply inacceptable, period.  This is the equivalent of 500 
> > bytes per object pinned in memory on average, just for listing object, 
> > which is completely silly. We ought to do better than that.
> 
> Ah, crap, I might have been fooled by "ps aux", top actually shows about
> 1.3G being shared, likely the mmapped pack files. And that will be
> reused, assuming the box has enough memory to keep all that stuff.

Right.  And since the pack is mapped read-only, it can be paged out 
easily by the OS.  And if that doesn't help, we already have 
core.packedGitWindowSize and core.packedGitLimit config options to play 
with.

> But that's still 700MB or about 150 bytes per object on average.
> 
> A "struct tree" is 40 bytes here, adding the average path length (19 in
> this repo) that's 59 byte, leaving about 90 bytes of "overhead" per
> object, as end the end we seem to care only about the sha1 and the path
> name.

I'm starting to think more seriously about pack v4 again, where each 
path components are indexed in a table.  Because most tree objects are 
different revisions of the same path, this could represent a significant 
saving in memory as well.

> And in the upload-pack case, there's also pack-objects running
> concurrently, already going up to 950M RSS/100M shared _while_ the
> rev-list is still running. So that's 3G of memory usage (2G if you
> ignore the shared stuff) before the "Compressing objects" part even
> starts. And of course, pack-objects will apparently start to mmap the
> pack files only after the rev-list finished, so a "smart" OS might have
> removed a lot of the mmapped stuff from memory again, causing it to be
> re-read. :-/

The first low hanging fruit to help this case is to make upload-pack use 
the --revs argument with pack-object to let it do the object enumeration 
itself directly, instead of relying on the rev-list output through a 
pipe.  This is what 'git repack' does already.  pack-objects has to 
access the pack anyway, so this would eliminate an extra access from a 
different process.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 13:37                       ` Jakub Narebski
  2009-04-07 14:03                         ` Jon Smirl
@ 2009-04-07 17:59                         ` Nicolas Pitre
  1 sibling, 0 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-07 17:59 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Björn Steinbrink, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

On Tue, 7 Apr 2009, Jakub Narebski wrote:

> On Tue, 7 Apr 2009, Nicolas Pitre wrote:
> > Having git-rev-list consume about 2G RSS for the enumeration of 4M 
> > objects is simply inacceptable, period.  This is the equivalent of 500 
> > bytes per object pinned in memory on average, just for listing object, 
> > which is completely silly. We ought to do better than that.
> 
> I have thought that the large amount of memory consumed by git-rev-list
> was caused by not-so-sequential access to very large packfile (1.5GB+ if
> I remember correctly), which I thought causes the whole packfile to be
> mmapped and not only window, plus large amount of objects in 300MB+ mem
> range or something; those both would account for around 2GB.

The pack has not to be mapped all at once.  At least on 32-bit machines 
the total pack mappings cannot exceed 256MB total by default.  On 64-bit 
machines the default is 8GB which might not work very well if total 
amount of RAM is lower than that.

Another consideration is the object layout in a pack.  Currently we have 
tree and blob objects mixed together so to have sequential pack access 
when performing a checkout.  Maybe having trees packed together would 
help a lot with object enumeration as the blobs have not to be mapped at 
all.  Remains to see how that might impact other operations though.

> Besides even if git-rev-list wouldn't take so much memory, object
> enumeration caching would still help with CPU load... admittedly less.

Yes, but let's not lose sight of all the inconvenients associated with 
extra caching.  If we can get away without it then all the better.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 17:48                         ` Nicolas Pitre
@ 2009-04-07 18:12                           ` Björn Steinbrink
  2009-04-07 18:56                             ` Nicolas Pitre
  2009-04-07 20:29                             ` Jeff King
  0 siblings, 2 replies; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-07 18:12 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jakub Narebski, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

On 2009.04.07 13:48:02 -0400, Nicolas Pitre wrote:
> On Tue, 7 Apr 2009, Björn Steinbrink wrote:
> > And in the upload-pack case, there's also pack-objects running
> > concurrently, already going up to 950M RSS/100M shared _while_ the
> > rev-list is still running. So that's 3G of memory usage (2G if you
> > ignore the shared stuff) before the "Compressing objects" part even
> > starts. And of course, pack-objects will apparently start to mmap the
> > pack files only after the rev-list finished, so a "smart" OS might have
> > removed a lot of the mmapped stuff from memory again, causing it to be
> > re-read. :-/
> 
> The first low hanging fruit to help this case is to make upload-pack use 
> the --revs argument with pack-object to let it do the object enumeration 
> itself directly, instead of relying on the rev-list output through a 
> pipe.  This is what 'git repack' does already.  pack-objects has to 
> access the pack anyway, so this would eliminate an extra access from a 
> different process.

Hm, for an initial clone that would end up as:
git pack-objects --stdout --all
right?

If so, that doesn't look it it's going to work out as easily as one
would hope. Robin said that both processes, git-upload-pack (which does
the rev-list) and pack-objects peaked at ~2GB of RSS (which probably
includes the mmapped packs). But the above pack-objects with --all peaks
at 3.1G here, so it basically seems to keep all the stuff in memory that
the individual processes had. But this way, it's all at once, not 2G
first and then 2G in a second process, after the first one exitted.

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 18:12                           ` Björn Steinbrink
@ 2009-04-07 18:56                             ` Nicolas Pitre
  2009-04-07 20:27                               ` Björn Steinbrink
  2009-04-07 20:29                             ` Jeff King
  1 sibling, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-07 18:56 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Jakub Narebski, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1201 bytes --]

On Tue, 7 Apr 2009, Björn Steinbrink wrote:

> On 2009.04.07 13:48:02 -0400, Nicolas Pitre wrote:
> > The first low hanging fruit to help this case is to make upload-pack use 
> > the --revs argument with pack-object to let it do the object enumeration 
> > itself directly, instead of relying on the rev-list output through a 
> > pipe.  This is what 'git repack' does already.  pack-objects has to 
> > access the pack anyway, so this would eliminate an extra access from a 
> > different process.
> 
> Hm, for an initial clone that would end up as:
> git pack-objects --stdout --all
> right?
> 
> If so, that doesn't look it it's going to work out as easily as one
> would hope. Robin said that both processes, git-upload-pack (which does
> the rev-list) and pack-objects peaked at ~2GB of RSS (which probably
> includes the mmapped packs). But the above pack-objects with --all peaks
> at 3.1G here, so it basically seems to keep all the stuff in memory that
> the individual processes had. But this way, it's all at once, not 2G
> first and then 2G in a second process, after the first one exitted.

Right, and it is probably faster too.

Can I get a copy of that repository somewhere?


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 18:56                             ` Nicolas Pitre
@ 2009-04-07 20:27                               ` Björn Steinbrink
  2009-04-08  4:52                                 ` Nicolas Pitre
  0 siblings, 1 reply; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-07 20:27 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jakub Narebski, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

On 2009.04.07 14:56:41 -0400, Nicolas Pitre wrote:
> On Tue, 7 Apr 2009, Björn Steinbrink wrote:
> 
> > On 2009.04.07 13:48:02 -0400, Nicolas Pitre wrote:
> > > The first low hanging fruit to help this case is to make upload-pack use 
> > > the --revs argument with pack-object to let it do the object enumeration 
> > > itself directly, instead of relying on the rev-list output through a 
> > > pipe.  This is what 'git repack' does already.  pack-objects has to 
> > > access the pack anyway, so this would eliminate an extra access from a 
> > > different process.
> > 
> > Hm, for an initial clone that would end up as:
> > git pack-objects --stdout --all
> > right?
> > 
> > If so, that doesn't look it it's going to work out as easily as one
> > would hope. Robin said that both processes, git-upload-pack (which does
> > the rev-list) and pack-objects peaked at ~2GB of RSS (which probably
> > includes the mmapped packs). But the above pack-objects with --all peaks
> > at 3.1G here, so it basically seems to keep all the stuff in memory that
> > the individual processes had. But this way, it's all at once, not 2G
> > first and then 2G in a second process, after the first one exitted.
> 
> Right, and it is probably faster too.
> 
> Can I get a copy of that repository somewhere?

http://git.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary

At least that's what I cloned ;-) I hope it's the right one, but it fits
the description...

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 18:12                           ` Björn Steinbrink
  2009-04-07 18:56                             ` Nicolas Pitre
@ 2009-04-07 20:29                             ` Jeff King
  2009-04-07 20:35                               ` Björn Steinbrink
  1 sibling, 1 reply; 97+ messages in thread
From: Jeff King @ 2009-04-07 20:29 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

On Tue, Apr 07, 2009 at 08:12:59PM +0200, Björn Steinbrink wrote:

> If so, that doesn't look it it's going to work out as easily as one
> would hope. Robin said that both processes, git-upload-pack (which does
> the rev-list) and pack-objects peaked at ~2GB of RSS (which probably
> includes the mmapped packs). But the above pack-objects with --all peaks

I thought he said this, too, but look at the ps output he posted here:

  http://article.gmane.org/gmane.comp.version-control.git/115739

It clearly shows upload-pack with a tiny RSS, and pack-objects doing all
of the damage.

-Peff

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 20:29                             ` Jeff King
@ 2009-04-07 20:35                               ` Björn Steinbrink
  0 siblings, 0 replies; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-07 20:35 UTC (permalink / raw)
  To: Jeff King
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

On 2009.04.07 16:29:54 -0400, Jeff King wrote:
> On Tue, Apr 07, 2009 at 08:12:59PM +0200, Björn Steinbrink wrote:
> 
> > If so, that doesn't look it it's going to work out as easily as one
> > would hope. Robin said that both processes, git-upload-pack (which does
> > the rev-list) and pack-objects peaked at ~2GB of RSS (which probably
> > includes the mmapped packs). But the above pack-objects with --all peaks
> 
> I thought he said this, too, but look at the ps output he posted here:
> 
>   http://article.gmane.org/gmane.comp.version-control.git/115739
> 
> It clearly shows upload-pack with a tiny RSS, and pack-objects doing all
> of the damage.

That second git-upload-pack is the interesting one. upload-pack forks to
do the rev-list stuff, without changing its process name, so it keeps
being listed as upload-pack. And as the process already died, its
RSS/VZS dropped to zero.

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-07 20:27                               ` Björn Steinbrink
@ 2009-04-08  4:52                                 ` Nicolas Pitre
  2009-04-10 20:38                                   ` Robin H. Johnson
  0 siblings, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-08  4:52 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Jakub Narebski, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1714 bytes --]

On Tue, 7 Apr 2009, Björn Steinbrink wrote:

> On 2009.04.07 14:56:41 -0400, Nicolas Pitre wrote:
> > On Tue, 7 Apr 2009, Björn Steinbrink wrote:
> > 
> > > On 2009.04.07 13:48:02 -0400, Nicolas Pitre wrote:
> > > > The first low hanging fruit to help this case is to make upload-pack use 
> > > > the --revs argument with pack-object to let it do the object enumeration 
> > > > itself directly, instead of relying on the rev-list output through a 
> > > > pipe.  This is what 'git repack' does already.  pack-objects has to 
> > > > access the pack anyway, so this would eliminate an extra access from a 
> > > > different process.
> > > 
> > > Hm, for an initial clone that would end up as:
> > > git pack-objects --stdout --all
> > > right?
> > > 
> > > If so, that doesn't look it it's going to work out as easily as one
> > > would hope. Robin said that both processes, git-upload-pack (which does
> > > the rev-list) and pack-objects peaked at ~2GB of RSS (which probably
> > > includes the mmapped packs). But the above pack-objects with --all peaks
> > > at 3.1G here, so it basically seems to keep all the stuff in memory that
> > > the individual processes had. But this way, it's all at once, not 2G
> > > first and then 2G in a second process, after the first one exitted.
> > 
> > Right, and it is probably faster too.
> > 
> > Can I get a copy of that repository somewhere?
> 
> http://git.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary
> 
> At least that's what I cloned ;-) I hope it's the right one, but it fits
> the description...

OK.  FWIW, I repacked it with --window=250 --depth=250 and obtained a 
725MB pack file.  So that's about half the originally reported size.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-07 13:13                     ` Nicolas Pitre
  2009-04-07 13:37                       ` Jakub Narebski
  2009-04-07 14:21                       ` Björn Steinbrink
@ 2009-04-08 11:28                       ` Björn Steinbrink
  2009-04-10 22:20                         ` Linus Torvalds
  2 siblings, 1 reply; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-08 11:28 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jakub Narebski, Sverre Rabbelier, david, Junio C Hamano,
	Nicolas Sebrecht, Robin H. Johnson, Git Mailing List

The name of the processed object was duplicated for passing it to
add_object(), but that already calls path_name, which allocates a new
string anyway. So the memory allocated by the xstrdup calls just went
nowhere, leaking memory.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
This reduces the RSS usage for a "rev-list --all --objects" by about 10% on
the gentoo repo (fully packed) as well as linux-2.6.git:

gentoo:
		| old		| new		
----------------|-------------------------------
RSS		|	1537284 |	1388408
VSZ		|	1816852 |	1667952
time elapsed	|	1:49.62 |	1:48.99
min. page faults|	 417178 |	 379919

linux-2.6.git:
		| old		| new		
----------------|-------------------------------
RSS		|	 324452 |	 292996
VSZ		|	 491792 |	 460376
time elapsed	|	0:14.53 |	0:14.28
min. page faults|	  89360 |	  81613

 list-objects.c |    2 --
 reachable.c    |    1 -
 2 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index c8b8375..dd243c7 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -23,7 +23,6 @@ static void process_blob(struct rev_info *revs,
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
 	obj->flags |= SEEN;
-	name = xstrdup(name);
 	add_object(obj, p, path, name);
 }
 
@@ -78,7 +77,6 @@ static void process_tree(struct rev_info *revs,
 	if (parse_tree(tree) < 0)
 		die("bad tree object %s", sha1_to_hex(obj->sha1));
 	obj->flags |= SEEN;
-	name = xstrdup(name);
 	add_object(obj, p, path, name);
 	me.up = path;
 	me.elem = name;
diff --git a/reachable.c b/reachable.c
index 3b1c18f..b515fa2 100644
--- a/reachable.c
+++ b/reachable.c
@@ -48,7 +48,6 @@ static void process_tree(struct tree *tree,
 	obj->flags |= SEEN;
 	if (parse_tree(tree) < 0)
 		die("bad tree object %s", sha1_to_hex(obj->sha1));
-	name = xstrdup(name);
 	add_object(obj, p, path, name);
 	me.up = path;
 	me.elem = name;
-- 
1.6.2.2.446.gfbdc0.dirty

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-08  4:52                                 ` Nicolas Pitre
@ 2009-04-10 20:38                                   ` Robin H. Johnson
  2009-04-11  1:58                                     ` Nicolas Pitre
  2009-04-14 15:52                                     ` Johannes Schindelin
  0 siblings, 2 replies; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-10 20:38 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1355 bytes --]

On Wed, Apr 08, 2009 at 12:52:54AM -0400, Nicolas Pitre wrote:
> > http://git.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary
> > At least that's what I cloned ;-) I hope it's the right one, but it fits
> > the description...
> OK.  FWIW, I repacked it with --window=250 --depth=250 and obtained a 
> 725MB pack file.  So that's about half the originally reported size.
The one problem with having the single large packfile is that Git
doesn't have a trivial way to resume downloading it when the git://
protocol is used.

For our developers cursed with bad internet connections (a fair number
of firewalls that don't seem to respect keepalive properly), I suppose
I can probably just maintain a separate repo for their initial clones,
which leaves a large overall download, but more chances to resume.

PS #1: B.Steinbrink's memory improvement patch seems to work nicely too,
but more memory improvements in that realm are still needed.

PS #2: We finally got some newer hardware to run the large repo, I'm
working on the install now, but until the memory issue is better
resolved, I'm still worried we might run short if there are too many
concurrent clones.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-08 11:28                       ` [PATCH] process_{tree,blob}: Remove useless xstrdup calls Björn Steinbrink
@ 2009-04-10 22:20                         ` Linus Torvalds
  2009-04-11  0:27                           ` Linus Torvalds
  0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2009-04-10 22:20 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List



On Wed, 8 Apr 2009, Björn Steinbrink wrote:
>
> The name of the processed object was duplicated for passing it to
> add_object(), but that already calls path_name, which allocates a new
> string anyway. So the memory allocated by the xstrdup calls just went
> nowhere, leaking memory.

Ack, ack.

There's another easy 5% or so for the built-in object walker: once we've 
created the hash from the name, the name isn't interesting any more, and 
so something trivial like this can help a bit.

Does it matter? Probably not on its own. But a few more memory saving 
tricks and it might all make a difference.

		Linus

---
 builtin-pack-objects.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 9fc3b35..d00eabe 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1912,6 +1912,8 @@ static void show_object(struct object_array_entry *p)
 	add_preferred_base_object(p->name);
 	add_object_entry(p->item->sha1, p->item->type, p->name, 0);
 	p->item->flags |= OBJECT_ADDED;
+	free(p->name);
+	p->name = NULL;
 }
 
 static void show_edge(struct commit *commit)

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-10 22:20                         ` Linus Torvalds
@ 2009-04-11  0:27                           ` Linus Torvalds
  2009-04-11  1:15                             ` Linus Torvalds
  0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2009-04-11  0:27 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List



On Fri, 10 Apr 2009, Linus Torvalds wrote:
> 
> There's another easy 5% or so for the built-in object walker: once we've 
> created the hash from the name, the name isn't interesting any more, and 
> so something trivial like this can help a bit.

Hmm.

Here's a less trivial thing, and slightly more dubious one.

I was looking at that "struct object_array objects", and wondering why we 
do that. I have honestly totally forgotten. Why not just call the "show()" 
function as we encounter the objects? Rather than add the objects to the 
object_array, and then at the very end going through the array and doing a 
'show' on all, just do things more incrementally.

Now, there are possible downsides to this:

 - the "buffer using object_array" _can_ in theory result in at least 
   better I-cache usage (two tight loops rather than one more spread out 
   one). I don't think this is a real issue, but in theory..

 - this _does_ change the order of the objects printed. Instead of doing a 
   "process_tree(revs, commit->tree, &objects, NULL, "");" in the loop 
   over the commits (which puts all the root trees _first_ in the object 
   list, this patch just adds them to the list of pending objects, and 
   then we'll traverse them in that order (and thus show each root tree 
   object together with the objects we discover under it)

   I _think_ the new ordering actually makes more sense, but the object 
   ordering is actually a subtle thing when it comes to packing 
   efficiency, so any change in order is going to have implications for 
   packing. Good or bad, I dunno.

 - There may be some reason why we did it that odd way with the object 
   array, that I have simply forgotten.

Anyway, this includes the "free(name)" in builtin-pack-objects.c: 
show_object() logic, and now that we don't buffer up the objects before 
showing them that may actually result in lower memory usage during that 
whole traverse_commit_list() phase.

This is seriously not very deeply tested. It makes sense to me, it seems 
to pass all the tests, it looks ok, but...

Does anybody remember why we did that "object_array" thing? It used to be 
an "object_list" a long long time ago, but got changed into the array due 
to better memory usage patterns (those linked lists of obejcts are 
horrible from a memory allocation standpoint). But I wonder why we didn't 
do this back then. Maybe there's a reason for it.

Or maybe there _used_ to be a reason, and no longer is. 

			Linus

---
 builtin-pack-objects.c |   14 ++++++++++----
 builtin-rev-list.c     |   20 ++++++++++----------
 list-objects.c         |   35 ++++++++++++++++++-----------------
 list-objects.h         |    2 +-
 revision.c             |    2 +-
 revision.h             |    2 ++
 upload-pack.c          |   12 ++++++------
 7 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 9fc3b35..e028a02 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1907,11 +1907,17 @@ static void show_commit(struct commit *commit)
 	commit->object.flags |= OBJECT_ADDED;
 }
 
-static void show_object(struct object_array_entry *p)
+static void show_object(struct object *obj, const char *name)
 {
-	add_preferred_base_object(p->name);
-	add_object_entry(p->item->sha1, p->item->type, p->name, 0);
-	p->item->flags |= OBJECT_ADDED;
+	add_preferred_base_object(name);
+	add_object_entry(obj->sha1, obj->type, name, 0);
+	obj->flags |= OBJECT_ADDED;
+
+	/*
+	 * We will have generated the hash from the name,
+	 * but not saved a pointer to it - we can free it
+	 */
+	free(name);
 }
 
 static void show_edge(struct commit *commit)
diff --git a/builtin-rev-list.c b/builtin-rev-list.c
index 40d5fcb..0815cf3 100644
--- a/builtin-rev-list.c
+++ b/builtin-rev-list.c
@@ -168,27 +168,27 @@ static void finish_commit(struct commit *commit)
 	commit->buffer = NULL;
 }
 
-static void finish_object(struct object_array_entry *p)
+static void finish_object(struct object *obj, const char *name)
 {
-	if (p->item->type == OBJ_BLOB && !has_sha1_file(p->item->sha1))
-		die("missing blob object '%s'", sha1_to_hex(p->item->sha1));
+	if (obj->type == OBJ_BLOB && !has_sha1_file(obj->sha1))
+		die("missing blob object '%s'", sha1_to_hex(obj->sha1));
 }
 
-static void show_object(struct object_array_entry *p)
+static void show_object(struct object *obj, const char *name)
 {
 	/* An object with name "foo\n0000000..." can be used to
 	 * confuse downstream "git pack-objects" very badly.
 	 */
-	const char *ep = strchr(p->name, '\n');
+	const char *ep = strchr(name, '\n');
 
-	finish_object(p);
+	finish_object(obj, name);
 	if (ep) {
-		printf("%s %.*s\n", sha1_to_hex(p->item->sha1),
-		       (int) (ep - p->name),
-		       p->name);
+		printf("%s %.*s\n", sha1_to_hex(obj->sha1),
+		       (int) (ep - name),
+		       name);
 	}
 	else
-		printf("%s %s\n", sha1_to_hex(p->item->sha1), p->name);
+		printf("%s %s\n", sha1_to_hex(obj->sha1), name);
 }
 
 static void show_edge(struct commit *commit)
diff --git a/list-objects.c b/list-objects.c
index dd243c7..5a4af62 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -10,7 +10,7 @@
 
 static void process_blob(struct rev_info *revs,
 			 struct blob *blob,
-			 struct object_array *p,
+			 show_object_fn show,
 			 struct name_path *path,
 			 const char *name)
 {
@@ -23,7 +23,7 @@ static void process_blob(struct rev_info *revs,
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
 	obj->flags |= SEEN;
-	add_object(obj, p, path, name);
+	show(obj, path_name(path, name));
 }
 
 /*
@@ -50,7 +50,7 @@ static void process_blob(struct rev_info *revs,
  */
 static void process_gitlink(struct rev_info *revs,
 			    const unsigned char *sha1,
-			    struct object_array *p,
+			    show_object_fn show,
 			    struct name_path *path,
 			    const char *name)
 {
@@ -59,7 +59,7 @@ static void process_gitlink(struct rev_info *revs,
 
 static void process_tree(struct rev_info *revs,
 			 struct tree *tree,
-			 struct object_array *p,
+			 show_object_fn show,
 			 struct name_path *path,
 			 const char *name)
 {
@@ -77,7 +77,7 @@ static void process_tree(struct rev_info *revs,
 	if (parse_tree(tree) < 0)
 		die("bad tree object %s", sha1_to_hex(obj->sha1));
 	obj->flags |= SEEN;
-	add_object(obj, p, path, name);
+	show(obj, path_name(path, name));
 	me.up = path;
 	me.elem = name;
 	me.elem_len = strlen(name);
@@ -88,14 +88,14 @@ static void process_tree(struct rev_info *revs,
 		if (S_ISDIR(entry.mode))
 			process_tree(revs,
 				     lookup_tree(entry.sha1),
-				     p, &me, entry.path);
+				     show, &me, entry.path);
 		else if (S_ISGITLINK(entry.mode))
 			process_gitlink(revs, entry.sha1,
-					p, &me, entry.path);
+					show, &me, entry.path);
 		else
 			process_blob(revs,
 				     lookup_blob(entry.sha1),
-				     p, &me, entry.path);
+				     show, &me, entry.path);
 	}
 	free(tree->buffer);
 	tree->buffer = NULL;
@@ -134,16 +134,20 @@ void mark_edges_uninteresting(struct commit_list *list,
 	}
 }
 
+static void add_pending_tree(struct rev_info *revs, struct tree *tree)
+{
+	add_pending_object(revs, &tree->object, "");
+}
+
 void traverse_commit_list(struct rev_info *revs,
 			  void (*show_commit)(struct commit *),
-			  void (*show_object)(struct object_array_entry *))
+			  void (*show_object)(struct object *, const char *))
 {
 	int i;
 	struct commit *commit;
-	struct object_array objects = { 0, 0, NULL };
 
 	while ((commit = get_revision(revs)) != NULL) {
-		process_tree(revs, commit->tree, &objects, NULL, "");
+		add_pending_tree(revs, commit->tree);
 		show_commit(commit);
 	}
 	for (i = 0; i < revs->pending.nr; i++) {
@@ -154,25 +158,22 @@ void traverse_commit_list(struct rev_info *revs,
 			continue;
 		if (obj->type == OBJ_TAG) {
 			obj->flags |= SEEN;
-			add_object_array(obj, name, &objects);
+			show_object(obj, name);
 			continue;
 		}
 		if (obj->type == OBJ_TREE) {
-			process_tree(revs, (struct tree *)obj, &objects,
+			process_tree(revs, (struct tree *)obj, show_object,
 				     NULL, name);
 			continue;
 		}
 		if (obj->type == OBJ_BLOB) {
-			process_blob(revs, (struct blob *)obj, &objects,
+			process_blob(revs, (struct blob *)obj, show_object,
 				     NULL, name);
 			continue;
 		}
 		die("unknown pending object %s (%s)",
 		    sha1_to_hex(obj->sha1), name);
 	}
-	for (i = 0; i < objects.nr; i++)
-		show_object(&objects.objects[i]);
-	free(objects.objects);
 	if (revs->pending.nr) {
 		free(revs->pending.objects);
 		revs->pending.nr = 0;
diff --git a/list-objects.h b/list-objects.h
index 0f41391..13b0dd9 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -2,7 +2,7 @@
 #define LIST_OBJECTS_H
 
 typedef void (*show_commit_fn)(struct commit *);
-typedef void (*show_object_fn)(struct object_array_entry *);
+typedef void (*show_object_fn)(struct object *, const char *);
 typedef void (*show_edge_fn)(struct commit *);
 
 void traverse_commit_list(struct rev_info *revs, show_commit_fn, show_object_fn);
diff --git a/revision.c b/revision.c
index b6215cc..44a9ce2 100644
--- a/revision.c
+++ b/revision.c
@@ -15,7 +15,7 @@
 
 volatile show_early_output_fn_t show_early_output;
 
-static char *path_name(struct name_path *path, const char *name)
+char *path_name(struct name_path *path, const char *name)
 {
 	struct name_path *p;
 	char *n, *m;
diff --git a/revision.h b/revision.h
index 5adfc91..c89e8ff 100644
--- a/revision.h
+++ b/revision.h
@@ -146,6 +146,8 @@ struct name_path {
 	const char *elem;
 };
 
+char *path_name(struct name_path *path, const char *name);
+
 extern void add_object(struct object *obj,
 		       struct object_array *p,
 		       struct name_path *path,
diff --git a/upload-pack.c b/upload-pack.c
index a49d872..5524ac4 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -78,20 +78,20 @@ static void show_commit(struct commit *commit)
 	commit->buffer = NULL;
 }
 
-static void show_object(struct object_array_entry *p)
+static void show_object(struct object *obj, const char *name)
 {
 	/* An object with name "foo\n0000000..." can be used to
 	 * confuse downstream git-pack-objects very badly.
 	 */
-	const char *ep = strchr(p->name, '\n');
+	const char *ep = strchr(name, '\n');
 	if (ep) {
-		fprintf(pack_pipe, "%s %.*s\n", sha1_to_hex(p->item->sha1),
-		       (int) (ep - p->name),
-		       p->name);
+		fprintf(pack_pipe, "%s %.*s\n", sha1_to_hex(obj->sha1),
+		       (int) (ep - name),
+		       name);
 	}
 	else
 		fprintf(pack_pipe, "%s %s\n",
-				sha1_to_hex(p->item->sha1), p->name);
+				sha1_to_hex(obj->sha1), name);
 }
 
 static void show_edge(struct commit *commit)

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11  0:27                           ` Linus Torvalds
@ 2009-04-11  1:15                             ` Linus Torvalds
  2009-04-11  1:34                               ` Nicolas Pitre
  2009-04-11 13:41                               ` Björn Steinbrink
  0 siblings, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2009-04-11  1:15 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List



On Fri, 10 Apr 2009, Linus Torvalds wrote:
> 
> Here's a less trivial thing, and slightly more dubious one.

I'm starting to like it more.

In particular, pushing the "path_name()" call _into_ the show() function 
would seem to allow

 - more clarity into who "owns" the name (ie now when we free the name in 
   the show_object callback, it's because we generated it ourselves by 
   calling path_name())

 - not calling path_name() at all, either because we don't care about the 
   name in the first place, or because we are actually happy walking the 
   linked list of "struct name_path *" and the last component.

Now, I didn't do that latter optimization, because it would require some 
more coding, but especially looking at "builtin-pack-objects.c", we really 
don't even want the whole pathname, we really would be better off with the 
list of path components.

Why? We use that name for two things:
 - add_preferred_base_object(), which actually _wants_ to traverse the 
   path, and now does it by looking for '/' characters!
 - for 'name_hash()', which only cares about the last 16 characters of a 
   name, so again, generating the full name seems to be just unnecessary 
   work.

Anyway, so I didn't look any closer at those things, but it did convince 
me that the "show_object()" calling convention was crazy, and we're 
actually better off doing _less_ in list-objects.c, and giving people 
access to the internal data structures so that they can decide whether 
they want to generate a path-name or not.

This patch does that, and then for people who did use the name (even if 
they might do something more clever in the future), it just does the 
straightforward "name = path_name(path, component); .. free(name);" thing.

It obviously goes on top of my previous patch.


		Linus

---
 builtin-pack-objects.c |    9 +++------
 builtin-rev-list.c     |    8 +++++---
 list-objects.c         |   10 +++++-----
 list-objects.h         |    2 +-
 revision.c             |    4 ++--
 revision.h             |    2 +-
 upload-pack.c          |    4 +++-
 7 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index e028a02..d74d8a4 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1907,16 +1907,13 @@ static void show_commit(struct commit *commit)
 	commit->object.flags |= OBJECT_ADDED;
 }
 
-static void show_object(struct object *obj, const char *name)
+static void show_object(struct object *obj, const struct name_path *path, const char *last)
 {
+	char *name = path_name(path, last);
+
 	add_preferred_base_object(name);
 	add_object_entry(obj->sha1, obj->type, name, 0);
 	obj->flags |= OBJECT_ADDED;
-
-	/*
-	 * We will have generated the hash from the name,
-	 * but not saved a pointer to it - we can free it
-	 */
 	free(name);
 }
 
diff --git a/builtin-rev-list.c b/builtin-rev-list.c
index 0815cf3..aba8a6f 100644
--- a/builtin-rev-list.c
+++ b/builtin-rev-list.c
@@ -168,20 +168,21 @@ static void finish_commit(struct commit *commit)
 	commit->buffer = NULL;
 }
 
-static void finish_object(struct object *obj, const char *name)
+static void finish_object(struct object *obj, const struct name_path *path, const char *name)
 {
 	if (obj->type == OBJ_BLOB && !has_sha1_file(obj->sha1))
 		die("missing blob object '%s'", sha1_to_hex(obj->sha1));
 }
 
-static void show_object(struct object *obj, const char *name)
+static void show_object(struct object *obj, const struct name_path *path, const char *component)
 {
+	char *name = path_name(path, component);
 	/* An object with name "foo\n0000000..." can be used to
 	 * confuse downstream "git pack-objects" very badly.
 	 */
 	const char *ep = strchr(name, '\n');
 
-	finish_object(obj, name);
+	finish_object(obj, path, name);
 	if (ep) {
 		printf("%s %.*s\n", sha1_to_hex(obj->sha1),
 		       (int) (ep - name),
@@ -189,6 +190,7 @@ static void show_object(struct object *obj, const char *name)
 	}
 	else
 		printf("%s %s\n", sha1_to_hex(obj->sha1), name);
+	free(name);
 }
 
 static void show_edge(struct commit *commit)
diff --git a/list-objects.c b/list-objects.c
index 5a4af62..30ded3d 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -23,7 +23,7 @@ static void process_blob(struct rev_info *revs,
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
 	obj->flags |= SEEN;
-	show(obj, path_name(path, name));
+	show(obj, path, name);
 }
 
 /*
@@ -77,7 +77,7 @@ static void process_tree(struct rev_info *revs,
 	if (parse_tree(tree) < 0)
 		die("bad tree object %s", sha1_to_hex(obj->sha1));
 	obj->flags |= SEEN;
-	show(obj, path_name(path, name));
+	show(obj, path, name);
 	me.up = path;
 	me.elem = name;
 	me.elem_len = strlen(name);
@@ -140,8 +140,8 @@ static void add_pending_tree(struct rev_info *revs, struct tree *tree)
 }
 
 void traverse_commit_list(struct rev_info *revs,
-			  void (*show_commit)(struct commit *),
-			  void (*show_object)(struct object *, const char *))
+			  show_commit_fn show_commit,
+			  show_object_fn show_object)
 {
 	int i;
 	struct commit *commit;
@@ -158,7 +158,7 @@ void traverse_commit_list(struct rev_info *revs,
 			continue;
 		if (obj->type == OBJ_TAG) {
 			obj->flags |= SEEN;
-			show_object(obj, name);
+			show_object(obj, NULL, name);
 			continue;
 		}
 		if (obj->type == OBJ_TREE) {
diff --git a/list-objects.h b/list-objects.h
index 13b0dd9..0b2de64 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -2,7 +2,7 @@
 #define LIST_OBJECTS_H
 
 typedef void (*show_commit_fn)(struct commit *);
-typedef void (*show_object_fn)(struct object *, const char *);
+typedef void (*show_object_fn)(struct object *, const struct name_path *, const char *);
 typedef void (*show_edge_fn)(struct commit *);
 
 void traverse_commit_list(struct rev_info *revs, show_commit_fn, show_object_fn);
diff --git a/revision.c b/revision.c
index 44a9ce2..bd0ea34 100644
--- a/revision.c
+++ b/revision.c
@@ -15,9 +15,9 @@
 
 volatile show_early_output_fn_t show_early_output;
 
-char *path_name(struct name_path *path, const char *name)
+char *path_name(const struct name_path *path, const char *name)
 {
-	struct name_path *p;
+	const struct name_path *p;
 	char *n, *m;
 	int nlen = strlen(name);
 	int len = nlen + 1;
diff --git a/revision.h b/revision.h
index c89e8ff..be39e7d 100644
--- a/revision.h
+++ b/revision.h
@@ -146,7 +146,7 @@ struct name_path {
 	const char *elem;
 };
 
-char *path_name(struct name_path *path, const char *name);
+char *path_name(const struct name_path *path, const char *name);
 
 extern void add_object(struct object *obj,
 		       struct object_array *p,
diff --git a/upload-pack.c b/upload-pack.c
index 5524ac4..536efbb 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -78,11 +78,12 @@ static void show_commit(struct commit *commit)
 	commit->buffer = NULL;
 }
 
-static void show_object(struct object *obj, const char *name)
+static void show_object(struct object *obj, const struct name_path *path, const char *component)
 {
 	/* An object with name "foo\n0000000..." can be used to
 	 * confuse downstream git-pack-objects very badly.
 	 */
+	const char *name = path_name(path, component);
 	const char *ep = strchr(name, '\n');
 	if (ep) {
 		fprintf(pack_pipe, "%s %.*s\n", sha1_to_hex(obj->sha1),
@@ -92,6 +93,7 @@ static void show_object(struct object *obj, const char *name)
 	else
 		fprintf(pack_pipe, "%s %s\n",
 				sha1_to_hex(obj->sha1), name);
+	free(name);
 }
 
 static void show_edge(struct commit *commit)

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11  1:15                             ` Linus Torvalds
@ 2009-04-11  1:34                               ` Nicolas Pitre
  2009-04-11 13:41                               ` Björn Steinbrink
  1 sibling, 0 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-11  1:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Björn Steinbrink, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

On Fri, 10 Apr 2009, Linus Torvalds wrote:

> 
> 
> On Fri, 10 Apr 2009, Linus Torvalds wrote:
> > 
> > Here's a less trivial thing, and slightly more dubious one.
> 
> I'm starting to like it more.
> 
> In particular, pushing the "path_name()" call _into_ the show() function 
> would seem to allow
> 
>  - more clarity into who "owns" the name (ie now when we free the name in 
>    the show_object callback, it's because we generated it ourselves by 
>    calling path_name())
> 
>  - not calling path_name() at all, either because we don't care about the 
>    name in the first place, or because we are actually happy walking the 
>    linked list of "struct name_path *" and the last component.
> 
> Now, I didn't do that latter optimization, because it would require some 
> more coding, but especially looking at "builtin-pack-objects.c", we really 
> don't even want the whole pathname, we really would be better off with the 
> list of path components.
> 
> Why? We use that name for two things:
>  - add_preferred_base_object(), which actually _wants_ to traverse the 
>    path, and now does it by looking for '/' characters!
>  - for 'name_hash()', which only cares about the last 16 characters of a 
>    name, so again, generating the full name seems to be just unnecessary 
>    work.
> 
> Anyway, so I didn't look any closer at those things, but it did convince 
> me that the "show_object()" calling convention was crazy, and we're 
> actually better off doing _less_ in list-objects.c, and giving people 
> access to the internal data structures so that they can decide whether 
> they want to generate a path-name or not.

YES!

I didn't look at the patch really closely, but this fits pretty well 
with the philosophy behind pack v4 where path components are stored in a 
separate table (instead of being duplicated in every tree objects for 
the same path), hence generating path names on demand would be a real 
win for those cases where it is not needed.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-10 20:38                                   ` Robin H. Johnson
@ 2009-04-11  1:58                                     ` Nicolas Pitre
  2009-04-11  7:06                                       ` Mike Hommey
  2009-04-14 15:52                                     ` Johannes Schindelin
  1 sibling, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-11  1:58 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List

On Fri, 10 Apr 2009, Robin H. Johnson wrote:

> On Wed, Apr 08, 2009 at 12:52:54AM -0400, Nicolas Pitre wrote:
> > > http://git.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary
> > > At least that's what I cloned ;-) I hope it's the right one, but it fits
> > > the description...
> > OK.  FWIW, I repacked it with --window=250 --depth=250 and obtained a 
> > 725MB pack file.  So that's about half the originally reported size.
> The one problem with having the single large packfile is that Git
> doesn't have a trivial way to resume downloading it when the git://
> protocol is used.

Having multiple packs won't help the git:// protocol at all in that 
regard.  In fact it'll just make it a bit harder on the server for all 
cases, which has to generate a single pack for streaming anyway by using 
multiple source ones and perform extra work in attempting delta 
compression across pack boundaries.

> For our developers cursed with bad internet connections (a fair number
> of firewalls that don't seem to respect keepalive properly), I suppose
> I can probably just maintain a separate repo for their initial clones,
> which leaves a large overall download, but more chances to resume.

I don't know much about git's http protocol implementation, but I guess 
it should be able to resume the transfer of a pack file which might have 
been interrupted in the middle?  If no then this should be considered.

> PS #1: B.Steinbrink's memory improvement patch seems to work nicely too,
> but more memory improvements in that realm are still needed.

Good.

> PS #2: We finally got some newer hardware to run the large repo, I'm
> working on the install now, but until the memory issue is better
> resolved, I'm still worried we might run short if there are too many
> concurrent clones.

Right.


Nicolas (who wishes he could dedicate more time to git hacking)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-11  1:58                                     ` Nicolas Pitre
@ 2009-04-11  7:06                                       ` Mike Hommey
  0 siblings, 0 replies; 97+ messages in thread
From: Mike Hommey @ 2009-04-11  7:06 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Robin H. Johnson, Git Mailing List

On Fri, Apr 10, 2009 at 09:58:11PM -0400, Nicolas Pitre wrote:
> On Fri, 10 Apr 2009, Robin H. Johnson wrote:
> 
> > On Wed, Apr 08, 2009 at 12:52:54AM -0400, Nicolas Pitre wrote:
> > > > http://git.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary
> > > > At least that's what I cloned ;-) I hope it's the right one, but it fits
> > > > the description...
> > > OK.  FWIW, I repacked it with --window=250 --depth=250 and obtained a 
> > > 725MB pack file.  So that's about half the originally reported size.
> > The one problem with having the single large packfile is that Git
> > doesn't have a trivial way to resume downloading it when the git://
> > protocol is used.
> 
> Having multiple packs won't help the git:// protocol at all in that 
> regard.  In fact it'll just make it a bit harder on the server for all 
> cases, which has to generate a single pack for streaming anyway by using 
> multiple source ones and perform extra work in attempting delta 
> compression across pack boundaries.
> 
> > For our developers cursed with bad internet connections (a fair number
> > of firewalls that don't seem to respect keepalive properly), I suppose
> > I can probably just maintain a separate repo for their initial clones,
> > which leaves a large overall download, but more chances to resume.
> 
> I don't know much about git's http protocol implementation, but I guess 
> it should be able to resume the transfer of a pack file which might have 
> been interrupted in the middle?  If no then this should be considered.

It can.

Mike

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11  1:15                             ` Linus Torvalds
  2009-04-11  1:34                               ` Nicolas Pitre
@ 2009-04-11 13:41                               ` Björn Steinbrink
  2009-04-11 14:07                                 ` Björn Steinbrink
  1 sibling, 1 reply; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-11 13:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

On 2009.04.10 18:15:26 -0700, Linus Torvalds wrote:
> It obviously goes on top of my previous patch.

Gives some nice results for the "rev-list --all --objects" test on the
gentoo repo says (with the old pack):

     | With my patch | With your patch on top
-----|---------------|-----------------------
VSZ  |       1667952 | 1319324
RSS  |       1388408 | 1126080
time |       1:48.99 | 1:42.24

Testing a full repack, it feels slower during the "Compressing objects"
part, but I don't have any hard numbers on that, and maybe I've just
been more patient the last week, when I did the first repack on that
repo. I can just tell that it took about 13 minutes for the "Compressing
objects" part, and 18 minutes in total, on my Core 2 Quad 2.83GHz with
4G of RAM.

The new pack is slightly worse than the old one (window=250, --depth=250):
Old: 759662467
New: 759720234

But that's seems totally negligible, and at least the performance of the
(stupid) rev-list test is not affected by the different pack layout.

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11 13:41                               ` Björn Steinbrink
@ 2009-04-11 14:07                                 ` Björn Steinbrink
  2009-04-11 18:06                                   ` Linus Torvalds
  2009-04-11 18:19                                   ` Linus Torvalds
  0 siblings, 2 replies; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-11 14:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

On 2009.04.11 15:41:12 +0200, Björn Steinbrink wrote:
> On 2009.04.10 18:15:26 -0700, Linus Torvalds wrote:
> > It obviously goes on top of my previous patch.
> 
> Gives some nice results for the "rev-list --all --objects" test on the
> gentoo repo says (with the old pack):

And for completeness, here are the results for linux-2.6.git

     | With my patch | With your patch on top
-----|---------------|-----------------------
VSZ  |        460376 | 407900
RSS  |        292996 | 239760
time |       0:14.28 | 0:14.66

And again, the new pack is slightly worse than the old one
 (window=250, --depth=250).
Old: 240238406
New: 240280452

But again, it's negligible.

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-04 22:07 Performance issue: initial git clone causes massive repack Robin H. Johnson
  2009-04-05  0:05 ` Nicolas Sebrecht
  2009-04-05 19:57 ` Jeff King
@ 2009-04-11 17:24 ` Mark Levedahl
  2 siblings, 0 replies; 97+ messages in thread
From: Mark Levedahl @ 2009-04-11 17:24 UTC (permalink / raw)
  To: git

Robin H. Johnson wrote:

> Hi,
> 
> This is a first in my series of mails over the next few days, on issues
> that we've run into planning a potential migration for Gentoo's
> repository into Git.
> 
> Our full repository conversion is large, even after tuning the
> repacking, the packed repository is between 1.4 and 1.6GiB. As of Feburary
> 4th, 2009, it contained 4886949 objects. It is not suitable for
> splitting into submodules either unfortunately - we have a lot of
> directory moves that would cause submodule bloat.
> 
> During an initial clone, I see that git-upload-pack invokes
> pack-objects, despite the ENTIRE repository already being packed - no
> loose objects whatsoever. git-upload-pack then seems to buffer in
> memory.
> 

Have you considered using a bundle as part of the initial clone process? The 
idea would be to periodically create a bundle

	git bundle create <somename>.bundle [list of refs]

and publish that on your website. A new user would then do

	wget $uri-of-bundle
	git clone <somename>.bundle
	cd $somename
	git remote add origin $origin
	git fetch

and they have the current repo. As the bundle is a file, it can be 
distributed by torrent or other method. The expense of creating the pack in 
the bundle is paid exactly once when the bundle is created.

Mark

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11 14:07                                 ` Björn Steinbrink
@ 2009-04-11 18:06                                   ` Linus Torvalds
  2009-04-11 18:22                                     ` Linus Torvalds
  2009-04-11 20:50                                     ` Björn Steinbrink
  2009-04-11 18:19                                   ` Linus Torvalds
  1 sibling, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2009-04-11 18:06 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List



On Sat, 11 Apr 2009, Björn Steinbrink wrote:
> 
> And for completeness, here are the results for linux-2.6.git
> 
>      | With my patch | With your patch on top
> -----|---------------|-----------------------
> VSZ  |        460376 | 407900
> RSS  |        292996 | 239760
> time |       0:14.28 | 0:14.66

Ok, it uses less memory, but more CPU time. That's reasonable - we "waste" 
CPU time on doing the extra free's, and since the memory use isn't a huge 
constraining factor and cache behavior is bad anyway, it's then actually 
slightly slower.

> And again, the new pack is slightly worse than the old one
>  (window=250, --depth=250).
> Old: 240238406
> New: 240280452
> 
> But again, it's negligible.

Well, it's sad that it's consistently a bit worse, even if we're talking 
just small small fractions of a percent (looks like 0.02% bigger ;). 

And I think I can see why. The new code actually does a _better_ job of 
the resulting list being in "recency" order, whereas the old code used to 
output the root trees all together. Now they're spread out according to 
how soon they are reached.

The object sorting code _should_ sort them by type, name and size (and 
thus the pack generation should generate the same deltas), but the name 
hashing is probably weak enough that it doesn't always do a perfect job, 
and then we likely get a slightly worse pack.

But it would be good to really understand that part. It's a _small_ 
downside, but it's a downside.

But it's interesting to note how the bigger gentoo case actually improved 
in performance, probably because by then the denser memory use actually 
meant that we had noticeably better cache and TLB behavior. So the patch 
helps the bad case, at least.

			Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11 14:07                                 ` Björn Steinbrink
  2009-04-11 18:06                                   ` Linus Torvalds
@ 2009-04-11 18:19                                   ` Linus Torvalds
  2009-04-11 19:40                                     ` Björn Steinbrink
  1 sibling, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2009-04-11 18:19 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List



On Sat, 11 Apr 2009, Björn Steinbrink wrote:

> On 2009.04.11 15:41:12 +0200, Björn Steinbrink wrote:
> > On 2009.04.10 18:15:26 -0700, Linus Torvalds wrote:
> > > It obviously goes on top of my previous patch.
> > 
> > Gives some nice results for the "rev-list --all --objects" test on the
> > gentoo repo says (with the old pack):
> >      | With my patch | With your patch on top
> > -----|---------------|-----------------------
> > VSZ  |       1667952 | 1319324
> > RSS  |       1388408 | 1126080
> 
> linux-2.6.git:
> 
>      | With my patch | With your patch on top
> -----|---------------|-----------------------
> VSZ  |        460376 | 407900
> RSS  |        292996 | 239760

Interesting. That's a 18+% reduction in RSS in both cases. Much bigger 
than I expected, or what I saw in my limited testing. Is this in 32-bit 
mode, where the pointers are cheaper, and thus the non-pointer data 
relatively more expensive and a bigger percentage of the total? We really 
wasted a _lot_ of memory on those names.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11 18:06                                   ` Linus Torvalds
@ 2009-04-11 18:22                                     ` Linus Torvalds
  2009-04-11 19:22                                       ` Björn Steinbrink
  2009-04-11 20:50                                     ` Björn Steinbrink
  1 sibling, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2009-04-11 18:22 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List



On Sat, 11 Apr 2009, Linus Torvalds wrote:
> On Sat, 11 Apr 2009, Björn Steinbrink wrote:
> 
> > And again, the new pack is slightly worse than the old one
> >  (window=250, --depth=250).
> > Old: 240238406
> > New: 240280452
> > 
> > But again, it's negligible.
> 
> Well, it's sad that it's consistently a bit worse, even if we're talking 
> just small small fractions of a percent (looks like 0.02% bigger ;). 

Oh, just wondering: that 0.02% is negligible, but did you use "-f" (or 
--no-reuse-delta if you're testing with 'git pack-objects') to see that 
it's actually re-computing the deltas?

The 0.02% difference might be just because of differences in pack layout. 
If you force all deltas to be recomputed, maybe the difference is much 
bigger?

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11 18:22                                     ` Linus Torvalds
@ 2009-04-11 19:22                                       ` Björn Steinbrink
  0 siblings, 0 replies; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-11 19:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

On 2009.04.11 11:22:09 -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 11 Apr 2009, Linus Torvalds wrote:
> > On Sat, 11 Apr 2009, Björn Steinbrink wrote:
> > 
> > > And again, the new pack is slightly worse than the old one
> > >  (window=250, --depth=250).
> > > Old: 240238406
> > > New: 240280452
> > > 
> > > But again, it's negligible.
> > 
> > Well, it's sad that it's consistently a bit worse, even if we're talking 
> > just small small fractions of a percent (looks like 0.02% bigger ;). 
> 
> Oh, just wondering: that 0.02% is negligible, but did you use "-f" (or 
> --no-reuse-delta if you're testing with 'git pack-objects') to see that 
> it's actually re-computing the deltas?
> 
> The 0.02% difference might be just because of differences in pack layout. 
> If you force all deltas to be recomputed, maybe the difference is much 
> bigger?

Yep, that was "git repack -adf --window=250 --depth=250".

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11 18:19                                   ` Linus Torvalds
@ 2009-04-11 19:40                                     ` Björn Steinbrink
  2009-04-11 19:58                                       ` Linus Torvalds
  0 siblings, 1 reply; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-11 19:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

On 2009.04.11 11:19:01 -0700, Linus Torvalds wrote:
> On Sat, 11 Apr 2009, Björn Steinbrink wrote:
> > On 2009.04.11 15:41:12 +0200, Björn Steinbrink wrote:
> > > On 2009.04.10 18:15:26 -0700, Linus Torvalds wrote:
> > > > It obviously goes on top of my previous patch.
> > > 
> > > Gives some nice results for the "rev-list --all --objects" test on the
> > > gentoo repo says (with the old pack):
> > >      | With my patch | With your patch on top
> > > -----|---------------|-----------------------
> > > VSZ  |       1667952 | 1319324
> > > RSS  |       1388408 | 1126080
> > 
> > linux-2.6.git:
> > 
> >      | With my patch | With your patch on top
> > -----|---------------|-----------------------
> > VSZ  |        460376 | 407900
> > RSS  |        292996 | 239760
> 
> Interesting. That's a 18+% reduction in RSS in both cases. Much bigger 
> than I expected, or what I saw in my limited testing. Is this in 32-bit 
> mode, where the pointers are cheaper, and thus the non-pointer data 
> relatively more expensive and a bigger percentage of the total? We really 
> wasted a _lot_ of memory on those names.

No, this is x86-64, 8 byte pointers. But the savings are trivially
explained I think. The struct object_array things are 20 bytes here (per
object overhead!), so that's about 5M * 20 = 100M. And the average name
length for the objects was 19 bytes, which means about another 100M.
Both, the object_array stuff as well as the path names, were allocated
and never freed. Your patch removed the object_array stuff, and it made
the memory allocations for the names temporary. Right?

Had you moved just the path_name() calls, that would have meant that we
had needed to keep the name_path stuff around, which is also 20 bytes
here (two pointers, one int). And that would have meant that anything
that has a leading-up path shorter than 20 bytes (64 bit pointers) would
have seen increased memory usage (64bit pointers), but with 32 pointers,
the limit would have been 12 bytes.

So for the "just move path_name() call" solution, 32bit vs. 64bit would
have made a difference, but with your actual patches, you just turned
everything into temporary allocations. So the 4byte overhead on 64bit
platforms is just once linear with the directory-depth of the current
object, instead of with the number of objects in total.

Right?

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11 19:40                                     ` Björn Steinbrink
@ 2009-04-11 19:58                                       ` Linus Torvalds
  0 siblings, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2009-04-11 19:58 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List



On Sat, 11 Apr 2009, Björn Steinbrink wrote:
> 
> No, this is x86-64, 8 byte pointers. But the savings are trivially
> explained I think. The struct object_array things are 20 bytes here (per
> object overhead!), so that's about 5M * 20 = 100M. And the average name
> length for the objects was 19 bytes, which means about another 100M.
> Both, the object_array stuff as well as the path names, were allocated
> and never freed. Your patch removed the object_array stuff, and it made
> the memory allocations for the names temporary. Right?

Right.

My original one-liner patch just did the name freeing part, but it did so 
only at the _end_ (when actually calling show_object()), so it probably 
didn't help RSS very much - because you still had one point in time where 
you had all the names allocated. It probably helped packing (since it 
allocates more _afterwards_), but likely didn't make much of a difference 
for just 'git rev-list".

So that was the impetus for trying to just avoid the "keep all objects 
around on the 'object_array' thing" patch, and then cleaning up the 
show_object() call semantics.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11 18:06                                   ` Linus Torvalds
  2009-04-11 18:22                                     ` Linus Torvalds
@ 2009-04-11 20:50                                     ` Björn Steinbrink
  2009-04-11 21:43                                       ` Linus Torvalds
  1 sibling, 1 reply; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-11 20:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

On 2009.04.11 11:06:17 -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 11 Apr 2009, Björn Steinbrink wrote:
> > 
> > And for completeness, here are the results for linux-2.6.git
> > 
> >      | With my patch | With your patch on top
> > -----|---------------|-----------------------
> > VSZ  |        460376 | 407900
> > RSS  |        292996 | 239760
> > time |       0:14.28 | 0:14.66
> 
> Ok, it uses less memory, but more CPU time. That's reasonable - we "waste" 
> CPU time on doing the extra free's, and since the memory use isn't a huge 
> constraining factor and cache behavior is bad anyway, it's then actually 
> slightly slower.
> 
> > And again, the new pack is slightly worse than the old one
> >  (window=250, --depth=250).
> > Old: 240238406
> > New: 240280452
> > 
> > But again, it's negligible.
> 
> Well, it's sad that it's consistently a bit worse, even if we're talking 
> just small small fractions of a percent (looks like 0.02% bigger ;). 
> 
> And I think I can see why. The new code actually does a _better_ job of 
> the resulting list being in "recency" order, whereas the old code used to 
> output the root trees all together. Now they're spread out according to 
> how soon they are reached.

Hm, I don't think that was the case. When iterating over the commits,
process_tree was called with commit->tree, and that added the root tree
to the objects array as well as walking it to add all referenced objects.

And yep, the 'old' "rev-list --all-objects" shows for example:
ebace34d059216b3573cd67a83068d2eafe2f2e7 read-cache.c
a869cb0789d8ad87f04d28dd9b703f3ff343a4a7 
497a05b8fa8e9aa3a5db9b42e5c50392f352d2b4 cache.h
91b2628e3c18e7f75e477c24197d9ef2eca14125 read-cache.c
6862d1012681cd6812ab9bfe1a866446f92a7c91 read-tree.c

a869cb0 being a root tree, inbetween two blobs.

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11 20:50                                     ` Björn Steinbrink
@ 2009-04-11 21:43                                       ` Linus Torvalds
  2009-04-11 23:24                                         ` Björn Steinbrink
  0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2009-04-11 21:43 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List



On Sat, 11 Apr 2009, Björn Steinbrink wrote:
> > 
> > And I think I can see why. The new code actually does a _better_ job of 
> > the resulting list being in "recency" order, whereas the old code used to 
> > output the root trees all together. Now they're spread out according to 
> > how soon they are reached.
> 
> Hm, I don't think that was the case. When iterating over the commits,
> process_tree was called with commit->tree, and that added the root tree
> to the objects array as well as walking it to add all referenced objects.

Oh, you're right. We actually ended up walking the trees at that point, 
so recency should be the same. 

Hmm. Where does the difference in ordering come from, then? 

			Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH] process_{tree,blob}: Remove useless xstrdup calls
  2009-04-11 21:43                                       ` Linus Torvalds
@ 2009-04-11 23:24                                         ` Björn Steinbrink
  0 siblings, 0 replies; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-11 23:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Jakub Narebski, Sverre Rabbelier, david,
	Junio C Hamano, Nicolas Sebrecht, Robin H. Johnson,
	Git Mailing List

On 2009.04.11 14:43:14 -0700, Linus Torvalds wrote:
> On Sat, 11 Apr 2009, Björn Steinbrink wrote:
> > > 
> > > And I think I can see why. The new code actually does a _better_ job of 
> > > the resulting list being in "recency" order, whereas the old code used to 
> > > output the root trees all together. Now they're spread out according to 
> > > how soon they are reached.
> > 
> > Hm, I don't think that was the case. When iterating over the commits,
> > process_tree was called with commit->tree, and that added the root tree
> > to the objects array as well as walking it to add all referenced objects.
> 
> Oh, you're right. We actually ended up walking the trees at that point, 
> so recency should be the same. 
> 
> Hmm. Where does the difference in ordering come from, then? 

Ah! The tag objects. Previously, they were added to the end of the
objects array, after all the objects from the process_tree() calls. But
now, the pending array is directly processed, causing the tags to show
up earlier. The same is of course true for any other pending object, but
verifying that for the tag objects was easier :-)

Björn

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-10 20:38                                   ` Robin H. Johnson
  2009-04-11  1:58                                     ` Nicolas Pitre
@ 2009-04-14 15:52                                     ` Johannes Schindelin
  2009-04-14 20:17                                       ` Nicolas Pitre
  1 sibling, 1 reply; 97+ messages in thread
From: Johannes Schindelin @ 2009-04-14 15:52 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Git Mailing List

Hi,

On Fri, 10 Apr 2009, Robin H. Johnson wrote:

> On Wed, Apr 08, 2009 at 12:52:54AM -0400, Nicolas Pitre wrote:
> > > http://git.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary
> > > At least that's what I cloned ;-) I hope it's the right one, but it fits
> > > the description...
> > OK.  FWIW, I repacked it with --window=250 --depth=250 and obtained a 
> > 725MB pack file.  So that's about half the originally reported size.
> The one problem with having the single large packfile is that Git
> doesn't have a trivial way to resume downloading it when the git://
> protocol is used.
> 
> For our developers cursed with bad internet connections (a fair number
> of firewalls that don't seem to respect keepalive properly), I suppose
> I can probably just maintain a separate repo for their initial clones,
> which leaves a large overall download, but more chances to resume.

IMO the best we could do under these circumstances is to use fsck 
--lost-found to find those commits which have a complete history (i.e. no 
"broken links") -- this probably needs to be implemented as a special mode 
of --lost-found -- and store them in a temporary to-be-removed 
namespace, say refs/heads/incomplete-refs/$number, which will be sent to 
the server when fetching the next time.  (Might need some iterations to 
get everything, though.)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-14 15:52                                     ` Johannes Schindelin
@ 2009-04-14 20:17                                       ` Nicolas Pitre
  2009-04-14 20:27                                         ` Robin H. Johnson
  2009-04-14 20:30                                         ` Johannes Schindelin
  0 siblings, 2 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-14 20:17 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Robin H. Johnson, Git Mailing List

On Tue, 14 Apr 2009, Johannes Schindelin wrote:

> Hi,
> 
> On Fri, 10 Apr 2009, Robin H. Johnson wrote:
> 
> > On Wed, Apr 08, 2009 at 12:52:54AM -0400, Nicolas Pitre wrote:
> > > > http://git.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary
> > > > At least that's what I cloned ;-) I hope it's the right one, but it fits
> > > > the description...
> > > OK.  FWIW, I repacked it with --window=250 --depth=250 and obtained a 
> > > 725MB pack file.  So that's about half the originally reported size.
> > The one problem with having the single large packfile is that Git
> > doesn't have a trivial way to resume downloading it when the git://
> > protocol is used.
> > 
> > For our developers cursed with bad internet connections (a fair number
> > of firewalls that don't seem to respect keepalive properly), I suppose
> > I can probably just maintain a separate repo for their initial clones,
> > which leaves a large overall download, but more chances to resume.
> 
> IMO the best we could do under these circumstances is to use fsck 
> --lost-found to find those commits which have a complete history (i.e. no 
> "broken links") -- this probably needs to be implemented as a special mode 
> of --lost-found -- and store them in a temporary to-be-removed 
> namespace, say refs/heads/incomplete-refs/$number, which will be sent to 
> the server when fetching the next time.  (Might need some iterations to 
> get everything, though.)

Well, although this might seem a good idea, this would help only in 
those cases where there is at least one complete revision available, 
i.e. no delta needed. This is usually true for the top commit after a 
repack which objects are all stored at the front of the pack and serve 
as base objects for deltas from subsequent (older) commits.  Thing is, 
that first revision is likely to occupy a significant portion of the 
whole pack, like no less than the size of the equivalent .tar.gz for the 
content of that commit.  To see what this represents, just try a shallow 
clone with depth=1.  For the Linux kernel, this is more than 80MB while 
the whole repo is in the 200MB range.  So if your connection isn't 
reliable enough to transfer at least that amount then you're screwed 
anyway.

Independently from this, I think there is quite a lot of confusion here.  
According to Robin, the reason for splitting the large Gentoo repo into 
multiple packs is apparently to help with the resuming of a clone.  We 
know that the git:// protocol is currently not resumable, and having 
multiple packs on the remote server won't change the outcome in any way 
as the client still receives a single big pack anyway.

WRT the HTTP protocol, I was questioning git's ability to resume the 
transfer of a pack in the middle if such transfer is interrupted without 
redownloading it all. And Mike Hommey says this is actually the case.

Meaning there is simply no reason to split a big pack into multiple 
ones.  If anything, it'll only make a clone over the native git protocol 
more costly for the server which has to pack everything back together.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-14 20:17                                       ` Nicolas Pitre
@ 2009-04-14 20:27                                         ` Robin H. Johnson
  2009-04-14 21:02                                           ` Nicolas Pitre
                                                             ` (2 more replies)
  2009-04-14 20:30                                         ` Johannes Schindelin
  1 sibling, 3 replies; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-14 20:27 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Schindelin, Robin H. Johnson, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1187 bytes --]

On Tue, Apr 14, 2009 at 04:17:55PM -0400, Nicolas Pitre wrote:
> WRT the HTTP protocol, I was questioning git's ability to resume the 
> transfer of a pack in the middle if such transfer is interrupted without 
> redownloading it all. And Mike Hommey says this is actually the case.
With rsync:// it was helpful to split the pack, and resume there worked
reasonably (see my other mail about the segfault that turns up
sometimes).

More recent discussions raised the possibility of using git-bundle to
provide a more ideal initial download that they CAN resume easily, as
well as being able to move on from it.

So, from the Gentoo side right now, we're looking at this:
1. Setup git-bundle for initial downloads.
2. Disallow initial clones over git:// (allow updates ONLY)
3. Disallow git-over-http, git-over-rsync.

This also avoids the wait time with the initial clone. Just grab the
bundle with your choice of rsync or http, check it's integrity, throw it
into your repo, and update to the latest tree.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-14 20:17                                       ` Nicolas Pitre
  2009-04-14 20:27                                         ` Robin H. Johnson
@ 2009-04-14 20:30                                         ` Johannes Schindelin
  1 sibling, 0 replies; 97+ messages in thread
From: Johannes Schindelin @ 2009-04-14 20:30 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Robin H. Johnson, Git Mailing List

Hi,

On Tue, 14 Apr 2009, Nicolas Pitre wrote:

> On Tue, 14 Apr 2009, Johannes Schindelin wrote:
> 
> > IMO the best we could do under these circumstances [unreliable 
> > network] is to use fsck --lost-found to find those commits which have 
> > a complete history (i.e. no "broken links") -- this probably needs to 
> > be implemented as a special mode of --lost-found -- and store them in 
> > a temporary to-be-removed namespace, say 
> > refs/heads/incomplete-refs/$number, which will be sent to the server 
> > when fetching the next time.  (Might need some iterations to get 
> > everything, though.)
> 
> Well, although this might seem a good idea, this would help only in 
> those cases where there is at least one complete revision available, 
> i.e. no delta needed. This is usually true for the top commit after a 
> repack which objects are all stored at the front of the pack and serve 
> as base objects for deltas from subsequent (older) commits.  Thing is, 
> that first revision is likely to occupy a significant portion of the 
> whole pack, like no less than the size of the equivalent .tar.gz for the 
> content of that commit.  To see what this represents, just try a shallow 
> clone with depth=1.  For the Linux kernel, this is more than 80MB while 
> the whole repo is in the 200MB range.  So if your connection isn't 
> reliable enough to transfer at least that amount then you're screwed 
> anyway.

Good point.

Sorry for not thinking it through,
Dscho

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-14 20:27                                         ` Robin H. Johnson
@ 2009-04-14 21:02                                           ` Nicolas Pitre
  2009-04-15  3:09                                           ` Nguyen Thai Ngoc Duy
  2009-04-22  1:15                                           ` Sam Vilain
  2 siblings, 0 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-14 21:02 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Johannes Schindelin, Git Mailing List

On Tue, 14 Apr 2009, Robin H. Johnson wrote:

> More recent discussions raised the possibility of using git-bundle to
> provide a more ideal initial download that they CAN resume easily, as
> well as being able to move on from it.
> 
> So, from the Gentoo side right now, we're looking at this:
> 1. Setup git-bundle for initial downloads.
> 2. Disallow initial clones over git:// (allow updates ONLY)
> 3. Disallow git-over-http, git-over-rsync.
> 
> This also avoids the wait time with the initial clone. Just grab the
> bundle with your choice of rsync or http, check it's integrity, throw it
> into your repo, and update to the latest tree.

This certainly makes lots of sense until we overcome the current clone 
bothleneck.  You should tightly repack your repository first, like with
"git repack -a -f -d --depth=100 --window=500".  Use a fast machine with 
enough ram of course.  Then you'll have a nice and small bundle.

Of course any git pack/bundle has full self-integrity built in.  So you 
should not need to do a separate check.

And don't forget to delete the bundle once it has been fetched into a 
full repository, otherwise it'll only wastes disk space.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-14 20:27                                         ` Robin H. Johnson
  2009-04-14 21:02                                           ` Nicolas Pitre
@ 2009-04-15  3:09                                           ` Nguyen Thai Ngoc Duy
  2009-04-15  5:53                                             ` Robin H. Johnson
  2009-04-15  5:54                                             ` Junio C Hamano
  2009-04-22  1:15                                           ` Sam Vilain
  2 siblings, 2 replies; 97+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2009-04-15  3:09 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Nicolas Pitre, Johannes Schindelin, Git Mailing List

On Wed, Apr 15, 2009 at 6:27 AM, Robin H. Johnson <robbat2@gentoo.org> wrote:
> So, from the Gentoo side right now, we're looking at this:
> 1. Setup git-bundle for initial downloads.
> 2. Disallow initial clones over git:// (allow updates ONLY)

How can you do that? If I understand git protocol correctly, there is
no difference between a fetch request and a clone one.

> 3. Disallow git-over-http, git-over-rsync.
-- 
Duy

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-15  3:09                                           ` Nguyen Thai Ngoc Duy
@ 2009-04-15  5:53                                             ` Robin H. Johnson
  2009-04-15  5:54                                             ` Junio C Hamano
  1 sibling, 0 replies; 97+ messages in thread
From: Robin H. Johnson @ 2009-04-15  5:53 UTC (permalink / raw)
  To: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1381 bytes --]

On Wed, Apr 15, 2009 at 01:09:43PM +1000, Nguyen Thai Ngoc Duy wrote:
> On Wed, Apr 15, 2009 at 6:27 AM, Robin H. Johnson <robbat2@gentoo.org> wrote:
> > So, from the Gentoo side right now, we're looking at this:
> > 1. Setup git-bundle for initial downloads.
> > 2. Disallow initial clones over git:// (allow updates ONLY)
> How can you do that? If I understand git protocol correctly, there is
> no difference between a fetch request and a clone one.
I'm planning on adding a new hook, in upload-pack.
Inputs: want_obj, have_obj

Not sure of the best way to pass them yet, probably stdin, 'want ....',
'have ....'.

Probably best to run right before git-rev-list.

For the Gentoo-specific content of the hook, I'm after this design:
- you don't send ANY have => you get the error
- you have is too old => you get the error
- you ask for something non-existent => you get the error

The error will be a message instructing you to use the bundle, and
pointing to a URL with detailed instructions.

The 'too old' case is to able better DoS prevention, stopping somebody
malicious from finding the first commit in the bundle, and pretending
they have it, asking for a pack from that to the HEAD.

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-15  3:09                                           ` Nguyen Thai Ngoc Duy
  2009-04-15  5:53                                             ` Robin H. Johnson
@ 2009-04-15  5:54                                             ` Junio C Hamano
  2009-04-15 11:51                                               ` Nicolas Pitre
  1 sibling, 1 reply; 97+ messages in thread
From: Junio C Hamano @ 2009-04-15  5:54 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Robin H. Johnson, Nicolas Pitre, Johannes Schindelin, Git Mailing List

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

>> 2. Disallow initial clones over git:// (allow updates ONLY)
>
> How can you do that? If I understand git protocol correctly, there is
> no difference between a fetch request and a clone one.

At the protocol level, you can tell a clone request by noticing that the
downloading side does not have any "have" lines, but it is a different
matter what the software does out of the box.

You can patch upload-pack to reject such requests.  I am sure gentoo folks
are capable of doing that ;-)

Also a rogue client can send a bogus "have" to fool that logic, and that
is the primary reason why we do not have such a patch to upload-pack.  It
is not worth it as a protection against determined people who want to DoS.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-15  5:54                                             ` Junio C Hamano
@ 2009-04-15 11:51                                               ` Nicolas Pitre
  0 siblings, 0 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-15 11:51 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nguyen Thai Ngoc Duy, Robin H. Johnson, Johannes Schindelin,
	Git Mailing List

On Tue, 14 Apr 2009, Junio C Hamano wrote:

> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
> 
> > How can you do that? If I understand git protocol correctly, there is
> > no difference between a fetch request and a clone one.
> 
> At the protocol level, you can tell a clone request by noticing that the
> downloading side does not have any "have" lines, but it is a different
> matter what the software does out of the box.
> 
> You can patch upload-pack to reject such requests.  I am sure gentoo folks
> are capable of doing that ;-)
> 
> Also a rogue client can send a bogus "have" to fool that logic, and that
> is the primary reason why we do not have such a patch to upload-pack.  It
> is not worth it as a protection against determined people who want to DoS.

Implementing a minimum treshold with merge-base to ensure that the 
client has at least commit X should be easy to do.  Unfortunately we 
don't have any hook for such a purpose yet.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-14 20:27                                         ` Robin H. Johnson
  2009-04-14 21:02                                           ` Nicolas Pitre
  2009-04-15  3:09                                           ` Nguyen Thai Ngoc Duy
@ 2009-04-22  1:15                                           ` Sam Vilain
  2009-04-22  9:55                                             ` Mike Ralphson
  2009-04-22 14:14                                             ` Nicolas Pitre
  2 siblings, 2 replies; 97+ messages in thread
From: Sam Vilain @ 2009-04-22  1:15 UTC (permalink / raw)
  To: Robin H. Johnson; +Cc: Nicolas Pitre, Johannes Schindelin, Git Mailing List

On Tue, 2009-04-14 at 13:27 -0700, Robin H. Johnson wrote:
> On Tue, Apr 14, 2009 at 04:17:55PM -0400, Nicolas Pitre wrote:
> > WRT the HTTP protocol, I was questioning git's ability to resume the 
> > transfer of a pack in the middle if such transfer is interrupted without 
> > redownloading it all. And Mike Hommey says this is actually the case.
> With rsync:// it was helpful to split the pack, and resume there worked
> reasonably (see my other mail about the segfault that turns up
> sometimes).
> 
> More recent discussions raised the possibility of using git-bundle to
> provide a more ideal initial download that they CAN resume easily, as
> well as being able to move on from it.

Hey Robin,

Now that the GSoC projects have been announced I can give you the good
news that one of our two projects is to optimise this stage in
git-daemon; I'm hoping we can get it down to being almost as cheap as
the workaround you described in your post.  I'll certainly be using your
repository as a test case :-)

So stay tuned!
Sam.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22  1:15                                           ` Sam Vilain
@ 2009-04-22  9:55                                             ` Mike Ralphson
  2009-04-22 11:24                                               ` Pieter de Bie
                                                                 ` (2 more replies)
  2009-04-22 14:14                                             ` Nicolas Pitre
  1 sibling, 3 replies; 97+ messages in thread
From: Mike Ralphson @ 2009-04-22  9:55 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Robin H. Johnson, Nicolas Pitre, Johannes Schindelin,
	Git Mailing List, Matt Enright

2009/4/22 Sam Vilain <sam@vilain.net>
> Now that the GSoC projects have been announced I can give you the good
> news that one of our two projects...

It's sort of three, really...

http://socghop.appspot.com/student_project/show/google/gsoc2009/mono/t124022708105

Mike

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22  9:55                                             ` Mike Ralphson
@ 2009-04-22 11:24                                               ` Pieter de Bie
  2009-04-22 13:19                                               ` Johannes Schindelin
  2009-04-23 19:30                                               ` Christian Couder
  2 siblings, 0 replies; 97+ messages in thread
From: Pieter de Bie @ 2009-04-22 11:24 UTC (permalink / raw)
  To: Mike Ralphson
  Cc: Sam Vilain, Robin H. Johnson, Nicolas Pitre, Johannes Schindelin,
	Git Mailing List, Matt Enright


On 22 apr 2009, at 10:55, Mike Ralphson wrote:

> It's sort of three, really...
>
> http://socghop.appspot.com/student_project/show/google/gsoc2009/mono/t124022708105

That same project was also done by two(!) students last
year, but I don't think that worked out. I wonder how it'll
play out this year.

- Pieter

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22  9:55                                             ` Mike Ralphson
  2009-04-22 11:24                                               ` Pieter de Bie
@ 2009-04-22 13:19                                               ` Johannes Schindelin
  2009-04-22 14:35                                                 ` Shawn O. Pearce
  2009-04-23 19:30                                               ` Christian Couder
  2 siblings, 1 reply; 97+ messages in thread
From: Johannes Schindelin @ 2009-04-22 13:19 UTC (permalink / raw)
  To: Mike Ralphson
  Cc: Sam Vilain, Robin H. Johnson, Nicolas Pitre, Git Mailing List,
	Matt Enright

Hi,

On Wed, 22 Apr 2009, Mike Ralphson wrote:

> 2009/4/22 Sam Vilain <sam@vilain.net>
> > Now that the GSoC projects have been announced I can give you the good
> > news that one of our two projects...
> 
> It's sort of three, really...
> 
> http://socghop.appspot.com/student_project/show/google/gsoc2009/mono/t124022708105

OMG!  That's the third time they are wasting Google's money: AFAICT they 
haven't learnt from the past two years' failures.  At least I am not aware 
of any of them Mono guys trying to collaborate with us.

Oh well, maybe I should drop them a mail that they may get valuable input 
here _iff_ they just ask.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22  1:15                                           ` Sam Vilain
  2009-04-22  9:55                                             ` Mike Ralphson
@ 2009-04-22 14:14                                             ` Nicolas Pitre
  2009-04-22 22:01                                               ` Sam Vilain
  1 sibling, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-22 14:14 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Robin H. Johnson, Johannes Schindelin, Git Mailing List

On Wed, 22 Apr 2009, Sam Vilain wrote:

> On Tue, 2009-04-14 at 13:27 -0700, Robin H. Johnson wrote:
> > On Tue, Apr 14, 2009 at 04:17:55PM -0400, Nicolas Pitre wrote:
> > > WRT the HTTP protocol, I was questioning git's ability to resume the 
> > > transfer of a pack in the middle if such transfer is interrupted without 
> > > redownloading it all. And Mike Hommey says this is actually the case.
> > With rsync:// it was helpful to split the pack, and resume there worked
> > reasonably (see my other mail about the segfault that turns up
> > sometimes).
> > 
> > More recent discussions raised the possibility of using git-bundle to
> > provide a more ideal initial download that they CAN resume easily, as
> > well as being able to move on from it.
> 
> Hey Robin,
> 
> Now that the GSoC projects have been announced I can give you the good
> news that one of our two projects is to optimise this stage in
> git-daemon; I'm hoping we can get it down to being almost as cheap as
> the workaround you described in your post.  I'll certainly be using your
> repository as a test case :-)

Please keep me in the loop as much as possible.  I'd prefer we're not in 
disagreement over the implementation only after final patches are posted 
to the list.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22 13:19                                               ` Johannes Schindelin
@ 2009-04-22 14:35                                                 ` Shawn O. Pearce
  2009-04-22 16:40                                                   ` Andreas Ericsson
  0 siblings, 1 reply; 97+ messages in thread
From: Shawn O. Pearce @ 2009-04-22 14:35 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Mike Ralphson, Sam Vilain, Robin H. Johnson, Nicolas Pitre,
	Git Mailing List, Matt Enright

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Wed, 22 Apr 2009, Mike Ralphson wrote:
> 
> > 2009/4/22 Sam Vilain <sam@vilain.net>
> > > Now that the GSoC projects have been announced I can give you the good
> > > news that one of our two projects...
> > 
> > It's sort of three, really...
> > 
> > http://socghop.appspot.com/student_project/show/google/gsoc2009/mono/t124022708105
> 
> OMG!  That's the third time they are wasting Google's money: AFAICT they 
> haven't learnt from the past two years' failures.  At least I am not aware 
> of any of them Mono guys trying to collaborate with us.

Yikes!

Wearing my Google hat, I have to cry a little.  I think its such
a waste.  But we don't tell the orgs what projects they should or
should not do, its at each org's individual discretion.  Clearly the
Mono folks feel they should "spend" a *fourth* slot on this project.
That or, Mono was granted one too many slots in the program.  *sigh*

Wearing my JGit maintainer hat, I have to cry a little.  The mentor
for this project should realize... we've spent over 3 years now
on JGit (it turned 3 on Mar 6 2009) and it *still* doesn't provide
a full replacement for git.git.

I'd like to think that I'm not a moron, and that it really does
take 3 years of R&D work to find a suitable implementation of Git
in a sandboxed language like Java.  Or, maybe I am a moron.  Linus,
Junio and crew had git.git implemented in less time.
 
> Oh well, maybe I should drop them a mail that they may get valuable input 
> here _iff_ they just ask.

I've tried that in the past two years.  I've given up.  The JGit
code is available.  Its license is quite liberal.  They can look
at it if they want.  My guess is, they won't.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22 14:35                                                 ` Shawn O. Pearce
@ 2009-04-22 16:40                                                   ` Andreas Ericsson
  2009-04-22 17:06                                                     ` Johannes Schindelin
  0 siblings, 1 reply; 97+ messages in thread
From: Andreas Ericsson @ 2009-04-22 16:40 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Johannes Schindelin, Mike Ralphson, Sam Vilain, Robin H. Johnson,
	Nicolas Pitre, Git Mailing List, Matt Enright

Shawn O. Pearce wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> On Wed, 22 Apr 2009, Mike Ralphson wrote:
>>
>>> 2009/4/22 Sam Vilain <sam@vilain.net>
>>>> Now that the GSoC projects have been announced I can give you the good
>>>> news that one of our two projects...
>>> It's sort of three, really...
>>>
>>> http://socghop.appspot.com/student_project/show/google/gsoc2009/mono/t124022708105
>> OMG!  That's the third time they are wasting Google's money: AFAICT they 
>> haven't learnt from the past two years' failures.  At least I am not aware 
>> of any of them Mono guys trying to collaborate with us.
> 

I offered to assist with reviewing patches or explaining technicalia to them
last year (while I was learning a bit of C# myself), but got no patches or
requests from them at all.

>> Oh well, maybe I should drop them a mail that they may get valuable input 
>> here _iff_ they just ask.
> 
> I've tried that in the past two years.  I've given up.  The JGit
> code is available.  Its license is quite liberal.  They can look
> at it if they want.  My guess is, they won't.
> 

I'm with Shawn here. They refuse to look at unmanaged code (that is, non-C#
code), and since there is none yet, they're in a sort of catch-22 when it
comes to reference implementations. Ah well. I'll join mono-develop mailing
list again and see what I can do to help.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22 16:40                                                   ` Andreas Ericsson
@ 2009-04-22 17:06                                                     ` Johannes Schindelin
  0 siblings, 0 replies; 97+ messages in thread
From: Johannes Schindelin @ 2009-04-22 17:06 UTC (permalink / raw)
  To: Andreas Ericsson
  Cc: Shawn O. Pearce, Mike Ralphson, Sam Vilain, Robin H. Johnson,
	Nicolas Pitre, Git Mailing List, Matt Enright

Hi,

On Wed, 22 Apr 2009, Andreas Ericsson wrote:

> Ah well. I'll join mono-develop mailing list again and see what I can do 
> to help.

Thanks.  I think these guys are in serious need of help, not only in terms 
of Git, but also in managing GSoC.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22 14:14                                             ` Nicolas Pitre
@ 2009-04-22 22:01                                               ` Sam Vilain
  2009-04-22 22:50                                                 ` Björn Steinbrink
  2009-04-22 23:07                                                 ` Nicolas Pitre
  0 siblings, 2 replies; 97+ messages in thread
From: Sam Vilain @ 2009-04-22 22:01 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Robin H. Johnson, Johannes Schindelin, Git Mailing List, Nick Edelen

Nicolas Pitre wrote:
>> Now that the GSoC projects have been announced I can give you the good
>> news that one of our two projects is to optimise this stage in
>> git-daemon; I'm hoping we can get it down to being almost as cheap as
>> the workaround you described in your post. I'll certainly be using your
>> repository as a test case :-)
>
> Please keep me in the loop as much as possible. I'd prefer we're not in
> disagreement over the implementation only after final patches are posted
> to the list.

Thanks Nico, given your close working knowledge of the pack-objects
code this will be very much appreciated. Perhaps you can first help
out by telling me what you have to say about moving object enumeration
from upload-pack to pack-objects?

Cheers!
Sam.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22 22:01                                               ` Sam Vilain
@ 2009-04-22 22:50                                                 ` Björn Steinbrink
  2009-04-22 23:07                                                 ` Nicolas Pitre
  1 sibling, 0 replies; 97+ messages in thread
From: Björn Steinbrink @ 2009-04-22 22:50 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Nicolas Pitre, Robin H. Johnson, Johannes Schindelin,
	Git Mailing List, Nick Edelen

On 2009.04.23 10:01:46 +1200, Sam Vilain wrote:
> Nicolas Pitre wrote:
> >> Now that the GSoC projects have been announced I can give you the good
> >> news that one of our two projects is to optimise this stage in
> >> git-daemon; I'm hoping we can get it down to being almost as cheap as
> >> the workaround you described in your post. I'll certainly be using your
> >> repository as a test case :-)
> >
> > Please keep me in the loop as much as possible. I'd prefer we're not in
> > disagreement over the implementation only after final patches are posted
> > to the list.
> 
> Thanks Nico, given your close working knowledge of the pack-objects
> code this will be very much appreciated. Perhaps you can first help
> out by telling me what you have to say about moving object enumeration
> from upload-pack to pack-objects?

Here's a bit about that:
http://article.gmane.org/gmane.comp.version-control.git/116032

Note that my RSS measurement should be invalid by now. Linus's
patches(*) should have improved the memory usage for that scenario by
quite a bit, since we used to keep a lot of the stuff that the revision
enumeration required in memory, even after that processed finished,
which should no longer be the case IIRC. And the peak memory usage for
that process was also improved on its own, as the whole buffering is
gone.

Björn

(*) These commits:
8d2dfc49b1     process_{tree,blob}: show objects without buffering
cf2ab916af     show_object(): push path_name() call further down
213152688c     process_{tree,blob}: Remove useless xstrdup calls

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22 22:01                                               ` Sam Vilain
  2009-04-22 22:50                                                 ` Björn Steinbrink
@ 2009-04-22 23:07                                                 ` Nicolas Pitre
  2009-04-22 23:30                                                   ` Johannes Schindelin
  1 sibling, 1 reply; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-22 23:07 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Robin H. Johnson, Johannes Schindelin, Git Mailing List, Nick Edelen

On Thu, 23 Apr 2009, Sam Vilain wrote:

> Nicolas Pitre wrote:
> >> Now that the GSoC projects have been announced I can give you the good
> >> news that one of our two projects is to optimise this stage in
> >> git-daemon; I'm hoping we can get it down to being almost as cheap as
> >> the workaround you described in your post. I'll certainly be using your
> >> repository as a test case :-)
> >
> > Please keep me in the loop as much as possible. I'd prefer we're not in
> > disagreement over the implementation only after final patches are posted
> > to the list.
> 
> Thanks Nico, given your close working knowledge of the pack-objects
> code this will be very much appreciated. Perhaps you can first help
> out by telling me what you have to say about moving object enumeration
> from upload-pack to pack-objects?

It is like a 25-line patch or so.  I did it once, although the shalow 
clone support was missing from it.  And somehow I managed to lose the 
patch while doing some reshuffling of unrelated bigger changes.

Basically, you can pass the revision arguments to pack-objects directly 
instead of passing them to rev-list and piping rev-list's output to 
pack-objects.

Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22 23:07                                                 ` Nicolas Pitre
@ 2009-04-22 23:30                                                   ` Johannes Schindelin
  2009-04-23  3:16                                                     ` Nicolas Pitre
  0 siblings, 1 reply; 97+ messages in thread
From: Johannes Schindelin @ 2009-04-22 23:30 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Sam Vilain, Robin H. Johnson, Git Mailing List, Nick Edelen

Hi,

On Wed, 22 Apr 2009, Nicolas Pitre wrote:

> On Thu, 23 Apr 2009, Sam Vilain wrote:
> 
> > Nicolas Pitre wrote:
> > >> Now that the GSoC projects have been announced I can give you the good
> > >> news that one of our two projects is to optimise this stage in
> > >> git-daemon; I'm hoping we can get it down to being almost as cheap as
> > >> the workaround you described in your post. I'll certainly be using your
> > >> repository as a test case :-)
> > >
> > > Please keep me in the loop as much as possible. I'd prefer we're not in
> > > disagreement over the implementation only after final patches are posted
> > > to the list.
> > 
> > Thanks Nico, given your close working knowledge of the pack-objects
> > code this will be very much appreciated. Perhaps you can first help
> > out by telling me what you have to say about moving object enumeration
> > from upload-pack to pack-objects?
> 
> It is like a 25-line patch or so.  I did it once, although the shalow 
> clone support was missing from it.  And somehow I managed to lose the 
> patch while doing some reshuffling of unrelated bigger changes.
> 
> Basically, you can pass the revision arguments to pack-objects directly 
> instead of passing them to rev-list and piping rev-list's output to 
> pack-objects.

I seem to remember that somebody sent a patch within the last two weeks 
implementing that, and if my memory does not fail me, in response to one 
of your mails mentioning this wish.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22 23:30                                                   ` Johannes Schindelin
@ 2009-04-23  3:16                                                     ` Nicolas Pitre
  0 siblings, 0 replies; 97+ messages in thread
From: Nicolas Pitre @ 2009-04-23  3:16 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Sam Vilain, Robin H. Johnson, Git Mailing List, Nick Edelen

On Thu, 23 Apr 2009, Johannes Schindelin wrote:

> Hi,
> 
> On Wed, 22 Apr 2009, Nicolas Pitre wrote:
> 
> > On Thu, 23 Apr 2009, Sam Vilain wrote:
> > 
> > > Perhaps you can first help out by telling me what you have to say 
> > > about moving object enumeration from upload-pack to pack-objects?
> > 
> > It is like a 25-line patch or so.  I did it once, although the shalow 
> > clone support was missing from it.  And somehow I managed to lose the 
> > patch while doing some reshuffling of unrelated bigger changes.
> > 
> > Basically, you can pass the revision arguments to pack-objects directly 
> > instead of passing them to rev-list and piping rev-list's output to 
> > pack-objects.
> 
> I seem to remember that somebody sent a patch within the last two weeks 
> implementing that, and if my memory does not fail me, in response to one 
> of your mails mentioning this wish.

Well, if so I wasn't CC'd on the post, and my periodic scan of the git 
list missed it as well.


Nicolas

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Performance issue: initial git clone causes massive repack
  2009-04-22  9:55                                             ` Mike Ralphson
  2009-04-22 11:24                                               ` Pieter de Bie
  2009-04-22 13:19                                               ` Johannes Schindelin
@ 2009-04-23 19:30                                               ` Christian Couder
  2 siblings, 0 replies; 97+ messages in thread
From: Christian Couder @ 2009-04-23 19:30 UTC (permalink / raw)
  To: Mike Ralphson
  Cc: Sam Vilain, Robin H. Johnson, Nicolas Pitre, Johannes Schindelin,
	Git Mailing List, Matt Enright

Le mercredi 22 avril 2009, Mike Ralphson a écrit :
> 2009/4/22 Sam Vilain <sam@vilain.net>
>
> > Now that the GSoC projects have been announced I can give you the good
> > news that one of our two projects...
>
> It's sort of three, really...
>
> http://socghop.appspot.com/student_project/show/google/gsoc2009/mono/t124
>022708105

There is also this one:

http://socghop.appspot.com/student_project/show/google/gsoc2009/hg/t124022472367

Regards,
Christian.

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2009-04-23 19:34 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-04 22:07 Performance issue: initial git clone causes massive repack Robin H. Johnson
2009-04-05  0:05 ` Nicolas Sebrecht
2009-04-05  0:37   ` Robin H. Johnson
2009-04-05  3:54     ` Nicolas Sebrecht
2009-04-05  4:08       ` Nicolas Sebrecht
2009-04-05  7:04       ` Robin H. Johnson
2009-04-05 19:02         ` Nicolas Sebrecht
2009-04-05 19:17           ` Shawn O. Pearce
2009-04-05 23:02             ` Robin H. Johnson
2009-04-05 20:43           ` Robin H. Johnson
2009-04-05 21:08             ` Shawn O. Pearce
2009-04-05 21:28           ` david
2009-04-05 21:36             ` Sverre Rabbelier
2009-04-06  3:24               ` Nicolas Pitre
2009-04-07  8:10                 ` Björn Steinbrink
2009-04-07  9:45                   ` Jakub Narebski
2009-04-07 13:13                     ` Nicolas Pitre
2009-04-07 13:37                       ` Jakub Narebski
2009-04-07 14:03                         ` Jon Smirl
2009-04-07 17:59                         ` Nicolas Pitre
2009-04-07 14:21                       ` Björn Steinbrink
2009-04-07 17:48                         ` Nicolas Pitre
2009-04-07 18:12                           ` Björn Steinbrink
2009-04-07 18:56                             ` Nicolas Pitre
2009-04-07 20:27                               ` Björn Steinbrink
2009-04-08  4:52                                 ` Nicolas Pitre
2009-04-10 20:38                                   ` Robin H. Johnson
2009-04-11  1:58                                     ` Nicolas Pitre
2009-04-11  7:06                                       ` Mike Hommey
2009-04-14 15:52                                     ` Johannes Schindelin
2009-04-14 20:17                                       ` Nicolas Pitre
2009-04-14 20:27                                         ` Robin H. Johnson
2009-04-14 21:02                                           ` Nicolas Pitre
2009-04-15  3:09                                           ` Nguyen Thai Ngoc Duy
2009-04-15  5:53                                             ` Robin H. Johnson
2009-04-15  5:54                                             ` Junio C Hamano
2009-04-15 11:51                                               ` Nicolas Pitre
2009-04-22  1:15                                           ` Sam Vilain
2009-04-22  9:55                                             ` Mike Ralphson
2009-04-22 11:24                                               ` Pieter de Bie
2009-04-22 13:19                                               ` Johannes Schindelin
2009-04-22 14:35                                                 ` Shawn O. Pearce
2009-04-22 16:40                                                   ` Andreas Ericsson
2009-04-22 17:06                                                     ` Johannes Schindelin
2009-04-23 19:30                                               ` Christian Couder
2009-04-22 14:14                                             ` Nicolas Pitre
2009-04-22 22:01                                               ` Sam Vilain
2009-04-22 22:50                                                 ` Björn Steinbrink
2009-04-22 23:07                                                 ` Nicolas Pitre
2009-04-22 23:30                                                   ` Johannes Schindelin
2009-04-23  3:16                                                     ` Nicolas Pitre
2009-04-14 20:30                                         ` Johannes Schindelin
2009-04-07 20:29                             ` Jeff King
2009-04-07 20:35                               ` Björn Steinbrink
2009-04-08 11:28                       ` [PATCH] process_{tree,blob}: Remove useless xstrdup calls Björn Steinbrink
2009-04-10 22:20                         ` Linus Torvalds
2009-04-11  0:27                           ` Linus Torvalds
2009-04-11  1:15                             ` Linus Torvalds
2009-04-11  1:34                               ` Nicolas Pitre
2009-04-11 13:41                               ` Björn Steinbrink
2009-04-11 14:07                                 ` Björn Steinbrink
2009-04-11 18:06                                   ` Linus Torvalds
2009-04-11 18:22                                     ` Linus Torvalds
2009-04-11 19:22                                       ` Björn Steinbrink
2009-04-11 20:50                                     ` Björn Steinbrink
2009-04-11 21:43                                       ` Linus Torvalds
2009-04-11 23:24                                         ` Björn Steinbrink
2009-04-11 18:19                                   ` Linus Torvalds
2009-04-11 19:40                                     ` Björn Steinbrink
2009-04-11 19:58                                       ` Linus Torvalds
2009-04-05 22:59             ` Performance issue: initial git clone causes massive repack Nicolas Sebrecht
2009-04-05 23:20               ` david
2009-04-05 23:28                 ` Robin Rosenberg
2009-04-06  3:34                 ` Nicolas Pitre
2009-04-06  5:15                   ` Junio C Hamano
2009-04-06 13:12                     ` Nicolas Pitre
2009-04-06 13:52                     ` Jon Smirl
2009-04-06 14:19                       ` Nicolas Pitre
2009-04-06 14:37                         ` Jon Smirl
2009-04-06 14:48                           ` Shawn O. Pearce
2009-04-06 15:14                           ` Nicolas Pitre
2009-04-06 15:28                             ` Jon Smirl
2009-04-06 16:14                               ` Nicolas Pitre
2009-04-06 11:22                   ` Matthieu Moy
2009-04-06 13:29                     ` Nicolas Pitre
2009-04-06 14:03                       ` Robin H. Johnson
2009-04-06 14:14                         ` Nicolas Pitre
2009-04-07 10:11               ` Martin Langhoff
2009-04-05 19:57 ` Jeff King
2009-04-05 23:38   ` Robin H. Johnson
2009-04-05 23:42     ` Robin H. Johnson
     [not found]     ` <0015174c150e49b5740466d7d2c2@google.com>
2009-04-06  0:29       ` Robin H. Johnson
2009-04-06  3:10     ` Nguyen Thai Ngoc Duy
2009-04-06  4:09       ` Nicolas Pitre
2009-04-06  4:06     ` Nicolas Pitre
2009-04-06 14:20       ` Robin H. Johnson
2009-04-11 17:24 ` Mark Levedahl

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.