git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Eric Sink's blog - notes on git, dscms and a "whole product" approach
@ 2009-04-27  8:55 Martin Langhoff
  2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski
  0 siblings, 2 replies; 39+ messages in thread
From: Martin Langhoff @ 2009-04-27  8:55 UTC (permalink / raw)
  To: Git Mailing List

Eric Sink hs been working on the (commercial, proprietary) centralised
SCM Vault for a while. He's written recently about his explorations
around the new crop of DSCMs, and I think it's quite interesting. A
quick search of the list archives makes me thing it wasn't discussed
before.

The guy is knowledgeable, and writes quite witty posts -- naturally,
there's plenty to disagree on, but I'd like to encourage readers not
to nitpick or focus on where Eric is wrong. It is interesting to read
where he thinks git and other DSCMs are missing the mark.

   Maybe he's right, maybe he's wrong, but damn he's interesting :-)

So here's the blog -  http://www.ericsink.com/

These are the best entry points
  http://www.ericsink.com/entries/quirky.html
  http://www.ericsink.com/entries/hg_denzel.html

To be frank, I think he's wrong in some details (as he's admittedly
only spent limited time with it) but right on the larger-picture
(large userbases want it integrated and foolproof, bugtracking needs
to go distributed alongside the code, git is as powerful^Wdangerous as
C).

cheers,



martin
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-27  8:55 Eric Sink's blog - notes on git, dscms and a "whole product" approach Martin Langhoff
@ 2009-04-28 11:24 ` Jakub Narebski
  2009-04-28 21:00   ` Robin Rosenberg
  2009-04-29  6:55   ` Martin Langhoff
  2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski
  1 sibling, 2 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-04-28 11:24 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com> writes:

> Eric Sink hs been working on the (commercial, proprietary) centralised
> SCM Vault for a while. He's written recently about his explorations
> around the new crop of DSCMs, and I think it's quite interesting. A
> quick search of the list archives makes me thing it wasn't discussed
> before.
> 
> The guy is knowledgeable, and writes quite witty posts -- naturally,
> there's plenty to disagree on, but I'd like to encourage readers not
> to nitpick or focus on where Eric is wrong. It is interesting to read
> where he thinks git and other DSCMs are missing the mark.
> 
>    Maybe he's right, maybe he's wrong, but damn he's interesting :-)
> 
> So here's the blog -  http://www.ericsink.com/

"Here's a blog"... and therefore my dilemma. Should I post my reply
as a comment to this blog, or should I reply here on git mailing list?
 
> These are the best entry points

Because those two entries are quite different, I'll reply separately

1.  "Ten Quirky Issues with Cross-Platform Version Control"
>   http://www.ericsink.com/entries/quirky.html

which is generic comment about (mainly) using version control
in heterogenic environment, where different machines have different
filesystem limitations.  I'll concentrate here on that issue.

2.  "Mercurial, Subversion, and Wesley Snipes"
>   http://www.ericsink.com/entries/hg_denzel.html

where, paraphrasing, Eric Sink says that he doesn't write about
Mercurial and Subversion because they are perfect.  Or at least not
as controversial (and controversial means interesting).

> 
> To be frank, I think he's wrong in some details (as he's admittedly
> only spent limited time with it) but right on the larger-picture
> (large userbases want it integrated and foolproof, bugtracking needs
> to go distributed alongside the code, git is as powerful^Wdangerous as
> C).

Neither of mentioned above blog posts touches those issues, BTW...

----------------------------------------------------------------------
Ad 1. "Ten Quirky Issues with Cross-Platform Version Control"

Actually those are two issues: troubles with different limitations of
different filesystems, and different dealing with line endings in text
files on different platforms.


Line endings (issue 8.) is in theory and in practice (at least for
Git) a non-issue.  

In theory you should use project's convention for end of line
character in text files, and use smart editor that can deal (or can be
configured to deal) with this issue correctly.

In practice this is a matter of correctly setting up core.autocrlf
(and in more complicated cases, where more complicated means for git
very very rare, configuring which files are text and which are not).


There are a few classes of troubles with filesystems (with filenames).

1. Different limitations on file names (e.g. pathname length),
   different special characters, different special filenames (if any).
   Those are issues 2. (special basename PRN on MS Windows), 
   issue 3. (trailing dot, trailing whitespace), issue 4. (pathname
   and filename length limit), issue 6. (special characters, in this
   case colon being path element delimiter on MacOS, but it is also
   about special characters like colon, asterisk and question mark
   on MS Windows) and also issue 7. (name that begins with dash)
   in Eric Sink article.

   The answer is convention for filenames in a project. Simply DON'T
   use filenames which can cause problems.  There is no way to simply
   solve this problem in version control system, although I think if
   you really, really, really need it you should be able to cobble
   something together using low-level git tools to have different name
   for filename in working directory from the one used in repository
   (and index).

   See also David A. Wheeler essay "Fixing Unix/Linux/POSIX Filenames:
   Control Characters (such as Newline), Leading Dashes, and Other Problems" 
   http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

   DON'T DO THAT.


2. "Case-insensitive" but "case-preserving" filesystems; the case
   where some different filenames are equivalent (like 'README' and
   'readme' on case-insensitive filesystem), but are returned as you
   created them (so if you created 'README', you would get 'README' in
   directory listing, but filesystem would return that 'readme' exists
   too).  This is issue 1. ('README' and 'readme' in the same
   directory) in Eric Sink article.

   The answer is like for previous issue: don't.  Simply DO NOT create
   files with filenames which differ only in case (like unfortunate
   ct_conntrack.h and cn_CONNTRACK.h or similar in Linux kernel).

   But I think that even in case where such unfortunate incident (two
   filenames differing only in case) occur, you can deal with it in
   Git by using lower level tools (and editing only one of two such
   files at once).  You would get spurious info about modified files
   in git-status, though...  perhaps that could be improved using
   infrastructure created (IIRC) by Linus for dealing with 'insane'
   filesystems.

   DON'T DO THAT, SOLVABLE.


3. Non "Case-preserving" filesystems, where filename as sequence of
   bytes differ between what you created, and what you get from
   filesystem.  An example here is MacOS X filesystem, which accepts
   filenames in NFC composed normalized form of Unicode, but stores
   them internally and returns them in NFD decomposed form.  This is
   issue 9. (Español being "Espa\u00f1ol" in NFC, but "Espan\u0303ol"
   in NFD).

   In this case 'don't do this' might be not acceptable answer.
   Perhaps you need non-ASCII characters in filenames.  Not always can
   you use filesystem or specify mount point option that makes it not
   a problem.

   I remember that this issue was discussed extensively on git mailing
   list, but I don't remember what was the conclusion (beside agreeing
   that filesystem that is not "*-preserving" is not sane filesystem ;).
   In particular I do not remember if Git can deal with this issue
   sanely (I remember Linus adding infrastructure for that, but did it
   solve this problem...).

   PROBABLY SOLVED.


4. Filesystems which cannot store all SCM-sane metainfo, for example
   filesystems without support for symbolic links, or without support
   for executable permission (executable bit).  This is extension of
   issue 10. (which is limited to symbolic links) in Eric Sink
   article.

   In Git you have core.fileMode to ignore executable bit differences
   (you would need to use SCM tools and not filesystem tools to
   maniulate it), and core.symlinks to be able to checkout symlinks as
   plain text files (again using SCM tools to manipulate).

   SOLVED.


There is also mistaken implicit assumption that version control
systems have (and should) preserve all metadata.

5. The issue of extra metadata that is not SCM-sane, and which
   different filesystems can or cannot store.  Examples include full
   Unix permissions, Unix ownership (and groups file belongs to),
   other permission-related metadata such as ACL, extra resources tied
   to file such as EA (extended attributes) for some Linux filesystems
   or (in)famous resource form in MacOS.  This is issue 5. (resource
   fork on MacOS vs. xattrs on Linux) in Eric Sink article.

   This is not an issue for SCM: _source_ code management system
   to solve.  Preserving extra metadata indiscrimitedly can cause
   problems, like e.g. full permissions and ownership.  Therefore
   SCM preserve only limited SCM-sane subset of metadata.  If you
   need to preserve extra metadata, you can use (in good SCMs) hooks
   for that, like e.g. etckeeper uses metastore (in Git).

   NOT A PROBLEM.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach
  2009-04-27  8:55 Eric Sink's blog - notes on git, dscms and a "whole product" approach Martin Langhoff
  2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
@ 2009-04-28 18:16 ` Jakub Narebski
  2009-04-29  7:54   ` Sitaram Chamarty
  2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  1 sibling, 2 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-04-28 18:16 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com> writes:

> Eric Sink hs been working on the (commercial, proprietary) centralised
> SCM Vault for a while. He's written recently about his explorations
> around the new crop of DSCMs, and I think it's quite interesting. A
> quick search of the list archives makes me thing it wasn't discussed
> before.
> 
> The guy is knowledgeable, and writes quite witty posts -- naturally,
> there's plenty to disagree on, but I'd like to encourage readers not
> to nitpick or focus on where Eric is wrong. It is interesting to read
> where he thinks git and other DSCMs are missing the mark.
> 
>    Maybe he's right, maybe he's wrong, but damn he's interesting :-)
> 
> So here's the blog -  http://www.ericsink.com/

"Here's a blog"... and therefore my dilemma. Should I post my reply
as a comment to this blog, or should I reply here on git mailing list?

I think I will just add link to this thread in GMane mailing list
archive for git mailing list...
 
> These are the best entry points
*  "Ten Quirky Issues with Cross-Platform Version Control"
>   http://www.ericsink.com/entries/quirky.html

which I have answered in separate post in this thread

*  "Mercurial, Subversion, and Wesley Snipes"
>   http://www.ericsink.com/entries/hg_denzel.html

which I will comment now.  The 'ES>' prefix means quoting above blog
post.


First there is a list of earlier blog post, with links, which makes
article in question a good staring point.

ES> As part of that effort, I have undertaken an exploration of the
ES> DVCS world.  Several weeks ago I started writing one blog entry
ES> every week, mostly focused on DVCS topics.  In chronological
ES> order, here they are:
ES>
ES> * The one where I gripe about Git's index

where Eric complains that "git add -p" allows for committing untested
changes... not knowing about "git stash --keep-index", and not
understanding that comitting is (usually) separate from publishing in
distributed version control systems (so you can check after commit,
and amend commit if it does not pass test).

ES> * The one where I whine about the way Git allows developers to
ES>   rearrange the DAG

where Eric seems to not notice that you are strongly encouraged to do
'rearranging the DAG' (rewriting the history) _only_ in unpublished
(not made public) part of history.

ES> * The one where it looks like I am against DAG-based version
ES>   control but I'm really not

where Eric conflates linear versus merge workflows with
update-before-commit versus commit-then-merge paradigm, not noticing
that you can have linear history using sane commit-update-rebase
rather than unsafe update-before-commit.

ES> * The one where I fuss about DVCSes that try to act like
ES>   centralized tools

where DVCS in question that behaves this way is Bazaar (if I
understood this correctly).

ES> * The one where I complain that DVCSes have a lousy story when it
ES>   comes to bug-tracking

where Eric correctly notice that distributed version control would not
help much if you use centralized bugtracker, and speculates about
required features that distributed bugtracker should have.  Very nice
post in my opinion.

ES> * The one where I lament that I want to like Darcs but I can't

where Eric talks about difference between parentage in merge commit
(which is needed for good merging) and "parentage"/weak link in
cherry-picked commit; Git uses weak link = no link.

ES> * The one where I speculate cluelessly about why Git is so fast

where Eric guesses instead of asking on git mailing list or #git
channel... ;-)

ES> Along the way, I've been spending some time getting hands-on
ES> experience with these tools.  I've been using Bazaar for several
ES> months.  I don't like it very much.  I am currently in the process
ES> of switching to Git, but I don't expect to like it very much
ES> either.

Aaaargh... if you expect to not like it very much, I would be very
suprised if you find it to your liking...

ES> So why don't I write about Mercurial?  Because I'm pretty sure I
ES> would like it.
ES>
ES> I chose Bazaar and Git for the experience.  But if I were choosing
ES> a DVCS as a regular user, I would choose Mercurial.  I've used it
ES> some, and found it to be incredibly pleasant.  It seems like the
ES> DVCS that got everything just about right.  That's great if you're
ES> a user, but for a writer, what's interesting about that?

Well, Mercurial IMHO didn't get everything right. Not mentioning
implementation issues, like dealing with copies, binary files, and
large files, it got IMHO wrong:
 * branching in multiple branches per repository
 * tags which should be transferrable but non-versioned

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
@ 2009-04-28 21:00   ` Robin Rosenberg
  2009-04-29  6:55   ` Martin Langhoff
  1 sibling, 0 replies; 39+ messages in thread
From: Robin Rosenberg @ 2009-04-28 21:00 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List

tisdag 28 april 2009 13:24:31 skrev Jakub Narebski <jnareb@gmail.com>:
> Line endings (issue 8.) is in theory and in practice (at least for
> Git) a non-issue.  
> 
> In theory you should use project's convention for end of line
> character in text files, and use smart editor that can deal (or can be
> configured to deal) with this issue correctly.
Windows people will disagree.

> In practice this is a matter of correctly setting up core.autocrlf
> (and in more complicated cases, where more complicated means for git
> very very rare, configuring which files are text and which are not).

Which proves it is an issue or we wouldn't need to tune settings
to make it work right.  A non-issue is something that "just works"
without turning knobs. I had had to think more than once on
what the issue was and the right way to solve these issues. It
can be considered wierd, because Eclipse on Linux generated files
with CRLF which I happily committed and Git on Windows happily
converted to LF and determined that the HEAD and index was out
of sync, but refuesed to commit the CRLF>LF change becuase there
was no "diff"..  You know the fix, but don't tell me it's not an issue.

-- robin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on  git, dscms and a "whole product" approach)
  2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  2009-04-28 21:00   ` Robin Rosenberg
@ 2009-04-29  6:55   ` Martin Langhoff
  2009-04-29  7:21     ` Jeff King
  2009-04-29  7:52     ` Cross-Platform Version Control Jakub Narebski
  1 sibling, 2 replies; 39+ messages in thread
From: Martin Langhoff @ 2009-04-29  6:55 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Git Mailing List

On Tue, Apr 28, 2009 at 1:24 PM, Jakub Narebski <jnareb@gmail.com> wrote:
>   DON'T DO THAT.
>   DON'T DO THAT, SOLVABLE.

As I mentioned, Eric is taking the perspective of offering a supported
SCM to a large and diverse audience. As such, his notes are
interesting not because he's right or he's wrong.

We can be "right" and say "don't do that" if we shrink our audience so
that it looks a lot like us. There, fixed.

But something tells me that successful tools are -- by definition --
tools that grow past their creators use.

So from Eric's perspective, it is worthwhile to work on all those
issues, and get the right for the end user -- support things we don't
like, offer foolproof catches and warnings that prevent the user from
shooting their lovely toes off to mars, etc.

His perspective is one of commercial licensing, but even if we aren't
driven by the "each new user is a new dollar" bit, the long term hopes
for git might also be to be widely used and to improve the version
control life of many unsuspecting users.

To get there, I suspect we have to understand more of Eric's perspective.

that's my 2c.



m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-29  6:55   ` Martin Langhoff
@ 2009-04-29  7:21     ` Jeff King
  2009-04-29 20:05       ` Markus Heidelberg
  2009-04-29  7:52     ` Cross-Platform Version Control Jakub Narebski
  1 sibling, 1 reply; 39+ messages in thread
From: Jeff King @ 2009-04-29  7:21 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jakub Narebski, Git Mailing List

On Wed, Apr 29, 2009 at 08:55:29AM +0200, Martin Langhoff wrote:

> So from Eric's perspective, it is worthwhile to work on all those
> issues, and get the right for the end user -- support things we don't
> like, offer foolproof catches and warnings that prevent the user from
> shooting their lovely toes off to mars, etc.

I read a few of his blog postings. He kept complaining about the
features of git that I like the most. :)

So one thing I took away from it is that there probably isn't _one_
interface that works for everybody. I can see his arguments about how
"add -p" can be dangerous, and how history rewriting can be dangerous.
So for some users, blocking those features makes sense.

But for other users (myself included), those are critical features that
make me _way_ more productive. And I manage the risk that comes from
using them as part of my workflow, and it isn't a problem in practice.

While part of me is happy that cogito is now dead (not because I didn't
think it was good, but because having two sets of tools just seemed to
create maintenance and staleness headaches), I do sometimes wonder if we
would be better off with several "from scratch" git interfaces based
around the plumbing (or even a C library). And I don't just mean simple
wrappers around git commands, but whole new interfaces which make
decisions like "no history rewriting at all", and try to provide a safer
interface based on that.

Of course, _I_ wouldn't want to use such an interface. But in theory I
could seamlessly interoperate with people who did.

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Cross-Platform Version Control
  2009-04-29  6:55   ` Martin Langhoff
  2009-04-29  7:21     ` Jeff King
@ 2009-04-29  7:52     ` Jakub Narebski
  2009-04-29  8:25       ` Martin Langhoff
  1 sibling, 1 reply; 39+ messages in thread
From: Jakub Narebski @ 2009-04-29  7:52 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

On Wed, 29 April 2009, Martin Langhoff wrote:
> On Tue, Apr 28, 2009 at 1:24 PM, Jakub Narebski <jnareb@gmail.com>
> wrote: 

[I think you cut out a bit too much. Here I resurrected it]

JN> 1. Different limitations on file names (e.g. pathname length),
JN>   different special characters, different special filenames
JN>   (if any).
[...]
JN>   The answer is convention for filenames in a project. Simply
JN>   DON'T use filenames which can cause problems.
[...]

> >   DON'T DO THAT.

What could be proper solution to that, if you do not accept social 
rather than technical restriction?  We can have pre-commit hook that 
checks for portability for filenames (which is deployment specific,
and shouldn't be part of SCM perhaps with an exception of being example 
hook) but it wouldn't help dealing with non-portable filenames on 
filesystem that cannot represent them that are there.

If I remember correctly Git for some time has layer which can translate 
between filenames in repository and filenames on filesystem, but I'm 
not sure if it is generic enough for it to be a solution to this 
problem, and currently there is no way to manipulate this mapping, I 
think.


JN> 2. "Case-insensitive" but "case-preserving" filesystems. [...]
JN>
JN>     The answer is like for previous issue: don't.  Simply DO NOT
JN>     create files with filenames which differ only in case [...]

> >   DON'T DO THAT, SOLVABLE.

By 'solvable' here I mean that you should be able to modify only one of 
clashing files at once (checkout 'README', modify, add to index, remove 
from filesystem, checkout 'readme', modify, etc.), and deal with 
annoyances in git-status output.  It can be done in Git, with medium 
amount of hacking.  I don't think any other SCM can do even this, and
I cannot think of a better, automatic solution that would somehow deal 
with case-clashing.

Note that all deals are off in case-insensitive and not preserving 
filesystem.

By the way, wouldn't be a better solution to use sane filesystem, rather 
than complicating SCM? ;-)

> 
> As I mentioned, Eric is taking the perspective of offering a supported
> SCM to a large and diverse audience. As such, his notes are
> interesting not because he's right or he's wrong.
> 
> We can be "right" and say "don't do that" if we shrink our audience so
> that it looks a lot like us. There, fixed.

<quote source="Dune by Frank Herbert">
  [...] the attitude of the knife — chopping off what's incomplete and
  saying: "Now it's complete because it's ended here."
</quote>

I could not resist posting this quote :-P

> 
> But something tells me that successful tools are -- by definition --
> tools that grow past their creators use.
> 
> So from Eric's perspective, it is worthwhile to work on all those
> issues, and get the right for the end user -- support things we don't
> like, offer foolproof catches and warnings that prevent the user from
> shooting their lovely toes off to mars, etc.

Warnings and catches I can accept; adding complications and corner cases 
for situations which can be trivially avoided with a bit of social 
engineering aka. project guidelines... not so much.

I simply cannot see the situation where you _must_ have dangerously 
unportable file names (trailing dot, trailing whitespace) and 
case-clashing files...

> 
> His perspective is one of commercial licensing, but even if we aren't
> driven by the "each new user is a new dollar" bit, the long term hopes
> for git might also be to be widely used and to improve the version
> control life of many unsuspecting users.
> 
> To get there, I suspect we have to understand more of Eric's
> perspective. 
> 
> that's my 2c.

By the way, I think that the article on cross-platform version control 
(version control in heterogenic environment) is quite good article.
I don't quite like the "10 Issues"/"Top 10" way of writing, but the 
article examines different ways that heterogenic environment can trip 
SCM.  

In my opinion Git does quite good here, where it can, and where the 
issue is to be solved by SCM and not otherwise (extra metadata like 
resource fork).

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach
  2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski
@ 2009-04-29  7:54   ` Sitaram Chamarty
  2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  1 sibling, 0 replies; 39+ messages in thread
From: Sitaram Chamarty @ 2009-04-29  7:54 UTC (permalink / raw)
  To: git

On 2009-04-28, Jakub Narebski <jnareb@gmail.com> wrote:

> ES> * The one where I lament that I want to like Darcs but I can't
>
> where Eric talks about difference between parentage in merge commit
> (which is needed for good merging) and "parentage"/weak link in
> cherry-picked commit; Git uses weak link = no link.

Well the patch-id is a sort of "compute on demand" link, so
it would qualify as a weak link, especially because git
manages to use it during a rebase.

I wanted to point that out but I didn't see a link to post
comments so I didn't bother.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Cross-Platform Version Control
  2009-04-29  7:52     ` Cross-Platform Version Control Jakub Narebski
@ 2009-04-29  8:25       ` Martin Langhoff
  0 siblings, 0 replies; 39+ messages in thread
From: Martin Langhoff @ 2009-04-29  8:25 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Git Mailing List

On Wed, Apr 29, 2009 at 9:52 AM, Jakub Narebski <jnareb@gmail.com> wrote:
>> >   DON'T DO THAT.
>
> What could be proper solution to that, if you do not accept social
> rather than technical restriction?

Let's say strong checks for case sensitivity clashes, leading/trailing
dots, utf-8 encoding maladies, etc switched on by default. And note
that to be user-friendly you want most of those checks at 'add' time.

 If we don't like a particular FS, or we think it is messing up our
utf-8 filenames, say it up-front, at clone and checkout time. For
example, if the checkout has files with interesting utf-8 names, it'd
be reasonable to check for filename mangling.

Some things are hard or impossible to prevent - the utf-8 encoding
maladies of OSX for example. But it may be detectable on checkout.

In short, play on the defensive, for the benefit of users who are not
kernel developers.

It will piss off kernel & git developers and slow some operations
somewhat. It will piss off oldtimers like me. But I'll say git config
--global core.trainingwheels no and life will be good.

It may be - as Jeff King points out - a matter of a polished git
porcelain. We've seen lots of porcelains, but no smooth user-targetted
porcelain yet.

cheers,



m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-29  7:21     ` Jeff King
@ 2009-04-29 20:05       ` Markus Heidelberg
  0 siblings, 0 replies; 39+ messages in thread
From: Markus Heidelberg @ 2009-04-29 20:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Langhoff, Jakub Narebski, Git Mailing List

Jeff King, 29.04.2009:
> On Wed, Apr 29, 2009 at 08:55:29AM +0200, Martin Langhoff wrote:
> 
> > So from Eric's perspective, it is worthwhile to work on all those
> > issues, and get the right for the end user -- support things we don't
> > like, offer foolproof catches and warnings that prevent the user from
> > shooting their lovely toes off to mars, etc.
> 
> I read a few of his blog postings. He kept complaining about the
> features of git that I like the most. :)
> 
> I can see his arguments about how
> "add -p" can be dangerous

Actually, I don't see a very special case here with committing a never
compiled/tested worktree state. You can do this with every VCS (without
an index like git) with just selectively committing files instead of the
whole current worktree.

Markus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski
  2009-04-29  7:54   ` Sitaram Chamarty
@ 2009-04-30 12:17   ` Jakub Narebski
  2009-04-30 12:56     ` Michael Witten
                       ` (2 more replies)
  1 sibling, 3 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-04-30 12:17 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

Jakub Narebski <jnareb@gmail.com> writes:

> Martin Langhoff <martin.langhoff@gmail.com> writes:
> 
> > Eric Sink hs been working on the (commercial, proprietary) centralised
> > SCM Vault for a while. He's written recently about his explorations
> > around the new crop of DSCMs, and I think it's quite interesting.

[...]

> > So here's the blog -  http://www.ericsink.com/

[...]
> *  "Mercurial, Subversion, and Wesley Snipes"
> >   http://www.ericsink.com/entries/hg_denzel.html
> 
> which I will comment now.  The 'ES>' prefix means quoting above blog
> post.

[...]
> ES> * The one where I speculate cluelessly about why Git is so fast
> 
> where Eric guesses instead of asking on git mailing list or #git
> channel... ;-)

This issue is interesting: what features and what design decision
make Git fast? One of the goals of Git was good performance; are
we there?

All quotes marked 'es> ' below are from "Why is Git so Fast?" post
http://www.ericsink.com/entries/why_is_git_fast.html

es> One:  Maybe Git is fast simply because it's a DVCS.
es>
es> There's probably some truth here.  One of the main benefits touted
es> by the DVCS fanatics is the extra performance you get when
es> everything is "local".

This is I think quite obvious.  Accessing memory is faster than
acessing disk, which in turn is faster than accessing network.  So if
commit and (change)log does not require access to server via network,
they are so much faster.

BTW. that is why Subversion stores along working copy 'pristine'
versions of files: to make status and diff fast enough to be usable.
Which in turn might make SVN checkout to be larger than full Git
clone ;-)

es>
es> But this answer isn't enough.  Maybe it explains why Git is faster
es> than Subversion, but it doesn't explain why Git is so often
es> described as being faster than the other DVCSs.

Not only described; see http://git.or.cz/gitwiki/GitBenchmarks
(although some, if not most of those benchmarks are dated,
and e.g. Bazaar claims to have much better performance now).

es>
es> Two:  Maybe Git is fast because Linus Torvalds is so smart.

[non answer; the details are important]

es> Three: Maybe Git is fast because it's written in C instead of one
es> of those newfangled higher-level languages.
es>
es> Nah, probably not.  Lots of people have written fast software in
es> C#, Java or Python.
es>
es> And lots of people have written really slow software in
es> traditional native languages like C/C++. [...]

Well, I guess that access to low-level optimization techniques like
mmap are important for performance.  But here I am guessing and
speculating like Eric did; well, I am asking on a proper forum ;-)

We have some anegdotical evidence supporting this possibility (which
Eric dismisses), namely the fact that pure-Python Bazaar is slowest of
three most common open source DVCS (Git, Mercurial, bazaar) and the
fact that parts of Mercurial were written in C for better performance.

We can also compare implementations of Git in other, higher level
languages, with reference implementation in C (and shell scripts, and
Perl ;-)).  For example most complete I think but still not fully
complete Java implementation: JGit.  I hope that JGit developers can
tell us whether using higher level language affects performance, how
much, and what features of higher-level language are causing decrease
in performance.  Of course we have to take into account the
possibility that JGit isn't simply as well optimized because of less
manpower.

es>
es> Four: Maybe Git is fast because being fast is the primary goal for
es> Git.

[non answer; the details are important]

es>
es> Five:  Maybe Git is fast because it does less.
es>
es> One of my favorite recent blog entries is this piece[1] which
es> claims that the way to make code faster is to have it do less.
es>
es> [1] "How to write fast code" by Kas Thomas
es>     http://asserttrue.blogspot.com/2009/03/how-to-write-fast-code.html
[...]

es>
es> For example, the way you get something in the Git index is you use
es> the "git add" command.  Git doesn't scan your working copy for
es> changed files unless you explicitly tell it to.  This can be a
es> pretty big performance win for huge trees.  Even when you use the
es> "remember the timestamp" trick, detecting modified files in a
es> really big tree can take a noticeable amount of time.

That of course depends on how you compare performance of different
version control systems (to not compare apples with oranges).  But if
you compare e.g. "<scm> commit" with Git equivalent "git commit -a"
the above is simply not true.

BTW. when doing comparison you have to take care of the reverse,
e.g. git doing more like calculating and dislaying diffstat by default
for merges/pulls.

es>
es> Or maybe Git's shortcut for handling renames is faster than doing
es> them more correctly[2] like Bazaar does.
es>
es> [2] "Renaming is the killer app of distributed version control"
es>     http://www.markshuttleworth.com/archives/123

Errr... what?


es> Six:  Maybe Git is fast because it doesn't use much external code.
es>
es> Very often, when you are facing a decision to use somebody else's
es> code or write it yourself, there is a performance tradeoff.  Not
es> always, but often.  Maybe the third party code is just slower than
es> the code you could write yourself if you had time to do it.  Or
es> maybe there is an impedance mismatch between the API of the
es> external library and your own architecture.
es>
es> This can happen even when the library is very high quality.  For
es> example, consider libcurl.  This is a great library.  Tons of
es> people use it.  But it does have one problem that will cause
es> performance problems for some users: When using libcurl to fetch
es> an object, it wants to own the buffer.  In some situations, this
es> can end up forcing you to use extra memcpys or temporary files.
es> The reason all the low level calls like send() and recv() allow
es> the caller to own the loop and the buffer is because this is the
es> best way to avoid the need to make extra copies of the data on
es> disk or in memory.
[...]

es>
es> Maybe Git is fast because every time they faced one of these "buy
es> vs. build" choices, they decided to just write it themselves.

I don't think so.  Rather the opposite is true.  Git uses libcurl for
HTTP transport.  Git uses zlib for compression.  Git uses SHA-1 from
OpenSSL or from Mozilla.  Git uses (modified, internal) LibXDiff for
(binary) deltaifying, for diffs and for merges.

OTOH Git includes several own micro-libraries: parseopt, strbuf,
ALLOC_GROW, etc.  NIH syndrome?  I don't think so; rather avoiding
extra dependencies (bstring vs strbuf), and existing solutions not
fitting all needs (popt/argp/getopt vs parse-options).

es> Seven:  Maybe Git isn't really that fast.
es>
es> If there is one thing I've learned about version control it's that
es> everybody's situation is different.  It is quite likely that Git
es> is a lot faster for some scenarios than it is for others.
es>
es> How does Git handle really large trees?  Git was designed primary
es> to support the efforts of the Linux kernel developers.  A lot of
es> people think the Linux kernel is a large tree, but it's really
es> not.  Many enterprise configuration management repositories are
es> FAR bigger than the Linux kernel.

c.f. "Why Perforce is more scalable than Git" by Steve Hanov
     http://gandolf.homelinux.org/blog/index.php?id=50

I don't really know about this.

But there is one issue Eric Sink didn't think about:

Eight: Git seems fast.
======================

Here I mean concentaring on low _latency_, which means that when git
produces more than one page of output (for example "git log"), it tries to output the first page as fast as possible; which means that first page e.g.
"git <sth> | head -25  >/dev/null" has to be fast, and not 
"git <sth> >/dev/null" itself.

Having progress indicator appearing whenever is longer wait (quite
fresh feature) also help impression of being fast...


And what do you think about this?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git,  dscms and a "whole product" approach)
  2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
@ 2009-04-30 12:56     ` Michael Witten
  2009-04-30 15:28       ` Why Git is so fast Jakub Narebski
  2009-04-30 18:43       ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Shawn O. Pearce
  2009-04-30 14:22     ` Jeff King
  2009-04-30 18:56     ` Nicolas Pitre
  2 siblings, 2 replies; 39+ messages in thread
From: Michael Witten @ 2009-04-30 12:56 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List

On Thu, Apr 30, 2009 at 07:17, Jakub Narebski <jnareb@gmail.com> wrote:
> I hope that JGit developers can
> tell us whether using higher level language affects performance, how
> much, and what features of higher-level language are causing decrease
> in performance.

Java is definitely higher than C, but you can do some pretty low-level
operations on bits and bytes and the like, not to mention the presence
of a JIT.

My point: I don't think that Java can tell us anything special in this regard.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  2009-04-30 12:56     ` Michael Witten
@ 2009-04-30 14:22     ` Jeff King
  2009-05-01 18:43       ` Linus Torvalds
  2009-04-30 18:56     ` Nicolas Pitre
  2 siblings, 1 reply; 39+ messages in thread
From: Jeff King @ 2009-04-30 14:22 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List

On Thu, Apr 30, 2009 at 05:17:58AM -0700, Jakub Narebski wrote:

> This is I think quite obvious.  Accessing memory is faster than
> acessing disk, which in turn is faster than accessing network.  So if
> commit and (change)log does not require access to server via network,
> they are so much faster.

Like all generalizations, this is only mostly true. Fast network servers
with big caches can outperform disks for some loads. And in many cases
with a VCS, you are performing a query that might look over the whole
dataset, but return only a small fraction of data.

So I wouldn't rule out the possibility of a pleasant VCS experience on a
network-optimized system backed by beefy servers on a local network. I
have never used perforce, but I get the impression that it is more
optimized for such a situation. Git is really optimized for open source
projects: slow servers across high-latency, low-bandwidth links.

> es> Nah, probably not.  Lots of people have written fast software in
> es> C#, Java or Python.
> es>
> es> And lots of people have written really slow software in
> es> traditional native languages like C/C++. [...]
> 
> Well, I guess that access to low-level optimization techniques like
> mmap are important for performance.  But here I am guessing and
> speculating like Eric did; well, I am asking on a proper forum ;-)

Certainly there's algorithmic fastness that you can do in any language,
and I think git does well at that. Most operations are independent of
the total size of history (e.g., branching is O(1) and commit is
O(changed files), diff looks only at endpoints, etc). Operations which
deal only with history are independent of the size of the tree (e.g.,
"git log" and the history graph in gitk look only at commits, never at
the tree).  And when we do have to look at the tree, we can drastically
reduce our I/O by comparing hashes instead of full files.

But there are also some micro-optimizations that make a big difference
in practice. Some of them can be done in any language. For example, the
packfiles are ordered by type so that all of the commits have a nice I/O
pattern when doing a history walk.

Some other micro-optimizations are really language-specific, though. I
don't recall the numbers, but I think Linus got measurable speedups from
cutting the memory footprint of the object and commit structs (which
gave better cache usage patterns).  Git uses some variable-length fields
inside structs instead of a pointer to a separate allocated string to
give better memory access patterns. Tricks like that won't give the
order-of-magnitude speedups that algorithmic optimizations will, but 10%
here and 20% there means you can get a system that is a few times faster
than the competition. For an operation that takes 0.1s anyway, that
doesn't matter. But with current hardware and current project size, you
are often talking about dropping a 3-second operation down to 1s or
0.5s, which just feels a lot snappier.

And finally, git tries to do as little work as possible when starting a
new command, and streams output as soon as possible. Which means that in
a command-line setting, git can _feel_ snappier, because it starts
output immediately. Higher-level languages can often have a much longer
startup time, especially if they have a lot of modules to load. E.g.,:

  # does enough work to easily fill your pager
  $ time git log -100 >/dev/null
  real    0m0.011s
  user    0m0.008s
  sys     0m0.004s

  # does nothing, just starts perl and aborts with usage
  $ time git send-email >/dev/null
  real    0m0.150s
  user    0m0.104s
  sys     0m0.048s

Both are warm-cache times. C git gives you output almost instaneously,
whereas just loading perl with a modest set of modules introduces a
noticeable pause before any work is actually done. In the grand scheme
of things, .1s probably isn't relevant, but I think avoiding that delay
adds to the perception of git as fast.

> es> Or maybe Git's shortcut for handling renames is faster than doing
> es> them more correctly[2] like Bazaar does.
> es>
> es> [2] "Renaming is the killer app of distributed version control"
> es>     http://www.markshuttleworth.com/archives/123
> 
> Errr... what?

Yeah, I had the same thought. Git's rename handling is _much_ more
computationally intensive than other systems. In fact, it is one of only
two places where I have ever wanted git to be any faster (the other
being repacking of large repos).

> Eight: Git seems fast.
> ======================
> 
> Here I mean concentaring on low _latency_, which means that when git

I do think this helps (see above), but I wanted to note that it is more
than just "streaming"; I think other systems stream, as well. For
example, I am pretty sure that "cvs log" streamed (but thank god it has
been so long since I touched CVS that I can't really remember), but it
_still_ felt awfully slow.

So it is also about keeping start times low and having your data in a
format that is ready to use.

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-04-30 12:56     ` Michael Witten
@ 2009-04-30 15:28       ` Jakub Narebski
  2009-04-30 18:52         ` Shawn O. Pearce
  2009-04-30 18:43       ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Shawn O. Pearce
  1 sibling, 1 reply; 39+ messages in thread
From: Jakub Narebski @ 2009-04-30 15:28 UTC (permalink / raw)
  To: Michael Witten; +Cc: Martin Langhoff, Git Mailing List

On Thu, 30 Apr 2009, Michael Witten wrote:
> On Thu, Apr 30, 2009 at 07:17, Jakub Narebski <jnareb@gmail.com> wrote:

> > I hope that JGit developers can
> > tell us whether using higher level language affects performance, how
> > much, and what features of higher-level language are causing decrease
> > in performance.
> 
> Java is definitely higher than C, but you can do some pretty low-level
> operations on bits and bytes and the like, not to mention the presence
> of a JIT.
> 
> My point: I don't think that Java can tell us anything special in this regard.

Let's rephrase question a bit then: what low-level operation were needed
for good performance in JGit? 

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-30 12:56     ` Michael Witten
  2009-04-30 15:28       ` Why Git is so fast Jakub Narebski
@ 2009-04-30 18:43       ` Shawn O. Pearce
  1 sibling, 0 replies; 39+ messages in thread
From: Shawn O. Pearce @ 2009-04-30 18:43 UTC (permalink / raw)
  To: Michael Witten; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List

Michael Witten <mfwitten@gmail.com> wrote:
> On Thu, Apr 30, 2009 at 07:17, Jakub Narebski <jnareb@gmail.com> wrote:
> > I hope that JGit developers can
> > tell us whether using higher level language affects performance, how
> > much, and what features of higher-level language are causing decrease
> > in performance.
> 
> Java is definitely higher than C, but you can do some pretty low-level
> operations on bits and bytes and the like, not to mention the presence
> of a JIT.

But its still costly compared to C.
 
> My point: I don't think that Java can tell us anything special in this regard.

Sure it can.

Peff I think made a good point here, that we rely on a lot of small
tweaks in the C git code to get *really* good performance.  5% here,
10% there, and suddenly you are 60% faster than you were before.
Nico, Linus, Junio, they have all spent some time over the past
3 or 4 years trying to tune various parts of Git to just flat out
run fast.

Higher level languages hide enough of the machine that we can't
make all of these optimizations.

JGit struggles with not having mmap(), or when you do use Java NIO
MappedByteBuffer, we still have to copy to a temporary byte[] in
order to do any real processing.  C Git avoids that copy.  Sure,
other higher level langauges may offer a better mmap facility,
but they also tend to offer garbage collection and most try to tie
the mmap management into the GC "for safety and ease of use".

JGit struggles with not having unsigned types in Java.  There are
many locations in JGit where we really need "unsigned int32_t" or
"unsigned long" (largest machine word available) or "unsigned char"
but these types just don't exist in Java.  Converting a byte up to
an int just to treat it as an unsigned requires an extra " & 0xFF"
operation to remove the sign extension.

JGit struggles with not having an efficient way to represent a SHA-1.
C can just say "unsigned char[20]" and have it inline into the
container's memory allocation.  A byte[20] in Java will cost an
*additional* 16 bytes of memory, and be slower to access because
the bytes themselves are in a different area of memory from the
container object.  We try to work around it by converting from a
byte[20] to 5 ints, but that costs us machine instructions.

C Git takes for granted that memcpy(a, b, 20) is dirt cheap when
doing a copy from an inflated tree into a struct object.  JGit has
to pay a huge penalty to copy that 20 byte region out into 5 ints,
because later on, those 5 ints are cheaper.

Other higher level languages also lack the ability to mark a
type unsigned.  Or face similiar penalties with storing a 20 byte
binary region.

Native Java collection types have been a snare for us in JGit.
We've used java.util.* types when they seem to be handy and already
solve the data structure problem at hand, but they tend to preform
a lot worse than writing a specialized data structure.

For example, we have ObjectIdSubclassMap for what should be
Map<ObjectId,Object>.  Only it requires that the Object type you
use as the "value" entry in the map extend from ObjectId, as the
instance serves as both key *and* value.  But it screams when
compared to HashMap<ObjectId,Object>.  (For those who don't know,
ObjectId is JGit's "unsigned char[20]" for a SHA-1.)

Just a day or so ago I wrote LongMap, a faster HashMap<Long,Object>,
for hashing objects by indexes in a pack file.  Again, the boxing
costs in Java to convert a "long" (largest integer type) into an
Object that the standard HashMap type would accept was rather high.

Right now, JGit is still paying dearly when it comes to ripping
apart a commit or a tree object to follow the object links.  Or when
invoking inflate().  We spend a lot more time doing this sort of work
than C git does, and yet we're trying to be as close to the machine
as we can go by using byte[] whenever possible, by avoiding copying
whenever possible, and avoiding memory allocation when possible.

Notably, `rev-list --objects --all` takes about 2x as long in
JGit as it does in C Git on a project like the linux kernel, and
`index-pack` for the full ~270M pack file takes about 2x as long.

Both parts of JGit are about as good as I know how to make them,
but we're really at the mercy of the JIT, and changes in the JIT
can cause us to perform worse (or better) than before.  Unlike in
C Git where Linus has done assembler dumps of sections of code and
tried to determine better approaches.  :-)

So. Yes, its practical to build Git in a higher level language, but
you just can't get the same performance, or tight memory utilization,
that C Git gets.  That's what that higher level language abstraction
costs you.  But, JGit performs reasonably well; well enough that
we use internally at Google as a git server.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-04-30 15:28       ` Why Git is so fast Jakub Narebski
@ 2009-04-30 18:52         ` Shawn O. Pearce
  2009-04-30 20:36           ` Kjetil Barvik
  0 siblings, 1 reply; 39+ messages in thread
From: Shawn O. Pearce @ 2009-04-30 18:52 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Michael Witten, Martin Langhoff, Git Mailing List

Jakub Narebski <jnareb@gmail.com> wrote:
> Let's rephrase question a bit then: what low-level operation were needed
> for good performance in JGit? 

Aside from the message I just posted:

- Avoid String, its too expensive most of the time.  Stick with
  byte[], and better, stick with data that is a triplet of (byte[],
  int start, int end) to define a region of data.  Yes, its annoying,
  as its 3 values you need to pass around instead of just 1, but
  its makes a big difference in running time.

- Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints,
  which can be inlined into an object allocation.

- Subclass instead of contain references.  We extend ObjectId to
  attach application data, rather than contain a reference to an
  ObjectId.  Classical Java programming techniques would say this
  is a violation of encapsulatio.  But it gets us the same memory
  impact that C Git gets by saying:

    struct appdata {
      unsigned char[20] sha1;
      ....
	}

- We're hurting dearly for not having more efficient access to the
  pack-*.pack file data.  mmap in Java is crap.  We implement our
  own page buffer, reading in blocks of 8192 bytes at a time and
  holding them in our own cache.

  Really, we should write our own mmap library as an optional JNI
  thing, and tie it into libz so we can efficiently run inflate()
  off the pack data directly.

- We're hurting dearly for not having more efficient access to the
  pack-*.idx files.  Again, with no mmap we read the entire bloody
  index into memory.  But since you won't touch most of it we keep
  it in large byte[], but since you are searching with an ObjectId
  (5 ints) we pay a conversion price on every search step where
  we have to copy from the large byte[] to 5 local variable ints,
  and then compare to the ObjectId.  Its an overhead C git doesn't
  have to deal with.

Anyway.

I'm still just amazed at how well JGit runs given these limitations.
I guess that's Moore's Law for you.  10 years ago, JGit wouldn't
have been practical.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  2009-04-30 12:56     ` Michael Witten
  2009-04-30 14:22     ` Jeff King
@ 2009-04-30 18:56     ` Nicolas Pitre
  2009-04-30 19:16       ` Alex Riesen
  2009-04-30 19:33       ` Jakub Narebski
  2 siblings, 2 replies; 39+ messages in thread
From: Nicolas Pitre @ 2009-04-30 18:56 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List

On Thu, 30 Apr 2009, Jakub Narebski wrote:

> Jakub Narebski <jnareb@gmail.com> writes:
> 
> es> Two:  Maybe Git is fast because Linus Torvalds is so smart.
> 
> [non answer; the details are important]

I think Linus is certainly responsible for a big part of Git's speed.  
He came with the basic data structure used by git which has lots to do 
with that.  Also, he designed Git specifically to fulfill a need for 
which none of the alternatives were fast enough.  Hence Git was designed 
from the ground up with speed as one of the primary design goals, such 
as being able to create multiple commits per second instead of the other 
way around (several seconds per commit). And yes, Linus is usually smart 
enough with the proper mindset to achieve such goals.

> es> Three: Maybe Git is fast because it's written in C instead of one
> es> of those newfangled higher-level languages.
> es>
> es> Nah, probably not.  Lots of people have written fast software in
> es> C#, Java or Python.
> es>
> es> And lots of people have written really slow software in
> es> traditional native languages like C/C++. [...]
> 
> Well, I guess that access to low-level optimization techniques like
> mmap are important for performance.  But here I am guessing and
> speculating like Eric did; well, I am asking on a proper forum ;-)
> 
> We have some anegdotical evidence supporting this possibility (which
> Eric dismisses), namely the fact that pure-Python Bazaar is slowest of
> three most common open source DVCS (Git, Mercurial, bazaar) and the
> fact that parts of Mercurial were written in C for better performance.
> 
> We can also compare implementations of Git in other, higher level
> languages, with reference implementation in C (and shell scripts, and
> Perl ;-)).  For example most complete I think but still not fully
> complete Java implementation: JGit.  I hope that JGit developers can
> tell us whether using higher level language affects performance, how
> much, and what features of higher-level language are causing decrease
> in performance.  Of course we have to take into account the
> possibility that JGit isn't simply as well optimized because of less
> manpower.

One of the main JGit developers is Shawn Pearce.  If you look at Shawn's 
contribution to C git, they mostly are all related to performance 
issues.  Amongst other things, he is the author of git-fast-import, he 
contributed the pack access windowing code, and he was also involved in 
the initial design of pack v4.  Hence Shawn is a smart guy who certainly 
knows one or two things about performance optimizations.  Yet he 
reported on this list that his efforts to make JGit faster were not much 
successful anymore, most probably due to the language overhead.

> es> Four: Maybe Git is fast because being fast is the primary goal for
> es> Git.
> 
> [non answer; the details are important]

Still, this is actually true (see about Linus above).  Without such a 
goal, you quickly lose sight of performance regressions.

> es> Maybe Git is fast because every time they faced one of these "buy
> es> vs. build" choices, they decided to just write it themselves.
> 
> I don't think so.  Rather the opposite is true.  Git uses libcurl for
> HTTP transport.  Git uses zlib for compression.  Git uses SHA-1 from
> OpenSSL or from Mozilla.  Git uses (modified, internal) LibXDiff for
> (binary) deltaifying, for diffs and for merges.

Well, I think he's right on this point as well.  libcurl is not so 
relevant since it is rarely the bottleneck (the network bandwidth itself 
usually is).  zlib is already as fast as it can be as multiple attempts 
to make it faster didn't succeed.  Git already carries its own version 
of SHA-1 code for ARM and PPC because the alternatives were slower.  
The fact that libxdiff was made internal is indeed to have a better 
impedance matching with the core code, otherwise it could have remained 
fully external just like zlib.  And the binary delta code is not 
libxdiff anymore but a much smaller, straight forward, and optimized to 
death version to achieve speed over versatility (no need to be versatile 
when strictly dealing with Git's needs only).

> es> Seven:  Maybe Git isn't really that fast.
> es>
> es> If there is one thing I've learned about version control it's that
> es> everybody's situation is different.  It is quite likely that Git
> es> is a lot faster for some scenarios than it is for others.
> es>
> es> How does Git handle really large trees?  Git was designed primary
> es> to support the efforts of the Linux kernel developers.  A lot of
> es> people think the Linux kernel is a large tree, but it's really
> es> not.  Many enterprise configuration management repositories are
> es> FAR bigger than the Linux kernel.
> 
> c.f. "Why Perforce is more scalable than Git" by Steve Hanov
>      http://gandolf.homelinux.org/blog/index.php?id=50
> 
> I don't really know about this.

Git certainly sucks big time with large files.

Git also sucks to a lesser extent (but still) with very large 
repositories.

But large trees?  I don't think Git is worse than anything out there 
with a large tree of average size files.

Yet, this point is misleading because when people gives to Git the 
reputation of being faster, this is certainly from comparison of 
operations performed on the same source tree.  Who cares about scenarios 
for which the tool was not designed?  Those "enterprise configuration 
management repositories" are not what Git was designed for indeed, but 
neither was Mercurial nor Bazaar, or any other contender to which Git is 
usually compared.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git,  dscms and a "whole product" approach)
  2009-04-30 18:56     ` Nicolas Pitre
@ 2009-04-30 19:16       ` Alex Riesen
  2009-05-04  8:01         ` Why Git is so fast Andreas Ericsson
  2009-04-30 19:33       ` Jakub Narebski
  1 sibling, 1 reply; 39+ messages in thread
From: Alex Riesen @ 2009-04-30 19:16 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List

2009/4/30 Nicolas Pitre <nico@cam.org>:
> Yet, this point is misleading because when people gives to Git the
> reputation of being faster, this is certainly from comparison of
> operations performed on the same source tree.  Who cares about scenarios
> for which the tool was not designed?  Those "enterprise configuration
> management repositories" are not what Git was designed for indeed, but

Especially when no sane developer will put in his repository the toolchain
(pre-compiled. For all supported platforms!), all the supporting tools
(like grep,
find, etc.Pre-compiled _and_ source), the in-house framework (pre-compiled
and source, again), firmware (pre-compiled and put in the repository weekly),
and operating system code (pre-compiled, with firmware-specific drivers,
updated, you guessed it, weekly), and well, there is the project itself (Java or
C++, and documentation in .doc and .xls)...
Now, what kind of self-hating idiot will design a system for that kind of abuse?
(And if someone says that's is not true in the most enterprise
f$%cking configurations,
he definitely hasn't had to live through big enough number of them).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-04-30 18:56     ` Nicolas Pitre
  2009-04-30 19:16       ` Alex Riesen
@ 2009-04-30 19:33       ` Jakub Narebski
  1 sibling, 0 replies; 39+ messages in thread
From: Jakub Narebski @ 2009-04-30 19:33 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Martin Langhoff, Git Mailing List

On Thu, 30 Apr 2009, Nicolas Pitre wrote:
> On Thu, 30 Apr 2009, Jakub Narebski wrote:
> > Jakub Narebski <jnareb@gmail.com> writes:

> > es> Maybe Git is fast because every time they faced one of these "buy
> > es> vs. build" choices, they decided to just write it themselves.
> > 
> > I don't think so.  Rather the opposite is true.  Git uses libcurl for
> > HTTP transport.  Git uses zlib for compression.  Git uses SHA-1 from
> > OpenSSL or from Mozilla.  Git uses (modified, internal) LibXDiff for
> > (binary) deltaifying, for diffs and for merges.
> 
> Well, I think he's right on this point as well.  [...]
> The fact that libxdiff was made internal is indeed to have a better 
> impedance matching with the core code, otherwise it could have remained 
> fully external just like zlib.  And the binary delta code is not 
> libxdiff anymore but a much smaller, straight forward, and optimized to 
> death version to achieve speed over versatility (no need to be versatile 
> when strictly dealing with Git's needs only).

Hrmmmm... I have thought that LibXDiff was internalized mainly for ease
of modification, as my impression is that LibXDiff is single developer
effort, while Git from beginning have many contributors (and submodules
didn't exist then).  If I remember correctly the rcsmerge/diff3 algorithm
was added first in internalized git's xdiff... was it added to LibXDiff
proper, anyway?

BTW. I wonder what other F/OSS version control systems: Bazaar,
Mercurial, Darcs, Monotone use for binary deltas, for diff engine,
and for textual three-way merge engine.  Hmmm... perhaps I'll ask
on #revctrl

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-04-30 18:52         ` Shawn O. Pearce
@ 2009-04-30 20:36           ` Kjetil Barvik
  2009-04-30 20:40             ` Shawn O. Pearce
  2009-05-01  5:24             ` Dmitry Potapov
  0 siblings, 2 replies; 39+ messages in thread
From: Kjetil Barvik @ 2009-04-30 20:36 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

* "Shawn O. Pearce" <spearce@spearce.org> writes:
 <snipp>
| - Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints,
|   which can be inlined into an object allocation.

  What to pepole think about doing something simmilar in C GIT?

  That is, convert the current internal representation of the SHA-1 from
  "unsigned char sha1[20]" to "unsigned long sha1[5]"?

  Ok, I currently see 2 problems with it:

     1) Will the type "unsigned long" always be unsigned 32 bit on all
        platforms on all computers?  do we need an "unit_32_t" thing?

     2) Can we get in truble because of differences between litle- and
        big-endian machines?

  And, simmilar I can see or guess the following would be positive with
  this change:

     3) From a SHA1 library I worked with some time ago, I noticed that
        it internaly used the type "unsigned long arr[5]", so it can
        mabye be possible to get some shurtcuts or maybe speedups here,
        if we want to do it.

     4) The "static inline void hashcpy(....)" in cache.h could then
        maybe be written like this:

  static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5])
  {
       sha_dst[0] = sha_src[0];
       sha_dst[1] = sha_src[1];
       sha_dst[2] = sha_src[2];
       sha_dst[3] = sha_src[3];
       sha_dst[4] = sha_src[4];
  }

        And hopefully will be compiled to just 5 store/more
        instructions, or at least hopefully be faster than the currently
        memcpy() call. But mabye we get more compiled instructions compared
        to a single call to memcpy()?

     5) Simmilar as 4) for the other SHA1 realted hash functions nearby
        hashcpy() in cache.h

  OK, just some thought's.  Sorry if this allready has been discussed
  but could not find something abouth it after a simple google search.

  -- kjetil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-04-30 20:36           ` Kjetil Barvik
@ 2009-04-30 20:40             ` Shawn O. Pearce
  2009-04-30 21:36               ` Kjetil Barvik
  2009-05-01  5:24             ` Dmitry Potapov
  1 sibling, 1 reply; 39+ messages in thread
From: Shawn O. Pearce @ 2009-04-30 20:40 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: git

Kjetil Barvik <barvik@broadpark.no> wrote:
> * "Shawn O. Pearce" <spearce@spearce.org> writes:
>  <snipp>
> | - Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints,
> |   which can be inlined into an object allocation.
> 
>   What to pepole think about doing something simmilar in C GIT?
> 
>   That is, convert the current internal representation of the SHA-1 from
>   "unsigned char sha1[20]" to "unsigned long sha1[5]"?

Its not worth the code churn.
 
>   Ok, I currently see 2 problems with it:
> 
>      1) Will the type "unsigned long" always be unsigned 32 bit on all
>         platforms on all computers?  do we need an "unit_32_t" thing?

Yea, "unsigned long" isn't always 32 bits.  So we'd need to use
uint32_t.  Which we already use elsewhere, but still.
 
>      2) Can we get in truble because of differences between litle- and
>         big-endian machines?

Yes, especially if compare was implemented using native uint32_t
compare and the processor was little-endian.

>      4) The "static inline void hashcpy(....)" in cache.h could then
>         maybe be written like this:

Its already done as "memcpy(a, b, 20)" which most compilers will
inline and probably reduce to 5 word moves anyway.  That's why
hashcpy() itself is inline.
 
-- 
Shawn.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-04-30 20:40             ` Shawn O. Pearce
@ 2009-04-30 21:36               ` Kjetil Barvik
  2009-05-01  0:23                 ` Steven Noonan
  2009-05-01 17:42                 ` Tony Finch
  0 siblings, 2 replies; 39+ messages in thread
From: Kjetil Barvik @ 2009-04-30 21:36 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

* "Shawn O. Pearce" <spearce@spearce.org> writes:
|>      4) The "static inline void hashcpy(....)" in cache.h could then
|>         maybe be written like this:
|
| Its already done as "memcpy(a, b, 20)" which most compilers will
| inline and probably reduce to 5 word moves anyway.  That's why
| hashcpy() itself is inline.

  But would the compiler be able to trust that the hashcpy() is always
  called with correct word alignment on variables a and b?

  I made a test and compiled git with:

    make USE_NSEC=1 CFLAGS="-march=core2 -mtune=core2 -O2 -g2 -fno-stack-protector" clean all

  compiler: gcc (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3
  CPU: Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz GenuineIntel

  Then used gdb to get the following:

(gdb) disassemble write_sha1_file
Dump of assembler code for function write_sha1_file:
0x080e3830 <write_sha1_file+0>:	push   %ebp
0x080e3831 <write_sha1_file+1>:	mov    %esp,%ebp
0x080e3833 <write_sha1_file+3>:	sub    $0x58,%esp
0x080e3836 <write_sha1_file+6>:	lea    -0x10(%ebp),%eax
0x080e3839 <write_sha1_file+9>:	mov    %ebx,-0xc(%ebp)
0x080e383c <write_sha1_file+12>:	mov    %esi,-0x8(%ebp)
0x080e383f <write_sha1_file+15>:	mov    %edi,-0x4(%ebp)
0x080e3842 <write_sha1_file+18>:	mov    0x14(%ebp),%ebx
0x080e3845 <write_sha1_file+21>:	mov    %eax,0x8(%esp)
0x080e3849 <write_sha1_file+25>:	lea    -0x44(%ebp),%edi
0x080e384c <write_sha1_file+28>:	lea    -0x24(%ebp),%esi
0x080e384f <write_sha1_file+31>:	mov    %edi,0x4(%esp)
0x080e3853 <write_sha1_file+35>:	mov    %esi,(%esp)
0x080e3856 <write_sha1_file+38>:	mov    0x10(%ebp),%ecx
0x080e3859 <write_sha1_file+41>:	mov    0xc(%ebp),%edx
0x080e385c <write_sha1_file+44>:	mov    0x8(%ebp),%eax
0x080e385f <write_sha1_file+47>:	call   0x80e0350 <write_sha1_file_prepare>
0x080e3864 <write_sha1_file+52>:	test   %ebx,%ebx
0x080e3866 <write_sha1_file+54>:	je     0x80e3885 <write_sha1_file+85>

0x080e3868 <write_sha1_file+56>:	mov    -0x24(%ebp),%eax
0x080e386b <write_sha1_file+59>:	mov    %eax,(%ebx)
0x080e386d <write_sha1_file+61>:	mov    -0x20(%ebp),%eax
0x080e3870 <write_sha1_file+64>:	mov    %eax,0x4(%ebx)
0x080e3873 <write_sha1_file+67>:	mov    -0x1c(%ebp),%eax
0x080e3876 <write_sha1_file+70>:	mov    %eax,0x8(%ebx)
0x080e3879 <write_sha1_file+73>:	mov    -0x18(%ebp),%eax
0x080e387c <write_sha1_file+76>:	mov    %eax,0xc(%ebx)
0x080e387f <write_sha1_file+79>:	mov    -0x14(%ebp),%eax
0x080e3882 <write_sha1_file+82>:	mov    %eax,0x10(%ebx)

  I admit that I am not particular familar with intel machine
  instructions, but I guess that the above 10 mov instructions is the
  result for the compiled inline hashcpy() in the write_sha1_file()
  function in sha1_file.c

  Question: would it be possible for the compiler to compile it down to
  just 5 mov instructions if we had used unsigned 32 bits type?  Or is
  this the best we can reasonable hope for inside the write_sha1_file()
  function?

  I checked 3 other output of "disassemble function_foo", and it seems
  that those 3 functions I checked got 10 mov instructions for the
  inline hashcpy(), as far as I can tell.

0x080e3885 <write_sha1_file+85>:	mov    %esi,(%esp)
0x080e3888 <write_sha1_file+88>:	call   0x80e3800 <has_sha1_file>
0x080e388d <write_sha1_file+93>:	xor    %edx,%edx
0x080e388f <write_sha1_file+95>:	test   %eax,%eax
0x080e3891 <write_sha1_file+97>:	jne    0x80e38b6 <write_sha1_file+134>
0x080e3893 <write_sha1_file+99>:	mov    0xc(%ebp),%eax
0x080e3896 <write_sha1_file+102>:	mov    %edi,%edx
0x080e3898 <write_sha1_file+104>:	mov    %eax,0x4(%esp)
0x080e389c <write_sha1_file+108>:	mov    -0x10(%ebp),%ecx
0x080e389f <write_sha1_file+111>:	mov    0x8(%ebp),%eax
0x080e38a2 <write_sha1_file+114>:	movl   $0x0,0x8(%esp)
0x080e38aa <write_sha1_file+122>:	mov    %eax,(%esp)
0x080e38ad <write_sha1_file+125>:	mov    %esi,%eax
0x080e38af <write_sha1_file+127>:	call   0x80e1e40 <write_loose_object>
0x080e38b4 <write_sha1_file+132>:	mov    %eax,%edx
0x080e38b6 <write_sha1_file+134>:	mov    %edx,%eax
0x080e38b8 <write_sha1_file+136>:	mov    -0xc(%ebp),%ebx
0x080e38bb <write_sha1_file+139>:	mov    -0x8(%ebp),%esi
0x080e38be <write_sha1_file+142>:	mov    -0x4(%ebp),%edi
0x080e38c1 <write_sha1_file+145>:	leave  
0x080e38c2 <write_sha1_file+146>:	ret    
End of assembler dump.
(gdb) 

  So, maybe the compiler is doing the right thing after all?

  -- kjetil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-04-30 21:36               ` Kjetil Barvik
@ 2009-05-01  0:23                 ` Steven Noonan
  2009-05-01  1:25                   ` James Pickens
  2009-05-01  9:19                   ` Kjetil Barvik
  2009-05-01 17:42                 ` Tony Finch
  1 sibling, 2 replies; 39+ messages in thread
From: Steven Noonan @ 2009-05-01  0:23 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: Shawn O. Pearce, git

On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik <barvik@broadpark.no> wrote:
> * "Shawn O. Pearce" <spearce@spearce.org> writes:
> |>      4) The "static inline void hashcpy(....)" in cache.h could then
> |>         maybe be written like this:
> |
> | Its already done as "memcpy(a, b, 20)" which most compilers will
> | inline and probably reduce to 5 word moves anyway.  That's why
> | hashcpy() itself is inline.
>
>  But would the compiler be able to trust that the hashcpy() is always
>  called with correct word alignment on variables a and b?
>
>  I made a test and compiled git with:
>
>    make USE_NSEC=1 CFLAGS="-march=core2 -mtune=core2 -O2 -g2 -fno-stack-protector" clean all
>
>  compiler: gcc (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3
>  CPU: Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz GenuineIntel
>
>  Then used gdb to get the following:
>
> (gdb) disassemble write_sha1_file
> Dump of assembler code for function write_sha1_file:
> 0x080e3830 <write_sha1_file+0>: push   %ebp
> 0x080e3831 <write_sha1_file+1>: mov    %esp,%ebp
> 0x080e3833 <write_sha1_file+3>: sub    $0x58,%esp
> 0x080e3836 <write_sha1_file+6>: lea    -0x10(%ebp),%eax
> 0x080e3839 <write_sha1_file+9>: mov    %ebx,-0xc(%ebp)
> 0x080e383c <write_sha1_file+12>:        mov    %esi,-0x8(%ebp)
> 0x080e383f <write_sha1_file+15>:        mov    %edi,-0x4(%ebp)
> 0x080e3842 <write_sha1_file+18>:        mov    0x14(%ebp),%ebx
> 0x080e3845 <write_sha1_file+21>:        mov    %eax,0x8(%esp)
> 0x080e3849 <write_sha1_file+25>:        lea    -0x44(%ebp),%edi
> 0x080e384c <write_sha1_file+28>:        lea    -0x24(%ebp),%esi
> 0x080e384f <write_sha1_file+31>:        mov    %edi,0x4(%esp)
> 0x080e3853 <write_sha1_file+35>:        mov    %esi,(%esp)
> 0x080e3856 <write_sha1_file+38>:        mov    0x10(%ebp),%ecx
> 0x080e3859 <write_sha1_file+41>:        mov    0xc(%ebp),%edx
> 0x080e385c <write_sha1_file+44>:        mov    0x8(%ebp),%eax
> 0x080e385f <write_sha1_file+47>:        call   0x80e0350 <write_sha1_file_prepare>
> 0x080e3864 <write_sha1_file+52>:        test   %ebx,%ebx
> 0x080e3866 <write_sha1_file+54>:        je     0x80e3885 <write_sha1_file+85>
>
> 0x080e3868 <write_sha1_file+56>:        mov    -0x24(%ebp),%eax
> 0x080e386b <write_sha1_file+59>:        mov    %eax,(%ebx)
> 0x080e386d <write_sha1_file+61>:        mov    -0x20(%ebp),%eax
> 0x080e3870 <write_sha1_file+64>:        mov    %eax,0x4(%ebx)
> 0x080e3873 <write_sha1_file+67>:        mov    -0x1c(%ebp),%eax
> 0x080e3876 <write_sha1_file+70>:        mov    %eax,0x8(%ebx)
> 0x080e3879 <write_sha1_file+73>:        mov    -0x18(%ebp),%eax
> 0x080e387c <write_sha1_file+76>:        mov    %eax,0xc(%ebx)
> 0x080e387f <write_sha1_file+79>:        mov    -0x14(%ebp),%eax
> 0x080e3882 <write_sha1_file+82>:        mov    %eax,0x10(%ebx)
>
>  I admit that I am not particular familar with intel machine
>  instructions, but I guess that the above 10 mov instructions is the
>  result for the compiled inline hashcpy() in the write_sha1_file()
>  function in sha1_file.c
>
>  Question: would it be possible for the compiler to compile it down to
>  just 5 mov instructions if we had used unsigned 32 bits type?  Or is
>  this the best we can reasonable hope for inside the write_sha1_file()
>  function?
>
>  I checked 3 other output of "disassemble function_foo", and it seems
>  that those 3 functions I checked got 10 mov instructions for the
>  inline hashcpy(), as far as I can tell.
>
> 0x080e3885 <write_sha1_file+85>:        mov    %esi,(%esp)
> 0x080e3888 <write_sha1_file+88>:        call   0x80e3800 <has_sha1_file>
> 0x080e388d <write_sha1_file+93>:        xor    %edx,%edx
> 0x080e388f <write_sha1_file+95>:        test   %eax,%eax
> 0x080e3891 <write_sha1_file+97>:        jne    0x80e38b6 <write_sha1_file+134>
> 0x080e3893 <write_sha1_file+99>:        mov    0xc(%ebp),%eax
> 0x080e3896 <write_sha1_file+102>:       mov    %edi,%edx
> 0x080e3898 <write_sha1_file+104>:       mov    %eax,0x4(%esp)
> 0x080e389c <write_sha1_file+108>:       mov    -0x10(%ebp),%ecx
> 0x080e389f <write_sha1_file+111>:       mov    0x8(%ebp),%eax
> 0x080e38a2 <write_sha1_file+114>:       movl   $0x0,0x8(%esp)
> 0x080e38aa <write_sha1_file+122>:       mov    %eax,(%esp)
> 0x080e38ad <write_sha1_file+125>:       mov    %esi,%eax
> 0x080e38af <write_sha1_file+127>:       call   0x80e1e40 <write_loose_object>
> 0x080e38b4 <write_sha1_file+132>:       mov    %eax,%edx
> 0x080e38b6 <write_sha1_file+134>:       mov    %edx,%eax
> 0x080e38b8 <write_sha1_file+136>:       mov    -0xc(%ebp),%ebx
> 0x080e38bb <write_sha1_file+139>:       mov    -0x8(%ebp),%esi
> 0x080e38be <write_sha1_file+142>:       mov    -0x4(%ebp),%edi
> 0x080e38c1 <write_sha1_file+145>:       leave
> 0x080e38c2 <write_sha1_file+146>:       ret
> End of assembler dump.
> (gdb)
>
>  So, maybe the compiler is doing the right thing after all?
>

Well, I just tested this with GCC myself. I used this segment of code:

        #include <memory.h>
        void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
        {
                memcpy(sha_dst, sha_src, 20);
        }

I compiled using Apple's GCC 4.0.1 (note that GCC 4.3 and 4.4 vanilla
yield the same code) with these parameters to get Intel assembly:
        gcc -O2 -arch i386 -march=pentium3 -mtune=pentium3
-fomit-frame-pointer -fno-strict-aliasing -S test.c
and these parameters to get the equivalent PowerPC code:
        gcc -O2 -mcpu=G5 -arch ppc -fomit-frame-pointer
-fno-strict-aliasing -S test.c

Intel code:
        .text
        .align 4,0x90
.globl _hashcpy
_hashcpy:
        subl    $12, %esp
        movl    20(%esp), %edx
        movl    16(%esp), %ecx
        movl    (%edx), %eax
        movl    %eax, (%ecx)
        movl    4(%edx), %eax
        movl    %eax, 4(%ecx)
        movl    8(%edx), %eax
        movl    %eax, 8(%ecx)
        movl    12(%edx), %eax
        movl    %eax, 12(%ecx)
        movl    16(%edx), %eax
        movl    %eax, 16(%ecx)
        addl    $12, %esp
        ret
        .subsections_via_symbols


and the PowerPC code:

        .section __TEXT,__text,regular,pure_instructions
        .section __TEXT,__picsymbolstub1,symbol_stubs,pure_instructions,32
        .machine ppc970
        .text
        .align 2
        .p2align 4,,15
        .globl _hashcpy
_hashcpy:
        lwz r0,0(r4)
        lwz r2,4(r4)
        lwz r9,8(r4)
        lwz r11,12(r4)
        stw r0,0(r3)
        stw r2,4(r3)
        stw r9,8(r3)
        stw r11,12(r3)
        lwz r0,16(r4)
        stw r0,16(r3)
        blr
        .subsections_via_symbols


So it does look like GCC does what it should and it inlines the memcpy.

A bit off topic, but the results are rather interesting to me, and I
think I see a weakness in how GCC is doing this on Intel. Someone
please correct me if I'm wrong, but the PowerPC code seems much better
because it can yield very high instruction-level parallelism. It does
5 loads and then 5 stores, using 4 registers for temporary storage and
2 registers for pointers.

I realize the Intel x86 architecture is quite constrained in that it
has so few general purpose registers, but there has to be better code
than what GCC emitted above. It seems like the processor would stall
because of the quantity of sequential inter-dependent instructions
that can't be done in parallel (mov to memory that depends on a mov to
eax, etc).

I suppose the code might not be stalling if it's using the maximum
number of registers and doing as many memory accesses that it can per
clock, but based on known details about the architecture, does it seem
to be doing that?

- Steven

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-05-01  0:23                 ` Steven Noonan
@ 2009-05-01  1:25                   ` James Pickens
  2009-05-01  9:19                   ` Kjetil Barvik
  1 sibling, 0 replies; 39+ messages in thread
From: James Pickens @ 2009-05-01  1:25 UTC (permalink / raw)
  To: Steven Noonan; +Cc: Kjetil Barvik, Shawn O. Pearce, git

On Thu, Apr 30, 2009, Steven Noonan <steven@uplinklabs.net> wrote:
> A bit off topic, but the results are rather interesting to me, and I
> think I see a weakness in how GCC is doing this on Intel. Someone
> please correct me if I'm wrong, but the PowerPC code seems much better
> because it can yield very high instruction-level parallelism. It does
> 5 loads and then 5 stores, using 4 registers for temporary storage and
> 2 registers for pointers.
>
> I realize the Intel x86 architecture is quite constrained in that it
> has so few general purpose registers, but there has to be better code
> than what GCC emitted above. It seems like the processor would stall
> because of the quantity of sequential inter-dependent instructions
> that can't be done in parallel (mov to memory that depends on a mov to
> eax, etc).

There aren't any unnecessary dependencies.  Take this sequence:

1:        movl    (%edx), %eax
2:        movl    %eax, (%ecx)
3:        movl    4(%edx), %eax
4:        movl    %eax, 4(%ecx)

There are two unavoidable dependencies - #2 depends on #1, and #4
depends on #3.  #3 does not depend on #2, even though they both
use %eax, because #3 is a write to %eax.  So whatever was in %eax
before #3 is irrelevant.  The processor knows this and will use
register renaming to execute #1 and #3 in parallel, and #2 and #4
in parallel.

James

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-04-30 20:36           ` Kjetil Barvik
  2009-04-30 20:40             ` Shawn O. Pearce
@ 2009-05-01  5:24             ` Dmitry Potapov
  2009-05-01  9:42               ` Mike Hommey
  1 sibling, 1 reply; 39+ messages in thread
From: Dmitry Potapov @ 2009-05-01  5:24 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: Shawn O. Pearce, git

On Thu, Apr 30, 2009 at 10:36:03PM +0200, Kjetil Barvik wrote:
>      4) The "static inline void hashcpy(....)" in cache.h could then
>         maybe be written like this:
> 
>   static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5])
>   {
>        sha_dst[0] = sha_src[0];
>        sha_dst[1] = sha_src[1];
>        sha_dst[2] = sha_src[2];
>        sha_dst[3] = sha_src[3];
>        sha_dst[4] = sha_src[4];
>   }
> 
>         And hopefully will be compiled to just 5 store/more
>         instructions, or at least hopefully be faster than the currently
>         memcpy() call. But mabye we get more compiled instructions compared
>         to a single call to memcpy()?

Good compilers can inline memcpy and should produce more efficient code
for the target architecture, which can be faster than manually written.
On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1
while the above code requires 5 operations.

Dmitry

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-05-01  0:23                 ` Steven Noonan
  2009-05-01  1:25                   ` James Pickens
@ 2009-05-01  9:19                   ` Kjetil Barvik
  2009-05-01  9:34                     ` Mike Hommey
  1 sibling, 1 reply; 39+ messages in thread
From: Kjetil Barvik @ 2009-05-01  9:19 UTC (permalink / raw)
  To: Steven Noonan; +Cc: Shawn O. Pearce, git

* Steven Noonan <steven@uplinklabs.net> writes:
| On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik <barvik@broadpark.no> wrote:
|> * "Shawn O. Pearce" <spearce@spearce.org> writes:
|> |>      4) The "static inline void hashcpy(....)" in cache.h could then
|> |>         maybe be written like this:
|> |
|> | Its already done as "memcpy(a, b, 20)" which most compilers will
|> | inline and probably reduce to 5 word moves anyway.  That's why
|> | hashcpy() itself is inline.
|>
|>  But would the compiler be able to trust that the hashcpy() is always
|>  called with correct word alignment on variables a and b?

 <snipp>

| Well, I just tested this with GCC myself. I used this segment of code:
|
|         #include <memory.h>
|         void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
|         {
|                 memcpy(sha_dst, sha_src, 20);
|         }

  OK, here is a smal test, which maybe shows at least one difference
  between using "unsigned char sha1[20]" and "unsigned long sha1[5]".
  Given the following file, memcpy_test.c:

#include <string.h>
extern void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src);
void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src)
{
        memcpy(sha_dst, sha_src, 20);
}
extern void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src);
void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src)
{
        memcpy(sha_dst, sha_src, 5);
}

  And, compiled with the following:

    gcc -O2 -mtune=core2 -march=core2 -S -fomit-frame-pointer memcpy_test.c

  It produced the following memcpy_test.s file:

        .file   "memcpy_test.c"
        .text
        .p2align 4,,15
.globl hashcpy_ulong
        .type   hashcpy_ulong, @function
hashcpy_ulong:
        movl    8(%esp), %edx
        movl    4(%esp), %ecx
        movl    (%edx), %eax
        movl    %eax, (%ecx)
        movzbl  4(%edx), %eax
        movb    %al, 4(%ecx)
        ret
        .size   hashcpy_ulong, .-hashcpy_ulong
        .p2align 4,,15
.globl hashcpy_uchar
        .type   hashcpy_uchar, @function
hashcpy_uchar:
        movl    8(%esp), %edx
        movl    4(%esp), %ecx
        movl    (%edx), %eax
        movl    %eax, (%ecx)
        movl    4(%edx), %eax
        movl    %eax, 4(%ecx)
        movl    8(%edx), %eax
        movl    %eax, 8(%ecx)
        movl    12(%edx), %eax
        movl    %eax, 12(%ecx)
        movl    16(%edx), %eax
        movl    %eax, 16(%ecx)
        ret
        .size   hashcpy_uchar, .-hashcpy_uchar
        .ident  "GCC: (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3"
        .section        .note.GNU-stack,"",@progbits

  So, the "unsigned long" type hashcpy() used 7 instructions, compared
  to 13 for the "unsigned char" type hascpy().

  Would I guess correct if the hashcpy_ulong() function will also use
  less CPU cycles, and then would be faster than hashcpy_uchar()?

  -- kjetil

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-05-01  9:19                   ` Kjetil Barvik
@ 2009-05-01  9:34                     ` Mike Hommey
  2009-05-01  9:42                       ` Kjetil Barvik
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Hommey @ 2009-05-01  9:34 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: Steven Noonan, Shawn O. Pearce, git

On Fri, May 01, 2009 at 11:19:04AM +0200, Kjetil Barvik wrote:
> * Steven Noonan <steven@uplinklabs.net> writes:
> | On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik <barvik@broadpark.no> wrote:
> |> * "Shawn O. Pearce" <spearce@spearce.org> writes:
> |> |>      4) The "static inline void hashcpy(....)" in cache.h could then
> |> |>         maybe be written like this:
> |> |
> |> | Its already done as "memcpy(a, b, 20)" which most compilers will
> |> | inline and probably reduce to 5 word moves anyway.  That's why
> |> | hashcpy() itself is inline.
> |>
> |>  But would the compiler be able to trust that the hashcpy() is always
> |>  called with correct word alignment on variables a and b?
> 
>  <snipp>
> 
> | Well, I just tested this with GCC myself. I used this segment of code:
> |
> |         #include <memory.h>
> |         void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
> |         {
> |                 memcpy(sha_dst, sha_src, 20);
> |         }
> 
>   OK, here is a smal test, which maybe shows at least one difference
>   between using "unsigned char sha1[20]" and "unsigned long sha1[5]".
>   Given the following file, memcpy_test.c:
> 
> #include <string.h>
> extern void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src);
> void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src)
> {
>         memcpy(sha_dst, sha_src, 20);
> }
> extern void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src);
> void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src)
> {
>         memcpy(sha_dst, sha_src, 5);
> }
> 
>   And, compiled with the following:
> 
>     gcc -O2 -mtune=core2 -march=core2 -S -fomit-frame-pointer memcpy_test.c
> 
>   It produced the following memcpy_test.s file:
> 
>         .file   "memcpy_test.c"
>         .text
>         .p2align 4,,15
> .globl hashcpy_ulong
>         .type   hashcpy_ulong, @function
> hashcpy_ulong:
>         movl    8(%esp), %edx
>         movl    4(%esp), %ecx
>         movl    (%edx), %eax
>         movl    %eax, (%ecx)
>         movzbl  4(%edx), %eax
>         movb    %al, 4(%ecx)
>         ret
>         .size   hashcpy_ulong, .-hashcpy_ulong
>         .p2align 4,,15
> .globl hashcpy_uchar
>         .type   hashcpy_uchar, @function
> hashcpy_uchar:
>         movl    8(%esp), %edx
>         movl    4(%esp), %ecx
>         movl    (%edx), %eax
>         movl    %eax, (%ecx)
>         movl    4(%edx), %eax
>         movl    %eax, 4(%ecx)
>         movl    8(%edx), %eax
>         movl    %eax, 8(%ecx)
>         movl    12(%edx), %eax
>         movl    %eax, 12(%ecx)
>         movl    16(%edx), %eax
>         movl    %eax, 16(%ecx)
>         ret
>         .size   hashcpy_uchar, .-hashcpy_uchar
>         .ident  "GCC: (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3"
>         .section        .note.GNU-stack,"",@progbits
> 
>   So, the "unsigned long" type hashcpy() used 7 instructions, compared
>   to 13 for the "unsigned char" type hascpy().

But your "unsigned long" version only copies 5 bytes...

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-05-01  9:34                     ` Mike Hommey
@ 2009-05-01  9:42                       ` Kjetil Barvik
  0 siblings, 0 replies; 39+ messages in thread
From: Kjetil Barvik @ 2009-05-01  9:42 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Steven Noonan, Shawn O. Pearce, git

* Mike Hommey <mh@glandium.org> writes:
 <snipp>
| But your "unsigned long" version only copies 5 bytes...

  Yes, that is true...  OK, same result for hashcpy_uchar() and
  hashcpy_ulong() when corrected for this.

  --kjetil, with a brown paper bag

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-05-01  5:24             ` Dmitry Potapov
@ 2009-05-01  9:42               ` Mike Hommey
  2009-05-01 10:46                 ` Dmitry Potapov
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Hommey @ 2009-05-01  9:42 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Kjetil Barvik, Shawn O. Pearce, git

On Fri, May 01, 2009 at 09:24:34AM +0400, Dmitry Potapov wrote:
> On Thu, Apr 30, 2009 at 10:36:03PM +0200, Kjetil Barvik wrote:
> >      4) The "static inline void hashcpy(....)" in cache.h could then
> >         maybe be written like this:
> > 
> >   static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5])
> >   {
> >        sha_dst[0] = sha_src[0];
> >        sha_dst[1] = sha_src[1];
> >        sha_dst[2] = sha_src[2];
> >        sha_dst[3] = sha_src[3];
> >        sha_dst[4] = sha_src[4];
> >   }
> > 
> >         And hopefully will be compiled to just 5 store/more
> >         instructions, or at least hopefully be faster than the currently
> >         memcpy() call. But mabye we get more compiled instructions compared
> >         to a single call to memcpy()?
> 
> Good compilers can inline memcpy and should produce more efficient code
> for the target architecture, which can be faster than manually written.
> On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1
> while the above code requires 5 operations.

I guess, though, that some enforced alignment could help produce
slightly more efficient code on some architectures (most notably sparc,
which really doesn't like to deal with unaligned words).

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-05-01  9:42               ` Mike Hommey
@ 2009-05-01 10:46                 ` Dmitry Potapov
  0 siblings, 0 replies; 39+ messages in thread
From: Dmitry Potapov @ 2009-05-01 10:46 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Kjetil Barvik, Shawn O. Pearce, git

On Fri, May 01, 2009 at 11:42:21AM +0200, Mike Hommey wrote:
> On Fri, May 01, 2009 at 09:24:34AM +0400, Dmitry Potapov wrote:
> > 
> > Good compilers can inline memcpy and should produce more efficient code
> > for the target architecture, which can be faster than manually written.
> > On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1
> > while the above code requires 5 operations.
> 
> I guess, though, that some enforced alignment could help produce
> slightly more efficient code on some architectures (most notably sparc,
> which really doesn't like to deal with unaligned words).

Agreed. Enforcing good alignment may be useful. My point was that avoiding
memcpy with modern compilers is rather pointless or even harmful because the
compiler know more about the target architecture than the author of the code.

Dmitry

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-04-30 21:36               ` Kjetil Barvik
  2009-05-01  0:23                 ` Steven Noonan
@ 2009-05-01 17:42                 ` Tony Finch
  1 sibling, 0 replies; 39+ messages in thread
From: Tony Finch @ 2009-05-01 17:42 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: Shawn O. Pearce, git

On Thu, 30 Apr 2009, Kjetil Barvik wrote:
>
>   I admit that I am not particular familar with intel machine
>   instructions, but I guess that the above 10 mov instructions is the
>   result for the compiled inline hashcpy() in the write_sha1_file()
>   function in sha1_file.c
>
>   Question: would it be possible for the compiler to compile it down to
>   just 5 mov instructions if we had used unsigned 32 bits type?

No, because the x86 can't do direct memory-to-memory moves.

Tony.
-- 
f.anthony.n.finch  <dot@dotat.at>  http://dotat.at/
GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS.
MODERATE OR GOOD.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-30 14:22     ` Jeff King
@ 2009-05-01 18:43       ` Linus Torvalds
  2009-05-01 19:08         ` Jeff King
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2009-05-01 18:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List



On Thu, 30 Apr 2009, Jeff King wrote:
> 
> Like all generalizations, this is only mostly true. Fast network servers
> with big caches can outperform disks for some loads.

That's _very_ few loads.

It doesn't matter how good a server you have, network filesystems 
invariably suck.

Why? It's not that the network or the server sucks - you can easily find 
beefy NAS setups that have big raids etc and are much faster than most 
local disks.

And they _still_ suck.

Simple reason: caching. It's a lot easier to cache local filesystems. Even 
modern networked filesystems (ie NFSv4), that do a pretty good job on a 
file-per-file basis with delegations etc, and they still tend to suck 
horribly at metadata.

In contrast, a workstation with local filesystems and enough memory to 
cache it well will just be a lot nicer.

> So I wouldn't rule out the possibility of a pleasant VCS experience on a
> network-optimized system backed by beefy servers on a local network.

Hey, you can always throw resources at it.

But no:

> I have never used perforce, but I get the impression that it is more 
> optimized for such a situation.

I doubt it. I suspect git will outperform pretty much anything else in 
that kind of situation too.

One thing that git does - and some other VCS's avoid - is to actually 
stat() the whole working tree in order to not need special per-file "I use 
this file" locking semantics. That can in theory make git slower over a 
network filesystem than such (very broken) alternatives.

If your VCS requires that you mark all files for editing somehow (ie you 
can't just use your favourite editor or scripting to modify files, but 
have to use "p4 edit" to say that you're going to write to the file, and 
the file is otherwise read-only), then such a VCS can - by being annoying 
and in your way - do some things faster than git can.

And yes, perforce does that (the "p4 edit" command is real, and exists).

And yes, in theory that can probably mean that perforce doesn't care so 
much about the metadata caching problem on network filesystems - because 
p4 will maintain some file of its own that contains the metadata.

But I suspect that the git "async stat" ("core.preloadindex") thing means 
that git will kick p4 *ss even on that benchmark, and be a whole lot more 
pleasant to use. Even on networked filesystems.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 18:43       ` Linus Torvalds
@ 2009-05-01 19:08         ` Jeff King
  2009-05-01 19:13           ` david
                             ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Jeff King @ 2009-05-01 19:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List

On Fri, May 01, 2009 at 02:43:49PM -0400, Linus Torvalds wrote:

> > Like all generalizations, this is only mostly true. Fast network servers
> > with big caches can outperform disks for some loads.
> [...]
> In contrast, a workstation with local filesystems and enough memory to 
> cache it well will just be a lot nicer.
> [...]
> > I have never used perforce, but I get the impression that it is more 
> > optimized for such a situation.
> 
> I doubt it. I suspect git will outperform pretty much anything else in 
> that kind of situation too.

Thanks for the analysis; what you said makes sense to me. However, there
is at least one case of somebody complaining that git doesn't scale as
well as perforce for their load:

  http://gandolf.homelinux.org/blog/index.php?id=50

Part of his issue is with git-p4 sucking, which it probably does. But
part of it sounds like he has a gigantic workload (the description of
which sounds silly to me, but I respect the fact that he is probably
describing standard practice among some companies), and that workload is
just a little too gigantic for the workstations to handle. I.e., by
throwing resources at the central server they can avoid throwing as many
at each workstation.

But there are so few details it's hard to say whether he's doing
something else wrong or suboptimally. He does mention Windows, which
IIRC has horrific stat performance.

-Peff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 19:08         ` Jeff King
@ 2009-05-01 19:13           ` david
  2009-05-01 19:32             ` Nicolas Pitre
  2009-05-01 21:17           ` Daniel Barkalow
  2009-05-01 21:37           ` Linus Torvalds
  2 siblings, 1 reply; 39+ messages in thread
From: david @ 2009-05-01 19:13 UTC (permalink / raw)
  To: Jeff King
  Cc: Linus Torvalds, Jakub Narebski, Martin Langhoff, Git Mailing List

On Fri, 1 May 2009, Jeff King wrote:

> On Fri, May 01, 2009 at 02:43:49PM -0400, Linus Torvalds wrote:
>
>>> Like all generalizations, this is only mostly true. Fast network servers
>>> with big caches can outperform disks for some loads.
>> [...]
>> In contrast, a workstation with local filesystems and enough memory to
>> cache it well will just be a lot nicer.
>> [...]
>>> I have never used perforce, but I get the impression that it is more
>>> optimized for such a situation.
>>
>> I doubt it. I suspect git will outperform pretty much anything else in
>> that kind of situation too.
>
> Thanks for the analysis; what you said makes sense to me. However, there
> is at least one case of somebody complaining that git doesn't scale as
> well as perforce for their load:
>
>  http://gandolf.homelinux.org/blog/index.php?id=50
>
> Part of his issue is with git-p4 sucking, which it probably does. But
> part of it sounds like he has a gigantic workload (the description of
> which sounds silly to me, but I respect the fact that he is probably
> describing standard practice among some companies), and that workload is
> just a little too gigantic for the workstations to handle. I.e., by
> throwing resources at the central server they can avoid throwing as many
> at each workstation.
>
> But there are so few details it's hard to say whether he's doing
> something else wrong or suboptimally. He does mention Windows, which
> IIRC has horrific stat performance.

the key thing for his problem is the support for large binary objects. 
there was discussion here a few weeks ago about ways to handle such things 
without trying to pull them into packs. I suspect that solving those sorts 
of issues would go a long way towards closing the gap on this workload.

there may be issues in doing a clone for repositories that large, I don't 
remember exactly what happens when you have something larger than 4G to 
send in a clone.

David Lang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 19:13           ` david
@ 2009-05-01 19:32             ` Nicolas Pitre
  0 siblings, 0 replies; 39+ messages in thread
From: Nicolas Pitre @ 2009-05-01 19:32 UTC (permalink / raw)
  To: david
  Cc: Jeff King, Linus Torvalds, Jakub Narebski, Martin Langhoff,
	Git Mailing List

On Fri, 1 May 2009, david@lang.hm wrote:

> the key thing for his problem is the support for large binary objects. there
> was discussion here a few weeks ago about ways to handle such things without
> trying to pull them into packs. I suspect that solving those sorts of issues
> would go a long way towards closing the gap on this workload.
> 
> there may be issues in doing a clone for repositories that large, I don't
> remember exactly what happens when you have something larger than 4G to send
> in a clone.

If you have files larger than 4G then you definitively need a 64-bit 
machine with plenty of RAM for git to at least be able to cope at the 
moment.

That should be easy to add a config option to determine how big is a big 
file, and store those big files directly in a pack of their own instead 
of a loose object (for easy pack reuse during a further repack), and 
never attempt to deltify them, etc. etc.  At which point git will handle 
big files just fine even on a 32-bit machine but it won't do more than 
copying them in and out, and possibly deflating/inflating them while at 
it, but nothing fancier.


Nicolas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 19:08         ` Jeff King
  2009-05-01 19:13           ` david
@ 2009-05-01 21:17           ` Daniel Barkalow
  2009-05-01 21:37           ` Linus Torvalds
  2 siblings, 0 replies; 39+ messages in thread
From: Daniel Barkalow @ 2009-05-01 21:17 UTC (permalink / raw)
  To: Jeff King
  Cc: Linus Torvalds, Jakub Narebski, Martin Langhoff, Git Mailing List

On Fri, 1 May 2009, Jeff King wrote:

> On Fri, May 01, 2009 at 02:43:49PM -0400, Linus Torvalds wrote:
> 
> > > Like all generalizations, this is only mostly true. Fast network servers
> > > with big caches can outperform disks for some loads.
> > [...]
> > In contrast, a workstation with local filesystems and enough memory to 
> > cache it well will just be a lot nicer.
> > [...]
> > > I have never used perforce, but I get the impression that it is more 
> > > optimized for such a situation.
> > 
> > I doubt it. I suspect git will outperform pretty much anything else in 
> > that kind of situation too.
> 
> Thanks for the analysis; what you said makes sense to me. However, there
> is at least one case of somebody complaining that git doesn't scale as
> well as perforce for their load:
> 
>   http://gandolf.homelinux.org/blog/index.php?id=50
> 
> Part of his issue is with git-p4 sucking, which it probably does. But
> part of it sounds like he has a gigantic workload (the description of
> which sounds silly to me, but I respect the fact that he is probably
> describing standard practice among some companies), and that workload is
> just a little too gigantic for the workstations to handle. I.e., by
> throwing resources at the central server they can avoid throwing as many
> at each workstation.

I think his problem is that he's trying to replace his p4 repository with 
a git repository, which is a bit like trying to download github, rather 
than a project from github. Perforce is good at dealing with the case 
where people check in a vast quantity of junk that you don't check out.

That is, you can back up your workstation into Perforce, and it won't 
affect anyone's performance if you use a path that's not in the range that 
anybody else checks out. And people actually do that. And Perforce doesn't 
make a distinction between different projects and different branches of 
the same project and different subdirectories of a branch of the same 
project, so it's impossible to tease apart except by company policy.

Git doesn't scale in that it can't do the extremely narrow checkouts you 
need if your repository root directory contains thousands of complete 
unrelated projects with each branch of each project getting a 
subdirectory. On the other hand, it does a great job when the data is 
already partitioned into useful repositories.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 19:08         ` Jeff King
  2009-05-01 19:13           ` david
  2009-05-01 21:17           ` Daniel Barkalow
@ 2009-05-01 21:37           ` Linus Torvalds
  2009-05-01 22:11             ` david
  2 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2009-05-01 21:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List



On Fri, 1 May 2009, Jeff King wrote:
> 
> Thanks for the analysis; what you said makes sense to me. However, there
> is at least one case of somebody complaining that git doesn't scale as
> well as perforce for their load:

So we definitely do have scaling issues, there's no question about that. I 
just don't think they are about enterprise network servers vs the more 
workstation-oriented OSS world..

I think they're likely about the whole git mentality of looking at the big 
picture, and then getting swamped by just how _huge_ that picture can be 
if somebody just put the whole world in a single repository..

With perforce, repository maintenance is such a central issue that the 
whole p4 mentality seems to _encourage_ everybody to put everything into 
basically one single p4 repository. And afaik, p4 basically works mostly 
like CVS, ie it really ends up being pretty much oriented to a "one file 
at a time" model.

Which is nice in that you can have a million files, and then only check 
out a few of them - you'll never even _see_ the impact of the other 
999,995 files.

And git obviously doesn't have that kind of model at all. Git 
fundamnetally never really looks at less than the whole repo. Even if you 
limit things a bit (ie check out just a portion, or have the history go 
back just a bit), git ends up still always caring about the whole thing, 
and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one 
_huge_ repository. I don't think that part is really fixable, although we 
can probably improve on it.

And yes, then there's the "big file" issues. I really don't know what to 
do about huge files. We suck at them, I know. There are work-arounds (like 
not deltaing big objects at all), but they aren't necessarily that great 
either.

I bet we could probably improve git large-file behavior for many common 
cases. Do we have a good test-case of some particular suckiness that is 
actually relevant enough that people might decide to look at it (and by 
"people", I do mean myself too - but I'd need to be somewhat motivated by 
it. A usage case that we suck at and that is available and relevant).

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 21:37           ` Linus Torvalds
@ 2009-05-01 22:11             ` david
  0 siblings, 0 replies; 39+ messages in thread
From: david @ 2009-05-01 22:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff King, Jakub Narebski, Martin Langhoff, Git Mailing List

On Fri, 1 May 2009, Linus Torvalds wrote:

> I bet we could probably improve git large-file behavior for many common
> cases. Do we have a good test-case of some particular suckiness that is
> actually relevant enough that people might decide to look at it (and by
> "people", I do mean myself too - but I'd need to be somewhat motivated by
> it. A usage case that we suck at and that is available and relevant).

I think that a sane use case that would make sense to people is based on 
the 'game developer' example

they have source code, but they also have large images (and sometimes 
movie clips), where a particular release of the game needs a particular 
set of the images. during development you may change images frequently 
(although most changesets probably only change a few, if any of the 
images)

the images can be large (movies can be very large), and since they are 
already compressed they don't diff or compress well.

David Lang

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Why Git is so fast
  2009-04-30 19:16       ` Alex Riesen
@ 2009-05-04  8:01         ` Andreas Ericsson
  0 siblings, 0 replies; 39+ messages in thread
From: Andreas Ericsson @ 2009-05-04  8:01 UTC (permalink / raw)
  To: Alex Riesen
  Cc: Nicolas Pitre, Jakub Narebski, Martin Langhoff, Git Mailing List

Alex Riesen wrote:
> 2009/4/30 Nicolas Pitre <nico@cam.org>:
>> Yet, this point is misleading because when people gives to Git the
>> reputation of being faster, this is certainly from comparison of
>> operations performed on the same source tree.  Who cares about scenarios
>> for which the tool was not designed?  Those "enterprise configuration
>> management repositories" are not what Git was designed for indeed, but
> 
> Especially when no sane developer will put in his repository the toolchain
> (pre-compiled. For all supported platforms!), all the supporting tools
> (like grep,
> find, etc.Pre-compiled _and_ source), the in-house framework (pre-compiled
> and source, again), firmware (pre-compiled and put in the repository weekly),
> and operating system code (pre-compiled, with firmware-specific drivers,
> updated, you guessed it, weekly), and well, there is the project itself (Java or
> C++, and documentation in .doc and .xls)...

Well, git could actually handle that just fine if the toolchain was in a
submodule or even in a separate repository that developers never had to
worry about. Then you'd design a little tool that said "re-create build 8149"
and it would pull the tools used to do that, and the code and the artwork,
and then set to work. It'd be an overnight (or over-weekend) job, but no
man-hours would be spent on it. That's how I'd do it anyways, probably
with the "build" repository as a master repo with "tools", "code" and
"artwork" as submodules to it.

> Now, what kind of self-hating idiot will design a system for that kind of abuse?

Noone, naturally, but one might design a system where each folder
in the repository root is considered a repository in its own right,
and then get that more or less for free.

The problem with git for such scenarios is that you have to think
*before* creating the repository, or play silly buggers when importing
which makes it hard to see how the pieces fit together afterwards.

A tool that could take a repository from a different scm, create a
master repository and several submodule repositories from it would
probably solve many of the issues gaming companies have if they want
to switch to using git. Not least because it would open their eyes
to how that sort of separation can be done in git, and why it's
useful. The binary repos can then turn off delta-compression (and
zlib compression) for all its blobs using a .gitattributes file,
and things would be several orders of magnitudes faster.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-05-04  8:02 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-27  8:55 Eric Sink's blog - notes on git, dscms and a "whole product" approach Martin Langhoff
2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
2009-04-28 21:00   ` Robin Rosenberg
2009-04-29  6:55   ` Martin Langhoff
2009-04-29  7:21     ` Jeff King
2009-04-29 20:05       ` Markus Heidelberg
2009-04-29  7:52     ` Cross-Platform Version Control Jakub Narebski
2009-04-29  8:25       ` Martin Langhoff
2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski
2009-04-29  7:54   ` Sitaram Chamarty
2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
2009-04-30 12:56     ` Michael Witten
2009-04-30 15:28       ` Why Git is so fast Jakub Narebski
2009-04-30 18:52         ` Shawn O. Pearce
2009-04-30 20:36           ` Kjetil Barvik
2009-04-30 20:40             ` Shawn O. Pearce
2009-04-30 21:36               ` Kjetil Barvik
2009-05-01  0:23                 ` Steven Noonan
2009-05-01  1:25                   ` James Pickens
2009-05-01  9:19                   ` Kjetil Barvik
2009-05-01  9:34                     ` Mike Hommey
2009-05-01  9:42                       ` Kjetil Barvik
2009-05-01 17:42                 ` Tony Finch
2009-05-01  5:24             ` Dmitry Potapov
2009-05-01  9:42               ` Mike Hommey
2009-05-01 10:46                 ` Dmitry Potapov
2009-04-30 18:43       ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Shawn O. Pearce
2009-04-30 14:22     ` Jeff King
2009-05-01 18:43       ` Linus Torvalds
2009-05-01 19:08         ` Jeff King
2009-05-01 19:13           ` david
2009-05-01 19:32             ` Nicolas Pitre
2009-05-01 21:17           ` Daniel Barkalow
2009-05-01 21:37           ` Linus Torvalds
2009-05-01 22:11             ` david
2009-04-30 18:56     ` Nicolas Pitre
2009-04-30 19:16       ` Alex Riesen
2009-05-04  8:01         ` Why Git is so fast Andreas Ericsson
2009-04-30 19:33       ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).