Eric Sink's blog - notes on git, dscms and a "whole product" approach

All of lore.kernel.org
 help / color / mirror / Atom feed

* Eric Sink's blog - notes on git, dscms and a "whole product" approach
@ 2009-04-27  8:55 Martin Langhoff
  2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski
  0 siblings, 2 replies; 82+ messages in thread
From: Martin Langhoff @ 2009-04-27  8:55 UTC (permalink / raw)
  To: Git Mailing List

Eric Sink hs been working on the (commercial, proprietary) centralised
SCM Vault for a while. He's written recently about his explorations
around the new crop of DSCMs, and I think it's quite interesting. A
quick search of the list archives makes me thing it wasn't discussed
before.

The guy is knowledgeable, and writes quite witty posts -- naturally,
there's plenty to disagree on, but I'd like to encourage readers not
to nitpick or focus on where Eric is wrong. It is interesting to read
where he thinks git and other DSCMs are missing the mark.

   Maybe he's right, maybe he's wrong, but damn he's interesting :-)

So here's the blog -  http://www.ericsink.com/

These are the best entry points
  http://www.ericsink.com/entries/quirky.html
  http://www.ericsink.com/entries/hg_denzel.html

To be frank, I think he's wrong in some details (as he's admittedly
only spent limited time with it) but right on the larger-picture
(large userbases want it integrated and foolproof, bugtracking needs
to go distributed alongside the code, git is as powerful^Wdangerous as
C).

cheers,

martin
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-27  8:55 Eric Sink's blog - notes on git, dscms and a "whole product" approach Martin Langhoff
@ 2009-04-28 11:24 ` Jakub Narebski
  2009-04-28 21:00   ` Robin Rosenberg
  2009-04-29  6:55   ` Martin Langhoff
  2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski
  1 sibling, 2 replies; 82+ messages in thread
From: Jakub Narebski @ 2009-04-28 11:24 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com> writes:

> Eric Sink hs been working on the (commercial, proprietary) centralised
> SCM Vault for a while. He's written recently about his explorations
> around the new crop of DSCMs, and I think it's quite interesting. A
> quick search of the list archives makes me thing it wasn't discussed
> before.
> 
> The guy is knowledgeable, and writes quite witty posts -- naturally,
> there's plenty to disagree on, but I'd like to encourage readers not
> to nitpick or focus on where Eric is wrong. It is interesting to read
> where he thinks git and other DSCMs are missing the mark.
> 
>    Maybe he's right, maybe he's wrong, but damn he's interesting :-)
> 
> So here's the blog -  http://www.ericsink.com/

"Here's a blog"... and therefore my dilemma. Should I post my reply
as a comment to this blog, or should I reply here on git mailing list?

> These are the best entry points

Because those two entries are quite different, I'll reply separately

1.  "Ten Quirky Issues with Cross-Platform Version Control"
>   http://www.ericsink.com/entries/quirky.html

which is generic comment about (mainly) using version control
in heterogenic environment, where different machines have different
filesystem limitations.  I'll concentrate here on that issue.

2.  "Mercurial, Subversion, and Wesley Snipes"
>   http://www.ericsink.com/entries/hg_denzel.html

where, paraphrasing, Eric Sink says that he doesn't write about
Mercurial and Subversion because they are perfect.  Or at least not
as controversial (and controversial means interesting).

> 
> To be frank, I think he's wrong in some details (as he's admittedly
> only spent limited time with it) but right on the larger-picture
> (large userbases want it integrated and foolproof, bugtracking needs
> to go distributed alongside the code, git is as powerful^Wdangerous as
> C).

Neither of mentioned above blog posts touches those issues, BTW...

----------------------------------------------------------------------
Ad 1. "Ten Quirky Issues with Cross-Platform Version Control"

Actually those are two issues: troubles with different limitations of
different filesystems, and different dealing with line endings in text
files on different platforms.

Line endings (issue 8.) is in theory and in practice (at least for
Git) a non-issue.  

In theory you should use project's convention for end of line
character in text files, and use smart editor that can deal (or can be
configured to deal) with this issue correctly.

In practice this is a matter of correctly setting up core.autocrlf
(and in more complicated cases, where more complicated means for git
very very rare, configuring which files are text and which are not).

There are a few classes of troubles with filesystems (with filenames).

1. Different limitations on file names (e.g. pathname length),
   different special characters, different special filenames (if any).
   Those are issues 2. (special basename PRN on MS Windows), 
   issue 3. (trailing dot, trailing whitespace), issue 4. (pathname
   and filename length limit), issue 6. (special characters, in this
   case colon being path element delimiter on MacOS, but it is also
   about special characters like colon, asterisk and question mark
   on MS Windows) and also issue 7. (name that begins with dash)
   in Eric Sink article.

   The answer is convention for filenames in a project. Simply DON'T
   use filenames which can cause problems.  There is no way to simply
   solve this problem in version control system, although I think if
   you really, really, really need it you should be able to cobble
   something together using low-level git tools to have different name
   for filename in working directory from the one used in repository
   (and index).

   See also David A. Wheeler essay "Fixing Unix/Linux/POSIX Filenames:
   Control Characters (such as Newline), Leading Dashes, and Other Problems" 
   http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

   DON'T DO THAT.

2. "Case-insensitive" but "case-preserving" filesystems; the case
   where some different filenames are equivalent (like 'README' and
   'readme' on case-insensitive filesystem), but are returned as you
   created them (so if you created 'README', you would get 'README' in
   directory listing, but filesystem would return that 'readme' exists
   too).  This is issue 1. ('README' and 'readme' in the same
   directory) in Eric Sink article.

   The answer is like for previous issue: don't.  Simply DO NOT create
   files with filenames which differ only in case (like unfortunate
   ct_conntrack.h and cn_CONNTRACK.h or similar in Linux kernel).

   But I think that even in case where such unfortunate incident (two
   filenames differing only in case) occur, you can deal with it in
   Git by using lower level tools (and editing only one of two such
   files at once).  You would get spurious info about modified files
   in git-status, though...  perhaps that could be improved using
   infrastructure created (IIRC) by Linus for dealing with 'insane'
   filesystems.

   DON'T DO THAT, SOLVABLE.

3. Non "Case-preserving" filesystems, where filename as sequence of
   bytes differ between what you created, and what you get from
   filesystem.  An example here is MacOS X filesystem, which accepts
   filenames in NFC composed normalized form of Unicode, but stores
   them internally and returns them in NFD decomposed form.  This is
   issue 9. (Español being "Espa\u00f1ol" in NFC, but "Espan\u0303ol"
   in NFD).

   In this case 'don't do this' might be not acceptable answer.
   Perhaps you need non-ASCII characters in filenames.  Not always can
   you use filesystem or specify mount point option that makes it not
   a problem.

   I remember that this issue was discussed extensively on git mailing
   list, but I don't remember what was the conclusion (beside agreeing
   that filesystem that is not "*-preserving" is not sane filesystem ;).
   In particular I do not remember if Git can deal with this issue
   sanely (I remember Linus adding infrastructure for that, but did it
   solve this problem...).

   PROBABLY SOLVED.

4. Filesystems which cannot store all SCM-sane metainfo, for example
   filesystems without support for symbolic links, or without support
   for executable permission (executable bit).  This is extension of
   issue 10. (which is limited to symbolic links) in Eric Sink
   article.

   In Git you have core.fileMode to ignore executable bit differences
   (you would need to use SCM tools and not filesystem tools to
   maniulate it), and core.symlinks to be able to checkout symlinks as
   plain text files (again using SCM tools to manipulate).

   SOLVED.

There is also mistaken implicit assumption that version control
systems have (and should) preserve all metadata.

5. The issue of extra metadata that is not SCM-sane, and which
   different filesystems can or cannot store.  Examples include full
   Unix permissions, Unix ownership (and groups file belongs to),
   other permission-related metadata such as ACL, extra resources tied
   to file such as EA (extended attributes) for some Linux filesystems
   or (in)famous resource form in MacOS.  This is issue 5. (resource
   fork on MacOS vs. xattrs on Linux) in Eric Sink article.

   This is not an issue for SCM: _source_ code management system
   to solve.  Preserving extra metadata indiscrimitedly can cause
   problems, like e.g. full permissions and ownership.  Therefore
   SCM preserve only limited SCM-sane subset of metadata.  If you
   need to preserve extra metadata, you can use (in good SCMs) hooks
   for that, like e.g. etckeeper uses metastore (in Git).

   NOT A PROBLEM.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
@ 2009-04-28 21:00   ` Robin Rosenberg
  2009-04-29  6:55   ` Martin Langhoff
  1 sibling, 0 replies; 82+ messages in thread
From: Robin Rosenberg @ 2009-04-28 21:00 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List

tisdag 28 april 2009 13:24:31 skrev Jakub Narebski <jnareb@gmail.com>:
> Line endings (issue 8.) is in theory and in practice (at least for
> Git) a non-issue.  
> 
> In theory you should use project's convention for end of line
> character in text files, and use smart editor that can deal (or can be
> configured to deal) with this issue correctly.
Windows people will disagree.

> In practice this is a matter of correctly setting up core.autocrlf
> (and in more complicated cases, where more complicated means for git
> very very rare, configuring which files are text and which are not).

Which proves it is an issue or we wouldn't need to tune settings
to make it work right.  A non-issue is something that "just works"
without turning knobs. I had had to think more than once on
what the issue was and the right way to solve these issues. It
can be considered wierd, because Eclipse on Linux generated files
with CRLF which I happily committed and Git on Windows happily
converted to LF and determined that the HEAD and index was out
of sync, but refuesed to commit the CRLF>LF change becuase there
was no "diff"..  You know the fix, but don't tell me it's not an issue.

-- robin

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on  git, dscms and a "whole product" approach)
  2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  2009-04-28 21:00   ` Robin Rosenberg
@ 2009-04-29  6:55   ` Martin Langhoff
  2009-04-29  7:21     ` Jeff King
  2009-04-29  7:52     ` Cross-Platform Version Control Jakub Narebski
  1 sibling, 2 replies; 82+ messages in thread
From: Martin Langhoff @ 2009-04-29  6:55 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Git Mailing List

On Tue, Apr 28, 2009 at 1:24 PM, Jakub Narebski <jnareb@gmail.com> wrote:
>   DON'T DO THAT.
>   DON'T DO THAT, SOLVABLE.

As I mentioned, Eric is taking the perspective of offering a supported
SCM to a large and diverse audience. As such, his notes are
interesting not because he's right or he's wrong.

We can be "right" and say "don't do that" if we shrink our audience so
that it looks a lot like us. There, fixed.

But something tells me that successful tools are -- by definition --
tools that grow past their creators use.

So from Eric's perspective, it is worthwhile to work on all those
issues, and get the right for the end user -- support things we don't
like, offer foolproof catches and warnings that prevent the user from
shooting their lovely toes off to mars, etc.

His perspective is one of commercial licensing, but even if we aren't
driven by the "each new user is a new dollar" bit, the long term hopes
for git might also be to be widely used and to improve the version
control life of many unsuspecting users.

To get there, I suspect we have to understand more of Eric's perspective.

that's my 2c.

m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-29  6:55   ` Martin Langhoff
@ 2009-04-29  7:21     ` Jeff King
  2009-04-29 20:05       ` Markus Heidelberg
  2009-04-29  7:52     ` Cross-Platform Version Control Jakub Narebski
  1 sibling, 1 reply; 82+ messages in thread
From: Jeff King @ 2009-04-29  7:21 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jakub Narebski, Git Mailing List

On Wed, Apr 29, 2009 at 08:55:29AM +0200, Martin Langhoff wrote:

> So from Eric's perspective, it is worthwhile to work on all those
> issues, and get the right for the end user -- support things we don't
> like, offer foolproof catches and warnings that prevent the user from
> shooting their lovely toes off to mars, etc.

I read a few of his blog postings. He kept complaining about the
features of git that I like the most. :)

So one thing I took away from it is that there probably isn't _one_
interface that works for everybody. I can see his arguments about how
"add -p" can be dangerous, and how history rewriting can be dangerous.
So for some users, blocking those features makes sense.

But for other users (myself included), those are critical features that
make me _way_ more productive. And I manage the risk that comes from
using them as part of my workflow, and it isn't a problem in practice.

While part of me is happy that cogito is now dead (not because I didn't
think it was good, but because having two sets of tools just seemed to
create maintenance and staleness headaches), I do sometimes wonder if we
would be better off with several "from scratch" git interfaces based
around the plumbing (or even a C library). And I don't just mean simple
wrappers around git commands, but whole new interfaces which make
decisions like "no history rewriting at all", and try to provide a safer
interface based on that.

Of course, _I_ wouldn't want to use such an interface. But in theory I
could seamlessly interoperate with people who did.

-Peff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-29  7:21     ` Jeff King
@ 2009-04-29 20:05       ` Markus Heidelberg
  0 siblings, 0 replies; 82+ messages in thread
From: Markus Heidelberg @ 2009-04-29 20:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Langhoff, Jakub Narebski, Git Mailing List

Jeff King, 29.04.2009:
> On Wed, Apr 29, 2009 at 08:55:29AM +0200, Martin Langhoff wrote:
> 
> > So from Eric's perspective, it is worthwhile to work on all those
> > issues, and get the right for the end user -- support things we don't
> > like, offer foolproof catches and warnings that prevent the user from
> > shooting their lovely toes off to mars, etc.
> 
> I read a few of his blog postings. He kept complaining about the
> features of git that I like the most. :)
> 
> I can see his arguments about how
> "add -p" can be dangerous

Actually, I don't see a very special case here with committing a never
compiled/tested worktree state. You can do this with every VCS (without
an index like git) with just selectively committing files instead of the
whole current worktree.

Markus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-04-29  6:55   ` Martin Langhoff
  2009-04-29  7:21     ` Jeff King
@ 2009-04-29  7:52     ` Jakub Narebski
  2009-04-29  8:25       ` Martin Langhoff
  1 sibling, 1 reply; 82+ messages in thread
From: Jakub Narebski @ 2009-04-29  7:52 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

On Wed, 29 April 2009, Martin Langhoff wrote:
> On Tue, Apr 28, 2009 at 1:24 PM, Jakub Narebski <jnareb@gmail.com>
> wrote: 

[I think you cut out a bit too much. Here I resurrected it]

JN> 1. Different limitations on file names (e.g. pathname length),
JN>   different special characters, different special filenames
JN>   (if any).
[...]
JN>   The answer is convention for filenames in a project. Simply
JN>   DON'T use filenames which can cause problems.
[...]

> >   DON'T DO THAT.

What could be proper solution to that, if you do not accept social 
rather than technical restriction?  We can have pre-commit hook that 
checks for portability for filenames (which is deployment specific,
and shouldn't be part of SCM perhaps with an exception of being example 
hook) but it wouldn't help dealing with non-portable filenames on 
filesystem that cannot represent them that are there.

If I remember correctly Git for some time has layer which can translate 
between filenames in repository and filenames on filesystem, but I'm 
not sure if it is generic enough for it to be a solution to this 
problem, and currently there is no way to manipulate this mapping, I 
think.

JN> 2. "Case-insensitive" but "case-preserving" filesystems. [...]
JN>
JN>     The answer is like for previous issue: don't.  Simply DO NOT
JN>     create files with filenames which differ only in case [...]

> >   DON'T DO THAT, SOLVABLE.

By 'solvable' here I mean that you should be able to modify only one of 
clashing files at once (checkout 'README', modify, add to index, remove 
from filesystem, checkout 'readme', modify, etc.), and deal with 
annoyances in git-status output.  It can be done in Git, with medium 
amount of hacking.  I don't think any other SCM can do even this, and
I cannot think of a better, automatic solution that would somehow deal 
with case-clashing.

Note that all deals are off in case-insensitive and not preserving 
filesystem.

By the way, wouldn't be a better solution to use sane filesystem, rather 
than complicating SCM? ;-)

> 
> As I mentioned, Eric is taking the perspective of offering a supported
> SCM to a large and diverse audience. As such, his notes are
> interesting not because he's right or he's wrong.
> 
> We can be "right" and say "don't do that" if we shrink our audience so
> that it looks a lot like us. There, fixed.

<quote source="Dune by Frank Herbert">
  [...] the attitude of the knife — chopping off what's incomplete and
  saying: "Now it's complete because it's ended here."
</quote>

I could not resist posting this quote :-P

> 
> But something tells me that successful tools are -- by definition --
> tools that grow past their creators use.
> 
> So from Eric's perspective, it is worthwhile to work on all those
> issues, and get the right for the end user -- support things we don't
> like, offer foolproof catches and warnings that prevent the user from
> shooting their lovely toes off to mars, etc.

Warnings and catches I can accept; adding complications and corner cases 
for situations which can be trivially avoided with a bit of social 
engineering aka. project guidelines... not so much.

I simply cannot see the situation where you _must_ have dangerously 
unportable file names (trailing dot, trailing whitespace) and 
case-clashing files...

> 
> His perspective is one of commercial licensing, but even if we aren't
> driven by the "each new user is a new dollar" bit, the long term hopes
> for git might also be to be widely used and to improve the version
> control life of many unsuspecting users.
> 
> To get there, I suspect we have to understand more of Eric's
> perspective. 
> 
> that's my 2c.

By the way, I think that the article on cross-platform version control 
(version control in heterogenic environment) is quite good article.
I don't quite like the "10 Issues"/"Top 10" way of writing, but the 
article examines different ways that heterogenic environment can trip 
SCM.  

In my opinion Git does quite good here, where it can, and where the 
issue is to be solved by SCM and not otherwise (extra metadata like 
resource fork).

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-04-29  7:52     ` Cross-Platform Version Control Jakub Narebski
@ 2009-04-29  8:25       ` Martin Langhoff
  0 siblings, 0 replies; 82+ messages in thread
From: Martin Langhoff @ 2009-04-29  8:25 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Git Mailing List

On Wed, Apr 29, 2009 at 9:52 AM, Jakub Narebski <jnareb@gmail.com> wrote:
>> >   DON'T DO THAT.
>
> What could be proper solution to that, if you do not accept social
> rather than technical restriction?

Let's say strong checks for case sensitivity clashes, leading/trailing
dots, utf-8 encoding maladies, etc switched on by default. And note
that to be user-friendly you want most of those checks at 'add' time.

 If we don't like a particular FS, or we think it is messing up our
utf-8 filenames, say it up-front, at clone and checkout time. For
example, if the checkout has files with interesting utf-8 names, it'd
be reasonable to check for filename mangling.

Some things are hard or impossible to prevent - the utf-8 encoding
maladies of OSX for example. But it may be detectable on checkout.

In short, play on the defensive, for the benefit of users who are not
kernel developers.

It will piss off kernel & git developers and slow some operations
somewhat. It will piss off oldtimers like me. But I'll say git config
--global core.trainingwheels no and life will be good.

It may be - as Jeff King points out - a matter of a polished git
porcelain. We've seen lots of porcelains, but no smooth user-targetted
porcelain yet.

cheers,

m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach
  2009-04-27  8:55 Eric Sink's blog - notes on git, dscms and a "whole product" approach Martin Langhoff
  2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
@ 2009-04-28 18:16 ` Jakub Narebski
  2009-04-29  7:54   ` Sitaram Chamarty
  2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  1 sibling, 2 replies; 82+ messages in thread
From: Jakub Narebski @ 2009-04-28 18:16 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

Martin Langhoff <martin.langhoff@gmail.com> writes:

> Eric Sink hs been working on the (commercial, proprietary) centralised
> SCM Vault for a while. He's written recently about his explorations
> around the new crop of DSCMs, and I think it's quite interesting. A
> quick search of the list archives makes me thing it wasn't discussed
> before.
> 
> The guy is knowledgeable, and writes quite witty posts -- naturally,
> there's plenty to disagree on, but I'd like to encourage readers not
> to nitpick or focus on where Eric is wrong. It is interesting to read
> where he thinks git and other DSCMs are missing the mark.
> 
>    Maybe he's right, maybe he's wrong, but damn he's interesting :-)
> 
> So here's the blog -  http://www.ericsink.com/

"Here's a blog"... and therefore my dilemma. Should I post my reply
as a comment to this blog, or should I reply here on git mailing list?

I think I will just add link to this thread in GMane mailing list
archive for git mailing list...

> These are the best entry points
*  "Ten Quirky Issues with Cross-Platform Version Control"
>   http://www.ericsink.com/entries/quirky.html

which I have answered in separate post in this thread

*  "Mercurial, Subversion, and Wesley Snipes"
>   http://www.ericsink.com/entries/hg_denzel.html

which I will comment now.  The 'ES>' prefix means quoting above blog
post.

First there is a list of earlier blog post, with links, which makes
article in question a good staring point.

ES> As part of that effort, I have undertaken an exploration of the
ES> DVCS world.  Several weeks ago I started writing one blog entry
ES> every week, mostly focused on DVCS topics.  In chronological
ES> order, here they are:
ES>
ES> * The one where I gripe about Git's index

where Eric complains that "git add -p" allows for committing untested
changes... not knowing about "git stash --keep-index", and not
understanding that comitting is (usually) separate from publishing in
distributed version control systems (so you can check after commit,
and amend commit if it does not pass test).

ES> * The one where I whine about the way Git allows developers to
ES>   rearrange the DAG

where Eric seems to not notice that you are strongly encouraged to do
'rearranging the DAG' (rewriting the history) _only_ in unpublished
(not made public) part of history.

ES> * The one where it looks like I am against DAG-based version
ES>   control but I'm really not

where Eric conflates linear versus merge workflows with
update-before-commit versus commit-then-merge paradigm, not noticing
that you can have linear history using sane commit-update-rebase
rather than unsafe update-before-commit.

ES> * The one where I fuss about DVCSes that try to act like
ES>   centralized tools

where DVCS in question that behaves this way is Bazaar (if I
understood this correctly).

ES> * The one where I complain that DVCSes have a lousy story when it
ES>   comes to bug-tracking

where Eric correctly notice that distributed version control would not
help much if you use centralized bugtracker, and speculates about
required features that distributed bugtracker should have.  Very nice
post in my opinion.

ES> * The one where I lament that I want to like Darcs but I can't

where Eric talks about difference between parentage in merge commit
(which is needed for good merging) and "parentage"/weak link in
cherry-picked commit; Git uses weak link = no link.

ES> * The one where I speculate cluelessly about why Git is so fast

where Eric guesses instead of asking on git mailing list or #git
channel... ;-)

ES> Along the way, I've been spending some time getting hands-on
ES> experience with these tools.  I've been using Bazaar for several
ES> months.  I don't like it very much.  I am currently in the process
ES> of switching to Git, but I don't expect to like it very much
ES> either.

Aaaargh... if you expect to not like it very much, I would be very
suprised if you find it to your liking...

ES> So why don't I write about Mercurial?  Because I'm pretty sure I
ES> would like it.
ES>
ES> I chose Bazaar and Git for the experience.  But if I were choosing
ES> a DVCS as a regular user, I would choose Mercurial.  I've used it
ES> some, and found it to be incredibly pleasant.  It seems like the
ES> DVCS that got everything just about right.  That's great if you're
ES> a user, but for a writer, what's interesting about that?

Well, Mercurial IMHO didn't get everything right. Not mentioning
implementation issues, like dealing with copies, binary files, and
large files, it got IMHO wrong:
 * branching in multiple branches per repository
 * tags which should be transferrable but non-versioned

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach
  2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski
@ 2009-04-29  7:54   ` Sitaram Chamarty
  2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  1 sibling, 0 replies; 82+ messages in thread
From: Sitaram Chamarty @ 2009-04-29  7:54 UTC (permalink / raw)
  To: git

On 2009-04-28, Jakub Narebski <jnareb@gmail.com> wrote:

> ES> * The one where I lament that I want to like Darcs but I can't
>
> where Eric talks about difference between parentage in merge commit
> (which is needed for good merging) and "parentage"/weak link in
> cherry-picked commit; Git uses weak link = no link.

Well the patch-id is a sort of "compute on demand" link, so
it would qualify as a weak link, especially because git
manages to use it during a rebase.

I wanted to point that out but I didn't see a link to post
comments so I didn't bother.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski
  2009-04-29  7:54   ` Sitaram Chamarty
@ 2009-04-30 12:17   ` Jakub Narebski
  2009-04-30 12:56     ` Michael Witten
                       ` (2 more replies)
  1 sibling, 3 replies; 82+ messages in thread
From: Jakub Narebski @ 2009-04-30 12:17 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

Jakub Narebski <jnareb@gmail.com> writes:

> Martin Langhoff <martin.langhoff@gmail.com> writes:
> 
> > Eric Sink hs been working on the (commercial, proprietary) centralised
> > SCM Vault for a while. He's written recently about his explorations
> > around the new crop of DSCMs, and I think it's quite interesting.

[...]

> > So here's the blog -  http://www.ericsink.com/

[...]
> *  "Mercurial, Subversion, and Wesley Snipes"
> >   http://www.ericsink.com/entries/hg_denzel.html
> 
> which I will comment now.  The 'ES>' prefix means quoting above blog
> post.

[...]
> ES> * The one where I speculate cluelessly about why Git is so fast
> 
> where Eric guesses instead of asking on git mailing list or #git
> channel... ;-)

This issue is interesting: what features and what design decision
make Git fast? One of the goals of Git was good performance; are
we there?

All quotes marked 'es> ' below are from "Why is Git so Fast?" post
http://www.ericsink.com/entries/why_is_git_fast.html

es> One:  Maybe Git is fast simply because it's a DVCS.
es>
es> There's probably some truth here.  One of the main benefits touted
es> by the DVCS fanatics is the extra performance you get when
es> everything is "local".

This is I think quite obvious.  Accessing memory is faster than
acessing disk, which in turn is faster than accessing network.  So if
commit and (change)log does not require access to server via network,
they are so much faster.

BTW. that is why Subversion stores along working copy 'pristine'
versions of files: to make status and diff fast enough to be usable.
Which in turn might make SVN checkout to be larger than full Git
clone ;-)

es>
es> But this answer isn't enough.  Maybe it explains why Git is faster
es> than Subversion, but it doesn't explain why Git is so often
es> described as being faster than the other DVCSs.

Not only described; see http://git.or.cz/gitwiki/GitBenchmarks
(although some, if not most of those benchmarks are dated,
and e.g. Bazaar claims to have much better performance now).

es>
es> Two:  Maybe Git is fast because Linus Torvalds is so smart.

[non answer; the details are important]

es> Three: Maybe Git is fast because it's written in C instead of one
es> of those newfangled higher-level languages.
es>
es> Nah, probably not.  Lots of people have written fast software in
es> C#, Java or Python.
es>
es> And lots of people have written really slow software in
es> traditional native languages like C/C++. [...]

Well, I guess that access to low-level optimization techniques like
mmap are important for performance.  But here I am guessing and
speculating like Eric did; well, I am asking on a proper forum ;-)

We have some anegdotical evidence supporting this possibility (which
Eric dismisses), namely the fact that pure-Python Bazaar is slowest of
three most common open source DVCS (Git, Mercurial, bazaar) and the
fact that parts of Mercurial were written in C for better performance.

We can also compare implementations of Git in other, higher level
languages, with reference implementation in C (and shell scripts, and
Perl ;-)).  For example most complete I think but still not fully
complete Java implementation: JGit.  I hope that JGit developers can
tell us whether using higher level language affects performance, how
much, and what features of higher-level language are causing decrease
in performance.  Of course we have to take into account the
possibility that JGit isn't simply as well optimized because of less
manpower.

es>
es> Four: Maybe Git is fast because being fast is the primary goal for
es> Git.

[non answer; the details are important]

es>
es> Five:  Maybe Git is fast because it does less.
es>
es> One of my favorite recent blog entries is this piece[1] which
es> claims that the way to make code faster is to have it do less.
es>
es> [1] "How to write fast code" by Kas Thomas
es>     http://asserttrue.blogspot.com/2009/03/how-to-write-fast-code.html
[...]

es>
es> For example, the way you get something in the Git index is you use
es> the "git add" command.  Git doesn't scan your working copy for
es> changed files unless you explicitly tell it to.  This can be a
es> pretty big performance win for huge trees.  Even when you use the
es> "remember the timestamp" trick, detecting modified files in a
es> really big tree can take a noticeable amount of time.

That of course depends on how you compare performance of different
version control systems (to not compare apples with oranges).  But if
you compare e.g. "<scm> commit" with Git equivalent "git commit -a"
the above is simply not true.

BTW. when doing comparison you have to take care of the reverse,
e.g. git doing more like calculating and dislaying diffstat by default
for merges/pulls.

es>
es> Or maybe Git's shortcut for handling renames is faster than doing
es> them more correctly[2] like Bazaar does.
es>
es> [2] "Renaming is the killer app of distributed version control"
es>     http://www.markshuttleworth.com/archives/123

Errr... what?

es> Six:  Maybe Git is fast because it doesn't use much external code.
es>
es> Very often, when you are facing a decision to use somebody else's
es> code or write it yourself, there is a performance tradeoff.  Not
es> always, but often.  Maybe the third party code is just slower than
es> the code you could write yourself if you had time to do it.  Or
es> maybe there is an impedance mismatch between the API of the
es> external library and your own architecture.
es>
es> This can happen even when the library is very high quality.  For
es> example, consider libcurl.  This is a great library.  Tons of
es> people use it.  But it does have one problem that will cause
es> performance problems for some users: When using libcurl to fetch
es> an object, it wants to own the buffer.  In some situations, this
es> can end up forcing you to use extra memcpys or temporary files.
es> The reason all the low level calls like send() and recv() allow
es> the caller to own the loop and the buffer is because this is the
es> best way to avoid the need to make extra copies of the data on
es> disk or in memory.
[...]

es>
es> Maybe Git is fast because every time they faced one of these "buy
es> vs. build" choices, they decided to just write it themselves.

I don't think so.  Rather the opposite is true.  Git uses libcurl for
HTTP transport.  Git uses zlib for compression.  Git uses SHA-1 from
OpenSSL or from Mozilla.  Git uses (modified, internal) LibXDiff for
(binary) deltaifying, for diffs and for merges.

OTOH Git includes several own micro-libraries: parseopt, strbuf,
ALLOC_GROW, etc.  NIH syndrome?  I don't think so; rather avoiding
extra dependencies (bstring vs strbuf), and existing solutions not
fitting all needs (popt/argp/getopt vs parse-options).

es> Seven:  Maybe Git isn't really that fast.
es>
es> If there is one thing I've learned about version control it's that
es> everybody's situation is different.  It is quite likely that Git
es> is a lot faster for some scenarios than it is for others.
es>
es> How does Git handle really large trees?  Git was designed primary
es> to support the efforts of the Linux kernel developers.  A lot of
es> people think the Linux kernel is a large tree, but it's really
es> not.  Many enterprise configuration management repositories are
es> FAR bigger than the Linux kernel.

c.f. "Why Perforce is more scalable than Git" by Steve Hanov
     http://gandolf.homelinux.org/blog/index.php?id=50

I don't really know about this.

But there is one issue Eric Sink didn't think about:

Eight: Git seems fast.
======================

Here I mean concentaring on low _latency_, which means that when git
produces more than one page of output (for example "git log"), it tries to output the first page as fast as possible; which means that first page e.g.
"git <sth> | head -25  >/dev/null" has to be fast, and not 
"git <sth> >/dev/null" itself.

Having progress indicator appearing whenever is longer wait (quite
fresh feature) also help impression of being fast...

And what do you think about this?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git,  dscms and a "whole product" approach)
  2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
@ 2009-04-30 12:56     ` Michael Witten
  2009-04-30 15:28       ` Why Git is so fast Jakub Narebski
  2009-04-30 18:43       ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Shawn O. Pearce
  2009-04-30 14:22     ` Jeff King
  2009-04-30 18:56     ` Nicolas Pitre
  2 siblings, 2 replies; 82+ messages in thread
From: Michael Witten @ 2009-04-30 12:56 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List

On Thu, Apr 30, 2009 at 07:17, Jakub Narebski <jnareb@gmail.com> wrote:
> I hope that JGit developers can
> tell us whether using higher level language affects performance, how
> much, and what features of higher-level language are causing decrease
> in performance.

Java is definitely higher than C, but you can do some pretty low-level
operations on bits and bytes and the like, not to mention the presence
of a JIT.

My point: I don't think that Java can tell us anything special in this regard.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-04-30 12:56     ` Michael Witten
@ 2009-04-30 15:28       ` Jakub Narebski
  2009-04-30 18:52         ` Shawn O. Pearce
  2009-04-30 18:43       ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Shawn O. Pearce
  1 sibling, 1 reply; 82+ messages in thread
From: Jakub Narebski @ 2009-04-30 15:28 UTC (permalink / raw)
  To: Michael Witten; +Cc: Martin Langhoff, Git Mailing List

On Thu, 30 Apr 2009, Michael Witten wrote:
> On Thu, Apr 30, 2009 at 07:17, Jakub Narebski <jnareb@gmail.com> wrote:

> > I hope that JGit developers can
> > tell us whether using higher level language affects performance, how
> > much, and what features of higher-level language are causing decrease
> > in performance.
> 
> Java is definitely higher than C, but you can do some pretty low-level
> operations on bits and bytes and the like, not to mention the presence
> of a JIT.
> 
> My point: I don't think that Java can tell us anything special in this regard.

Let's rephrase question a bit then: what low-level operation were needed
for good performance in JGit? 

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-04-30 15:28       ` Why Git is so fast Jakub Narebski
@ 2009-04-30 18:52         ` Shawn O. Pearce
  2009-04-30 20:36           ` Kjetil Barvik
  0 siblings, 1 reply; 82+ messages in thread
From: Shawn O. Pearce @ 2009-04-30 18:52 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Michael Witten, Martin Langhoff, Git Mailing List

Jakub Narebski <jnareb@gmail.com> wrote:
> Let's rephrase question a bit then: what low-level operation were needed
> for good performance in JGit? 

Aside from the message I just posted:

- Avoid String, its too expensive most of the time.  Stick with
  byte[], and better, stick with data that is a triplet of (byte[],
  int start, int end) to define a region of data.  Yes, its annoying,
  as its 3 values you need to pass around instead of just 1, but
  its makes a big difference in running time.

- Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints,
  which can be inlined into an object allocation.

- Subclass instead of contain references.  We extend ObjectId to
  attach application data, rather than contain a reference to an
  ObjectId.  Classical Java programming techniques would say this
  is a violation of encapsulatio.  But it gets us the same memory
  impact that C Git gets by saying:

    struct appdata {
      unsigned char[20] sha1;
      ....
	}

- We're hurting dearly for not having more efficient access to the
  pack-*.pack file data.  mmap in Java is crap.  We implement our
  own page buffer, reading in blocks of 8192 bytes at a time and
  holding them in our own cache.

  Really, we should write our own mmap library as an optional JNI
  thing, and tie it into libz so we can efficiently run inflate()
  off the pack data directly.

- We're hurting dearly for not having more efficient access to the
  pack-*.idx files.  Again, with no mmap we read the entire bloody
  index into memory.  But since you won't touch most of it we keep
  it in large byte[], but since you are searching with an ObjectId
  (5 ints) we pay a conversion price on every search step where
  we have to copy from the large byte[] to 5 local variable ints,
  and then compare to the ObjectId.  Its an overhead C git doesn't
  have to deal with.

Anyway.

I'm still just amazed at how well JGit runs given these limitations.
I guess that's Moore's Law for you.  10 years ago, JGit wouldn't
have been practical.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-04-30 18:52         ` Shawn O. Pearce
@ 2009-04-30 20:36           ` Kjetil Barvik
  2009-04-30 20:40             ` Shawn O. Pearce
  2009-05-01  5:24             ` Dmitry Potapov
  0 siblings, 2 replies; 82+ messages in thread
From: Kjetil Barvik @ 2009-04-30 20:36 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

* "Shawn O. Pearce" <spearce@spearce.org> writes:
 <snipp>
| - Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints,
|   which can be inlined into an object allocation.

  What to pepole think about doing something simmilar in C GIT?

  That is, convert the current internal representation of the SHA-1 from
  "unsigned char sha1[20]" to "unsigned long sha1[5]"?

  Ok, I currently see 2 problems with it:

     1) Will the type "unsigned long" always be unsigned 32 bit on all
        platforms on all computers?  do we need an "unit_32_t" thing?

     2) Can we get in truble because of differences between litle- and
        big-endian machines?

  And, simmilar I can see or guess the following would be positive with
  this change:

     3) From a SHA1 library I worked with some time ago, I noticed that
        it internaly used the type "unsigned long arr[5]", so it can
        mabye be possible to get some shurtcuts or maybe speedups here,
        if we want to do it.

     4) The "static inline void hashcpy(....)" in cache.h could then
        maybe be written like this:

  static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5])
  {
       sha_dst[0] = sha_src[0];
       sha_dst[1] = sha_src[1];
       sha_dst[2] = sha_src[2];
       sha_dst[3] = sha_src[3];
       sha_dst[4] = sha_src[4];
  }

        And hopefully will be compiled to just 5 store/more
        instructions, or at least hopefully be faster than the currently
        memcpy() call. But mabye we get more compiled instructions compared
        to a single call to memcpy()?

     5) Simmilar as 4) for the other SHA1 realted hash functions nearby
        hashcpy() in cache.h

  OK, just some thought's.  Sorry if this allready has been discussed
  but could not find something abouth it after a simple google search.

  -- kjetil

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-04-30 20:36           ` Kjetil Barvik
@ 2009-04-30 20:40             ` Shawn O. Pearce
  2009-04-30 21:36               ` Kjetil Barvik
  2009-05-01  5:24             ` Dmitry Potapov
  1 sibling, 1 reply; 82+ messages in thread
From: Shawn O. Pearce @ 2009-04-30 20:40 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: git

Kjetil Barvik <barvik@broadpark.no> wrote:
> * "Shawn O. Pearce" <spearce@spearce.org> writes:
>  <snipp>
> | - Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints,
> |   which can be inlined into an object allocation.
> 
>   What to pepole think about doing something simmilar in C GIT?
> 
>   That is, convert the current internal representation of the SHA-1 from
>   "unsigned char sha1[20]" to "unsigned long sha1[5]"?

Its not worth the code churn.
 
>   Ok, I currently see 2 problems with it:
> 
>      1) Will the type "unsigned long" always be unsigned 32 bit on all
>         platforms on all computers?  do we need an "unit_32_t" thing?

Yea, "unsigned long" isn't always 32 bits.  So we'd need to use
uint32_t.  Which we already use elsewhere, but still.
 
>      2) Can we get in truble because of differences between litle- and
>         big-endian machines?

Yes, especially if compare was implemented using native uint32_t
compare and the processor was little-endian.

>      4) The "static inline void hashcpy(....)" in cache.h could then
>         maybe be written like this:

Its already done as "memcpy(a, b, 20)" which most compilers will
inline and probably reduce to 5 word moves anyway.  That's why
hashcpy() itself is inline.
 
-- 
Shawn.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-04-30 20:40             ` Shawn O. Pearce
@ 2009-04-30 21:36               ` Kjetil Barvik
  2009-05-01  0:23                 ` Steven Noonan
  2009-05-01 17:42                 ` Tony Finch
  0 siblings, 2 replies; 82+ messages in thread
From: Kjetil Barvik @ 2009-04-30 21:36 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

* "Shawn O. Pearce" <spearce@spearce.org> writes:
|>      4) The "static inline void hashcpy(....)" in cache.h could then
|>         maybe be written like this:
|
| Its already done as "memcpy(a, b, 20)" which most compilers will
| inline and probably reduce to 5 word moves anyway.  That's why
| hashcpy() itself is inline.

  But would the compiler be able to trust that the hashcpy() is always
  called with correct word alignment on variables a and b?

  I made a test and compiled git with:

    make USE_NSEC=1 CFLAGS="-march=core2 -mtune=core2 -O2 -g2 -fno-stack-protector" clean all

  compiler: gcc (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3
  CPU: Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz GenuineIntel

  Then used gdb to get the following:

(gdb) disassemble write_sha1_file
Dump of assembler code for function write_sha1_file:
0x080e3830 <write_sha1_file+0>:	push   %ebp
0x080e3831 <write_sha1_file+1>:	mov    %esp,%ebp
0x080e3833 <write_sha1_file+3>:	sub    $0x58,%esp
0x080e3836 <write_sha1_file+6>:	lea    -0x10(%ebp),%eax
0x080e3839 <write_sha1_file+9>:	mov    %ebx,-0xc(%ebp)
0x080e383c <write_sha1_file+12>:	mov    %esi,-0x8(%ebp)
0x080e383f <write_sha1_file+15>:	mov    %edi,-0x4(%ebp)
0x080e3842 <write_sha1_file+18>:	mov    0x14(%ebp),%ebx
0x080e3845 <write_sha1_file+21>:	mov    %eax,0x8(%esp)
0x080e3849 <write_sha1_file+25>:	lea    -0x44(%ebp),%edi
0x080e384c <write_sha1_file+28>:	lea    -0x24(%ebp),%esi
0x080e384f <write_sha1_file+31>:	mov    %edi,0x4(%esp)
0x080e3853 <write_sha1_file+35>:	mov    %esi,(%esp)
0x080e3856 <write_sha1_file+38>:	mov    0x10(%ebp),%ecx
0x080e3859 <write_sha1_file+41>:	mov    0xc(%ebp),%edx
0x080e385c <write_sha1_file+44>:	mov    0x8(%ebp),%eax
0x080e385f <write_sha1_file+47>:	call   0x80e0350 <write_sha1_file_prepare>
0x080e3864 <write_sha1_file+52>:	test   %ebx,%ebx
0x080e3866 <write_sha1_file+54>:	je     0x80e3885 <write_sha1_file+85>

0x080e3868 <write_sha1_file+56>:	mov    -0x24(%ebp),%eax
0x080e386b <write_sha1_file+59>:	mov    %eax,(%ebx)
0x080e386d <write_sha1_file+61>:	mov    -0x20(%ebp),%eax
0x080e3870 <write_sha1_file+64>:	mov    %eax,0x4(%ebx)
0x080e3873 <write_sha1_file+67>:	mov    -0x1c(%ebp),%eax
0x080e3876 <write_sha1_file+70>:	mov    %eax,0x8(%ebx)
0x080e3879 <write_sha1_file+73>:	mov    -0x18(%ebp),%eax
0x080e387c <write_sha1_file+76>:	mov    %eax,0xc(%ebx)
0x080e387f <write_sha1_file+79>:	mov    -0x14(%ebp),%eax
0x080e3882 <write_sha1_file+82>:	mov    %eax,0x10(%ebx)

  I admit that I am not particular familar with intel machine
  instructions, but I guess that the above 10 mov instructions is the
  result for the compiled inline hashcpy() in the write_sha1_file()
  function in sha1_file.c

  Question: would it be possible for the compiler to compile it down to
  just 5 mov instructions if we had used unsigned 32 bits type?  Or is
  this the best we can reasonable hope for inside the write_sha1_file()
  function?

  I checked 3 other output of "disassemble function_foo", and it seems
  that those 3 functions I checked got 10 mov instructions for the
  inline hashcpy(), as far as I can tell.

0x080e3885 <write_sha1_file+85>:	mov    %esi,(%esp)
0x080e3888 <write_sha1_file+88>:	call   0x80e3800 <has_sha1_file>
0x080e388d <write_sha1_file+93>:	xor    %edx,%edx
0x080e388f <write_sha1_file+95>:	test   %eax,%eax
0x080e3891 <write_sha1_file+97>:	jne    0x80e38b6 <write_sha1_file+134>
0x080e3893 <write_sha1_file+99>:	mov    0xc(%ebp),%eax
0x080e3896 <write_sha1_file+102>:	mov    %edi,%edx
0x080e3898 <write_sha1_file+104>:	mov    %eax,0x4(%esp)
0x080e389c <write_sha1_file+108>:	mov    -0x10(%ebp),%ecx
0x080e389f <write_sha1_file+111>:	mov    0x8(%ebp),%eax
0x080e38a2 <write_sha1_file+114>:	movl   $0x0,0x8(%esp)
0x080e38aa <write_sha1_file+122>:	mov    %eax,(%esp)
0x080e38ad <write_sha1_file+125>:	mov    %esi,%eax
0x080e38af <write_sha1_file+127>:	call   0x80e1e40 <write_loose_object>
0x080e38b4 <write_sha1_file+132>:	mov    %eax,%edx
0x080e38b6 <write_sha1_file+134>:	mov    %edx,%eax
0x080e38b8 <write_sha1_file+136>:	mov    -0xc(%ebp),%ebx
0x080e38bb <write_sha1_file+139>:	mov    -0x8(%ebp),%esi
0x080e38be <write_sha1_file+142>:	mov    -0x4(%ebp),%edi
0x080e38c1 <write_sha1_file+145>:	leave  
0x080e38c2 <write_sha1_file+146>:	ret    
End of assembler dump.
(gdb) 

  So, maybe the compiler is doing the right thing after all?

  -- kjetil

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-04-30 21:36               ` Kjetil Barvik
@ 2009-05-01  0:23                 ` Steven Noonan
  2009-05-01  1:25                   ` James Pickens
  2009-05-01  9:19                   ` Kjetil Barvik
  2009-05-01 17:42                 ` Tony Finch
  1 sibling, 2 replies; 82+ messages in thread
From: Steven Noonan @ 2009-05-01  0:23 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: Shawn O. Pearce, git

On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik <barvik@broadpark.no> wrote:
> * "Shawn O. Pearce" <spearce@spearce.org> writes:
> |>      4) The "static inline void hashcpy(....)" in cache.h could then
> |>         maybe be written like this:
> |
> | Its already done as "memcpy(a, b, 20)" which most compilers will
> | inline and probably reduce to 5 word moves anyway.  That's why
> | hashcpy() itself is inline.
>
>  But would the compiler be able to trust that the hashcpy() is always
>  called with correct word alignment on variables a and b?
>
>  I made a test and compiled git with:
>
>    make USE_NSEC=1 CFLAGS="-march=core2 -mtune=core2 -O2 -g2 -fno-stack-protector" clean all
>
>  compiler: gcc (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3
>  CPU: Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz GenuineIntel
>
>  Then used gdb to get the following:
>
> (gdb) disassemble write_sha1_file
> Dump of assembler code for function write_sha1_file:
> 0x080e3830 <write_sha1_file+0>: push   %ebp
> 0x080e3831 <write_sha1_file+1>: mov    %esp,%ebp
> 0x080e3833 <write_sha1_file+3>: sub    $0x58,%esp
> 0x080e3836 <write_sha1_file+6>: lea    -0x10(%ebp),%eax
> 0x080e3839 <write_sha1_file+9>: mov    %ebx,-0xc(%ebp)
> 0x080e383c <write_sha1_file+12>:        mov    %esi,-0x8(%ebp)
> 0x080e383f <write_sha1_file+15>:        mov    %edi,-0x4(%ebp)
> 0x080e3842 <write_sha1_file+18>:        mov    0x14(%ebp),%ebx
> 0x080e3845 <write_sha1_file+21>:        mov    %eax,0x8(%esp)
> 0x080e3849 <write_sha1_file+25>:        lea    -0x44(%ebp),%edi
> 0x080e384c <write_sha1_file+28>:        lea    -0x24(%ebp),%esi
> 0x080e384f <write_sha1_file+31>:        mov    %edi,0x4(%esp)
> 0x080e3853 <write_sha1_file+35>:        mov    %esi,(%esp)
> 0x080e3856 <write_sha1_file+38>:        mov    0x10(%ebp),%ecx
> 0x080e3859 <write_sha1_file+41>:        mov    0xc(%ebp),%edx
> 0x080e385c <write_sha1_file+44>:        mov    0x8(%ebp),%eax
> 0x080e385f <write_sha1_file+47>:        call   0x80e0350 <write_sha1_file_prepare>
> 0x080e3864 <write_sha1_file+52>:        test   %ebx,%ebx
> 0x080e3866 <write_sha1_file+54>:        je     0x80e3885 <write_sha1_file+85>
>
> 0x080e3868 <write_sha1_file+56>:        mov    -0x24(%ebp),%eax
> 0x080e386b <write_sha1_file+59>:        mov    %eax,(%ebx)
> 0x080e386d <write_sha1_file+61>:        mov    -0x20(%ebp),%eax
> 0x080e3870 <write_sha1_file+64>:        mov    %eax,0x4(%ebx)
> 0x080e3873 <write_sha1_file+67>:        mov    -0x1c(%ebp),%eax
> 0x080e3876 <write_sha1_file+70>:        mov    %eax,0x8(%ebx)
> 0x080e3879 <write_sha1_file+73>:        mov    -0x18(%ebp),%eax
> 0x080e387c <write_sha1_file+76>:        mov    %eax,0xc(%ebx)
> 0x080e387f <write_sha1_file+79>:        mov    -0x14(%ebp),%eax
> 0x080e3882 <write_sha1_file+82>:        mov    %eax,0x10(%ebx)
>
>  I admit that I am not particular familar with intel machine
>  instructions, but I guess that the above 10 mov instructions is the
>  result for the compiled inline hashcpy() in the write_sha1_file()
>  function in sha1_file.c
>
>  Question: would it be possible for the compiler to compile it down to
>  just 5 mov instructions if we had used unsigned 32 bits type?  Or is
>  this the best we can reasonable hope for inside the write_sha1_file()
>  function?
>
>  I checked 3 other output of "disassemble function_foo", and it seems
>  that those 3 functions I checked got 10 mov instructions for the
>  inline hashcpy(), as far as I can tell.
>
> 0x080e3885 <write_sha1_file+85>:        mov    %esi,(%esp)
> 0x080e3888 <write_sha1_file+88>:        call   0x80e3800 <has_sha1_file>
> 0x080e388d <write_sha1_file+93>:        xor    %edx,%edx
> 0x080e388f <write_sha1_file+95>:        test   %eax,%eax
> 0x080e3891 <write_sha1_file+97>:        jne    0x80e38b6 <write_sha1_file+134>
> 0x080e3893 <write_sha1_file+99>:        mov    0xc(%ebp),%eax
> 0x080e3896 <write_sha1_file+102>:       mov    %edi,%edx
> 0x080e3898 <write_sha1_file+104>:       mov    %eax,0x4(%esp)
> 0x080e389c <write_sha1_file+108>:       mov    -0x10(%ebp),%ecx
> 0x080e389f <write_sha1_file+111>:       mov    0x8(%ebp),%eax
> 0x080e38a2 <write_sha1_file+114>:       movl   $0x0,0x8(%esp)
> 0x080e38aa <write_sha1_file+122>:       mov    %eax,(%esp)
> 0x080e38ad <write_sha1_file+125>:       mov    %esi,%eax
> 0x080e38af <write_sha1_file+127>:       call   0x80e1e40 <write_loose_object>
> 0x080e38b4 <write_sha1_file+132>:       mov    %eax,%edx
> 0x080e38b6 <write_sha1_file+134>:       mov    %edx,%eax
> 0x080e38b8 <write_sha1_file+136>:       mov    -0xc(%ebp),%ebx
> 0x080e38bb <write_sha1_file+139>:       mov    -0x8(%ebp),%esi
> 0x080e38be <write_sha1_file+142>:       mov    -0x4(%ebp),%edi
> 0x080e38c1 <write_sha1_file+145>:       leave
> 0x080e38c2 <write_sha1_file+146>:       ret
> End of assembler dump.
> (gdb)
>
>  So, maybe the compiler is doing the right thing after all?
>

Well, I just tested this with GCC myself. I used this segment of code:

        #include <memory.h>
        void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
        {
                memcpy(sha_dst, sha_src, 20);
        }

I compiled using Apple's GCC 4.0.1 (note that GCC 4.3 and 4.4 vanilla
yield the same code) with these parameters to get Intel assembly:
        gcc -O2 -arch i386 -march=pentium3 -mtune=pentium3
-fomit-frame-pointer -fno-strict-aliasing -S test.c
and these parameters to get the equivalent PowerPC code:
        gcc -O2 -mcpu=G5 -arch ppc -fomit-frame-pointer
-fno-strict-aliasing -S test.c

Intel code:
        .text
        .align 4,0x90
.globl _hashcpy
_hashcpy:
        subl    $12, %esp
        movl    20(%esp), %edx
        movl    16(%esp), %ecx
        movl    (%edx), %eax
        movl    %eax, (%ecx)
        movl    4(%edx), %eax
        movl    %eax, 4(%ecx)
        movl    8(%edx), %eax
        movl    %eax, 8(%ecx)
        movl    12(%edx), %eax
        movl    %eax, 12(%ecx)
        movl    16(%edx), %eax
        movl    %eax, 16(%ecx)
        addl    $12, %esp
        ret
        .subsections_via_symbols


and the PowerPC code:

        .section __TEXT,__text,regular,pure_instructions
        .section __TEXT,__picsymbolstub1,symbol_stubs,pure_instructions,32
        .machine ppc970
        .text
        .align 2
        .p2align 4,,15
        .globl _hashcpy
_hashcpy:
        lwz r0,0(r4)
        lwz r2,4(r4)
        lwz r9,8(r4)
        lwz r11,12(r4)
        stw r0,0(r3)
        stw r2,4(r3)
        stw r9,8(r3)
        stw r11,12(r3)
        lwz r0,16(r4)
        stw r0,16(r3)
        blr
        .subsections_via_symbols


So it does look like GCC does what it should and it inlines the memcpy.

A bit off topic, but the results are rather interesting to me, and I
think I see a weakness in how GCC is doing this on Intel. Someone
please correct me if I'm wrong, but the PowerPC code seems much better
because it can yield very high instruction-level parallelism. It does
5 loads and then 5 stores, using 4 registers for temporary storage and
2 registers for pointers.

I realize the Intel x86 architecture is quite constrained in that it
has so few general purpose registers, but there has to be better code
than what GCC emitted above. It seems like the processor would stall
because of the quantity of sequential inter-dependent instructions
that can't be done in parallel (mov to memory that depends on a mov to
eax, etc).

I suppose the code might not be stalling if it's using the maximum
number of registers and doing as many memory accesses that it can per
clock, but based on known details about the architecture, does it seem
to be doing that?

- Steven

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-05-01  0:23                 ` Steven Noonan
@ 2009-05-01  1:25                   ` James Pickens
  2009-05-01  9:19                   ` Kjetil Barvik
  1 sibling, 0 replies; 82+ messages in thread
From: James Pickens @ 2009-05-01  1:25 UTC (permalink / raw)
  To: Steven Noonan; +Cc: Kjetil Barvik, Shawn O. Pearce, git

On Thu, Apr 30, 2009, Steven Noonan <steven@uplinklabs.net> wrote:
> A bit off topic, but the results are rather interesting to me, and I
> think I see a weakness in how GCC is doing this on Intel. Someone
> please correct me if I'm wrong, but the PowerPC code seems much better
> because it can yield very high instruction-level parallelism. It does
> 5 loads and then 5 stores, using 4 registers for temporary storage and
> 2 registers for pointers.
>
> I realize the Intel x86 architecture is quite constrained in that it
> has so few general purpose registers, but there has to be better code
> than what GCC emitted above. It seems like the processor would stall
> because of the quantity of sequential inter-dependent instructions
> that can't be done in parallel (mov to memory that depends on a mov to
> eax, etc).

There aren't any unnecessary dependencies.  Take this sequence:

1:        movl    (%edx), %eax
2:        movl    %eax, (%ecx)
3:        movl    4(%edx), %eax
4:        movl    %eax, 4(%ecx)

There are two unavoidable dependencies - #2 depends on #1, and #4
depends on #3.  #3 does not depend on #2, even though they both
use %eax, because #3 is a write to %eax.  So whatever was in %eax
before #3 is irrelevant.  The processor knows this and will use
register renaming to execute #1 and #3 in parallel, and #2 and #4
in parallel.

James

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-05-01  0:23                 ` Steven Noonan
  2009-05-01  1:25                   ` James Pickens
@ 2009-05-01  9:19                   ` Kjetil Barvik
  2009-05-01  9:34                     ` Mike Hommey
  1 sibling, 1 reply; 82+ messages in thread
From: Kjetil Barvik @ 2009-05-01  9:19 UTC (permalink / raw)
  To: Steven Noonan; +Cc: Shawn O. Pearce, git

* Steven Noonan <steven@uplinklabs.net> writes:
| On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik <barvik@broadpark.no> wrote:
|> * "Shawn O. Pearce" <spearce@spearce.org> writes:
|> |>      4) The "static inline void hashcpy(....)" in cache.h could then
|> |>         maybe be written like this:
|> |
|> | Its already done as "memcpy(a, b, 20)" which most compilers will
|> | inline and probably reduce to 5 word moves anyway.  That's why
|> | hashcpy() itself is inline.
|>
|>  But would the compiler be able to trust that the hashcpy() is always
|>  called with correct word alignment on variables a and b?

 <snipp>

| Well, I just tested this with GCC myself. I used this segment of code:
|
|         #include <memory.h>
|         void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
|         {
|                 memcpy(sha_dst, sha_src, 20);
|         }

  OK, here is a smal test, which maybe shows at least one difference
  between using "unsigned char sha1[20]" and "unsigned long sha1[5]".
  Given the following file, memcpy_test.c:

#include <string.h>
extern void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src);
void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src)
{
        memcpy(sha_dst, sha_src, 20);
}
extern void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src);
void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src)
{
        memcpy(sha_dst, sha_src, 5);
}

  And, compiled with the following:

    gcc -O2 -mtune=core2 -march=core2 -S -fomit-frame-pointer memcpy_test.c

  It produced the following memcpy_test.s file:

        .file   "memcpy_test.c"
        .text
        .p2align 4,,15
.globl hashcpy_ulong
        .type   hashcpy_ulong, @function
hashcpy_ulong:
        movl    8(%esp), %edx
        movl    4(%esp), %ecx
        movl    (%edx), %eax
        movl    %eax, (%ecx)
        movzbl  4(%edx), %eax
        movb    %al, 4(%ecx)
        ret
        .size   hashcpy_ulong, .-hashcpy_ulong
        .p2align 4,,15
.globl hashcpy_uchar
        .type   hashcpy_uchar, @function
hashcpy_uchar:
        movl    8(%esp), %edx
        movl    4(%esp), %ecx
        movl    (%edx), %eax
        movl    %eax, (%ecx)
        movl    4(%edx), %eax
        movl    %eax, 4(%ecx)
        movl    8(%edx), %eax
        movl    %eax, 8(%ecx)
        movl    12(%edx), %eax
        movl    %eax, 12(%ecx)
        movl    16(%edx), %eax
        movl    %eax, 16(%ecx)
        ret
        .size   hashcpy_uchar, .-hashcpy_uchar
        .ident  "GCC: (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3"
        .section        .note.GNU-stack,"",@progbits

  So, the "unsigned long" type hashcpy() used 7 instructions, compared
  to 13 for the "unsigned char" type hascpy().

  Would I guess correct if the hashcpy_ulong() function will also use
  less CPU cycles, and then would be faster than hashcpy_uchar()?

  -- kjetil

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-05-01  9:19                   ` Kjetil Barvik
@ 2009-05-01  9:34                     ` Mike Hommey
  2009-05-01  9:42                       ` Kjetil Barvik
  0 siblings, 1 reply; 82+ messages in thread
From: Mike Hommey @ 2009-05-01  9:34 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: Steven Noonan, Shawn O. Pearce, git

On Fri, May 01, 2009 at 11:19:04AM +0200, Kjetil Barvik wrote:
> * Steven Noonan <steven@uplinklabs.net> writes:
> | On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik <barvik@broadpark.no> wrote:
> |> * "Shawn O. Pearce" <spearce@spearce.org> writes:
> |> |>      4) The "static inline void hashcpy(....)" in cache.h could then
> |> |>         maybe be written like this:
> |> |
> |> | Its already done as "memcpy(a, b, 20)" which most compilers will
> |> | inline and probably reduce to 5 word moves anyway.  That's why
> |> | hashcpy() itself is inline.
> |>
> |>  But would the compiler be able to trust that the hashcpy() is always
> |>  called with correct word alignment on variables a and b?
> 
>  <snipp>
> 
> | Well, I just tested this with GCC myself. I used this segment of code:
> |
> |         #include <memory.h>
> |         void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
> |         {
> |                 memcpy(sha_dst, sha_src, 20);
> |         }
> 
>   OK, here is a smal test, which maybe shows at least one difference
>   between using "unsigned char sha1[20]" and "unsigned long sha1[5]".
>   Given the following file, memcpy_test.c:
> 
> #include <string.h>
> extern void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src);
> void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src)
> {
>         memcpy(sha_dst, sha_src, 20);
> }
> extern void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src);
> void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src)
> {
>         memcpy(sha_dst, sha_src, 5);
> }
> 
>   And, compiled with the following:
> 
>     gcc -O2 -mtune=core2 -march=core2 -S -fomit-frame-pointer memcpy_test.c
> 
>   It produced the following memcpy_test.s file:
> 
>         .file   "memcpy_test.c"
>         .text
>         .p2align 4,,15
> .globl hashcpy_ulong
>         .type   hashcpy_ulong, @function
> hashcpy_ulong:
>         movl    8(%esp), %edx
>         movl    4(%esp), %ecx
>         movl    (%edx), %eax
>         movl    %eax, (%ecx)
>         movzbl  4(%edx), %eax
>         movb    %al, 4(%ecx)
>         ret
>         .size   hashcpy_ulong, .-hashcpy_ulong
>         .p2align 4,,15
> .globl hashcpy_uchar
>         .type   hashcpy_uchar, @function
> hashcpy_uchar:
>         movl    8(%esp), %edx
>         movl    4(%esp), %ecx
>         movl    (%edx), %eax
>         movl    %eax, (%ecx)
>         movl    4(%edx), %eax
>         movl    %eax, 4(%ecx)
>         movl    8(%edx), %eax
>         movl    %eax, 8(%ecx)
>         movl    12(%edx), %eax
>         movl    %eax, 12(%ecx)
>         movl    16(%edx), %eax
>         movl    %eax, 16(%ecx)
>         ret
>         .size   hashcpy_uchar, .-hashcpy_uchar
>         .ident  "GCC: (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3"
>         .section        .note.GNU-stack,"",@progbits
> 
>   So, the "unsigned long" type hashcpy() used 7 instructions, compared
>   to 13 for the "unsigned char" type hascpy().

But your "unsigned long" version only copies 5 bytes...

Mike

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-05-01  9:34                     ` Mike Hommey
@ 2009-05-01  9:42                       ` Kjetil Barvik
  0 siblings, 0 replies; 82+ messages in thread
From: Kjetil Barvik @ 2009-05-01  9:42 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Steven Noonan, Shawn O. Pearce, git

* Mike Hommey <mh@glandium.org> writes:
 <snipp>
| But your "unsigned long" version only copies 5 bytes...

  Yes, that is true...  OK, same result for hashcpy_uchar() and
  hashcpy_ulong() when corrected for this.

  --kjetil, with a brown paper bag

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-04-30 21:36               ` Kjetil Barvik
  2009-05-01  0:23                 ` Steven Noonan
@ 2009-05-01 17:42                 ` Tony Finch
  1 sibling, 0 replies; 82+ messages in thread
From: Tony Finch @ 2009-05-01 17:42 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: Shawn O. Pearce, git

On Thu, 30 Apr 2009, Kjetil Barvik wrote:
>
>   I admit that I am not particular familar with intel machine
>   instructions, but I guess that the above 10 mov instructions is the
>   result for the compiled inline hashcpy() in the write_sha1_file()
>   function in sha1_file.c
>
>   Question: would it be possible for the compiler to compile it down to
>   just 5 mov instructions if we had used unsigned 32 bits type?

No, because the x86 can't do direct memory-to-memory moves.

Tony.
-- 
f.anthony.n.finch  <dot@dotat.at>  http://dotat.at/
GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS.
MODERATE OR GOOD.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-04-30 20:36           ` Kjetil Barvik
  2009-04-30 20:40             ` Shawn O. Pearce
@ 2009-05-01  5:24             ` Dmitry Potapov
  2009-05-01  9:42               ` Mike Hommey
  1 sibling, 1 reply; 82+ messages in thread
From: Dmitry Potapov @ 2009-05-01  5:24 UTC (permalink / raw)
  To: Kjetil Barvik; +Cc: Shawn O. Pearce, git

On Thu, Apr 30, 2009 at 10:36:03PM +0200, Kjetil Barvik wrote:
>      4) The "static inline void hashcpy(....)" in cache.h could then
>         maybe be written like this:
> 
>   static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5])
>   {
>        sha_dst[0] = sha_src[0];
>        sha_dst[1] = sha_src[1];
>        sha_dst[2] = sha_src[2];
>        sha_dst[3] = sha_src[3];
>        sha_dst[4] = sha_src[4];
>   }
> 
>         And hopefully will be compiled to just 5 store/more
>         instructions, or at least hopefully be faster than the currently
>         memcpy() call. But mabye we get more compiled instructions compared
>         to a single call to memcpy()?

Good compilers can inline memcpy and should produce more efficient code
for the target architecture, which can be faster than manually written.
On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1
while the above code requires 5 operations.

Dmitry

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-05-01  5:24             ` Dmitry Potapov
@ 2009-05-01  9:42               ` Mike Hommey
  2009-05-01 10:46                 ` Dmitry Potapov
  0 siblings, 1 reply; 82+ messages in thread
From: Mike Hommey @ 2009-05-01  9:42 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Kjetil Barvik, Shawn O. Pearce, git

On Fri, May 01, 2009 at 09:24:34AM +0400, Dmitry Potapov wrote:
> On Thu, Apr 30, 2009 at 10:36:03PM +0200, Kjetil Barvik wrote:
> >      4) The "static inline void hashcpy(....)" in cache.h could then
> >         maybe be written like this:
> > 
> >   static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5])
> >   {
> >        sha_dst[0] = sha_src[0];
> >        sha_dst[1] = sha_src[1];
> >        sha_dst[2] = sha_src[2];
> >        sha_dst[3] = sha_src[3];
> >        sha_dst[4] = sha_src[4];
> >   }
> > 
> >         And hopefully will be compiled to just 5 store/more
> >         instructions, or at least hopefully be faster than the currently
> >         memcpy() call. But mabye we get more compiled instructions compared
> >         to a single call to memcpy()?
> 
> Good compilers can inline memcpy and should produce more efficient code
> for the target architecture, which can be faster than manually written.
> On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1
> while the above code requires 5 operations.

I guess, though, that some enforced alignment could help produce
slightly more efficient code on some architectures (most notably sparc,
which really doesn't like to deal with unaligned words).

Mike

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-05-01  9:42               ` Mike Hommey
@ 2009-05-01 10:46                 ` Dmitry Potapov
  0 siblings, 0 replies; 82+ messages in thread
From: Dmitry Potapov @ 2009-05-01 10:46 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Kjetil Barvik, Shawn O. Pearce, git

On Fri, May 01, 2009 at 11:42:21AM +0200, Mike Hommey wrote:
> On Fri, May 01, 2009 at 09:24:34AM +0400, Dmitry Potapov wrote:
> > 
> > Good compilers can inline memcpy and should produce more efficient code
> > for the target architecture, which can be faster than manually written.
> > On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1
> > while the above code requires 5 operations.
> 
> I guess, though, that some enforced alignment could help produce
> slightly more efficient code on some architectures (most notably sparc,
> which really doesn't like to deal with unaligned words).

Agreed. Enforcing good alignment may be useful. My point was that avoiding
memcpy with modern compilers is rather pointless or even harmful because the
compiler know more about the target architecture than the author of the code.

Dmitry

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-30 12:56     ` Michael Witten
  2009-04-30 15:28       ` Why Git is so fast Jakub Narebski
@ 2009-04-30 18:43       ` Shawn O. Pearce
  1 sibling, 0 replies; 82+ messages in thread
From: Shawn O. Pearce @ 2009-04-30 18:43 UTC (permalink / raw)
  To: Michael Witten; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List

Michael Witten <mfwitten@gmail.com> wrote:
> On Thu, Apr 30, 2009 at 07:17, Jakub Narebski <jnareb@gmail.com> wrote:
> > I hope that JGit developers can
> > tell us whether using higher level language affects performance, how
> > much, and what features of higher-level language are causing decrease
> > in performance.
> 
> Java is definitely higher than C, but you can do some pretty low-level
> operations on bits and bytes and the like, not to mention the presence
> of a JIT.

But its still costly compared to C.

> My point: I don't think that Java can tell us anything special in this regard.

Sure it can.

Peff I think made a good point here, that we rely on a lot of small
tweaks in the C git code to get *really* good performance.  5% here,
10% there, and suddenly you are 60% faster than you were before.
Nico, Linus, Junio, they have all spent some time over the past
3 or 4 years trying to tune various parts of Git to just flat out
run fast.

Higher level languages hide enough of the machine that we can't
make all of these optimizations.

JGit struggles with not having mmap(), or when you do use Java NIO
MappedByteBuffer, we still have to copy to a temporary byte[] in
order to do any real processing.  C Git avoids that copy.  Sure,
other higher level langauges may offer a better mmap facility,
but they also tend to offer garbage collection and most try to tie
the mmap management into the GC "for safety and ease of use".

JGit struggles with not having unsigned types in Java.  There are
many locations in JGit where we really need "unsigned int32_t" or
"unsigned long" (largest machine word available) or "unsigned char"
but these types just don't exist in Java.  Converting a byte up to
an int just to treat it as an unsigned requires an extra " & 0xFF"
operation to remove the sign extension.

JGit struggles with not having an efficient way to represent a SHA-1.
C can just say "unsigned char[20]" and have it inline into the
container's memory allocation.  A byte[20] in Java will cost an
*additional* 16 bytes of memory, and be slower to access because
the bytes themselves are in a different area of memory from the
container object.  We try to work around it by converting from a
byte[20] to 5 ints, but that costs us machine instructions.

C Git takes for granted that memcpy(a, b, 20) is dirt cheap when
doing a copy from an inflated tree into a struct object.  JGit has
to pay a huge penalty to copy that 20 byte region out into 5 ints,
because later on, those 5 ints are cheaper.

Other higher level languages also lack the ability to mark a
type unsigned.  Or face similiar penalties with storing a 20 byte
binary region.

Native Java collection types have been a snare for us in JGit.
We've used java.util.* types when they seem to be handy and already
solve the data structure problem at hand, but they tend to preform
a lot worse than writing a specialized data structure.

For example, we have ObjectIdSubclassMap for what should be
Map<ObjectId,Object>.  Only it requires that the Object type you
use as the "value" entry in the map extend from ObjectId, as the
instance serves as both key *and* value.  But it screams when
compared to HashMap<ObjectId,Object>.  (For those who don't know,
ObjectId is JGit's "unsigned char[20]" for a SHA-1.)

Just a day or so ago I wrote LongMap, a faster HashMap<Long,Object>,
for hashing objects by indexes in a pack file.  Again, the boxing
costs in Java to convert a "long" (largest integer type) into an
Object that the standard HashMap type would accept was rather high.

Right now, JGit is still paying dearly when it comes to ripping
apart a commit or a tree object to follow the object links.  Or when
invoking inflate().  We spend a lot more time doing this sort of work
than C git does, and yet we're trying to be as close to the machine
as we can go by using byte[] whenever possible, by avoiding copying
whenever possible, and avoiding memory allocation when possible.

Notably, `rev-list --objects --all` takes about 2x as long in
JGit as it does in C Git on a project like the linux kernel, and
`index-pack` for the full ~270M pack file takes about 2x as long.

Both parts of JGit are about as good as I know how to make them,
but we're really at the mercy of the JIT, and changes in the JIT
can cause us to perform worse (or better) than before.  Unlike in
C Git where Linus has done assembler dumps of sections of code and
tried to determine better approaches.  :-)

So. Yes, its practical to build Git in a higher level language, but
you just can't get the same performance, or tight memory utilization,
that C Git gets.  That's what that higher level language abstraction
costs you.  But, JGit performs reasonably well; well enough that
we use internally at Google as a git server.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  2009-04-30 12:56     ` Michael Witten
@ 2009-04-30 14:22     ` Jeff King
  2009-05-01 18:43       ` Linus Torvalds
  2009-04-30 18:56     ` Nicolas Pitre
  2 siblings, 1 reply; 82+ messages in thread
From: Jeff King @ 2009-04-30 14:22 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List

On Thu, Apr 30, 2009 at 05:17:58AM -0700, Jakub Narebski wrote:

> This is I think quite obvious.  Accessing memory is faster than
> acessing disk, which in turn is faster than accessing network.  So if
> commit and (change)log does not require access to server via network,
> they are so much faster.

Like all generalizations, this is only mostly true. Fast network servers
with big caches can outperform disks for some loads. And in many cases
with a VCS, you are performing a query that might look over the whole
dataset, but return only a small fraction of data.

So I wouldn't rule out the possibility of a pleasant VCS experience on a
network-optimized system backed by beefy servers on a local network. I
have never used perforce, but I get the impression that it is more
optimized for such a situation. Git is really optimized for open source
projects: slow servers across high-latency, low-bandwidth links.

> es> Nah, probably not.  Lots of people have written fast software in
> es> C#, Java or Python.
> es>
> es> And lots of people have written really slow software in
> es> traditional native languages like C/C++. [...]
> 
> Well, I guess that access to low-level optimization techniques like
> mmap are important for performance.  But here I am guessing and
> speculating like Eric did; well, I am asking on a proper forum ;-)

Certainly there's algorithmic fastness that you can do in any language,
and I think git does well at that. Most operations are independent of
the total size of history (e.g., branching is O(1) and commit is
O(changed files), diff looks only at endpoints, etc). Operations which
deal only with history are independent of the size of the tree (e.g.,
"git log" and the history graph in gitk look only at commits, never at
the tree).  And when we do have to look at the tree, we can drastically
reduce our I/O by comparing hashes instead of full files.

But there are also some micro-optimizations that make a big difference
in practice. Some of them can be done in any language. For example, the
packfiles are ordered by type so that all of the commits have a nice I/O
pattern when doing a history walk.

Some other micro-optimizations are really language-specific, though. I
don't recall the numbers, but I think Linus got measurable speedups from
cutting the memory footprint of the object and commit structs (which
gave better cache usage patterns).  Git uses some variable-length fields
inside structs instead of a pointer to a separate allocated string to
give better memory access patterns. Tricks like that won't give the
order-of-magnitude speedups that algorithmic optimizations will, but 10%
here and 20% there means you can get a system that is a few times faster
than the competition. For an operation that takes 0.1s anyway, that
doesn't matter. But with current hardware and current project size, you
are often talking about dropping a 3-second operation down to 1s or
0.5s, which just feels a lot snappier.

And finally, git tries to do as little work as possible when starting a
new command, and streams output as soon as possible. Which means that in
a command-line setting, git can _feel_ snappier, because it starts
output immediately. Higher-level languages can often have a much longer
startup time, especially if they have a lot of modules to load. E.g.,:

  # does enough work to easily fill your pager
  $ time git log -100 >/dev/null
  real    0m0.011s
  user    0m0.008s
  sys     0m0.004s

  # does nothing, just starts perl and aborts with usage
  $ time git send-email >/dev/null
  real    0m0.150s
  user    0m0.104s
  sys     0m0.048s

Both are warm-cache times. C git gives you output almost instaneously,
whereas just loading perl with a modest set of modules introduces a
noticeable pause before any work is actually done. In the grand scheme
of things, .1s probably isn't relevant, but I think avoiding that delay
adds to the perception of git as fast.

> es> Or maybe Git's shortcut for handling renames is faster than doing
> es> them more correctly[2] like Bazaar does.
> es>
> es> [2] "Renaming is the killer app of distributed version control"
> es>     http://www.markshuttleworth.com/archives/123
> 
> Errr... what?

Yeah, I had the same thought. Git's rename handling is _much_ more
computationally intensive than other systems. In fact, it is one of only
two places where I have ever wanted git to be any faster (the other
being repacking of large repos).

> Eight: Git seems fast.
> ======================
> 
> Here I mean concentaring on low _latency_, which means that when git

I do think this helps (see above), but I wanted to note that it is more
than just "streaming"; I think other systems stream, as well. For
example, I am pretty sure that "cvs log" streamed (but thank god it has
been so long since I touched CVS that I can't really remember), but it
_still_ felt awfully slow.

So it is also about keeping start times low and having your data in a
format that is ready to use.

-Peff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-30 14:22     ` Jeff King
@ 2009-05-01 18:43       ` Linus Torvalds
  2009-05-01 19:08         ` Jeff King
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2009-05-01 18:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List

On Thu, 30 Apr 2009, Jeff King wrote:
> 
> Like all generalizations, this is only mostly true. Fast network servers
> with big caches can outperform disks for some loads.

That's _very_ few loads.

It doesn't matter how good a server you have, network filesystems 
invariably suck.

Why? It's not that the network or the server sucks - you can easily find 
beefy NAS setups that have big raids etc and are much faster than most 
local disks.

And they _still_ suck.

Simple reason: caching. It's a lot easier to cache local filesystems. Even 
modern networked filesystems (ie NFSv4), that do a pretty good job on a 
file-per-file basis with delegations etc, and they still tend to suck 
horribly at metadata.

In contrast, a workstation with local filesystems and enough memory to 
cache it well will just be a lot nicer.

> So I wouldn't rule out the possibility of a pleasant VCS experience on a
> network-optimized system backed by beefy servers on a local network.

Hey, you can always throw resources at it.

But no:

> I have never used perforce, but I get the impression that it is more 
> optimized for such a situation.

I doubt it. I suspect git will outperform pretty much anything else in 
that kind of situation too.

One thing that git does - and some other VCS's avoid - is to actually 
stat() the whole working tree in order to not need special per-file "I use 
this file" locking semantics. That can in theory make git slower over a 
network filesystem than such (very broken) alternatives.

If your VCS requires that you mark all files for editing somehow (ie you 
can't just use your favourite editor or scripting to modify files, but 
have to use "p4 edit" to say that you're going to write to the file, and 
the file is otherwise read-only), then such a VCS can - by being annoying 
and in your way - do some things faster than git can.

And yes, perforce does that (the "p4 edit" command is real, and exists).

And yes, in theory that can probably mean that perforce doesn't care so 
much about the metadata caching problem on network filesystems - because 
p4 will maintain some file of its own that contains the metadata.

But I suspect that the git "async stat" ("core.preloadindex") thing means 
that git will kick p4 *ss even on that benchmark, and be a whole lot more 
pleasant to use. Even on networked filesystems.

			Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 18:43       ` Linus Torvalds
@ 2009-05-01 19:08         ` Jeff King
  2009-05-01 19:13           ` david
                             ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Jeff King @ 2009-05-01 19:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List

On Fri, May 01, 2009 at 02:43:49PM -0400, Linus Torvalds wrote:

> > Like all generalizations, this is only mostly true. Fast network servers
> > with big caches can outperform disks for some loads.
> [...]
> In contrast, a workstation with local filesystems and enough memory to 
> cache it well will just be a lot nicer.
> [...]
> > I have never used perforce, but I get the impression that it is more 
> > optimized for such a situation.
> 
> I doubt it. I suspect git will outperform pretty much anything else in 
> that kind of situation too.

Thanks for the analysis; what you said makes sense to me. However, there
is at least one case of somebody complaining that git doesn't scale as
well as perforce for their load:

  http://gandolf.homelinux.org/blog/index.php?id=50

Part of his issue is with git-p4 sucking, which it probably does. But
part of it sounds like he has a gigantic workload (the description of
which sounds silly to me, but I respect the fact that he is probably
describing standard practice among some companies), and that workload is
just a little too gigantic for the workstations to handle. I.e., by
throwing resources at the central server they can avoid throwing as many
at each workstation.

But there are so few details it's hard to say whether he's doing
something else wrong or suboptimally. He does mention Windows, which
IIRC has horrific stat performance.

-Peff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 19:08         ` Jeff King
@ 2009-05-01 19:13           ` david
  2009-05-01 19:32             ` Nicolas Pitre
  2009-05-01 21:17           ` Daniel Barkalow
  2009-05-01 21:37           ` Linus Torvalds
  2 siblings, 1 reply; 82+ messages in thread
From: david @ 2009-05-01 19:13 UTC (permalink / raw)
  To: Jeff King
  Cc: Linus Torvalds, Jakub Narebski, Martin Langhoff, Git Mailing List

On Fri, 1 May 2009, Jeff King wrote:

> On Fri, May 01, 2009 at 02:43:49PM -0400, Linus Torvalds wrote:
>
>>> Like all generalizations, this is only mostly true. Fast network servers
>>> with big caches can outperform disks for some loads.
>> [...]
>> In contrast, a workstation with local filesystems and enough memory to
>> cache it well will just be a lot nicer.
>> [...]
>>> I have never used perforce, but I get the impression that it is more
>>> optimized for such a situation.
>>
>> I doubt it. I suspect git will outperform pretty much anything else in
>> that kind of situation too.
>
> Thanks for the analysis; what you said makes sense to me. However, there
> is at least one case of somebody complaining that git doesn't scale as
> well as perforce for their load:
>
>  http://gandolf.homelinux.org/blog/index.php?id=50
>
> Part of his issue is with git-p4 sucking, which it probably does. But
> part of it sounds like he has a gigantic workload (the description of
> which sounds silly to me, but I respect the fact that he is probably
> describing standard practice among some companies), and that workload is
> just a little too gigantic for the workstations to handle. I.e., by
> throwing resources at the central server they can avoid throwing as many
> at each workstation.
>
> But there are so few details it's hard to say whether he's doing
> something else wrong or suboptimally. He does mention Windows, which
> IIRC has horrific stat performance.

the key thing for his problem is the support for large binary objects. 
there was discussion here a few weeks ago about ways to handle such things 
without trying to pull them into packs. I suspect that solving those sorts 
of issues would go a long way towards closing the gap on this workload.

there may be issues in doing a clone for repositories that large, I don't 
remember exactly what happens when you have something larger than 4G to 
send in a clone.

David Lang

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 19:13           ` david
@ 2009-05-01 19:32             ` Nicolas Pitre
  0 siblings, 0 replies; 82+ messages in thread
From: Nicolas Pitre @ 2009-05-01 19:32 UTC (permalink / raw)
  To: david
  Cc: Jeff King, Linus Torvalds, Jakub Narebski, Martin Langhoff,
	Git Mailing List

On Fri, 1 May 2009, david@lang.hm wrote:

> the key thing for his problem is the support for large binary objects. there
> was discussion here a few weeks ago about ways to handle such things without
> trying to pull them into packs. I suspect that solving those sorts of issues
> would go a long way towards closing the gap on this workload.
> 
> there may be issues in doing a clone for repositories that large, I don't
> remember exactly what happens when you have something larger than 4G to send
> in a clone.

If you have files larger than 4G then you definitively need a 64-bit 
machine with plenty of RAM for git to at least be able to cope at the 
moment.

That should be easy to add a config option to determine how big is a big 
file, and store those big files directly in a pack of their own instead 
of a loose object (for easy pack reuse during a further repack), and 
never attempt to deltify them, etc. etc.  At which point git will handle 
big files just fine even on a 32-bit machine but it won't do more than 
copying them in and out, and possibly deflating/inflating them while at 
it, but nothing fancier.

Nicolas

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 19:08         ` Jeff King
  2009-05-01 19:13           ` david
@ 2009-05-01 21:17           ` Daniel Barkalow
  2009-05-01 21:37           ` Linus Torvalds
  2 siblings, 0 replies; 82+ messages in thread
From: Daniel Barkalow @ 2009-05-01 21:17 UTC (permalink / raw)
  To: Jeff King
  Cc: Linus Torvalds, Jakub Narebski, Martin Langhoff, Git Mailing List

On Fri, 1 May 2009, Jeff King wrote:

> On Fri, May 01, 2009 at 02:43:49PM -0400, Linus Torvalds wrote:
> 
> > > Like all generalizations, this is only mostly true. Fast network servers
> > > with big caches can outperform disks for some loads.
> > [...]
> > In contrast, a workstation with local filesystems and enough memory to 
> > cache it well will just be a lot nicer.
> > [...]
> > > I have never used perforce, but I get the impression that it is more 
> > > optimized for such a situation.
> > 
> > I doubt it. I suspect git will outperform pretty much anything else in 
> > that kind of situation too.
> 
> Thanks for the analysis; what you said makes sense to me. However, there
> is at least one case of somebody complaining that git doesn't scale as
> well as perforce for their load:
> 
>   http://gandolf.homelinux.org/blog/index.php?id=50
> 
> Part of his issue is with git-p4 sucking, which it probably does. But
> part of it sounds like he has a gigantic workload (the description of
> which sounds silly to me, but I respect the fact that he is probably
> describing standard practice among some companies), and that workload is
> just a little too gigantic for the workstations to handle. I.e., by
> throwing resources at the central server they can avoid throwing as many
> at each workstation.

I think his problem is that he's trying to replace his p4 repository with 
a git repository, which is a bit like trying to download github, rather 
than a project from github. Perforce is good at dealing with the case 
where people check in a vast quantity of junk that you don't check out.

That is, you can back up your workstation into Perforce, and it won't 
affect anyone's performance if you use a path that's not in the range that 
anybody else checks out. And people actually do that. And Perforce doesn't 
make a distinction between different projects and different branches of 
the same project and different subdirectories of a branch of the same 
project, so it's impossible to tease apart except by company policy.

Git doesn't scale in that it can't do the extremely narrow checkouts you 
need if your repository root directory contains thousands of complete 
unrelated projects with each branch of each project getting a 
subdirectory. On the other hand, it does a great job when the data is 
already partitioned into useful repositories.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 19:08         ` Jeff King
  2009-05-01 19:13           ` david
  2009-05-01 21:17           ` Daniel Barkalow
@ 2009-05-01 21:37           ` Linus Torvalds
  2009-05-01 22:11             ` david
  2 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2009-05-01 21:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List

On Fri, 1 May 2009, Jeff King wrote:
> 
> Thanks for the analysis; what you said makes sense to me. However, there
> is at least one case of somebody complaining that git doesn't scale as
> well as perforce for their load:

So we definitely do have scaling issues, there's no question about that. I 
just don't think they are about enterprise network servers vs the more 
workstation-oriented OSS world..

I think they're likely about the whole git mentality of looking at the big 
picture, and then getting swamped by just how _huge_ that picture can be 
if somebody just put the whole world in a single repository..

With perforce, repository maintenance is such a central issue that the 
whole p4 mentality seems to _encourage_ everybody to put everything into 
basically one single p4 repository. And afaik, p4 basically works mostly 
like CVS, ie it really ends up being pretty much oriented to a "one file 
at a time" model.

Which is nice in that you can have a million files, and then only check 
out a few of them - you'll never even _see_ the impact of the other 
999,995 files.

And git obviously doesn't have that kind of model at all. Git 
fundamnetally never really looks at less than the whole repo. Even if you 
limit things a bit (ie check out just a portion, or have the history go 
back just a bit), git ends up still always caring about the whole thing, 
and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one 
_huge_ repository. I don't think that part is really fixable, although we 
can probably improve on it.

And yes, then there's the "big file" issues. I really don't know what to 
do about huge files. We suck at them, I know. There are work-arounds (like 
not deltaing big objects at all), but they aren't necessarily that great 
either.

I bet we could probably improve git large-file behavior for many common 
cases. Do we have a good test-case of some particular suckiness that is 
actually relevant enough that people might decide to look at it (and by 
"people", I do mean myself too - but I'd need to be somewhat motivated by 
it. A usage case that we suck at and that is available and relevant).

			Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-05-01 21:37           ` Linus Torvalds
@ 2009-05-01 22:11             ` david
  0 siblings, 0 replies; 82+ messages in thread
From: david @ 2009-05-01 22:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff King, Jakub Narebski, Martin Langhoff, Git Mailing List

On Fri, 1 May 2009, Linus Torvalds wrote:

> I bet we could probably improve git large-file behavior for many common
> cases. Do we have a good test-case of some particular suckiness that is
> actually relevant enough that people might decide to look at it (and by
> "people", I do mean myself too - but I'd need to be somewhat motivated by
> it. A usage case that we suck at and that is available and relevant).

I think that a sane use case that would make sense to people is based on 
the 'game developer' example

they have source code, but they also have large images (and sometimes 
movie clips), where a particular release of the game needs a particular 
set of the images. during development you may change images frequently 
(although most changesets probably only change a few, if any of the 
images)

the images can be large (movies can be very large), and since they are 
already compressed they don't diff or compress well.

David Lang

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach)
  2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
  2009-04-30 12:56     ` Michael Witten
  2009-04-30 14:22     ` Jeff King
@ 2009-04-30 18:56     ` Nicolas Pitre
  2009-04-30 19:16       ` Alex Riesen
  2009-04-30 19:33       ` Jakub Narebski
  2 siblings, 2 replies; 82+ messages in thread
From: Nicolas Pitre @ 2009-04-30 18:56 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List

On Thu, 30 Apr 2009, Jakub Narebski wrote:

> Jakub Narebski <jnareb@gmail.com> writes:
> 
> es> Two:  Maybe Git is fast because Linus Torvalds is so smart.
> 
> [non answer; the details are important]

I think Linus is certainly responsible for a big part of Git's speed.  
He came with the basic data structure used by git which has lots to do 
with that.  Also, he designed Git specifically to fulfill a need for 
which none of the alternatives were fast enough.  Hence Git was designed 
from the ground up with speed as one of the primary design goals, such 
as being able to create multiple commits per second instead of the other 
way around (several seconds per commit). And yes, Linus is usually smart 
enough with the proper mindset to achieve such goals.

> es> Three: Maybe Git is fast because it's written in C instead of one
> es> of those newfangled higher-level languages.
> es>
> es> Nah, probably not.  Lots of people have written fast software in
> es> C#, Java or Python.
> es>
> es> And lots of people have written really slow software in
> es> traditional native languages like C/C++. [...]
> 
> Well, I guess that access to low-level optimization techniques like
> mmap are important for performance.  But here I am guessing and
> speculating like Eric did; well, I am asking on a proper forum ;-)
> 
> We have some anegdotical evidence supporting this possibility (which
> Eric dismisses), namely the fact that pure-Python Bazaar is slowest of
> three most common open source DVCS (Git, Mercurial, bazaar) and the
> fact that parts of Mercurial were written in C for better performance.
> 
> We can also compare implementations of Git in other, higher level
> languages, with reference implementation in C (and shell scripts, and
> Perl ;-)).  For example most complete I think but still not fully
> complete Java implementation: JGit.  I hope that JGit developers can
> tell us whether using higher level language affects performance, how
> much, and what features of higher-level language are causing decrease
> in performance.  Of course we have to take into account the
> possibility that JGit isn't simply as well optimized because of less
> manpower.

One of the main JGit developers is Shawn Pearce.  If you look at Shawn's 
contribution to C git, they mostly are all related to performance 
issues.  Amongst other things, he is the author of git-fast-import, he 
contributed the pack access windowing code, and he was also involved in 
the initial design of pack v4.  Hence Shawn is a smart guy who certainly 
knows one or two things about performance optimizations.  Yet he 
reported on this list that his efforts to make JGit faster were not much 
successful anymore, most probably due to the language overhead.

> es> Four: Maybe Git is fast because being fast is the primary goal for
> es> Git.
> 
> [non answer; the details are important]

Still, this is actually true (see about Linus above).  Without such a 
goal, you quickly lose sight of performance regressions.

> es> Maybe Git is fast because every time they faced one of these "buy
> es> vs. build" choices, they decided to just write it themselves.
> 
> I don't think so.  Rather the opposite is true.  Git uses libcurl for
> HTTP transport.  Git uses zlib for compression.  Git uses SHA-1 from
> OpenSSL or from Mozilla.  Git uses (modified, internal) LibXDiff for
> (binary) deltaifying, for diffs and for merges.

Well, I think he's right on this point as well.  libcurl is not so 
relevant since it is rarely the bottleneck (the network bandwidth itself 
usually is).  zlib is already as fast as it can be as multiple attempts 
to make it faster didn't succeed.  Git already carries its own version 
of SHA-1 code for ARM and PPC because the alternatives were slower.  
The fact that libxdiff was made internal is indeed to have a better 
impedance matching with the core code, otherwise it could have remained 
fully external just like zlib.  And the binary delta code is not 
libxdiff anymore but a much smaller, straight forward, and optimized to 
death version to achieve speed over versatility (no need to be versatile 
when strictly dealing with Git's needs only).

> es> Seven:  Maybe Git isn't really that fast.
> es>
> es> If there is one thing I've learned about version control it's that
> es> everybody's situation is different.  It is quite likely that Git
> es> is a lot faster for some scenarios than it is for others.
> es>
> es> How does Git handle really large trees?  Git was designed primary
> es> to support the efforts of the Linux kernel developers.  A lot of
> es> people think the Linux kernel is a large tree, but it's really
> es> not.  Many enterprise configuration management repositories are
> es> FAR bigger than the Linux kernel.
> 
> c.f. "Why Perforce is more scalable than Git" by Steve Hanov
>      http://gandolf.homelinux.org/blog/index.php?id=50
> 
> I don't really know about this.

Git certainly sucks big time with large files.

Git also sucks to a lesser extent (but still) with very large 
repositories.

But large trees?  I don't think Git is worse than anything out there 
with a large tree of average size files.

Yet, this point is misleading because when people gives to Git the 
reputation of being faster, this is certainly from comparison of 
operations performed on the same source tree.  Who cares about scenarios 
for which the tool was not designed?  Those "enterprise configuration 
management repositories" are not what Git was designed for indeed, but 
neither was Mercurial nor Bazaar, or any other contender to which Git is 
usually compared.

Nicolas

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git,  dscms and a "whole product" approach)
  2009-04-30 18:56     ` Nicolas Pitre
@ 2009-04-30 19:16       ` Alex Riesen
  2009-05-04  8:01         ` Why Git is so fast Andreas Ericsson
  2009-04-30 19:33       ` Jakub Narebski
  1 sibling, 1 reply; 82+ messages in thread
From: Alex Riesen @ 2009-04-30 19:16 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List

2009/4/30 Nicolas Pitre <nico@cam.org>:
> Yet, this point is misleading because when people gives to Git the
> reputation of being faster, this is certainly from comparison of
> operations performed on the same source tree.  Who cares about scenarios
> for which the tool was not designed?  Those "enterprise configuration
> management repositories" are not what Git was designed for indeed, but

Especially when no sane developer will put in his repository the toolchain
(pre-compiled. For all supported platforms!), all the supporting tools
(like grep,
find, etc.Pre-compiled _and_ source), the in-house framework (pre-compiled
and source, again), firmware (pre-compiled and put in the repository weekly),
and operating system code (pre-compiled, with firmware-specific drivers,
updated, you guessed it, weekly), and well, there is the project itself (Java or
C++, and documentation in .doc and .xls)...
Now, what kind of self-hating idiot will design a system for that kind of abuse?
(And if someone says that's is not true in the most enterprise
f$%cking configurations,
he definitely hasn't had to live through big enough number of them).

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-04-30 19:16       ` Alex Riesen
@ 2009-05-04  8:01         ` Andreas Ericsson
  0 siblings, 0 replies; 82+ messages in thread
From: Andreas Ericsson @ 2009-05-04  8:01 UTC (permalink / raw)
  To: Alex Riesen
  Cc: Nicolas Pitre, Jakub Narebski, Martin Langhoff, Git Mailing List

Alex Riesen wrote:
> 2009/4/30 Nicolas Pitre <nico@cam.org>:
>> Yet, this point is misleading because when people gives to Git the
>> reputation of being faster, this is certainly from comparison of
>> operations performed on the same source tree.  Who cares about scenarios
>> for which the tool was not designed?  Those "enterprise configuration
>> management repositories" are not what Git was designed for indeed, but
> 
> Especially when no sane developer will put in his repository the toolchain
> (pre-compiled. For all supported platforms!), all the supporting tools
> (like grep,
> find, etc.Pre-compiled _and_ source), the in-house framework (pre-compiled
> and source, again), firmware (pre-compiled and put in the repository weekly),
> and operating system code (pre-compiled, with firmware-specific drivers,
> updated, you guessed it, weekly), and well, there is the project itself (Java or
> C++, and documentation in .doc and .xls)...

Well, git could actually handle that just fine if the toolchain was in a
submodule or even in a separate repository that developers never had to
worry about. Then you'd design a little tool that said "re-create build 8149"
and it would pull the tools used to do that, and the code and the artwork,
and then set to work. It'd be an overnight (or over-weekend) job, but no
man-hours would be spent on it. That's how I'd do it anyways, probably
with the "build" repository as a master repo with "tools", "code" and
"artwork" as submodules to it.

> Now, what kind of self-hating idiot will design a system for that kind of abuse?

Noone, naturally, but one might design a system where each folder
in the repository root is considered a repository in its own right,
and then get that more or less for free.

The problem with git for such scenarios is that you have to think
*before* creating the repository, or play silly buggers when importing
which makes it hard to see how the pieces fit together afterwards.

A tool that could take a repository from a different scm, create a
master repository and several submodule repositories from it would
probably solve many of the issues gaming companies have if they want
to switch to using git. Not least because it would open their eyes
to how that sort of separation can be done in git, and why it's
useful. The binary repos can then turn off delta-compression (and
zlib compression) for all its blobs using a .gitattributes file,
and things would be several orders of magnitudes faster.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Why Git is so fast
  2009-04-30 18:56     ` Nicolas Pitre
  2009-04-30 19:16       ` Alex Riesen
@ 2009-04-30 19:33       ` Jakub Narebski
  1 sibling, 0 replies; 82+ messages in thread
From: Jakub Narebski @ 2009-04-30 19:33 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Martin Langhoff, Git Mailing List

On Thu, 30 Apr 2009, Nicolas Pitre wrote:
> On Thu, 30 Apr 2009, Jakub Narebski wrote:
> > Jakub Narebski <jnareb@gmail.com> writes:

> > es> Maybe Git is fast because every time they faced one of these "buy
> > es> vs. build" choices, they decided to just write it themselves.
> > 
> > I don't think so.  Rather the opposite is true.  Git uses libcurl for
> > HTTP transport.  Git uses zlib for compression.  Git uses SHA-1 from
> > OpenSSL or from Mozilla.  Git uses (modified, internal) LibXDiff for
> > (binary) deltaifying, for diffs and for merges.
> 
> Well, I think he's right on this point as well.  [...]
> The fact that libxdiff was made internal is indeed to have a better 
> impedance matching with the core code, otherwise it could have remained 
> fully external just like zlib.  And the binary delta code is not 
> libxdiff anymore but a much smaller, straight forward, and optimized to 
> death version to achieve speed over versatility (no need to be versatile 
> when strictly dealing with Git's needs only).

Hrmmmm... I have thought that LibXDiff was internalized mainly for ease
of modification, as my impression is that LibXDiff is single developer
effort, while Git from beginning have many contributors (and submodules
didn't exist then).  If I remember correctly the rcsmerge/diff3 algorithm
was added first in internalized git's xdiff... was it added to LibXDiff
proper, anyway?

BTW. I wonder what other F/OSS version control systems: Bazaar,
Mercurial, Darcs, Monotone use for binary deltas, for diff engine,
and for textual three-way merge engine.  Hmmm... perhaps I'll ask
on #revctrl

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
@ 2009-05-12 15:06 Esko Luontola
  2009-05-12 15:14 ` Shawn O. Pearce
                   ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Esko Luontola @ 2009-05-12 15:06 UTC (permalink / raw)
  To: git

A good start for making Git cross-platform, would be storing the text  
encoding of every file name and commit message together with the  
commit. Currently, because Git is oblivious to the encodings and just  
considers them as a series of bytes, there is no way to make them  
cross-platform. It's as http://www.joelonsoftware.com/articles/Unicode.html 
  says, "It does not make sense to have a string without knowing what  
encoding it uses." Without explicit encoding information, making a  
system that works even on the three main platforms, let alone in all  
countries and languages, is simply not possible.

On the other hand, if the encoding is explicitly stated in the  
repository, then it is possible for platform and locale aware Git  
clients to handle the file names and commit messages in whatever way  
makes most sense for the platform (for example convert the file names  
to the platform's encoding, if it differs from the committer's  
platform encoding). Then it would also be possible to create a Mac  
version of Git, which compensates for Mac OS X's file system's file  
name encoding peculiarities. Also the system could then warn (on "git  
add") if the data does not look like it has been encoded with the said  
encoding.

If the platform's and the repository's encoding happen to be the same  
(which in reality might be possible only inside a small company where  
everybody is forced to use the same OS and is configured by a single  
sysadmin), then no conversions need to be done. Also Git purists, who  
think that the byte sequence representing a file name are more  
important than the human readable version of the file name, may use  
some configuration switch that disables all conversions - but even  
then the current encoding should be stored together with the commit.

Are there any plans on storing the encoding information of file names  
and commit messages in the Git repository? How much time would  
implementing it take? Any ideas on how to maintain backwards  
compatibility (for old commits that do not have the encoding  
information)?

- Esko

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
@ 2009-05-12 15:14 ` Shawn O. Pearce
  2009-05-12 16:13   ` Johannes Schindelin
  2009-05-12 16:16   ` Jeff King
  2009-05-12 18:28 ` Dmitry Potapov
  2009-05-14 13:48 ` Peter Krefting
  2 siblings, 2 replies; 82+ messages in thread
From: Shawn O. Pearce @ 2009-05-12 15:14 UTC (permalink / raw)
  To: Esko Luontola; +Cc: git

Esko Luontola <esko.luontola@gmail.com> wrote:
> Are there any plans on storing the encoding information of file names  
> and commit messages in the Git repository?

Commit messages already store their encoding in an optional
"encoding" header if the message isn't stored in UTF-8, or
US-ASCII, which is a strict subset of UTF-8.

As for file names, no plans, its a sequence of bytes, but I think a
lot of people wind up using some subset of US-ASCII for their file
names, especially if their project is going to be cross platform.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 15:14 ` Shawn O. Pearce
@ 2009-05-12 16:13   ` Johannes Schindelin
  2009-05-12 17:56     ` Esko Luontola
  2009-05-12 16:16   ` Jeff King
  1 sibling, 1 reply; 82+ messages in thread
From: Johannes Schindelin @ 2009-05-12 16:13 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Esko Luontola, git

Hi,

On Tue, 12 May 2009, Shawn O. Pearce wrote:

> Esko Luontola <esko.luontola@gmail.com> wrote:
> > Are there any plans on storing the encoding information of file names 
> > and commit messages in the Git repository?
> 
> Commit messages already store their encoding in an optional "encoding" 
> header if the message isn't stored in UTF-8, or US-ASCII, which is a 
> strict subset of UTF-8.
> 
> As for file names, no plans, its a sequence of bytes, but I think a
> lot of people wind up using some subset of US-ASCII for their file
> names, especially if their project is going to be cross platform.

Some context: this issue cropped up in msysGit, of course.

As to storing all file names in UTF-8, my point about Unicode being not 
necessarily appropriate for everyone still stands.

UTF-8 _might_ be the de-facto standard for Linux filesystems, but 
IMHO we should not take away the freedom for everybody to decide what they 
want their file names to be encoded as.

However, I see that there might be a need to be able to encode the file 
names differently, such as on Windows.  IMHO the best solution would be 
a config variable controlling the reencoding of file names.

For some time, it looked as if two people were interested in implementing 
something like that (Peter and Robin IIRC), but efforts have stalled.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 16:13   ` Johannes Schindelin
@ 2009-05-12 17:56     ` Esko Luontola
  2009-05-12 20:38       ` Johannes Schindelin
  0 siblings, 1 reply; 82+ messages in thread
From: Esko Luontola @ 2009-05-12 17:56 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Shawn O. Pearce, git

On 12.5.2009, at 19:13, Johannes Schindelin wrote:
> As to storing all file names in UTF-8, my point about Unicode being  
> not
> necessarily appropriate for everyone still stands.
>
> UTF-8 _might_ be the de-facto standard for Linux filesystems, but
> IMHO we should not take away the freedom for everybody to decide  
> what they
> want their file names to be encoded as.
>
> However, I see that there might be a need to be able to encode the  
> file
> names differently, such as on Windows.  IMHO the best solution would  
> be
> a config variable controlling the reencoding of file names.

Exactly. The system should not force the use of a specific encoding.  
It should only offer a recommendation, but be also fully compatible if  
the user uses some other encoding.

That's why it's best to always store the information about what  
encoding was used. It shouldn't matter, whether the data is encoded  
with ISO-8859-1, UTF-8, Shift_JIS, Big5 or some other encoding, as  
long as it is explicitly said that what the encoding is. Then the  
reader of the data can best decide, how to show that data on the  
current platform.

A config variable for defining, that what encoding should be used when  
committing the file names, would make sense. Git should also try to  
autodetect, that what encoding is used in its current environment. In  
the case of UTF-8, you should also be able to specify which  
normalization form is used (http://www.unicode.org/unicode/reports/ 
tr15/), or whether it is normalized at all.

For example, it should be possible to configure Git so, that when a  
file is checked out on Mac, its file name is converted to the current  
file system's encoding (UTF-8 NFD, I think), and when the file is  
committed on Mac, the file name is normalized back to the same UTF-8  
form as is used on Linux (UTF-8 NFC).

It would be nice to have config variables for saying, that all file  
names in this repository must use UTF-8 NFC, and all commit messages  
must use UTF-8 NFC (with Unix newlines). Then the Git client would  
autodetect the current environment's encoding, and convert the text,  
if necessary, to match the repository's encoding.

- Esko

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 17:56     ` Esko Luontola
@ 2009-05-12 20:38       ` Johannes Schindelin
  2009-05-12 21:16         ` Esko Luontola
  0 siblings, 1 reply; 82+ messages in thread
From: Johannes Schindelin @ 2009-05-12 20:38 UTC (permalink / raw)
  To: Esko Luontola; +Cc: Shawn O. Pearce, git

Hi,

On Tue, 12 May 2009, Esko Luontola wrote:

> On 12.5.2009, at 19:13, Johannes Schindelin wrote:
> >As to storing all file names in UTF-8, my point about Unicode being not 
> >necessarily appropriate for everyone still stands.
> >
> >UTF-8 _might_ be the de-facto standard for Linux filesystems, but IMHO 
> >we should not take away the freedom for everybody to decide what they 
> >want their file names to be encoded as.
> >
> >However, I see that there might be a need to be able to encode the file 
> >names differently, such as on Windows.  IMHO the best solution would be 
> >a config variable controlling the reencoding of file names.
> 
> Exactly. The system should not force the use of a specific encoding. It 
> should only offer a recommendation, but be also fully compatible if the 
> user uses some other encoding.
> 
> That's why it's best to always store the information about what encoding 
> was used. It shouldn't matter, whether the data is encoded with 
> ISO-8859-1, UTF-8, Shift_JIS, Big5 or some other encoding, as long as it 
> is explicitly said that what the encoding is. Then the reader of the 
> data can best decide, how to show that data on the current platform.
> 
> A config variable for defining, that what encoding should be used when 
> committing the file names, would make sense. Git should also try to 
> autodetect, that what encoding is used in its current environment. In 
> the case of UTF-8, you should also be able to specify which 
> normalization form is used 
> (http://www.unicode.org/unicode/reports/tr15/), or whether it is 
> normalized at all.
> 
> For example, it should be possible to configure Git so, that when a file 
> is checked out on Mac, its file name is converted to the current file 
> system's encoding (UTF-8 NFD, I think), and when the file is committed 
> on Mac, the file name is normalized back to the same UTF-8 form as is 
> used on Linux (UTF-8 NFC).
> 
> It would be nice to have config variables for saying, that all file 
> names in this repository must use UTF-8 NFC, and all commit messages 
> must use UTF-8 NFC (with Unix newlines). Then the Git client would 
> autodetect the current environment's encoding, and convert the text, if 
> necessary, to match the repository's encoding.

That is a nice analysis.  How about implementing it?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 20:38       ` Johannes Schindelin
@ 2009-05-12 21:16         ` Esko Luontola
  2009-05-13  0:23           ` Johannes Schindelin
  0 siblings, 1 reply; 82+ messages in thread
From: Esko Luontola @ 2009-05-12 21:16 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin, Shawn O. Pearce

Johannes Schindelin wrote on 12.5.2009 23:38:
> That is a nice analysis.  How about implementing it?
> 

Do we have here somebody, who knows Git's code well and is motivated to 
implement this?

I don't think that I would be capable, because of not having used C 
much, being new to Git's codebase and having too little time. But I can 
help with the requirements specification, interaction design and system 
testing.

-- 
Esko Luontola
www.orfjackal.net

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 21:16         ` Esko Luontola
@ 2009-05-13  0:23           ` Johannes Schindelin
  2009-05-13  5:34             ` Esko Luontola
  0 siblings, 1 reply; 82+ messages in thread
From: Johannes Schindelin @ 2009-05-13  0:23 UTC (permalink / raw)
  To: Esko Luontola; +Cc: git, Shawn O. Pearce

Hi,

On Wed, 13 May 2009, Esko Luontola wrote:

> Johannes Schindelin wrote on 12.5.2009 23:38:
> > That is a nice analysis.  How about implementing it?
> > 
> 
> Do we have here somebody, who knows Git's code well and is motivated to
> implement this?
> 
> I don't think that I would be capable, because of not having used C 
> much, being new to Git's codebase and having too little time.

Well, that rather settles things, no?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13  0:23           ` Johannes Schindelin
@ 2009-05-13  5:34             ` Esko Luontola
  2009-05-13  6:49               ` Alex Riesen
  2009-05-13 10:15               ` Johannes Schindelin
  0 siblings, 2 replies; 82+ messages in thread
From: Esko Luontola @ 2009-05-13  5:34 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Shawn O. Pearce

Johannes Schindelin wrote on 13.5.2009 3:23:
> Well, that rather settles things, no?
> 

There is need for the feature, but it's unfortunate that the Git 
developers do not see its value. There are many users for whom using 
non-ASCII names is necessary (for example all of Asia and most of 
Europe), but now it seems that Bazaar is the only DVCS that handles 
encodings correctly: 
http://stackoverflow.com/questions/829682/what-dvcs-support-unicode-filenames

Let's see if I have time later this or next year to work on it. At least 
it would be good practise in getting acquainted with a new codebase and 
learning C. But it would be better for someone else do it, to get it 
done within a reasonable amount of time.

I see that there are some tests in the /t directory. Which command will 
run all of them, how good coverage do the tests have, how reproducable 
and isolated they are, how many seconds does it take to run all the 
tests? Is there some high-level documentation for new developers?

-- 
Esko Luontola
www.orfjackal.net

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13  5:34             ` Esko Luontola
@ 2009-05-13  6:49               ` Alex Riesen
  2009-05-13 10:15               ` Johannes Schindelin
  1 sibling, 0 replies; 82+ messages in thread
From: Alex Riesen @ 2009-05-13  6:49 UTC (permalink / raw)
  To: Esko Luontola; +Cc: Johannes Schindelin, git, Shawn O. Pearce

2009/5/13 Esko Luontola <esko.luontola@gmail.com>:
> Johannes Schindelin wrote on 13.5.2009 3:23:
>>
>> Well, that rather settles things, no?
>>
>
> There is need for the feature, but it's unfortunate that the Git developers
> do not see its value. There are many users for whom using non-ASCII names is
> necessary (for example all of Asia and most of Europe), but now it seems
> that Bazaar is the only DVCS that handles encodings correctly:
> http://stackoverflow.com/questions/829682/what-dvcs-support-unicode-filenames

Many Git developers just use systems which don't care about the file names
encoding at all and just keep the names as they were. So interoperability
problem does not exist for them. So, they either don't need the feature,
or can trivially avoid or workaround any problems.

> I see that there are some tests in the /t directory. Which command will run
> all of them, how good coverage do the tests have, how reproducable and
> isolated they are, how many seconds does it take to run all the tests? Is
> there some high-level documentation for new developers?

make test. See also t/README. We like them. I always run test suite before
deployment and sometimes run it just for fun (unless I have to run it
on Windows).

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13  5:34             ` Esko Luontola
  2009-05-13  6:49               ` Alex Riesen
@ 2009-05-13 10:15               ` Johannes Schindelin
       [not found]                 ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>
  1 sibling, 1 reply; 82+ messages in thread
From: Johannes Schindelin @ 2009-05-13 10:15 UTC (permalink / raw)
  To: Esko Luontola; +Cc: git, Shawn O. Pearce

Hi,

On Wed, 13 May 2009, Esko Luontola wrote:

> Johannes Schindelin wrote on 13.5.2009 3:23:
> > Well, that rather settles things, no?
> 
> There is need for the feature, but it's unfortunate that the Git 
> developers do not see its value.

I see a value.  But it is not my itch.  And since it is your itch and you 
said that you will not do anything about it (I don't count writing emails 
here ;-), I concluded that it settles the issue.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 82+ messages in thread

[parent not found: <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>]

* Cross-Platform Version Control
       [not found]                 ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>
@ 2009-05-13 10:41                   ` John Tapsell
  2009-05-13 13:42                     ` Jay Soffian
  0 siblings, 1 reply; 82+ messages in thread
From: John Tapsell @ 2009-05-13 10:41 UTC (permalink / raw)
  To: git

2009/5/13 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
> Hi,
>
> On Wed, 13 May 2009, Esko Luontola wrote:
>
>> Johannes Schindelin wrote on 13.5.2009 3:23:
>> > Well, that rather settles things, no?
>>
>> There is need for the feature, but it's unfortunate that the Git
>> developers do not see its value.
>
> I see a value.  But it is not my itch.  And since it is your itch and you
> said that you will not do anything about it (I don't count writing emails
> here ;-), I concluded that it settles the issue.

I don't know why the git developers are being so hostile/dismisisve,
but I also hope that somebody volunteers to fix this.
Esko, you have my moral support :-)

John

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 10:41                   ` John Tapsell
@ 2009-05-13 13:42                     ` Jay Soffian
  2009-05-13 13:44                       ` Alex Riesen
  0 siblings, 1 reply; 82+ messages in thread
From: Jay Soffian @ 2009-05-13 13:42 UTC (permalink / raw)
  To: John Tapsell; +Cc: git

On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
> I don't know why the git developers are being so hostile/dismisisve,

Are you serious?

j.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:42                     ` Jay Soffian
@ 2009-05-13 13:44                       ` Alex Riesen
  2009-05-13 13:50                         ` Jay Soffian
  0 siblings, 1 reply; 82+ messages in thread
From: Alex Riesen @ 2009-05-13 13:44 UTC (permalink / raw)
  To: Jay Soffian; +Cc: John Tapsell, git

2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>> I don't know why the git developers are being so hostile/dismisisve,
>
> Are you serious?
>

...because we'll kill you if aren't >:-E

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:44                       ` Alex Riesen
@ 2009-05-13 13:50                         ` Jay Soffian
  2009-05-13 13:57                           ` John Tapsell
  0 siblings, 1 reply; 82+ messages in thread
From: Jay Soffian @ 2009-05-13 13:50 UTC (permalink / raw)
  To: Alex Riesen; +Cc: John Tapsell, git

On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>>> I don't know why the git developers are being so hostile/dismisisve,
>>
>> Are you serious?
>>
>
> ...because we'll kill you if aren't >:-E

I'm just flabbergasted by some people's expectations. Perhaps John
doesn't realize the git developers are all volunteers, and that it is
never appropriate to criticize a volunteer. A "thank you for all your
hard work on git" would have done nicely.

j.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:50                         ` Jay Soffian
@ 2009-05-13 13:57                           ` John Tapsell
  2009-05-13 15:27                             ` Nicolas Pitre
                                               ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: John Tapsell @ 2009-05-13 13:57 UTC (permalink / raw)
  To: Jay Soffian; +Cc: Alex Riesen, git

2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
> On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
>> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>>>> I don't know why the git developers are being so hostile/dismisisve,
>>>
>>> Are you serious?
>>>
>>
>> ...because we'll kill you if aren't >:-E
>
> I'm just flabbergasted by some people's expectations. Perhaps John
> doesn't realize the git developers are all volunteers, and that it is
> never appropriate to criticize a volunteer. A "thank you for all your
> hard work on git" would have done nicely.

I'm as much of an open source developer as anyone else here.  I spend
a huge amount of my time programming for KDE.  But I've never told a
user "well that settles it" because they won't code it themselves :-/
I certaintly get a huge number of bug/wishes that I can't/won't code
myself, but I try to be a bit more diplomatic about it.
But then the kernel mailing lists tend to be a lot more.. direct..
than the kde mailing lists, so I guess it comes from that.  Requiring
people to have a thick skin and all that.

John

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:57                           ` John Tapsell
@ 2009-05-13 15:27                             ` Nicolas Pitre
  2009-05-13 16:22                               ` Johannes Schindelin
  2009-05-13 17:24                             ` Andreas Ericsson
  2009-05-14  1:49                             ` Miles Bader
  2 siblings, 1 reply; 82+ messages in thread
From: Nicolas Pitre @ 2009-05-13 15:27 UTC (permalink / raw)
  To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git

On Wed, 13 May 2009, John Tapsell wrote:

> I'm as much of an open source developer as anyone else here.  I spend
> a huge amount of my time programming for KDE.  But I've never told a
> user "well that settles it" because they won't code it themselves :-/
> I certaintly get a huge number of bug/wishes that I can't/won't code
> myself, but I try to be a bit more diplomatic about it.
> But then the kernel mailing lists tend to be a lot more.. direct..
> than the kde mailing lists, so I guess it comes from that.  Requiring
> people to have a thick skin and all that.

This is not the kernel mailing list.  In fact this list is quite 
friendlier and accommodating that the kernel list.

The remark alluded above comes from _one_ of the git developers.  And 
Dscho is apparently in a rather sad mood these days. While the substance 
of Dscho's remark is entirely pertinent, it would be wrong to use its 
form and style as a characterization of git developers in general.


Nicolas

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 15:27                             ` Nicolas Pitre
@ 2009-05-13 16:22                               ` Johannes Schindelin
  0 siblings, 0 replies; 82+ messages in thread
From: Johannes Schindelin @ 2009-05-13 16:22 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: John Tapsell, Jay Soffian, Alex Riesen, git

Hi,

On Wed, 13 May 2009, Nicolas Pitre wrote:

> On Wed, 13 May 2009, John Tapsell wrote:
> 
> > I'm as much of an open source developer as anyone else here.  I spend 
> > a huge amount of my time programming for KDE.  But I've never told a 
> > user "well that settles it" because they won't code it themselves :-/ 
> > I certaintly get a huge number of bug/wishes that I can't/won't code 
> > myself, but I try to be a bit more diplomatic about it.
> >
> > But then the kernel mailing lists tend to be a lot more.. direct.. 
> > than the kde mailing lists, so I guess it comes from that.  Requiring 
> > people to have a thick skin and all that.
> 
> This is not the kernel mailing list.  In fact this list is quite 
> friendlier and accommodating that the kernel list.
> 
> The remark alluded above comes from _one_ of the git developers.  And 
> Dscho is apparently in a rather sad mood these days. While the substance 
> of Dscho's remark is entirely pertinent, it would be wrong to use its 
> form and style as a characterization of git developers in general.

Even if I were in a better mood, the whole thread has a back story on an 
msysGit issue, and this led me to try to stop what I feared would become a 
rather long mail thread without much of an outcome, such as that infamous 
thread about MacOSX UTF-8 filename handling.

Alas, it seems that Robin is willing to work on the issues, so my fears 
have been totally and completely unfounded.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:57                           ` John Tapsell
  2009-05-13 15:27                             ` Nicolas Pitre
@ 2009-05-13 17:24                             ` Andreas Ericsson
  2009-05-14  1:49                             ` Miles Bader
  2 siblings, 0 replies; 82+ messages in thread
From: Andreas Ericsson @ 2009-05-13 17:24 UTC (permalink / raw)
  To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git

John Tapsell wrote:
> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>> On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
>>> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>>>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>>>>> I don't know why the git developers are being so hostile/dismisisve,
>>>> Are you serious?
>>>>
>>> ...because we'll kill you if aren't >:-E
>> I'm just flabbergasted by some people's expectations. Perhaps John
>> doesn't realize the git developers are all volunteers, and that it is
>> never appropriate to criticize a volunteer. A "thank you for all your
>> hard work on git" would have done nicely.
> 
> I'm as much of an open source developer as anyone else here.  I spend
> a huge amount of my time programming for KDE.  But I've never told a
> user "well that settles it" because they won't code it themselves :-/
> I certaintly get a huge number of bug/wishes that I can't/won't code
> myself, but I try to be a bit more diplomatic about it.
> But then the kernel mailing lists tend to be a lot more.. direct..
> than the kde mailing lists, so I guess it comes from that.  Requiring
> people to have a thick skin and all that.
> 

I think much of the perceived malignancy stems from the fact that the
git list has a high ratio of developer-to-luser mailings on it, being
by nature a developer tool most of the time. When the unaware user
appears on the list with demands rather than polite requests, they're
treated that much harder. Especially by the developer who happens to
be, as it were, the butt of the request.

Personally, I've only ever found Dscho being anything but friendly on
this list, and even then, I really didn't find it offensive. If viewed
in a happy mood, it matches quite nicely with a swedish sketch whose
theme is "men ja ente bitter". It's often quite funny, really :-)

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:57                           ` John Tapsell
  2009-05-13 15:27                             ` Nicolas Pitre
  2009-05-13 17:24                             ` Andreas Ericsson
@ 2009-05-14  1:49                             ` Miles Bader
  2 siblings, 0 replies; 82+ messages in thread
From: Miles Bader @ 2009-05-14  1:49 UTC (permalink / raw)
  To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git

John Tapsell <johnflux@gmail.com> writes:
> I'm as much of an open source developer as anyone else here.  I spend
> a huge amount of my time programming for KDE.  But I've never told a
> user "well that settles it" because they won't code it themselves :-/

FWIW, Johannes' use of "Well, that rather settles things, no?" in this
thread this didn't strike me as being rude or truly dismissive (even
though it's literally so).

It seemed more just a timely and to the point reminder that however fun
it is to talk about random feature X, someone's gotta do the work if
it's going to actually be implemented, and that the direction of git
development very much follows the whims of those doing the actual
hacking (perhaps more so than other projects).

[and I don't even have particularly thick skin, I think -- I'm often
very annoyed by brusqueness one sees on many developer mailing lists...]

-Miles

-- 
Acquaintance, n. A person whom we know well enough to borrow from, but not
well enough to lend to.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 15:14 ` Shawn O. Pearce
  2009-05-12 16:13   ` Johannes Schindelin
@ 2009-05-12 16:16   ` Jeff King
  2009-05-12 16:57     ` Johannes Schindelin
  2009-05-13 16:26     ` Linus Torvalds
  1 sibling, 2 replies; 82+ messages in thread
From: Jeff King @ 2009-05-12 16:16 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Esko Luontola, git

On Tue, May 12, 2009 at 08:14:03AM -0700, Shawn O. Pearce wrote:

> As for file names, no plans, its a sequence of bytes, but I think a
> lot of people wind up using some subset of US-ASCII for their file
> names, especially if their project is going to be cross platform.

Or they use a single encoding like utf8 so that there are no surprises.
You can still run into normalization problems with filenames on some
filesystems, though.  Linus's name_hash code sets up the framework to
handle "these two names are actually equivalent", but right now I think
there is just code for handling case-sensitivity, not utf8 normalization
(but I just skimmed the code, so I might be wrong).

-Peff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 16:16   ` Jeff King
@ 2009-05-12 16:57     ` Johannes Schindelin
  2009-05-13 16:26     ` Linus Torvalds
  1 sibling, 0 replies; 82+ messages in thread
From: Johannes Schindelin @ 2009-05-12 16:57 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git

Hi,

On Tue, 12 May 2009, Jeff King wrote:

> On Tue, May 12, 2009 at 08:14:03AM -0700, Shawn O. Pearce wrote:
> 
> > As for file names, no plans, its a sequence of bytes, but I think a 
> > lot of people wind up using some subset of US-ASCII for their file 
> > names, especially if their project is going to be cross platform.
> 
> Or they use a single encoding like utf8 so that there are no surprises. 
> You can still run into normalization problems with filenames on some 
> filesystems, though.  Linus's name_hash code sets up the framework to 
> handle "these two names are actually equivalent", but right now I think 
> there is just code for handling case-sensitivity, not utf8 normalization 
> (but I just skimmed the code, so I might be wrong).

Back then I actually started on a patch to make Git capable of determining 
UTF-8 equivalence, but at the same time somebody started such an annoying 
mail thread that I stopped working on the issue completely.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 16:16   ` Jeff King
  2009-05-12 16:57     ` Johannes Schindelin
@ 2009-05-13 16:26     ` Linus Torvalds
  2009-05-13 17:12       ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2009-05-13 16:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git

On Tue, 12 May 2009, Jeff King wrote:
>
> Or they use a single encoding like utf8 so that there are no surprises.
> You can still run into normalization problems with filenames on some
> filesystems, though.  Linus's name_hash code sets up the framework to
> handle "these two names are actually equivalent", but right now I think
> there is just code for handling case-sensitivity, not utf8 normalization
> (but I just skimmed the code, so I might be wrong).

utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But 
quite frankly, the index is only part of it, and probably not the worst 
part.

The real pain of filename handling is all the "read tree recursively with 
readdir()" issues. Along with just an absolute sh*t-load of issues about 
what to do when people ended up using different versions of the "same" 
name in different branches.

There's also the issue that "cross-platform" really can be a pretty damn 
big pain. What do you do for platforms that simply are pure shit? I 
realize that OS X people have a hard time accepting it, but OS X 
filesystems are generally total and utter crap - even more so than 
Windows.

Yes, yes, you can tell OS X that case matters, but that's not the normal 
case - and what do you do with projects that simply _do_ care about case. 
The kernel is one such project.

Sure, you can "encode" the filenames on such broken filesystems in a way 
that they'd be different - but that won't really help the project, since 
makefiles etc won't work anyway.

So one reason I didn't bother with utf-8 is that the much more fundamental 
issues are simply in plain old 7-bit US-ASCII. 

That said, if the only issue is that you want to encode regular utf-8 in a 
coherent way (and ignore the case issues), then we could probably do that 
part fairly easily with a "convert_to_internal()" and 
"convert_to_filename()" thing that acts very much like the CRLF conversion 
(except on filenames, not data).

And yes, it's probably worth doing, since we'd need that for fuller case 
support anyway.

It's just a fair amount of churn - not fundamentally _hard_, but not 
trivial either. And it needs a _lot_ of care, and a fair amount of 
testing that is probably hard to do on sane filesystems (ie the case where 
the filesystem actually _changes_ the name is going to be hard to test on 
anything sane).

			Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 16:26     ` Linus Torvalds
@ 2009-05-13 17:12       ` Linus Torvalds
  2009-05-13 17:31         ` Andreas Ericsson
                           ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Linus Torvalds @ 2009-05-13 17:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git

On Wed, 13 May 2009, Linus Torvalds wrote:
> 
> utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But 
> quite frankly, the index is only part of it, and probably not the worst 
> part.
> 
> The real pain of filename handling is all the "read tree recursively with 
> readdir()" issues. Along with just an absolute sh*t-load of issues about 
> what to do when people ended up using different versions of the "same" 
> name in different branches.

Btw, if people care mainly just about OS X, and don't worry so much about 
case, but about the idiotic and insane OS X behavior of turning UTF-8 
filenames into that crazy NFD format, here's a simple patch that may be 
useful for that.

There _will_ certainly be other places, but this handles the one big case 
of "read_directory_recursive()", and can turn NFD into the sane NFC 
format.

Since OS X will then accept NFC (and internally turn it back to NFD) when 
you pass them as filenames, that means that converting the other way is 
not necessary.

NOTE NOTE NOTE! This really just handles one case, and is not enough for 
any kind of general case. For example, it does NOT handle the case where 
you do

	git add filename_with_åäö

explicitly, because if the "filename_with_åäö" is done using NFD 
(tab-completion etc), now git won't _match_ it with the filename it reads 
using readdir() any more (which got converted to NFC), so at a minimum 
we'd need to do that crazy NFD->NFC conversion in all the pathspecs too. 

See "get_pathspec()" in setup.c for that latter case.

But with that, and this crazy thing, OS X users might be already a lot 
better off. Totally untested, of course. 

Oh, and somebody needs to fill in that 

	convert_name_from_nfd_to_nfc()

implementation. It's designed so that if it notices that the string is 
just plain US-ASCII, it can return 0 and no extra work is done. That, in 
turn, can easily be done by some simple and efficient pre-processign that 
checks that there are no high bits set (on a 64-bit platform, do it 8 
characters at a time with a "& 0x8080808080808080"), so that the common 
case doesn't need to have barely any overhead at all.

Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do 
the actual normalization if you find characters with the high bit set. And 
since I know that the OS X filesystems are so buggy as to not even do that 
whole NFD thing right, there is probably some OS-X specific "use this for 
filesystem names" conversion function.

Hmm. Anybody want to take this on? It really shouldn't be too complex to 
get it working for the common case on just OS X. It's really the case 
sensitivity that is the biggest problem, if you ignore that for now, the 
problem space is _much_ smaller.

In other words, I think we can reasonably easily support a subset of 
_common_ issues with some trivial patches like this. But getting it right 
in _all_ the cases is going to be much more work (there are lots of other 
uses of "readdir()" too, this one just happens to be one of the more 
central ones).

Of course, it probably makes sense to have a whole "git_readdir()" that 
does this thing in general. That "create_full_path()" thing makes sense 
regardless, though, in that it also simplifies a lot of "baselen+len" 
usage in just "len".

		Linus

---
 dir.c |   40 ++++++++++++++++++++++++++++++++--------
 1 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/dir.c b/dir.c
index 6aae09a..4cbfc24 100644
--- a/dir.c
+++ b/dir.c
@@ -566,6 +566,30 @@ static int get_dtype(struct dirent *de, const char *path)
 }

 /*
+ * Take the readdir output, in (d_name,len), and append it to
+ * our base name in (fullname,baselen) with any required
+ * readdir fs->internal translation.
+ *
+ * Put the result in 'fullname', and return the final length.
+ *
+ * Right now we have no translation, and just do a memcpy()
+ * (the +1 is to copy the final NUL character too).
+ */
+static int create_full_path(char *fullname, int baselen, const char *d_name, int len)
+{
+#ifdef OS_X_IS_SOME_CRAZY_SHxAT
+	char temp[256], nlen;
+	nlen = convert_name_from_nfd_to_nfc(d_name, len, temp, sizeof(temp));
+	if (nlen) {
+		len = nlen;
+		d_name = temp;
+	}
+#endif
+	memcpy(fullname + baselen, d_name, len + 1);
+	return baselen + len;
+}
+
+/*
  * Read a directory tree. We currently ignore anything but
  * directories, regular files and symlinks. That's because git
  * doesn't handle them at all yet. Maybe that will change some
@@ -595,15 +619,15 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			/* Ignore overly long pathnames! */
 			if (len + baselen + 8 > sizeof(fullname))
 				continue;
-			memcpy(fullname + baselen, de->d_name, len+1);
-			if (simplify_away(fullname, baselen + len, simplify))
+			len = create_full_path(fullname, baselen, de->d_name, len);
+			if (simplify_away(fullname, len, simplify))
 				continue;

 			dtype = DTYPE(de);
 			exclude = excluded(dir, fullname, &dtype);
 			if (exclude && (dir->flags & DIR_COLLECT_IGNORED)
-			    && in_pathspec(fullname, baselen + len, simplify))
-				dir_add_ignored(dir, fullname, baselen + len);
+			    && in_pathspec(fullname, len, simplify))
+				dir_add_ignored(dir, fullname, len);

 			/*
 			 * Excluded? If we don't explicitly want to show
@@ -630,9 +654,9 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			default:
 				continue;
 			case DT_DIR:
-				memcpy(fullname + baselen + len, "/", 2);
+				memcpy(fullname + len, "/", 2);
 				len++;
-				switch (treat_directory(dir, fullname, baselen + len, simplify)) {
+				switch (treat_directory(dir, fullname, len, simplify)) {
 				case show_directory:
 					if (exclude != !!(dir->flags
 							& DIR_SHOW_IGNORED))
@@ -640,7 +664,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 					break;
 				case recurse_into_directory:
 					contents += read_directory_recursive(dir,
-						fullname, fullname, baselen + len, 0, simplify);
+						fullname, fullname, len, 0, simplify);
 					continue;
 				case ignore_directory:
 					continue;
@@ -654,7 +678,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			if (check_only)
 				goto exit_early;
 			else
-				dir_add_name(dir, fullname, baselen + len);
+				dir_add_name(dir, fullname, len);
 		}
 exit_early:
 		closedir(fdir);

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 17:12       ` Linus Torvalds
@ 2009-05-13 17:31         ` Andreas Ericsson
  2009-05-13 17:46         ` Linus Torvalds
  2009-05-13 20:57         ` Matthias Andree
  2 siblings, 0 replies; 82+ messages in thread
From: Andreas Ericsson @ 2009-05-13 17:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git

Linus Torvalds wrote:
> 
> Of course, it probably makes sense to have a whole "git_readdir()" that 
> does this thing in general. That "create_full_path()" thing makes sense 
> regardless, though, in that it also simplifies a lot of "baselen+len" 
> usage in just "len".
> 

In a flash of premonitory insight, libgit2 has 

	gitfo_foreach_dirent(path, callback)

which would probably be well suited for this kind of thing.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 17:12       ` Linus Torvalds
  2009-05-13 17:31         ` Andreas Ericsson
@ 2009-05-13 17:46         ` Linus Torvalds
  2009-05-13 18:26           ` Martin Langhoff
  2009-05-13 20:57         ` Matthias Andree
  2 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2009-05-13 17:46 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git

On Wed, 13 May 2009, Linus Torvalds wrote:
>
> Of course, it probably makes sense to have a whole "git_readdir()" that 
> does this thing in general.

Actually, the more I think about that, the less true I think it is.

It _sounds_ like a nice simplification ("just do it once in readdir, and 
forget about it everywhere else"), but it's in fact a stupid thing to do.

Why?

If we _ever_ want to fix this in the general case, then the code that does 
the readdir() will actually have to remember both the "raw filesystem" 
form _and_ the "cleaned-up utf-8 form".

Why? Because when we do readdir(), we'll also do 'lstat()' on the end 
result to check the types, and opendir() in case it's a directory and we 
then want to do things recursively etc. And that happens to work on OS X 
(because we can use our "fixed" filename for lstat too), but it does not 
work in the general case.

And you can say "well, just do the stat inside the wrapped readdir()", but 
that doesn't work _either_, since

 - we don't want to do the lstat() if it's unnecessary. Even if we don't 
   have "de->d_type" information, we can often avoid the need for it, if 
   we can tell that the name isn't interestign (due to being ignored).

   Avoiding the lstat is a huge performance issue for cold-cache cases. 
   It's basically a seek.

   So we really want to do the lstat() later, which implies that the 
   caller needs to know _both_ the original "real" filesystem name _and_ 
   the converted one.

 - it doesn't handle the opendir() case anyway - so the end result is that 
   a real implementation will _always_ need to carry around both the 
   "filesystem view" filename _and_ the "what we've converted it into".

Now, the point of the patch I sent out was that for the specific case of 
OS X, which does UTF-8 conversions (wrong) but also is happy to get our 
properly normalized name, we don't care. So my patch is "correct" for that 
special case - and so would a plain readdir() wrapper be.

But my patch is _also_ correct for the case where a readdir() wrapper 
would do the wrong thing. My patch doesn't _handle_ it (since it doesn't 
change the code to pass both "filesystem view" and "cleaned-up view" 
pathnames), but the patch I sent out also doesn't make it any harder to do 
right.

In contrast, doing a readdir() wrapper makes it much harder to do right 
later, because it's just doing the conversion at the wrong level (you 
could make that "wrapper" return both the original and the fixed 
filename, but at that point the wrapper doesn't really help - you might 
as well just have the "convert" function, and it would be a hell of a lot 
more obvious what is really going on).

So I take it back. A readdir() wrapper is not a good idea. It gets us a 
tiny bit of the way, but it would actually take us a step back from the 
"real" solution.

			Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 17:46         ` Linus Torvalds
@ 2009-05-13 18:26           ` Martin Langhoff
  2009-05-13 18:37             ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Martin Langhoff @ 2009-05-13 18:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, May 13, 2009 at 7:46 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So I take it back. A readdir() wrapper is not a good idea. It gets us a
> tiny bit of the way, but it would actually take us a step back from the
> "real" solution.

Do we need to take the real solution to the core of git?

What I am wondering is whether we can keep this simple in git
internals and catch problem filenames at git-add time. This would
allow git to keep treating filenames as a bag of bytes, and it does a
better thing for users.

In cross platform projects, most users don't even know that there are
problems, and even if they do, they don't know what the problems are.

If git add can be told to warn & refuse to add a path with portability
problems, then we educate our users, prevent them from committing
filenames that will later cause trouble to others in their projects,
etc.

from-the-keep-it-simple-and-informative-dept,

m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 18:26           ` Martin Langhoff
@ 2009-05-13 18:37             ` Linus Torvalds
  2009-05-13 21:04               ` Theodore Tso
  2009-05-13 21:08               ` Daniel Barkalow
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2009-05-13 18:37 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, 13 May 2009, Martin Langhoff wrote:
> 
> Do we need to take the real solution to the core of git?

Well, I suspect that if we really want to support it, then we'd better.

> What I am wondering is whether we can keep this simple in git
> internals and catch problem filenames at git-add time.

I can almost guarantee that it will just cause more problems than it 
solves, and generate some nasty cases that just aren't solvable.

Because it really isn't just "git add". It's every single thing that does 
a lstat() on a filename inside of git.

Now, the simple OS X case is not a huge problem, since the lstat will 
succeed with the fixed-up filename too. But as mentioned, the OS X case is 
the thing that doesn't need a lot of infrastructure _anyway_ - I can 
almost guarantee that my posted patch (with the added setup.c stuff for 
get_pathspec()) is going to be _fewer_ lines than some wrapper logic.

Note: in all of the above, I assume that people care more about just plain 
UTF characters (and the insane NFD form OS X uses) than about worrying 
about the _really_ subtle issues of case-independence. Those are a major 
pain, but they will need even more "internal" support, because there 
simply isn't any sane wrapping method.

(You could wrap everything to force lower-casing of all filesystem ops or 
something, but that would not be acceptable to any sane environment. So in 
reality you need to accept mixed-case things, and then there is no way to 
know from the "outside" whether one external mixed-case thing matches some 
internal index mixed-case thing).

			Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 18:37             ` Linus Torvalds
@ 2009-05-13 21:04               ` Theodore Tso
  2009-05-13 21:20                 ` Linus Torvalds
  2009-05-13 21:08               ` Daniel Barkalow
  1 sibling, 1 reply; 82+ messages in thread
From: Theodore Tso @ 2009-05-13 21:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, May 13, 2009 at 11:37:28AM -0700, Linus Torvalds wrote:
> Note: in all of the above, I assume that people care more about just plain 
> UTF characters (and the insane NFD form OS X uses) than about worrying 
> about the _really_ subtle issues of case-independence. Those are a major 
> pain, but they will need even more "internal" support, because there 
> simply isn't any sane wrapping method.

Stupid question --- if we get something that works for Windows and
MacOS X, is there any reason why we need to solve the general problem
of case-insentive filesystems?  It's really backwards compatibility
with Legacy OS's that most important, right?  Are there any other
systems other than Windows and Mac OS X which (a) perpetrate case
insensitivity on application programmers, and (b) which current or
future git users are likely to care about?

						- Ted

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 21:04               ` Theodore Tso
@ 2009-05-13 21:20                 ` Linus Torvalds
  0 siblings, 0 replies; 82+ messages in thread
From: Linus Torvalds @ 2009-05-13 21:20 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, 13 May 2009, Theodore Tso wrote:
> 
> Stupid question --- if we get something that works for Windows and
> MacOS X, is there any reason why we need to solve the general problem
> of case-insentive filesystems?

Qutie frankly, I don't think we're even very close to getting anything 
that works for Windows of OS X.

Case-insensitivity is _hard_.

The "easy" case is to just handle the OS X craxy pseudo-NFD format, and at 
least turn that into NFC (and perhaps add a config option to do latin1 and 
EUC-JP to utf-8 too) and. At that point, we at least handle regular utf-8 
the same way.

Doing the latin1/EUC-JP thing would actually to some degree be more 
interesting than the OS X NFD case, because that really does require 
two-way conversion, and we can "test" that even on sane filesystems (ie 
play at having a Latin1 filesystem).

That said, I suspect there aren't that many people who care about latin1 
filesystems. I dunno about EUC-JP (and variants - for all I know, 
shift-JIS and other cases may be the more common ones).

Of course, if we do everything right, maybe the windows people would 
actually like us to keep the filesystem-native representation in UTF-16LE 
or whatever the crazy format is that Windows really uses deep down.

My point being that all of these things happen even without the added 
worry about case. And in many ways, not worrying about case should 
probably be the first step. We do have some support for worrying about 
case, but trying to solve both things at the same time isn't going to be 
workable, I suspect.

Case insensitivity should never ever involve a _conversion_ (if it does, 
you get all kinds of crazy behavior), it's just purely a _comparison_ 
issue, so the two really are fundamentally different.

Of course, the reason OS-X seems to be so messed up is exactly that the 
morons at Apple didn't understand the difference between conversion and 
comparison, and mixed them up.

		Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 18:37             ` Linus Torvalds
  2009-05-13 21:04               ` Theodore Tso
@ 2009-05-13 21:08               ` Daniel Barkalow
  2009-05-13 21:29                 ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: Daniel Barkalow @ 2009-05-13 21:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, 13 May 2009, Linus Torvalds wrote:

> On Wed, 13 May 2009, Martin Langhoff wrote:
> > 
> > Do we need to take the real solution to the core of git?
> 
> Well, I suspect that if we really want to support it, then we'd better.
> 
> > What I am wondering is whether we can keep this simple in git
> > internals and catch problem filenames at git-add time.
> 
> I can almost guarantee that it will just cause more problems than it 
> solves, and generate some nasty cases that just aren't solvable.
> 
> Because it really isn't just "git add". It's every single thing that does 
> a lstat() on a filename inside of git.
> 
> Now, the simple OS X case is not a huge problem, since the lstat will 
> succeed with the fixed-up filename too.

I'm not seeing what the general case is, and how it could possibly behave.

There's the "insensitive" behavior: if you create "foo" and look for 
"FOO", it's there, but readdir() reports "foo".

There's the "converting" behavior: if you create "foo", readdir() reports 
"FOO", but lstat("foo") returns it.

The obvious general case is: if you create "foo", readdir() reports "FOO", 
and lstat("foo") doesn't find a match. But if you create "foo" again... it 
doesn't find "foo", so it creates a new file, which it also calls "FOO", 
and the filesystem now has two files with identical names?

It seems to me that the limits of minimally functional, non-inode-losing 
filesystems are: lstat() might take a filename and return the data for a 
non-byte-identical filename; open(name, O_CREAT|O_EXCL) might replace the 
given name with a non-byte-identical filename. But surely open(name) and 
lstat(name) (with the same name) must find the same file, even if 
readdir() would report it with a different name.

And I assume that a filesystem that rejected any non-NFD filenames or any 
non-NFC filenames would be totally unusable, in that users will manage to 
get unnormalized filenames into programs and find that the filesystem just 
doesn't work.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 21:08               ` Daniel Barkalow
@ 2009-05-13 21:29                 ` Linus Torvalds
  0 siblings, 0 replies; 82+ messages in thread
From: Linus Torvalds @ 2009-05-13 21:29 UTC (permalink / raw)
  To: Daniel Barkalow
  Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, 13 May 2009, Daniel Barkalow wrote:
> > 
> > Now, the simple OS X case is not a huge problem, since the lstat will 
> > succeed with the fixed-up filename too.
> 
> I'm not seeing what the general case is, and how it could possibly behave.

Here's a simple example.

Let's say that your company uses Latin1 internally for your filesystems, 
because your tools really aren't utf-8 ready. 

This is NOT AT ALL unnatural - it's how lots of people used to work with 
Linux over the years, and it's largely how people still use FAT, I suspect 
(except it's not latin1, it's some windows-specific 8-bits-per-character 
mapping).

IOW, if you have a file called 'åäö', it literally is encoded as 
'\xe5\xe4\xf6' (if you wonder why I picked those three letters, it's 
because they are the regular extra letters in Swedish - Swedish has 29 
letters in its alphabet, and those three letters really are letters in 
their own right, they are NOT 'a' and 'o' with some dots/rings on top).

IOW, if you open such a file, you need to use those three bytes.

Now, even if you happen to have an OS and use Latin1 on disk, you may 
realize that you'd like to interact with others that use UTF-8, and would 
want to have your git archive that you export use nice portable UTF-8.

But you absolutely MUST NOT just do a conversion at "readdir()" time. If 
you do that, then your three-byte filename turns into a six-byte utf-8 
sequence of '\xc3\xa5\xc3\xa4\xc3\xb6' and the thing is, now "lstat()" 
won't work on that sequence.

So obviously you could always turn things _back_ for lstat(), but quite 
frankly, that's (a) insane (b) incompetent and (c) not even always 
well-defined.

> There's the "insensitive" behavior: if you create "foo" and look for 
> "FOO", it's there, but readdir() reports "foo".
> 
> There's the "converting" behavior: if you create "foo", readdir() reports 
> "FOO", but lstat("foo") returns it.

Then there's the behaviour above: you want your git repository to have 
utf-8, but your filesystem doesn't convert anything at all, and all your 
regular tools (think editors etc) are all Latin1.

Latin1 is going away, I hope, but I bet EUC-JP etc still exist. 

		Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 17:12       ` Linus Torvalds
  2009-05-13 17:31         ` Andreas Ericsson
  2009-05-13 17:46         ` Linus Torvalds
@ 2009-05-13 20:57         ` Matthias Andree
  2009-05-13 21:10           ` Linus Torvalds
  2 siblings, 1 reply; 82+ messages in thread
From: Matthias Andree @ 2009-05-13 20:57 UTC (permalink / raw)
  To: Linus Torvalds, Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git

Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds  
<torvalds@linux-foundation.org>:

> Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to  
> do the actual normalization if you find characters with the high bit  
> set. And since I know that the OS X filesystems are so buggy as to not  
> even do that whole NFD thing right, there is probably some OS-X specific  
> "use this for
> filesystem names" conversion function.

Sorry for interrupting, but NF_K_C? You don't want that (K for  
compatibility, rather than canonical, normalization) for anything except  
normalizing temporary variables inside strcasecmp(3) or similar. Probably  
not even that. The normalizations done are often irreversible and also  
surprising. You don't want to turn 2³.c into 23.c, do you?

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 20:57         ` Matthias Andree
@ 2009-05-13 21:10           ` Linus Torvalds
  2009-05-13 21:30             ` Jay Soffian
  2009-05-13 21:47             ` Matthias Andree
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2009-05-13 21:10 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, 13 May 2009, Matthias Andree wrote:

> Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds
> <torvalds@linux-foundation.org>:
> 
> > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do
> > the actual normalization if you find characters with the high bit set. And
> > since I know that the OS X filesystems are so buggy as to not even do that
> > whole NFD thing right, there is probably some OS-X specific "use this for
> > filesystem names" conversion function.
> 
> Sorry for interrupting, but NF_K_C? You don't want that (K for compatibility,
> rather than canonical, normalization) for anything except normalizing
> temporary variables inside strcasecmp(3) or similar. Probably not even that.
> The normalizations done are often irreversible and also surprising. You don't
> want to turn 2³.c into 23.c, do you?

No, you're right. We want just plain NFC. I just googled for how some 
other projects handled this, and found the stringprep thing in a post 
about rsync, and didn't look any closer.

But yes, you're absolutely right, stringprep is total crap, and nfkc is 
horrible.

I have no idea of what library to use, though. For perl, there's 
Unicode::Normalize, but that's likely still subtly incorrect for the OS-X 
case due to the filesystem not using _strict_ NFD.

I have this dim memory of somebody actually pointing to the documentation 
of exactly which characters OS X ends up decomposing. Maybe we could just 
do a git-specific inverse of that, knowing that NOBODY ELSE IN THE WHOLE 
UNIVERSE IS SO TERMINALLY STUPID AS TO DO THAT DECOMPOSITION, and thus the 
OS X case is the only one we need to care about?

			Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 21:10           ` Linus Torvalds
@ 2009-05-13 21:30             ` Jay Soffian
  2009-05-13 21:47             ` Matthias Andree
  1 sibling, 0 replies; 82+ messages in thread
From: Jay Soffian @ 2009-05-13 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthias Andree, Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, May 13, 2009 at 5:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I have this dim memory of somebody actually pointing to the documentation
> of exactly which characters OS X ends up decomposing.

http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
http://developer.apple.com/technotes/tn/tn1150table.html

j.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 21:10           ` Linus Torvalds
  2009-05-13 21:30             ` Jay Soffian
@ 2009-05-13 21:47             ` Matthias Andree
  1 sibling, 0 replies; 82+ messages in thread
From: Matthias Andree @ 2009-05-13 21:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git

Am 13.05.2009, 23:10 Uhr, schrieb Linus Torvalds  
<torvalds@linux-foundation.org>:

>
>
> On Wed, 13 May 2009, Matthias Andree wrote:
>
>> Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds
>> <torvalds@linux-foundation.org>:
>>
>> > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something  
>> to do
>> > the actual normalization if you find characters with the high bit  
>> set. And
>> > since I know that the OS X filesystems are so buggy as to not even do  
>> that
>> > whole NFD thing right, there is probably some OS-X specific "use this  
>> for
>> > filesystem names" conversion function.
>>
>> Sorry for interrupting, but NF_K_C? You don't want that (K for  
>> compatibility,
>> rather than canonical, normalization) for anything except normalizing
>> temporary variables inside strcasecmp(3) or similar. Probably not even  
>> that.
>> The normalizations done are often irreversible and also surprising. You  
>> don't
>> want to turn 2³.c into 23.c, do you?
>
> No, you're right. We want just plain NFC. I just googled for how some
> other projects handled this, and found the stringprep thing in a post
> about rsync, and didn't look any closer.
>
> But yes, you're absolutely right, stringprep is total crap, and nfkc is
> horrible.

Crap? It's just besides the purpose and some limited form of fuzzy match.  
Anyways...

> I have no idea of what library to use, though. For perl, there's
> Unicode::Normalize, but that's likely still subtly incorrect for the OS-X
> case due to the filesystem not using _strict_ NFD.

Perhaps ICU (ICU4C), from http://site.icu-project.org/

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
  2009-05-12 15:14 ` Shawn O. Pearce
@ 2009-05-12 18:28 ` Dmitry Potapov
  2009-05-12 18:40   ` Martin Langhoff
  2009-05-14 13:48 ` Peter Krefting
  2 siblings, 1 reply; 82+ messages in thread
From: Dmitry Potapov @ 2009-05-12 18:28 UTC (permalink / raw)
  To: Esko Luontola; +Cc: git

On Tue, May 12, 2009 at 06:06:05PM +0300, Esko Luontola wrote:
> A good start for making Git cross-platform, would be storing the text  
> encoding of every file name and commit message together with the commit. 
> Currently, because Git is oblivious to the encodings and just considers 
> them as a series of bytes, there is no way to make them cross-platform. 

1. Git already stores the endcoding for all commit messages that are not
   in UTF-8.

2. If you really want to be cross-platform portable, you should not use
   any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable
   Filename Character Set)
   http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276


Dmitry

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 18:28 ` Dmitry Potapov
@ 2009-05-12 18:40   ` Martin Langhoff
  2009-05-12 18:55     ` Jakub Narebski
  0 siblings, 1 reply; 82+ messages in thread
From: Martin Langhoff @ 2009-05-12 18:40 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Esko Luontola, git

On Tue, May 12, 2009 at 8:28 PM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> 2. If you really want to be cross-platform portable, you should not use
>   any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable
>   Filename Character Set)
>   http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276

Would it make sense to have warnings at 'git add' time about

 - filenames outside of that charset (as the strictest mode, perhaps
even default)
 - filenames that have a potential conflict wrt case-sensitivity
 - filenames that have potential conflict in the same tree due to
utf-8 encoding vagaries

MHO is that a strict "start your project portable from day one" mode
is best as a default. But I'd be happy with any default, actually ;-)



m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 18:40   ` Martin Langhoff
@ 2009-05-12 18:55     ` Jakub Narebski
  0 siblings, 0 replies; 82+ messages in thread
From: Jakub Narebski @ 2009-05-12 18:55 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Dmitry Potapov, Esko Luontola, git

Martin Langhoff <martin.langhoff@gmail.com> writes:
> On Tue, May 12, 2009 at 8:28 PM, Dmitry Potapov <dpotapov@gmail.com> wrote:

> > 2. If you really want to be cross-platform portable, you should not use
> >   any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable
> >   Filename Character Set)
> >   http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276
> 
> Would it make sense to have warnings at 'git add' time about
> 
>  - filenames outside of that charset (as the strictest mode, perhaps
> even default)
>  - filenames that have a potential conflict wrt case-sensitivity
>  - filenames that have potential conflict in the same tree due to
> utf-8 encoding vagaries
> 
> MHO is that a strict "start your project portable from day one" mode
> is best as a default. But I'd be happy with any default, actually ;-)

Somebody asked for a pre-add hook in the past; it would be good place
to put such check.  But in meantime you can do it using pre-commit
hook instead, isn't it?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
  2009-05-12 15:14 ` Shawn O. Pearce
  2009-05-12 18:28 ` Dmitry Potapov
@ 2009-05-14 13:48 ` Peter Krefting
  2009-05-14 19:58   ` Esko Luontola
  2 siblings, 1 reply; 82+ messages in thread
From: Peter Krefting @ 2009-05-14 13:48 UTC (permalink / raw)
  To: Esko Luontola; +Cc: git

Esko Luontola:

> A good start for making Git cross-platform, would be storing the text 
> encoding of every file name and commit message together with the commit.

Is it really necessary to store the encoding for every single file name, 
should it not be enough to just store encoding information for all file 
names at once (i.e., for the object that contains the list of file names and 
their associated blobs)?

I did publish, as a request for comments, the beginnings of a patch that 
would change the Windows version of Git to expect file names to be UTF-8 
encoded. There were some comments about it, especially that I could not just 
assume that UTF-8 was the right thing to assume.

Perhaps if we added some meta-data, maybe using the same fall-back mechanism 
as for commit messages (i.e., assume UTF-8 unless otherwise specified), it 
would be easier to do.

On Windows, the file APIs allow you to use Unicode (UTF-16) to specify file 
names, and the file systems will handle any necessary conversion to whatever 
byte sequences are used to store the file names. UTF-16 and UTF-8 are 
trivial to convert between, and Windows does contain APIs to convert between 
other character encodings and UTF-16.

On Mac OS X, I believe the file system APIs assume you use some kind of 
normalized UTF-8. That should also be possible to create, possibly 
converting back and forth between different normalization forms, if necessary.

On Linux and other Unixes we could just use iconv() to convert from the 
repository file name encoding to whatever the current locale has set up. The 
trick here is to handle file names outside the current encoding. Some kind 
of escaping mechanism will probably need to be introduced.

The best way would be to define this in the Git core once and for all, and 
add support to it for all the platforms in the same go, instead of trying to 
hack around the issue whenever it pops up on the various platforms.

My main use-case for Git on Windows has disappeared as my $dayjob went 
bankrupt, but I am happy to assist with whatever insight I may be able to 
bring.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-14 13:48 ` Peter Krefting
@ 2009-05-14 19:58   ` Esko Luontola
  2009-05-14 20:21     ` Andreas Ericsson
                       ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Esko Luontola @ 2009-05-14 19:58 UTC (permalink / raw)
  To: Peter Krefting; +Cc: git

Peter Krefting wrote on 14.5.2009 16:48:
> Is it really necessary to store the encoding for every single file name, 
> should it not be enough to just store encoding information for all file 
> names at once (i.e., for the object that contains the list of file names 
> and their associated blobs)?

What about if some disorganized project has people committing with many 
different encodings? Should we allow it, that a directory has the names 
of some files using one encoding, and the names of other files using 
another encoding? Or should we force the whole repository to use the 
same encoding?

> The best way would be to define this in the Git core once and for all, 
> and add support to it for all the platforms in the same go, instead of 
> trying to hack around the issue whenever it pops up on the various 
> platforms.

+1

-- 
Esko Luontola
www.orfjackal.net

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-14 19:58   ` Esko Luontola
@ 2009-05-14 20:21     ` Andreas Ericsson
  2009-05-14 22:25     ` Johannes Schindelin
  2009-05-15 11:18     ` Dmitry Potapov
  2 siblings, 0 replies; 82+ messages in thread
From: Andreas Ericsson @ 2009-05-14 20:21 UTC (permalink / raw)
  To: Esko Luontola; +Cc: Peter Krefting, git

Esko Luontola wrote:
> Peter Krefting wrote on 14.5.2009 16:48:
>> Is it really necessary to store the encoding for every single file 
>> name, should it not be enough to just store encoding information for 
>> all file names at once (i.e., for the object that contains the list of 
>> file names and their associated blobs)?
> 
> What about if some disorganized project has people committing with many 
> different encodings? Should we allow it, that a directory has the names 
> of some files using one encoding, and the names of other files using 
> another encoding? Or should we force the whole repository to use the 
> same encoding?
> 

If encodings are on a per-tree basis, we could add a special mode-flag for
it without breaking backwards incompatibility (I think, anyways). Older
gits just won't know how to handle it and will treat it as a byte-stream.

>> The best way would be to define this in the Git core once and for all, 
>> and add support to it for all the platforms in the same go, instead of 
>> trying to hack around the issue whenever it pops up on the various 
>> platforms.
> 
> +1
> 

There's still the problem that noone's stepped forward to do all that
work yet, so apparently this isn't important enough for people to put
their patches where their mouths are. Often when issues generate long
discussions and no code, it's of high academic interest and of little
real-world value.

I believe the "little real-world value" here comes from the fact that
cross-platform projects often enforce 7-bit ascii compatible filenames
from the start, because they *know* they may run into problems on other
filesystems otherwise. Remember it's not only git that has to get
things right. It's also build-systems and compilers that have to locate
the correct files (the Makefile and the filesystem may use different
encodings), so in the real world, people really do stay away from
filenames with åäö or other non-ascii chars in them.

It's fun to discuss, but I won't spend any time on it. Good luck to
those who do though. I'd quite like to see if someone could pull it
off without breaking backwards compatibility or impacting performance
too much.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-14 19:58   ` Esko Luontola
  2009-05-14 20:21     ` Andreas Ericsson
@ 2009-05-14 22:25     ` Johannes Schindelin
  2009-05-15 11:18     ` Dmitry Potapov
  2 siblings, 0 replies; 82+ messages in thread
From: Johannes Schindelin @ 2009-05-14 22:25 UTC (permalink / raw)
  To: Esko Luontola; +Cc: Peter Krefting, git

Hi,

On Thu, 14 May 2009, Esko Luontola wrote:

> Peter Krefting wrote on 14.5.2009 16:48:
> 
> > The best way would be to define this in the Git core once and for all, 
> > and add support to it for all the platforms in the same go, instead of 
> > trying to hack around the issue whenever it pops up on the various 
> > platforms.
> 
> +1

You might be enthusiastic about this cunning idea.  However, if it costs 
me performance on Linux, and all the benefits go to Windows users, then I 
will remove this "solution" from my personal Git tree _right away_, and 
I'd expect a lot of other people, too.

I repeat this just once more: if you add complexity, you'll have to have a 
compelling reason to do so.  If there is no benefit for Linux users, why 
should they bear the cost?

But as Andreas remarked, I sincerely think that there has been enough talk 
about the issue.  It's time to see some patches, or to stop the 
discussion.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-14 19:58   ` Esko Luontola
  2009-05-14 20:21     ` Andreas Ericsson
  2009-05-14 22:25     ` Johannes Schindelin
@ 2009-05-15 11:18     ` Dmitry Potapov
  2 siblings, 0 replies; 82+ messages in thread
From: Dmitry Potapov @ 2009-05-15 11:18 UTC (permalink / raw)
  To: Esko Luontola; +Cc: Peter Krefting, git

On Thu, May 14, 2009 at 10:58:17PM +0300, Esko Luontola wrote:
>
> What about if some disorganized project has people committing with many  
> different encodings? Should we allow it, that a directory has the names  
> of some files using one encoding, and the names of other files using  
> another encoding? Or should we force the whole repository to use the  
> same encoding?

The whole repository should have the same encoding internally. Anything
else will be too complex and too slow... Have you seen any file system
where file names would be stored in different encodings? And Git does
far more operation on file names than a file system does. So, it is
clearly to me that the whole repository should have a single encoding.

Now, I don't think that you will find many open source projects that use
non-ASCII in file names. Moreover, most Linux users are either use UTF-8
already or switch to it in the near future. Mac OS X uses UTF-8 (though
there is a problem with decomposed characters, but Linus posted a
possible solution). So, the only platform were non-ASCII characters may
be interesting to Git users and that does not support UTF-8 is Windows.
AFAIK, Cygwin 1.7 has UTF-8 support. So, it is mostly a problem for
msysGit... Though adding support for legacy encodings can help to some
degree, it means that every system call involving a file name will go
through UTF-8 <-> LEGACY_ENC <-> UTF-16LE conversion. IMHO, having a
legacy encoding involved is far from the best possible solution; but
to avoid that, you need to change MSYS to be able to work with UTF-8.
(I have never looked at MSYS myself, but I suspect it may be not easy).

Dmitry

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2009-05-15 11:20 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-27  8:55 Eric Sink's blog - notes on git, dscms and a "whole product" approach Martin Langhoff
2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
2009-04-28 21:00   ` Robin Rosenberg
2009-04-29  6:55   ` Martin Langhoff
2009-04-29  7:21     ` Jeff King
2009-04-29 20:05       ` Markus Heidelberg
2009-04-29  7:52     ` Cross-Platform Version Control Jakub Narebski
2009-04-29  8:25       ` Martin Langhoff
2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski
2009-04-29  7:54   ` Sitaram Chamarty
2009-04-30 12:17   ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
2009-04-30 12:56     ` Michael Witten
2009-04-30 15:28       ` Why Git is so fast Jakub Narebski
2009-04-30 18:52         ` Shawn O. Pearce
2009-04-30 20:36           ` Kjetil Barvik
2009-04-30 20:40             ` Shawn O. Pearce
2009-04-30 21:36               ` Kjetil Barvik
2009-05-01  0:23                 ` Steven Noonan
2009-05-01  1:25                   ` James Pickens
2009-05-01  9:19                   ` Kjetil Barvik
2009-05-01  9:34                     ` Mike Hommey
2009-05-01  9:42                       ` Kjetil Barvik
2009-05-01 17:42                 ` Tony Finch
2009-05-01  5:24             ` Dmitry Potapov
2009-05-01  9:42               ` Mike Hommey
2009-05-01 10:46                 ` Dmitry Potapov
2009-04-30 18:43       ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Shawn O. Pearce
2009-04-30 14:22     ` Jeff King
2009-05-01 18:43       ` Linus Torvalds
2009-05-01 19:08         ` Jeff King
2009-05-01 19:13           ` david
2009-05-01 19:32             ` Nicolas Pitre
2009-05-01 21:17           ` Daniel Barkalow
2009-05-01 21:37           ` Linus Torvalds
2009-05-01 22:11             ` david
2009-04-30 18:56     ` Nicolas Pitre
2009-04-30 19:16       ` Alex Riesen
2009-05-04  8:01         ` Why Git is so fast Andreas Ericsson
2009-04-30 19:33       ` Jakub Narebski
2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
2009-05-12 15:14 ` Shawn O. Pearce
2009-05-12 16:13   ` Johannes Schindelin
2009-05-12 17:56     ` Esko Luontola
2009-05-12 20:38       ` Johannes Schindelin
2009-05-12 21:16         ` Esko Luontola
2009-05-13  0:23           ` Johannes Schindelin
2009-05-13  5:34             ` Esko Luontola
2009-05-13  6:49               ` Alex Riesen
2009-05-13 10:15               ` Johannes Schindelin
     [not found]                 ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>
2009-05-13 10:41                   ` John Tapsell
2009-05-13 13:42                     ` Jay Soffian
2009-05-13 13:44                       ` Alex Riesen
2009-05-13 13:50                         ` Jay Soffian
2009-05-13 13:57                           ` John Tapsell
2009-05-13 15:27                             ` Nicolas Pitre
2009-05-13 16:22                               ` Johannes Schindelin
2009-05-13 17:24                             ` Andreas Ericsson
2009-05-14  1:49                             ` Miles Bader
2009-05-12 16:16   ` Jeff King
2009-05-12 16:57     ` Johannes Schindelin
2009-05-13 16:26     ` Linus Torvalds
2009-05-13 17:12       ` Linus Torvalds
2009-05-13 17:31         ` Andreas Ericsson
2009-05-13 17:46         ` Linus Torvalds
2009-05-13 18:26           ` Martin Langhoff
2009-05-13 18:37             ` Linus Torvalds
2009-05-13 21:04               ` Theodore Tso
2009-05-13 21:20                 ` Linus Torvalds
2009-05-13 21:08               ` Daniel Barkalow
2009-05-13 21:29                 ` Linus Torvalds
2009-05-13 20:57         ` Matthias Andree
2009-05-13 21:10           ` Linus Torvalds
2009-05-13 21:30             ` Jay Soffian
2009-05-13 21:47             ` Matthias Andree
2009-05-12 18:28 ` Dmitry Potapov
2009-05-12 18:40   ` Martin Langhoff
2009-05-12 18:55     ` Jakub Narebski
2009-05-14 13:48 ` Peter Krefting
2009-05-14 19:58   ` Esko Luontola
2009-05-14 20:21     ` Andreas Ericsson
2009-05-14 22:25     ` Johannes Schindelin
2009-05-15 11:18     ` Dmitry Potapov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.