* Eric Sink's blog - notes on git, dscms and a "whole product" approach @ 2009-04-27 8:55 Martin Langhoff 2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski 2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski 0 siblings, 2 replies; 82+ messages in thread From: Martin Langhoff @ 2009-04-27 8:55 UTC (permalink / raw) To: Git Mailing List Eric Sink hs been working on the (commercial, proprietary) centralised SCM Vault for a while. He's written recently about his explorations around the new crop of DSCMs, and I think it's quite interesting. A quick search of the list archives makes me thing it wasn't discussed before. The guy is knowledgeable, and writes quite witty posts -- naturally, there's plenty to disagree on, but I'd like to encourage readers not to nitpick or focus on where Eric is wrong. It is interesting to read where he thinks git and other DSCMs are missing the mark. Maybe he's right, maybe he's wrong, but damn he's interesting :-) So here's the blog - http://www.ericsink.com/ These are the best entry points http://www.ericsink.com/entries/quirky.html http://www.ericsink.com/entries/hg_denzel.html To be frank, I think he's wrong in some details (as he's admittedly only spent limited time with it) but right on the larger-picture (large userbases want it integrated and foolproof, bugtracking needs to go distributed alongside the code, git is as powerful^Wdangerous as C). cheers, martin -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-27 8:55 Eric Sink's blog - notes on git, dscms and a "whole product" approach Martin Langhoff @ 2009-04-28 11:24 ` Jakub Narebski 2009-04-28 21:00 ` Robin Rosenberg 2009-04-29 6:55 ` Martin Langhoff 2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski 1 sibling, 2 replies; 82+ messages in thread From: Jakub Narebski @ 2009-04-28 11:24 UTC (permalink / raw) To: Martin Langhoff; +Cc: Git Mailing List Martin Langhoff <martin.langhoff@gmail.com> writes: > Eric Sink hs been working on the (commercial, proprietary) centralised > SCM Vault for a while. He's written recently about his explorations > around the new crop of DSCMs, and I think it's quite interesting. A > quick search of the list archives makes me thing it wasn't discussed > before. > > The guy is knowledgeable, and writes quite witty posts -- naturally, > there's plenty to disagree on, but I'd like to encourage readers not > to nitpick or focus on where Eric is wrong. It is interesting to read > where he thinks git and other DSCMs are missing the mark. > > Maybe he's right, maybe he's wrong, but damn he's interesting :-) > > So here's the blog - http://www.ericsink.com/ "Here's a blog"... and therefore my dilemma. Should I post my reply as a comment to this blog, or should I reply here on git mailing list? > These are the best entry points Because those two entries are quite different, I'll reply separately 1. "Ten Quirky Issues with Cross-Platform Version Control" > http://www.ericsink.com/entries/quirky.html which is generic comment about (mainly) using version control in heterogenic environment, where different machines have different filesystem limitations. I'll concentrate here on that issue. 2. "Mercurial, Subversion, and Wesley Snipes" > http://www.ericsink.com/entries/hg_denzel.html where, paraphrasing, Eric Sink says that he doesn't write about Mercurial and Subversion because they are perfect. Or at least not as controversial (and controversial means interesting). > > To be frank, I think he's wrong in some details (as he's admittedly > only spent limited time with it) but right on the larger-picture > (large userbases want it integrated and foolproof, bugtracking needs > to go distributed alongside the code, git is as powerful^Wdangerous as > C). Neither of mentioned above blog posts touches those issues, BTW... ---------------------------------------------------------------------- Ad 1. "Ten Quirky Issues with Cross-Platform Version Control" Actually those are two issues: troubles with different limitations of different filesystems, and different dealing with line endings in text files on different platforms. Line endings (issue 8.) is in theory and in practice (at least for Git) a non-issue. In theory you should use project's convention for end of line character in text files, and use smart editor that can deal (or can be configured to deal) with this issue correctly. In practice this is a matter of correctly setting up core.autocrlf (and in more complicated cases, where more complicated means for git very very rare, configuring which files are text and which are not). There are a few classes of troubles with filesystems (with filenames). 1. Different limitations on file names (e.g. pathname length), different special characters, different special filenames (if any). Those are issues 2. (special basename PRN on MS Windows), issue 3. (trailing dot, trailing whitespace), issue 4. (pathname and filename length limit), issue 6. (special characters, in this case colon being path element delimiter on MacOS, but it is also about special characters like colon, asterisk and question mark on MS Windows) and also issue 7. (name that begins with dash) in Eric Sink article. The answer is convention for filenames in a project. Simply DON'T use filenames which can cause problems. There is no way to simply solve this problem in version control system, although I think if you really, really, really need it you should be able to cobble something together using low-level git tools to have different name for filename in working directory from the one used in repository (and index). See also David A. Wheeler essay "Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems" http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html DON'T DO THAT. 2. "Case-insensitive" but "case-preserving" filesystems; the case where some different filenames are equivalent (like 'README' and 'readme' on case-insensitive filesystem), but are returned as you created them (so if you created 'README', you would get 'README' in directory listing, but filesystem would return that 'readme' exists too). This is issue 1. ('README' and 'readme' in the same directory) in Eric Sink article. The answer is like for previous issue: don't. Simply DO NOT create files with filenames which differ only in case (like unfortunate ct_conntrack.h and cn_CONNTRACK.h or similar in Linux kernel). But I think that even in case where such unfortunate incident (two filenames differing only in case) occur, you can deal with it in Git by using lower level tools (and editing only one of two such files at once). You would get spurious info about modified files in git-status, though... perhaps that could be improved using infrastructure created (IIRC) by Linus for dealing with 'insane' filesystems. DON'T DO THAT, SOLVABLE. 3. Non "Case-preserving" filesystems, where filename as sequence of bytes differ between what you created, and what you get from filesystem. An example here is MacOS X filesystem, which accepts filenames in NFC composed normalized form of Unicode, but stores them internally and returns them in NFD decomposed form. This is issue 9. (Español being "Espa\u00f1ol" in NFC, but "Espan\u0303ol" in NFD). In this case 'don't do this' might be not acceptable answer. Perhaps you need non-ASCII characters in filenames. Not always can you use filesystem or specify mount point option that makes it not a problem. I remember that this issue was discussed extensively on git mailing list, but I don't remember what was the conclusion (beside agreeing that filesystem that is not "*-preserving" is not sane filesystem ;). In particular I do not remember if Git can deal with this issue sanely (I remember Linus adding infrastructure for that, but did it solve this problem...). PROBABLY SOLVED. 4. Filesystems which cannot store all SCM-sane metainfo, for example filesystems without support for symbolic links, or without support for executable permission (executable bit). This is extension of issue 10. (which is limited to symbolic links) in Eric Sink article. In Git you have core.fileMode to ignore executable bit differences (you would need to use SCM tools and not filesystem tools to maniulate it), and core.symlinks to be able to checkout symlinks as plain text files (again using SCM tools to manipulate). SOLVED. There is also mistaken implicit assumption that version control systems have (and should) preserve all metadata. 5. The issue of extra metadata that is not SCM-sane, and which different filesystems can or cannot store. Examples include full Unix permissions, Unix ownership (and groups file belongs to), other permission-related metadata such as ACL, extra resources tied to file such as EA (extended attributes) for some Linux filesystems or (in)famous resource form in MacOS. This is issue 5. (resource fork on MacOS vs. xattrs on Linux) in Eric Sink article. This is not an issue for SCM: _source_ code management system to solve. Preserving extra metadata indiscrimitedly can cause problems, like e.g. full permissions and ownership. Therefore SCM preserve only limited SCM-sane subset of metadata. If you need to preserve extra metadata, you can use (in good SCMs) hooks for that, like e.g. etckeeper uses metastore (in Git). NOT A PROBLEM. -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski @ 2009-04-28 21:00 ` Robin Rosenberg 2009-04-29 6:55 ` Martin Langhoff 1 sibling, 0 replies; 82+ messages in thread From: Robin Rosenberg @ 2009-04-28 21:00 UTC (permalink / raw) To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List tisdag 28 april 2009 13:24:31 skrev Jakub Narebski <jnareb@gmail.com>: > Line endings (issue 8.) is in theory and in practice (at least for > Git) a non-issue. > > In theory you should use project's convention for end of line > character in text files, and use smart editor that can deal (or can be > configured to deal) with this issue correctly. Windows people will disagree. > In practice this is a matter of correctly setting up core.autocrlf > (and in more complicated cases, where more complicated means for git > very very rare, configuring which files are text and which are not). Which proves it is an issue or we wouldn't need to tune settings to make it work right. A non-issue is something that "just works" without turning knobs. I had had to think more than once on what the issue was and the right way to solve these issues. It can be considered wierd, because Eclipse on Linux generated files with CRLF which I happily committed and Git on Windows happily converted to LF and determined that the HEAD and index was out of sync, but refuesed to commit the CRLF>LF change becuase there was no "diff".. You know the fix, but don't tell me it's not an issue. -- robin ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski 2009-04-28 21:00 ` Robin Rosenberg @ 2009-04-29 6:55 ` Martin Langhoff 2009-04-29 7:21 ` Jeff King 2009-04-29 7:52 ` Cross-Platform Version Control Jakub Narebski 1 sibling, 2 replies; 82+ messages in thread From: Martin Langhoff @ 2009-04-29 6:55 UTC (permalink / raw) To: Jakub Narebski; +Cc: Git Mailing List On Tue, Apr 28, 2009 at 1:24 PM, Jakub Narebski <jnareb@gmail.com> wrote: > DON'T DO THAT. > DON'T DO THAT, SOLVABLE. As I mentioned, Eric is taking the perspective of offering a supported SCM to a large and diverse audience. As such, his notes are interesting not because he's right or he's wrong. We can be "right" and say "don't do that" if we shrink our audience so that it looks a lot like us. There, fixed. But something tells me that successful tools are -- by definition -- tools that grow past their creators use. So from Eric's perspective, it is worthwhile to work on all those issues, and get the right for the end user -- support things we don't like, offer foolproof catches and warnings that prevent the user from shooting their lovely toes off to mars, etc. His perspective is one of commercial licensing, but even if we aren't driven by the "each new user is a new dollar" bit, the long term hopes for git might also be to be widely used and to improve the version control life of many unsuspecting users. To get there, I suspect we have to understand more of Eric's perspective. that's my 2c. m -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-29 6:55 ` Martin Langhoff @ 2009-04-29 7:21 ` Jeff King 2009-04-29 20:05 ` Markus Heidelberg 2009-04-29 7:52 ` Cross-Platform Version Control Jakub Narebski 1 sibling, 1 reply; 82+ messages in thread From: Jeff King @ 2009-04-29 7:21 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jakub Narebski, Git Mailing List On Wed, Apr 29, 2009 at 08:55:29AM +0200, Martin Langhoff wrote: > So from Eric's perspective, it is worthwhile to work on all those > issues, and get the right for the end user -- support things we don't > like, offer foolproof catches and warnings that prevent the user from > shooting their lovely toes off to mars, etc. I read a few of his blog postings. He kept complaining about the features of git that I like the most. :) So one thing I took away from it is that there probably isn't _one_ interface that works for everybody. I can see his arguments about how "add -p" can be dangerous, and how history rewriting can be dangerous. So for some users, blocking those features makes sense. But for other users (myself included), those are critical features that make me _way_ more productive. And I manage the risk that comes from using them as part of my workflow, and it isn't a problem in practice. While part of me is happy that cogito is now dead (not because I didn't think it was good, but because having two sets of tools just seemed to create maintenance and staleness headaches), I do sometimes wonder if we would be better off with several "from scratch" git interfaces based around the plumbing (or even a C library). And I don't just mean simple wrappers around git commands, but whole new interfaces which make decisions like "no history rewriting at all", and try to provide a safer interface based on that. Of course, _I_ wouldn't want to use such an interface. But in theory I could seamlessly interoperate with people who did. -Peff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-29 7:21 ` Jeff King @ 2009-04-29 20:05 ` Markus Heidelberg 0 siblings, 0 replies; 82+ messages in thread From: Markus Heidelberg @ 2009-04-29 20:05 UTC (permalink / raw) To: Jeff King; +Cc: Martin Langhoff, Jakub Narebski, Git Mailing List Jeff King, 29.04.2009: > On Wed, Apr 29, 2009 at 08:55:29AM +0200, Martin Langhoff wrote: > > > So from Eric's perspective, it is worthwhile to work on all those > > issues, and get the right for the end user -- support things we don't > > like, offer foolproof catches and warnings that prevent the user from > > shooting their lovely toes off to mars, etc. > > I read a few of his blog postings. He kept complaining about the > features of git that I like the most. :) > > I can see his arguments about how > "add -p" can be dangerous Actually, I don't see a very special case here with committing a never compiled/tested worktree state. You can do this with every VCS (without an index like git) with just selectively committing files instead of the whole current worktree. Markus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-04-29 6:55 ` Martin Langhoff 2009-04-29 7:21 ` Jeff King @ 2009-04-29 7:52 ` Jakub Narebski 2009-04-29 8:25 ` Martin Langhoff 1 sibling, 1 reply; 82+ messages in thread From: Jakub Narebski @ 2009-04-29 7:52 UTC (permalink / raw) To: Martin Langhoff; +Cc: Git Mailing List On Wed, 29 April 2009, Martin Langhoff wrote: > On Tue, Apr 28, 2009 at 1:24 PM, Jakub Narebski <jnareb@gmail.com> > wrote: [I think you cut out a bit too much. Here I resurrected it] JN> 1. Different limitations on file names (e.g. pathname length), JN> different special characters, different special filenames JN> (if any). [...] JN> The answer is convention for filenames in a project. Simply JN> DON'T use filenames which can cause problems. [...] > > DON'T DO THAT. What could be proper solution to that, if you do not accept social rather than technical restriction? We can have pre-commit hook that checks for portability for filenames (which is deployment specific, and shouldn't be part of SCM perhaps with an exception of being example hook) but it wouldn't help dealing with non-portable filenames on filesystem that cannot represent them that are there. If I remember correctly Git for some time has layer which can translate between filenames in repository and filenames on filesystem, but I'm not sure if it is generic enough for it to be a solution to this problem, and currently there is no way to manipulate this mapping, I think. JN> 2. "Case-insensitive" but "case-preserving" filesystems. [...] JN> JN> The answer is like for previous issue: don't. Simply DO NOT JN> create files with filenames which differ only in case [...] > > DON'T DO THAT, SOLVABLE. By 'solvable' here I mean that you should be able to modify only one of clashing files at once (checkout 'README', modify, add to index, remove from filesystem, checkout 'readme', modify, etc.), and deal with annoyances in git-status output. It can be done in Git, with medium amount of hacking. I don't think any other SCM can do even this, and I cannot think of a better, automatic solution that would somehow deal with case-clashing. Note that all deals are off in case-insensitive and not preserving filesystem. By the way, wouldn't be a better solution to use sane filesystem, rather than complicating SCM? ;-) > > As I mentioned, Eric is taking the perspective of offering a supported > SCM to a large and diverse audience. As such, his notes are > interesting not because he's right or he's wrong. > > We can be "right" and say "don't do that" if we shrink our audience so > that it looks a lot like us. There, fixed. <quote source="Dune by Frank Herbert"> [...] the attitude of the knife — chopping off what's incomplete and saying: "Now it's complete because it's ended here." </quote> I could not resist posting this quote :-P > > But something tells me that successful tools are -- by definition -- > tools that grow past their creators use. > > So from Eric's perspective, it is worthwhile to work on all those > issues, and get the right for the end user -- support things we don't > like, offer foolproof catches and warnings that prevent the user from > shooting their lovely toes off to mars, etc. Warnings and catches I can accept; adding complications and corner cases for situations which can be trivially avoided with a bit of social engineering aka. project guidelines... not so much. I simply cannot see the situation where you _must_ have dangerously unportable file names (trailing dot, trailing whitespace) and case-clashing files... > > His perspective is one of commercial licensing, but even if we aren't > driven by the "each new user is a new dollar" bit, the long term hopes > for git might also be to be widely used and to improve the version > control life of many unsuspecting users. > > To get there, I suspect we have to understand more of Eric's > perspective. > > that's my 2c. By the way, I think that the article on cross-platform version control (version control in heterogenic environment) is quite good article. I don't quite like the "10 Issues"/"Top 10" way of writing, but the article examines different ways that heterogenic environment can trip SCM. In my opinion Git does quite good here, where it can, and where the issue is to be solved by SCM and not otherwise (extra metadata like resource fork). -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-04-29 7:52 ` Cross-Platform Version Control Jakub Narebski @ 2009-04-29 8:25 ` Martin Langhoff 0 siblings, 0 replies; 82+ messages in thread From: Martin Langhoff @ 2009-04-29 8:25 UTC (permalink / raw) To: Jakub Narebski; +Cc: Git Mailing List On Wed, Apr 29, 2009 at 9:52 AM, Jakub Narebski <jnareb@gmail.com> wrote: >> > DON'T DO THAT. > > What could be proper solution to that, if you do not accept social > rather than technical restriction? Let's say strong checks for case sensitivity clashes, leading/trailing dots, utf-8 encoding maladies, etc switched on by default. And note that to be user-friendly you want most of those checks at 'add' time. If we don't like a particular FS, or we think it is messing up our utf-8 filenames, say it up-front, at clone and checkout time. For example, if the checkout has files with interesting utf-8 names, it'd be reasonable to check for filename mangling. Some things are hard or impossible to prevent - the utf-8 encoding maladies of OSX for example. But it may be detectable on checkout. In short, play on the defensive, for the benefit of users who are not kernel developers. It will piss off kernel & git developers and slow some operations somewhat. It will piss off oldtimers like me. But I'll say git config --global core.trainingwheels no and life will be good. It may be - as Jeff King points out - a matter of a polished git porcelain. We've seen lots of porcelains, but no smooth user-targetted porcelain yet. cheers, m -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach 2009-04-27 8:55 Eric Sink's blog - notes on git, dscms and a "whole product" approach Martin Langhoff 2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski @ 2009-04-28 18:16 ` Jakub Narebski 2009-04-29 7:54 ` Sitaram Chamarty 2009-04-30 12:17 ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski 1 sibling, 2 replies; 82+ messages in thread From: Jakub Narebski @ 2009-04-28 18:16 UTC (permalink / raw) To: Martin Langhoff; +Cc: Git Mailing List Martin Langhoff <martin.langhoff@gmail.com> writes: > Eric Sink hs been working on the (commercial, proprietary) centralised > SCM Vault for a while. He's written recently about his explorations > around the new crop of DSCMs, and I think it's quite interesting. A > quick search of the list archives makes me thing it wasn't discussed > before. > > The guy is knowledgeable, and writes quite witty posts -- naturally, > there's plenty to disagree on, but I'd like to encourage readers not > to nitpick or focus on where Eric is wrong. It is interesting to read > where he thinks git and other DSCMs are missing the mark. > > Maybe he's right, maybe he's wrong, but damn he's interesting :-) > > So here's the blog - http://www.ericsink.com/ "Here's a blog"... and therefore my dilemma. Should I post my reply as a comment to this blog, or should I reply here on git mailing list? I think I will just add link to this thread in GMane mailing list archive for git mailing list... > These are the best entry points * "Ten Quirky Issues with Cross-Platform Version Control" > http://www.ericsink.com/entries/quirky.html which I have answered in separate post in this thread * "Mercurial, Subversion, and Wesley Snipes" > http://www.ericsink.com/entries/hg_denzel.html which I will comment now. The 'ES>' prefix means quoting above blog post. First there is a list of earlier blog post, with links, which makes article in question a good staring point. ES> As part of that effort, I have undertaken an exploration of the ES> DVCS world. Several weeks ago I started writing one blog entry ES> every week, mostly focused on DVCS topics. In chronological ES> order, here they are: ES> ES> * The one where I gripe about Git's index where Eric complains that "git add -p" allows for committing untested changes... not knowing about "git stash --keep-index", and not understanding that comitting is (usually) separate from publishing in distributed version control systems (so you can check after commit, and amend commit if it does not pass test). ES> * The one where I whine about the way Git allows developers to ES> rearrange the DAG where Eric seems to not notice that you are strongly encouraged to do 'rearranging the DAG' (rewriting the history) _only_ in unpublished (not made public) part of history. ES> * The one where it looks like I am against DAG-based version ES> control but I'm really not where Eric conflates linear versus merge workflows with update-before-commit versus commit-then-merge paradigm, not noticing that you can have linear history using sane commit-update-rebase rather than unsafe update-before-commit. ES> * The one where I fuss about DVCSes that try to act like ES> centralized tools where DVCS in question that behaves this way is Bazaar (if I understood this correctly). ES> * The one where I complain that DVCSes have a lousy story when it ES> comes to bug-tracking where Eric correctly notice that distributed version control would not help much if you use centralized bugtracker, and speculates about required features that distributed bugtracker should have. Very nice post in my opinion. ES> * The one where I lament that I want to like Darcs but I can't where Eric talks about difference between parentage in merge commit (which is needed for good merging) and "parentage"/weak link in cherry-picked commit; Git uses weak link = no link. ES> * The one where I speculate cluelessly about why Git is so fast where Eric guesses instead of asking on git mailing list or #git channel... ;-) ES> Along the way, I've been spending some time getting hands-on ES> experience with these tools. I've been using Bazaar for several ES> months. I don't like it very much. I am currently in the process ES> of switching to Git, but I don't expect to like it very much ES> either. Aaaargh... if you expect to not like it very much, I would be very suprised if you find it to your liking... ES> So why don't I write about Mercurial? Because I'm pretty sure I ES> would like it. ES> ES> I chose Bazaar and Git for the experience. But if I were choosing ES> a DVCS as a regular user, I would choose Mercurial. I've used it ES> some, and found it to be incredibly pleasant. It seems like the ES> DVCS that got everything just about right. That's great if you're ES> a user, but for a writer, what's interesting about that? Well, Mercurial IMHO didn't get everything right. Not mentioning implementation issues, like dealing with copies, binary files, and large files, it got IMHO wrong: * branching in multiple branches per repository * tags which should be transferrable but non-versioned -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach 2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski @ 2009-04-29 7:54 ` Sitaram Chamarty 2009-04-30 12:17 ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski 1 sibling, 0 replies; 82+ messages in thread From: Sitaram Chamarty @ 2009-04-29 7:54 UTC (permalink / raw) To: git On 2009-04-28, Jakub Narebski <jnareb@gmail.com> wrote: > ES> * The one where I lament that I want to like Darcs but I can't > > where Eric talks about difference between parentage in merge commit > (which is needed for good merging) and "parentage"/weak link in > cherry-picked commit; Git uses weak link = no link. Well the patch-id is a sort of "compute on demand" link, so it would qualify as a weak link, especially because git manages to use it during a rebase. I wanted to point that out but I didn't see a link to post comments so I didn't bother. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski 2009-04-29 7:54 ` Sitaram Chamarty @ 2009-04-30 12:17 ` Jakub Narebski 2009-04-30 12:56 ` Michael Witten ` (2 more replies) 1 sibling, 3 replies; 82+ messages in thread From: Jakub Narebski @ 2009-04-30 12:17 UTC (permalink / raw) To: Martin Langhoff; +Cc: Git Mailing List Jakub Narebski <jnareb@gmail.com> writes: > Martin Langhoff <martin.langhoff@gmail.com> writes: > > > Eric Sink hs been working on the (commercial, proprietary) centralised > > SCM Vault for a while. He's written recently about his explorations > > around the new crop of DSCMs, and I think it's quite interesting. [...] > > So here's the blog - http://www.ericsink.com/ [...] > * "Mercurial, Subversion, and Wesley Snipes" > > http://www.ericsink.com/entries/hg_denzel.html > > which I will comment now. The 'ES>' prefix means quoting above blog > post. [...] > ES> * The one where I speculate cluelessly about why Git is so fast > > where Eric guesses instead of asking on git mailing list or #git > channel... ;-) This issue is interesting: what features and what design decision make Git fast? One of the goals of Git was good performance; are we there? All quotes marked 'es> ' below are from "Why is Git so Fast?" post http://www.ericsink.com/entries/why_is_git_fast.html es> One: Maybe Git is fast simply because it's a DVCS. es> es> There's probably some truth here. One of the main benefits touted es> by the DVCS fanatics is the extra performance you get when es> everything is "local". This is I think quite obvious. Accessing memory is faster than acessing disk, which in turn is faster than accessing network. So if commit and (change)log does not require access to server via network, they are so much faster. BTW. that is why Subversion stores along working copy 'pristine' versions of files: to make status and diff fast enough to be usable. Which in turn might make SVN checkout to be larger than full Git clone ;-) es> es> But this answer isn't enough. Maybe it explains why Git is faster es> than Subversion, but it doesn't explain why Git is so often es> described as being faster than the other DVCSs. Not only described; see http://git.or.cz/gitwiki/GitBenchmarks (although some, if not most of those benchmarks are dated, and e.g. Bazaar claims to have much better performance now). es> es> Two: Maybe Git is fast because Linus Torvalds is so smart. [non answer; the details are important] es> Three: Maybe Git is fast because it's written in C instead of one es> of those newfangled higher-level languages. es> es> Nah, probably not. Lots of people have written fast software in es> C#, Java or Python. es> es> And lots of people have written really slow software in es> traditional native languages like C/C++. [...] Well, I guess that access to low-level optimization techniques like mmap are important for performance. But here I am guessing and speculating like Eric did; well, I am asking on a proper forum ;-) We have some anegdotical evidence supporting this possibility (which Eric dismisses), namely the fact that pure-Python Bazaar is slowest of three most common open source DVCS (Git, Mercurial, bazaar) and the fact that parts of Mercurial were written in C for better performance. We can also compare implementations of Git in other, higher level languages, with reference implementation in C (and shell scripts, and Perl ;-)). For example most complete I think but still not fully complete Java implementation: JGit. I hope that JGit developers can tell us whether using higher level language affects performance, how much, and what features of higher-level language are causing decrease in performance. Of course we have to take into account the possibility that JGit isn't simply as well optimized because of less manpower. es> es> Four: Maybe Git is fast because being fast is the primary goal for es> Git. [non answer; the details are important] es> es> Five: Maybe Git is fast because it does less. es> es> One of my favorite recent blog entries is this piece[1] which es> claims that the way to make code faster is to have it do less. es> es> [1] "How to write fast code" by Kas Thomas es> http://asserttrue.blogspot.com/2009/03/how-to-write-fast-code.html [...] es> es> For example, the way you get something in the Git index is you use es> the "git add" command. Git doesn't scan your working copy for es> changed files unless you explicitly tell it to. This can be a es> pretty big performance win for huge trees. Even when you use the es> "remember the timestamp" trick, detecting modified files in a es> really big tree can take a noticeable amount of time. That of course depends on how you compare performance of different version control systems (to not compare apples with oranges). But if you compare e.g. "<scm> commit" with Git equivalent "git commit -a" the above is simply not true. BTW. when doing comparison you have to take care of the reverse, e.g. git doing more like calculating and dislaying diffstat by default for merges/pulls. es> es> Or maybe Git's shortcut for handling renames is faster than doing es> them more correctly[2] like Bazaar does. es> es> [2] "Renaming is the killer app of distributed version control" es> http://www.markshuttleworth.com/archives/123 Errr... what? es> Six: Maybe Git is fast because it doesn't use much external code. es> es> Very often, when you are facing a decision to use somebody else's es> code or write it yourself, there is a performance tradeoff. Not es> always, but often. Maybe the third party code is just slower than es> the code you could write yourself if you had time to do it. Or es> maybe there is an impedance mismatch between the API of the es> external library and your own architecture. es> es> This can happen even when the library is very high quality. For es> example, consider libcurl. This is a great library. Tons of es> people use it. But it does have one problem that will cause es> performance problems for some users: When using libcurl to fetch es> an object, it wants to own the buffer. In some situations, this es> can end up forcing you to use extra memcpys or temporary files. es> The reason all the low level calls like send() and recv() allow es> the caller to own the loop and the buffer is because this is the es> best way to avoid the need to make extra copies of the data on es> disk or in memory. [...] es> es> Maybe Git is fast because every time they faced one of these "buy es> vs. build" choices, they decided to just write it themselves. I don't think so. Rather the opposite is true. Git uses libcurl for HTTP transport. Git uses zlib for compression. Git uses SHA-1 from OpenSSL or from Mozilla. Git uses (modified, internal) LibXDiff for (binary) deltaifying, for diffs and for merges. OTOH Git includes several own micro-libraries: parseopt, strbuf, ALLOC_GROW, etc. NIH syndrome? I don't think so; rather avoiding extra dependencies (bstring vs strbuf), and existing solutions not fitting all needs (popt/argp/getopt vs parse-options). es> Seven: Maybe Git isn't really that fast. es> es> If there is one thing I've learned about version control it's that es> everybody's situation is different. It is quite likely that Git es> is a lot faster for some scenarios than it is for others. es> es> How does Git handle really large trees? Git was designed primary es> to support the efforts of the Linux kernel developers. A lot of es> people think the Linux kernel is a large tree, but it's really es> not. Many enterprise configuration management repositories are es> FAR bigger than the Linux kernel. c.f. "Why Perforce is more scalable than Git" by Steve Hanov http://gandolf.homelinux.org/blog/index.php?id=50 I don't really know about this. But there is one issue Eric Sink didn't think about: Eight: Git seems fast. ====================== Here I mean concentaring on low _latency_, which means that when git produces more than one page of output (for example "git log"), it tries to output the first page as fast as possible; which means that first page e.g. "git <sth> | head -25 >/dev/null" has to be fast, and not "git <sth> >/dev/null" itself. Having progress indicator appearing whenever is longer wait (quite fresh feature) also help impression of being fast... And what do you think about this? -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-30 12:17 ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski @ 2009-04-30 12:56 ` Michael Witten 2009-04-30 15:28 ` Why Git is so fast Jakub Narebski 2009-04-30 18:43 ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Shawn O. Pearce 2009-04-30 14:22 ` Jeff King 2009-04-30 18:56 ` Nicolas Pitre 2 siblings, 2 replies; 82+ messages in thread From: Michael Witten @ 2009-04-30 12:56 UTC (permalink / raw) To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List On Thu, Apr 30, 2009 at 07:17, Jakub Narebski <jnareb@gmail.com> wrote: > I hope that JGit developers can > tell us whether using higher level language affects performance, how > much, and what features of higher-level language are causing decrease > in performance. Java is definitely higher than C, but you can do some pretty low-level operations on bits and bytes and the like, not to mention the presence of a JIT. My point: I don't think that Java can tell us anything special in this regard. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-04-30 12:56 ` Michael Witten @ 2009-04-30 15:28 ` Jakub Narebski 2009-04-30 18:52 ` Shawn O. Pearce 2009-04-30 18:43 ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Shawn O. Pearce 1 sibling, 1 reply; 82+ messages in thread From: Jakub Narebski @ 2009-04-30 15:28 UTC (permalink / raw) To: Michael Witten; +Cc: Martin Langhoff, Git Mailing List On Thu, 30 Apr 2009, Michael Witten wrote: > On Thu, Apr 30, 2009 at 07:17, Jakub Narebski <jnareb@gmail.com> wrote: > > I hope that JGit developers can > > tell us whether using higher level language affects performance, how > > much, and what features of higher-level language are causing decrease > > in performance. > > Java is definitely higher than C, but you can do some pretty low-level > operations on bits and bytes and the like, not to mention the presence > of a JIT. > > My point: I don't think that Java can tell us anything special in this regard. Let's rephrase question a bit then: what low-level operation were needed for good performance in JGit? -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-04-30 15:28 ` Why Git is so fast Jakub Narebski @ 2009-04-30 18:52 ` Shawn O. Pearce 2009-04-30 20:36 ` Kjetil Barvik 0 siblings, 1 reply; 82+ messages in thread From: Shawn O. Pearce @ 2009-04-30 18:52 UTC (permalink / raw) To: Jakub Narebski; +Cc: Michael Witten, Martin Langhoff, Git Mailing List Jakub Narebski <jnareb@gmail.com> wrote: > Let's rephrase question a bit then: what low-level operation were needed > for good performance in JGit? Aside from the message I just posted: - Avoid String, its too expensive most of the time. Stick with byte[], and better, stick with data that is a triplet of (byte[], int start, int end) to define a region of data. Yes, its annoying, as its 3 values you need to pass around instead of just 1, but its makes a big difference in running time. - Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints, which can be inlined into an object allocation. - Subclass instead of contain references. We extend ObjectId to attach application data, rather than contain a reference to an ObjectId. Classical Java programming techniques would say this is a violation of encapsulatio. But it gets us the same memory impact that C Git gets by saying: struct appdata { unsigned char[20] sha1; .... } - We're hurting dearly for not having more efficient access to the pack-*.pack file data. mmap in Java is crap. We implement our own page buffer, reading in blocks of 8192 bytes at a time and holding them in our own cache. Really, we should write our own mmap library as an optional JNI thing, and tie it into libz so we can efficiently run inflate() off the pack data directly. - We're hurting dearly for not having more efficient access to the pack-*.idx files. Again, with no mmap we read the entire bloody index into memory. But since you won't touch most of it we keep it in large byte[], but since you are searching with an ObjectId (5 ints) we pay a conversion price on every search step where we have to copy from the large byte[] to 5 local variable ints, and then compare to the ObjectId. Its an overhead C git doesn't have to deal with. Anyway. I'm still just amazed at how well JGit runs given these limitations. I guess that's Moore's Law for you. 10 years ago, JGit wouldn't have been practical. -- Shawn. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-04-30 18:52 ` Shawn O. Pearce @ 2009-04-30 20:36 ` Kjetil Barvik 2009-04-30 20:40 ` Shawn O. Pearce 2009-05-01 5:24 ` Dmitry Potapov 0 siblings, 2 replies; 82+ messages in thread From: Kjetil Barvik @ 2009-04-30 20:36 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: git * "Shawn O. Pearce" <spearce@spearce.org> writes: <snipp> | - Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints, | which can be inlined into an object allocation. What to pepole think about doing something simmilar in C GIT? That is, convert the current internal representation of the SHA-1 from "unsigned char sha1[20]" to "unsigned long sha1[5]"? Ok, I currently see 2 problems with it: 1) Will the type "unsigned long" always be unsigned 32 bit on all platforms on all computers? do we need an "unit_32_t" thing? 2) Can we get in truble because of differences between litle- and big-endian machines? And, simmilar I can see or guess the following would be positive with this change: 3) From a SHA1 library I worked with some time ago, I noticed that it internaly used the type "unsigned long arr[5]", so it can mabye be possible to get some shurtcuts or maybe speedups here, if we want to do it. 4) The "static inline void hashcpy(....)" in cache.h could then maybe be written like this: static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5]) { sha_dst[0] = sha_src[0]; sha_dst[1] = sha_src[1]; sha_dst[2] = sha_src[2]; sha_dst[3] = sha_src[3]; sha_dst[4] = sha_src[4]; } And hopefully will be compiled to just 5 store/more instructions, or at least hopefully be faster than the currently memcpy() call. But mabye we get more compiled instructions compared to a single call to memcpy()? 5) Simmilar as 4) for the other SHA1 realted hash functions nearby hashcpy() in cache.h OK, just some thought's. Sorry if this allready has been discussed but could not find something abouth it after a simple google search. -- kjetil ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-04-30 20:36 ` Kjetil Barvik @ 2009-04-30 20:40 ` Shawn O. Pearce 2009-04-30 21:36 ` Kjetil Barvik 2009-05-01 5:24 ` Dmitry Potapov 1 sibling, 1 reply; 82+ messages in thread From: Shawn O. Pearce @ 2009-04-30 20:40 UTC (permalink / raw) To: Kjetil Barvik; +Cc: git Kjetil Barvik <barvik@broadpark.no> wrote: > * "Shawn O. Pearce" <spearce@spearce.org> writes: > <snipp> > | - Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints, > | which can be inlined into an object allocation. > > What to pepole think about doing something simmilar in C GIT? > > That is, convert the current internal representation of the SHA-1 from > "unsigned char sha1[20]" to "unsigned long sha1[5]"? Its not worth the code churn. > Ok, I currently see 2 problems with it: > > 1) Will the type "unsigned long" always be unsigned 32 bit on all > platforms on all computers? do we need an "unit_32_t" thing? Yea, "unsigned long" isn't always 32 bits. So we'd need to use uint32_t. Which we already use elsewhere, but still. > 2) Can we get in truble because of differences between litle- and > big-endian machines? Yes, especially if compare was implemented using native uint32_t compare and the processor was little-endian. > 4) The "static inline void hashcpy(....)" in cache.h could then > maybe be written like this: Its already done as "memcpy(a, b, 20)" which most compilers will inline and probably reduce to 5 word moves anyway. That's why hashcpy() itself is inline. -- Shawn. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-04-30 20:40 ` Shawn O. Pearce @ 2009-04-30 21:36 ` Kjetil Barvik 2009-05-01 0:23 ` Steven Noonan 2009-05-01 17:42 ` Tony Finch 0 siblings, 2 replies; 82+ messages in thread From: Kjetil Barvik @ 2009-04-30 21:36 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: git * "Shawn O. Pearce" <spearce@spearce.org> writes: |> 4) The "static inline void hashcpy(....)" in cache.h could then |> maybe be written like this: | | Its already done as "memcpy(a, b, 20)" which most compilers will | inline and probably reduce to 5 word moves anyway. That's why | hashcpy() itself is inline. But would the compiler be able to trust that the hashcpy() is always called with correct word alignment on variables a and b? I made a test and compiled git with: make USE_NSEC=1 CFLAGS="-march=core2 -mtune=core2 -O2 -g2 -fno-stack-protector" clean all compiler: gcc (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3 CPU: Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz GenuineIntel Then used gdb to get the following: (gdb) disassemble write_sha1_file Dump of assembler code for function write_sha1_file: 0x080e3830 <write_sha1_file+0>: push %ebp 0x080e3831 <write_sha1_file+1>: mov %esp,%ebp 0x080e3833 <write_sha1_file+3>: sub $0x58,%esp 0x080e3836 <write_sha1_file+6>: lea -0x10(%ebp),%eax 0x080e3839 <write_sha1_file+9>: mov %ebx,-0xc(%ebp) 0x080e383c <write_sha1_file+12>: mov %esi,-0x8(%ebp) 0x080e383f <write_sha1_file+15>: mov %edi,-0x4(%ebp) 0x080e3842 <write_sha1_file+18>: mov 0x14(%ebp),%ebx 0x080e3845 <write_sha1_file+21>: mov %eax,0x8(%esp) 0x080e3849 <write_sha1_file+25>: lea -0x44(%ebp),%edi 0x080e384c <write_sha1_file+28>: lea -0x24(%ebp),%esi 0x080e384f <write_sha1_file+31>: mov %edi,0x4(%esp) 0x080e3853 <write_sha1_file+35>: mov %esi,(%esp) 0x080e3856 <write_sha1_file+38>: mov 0x10(%ebp),%ecx 0x080e3859 <write_sha1_file+41>: mov 0xc(%ebp),%edx 0x080e385c <write_sha1_file+44>: mov 0x8(%ebp),%eax 0x080e385f <write_sha1_file+47>: call 0x80e0350 <write_sha1_file_prepare> 0x080e3864 <write_sha1_file+52>: test %ebx,%ebx 0x080e3866 <write_sha1_file+54>: je 0x80e3885 <write_sha1_file+85> 0x080e3868 <write_sha1_file+56>: mov -0x24(%ebp),%eax 0x080e386b <write_sha1_file+59>: mov %eax,(%ebx) 0x080e386d <write_sha1_file+61>: mov -0x20(%ebp),%eax 0x080e3870 <write_sha1_file+64>: mov %eax,0x4(%ebx) 0x080e3873 <write_sha1_file+67>: mov -0x1c(%ebp),%eax 0x080e3876 <write_sha1_file+70>: mov %eax,0x8(%ebx) 0x080e3879 <write_sha1_file+73>: mov -0x18(%ebp),%eax 0x080e387c <write_sha1_file+76>: mov %eax,0xc(%ebx) 0x080e387f <write_sha1_file+79>: mov -0x14(%ebp),%eax 0x080e3882 <write_sha1_file+82>: mov %eax,0x10(%ebx) I admit that I am not particular familar with intel machine instructions, but I guess that the above 10 mov instructions is the result for the compiled inline hashcpy() in the write_sha1_file() function in sha1_file.c Question: would it be possible for the compiler to compile it down to just 5 mov instructions if we had used unsigned 32 bits type? Or is this the best we can reasonable hope for inside the write_sha1_file() function? I checked 3 other output of "disassemble function_foo", and it seems that those 3 functions I checked got 10 mov instructions for the inline hashcpy(), as far as I can tell. 0x080e3885 <write_sha1_file+85>: mov %esi,(%esp) 0x080e3888 <write_sha1_file+88>: call 0x80e3800 <has_sha1_file> 0x080e388d <write_sha1_file+93>: xor %edx,%edx 0x080e388f <write_sha1_file+95>: test %eax,%eax 0x080e3891 <write_sha1_file+97>: jne 0x80e38b6 <write_sha1_file+134> 0x080e3893 <write_sha1_file+99>: mov 0xc(%ebp),%eax 0x080e3896 <write_sha1_file+102>: mov %edi,%edx 0x080e3898 <write_sha1_file+104>: mov %eax,0x4(%esp) 0x080e389c <write_sha1_file+108>: mov -0x10(%ebp),%ecx 0x080e389f <write_sha1_file+111>: mov 0x8(%ebp),%eax 0x080e38a2 <write_sha1_file+114>: movl $0x0,0x8(%esp) 0x080e38aa <write_sha1_file+122>: mov %eax,(%esp) 0x080e38ad <write_sha1_file+125>: mov %esi,%eax 0x080e38af <write_sha1_file+127>: call 0x80e1e40 <write_loose_object> 0x080e38b4 <write_sha1_file+132>: mov %eax,%edx 0x080e38b6 <write_sha1_file+134>: mov %edx,%eax 0x080e38b8 <write_sha1_file+136>: mov -0xc(%ebp),%ebx 0x080e38bb <write_sha1_file+139>: mov -0x8(%ebp),%esi 0x080e38be <write_sha1_file+142>: mov -0x4(%ebp),%edi 0x080e38c1 <write_sha1_file+145>: leave 0x080e38c2 <write_sha1_file+146>: ret End of assembler dump. (gdb) So, maybe the compiler is doing the right thing after all? -- kjetil ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-04-30 21:36 ` Kjetil Barvik @ 2009-05-01 0:23 ` Steven Noonan 2009-05-01 1:25 ` James Pickens 2009-05-01 9:19 ` Kjetil Barvik 2009-05-01 17:42 ` Tony Finch 1 sibling, 2 replies; 82+ messages in thread From: Steven Noonan @ 2009-05-01 0:23 UTC (permalink / raw) To: Kjetil Barvik; +Cc: Shawn O. Pearce, git On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik <barvik@broadpark.no> wrote: > * "Shawn O. Pearce" <spearce@spearce.org> writes: > |> 4) The "static inline void hashcpy(....)" in cache.h could then > |> maybe be written like this: > | > | Its already done as "memcpy(a, b, 20)" which most compilers will > | inline and probably reduce to 5 word moves anyway. That's why > | hashcpy() itself is inline. > > But would the compiler be able to trust that the hashcpy() is always > called with correct word alignment on variables a and b? > > I made a test and compiled git with: > > make USE_NSEC=1 CFLAGS="-march=core2 -mtune=core2 -O2 -g2 -fno-stack-protector" clean all > > compiler: gcc (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3 > CPU: Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz GenuineIntel > > Then used gdb to get the following: > > (gdb) disassemble write_sha1_file > Dump of assembler code for function write_sha1_file: > 0x080e3830 <write_sha1_file+0>: push %ebp > 0x080e3831 <write_sha1_file+1>: mov %esp,%ebp > 0x080e3833 <write_sha1_file+3>: sub $0x58,%esp > 0x080e3836 <write_sha1_file+6>: lea -0x10(%ebp),%eax > 0x080e3839 <write_sha1_file+9>: mov %ebx,-0xc(%ebp) > 0x080e383c <write_sha1_file+12>: mov %esi,-0x8(%ebp) > 0x080e383f <write_sha1_file+15>: mov %edi,-0x4(%ebp) > 0x080e3842 <write_sha1_file+18>: mov 0x14(%ebp),%ebx > 0x080e3845 <write_sha1_file+21>: mov %eax,0x8(%esp) > 0x080e3849 <write_sha1_file+25>: lea -0x44(%ebp),%edi > 0x080e384c <write_sha1_file+28>: lea -0x24(%ebp),%esi > 0x080e384f <write_sha1_file+31>: mov %edi,0x4(%esp) > 0x080e3853 <write_sha1_file+35>: mov %esi,(%esp) > 0x080e3856 <write_sha1_file+38>: mov 0x10(%ebp),%ecx > 0x080e3859 <write_sha1_file+41>: mov 0xc(%ebp),%edx > 0x080e385c <write_sha1_file+44>: mov 0x8(%ebp),%eax > 0x080e385f <write_sha1_file+47>: call 0x80e0350 <write_sha1_file_prepare> > 0x080e3864 <write_sha1_file+52>: test %ebx,%ebx > 0x080e3866 <write_sha1_file+54>: je 0x80e3885 <write_sha1_file+85> > > 0x080e3868 <write_sha1_file+56>: mov -0x24(%ebp),%eax > 0x080e386b <write_sha1_file+59>: mov %eax,(%ebx) > 0x080e386d <write_sha1_file+61>: mov -0x20(%ebp),%eax > 0x080e3870 <write_sha1_file+64>: mov %eax,0x4(%ebx) > 0x080e3873 <write_sha1_file+67>: mov -0x1c(%ebp),%eax > 0x080e3876 <write_sha1_file+70>: mov %eax,0x8(%ebx) > 0x080e3879 <write_sha1_file+73>: mov -0x18(%ebp),%eax > 0x080e387c <write_sha1_file+76>: mov %eax,0xc(%ebx) > 0x080e387f <write_sha1_file+79>: mov -0x14(%ebp),%eax > 0x080e3882 <write_sha1_file+82>: mov %eax,0x10(%ebx) > > I admit that I am not particular familar with intel machine > instructions, but I guess that the above 10 mov instructions is the > result for the compiled inline hashcpy() in the write_sha1_file() > function in sha1_file.c > > Question: would it be possible for the compiler to compile it down to > just 5 mov instructions if we had used unsigned 32 bits type? Or is > this the best we can reasonable hope for inside the write_sha1_file() > function? > > I checked 3 other output of "disassemble function_foo", and it seems > that those 3 functions I checked got 10 mov instructions for the > inline hashcpy(), as far as I can tell. > > 0x080e3885 <write_sha1_file+85>: mov %esi,(%esp) > 0x080e3888 <write_sha1_file+88>: call 0x80e3800 <has_sha1_file> > 0x080e388d <write_sha1_file+93>: xor %edx,%edx > 0x080e388f <write_sha1_file+95>: test %eax,%eax > 0x080e3891 <write_sha1_file+97>: jne 0x80e38b6 <write_sha1_file+134> > 0x080e3893 <write_sha1_file+99>: mov 0xc(%ebp),%eax > 0x080e3896 <write_sha1_file+102>: mov %edi,%edx > 0x080e3898 <write_sha1_file+104>: mov %eax,0x4(%esp) > 0x080e389c <write_sha1_file+108>: mov -0x10(%ebp),%ecx > 0x080e389f <write_sha1_file+111>: mov 0x8(%ebp),%eax > 0x080e38a2 <write_sha1_file+114>: movl $0x0,0x8(%esp) > 0x080e38aa <write_sha1_file+122>: mov %eax,(%esp) > 0x080e38ad <write_sha1_file+125>: mov %esi,%eax > 0x080e38af <write_sha1_file+127>: call 0x80e1e40 <write_loose_object> > 0x080e38b4 <write_sha1_file+132>: mov %eax,%edx > 0x080e38b6 <write_sha1_file+134>: mov %edx,%eax > 0x080e38b8 <write_sha1_file+136>: mov -0xc(%ebp),%ebx > 0x080e38bb <write_sha1_file+139>: mov -0x8(%ebp),%esi > 0x080e38be <write_sha1_file+142>: mov -0x4(%ebp),%edi > 0x080e38c1 <write_sha1_file+145>: leave > 0x080e38c2 <write_sha1_file+146>: ret > End of assembler dump. > (gdb) > > So, maybe the compiler is doing the right thing after all? > Well, I just tested this with GCC myself. I used this segment of code: #include <memory.h> void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src) { memcpy(sha_dst, sha_src, 20); } I compiled using Apple's GCC 4.0.1 (note that GCC 4.3 and 4.4 vanilla yield the same code) with these parameters to get Intel assembly: gcc -O2 -arch i386 -march=pentium3 -mtune=pentium3 -fomit-frame-pointer -fno-strict-aliasing -S test.c and these parameters to get the equivalent PowerPC code: gcc -O2 -mcpu=G5 -arch ppc -fomit-frame-pointer -fno-strict-aliasing -S test.c Intel code: .text .align 4,0x90 .globl _hashcpy _hashcpy: subl $12, %esp movl 20(%esp), %edx movl 16(%esp), %ecx movl (%edx), %eax movl %eax, (%ecx) movl 4(%edx), %eax movl %eax, 4(%ecx) movl 8(%edx), %eax movl %eax, 8(%ecx) movl 12(%edx), %eax movl %eax, 12(%ecx) movl 16(%edx), %eax movl %eax, 16(%ecx) addl $12, %esp ret .subsections_via_symbols and the PowerPC code: .section __TEXT,__text,regular,pure_instructions .section __TEXT,__picsymbolstub1,symbol_stubs,pure_instructions,32 .machine ppc970 .text .align 2 .p2align 4,,15 .globl _hashcpy _hashcpy: lwz r0,0(r4) lwz r2,4(r4) lwz r9,8(r4) lwz r11,12(r4) stw r0,0(r3) stw r2,4(r3) stw r9,8(r3) stw r11,12(r3) lwz r0,16(r4) stw r0,16(r3) blr .subsections_via_symbols So it does look like GCC does what it should and it inlines the memcpy. A bit off topic, but the results are rather interesting to me, and I think I see a weakness in how GCC is doing this on Intel. Someone please correct me if I'm wrong, but the PowerPC code seems much better because it can yield very high instruction-level parallelism. It does 5 loads and then 5 stores, using 4 registers for temporary storage and 2 registers for pointers. I realize the Intel x86 architecture is quite constrained in that it has so few general purpose registers, but there has to be better code than what GCC emitted above. It seems like the processor would stall because of the quantity of sequential inter-dependent instructions that can't be done in parallel (mov to memory that depends on a mov to eax, etc). I suppose the code might not be stalling if it's using the maximum number of registers and doing as many memory accesses that it can per clock, but based on known details about the architecture, does it seem to be doing that? - Steven ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-05-01 0:23 ` Steven Noonan @ 2009-05-01 1:25 ` James Pickens 2009-05-01 9:19 ` Kjetil Barvik 1 sibling, 0 replies; 82+ messages in thread From: James Pickens @ 2009-05-01 1:25 UTC (permalink / raw) To: Steven Noonan; +Cc: Kjetil Barvik, Shawn O. Pearce, git On Thu, Apr 30, 2009, Steven Noonan <steven@uplinklabs.net> wrote: > A bit off topic, but the results are rather interesting to me, and I > think I see a weakness in how GCC is doing this on Intel. Someone > please correct me if I'm wrong, but the PowerPC code seems much better > because it can yield very high instruction-level parallelism. It does > 5 loads and then 5 stores, using 4 registers for temporary storage and > 2 registers for pointers. > > I realize the Intel x86 architecture is quite constrained in that it > has so few general purpose registers, but there has to be better code > than what GCC emitted above. It seems like the processor would stall > because of the quantity of sequential inter-dependent instructions > that can't be done in parallel (mov to memory that depends on a mov to > eax, etc). There aren't any unnecessary dependencies. Take this sequence: 1: movl (%edx), %eax 2: movl %eax, (%ecx) 3: movl 4(%edx), %eax 4: movl %eax, 4(%ecx) There are two unavoidable dependencies - #2 depends on #1, and #4 depends on #3. #3 does not depend on #2, even though they both use %eax, because #3 is a write to %eax. So whatever was in %eax before #3 is irrelevant. The processor knows this and will use register renaming to execute #1 and #3 in parallel, and #2 and #4 in parallel. James ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-05-01 0:23 ` Steven Noonan 2009-05-01 1:25 ` James Pickens @ 2009-05-01 9:19 ` Kjetil Barvik 2009-05-01 9:34 ` Mike Hommey 1 sibling, 1 reply; 82+ messages in thread From: Kjetil Barvik @ 2009-05-01 9:19 UTC (permalink / raw) To: Steven Noonan; +Cc: Shawn O. Pearce, git * Steven Noonan <steven@uplinklabs.net> writes: | On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik <barvik@broadpark.no> wrote: |> * "Shawn O. Pearce" <spearce@spearce.org> writes: |> |> 4) The "static inline void hashcpy(....)" in cache.h could then |> |> maybe be written like this: |> | |> | Its already done as "memcpy(a, b, 20)" which most compilers will |> | inline and probably reduce to 5 word moves anyway. That's why |> | hashcpy() itself is inline. |> |> But would the compiler be able to trust that the hashcpy() is always |> called with correct word alignment on variables a and b? <snipp> | Well, I just tested this with GCC myself. I used this segment of code: | | #include <memory.h> | void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src) | { | memcpy(sha_dst, sha_src, 20); | } OK, here is a smal test, which maybe shows at least one difference between using "unsigned char sha1[20]" and "unsigned long sha1[5]". Given the following file, memcpy_test.c: #include <string.h> extern void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src); void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src) { memcpy(sha_dst, sha_src, 20); } extern void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src); void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src) { memcpy(sha_dst, sha_src, 5); } And, compiled with the following: gcc -O2 -mtune=core2 -march=core2 -S -fomit-frame-pointer memcpy_test.c It produced the following memcpy_test.s file: .file "memcpy_test.c" .text .p2align 4,,15 .globl hashcpy_ulong .type hashcpy_ulong, @function hashcpy_ulong: movl 8(%esp), %edx movl 4(%esp), %ecx movl (%edx), %eax movl %eax, (%ecx) movzbl 4(%edx), %eax movb %al, 4(%ecx) ret .size hashcpy_ulong, .-hashcpy_ulong .p2align 4,,15 .globl hashcpy_uchar .type hashcpy_uchar, @function hashcpy_uchar: movl 8(%esp), %edx movl 4(%esp), %ecx movl (%edx), %eax movl %eax, (%ecx) movl 4(%edx), %eax movl %eax, 4(%ecx) movl 8(%edx), %eax movl %eax, 8(%ecx) movl 12(%edx), %eax movl %eax, 12(%ecx) movl 16(%edx), %eax movl %eax, 16(%ecx) ret .size hashcpy_uchar, .-hashcpy_uchar .ident "GCC: (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3" .section .note.GNU-stack,"",@progbits So, the "unsigned long" type hashcpy() used 7 instructions, compared to 13 for the "unsigned char" type hascpy(). Would I guess correct if the hashcpy_ulong() function will also use less CPU cycles, and then would be faster than hashcpy_uchar()? -- kjetil ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-05-01 9:19 ` Kjetil Barvik @ 2009-05-01 9:34 ` Mike Hommey 2009-05-01 9:42 ` Kjetil Barvik 0 siblings, 1 reply; 82+ messages in thread From: Mike Hommey @ 2009-05-01 9:34 UTC (permalink / raw) To: Kjetil Barvik; +Cc: Steven Noonan, Shawn O. Pearce, git On Fri, May 01, 2009 at 11:19:04AM +0200, Kjetil Barvik wrote: > * Steven Noonan <steven@uplinklabs.net> writes: > | On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik <barvik@broadpark.no> wrote: > |> * "Shawn O. Pearce" <spearce@spearce.org> writes: > |> |> 4) The "static inline void hashcpy(....)" in cache.h could then > |> |> maybe be written like this: > |> | > |> | Its already done as "memcpy(a, b, 20)" which most compilers will > |> | inline and probably reduce to 5 word moves anyway. That's why > |> | hashcpy() itself is inline. > |> > |> But would the compiler be able to trust that the hashcpy() is always > |> called with correct word alignment on variables a and b? > > <snipp> > > | Well, I just tested this with GCC myself. I used this segment of code: > | > | #include <memory.h> > | void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src) > | { > | memcpy(sha_dst, sha_src, 20); > | } > > OK, here is a smal test, which maybe shows at least one difference > between using "unsigned char sha1[20]" and "unsigned long sha1[5]". > Given the following file, memcpy_test.c: > > #include <string.h> > extern void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src); > void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src) > { > memcpy(sha_dst, sha_src, 20); > } > extern void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src); > void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src) > { > memcpy(sha_dst, sha_src, 5); > } > > And, compiled with the following: > > gcc -O2 -mtune=core2 -march=core2 -S -fomit-frame-pointer memcpy_test.c > > It produced the following memcpy_test.s file: > > .file "memcpy_test.c" > .text > .p2align 4,,15 > .globl hashcpy_ulong > .type hashcpy_ulong, @function > hashcpy_ulong: > movl 8(%esp), %edx > movl 4(%esp), %ecx > movl (%edx), %eax > movl %eax, (%ecx) > movzbl 4(%edx), %eax > movb %al, 4(%ecx) > ret > .size hashcpy_ulong, .-hashcpy_ulong > .p2align 4,,15 > .globl hashcpy_uchar > .type hashcpy_uchar, @function > hashcpy_uchar: > movl 8(%esp), %edx > movl 4(%esp), %ecx > movl (%edx), %eax > movl %eax, (%ecx) > movl 4(%edx), %eax > movl %eax, 4(%ecx) > movl 8(%edx), %eax > movl %eax, 8(%ecx) > movl 12(%edx), %eax > movl %eax, 12(%ecx) > movl 16(%edx), %eax > movl %eax, 16(%ecx) > ret > .size hashcpy_uchar, .-hashcpy_uchar > .ident "GCC: (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3" > .section .note.GNU-stack,"",@progbits > > So, the "unsigned long" type hashcpy() used 7 instructions, compared > to 13 for the "unsigned char" type hascpy(). But your "unsigned long" version only copies 5 bytes... Mike ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-05-01 9:34 ` Mike Hommey @ 2009-05-01 9:42 ` Kjetil Barvik 0 siblings, 0 replies; 82+ messages in thread From: Kjetil Barvik @ 2009-05-01 9:42 UTC (permalink / raw) To: Mike Hommey; +Cc: Steven Noonan, Shawn O. Pearce, git * Mike Hommey <mh@glandium.org> writes: <snipp> | But your "unsigned long" version only copies 5 bytes... Yes, that is true... OK, same result for hashcpy_uchar() and hashcpy_ulong() when corrected for this. --kjetil, with a brown paper bag ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-04-30 21:36 ` Kjetil Barvik 2009-05-01 0:23 ` Steven Noonan @ 2009-05-01 17:42 ` Tony Finch 1 sibling, 0 replies; 82+ messages in thread From: Tony Finch @ 2009-05-01 17:42 UTC (permalink / raw) To: Kjetil Barvik; +Cc: Shawn O. Pearce, git On Thu, 30 Apr 2009, Kjetil Barvik wrote: > > I admit that I am not particular familar with intel machine > instructions, but I guess that the above 10 mov instructions is the > result for the compiled inline hashcpy() in the write_sha1_file() > function in sha1_file.c > > Question: would it be possible for the compiler to compile it down to > just 5 mov instructions if we had used unsigned 32 bits type? No, because the x86 can't do direct memory-to-memory moves. Tony. -- f.anthony.n.finch <dot@dotat.at> http://dotat.at/ GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS. MODERATE OR GOOD. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-04-30 20:36 ` Kjetil Barvik 2009-04-30 20:40 ` Shawn O. Pearce @ 2009-05-01 5:24 ` Dmitry Potapov 2009-05-01 9:42 ` Mike Hommey 1 sibling, 1 reply; 82+ messages in thread From: Dmitry Potapov @ 2009-05-01 5:24 UTC (permalink / raw) To: Kjetil Barvik; +Cc: Shawn O. Pearce, git On Thu, Apr 30, 2009 at 10:36:03PM +0200, Kjetil Barvik wrote: > 4) The "static inline void hashcpy(....)" in cache.h could then > maybe be written like this: > > static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5]) > { > sha_dst[0] = sha_src[0]; > sha_dst[1] = sha_src[1]; > sha_dst[2] = sha_src[2]; > sha_dst[3] = sha_src[3]; > sha_dst[4] = sha_src[4]; > } > > And hopefully will be compiled to just 5 store/more > instructions, or at least hopefully be faster than the currently > memcpy() call. But mabye we get more compiled instructions compared > to a single call to memcpy()? Good compilers can inline memcpy and should produce more efficient code for the target architecture, which can be faster than manually written. On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1 while the above code requires 5 operations. Dmitry ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-05-01 5:24 ` Dmitry Potapov @ 2009-05-01 9:42 ` Mike Hommey 2009-05-01 10:46 ` Dmitry Potapov 0 siblings, 1 reply; 82+ messages in thread From: Mike Hommey @ 2009-05-01 9:42 UTC (permalink / raw) To: Dmitry Potapov; +Cc: Kjetil Barvik, Shawn O. Pearce, git On Fri, May 01, 2009 at 09:24:34AM +0400, Dmitry Potapov wrote: > On Thu, Apr 30, 2009 at 10:36:03PM +0200, Kjetil Barvik wrote: > > 4) The "static inline void hashcpy(....)" in cache.h could then > > maybe be written like this: > > > > static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5]) > > { > > sha_dst[0] = sha_src[0]; > > sha_dst[1] = sha_src[1]; > > sha_dst[2] = sha_src[2]; > > sha_dst[3] = sha_src[3]; > > sha_dst[4] = sha_src[4]; > > } > > > > And hopefully will be compiled to just 5 store/more > > instructions, or at least hopefully be faster than the currently > > memcpy() call. But mabye we get more compiled instructions compared > > to a single call to memcpy()? > > Good compilers can inline memcpy and should produce more efficient code > for the target architecture, which can be faster than manually written. > On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1 > while the above code requires 5 operations. I guess, though, that some enforced alignment could help produce slightly more efficient code on some architectures (most notably sparc, which really doesn't like to deal with unaligned words). Mike ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-05-01 9:42 ` Mike Hommey @ 2009-05-01 10:46 ` Dmitry Potapov 0 siblings, 0 replies; 82+ messages in thread From: Dmitry Potapov @ 2009-05-01 10:46 UTC (permalink / raw) To: Mike Hommey; +Cc: Kjetil Barvik, Shawn O. Pearce, git On Fri, May 01, 2009 at 11:42:21AM +0200, Mike Hommey wrote: > On Fri, May 01, 2009 at 09:24:34AM +0400, Dmitry Potapov wrote: > > > > Good compilers can inline memcpy and should produce more efficient code > > for the target architecture, which can be faster than manually written. > > On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1 > > while the above code requires 5 operations. > > I guess, though, that some enforced alignment could help produce > slightly more efficient code on some architectures (most notably sparc, > which really doesn't like to deal with unaligned words). Agreed. Enforcing good alignment may be useful. My point was that avoiding memcpy with modern compilers is rather pointless or even harmful because the compiler know more about the target architecture than the author of the code. Dmitry ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-30 12:56 ` Michael Witten 2009-04-30 15:28 ` Why Git is so fast Jakub Narebski @ 2009-04-30 18:43 ` Shawn O. Pearce 1 sibling, 0 replies; 82+ messages in thread From: Shawn O. Pearce @ 2009-04-30 18:43 UTC (permalink / raw) To: Michael Witten; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List Michael Witten <mfwitten@gmail.com> wrote: > On Thu, Apr 30, 2009 at 07:17, Jakub Narebski <jnareb@gmail.com> wrote: > > I hope that JGit developers can > > tell us whether using higher level language affects performance, how > > much, and what features of higher-level language are causing decrease > > in performance. > > Java is definitely higher than C, but you can do some pretty low-level > operations on bits and bytes and the like, not to mention the presence > of a JIT. But its still costly compared to C. > My point: I don't think that Java can tell us anything special in this regard. Sure it can. Peff I think made a good point here, that we rely on a lot of small tweaks in the C git code to get *really* good performance. 5% here, 10% there, and suddenly you are 60% faster than you were before. Nico, Linus, Junio, they have all spent some time over the past 3 or 4 years trying to tune various parts of Git to just flat out run fast. Higher level languages hide enough of the machine that we can't make all of these optimizations. JGit struggles with not having mmap(), or when you do use Java NIO MappedByteBuffer, we still have to copy to a temporary byte[] in order to do any real processing. C Git avoids that copy. Sure, other higher level langauges may offer a better mmap facility, but they also tend to offer garbage collection and most try to tie the mmap management into the GC "for safety and ease of use". JGit struggles with not having unsigned types in Java. There are many locations in JGit where we really need "unsigned int32_t" or "unsigned long" (largest machine word available) or "unsigned char" but these types just don't exist in Java. Converting a byte up to an int just to treat it as an unsigned requires an extra " & 0xFF" operation to remove the sign extension. JGit struggles with not having an efficient way to represent a SHA-1. C can just say "unsigned char[20]" and have it inline into the container's memory allocation. A byte[20] in Java will cost an *additional* 16 bytes of memory, and be slower to access because the bytes themselves are in a different area of memory from the container object. We try to work around it by converting from a byte[20] to 5 ints, but that costs us machine instructions. C Git takes for granted that memcpy(a, b, 20) is dirt cheap when doing a copy from an inflated tree into a struct object. JGit has to pay a huge penalty to copy that 20 byte region out into 5 ints, because later on, those 5 ints are cheaper. Other higher level languages also lack the ability to mark a type unsigned. Or face similiar penalties with storing a 20 byte binary region. Native Java collection types have been a snare for us in JGit. We've used java.util.* types when they seem to be handy and already solve the data structure problem at hand, but they tend to preform a lot worse than writing a specialized data structure. For example, we have ObjectIdSubclassMap for what should be Map<ObjectId,Object>. Only it requires that the Object type you use as the "value" entry in the map extend from ObjectId, as the instance serves as both key *and* value. But it screams when compared to HashMap<ObjectId,Object>. (For those who don't know, ObjectId is JGit's "unsigned char[20]" for a SHA-1.) Just a day or so ago I wrote LongMap, a faster HashMap<Long,Object>, for hashing objects by indexes in a pack file. Again, the boxing costs in Java to convert a "long" (largest integer type) into an Object that the standard HashMap type would accept was rather high. Right now, JGit is still paying dearly when it comes to ripping apart a commit or a tree object to follow the object links. Or when invoking inflate(). We spend a lot more time doing this sort of work than C git does, and yet we're trying to be as close to the machine as we can go by using byte[] whenever possible, by avoiding copying whenever possible, and avoiding memory allocation when possible. Notably, `rev-list --objects --all` takes about 2x as long in JGit as it does in C Git on a project like the linux kernel, and `index-pack` for the full ~270M pack file takes about 2x as long. Both parts of JGit are about as good as I know how to make them, but we're really at the mercy of the JIT, and changes in the JIT can cause us to perform worse (or better) than before. Unlike in C Git where Linus has done assembler dumps of sections of code and tried to determine better approaches. :-) So. Yes, its practical to build Git in a higher level language, but you just can't get the same performance, or tight memory utilization, that C Git gets. That's what that higher level language abstraction costs you. But, JGit performs reasonably well; well enough that we use internally at Google as a git server. -- Shawn. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-30 12:17 ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski 2009-04-30 12:56 ` Michael Witten @ 2009-04-30 14:22 ` Jeff King 2009-05-01 18:43 ` Linus Torvalds 2009-04-30 18:56 ` Nicolas Pitre 2 siblings, 1 reply; 82+ messages in thread From: Jeff King @ 2009-04-30 14:22 UTC (permalink / raw) To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List On Thu, Apr 30, 2009 at 05:17:58AM -0700, Jakub Narebski wrote: > This is I think quite obvious. Accessing memory is faster than > acessing disk, which in turn is faster than accessing network. So if > commit and (change)log does not require access to server via network, > they are so much faster. Like all generalizations, this is only mostly true. Fast network servers with big caches can outperform disks for some loads. And in many cases with a VCS, you are performing a query that might look over the whole dataset, but return only a small fraction of data. So I wouldn't rule out the possibility of a pleasant VCS experience on a network-optimized system backed by beefy servers on a local network. I have never used perforce, but I get the impression that it is more optimized for such a situation. Git is really optimized for open source projects: slow servers across high-latency, low-bandwidth links. > es> Nah, probably not. Lots of people have written fast software in > es> C#, Java or Python. > es> > es> And lots of people have written really slow software in > es> traditional native languages like C/C++. [...] > > Well, I guess that access to low-level optimization techniques like > mmap are important for performance. But here I am guessing and > speculating like Eric did; well, I am asking on a proper forum ;-) Certainly there's algorithmic fastness that you can do in any language, and I think git does well at that. Most operations are independent of the total size of history (e.g., branching is O(1) and commit is O(changed files), diff looks only at endpoints, etc). Operations which deal only with history are independent of the size of the tree (e.g., "git log" and the history graph in gitk look only at commits, never at the tree). And when we do have to look at the tree, we can drastically reduce our I/O by comparing hashes instead of full files. But there are also some micro-optimizations that make a big difference in practice. Some of them can be done in any language. For example, the packfiles are ordered by type so that all of the commits have a nice I/O pattern when doing a history walk. Some other micro-optimizations are really language-specific, though. I don't recall the numbers, but I think Linus got measurable speedups from cutting the memory footprint of the object and commit structs (which gave better cache usage patterns). Git uses some variable-length fields inside structs instead of a pointer to a separate allocated string to give better memory access patterns. Tricks like that won't give the order-of-magnitude speedups that algorithmic optimizations will, but 10% here and 20% there means you can get a system that is a few times faster than the competition. For an operation that takes 0.1s anyway, that doesn't matter. But with current hardware and current project size, you are often talking about dropping a 3-second operation down to 1s or 0.5s, which just feels a lot snappier. And finally, git tries to do as little work as possible when starting a new command, and streams output as soon as possible. Which means that in a command-line setting, git can _feel_ snappier, because it starts output immediately. Higher-level languages can often have a much longer startup time, especially if they have a lot of modules to load. E.g.,: # does enough work to easily fill your pager $ time git log -100 >/dev/null real 0m0.011s user 0m0.008s sys 0m0.004s # does nothing, just starts perl and aborts with usage $ time git send-email >/dev/null real 0m0.150s user 0m0.104s sys 0m0.048s Both are warm-cache times. C git gives you output almost instaneously, whereas just loading perl with a modest set of modules introduces a noticeable pause before any work is actually done. In the grand scheme of things, .1s probably isn't relevant, but I think avoiding that delay adds to the perception of git as fast. > es> Or maybe Git's shortcut for handling renames is faster than doing > es> them more correctly[2] like Bazaar does. > es> > es> [2] "Renaming is the killer app of distributed version control" > es> http://www.markshuttleworth.com/archives/123 > > Errr... what? Yeah, I had the same thought. Git's rename handling is _much_ more computationally intensive than other systems. In fact, it is one of only two places where I have ever wanted git to be any faster (the other being repacking of large repos). > Eight: Git seems fast. > ====================== > > Here I mean concentaring on low _latency_, which means that when git I do think this helps (see above), but I wanted to note that it is more than just "streaming"; I think other systems stream, as well. For example, I am pretty sure that "cvs log" streamed (but thank god it has been so long since I touched CVS that I can't really remember), but it _still_ felt awfully slow. So it is also about keeping start times low and having your data in a format that is ready to use. -Peff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-30 14:22 ` Jeff King @ 2009-05-01 18:43 ` Linus Torvalds 2009-05-01 19:08 ` Jeff King 0 siblings, 1 reply; 82+ messages in thread From: Linus Torvalds @ 2009-05-01 18:43 UTC (permalink / raw) To: Jeff King; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List On Thu, 30 Apr 2009, Jeff King wrote: > > Like all generalizations, this is only mostly true. Fast network servers > with big caches can outperform disks for some loads. That's _very_ few loads. It doesn't matter how good a server you have, network filesystems invariably suck. Why? It's not that the network or the server sucks - you can easily find beefy NAS setups that have big raids etc and are much faster than most local disks. And they _still_ suck. Simple reason: caching. It's a lot easier to cache local filesystems. Even modern networked filesystems (ie NFSv4), that do a pretty good job on a file-per-file basis with delegations etc, and they still tend to suck horribly at metadata. In contrast, a workstation with local filesystems and enough memory to cache it well will just be a lot nicer. > So I wouldn't rule out the possibility of a pleasant VCS experience on a > network-optimized system backed by beefy servers on a local network. Hey, you can always throw resources at it. But no: > I have never used perforce, but I get the impression that it is more > optimized for such a situation. I doubt it. I suspect git will outperform pretty much anything else in that kind of situation too. One thing that git does - and some other VCS's avoid - is to actually stat() the whole working tree in order to not need special per-file "I use this file" locking semantics. That can in theory make git slower over a network filesystem than such (very broken) alternatives. If your VCS requires that you mark all files for editing somehow (ie you can't just use your favourite editor or scripting to modify files, but have to use "p4 edit" to say that you're going to write to the file, and the file is otherwise read-only), then such a VCS can - by being annoying and in your way - do some things faster than git can. And yes, perforce does that (the "p4 edit" command is real, and exists). And yes, in theory that can probably mean that perforce doesn't care so much about the metadata caching problem on network filesystems - because p4 will maintain some file of its own that contains the metadata. But I suspect that the git "async stat" ("core.preloadindex") thing means that git will kick p4 *ss even on that benchmark, and be a whole lot more pleasant to use. Even on networked filesystems. Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-05-01 18:43 ` Linus Torvalds @ 2009-05-01 19:08 ` Jeff King 2009-05-01 19:13 ` david ` (2 more replies) 0 siblings, 3 replies; 82+ messages in thread From: Jeff King @ 2009-05-01 19:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List On Fri, May 01, 2009 at 02:43:49PM -0400, Linus Torvalds wrote: > > Like all generalizations, this is only mostly true. Fast network servers > > with big caches can outperform disks for some loads. > [...] > In contrast, a workstation with local filesystems and enough memory to > cache it well will just be a lot nicer. > [...] > > I have never used perforce, but I get the impression that it is more > > optimized for such a situation. > > I doubt it. I suspect git will outperform pretty much anything else in > that kind of situation too. Thanks for the analysis; what you said makes sense to me. However, there is at least one case of somebody complaining that git doesn't scale as well as perforce for their load: http://gandolf.homelinux.org/blog/index.php?id=50 Part of his issue is with git-p4 sucking, which it probably does. But part of it sounds like he has a gigantic workload (the description of which sounds silly to me, but I respect the fact that he is probably describing standard practice among some companies), and that workload is just a little too gigantic for the workstations to handle. I.e., by throwing resources at the central server they can avoid throwing as many at each workstation. But there are so few details it's hard to say whether he's doing something else wrong or suboptimally. He does mention Windows, which IIRC has horrific stat performance. -Peff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-05-01 19:08 ` Jeff King @ 2009-05-01 19:13 ` david 2009-05-01 19:32 ` Nicolas Pitre 2009-05-01 21:17 ` Daniel Barkalow 2009-05-01 21:37 ` Linus Torvalds 2 siblings, 1 reply; 82+ messages in thread From: david @ 2009-05-01 19:13 UTC (permalink / raw) To: Jeff King Cc: Linus Torvalds, Jakub Narebski, Martin Langhoff, Git Mailing List On Fri, 1 May 2009, Jeff King wrote: > On Fri, May 01, 2009 at 02:43:49PM -0400, Linus Torvalds wrote: > >>> Like all generalizations, this is only mostly true. Fast network servers >>> with big caches can outperform disks for some loads. >> [...] >> In contrast, a workstation with local filesystems and enough memory to >> cache it well will just be a lot nicer. >> [...] >>> I have never used perforce, but I get the impression that it is more >>> optimized for such a situation. >> >> I doubt it. I suspect git will outperform pretty much anything else in >> that kind of situation too. > > Thanks for the analysis; what you said makes sense to me. However, there > is at least one case of somebody complaining that git doesn't scale as > well as perforce for their load: > > http://gandolf.homelinux.org/blog/index.php?id=50 > > Part of his issue is with git-p4 sucking, which it probably does. But > part of it sounds like he has a gigantic workload (the description of > which sounds silly to me, but I respect the fact that he is probably > describing standard practice among some companies), and that workload is > just a little too gigantic for the workstations to handle. I.e., by > throwing resources at the central server they can avoid throwing as many > at each workstation. > > But there are so few details it's hard to say whether he's doing > something else wrong or suboptimally. He does mention Windows, which > IIRC has horrific stat performance. the key thing for his problem is the support for large binary objects. there was discussion here a few weeks ago about ways to handle such things without trying to pull them into packs. I suspect that solving those sorts of issues would go a long way towards closing the gap on this workload. there may be issues in doing a clone for repositories that large, I don't remember exactly what happens when you have something larger than 4G to send in a clone. David Lang ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-05-01 19:13 ` david @ 2009-05-01 19:32 ` Nicolas Pitre 0 siblings, 0 replies; 82+ messages in thread From: Nicolas Pitre @ 2009-05-01 19:32 UTC (permalink / raw) To: david Cc: Jeff King, Linus Torvalds, Jakub Narebski, Martin Langhoff, Git Mailing List On Fri, 1 May 2009, david@lang.hm wrote: > the key thing for his problem is the support for large binary objects. there > was discussion here a few weeks ago about ways to handle such things without > trying to pull them into packs. I suspect that solving those sorts of issues > would go a long way towards closing the gap on this workload. > > there may be issues in doing a clone for repositories that large, I don't > remember exactly what happens when you have something larger than 4G to send > in a clone. If you have files larger than 4G then you definitively need a 64-bit machine with plenty of RAM for git to at least be able to cope at the moment. That should be easy to add a config option to determine how big is a big file, and store those big files directly in a pack of their own instead of a loose object (for easy pack reuse during a further repack), and never attempt to deltify them, etc. etc. At which point git will handle big files just fine even on a 32-bit machine but it won't do more than copying them in and out, and possibly deflating/inflating them while at it, but nothing fancier. Nicolas ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-05-01 19:08 ` Jeff King 2009-05-01 19:13 ` david @ 2009-05-01 21:17 ` Daniel Barkalow 2009-05-01 21:37 ` Linus Torvalds 2 siblings, 0 replies; 82+ messages in thread From: Daniel Barkalow @ 2009-05-01 21:17 UTC (permalink / raw) To: Jeff King Cc: Linus Torvalds, Jakub Narebski, Martin Langhoff, Git Mailing List On Fri, 1 May 2009, Jeff King wrote: > On Fri, May 01, 2009 at 02:43:49PM -0400, Linus Torvalds wrote: > > > > Like all generalizations, this is only mostly true. Fast network servers > > > with big caches can outperform disks for some loads. > > [...] > > In contrast, a workstation with local filesystems and enough memory to > > cache it well will just be a lot nicer. > > [...] > > > I have never used perforce, but I get the impression that it is more > > > optimized for such a situation. > > > > I doubt it. I suspect git will outperform pretty much anything else in > > that kind of situation too. > > Thanks for the analysis; what you said makes sense to me. However, there > is at least one case of somebody complaining that git doesn't scale as > well as perforce for their load: > > http://gandolf.homelinux.org/blog/index.php?id=50 > > Part of his issue is with git-p4 sucking, which it probably does. But > part of it sounds like he has a gigantic workload (the description of > which sounds silly to me, but I respect the fact that he is probably > describing standard practice among some companies), and that workload is > just a little too gigantic for the workstations to handle. I.e., by > throwing resources at the central server they can avoid throwing as many > at each workstation. I think his problem is that he's trying to replace his p4 repository with a git repository, which is a bit like trying to download github, rather than a project from github. Perforce is good at dealing with the case where people check in a vast quantity of junk that you don't check out. That is, you can back up your workstation into Perforce, and it won't affect anyone's performance if you use a path that's not in the range that anybody else checks out. And people actually do that. And Perforce doesn't make a distinction between different projects and different branches of the same project and different subdirectories of a branch of the same project, so it's impossible to tease apart except by company policy. Git doesn't scale in that it can't do the extremely narrow checkouts you need if your repository root directory contains thousands of complete unrelated projects with each branch of each project getting a subdirectory. On the other hand, it does a great job when the data is already partitioned into useful repositories. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-05-01 19:08 ` Jeff King 2009-05-01 19:13 ` david 2009-05-01 21:17 ` Daniel Barkalow @ 2009-05-01 21:37 ` Linus Torvalds 2009-05-01 22:11 ` david 2 siblings, 1 reply; 82+ messages in thread From: Linus Torvalds @ 2009-05-01 21:37 UTC (permalink / raw) To: Jeff King; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List On Fri, 1 May 2009, Jeff King wrote: > > Thanks for the analysis; what you said makes sense to me. However, there > is at least one case of somebody complaining that git doesn't scale as > well as perforce for their load: So we definitely do have scaling issues, there's no question about that. I just don't think they are about enterprise network servers vs the more workstation-oriented OSS world.. I think they're likely about the whole git mentality of looking at the big picture, and then getting swamped by just how _huge_ that picture can be if somebody just put the whole world in a single repository.. With perforce, repository maintenance is such a central issue that the whole p4 mentality seems to _encourage_ everybody to put everything into basically one single p4 repository. And afaik, p4 basically works mostly like CVS, ie it really ends up being pretty much oriented to a "one file at a time" model. Which is nice in that you can have a million files, and then only check out a few of them - you'll never even _see_ the impact of the other 999,995 files. And git obviously doesn't have that kind of model at all. Git fundamnetally never really looks at less than the whole repo. Even if you limit things a bit (ie check out just a portion, or have the history go back just a bit), git ends up still always caring about the whole thing, and carrying the knowledge around. So git scales really badly if you force it to look at everything as one _huge_ repository. I don't think that part is really fixable, although we can probably improve on it. And yes, then there's the "big file" issues. I really don't know what to do about huge files. We suck at them, I know. There are work-arounds (like not deltaing big objects at all), but they aren't necessarily that great either. I bet we could probably improve git large-file behavior for many common cases. Do we have a good test-case of some particular suckiness that is actually relevant enough that people might decide to look at it (and by "people", I do mean myself too - but I'd need to be somewhat motivated by it. A usage case that we suck at and that is available and relevant). Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-05-01 21:37 ` Linus Torvalds @ 2009-05-01 22:11 ` david 0 siblings, 0 replies; 82+ messages in thread From: david @ 2009-05-01 22:11 UTC (permalink / raw) To: Linus Torvalds Cc: Jeff King, Jakub Narebski, Martin Langhoff, Git Mailing List On Fri, 1 May 2009, Linus Torvalds wrote: > I bet we could probably improve git large-file behavior for many common > cases. Do we have a good test-case of some particular suckiness that is > actually relevant enough that people might decide to look at it (and by > "people", I do mean myself too - but I'd need to be somewhat motivated by > it. A usage case that we suck at and that is available and relevant). I think that a sane use case that would make sense to people is based on the 'game developer' example they have source code, but they also have large images (and sometimes movie clips), where a particular release of the game needs a particular set of the images. during development you may change images frequently (although most changesets probably only change a few, if any of the images) the images can be large (movies can be very large), and since they are already compressed they don't diff or compress well. David Lang ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-30 12:17 ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski 2009-04-30 12:56 ` Michael Witten 2009-04-30 14:22 ` Jeff King @ 2009-04-30 18:56 ` Nicolas Pitre 2009-04-30 19:16 ` Alex Riesen 2009-04-30 19:33 ` Jakub Narebski 2 siblings, 2 replies; 82+ messages in thread From: Nicolas Pitre @ 2009-04-30 18:56 UTC (permalink / raw) To: Jakub Narebski; +Cc: Martin Langhoff, Git Mailing List On Thu, 30 Apr 2009, Jakub Narebski wrote: > Jakub Narebski <jnareb@gmail.com> writes: > > es> Two: Maybe Git is fast because Linus Torvalds is so smart. > > [non answer; the details are important] I think Linus is certainly responsible for a big part of Git's speed. He came with the basic data structure used by git which has lots to do with that. Also, he designed Git specifically to fulfill a need for which none of the alternatives were fast enough. Hence Git was designed from the ground up with speed as one of the primary design goals, such as being able to create multiple commits per second instead of the other way around (several seconds per commit). And yes, Linus is usually smart enough with the proper mindset to achieve such goals. > es> Three: Maybe Git is fast because it's written in C instead of one > es> of those newfangled higher-level languages. > es> > es> Nah, probably not. Lots of people have written fast software in > es> C#, Java or Python. > es> > es> And lots of people have written really slow software in > es> traditional native languages like C/C++. [...] > > Well, I guess that access to low-level optimization techniques like > mmap are important for performance. But here I am guessing and > speculating like Eric did; well, I am asking on a proper forum ;-) > > We have some anegdotical evidence supporting this possibility (which > Eric dismisses), namely the fact that pure-Python Bazaar is slowest of > three most common open source DVCS (Git, Mercurial, bazaar) and the > fact that parts of Mercurial were written in C for better performance. > > We can also compare implementations of Git in other, higher level > languages, with reference implementation in C (and shell scripts, and > Perl ;-)). For example most complete I think but still not fully > complete Java implementation: JGit. I hope that JGit developers can > tell us whether using higher level language affects performance, how > much, and what features of higher-level language are causing decrease > in performance. Of course we have to take into account the > possibility that JGit isn't simply as well optimized because of less > manpower. One of the main JGit developers is Shawn Pearce. If you look at Shawn's contribution to C git, they mostly are all related to performance issues. Amongst other things, he is the author of git-fast-import, he contributed the pack access windowing code, and he was also involved in the initial design of pack v4. Hence Shawn is a smart guy who certainly knows one or two things about performance optimizations. Yet he reported on this list that his efforts to make JGit faster were not much successful anymore, most probably due to the language overhead. > es> Four: Maybe Git is fast because being fast is the primary goal for > es> Git. > > [non answer; the details are important] Still, this is actually true (see about Linus above). Without such a goal, you quickly lose sight of performance regressions. > es> Maybe Git is fast because every time they faced one of these "buy > es> vs. build" choices, they decided to just write it themselves. > > I don't think so. Rather the opposite is true. Git uses libcurl for > HTTP transport. Git uses zlib for compression. Git uses SHA-1 from > OpenSSL or from Mozilla. Git uses (modified, internal) LibXDiff for > (binary) deltaifying, for diffs and for merges. Well, I think he's right on this point as well. libcurl is not so relevant since it is rarely the bottleneck (the network bandwidth itself usually is). zlib is already as fast as it can be as multiple attempts to make it faster didn't succeed. Git already carries its own version of SHA-1 code for ARM and PPC because the alternatives were slower. The fact that libxdiff was made internal is indeed to have a better impedance matching with the core code, otherwise it could have remained fully external just like zlib. And the binary delta code is not libxdiff anymore but a much smaller, straight forward, and optimized to death version to achieve speed over versatility (no need to be versatile when strictly dealing with Git's needs only). > es> Seven: Maybe Git isn't really that fast. > es> > es> If there is one thing I've learned about version control it's that > es> everybody's situation is different. It is quite likely that Git > es> is a lot faster for some scenarios than it is for others. > es> > es> How does Git handle really large trees? Git was designed primary > es> to support the efforts of the Linux kernel developers. A lot of > es> people think the Linux kernel is a large tree, but it's really > es> not. Many enterprise configuration management repositories are > es> FAR bigger than the Linux kernel. > > c.f. "Why Perforce is more scalable than Git" by Steve Hanov > http://gandolf.homelinux.org/blog/index.php?id=50 > > I don't really know about this. Git certainly sucks big time with large files. Git also sucks to a lesser extent (but still) with very large repositories. But large trees? I don't think Git is worse than anything out there with a large tree of average size files. Yet, this point is misleading because when people gives to Git the reputation of being faster, this is certainly from comparison of operations performed on the same source tree. Who cares about scenarios for which the tool was not designed? Those "enterprise configuration management repositories" are not what Git was designed for indeed, but neither was Mercurial nor Bazaar, or any other contender to which Git is usually compared. Nicolas ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) 2009-04-30 18:56 ` Nicolas Pitre @ 2009-04-30 19:16 ` Alex Riesen 2009-05-04 8:01 ` Why Git is so fast Andreas Ericsson 2009-04-30 19:33 ` Jakub Narebski 1 sibling, 1 reply; 82+ messages in thread From: Alex Riesen @ 2009-04-30 19:16 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jakub Narebski, Martin Langhoff, Git Mailing List 2009/4/30 Nicolas Pitre <nico@cam.org>: > Yet, this point is misleading because when people gives to Git the > reputation of being faster, this is certainly from comparison of > operations performed on the same source tree. Who cares about scenarios > for which the tool was not designed? Those "enterprise configuration > management repositories" are not what Git was designed for indeed, but Especially when no sane developer will put in his repository the toolchain (pre-compiled. For all supported platforms!), all the supporting tools (like grep, find, etc.Pre-compiled _and_ source), the in-house framework (pre-compiled and source, again), firmware (pre-compiled and put in the repository weekly), and operating system code (pre-compiled, with firmware-specific drivers, updated, you guessed it, weekly), and well, there is the project itself (Java or C++, and documentation in .doc and .xls)... Now, what kind of self-hating idiot will design a system for that kind of abuse? (And if someone says that's is not true in the most enterprise f$%cking configurations, he definitely hasn't had to live through big enough number of them). ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-04-30 19:16 ` Alex Riesen @ 2009-05-04 8:01 ` Andreas Ericsson 0 siblings, 0 replies; 82+ messages in thread From: Andreas Ericsson @ 2009-05-04 8:01 UTC (permalink / raw) To: Alex Riesen Cc: Nicolas Pitre, Jakub Narebski, Martin Langhoff, Git Mailing List Alex Riesen wrote: > 2009/4/30 Nicolas Pitre <nico@cam.org>: >> Yet, this point is misleading because when people gives to Git the >> reputation of being faster, this is certainly from comparison of >> operations performed on the same source tree. Who cares about scenarios >> for which the tool was not designed? Those "enterprise configuration >> management repositories" are not what Git was designed for indeed, but > > Especially when no sane developer will put in his repository the toolchain > (pre-compiled. For all supported platforms!), all the supporting tools > (like grep, > find, etc.Pre-compiled _and_ source), the in-house framework (pre-compiled > and source, again), firmware (pre-compiled and put in the repository weekly), > and operating system code (pre-compiled, with firmware-specific drivers, > updated, you guessed it, weekly), and well, there is the project itself (Java or > C++, and documentation in .doc and .xls)... Well, git could actually handle that just fine if the toolchain was in a submodule or even in a separate repository that developers never had to worry about. Then you'd design a little tool that said "re-create build 8149" and it would pull the tools used to do that, and the code and the artwork, and then set to work. It'd be an overnight (or over-weekend) job, but no man-hours would be spent on it. That's how I'd do it anyways, probably with the "build" repository as a master repo with "tools", "code" and "artwork" as submodules to it. > Now, what kind of self-hating idiot will design a system for that kind of abuse? Noone, naturally, but one might design a system where each folder in the repository root is considered a repository in its own right, and then get that more or less for free. The problem with git for such scenarios is that you have to think *before* creating the repository, or play silly buggers when importing which makes it hard to see how the pieces fit together afterwards. A tool that could take a repository from a different scm, create a master repository and several submodule repositories from it would probably solve many of the issues gaming companies have if they want to switch to using git. Not least because it would open their eyes to how that sort of separation can be done in git, and why it's useful. The binary repos can then turn off delta-compression (and zlib compression) for all its blobs using a .gitattributes file, and things would be several orders of magnitudes faster. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Register now for Nordic Meet on Nagios, June 3-4 in Stockholm http://nordicmeetonnagios.op5.org/ Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Why Git is so fast 2009-04-30 18:56 ` Nicolas Pitre 2009-04-30 19:16 ` Alex Riesen @ 2009-04-30 19:33 ` Jakub Narebski 1 sibling, 0 replies; 82+ messages in thread From: Jakub Narebski @ 2009-04-30 19:33 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Martin Langhoff, Git Mailing List On Thu, 30 Apr 2009, Nicolas Pitre wrote: > On Thu, 30 Apr 2009, Jakub Narebski wrote: > > Jakub Narebski <jnareb@gmail.com> writes: > > es> Maybe Git is fast because every time they faced one of these "buy > > es> vs. build" choices, they decided to just write it themselves. > > > > I don't think so. Rather the opposite is true. Git uses libcurl for > > HTTP transport. Git uses zlib for compression. Git uses SHA-1 from > > OpenSSL or from Mozilla. Git uses (modified, internal) LibXDiff for > > (binary) deltaifying, for diffs and for merges. > > Well, I think he's right on this point as well. [...] > The fact that libxdiff was made internal is indeed to have a better > impedance matching with the core code, otherwise it could have remained > fully external just like zlib. And the binary delta code is not > libxdiff anymore but a much smaller, straight forward, and optimized to > death version to achieve speed over versatility (no need to be versatile > when strictly dealing with Git's needs only). Hrmmmm... I have thought that LibXDiff was internalized mainly for ease of modification, as my impression is that LibXDiff is single developer effort, while Git from beginning have many contributors (and submodules didn't exist then). If I remember correctly the rcsmerge/diff3 algorithm was added first in internalized git's xdiff... was it added to LibXDiff proper, anyway? BTW. I wonder what other F/OSS version control systems: Bazaar, Mercurial, Darcs, Monotone use for binary deltas, for diff engine, and for textual three-way merge engine. Hmmm... perhaps I'll ask on #revctrl -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control @ 2009-05-12 15:06 Esko Luontola 2009-05-12 15:14 ` Shawn O. Pearce ` (2 more replies) 0 siblings, 3 replies; 82+ messages in thread From: Esko Luontola @ 2009-05-12 15:06 UTC (permalink / raw) To: git A good start for making Git cross-platform, would be storing the text encoding of every file name and commit message together with the commit. Currently, because Git is oblivious to the encodings and just considers them as a series of bytes, there is no way to make them cross-platform. It's as http://www.joelonsoftware.com/articles/Unicode.html says, "It does not make sense to have a string without knowing what encoding it uses." Without explicit encoding information, making a system that works even on the three main platforms, let alone in all countries and languages, is simply not possible. On the other hand, if the encoding is explicitly stated in the repository, then it is possible for platform and locale aware Git clients to handle the file names and commit messages in whatever way makes most sense for the platform (for example convert the file names to the platform's encoding, if it differs from the committer's platform encoding). Then it would also be possible to create a Mac version of Git, which compensates for Mac OS X's file system's file name encoding peculiarities. Also the system could then warn (on "git add") if the data does not look like it has been encoded with the said encoding. If the platform's and the repository's encoding happen to be the same (which in reality might be possible only inside a small company where everybody is forced to use the same OS and is configured by a single sysadmin), then no conversions need to be done. Also Git purists, who think that the byte sequence representing a file name are more important than the human readable version of the file name, may use some configuration switch that disables all conversions - but even then the current encoding should be stored together with the commit. Are there any plans on storing the encoding information of file names and commit messages in the Git repository? How much time would implementing it take? Any ideas on how to maintain backwards compatibility (for old commits that do not have the encoding information)? - Esko ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 15:06 Cross-Platform Version Control Esko Luontola @ 2009-05-12 15:14 ` Shawn O. Pearce 2009-05-12 16:13 ` Johannes Schindelin 2009-05-12 16:16 ` Jeff King 2009-05-12 18:28 ` Dmitry Potapov 2009-05-14 13:48 ` Peter Krefting 2 siblings, 2 replies; 82+ messages in thread From: Shawn O. Pearce @ 2009-05-12 15:14 UTC (permalink / raw) To: Esko Luontola; +Cc: git Esko Luontola <esko.luontola@gmail.com> wrote: > Are there any plans on storing the encoding information of file names > and commit messages in the Git repository? Commit messages already store their encoding in an optional "encoding" header if the message isn't stored in UTF-8, or US-ASCII, which is a strict subset of UTF-8. As for file names, no plans, its a sequence of bytes, but I think a lot of people wind up using some subset of US-ASCII for their file names, especially if their project is going to be cross platform. -- Shawn. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 15:14 ` Shawn O. Pearce @ 2009-05-12 16:13 ` Johannes Schindelin 2009-05-12 17:56 ` Esko Luontola 2009-05-12 16:16 ` Jeff King 1 sibling, 1 reply; 82+ messages in thread From: Johannes Schindelin @ 2009-05-12 16:13 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Esko Luontola, git Hi, On Tue, 12 May 2009, Shawn O. Pearce wrote: > Esko Luontola <esko.luontola@gmail.com> wrote: > > Are there any plans on storing the encoding information of file names > > and commit messages in the Git repository? > > Commit messages already store their encoding in an optional "encoding" > header if the message isn't stored in UTF-8, or US-ASCII, which is a > strict subset of UTF-8. > > As for file names, no plans, its a sequence of bytes, but I think a > lot of people wind up using some subset of US-ASCII for their file > names, especially if their project is going to be cross platform. Some context: this issue cropped up in msysGit, of course. As to storing all file names in UTF-8, my point about Unicode being not necessarily appropriate for everyone still stands. UTF-8 _might_ be the de-facto standard for Linux filesystems, but IMHO we should not take away the freedom for everybody to decide what they want their file names to be encoded as. However, I see that there might be a need to be able to encode the file names differently, such as on Windows. IMHO the best solution would be a config variable controlling the reencoding of file names. For some time, it looked as if two people were interested in implementing something like that (Peter and Robin IIRC), but efforts have stalled. Ciao, Dscho ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 16:13 ` Johannes Schindelin @ 2009-05-12 17:56 ` Esko Luontola 2009-05-12 20:38 ` Johannes Schindelin 0 siblings, 1 reply; 82+ messages in thread From: Esko Luontola @ 2009-05-12 17:56 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Shawn O. Pearce, git On 12.5.2009, at 19:13, Johannes Schindelin wrote: > As to storing all file names in UTF-8, my point about Unicode being > not > necessarily appropriate for everyone still stands. > > UTF-8 _might_ be the de-facto standard for Linux filesystems, but > IMHO we should not take away the freedom for everybody to decide > what they > want their file names to be encoded as. > > However, I see that there might be a need to be able to encode the > file > names differently, such as on Windows. IMHO the best solution would > be > a config variable controlling the reencoding of file names. Exactly. The system should not force the use of a specific encoding. It should only offer a recommendation, but be also fully compatible if the user uses some other encoding. That's why it's best to always store the information about what encoding was used. It shouldn't matter, whether the data is encoded with ISO-8859-1, UTF-8, Shift_JIS, Big5 or some other encoding, as long as it is explicitly said that what the encoding is. Then the reader of the data can best decide, how to show that data on the current platform. A config variable for defining, that what encoding should be used when committing the file names, would make sense. Git should also try to autodetect, that what encoding is used in its current environment. In the case of UTF-8, you should also be able to specify which normalization form is used (http://www.unicode.org/unicode/reports/ tr15/), or whether it is normalized at all. For example, it should be possible to configure Git so, that when a file is checked out on Mac, its file name is converted to the current file system's encoding (UTF-8 NFD, I think), and when the file is committed on Mac, the file name is normalized back to the same UTF-8 form as is used on Linux (UTF-8 NFC). It would be nice to have config variables for saying, that all file names in this repository must use UTF-8 NFC, and all commit messages must use UTF-8 NFC (with Unix newlines). Then the Git client would autodetect the current environment's encoding, and convert the text, if necessary, to match the repository's encoding. - Esko ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 17:56 ` Esko Luontola @ 2009-05-12 20:38 ` Johannes Schindelin 2009-05-12 21:16 ` Esko Luontola 0 siblings, 1 reply; 82+ messages in thread From: Johannes Schindelin @ 2009-05-12 20:38 UTC (permalink / raw) To: Esko Luontola; +Cc: Shawn O. Pearce, git Hi, On Tue, 12 May 2009, Esko Luontola wrote: > On 12.5.2009, at 19:13, Johannes Schindelin wrote: > >As to storing all file names in UTF-8, my point about Unicode being not > >necessarily appropriate for everyone still stands. > > > >UTF-8 _might_ be the de-facto standard for Linux filesystems, but IMHO > >we should not take away the freedom for everybody to decide what they > >want their file names to be encoded as. > > > >However, I see that there might be a need to be able to encode the file > >names differently, such as on Windows. IMHO the best solution would be > >a config variable controlling the reencoding of file names. > > Exactly. The system should not force the use of a specific encoding. It > should only offer a recommendation, but be also fully compatible if the > user uses some other encoding. > > That's why it's best to always store the information about what encoding > was used. It shouldn't matter, whether the data is encoded with > ISO-8859-1, UTF-8, Shift_JIS, Big5 or some other encoding, as long as it > is explicitly said that what the encoding is. Then the reader of the > data can best decide, how to show that data on the current platform. > > A config variable for defining, that what encoding should be used when > committing the file names, would make sense. Git should also try to > autodetect, that what encoding is used in its current environment. In > the case of UTF-8, you should also be able to specify which > normalization form is used > (http://www.unicode.org/unicode/reports/tr15/), or whether it is > normalized at all. > > For example, it should be possible to configure Git so, that when a file > is checked out on Mac, its file name is converted to the current file > system's encoding (UTF-8 NFD, I think), and when the file is committed > on Mac, the file name is normalized back to the same UTF-8 form as is > used on Linux (UTF-8 NFC). > > It would be nice to have config variables for saying, that all file > names in this repository must use UTF-8 NFC, and all commit messages > must use UTF-8 NFC (with Unix newlines). Then the Git client would > autodetect the current environment's encoding, and convert the text, if > necessary, to match the repository's encoding. That is a nice analysis. How about implementing it? Ciao, Dscho ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 20:38 ` Johannes Schindelin @ 2009-05-12 21:16 ` Esko Luontola 2009-05-13 0:23 ` Johannes Schindelin 0 siblings, 1 reply; 82+ messages in thread From: Esko Luontola @ 2009-05-12 21:16 UTC (permalink / raw) To: git; +Cc: Johannes Schindelin, Shawn O. Pearce Johannes Schindelin wrote on 12.5.2009 23:38: > That is a nice analysis. How about implementing it? > Do we have here somebody, who knows Git's code well and is motivated to implement this? I don't think that I would be capable, because of not having used C much, being new to Git's codebase and having too little time. But I can help with the requirements specification, interaction design and system testing. -- Esko Luontola www.orfjackal.net ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 21:16 ` Esko Luontola @ 2009-05-13 0:23 ` Johannes Schindelin 2009-05-13 5:34 ` Esko Luontola 0 siblings, 1 reply; 82+ messages in thread From: Johannes Schindelin @ 2009-05-13 0:23 UTC (permalink / raw) To: Esko Luontola; +Cc: git, Shawn O. Pearce Hi, On Wed, 13 May 2009, Esko Luontola wrote: > Johannes Schindelin wrote on 12.5.2009 23:38: > > That is a nice analysis. How about implementing it? > > > > Do we have here somebody, who knows Git's code well and is motivated to > implement this? > > I don't think that I would be capable, because of not having used C > much, being new to Git's codebase and having too little time. Well, that rather settles things, no? Ciao, Dscho ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 0:23 ` Johannes Schindelin @ 2009-05-13 5:34 ` Esko Luontola 2009-05-13 6:49 ` Alex Riesen 2009-05-13 10:15 ` Johannes Schindelin 0 siblings, 2 replies; 82+ messages in thread From: Esko Luontola @ 2009-05-13 5:34 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, Shawn O. Pearce Johannes Schindelin wrote on 13.5.2009 3:23: > Well, that rather settles things, no? > There is need for the feature, but it's unfortunate that the Git developers do not see its value. There are many users for whom using non-ASCII names is necessary (for example all of Asia and most of Europe), but now it seems that Bazaar is the only DVCS that handles encodings correctly: http://stackoverflow.com/questions/829682/what-dvcs-support-unicode-filenames Let's see if I have time later this or next year to work on it. At least it would be good practise in getting acquainted with a new codebase and learning C. But it would be better for someone else do it, to get it done within a reasonable amount of time. I see that there are some tests in the /t directory. Which command will run all of them, how good coverage do the tests have, how reproducable and isolated they are, how many seconds does it take to run all the tests? Is there some high-level documentation for new developers? -- Esko Luontola www.orfjackal.net ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 5:34 ` Esko Luontola @ 2009-05-13 6:49 ` Alex Riesen 2009-05-13 10:15 ` Johannes Schindelin 1 sibling, 0 replies; 82+ messages in thread From: Alex Riesen @ 2009-05-13 6:49 UTC (permalink / raw) To: Esko Luontola; +Cc: Johannes Schindelin, git, Shawn O. Pearce 2009/5/13 Esko Luontola <esko.luontola@gmail.com>: > Johannes Schindelin wrote on 13.5.2009 3:23: >> >> Well, that rather settles things, no? >> > > There is need for the feature, but it's unfortunate that the Git developers > do not see its value. There are many users for whom using non-ASCII names is > necessary (for example all of Asia and most of Europe), but now it seems > that Bazaar is the only DVCS that handles encodings correctly: > http://stackoverflow.com/questions/829682/what-dvcs-support-unicode-filenames Many Git developers just use systems which don't care about the file names encoding at all and just keep the names as they were. So interoperability problem does not exist for them. So, they either don't need the feature, or can trivially avoid or workaround any problems. > I see that there are some tests in the /t directory. Which command will run > all of them, how good coverage do the tests have, how reproducable and > isolated they are, how many seconds does it take to run all the tests? Is > there some high-level documentation for new developers? make test. See also t/README. We like them. I always run test suite before deployment and sometimes run it just for fun (unless I have to run it on Windows). ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 5:34 ` Esko Luontola 2009-05-13 6:49 ` Alex Riesen @ 2009-05-13 10:15 ` Johannes Schindelin [not found] ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com> 1 sibling, 1 reply; 82+ messages in thread From: Johannes Schindelin @ 2009-05-13 10:15 UTC (permalink / raw) To: Esko Luontola; +Cc: git, Shawn O. Pearce Hi, On Wed, 13 May 2009, Esko Luontola wrote: > Johannes Schindelin wrote on 13.5.2009 3:23: > > Well, that rather settles things, no? > > There is need for the feature, but it's unfortunate that the Git > developers do not see its value. I see a value. But it is not my itch. And since it is your itch and you said that you will not do anything about it (I don't count writing emails here ;-), I concluded that it settles the issue. Ciao, Dscho ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>]
* Cross-Platform Version Control [not found] ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com> @ 2009-05-13 10:41 ` John Tapsell 2009-05-13 13:42 ` Jay Soffian 0 siblings, 1 reply; 82+ messages in thread From: John Tapsell @ 2009-05-13 10:41 UTC (permalink / raw) To: git 2009/5/13 Johannes Schindelin <Johannes.Schindelin@gmx.de>: > Hi, > > On Wed, 13 May 2009, Esko Luontola wrote: > >> Johannes Schindelin wrote on 13.5.2009 3:23: >> > Well, that rather settles things, no? >> >> There is need for the feature, but it's unfortunate that the Git >> developers do not see its value. > > I see a value. But it is not my itch. And since it is your itch and you > said that you will not do anything about it (I don't count writing emails > here ;-), I concluded that it settles the issue. I don't know why the git developers are being so hostile/dismisisve, but I also hope that somebody volunteers to fix this. Esko, you have my moral support :-) John ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 10:41 ` John Tapsell @ 2009-05-13 13:42 ` Jay Soffian 2009-05-13 13:44 ` Alex Riesen 0 siblings, 1 reply; 82+ messages in thread From: Jay Soffian @ 2009-05-13 13:42 UTC (permalink / raw) To: John Tapsell; +Cc: git On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote: > I don't know why the git developers are being so hostile/dismisisve, Are you serious? j. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:42 ` Jay Soffian @ 2009-05-13 13:44 ` Alex Riesen 2009-05-13 13:50 ` Jay Soffian 0 siblings, 1 reply; 82+ messages in thread From: Alex Riesen @ 2009-05-13 13:44 UTC (permalink / raw) To: Jay Soffian; +Cc: John Tapsell, git 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: > On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote: >> I don't know why the git developers are being so hostile/dismisisve, > > Are you serious? > ...because we'll kill you if aren't >:-E ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:44 ` Alex Riesen @ 2009-05-13 13:50 ` Jay Soffian 2009-05-13 13:57 ` John Tapsell 0 siblings, 1 reply; 82+ messages in thread From: Jay Soffian @ 2009-05-13 13:50 UTC (permalink / raw) To: Alex Riesen; +Cc: John Tapsell, git On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote: > 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: >> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote: >>> I don't know why the git developers are being so hostile/dismisisve, >> >> Are you serious? >> > > ...because we'll kill you if aren't >:-E I'm just flabbergasted by some people's expectations. Perhaps John doesn't realize the git developers are all volunteers, and that it is never appropriate to criticize a volunteer. A "thank you for all your hard work on git" would have done nicely. j. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:50 ` Jay Soffian @ 2009-05-13 13:57 ` John Tapsell 2009-05-13 15:27 ` Nicolas Pitre ` (2 more replies) 0 siblings, 3 replies; 82+ messages in thread From: John Tapsell @ 2009-05-13 13:57 UTC (permalink / raw) To: Jay Soffian; +Cc: Alex Riesen, git 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: > On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote: >> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: >>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote: >>>> I don't know why the git developers are being so hostile/dismisisve, >>> >>> Are you serious? >>> >> >> ...because we'll kill you if aren't >:-E > > I'm just flabbergasted by some people's expectations. Perhaps John > doesn't realize the git developers are all volunteers, and that it is > never appropriate to criticize a volunteer. A "thank you for all your > hard work on git" would have done nicely. I'm as much of an open source developer as anyone else here. I spend a huge amount of my time programming for KDE. But I've never told a user "well that settles it" because they won't code it themselves :-/ I certaintly get a huge number of bug/wishes that I can't/won't code myself, but I try to be a bit more diplomatic about it. But then the kernel mailing lists tend to be a lot more.. direct.. than the kde mailing lists, so I guess it comes from that. Requiring people to have a thick skin and all that. John ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:57 ` John Tapsell @ 2009-05-13 15:27 ` Nicolas Pitre 2009-05-13 16:22 ` Johannes Schindelin 2009-05-13 17:24 ` Andreas Ericsson 2009-05-14 1:49 ` Miles Bader 2 siblings, 1 reply; 82+ messages in thread From: Nicolas Pitre @ 2009-05-13 15:27 UTC (permalink / raw) To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git On Wed, 13 May 2009, John Tapsell wrote: > I'm as much of an open source developer as anyone else here. I spend > a huge amount of my time programming for KDE. But I've never told a > user "well that settles it" because they won't code it themselves :-/ > I certaintly get a huge number of bug/wishes that I can't/won't code > myself, but I try to be a bit more diplomatic about it. > But then the kernel mailing lists tend to be a lot more.. direct.. > than the kde mailing lists, so I guess it comes from that. Requiring > people to have a thick skin and all that. This is not the kernel mailing list. In fact this list is quite friendlier and accommodating that the kernel list. The remark alluded above comes from _one_ of the git developers. And Dscho is apparently in a rather sad mood these days. While the substance of Dscho's remark is entirely pertinent, it would be wrong to use its form and style as a characterization of git developers in general. Nicolas ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 15:27 ` Nicolas Pitre @ 2009-05-13 16:22 ` Johannes Schindelin 0 siblings, 0 replies; 82+ messages in thread From: Johannes Schindelin @ 2009-05-13 16:22 UTC (permalink / raw) To: Nicolas Pitre; +Cc: John Tapsell, Jay Soffian, Alex Riesen, git Hi, On Wed, 13 May 2009, Nicolas Pitre wrote: > On Wed, 13 May 2009, John Tapsell wrote: > > > I'm as much of an open source developer as anyone else here. I spend > > a huge amount of my time programming for KDE. But I've never told a > > user "well that settles it" because they won't code it themselves :-/ > > I certaintly get a huge number of bug/wishes that I can't/won't code > > myself, but I try to be a bit more diplomatic about it. > > > > But then the kernel mailing lists tend to be a lot more.. direct.. > > than the kde mailing lists, so I guess it comes from that. Requiring > > people to have a thick skin and all that. > > This is not the kernel mailing list. In fact this list is quite > friendlier and accommodating that the kernel list. > > The remark alluded above comes from _one_ of the git developers. And > Dscho is apparently in a rather sad mood these days. While the substance > of Dscho's remark is entirely pertinent, it would be wrong to use its > form and style as a characterization of git developers in general. Even if I were in a better mood, the whole thread has a back story on an msysGit issue, and this led me to try to stop what I feared would become a rather long mail thread without much of an outcome, such as that infamous thread about MacOSX UTF-8 filename handling. Alas, it seems that Robin is willing to work on the issues, so my fears have been totally and completely unfounded. Ciao, Dscho ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:57 ` John Tapsell 2009-05-13 15:27 ` Nicolas Pitre @ 2009-05-13 17:24 ` Andreas Ericsson 2009-05-14 1:49 ` Miles Bader 2 siblings, 0 replies; 82+ messages in thread From: Andreas Ericsson @ 2009-05-13 17:24 UTC (permalink / raw) To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git John Tapsell wrote: > 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: >> On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote: >>> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: >>>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote: >>>>> I don't know why the git developers are being so hostile/dismisisve, >>>> Are you serious? >>>> >>> ...because we'll kill you if aren't >:-E >> I'm just flabbergasted by some people's expectations. Perhaps John >> doesn't realize the git developers are all volunteers, and that it is >> never appropriate to criticize a volunteer. A "thank you for all your >> hard work on git" would have done nicely. > > I'm as much of an open source developer as anyone else here. I spend > a huge amount of my time programming for KDE. But I've never told a > user "well that settles it" because they won't code it themselves :-/ > I certaintly get a huge number of bug/wishes that I can't/won't code > myself, but I try to be a bit more diplomatic about it. > But then the kernel mailing lists tend to be a lot more.. direct.. > than the kde mailing lists, so I guess it comes from that. Requiring > people to have a thick skin and all that. > I think much of the perceived malignancy stems from the fact that the git list has a high ratio of developer-to-luser mailings on it, being by nature a developer tool most of the time. When the unaware user appears on the list with demands rather than polite requests, they're treated that much harder. Especially by the developer who happens to be, as it were, the butt of the request. Personally, I've only ever found Dscho being anything but friendly on this list, and even then, I really didn't find it offensive. If viewed in a happy mood, it matches quite nicely with a swedish sketch whose theme is "men ja ente bitter". It's often quite funny, really :-) -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Register now for Nordic Meet on Nagios, June 3-4 in Stockholm http://nordicmeetonnagios.op5.org/ Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:57 ` John Tapsell 2009-05-13 15:27 ` Nicolas Pitre 2009-05-13 17:24 ` Andreas Ericsson @ 2009-05-14 1:49 ` Miles Bader 2 siblings, 0 replies; 82+ messages in thread From: Miles Bader @ 2009-05-14 1:49 UTC (permalink / raw) To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git John Tapsell <johnflux@gmail.com> writes: > I'm as much of an open source developer as anyone else here. I spend > a huge amount of my time programming for KDE. But I've never told a > user "well that settles it" because they won't code it themselves :-/ FWIW, Johannes' use of "Well, that rather settles things, no?" in this thread this didn't strike me as being rude or truly dismissive (even though it's literally so). It seemed more just a timely and to the point reminder that however fun it is to talk about random feature X, someone's gotta do the work if it's going to actually be implemented, and that the direction of git development very much follows the whims of those doing the actual hacking (perhaps more so than other projects). [and I don't even have particularly thick skin, I think -- I'm often very annoyed by brusqueness one sees on many developer mailing lists...] -Miles -- Acquaintance, n. A person whom we know well enough to borrow from, but not well enough to lend to. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 15:14 ` Shawn O. Pearce 2009-05-12 16:13 ` Johannes Schindelin @ 2009-05-12 16:16 ` Jeff King 2009-05-12 16:57 ` Johannes Schindelin 2009-05-13 16:26 ` Linus Torvalds 1 sibling, 2 replies; 82+ messages in thread From: Jeff King @ 2009-05-12 16:16 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Esko Luontola, git On Tue, May 12, 2009 at 08:14:03AM -0700, Shawn O. Pearce wrote: > As for file names, no plans, its a sequence of bytes, but I think a > lot of people wind up using some subset of US-ASCII for their file > names, especially if their project is going to be cross platform. Or they use a single encoding like utf8 so that there are no surprises. You can still run into normalization problems with filenames on some filesystems, though. Linus's name_hash code sets up the framework to handle "these two names are actually equivalent", but right now I think there is just code for handling case-sensitivity, not utf8 normalization (but I just skimmed the code, so I might be wrong). -Peff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 16:16 ` Jeff King @ 2009-05-12 16:57 ` Johannes Schindelin 2009-05-13 16:26 ` Linus Torvalds 1 sibling, 0 replies; 82+ messages in thread From: Johannes Schindelin @ 2009-05-12 16:57 UTC (permalink / raw) To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git Hi, On Tue, 12 May 2009, Jeff King wrote: > On Tue, May 12, 2009 at 08:14:03AM -0700, Shawn O. Pearce wrote: > > > As for file names, no plans, its a sequence of bytes, but I think a > > lot of people wind up using some subset of US-ASCII for their file > > names, especially if their project is going to be cross platform. > > Or they use a single encoding like utf8 so that there are no surprises. > You can still run into normalization problems with filenames on some > filesystems, though. Linus's name_hash code sets up the framework to > handle "these two names are actually equivalent", but right now I think > there is just code for handling case-sensitivity, not utf8 normalization > (but I just skimmed the code, so I might be wrong). Back then I actually started on a patch to make Git capable of determining UTF-8 equivalence, but at the same time somebody started such an annoying mail thread that I stopped working on the issue completely. Ciao, Dscho ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 16:16 ` Jeff King 2009-05-12 16:57 ` Johannes Schindelin @ 2009-05-13 16:26 ` Linus Torvalds 2009-05-13 17:12 ` Linus Torvalds 1 sibling, 1 reply; 82+ messages in thread From: Linus Torvalds @ 2009-05-13 16:26 UTC (permalink / raw) To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git On Tue, 12 May 2009, Jeff King wrote: > > Or they use a single encoding like utf8 so that there are no surprises. > You can still run into normalization problems with filenames on some > filesystems, though. Linus's name_hash code sets up the framework to > handle "these two names are actually equivalent", but right now I think > there is just code for handling case-sensitivity, not utf8 normalization > (but I just skimmed the code, so I might be wrong). utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But quite frankly, the index is only part of it, and probably not the worst part. The real pain of filename handling is all the "read tree recursively with readdir()" issues. Along with just an absolute sh*t-load of issues about what to do when people ended up using different versions of the "same" name in different branches. There's also the issue that "cross-platform" really can be a pretty damn big pain. What do you do for platforms that simply are pure shit? I realize that OS X people have a hard time accepting it, but OS X filesystems are generally total and utter crap - even more so than Windows. Yes, yes, you can tell OS X that case matters, but that's not the normal case - and what do you do with projects that simply _do_ care about case. The kernel is one such project. Sure, you can "encode" the filenames on such broken filesystems in a way that they'd be different - but that won't really help the project, since makefiles etc won't work anyway. So one reason I didn't bother with utf-8 is that the much more fundamental issues are simply in plain old 7-bit US-ASCII. That said, if the only issue is that you want to encode regular utf-8 in a coherent way (and ignore the case issues), then we could probably do that part fairly easily with a "convert_to_internal()" and "convert_to_filename()" thing that acts very much like the CRLF conversion (except on filenames, not data). And yes, it's probably worth doing, since we'd need that for fuller case support anyway. It's just a fair amount of churn - not fundamentally _hard_, but not trivial either. And it needs a _lot_ of care, and a fair amount of testing that is probably hard to do on sane filesystems (ie the case where the filesystem actually _changes_ the name is going to be hard to test on anything sane). Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 16:26 ` Linus Torvalds @ 2009-05-13 17:12 ` Linus Torvalds 2009-05-13 17:31 ` Andreas Ericsson ` (2 more replies) 0 siblings, 3 replies; 82+ messages in thread From: Linus Torvalds @ 2009-05-13 17:12 UTC (permalink / raw) To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Linus Torvalds wrote: > > utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But > quite frankly, the index is only part of it, and probably not the worst > part. > > The real pain of filename handling is all the "read tree recursively with > readdir()" issues. Along with just an absolute sh*t-load of issues about > what to do when people ended up using different versions of the "same" > name in different branches. Btw, if people care mainly just about OS X, and don't worry so much about case, but about the idiotic and insane OS X behavior of turning UTF-8 filenames into that crazy NFD format, here's a simple patch that may be useful for that. There _will_ certainly be other places, but this handles the one big case of "read_directory_recursive()", and can turn NFD into the sane NFC format. Since OS X will then accept NFC (and internally turn it back to NFD) when you pass them as filenames, that means that converting the other way is not necessary. NOTE NOTE NOTE! This really just handles one case, and is not enough for any kind of general case. For example, it does NOT handle the case where you do git add filename_with_åäö explicitly, because if the "filename_with_åäö" is done using NFD (tab-completion etc), now git won't _match_ it with the filename it reads using readdir() any more (which got converted to NFC), so at a minimum we'd need to do that crazy NFD->NFC conversion in all the pathspecs too. See "get_pathspec()" in setup.c for that latter case. But with that, and this crazy thing, OS X users might be already a lot better off. Totally untested, of course. Oh, and somebody needs to fill in that convert_name_from_nfd_to_nfc() implementation. It's designed so that if it notices that the string is just plain US-ASCII, it can return 0 and no extra work is done. That, in turn, can easily be done by some simple and efficient pre-processign that checks that there are no high bits set (on a 64-bit platform, do it 8 characters at a time with a "& 0x8080808080808080"), so that the common case doesn't need to have barely any overhead at all. Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do the actual normalization if you find characters with the high bit set. And since I know that the OS X filesystems are so buggy as to not even do that whole NFD thing right, there is probably some OS-X specific "use this for filesystem names" conversion function. Hmm. Anybody want to take this on? It really shouldn't be too complex to get it working for the common case on just OS X. It's really the case sensitivity that is the biggest problem, if you ignore that for now, the problem space is _much_ smaller. In other words, I think we can reasonably easily support a subset of _common_ issues with some trivial patches like this. But getting it right in _all_ the cases is going to be much more work (there are lots of other uses of "readdir()" too, this one just happens to be one of the more central ones). Of course, it probably makes sense to have a whole "git_readdir()" that does this thing in general. That "create_full_path()" thing makes sense regardless, though, in that it also simplifies a lot of "baselen+len" usage in just "len". Linus --- dir.c | 40 ++++++++++++++++++++++++++++++++-------- 1 files changed, 32 insertions(+), 8 deletions(-) diff --git a/dir.c b/dir.c index 6aae09a..4cbfc24 100644 --- a/dir.c +++ b/dir.c @@ -566,6 +566,30 @@ static int get_dtype(struct dirent *de, const char *path) } /* + * Take the readdir output, in (d_name,len), and append it to + * our base name in (fullname,baselen) with any required + * readdir fs->internal translation. + * + * Put the result in 'fullname', and return the final length. + * + * Right now we have no translation, and just do a memcpy() + * (the +1 is to copy the final NUL character too). + */ +static int create_full_path(char *fullname, int baselen, const char *d_name, int len) +{ +#ifdef OS_X_IS_SOME_CRAZY_SHxAT + char temp[256], nlen; + nlen = convert_name_from_nfd_to_nfc(d_name, len, temp, sizeof(temp)); + if (nlen) { + len = nlen; + d_name = temp; + } +#endif + memcpy(fullname + baselen, d_name, len + 1); + return baselen + len; +} + +/* * Read a directory tree. We currently ignore anything but * directories, regular files and symlinks. That's because git * doesn't handle them at all yet. Maybe that will change some @@ -595,15 +619,15 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co /* Ignore overly long pathnames! */ if (len + baselen + 8 > sizeof(fullname)) continue; - memcpy(fullname + baselen, de->d_name, len+1); - if (simplify_away(fullname, baselen + len, simplify)) + len = create_full_path(fullname, baselen, de->d_name, len); + if (simplify_away(fullname, len, simplify)) continue; dtype = DTYPE(de); exclude = excluded(dir, fullname, &dtype); if (exclude && (dir->flags & DIR_COLLECT_IGNORED) - && in_pathspec(fullname, baselen + len, simplify)) - dir_add_ignored(dir, fullname, baselen + len); + && in_pathspec(fullname, len, simplify)) + dir_add_ignored(dir, fullname, len); /* * Excluded? If we don't explicitly want to show @@ -630,9 +654,9 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co default: continue; case DT_DIR: - memcpy(fullname + baselen + len, "/", 2); + memcpy(fullname + len, "/", 2); len++; - switch (treat_directory(dir, fullname, baselen + len, simplify)) { + switch (treat_directory(dir, fullname, len, simplify)) { case show_directory: if (exclude != !!(dir->flags & DIR_SHOW_IGNORED)) @@ -640,7 +664,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co break; case recurse_into_directory: contents += read_directory_recursive(dir, - fullname, fullname, baselen + len, 0, simplify); + fullname, fullname, len, 0, simplify); continue; case ignore_directory: continue; @@ -654,7 +678,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co if (check_only) goto exit_early; else - dir_add_name(dir, fullname, baselen + len); + dir_add_name(dir, fullname, len); } exit_early: closedir(fdir); ^ permalink raw reply related [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 17:12 ` Linus Torvalds @ 2009-05-13 17:31 ` Andreas Ericsson 2009-05-13 17:46 ` Linus Torvalds 2009-05-13 20:57 ` Matthias Andree 2 siblings, 0 replies; 82+ messages in thread From: Andreas Ericsson @ 2009-05-13 17:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git Linus Torvalds wrote: > > Of course, it probably makes sense to have a whole "git_readdir()" that > does this thing in general. That "create_full_path()" thing makes sense > regardless, though, in that it also simplifies a lot of "baselen+len" > usage in just "len". > In a flash of premonitory insight, libgit2 has gitfo_foreach_dirent(path, callback) which would probably be well suited for this kind of thing. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Register now for Nordic Meet on Nagios, June 3-4 in Stockholm http://nordicmeetonnagios.op5.org/ Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 17:12 ` Linus Torvalds 2009-05-13 17:31 ` Andreas Ericsson @ 2009-05-13 17:46 ` Linus Torvalds 2009-05-13 18:26 ` Martin Langhoff 2009-05-13 20:57 ` Matthias Andree 2 siblings, 1 reply; 82+ messages in thread From: Linus Torvalds @ 2009-05-13 17:46 UTC (permalink / raw) To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Linus Torvalds wrote: > > Of course, it probably makes sense to have a whole "git_readdir()" that > does this thing in general. Actually, the more I think about that, the less true I think it is. It _sounds_ like a nice simplification ("just do it once in readdir, and forget about it everywhere else"), but it's in fact a stupid thing to do. Why? If we _ever_ want to fix this in the general case, then the code that does the readdir() will actually have to remember both the "raw filesystem" form _and_ the "cleaned-up utf-8 form". Why? Because when we do readdir(), we'll also do 'lstat()' on the end result to check the types, and opendir() in case it's a directory and we then want to do things recursively etc. And that happens to work on OS X (because we can use our "fixed" filename for lstat too), but it does not work in the general case. And you can say "well, just do the stat inside the wrapped readdir()", but that doesn't work _either_, since - we don't want to do the lstat() if it's unnecessary. Even if we don't have "de->d_type" information, we can often avoid the need for it, if we can tell that the name isn't interestign (due to being ignored). Avoiding the lstat is a huge performance issue for cold-cache cases. It's basically a seek. So we really want to do the lstat() later, which implies that the caller needs to know _both_ the original "real" filesystem name _and_ the converted one. - it doesn't handle the opendir() case anyway - so the end result is that a real implementation will _always_ need to carry around both the "filesystem view" filename _and_ the "what we've converted it into". Now, the point of the patch I sent out was that for the specific case of OS X, which does UTF-8 conversions (wrong) but also is happy to get our properly normalized name, we don't care. So my patch is "correct" for that special case - and so would a plain readdir() wrapper be. But my patch is _also_ correct for the case where a readdir() wrapper would do the wrong thing. My patch doesn't _handle_ it (since it doesn't change the code to pass both "filesystem view" and "cleaned-up view" pathnames), but the patch I sent out also doesn't make it any harder to do right. In contrast, doing a readdir() wrapper makes it much harder to do right later, because it's just doing the conversion at the wrong level (you could make that "wrapper" return both the original and the fixed filename, but at that point the wrapper doesn't really help - you might as well just have the "convert" function, and it would be a hell of a lot more obvious what is really going on). So I take it back. A readdir() wrapper is not a good idea. It gets us a tiny bit of the way, but it would actually take us a step back from the "real" solution. Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 17:46 ` Linus Torvalds @ 2009-05-13 18:26 ` Martin Langhoff 2009-05-13 18:37 ` Linus Torvalds 0 siblings, 1 reply; 82+ messages in thread From: Martin Langhoff @ 2009-05-13 18:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, May 13, 2009 at 7:46 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > So I take it back. A readdir() wrapper is not a good idea. It gets us a > tiny bit of the way, but it would actually take us a step back from the > "real" solution. Do we need to take the real solution to the core of git? What I am wondering is whether we can keep this simple in git internals and catch problem filenames at git-add time. This would allow git to keep treating filenames as a bag of bytes, and it does a better thing for users. In cross platform projects, most users don't even know that there are problems, and even if they do, they don't know what the problems are. If git add can be told to warn & refuse to add a path with portability problems, then we educate our users, prevent them from committing filenames that will later cause trouble to others in their projects, etc. from-the-keep-it-simple-and-informative-dept, m -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 18:26 ` Martin Langhoff @ 2009-05-13 18:37 ` Linus Torvalds 2009-05-13 21:04 ` Theodore Tso 2009-05-13 21:08 ` Daniel Barkalow 0 siblings, 2 replies; 82+ messages in thread From: Linus Torvalds @ 2009-05-13 18:37 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Martin Langhoff wrote: > > Do we need to take the real solution to the core of git? Well, I suspect that if we really want to support it, then we'd better. > What I am wondering is whether we can keep this simple in git > internals and catch problem filenames at git-add time. I can almost guarantee that it will just cause more problems than it solves, and generate some nasty cases that just aren't solvable. Because it really isn't just "git add". It's every single thing that does a lstat() on a filename inside of git. Now, the simple OS X case is not a huge problem, since the lstat will succeed with the fixed-up filename too. But as mentioned, the OS X case is the thing that doesn't need a lot of infrastructure _anyway_ - I can almost guarantee that my posted patch (with the added setup.c stuff for get_pathspec()) is going to be _fewer_ lines than some wrapper logic. Note: in all of the above, I assume that people care more about just plain UTF characters (and the insane NFD form OS X uses) than about worrying about the _really_ subtle issues of case-independence. Those are a major pain, but they will need even more "internal" support, because there simply isn't any sane wrapping method. (You could wrap everything to force lower-casing of all filesystem ops or something, but that would not be acceptable to any sane environment. So in reality you need to accept mixed-case things, and then there is no way to know from the "outside" whether one external mixed-case thing matches some internal index mixed-case thing). Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 18:37 ` Linus Torvalds @ 2009-05-13 21:04 ` Theodore Tso 2009-05-13 21:20 ` Linus Torvalds 2009-05-13 21:08 ` Daniel Barkalow 1 sibling, 1 reply; 82+ messages in thread From: Theodore Tso @ 2009-05-13 21:04 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, May 13, 2009 at 11:37:28AM -0700, Linus Torvalds wrote: > Note: in all of the above, I assume that people care more about just plain > UTF characters (and the insane NFD form OS X uses) than about worrying > about the _really_ subtle issues of case-independence. Those are a major > pain, but they will need even more "internal" support, because there > simply isn't any sane wrapping method. Stupid question --- if we get something that works for Windows and MacOS X, is there any reason why we need to solve the general problem of case-insentive filesystems? It's really backwards compatibility with Legacy OS's that most important, right? Are there any other systems other than Windows and Mac OS X which (a) perpetrate case insensitivity on application programmers, and (b) which current or future git users are likely to care about? - Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 21:04 ` Theodore Tso @ 2009-05-13 21:20 ` Linus Torvalds 0 siblings, 0 replies; 82+ messages in thread From: Linus Torvalds @ 2009-05-13 21:20 UTC (permalink / raw) To: Theodore Tso Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Theodore Tso wrote: > > Stupid question --- if we get something that works for Windows and > MacOS X, is there any reason why we need to solve the general problem > of case-insentive filesystems? Qutie frankly, I don't think we're even very close to getting anything that works for Windows of OS X. Case-insensitivity is _hard_. The "easy" case is to just handle the OS X craxy pseudo-NFD format, and at least turn that into NFC (and perhaps add a config option to do latin1 and EUC-JP to utf-8 too) and. At that point, we at least handle regular utf-8 the same way. Doing the latin1/EUC-JP thing would actually to some degree be more interesting than the OS X NFD case, because that really does require two-way conversion, and we can "test" that even on sane filesystems (ie play at having a Latin1 filesystem). That said, I suspect there aren't that many people who care about latin1 filesystems. I dunno about EUC-JP (and variants - for all I know, shift-JIS and other cases may be the more common ones). Of course, if we do everything right, maybe the windows people would actually like us to keep the filesystem-native representation in UTF-16LE or whatever the crazy format is that Windows really uses deep down. My point being that all of these things happen even without the added worry about case. And in many ways, not worrying about case should probably be the first step. We do have some support for worrying about case, but trying to solve both things at the same time isn't going to be workable, I suspect. Case insensitivity should never ever involve a _conversion_ (if it does, you get all kinds of crazy behavior), it's just purely a _comparison_ issue, so the two really are fundamentally different. Of course, the reason OS-X seems to be so messed up is exactly that the morons at Apple didn't understand the difference between conversion and comparison, and mixed them up. Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 18:37 ` Linus Torvalds 2009-05-13 21:04 ` Theodore Tso @ 2009-05-13 21:08 ` Daniel Barkalow 2009-05-13 21:29 ` Linus Torvalds 1 sibling, 1 reply; 82+ messages in thread From: Daniel Barkalow @ 2009-05-13 21:08 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Linus Torvalds wrote: > On Wed, 13 May 2009, Martin Langhoff wrote: > > > > Do we need to take the real solution to the core of git? > > Well, I suspect that if we really want to support it, then we'd better. > > > What I am wondering is whether we can keep this simple in git > > internals and catch problem filenames at git-add time. > > I can almost guarantee that it will just cause more problems than it > solves, and generate some nasty cases that just aren't solvable. > > Because it really isn't just "git add". It's every single thing that does > a lstat() on a filename inside of git. > > Now, the simple OS X case is not a huge problem, since the lstat will > succeed with the fixed-up filename too. I'm not seeing what the general case is, and how it could possibly behave. There's the "insensitive" behavior: if you create "foo" and look for "FOO", it's there, but readdir() reports "foo". There's the "converting" behavior: if you create "foo", readdir() reports "FOO", but lstat("foo") returns it. The obvious general case is: if you create "foo", readdir() reports "FOO", and lstat("foo") doesn't find a match. But if you create "foo" again... it doesn't find "foo", so it creates a new file, which it also calls "FOO", and the filesystem now has two files with identical names? It seems to me that the limits of minimally functional, non-inode-losing filesystems are: lstat() might take a filename and return the data for a non-byte-identical filename; open(name, O_CREAT|O_EXCL) might replace the given name with a non-byte-identical filename. But surely open(name) and lstat(name) (with the same name) must find the same file, even if readdir() would report it with a different name. And I assume that a filesystem that rejected any non-NFD filenames or any non-NFC filenames would be totally unusable, in that users will manage to get unnormalized filenames into programs and find that the filesystem just doesn't work. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 21:08 ` Daniel Barkalow @ 2009-05-13 21:29 ` Linus Torvalds 0 siblings, 0 replies; 82+ messages in thread From: Linus Torvalds @ 2009-05-13 21:29 UTC (permalink / raw) To: Daniel Barkalow Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Daniel Barkalow wrote: > > > > Now, the simple OS X case is not a huge problem, since the lstat will > > succeed with the fixed-up filename too. > > I'm not seeing what the general case is, and how it could possibly behave. Here's a simple example. Let's say that your company uses Latin1 internally for your filesystems, because your tools really aren't utf-8 ready. This is NOT AT ALL unnatural - it's how lots of people used to work with Linux over the years, and it's largely how people still use FAT, I suspect (except it's not latin1, it's some windows-specific 8-bits-per-character mapping). IOW, if you have a file called 'åäö', it literally is encoded as '\xe5\xe4\xf6' (if you wonder why I picked those three letters, it's because they are the regular extra letters in Swedish - Swedish has 29 letters in its alphabet, and those three letters really are letters in their own right, they are NOT 'a' and 'o' with some dots/rings on top). IOW, if you open such a file, you need to use those three bytes. Now, even if you happen to have an OS and use Latin1 on disk, you may realize that you'd like to interact with others that use UTF-8, and would want to have your git archive that you export use nice portable UTF-8. But you absolutely MUST NOT just do a conversion at "readdir()" time. If you do that, then your three-byte filename turns into a six-byte utf-8 sequence of '\xc3\xa5\xc3\xa4\xc3\xb6' and the thing is, now "lstat()" won't work on that sequence. So obviously you could always turn things _back_ for lstat(), but quite frankly, that's (a) insane (b) incompetent and (c) not even always well-defined. > There's the "insensitive" behavior: if you create "foo" and look for > "FOO", it's there, but readdir() reports "foo". > > There's the "converting" behavior: if you create "foo", readdir() reports > "FOO", but lstat("foo") returns it. Then there's the behaviour above: you want your git repository to have utf-8, but your filesystem doesn't convert anything at all, and all your regular tools (think editors etc) are all Latin1. Latin1 is going away, I hope, but I bet EUC-JP etc still exist. Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 17:12 ` Linus Torvalds 2009-05-13 17:31 ` Andreas Ericsson 2009-05-13 17:46 ` Linus Torvalds @ 2009-05-13 20:57 ` Matthias Andree 2009-05-13 21:10 ` Linus Torvalds 2 siblings, 1 reply; 82+ messages in thread From: Matthias Andree @ 2009-05-13 20:57 UTC (permalink / raw) To: Linus Torvalds, Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds <torvalds@linux-foundation.org>: > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to > do the actual normalization if you find characters with the high bit > set. And since I know that the OS X filesystems are so buggy as to not > even do that whole NFD thing right, there is probably some OS-X specific > "use this for > filesystem names" conversion function. Sorry for interrupting, but NF_K_C? You don't want that (K for compatibility, rather than canonical, normalization) for anything except normalizing temporary variables inside strcasecmp(3) or similar. Probably not even that. The normalizations done are often irreversible and also surprising. You don't want to turn 2³.c into 23.c, do you? -- Matthias Andree ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 20:57 ` Matthias Andree @ 2009-05-13 21:10 ` Linus Torvalds 2009-05-13 21:30 ` Jay Soffian 2009-05-13 21:47 ` Matthias Andree 0 siblings, 2 replies; 82+ messages in thread From: Linus Torvalds @ 2009-05-13 21:10 UTC (permalink / raw) To: Matthias Andree; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Matthias Andree wrote: > Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds > <torvalds@linux-foundation.org>: > > > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do > > the actual normalization if you find characters with the high bit set. And > > since I know that the OS X filesystems are so buggy as to not even do that > > whole NFD thing right, there is probably some OS-X specific "use this for > > filesystem names" conversion function. > > Sorry for interrupting, but NF_K_C? You don't want that (K for compatibility, > rather than canonical, normalization) for anything except normalizing > temporary variables inside strcasecmp(3) or similar. Probably not even that. > The normalizations done are often irreversible and also surprising. You don't > want to turn 2³.c into 23.c, do you? No, you're right. We want just plain NFC. I just googled for how some other projects handled this, and found the stringprep thing in a post about rsync, and didn't look any closer. But yes, you're absolutely right, stringprep is total crap, and nfkc is horrible. I have no idea of what library to use, though. For perl, there's Unicode::Normalize, but that's likely still subtly incorrect for the OS-X case due to the filesystem not using _strict_ NFD. I have this dim memory of somebody actually pointing to the documentation of exactly which characters OS X ends up decomposing. Maybe we could just do a git-specific inverse of that, knowing that NOBODY ELSE IN THE WHOLE UNIVERSE IS SO TERMINALLY STUPID AS TO DO THAT DECOMPOSITION, and thus the OS X case is the only one we need to care about? Linus ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 21:10 ` Linus Torvalds @ 2009-05-13 21:30 ` Jay Soffian 2009-05-13 21:47 ` Matthias Andree 1 sibling, 0 replies; 82+ messages in thread From: Jay Soffian @ 2009-05-13 21:30 UTC (permalink / raw) To: Linus Torvalds Cc: Matthias Andree, Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, May 13, 2009 at 5:10 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > I have this dim memory of somebody actually pointing to the documentation > of exactly which characters OS X ends up decomposing. http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties http://developer.apple.com/technotes/tn/tn1150table.html j. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 21:10 ` Linus Torvalds 2009-05-13 21:30 ` Jay Soffian @ 2009-05-13 21:47 ` Matthias Andree 1 sibling, 0 replies; 82+ messages in thread From: Matthias Andree @ 2009-05-13 21:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git Am 13.05.2009, 23:10 Uhr, schrieb Linus Torvalds <torvalds@linux-foundation.org>: > > > On Wed, 13 May 2009, Matthias Andree wrote: > >> Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds >> <torvalds@linux-foundation.org>: >> >> > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something >> to do >> > the actual normalization if you find characters with the high bit >> set. And >> > since I know that the OS X filesystems are so buggy as to not even do >> that >> > whole NFD thing right, there is probably some OS-X specific "use this >> for >> > filesystem names" conversion function. >> >> Sorry for interrupting, but NF_K_C? You don't want that (K for >> compatibility, >> rather than canonical, normalization) for anything except normalizing >> temporary variables inside strcasecmp(3) or similar. Probably not even >> that. >> The normalizations done are often irreversible and also surprising. You >> don't >> want to turn 2³.c into 23.c, do you? > > No, you're right. We want just plain NFC. I just googled for how some > other projects handled this, and found the stringprep thing in a post > about rsync, and didn't look any closer. > > But yes, you're absolutely right, stringprep is total crap, and nfkc is > horrible. Crap? It's just besides the purpose and some limited form of fuzzy match. Anyways... > I have no idea of what library to use, though. For perl, there's > Unicode::Normalize, but that's likely still subtly incorrect for the OS-X > case due to the filesystem not using _strict_ NFD. Perhaps ICU (ICU4C), from http://site.icu-project.org/ -- Matthias Andree ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 15:06 Cross-Platform Version Control Esko Luontola 2009-05-12 15:14 ` Shawn O. Pearce @ 2009-05-12 18:28 ` Dmitry Potapov 2009-05-12 18:40 ` Martin Langhoff 2009-05-14 13:48 ` Peter Krefting 2 siblings, 1 reply; 82+ messages in thread From: Dmitry Potapov @ 2009-05-12 18:28 UTC (permalink / raw) To: Esko Luontola; +Cc: git On Tue, May 12, 2009 at 06:06:05PM +0300, Esko Luontola wrote: > A good start for making Git cross-platform, would be storing the text > encoding of every file name and commit message together with the commit. > Currently, because Git is oblivious to the encodings and just considers > them as a series of bytes, there is no way to make them cross-platform. 1. Git already stores the endcoding for all commit messages that are not in UTF-8. 2. If you really want to be cross-platform portable, you should not use any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable Filename Character Set) http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276 Dmitry ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 18:28 ` Dmitry Potapov @ 2009-05-12 18:40 ` Martin Langhoff 2009-05-12 18:55 ` Jakub Narebski 0 siblings, 1 reply; 82+ messages in thread From: Martin Langhoff @ 2009-05-12 18:40 UTC (permalink / raw) To: Dmitry Potapov; +Cc: Esko Luontola, git On Tue, May 12, 2009 at 8:28 PM, Dmitry Potapov <dpotapov@gmail.com> wrote: > 2. If you really want to be cross-platform portable, you should not use > any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable > Filename Character Set) > http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276 Would it make sense to have warnings at 'git add' time about - filenames outside of that charset (as the strictest mode, perhaps even default) - filenames that have a potential conflict wrt case-sensitivity - filenames that have potential conflict in the same tree due to utf-8 encoding vagaries MHO is that a strict "start your project portable from day one" mode is best as a default. But I'd be happy with any default, actually ;-) m -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 18:40 ` Martin Langhoff @ 2009-05-12 18:55 ` Jakub Narebski 0 siblings, 0 replies; 82+ messages in thread From: Jakub Narebski @ 2009-05-12 18:55 UTC (permalink / raw) To: Martin Langhoff; +Cc: Dmitry Potapov, Esko Luontola, git Martin Langhoff <martin.langhoff@gmail.com> writes: > On Tue, May 12, 2009 at 8:28 PM, Dmitry Potapov <dpotapov@gmail.com> wrote: > > 2. If you really want to be cross-platform portable, you should not use > > any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable > > Filename Character Set) > > http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276 > > Would it make sense to have warnings at 'git add' time about > > - filenames outside of that charset (as the strictest mode, perhaps > even default) > - filenames that have a potential conflict wrt case-sensitivity > - filenames that have potential conflict in the same tree due to > utf-8 encoding vagaries > > MHO is that a strict "start your project portable from day one" mode > is best as a default. But I'd be happy with any default, actually ;-) Somebody asked for a pre-add hook in the past; it would be good place to put such check. But in meantime you can do it using pre-commit hook instead, isn't it? -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 15:06 Cross-Platform Version Control Esko Luontola 2009-05-12 15:14 ` Shawn O. Pearce 2009-05-12 18:28 ` Dmitry Potapov @ 2009-05-14 13:48 ` Peter Krefting 2009-05-14 19:58 ` Esko Luontola 2 siblings, 1 reply; 82+ messages in thread From: Peter Krefting @ 2009-05-14 13:48 UTC (permalink / raw) To: Esko Luontola; +Cc: git Esko Luontola: > A good start for making Git cross-platform, would be storing the text > encoding of every file name and commit message together with the commit. Is it really necessary to store the encoding for every single file name, should it not be enough to just store encoding information for all file names at once (i.e., for the object that contains the list of file names and their associated blobs)? I did publish, as a request for comments, the beginnings of a patch that would change the Windows version of Git to expect file names to be UTF-8 encoded. There were some comments about it, especially that I could not just assume that UTF-8 was the right thing to assume. Perhaps if we added some meta-data, maybe using the same fall-back mechanism as for commit messages (i.e., assume UTF-8 unless otherwise specified), it would be easier to do. On Windows, the file APIs allow you to use Unicode (UTF-16) to specify file names, and the file systems will handle any necessary conversion to whatever byte sequences are used to store the file names. UTF-16 and UTF-8 are trivial to convert between, and Windows does contain APIs to convert between other character encodings and UTF-16. On Mac OS X, I believe the file system APIs assume you use some kind of normalized UTF-8. That should also be possible to create, possibly converting back and forth between different normalization forms, if necessary. On Linux and other Unixes we could just use iconv() to convert from the repository file name encoding to whatever the current locale has set up. The trick here is to handle file names outside the current encoding. Some kind of escaping mechanism will probably need to be introduced. The best way would be to define this in the Git core once and for all, and add support to it for all the platforms in the same go, instead of trying to hack around the issue whenever it pops up on the various platforms. My main use-case for Git on Windows has disappeared as my $dayjob went bankrupt, but I am happy to assist with whatever insight I may be able to bring. -- \\// Peter - http://www.softwolves.pp.se/ ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-14 13:48 ` Peter Krefting @ 2009-05-14 19:58 ` Esko Luontola 2009-05-14 20:21 ` Andreas Ericsson ` (2 more replies) 0 siblings, 3 replies; 82+ messages in thread From: Esko Luontola @ 2009-05-14 19:58 UTC (permalink / raw) To: Peter Krefting; +Cc: git Peter Krefting wrote on 14.5.2009 16:48: > Is it really necessary to store the encoding for every single file name, > should it not be enough to just store encoding information for all file > names at once (i.e., for the object that contains the list of file names > and their associated blobs)? What about if some disorganized project has people committing with many different encodings? Should we allow it, that a directory has the names of some files using one encoding, and the names of other files using another encoding? Or should we force the whole repository to use the same encoding? > The best way would be to define this in the Git core once and for all, > and add support to it for all the platforms in the same go, instead of > trying to hack around the issue whenever it pops up on the various > platforms. +1 -- Esko Luontola www.orfjackal.net ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-14 19:58 ` Esko Luontola @ 2009-05-14 20:21 ` Andreas Ericsson 2009-05-14 22:25 ` Johannes Schindelin 2009-05-15 11:18 ` Dmitry Potapov 2 siblings, 0 replies; 82+ messages in thread From: Andreas Ericsson @ 2009-05-14 20:21 UTC (permalink / raw) To: Esko Luontola; +Cc: Peter Krefting, git Esko Luontola wrote: > Peter Krefting wrote on 14.5.2009 16:48: >> Is it really necessary to store the encoding for every single file >> name, should it not be enough to just store encoding information for >> all file names at once (i.e., for the object that contains the list of >> file names and their associated blobs)? > > What about if some disorganized project has people committing with many > different encodings? Should we allow it, that a directory has the names > of some files using one encoding, and the names of other files using > another encoding? Or should we force the whole repository to use the > same encoding? > If encodings are on a per-tree basis, we could add a special mode-flag for it without breaking backwards incompatibility (I think, anyways). Older gits just won't know how to handle it and will treat it as a byte-stream. >> The best way would be to define this in the Git core once and for all, >> and add support to it for all the platforms in the same go, instead of >> trying to hack around the issue whenever it pops up on the various >> platforms. > > +1 > There's still the problem that noone's stepped forward to do all that work yet, so apparently this isn't important enough for people to put their patches where their mouths are. Often when issues generate long discussions and no code, it's of high academic interest and of little real-world value. I believe the "little real-world value" here comes from the fact that cross-platform projects often enforce 7-bit ascii compatible filenames from the start, because they *know* they may run into problems on other filesystems otherwise. Remember it's not only git that has to get things right. It's also build-systems and compilers that have to locate the correct files (the Makefile and the filesystem may use different encodings), so in the real world, people really do stay away from filenames with åäö or other non-ascii chars in them. It's fun to discuss, but I won't spend any time on it. Good luck to those who do though. I'd quite like to see if someone could pull it off without breaking backwards compatibility or impacting performance too much. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Register now for Nordic Meet on Nagios, June 3-4 in Stockholm http://nordicmeetonnagios.op5.org/ Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-14 19:58 ` Esko Luontola 2009-05-14 20:21 ` Andreas Ericsson @ 2009-05-14 22:25 ` Johannes Schindelin 2009-05-15 11:18 ` Dmitry Potapov 2 siblings, 0 replies; 82+ messages in thread From: Johannes Schindelin @ 2009-05-14 22:25 UTC (permalink / raw) To: Esko Luontola; +Cc: Peter Krefting, git Hi, On Thu, 14 May 2009, Esko Luontola wrote: > Peter Krefting wrote on 14.5.2009 16:48: > > > The best way would be to define this in the Git core once and for all, > > and add support to it for all the platforms in the same go, instead of > > trying to hack around the issue whenever it pops up on the various > > platforms. > > +1 You might be enthusiastic about this cunning idea. However, if it costs me performance on Linux, and all the benefits go to Windows users, then I will remove this "solution" from my personal Git tree _right away_, and I'd expect a lot of other people, too. I repeat this just once more: if you add complexity, you'll have to have a compelling reason to do so. If there is no benefit for Linux users, why should they bear the cost? But as Andreas remarked, I sincerely think that there has been enough talk about the issue. It's time to see some patches, or to stop the discussion. Ciao, Dscho ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: Cross-Platform Version Control 2009-05-14 19:58 ` Esko Luontola 2009-05-14 20:21 ` Andreas Ericsson 2009-05-14 22:25 ` Johannes Schindelin @ 2009-05-15 11:18 ` Dmitry Potapov 2 siblings, 0 replies; 82+ messages in thread From: Dmitry Potapov @ 2009-05-15 11:18 UTC (permalink / raw) To: Esko Luontola; +Cc: Peter Krefting, git On Thu, May 14, 2009 at 10:58:17PM +0300, Esko Luontola wrote: > > What about if some disorganized project has people committing with many > different encodings? Should we allow it, that a directory has the names > of some files using one encoding, and the names of other files using > another encoding? Or should we force the whole repository to use the > same encoding? The whole repository should have the same encoding internally. Anything else will be too complex and too slow... Have you seen any file system where file names would be stored in different encodings? And Git does far more operation on file names than a file system does. So, it is clearly to me that the whole repository should have a single encoding. Now, I don't think that you will find many open source projects that use non-ASCII in file names. Moreover, most Linux users are either use UTF-8 already or switch to it in the near future. Mac OS X uses UTF-8 (though there is a problem with decomposed characters, but Linus posted a possible solution). So, the only platform were non-ASCII characters may be interesting to Git users and that does not support UTF-8 is Windows. AFAIK, Cygwin 1.7 has UTF-8 support. So, it is mostly a problem for msysGit... Though adding support for legacy encodings can help to some degree, it means that every system call involving a file name will go through UTF-8 <-> LEGACY_ENC <-> UTF-16LE conversion. IMHO, having a legacy encoding involved is far from the best possible solution; but to avoid that, you need to change MSYS to be able to work with UTF-8. (I have never looked at MSYS myself, but I suspect it may be not easy). Dmitry ^ permalink raw reply [flat|nested] 82+ messages in thread
end of thread, other threads:[~2009-05-15 11:20 UTC | newest] Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-04-27 8:55 Eric Sink's blog - notes on git, dscms and a "whole product" approach Martin Langhoff 2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski 2009-04-28 21:00 ` Robin Rosenberg 2009-04-29 6:55 ` Martin Langhoff 2009-04-29 7:21 ` Jeff King 2009-04-29 20:05 ` Markus Heidelberg 2009-04-29 7:52 ` Cross-Platform Version Control Jakub Narebski 2009-04-29 8:25 ` Martin Langhoff 2009-04-28 18:16 ` Eric Sink's blog - notes on git, dscms and a "whole product" approach Jakub Narebski 2009-04-29 7:54 ` Sitaram Chamarty 2009-04-30 12:17 ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski 2009-04-30 12:56 ` Michael Witten 2009-04-30 15:28 ` Why Git is so fast Jakub Narebski 2009-04-30 18:52 ` Shawn O. Pearce 2009-04-30 20:36 ` Kjetil Barvik 2009-04-30 20:40 ` Shawn O. Pearce 2009-04-30 21:36 ` Kjetil Barvik 2009-05-01 0:23 ` Steven Noonan 2009-05-01 1:25 ` James Pickens 2009-05-01 9:19 ` Kjetil Barvik 2009-05-01 9:34 ` Mike Hommey 2009-05-01 9:42 ` Kjetil Barvik 2009-05-01 17:42 ` Tony Finch 2009-05-01 5:24 ` Dmitry Potapov 2009-05-01 9:42 ` Mike Hommey 2009-05-01 10:46 ` Dmitry Potapov 2009-04-30 18:43 ` Why Git is so fast (was: Re: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Shawn O. Pearce 2009-04-30 14:22 ` Jeff King 2009-05-01 18:43 ` Linus Torvalds 2009-05-01 19:08 ` Jeff King 2009-05-01 19:13 ` david 2009-05-01 19:32 ` Nicolas Pitre 2009-05-01 21:17 ` Daniel Barkalow 2009-05-01 21:37 ` Linus Torvalds 2009-05-01 22:11 ` david 2009-04-30 18:56 ` Nicolas Pitre 2009-04-30 19:16 ` Alex Riesen 2009-05-04 8:01 ` Why Git is so fast Andreas Ericsson 2009-04-30 19:33 ` Jakub Narebski 2009-05-12 15:06 Cross-Platform Version Control Esko Luontola 2009-05-12 15:14 ` Shawn O. Pearce 2009-05-12 16:13 ` Johannes Schindelin 2009-05-12 17:56 ` Esko Luontola 2009-05-12 20:38 ` Johannes Schindelin 2009-05-12 21:16 ` Esko Luontola 2009-05-13 0:23 ` Johannes Schindelin 2009-05-13 5:34 ` Esko Luontola 2009-05-13 6:49 ` Alex Riesen 2009-05-13 10:15 ` Johannes Schindelin [not found] ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com> 2009-05-13 10:41 ` John Tapsell 2009-05-13 13:42 ` Jay Soffian 2009-05-13 13:44 ` Alex Riesen 2009-05-13 13:50 ` Jay Soffian 2009-05-13 13:57 ` John Tapsell 2009-05-13 15:27 ` Nicolas Pitre 2009-05-13 16:22 ` Johannes Schindelin 2009-05-13 17:24 ` Andreas Ericsson 2009-05-14 1:49 ` Miles Bader 2009-05-12 16:16 ` Jeff King 2009-05-12 16:57 ` Johannes Schindelin 2009-05-13 16:26 ` Linus Torvalds 2009-05-13 17:12 ` Linus Torvalds 2009-05-13 17:31 ` Andreas Ericsson 2009-05-13 17:46 ` Linus Torvalds 2009-05-13 18:26 ` Martin Langhoff 2009-05-13 18:37 ` Linus Torvalds 2009-05-13 21:04 ` Theodore Tso 2009-05-13 21:20 ` Linus Torvalds 2009-05-13 21:08 ` Daniel Barkalow 2009-05-13 21:29 ` Linus Torvalds 2009-05-13 20:57 ` Matthias Andree 2009-05-13 21:10 ` Linus Torvalds 2009-05-13 21:30 ` Jay Soffian 2009-05-13 21:47 ` Matthias Andree 2009-05-12 18:28 ` Dmitry Potapov 2009-05-12 18:40 ` Martin Langhoff 2009-05-12 18:55 ` Jakub Narebski 2009-05-14 13:48 ` Peter Krefting 2009-05-14 19:58 ` Esko Luontola 2009-05-14 20:21 ` Andreas Ericsson 2009-05-14 22:25 ` Johannes Schindelin 2009-05-15 11:18 ` Dmitry Potapov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.