* Re: Cross-Platform Version Control @ 2009-05-12 15:06 Esko Luontola 2009-05-12 15:14 ` Shawn O. Pearce ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Esko Luontola @ 2009-05-12 15:06 UTC (permalink / raw) To: git A good start for making Git cross-platform, would be storing the text encoding of every file name and commit message together with the commit. Currently, because Git is oblivious to the encodings and just considers them as a series of bytes, there is no way to make them cross-platform. It's as http://www.joelonsoftware.com/articles/Unicode.html says, "It does not make sense to have a string without knowing what encoding it uses." Without explicit encoding information, making a system that works even on the three main platforms, let alone in all countries and languages, is simply not possible. On the other hand, if the encoding is explicitly stated in the repository, then it is possible for platform and locale aware Git clients to handle the file names and commit messages in whatever way makes most sense for the platform (for example convert the file names to the platform's encoding, if it differs from the committer's platform encoding). Then it would also be possible to create a Mac version of Git, which compensates for Mac OS X's file system's file name encoding peculiarities. Also the system could then warn (on "git add") if the data does not look like it has been encoded with the said encoding. If the platform's and the repository's encoding happen to be the same (which in reality might be possible only inside a small company where everybody is forced to use the same OS and is configured by a single sysadmin), then no conversions need to be done. Also Git purists, who think that the byte sequence representing a file name are more important than the human readable version of the file name, may use some configuration switch that disables all conversions - but even then the current encoding should be stored together with the commit. Are there any plans on storing the encoding information of file names and commit messages in the Git repository? How much time would implementing it take? Any ideas on how to maintain backwards compatibility (for old commits that do not have the encoding information)? - Esko ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 15:06 Cross-Platform Version Control Esko Luontola @ 2009-05-12 15:14 ` Shawn O. Pearce 2009-05-12 16:13 ` Johannes Schindelin 2009-05-12 16:16 ` Jeff King 2009-05-12 18:28 ` Dmitry Potapov 2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting 2 siblings, 2 replies; 59+ messages in thread From: Shawn O. Pearce @ 2009-05-12 15:14 UTC (permalink / raw) To: Esko Luontola; +Cc: git Esko Luontola <esko.luontola@gmail.com> wrote: > Are there any plans on storing the encoding information of file names > and commit messages in the Git repository? Commit messages already store their encoding in an optional "encoding" header if the message isn't stored in UTF-8, or US-ASCII, which is a strict subset of UTF-8. As for file names, no plans, its a sequence of bytes, but I think a lot of people wind up using some subset of US-ASCII for their file names, especially if their project is going to be cross platform. -- Shawn. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 15:14 ` Shawn O. Pearce @ 2009-05-12 16:13 ` Johannes Schindelin 2009-05-12 17:56 ` Esko Luontola 2009-05-12 16:16 ` Jeff King 1 sibling, 1 reply; 59+ messages in thread From: Johannes Schindelin @ 2009-05-12 16:13 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Esko Luontola, git Hi, On Tue, 12 May 2009, Shawn O. Pearce wrote: > Esko Luontola <esko.luontola@gmail.com> wrote: > > Are there any plans on storing the encoding information of file names > > and commit messages in the Git repository? > > Commit messages already store their encoding in an optional "encoding" > header if the message isn't stored in UTF-8, or US-ASCII, which is a > strict subset of UTF-8. > > As for file names, no plans, its a sequence of bytes, but I think a > lot of people wind up using some subset of US-ASCII for their file > names, especially if their project is going to be cross platform. Some context: this issue cropped up in msysGit, of course. As to storing all file names in UTF-8, my point about Unicode being not necessarily appropriate for everyone still stands. UTF-8 _might_ be the de-facto standard for Linux filesystems, but IMHO we should not take away the freedom for everybody to decide what they want their file names to be encoded as. However, I see that there might be a need to be able to encode the file names differently, such as on Windows. IMHO the best solution would be a config variable controlling the reencoding of file names. For some time, it looked as if two people were interested in implementing something like that (Peter and Robin IIRC), but efforts have stalled. Ciao, Dscho ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 16:13 ` Johannes Schindelin @ 2009-05-12 17:56 ` Esko Luontola 2009-05-12 20:38 ` Johannes Schindelin 0 siblings, 1 reply; 59+ messages in thread From: Esko Luontola @ 2009-05-12 17:56 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Shawn O. Pearce, git On 12.5.2009, at 19:13, Johannes Schindelin wrote: > As to storing all file names in UTF-8, my point about Unicode being > not > necessarily appropriate for everyone still stands. > > UTF-8 _might_ be the de-facto standard for Linux filesystems, but > IMHO we should not take away the freedom for everybody to decide > what they > want their file names to be encoded as. > > However, I see that there might be a need to be able to encode the > file > names differently, such as on Windows. IMHO the best solution would > be > a config variable controlling the reencoding of file names. Exactly. The system should not force the use of a specific encoding. It should only offer a recommendation, but be also fully compatible if the user uses some other encoding. That's why it's best to always store the information about what encoding was used. It shouldn't matter, whether the data is encoded with ISO-8859-1, UTF-8, Shift_JIS, Big5 or some other encoding, as long as it is explicitly said that what the encoding is. Then the reader of the data can best decide, how to show that data on the current platform. A config variable for defining, that what encoding should be used when committing the file names, would make sense. Git should also try to autodetect, that what encoding is used in its current environment. In the case of UTF-8, you should also be able to specify which normalization form is used (http://www.unicode.org/unicode/reports/ tr15/), or whether it is normalized at all. For example, it should be possible to configure Git so, that when a file is checked out on Mac, its file name is converted to the current file system's encoding (UTF-8 NFD, I think), and when the file is committed on Mac, the file name is normalized back to the same UTF-8 form as is used on Linux (UTF-8 NFC). It would be nice to have config variables for saying, that all file names in this repository must use UTF-8 NFC, and all commit messages must use UTF-8 NFC (with Unix newlines). Then the Git client would autodetect the current environment's encoding, and convert the text, if necessary, to match the repository's encoding. - Esko ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 17:56 ` Esko Luontola @ 2009-05-12 20:38 ` Johannes Schindelin 2009-05-12 21:16 ` Esko Luontola 0 siblings, 1 reply; 59+ messages in thread From: Johannes Schindelin @ 2009-05-12 20:38 UTC (permalink / raw) To: Esko Luontola; +Cc: Shawn O. Pearce, git Hi, On Tue, 12 May 2009, Esko Luontola wrote: > On 12.5.2009, at 19:13, Johannes Schindelin wrote: > >As to storing all file names in UTF-8, my point about Unicode being not > >necessarily appropriate for everyone still stands. > > > >UTF-8 _might_ be the de-facto standard for Linux filesystems, but IMHO > >we should not take away the freedom for everybody to decide what they > >want their file names to be encoded as. > > > >However, I see that there might be a need to be able to encode the file > >names differently, such as on Windows. IMHO the best solution would be > >a config variable controlling the reencoding of file names. > > Exactly. The system should not force the use of a specific encoding. It > should only offer a recommendation, but be also fully compatible if the > user uses some other encoding. > > That's why it's best to always store the information about what encoding > was used. It shouldn't matter, whether the data is encoded with > ISO-8859-1, UTF-8, Shift_JIS, Big5 or some other encoding, as long as it > is explicitly said that what the encoding is. Then the reader of the > data can best decide, how to show that data on the current platform. > > A config variable for defining, that what encoding should be used when > committing the file names, would make sense. Git should also try to > autodetect, that what encoding is used in its current environment. In > the case of UTF-8, you should also be able to specify which > normalization form is used > (http://www.unicode.org/unicode/reports/tr15/), or whether it is > normalized at all. > > For example, it should be possible to configure Git so, that when a file > is checked out on Mac, its file name is converted to the current file > system's encoding (UTF-8 NFD, I think), and when the file is committed > on Mac, the file name is normalized back to the same UTF-8 form as is > used on Linux (UTF-8 NFC). > > It would be nice to have config variables for saying, that all file > names in this repository must use UTF-8 NFC, and all commit messages > must use UTF-8 NFC (with Unix newlines). Then the Git client would > autodetect the current environment's encoding, and convert the text, if > necessary, to match the repository's encoding. That is a nice analysis. How about implementing it? Ciao, Dscho ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 20:38 ` Johannes Schindelin @ 2009-05-12 21:16 ` Esko Luontola 2009-05-13 0:23 ` Johannes Schindelin 0 siblings, 1 reply; 59+ messages in thread From: Esko Luontola @ 2009-05-12 21:16 UTC (permalink / raw) To: git; +Cc: Johannes Schindelin, Shawn O. Pearce Johannes Schindelin wrote on 12.5.2009 23:38: > That is a nice analysis. How about implementing it? > Do we have here somebody, who knows Git's code well and is motivated to implement this? I don't think that I would be capable, because of not having used C much, being new to Git's codebase and having too little time. But I can help with the requirements specification, interaction design and system testing. -- Esko Luontola www.orfjackal.net ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 21:16 ` Esko Luontola @ 2009-05-13 0:23 ` Johannes Schindelin 2009-05-13 5:34 ` Esko Luontola 0 siblings, 1 reply; 59+ messages in thread From: Johannes Schindelin @ 2009-05-13 0:23 UTC (permalink / raw) To: Esko Luontola; +Cc: git, Shawn O. Pearce Hi, On Wed, 13 May 2009, Esko Luontola wrote: > Johannes Schindelin wrote on 12.5.2009 23:38: > > That is a nice analysis. How about implementing it? > > > > Do we have here somebody, who knows Git's code well and is motivated to > implement this? > > I don't think that I would be capable, because of not having used C > much, being new to Git's codebase and having too little time. Well, that rather settles things, no? Ciao, Dscho ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 0:23 ` Johannes Schindelin @ 2009-05-13 5:34 ` Esko Luontola 2009-05-13 6:49 ` Alex Riesen 2009-05-13 10:15 ` Johannes Schindelin 0 siblings, 2 replies; 59+ messages in thread From: Esko Luontola @ 2009-05-13 5:34 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git, Shawn O. Pearce Johannes Schindelin wrote on 13.5.2009 3:23: > Well, that rather settles things, no? > There is need for the feature, but it's unfortunate that the Git developers do not see its value. There are many users for whom using non-ASCII names is necessary (for example all of Asia and most of Europe), but now it seems that Bazaar is the only DVCS that handles encodings correctly: http://stackoverflow.com/questions/829682/what-dvcs-support-unicode-filenames Let's see if I have time later this or next year to work on it. At least it would be good practise in getting acquainted with a new codebase and learning C. But it would be better for someone else do it, to get it done within a reasonable amount of time. I see that there are some tests in the /t directory. Which command will run all of them, how good coverage do the tests have, how reproducable and isolated they are, how many seconds does it take to run all the tests? Is there some high-level documentation for new developers? -- Esko Luontola www.orfjackal.net ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 5:34 ` Esko Luontola @ 2009-05-13 6:49 ` Alex Riesen 2009-05-13 10:15 ` Johannes Schindelin 1 sibling, 0 replies; 59+ messages in thread From: Alex Riesen @ 2009-05-13 6:49 UTC (permalink / raw) To: Esko Luontola; +Cc: Johannes Schindelin, git, Shawn O. Pearce 2009/5/13 Esko Luontola <esko.luontola@gmail.com>: > Johannes Schindelin wrote on 13.5.2009 3:23: >> >> Well, that rather settles things, no? >> > > There is need for the feature, but it's unfortunate that the Git developers > do not see its value. There are many users for whom using non-ASCII names is > necessary (for example all of Asia and most of Europe), but now it seems > that Bazaar is the only DVCS that handles encodings correctly: > http://stackoverflow.com/questions/829682/what-dvcs-support-unicode-filenames Many Git developers just use systems which don't care about the file names encoding at all and just keep the names as they were. So interoperability problem does not exist for them. So, they either don't need the feature, or can trivially avoid or workaround any problems. > I see that there are some tests in the /t directory. Which command will run > all of them, how good coverage do the tests have, how reproducable and > isolated they are, how many seconds does it take to run all the tests? Is > there some high-level documentation for new developers? make test. See also t/README. We like them. I always run test suite before deployment and sometimes run it just for fun (unless I have to run it on Windows). ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 5:34 ` Esko Luontola 2009-05-13 6:49 ` Alex Riesen @ 2009-05-13 10:15 ` Johannes Schindelin [not found] ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com> 1 sibling, 1 reply; 59+ messages in thread From: Johannes Schindelin @ 2009-05-13 10:15 UTC (permalink / raw) To: Esko Luontola; +Cc: git, Shawn O. Pearce Hi, On Wed, 13 May 2009, Esko Luontola wrote: > Johannes Schindelin wrote on 13.5.2009 3:23: > > Well, that rather settles things, no? > > There is need for the feature, but it's unfortunate that the Git > developers do not see its value. I see a value. But it is not my itch. And since it is your itch and you said that you will not do anything about it (I don't count writing emails here ;-), I concluded that it settles the issue. Ciao, Dscho ^ permalink raw reply [flat|nested] 59+ messages in thread
[parent not found: <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>]
* Cross-Platform Version Control [not found] ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com> @ 2009-05-13 10:41 ` John Tapsell 2009-05-13 13:42 ` Jay Soffian 0 siblings, 1 reply; 59+ messages in thread From: John Tapsell @ 2009-05-13 10:41 UTC (permalink / raw) To: git 2009/5/13 Johannes Schindelin <Johannes.Schindelin@gmx.de>: > Hi, > > On Wed, 13 May 2009, Esko Luontola wrote: > >> Johannes Schindelin wrote on 13.5.2009 3:23: >> > Well, that rather settles things, no? >> >> There is need for the feature, but it's unfortunate that the Git >> developers do not see its value. > > I see a value. But it is not my itch. And since it is your itch and you > said that you will not do anything about it (I don't count writing emails > here ;-), I concluded that it settles the issue. I don't know why the git developers are being so hostile/dismisisve, but I also hope that somebody volunteers to fix this. Esko, you have my moral support :-) John ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 10:41 ` John Tapsell @ 2009-05-13 13:42 ` Jay Soffian 2009-05-13 13:44 ` Alex Riesen 0 siblings, 1 reply; 59+ messages in thread From: Jay Soffian @ 2009-05-13 13:42 UTC (permalink / raw) To: John Tapsell; +Cc: git On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote: > I don't know why the git developers are being so hostile/dismisisve, Are you serious? j. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:42 ` Jay Soffian @ 2009-05-13 13:44 ` Alex Riesen 2009-05-13 13:50 ` Jay Soffian 0 siblings, 1 reply; 59+ messages in thread From: Alex Riesen @ 2009-05-13 13:44 UTC (permalink / raw) To: Jay Soffian; +Cc: John Tapsell, git 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: > On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote: >> I don't know why the git developers are being so hostile/dismisisve, > > Are you serious? > ...because we'll kill you if aren't >:-E ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:44 ` Alex Riesen @ 2009-05-13 13:50 ` Jay Soffian 2009-05-13 13:57 ` John Tapsell 0 siblings, 1 reply; 59+ messages in thread From: Jay Soffian @ 2009-05-13 13:50 UTC (permalink / raw) To: Alex Riesen; +Cc: John Tapsell, git On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote: > 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: >> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote: >>> I don't know why the git developers are being so hostile/dismisisve, >> >> Are you serious? >> > > ...because we'll kill you if aren't >:-E I'm just flabbergasted by some people's expectations. Perhaps John doesn't realize the git developers are all volunteers, and that it is never appropriate to criticize a volunteer. A "thank you for all your hard work on git" would have done nicely. j. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:50 ` Jay Soffian @ 2009-05-13 13:57 ` John Tapsell 2009-05-13 15:27 ` Nicolas Pitre ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: John Tapsell @ 2009-05-13 13:57 UTC (permalink / raw) To: Jay Soffian; +Cc: Alex Riesen, git 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: > On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote: >> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: >>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote: >>>> I don't know why the git developers are being so hostile/dismisisve, >>> >>> Are you serious? >>> >> >> ...because we'll kill you if aren't >:-E > > I'm just flabbergasted by some people's expectations. Perhaps John > doesn't realize the git developers are all volunteers, and that it is > never appropriate to criticize a volunteer. A "thank you for all your > hard work on git" would have done nicely. I'm as much of an open source developer as anyone else here. I spend a huge amount of my time programming for KDE. But I've never told a user "well that settles it" because they won't code it themselves :-/ I certaintly get a huge number of bug/wishes that I can't/won't code myself, but I try to be a bit more diplomatic about it. But then the kernel mailing lists tend to be a lot more.. direct.. than the kde mailing lists, so I guess it comes from that. Requiring people to have a thick skin and all that. John ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:57 ` John Tapsell @ 2009-05-13 15:27 ` Nicolas Pitre 2009-05-13 16:22 ` Johannes Schindelin 2009-05-13 17:24 ` Andreas Ericsson 2009-05-14 1:49 ` Miles Bader 2 siblings, 1 reply; 59+ messages in thread From: Nicolas Pitre @ 2009-05-13 15:27 UTC (permalink / raw) To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git On Wed, 13 May 2009, John Tapsell wrote: > I'm as much of an open source developer as anyone else here. I spend > a huge amount of my time programming for KDE. But I've never told a > user "well that settles it" because they won't code it themselves :-/ > I certaintly get a huge number of bug/wishes that I can't/won't code > myself, but I try to be a bit more diplomatic about it. > But then the kernel mailing lists tend to be a lot more.. direct.. > than the kde mailing lists, so I guess it comes from that. Requiring > people to have a thick skin and all that. This is not the kernel mailing list. In fact this list is quite friendlier and accommodating that the kernel list. The remark alluded above comes from _one_ of the git developers. And Dscho is apparently in a rather sad mood these days. While the substance of Dscho's remark is entirely pertinent, it would be wrong to use its form and style as a characterization of git developers in general. Nicolas ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 15:27 ` Nicolas Pitre @ 2009-05-13 16:22 ` Johannes Schindelin 0 siblings, 0 replies; 59+ messages in thread From: Johannes Schindelin @ 2009-05-13 16:22 UTC (permalink / raw) To: Nicolas Pitre; +Cc: John Tapsell, Jay Soffian, Alex Riesen, git Hi, On Wed, 13 May 2009, Nicolas Pitre wrote: > On Wed, 13 May 2009, John Tapsell wrote: > > > I'm as much of an open source developer as anyone else here. I spend > > a huge amount of my time programming for KDE. But I've never told a > > user "well that settles it" because they won't code it themselves :-/ > > I certaintly get a huge number of bug/wishes that I can't/won't code > > myself, but I try to be a bit more diplomatic about it. > > > > But then the kernel mailing lists tend to be a lot more.. direct.. > > than the kde mailing lists, so I guess it comes from that. Requiring > > people to have a thick skin and all that. > > This is not the kernel mailing list. In fact this list is quite > friendlier and accommodating that the kernel list. > > The remark alluded above comes from _one_ of the git developers. And > Dscho is apparently in a rather sad mood these days. While the substance > of Dscho's remark is entirely pertinent, it would be wrong to use its > form and style as a characterization of git developers in general. Even if I were in a better mood, the whole thread has a back story on an msysGit issue, and this led me to try to stop what I feared would become a rather long mail thread without much of an outcome, such as that infamous thread about MacOSX UTF-8 filename handling. Alas, it seems that Robin is willing to work on the issues, so my fears have been totally and completely unfounded. Ciao, Dscho ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:57 ` John Tapsell 2009-05-13 15:27 ` Nicolas Pitre @ 2009-05-13 17:24 ` Andreas Ericsson 2009-05-14 1:49 ` Miles Bader 2 siblings, 0 replies; 59+ messages in thread From: Andreas Ericsson @ 2009-05-13 17:24 UTC (permalink / raw) To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git John Tapsell wrote: > 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: >> On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote: >>> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>: >>>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote: >>>>> I don't know why the git developers are being so hostile/dismisisve, >>>> Are you serious? >>>> >>> ...because we'll kill you if aren't >:-E >> I'm just flabbergasted by some people's expectations. Perhaps John >> doesn't realize the git developers are all volunteers, and that it is >> never appropriate to criticize a volunteer. A "thank you for all your >> hard work on git" would have done nicely. > > I'm as much of an open source developer as anyone else here. I spend > a huge amount of my time programming for KDE. But I've never told a > user "well that settles it" because they won't code it themselves :-/ > I certaintly get a huge number of bug/wishes that I can't/won't code > myself, but I try to be a bit more diplomatic about it. > But then the kernel mailing lists tend to be a lot more.. direct.. > than the kde mailing lists, so I guess it comes from that. Requiring > people to have a thick skin and all that. > I think much of the perceived malignancy stems from the fact that the git list has a high ratio of developer-to-luser mailings on it, being by nature a developer tool most of the time. When the unaware user appears on the list with demands rather than polite requests, they're treated that much harder. Especially by the developer who happens to be, as it were, the butt of the request. Personally, I've only ever found Dscho being anything but friendly on this list, and even then, I really didn't find it offensive. If viewed in a happy mood, it matches quite nicely with a swedish sketch whose theme is "men ja ente bitter". It's often quite funny, really :-) -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Register now for Nordic Meet on Nagios, June 3-4 in Stockholm http://nordicmeetonnagios.op5.org/ Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 13:57 ` John Tapsell 2009-05-13 15:27 ` Nicolas Pitre 2009-05-13 17:24 ` Andreas Ericsson @ 2009-05-14 1:49 ` Miles Bader 2 siblings, 0 replies; 59+ messages in thread From: Miles Bader @ 2009-05-14 1:49 UTC (permalink / raw) To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git John Tapsell <johnflux@gmail.com> writes: > I'm as much of an open source developer as anyone else here. I spend > a huge amount of my time programming for KDE. But I've never told a > user "well that settles it" because they won't code it themselves :-/ FWIW, Johannes' use of "Well, that rather settles things, no?" in this thread this didn't strike me as being rude or truly dismissive (even though it's literally so). It seemed more just a timely and to the point reminder that however fun it is to talk about random feature X, someone's gotta do the work if it's going to actually be implemented, and that the direction of git development very much follows the whims of those doing the actual hacking (perhaps more so than other projects). [and I don't even have particularly thick skin, I think -- I'm often very annoyed by brusqueness one sees on many developer mailing lists...] -Miles -- Acquaintance, n. A person whom we know well enough to borrow from, but not well enough to lend to. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 15:14 ` Shawn O. Pearce 2009-05-12 16:13 ` Johannes Schindelin @ 2009-05-12 16:16 ` Jeff King 2009-05-12 16:57 ` Johannes Schindelin 2009-05-13 16:26 ` Linus Torvalds 1 sibling, 2 replies; 59+ messages in thread From: Jeff King @ 2009-05-12 16:16 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Esko Luontola, git On Tue, May 12, 2009 at 08:14:03AM -0700, Shawn O. Pearce wrote: > As for file names, no plans, its a sequence of bytes, but I think a > lot of people wind up using some subset of US-ASCII for their file > names, especially if their project is going to be cross platform. Or they use a single encoding like utf8 so that there are no surprises. You can still run into normalization problems with filenames on some filesystems, though. Linus's name_hash code sets up the framework to handle "these two names are actually equivalent", but right now I think there is just code for handling case-sensitivity, not utf8 normalization (but I just skimmed the code, so I might be wrong). -Peff ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 16:16 ` Jeff King @ 2009-05-12 16:57 ` Johannes Schindelin 2009-05-13 16:26 ` Linus Torvalds 1 sibling, 0 replies; 59+ messages in thread From: Johannes Schindelin @ 2009-05-12 16:57 UTC (permalink / raw) To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git Hi, On Tue, 12 May 2009, Jeff King wrote: > On Tue, May 12, 2009 at 08:14:03AM -0700, Shawn O. Pearce wrote: > > > As for file names, no plans, its a sequence of bytes, but I think a > > lot of people wind up using some subset of US-ASCII for their file > > names, especially if their project is going to be cross platform. > > Or they use a single encoding like utf8 so that there are no surprises. > You can still run into normalization problems with filenames on some > filesystems, though. Linus's name_hash code sets up the framework to > handle "these two names are actually equivalent", but right now I think > there is just code for handling case-sensitivity, not utf8 normalization > (but I just skimmed the code, so I might be wrong). Back then I actually started on a patch to make Git capable of determining UTF-8 equivalence, but at the same time somebody started such an annoying mail thread that I stopped working on the issue completely. Ciao, Dscho ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 16:16 ` Jeff King 2009-05-12 16:57 ` Johannes Schindelin @ 2009-05-13 16:26 ` Linus Torvalds 2009-05-13 17:12 ` Linus Torvalds 1 sibling, 1 reply; 59+ messages in thread From: Linus Torvalds @ 2009-05-13 16:26 UTC (permalink / raw) To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git On Tue, 12 May 2009, Jeff King wrote: > > Or they use a single encoding like utf8 so that there are no surprises. > You can still run into normalization problems with filenames on some > filesystems, though. Linus's name_hash code sets up the framework to > handle "these two names are actually equivalent", but right now I think > there is just code for handling case-sensitivity, not utf8 normalization > (but I just skimmed the code, so I might be wrong). utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But quite frankly, the index is only part of it, and probably not the worst part. The real pain of filename handling is all the "read tree recursively with readdir()" issues. Along with just an absolute sh*t-load of issues about what to do when people ended up using different versions of the "same" name in different branches. There's also the issue that "cross-platform" really can be a pretty damn big pain. What do you do for platforms that simply are pure shit? I realize that OS X people have a hard time accepting it, but OS X filesystems are generally total and utter crap - even more so than Windows. Yes, yes, you can tell OS X that case matters, but that's not the normal case - and what do you do with projects that simply _do_ care about case. The kernel is one such project. Sure, you can "encode" the filenames on such broken filesystems in a way that they'd be different - but that won't really help the project, since makefiles etc won't work anyway. So one reason I didn't bother with utf-8 is that the much more fundamental issues are simply in plain old 7-bit US-ASCII. That said, if the only issue is that you want to encode regular utf-8 in a coherent way (and ignore the case issues), then we could probably do that part fairly easily with a "convert_to_internal()" and "convert_to_filename()" thing that acts very much like the CRLF conversion (except on filenames, not data). And yes, it's probably worth doing, since we'd need that for fuller case support anyway. It's just a fair amount of churn - not fundamentally _hard_, but not trivial either. And it needs a _lot_ of care, and a fair amount of testing that is probably hard to do on sane filesystems (ie the case where the filesystem actually _changes_ the name is going to be hard to test on anything sane). Linus ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 16:26 ` Linus Torvalds @ 2009-05-13 17:12 ` Linus Torvalds 2009-05-13 17:31 ` Andreas Ericsson ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Linus Torvalds @ 2009-05-13 17:12 UTC (permalink / raw) To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Linus Torvalds wrote: > > utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But > quite frankly, the index is only part of it, and probably not the worst > part. > > The real pain of filename handling is all the "read tree recursively with > readdir()" issues. Along with just an absolute sh*t-load of issues about > what to do when people ended up using different versions of the "same" > name in different branches. Btw, if people care mainly just about OS X, and don't worry so much about case, but about the idiotic and insane OS X behavior of turning UTF-8 filenames into that crazy NFD format, here's a simple patch that may be useful for that. There _will_ certainly be other places, but this handles the one big case of "read_directory_recursive()", and can turn NFD into the sane NFC format. Since OS X will then accept NFC (and internally turn it back to NFD) when you pass them as filenames, that means that converting the other way is not necessary. NOTE NOTE NOTE! This really just handles one case, and is not enough for any kind of general case. For example, it does NOT handle the case where you do git add filename_with_åäö explicitly, because if the "filename_with_åäö" is done using NFD (tab-completion etc), now git won't _match_ it with the filename it reads using readdir() any more (which got converted to NFC), so at a minimum we'd need to do that crazy NFD->NFC conversion in all the pathspecs too. See "get_pathspec()" in setup.c for that latter case. But with that, and this crazy thing, OS X users might be already a lot better off. Totally untested, of course. Oh, and somebody needs to fill in that convert_name_from_nfd_to_nfc() implementation. It's designed so that if it notices that the string is just plain US-ASCII, it can return 0 and no extra work is done. That, in turn, can easily be done by some simple and efficient pre-processign that checks that there are no high bits set (on a 64-bit platform, do it 8 characters at a time with a "& 0x8080808080808080"), so that the common case doesn't need to have barely any overhead at all. Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do the actual normalization if you find characters with the high bit set. And since I know that the OS X filesystems are so buggy as to not even do that whole NFD thing right, there is probably some OS-X specific "use this for filesystem names" conversion function. Hmm. Anybody want to take this on? It really shouldn't be too complex to get it working for the common case on just OS X. It's really the case sensitivity that is the biggest problem, if you ignore that for now, the problem space is _much_ smaller. In other words, I think we can reasonably easily support a subset of _common_ issues with some trivial patches like this. But getting it right in _all_ the cases is going to be much more work (there are lots of other uses of "readdir()" too, this one just happens to be one of the more central ones). Of course, it probably makes sense to have a whole "git_readdir()" that does this thing in general. That "create_full_path()" thing makes sense regardless, though, in that it also simplifies a lot of "baselen+len" usage in just "len". Linus --- dir.c | 40 ++++++++++++++++++++++++++++++++-------- 1 files changed, 32 insertions(+), 8 deletions(-) diff --git a/dir.c b/dir.c index 6aae09a..4cbfc24 100644 --- a/dir.c +++ b/dir.c @@ -566,6 +566,30 @@ static int get_dtype(struct dirent *de, const char *path) } /* + * Take the readdir output, in (d_name,len), and append it to + * our base name in (fullname,baselen) with any required + * readdir fs->internal translation. + * + * Put the result in 'fullname', and return the final length. + * + * Right now we have no translation, and just do a memcpy() + * (the +1 is to copy the final NUL character too). + */ +static int create_full_path(char *fullname, int baselen, const char *d_name, int len) +{ +#ifdef OS_X_IS_SOME_CRAZY_SHxAT + char temp[256], nlen; + nlen = convert_name_from_nfd_to_nfc(d_name, len, temp, sizeof(temp)); + if (nlen) { + len = nlen; + d_name = temp; + } +#endif + memcpy(fullname + baselen, d_name, len + 1); + return baselen + len; +} + +/* * Read a directory tree. We currently ignore anything but * directories, regular files and symlinks. That's because git * doesn't handle them at all yet. Maybe that will change some @@ -595,15 +619,15 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co /* Ignore overly long pathnames! */ if (len + baselen + 8 > sizeof(fullname)) continue; - memcpy(fullname + baselen, de->d_name, len+1); - if (simplify_away(fullname, baselen + len, simplify)) + len = create_full_path(fullname, baselen, de->d_name, len); + if (simplify_away(fullname, len, simplify)) continue; dtype = DTYPE(de); exclude = excluded(dir, fullname, &dtype); if (exclude && (dir->flags & DIR_COLLECT_IGNORED) - && in_pathspec(fullname, baselen + len, simplify)) - dir_add_ignored(dir, fullname, baselen + len); + && in_pathspec(fullname, len, simplify)) + dir_add_ignored(dir, fullname, len); /* * Excluded? If we don't explicitly want to show @@ -630,9 +654,9 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co default: continue; case DT_DIR: - memcpy(fullname + baselen + len, "/", 2); + memcpy(fullname + len, "/", 2); len++; - switch (treat_directory(dir, fullname, baselen + len, simplify)) { + switch (treat_directory(dir, fullname, len, simplify)) { case show_directory: if (exclude != !!(dir->flags & DIR_SHOW_IGNORED)) @@ -640,7 +664,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co break; case recurse_into_directory: contents += read_directory_recursive(dir, - fullname, fullname, baselen + len, 0, simplify); + fullname, fullname, len, 0, simplify); continue; case ignore_directory: continue; @@ -654,7 +678,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co if (check_only) goto exit_early; else - dir_add_name(dir, fullname, baselen + len); + dir_add_name(dir, fullname, len); } exit_early: closedir(fdir); ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 17:12 ` Linus Torvalds @ 2009-05-13 17:31 ` Andreas Ericsson 2009-05-13 17:46 ` Linus Torvalds 2009-05-13 20:57 ` Matthias Andree 2 siblings, 0 replies; 59+ messages in thread From: Andreas Ericsson @ 2009-05-13 17:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git Linus Torvalds wrote: > > Of course, it probably makes sense to have a whole "git_readdir()" that > does this thing in general. That "create_full_path()" thing makes sense > regardless, though, in that it also simplifies a lot of "baselen+len" > usage in just "len". > In a flash of premonitory insight, libgit2 has gitfo_foreach_dirent(path, callback) which would probably be well suited for this kind of thing. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Register now for Nordic Meet on Nagios, June 3-4 in Stockholm http://nordicmeetonnagios.op5.org/ Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 17:12 ` Linus Torvalds 2009-05-13 17:31 ` Andreas Ericsson @ 2009-05-13 17:46 ` Linus Torvalds 2009-05-13 18:26 ` Martin Langhoff 2009-05-13 20:57 ` Matthias Andree 2 siblings, 1 reply; 59+ messages in thread From: Linus Torvalds @ 2009-05-13 17:46 UTC (permalink / raw) To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Linus Torvalds wrote: > > Of course, it probably makes sense to have a whole "git_readdir()" that > does this thing in general. Actually, the more I think about that, the less true I think it is. It _sounds_ like a nice simplification ("just do it once in readdir, and forget about it everywhere else"), but it's in fact a stupid thing to do. Why? If we _ever_ want to fix this in the general case, then the code that does the readdir() will actually have to remember both the "raw filesystem" form _and_ the "cleaned-up utf-8 form". Why? Because when we do readdir(), we'll also do 'lstat()' on the end result to check the types, and opendir() in case it's a directory and we then want to do things recursively etc. And that happens to work on OS X (because we can use our "fixed" filename for lstat too), but it does not work in the general case. And you can say "well, just do the stat inside the wrapped readdir()", but that doesn't work _either_, since - we don't want to do the lstat() if it's unnecessary. Even if we don't have "de->d_type" information, we can often avoid the need for it, if we can tell that the name isn't interestign (due to being ignored). Avoiding the lstat is a huge performance issue for cold-cache cases. It's basically a seek. So we really want to do the lstat() later, which implies that the caller needs to know _both_ the original "real" filesystem name _and_ the converted one. - it doesn't handle the opendir() case anyway - so the end result is that a real implementation will _always_ need to carry around both the "filesystem view" filename _and_ the "what we've converted it into". Now, the point of the patch I sent out was that for the specific case of OS X, which does UTF-8 conversions (wrong) but also is happy to get our properly normalized name, we don't care. So my patch is "correct" for that special case - and so would a plain readdir() wrapper be. But my patch is _also_ correct for the case where a readdir() wrapper would do the wrong thing. My patch doesn't _handle_ it (since it doesn't change the code to pass both "filesystem view" and "cleaned-up view" pathnames), but the patch I sent out also doesn't make it any harder to do right. In contrast, doing a readdir() wrapper makes it much harder to do right later, because it's just doing the conversion at the wrong level (you could make that "wrapper" return both the original and the fixed filename, but at that point the wrapper doesn't really help - you might as well just have the "convert" function, and it would be a hell of a lot more obvious what is really going on). So I take it back. A readdir() wrapper is not a good idea. It gets us a tiny bit of the way, but it would actually take us a step back from the "real" solution. Linus ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 17:46 ` Linus Torvalds @ 2009-05-13 18:26 ` Martin Langhoff 2009-05-13 18:37 ` Linus Torvalds 0 siblings, 1 reply; 59+ messages in thread From: Martin Langhoff @ 2009-05-13 18:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, May 13, 2009 at 7:46 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > So I take it back. A readdir() wrapper is not a good idea. It gets us a > tiny bit of the way, but it would actually take us a step back from the > "real" solution. Do we need to take the real solution to the core of git? What I am wondering is whether we can keep this simple in git internals and catch problem filenames at git-add time. This would allow git to keep treating filenames as a bag of bytes, and it does a better thing for users. In cross platform projects, most users don't even know that there are problems, and even if they do, they don't know what the problems are. If git add can be told to warn & refuse to add a path with portability problems, then we educate our users, prevent them from committing filenames that will later cause trouble to others in their projects, etc. from-the-keep-it-simple-and-informative-dept, m -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 18:26 ` Martin Langhoff @ 2009-05-13 18:37 ` Linus Torvalds 2009-05-13 21:04 ` Theodore Tso 2009-05-13 21:08 ` Daniel Barkalow 0 siblings, 2 replies; 59+ messages in thread From: Linus Torvalds @ 2009-05-13 18:37 UTC (permalink / raw) To: Martin Langhoff; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Martin Langhoff wrote: > > Do we need to take the real solution to the core of git? Well, I suspect that if we really want to support it, then we'd better. > What I am wondering is whether we can keep this simple in git > internals and catch problem filenames at git-add time. I can almost guarantee that it will just cause more problems than it solves, and generate some nasty cases that just aren't solvable. Because it really isn't just "git add". It's every single thing that does a lstat() on a filename inside of git. Now, the simple OS X case is not a huge problem, since the lstat will succeed with the fixed-up filename too. But as mentioned, the OS X case is the thing that doesn't need a lot of infrastructure _anyway_ - I can almost guarantee that my posted patch (with the added setup.c stuff for get_pathspec()) is going to be _fewer_ lines than some wrapper logic. Note: in all of the above, I assume that people care more about just plain UTF characters (and the insane NFD form OS X uses) than about worrying about the _really_ subtle issues of case-independence. Those are a major pain, but they will need even more "internal" support, because there simply isn't any sane wrapping method. (You could wrap everything to force lower-casing of all filesystem ops or something, but that would not be acceptable to any sane environment. So in reality you need to accept mixed-case things, and then there is no way to know from the "outside" whether one external mixed-case thing matches some internal index mixed-case thing). Linus ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 18:37 ` Linus Torvalds @ 2009-05-13 21:04 ` Theodore Tso 2009-05-13 21:20 ` Linus Torvalds 2009-05-13 21:08 ` Daniel Barkalow 1 sibling, 1 reply; 59+ messages in thread From: Theodore Tso @ 2009-05-13 21:04 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, May 13, 2009 at 11:37:28AM -0700, Linus Torvalds wrote: > Note: in all of the above, I assume that people care more about just plain > UTF characters (and the insane NFD form OS X uses) than about worrying > about the _really_ subtle issues of case-independence. Those are a major > pain, but they will need even more "internal" support, because there > simply isn't any sane wrapping method. Stupid question --- if we get something that works for Windows and MacOS X, is there any reason why we need to solve the general problem of case-insentive filesystems? It's really backwards compatibility with Legacy OS's that most important, right? Are there any other systems other than Windows and Mac OS X which (a) perpetrate case insensitivity on application programmers, and (b) which current or future git users are likely to care about? - Ted ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 21:04 ` Theodore Tso @ 2009-05-13 21:20 ` Linus Torvalds 0 siblings, 0 replies; 59+ messages in thread From: Linus Torvalds @ 2009-05-13 21:20 UTC (permalink / raw) To: Theodore Tso Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Theodore Tso wrote: > > Stupid question --- if we get something that works for Windows and > MacOS X, is there any reason why we need to solve the general problem > of case-insentive filesystems? Qutie frankly, I don't think we're even very close to getting anything that works for Windows of OS X. Case-insensitivity is _hard_. The "easy" case is to just handle the OS X craxy pseudo-NFD format, and at least turn that into NFC (and perhaps add a config option to do latin1 and EUC-JP to utf-8 too) and. At that point, we at least handle regular utf-8 the same way. Doing the latin1/EUC-JP thing would actually to some degree be more interesting than the OS X NFD case, because that really does require two-way conversion, and we can "test" that even on sane filesystems (ie play at having a Latin1 filesystem). That said, I suspect there aren't that many people who care about latin1 filesystems. I dunno about EUC-JP (and variants - for all I know, shift-JIS and other cases may be the more common ones). Of course, if we do everything right, maybe the windows people would actually like us to keep the filesystem-native representation in UTF-16LE or whatever the crazy format is that Windows really uses deep down. My point being that all of these things happen even without the added worry about case. And in many ways, not worrying about case should probably be the first step. We do have some support for worrying about case, but trying to solve both things at the same time isn't going to be workable, I suspect. Case insensitivity should never ever involve a _conversion_ (if it does, you get all kinds of crazy behavior), it's just purely a _comparison_ issue, so the two really are fundamentally different. Of course, the reason OS-X seems to be so messed up is exactly that the morons at Apple didn't understand the difference between conversion and comparison, and mixed them up. Linus ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 18:37 ` Linus Torvalds 2009-05-13 21:04 ` Theodore Tso @ 2009-05-13 21:08 ` Daniel Barkalow 2009-05-13 21:29 ` Linus Torvalds 1 sibling, 1 reply; 59+ messages in thread From: Daniel Barkalow @ 2009-05-13 21:08 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Linus Torvalds wrote: > On Wed, 13 May 2009, Martin Langhoff wrote: > > > > Do we need to take the real solution to the core of git? > > Well, I suspect that if we really want to support it, then we'd better. > > > What I am wondering is whether we can keep this simple in git > > internals and catch problem filenames at git-add time. > > I can almost guarantee that it will just cause more problems than it > solves, and generate some nasty cases that just aren't solvable. > > Because it really isn't just "git add". It's every single thing that does > a lstat() on a filename inside of git. > > Now, the simple OS X case is not a huge problem, since the lstat will > succeed with the fixed-up filename too. I'm not seeing what the general case is, and how it could possibly behave. There's the "insensitive" behavior: if you create "foo" and look for "FOO", it's there, but readdir() reports "foo". There's the "converting" behavior: if you create "foo", readdir() reports "FOO", but lstat("foo") returns it. The obvious general case is: if you create "foo", readdir() reports "FOO", and lstat("foo") doesn't find a match. But if you create "foo" again... it doesn't find "foo", so it creates a new file, which it also calls "FOO", and the filesystem now has two files with identical names? It seems to me that the limits of minimally functional, non-inode-losing filesystems are: lstat() might take a filename and return the data for a non-byte-identical filename; open(name, O_CREAT|O_EXCL) might replace the given name with a non-byte-identical filename. But surely open(name) and lstat(name) (with the same name) must find the same file, even if readdir() would report it with a different name. And I assume that a filesystem that rejected any non-NFD filenames or any non-NFC filenames would be totally unusable, in that users will manage to get unnormalized filenames into programs and find that the filesystem just doesn't work. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 21:08 ` Daniel Barkalow @ 2009-05-13 21:29 ` Linus Torvalds 0 siblings, 0 replies; 59+ messages in thread From: Linus Torvalds @ 2009-05-13 21:29 UTC (permalink / raw) To: Daniel Barkalow Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Daniel Barkalow wrote: > > > > Now, the simple OS X case is not a huge problem, since the lstat will > > succeed with the fixed-up filename too. > > I'm not seeing what the general case is, and how it could possibly behave. Here's a simple example. Let's say that your company uses Latin1 internally for your filesystems, because your tools really aren't utf-8 ready. This is NOT AT ALL unnatural - it's how lots of people used to work with Linux over the years, and it's largely how people still use FAT, I suspect (except it's not latin1, it's some windows-specific 8-bits-per-character mapping). IOW, if you have a file called 'åäö', it literally is encoded as '\xe5\xe4\xf6' (if you wonder why I picked those three letters, it's because they are the regular extra letters in Swedish - Swedish has 29 letters in its alphabet, and those three letters really are letters in their own right, they are NOT 'a' and 'o' with some dots/rings on top). IOW, if you open such a file, you need to use those three bytes. Now, even if you happen to have an OS and use Latin1 on disk, you may realize that you'd like to interact with others that use UTF-8, and would want to have your git archive that you export use nice portable UTF-8. But you absolutely MUST NOT just do a conversion at "readdir()" time. If you do that, then your three-byte filename turns into a six-byte utf-8 sequence of '\xc3\xa5\xc3\xa4\xc3\xb6' and the thing is, now "lstat()" won't work on that sequence. So obviously you could always turn things _back_ for lstat(), but quite frankly, that's (a) insane (b) incompetent and (c) not even always well-defined. > There's the "insensitive" behavior: if you create "foo" and look for > "FOO", it's there, but readdir() reports "foo". > > There's the "converting" behavior: if you create "foo", readdir() reports > "FOO", but lstat("foo") returns it. Then there's the behaviour above: you want your git repository to have utf-8, but your filesystem doesn't convert anything at all, and all your regular tools (think editors etc) are all Latin1. Latin1 is going away, I hope, but I bet EUC-JP etc still exist. Linus ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 17:12 ` Linus Torvalds 2009-05-13 17:31 ` Andreas Ericsson 2009-05-13 17:46 ` Linus Torvalds @ 2009-05-13 20:57 ` Matthias Andree 2009-05-13 21:10 ` Linus Torvalds 2 siblings, 1 reply; 59+ messages in thread From: Matthias Andree @ 2009-05-13 20:57 UTC (permalink / raw) To: Linus Torvalds, Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds <torvalds@linux-foundation.org>: > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to > do the actual normalization if you find characters with the high bit > set. And since I know that the OS X filesystems are so buggy as to not > even do that whole NFD thing right, there is probably some OS-X specific > "use this for > filesystem names" conversion function. Sorry for interrupting, but NF_K_C? You don't want that (K for compatibility, rather than canonical, normalization) for anything except normalizing temporary variables inside strcasecmp(3) or similar. Probably not even that. The normalizations done are often irreversible and also surprising. You don't want to turn 2³.c into 23.c, do you? -- Matthias Andree ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 20:57 ` Matthias Andree @ 2009-05-13 21:10 ` Linus Torvalds 2009-05-13 21:30 ` Jay Soffian 2009-05-13 21:47 ` Matthias Andree 0 siblings, 2 replies; 59+ messages in thread From: Linus Torvalds @ 2009-05-13 21:10 UTC (permalink / raw) To: Matthias Andree; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, 13 May 2009, Matthias Andree wrote: > Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds > <torvalds@linux-foundation.org>: > > > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do > > the actual normalization if you find characters with the high bit set. And > > since I know that the OS X filesystems are so buggy as to not even do that > > whole NFD thing right, there is probably some OS-X specific "use this for > > filesystem names" conversion function. > > Sorry for interrupting, but NF_K_C? You don't want that (K for compatibility, > rather than canonical, normalization) for anything except normalizing > temporary variables inside strcasecmp(3) or similar. Probably not even that. > The normalizations done are often irreversible and also surprising. You don't > want to turn 2³.c into 23.c, do you? No, you're right. We want just plain NFC. I just googled for how some other projects handled this, and found the stringprep thing in a post about rsync, and didn't look any closer. But yes, you're absolutely right, stringprep is total crap, and nfkc is horrible. I have no idea of what library to use, though. For perl, there's Unicode::Normalize, but that's likely still subtly incorrect for the OS-X case due to the filesystem not using _strict_ NFD. I have this dim memory of somebody actually pointing to the documentation of exactly which characters OS X ends up decomposing. Maybe we could just do a git-specific inverse of that, knowing that NOBODY ELSE IN THE WHOLE UNIVERSE IS SO TERMINALLY STUPID AS TO DO THAT DECOMPOSITION, and thus the OS X case is the only one we need to care about? Linus ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 21:10 ` Linus Torvalds @ 2009-05-13 21:30 ` Jay Soffian 2009-05-13 21:47 ` Matthias Andree 1 sibling, 0 replies; 59+ messages in thread From: Jay Soffian @ 2009-05-13 21:30 UTC (permalink / raw) To: Linus Torvalds Cc: Matthias Andree, Jeff King, Shawn O. Pearce, Esko Luontola, git On Wed, May 13, 2009 at 5:10 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > I have this dim memory of somebody actually pointing to the documentation > of exactly which characters OS X ends up decomposing. http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties http://developer.apple.com/technotes/tn/tn1150table.html j. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-13 21:10 ` Linus Torvalds 2009-05-13 21:30 ` Jay Soffian @ 2009-05-13 21:47 ` Matthias Andree 1 sibling, 0 replies; 59+ messages in thread From: Matthias Andree @ 2009-05-13 21:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git Am 13.05.2009, 23:10 Uhr, schrieb Linus Torvalds <torvalds@linux-foundation.org>: > > > On Wed, 13 May 2009, Matthias Andree wrote: > >> Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds >> <torvalds@linux-foundation.org>: >> >> > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something >> to do >> > the actual normalization if you find characters with the high bit >> set. And >> > since I know that the OS X filesystems are so buggy as to not even do >> that >> > whole NFD thing right, there is probably some OS-X specific "use this >> for >> > filesystem names" conversion function. >> >> Sorry for interrupting, but NF_K_C? You don't want that (K for >> compatibility, >> rather than canonical, normalization) for anything except normalizing >> temporary variables inside strcasecmp(3) or similar. Probably not even >> that. >> The normalizations done are often irreversible and also surprising. You >> don't >> want to turn 2³.c into 23.c, do you? > > No, you're right. We want just plain NFC. I just googled for how some > other projects handled this, and found the stringprep thing in a post > about rsync, and didn't look any closer. > > But yes, you're absolutely right, stringprep is total crap, and nfkc is > horrible. Crap? It's just besides the purpose and some limited form of fuzzy match. Anyways... > I have no idea of what library to use, though. For perl, there's > Unicode::Normalize, but that's likely still subtly incorrect for the OS-X > case due to the filesystem not using _strict_ NFD. Perhaps ICU (ICU4C), from http://site.icu-project.org/ -- Matthias Andree ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 15:06 Cross-Platform Version Control Esko Luontola 2009-05-12 15:14 ` Shawn O. Pearce @ 2009-05-12 18:28 ` Dmitry Potapov 2009-05-12 18:40 ` Martin Langhoff 2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting 2 siblings, 1 reply; 59+ messages in thread From: Dmitry Potapov @ 2009-05-12 18:28 UTC (permalink / raw) To: Esko Luontola; +Cc: git On Tue, May 12, 2009 at 06:06:05PM +0300, Esko Luontola wrote: > A good start for making Git cross-platform, would be storing the text > encoding of every file name and commit message together with the commit. > Currently, because Git is oblivious to the encodings and just considers > them as a series of bytes, there is no way to make them cross-platform. 1. Git already stores the endcoding for all commit messages that are not in UTF-8. 2. If you really want to be cross-platform portable, you should not use any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable Filename Character Set) http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276 Dmitry ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 18:28 ` Dmitry Potapov @ 2009-05-12 18:40 ` Martin Langhoff 2009-05-12 18:55 ` Jakub Narebski 0 siblings, 1 reply; 59+ messages in thread From: Martin Langhoff @ 2009-05-12 18:40 UTC (permalink / raw) To: Dmitry Potapov; +Cc: Esko Luontola, git On Tue, May 12, 2009 at 8:28 PM, Dmitry Potapov <dpotapov@gmail.com> wrote: > 2. If you really want to be cross-platform portable, you should not use > any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable > Filename Character Set) > http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276 Would it make sense to have warnings at 'git add' time about - filenames outside of that charset (as the strictest mode, perhaps even default) - filenames that have a potential conflict wrt case-sensitivity - filenames that have potential conflict in the same tree due to utf-8 encoding vagaries MHO is that a strict "start your project portable from day one" mode is best as a default. But I'd be happy with any default, actually ;-) m -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 18:40 ` Martin Langhoff @ 2009-05-12 18:55 ` Jakub Narebski 2009-05-12 21:43 ` [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames Heiko Voigt 0 siblings, 1 reply; 59+ messages in thread From: Jakub Narebski @ 2009-05-12 18:55 UTC (permalink / raw) To: Martin Langhoff; +Cc: Dmitry Potapov, Esko Luontola, git Martin Langhoff <martin.langhoff@gmail.com> writes: > On Tue, May 12, 2009 at 8:28 PM, Dmitry Potapov <dpotapov@gmail.com> wrote: > > 2. If you really want to be cross-platform portable, you should not use > > any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable > > Filename Character Set) > > http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276 > > Would it make sense to have warnings at 'git add' time about > > - filenames outside of that charset (as the strictest mode, perhaps > even default) > - filenames that have a potential conflict wrt case-sensitivity > - filenames that have potential conflict in the same tree due to > utf-8 encoding vagaries > > MHO is that a strict "start your project portable from day one" mode > is best as a default. But I'd be happy with any default, actually ;-) Somebody asked for a pre-add hook in the past; it would be good place to put such check. But in meantime you can do it using pre-commit hook instead, isn't it? -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames 2009-05-12 18:55 ` Jakub Narebski @ 2009-05-12 21:43 ` Heiko Voigt 2009-05-12 21:55 ` Jakub Narebski 0 siblings, 1 reply; 59+ messages in thread From: Heiko Voigt @ 2009-05-12 21:43 UTC (permalink / raw) To: Jakub Narebski Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano At the moment non-ascii encodings of file/usernames are not very well supported by git. This will most likely change in the future but to allow repositories to be portable among different file/operating systems this check is enabled by default. Signed-off-by: Heiko Voigt <heiko.voigt@mahr.de> --- On Tue, May 12, 2009 at 11:55:39AM -0700, Jakub Narebski wrote: > Somebody asked for a pre-add hook in the past; it would be good place > to put such check. But in meantime you can do it using pre-commit > hook instead, isn't it? I actually had this in my queue to be submitted... templates/hooks--pre-commit.sample | 33 +++++++++++++++++++++++++++++++++ 1 files changed, 33 insertions(+), 0 deletions(-) diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample index 0e49279..83ff873 100755 --- a/templates/hooks--pre-commit.sample +++ b/templates/hooks--pre-commit.sample @@ -7,6 +7,39 @@ # # To enable this hook, rename this file to "pre-commit". +# If you want to allow non-ascii filenames or usernames set +# this variable to true. +allownonascii=$(git config hooks.allownonascii) + +function is_ascii () { + test -z "$(cat | sed -e "s/[\ -~]*//g")" + return $? +} + +if [ "$allownonascii" != "true" ] +then + # until git can handle non-ascii filenames gracefully + # prevent them to be added into the repository + if ! git diff --cached --name-only --diff-filter=A -z \ + | tr "\0" "\n" | is_ascii; then + echo "Non-ascii filenames are not allowed !" + echo "Please rename the file ..." + exit 1 + fi + + # non-ascii username issue a warning in git gui so tell the + # user to change it + if ! git config user.name | is_ascii; then + echo "Please only use ascii characters in your username!" + exit 1 + fi + + if ! git config user.email | is_ascii; then + echo "Please only use ascii characters in your email!" + exit 1 + fi +fi + if git-rev-parse --verify HEAD 2>/dev/null then against=HEAD -- 1.6.3 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames 2009-05-12 21:43 ` [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames Heiko Voigt @ 2009-05-12 21:55 ` Jakub Narebski 2009-05-14 17:59 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt 0 siblings, 1 reply; 59+ messages in thread From: Jakub Narebski @ 2009-05-12 21:55 UTC (permalink / raw) To: Heiko Voigt Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano On Tue, 12 May 2009, Heiko Voigt wrote: > At the moment non-ascii encodings of file/usernames are not very well > supported by git. This will most likely change in the future but to > allow repositories to be portable among different file/operating systems > this check is enabled by default. > + # non-ascii username issue a warning in git gui so tell the > + # user to change it > + if ! git config user.name | is_ascii; then > + echo "Please only use ascii characters in your username!" > + exit 1 > + fi > + > + if ! git config user.email | is_ascii; then > + echo "Please only use ascii characters in your email!" > + exit 1 > + fi Actually 1.) there is no easy way to avoid non-ASCII names at least in user.name (I think they are not allowed in email), but 2.) there is no trouble with non-ASCII encoding of commits, as they have 'encoding' header if it is not uft-8 (see *encoding* config variables). -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames 2009-05-12 21:55 ` Jakub Narebski @ 2009-05-14 17:59 ` Heiko Voigt 2009-05-15 10:52 ` Martin Langhoff ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Heiko Voigt @ 2009-05-14 17:59 UTC (permalink / raw) To: Jakub Narebski Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano At the moment non-ascii encodings of filenames are not portably converted between different filesystems by git. This will most likely change in the future but to allow repositories to be portable among different file/operating systems this check is enabled by default. Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net> --- On Tue, May 12, 2009 at 11:55:59PM +0200, Jakub Narebski wrote: > On Tue, 12 May 2009, Heiko Voigt wrote: > > > At the moment non-ascii encodings of file/usernames are not very well > > supported by git. This will most likely change in the future but to > > allow repositories to be portable among different file/operating systems > > this check is enabled by default. > > > + # non-ascii username issue a warning in git gui so tell the > > + # user to change it > > + if ! git config user.name | is_ascii; then > > + echo "Please only use ascii characters in your username!" > > + exit 1 > > + fi > > + > > + if ! git config user.email | is_ascii; then > > + echo "Please only use ascii characters in your email!" > > + exit 1 > > + fi > > Actually 1.) there is no easy way to avoid non-ASCII names at least > in user.name (I think they are not allowed in email), but 2.) there > is no trouble with non-ASCII encoding of commits, as they have > 'encoding' header if it is not uft-8 (see *encoding* config variables). I tried it and indeed it seems to work now. This hook originated from a windows installation were having non-ascii characters resulted in a strange warning from git gui each time you commit. So here is the corrected patch. templates/hooks--pre-commit.sample | 20 ++++++++++++++++++++ 1 files changed, 20 insertions(+), 0 deletions(-) diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample index 0e49279..3083735 100755 --- a/templates/hooks--pre-commit.sample +++ b/templates/hooks--pre-commit.sample @@ -7,6 +7,26 @@ # # To enable this hook, rename this file to "pre-commit". +# If you want to allow non-ascii filenames set this variable to true. +allownonascii=$(git config hooks.allownonascii) + +function is_ascii () { + test -z "$(cat | sed -e "s/[\ -~]*//g")" + return $? +} + +if [ "$allownonascii" != "true" ] +then + # until git can handle non-ascii filenames gracefully + # prevent them to be added into the repository + if ! git diff --cached --name-only --diff-filter=A -z \ + | tr "\0" "\n" | is_ascii; then + echo "Non-ascii filenames are not allowed !" + echo "Please rename the file ..." + exit 1 + fi +fi + if git-rev-parse --verify HEAD 2>/dev/null then against=HEAD -- 1.6.3 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames 2009-05-14 17:59 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt @ 2009-05-15 10:52 ` Martin Langhoff 2009-05-18 9:37 ` Heiko Voigt 2009-06-20 12:14 ` [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook Heiko Voigt 2009-05-15 14:57 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski 2009-05-15 18:11 ` [PATCH v2] " Junio C Hamano 2 siblings, 2 replies; 59+ messages in thread From: Martin Langhoff @ 2009-05-15 10:52 UTC (permalink / raw) To: Heiko Voigt Cc: Jakub Narebski, Dmitry Potapov, Esko Luontola, git, Junio C Hamano On Thu, May 14, 2009 at 7:59 PM, Heiko Voigt <hvoigt@hvoigt.net> wrote: > At the moment non-ascii encodings of filenames are not portably converted > between different filesystems by git. This will most likely change in the > future but to allow repositories to be portable among different file/operating > systems this check is enabled by default. Nice! - It'd be a good idea to add to the mix a check for filenames that are equivalent in case-insensitive FSs. - Should all of this be a general "portablefilenames" setting? cheers, m -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames 2009-05-15 10:52 ` Martin Langhoff @ 2009-05-18 9:37 ` Heiko Voigt 2009-05-18 22:26 ` Jakub Narebski 2009-06-20 12:14 ` [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook Heiko Voigt 1 sibling, 1 reply; 59+ messages in thread From: Heiko Voigt @ 2009-05-18 9:37 UTC (permalink / raw) To: Martin Langhoff Cc: Jakub Narebski, Dmitry Potapov, Esko Luontola, git, Junio C Hamano On Fri, May 15, 2009 at 12:52:41PM +0200, Martin Langhoff wrote: > On Thu, May 14, 2009 at 7:59 PM, Heiko Voigt <hvoigt@hvoigt.net> wrote: > > At the moment non-ascii encodings of filenames are not portably converted > > between different filesystems by git. This will most likely change in the > > future but to allow repositories to be portable among different file/operating > > systems this check is enabled by default. > > Nice! > > - It'd be a good idea to add to the mix a check for filenames that > are equivalent in case-insensitive FSs. I agree, but that will be an extension in another patch. BTW, if anyone has a good idea how to efficiently do that kind of check in a hook I'd cook up a patch on top of this. > - Should all of this be a general "portablefilenames" setting? Well, if you can specify what general portable filenames would have as properties. Questions like: * What is the portable maximum path length? * How long may a filename be (DOS 8.3 ?) * Are windows keywords (PRN, ...) allowed? * ... So I think this should be on a per property basis providing sensible defaults to support the most standard case. cheers Heiko ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames 2009-05-18 9:37 ` Heiko Voigt @ 2009-05-18 22:26 ` Jakub Narebski 0 siblings, 0 replies; 59+ messages in thread From: Jakub Narebski @ 2009-05-18 22:26 UTC (permalink / raw) To: Heiko Voigt Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano On Mon, 18 May 2009, Heiko Voigt wrote: > On Fri, May 15, 2009 at 12:52:41PM +0200, Martin Langhoff wrote: > > - Should all of this be a general "portablefilenames" setting? > > Well, if you can specify what general portable filenames would have as > properties. "Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems" by David A. Wheeler http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 59+ messages in thread
* [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook 2009-05-15 10:52 ` Martin Langhoff 2009-05-18 9:37 ` Heiko Voigt @ 2009-06-20 12:14 ` Heiko Voigt 1 sibling, 0 replies; 59+ messages in thread From: Heiko Voigt @ 2009-06-20 12:14 UTC (permalink / raw) To: Martin Langhoff Cc: Jakub Narebski, Dmitry Potapov, Esko Luontola, git, Junio C Hamano This helps cross-platform projects on the case-sensitive filename side of operating systems to use filenames that are nice for the case-insensitive side --- On Fri, May 15, 2009 at 12:52:41PM +0200, Martin Langhoff wrote: > On Thu, May 14, 2009 at 7:59 PM, Heiko Voigt <hvoigt@hvoigt.net> wrote: > > At the moment non-ascii encodings of filenames are not portably converted > > between different filesystems by git. This will most likely change in the > > future but to allow repositories to be portable among different file/operating > > systems this check is enabled by default. > - It'd be a good idea to add to the mix a check for filenames that > are equivalent in case-insensitive FSs. Totally untested. Just to get feedback if someone has ideas how this can be solved more efficiently. I suspect that processing all files will yield an unbearable performance degradation on large projects. Let me know what you think. The wording of the error message is not yet final. templates/hooks--pre-commit.sample | 21 +++++++++++++++++++++ 1 files changed, 21 insertions(+), 0 deletions(-) diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample index b11ad6a..32d1809 100755 --- a/templates/hooks--pre-commit.sample +++ b/templates/hooks--pre-commit.sample @@ -9,6 +9,10 @@ # If you want to allow non-ascii filenames set this variable to true. allownonascii=$(git config hooks.allownonascii) +# If you want to allow filenames that only differ in case set this +# variable to true. NOTE: This can degrade performance on project with +# lots of files +allowcaseonly=$(git config hooks.allowcaseonly) # Cross platform projects tend to avoid non-ascii filenames; prevent # them from being added to the repository. We exploit the fact that the @@ -32,6 +36,23 @@ then exit 1 fi +# check for names that already exist but only differ in case +# which can be problematic on non-casesensitive filesystems +if [ "$allowcaseonly" != "true" ] && + test -z "$(git ls-files | LC_ALL=C tr -s [A-Z] [a-z] | uniq -d)" +then + echo "Error: Attempt to add file which already exists in different case" + echo + echo "If you know what you are doing you can disable this" + echo "check using:" + echo + echo " git config hooks.allowcaseonly true" + echo + exit 1 +fi + if git-rev-parse --verify HEAD >/dev/null 2>&1 then against=HEAD -- 1.6.3.2.203.g9a122 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames 2009-05-14 17:59 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt 2009-05-15 10:52 ` Martin Langhoff @ 2009-05-15 14:57 ` Jakub Narebski 2009-05-18 9:50 ` [PATCH] " Heiko Voigt 2009-05-15 18:11 ` [PATCH v2] " Junio C Hamano 2 siblings, 1 reply; 59+ messages in thread From: Jakub Narebski @ 2009-05-15 14:57 UTC (permalink / raw) To: Heiko Voigt Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano <Insert standard Dscho disclaimer here...> ;-) In short: good idea, don't be discouraged by criticism... On Thu, 14 May 2009, Heiko Voigt wrote: > At the moment non-ascii encodings of filenames are not portably converted > between different filesystems by git. This will most likely change in the > future but to allow repositories to be portable among different file/operating > systems this check is enabled by default. By the way, you might consider choosing shorter line length for commits, something around 70-76 characters per line; otherwise it is harder to reply to without linewrapping. 80 characters that you used is, IMHO, absolute maximum, and it is good that you kept to it. > > Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net> > --- > +# If you want to allow non-ascii filenames set this variable to true. > +allownonascii=$(git config hooks.allownonascii) > + > +function is_ascii () { > + test -z "$(cat | sed -e "s/[\ -~]*//g")" > + return $? > +} >From CodingGuidelines for shell scripts: - We do not write the noiseword "function" in front of shell functions. (in short: do not use bash-specific features... unless, of course, you are modifying bash-completion script). Second, it would be nice to have comment about how to use this function (as it does not check file given by its argument, but rather its standard input). And perhaps also a comment that it works because ASCII printable characters begin with ' ' space (does it have to be escaped?) and end with '~' tilde[2]. Third, isn't it useless use of 'cat'[3]? And wouldn't it be better to use 'tr' to either delete printable characters and check for anything left (as you do; BTW. wouldn't "return test ..." be simpler?), or use 'tr' to count non portable characters? [1] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html [2] http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters [3] http://partmaps.org/era/unix/award.html#cat > + > +if [ "$allownonascii" != "true" ] > +then > + # until git can handle non-ascii filenames gracefully > + # prevent them to be added into the repository > + if ! git diff --cached --name-only --diff-filter=A -z \ > + | tr "\0" "\n" | is_ascii; then > + echo "Non-ascii filenames are not allowed !" > + echo "Please rename the file ..." > + exit 1 > + fi > +fi > + > if git-rev-parse --verify HEAD 2>/dev/null > then > against=HEAD > -- > 1.6.3 > > > > -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH] Extend sample pre-commit hook to check for non ascii filenames 2009-05-15 14:57 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski @ 2009-05-18 9:50 ` Heiko Voigt 2009-05-18 10:40 ` Johannes Sixt ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Heiko Voigt @ 2009-05-18 9:50 UTC (permalink / raw) To: Jakub Narebski, Junio C Hamano Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git At the moment non-ascii encodings of filenames are not portably converted between different filesystems by git. This will most likely change in the future but to allow repositories to be portable among different file/operating systems this check is enabled by default. Signed-off-by: Heiko <hvoigt@hvoigt.net> --- so here is a third version ... On Fri, May 15, 2009 at 04:57:45PM +0200, Jakub Narebski wrote: > On Thu, 14 May 2009, Heiko Voigt wrote: > > > At the moment non-ascii encodings of filenames are not portably converted > > between different filesystems by git. This will most likely change in the > > future but to allow repositories to be portable among different file/operating > > systems this check is enabled by default. > > By the way, you might consider choosing shorter line length for commits, > something around 70-76 characters per line; otherwise it is harder to > reply to without linewrapping. 80 characters that you used is, IMHO, > absolute maximum, and it is good that you kept to it. Yeah, I admit they were a little bit long. > > +function is_ascii () { > > + test -z "$(cat | sed -e "s/[\ -~]*//g")" > > + return $? > > +} > > From CodingGuidelines for shell scripts: > - We do not write the noiseword "function" in front of shell > functions. > > (in short: do not use bash-specific features... unless, of course, > you are modifying bash-completion script). Addressed. > Second, it would be nice to have comment about how to use this > function (as it does not check file given by its argument, but > rather its standard input). And perhaps also a comment that it > works because ASCII printable characters begin with ' ' space > (does it have to be escaped?) and end with '~' tilde[2]. Done > > Third, isn't it useless use of 'cat'[3]? And wouldn't it be better > to use 'tr' to either delete printable characters and check for > anything left (as you do; BTW. wouldn't "return test ..." be simpler?), > or use 'tr' to count non portable characters? Yes indeed it was useless. I also switched from sed to tr. > > [1] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html > [2] http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters > [3] http://partmaps.org/era/unix/award.html#cat On Fri, May 15, 2009 at 11:11:12AM -0700, Junio C Hamano wrote: > Heiko Voigt <hvoigt@hvoigt.net> writes: > > +function is_ascii () { > > We do not say "#!/bin/bash" at the beginning (hopefully), so let's not say > "function " here. See above. > > + test -z "$(cat | sed -e "s/[\ -~]*//g")" > > Do you need "cat | "? Also above. > Does this script run under LC_ALL=C? Can an i18n'ized sed interfere with > what you are trying to do? I now explicitely set LC_ALL=C for the tr call which should now be robust against such cases. > > > + return $? > > Do you need this, or does the function return the result of the last > statment anyway? I wasn't aware of that. Removed the return. > > + echo "Non-ascii filenames are not allowed !" > > + echo "Please rename the file ..." > > Can we make this sound more like a _sample_ project policy? It's not like > we enforce that policy to other people's projects. I've polished this so we are now more user friendly as well. templates/hooks--pre-commit.sample | 32 ++++++++++++++++++++++++++++++++ 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample index 0e49279..91ab563 100755 --- a/templates/hooks--pre-commit.sample +++ b/templates/hooks--pre-commit.sample @@ -7,6 +7,38 @@ # # To enable this hook, rename this file to "pre-commit". +# If you want to allow non-ascii filenames set this variable to true. +allownonascii=$(git config hooks.allownonascii) + +# is_ascii() Tests the string given given on standard input for +# printable ascii conformance. We exploit the fact that the printable +# range starts at the space character and ends with tilde. +is_ascii() { + test -z "$(LC_ALL=C tr -d \ -~)" +} + +if [ "$allownonascii" != "true" ] +then + # until git can handle non-ascii filenames gracefully + # prevent them to be added into the repository + if ! git diff --cached --name-only --diff-filter=A -z \ + | tr "\0" "\n" | is_ascii; then + echo "Error: Preventing to add a non-ascii filename." + echo + echo "This can cause problems if you want to work together" + echo "with people on other platforms than you." + echo + echo "To be portable it is adviseable to rename the file ..." + echo + echo "If you know what you are doing you can disable this" + echo "check using:" + echo + echo " git config hooks.allownonascii true" + echo + exit 1 + fi +fi + if git-rev-parse --verify HEAD 2>/dev/null then against=HEAD -- 1.6.3 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames 2009-05-18 9:50 ` [PATCH] " Heiko Voigt @ 2009-05-18 10:40 ` Johannes Sixt 2009-05-18 11:50 ` Heiko Voigt 2009-05-19 20:01 ` [PATCH v4] " Heiko Voigt 2009-05-18 14:42 ` [PATCH] " Junio C Hamano 2009-05-18 20:35 ` Julian Phillips 2 siblings, 2 replies; 59+ messages in thread From: Johannes Sixt @ 2009-05-18 10:40 UTC (permalink / raw) To: Heiko Voigt Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov, Esko Luontola, git Heiko Voigt schrieb: > +# is_ascii() Tests the string given given on standard input for > +# printable ascii conformance. We exploit the fact that the printable > +# range starts at the space character and ends with tilde. > +is_ascii() { > + test -z "$(LC_ALL=C tr -d \ -~)" > +} > + > +if [ "$allownonascii" != "true" ] > +then > + # until git can handle non-ascii filenames gracefully > + # prevent them to be added into the repository > + if ! git diff --cached --name-only --diff-filter=A -z \ > + | tr "\0" "\n" | is_ascii; then Will this not fail to add more than one file with allowed names? The \n is not removed in is_ascii(), and so the resulting string will not be empty. BTW, not all tr work well with NULs. See the commit message of e85fe4d8, for example. Otherwise, I would have suggested to convert the NUL to some allowed ASCII character, e.g. 'A'. BTW, you should really use '\0' and '\n' (single-quotes) to guarantee that the shell does not ignore the backslash. -- Hannes ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames 2009-05-18 10:40 ` Johannes Sixt @ 2009-05-18 11:50 ` Heiko Voigt 2009-05-18 12:04 ` Johannes Sixt 2009-05-19 20:01 ` [PATCH v4] " Heiko Voigt 1 sibling, 1 reply; 59+ messages in thread From: Heiko Voigt @ 2009-05-18 11:50 UTC (permalink / raw) To: Johannes Sixt Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov, Esko Luontola, git On Mon, May 18, 2009 at 12:40:09PM +0200, Johannes Sixt wrote: > Heiko Voigt schrieb: > > +# is_ascii() Tests the string given given on standard input for > > +# printable ascii conformance. We exploit the fact that the printable > > +# range starts at the space character and ends with tilde. > > +is_ascii() { > > + test -z "$(LC_ALL=C tr -d \ -~)" > > +} > > + > > +if [ "$allownonascii" != "true" ] > > +then > > + # until git can handle non-ascii filenames gracefully > > + # prevent them to be added into the repository > > + if ! git diff --cached --name-only --diff-filter=A -z \ > > + | tr "\0" "\n" | is_ascii; then > > Will this not fail to add more than one file with allowed names? The \n is > not removed in is_ascii(), and so the resulting string will not be empty. No currently it does not. At least on my system, but good point. > BTW, not all tr work well with NULs. See the commit message of e85fe4d8, > for example. Otherwise, I would have suggested to convert the NUL to some > allowed ASCII character, e.g. 'A'. BTW, you should really use '\0' and > '\n' (single-quotes) to guarantee that the shell does not ignore the > backslash. Are there any problems with '\0' and tr other than swallowing of it. In case not I would just change tr "\0" "\n" to tr -d '\0' That way there are no '\n's left over and it doesn't matter if tr swallows the '\0'. Waiting for further comments before sending the cleanup. cheers Heiko ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames 2009-05-18 11:50 ` Heiko Voigt @ 2009-05-18 12:04 ` Johannes Sixt 0 siblings, 0 replies; 59+ messages in thread From: Johannes Sixt @ 2009-05-18 12:04 UTC (permalink / raw) To: Heiko Voigt Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov, Esko Luontola, git Heiko Voigt schrieb: > Are there any problems with '\0' and tr other than swallowing of it. I can't tell. But the commits ae90e16..aab0abf are interesting to study in w.r.t. portability. > In > case not I would just change > > tr "\0" "\n" > to > tr -d '\0' In which case I'd suggest that you call tr only once, in isascii(): tr -d '[ -~]\0' -- Hannes ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH v4] Extend sample pre-commit hook to check for non ascii filenames 2009-05-18 10:40 ` Johannes Sixt 2009-05-18 11:50 ` Heiko Voigt @ 2009-05-19 20:01 ` Heiko Voigt 1 sibling, 0 replies; 59+ messages in thread From: Heiko Voigt @ 2009-05-19 20:01 UTC (permalink / raw) To: Johannes Sixt, Junio C Hamano, Julian Phillips Cc: Jakub Narebski, Martin Langhoff, Dmitry Potapov, Esko Luontola, git At the moment non-ascii encodings of filenames are not portably converted between different filesystems by git. This will most likely change in the future but to allow repositories to be portable among different file/operating systems this check is enabled by default. Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net> --- Thanks for all comments. I now hopefully have a satisfying patch. On Mon, May 18, 2009 at 12:40:09PM +0200, Johannes Sixt wrote: > Heiko Voigt schrieb: > > + if ! git diff --cached --name-only --diff-filter=A -z \ > > + | tr "\0" "\n" | is_ascii; then > > Will this not fail to add more than one file with allowed names? The \n is > not removed in is_ascii(), and so the resulting string will not be empty. > > BTW, not all tr work well with NULs. See the commit message of e85fe4d8, > for example. Otherwise, I would have suggested to convert the NUL to some > allowed ASCII character, e.g. 'A'. BTW, you should really use '\0' and > '\n' (single-quotes) to guarantee that the shell does not ignore the > backslash. I removed all \0 characters and hopefully use the correct platform independent syntax as described in the commits you send. On Mon, May 18, 2009 at 02:04:08PM +0200, Johannes Sixt wrote: > Heiko Voigt schrieb: > > Are there any problems with '\0' and tr other than swallowing of it. > > I can't tell. But the commits ae90e16..aab0abf are interesting to study in > w.r.t. portability. > > > In > > case not I would just change > > > > tr "\0" "\n" > > to > > tr -d '\0' > > In which case I'd suggest that you call tr only once, in isascii(): > > tr -d '[ -~]\0' After reading a little about the portability things. This seems to be the right way and is now included. On Mon, May 18, 2009 at 07:42:31AM -0700, Junio C Hamano wrote: > Heiko Voigt <hvoigt@hvoigt.net> writes: > > > +if [ "$allownonascii" != "true" ] > > +then > > + # until git can handle non-ascii filenames gracefully > > + # prevent them to be added into the repository > > I think you can inline your is_ascii shell function in the pipeline below. > You made it a separate function and I agree that it has a very good > documentation value, but the mention of "non-ascii filenames" in this > comment here is enough clue to let anybody know what is going on. I agree. I thought it would probably be useful in other places but we just need it once so its inlined now. > > Side note: I am not sure "Until ... can ... gracefully" is a good > description of the general problem. It probably is more neutral > to say "Cross platform projects tend to avoid non-ascii filenames; > prevent them from being added to the repository." Changed that. > > > + if ! git diff --cached --name-only --diff-filter=A -z \ > > + | tr "\0" "\n" | is_ascii; then > > A standard trick while writing a long pipeline in shell is to change line > after a pipe, like: > > cmd1 | > cmd2 | > cmd3 > > which allows you to lose the BS-before-LF sequence. Wasn't aware of that. Changed it accordingly. On Mon, May 18, 2009 at 09:35:19PM +0100, Julian Phillips wrote: > On Mon, 18 May 2009, Heiko Voigt wrote: >> + echo "Error: Preventing to add a non-ascii filename." > > This would read better as: > > + echo "Error: Attempt to add a non-ascii filename." > > (after all the prevention itself is a result of the error, not the cause > of it) That really sounds better. Thanks. templates/hooks--pre-commit.sample | 25 +++++++++++++++++++++++++ 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample index 0e49279..ad892a2 100755 --- a/templates/hooks--pre-commit.sample +++ b/templates/hooks--pre-commit.sample @@ -7,6 +7,31 @@ # # To enable this hook, rename this file to "pre-commit". +# If you want to allow non-ascii filenames set this variable to true. +allownonascii=$(git config hooks.allownonascii) + +# Cross platform projects tend to avoid non-ascii filenames; prevent +# them from being added to the repository. We exploit the fact that the +# printable range starts at the space character and ends with tilde. +if [ "$allownonascii" != "true" ] && + test "$(git diff --cached --name-only --diff-filter=A -z | + LC_ALL=C tr -d '[ -~]\0')" +then + echo "Error: Attempt to add a non-ascii filename." + echo + echo "This can cause problems if you want to work together" + echo "with people on other platforms than you." + echo + echo "To be portable it is adviseable to rename the file ..." + echo + echo "If you know what you are doing you can disable this" + echo "check using:" + echo + echo " git config hooks.allownonascii true" + echo + exit 1 +fi + if git-rev-parse --verify HEAD 2>/dev/null then against=HEAD -- 1.6.3 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames 2009-05-18 9:50 ` [PATCH] " Heiko Voigt 2009-05-18 10:40 ` Johannes Sixt @ 2009-05-18 14:42 ` Junio C Hamano 2009-05-18 20:35 ` Julian Phillips 2 siblings, 0 replies; 59+ messages in thread From: Junio C Hamano @ 2009-05-18 14:42 UTC (permalink / raw) To: Heiko Voigt Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov, Esko Luontola, git Heiko Voigt <hvoigt@hvoigt.net> writes: > +if [ "$allownonascii" != "true" ] > +then > + # until git can handle non-ascii filenames gracefully > + # prevent them to be added into the repository I think you can inline your is_ascii shell function in the pipeline below. You made it a separate function and I agree that it has a very good documentation value, but the mention of "non-ascii filenames" in this comment here is enough clue to let anybody know what is going on. Side note: I am not sure "Until ... can ... gracefully" is a good description of the general problem. It probably is more neutral to say "Cross platform projects tend to avoid non-ascii filenames; prevent them from being added to the repository." > + if ! git diff --cached --name-only --diff-filter=A -z \ > + | tr "\0" "\n" | is_ascii; then A standard trick while writing a long pipeline in shell is to change line after a pipe, like: cmd1 | cmd2 | cmd3 which allows you to lose the BS-before-LF sequence. I think comments from J6t and others are valuable but clear enough that I wouldn't have to repeat them. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames 2009-05-18 9:50 ` [PATCH] " Heiko Voigt 2009-05-18 10:40 ` Johannes Sixt 2009-05-18 14:42 ` [PATCH] " Junio C Hamano @ 2009-05-18 20:35 ` Julian Phillips 2 siblings, 0 replies; 59+ messages in thread From: Julian Phillips @ 2009-05-18 20:35 UTC (permalink / raw) To: Heiko Voigt Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov, Esko Luontola, git On Mon, 18 May 2009, Heiko Voigt wrote: > +if [ "$allownonascii" != "true" ] > +then > + # until git can handle non-ascii filenames gracefully > + # prevent them to be added into the repository > + if ! git diff --cached --name-only --diff-filter=A -z \ > + | tr "\0" "\n" | is_ascii; then > + echo "Error: Preventing to add a non-ascii filename." This would read better as: + echo "Error: Attempt to add a non-ascii filename." (after all the prevention itself is a result of the error, not the cause of it) If you want to keep the preventing, then you need to at least correct the english: > + echo "Error: Preventing addition of a non-ascii filename." -- Julian --- QOTD: Money isn't everything, but at least it keeps the kids in touch. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames 2009-05-14 17:59 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt 2009-05-15 10:52 ` Martin Langhoff 2009-05-15 14:57 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski @ 2009-05-15 18:11 ` Junio C Hamano 2 siblings, 0 replies; 59+ messages in thread From: Junio C Hamano @ 2009-05-15 18:11 UTC (permalink / raw) To: Heiko Voigt Cc: Jakub Narebski, Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano Heiko Voigt <hvoigt@hvoigt.net> writes: > diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample > index 0e49279..3083735 100755 > --- a/templates/hooks--pre-commit.sample > +++ b/templates/hooks--pre-commit.sample > @@ -7,6 +7,26 @@ > # > # To enable this hook, rename this file to "pre-commit". > > +# If you want to allow non-ascii filenames set this variable to true. > +allownonascii=$(git config hooks.allownonascii) > + > +function is_ascii () { We do not say "#!/bin/bash" at the beginning (hopefully), so let's not say "function " here. > + test -z "$(cat | sed -e "s/[\ -~]*//g")" Do you need "cat | "? Does this script run under LC_ALL=C? Can an i18n'ized sed interfere with what you are trying to do? > + return $? Do you need this, or does the function return the result of the last statment anyway? > + echo "Non-ascii filenames are not allowed !" > + echo "Please rename the file ..." Can we make this sound more like a _sample_ project policy? It's not like we enforce that policy to other people's projects. > + exit 1 > + fi > +fi > + > if git-rev-parse --verify HEAD 2>/dev/null > then > against=HEAD > -- > 1.6.3 ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-12 15:06 Cross-Platform Version Control Esko Luontola 2009-05-12 15:14 ` Shawn O. Pearce 2009-05-12 18:28 ` Dmitry Potapov @ 2009-05-14 13:48 ` Peter Krefting 2009-05-14 19:58 ` Esko Luontola 2 siblings, 1 reply; 59+ messages in thread From: Peter Krefting @ 2009-05-14 13:48 UTC (permalink / raw) To: Esko Luontola; +Cc: git Esko Luontola: > A good start for making Git cross-platform, would be storing the text > encoding of every file name and commit message together with the commit. Is it really necessary to store the encoding for every single file name, should it not be enough to just store encoding information for all file names at once (i.e., for the object that contains the list of file names and their associated blobs)? I did publish, as a request for comments, the beginnings of a patch that would change the Windows version of Git to expect file names to be UTF-8 encoded. There were some comments about it, especially that I could not just assume that UTF-8 was the right thing to assume. Perhaps if we added some meta-data, maybe using the same fall-back mechanism as for commit messages (i.e., assume UTF-8 unless otherwise specified), it would be easier to do. On Windows, the file APIs allow you to use Unicode (UTF-16) to specify file names, and the file systems will handle any necessary conversion to whatever byte sequences are used to store the file names. UTF-16 and UTF-8 are trivial to convert between, and Windows does contain APIs to convert between other character encodings and UTF-16. On Mac OS X, I believe the file system APIs assume you use some kind of normalized UTF-8. That should also be possible to create, possibly converting back and forth between different normalization forms, if necessary. On Linux and other Unixes we could just use iconv() to convert from the repository file name encoding to whatever the current locale has set up. The trick here is to handle file names outside the current encoding. Some kind of escaping mechanism will probably need to be introduced. The best way would be to define this in the Git core once and for all, and add support to it for all the platforms in the same go, instead of trying to hack around the issue whenever it pops up on the various platforms. My main use-case for Git on Windows has disappeared as my $dayjob went bankrupt, but I am happy to assist with whatever insight I may be able to bring. -- \\// Peter - http://www.softwolves.pp.se/ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting @ 2009-05-14 19:58 ` Esko Luontola 2009-05-14 20:21 ` Andreas Ericsson ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Esko Luontola @ 2009-05-14 19:58 UTC (permalink / raw) To: Peter Krefting; +Cc: git Peter Krefting wrote on 14.5.2009 16:48: > Is it really necessary to store the encoding for every single file name, > should it not be enough to just store encoding information for all file > names at once (i.e., for the object that contains the list of file names > and their associated blobs)? What about if some disorganized project has people committing with many different encodings? Should we allow it, that a directory has the names of some files using one encoding, and the names of other files using another encoding? Or should we force the whole repository to use the same encoding? > The best way would be to define this in the Git core once and for all, > and add support to it for all the platforms in the same go, instead of > trying to hack around the issue whenever it pops up on the various > platforms. +1 -- Esko Luontola www.orfjackal.net ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-14 19:58 ` Esko Luontola @ 2009-05-14 20:21 ` Andreas Ericsson 2009-05-14 22:25 ` Johannes Schindelin 2009-05-15 11:18 ` Dmitry Potapov 2 siblings, 0 replies; 59+ messages in thread From: Andreas Ericsson @ 2009-05-14 20:21 UTC (permalink / raw) To: Esko Luontola; +Cc: Peter Krefting, git Esko Luontola wrote: > Peter Krefting wrote on 14.5.2009 16:48: >> Is it really necessary to store the encoding for every single file >> name, should it not be enough to just store encoding information for >> all file names at once (i.e., for the object that contains the list of >> file names and their associated blobs)? > > What about if some disorganized project has people committing with many > different encodings? Should we allow it, that a directory has the names > of some files using one encoding, and the names of other files using > another encoding? Or should we force the whole repository to use the > same encoding? > If encodings are on a per-tree basis, we could add a special mode-flag for it without breaking backwards incompatibility (I think, anyways). Older gits just won't know how to handle it and will treat it as a byte-stream. >> The best way would be to define this in the Git core once and for all, >> and add support to it for all the platforms in the same go, instead of >> trying to hack around the issue whenever it pops up on the various >> platforms. > > +1 > There's still the problem that noone's stepped forward to do all that work yet, so apparently this isn't important enough for people to put their patches where their mouths are. Often when issues generate long discussions and no code, it's of high academic interest and of little real-world value. I believe the "little real-world value" here comes from the fact that cross-platform projects often enforce 7-bit ascii compatible filenames from the start, because they *know* they may run into problems on other filesystems otherwise. Remember it's not only git that has to get things right. It's also build-systems and compilers that have to locate the correct files (the Makefile and the filesystem may use different encodings), so in the real world, people really do stay away from filenames with åäö or other non-ascii chars in them. It's fun to discuss, but I won't spend any time on it. Good luck to those who do though. I'd quite like to see if someone could pull it off without breaking backwards compatibility or impacting performance too much. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Register now for Nordic Meet on Nagios, June 3-4 in Stockholm http://nordicmeetonnagios.op5.org/ Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-14 19:58 ` Esko Luontola 2009-05-14 20:21 ` Andreas Ericsson @ 2009-05-14 22:25 ` Johannes Schindelin 2009-05-15 11:18 ` Dmitry Potapov 2 siblings, 0 replies; 59+ messages in thread From: Johannes Schindelin @ 2009-05-14 22:25 UTC (permalink / raw) To: Esko Luontola; +Cc: Peter Krefting, git Hi, On Thu, 14 May 2009, Esko Luontola wrote: > Peter Krefting wrote on 14.5.2009 16:48: > > > The best way would be to define this in the Git core once and for all, > > and add support to it for all the platforms in the same go, instead of > > trying to hack around the issue whenever it pops up on the various > > platforms. > > +1 You might be enthusiastic about this cunning idea. However, if it costs me performance on Linux, and all the benefits go to Windows users, then I will remove this "solution" from my personal Git tree _right away_, and I'd expect a lot of other people, too. I repeat this just once more: if you add complexity, you'll have to have a compelling reason to do so. If there is no benefit for Linux users, why should they bear the cost? But as Andreas remarked, I sincerely think that there has been enough talk about the issue. It's time to see some patches, or to stop the discussion. Ciao, Dscho ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Cross-Platform Version Control 2009-05-14 19:58 ` Esko Luontola 2009-05-14 20:21 ` Andreas Ericsson 2009-05-14 22:25 ` Johannes Schindelin @ 2009-05-15 11:18 ` Dmitry Potapov 2 siblings, 0 replies; 59+ messages in thread From: Dmitry Potapov @ 2009-05-15 11:18 UTC (permalink / raw) To: Esko Luontola; +Cc: Peter Krefting, git On Thu, May 14, 2009 at 10:58:17PM +0300, Esko Luontola wrote: > > What about if some disorganized project has people committing with many > different encodings? Should we allow it, that a directory has the names > of some files using one encoding, and the names of other files using > another encoding? Or should we force the whole repository to use the > same encoding? The whole repository should have the same encoding internally. Anything else will be too complex and too slow... Have you seen any file system where file names would be stored in different encodings? And Git does far more operation on file names than a file system does. So, it is clearly to me that the whole repository should have a single encoding. Now, I don't think that you will find many open source projects that use non-ASCII in file names. Moreover, most Linux users are either use UTF-8 already or switch to it in the near future. Mac OS X uses UTF-8 (though there is a problem with decomposed characters, but Linus posted a possible solution). So, the only platform were non-ASCII characters may be interesting to Git users and that does not support UTF-8 is Windows. AFAIK, Cygwin 1.7 has UTF-8 support. So, it is mostly a problem for msysGit... Though adding support for legacy encodings can help to some degree, it means that every system call involving a file name will go through UTF-8 <-> LEGACY_ENC <-> UTF-16LE conversion. IMHO, having a legacy encoding involved is far from the best possible solution; but to avoid that, you need to change MSYS to be able to work with UTF-8. (I have never looked at MSYS myself, but I suspect it may be not easy). Dmitry ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2009-06-20 12:14 UTC | newest] Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-05-12 15:06 Cross-Platform Version Control Esko Luontola 2009-05-12 15:14 ` Shawn O. Pearce 2009-05-12 16:13 ` Johannes Schindelin 2009-05-12 17:56 ` Esko Luontola 2009-05-12 20:38 ` Johannes Schindelin 2009-05-12 21:16 ` Esko Luontola 2009-05-13 0:23 ` Johannes Schindelin 2009-05-13 5:34 ` Esko Luontola 2009-05-13 6:49 ` Alex Riesen 2009-05-13 10:15 ` Johannes Schindelin [not found] ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com> 2009-05-13 10:41 ` John Tapsell 2009-05-13 13:42 ` Jay Soffian 2009-05-13 13:44 ` Alex Riesen 2009-05-13 13:50 ` Jay Soffian 2009-05-13 13:57 ` John Tapsell 2009-05-13 15:27 ` Nicolas Pitre 2009-05-13 16:22 ` Johannes Schindelin 2009-05-13 17:24 ` Andreas Ericsson 2009-05-14 1:49 ` Miles Bader 2009-05-12 16:16 ` Jeff King 2009-05-12 16:57 ` Johannes Schindelin 2009-05-13 16:26 ` Linus Torvalds 2009-05-13 17:12 ` Linus Torvalds 2009-05-13 17:31 ` Andreas Ericsson 2009-05-13 17:46 ` Linus Torvalds 2009-05-13 18:26 ` Martin Langhoff 2009-05-13 18:37 ` Linus Torvalds 2009-05-13 21:04 ` Theodore Tso 2009-05-13 21:20 ` Linus Torvalds 2009-05-13 21:08 ` Daniel Barkalow 2009-05-13 21:29 ` Linus Torvalds 2009-05-13 20:57 ` Matthias Andree 2009-05-13 21:10 ` Linus Torvalds 2009-05-13 21:30 ` Jay Soffian 2009-05-13 21:47 ` Matthias Andree 2009-05-12 18:28 ` Dmitry Potapov 2009-05-12 18:40 ` Martin Langhoff 2009-05-12 18:55 ` Jakub Narebski 2009-05-12 21:43 ` [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames Heiko Voigt 2009-05-12 21:55 ` Jakub Narebski 2009-05-14 17:59 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt 2009-05-15 10:52 ` Martin Langhoff 2009-05-18 9:37 ` Heiko Voigt 2009-05-18 22:26 ` Jakub Narebski 2009-06-20 12:14 ` [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook Heiko Voigt 2009-05-15 14:57 ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski 2009-05-18 9:50 ` [PATCH] " Heiko Voigt 2009-05-18 10:40 ` Johannes Sixt 2009-05-18 11:50 ` Heiko Voigt 2009-05-18 12:04 ` Johannes Sixt 2009-05-19 20:01 ` [PATCH v4] " Heiko Voigt 2009-05-18 14:42 ` [PATCH] " Junio C Hamano 2009-05-18 20:35 ` Julian Phillips 2009-05-15 18:11 ` [PATCH v2] " Junio C Hamano 2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting 2009-05-14 19:58 ` Esko Luontola 2009-05-14 20:21 ` Andreas Ericsson 2009-05-14 22:25 ` Johannes Schindelin 2009-05-15 11:18 ` Dmitry Potapov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.