git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Cross-Platform Version Control
@ 2009-05-12 15:06 Esko Luontola
  2009-05-12 15:14 ` Shawn O. Pearce
                   ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Esko Luontola @ 2009-05-12 15:06 UTC (permalink / raw)
  To: git

A good start for making Git cross-platform, would be storing the text  
encoding of every file name and commit message together with the  
commit. Currently, because Git is oblivious to the encodings and just  
considers them as a series of bytes, there is no way to make them  
cross-platform. It's as http://www.joelonsoftware.com/articles/Unicode.html 
  says, "It does not make sense to have a string without knowing what  
encoding it uses." Without explicit encoding information, making a  
system that works even on the three main platforms, let alone in all  
countries and languages, is simply not possible.

On the other hand, if the encoding is explicitly stated in the  
repository, then it is possible for platform and locale aware Git  
clients to handle the file names and commit messages in whatever way  
makes most sense for the platform (for example convert the file names  
to the platform's encoding, if it differs from the committer's  
platform encoding). Then it would also be possible to create a Mac  
version of Git, which compensates for Mac OS X's file system's file  
name encoding peculiarities. Also the system could then warn (on "git  
add") if the data does not look like it has been encoded with the said  
encoding.

If the platform's and the repository's encoding happen to be the same  
(which in reality might be possible only inside a small company where  
everybody is forced to use the same OS and is configured by a single  
sysadmin), then no conversions need to be done. Also Git purists, who  
think that the byte sequence representing a file name are more  
important than the human readable version of the file name, may use  
some configuration switch that disables all conversions - but even  
then the current encoding should be stored together with the commit.

Are there any plans on storing the encoding information of file names  
and commit messages in the Git repository? How much time would  
implementing it take? Any ideas on how to maintain backwards  
compatibility (for old commits that do not have the encoding  
information)?

- Esko

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
@ 2009-05-12 15:14 ` Shawn O. Pearce
  2009-05-12 16:13   ` Johannes Schindelin
  2009-05-12 16:16   ` Jeff King
  2009-05-12 18:28 ` Dmitry Potapov
  2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting
  2 siblings, 2 replies; 59+ messages in thread
From: Shawn O. Pearce @ 2009-05-12 15:14 UTC (permalink / raw)
  To: Esko Luontola; +Cc: git

Esko Luontola <esko.luontola@gmail.com> wrote:
> Are there any plans on storing the encoding information of file names  
> and commit messages in the Git repository?

Commit messages already store their encoding in an optional
"encoding" header if the message isn't stored in UTF-8, or
US-ASCII, which is a strict subset of UTF-8.

As for file names, no plans, its a sequence of bytes, but I think a
lot of people wind up using some subset of US-ASCII for their file
names, especially if their project is going to be cross platform.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 15:14 ` Shawn O. Pearce
@ 2009-05-12 16:13   ` Johannes Schindelin
  2009-05-12 17:56     ` Esko Luontola
  2009-05-12 16:16   ` Jeff King
  1 sibling, 1 reply; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-12 16:13 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Esko Luontola, git

Hi,

On Tue, 12 May 2009, Shawn O. Pearce wrote:

> Esko Luontola <esko.luontola@gmail.com> wrote:
> > Are there any plans on storing the encoding information of file names 
> > and commit messages in the Git repository?
> 
> Commit messages already store their encoding in an optional "encoding" 
> header if the message isn't stored in UTF-8, or US-ASCII, which is a 
> strict subset of UTF-8.
> 
> As for file names, no plans, its a sequence of bytes, but I think a
> lot of people wind up using some subset of US-ASCII for their file
> names, especially if their project is going to be cross platform.

Some context: this issue cropped up in msysGit, of course.

As to storing all file names in UTF-8, my point about Unicode being not 
necessarily appropriate for everyone still stands.

UTF-8 _might_ be the de-facto standard for Linux filesystems, but 
IMHO we should not take away the freedom for everybody to decide what they 
want their file names to be encoded as.

However, I see that there might be a need to be able to encode the file 
names differently, such as on Windows.  IMHO the best solution would be 
a config variable controlling the reencoding of file names.

For some time, it looked as if two people were interested in implementing 
something like that (Peter and Robin IIRC), but efforts have stalled.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 15:14 ` Shawn O. Pearce
  2009-05-12 16:13   ` Johannes Schindelin
@ 2009-05-12 16:16   ` Jeff King
  2009-05-12 16:57     ` Johannes Schindelin
  2009-05-13 16:26     ` Linus Torvalds
  1 sibling, 2 replies; 59+ messages in thread
From: Jeff King @ 2009-05-12 16:16 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Esko Luontola, git

On Tue, May 12, 2009 at 08:14:03AM -0700, Shawn O. Pearce wrote:

> As for file names, no plans, its a sequence of bytes, but I think a
> lot of people wind up using some subset of US-ASCII for their file
> names, especially if their project is going to be cross platform.

Or they use a single encoding like utf8 so that there are no surprises.
You can still run into normalization problems with filenames on some
filesystems, though.  Linus's name_hash code sets up the framework to
handle "these two names are actually equivalent", but right now I think
there is just code for handling case-sensitivity, not utf8 normalization
(but I just skimmed the code, so I might be wrong).

-Peff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 16:16   ` Jeff King
@ 2009-05-12 16:57     ` Johannes Schindelin
  2009-05-13 16:26     ` Linus Torvalds
  1 sibling, 0 replies; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-12 16:57 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git

Hi,

On Tue, 12 May 2009, Jeff King wrote:

> On Tue, May 12, 2009 at 08:14:03AM -0700, Shawn O. Pearce wrote:
> 
> > As for file names, no plans, its a sequence of bytes, but I think a 
> > lot of people wind up using some subset of US-ASCII for their file 
> > names, especially if their project is going to be cross platform.
> 
> Or they use a single encoding like utf8 so that there are no surprises. 
> You can still run into normalization problems with filenames on some 
> filesystems, though.  Linus's name_hash code sets up the framework to 
> handle "these two names are actually equivalent", but right now I think 
> there is just code for handling case-sensitivity, not utf8 normalization 
> (but I just skimmed the code, so I might be wrong).

Back then I actually started on a patch to make Git capable of determining 
UTF-8 equivalence, but at the same time somebody started such an annoying 
mail thread that I stopped working on the issue completely.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 16:13   ` Johannes Schindelin
@ 2009-05-12 17:56     ` Esko Luontola
  2009-05-12 20:38       ` Johannes Schindelin
  0 siblings, 1 reply; 59+ messages in thread
From: Esko Luontola @ 2009-05-12 17:56 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Shawn O. Pearce, git

On 12.5.2009, at 19:13, Johannes Schindelin wrote:
> As to storing all file names in UTF-8, my point about Unicode being  
> not
> necessarily appropriate for everyone still stands.
>
> UTF-8 _might_ be the de-facto standard for Linux filesystems, but
> IMHO we should not take away the freedom for everybody to decide  
> what they
> want their file names to be encoded as.
>
> However, I see that there might be a need to be able to encode the  
> file
> names differently, such as on Windows.  IMHO the best solution would  
> be
> a config variable controlling the reencoding of file names.

Exactly. The system should not force the use of a specific encoding.  
It should only offer a recommendation, but be also fully compatible if  
the user uses some other encoding.

That's why it's best to always store the information about what  
encoding was used. It shouldn't matter, whether the data is encoded  
with ISO-8859-1, UTF-8, Shift_JIS, Big5 or some other encoding, as  
long as it is explicitly said that what the encoding is. Then the  
reader of the data can best decide, how to show that data on the  
current platform.

A config variable for defining, that what encoding should be used when  
committing the file names, would make sense. Git should also try to  
autodetect, that what encoding is used in its current environment. In  
the case of UTF-8, you should also be able to specify which  
normalization form is used (http://www.unicode.org/unicode/reports/ 
tr15/), or whether it is normalized at all.

For example, it should be possible to configure Git so, that when a  
file is checked out on Mac, its file name is converted to the current  
file system's encoding (UTF-8 NFD, I think), and when the file is  
committed on Mac, the file name is normalized back to the same UTF-8  
form as is used on Linux (UTF-8 NFC).

It would be nice to have config variables for saying, that all file  
names in this repository must use UTF-8 NFC, and all commit messages  
must use UTF-8 NFC (with Unix newlines). Then the Git client would  
autodetect the current environment's encoding, and convert the text,  
if necessary, to match the repository's encoding.

- Esko

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
  2009-05-12 15:14 ` Shawn O. Pearce
@ 2009-05-12 18:28 ` Dmitry Potapov
  2009-05-12 18:40   ` Martin Langhoff
  2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting
  2 siblings, 1 reply; 59+ messages in thread
From: Dmitry Potapov @ 2009-05-12 18:28 UTC (permalink / raw)
  To: Esko Luontola; +Cc: git

On Tue, May 12, 2009 at 06:06:05PM +0300, Esko Luontola wrote:
> A good start for making Git cross-platform, would be storing the text  
> encoding of every file name and commit message together with the commit. 
> Currently, because Git is oblivious to the encodings and just considers 
> them as a series of bytes, there is no way to make them cross-platform. 

1. Git already stores the endcoding for all commit messages that are not
   in UTF-8.

2. If you really want to be cross-platform portable, you should not use
   any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable
   Filename Character Set)
   http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276


Dmitry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 18:28 ` Dmitry Potapov
@ 2009-05-12 18:40   ` Martin Langhoff
  2009-05-12 18:55     ` Jakub Narebski
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Langhoff @ 2009-05-12 18:40 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Esko Luontola, git

On Tue, May 12, 2009 at 8:28 PM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> 2. If you really want to be cross-platform portable, you should not use
>   any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable
>   Filename Character Set)
>   http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276

Would it make sense to have warnings at 'git add' time about

 - filenames outside of that charset (as the strictest mode, perhaps
even default)
 - filenames that have a potential conflict wrt case-sensitivity
 - filenames that have potential conflict in the same tree due to
utf-8 encoding vagaries

MHO is that a strict "start your project portable from day one" mode
is best as a default. But I'd be happy with any default, actually ;-)



m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 18:40   ` Martin Langhoff
@ 2009-05-12 18:55     ` Jakub Narebski
  2009-05-12 21:43       ` [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames Heiko Voigt
  0 siblings, 1 reply; 59+ messages in thread
From: Jakub Narebski @ 2009-05-12 18:55 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Dmitry Potapov, Esko Luontola, git

Martin Langhoff <martin.langhoff@gmail.com> writes:
> On Tue, May 12, 2009 at 8:28 PM, Dmitry Potapov <dpotapov@gmail.com> wrote:

> > 2. If you really want to be cross-platform portable, you should not use
> >   any characters in filenames outside of [A-Za-z0-9._-] (i.e. Portable
> >   Filename Character Set)
> >   http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html#tag_03_276
> 
> Would it make sense to have warnings at 'git add' time about
> 
>  - filenames outside of that charset (as the strictest mode, perhaps
> even default)
>  - filenames that have a potential conflict wrt case-sensitivity
>  - filenames that have potential conflict in the same tree due to
> utf-8 encoding vagaries
> 
> MHO is that a strict "start your project portable from day one" mode
> is best as a default. But I'd be happy with any default, actually ;-)

Somebody asked for a pre-add hook in the past; it would be good place
to put such check.  But in meantime you can do it using pre-commit
hook instead, isn't it?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 17:56     ` Esko Luontola
@ 2009-05-12 20:38       ` Johannes Schindelin
  2009-05-12 21:16         ` Esko Luontola
  0 siblings, 1 reply; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-12 20:38 UTC (permalink / raw)
  To: Esko Luontola; +Cc: Shawn O. Pearce, git

Hi,

On Tue, 12 May 2009, Esko Luontola wrote:

> On 12.5.2009, at 19:13, Johannes Schindelin wrote:
> >As to storing all file names in UTF-8, my point about Unicode being not 
> >necessarily appropriate for everyone still stands.
> >
> >UTF-8 _might_ be the de-facto standard for Linux filesystems, but IMHO 
> >we should not take away the freedom for everybody to decide what they 
> >want their file names to be encoded as.
> >
> >However, I see that there might be a need to be able to encode the file 
> >names differently, such as on Windows.  IMHO the best solution would be 
> >a config variable controlling the reencoding of file names.
> 
> Exactly. The system should not force the use of a specific encoding. It 
> should only offer a recommendation, but be also fully compatible if the 
> user uses some other encoding.
> 
> That's why it's best to always store the information about what encoding 
> was used. It shouldn't matter, whether the data is encoded with 
> ISO-8859-1, UTF-8, Shift_JIS, Big5 or some other encoding, as long as it 
> is explicitly said that what the encoding is. Then the reader of the 
> data can best decide, how to show that data on the current platform.
> 
> A config variable for defining, that what encoding should be used when 
> committing the file names, would make sense. Git should also try to 
> autodetect, that what encoding is used in its current environment. In 
> the case of UTF-8, you should also be able to specify which 
> normalization form is used 
> (http://www.unicode.org/unicode/reports/tr15/), or whether it is 
> normalized at all.
> 
> For example, it should be possible to configure Git so, that when a file 
> is checked out on Mac, its file name is converted to the current file 
> system's encoding (UTF-8 NFD, I think), and when the file is committed 
> on Mac, the file name is normalized back to the same UTF-8 form as is 
> used on Linux (UTF-8 NFC).
> 
> It would be nice to have config variables for saying, that all file 
> names in this repository must use UTF-8 NFC, and all commit messages 
> must use UTF-8 NFC (with Unix newlines). Then the Git client would 
> autodetect the current environment's encoding, and convert the text, if 
> necessary, to match the repository's encoding.

That is a nice analysis.  How about implementing it?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 20:38       ` Johannes Schindelin
@ 2009-05-12 21:16         ` Esko Luontola
  2009-05-13  0:23           ` Johannes Schindelin
  0 siblings, 1 reply; 59+ messages in thread
From: Esko Luontola @ 2009-05-12 21:16 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin, Shawn O. Pearce

Johannes Schindelin wrote on 12.5.2009 23:38:
> That is a nice analysis.  How about implementing it?
> 

Do we have here somebody, who knows Git's code well and is motivated to 
implement this?

I don't think that I would be capable, because of not having used C 
much, being new to Git's codebase and having too little time. But I can 
help with the requirements specification, interaction design and system 
testing.

-- 
Esko Luontola
www.orfjackal.net

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames
  2009-05-12 18:55     ` Jakub Narebski
@ 2009-05-12 21:43       ` Heiko Voigt
  2009-05-12 21:55         ` Jakub Narebski
  0 siblings, 1 reply; 59+ messages in thread
From: Heiko Voigt @ 2009-05-12 21:43 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano

At the moment non-ascii encodings of file/usernames are not very well
supported by git. This will most likely change in the future but to
allow repositories to be portable among different file/operating systems
this check is enabled by default.

Signed-off-by: Heiko Voigt <heiko.voigt@mahr.de>
---
On Tue, May 12, 2009 at 11:55:39AM -0700, Jakub Narebski wrote:
> Somebody asked for a pre-add hook in the past; it would be good place
> to put such check.  But in meantime you can do it using pre-commit
> hook instead, isn't it?

I actually had this in my queue to be submitted...

 templates/hooks--pre-commit.sample |   33 +++++++++++++++++++++++++++++++++
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index 0e49279..83ff873 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -7,6 +7,39 @@
 #
 # To enable this hook, rename this file to "pre-commit".
 
+# If you want to allow non-ascii filenames or usernames set
+# this variable to true.
+allownonascii=$(git config hooks.allownonascii)
+
+function is_ascii () {
+    test -z "$(cat | sed -e "s/[\ -~]*//g")"
+    return $?
+}
+
+if [ "$allownonascii" != "true" ]
+then
+	# until git can handle non-ascii filenames gracefully
+	# prevent them to be added into the repository
+	if ! git diff --cached --name-only --diff-filter=A -z \
+			| tr "\0" "\n" | is_ascii; then
+		echo "Non-ascii filenames are not allowed !"
+		echo "Please rename the file ..."
+		exit 1
+	fi
+
+	# non-ascii username issue a warning in git gui so tell the
+	# user to change it
+	if ! git config user.name | is_ascii; then
+		echo "Please only use ascii characters in your username!"
+		exit 1
+	fi
+
+	if ! git config user.email | is_ascii; then
+		echo "Please only use ascii characters in your email!"
+		exit 1
+	fi
+fi
+
 if git-rev-parse --verify HEAD 2>/dev/null
 then
 	against=HEAD
-- 
1.6.3

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames
  2009-05-12 21:43       ` [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames Heiko Voigt
@ 2009-05-12 21:55         ` Jakub Narebski
  2009-05-14 17:59           ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
  0 siblings, 1 reply; 59+ messages in thread
From: Jakub Narebski @ 2009-05-12 21:55 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano

On Tue, 12 May 2009, Heiko Voigt wrote:

> At the moment non-ascii encodings of file/usernames are not very well
> supported by git. This will most likely change in the future but to
> allow repositories to be portable among different file/operating systems
> this check is enabled by default.

> +	# non-ascii username issue a warning in git gui so tell the
> +	# user to change it
> +	if ! git config user.name | is_ascii; then
> +		echo "Please only use ascii characters in your username!"
> +		exit 1
> +	fi
> +
> +	if ! git config user.email | is_ascii; then
> +		echo "Please only use ascii characters in your email!"
> +		exit 1
> +	fi

Actually 1.) there is no easy way to avoid non-ASCII names at least
in user.name (I think they are not allowed in email), but 2.) there
is no trouble with non-ASCII encoding of commits, as they have 
'encoding' header if it is not uft-8 (see *encoding* config variables).

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 21:16         ` Esko Luontola
@ 2009-05-13  0:23           ` Johannes Schindelin
  2009-05-13  5:34             ` Esko Luontola
  0 siblings, 1 reply; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-13  0:23 UTC (permalink / raw)
  To: Esko Luontola; +Cc: git, Shawn O. Pearce

Hi,

On Wed, 13 May 2009, Esko Luontola wrote:

> Johannes Schindelin wrote on 12.5.2009 23:38:
> > That is a nice analysis.  How about implementing it?
> > 
> 
> Do we have here somebody, who knows Git's code well and is motivated to
> implement this?
> 
> I don't think that I would be capable, because of not having used C 
> much, being new to Git's codebase and having too little time.

Well, that rather settles things, no?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13  0:23           ` Johannes Schindelin
@ 2009-05-13  5:34             ` Esko Luontola
  2009-05-13  6:49               ` Alex Riesen
  2009-05-13 10:15               ` Johannes Schindelin
  0 siblings, 2 replies; 59+ messages in thread
From: Esko Luontola @ 2009-05-13  5:34 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Shawn O. Pearce

Johannes Schindelin wrote on 13.5.2009 3:23:
> Well, that rather settles things, no?
> 

There is need for the feature, but it's unfortunate that the Git 
developers do not see its value. There are many users for whom using 
non-ASCII names is necessary (for example all of Asia and most of 
Europe), but now it seems that Bazaar is the only DVCS that handles 
encodings correctly: 
http://stackoverflow.com/questions/829682/what-dvcs-support-unicode-filenames

Let's see if I have time later this or next year to work on it. At least 
it would be good practise in getting acquainted with a new codebase and 
learning C. But it would be better for someone else do it, to get it 
done within a reasonable amount of time.

I see that there are some tests in the /t directory. Which command will 
run all of them, how good coverage do the tests have, how reproducable 
and isolated they are, how many seconds does it take to run all the 
tests? Is there some high-level documentation for new developers?

-- 
Esko Luontola
www.orfjackal.net

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13  5:34             ` Esko Luontola
@ 2009-05-13  6:49               ` Alex Riesen
  2009-05-13 10:15               ` Johannes Schindelin
  1 sibling, 0 replies; 59+ messages in thread
From: Alex Riesen @ 2009-05-13  6:49 UTC (permalink / raw)
  To: Esko Luontola; +Cc: Johannes Schindelin, git, Shawn O. Pearce

2009/5/13 Esko Luontola <esko.luontola@gmail.com>:
> Johannes Schindelin wrote on 13.5.2009 3:23:
>>
>> Well, that rather settles things, no?
>>
>
> There is need for the feature, but it's unfortunate that the Git developers
> do not see its value. There are many users for whom using non-ASCII names is
> necessary (for example all of Asia and most of Europe), but now it seems
> that Bazaar is the only DVCS that handles encodings correctly:
> http://stackoverflow.com/questions/829682/what-dvcs-support-unicode-filenames

Many Git developers just use systems which don't care about the file names
encoding at all and just keep the names as they were. So interoperability
problem does not exist for them. So, they either don't need the feature,
or can trivially avoid or workaround any problems.

> I see that there are some tests in the /t directory. Which command will run
> all of them, how good coverage do the tests have, how reproducable and
> isolated they are, how many seconds does it take to run all the tests? Is
> there some high-level documentation for new developers?

make test. See also t/README. We like them. I always run test suite before
deployment and sometimes run it just for fun (unless I have to run it
on Windows).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13  5:34             ` Esko Luontola
  2009-05-13  6:49               ` Alex Riesen
@ 2009-05-13 10:15               ` Johannes Schindelin
       [not found]                 ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>
  1 sibling, 1 reply; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-13 10:15 UTC (permalink / raw)
  To: Esko Luontola; +Cc: git, Shawn O. Pearce

Hi,

On Wed, 13 May 2009, Esko Luontola wrote:

> Johannes Schindelin wrote on 13.5.2009 3:23:
> > Well, that rather settles things, no?
> 
> There is need for the feature, but it's unfortunate that the Git 
> developers do not see its value.

I see a value.  But it is not my itch.  And since it is your itch and you 
said that you will not do anything about it (I don't count writing emails 
here ;-), I concluded that it settles the issue.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Cross-Platform Version Control
       [not found]                 ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>
@ 2009-05-13 10:41                   ` John Tapsell
  2009-05-13 13:42                     ` Jay Soffian
  0 siblings, 1 reply; 59+ messages in thread
From: John Tapsell @ 2009-05-13 10:41 UTC (permalink / raw)
  To: git

2009/5/13 Johannes Schindelin <Johannes.Schindelin@gmx.de>:
> Hi,
>
> On Wed, 13 May 2009, Esko Luontola wrote:
>
>> Johannes Schindelin wrote on 13.5.2009 3:23:
>> > Well, that rather settles things, no?
>>
>> There is need for the feature, but it's unfortunate that the Git
>> developers do not see its value.
>
> I see a value.  But it is not my itch.  And since it is your itch and you
> said that you will not do anything about it (I don't count writing emails
> here ;-), I concluded that it settles the issue.

I don't know why the git developers are being so hostile/dismisisve,
but I also hope that somebody volunteers to fix this.
Esko, you have my moral support :-)

John

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 10:41                   ` John Tapsell
@ 2009-05-13 13:42                     ` Jay Soffian
  2009-05-13 13:44                       ` Alex Riesen
  0 siblings, 1 reply; 59+ messages in thread
From: Jay Soffian @ 2009-05-13 13:42 UTC (permalink / raw)
  To: John Tapsell; +Cc: git

On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
> I don't know why the git developers are being so hostile/dismisisve,

Are you serious?

j.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:42                     ` Jay Soffian
@ 2009-05-13 13:44                       ` Alex Riesen
  2009-05-13 13:50                         ` Jay Soffian
  0 siblings, 1 reply; 59+ messages in thread
From: Alex Riesen @ 2009-05-13 13:44 UTC (permalink / raw)
  To: Jay Soffian; +Cc: John Tapsell, git

2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>> I don't know why the git developers are being so hostile/dismisisve,
>
> Are you serious?
>

...because we'll kill you if aren't >:-E

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:44                       ` Alex Riesen
@ 2009-05-13 13:50                         ` Jay Soffian
  2009-05-13 13:57                           ` John Tapsell
  0 siblings, 1 reply; 59+ messages in thread
From: Jay Soffian @ 2009-05-13 13:50 UTC (permalink / raw)
  To: Alex Riesen; +Cc: John Tapsell, git

On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>>> I don't know why the git developers are being so hostile/dismisisve,
>>
>> Are you serious?
>>
>
> ...because we'll kill you if aren't >:-E

I'm just flabbergasted by some people's expectations. Perhaps John
doesn't realize the git developers are all volunteers, and that it is
never appropriate to criticize a volunteer. A "thank you for all your
hard work on git" would have done nicely.

j.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:50                         ` Jay Soffian
@ 2009-05-13 13:57                           ` John Tapsell
  2009-05-13 15:27                             ` Nicolas Pitre
                                               ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: John Tapsell @ 2009-05-13 13:57 UTC (permalink / raw)
  To: Jay Soffian; +Cc: Alex Riesen, git

2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
> On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
>> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>>>> I don't know why the git developers are being so hostile/dismisisve,
>>>
>>> Are you serious?
>>>
>>
>> ...because we'll kill you if aren't >:-E
>
> I'm just flabbergasted by some people's expectations. Perhaps John
> doesn't realize the git developers are all volunteers, and that it is
> never appropriate to criticize a volunteer. A "thank you for all your
> hard work on git" would have done nicely.

I'm as much of an open source developer as anyone else here.  I spend
a huge amount of my time programming for KDE.  But I've never told a
user "well that settles it" because they won't code it themselves :-/
I certaintly get a huge number of bug/wishes that I can't/won't code
myself, but I try to be a bit more diplomatic about it.
But then the kernel mailing lists tend to be a lot more.. direct..
than the kde mailing lists, so I guess it comes from that.  Requiring
people to have a thick skin and all that.


John

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:57                           ` John Tapsell
@ 2009-05-13 15:27                             ` Nicolas Pitre
  2009-05-13 16:22                               ` Johannes Schindelin
  2009-05-13 17:24                             ` Andreas Ericsson
  2009-05-14  1:49                             ` Miles Bader
  2 siblings, 1 reply; 59+ messages in thread
From: Nicolas Pitre @ 2009-05-13 15:27 UTC (permalink / raw)
  To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git

On Wed, 13 May 2009, John Tapsell wrote:

> I'm as much of an open source developer as anyone else here.  I spend
> a huge amount of my time programming for KDE.  But I've never told a
> user "well that settles it" because they won't code it themselves :-/
> I certaintly get a huge number of bug/wishes that I can't/won't code
> myself, but I try to be a bit more diplomatic about it.
> But then the kernel mailing lists tend to be a lot more.. direct..
> than the kde mailing lists, so I guess it comes from that.  Requiring
> people to have a thick skin and all that.

This is not the kernel mailing list.  In fact this list is quite 
friendlier and accommodating that the kernel list.

The remark alluded above comes from _one_ of the git developers.  And 
Dscho is apparently in a rather sad mood these days. While the substance 
of Dscho's remark is entirely pertinent, it would be wrong to use its 
form and style as a characterization of git developers in general.


Nicolas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 15:27                             ` Nicolas Pitre
@ 2009-05-13 16:22                               ` Johannes Schindelin
  0 siblings, 0 replies; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-13 16:22 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: John Tapsell, Jay Soffian, Alex Riesen, git

Hi,

On Wed, 13 May 2009, Nicolas Pitre wrote:

> On Wed, 13 May 2009, John Tapsell wrote:
> 
> > I'm as much of an open source developer as anyone else here.  I spend 
> > a huge amount of my time programming for KDE.  But I've never told a 
> > user "well that settles it" because they won't code it themselves :-/ 
> > I certaintly get a huge number of bug/wishes that I can't/won't code 
> > myself, but I try to be a bit more diplomatic about it.
> >
> > But then the kernel mailing lists tend to be a lot more.. direct.. 
> > than the kde mailing lists, so I guess it comes from that.  Requiring 
> > people to have a thick skin and all that.
> 
> This is not the kernel mailing list.  In fact this list is quite 
> friendlier and accommodating that the kernel list.
> 
> The remark alluded above comes from _one_ of the git developers.  And 
> Dscho is apparently in a rather sad mood these days. While the substance 
> of Dscho's remark is entirely pertinent, it would be wrong to use its 
> form and style as a characterization of git developers in general.

Even if I were in a better mood, the whole thread has a back story on an 
msysGit issue, and this led me to try to stop what I feared would become a 
rather long mail thread without much of an outcome, such as that infamous 
thread about MacOSX UTF-8 filename handling.

Alas, it seems that Robin is willing to work on the issues, so my fears 
have been totally and completely unfounded.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 16:16   ` Jeff King
  2009-05-12 16:57     ` Johannes Schindelin
@ 2009-05-13 16:26     ` Linus Torvalds
  2009-05-13 17:12       ` Linus Torvalds
  1 sibling, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 16:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git



On Tue, 12 May 2009, Jeff King wrote:
>
> Or they use a single encoding like utf8 so that there are no surprises.
> You can still run into normalization problems with filenames on some
> filesystems, though.  Linus's name_hash code sets up the framework to
> handle "these two names are actually equivalent", but right now I think
> there is just code for handling case-sensitivity, not utf8 normalization
> (but I just skimmed the code, so I might be wrong).

utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But 
quite frankly, the index is only part of it, and probably not the worst 
part.

The real pain of filename handling is all the "read tree recursively with 
readdir()" issues. Along with just an absolute sh*t-load of issues about 
what to do when people ended up using different versions of the "same" 
name in different branches.

There's also the issue that "cross-platform" really can be a pretty damn 
big pain. What do you do for platforms that simply are pure shit? I 
realize that OS X people have a hard time accepting it, but OS X 
filesystems are generally total and utter crap - even more so than 
Windows.

Yes, yes, you can tell OS X that case matters, but that's not the normal 
case - and what do you do with projects that simply _do_ care about case. 
The kernel is one such project.

Sure, you can "encode" the filenames on such broken filesystems in a way 
that they'd be different - but that won't really help the project, since 
makefiles etc won't work anyway.

So one reason I didn't bother with utf-8 is that the much more fundamental 
issues are simply in plain old 7-bit US-ASCII. 

That said, if the only issue is that you want to encode regular utf-8 in a 
coherent way (and ignore the case issues), then we could probably do that 
part fairly easily with a "convert_to_internal()" and 
"convert_to_filename()" thing that acts very much like the CRLF conversion 
(except on filenames, not data).

And yes, it's probably worth doing, since we'd need that for fuller case 
support anyway.

It's just a fair amount of churn - not fundamentally _hard_, but not 
trivial either. And it needs a _lot_ of care, and a fair amount of 
testing that is probably hard to do on sane filesystems (ie the case where 
the filesystem actually _changes_ the name is going to be hard to test on 
anything sane).

			Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 16:26     ` Linus Torvalds
@ 2009-05-13 17:12       ` Linus Torvalds
  2009-05-13 17:31         ` Andreas Ericsson
                           ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 17:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git



On Wed, 13 May 2009, Linus Torvalds wrote:
> 
> utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But 
> quite frankly, the index is only part of it, and probably not the worst 
> part.
> 
> The real pain of filename handling is all the "read tree recursively with 
> readdir()" issues. Along with just an absolute sh*t-load of issues about 
> what to do when people ended up using different versions of the "same" 
> name in different branches.

Btw, if people care mainly just about OS X, and don't worry so much about 
case, but about the idiotic and insane OS X behavior of turning UTF-8 
filenames into that crazy NFD format, here's a simple patch that may be 
useful for that.

There _will_ certainly be other places, but this handles the one big case 
of "read_directory_recursive()", and can turn NFD into the sane NFC 
format.

Since OS X will then accept NFC (and internally turn it back to NFD) when 
you pass them as filenames, that means that converting the other way is 
not necessary.

NOTE NOTE NOTE! This really just handles one case, and is not enough for 
any kind of general case. For example, it does NOT handle the case where 
you do

	git add filename_with_åäö

explicitly, because if the "filename_with_åäö" is done using NFD 
(tab-completion etc), now git won't _match_ it with the filename it reads 
using readdir() any more (which got converted to NFC), so at a minimum 
we'd need to do that crazy NFD->NFC conversion in all the pathspecs too. 

See "get_pathspec()" in setup.c for that latter case.

But with that, and this crazy thing, OS X users might be already a lot 
better off. Totally untested, of course. 

Oh, and somebody needs to fill in that 

	convert_name_from_nfd_to_nfc()

implementation. It's designed so that if it notices that the string is 
just plain US-ASCII, it can return 0 and no extra work is done. That, in 
turn, can easily be done by some simple and efficient pre-processign that 
checks that there are no high bits set (on a 64-bit platform, do it 8 
characters at a time with a "& 0x8080808080808080"), so that the common 
case doesn't need to have barely any overhead at all.

Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do 
the actual normalization if you find characters with the high bit set. And 
since I know that the OS X filesystems are so buggy as to not even do that 
whole NFD thing right, there is probably some OS-X specific "use this for 
filesystem names" conversion function.

Hmm. Anybody want to take this on? It really shouldn't be too complex to 
get it working for the common case on just OS X. It's really the case 
sensitivity that is the biggest problem, if you ignore that for now, the 
problem space is _much_ smaller.

In other words, I think we can reasonably easily support a subset of 
_common_ issues with some trivial patches like this. But getting it right 
in _all_ the cases is going to be much more work (there are lots of other 
uses of "readdir()" too, this one just happens to be one of the more 
central ones).

Of course, it probably makes sense to have a whole "git_readdir()" that 
does this thing in general. That "create_full_path()" thing makes sense 
regardless, though, in that it also simplifies a lot of "baselen+len" 
usage in just "len".

		Linus

---
 dir.c |   40 ++++++++++++++++++++++++++++++++--------
 1 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/dir.c b/dir.c
index 6aae09a..4cbfc24 100644
--- a/dir.c
+++ b/dir.c
@@ -566,6 +566,30 @@ static int get_dtype(struct dirent *de, const char *path)
 }
 
 /*
+ * Take the readdir output, in (d_name,len), and append it to
+ * our base name in (fullname,baselen) with any required
+ * readdir fs->internal translation.
+ *
+ * Put the result in 'fullname', and return the final length.
+ *
+ * Right now we have no translation, and just do a memcpy()
+ * (the +1 is to copy the final NUL character too).
+ */
+static int create_full_path(char *fullname, int baselen, const char *d_name, int len)
+{
+#ifdef OS_X_IS_SOME_CRAZY_SHxAT
+	char temp[256], nlen;
+	nlen = convert_name_from_nfd_to_nfc(d_name, len, temp, sizeof(temp));
+	if (nlen) {
+		len = nlen;
+		d_name = temp;
+	}
+#endif
+	memcpy(fullname + baselen, d_name, len + 1);
+	return baselen + len;
+}
+
+/*
  * Read a directory tree. We currently ignore anything but
  * directories, regular files and symlinks. That's because git
  * doesn't handle them at all yet. Maybe that will change some
@@ -595,15 +619,15 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			/* Ignore overly long pathnames! */
 			if (len + baselen + 8 > sizeof(fullname))
 				continue;
-			memcpy(fullname + baselen, de->d_name, len+1);
-			if (simplify_away(fullname, baselen + len, simplify))
+			len = create_full_path(fullname, baselen, de->d_name, len);
+			if (simplify_away(fullname, len, simplify))
 				continue;
 
 			dtype = DTYPE(de);
 			exclude = excluded(dir, fullname, &dtype);
 			if (exclude && (dir->flags & DIR_COLLECT_IGNORED)
-			    && in_pathspec(fullname, baselen + len, simplify))
-				dir_add_ignored(dir, fullname, baselen + len);
+			    && in_pathspec(fullname, len, simplify))
+				dir_add_ignored(dir, fullname, len);
 
 			/*
 			 * Excluded? If we don't explicitly want to show
@@ -630,9 +654,9 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			default:
 				continue;
 			case DT_DIR:
-				memcpy(fullname + baselen + len, "/", 2);
+				memcpy(fullname + len, "/", 2);
 				len++;
-				switch (treat_directory(dir, fullname, baselen + len, simplify)) {
+				switch (treat_directory(dir, fullname, len, simplify)) {
 				case show_directory:
 					if (exclude != !!(dir->flags
 							& DIR_SHOW_IGNORED))
@@ -640,7 +664,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 					break;
 				case recurse_into_directory:
 					contents += read_directory_recursive(dir,
-						fullname, fullname, baselen + len, 0, simplify);
+						fullname, fullname, len, 0, simplify);
 					continue;
 				case ignore_directory:
 					continue;
@@ -654,7 +678,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			if (check_only)
 				goto exit_early;
 			else
-				dir_add_name(dir, fullname, baselen + len);
+				dir_add_name(dir, fullname, len);
 		}
 exit_early:
 		closedir(fdir);

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:57                           ` John Tapsell
  2009-05-13 15:27                             ` Nicolas Pitre
@ 2009-05-13 17:24                             ` Andreas Ericsson
  2009-05-14  1:49                             ` Miles Bader
  2 siblings, 0 replies; 59+ messages in thread
From: Andreas Ericsson @ 2009-05-13 17:24 UTC (permalink / raw)
  To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git

John Tapsell wrote:
> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>> On Wed, May 13, 2009 at 9:44 AM, Alex Riesen <raa.lkml@gmail.com> wrote:
>>> 2009/5/13 Jay Soffian <jaysoffian@gmail.com>:
>>>> On Wed, May 13, 2009 at 6:41 AM, John Tapsell <johnflux@gmail.com> wrote:
>>>>> I don't know why the git developers are being so hostile/dismisisve,
>>>> Are you serious?
>>>>
>>> ...because we'll kill you if aren't >:-E
>> I'm just flabbergasted by some people's expectations. Perhaps John
>> doesn't realize the git developers are all volunteers, and that it is
>> never appropriate to criticize a volunteer. A "thank you for all your
>> hard work on git" would have done nicely.
> 
> I'm as much of an open source developer as anyone else here.  I spend
> a huge amount of my time programming for KDE.  But I've never told a
> user "well that settles it" because they won't code it themselves :-/
> I certaintly get a huge number of bug/wishes that I can't/won't code
> myself, but I try to be a bit more diplomatic about it.
> But then the kernel mailing lists tend to be a lot more.. direct..
> than the kde mailing lists, so I guess it comes from that.  Requiring
> people to have a thick skin and all that.
> 

I think much of the perceived malignancy stems from the fact that the
git list has a high ratio of developer-to-luser mailings on it, being
by nature a developer tool most of the time. When the unaware user
appears on the list with demands rather than polite requests, they're
treated that much harder. Especially by the developer who happens to
be, as it were, the butt of the request.

Personally, I've only ever found Dscho being anything but friendly on
this list, and even then, I really didn't find it offensive. If viewed
in a happy mood, it matches quite nicely with a swedish sketch whose
theme is "men ja ente bitter". It's often quite funny, really :-)

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 17:12       ` Linus Torvalds
@ 2009-05-13 17:31         ` Andreas Ericsson
  2009-05-13 17:46         ` Linus Torvalds
  2009-05-13 20:57         ` Matthias Andree
  2 siblings, 0 replies; 59+ messages in thread
From: Andreas Ericsson @ 2009-05-13 17:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git

Linus Torvalds wrote:
> 
> Of course, it probably makes sense to have a whole "git_readdir()" that 
> does this thing in general. That "create_full_path()" thing makes sense 
> regardless, though, in that it also simplifies a lot of "baselen+len" 
> usage in just "len".
> 

In a flash of premonitory insight, libgit2 has 

	gitfo_foreach_dirent(path, callback)

which would probably be well suited for this kind of thing.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 17:12       ` Linus Torvalds
  2009-05-13 17:31         ` Andreas Ericsson
@ 2009-05-13 17:46         ` Linus Torvalds
  2009-05-13 18:26           ` Martin Langhoff
  2009-05-13 20:57         ` Matthias Andree
  2 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 17:46 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git



On Wed, 13 May 2009, Linus Torvalds wrote:
>
> Of course, it probably makes sense to have a whole "git_readdir()" that 
> does this thing in general.

Actually, the more I think about that, the less true I think it is.

It _sounds_ like a nice simplification ("just do it once in readdir, and 
forget about it everywhere else"), but it's in fact a stupid thing to do.

Why?

If we _ever_ want to fix this in the general case, then the code that does 
the readdir() will actually have to remember both the "raw filesystem" 
form _and_ the "cleaned-up utf-8 form".

Why? Because when we do readdir(), we'll also do 'lstat()' on the end 
result to check the types, and opendir() in case it's a directory and we 
then want to do things recursively etc. And that happens to work on OS X 
(because we can use our "fixed" filename for lstat too), but it does not 
work in the general case.

And you can say "well, just do the stat inside the wrapped readdir()", but 
that doesn't work _either_, since

 - we don't want to do the lstat() if it's unnecessary. Even if we don't 
   have "de->d_type" information, we can often avoid the need for it, if 
   we can tell that the name isn't interestign (due to being ignored).

   Avoiding the lstat is a huge performance issue for cold-cache cases. 
   It's basically a seek.

   So we really want to do the lstat() later, which implies that the 
   caller needs to know _both_ the original "real" filesystem name _and_ 
   the converted one.

 - it doesn't handle the opendir() case anyway - so the end result is that 
   a real implementation will _always_ need to carry around both the 
   "filesystem view" filename _and_ the "what we've converted it into".

Now, the point of the patch I sent out was that for the specific case of 
OS X, which does UTF-8 conversions (wrong) but also is happy to get our 
properly normalized name, we don't care. So my patch is "correct" for that 
special case - and so would a plain readdir() wrapper be.

But my patch is _also_ correct for the case where a readdir() wrapper 
would do the wrong thing. My patch doesn't _handle_ it (since it doesn't 
change the code to pass both "filesystem view" and "cleaned-up view" 
pathnames), but the patch I sent out also doesn't make it any harder to do 
right.

In contrast, doing a readdir() wrapper makes it much harder to do right 
later, because it's just doing the conversion at the wrong level (you 
could make that "wrapper" return both the original and the fixed 
filename, but at that point the wrapper doesn't really help - you might 
as well just have the "convert" function, and it would be a hell of a lot 
more obvious what is really going on).

So I take it back. A readdir() wrapper is not a good idea. It gets us a 
tiny bit of the way, but it would actually take us a step back from the 
"real" solution.

			Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 17:46         ` Linus Torvalds
@ 2009-05-13 18:26           ` Martin Langhoff
  2009-05-13 18:37             ` Linus Torvalds
  0 siblings, 1 reply; 59+ messages in thread
From: Martin Langhoff @ 2009-05-13 18:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, May 13, 2009 at 7:46 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So I take it back. A readdir() wrapper is not a good idea. It gets us a
> tiny bit of the way, but it would actually take us a step back from the
> "real" solution.

Do we need to take the real solution to the core of git?

What I am wondering is whether we can keep this simple in git
internals and catch problem filenames at git-add time. This would
allow git to keep treating filenames as a bag of bytes, and it does a
better thing for users.

In cross platform projects, most users don't even know that there are
problems, and even if they do, they don't know what the problems are.

If git add can be told to warn & refuse to add a path with portability
problems, then we educate our users, prevent them from committing
filenames that will later cause trouble to others in their projects,
etc.

from-the-keep-it-simple-and-informative-dept,


m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 18:26           ` Martin Langhoff
@ 2009-05-13 18:37             ` Linus Torvalds
  2009-05-13 21:04               ` Theodore Tso
  2009-05-13 21:08               ` Daniel Barkalow
  0 siblings, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 18:37 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git



On Wed, 13 May 2009, Martin Langhoff wrote:
> 
> Do we need to take the real solution to the core of git?

Well, I suspect that if we really want to support it, then we'd better.

> What I am wondering is whether we can keep this simple in git
> internals and catch problem filenames at git-add time.

I can almost guarantee that it will just cause more problems than it 
solves, and generate some nasty cases that just aren't solvable.

Because it really isn't just "git add". It's every single thing that does 
a lstat() on a filename inside of git.

Now, the simple OS X case is not a huge problem, since the lstat will 
succeed with the fixed-up filename too. But as mentioned, the OS X case is 
the thing that doesn't need a lot of infrastructure _anyway_ - I can 
almost guarantee that my posted patch (with the added setup.c stuff for 
get_pathspec()) is going to be _fewer_ lines than some wrapper logic.

Note: in all of the above, I assume that people care more about just plain 
UTF characters (and the insane NFD form OS X uses) than about worrying 
about the _really_ subtle issues of case-independence. Those are a major 
pain, but they will need even more "internal" support, because there 
simply isn't any sane wrapping method.

(You could wrap everything to force lower-casing of all filesystem ops or 
something, but that would not be acceptable to any sane environment. So in 
reality you need to accept mixed-case things, and then there is no way to 
know from the "outside" whether one external mixed-case thing matches some 
internal index mixed-case thing).

			Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 17:12       ` Linus Torvalds
  2009-05-13 17:31         ` Andreas Ericsson
  2009-05-13 17:46         ` Linus Torvalds
@ 2009-05-13 20:57         ` Matthias Andree
  2009-05-13 21:10           ` Linus Torvalds
  2 siblings, 1 reply; 59+ messages in thread
From: Matthias Andree @ 2009-05-13 20:57 UTC (permalink / raw)
  To: Linus Torvalds, Jeff King; +Cc: Shawn O. Pearce, Esko Luontola, git

Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds  
<torvalds@linux-foundation.org>:

> Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to  
> do the actual normalization if you find characters with the high bit  
> set. And since I know that the OS X filesystems are so buggy as to not  
> even do that whole NFD thing right, there is probably some OS-X specific  
> "use this for
> filesystem names" conversion function.

Sorry for interrupting, but NF_K_C? You don't want that (K for  
compatibility, rather than canonical, normalization) for anything except  
normalizing temporary variables inside strcasecmp(3) or similar. Probably  
not even that. The normalizations done are often irreversible and also  
surprising. You don't want to turn 2³.c into 23.c, do you?

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 18:37             ` Linus Torvalds
@ 2009-05-13 21:04               ` Theodore Tso
  2009-05-13 21:20                 ` Linus Torvalds
  2009-05-13 21:08               ` Daniel Barkalow
  1 sibling, 1 reply; 59+ messages in thread
From: Theodore Tso @ 2009-05-13 21:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, May 13, 2009 at 11:37:28AM -0700, Linus Torvalds wrote:
> Note: in all of the above, I assume that people care more about just plain 
> UTF characters (and the insane NFD form OS X uses) than about worrying 
> about the _really_ subtle issues of case-independence. Those are a major 
> pain, but they will need even more "internal" support, because there 
> simply isn't any sane wrapping method.

Stupid question --- if we get something that works for Windows and
MacOS X, is there any reason why we need to solve the general problem
of case-insentive filesystems?  It's really backwards compatibility
with Legacy OS's that most important, right?  Are there any other
systems other than Windows and Mac OS X which (a) perpetrate case
insensitivity on application programmers, and (b) which current or
future git users are likely to care about?

						- Ted

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 18:37             ` Linus Torvalds
  2009-05-13 21:04               ` Theodore Tso
@ 2009-05-13 21:08               ` Daniel Barkalow
  2009-05-13 21:29                 ` Linus Torvalds
  1 sibling, 1 reply; 59+ messages in thread
From: Daniel Barkalow @ 2009-05-13 21:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, 13 May 2009, Linus Torvalds wrote:

> On Wed, 13 May 2009, Martin Langhoff wrote:
> > 
> > Do we need to take the real solution to the core of git?
> 
> Well, I suspect that if we really want to support it, then we'd better.
> 
> > What I am wondering is whether we can keep this simple in git
> > internals and catch problem filenames at git-add time.
> 
> I can almost guarantee that it will just cause more problems than it 
> solves, and generate some nasty cases that just aren't solvable.
> 
> Because it really isn't just "git add". It's every single thing that does 
> a lstat() on a filename inside of git.
> 
> Now, the simple OS X case is not a huge problem, since the lstat will 
> succeed with the fixed-up filename too.

I'm not seeing what the general case is, and how it could possibly behave.

There's the "insensitive" behavior: if you create "foo" and look for 
"FOO", it's there, but readdir() reports "foo".

There's the "converting" behavior: if you create "foo", readdir() reports 
"FOO", but lstat("foo") returns it.

The obvious general case is: if you create "foo", readdir() reports "FOO", 
and lstat("foo") doesn't find a match. But if you create "foo" again... it 
doesn't find "foo", so it creates a new file, which it also calls "FOO", 
and the filesystem now has two files with identical names?

It seems to me that the limits of minimally functional, non-inode-losing 
filesystems are: lstat() might take a filename and return the data for a 
non-byte-identical filename; open(name, O_CREAT|O_EXCL) might replace the 
given name with a non-byte-identical filename. But surely open(name) and 
lstat(name) (with the same name) must find the same file, even if 
readdir() would report it with a different name.

And I assume that a filesystem that rejected any non-NFD filenames or any 
non-NFC filenames would be totally unusable, in that users will manage to 
get unnormalized filenames into programs and find that the filesystem just 
doesn't work.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 20:57         ` Matthias Andree
@ 2009-05-13 21:10           ` Linus Torvalds
  2009-05-13 21:30             ` Jay Soffian
  2009-05-13 21:47             ` Matthias Andree
  0 siblings, 2 replies; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 21:10 UTC (permalink / raw)
  To: Matthias Andree; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git



On Wed, 13 May 2009, Matthias Andree wrote:

> Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds
> <torvalds@linux-foundation.org>:
> 
> > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do
> > the actual normalization if you find characters with the high bit set. And
> > since I know that the OS X filesystems are so buggy as to not even do that
> > whole NFD thing right, there is probably some OS-X specific "use this for
> > filesystem names" conversion function.
> 
> Sorry for interrupting, but NF_K_C? You don't want that (K for compatibility,
> rather than canonical, normalization) for anything except normalizing
> temporary variables inside strcasecmp(3) or similar. Probably not even that.
> The normalizations done are often irreversible and also surprising. You don't
> want to turn 2³.c into 23.c, do you?

No, you're right. We want just plain NFC. I just googled for how some 
other projects handled this, and found the stringprep thing in a post 
about rsync, and didn't look any closer.

But yes, you're absolutely right, stringprep is total crap, and nfkc is 
horrible.

I have no idea of what library to use, though. For perl, there's 
Unicode::Normalize, but that's likely still subtly incorrect for the OS-X 
case due to the filesystem not using _strict_ NFD.

I have this dim memory of somebody actually pointing to the documentation 
of exactly which characters OS X ends up decomposing. Maybe we could just 
do a git-specific inverse of that, knowing that NOBODY ELSE IN THE WHOLE 
UNIVERSE IS SO TERMINALLY STUPID AS TO DO THAT DECOMPOSITION, and thus the 
OS X case is the only one we need to care about?

			Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 21:04               ` Theodore Tso
@ 2009-05-13 21:20                 ` Linus Torvalds
  0 siblings, 0 replies; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 21:20 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git



On Wed, 13 May 2009, Theodore Tso wrote:
> 
> Stupid question --- if we get something that works for Windows and
> MacOS X, is there any reason why we need to solve the general problem
> of case-insentive filesystems?

Qutie frankly, I don't think we're even very close to getting anything 
that works for Windows of OS X.

Case-insensitivity is _hard_.

The "easy" case is to just handle the OS X craxy pseudo-NFD format, and at 
least turn that into NFC (and perhaps add a config option to do latin1 and 
EUC-JP to utf-8 too) and. At that point, we at least handle regular utf-8 
the same way.

Doing the latin1/EUC-JP thing would actually to some degree be more 
interesting than the OS X NFD case, because that really does require 
two-way conversion, and we can "test" that even on sane filesystems (ie 
play at having a Latin1 filesystem).

That said, I suspect there aren't that many people who care about latin1 
filesystems. I dunno about EUC-JP (and variants - for all I know, 
shift-JIS and other cases may be the more common ones).

Of course, if we do everything right, maybe the windows people would 
actually like us to keep the filesystem-native representation in UTF-16LE 
or whatever the crazy format is that Windows really uses deep down.

My point being that all of these things happen even without the added 
worry about case. And in many ways, not worrying about case should 
probably be the first step. We do have some support for worrying about 
case, but trying to solve both things at the same time isn't going to be 
workable, I suspect.

Case insensitivity should never ever involve a _conversion_ (if it does, 
you get all kinds of crazy behavior), it's just purely a _comparison_ 
issue, so the two really are fundamentally different.

Of course, the reason OS-X seems to be so messed up is exactly that the 
morons at Apple didn't understand the difference between conversion and 
comparison, and mixed them up.

		Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 21:08               ` Daniel Barkalow
@ 2009-05-13 21:29                 ` Linus Torvalds
  0 siblings, 0 replies; 59+ messages in thread
From: Linus Torvalds @ 2009-05-13 21:29 UTC (permalink / raw)
  To: Daniel Barkalow
  Cc: Martin Langhoff, Jeff King, Shawn O. Pearce, Esko Luontola, git



On Wed, 13 May 2009, Daniel Barkalow wrote:
> > 
> > Now, the simple OS X case is not a huge problem, since the lstat will 
> > succeed with the fixed-up filename too.
> 
> I'm not seeing what the general case is, and how it could possibly behave.

Here's a simple example.

Let's say that your company uses Latin1 internally for your filesystems, 
because your tools really aren't utf-8 ready. 

This is NOT AT ALL unnatural - it's how lots of people used to work with 
Linux over the years, and it's largely how people still use FAT, I suspect 
(except it's not latin1, it's some windows-specific 8-bits-per-character 
mapping).


IOW, if you have a file called 'åäö', it literally is encoded as 
'\xe5\xe4\xf6' (if you wonder why I picked those three letters, it's 
because they are the regular extra letters in Swedish - Swedish has 29 
letters in its alphabet, and those three letters really are letters in 
their own right, they are NOT 'a' and 'o' with some dots/rings on top).

IOW, if you open such a file, you need to use those three bytes.

Now, even if you happen to have an OS and use Latin1 on disk, you may 
realize that you'd like to interact with others that use UTF-8, and would 
want to have your git archive that you export use nice portable UTF-8.

But you absolutely MUST NOT just do a conversion at "readdir()" time. If 
you do that, then your three-byte filename turns into a six-byte utf-8 
sequence of '\xc3\xa5\xc3\xa4\xc3\xb6' and the thing is, now "lstat()" 
won't work on that sequence.

So obviously you could always turn things _back_ for lstat(), but quite 
frankly, that's (a) insane (b) incompetent and (c) not even always 
well-defined.

> There's the "insensitive" behavior: if you create "foo" and look for 
> "FOO", it's there, but readdir() reports "foo".
> 
> There's the "converting" behavior: if you create "foo", readdir() reports 
> "FOO", but lstat("foo") returns it.

Then there's the behaviour above: you want your git repository to have 
utf-8, but your filesystem doesn't convert anything at all, and all your 
regular tools (think editors etc) are all Latin1.

Latin1 is going away, I hope, but I bet EUC-JP etc still exist. 

		Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 21:10           ` Linus Torvalds
@ 2009-05-13 21:30             ` Jay Soffian
  2009-05-13 21:47             ` Matthias Andree
  1 sibling, 0 replies; 59+ messages in thread
From: Jay Soffian @ 2009-05-13 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthias Andree, Jeff King, Shawn O. Pearce, Esko Luontola, git

On Wed, May 13, 2009 at 5:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I have this dim memory of somebody actually pointing to the documentation
> of exactly which characters OS X ends up decomposing.

http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
http://developer.apple.com/technotes/tn/tn1150table.html

j.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 21:10           ` Linus Torvalds
  2009-05-13 21:30             ` Jay Soffian
@ 2009-05-13 21:47             ` Matthias Andree
  1 sibling, 0 replies; 59+ messages in thread
From: Matthias Andree @ 2009-05-13 21:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Shawn O. Pearce, Esko Luontola, git

Am 13.05.2009, 23:10 Uhr, schrieb Linus Torvalds  
<torvalds@linux-foundation.org>:

>
>
> On Wed, 13 May 2009, Matthias Andree wrote:
>
>> Am 13.05.2009, 19:12 Uhr, schrieb Linus Torvalds
>> <torvalds@linux-foundation.org>:
>>
>> > Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something  
>> to do
>> > the actual normalization if you find characters with the high bit  
>> set. And
>> > since I know that the OS X filesystems are so buggy as to not even do  
>> that
>> > whole NFD thing right, there is probably some OS-X specific "use this  
>> for
>> > filesystem names" conversion function.
>>
>> Sorry for interrupting, but NF_K_C? You don't want that (K for  
>> compatibility,
>> rather than canonical, normalization) for anything except normalizing
>> temporary variables inside strcasecmp(3) or similar. Probably not even  
>> that.
>> The normalizations done are often irreversible and also surprising. You  
>> don't
>> want to turn 2³.c into 23.c, do you?
>
> No, you're right. We want just plain NFC. I just googled for how some
> other projects handled this, and found the stringprep thing in a post
> about rsync, and didn't look any closer.
>
> But yes, you're absolutely right, stringprep is total crap, and nfkc is
> horrible.

Crap? It's just besides the purpose and some limited form of fuzzy match.  
Anyways...

> I have no idea of what library to use, though. For perl, there's
> Unicode::Normalize, but that's likely still subtly incorrect for the OS-X
> case due to the filesystem not using _strict_ NFD.

Perhaps ICU (ICU4C), from http://site.icu-project.org/

-- 
Matthias Andree

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-13 13:57                           ` John Tapsell
  2009-05-13 15:27                             ` Nicolas Pitre
  2009-05-13 17:24                             ` Andreas Ericsson
@ 2009-05-14  1:49                             ` Miles Bader
  2 siblings, 0 replies; 59+ messages in thread
From: Miles Bader @ 2009-05-14  1:49 UTC (permalink / raw)
  To: John Tapsell; +Cc: Jay Soffian, Alex Riesen, git

John Tapsell <johnflux@gmail.com> writes:
> I'm as much of an open source developer as anyone else here.  I spend
> a huge amount of my time programming for KDE.  But I've never told a
> user "well that settles it" because they won't code it themselves :-/

FWIW, Johannes' use of "Well, that rather settles things, no?" in this
thread this didn't strike me as being rude or truly dismissive (even
though it's literally so).

It seemed more just a timely and to the point reminder that however fun
it is to talk about random feature X, someone's gotta do the work if
it's going to actually be implemented, and that the direction of git
development very much follows the whims of those doing the actual
hacking (perhaps more so than other projects).

[and I don't even have particularly thick skin, I think -- I'm often
very annoyed by brusqueness one sees on many developer mailing lists...]

-Miles

-- 
Acquaintance, n. A person whom we know well enough to borrow from, but not
well enough to lend to.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
  2009-05-12 15:14 ` Shawn O. Pearce
  2009-05-12 18:28 ` Dmitry Potapov
@ 2009-05-14 13:48 ` Peter Krefting
  2009-05-14 19:58   ` Esko Luontola
  2 siblings, 1 reply; 59+ messages in thread
From: Peter Krefting @ 2009-05-14 13:48 UTC (permalink / raw)
  To: Esko Luontola; +Cc: git

Esko Luontola:

> A good start for making Git cross-platform, would be storing the text 
> encoding of every file name and commit message together with the commit.

Is it really necessary to store the encoding for every single file name, 
should it not be enough to just store encoding information for all file 
names at once (i.e., for the object that contains the list of file names and 
their associated blobs)?

I did publish, as a request for comments, the beginnings of a patch that 
would change the Windows version of Git to expect file names to be UTF-8 
encoded. There were some comments about it, especially that I could not just 
assume that UTF-8 was the right thing to assume.

Perhaps if we added some meta-data, maybe using the same fall-back mechanism 
as for commit messages (i.e., assume UTF-8 unless otherwise specified), it 
would be easier to do.

On Windows, the file APIs allow you to use Unicode (UTF-16) to specify file 
names, and the file systems will handle any necessary conversion to whatever 
byte sequences are used to store the file names. UTF-16 and UTF-8 are 
trivial to convert between, and Windows does contain APIs to convert between 
other character encodings and UTF-16.

On Mac OS X, I believe the file system APIs assume you use some kind of 
normalized UTF-8. That should also be possible to create, possibly 
converting back and forth between different normalization forms, if necessary.

On Linux and other Unixes we could just use iconv() to convert from the 
repository file name encoding to whatever the current locale has set up. The 
trick here is to handle file names outside the current encoding. Some kind 
of escaping mechanism will probably need to be introduced.

The best way would be to define this in the Git core once and for all, and 
add support to it for all the platforms in the same go, instead of trying to 
hack around the issue whenever it pops up on the various platforms.

My main use-case for Git on Windows has disappeared as my $dayjob went 
bankrupt, but I am happy to assist with whatever insight I may be able to 
bring.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-12 21:55         ` Jakub Narebski
@ 2009-05-14 17:59           ` Heiko Voigt
  2009-05-15 10:52             ` Martin Langhoff
                               ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Heiko Voigt @ 2009-05-14 17:59 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano

At the moment non-ascii encodings of filenames are not portably converted
between different filesystems by git. This will most likely change in the
future but to allow repositories to be portable among different file/operating
systems this check is enabled by default.

Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
---
On Tue, May 12, 2009 at 11:55:59PM +0200, Jakub Narebski wrote:
> On Tue, 12 May 2009, Heiko Voigt wrote:
> 
> > At the moment non-ascii encodings of file/usernames are not very well
> > supported by git. This will most likely change in the future but to
> > allow repositories to be portable among different file/operating systems
> > this check is enabled by default.
> 
> > +	# non-ascii username issue a warning in git gui so tell the
> > +	# user to change it
> > +	if ! git config user.name | is_ascii; then
> > +		echo "Please only use ascii characters in your username!"
> > +		exit 1
> > +	fi
> > +
> > +	if ! git config user.email | is_ascii; then
> > +		echo "Please only use ascii characters in your email!"
> > +		exit 1
> > +	fi
> 
> Actually 1.) there is no easy way to avoid non-ASCII names at least
> in user.name (I think they are not allowed in email), but 2.) there
> is no trouble with non-ASCII encoding of commits, as they have 
> 'encoding' header if it is not uft-8 (see *encoding* config variables).

I tried it and indeed it seems to work now. This hook originated from a
windows installation were having non-ascii characters resulted in a
strange warning from git gui each time you commit. So here is the
corrected patch.

 templates/hooks--pre-commit.sample |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index 0e49279..3083735 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -7,6 +7,26 @@
 #
 # To enable this hook, rename this file to "pre-commit".
 
+# If you want to allow non-ascii filenames set this variable to true.
+allownonascii=$(git config hooks.allownonascii)
+
+function is_ascii () {
+    test -z "$(cat | sed -e "s/[\ -~]*//g")"
+    return $?
+}
+
+if [ "$allownonascii" != "true" ]
+then
+	# until git can handle non-ascii filenames gracefully
+	# prevent them to be added into the repository
+	if ! git diff --cached --name-only --diff-filter=A -z \
+			| tr "\0" "\n" | is_ascii; then
+		echo "Non-ascii filenames are not allowed !"
+		echo "Please rename the file ..."
+		exit 1
+	fi
+fi
+
 if git-rev-parse --verify HEAD 2>/dev/null
 then
 	against=HEAD
-- 
1.6.3

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting
@ 2009-05-14 19:58   ` Esko Luontola
  2009-05-14 20:21     ` Andreas Ericsson
                       ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Esko Luontola @ 2009-05-14 19:58 UTC (permalink / raw)
  To: Peter Krefting; +Cc: git

Peter Krefting wrote on 14.5.2009 16:48:
> Is it really necessary to store the encoding for every single file name, 
> should it not be enough to just store encoding information for all file 
> names at once (i.e., for the object that contains the list of file names 
> and their associated blobs)?

What about if some disorganized project has people committing with many 
different encodings? Should we allow it, that a directory has the names 
of some files using one encoding, and the names of other files using 
another encoding? Or should we force the whole repository to use the 
same encoding?

> The best way would be to define this in the Git core once and for all, 
> and add support to it for all the platforms in the same go, instead of 
> trying to hack around the issue whenever it pops up on the various 
> platforms.

+1

-- 
Esko Luontola
www.orfjackal.net

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-14 19:58   ` Esko Luontola
@ 2009-05-14 20:21     ` Andreas Ericsson
  2009-05-14 22:25     ` Johannes Schindelin
  2009-05-15 11:18     ` Dmitry Potapov
  2 siblings, 0 replies; 59+ messages in thread
From: Andreas Ericsson @ 2009-05-14 20:21 UTC (permalink / raw)
  To: Esko Luontola; +Cc: Peter Krefting, git

Esko Luontola wrote:
> Peter Krefting wrote on 14.5.2009 16:48:
>> Is it really necessary to store the encoding for every single file 
>> name, should it not be enough to just store encoding information for 
>> all file names at once (i.e., for the object that contains the list of 
>> file names and their associated blobs)?
> 
> What about if some disorganized project has people committing with many 
> different encodings? Should we allow it, that a directory has the names 
> of some files using one encoding, and the names of other files using 
> another encoding? Or should we force the whole repository to use the 
> same encoding?
> 

If encodings are on a per-tree basis, we could add a special mode-flag for
it without breaking backwards incompatibility (I think, anyways). Older
gits just won't know how to handle it and will treat it as a byte-stream.

>> The best way would be to define this in the Git core once and for all, 
>> and add support to it for all the platforms in the same go, instead of 
>> trying to hack around the issue whenever it pops up on the various 
>> platforms.
> 
> +1
> 

There's still the problem that noone's stepped forward to do all that
work yet, so apparently this isn't important enough for people to put
their patches where their mouths are. Often when issues generate long
discussions and no code, it's of high academic interest and of little
real-world value.

I believe the "little real-world value" here comes from the fact that
cross-platform projects often enforce 7-bit ascii compatible filenames
from the start, because they *know* they may run into problems on other
filesystems otherwise. Remember it's not only git that has to get
things right. It's also build-systems and compilers that have to locate
the correct files (the Makefile and the filesystem may use different
encodings), so in the real world, people really do stay away from
filenames with åäö or other non-ascii chars in them.

It's fun to discuss, but I won't spend any time on it. Good luck to
those who do though. I'd quite like to see if someone could pull it
off without breaking backwards compatibility or impacting performance
too much.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
 http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-14 19:58   ` Esko Luontola
  2009-05-14 20:21     ` Andreas Ericsson
@ 2009-05-14 22:25     ` Johannes Schindelin
  2009-05-15 11:18     ` Dmitry Potapov
  2 siblings, 0 replies; 59+ messages in thread
From: Johannes Schindelin @ 2009-05-14 22:25 UTC (permalink / raw)
  To: Esko Luontola; +Cc: Peter Krefting, git

Hi,

On Thu, 14 May 2009, Esko Luontola wrote:

> Peter Krefting wrote on 14.5.2009 16:48:
> 
> > The best way would be to define this in the Git core once and for all, 
> > and add support to it for all the platforms in the same go, instead of 
> > trying to hack around the issue whenever it pops up on the various 
> > platforms.
> 
> +1

You might be enthusiastic about this cunning idea.  However, if it costs 
me performance on Linux, and all the benefits go to Windows users, then I 
will remove this "solution" from my personal Git tree _right away_, and 
I'd expect a lot of other people, too.

I repeat this just once more: if you add complexity, you'll have to have a 
compelling reason to do so.  If there is no benefit for Linux users, why 
should they bear the cost?

But as Andreas remarked, I sincerely think that there has been enough talk 
about the issue.  It's time to see some patches, or to stop the 
discussion.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii  filenames
  2009-05-14 17:59           ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
@ 2009-05-15 10:52             ` Martin Langhoff
  2009-05-18  9:37               ` Heiko Voigt
  2009-06-20 12:14               ` [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook Heiko Voigt
  2009-05-15 14:57             ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski
  2009-05-15 18:11             ` [PATCH v2] " Junio C Hamano
  2 siblings, 2 replies; 59+ messages in thread
From: Martin Langhoff @ 2009-05-15 10:52 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jakub Narebski, Dmitry Potapov, Esko Luontola, git, Junio C Hamano

On Thu, May 14, 2009 at 7:59 PM, Heiko Voigt <hvoigt@hvoigt.net> wrote:
> At the moment non-ascii encodings of filenames are not portably converted
> between different filesystems by git. This will most likely change in the
> future but to allow repositories to be portable among different file/operating
> systems this check is enabled by default.

Nice!

 - It'd be a good idea to add to the mix a check for filenames that
are equivalent in case-insensitive FSs.

 - Should all of this be a general "portablefilenames" setting?

cheers,


m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Cross-Platform Version Control
  2009-05-14 19:58   ` Esko Luontola
  2009-05-14 20:21     ` Andreas Ericsson
  2009-05-14 22:25     ` Johannes Schindelin
@ 2009-05-15 11:18     ` Dmitry Potapov
  2 siblings, 0 replies; 59+ messages in thread
From: Dmitry Potapov @ 2009-05-15 11:18 UTC (permalink / raw)
  To: Esko Luontola; +Cc: Peter Krefting, git

On Thu, May 14, 2009 at 10:58:17PM +0300, Esko Luontola wrote:
>
> What about if some disorganized project has people committing with many  
> different encodings? Should we allow it, that a directory has the names  
> of some files using one encoding, and the names of other files using  
> another encoding? Or should we force the whole repository to use the  
> same encoding?

The whole repository should have the same encoding internally. Anything
else will be too complex and too slow... Have you seen any file system
where file names would be stored in different encodings? And Git does
far more operation on file names than a file system does. So, it is
clearly to me that the whole repository should have a single encoding.

Now, I don't think that you will find many open source projects that use
non-ASCII in file names. Moreover, most Linux users are either use UTF-8
already or switch to it in the near future. Mac OS X uses UTF-8 (though
there is a problem with decomposed characters, but Linus posted a
possible solution). So, the only platform were non-ASCII characters may
be interesting to Git users and that does not support UTF-8 is Windows.
AFAIK, Cygwin 1.7 has UTF-8 support. So, it is mostly a problem for
msysGit... Though adding support for legacy encodings can help to some
degree, it means that every system call involving a file name will go
through UTF-8 <-> LEGACY_ENC <-> UTF-16LE conversion. IMHO, having a
legacy encoding involved is far from the best possible solution; but
to avoid that, you need to change MSYS to be able to work with UTF-8.
(I have never looked at MSYS myself, but I suspect it may be not easy).


Dmitry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-14 17:59           ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
  2009-05-15 10:52             ` Martin Langhoff
@ 2009-05-15 14:57             ` Jakub Narebski
  2009-05-18  9:50               ` [PATCH] " Heiko Voigt
  2009-05-15 18:11             ` [PATCH v2] " Junio C Hamano
  2 siblings, 1 reply; 59+ messages in thread
From: Jakub Narebski @ 2009-05-15 14:57 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano

<Insert standard Dscho disclaimer here...> ;-)

In short: good idea, don't be discouraged by criticism...

On Thu, 14 May 2009, Heiko Voigt wrote:

> At the moment non-ascii encodings of filenames are not portably converted
> between different filesystems by git. This will most likely change in the
> future but to allow repositories to be portable among different file/operating
> systems this check is enabled by default.

By the way, you might consider choosing shorter line length for commits,
something around 70-76 characters per line; otherwise it is harder to
reply to without linewrapping. 80 characters that you used is, IMHO,
absolute maximum, and it is good that you kept to it.

> 
> Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
> ---

> +# If you want to allow non-ascii filenames set this variable to true.
> +allownonascii=$(git config hooks.allownonascii)
> +
> +function is_ascii () {
> +    test -z "$(cat | sed -e "s/[\ -~]*//g")"
> +    return $?
> +}

>From CodingGuidelines for shell scripts:
 - We do not write the noiseword "function" in front of shell
   functions.

(in short: do not use bash-specific features... unless, of course,
you are modifying bash-completion script).

Second, it would be nice to have comment about how to use this
function (as it does not check file given by its argument, but
rather its standard input). And perhaps also a comment that it
works because ASCII printable characters begin with ' ' space
(does it have to be escaped?) and end with '~' tilde[2].

Third, isn't it useless use of 'cat'[3]? And wouldn't it be better
to use 'tr' to either delete printable characters and check for
anything left (as you do; BTW. wouldn't "return test ..." be simpler?),
or use 'tr' to count non portable characters?

[1] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
[2] http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters
[3] http://partmaps.org/era/unix/award.html#cat

> +
> +if [ "$allownonascii" != "true" ]
> +then
> +	# until git can handle non-ascii filenames gracefully
> +	# prevent them to be added into the repository
> +	if ! git diff --cached --name-only --diff-filter=A -z \
> +			| tr "\0" "\n" | is_ascii; then
> +		echo "Non-ascii filenames are not allowed !"
> +		echo "Please rename the file ..."
> +		exit 1
> +	fi
> +fi
> +
>  if git-rev-parse --verify HEAD 2>/dev/null
>  then
>  	against=HEAD
> -- 
> 1.6.3
> 
> 
> 
> 

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-14 17:59           ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
  2009-05-15 10:52             ` Martin Langhoff
  2009-05-15 14:57             ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski
@ 2009-05-15 18:11             ` Junio C Hamano
  2 siblings, 0 replies; 59+ messages in thread
From: Junio C Hamano @ 2009-05-15 18:11 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jakub Narebski, Martin Langhoff, Dmitry Potapov, Esko Luontola,
	git, Junio C Hamano

Heiko Voigt <hvoigt@hvoigt.net> writes:

> diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
> index 0e49279..3083735 100755
> --- a/templates/hooks--pre-commit.sample
> +++ b/templates/hooks--pre-commit.sample
> @@ -7,6 +7,26 @@
>  #
>  # To enable this hook, rename this file to "pre-commit".
>  
> +# If you want to allow non-ascii filenames set this variable to true.
> +allownonascii=$(git config hooks.allownonascii)
> +
> +function is_ascii () {

We do not say "#!/bin/bash" at the beginning (hopefully), so let's not say
"function " here.

> +    test -z "$(cat | sed -e "s/[\ -~]*//g")"

Do you need "cat | "?

Does this script run under LC_ALL=C?  Can an i18n'ized sed interfere with
what you are trying to do?

> +    return $?

Do you need this, or does the function return the result of the last
statment anyway?

> +		echo "Non-ascii filenames are not allowed !"
> +		echo "Please rename the file ..."

Can we make this sound more like a _sample_ project policy?  It's not like
we enforce that policy to other people's projects.

> +		exit 1
> +	fi
> +fi
> +
>  if git-rev-parse --verify HEAD 2>/dev/null
>  then
>  	against=HEAD
> -- 
> 1.6.3

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-15 10:52             ` Martin Langhoff
@ 2009-05-18  9:37               ` Heiko Voigt
  2009-05-18 22:26                 ` Jakub Narebski
  2009-06-20 12:14               ` [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook Heiko Voigt
  1 sibling, 1 reply; 59+ messages in thread
From: Heiko Voigt @ 2009-05-18  9:37 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Jakub Narebski, Dmitry Potapov, Esko Luontola, git, Junio C Hamano

On Fri, May 15, 2009 at 12:52:41PM +0200, Martin Langhoff wrote:
> On Thu, May 14, 2009 at 7:59 PM, Heiko Voigt <hvoigt@hvoigt.net> wrote:
> > At the moment non-ascii encodings of filenames are not portably converted
> > between different filesystems by git. This will most likely change in the
> > future but to allow repositories to be portable among different file/operating
> > systems this check is enabled by default.
> 
> Nice!
> 
>  - It'd be a good idea to add to the mix a check for filenames that
> are equivalent in case-insensitive FSs.

I agree, but that will be an extension in another patch. BTW, if anyone
has a good idea how to efficiently do that kind of check in a hook I'd
cook up a patch on top of this.

>  - Should all of this be a general "portablefilenames" setting?

Well, if you can specify what general portable filenames would have as
properties.

Questions like:

 * What is the portable maximum path length?
 * How long may a filename be (DOS 8.3 ?)
 * Are windows keywords (PRN, ...) allowed?
 * ...

So I think this should be on a per property basis providing sensible
defaults to support the most standard case.

cheers Heiko

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-15 14:57             ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski
@ 2009-05-18  9:50               ` Heiko Voigt
  2009-05-18 10:40                 ` Johannes Sixt
                                   ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Heiko Voigt @ 2009-05-18  9:50 UTC (permalink / raw)
  To: Jakub Narebski, Junio C Hamano
  Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git

At the moment non-ascii encodings of filenames are not portably converted
between different filesystems by git. This will most likely change in the
future but to allow repositories to be portable among different file/operating
systems this check is enabled by default.

Signed-off-by: Heiko <hvoigt@hvoigt.net>
---
so here is a third version ...

On Fri, May 15, 2009 at 04:57:45PM +0200, Jakub Narebski wrote:
> On Thu, 14 May 2009, Heiko Voigt wrote:
> 
> > At the moment non-ascii encodings of filenames are not portably converted
> > between different filesystems by git. This will most likely change in the
> > future but to allow repositories to be portable among different file/operating
> > systems this check is enabled by default.
> 
> By the way, you might consider choosing shorter line length for commits,
> something around 70-76 characters per line; otherwise it is harder to
> reply to without linewrapping. 80 characters that you used is, IMHO,
> absolute maximum, and it is good that you kept to it.

Yeah, I admit they were a little bit long.

> > +function is_ascii () {
> > +    test -z "$(cat | sed -e "s/[\ -~]*//g")"
> > +    return $?
> > +}
> 
> From CodingGuidelines for shell scripts:
>  - We do not write the noiseword "function" in front of shell
>    functions.
> 
> (in short: do not use bash-specific features... unless, of course,
> you are modifying bash-completion script).

Addressed.

> Second, it would be nice to have comment about how to use this
> function (as it does not check file given by its argument, but
> rather its standard input). And perhaps also a comment that it
> works because ASCII printable characters begin with ' ' space
> (does it have to be escaped?) and end with '~' tilde[2].

Done

> 
> Third, isn't it useless use of 'cat'[3]? And wouldn't it be better
> to use 'tr' to either delete printable characters and check for
> anything left (as you do; BTW. wouldn't "return test ..." be simpler?),
> or use 'tr' to count non portable characters?

Yes indeed it was useless. I also switched from sed to tr.

> 
> [1] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
> [2] http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters
> [3] http://partmaps.org/era/unix/award.html#cat

On Fri, May 15, 2009 at 11:11:12AM -0700, Junio C Hamano wrote:
> Heiko Voigt <hvoigt@hvoigt.net> writes:
> > +function is_ascii () {
> 
> We do not say "#!/bin/bash" at the beginning (hopefully), so let's not say
> "function " here.

See above.

> > +    test -z "$(cat | sed -e "s/[\ -~]*//g")"
> 
> Do you need "cat | "?

Also above.

> Does this script run under LC_ALL=C?  Can an i18n'ized sed interfere with
> what you are trying to do?

I now explicitely set LC_ALL=C for the tr call which should now be robust
against such cases.

> 
> > +    return $?
> 
> Do you need this, or does the function return the result of the last
> statment anyway?

I wasn't aware of that. Removed the return.

> > +		echo "Non-ascii filenames are not allowed !"
> > +		echo "Please rename the file ..."
> 
> Can we make this sound more like a _sample_ project policy?  It's not like
> we enforce that policy to other people's projects.

I've polished this so we are now more user friendly as well.

 templates/hooks--pre-commit.sample |   32 ++++++++++++++++++++++++++++++++
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index 0e49279..91ab563 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -7,6 +7,38 @@
 #
 # To enable this hook, rename this file to "pre-commit".
 
+# If you want to allow non-ascii filenames set this variable to true.
+allownonascii=$(git config hooks.allownonascii)
+
+# is_ascii() Tests the string given given on standard input for
+# printable ascii conformance. We exploit the fact that the printable
+# range starts at the space character and ends with tilde.
+is_ascii() {
+    test -z "$(LC_ALL=C tr -d \ -~)"
+}
+
+if [ "$allownonascii" != "true" ]
+then
+	# until git can handle non-ascii filenames gracefully
+	# prevent them to be added into the repository
+	if ! git diff --cached --name-only --diff-filter=A -z \
+	   | tr "\0" "\n" | is_ascii; then
+		echo "Error: Preventing to add a non-ascii filename."
+		echo
+		echo "This can cause problems if you want to work together"
+		echo "with people on other platforms than you."
+		echo
+		echo "To be portable it is adviseable to rename the file ..."
+		echo
+		echo "If you know what you are doing you can disable this"
+		echo "check using:"
+		echo
+		echo "  git config hooks.allownonascii true"
+		echo
+		exit 1
+	fi
+fi
+
 if git-rev-parse --verify HEAD 2>/dev/null
 then
 	against=HEAD
-- 
1.6.3

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-18  9:50               ` [PATCH] " Heiko Voigt
@ 2009-05-18 10:40                 ` Johannes Sixt
  2009-05-18 11:50                   ` Heiko Voigt
  2009-05-19 20:01                   ` [PATCH v4] " Heiko Voigt
  2009-05-18 14:42                 ` [PATCH] " Junio C Hamano
  2009-05-18 20:35                 ` Julian Phillips
  2 siblings, 2 replies; 59+ messages in thread
From: Johannes Sixt @ 2009-05-18 10:40 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov,
	Esko Luontola, git

Heiko Voigt schrieb:
> +# is_ascii() Tests the string given given on standard input for
> +# printable ascii conformance. We exploit the fact that the printable
> +# range starts at the space character and ends with tilde.
> +is_ascii() {
> +    test -z "$(LC_ALL=C tr -d \ -~)"
> +}
> +
> +if [ "$allownonascii" != "true" ]
> +then
> +	# until git can handle non-ascii filenames gracefully
> +	# prevent them to be added into the repository
> +	if ! git diff --cached --name-only --diff-filter=A -z \
> +	   | tr "\0" "\n" | is_ascii; then

Will this not fail to add more than one file with allowed names? The \n is
not removed in is_ascii(), and so the resulting string will not be empty.

BTW, not all tr work well with NULs. See the commit message of e85fe4d8,
for example. Otherwise, I would have suggested to convert the NUL to some
allowed ASCII character, e.g. 'A'. BTW, you should really use '\0' and
'\n' (single-quotes) to guarantee that the shell does not ignore the
backslash.

-- Hannes

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-18 10:40                 ` Johannes Sixt
@ 2009-05-18 11:50                   ` Heiko Voigt
  2009-05-18 12:04                     ` Johannes Sixt
  2009-05-19 20:01                   ` [PATCH v4] " Heiko Voigt
  1 sibling, 1 reply; 59+ messages in thread
From: Heiko Voigt @ 2009-05-18 11:50 UTC (permalink / raw)
  To: Johannes Sixt
  Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov,
	Esko Luontola, git

On Mon, May 18, 2009 at 12:40:09PM +0200, Johannes Sixt wrote:
> Heiko Voigt schrieb:
> > +# is_ascii() Tests the string given given on standard input for
> > +# printable ascii conformance. We exploit the fact that the printable
> > +# range starts at the space character and ends with tilde.
> > +is_ascii() {
> > +    test -z "$(LC_ALL=C tr -d \ -~)"
> > +}
> > +
> > +if [ "$allownonascii" != "true" ]
> > +then
> > +	# until git can handle non-ascii filenames gracefully
> > +	# prevent them to be added into the repository
> > +	if ! git diff --cached --name-only --diff-filter=A -z \
> > +	   | tr "\0" "\n" | is_ascii; then
> 
> Will this not fail to add more than one file with allowed names? The \n is
> not removed in is_ascii(), and so the resulting string will not be empty.

No currently it does not. At least on my system, but good point.

> BTW, not all tr work well with NULs. See the commit message of e85fe4d8,
> for example. Otherwise, I would have suggested to convert the NUL to some
> allowed ASCII character, e.g. 'A'. BTW, you should really use '\0' and
> '\n' (single-quotes) to guarantee that the shell does not ignore the
> backslash.

Are there any problems with '\0' and tr other than swallowing of it. In
case not I would just change

	tr "\0" "\n"
to
  	tr -d '\0'

That way there are no '\n's left over and it doesn't matter if tr
swallows the '\0'.

Waiting for further comments before sending the cleanup.

cheers Heiko

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-18 11:50                   ` Heiko Voigt
@ 2009-05-18 12:04                     ` Johannes Sixt
  0 siblings, 0 replies; 59+ messages in thread
From: Johannes Sixt @ 2009-05-18 12:04 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov,
	Esko Luontola, git

Heiko Voigt schrieb:
> Are there any problems with '\0' and tr other than swallowing of it.

I can't tell. But the commits ae90e16..aab0abf are interesting to study in
w.r.t. portability.

> In
> case not I would just change
> 
> 	tr "\0" "\n"
> to
>   	tr -d '\0'

In which case I'd suggest that you call tr only once, in isascii():

     tr -d '[ -~]\0'

-- Hannes

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-18  9:50               ` [PATCH] " Heiko Voigt
  2009-05-18 10:40                 ` Johannes Sixt
@ 2009-05-18 14:42                 ` Junio C Hamano
  2009-05-18 20:35                 ` Julian Phillips
  2 siblings, 0 replies; 59+ messages in thread
From: Junio C Hamano @ 2009-05-18 14:42 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov,
	Esko Luontola, git

Heiko Voigt <hvoigt@hvoigt.net> writes:

> +if [ "$allownonascii" != "true" ]
> +then
> +	# until git can handle non-ascii filenames gracefully
> +	# prevent them to be added into the repository

I think you can inline your is_ascii shell function in the pipeline below.
You made it a separate function and I agree that it has a very good
documentation value, but the mention of "non-ascii filenames" in this
comment here is enough clue to let anybody know what is going on.

	Side note: I am not sure "Until ... can ... gracefully" is a good
	description of the general problem.  It probably is more neutral
	to say "Cross platform projects tend to avoid non-ascii filenames;
        prevent them from being added to the repository."

> +	if ! git diff --cached --name-only --diff-filter=A -z \
> +	   | tr "\0" "\n" | is_ascii; then

A standard trick while writing a long pipeline in shell is to change line
after a pipe, like:

	cmd1 |
        cmd2 |
        cmd3

which allows you to lose the BS-before-LF sequence.

I think comments from J6t and others are valuable but clear enough that I
wouldn't have to repeat them.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-18  9:50               ` [PATCH] " Heiko Voigt
  2009-05-18 10:40                 ` Johannes Sixt
  2009-05-18 14:42                 ` [PATCH] " Junio C Hamano
@ 2009-05-18 20:35                 ` Julian Phillips
  2 siblings, 0 replies; 59+ messages in thread
From: Julian Phillips @ 2009-05-18 20:35 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Jakub Narebski, Junio C Hamano, Martin Langhoff, Dmitry Potapov,
	Esko Luontola, git

On Mon, 18 May 2009, Heiko Voigt wrote:

> +if [ "$allownonascii" != "true" ]
> +then
> +	# until git can handle non-ascii filenames gracefully
> +	# prevent them to be added into the repository
> +	if ! git diff --cached --name-only --diff-filter=A -z \
> +	   | tr "\0" "\n" | is_ascii; then
> +		echo "Error: Preventing to add a non-ascii filename."

This would read better as:

+		echo "Error: Attempt to add a non-ascii filename."

(after all the prevention itself is a result of the error, not the cause 
of it)

If you want to keep the preventing, then you need to at least correct the 
english:

> +		echo "Error: Preventing addition of a non-ascii filename."

-- 
Julian

  ---
QOTD:
 	Money isn't everything, but at least it keeps the kids in touch.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-18  9:37               ` Heiko Voigt
@ 2009-05-18 22:26                 ` Jakub Narebski
  0 siblings, 0 replies; 59+ messages in thread
From: Jakub Narebski @ 2009-05-18 22:26 UTC (permalink / raw)
  To: Heiko Voigt
  Cc: Martin Langhoff, Dmitry Potapov, Esko Luontola, git, Junio C Hamano

On Mon, 18 May 2009, Heiko Voigt wrote:
> On Fri, May 15, 2009 at 12:52:41PM +0200, Martin Langhoff wrote:

> >  - Should all of this be a general "portablefilenames" setting?
> 
> Well, if you can specify what general portable filenames would have as
> properties.

"Fixing Unix/Linux/POSIX Filenames: Control Characters (such as 
Newline), Leading Dashes, and Other Problems" by David A. Wheeler
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4] Extend sample pre-commit hook to check for non ascii filenames
  2009-05-18 10:40                 ` Johannes Sixt
  2009-05-18 11:50                   ` Heiko Voigt
@ 2009-05-19 20:01                   ` Heiko Voigt
  1 sibling, 0 replies; 59+ messages in thread
From: Heiko Voigt @ 2009-05-19 20:01 UTC (permalink / raw)
  To: Johannes Sixt, Junio C Hamano, Julian Phillips
  Cc: Jakub Narebski, Martin Langhoff, Dmitry Potapov, Esko Luontola, git

At the moment non-ascii encodings of filenames are not portably
converted between different filesystems by git. This will most likely
change in the future but to allow repositories to be portable among
different file/operating systems this check is enabled by default.

Signed-off-by: Heiko Voigt <hvoigt@hvoigt.net>
---

Thanks for all comments. I now hopefully have a satisfying patch.


On Mon, May 18, 2009 at 12:40:09PM +0200, Johannes Sixt wrote:
> Heiko Voigt schrieb:
> > +	if ! git diff --cached --name-only --diff-filter=A -z \
> > +	   | tr "\0" "\n" | is_ascii; then
> 
> Will this not fail to add more than one file with allowed names? The \n is
> not removed in is_ascii(), and so the resulting string will not be empty.
> 
> BTW, not all tr work well with NULs. See the commit message of e85fe4d8,
> for example. Otherwise, I would have suggested to convert the NUL to some
> allowed ASCII character, e.g. 'A'. BTW, you should really use '\0' and
> '\n' (single-quotes) to guarantee that the shell does not ignore the
> backslash.

I removed all \0 characters and hopefully use the correct platform
independent syntax as described in the commits you send.


On Mon, May 18, 2009 at 02:04:08PM +0200, Johannes Sixt wrote:
> Heiko Voigt schrieb:
> > Are there any problems with '\0' and tr other than swallowing of it.
> 
> I can't tell. But the commits ae90e16..aab0abf are interesting to study in
> w.r.t. portability.
> 
> > In
> > case not I would just change
> > 
> > 	tr "\0" "\n"
> > to
> >   	tr -d '\0'
> 
> In which case I'd suggest that you call tr only once, in isascii():
> 
>      tr -d '[ -~]\0'

After reading a little about the portability things. This seems to be
the right way and is now included.


On Mon, May 18, 2009 at 07:42:31AM -0700, Junio C Hamano wrote:
> Heiko Voigt <hvoigt@hvoigt.net> writes:
> 
> > +if [ "$allownonascii" != "true" ]
> > +then
> > +	# until git can handle non-ascii filenames gracefully
> > +	# prevent them to be added into the repository
> 
> I think you can inline your is_ascii shell function in the pipeline below.
> You made it a separate function and I agree that it has a very good
> documentation value, but the mention of "non-ascii filenames" in this
> comment here is enough clue to let anybody know what is going on.

I agree. I thought it would probably be useful in other places but we
just need it once so its inlined now.

> 
> 	Side note: I am not sure "Until ... can ... gracefully" is a good
> 	description of the general problem.  It probably is more neutral
> 	to say "Cross platform projects tend to avoid non-ascii filenames;
>         prevent them from being added to the repository."

Changed that.

> 
> > +	if ! git diff --cached --name-only --diff-filter=A -z \
> > +	   | tr "\0" "\n" | is_ascii; then
> 
> A standard trick while writing a long pipeline in shell is to change line
> after a pipe, like:
> 
> 	cmd1 |
>         cmd2 |
>         cmd3
> 
> which allows you to lose the BS-before-LF sequence.

Wasn't aware of that. Changed it accordingly.


On Mon, May 18, 2009 at 09:35:19PM +0100, Julian Phillips wrote:
> On Mon, 18 May 2009, Heiko Voigt wrote:
>> +		echo "Error: Preventing to add a non-ascii filename."
>
> This would read better as:
>
> +		echo "Error: Attempt to add a non-ascii filename."
>
> (after all the prevention itself is a result of the error, not the cause  
> of it)

That really sounds better. Thanks.

 templates/hooks--pre-commit.sample |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index 0e49279..ad892a2 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -7,6 +7,31 @@
 #
 # To enable this hook, rename this file to "pre-commit".
 
+# If you want to allow non-ascii filenames set this variable to true.
+allownonascii=$(git config hooks.allownonascii)
+
+# Cross platform projects tend to avoid non-ascii filenames; prevent
+# them from being added to the repository. We exploit the fact that the
+# printable range starts at the space character and ends with tilde.
+if [ "$allownonascii" != "true" ] &&
+	test "$(git diff --cached --name-only --diff-filter=A -z |
+	  LC_ALL=C tr -d '[ -~]\0')"
+then
+	echo "Error: Attempt to add a non-ascii filename."
+	echo
+	echo "This can cause problems if you want to work together"
+	echo "with people on other platforms than you."
+	echo
+	echo "To be portable it is adviseable to rename the file ..."
+	echo
+	echo "If you know what you are doing you can disable this"
+	echo "check using:"
+	echo
+	echo "  git config hooks.allownonascii true"
+	echo
+	exit 1
+fi
+
 if git-rev-parse --verify HEAD 2>/dev/null
 then
 	against=HEAD
-- 
1.6.3

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook
  2009-05-15 10:52             ` Martin Langhoff
  2009-05-18  9:37               ` Heiko Voigt
@ 2009-06-20 12:14               ` Heiko Voigt
  1 sibling, 0 replies; 59+ messages in thread
From: Heiko Voigt @ 2009-06-20 12:14 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Jakub Narebski, Dmitry Potapov, Esko Luontola, git, Junio C Hamano

This helps cross-platform projects on the case-sensitive filename side
of operating systems to use filenames that are nice for the
case-insensitive side

---
On Fri, May 15, 2009 at 12:52:41PM +0200, Martin Langhoff wrote:
> On Thu, May 14, 2009 at 7:59 PM, Heiko Voigt <hvoigt@hvoigt.net> wrote:
> > At the moment non-ascii encodings of filenames are not portably converted
> > between different filesystems by git. This will most likely change in the
> > future but to allow repositories to be portable among different file/operating
> > systems this check is enabled by default.
>  - It'd be a good idea to add to the mix a check for filenames that
> are equivalent in case-insensitive FSs.

Totally untested. Just to get feedback if someone has ideas how this can
be solved more efficiently. I suspect that processing all files will
yield an unbearable performance degradation on large projects.

Let me know what you think. The wording of the error message is not yet
final.

 templates/hooks--pre-commit.sample |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/templates/hooks--pre-commit.sample b/templates/hooks--pre-commit.sample
index b11ad6a..32d1809 100755
--- a/templates/hooks--pre-commit.sample
+++ b/templates/hooks--pre-commit.sample
@@ -9,6 +9,10 @@
 
 # If you want to allow non-ascii filenames set this variable to true.
 allownonascii=$(git config hooks.allownonascii)
+# If you want to allow filenames that only differ in case set this
+# variable to true. NOTE: This can degrade performance on project with
+# lots of files
+allowcaseonly=$(git config hooks.allowcaseonly)
 
 # Cross platform projects tend to avoid non-ascii filenames; prevent
 # them from being added to the repository. We exploit the fact that the
@@ -32,6 +36,23 @@ then
 	exit 1
 fi
 
+# check for names that already exist but only differ in case
+# which can be problematic on non-casesensitive filesystems
+if [ "$allowcaseonly" != "true" ] &&
+	test -z "$(git ls-files | LC_ALL=C tr -s [A-Z] [a-z] | uniq -d)"
+then
+	echo "Error: Attempt to add file which already exists in different case"
+	echo
+	echo "If you know what you are doing you can disable this"
+	echo "check using:"
+	echo
+	echo "  git config hooks.allowcaseonly true"
+	echo
+	exit 1
+fi
+
 if git-rev-parse --verify HEAD >/dev/null 2>&1
 then
 	against=HEAD
-- 
1.6.3.2.203.g9a122

^ permalink raw reply related	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2009-06-20 12:14 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
2009-05-12 15:14 ` Shawn O. Pearce
2009-05-12 16:13   ` Johannes Schindelin
2009-05-12 17:56     ` Esko Luontola
2009-05-12 20:38       ` Johannes Schindelin
2009-05-12 21:16         ` Esko Luontola
2009-05-13  0:23           ` Johannes Schindelin
2009-05-13  5:34             ` Esko Luontola
2009-05-13  6:49               ` Alex Riesen
2009-05-13 10:15               ` Johannes Schindelin
     [not found]                 ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>
2009-05-13 10:41                   ` John Tapsell
2009-05-13 13:42                     ` Jay Soffian
2009-05-13 13:44                       ` Alex Riesen
2009-05-13 13:50                         ` Jay Soffian
2009-05-13 13:57                           ` John Tapsell
2009-05-13 15:27                             ` Nicolas Pitre
2009-05-13 16:22                               ` Johannes Schindelin
2009-05-13 17:24                             ` Andreas Ericsson
2009-05-14  1:49                             ` Miles Bader
2009-05-12 16:16   ` Jeff King
2009-05-12 16:57     ` Johannes Schindelin
2009-05-13 16:26     ` Linus Torvalds
2009-05-13 17:12       ` Linus Torvalds
2009-05-13 17:31         ` Andreas Ericsson
2009-05-13 17:46         ` Linus Torvalds
2009-05-13 18:26           ` Martin Langhoff
2009-05-13 18:37             ` Linus Torvalds
2009-05-13 21:04               ` Theodore Tso
2009-05-13 21:20                 ` Linus Torvalds
2009-05-13 21:08               ` Daniel Barkalow
2009-05-13 21:29                 ` Linus Torvalds
2009-05-13 20:57         ` Matthias Andree
2009-05-13 21:10           ` Linus Torvalds
2009-05-13 21:30             ` Jay Soffian
2009-05-13 21:47             ` Matthias Andree
2009-05-12 18:28 ` Dmitry Potapov
2009-05-12 18:40   ` Martin Langhoff
2009-05-12 18:55     ` Jakub Narebski
2009-05-12 21:43       ` [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames Heiko Voigt
2009-05-12 21:55         ` Jakub Narebski
2009-05-14 17:59           ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
2009-05-15 10:52             ` Martin Langhoff
2009-05-18  9:37               ` Heiko Voigt
2009-05-18 22:26                 ` Jakub Narebski
2009-06-20 12:14               ` [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook Heiko Voigt
2009-05-15 14:57             ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski
2009-05-18  9:50               ` [PATCH] " Heiko Voigt
2009-05-18 10:40                 ` Johannes Sixt
2009-05-18 11:50                   ` Heiko Voigt
2009-05-18 12:04                     ` Johannes Sixt
2009-05-19 20:01                   ` [PATCH v4] " Heiko Voigt
2009-05-18 14:42                 ` [PATCH] " Junio C Hamano
2009-05-18 20:35                 ` Julian Phillips
2009-05-15 18:11             ` [PATCH v2] " Junio C Hamano
2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting
2009-05-14 19:58   ` Esko Luontola
2009-05-14 20:21     ` Andreas Ericsson
2009-05-14 22:25     ` Johannes Schindelin
2009-05-15 11:18     ` Dmitry Potapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).