All of lore.kernel.org
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Jeff King <peff@peff.net>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
	Esko Luontola <esko.luontola@gmail.com>,
	git@vger.kernel.org
Subject: Re: Cross-Platform Version Control
Date: Wed, 13 May 2009 10:12:45 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LFD.2.01.0905130951100.3343@localhost.localdomain> (raw)
In-Reply-To: <alpine.LFD.2.01.0905130915540.3343@localhost.localdomain>



On Wed, 13 May 2009, Linus Torvalds wrote:
> 
> utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But 
> quite frankly, the index is only part of it, and probably not the worst 
> part.
> 
> The real pain of filename handling is all the "read tree recursively with 
> readdir()" issues. Along with just an absolute sh*t-load of issues about 
> what to do when people ended up using different versions of the "same" 
> name in different branches.

Btw, if people care mainly just about OS X, and don't worry so much about 
case, but about the idiotic and insane OS X behavior of turning UTF-8 
filenames into that crazy NFD format, here's a simple patch that may be 
useful for that.

There _will_ certainly be other places, but this handles the one big case 
of "read_directory_recursive()", and can turn NFD into the sane NFC 
format.

Since OS X will then accept NFC (and internally turn it back to NFD) when 
you pass them as filenames, that means that converting the other way is 
not necessary.

NOTE NOTE NOTE! This really just handles one case, and is not enough for 
any kind of general case. For example, it does NOT handle the case where 
you do

	git add filename_with_åäö

explicitly, because if the "filename_with_åäö" is done using NFD 
(tab-completion etc), now git won't _match_ it with the filename it reads 
using readdir() any more (which got converted to NFC), so at a minimum 
we'd need to do that crazy NFD->NFC conversion in all the pathspecs too. 

See "get_pathspec()" in setup.c for that latter case.

But with that, and this crazy thing, OS X users might be already a lot 
better off. Totally untested, of course. 

Oh, and somebody needs to fill in that 

	convert_name_from_nfd_to_nfc()

implementation. It's designed so that if it notices that the string is 
just plain US-ASCII, it can return 0 and no extra work is done. That, in 
turn, can easily be done by some simple and efficient pre-processign that 
checks that there are no high bits set (on a 64-bit platform, do it 8 
characters at a time with a "& 0x8080808080808080"), so that the common 
case doesn't need to have barely any overhead at all.

Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do 
the actual normalization if you find characters with the high bit set. And 
since I know that the OS X filesystems are so buggy as to not even do that 
whole NFD thing right, there is probably some OS-X specific "use this for 
filesystem names" conversion function.

Hmm. Anybody want to take this on? It really shouldn't be too complex to 
get it working for the common case on just OS X. It's really the case 
sensitivity that is the biggest problem, if you ignore that for now, the 
problem space is _much_ smaller.

In other words, I think we can reasonably easily support a subset of 
_common_ issues with some trivial patches like this. But getting it right 
in _all_ the cases is going to be much more work (there are lots of other 
uses of "readdir()" too, this one just happens to be one of the more 
central ones).

Of course, it probably makes sense to have a whole "git_readdir()" that 
does this thing in general. That "create_full_path()" thing makes sense 
regardless, though, in that it also simplifies a lot of "baselen+len" 
usage in just "len".

		Linus

---
 dir.c |   40 ++++++++++++++++++++++++++++++++--------
 1 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/dir.c b/dir.c
index 6aae09a..4cbfc24 100644
--- a/dir.c
+++ b/dir.c
@@ -566,6 +566,30 @@ static int get_dtype(struct dirent *de, const char *path)
 }
 
 /*
+ * Take the readdir output, in (d_name,len), and append it to
+ * our base name in (fullname,baselen) with any required
+ * readdir fs->internal translation.
+ *
+ * Put the result in 'fullname', and return the final length.
+ *
+ * Right now we have no translation, and just do a memcpy()
+ * (the +1 is to copy the final NUL character too).
+ */
+static int create_full_path(char *fullname, int baselen, const char *d_name, int len)
+{
+#ifdef OS_X_IS_SOME_CRAZY_SHxAT
+	char temp[256], nlen;
+	nlen = convert_name_from_nfd_to_nfc(d_name, len, temp, sizeof(temp));
+	if (nlen) {
+		len = nlen;
+		d_name = temp;
+	}
+#endif
+	memcpy(fullname + baselen, d_name, len + 1);
+	return baselen + len;
+}
+
+/*
  * Read a directory tree. We currently ignore anything but
  * directories, regular files and symlinks. That's because git
  * doesn't handle them at all yet. Maybe that will change some
@@ -595,15 +619,15 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			/* Ignore overly long pathnames! */
 			if (len + baselen + 8 > sizeof(fullname))
 				continue;
-			memcpy(fullname + baselen, de->d_name, len+1);
-			if (simplify_away(fullname, baselen + len, simplify))
+			len = create_full_path(fullname, baselen, de->d_name, len);
+			if (simplify_away(fullname, len, simplify))
 				continue;
 
 			dtype = DTYPE(de);
 			exclude = excluded(dir, fullname, &dtype);
 			if (exclude && (dir->flags & DIR_COLLECT_IGNORED)
-			    && in_pathspec(fullname, baselen + len, simplify))
-				dir_add_ignored(dir, fullname, baselen + len);
+			    && in_pathspec(fullname, len, simplify))
+				dir_add_ignored(dir, fullname, len);
 
 			/*
 			 * Excluded? If we don't explicitly want to show
@@ -630,9 +654,9 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			default:
 				continue;
 			case DT_DIR:
-				memcpy(fullname + baselen + len, "/", 2);
+				memcpy(fullname + len, "/", 2);
 				len++;
-				switch (treat_directory(dir, fullname, baselen + len, simplify)) {
+				switch (treat_directory(dir, fullname, len, simplify)) {
 				case show_directory:
 					if (exclude != !!(dir->flags
 							& DIR_SHOW_IGNORED))
@@ -640,7 +664,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 					break;
 				case recurse_into_directory:
 					contents += read_directory_recursive(dir,
-						fullname, fullname, baselen + len, 0, simplify);
+						fullname, fullname, len, 0, simplify);
 					continue;
 				case ignore_directory:
 					continue;
@@ -654,7 +678,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			if (check_only)
 				goto exit_early;
 			else
-				dir_add_name(dir, fullname, baselen + len);
+				dir_add_name(dir, fullname, len);
 		}
 exit_early:
 		closedir(fdir);

  reply	other threads:[~2009-05-13 17:15 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-12 15:06 Cross-Platform Version Control Esko Luontola
2009-05-12 15:14 ` Shawn O. Pearce
2009-05-12 16:13   ` Johannes Schindelin
2009-05-12 17:56     ` Esko Luontola
2009-05-12 20:38       ` Johannes Schindelin
2009-05-12 21:16         ` Esko Luontola
2009-05-13  0:23           ` Johannes Schindelin
2009-05-13  5:34             ` Esko Luontola
2009-05-13  6:49               ` Alex Riesen
2009-05-13 10:15               ` Johannes Schindelin
     [not found]                 ` <43d8ce650905130340q596043d5g45b342b62fe20e8d@mail.gmail.com>
2009-05-13 10:41                   ` John Tapsell
2009-05-13 13:42                     ` Jay Soffian
2009-05-13 13:44                       ` Alex Riesen
2009-05-13 13:50                         ` Jay Soffian
2009-05-13 13:57                           ` John Tapsell
2009-05-13 15:27                             ` Nicolas Pitre
2009-05-13 16:22                               ` Johannes Schindelin
2009-05-13 17:24                             ` Andreas Ericsson
2009-05-14  1:49                             ` Miles Bader
2009-05-12 16:16   ` Jeff King
2009-05-12 16:57     ` Johannes Schindelin
2009-05-13 16:26     ` Linus Torvalds
2009-05-13 17:12       ` Linus Torvalds [this message]
2009-05-13 17:31         ` Andreas Ericsson
2009-05-13 17:46         ` Linus Torvalds
2009-05-13 18:26           ` Martin Langhoff
2009-05-13 18:37             ` Linus Torvalds
2009-05-13 21:04               ` Theodore Tso
2009-05-13 21:20                 ` Linus Torvalds
2009-05-13 21:08               ` Daniel Barkalow
2009-05-13 21:29                 ` Linus Torvalds
2009-05-13 20:57         ` Matthias Andree
2009-05-13 21:10           ` Linus Torvalds
2009-05-13 21:30             ` Jay Soffian
2009-05-13 21:47             ` Matthias Andree
2009-05-12 18:28 ` Dmitry Potapov
2009-05-12 18:40   ` Martin Langhoff
2009-05-12 18:55     ` Jakub Narebski
2009-05-12 21:43       ` [PATCH] Extend sample pre-commit hook to check for non ascii file/usernames Heiko Voigt
2009-05-12 21:55         ` Jakub Narebski
2009-05-14 17:59           ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Heiko Voigt
2009-05-15 10:52             ` Martin Langhoff
2009-05-18  9:37               ` Heiko Voigt
2009-05-18 22:26                 ` Jakub Narebski
2009-06-20 12:14               ` [RFC PATCH] check for filenames that only differ in case to sample pre-commit hook Heiko Voigt
2009-05-15 14:57             ` [PATCH v2] Extend sample pre-commit hook to check for non ascii filenames Jakub Narebski
2009-05-18  9:50               ` [PATCH] " Heiko Voigt
2009-05-18 10:40                 ` Johannes Sixt
2009-05-18 11:50                   ` Heiko Voigt
2009-05-18 12:04                     ` Johannes Sixt
2009-05-19 20:01                   ` [PATCH v4] " Heiko Voigt
2009-05-18 14:42                 ` [PATCH] " Junio C Hamano
2009-05-18 20:35                 ` Julian Phillips
2009-05-15 18:11             ` [PATCH v2] " Junio C Hamano
2009-05-14 13:48 ` Cross-Platform Version Control Peter Krefting
2009-05-14 19:58   ` Esko Luontola
2009-05-14 20:21     ` Andreas Ericsson
2009-05-14 22:25     ` Johannes Schindelin
2009-05-15 11:18     ` Dmitry Potapov
  -- strict thread matches above, loose matches on Subject: below --
2009-04-27  8:55 Eric Sink's blog - notes on git, dscms and a "whole product" approach Martin Langhoff
2009-04-28 11:24 ` Cross-Platform Version Control (was: Eric Sink's blog - notes on git, dscms and a "whole product" approach) Jakub Narebski
2009-04-29  6:55   ` Martin Langhoff
2009-04-29  7:52     ` Cross-Platform Version Control Jakub Narebski
2009-04-29  8:25       ` Martin Langhoff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.01.0905130951100.3343@localhost.localdomain \
    --to=torvalds@linux-foundation.org \
    --cc=esko.luontola@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.