* [RFC] embedded TAB and LF in pathnames @ 2005-10-07 19:35 Junio C Hamano 2005-10-07 23:29 ` Alex Riesen 0 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2005-10-07 19:35 UTC (permalink / raw) To: git; +Cc: Kai Ruemmler While I was reviewing git-status fix by Kai Ruemmler, it struck me that our barebone Porcelain-ish layer got a bit sloppier over time. The core layer does not care about any metacharacters in the pathname, and it has provisions, primarily in the form of '-z' flag, for carefully written Porcelain layers to handle pathnames with embedded metacharacters correctly. One exception, however, is the interaction between the git-diff family output and git-apply. We needed to be compatible with other people's diff, which meant that we should not have to worry too much about pathnames with embedded TABs and LFs because GNU diff would not produce usable diff for such things anyway. But 'git-diff --names' barfing if a pathname contained these characters when run without '-z' flag was too much. This still breaks 'git-status'. So I am considering the following changes: - 'raw' output format without '-z', upon finding a TAB or LF, would not die, but just issue a warning. However, the paths are "munged" in a way described later. - '--name-only' and '--name-status' format issue the same warning when finding these characters and run without '-z'. And the paths are "munged" as well. - 'patch' output format also issues a warning. The paths are "munged" but in a slightly different manner from the above. - 'git-apply' is taught about the path munging in the diff input for git diffs (i.e. 'diff --git') and do sensible things. One possible way for path munging goes like this. We could take advantage of the fact that we do not ever output '//' ourselves, and '//' never appears in valid diffs by other people's tools, unless done deliberately by hand ("diff -u a//foo. b//foo.c" from the command line). So we could use '//' as if it is a backslash. Examples. "foo/bar.c" --> "foo/bar.c" (no funny letters - as before) "foo\nbar" --> "foo//0Abar" (double slash followed by 2 hex) "foo\tbar" --> "foo//09bar" (double slash followed by 2 hex) So a diff output to rename "foo/bar.c" to "foo\nbar.c" would become: diff --git a/foo/bar.c b/foo//0Abar.c similarity index 100% rename from foo rename to foo//0Abar.c The byte-values subject to this munging is LF for patch output (because git-apply seems to grok TABs in pathnames just fine), and TAB and LF for 'raw', '--name-only', '--name-status' without '-z'. I have not made up my mind on the exact choice of the quoting convention. We could say '///' instead of '//', for example, or even '//{LF}//' instead of '//0A' proposed above. One thing I am trying to avoid is "foo\nbar", which I suspect would be unfriendly to the Cygwin folks. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC] embedded TAB and LF in pathnames 2005-10-07 19:35 [RFC] embedded TAB and LF in pathnames Junio C Hamano @ 2005-10-07 23:29 ` Alex Riesen 2005-10-07 23:44 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Alex Riesen @ 2005-10-07 23:29 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Kai Ruemmler Junio C Hamano, Fri, Oct 07, 2005 21:35:19 +0200: > I have not made up my mind on the exact choice of the quoting > convention. We could say '///' instead of '//', for example, or > even '//{LF}//' instead of '//0A' proposed above. One thing I > am trying to avoid is "foo\nbar", which I suspect would be > unfriendly to the Cygwin folks. Being unhappy one of them, I think I'd better manage (even if by postprocessing the output). Please, don't make the common case ugly just because of that platform (insanely broken anyway). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC] embedded TAB and LF in pathnames 2005-10-07 23:29 ` Alex Riesen @ 2005-10-07 23:44 ` Junio C Hamano 2005-10-08 6:45 ` Alex Riesen 0 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2005-10-07 23:44 UTC (permalink / raw) To: Alex Riesen; +Cc: git, Kai Ruemmler Alex Riesen <raa.lkml@gmail.com> writes: > Junio C Hamano, Fri, Oct 07, 2005 21:35:19 +0200: >> I have not made up my mind on the exact choice of the quoting >> convention. We could say '///' instead of '//', for example, or >> even '//{LF}//' instead of '//0A' proposed above. One thing I >> am trying to avoid is "foo\nbar", which I suspect would be >> unfriendly to the Cygwin folks. > > Being unhappy one of them, I think I'd better manage (even if by > postprocessing the output). > > Please, don't make the common case ugly just because of that platform > (insanely broken anyway). You really have to realize that having LF and TAB in filenames are *NOT* the common case, no matter which platform you are talking about. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC] embedded TAB and LF in pathnames 2005-10-07 23:44 ` Junio C Hamano @ 2005-10-08 6:45 ` Alex Riesen 2005-10-08 9:10 ` Junio C Hamano 0 siblings, 1 reply; 33+ messages in thread From: Alex Riesen @ 2005-10-08 6:45 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Kai Ruemmler Junio C Hamano, Sat, Oct 08, 2005 01:44:48 +0200: > > Junio C Hamano, Fri, Oct 07, 2005 21:35:19 +0200: > >> I have not made up my mind on the exact choice of the quoting > >> convention. We could say '///' instead of '//', for example, or > >> even '//{LF}//' instead of '//0A' proposed above. One thing I > >> am trying to avoid is "foo\nbar", which I suspect would be > >> unfriendly to the Cygwin folks. > > > > Being unhappy one of them, I think I'd better manage (even if by > > postprocessing the output). > > > > Please, don't make the common case ugly just because of that platform > > (insanely broken anyway). > > You really have to realize that having LF and TAB in filenames > are *NOT* the common case, no matter which platform you are > talking about. > Yes, but "//" in a path is quite common. Even "///" is not uncommon. How about copy ls' approach were possible? -b, --escape, --quoting-style=escape Quote nongraphic characters in file names using alphabetic and octal backslash sequences like those used in C. This option is the same as -Q except that filenames are not surrounded by dou- ble-quotes. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC] embedded TAB and LF in pathnames 2005-10-08 6:45 ` Alex Riesen @ 2005-10-08 9:10 ` Junio C Hamano 2005-10-08 13:30 ` [PATCH] Try URI quoting for " Robert Fitzsimons 0 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2005-10-08 9:10 UTC (permalink / raw) To: Alex Riesen; +Cc: git, Kai Ruemmler Alex Riesen <raa.lkml@gmail.com> writes: Quote nongraphic characters in file names using alphabetic and octal backslash sequences like those used in C. This option is the same as -Q except that filenames are not surrounded by dou- ble-quotes. If you have a file whose name is 'foo' + LF + 'bar', and if you use backslash convention, your diff would start like this: diff --git a/foo\nbar b/foo\nbar @@ 1,2 3,4 @@ context -deleted ... which looks quite natural. I would, however, prefer this kind of funny pathnames to *stand* *out* more than usual, to make it really obvious that there is something really funky going on. In that sense, the above is a bit too innocuous-looking to my taste. But this "embedded LF and TAB" is a corner case. I would not be using such paths that would trigger the quoting myself anyway, and I do not particularly care as long as the tools do the right thing -- any quoting rule would do, as long as the generating side (git-diff) is consistent with accepting side (git-apply), and as long as there is no new ambiguity introduced. The backslash proposal is introducing a small ambiguity. You cannot tell if the file had an embedded LF between 'foo' and 'bar' (and generated with your git-diff) or had an embedded backslash between 'foo' and 'nbar' (and generated with existing git-diff). Since we never had a version of git-diff that outputs double-slashes '//' in paths, there is no ambiguity if we use it as a quoting mechanism. Just as a concrete demonstration, here is how the git-status output and git-diff output would look like for a file 'pqr' in a directory whose name is 'def' + LF + 'ghi' that uses the version of git-diff from the proposed updates branch: # Changed but not updated: # (use git-update-index to mark for commit) # # modified: def//{LF}//ghi/pqr diff --git a/def//{LF}//ghi/pqr b/def//{LF}//ghi/pqr index 9ee055c..47dbc3f 100644 --- a/def//{LF}//ghi/pqr +++ b/def//{LF}//ghi/pqr @@ -1 +1,2 @@ Fri Oct 7 23:19:04 PDT 2005 +foo I am not married to this quoting syntax -- I think it *is* ugly, but as I said before, I'd prefer to have something ugly here. I would easily be persuaded otherwise, though. A working patch would probably be the most effective way of persuasion, but a mock output without the code to produce and/or parse it would also be fine as a starting point for discussion. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-08 9:10 ` Junio C Hamano @ 2005-10-08 13:30 ` Robert Fitzsimons 2005-10-08 18:30 ` Junio C Hamano 2005-10-09 10:42 ` Junio C Hamano 0 siblings, 2 replies; 33+ messages in thread From: Robert Fitzsimons @ 2005-10-08 13:30 UTC (permalink / raw) To: Junio C Hamano; +Cc: Alex Riesen, git, Kai Ruemmler Instead of using //{LF}// and //{TAG}// to quote embedded tab and linefeed characters in pathnames use URI quoting. '\t' becomes %09 '\n' becomes %10 '%' becomes %25 Signed-off-by: Robert Fitzsimons <robfitz@273k.net> --- > I am not married to this quoting syntax -- I think it *is* ugly, > but as I said before, I'd prefer to have something ugly here. > > I would easily be persuaded otherwise, though. A working patch > would probably be the most effective way of persuasion, but a > mock output without the code to produce and/or parse it would > also be fine as a starting point for discussion. Using URI encoding might be an option it's not a ugly and more peopel should under stand what it means. Heres a posible patch against pu. Robert apply.c | 19 ++++++++++++------- diff.c | 26 +++++++++++++++++--------- git-status.sh | 10 ++++++---- 3 files changed, 35 insertions(+), 20 deletions(-) applies-to: a9332b0c2bd80a182f946d22d4ec7511c32c55f4 8029a957cab1a912562696fdce8beea5fc2c11c4 diff --git a/apply.c b/apply.c --- a/apply.c +++ b/apply.c @@ -75,21 +75,26 @@ static char *unmunge_name(char *name) if (!name) return name; - cp = strstr(name, "//"); + cp = strstr(name, "%"); if (!cp) return name; ret_name = strdup(name); for (cp = dp = ret_name; (ch = *cp); cp++) { - if (ch == '/' && cp[1] == '/' && cp[2] == '{') { - /* //{TAB}// or //{LF}// */ - if (!strncmp(cp + 3, "TAB}//", 6)) { + if (ch == '%') { + /* %09 or %10 or %25 */ + if (!strncmp(cp + 1, "09", 2)) { *dp++ = '\t'; - cp += 8; + cp += 2; continue; } - else if (!strncmp(cp + 3, "LF}//", 5)) { + else if (!strncmp(cp + 1, "10", 2)) { *dp++ = '\n'; - cp += 7; + cp += 2; + continue; + } + else if (!strncmp(cp + 1, "25", 2)) { + *dp++ = '%'; + cp += 2; continue; } error("malformed munged name '%s' (looking at %s)", diff --git a/diff.c b/diff.c --- a/diff.c +++ b/diff.c @@ -13,7 +13,7 @@ static const char *path_munge(const char { const char *cp; char *retpath, *dp; - int ch, munge_inter_name = 0, munge_line_term = 0; + int ch, munge_inter_name = 0, munge_line_term = 0, munge_quote = 0; if (!path) return path; @@ -23,23 +23,31 @@ static const char *path_munge(const char munge_inter_name++; if (line_term && ch == '\n') munge_line_term++; + if (ch == '%') + munge_quote++; } - if (!(munge_inter_name + munge_line_term)) + if (!(munge_inter_name + munge_line_term + munge_quote)) return path; - /* need //{TAB}// and //{LF}// */ + /* need %09 and %10 and %25 */ retpath = xmalloc(cp - path + - munge_inter_name * 8 + - munge_line_term * 7 + 1); + munge_inter_name * 3 + + munge_line_term * 3 + + munge_quote * 3 + 1); for (cp = path, dp = retpath; (ch = *cp); cp++, dp++) { if (inter_name && ch == '\t') { - memcpy(dp, "//{TAB}//", 9); - dp += 8; + memcpy(dp, "%09", 3); + dp += 2; continue; } if (line_term && ch == '\n') { - memcpy(dp, "//{LF}//", 8); - dp += 7; + memcpy(dp, "%10", 3); + dp += 2; + continue; + } + if (ch == '%') { + memcpy(dp, "%25", 3); + dp += 2; continue; } *dp = ch; diff --git a/git-status.sh b/git-status.sh --- a/git-status.sh +++ b/git-status.sh @@ -54,8 +54,9 @@ else perl -e '$/ = "\0"; while (<>) { chomp; - s|\t|//{TAB}//|g; - s|\n|//{LF}//|g; + s|%([^021][^059])|%25\1|g; + s|\t|%09|g; + s|\n|%10|g; s/ /\\ /g; s/^/A /; print "$_\n"; @@ -84,8 +85,9 @@ perl -e '$/ = "\0"; my $shown = 0; while (<>) { chomp; - s|\t|//{TAB}//|g; - s|\n|//{LF}//|g; + s|%([^01][^09])|%25\1|g; + s|\t|%09|g; + s|\n|%10|g; s/^/# /; if (!$shown) { print "#\n# Ignored files:\n"; --- 0.99.8.GIT ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-08 13:30 ` [PATCH] Try URI quoting for " Robert Fitzsimons @ 2005-10-08 18:30 ` Junio C Hamano 2005-10-08 20:19 ` Junio C Hamano 2005-10-09 10:42 ` Junio C Hamano 1 sibling, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2005-10-08 18:30 UTC (permalink / raw) To: Robert Fitzsimons; +Cc: Alex Riesen, git, Kai Ruemmler Robert Fitzsimons <robfitz@273k.net> writes: > Instead of using //{LF}// and //{TAG}// to quote embedded tab and > linefeed characters in pathnames use URI quoting. > > '\t' becomes %09 > '\n' becomes %10 > '%' becomes %25 > > Signed-off-by: Robert Fitzsimons <robfitz@273k.net> This would break existing setup where people *has* per-cent letter in their pathname -- which I think is worse than the backslash proposal. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-08 18:30 ` Junio C Hamano @ 2005-10-08 20:19 ` Junio C Hamano 2005-10-11 6:20 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Junio C Hamano @ 2005-10-08 20:19 UTC (permalink / raw) To: Robert Fitzsimons; +Cc: Alex Riesen, git, Kai Ruemmler, eggert Junio C Hamano <junkio@cox.net> writes: > Robert Fitzsimons <robfitz@273k.net> writes: > >> '\t' becomes %09 >> '\n' becomes %10 >> '%' becomes %25 >> >> Signed-off-by: Robert Fitzsimons <robfitz@273k.net> > > This would break existing setup where people *has* per-cent > letter in their pathname -- which I think is worse than the > backslash proposal. Having said that, I think something along the lines of backslash or URI encoding is the cleanest way to go in the long run, with one condition: diffs generated with git-diff should be applicable with 'GNU patch', especially if there is no funnies like renames and the recipient does not mind losing mode information. Although 'GNU patch' has --quoting-style flag, it seems to be used only on its output side (i.e. reporting which file it is patching, etc.). If we can sell changes to teach the filename encoding convention to its util.c::fetchname() upstream, we could tell people that 'diff --git' can be applied with newer 'GNU patch' when the patch is about a file whose name contains '%' character (which is not that unusual, compared to TAB and LF). While we are selling those changes to 'GNU patch', we might be even be able to sell the other extended 'diff --git' metainformation support. The same filename quoting rules change should probably be sold to 'GNU diff' as well, so that plain diff can natively quote funny characters in its output without forcing us to fake it by using the -L flag. If all of the above is what we aim for, I would say that is a good direction to go in the longer term. The double-slash hack was just to avoid all these hassles of having to muck with other people's tools. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-08 20:19 ` Junio C Hamano @ 2005-10-11 6:20 ` Paul Eggert 2005-10-11 7:37 ` Junio C Hamano 2005-10-11 15:17 ` Linus Torvalds 0 siblings, 2 replies; 33+ messages in thread From: Paul Eggert @ 2005-10-11 6:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Junio C Hamano <junkio@cox.net> writes: > Although 'GNU patch' has --quoting-style flag, it seems to be > used only on its output side Yes, that's right. The convention I had been thinking of adding is to have GNU diff use shell-quoting style, e.g., 'three o'\''clock' to represent a file name with a newline and an apostrophe in it. This sort of file name can be cut and pasted into the shell. The quoting could be used with any file name containing a troublesome character. Perhaps another quoting style would be better. An issue I hadn't really had time to think about is the character encoding of file names. E.g., suppose one file system uses UTF-8 encoding for Japanese file names, and another file system uses EUC-JP. I suppose it would be nice to handle this problem too. Perhaps GNU 'diff' could standardize on using UTF-8 in its file names, regardless of what the underlying file system uses. Another option is to pass the bytes of the file name through, no matter what. This might require a runtime flag to diff, or to patch, or both. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-11 6:20 ` Paul Eggert @ 2005-10-11 7:37 ` Junio C Hamano 2005-10-11 15:17 ` Linus Torvalds 1 sibling, 0 replies; 33+ messages in thread From: Junio C Hamano @ 2005-10-11 7:37 UTC (permalink / raw) To: Paul Eggert; +Cc: git Paul Eggert <eggert@CS.UCLA.EDU> writes: > The convention I had been thinking of adding is to have GNU diff > use shell-quoting style, e.g., > > 'three > o'\''clock' > > to represent a file name with a newline and an apostrophe in it. > This sort of file name can be cut and pasted into the shell. > The quoting could be used with any file name containing a > troublesome character. > > Perhaps another quoting style would be better. A patch header (both "diff --git" line and ---/+++ lines) I've been considering, and have in the proposed updates branch, looks something like this: diff --git a/def\nghi/pqr b/dee/pqr similarity index 72% rename from def\nghi/pqr rename to dee/pqr index 9ee055c..243fbbc 100644 --- a/def\nghi/pqr +++ b/dee/pqr @@ -1 +1,3 @@ Fri Oct 7 23:19:04 PDT 2005 +foo +foo If we can keep things on one line, that would help parsing the stuff very simple, but more importantly, it is easier to see what's happening. The pattern is the same whether you have funny pathnames or not, and that helps the human consumer. Adjusting the "git diff" output to the style the GNU diff with your shell quoting style would produce something like this: diff --git 'a/def ghi/pqr' b/dee/pqr similarity index 72% rename from 'def ghi/pqr' rename to dee/pqr index 9ee055c..243fbbc 100644 --- 'a/def ghi/pqr' +++ b/dee/pqr @@ -1 +1,3 @@ Fri Oct 7 23:19:04 PDT 2005 +foo +foo Which, while it is possible to make tools parse them, is very distracting for humans to read and review. Yes, LF is quoted, but it still breaks the line, disrupting the pattern we are used to see. If you are talking about a funny file, whose name is "a\ndiff --git a/b/c", your diff would look like this: diff --git 'a/ diff --git a/b/c' 'b/ diff --git a/b/c' index 9ee055c..243fbbc 100644 --- 'a/ diff --git a/b/c' +++ 'b/ diff --git a/b/c' @@ -1 +1,3 @@ Fri Oct 7 23:19:04 PDT 2005 +foo +foo We are used to tell the "less" command to do "/^diff --git .*" while reviewing patches. The shell quoting, while I admit I learned its beauty from you, is a disaster for human consumption. For diff output quoting purposes, LF is the only thing that matters, as you mentioned in another message to me. Our parsing side ("GNU patch" counterpart) checks two pathnames on "diff --git" line and makes sure what follows a/ and b/ are consistent (that is, they should be identical, or each are the same as "rename from" and "rename to"), so there is no ambiguity. But again for human consumption purposes, we cannot easily tell SP and TAB apart by just reading, and a TAB is so unusual character to have in pathname (as opposed to SP which is not that uncommon), we may be better off making them visible. Quoting TAB incidentally has an added benefit, which you as GNU diff/patch person would probably not care too much about. Our other tools sometimes need to show two paths in one record, and TAB is used as the field separator between two paths (LF is the record separator). The tools do have '-z' mode to let us use anything but NUL in the pathname, and carefully written scripts tend to run them with '-z' flag and use Perl or Python to parse paths out, but it would be nicer if we did not always have to. For example, the 'git commit' command prepares the log editor with the status information about changes being committed, and needs to mention paths. This is purely for human consumption, and showing something like: # Type commit message to this file. Lines that start # with '#' are ignored. # # Updated but not checked in: # (will commit) # # new file: ab\n\tc/mno # modified: abc/mno # renamed: def\nghi/pqr -> dee/pqr ... is perfectly readable for human users, and can be done without running the tool in '-z' mode, if the tool output is quoted with '\n' and '\t' convention -- the parsing and formatting side can just split the field with TAB and show them, without worrying about an embedded LF making the rest of the pathname spilling over to the next line. And once we start teaching the user we represent funny characters in their paths this way, it becomes nicer to be consistent in the diff output as well. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-11 6:20 ` Paul Eggert 2005-10-11 7:37 ` Junio C Hamano @ 2005-10-11 15:17 ` Linus Torvalds 2005-10-11 18:03 ` Paul Eggert 1 sibling, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2005-10-11 15:17 UTC (permalink / raw) To: Paul Eggert Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler On Mon, 10 Oct 2005, Paul Eggert wrote: > > An issue I hadn't really had time to think about is the character > encoding of file names. Please don't. Use filenames as if they are just binary blobs of data, that's the only thing that has a high chance of success. Yes, it too can break in the presense of something _else_ doing character translation and/or people moving a patch from one encoding to another , buthat's just true of anything. Eventually everybody will hopefully use UTF-8, and nothing else really matters, but the thing is, if you see filenames as just blobs of data, that works with UTF-8 too, so it's not "wrong" even in the long run. And until everybody has one single encoding, you simply won't be able to tell, and the likelihood that you'd screw up is pretty high. The happy part of the "binary blob" approach is that users _understand_ it. People who actively use different encoding formats are (painfully) aware of conversions, and they may curse you for not doing the random encoding format of the day, but they will be able to handle it. In contrast, if you start doing conversions, I guarantee you that people will _not_ be able to handle it when you do something strange - you've changed the data. Personally, I'd like the normal C quoting the best. Leave space as-is, and quote TAB/NL as \t and \n respectively. It's pretty universally understood in programming circles even outside of C, and it's not like a very uncommon patch format like that really needs to be well-understood outside of those circles. It also has a very obvious and ASCII-safe format for other characters (ie just the normal octal escapes: \377 etc.. That said, I personally don't think it's necessarily even worth it. If somebody wants to use names with tabs and newlines, is he really going to work with diffs? Or is it just a driver error? Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-11 15:17 ` Linus Torvalds @ 2005-10-11 18:03 ` Paul Eggert 2005-10-11 18:37 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2005-10-11 18:03 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Linus Torvalds <torvalds@osdl.org> writes: > Personally, I'd like the normal C quoting the best. That would be fine with me too. How about if we use the equivalent of --quoting-style="c" for file names that contain funny bytes, and no quoting for other file names? So, for example, something like this: diff --git "space tab\tnewline\nquote\"backslash\\" b/dee/pqr similarity index 72% rename from "space tab\tnewline\nquote\"backslash\\" rename to dee/pqr index 9ee055c..243fbbc 100644 --- "space tab\tnewline\nquote\"backslash\\" +++ b/dee/pqr @@ -1 +1,3 @@ Fri Oct 7 23:19:04 PDT 2005 +foo +foo The surrounding double-quotes are an extra indication to the human reader that there is something weird about the quoted file name. > Use filenames as if they are just binary blobs of data, > that's the only thing that has a high chance of success. Thanks for thinking those things through. I agree mostly, but there's still a technical problem, in that we have to decide what a "funny byte" is if we are using C-style quoting. For example, the simplest approach is to say a byte is funny if it is space, backslash, quote, an ASCII control character, or is non-ASCII. But this will cause perfectly-reasonable UTF-8 file names to be presented in git format using unreadable strings like "a\293\203\257b" or whatever. Perhaps it would be better to say that a byte is "funny" if it is space, backslash, quote, an ASCII control character, or a byte that is not part of a valid UTF-8 encoding. This will let UTF-8 file names through unscathed, while still warning the reader when funny business is going on. File names with other encodings (e.g., Shift-JIS) will contain lots of backslashes, but that's OK: we don't mind making nonstandard encodings hard-to-read, so long as we preserve the bytes correctly. We could implement in other GNU applications by having a new quoting style that supports this quoting behavior. I can arrange for that. > If somebody wants to use names with tabs and newlines, is he really > going to work with diffs? Or is it just a driver error? The current-supported scheme with 'diff' and 'patch' should work for everything but newlines. I like the idea of getting it to work even with newlines, and I am willing to sacrifice old patches with file names starting with '"' (extremely rare, if any) to get newlines to work. Among other things I worry about people submitting purposely-malformed patches in non-git environments. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-11 18:03 ` Paul Eggert @ 2005-10-11 18:37 ` Linus Torvalds 2005-10-11 19:42 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2005-10-11 18:37 UTC (permalink / raw) To: Paul Eggert Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler [-- Attachment #1: Type: TEXT/PLAIN, Size: 2436 bytes --] On Tue, 11 Oct 2005, Paul Eggert wrote: > > For example, the simplest approach is to say a byte is funny if it is > space, backslash, quote, an ASCII control character, or is non-ASCII. > But this will cause perfectly-reasonable UTF-8 file names to be > presented in git format using unreadable strings like "a\293\203\257b" > or whatever. I think the simplest question to ask is "what are we protecting against?" There's only two characters that are _really_ special diff itself: \n and \t. The former is obvious, the latter just because the regular gnu diff format puts a tab between the name and the date (and if you _knew_ the date was always there you could just work backwards, but since not all diffs even put a date, \t ends up being special in practice). So what else would you want to protect against? I hope not 8-bit cleanness: if some stupid protocol still isn't 8-bit clean, it should be fixed. And \0 is already impossible, at least on sane systems. So arguably you don't need to quote anything else than \n and \t (and that obviously means you have to quote \ itself). That means that any filename always shows "sanely" in its own byte locale, and everything is readable, regardless of whether it's UTF-8 or just plain byte-encoded Latin1, or anything else. So I don't think you should quote invalid UTF-8: it's invalid UTF-8 whether Ãtis quoted or not. Linus PS. There _is_ something you may want to quote, namely the standard CSI terminal escapes. Not because they wouldn't pass through, but because some people might just "cat" a patch. This is debatable. Now, they are in all in the range 0x00-0x1f and 0x80-0x9f, and since UTF-8 encoding is supposed to happen before it (but you don't know how many get that right), if you want to quote those characters, you need to do so _both_ for the "raw" format and for the UTF-8 format. Now, the UTF-8 format for that high range is actually the same character, except preceded by a 0xc2 (I think), so the simplest thing is to do quoting _purely_ on a byte-stream level (ignore any UTF-8 stuff), and screw the fact that you end up with a non-UTF-8 sequence (character 0x0080 is UTF-8 sequence 0xC2 0x80, and would be quoted as 0xC2 + "\200", which is no longer valid in UTF-8). It gets quite nasty. For any UTF-8 quoting scheme you come up with, I'll point out something that it does wrong or looks horrible for a Latin1 filename ;) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-11 18:37 ` Linus Torvalds @ 2005-10-11 19:42 ` Paul Eggert 2005-10-11 20:56 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2005-10-11 19:42 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Linus Torvalds <torvalds@osdl.org> writes: > the simplest question to ask is "what are we protecting against?" I'd like to protect against: 1. File names that cannot be handled correctly with the current formats. Newline is the obvious problem here, along with (arguably) tab and space. 2. Common transliterations of patches. Many programs (and mailers, alas) expand tabs to spaces, append CR to lines, prepend spaces to lines, break lines at spaces, etc. 'patch' already deals with this to some extent, but it'd be nice if the format resisted these transliterations better. 3. Humans misreading patches. The patch format is intended to be human-readable, after all. 4. Reencoded patches. Programs like Emacs can and will convert patches from UTF-8 to EUC-JP, for example. You convinced me that (4) is not worth the hassle, but I'd still like to address (1)-(3) when it's easy. > invalid UTF-8 [is] invalid UTF-8 Yes, but (2) and (3) can lose information about invalid UTF-8 if we don't suitably protect the encoding errors. I daresay that many mailers will mishandle invalid UTF-8, for example. > There _is_ something you may want to quote, namely the standard CSI > terminal escapes. If I understand you aright, we could do that by modifying my previous proposal to escape all bytes in the UTF-8 representation of a control character. In Unicode, the characters 0080 through 009F are control characters, so that should suffice to quote the terminal escapes you mentioned. (Perhaps we should also escape unassigned Unicode characters too, on the theory that they might become control characters in the future.) > For any UTF-8 quoting scheme you come up with, I'll point out > something that it does wrong or looks horrible for a Latin1 filename > ;) Yes, quite true. But we don't have to come up with something that's perfect in all cases, just something that's good enough to handle cases that we expect will be common in practice, in a world where UTF-8 is the preferred encoding for non-ASCII characters. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-11 19:42 ` Paul Eggert @ 2005-10-11 20:56 ` Linus Torvalds 2005-10-12 6:51 ` Paul Eggert 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2005-10-11 20:56 UTC (permalink / raw) To: Paul Eggert Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler On Tue, 11 Oct 2005, Paul Eggert wrote: > > Yes, quite true. But we don't have to come up with something that's > perfect in all cases, just something that's good enough to handle > cases that we expect will be common in practice, in a world where > UTF-8 is the preferred encoding for non-ASCII characters. The thing is, I can almost guarantee you that any quoting in the high characters is going to be _worse_ than no quoting at all. Exactly because quoting as UTF-8 is the wrong thing when it isn't actually UTF-8, and quoting as non-UTF-8 is the wrong thing when it _is_. Not quoting at all, on the other hand, is unambigious. If you have a mailer that corrupts your text stream (which-ever type it is), then it's clearly the mailers problem. The _mailer_ at least has a chance in hell to know what character set it is getting mailed as. The other alternative is to quote _everything_ non-ASCII. That's definitely reliable, but it's also unquestionably ugly as hell, especially in the long run. Yes, there are some complex quoting approaches you can do, which quote things "correctly" (ie at a byte stream level) _and_ keep it valid UTF-8 at the same time. For example, you can read it as a UTF-8 stream, but then quote things at a byte level (ie if you quote one "character", you quote _all_ bytes in that character). And you quote if: - the UTF-8 _character_ is in the 0x80-0x9f control range - any _raw_byte_ is in the 0x80-0x9f range (it might not be UTF-8) - any _raw_byte_ is 0xfe-0xff (illegal UTF-8 character) - misformed UTF-8 (non-shortest sequence, or just generally invalid sequences with missing or wrong high bits) but quite frankly, that's a pretty painful thing to write. The upside is that it's easy to decode: you can _unquote_ it just as a byte stream. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-11 20:56 ` Linus Torvalds @ 2005-10-12 6:51 ` Paul Eggert 2005-10-12 14:59 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2005-10-12 6:51 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Linus Torvalds <torvalds@osdl.org> writes: > you can read it as a UTF-8 stream, but then quote things at a byte > level (ie if you quote one "character", you quote _all_ bytes in > that character). Yes, that's what I had in mind. > And you quote if: > > - the UTF-8 _character_ is in the 0x80-0x9f control range Yes. Or more generally, if it's any UTF-8 control character. > - any _raw_byte_ is in the 0x80-0x9f range (it might not be UTF-8) Why quote the raw bytes? Is this for terminal escapes on older xterm (or xterm-like) implementations that don't understand UTF-8? If so, I'm not sure I'd bother, as it would introduce a lot of annoying quoting with perfectly reasonable UTF-8, and (if we assume the world is moving to UTF-8) it addresses a problem that is going away. > - any _raw_byte_ is 0xfe-0xff (illegal UTF-8 character) > - misformed UTF-8 (non-shortest sequence, or just generally invalid > sequences with missing or wrong high bits) Yes, that makes sense. > quite frankly, that's a pretty painful thing to write. It's not trivially short, yes. But it shouldn't be that hard. Also, I guess we don't have to write it, at least not at first. As long as we specify something like the C quoted-string format mentioned earlier, we can encode into that format using a naive algorithm (e.g., quote any non-ASCII byte or ASCII control character), and beautify the encoding method later. > The upside is that it's easy to decode: you can _unquote_ it just as > a byte stream. Yes, that's the idea. Also, the interchange format is the most important thing. We have to decode anything that is in the format, and we must encode into the format. Encoding prettily is nice, but not necessary. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-12 6:51 ` Paul Eggert @ 2005-10-12 14:59 ` Linus Torvalds 2005-10-12 19:07 ` Daniel Barkalow [not found] ` <87vf02qy79.fsf@penguin.cs.ucla.edu> 0 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2005-10-12 14:59 UTC (permalink / raw) To: Paul Eggert Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler [-- Attachment #1: Type: TEXT/PLAIN, Size: 1949 bytes --] On Tue, 11 Oct 2005, Paul Eggert wrote: > > > - any _raw_byte_ is in the 0x80-0x9f range (it might not be UTF-8) > > Why quote the raw bytes? Is this for terminal escapes on older xterm > (or xterm-like) implementations that don't understand UTF-8? It's not about "understanding" UTF-8. Even a perfectly modern xterm may simply not be in UTF-8 mode: if it wasn't in an UTF-8 locale, then it won't do UTF-8 decoding. > If so, I'm not sure I'd bother, as it would introduce a lot of annoying > quoting with perfectly reasonable UTF-8, and (if we assume the world > is moving to UTF-8) it addresses a problem that is going away. UTF-8 is only _now_ getting really widespread, and I think it's because RedHat bit the bullet and made UTF-8 the default locale a few years ago. These things take _decades_. I don't know if you realize it, but it's only within the last couple of years that the old 7-bit "finnish ASCII" went away. Finnish and Swedish have three extra characters: åäö (latin1) and åäö (utf-8). But only within the last few years has the really _old_ ASCII representation really gone away so much that I don't see it at all (the characters '{' '}' and '|' were taken over, so that if you had a Finnish ASCII font, programming in C was really funky - but it was common enough that I could do it without thinking much about it ;) So lots of people still use the byte-wide encodings. Whether really old ASCII only or some special locale-dependent one (of which latin1 and the "win-latin1" thing are obviously the most common by far). And in that locale, it's not the UTF-8 control characters that matter, it's the _byte_ control characters that do. So if you want to support any other locale than UTF-8, you need to escape them. Assuming you want to escape control characters at all, of course (I still think it's perfectly fine to just let the raw mess through and depend on escaping at higher levels) Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-12 14:59 ` Linus Torvalds @ 2005-10-12 19:07 ` Daniel Barkalow 2005-10-12 19:52 ` Linus Torvalds [not found] ` <87vf02qy79.fsf@penguin.cs.ucla.edu> 1 sibling, 1 reply; 33+ messages in thread From: Daniel Barkalow @ 2005-10-12 19:07 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler On Wed, 12 Oct 2005, Linus Torvalds wrote: > So if you want to support any other locale than UTF-8, you need to escape > them. Assuming you want to escape control characters at all, of course (I > still think it's perfectly fine to just let the raw mess through and > depend on escaping at higher levels) I think it's actually sufficient to escape 0x00-0x1f and 0x7f; those ranges are both easy and, as far as I can tell, include all of the control characters that do annoying things. I think escape, backspace, delete, and bell are the only ones we'd rather the terminal not get; beyond that, patches with screwy filenames look screwy, but don't screw up anything outside of the filename. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-12 19:07 ` Daniel Barkalow @ 2005-10-12 19:52 ` Linus Torvalds 2005-10-12 20:21 ` H. Peter Anvin 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2005-10-12 19:52 UTC (permalink / raw) To: Daniel Barkalow Cc: Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler On Wed, 12 Oct 2005, Daniel Barkalow wrote: > > I think it's actually sufficient to escape 0x00-0x1f and 0x7f; those > ranges are both easy They are indeed easy. > and, as far as I can tell, include all of the control > characters that do annoying things. Nope. The traditional vt100 escape sequence is "ESC" followed by a character to indicate the type of sequence (the most common one is '['). That's all 7-bit and fine. HOWEVER, they made the 8-bit extension be such that any of these vt100 begin sequences where the second character is in the appropriate range can be instead shortened by one character, by instead using a single 8-bit character of "0x80+(char-0x40)". Ie the traditional "ESC + '['" (\x1b\x5b) can also be written as a single '\x9b' character, aka CSI. In other words, 0x80-0x9f are _all_ just vt100 shorthand for ESC+'@' through ESC+'_'. (I guess it's not strictly "vt100" any more - it's the extended vt220 format). > I think escape, backspace, delete, and > bell are the only ones we'd rather the terminal not get; beyond that, > patches with screwy filenames look screwy, but don't screw up anything > outside of the filename. Try this on a (non-UTF-8) xterm: echo -en '\x9b5B---\x9b1A---\x9b4A\r' and it should do: - move cursor 5 lines down - print "---" - move cursor 1 line up - print "---" - move cursor 4 lines up - return carriage to beginning. In other words, your screen should end up looking something like this: [torvalds@g5 ~]$ echo -en '\x9b5B---\x9b1A---\x9b4A\r' [torvalds@g5 ~]$ --- --- where that "staircase" of two "---" things was done with cursor movements. And that's a _benign_ sequence. You can do all kinds of funky stuff that really screws up the user experience. Including have the thing echo keys to you that you didn't type: echo -en '\x9b5n' or lock the keyboard (I don't think any of the terminal emulators implement the latter, or some of the other stranger sequences - things to do double-wide characters etc). Linus PS. You can do all the same in UTF-8 one, but then you'll have to add a \xc2 before the \x9b: echo -en '\xc2\x9b5B---\xc2\x9b1A---\xc2\x9b4A\r' etc.. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-12 19:52 ` Linus Torvalds @ 2005-10-12 20:21 ` H. Peter Anvin 0 siblings, 0 replies; 33+ messages in thread From: H. Peter Anvin @ 2005-10-12 20:21 UTC (permalink / raw) To: Linus Torvalds Cc: Daniel Barkalow, Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Linus Torvalds wrote: > > Nope. The traditional vt100 escape sequence is "ESC" followed by a > character to indicate the type of sequence (the most common one is '['). > That's all 7-bit and fine. > > HOWEVER, they made the 8-bit extension be such that any of these vt100 > begin sequences where the second character is in the appropriate range can > be instead shortened by one character, by instead using a single 8-bit > character of "0x80+(char-0x40)". Ie the traditional "ESC + '['" (\x1b\x5b) > can also be written as a single '\x9b' character, aka CSI. > > In other words, 0x80-0x9f are _all_ just vt100 shorthand for ESC+'@' > through ESC+'_'. > > (I guess it's not strictly "vt100" any more - it's the extended vt220 > format). > Actually, it's even trickier than that. CSI is character 0x1b of control code set C1; there are two "windows" for control codes -- CL (0x00-0x1f) and CR (0x80-0x9f). Normally CL is mapped to C0 and CR is mapped to CL, but ESC will temporarily map C1 into CL. VT1xx didn't support this since they didn't support 8-bit anything. Anyway, a *lot* of character sets -- not just UTF-8 -- use the CR range of bytes for printables. -hpa ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <87vf02qy79.fsf@penguin.cs.ucla.edu>]
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames [not found] ` <87vf02qy79.fsf@penguin.cs.ucla.edu> @ 2005-10-12 21:02 ` Junio C Hamano 2005-10-12 21:05 ` Linus Torvalds ` (2 subsequent siblings) 3 siblings, 0 replies; 33+ messages in thread From: Junio C Hamano @ 2005-10-12 21:02 UTC (permalink / raw) To: Paul Eggert Cc: Linus Torvalds, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Paul Eggert <eggert@CS.UCLA.EDU> writes: > Linus Torvalds <torvalds@osdl.org> writes: > >> I don't know if you realize it, but it's only within the last couple of >> years that the old 7-bit "finnish ASCII" went away. > > Aach! Those Finns! Always on the trailing edge of technology! Nah, Japanese are much worse. We are so used to see Yen signs at the end of multi-line CPP macro definitions (backslashes are taken over by it) and I do not foresee it going away anytime soon. I think windows people believe Yen signs are path component separators ;-). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames [not found] ` <87vf02qy79.fsf@penguin.cs.ucla.edu> 2005-10-12 21:02 ` Junio C Hamano @ 2005-10-12 21:05 ` Linus Torvalds 2005-10-12 21:09 ` H. Peter Anvin ` (3 more replies) 2005-10-12 21:24 ` Linus Torvalds 2005-10-14 6:59 ` Junio C Hamano 3 siblings, 4 replies; 33+ messages in thread From: Linus Torvalds @ 2005-10-12 21:05 UTC (permalink / raw) To: Paul Eggert Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler On Wed, 12 Oct 2005, Paul Eggert wrote: > > Your email message suggests that we need to be cautious here. > That message contained UTF-8 text but its header said "Content-Type: > TEXT/PLAIN; charset=ISO-8859-1". Well, my email message was wrong and evil, because it _mixed_ two different encodings in the same text. No sane client could have shown them both at the same time - but especially with a stupid client, you could have changed your terminal to show either one or the other by switching from utf-8 to latin1 encoding and doing a refresh. In other words, my email really was a nasty case of not one or the other, but both. Now, I believe patches can actually be that way - it's not at all impossible to have a diff where the _filename_ is utf-8, but the content of the patch itself is some byte-encoding like latin1. Or the other way around. > If we're still having problems like this in 2005 then I guess we need > to deal with them. This suggests we should be escaping every > non-ASCII byte, at least for patches designed to be emailed robustly. I find that email is very robust - it's basically 8-bit clean. No character encoding, no crap. Just a byte stream. It really _is_ the most reliable format. Now, a lot of email clients are really weak in _showing_ it, and as mentioned, the email that mixed both is fundamentally not something you really even _can_ show sanely. But who cares? What matters is not what it looks like, but what it _saves_ as. If you save the email message, it should come out as the same reliable 8-bit byte stream, or your client is actively corrupting messages rather than just showing them. This is really what my argument boils down to: character set encoding should _not_ EVER affect the _transfer_ of the data. It doesn't matter if something is latin1 or utf-8, the only thing that matters is the byte sequence. Only when you _display_ it should you try to figure out what the byte sequence possibly means. So I repeat: - escape as little as possible - make the _viewer_ decide how to view it. Yes, if people use "cat" to view patches, it can be dangerous. But that's _their_ problem. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-12 21:05 ` Linus Torvalds @ 2005-10-12 21:09 ` H. Peter Anvin 2005-10-12 21:15 ` Johannes Schindelin ` (2 subsequent siblings) 3 siblings, 0 replies; 33+ messages in thread From: H. Peter Anvin @ 2005-10-12 21:09 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Linus Torvalds wrote: > > Now, I believe patches can actually be that way - it's not at all > impossible to have a diff where the _filename_ is utf-8, but the content > of the patch itself is some byte-encoding like latin1. Or the other way > around. > Or both. Trivial example: a patch to change names in comments from ISO 8859-1 to UTF-8. -hpa ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-12 21:05 ` Linus Torvalds 2005-10-12 21:09 ` H. Peter Anvin @ 2005-10-12 21:15 ` Johannes Schindelin 2005-10-12 21:33 ` Junio C Hamano 2005-10-14 0:57 ` Paul Eggert 3 siblings, 0 replies; 33+ messages in thread From: Johannes Schindelin @ 2005-10-12 21:15 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Hi, On Wed, 12 Oct 2005, Linus Torvalds wrote: > Yes, if people use "cat" to view patches, it can be dangerous. But that's > _their_ problem. No, that is the cat's problem. Sorry, couldn't resist. Ciao, Dscho ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-12 21:05 ` Linus Torvalds 2005-10-12 21:09 ` H. Peter Anvin 2005-10-12 21:15 ` Johannes Schindelin @ 2005-10-12 21:33 ` Junio C Hamano 2005-10-14 0:57 ` Paul Eggert 3 siblings, 0 replies; 33+ messages in thread From: Junio C Hamano @ 2005-10-12 21:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > This is really what my argument boils down to: character set encoding > should _not_ EVER affect the _transfer_ of the data. It doesn't matter if > something is latin1 or utf-8, the only thing that matters is the byte > sequence. Only when you _display_ it should you try to figure out what the > byte sequence possibly means. > > So I repeat: > - escape as little as possible > - make the _viewer_ decide how to view it. I think the same argument can be made about patch application, although strictly speaking it is not "viewing". Let the patch program decide (or the user to tell her decision to the patch program) what the unescaped byte sequence in the patch that represents the path being affected is encoded in, and do something sensible while taking into account that the pathname encoding on the working tree may be different from what is recorded in the patch. For example, one of my partitions is ntfs mounted with nls=euc-jp, and I expect the tool to help me apply patches to a Japanese-named file when the patch is from a system with UTF-8 encoded filenames. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-12 21:05 ` Linus Torvalds ` (2 preceding siblings ...) 2005-10-12 21:33 ` Junio C Hamano @ 2005-10-14 0:57 ` Paul Eggert 2005-10-14 5:43 ` Linus Torvalds 3 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2005-10-14 0:57 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Linus Torvalds <torvalds@osdl.org> writes: > I find that email is very robust - it's basically 8-bit clean. No > character encoding, no crap. Just a byte stream. It really _is_ the most > reliable format. I found another amusing bit of info that tends to undercut this claim. This discussion thread is archived at <http://marc.theaimsgroup.com/?t=112877773400002&r=1&w=2&n=22>. But there's an item missing from the archive: my message with Message-ID <87vf02qy79.fsf@penguin.cs.ucla.edu>. This is the message with the joke "Aach! Those Finns! Always on the trailing edge of technology!". All my other messages are achived. What was special about this one? Surely there's not a joke filter at theaimsgroup.com! I nosed around through the archive and here's my guess as to what happened. My message's email header contained this: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable and my guess is that the web archiver can't handle that format. This is just a guess. I can't confirm it because (among other things) the web archiver won't give me all the bytes of the messages that it archives. Even its "Download message RAW" doesn't do that: it omits the header. But I have a strong suspicion. Let's put it this way: I think mine was the only message in the thread that said "charset=utf-8". If my guess is right, the archiver dropped my email on the floor simply because it contained UTF-8. This is not a good sign for putting UTF-8 into email, or for relying on email to transmit byte streams. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-14 0:57 ` Paul Eggert @ 2005-10-14 5:43 ` Linus Torvalds 0 siblings, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2005-10-14 5:43 UTC (permalink / raw) To: Paul Eggert Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler On Thu, 13 Oct 2005, Paul Eggert wrote: > Linus Torvalds <torvalds@osdl.org> writes: > > > I find that email is very robust - it's basically 8-bit clean. No > > character encoding, no crap. Just a byte stream. It really _is_ the most > > reliable format. > > I found another amusing bit of info that tends to undercut this claim. No, I think you found that email as a _transfer_ is mostly 8-bit clean (finally! Oh - has qmail gotten fixed?). But the end-points aren't. They do strange things with encodings, sometimes. They see an encoding they don't know what to do with, and they just freak out. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames [not found] ` <87vf02qy79.fsf@penguin.cs.ucla.edu> 2005-10-12 21:02 ` Junio C Hamano 2005-10-12 21:05 ` Linus Torvalds @ 2005-10-12 21:24 ` Linus Torvalds 2005-10-14 0:16 ` Paul Eggert 2005-10-14 6:59 ` Junio C Hamano 3 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2005-10-12 21:24 UTC (permalink / raw) To: Paul Eggert Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler On Wed, 12 Oct 2005, Paul Eggert wrote: > > Worse, when I used Emacs to copy your text into another file -- the > sort of thing that is likely to be done with an emailed patch -- the > file contained the UTF-8 encoding of the gibberish, rather than the > original bytes of your message. Btw, this is an example of where locale-based character translations just fundamentally suck. cut-and-paste quote naturally tries to translate between the source and destination locales, but it fundamentally cannot work. The only thing that ever works is bit-for-bit copying. Any program that tries to do locale conversion is always going to be a bug waiting to happen. If GNU emacs does locale translations rather than just do a binary transfer of the data, then that's a sign that GNU emavs is being really stupid. If the data was UTF-8 to begin with, then a binary copy is also going to be UTF-8. And if it wasn't UTF-8, then a binary copy is the only thing that is sensible. And this is the thing that makes UTF-8 so wonderful: exactly the fact that it makes bit-for-bit copying an acceptable policy again, and locales become a non-issue. In a truly UTF-8 world, you should _never_ convert anything at all (and that includes mis-formed UTF-8). Any non-binary file saving or transfer approach where characters have "meaning" is always mistake. It's why DOS/Windows "binary" vs "text" files was wrong. It's why font-encoding locales are wrong (Mixed text with two types? Yet another metadata quoting scheme? No thank you! It's also why UCS-16 and UCS-32 were total disasters: they had "context" in their encoding). Say "yes" to binary transfer. Because text transfers are broken. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-12 21:24 ` Linus Torvalds @ 2005-10-14 0:16 ` Paul Eggert 2005-10-14 5:20 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Paul Eggert @ 2005-10-14 0:16 UTC (permalink / raw) To: Linus Torvalds Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Linus Torvalds <torvalds@osdl.org> writes: > So I repeat: > - escape as little as possible > - make the _viewer_ decide how to view it. Under my most recent proposal, the only bytes one must escape are ", \, and LF. Doesn't that satisfy these two main criteria? > If GNU emacs does locale translations rather than just do a binary > transfer of the data, then that's a sign that GNU emacs is being > really stupid. Perhaps so, but it has a lot of company. I have even worse problems with Mozilla Thunderbird. And as we observed, Pine also has problems sending properly-formatted email containing arbitrary binary data. I suspect the vast majority of email clients will screw up in relatively common cases involving unusual characters in file names. Using attachments avoids many of the problems, but lots of patches are emailed inline and I'd rather not force people to use attachments to send diffs. > I find that email is very robust - it's basically 8-bit clean. No > character encoding, no crap. Just a byte stream. It really _is_ the most > reliable format. Hmm. To test that theory, I just now sent plain-text email to myself, containing a carriage-return (CR) byte in the middle of a line. The CR byte was transliterated into a LF. Ooops. This was the very first (and only) test I tried, which isn't a good sign for reliability. If you're curious, I tracked the problem down to Exim, a popular mail transfer agent that is running on my personal Debian GNU/Linux (stable) box. As to why Exim munges email, please see <http://www.exim.org/exim-html-4.40/doc/html/spec_44.html#SECT44.1>. (And I didn't know about the Exim glitch before trying my test. I'm normally a Sendmail man myself.) More generally, I suspect inline patches with weird bytes will suffer greatly from encoding and recoding by mail agents. > What matters is not what it looks like, but what it _saves_ as. If > you save the email message, it should come out as the same reliable > 8-bit byte stream Unfortunately this isn't true for Emacs, and I suspect other mailers will have similar problems. For example, with Emacs I can easily save either the exact byte-for-byte message body that my mail transfer agent gave me; or I can have Emacs decode the message into its constituent characters, reencode the result as UTF-8, and put that into a file. In neither case, though, am I saving the original byte stream that you presented to your mail user agent. Even if I save the byte-for-byte message body, it is often in quoted-printable format so I'll have to decode strings like "=EF" to recover the original bytes. This is doable, yes, but it's inconvenient in practice, at least with the mail user agents I'm familiar with. And even if I do it, I don't necessarily have the same byte stream you gave your mail user agent; I merely have the byte stream that your MUA gave to your MTA, and these may not be the same thing (they certainly aren't always the same thing with Emacs). The simplest fix for git may be to say "Don't use inline patches; use attachments if you must email anything with strange characters in it." That's fine. But I prefer a format that also allows GNU diff, if it chooses, to generate output that resists common inline-email botches. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-14 0:16 ` Paul Eggert @ 2005-10-14 5:20 ` Linus Torvalds 2005-10-14 17:18 ` H. Peter Anvin 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2005-10-14 5:20 UTC (permalink / raw) To: Paul Eggert Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler On Thu, 13 Oct 2005, Paul Eggert wrote: > > Perhaps so, but it has a lot of company. I have even worse problems > with Mozilla Thunderbird. And as we observed, Pine also has problems > sending properly-formatted email containing arbitrary binary data. No, pine does it right. Exactly because it sends _arbitraty_ binary data. The fact that I turned the terminal into utf-8 mode in order to generate the bytes (that end up being a garbage string in latin1) is not pine's fault. The point being that because the transport was 8-bit clean, I could do that. I could mix a latin1-encoding with a UTF-8 encoding, and the other side could see the mixed setting. Now, the other side had no way of knowing that I mixed things (unless it was a smart human and could read and understand what I wrote), so any email client would have trouble showing it. But it got _transferred_ right, and you could have saved the email, and turned the terminal into latin1 or utf-8 mode, and done a "cat" both ways, and you'd have seen both versions. > I suspect the vast majority of email clients will screw up in > relatively common cases involving unusual characters in file names. Not if they just save it. Oh, sure, they can't _display_ it, since they don't know what it is, but when they save it, they'd _better_ save it bit-for-bit. Which is the right thing to do. Then you apply it with "patch", and you get the right answer. > Using attachments avoids many of the problems, but lots of patches are > emailed inline and I'd rather not force people to use attachments to > send diffs. inline or attachment should not matter to any sane email client. If it does, then the email client isn't sane. The point is, when you save it, it _has_ to be saved bit-for-bit. The only difference between a binary attachment and a text thing is that an email client will _try_ to show the text thing to you as text. It has no other meaning. And trying is better than not trying. Attachments are _inferior_ to inline for that reason. > > I find that email is very robust - it's basically 8-bit clean. No > > character encoding, no crap. Just a byte stream. It really _is_ the most > > reliable format. > > Hmm. To test that theory, I just now sent plain-text email to myself, > containing a carriage-return (CR) byte in the middle of a line. > > The CR byte was transliterated into a LF. Ooops. I'm not surprised, since CR/LF is special for a lot of (sad) reasons. Oh, well. I agree that it makes sense to escape \r, and obviously you _have_ to escape \n. In general, escaping pretty much everything in the 0-31 range is likely the right approach, since those are never printable anyway. That, btw, is probably true of the patch contents too, not just the filename. The exception being \t (and in patch contents, \n is obviously part of the stream). > More generally, I suspect inline patches with weird bytes will suffer > greatly from encoding and recoding by mail agents. I've had pretty good luck. We do have 8-bit stuff occasionally, but it almost always makes it through. Spaces and tabs are much worse (yes, they're more common too). That's clearly just crap mailers. > Unfortunately this isn't true for Emacs, and I suspect other mailers > will have similar problems. For example, with Emacs I can easily save > either the exact byte-for-byte message body that my mail transfer > agent gave me; or I can have Emacs decode the message into its > constituent characters, reencode the result as UTF-8, and put that > into a file. Well, as long as there's a choice. > In neither case, though, am I saving the original byte > stream that you presented to your mail user agent. Even if I save the > byte-for-byte message body, it is often in quoted-printable format so > I'll have to decode strings like "=EF" to recover the original bytes. You have a broken mail client. Now, I'm not a big fan of QP (I think it was making a stupid excuse for bad transport), but QP is a _mail_ level quoting protocol, and the same way a MUA uses QP to encode, the MUA should have de-coded the QP. It shouldn't leave it to somebody else. I think GNU emacs is a horrible mistake ("do everything - badly"), but you may be able to fix it by letting your mail transport agent do the un-QP for you. A lot of them do, which makes it easier to then use weak MUA's. Anyway, it sounds like GNU emacs made the wrong choices (hey, I'm not surprised). It should have decoded QP, not the character set. There are lots of tools that do charset conversions, that's not very email-specific. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-14 5:20 ` Linus Torvalds @ 2005-10-14 17:18 ` H. Peter Anvin 0 siblings, 0 replies; 33+ messages in thread From: H. Peter Anvin @ 2005-10-14 17:18 UTC (permalink / raw) To: Linus Torvalds Cc: Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler Linus Torvalds wrote: > > No, pine does it right. Exactly because it sends _arbitraty_ binary data. > > The fact that I turned the terminal into utf-8 mode in order to generate > the bytes (that end up being a garbage string in latin1) is not pine's > fault. > I would think a full-screen editor would need to know about multibyte encodings. -hpa ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames [not found] ` <87vf02qy79.fsf@penguin.cs.ucla.edu> ` (2 preceding siblings ...) 2005-10-12 21:24 ` Linus Torvalds @ 2005-10-14 6:59 ` Junio C Hamano 3 siblings, 0 replies; 33+ messages in thread From: Junio C Hamano @ 2005-10-14 6:59 UTC (permalink / raw) To: Paul Eggert; +Cc: Linus Torvalds, git Paul Eggert <eggert@CS.UCLA.EDU> writes: > Here is the proposed format. Each file name is a string of bytes, in > one of the following two formats: > > A. A nonempty sequence of ASCII graphic characters (i.e., bytes in > the range '!' == '\041' through '~' == '\177'). The first byte > cannot be '!' == '\041' or '"' == '\042'. Leading '"' is used for > (B) below, and leading '!' is reserved for future extensions. > > B. A nonempty C-language character string literal, with the following > restrictions and modifications: > > B1. No multibyte character processing is done. Members of the > string literal are treated as bytes, not characters. Null > bytes are not allowed, and '"' == '\042', '\\' == '\134' and > '\n' == '\012' are allowed only if properly escaped as shown > below; but all other bytes are allowed. > > B2. No trigraph processing is done (e.g., ??/ stands for three > bytes, not one). > > B3. No line-splicing is done (i.e., backslash-newline is not allowed). > > B4. Only the following escape sequences are allowed. > > \" \\ \a \b \f \n \r \t \v > \XYZ (where X, Y, and Z are octal digits, X <= 3, and > at least one of the digits is nonzero) Just to let you know, I am slowly converting apply.c to accept this format, and also diff.c to produce this. I did not personally like the missing double quotes around what I did anyway, although it was easier to code. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames 2005-10-08 13:30 ` [PATCH] Try URI quoting for " Robert Fitzsimons 2005-10-08 18:30 ` Junio C Hamano @ 2005-10-09 10:42 ` Junio C Hamano 1 sibling, 0 replies; 33+ messages in thread From: Junio C Hamano @ 2005-10-09 10:42 UTC (permalink / raw) To: Robert Fitzsimons; +Cc: Alex Riesen, git, Kai Ruemmler Robert Fitzsimons <robfitz@273k.net> writes: > Instead of using //{LF}// and //{TAG}// to quote embedded tab and > linefeed characters in pathnames use URI quoting. I changed my mind, although I still do not particularly like the innocuous looking C-style or URI-style quoting, simply because I feel these funny characters should stand out loudly, but at the same time I realize that is just a matter of personal taste. Also the list does not seem to mind losing an extra unusual character (either backslash or per-cent) for quoting too much. After futzing with this a bit more, I decided that C-style quoting is the cleanest way in the longer run. I'll be talking with the current maintainer of GNU patch about making it understand C-style quoting in its input, when the program operates under --quoting-style=c flag (or maybe some other flag). In addition to git-diff output, git-ls-files output is also quoted for TAB and LF (and backslash) when not using '-z' in the version I have in the "pu" branch. I haven't converted git-ls-tree yet, but that should also be done before these changes can graduate out of "pu" branch. Existing Porcelains or people's scripts should not break (any more than they currently are broken ;-). Any self-respecting Porcelain should be either parsing '-z' output (in which case there is no change), or parsing non '-z' output after declaring that it does not support filenames with embedded TAB or LF (in which case there is no new breakage, except that they have one more character that their users cannot have in the filename -- backslash). Here is an example output from my random repository, that has files with TAB, LF and backslash in their names (Note that the file "pc" + one backslash + "h.c" is shown with two backslashes). : siamese; git status # Updated but not checked in: # (will commit) # # new file: ab\n\tc/mno # modified: abc/mno # renamed: def\nghi/pqr -> dee/pqr # new file: dee/www # modified: j k l # # # Changed but not updated: # (use git-update-index to mark for commit) # # deleted: abc/mno # # # Ignored files: # (use "git add" to add to commit) # # diff-sample # pc\\h.c # pch.c.orig # quote\targ.c # quotearg.c.orig : siamese; git diff HEAD diff --git a/abc/mno b/ab\n\tc/mno similarity index 72% rename from abc/mno rename to ab\n\tc/mno index 0ac2a8c..3deac99 100644 --- a/abc/mno +++ b/ab\n\tc/mno @@ -1 +1,3 @@ Fri Oct 7 23:18:45 PDT 2005 +foo +foo ... : siamese; git diff HEAD | git apply --index-info 100644 0ac2a8c8cad088c3e843689dbd833aeabf6b1870 abc/mno 100644 9ee055c103e84ffdd9ec15457481c92699d12fc8 def\nghi/pqr ... Anyway, I'll keep this in the "pu" branch a bit longer to let the discussion simmer. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2005-10-14 17:18 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-10-07 19:35 [RFC] embedded TAB and LF in pathnames Junio C Hamano 2005-10-07 23:29 ` Alex Riesen 2005-10-07 23:44 ` Junio C Hamano 2005-10-08 6:45 ` Alex Riesen 2005-10-08 9:10 ` Junio C Hamano 2005-10-08 13:30 ` [PATCH] Try URI quoting for " Robert Fitzsimons 2005-10-08 18:30 ` Junio C Hamano 2005-10-08 20:19 ` Junio C Hamano 2005-10-11 6:20 ` Paul Eggert 2005-10-11 7:37 ` Junio C Hamano 2005-10-11 15:17 ` Linus Torvalds 2005-10-11 18:03 ` Paul Eggert 2005-10-11 18:37 ` Linus Torvalds 2005-10-11 19:42 ` Paul Eggert 2005-10-11 20:56 ` Linus Torvalds 2005-10-12 6:51 ` Paul Eggert 2005-10-12 14:59 ` Linus Torvalds 2005-10-12 19:07 ` Daniel Barkalow 2005-10-12 19:52 ` Linus Torvalds 2005-10-12 20:21 ` H. Peter Anvin [not found] ` <87vf02qy79.fsf@penguin.cs.ucla.edu> 2005-10-12 21:02 ` Junio C Hamano 2005-10-12 21:05 ` Linus Torvalds 2005-10-12 21:09 ` H. Peter Anvin 2005-10-12 21:15 ` Johannes Schindelin 2005-10-12 21:33 ` Junio C Hamano 2005-10-14 0:57 ` Paul Eggert 2005-10-14 5:43 ` Linus Torvalds 2005-10-12 21:24 ` Linus Torvalds 2005-10-14 0:16 ` Paul Eggert 2005-10-14 5:20 ` Linus Torvalds 2005-10-14 17:18 ` H. Peter Anvin 2005-10-14 6:59 ` Junio C Hamano 2005-10-09 10:42 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).