git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] embedded TAB and LF in pathnames
@ 2005-10-07 19:35 Junio C Hamano
  2005-10-07 23:29 ` Alex Riesen
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2005-10-07 19:35 UTC (permalink / raw)
  To: git; +Cc: Kai Ruemmler

While I was reviewing git-status fix by Kai Ruemmler, it struck
me that our barebone Porcelain-ish layer got a bit sloppier over
time.  The core layer does not care about any metacharacters in
the pathname, and it has provisions, primarily in the form of
'-z' flag, for carefully written Porcelain layers to handle
pathnames with embedded metacharacters correctly.

One exception, however, is the interaction between the git-diff
family output and git-apply.  We needed to be compatible with
other people's diff, which meant that we should not have to
worry too much about pathnames with embedded TABs and LFs
because GNU diff would not produce usable diff for such things
anyway.  But 'git-diff --names' barfing if a pathname contained
these characters when run without '-z' flag was too much.  This
still breaks 'git-status'.

So I am considering the following changes:

 - 'raw' output format without '-z', upon finding a TAB or LF,
   would not die, but just issue a warning.  However, the paths
   are "munged" in a way described later.

 - '--name-only' and '--name-status' format issue the same
   warning when finding these characters and run without '-z'.
   And the paths are "munged" as well.

 - 'patch' output format also issues a warning.  The paths are
   "munged" but in a slightly different manner from the above.

 - 'git-apply' is taught about the path munging in the diff
   input for git diffs (i.e. 'diff --git') and do sensible
   things.

One possible way for path munging goes like this.  We could take
advantage of the fact that we do not ever output '//' ourselves,
and '//' never appears in valid diffs by other people's tools,
unless done deliberately by hand ("diff -u a//foo. b//foo.c"
from the command line).  So we could use '//' as if it is a
backslash.  Examples.

  "foo/bar.c" --> "foo/bar.c"	(no funny letters - as before)
  "foo\nbar"  --> "foo//0Abar" (double slash followed by 2 hex)
  "foo\tbar"  --> "foo//09bar" (double slash followed by 2 hex)

So a diff output to rename "foo/bar.c" to "foo\nbar.c" would
become:

  diff --git a/foo/bar.c b/foo//0Abar.c
  similarity index 100%
  rename from foo
  rename to foo//0Abar.c

The byte-values subject to this munging is LF for patch output
(because git-apply seems to grok TABs in pathnames just fine),
and TAB and LF for 'raw', '--name-only', '--name-status' without
'-z'.

I have not made up my mind on the exact choice of the quoting
convention.  We could say '///' instead of '//', for example, or
even '//{LF}//' instead of '//0A' proposed above.  One thing I
am trying to avoid is "foo\nbar", which I suspect would be
unfriendly to the Cygwin folks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] embedded TAB and LF in pathnames
  2005-10-07 19:35 [RFC] embedded TAB and LF in pathnames Junio C Hamano
@ 2005-10-07 23:29 ` Alex Riesen
  2005-10-07 23:44   ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Riesen @ 2005-10-07 23:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kai Ruemmler

Junio C Hamano, Fri, Oct 07, 2005 21:35:19 +0200:
> I have not made up my mind on the exact choice of the quoting
> convention.  We could say '///' instead of '//', for example, or
> even '//{LF}//' instead of '//0A' proposed above.  One thing I
> am trying to avoid is "foo\nbar", which I suspect would be
> unfriendly to the Cygwin folks.

Being unhappy one of them, I think I'd better manage (even if by
postprocessing the output).

Please, don't make the common case ugly just because of that platform
(insanely broken anyway).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] embedded TAB and LF in pathnames
  2005-10-07 23:29 ` Alex Riesen
@ 2005-10-07 23:44   ` Junio C Hamano
  2005-10-08  6:45     ` Alex Riesen
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2005-10-07 23:44 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git, Kai Ruemmler

Alex Riesen <raa.lkml@gmail.com> writes:

> Junio C Hamano, Fri, Oct 07, 2005 21:35:19 +0200:
>> I have not made up my mind on the exact choice of the quoting
>> convention.  We could say '///' instead of '//', for example, or
>> even '//{LF}//' instead of '//0A' proposed above.  One thing I
>> am trying to avoid is "foo\nbar", which I suspect would be
>> unfriendly to the Cygwin folks.
>
> Being unhappy one of them, I think I'd better manage (even if by
> postprocessing the output).
>
> Please, don't make the common case ugly just because of that platform
> (insanely broken anyway).

You really have to realize that having LF and TAB in filenames
are *NOT* the common case, no matter which platform you are
talking about.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] embedded TAB and LF in pathnames
  2005-10-07 23:44   ` Junio C Hamano
@ 2005-10-08  6:45     ` Alex Riesen
  2005-10-08  9:10       ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Riesen @ 2005-10-08  6:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kai Ruemmler

Junio C Hamano, Sat, Oct 08, 2005 01:44:48 +0200:
> > Junio C Hamano, Fri, Oct 07, 2005 21:35:19 +0200:
> >> I have not made up my mind on the exact choice of the quoting
> >> convention.  We could say '///' instead of '//', for example, or
> >> even '//{LF}//' instead of '//0A' proposed above.  One thing I
> >> am trying to avoid is "foo\nbar", which I suspect would be
> >> unfriendly to the Cygwin folks.
> >
> > Being unhappy one of them, I think I'd better manage (even if by
> > postprocessing the output).
> >
> > Please, don't make the common case ugly just because of that platform
> > (insanely broken anyway).
> 
> You really have to realize that having LF and TAB in filenames
> are *NOT* the common case, no matter which platform you are
> talking about.
> 

Yes, but "//" in a path is quite common. Even "///" is not uncommon.

How about copy ls' approach were possible?

   -b, --escape, --quoting-style=escape
          Quote  nongraphic  characters in file names using alphabetic and
          octal backslash sequences like those used in C. This  option  is
          the  same as -Q except that filenames are not surrounded by dou-
          ble-quotes.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] embedded TAB and LF in pathnames
  2005-10-08  6:45     ` Alex Riesen
@ 2005-10-08  9:10       ` Junio C Hamano
  2005-10-08 13:30         ` [PATCH] Try URI quoting for " Robert Fitzsimons
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2005-10-08  9:10 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git, Kai Ruemmler

Alex Riesen <raa.lkml@gmail.com> writes:

          Quote  nongraphic  characters in file names using alphabetic and
          octal backslash sequences like those used in C. This  option  is
          the  same as -Q except that filenames are not surrounded by dou-
          ble-quotes.

If you have a file whose name is 'foo' + LF + 'bar', and if you
use backslash convention, your diff would start like this:

    diff --git a/foo\nbar b/foo\nbar
    @@ 1,2 3,4 @@
     context
    -deleted
    ...

which looks quite natural.

I would, however, prefer this kind of funny pathnames to *stand*
*out* more than usual, to make it really obvious that there is
something really funky going on.  In that sense, the above is a
bit too innocuous-looking to my taste.

But this "embedded LF and TAB" is a corner case.  I would not be
using such paths that would trigger the quoting myself anyway,
and I do not particularly care as long as the tools do the right
thing -- any quoting rule would do, as long as the generating
side (git-diff) is consistent with accepting side (git-apply),
and as long as there is no new ambiguity introduced.

The backslash proposal is introducing a small ambiguity.  You
cannot tell if the file had an embedded LF between 'foo' and
'bar' (and generated with your git-diff) or had an embedded
backslash between 'foo' and 'nbar' (and generated with existing
git-diff).  Since we never had a version of git-diff that
outputs double-slashes '//' in paths, there is no ambiguity if
we use it as a quoting mechanism.

Just as a concrete demonstration, here is how the git-status
output and git-diff output would look like for a file 'pqr' in a
directory whose name is 'def' + LF + 'ghi' that uses the version
of git-diff from the proposed updates branch:

        # Changed but not updated:
        #   (use git-update-index to mark for commit)
        #
        #	modified: def//{LF}//ghi/pqr

        diff --git a/def//{LF}//ghi/pqr b/def//{LF}//ghi/pqr
        index 9ee055c..47dbc3f 100644
        --- a/def//{LF}//ghi/pqr
        +++ b/def//{LF}//ghi/pqr
        @@ -1 +1,2 @@
         Fri Oct  7 23:19:04 PDT 2005
        +foo

I am not married to this quoting syntax -- I think it *is* ugly,
but as I said before, I'd prefer to have something ugly here.

I would easily be persuaded otherwise, though.  A working patch
would probably be the most effective way of persuasion, but a
mock output without the code to produce and/or parse it would
also be fine as a starting point for discussion.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-08  9:10       ` Junio C Hamano
@ 2005-10-08 13:30         ` Robert Fitzsimons
  2005-10-08 18:30           ` Junio C Hamano
  2005-10-09 10:42           ` Junio C Hamano
  0 siblings, 2 replies; 33+ messages in thread
From: Robert Fitzsimons @ 2005-10-08 13:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Alex Riesen, git, Kai Ruemmler

Instead of using //{LF}// and //{TAG}// to quote embedded tab and
linefeed characters in pathnames use URI quoting.

'\t' becomes %09
'\n' becomes %10
'%' becomes %25

Signed-off-by: Robert Fitzsimons <robfitz@273k.net>

---

> I am not married to this quoting syntax -- I think it *is* ugly,
> but as I said before, I'd prefer to have something ugly here.
> 
> I would easily be persuaded otherwise, though.  A working patch
> would probably be the most effective way of persuasion, but a
> mock output without the code to produce and/or parse it would
> also be fine as a starting point for discussion.

Using URI encoding might be an option it's not a ugly and more peopel
should under stand what it means.  Heres a posible patch against pu.

Robert


 apply.c       |   19 ++++++++++++-------
 diff.c        |   26 +++++++++++++++++---------
 git-status.sh |   10 ++++++----
 3 files changed, 35 insertions(+), 20 deletions(-)

applies-to: a9332b0c2bd80a182f946d22d4ec7511c32c55f4
8029a957cab1a912562696fdce8beea5fc2c11c4
diff --git a/apply.c b/apply.c
--- a/apply.c
+++ b/apply.c
@@ -75,21 +75,26 @@ static char *unmunge_name(char *name)
 
 	if (!name)
 		return name;
-	cp = strstr(name, "//");
+	cp = strstr(name, "%");
 	if (!cp)
 		return name;
 	ret_name = strdup(name);
 	for (cp = dp = ret_name; (ch = *cp); cp++) {
-		if (ch == '/' && cp[1] == '/' && cp[2] == '{') {
-			/* //{TAB}// or //{LF}// */
-			if (!strncmp(cp + 3, "TAB}//", 6)) {
+		if (ch == '%') {
+			/* %09 or %10 or %25 */
+			if (!strncmp(cp + 1, "09", 2)) {
 				*dp++ = '\t';
-				cp += 8;
+				cp += 2;
 				continue;
 			}
-			else if (!strncmp(cp + 3, "LF}//", 5)) {
+			else if (!strncmp(cp + 1, "10", 2)) {
 				*dp++ = '\n';
-				cp += 7;
+				cp += 2;
+				continue;
+			}
+			else if (!strncmp(cp + 1, "25", 2)) {
+				*dp++ = '%';
+				cp += 2;
 				continue;
 			}
 			error("malformed munged name '%s' (looking at %s)",
diff --git a/diff.c b/diff.c
--- a/diff.c
+++ b/diff.c
@@ -13,7 +13,7 @@ static const char *path_munge(const char
 {
 	const char *cp;
 	char *retpath, *dp;
-	int ch, munge_inter_name = 0, munge_line_term = 0;
+	int ch, munge_inter_name = 0, munge_line_term = 0, munge_quote = 0;
 
 	if (!path)
 		return path;
@@ -23,23 +23,31 @@ static const char *path_munge(const char
 			munge_inter_name++;
 		if (line_term && ch == '\n')
 			munge_line_term++;
+		if (ch == '%')
+			munge_quote++;
 	}
-	if (!(munge_inter_name + munge_line_term))
+	if (!(munge_inter_name + munge_line_term + munge_quote))
 		return path;
 
-	/* need //{TAB}// and //{LF}// */
+	/* need %09 and %10 and %25 */
 	retpath = xmalloc(cp - path +
-			  munge_inter_name * 8 +
-			  munge_line_term * 7 + 1);
+			  munge_inter_name * 3 +
+			  munge_line_term * 3 +
+			  munge_quote * 3 + 1);
 	for (cp = path, dp = retpath; (ch = *cp); cp++, dp++) {
 		if (inter_name && ch == '\t') {
-			memcpy(dp, "//{TAB}//", 9);
-			dp += 8;
+			memcpy(dp, "%09", 3);
+			dp += 2;
 			continue;
 		}
 		if (line_term && ch == '\n') {
-			memcpy(dp, "//{LF}//", 8);
-			dp += 7;
+			memcpy(dp, "%10", 3);
+			dp += 2;
+			continue;
+		}
+		if (ch == '%') {
+			memcpy(dp, "%25", 3);
+			dp += 2;
 			continue;
 		}
 		*dp = ch;
diff --git a/git-status.sh b/git-status.sh
--- a/git-status.sh
+++ b/git-status.sh
@@ -54,8 +54,9 @@ else
 	perl -e '$/ = "\0";
 		while (<>) {
 			chomp;
-			s|\t|//{TAB}//|g;
-			s|\n|//{LF}//|g;
+			s|%([^021][^059])|%25\1|g;
+			s|\t|%09|g;
+			s|\n|%10|g;
 			s/ /\\ /g;
 			s/^/A /;
 			print "$_\n";
@@ -84,8 +85,9 @@ perl -e '$/ = "\0";
 	my $shown = 0;
 	while (<>) {
 		chomp;
-		s|\t|//{TAB}//|g;
-		s|\n|//{LF}//|g;
+		s|%([^01][^09])|%25\1|g;
+		s|\t|%09|g;
+		s|\n|%10|g;
 		s/^/#	/;
 		if (!$shown) {
 			print "#\n# Ignored files:\n";
---
0.99.8.GIT

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-08 13:30         ` [PATCH] Try URI quoting for " Robert Fitzsimons
@ 2005-10-08 18:30           ` Junio C Hamano
  2005-10-08 20:19             ` Junio C Hamano
  2005-10-09 10:42           ` Junio C Hamano
  1 sibling, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2005-10-08 18:30 UTC (permalink / raw)
  To: Robert Fitzsimons; +Cc: Alex Riesen, git, Kai Ruemmler

Robert Fitzsimons <robfitz@273k.net> writes:

> Instead of using //{LF}// and //{TAG}// to quote embedded tab and
> linefeed characters in pathnames use URI quoting.
>
> '\t' becomes %09
> '\n' becomes %10
> '%' becomes %25
>
> Signed-off-by: Robert Fitzsimons <robfitz@273k.net>

This would break existing setup where people *has* per-cent
letter in their pathname -- which I think is worse than the
backslash proposal.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-08 18:30           ` Junio C Hamano
@ 2005-10-08 20:19             ` Junio C Hamano
  2005-10-11  6:20               ` Paul Eggert
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2005-10-08 20:19 UTC (permalink / raw)
  To: Robert Fitzsimons; +Cc: Alex Riesen, git, Kai Ruemmler, eggert

Junio C Hamano <junkio@cox.net> writes:

> Robert Fitzsimons <robfitz@273k.net> writes:
>
>> '\t' becomes %09
>> '\n' becomes %10
>> '%' becomes %25
>>
>> Signed-off-by: Robert Fitzsimons <robfitz@273k.net>
>
> This would break existing setup where people *has* per-cent
> letter in their pathname -- which I think is worse than the
> backslash proposal.

Having said that, I think something along the lines of backslash
or URI encoding is the cleanest way to go in the long run, with
one condition: diffs generated with git-diff should be
applicable with 'GNU patch', especially if there is no funnies
like renames and the recipient does not mind losing mode
information.

Although 'GNU patch' has --quoting-style flag, it seems to be
used only on its output side (i.e. reporting which file it is
patching, etc.).  If we can sell changes to teach the filename
encoding convention to its util.c::fetchname() upstream, we
could tell people that 'diff --git' can be applied with newer
'GNU patch' when the patch is about a file whose name contains
'%' character (which is not that unusual, compared to TAB and
LF).  While we are selling those changes to 'GNU patch', we
might be even be able to sell the other extended 'diff --git'
metainformation support.

The same filename quoting rules change should probably be sold
to 'GNU diff' as well, so that plain diff can natively quote
funny characters in its output without forcing us to fake it
by using the -L flag.

If all of the above is what we aim for, I would say that is a
good direction to go in the longer term.  The double-slash hack
was just to avoid all these hassles of having to muck with other
people's tools.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-08 13:30         ` [PATCH] Try URI quoting for " Robert Fitzsimons
  2005-10-08 18:30           ` Junio C Hamano
@ 2005-10-09 10:42           ` Junio C Hamano
  1 sibling, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2005-10-09 10:42 UTC (permalink / raw)
  To: Robert Fitzsimons; +Cc: Alex Riesen, git, Kai Ruemmler

Robert Fitzsimons <robfitz@273k.net> writes:

> Instead of using //{LF}// and //{TAG}// to quote embedded tab and
> linefeed characters in pathnames use URI quoting.

I changed my mind, although I still do not particularly like the
innocuous looking C-style or URI-style quoting, simply because I
feel these funny characters should stand out loudly, but at the
same time I realize that is just a matter of personal taste.
Also the list does not seem to mind losing an extra unusual
character (either backslash or per-cent) for quoting too much.

After futzing with this a bit more, I decided that C-style
quoting is the cleanest way in the longer run.  I'll be talking
with the current maintainer of GNU patch about making it
understand C-style quoting in its input, when the program
operates under --quoting-style=c flag (or maybe some other
flag).

In addition to git-diff output, git-ls-files output is also
quoted for TAB and LF (and backslash) when not using '-z' in the
version I have in the "pu" branch.  I haven't converted
git-ls-tree yet, but that should also be done before these
changes can graduate out of "pu" branch.

Existing Porcelains or people's scripts should not break (any
more than they currently are broken ;-).  Any self-respecting
Porcelain should be either parsing '-z' output (in which case
there is no change), or parsing non '-z' output after declaring
that it does not support filenames with embedded TAB or LF (in
which case there is no new breakage, except that they have one
more character that their users cannot have in the filename --
backslash).

Here is an example output from my random repository, that has
files with TAB, LF and backslash in their names (Note that the
file "pc" + one backslash + "h.c" is shown with two backslashes).

	: siamese; git status
        # Updated but not checked in:
        #   (will commit)
        #
        #	new file: ab\n\tc/mno
        #	modified: abc/mno
        #	renamed: def\nghi/pqr -> dee/pqr
        #	new file: dee/www
        #	modified: j  k l
        #
        #
        # Changed but not updated:
        #   (use git-update-index to mark for commit)
        #
        #	deleted:  abc/mno
        #
        #
        # Ignored files:
        #   (use "git add" to add to commit)
        #
        #	diff-sample
        #	pc\\h.c
        #	pch.c.orig
        #	quote\targ.c
        #	quotearg.c.orig

	: siamese; git diff HEAD
        diff --git a/abc/mno b/ab\n\tc/mno
        similarity index 72%
        rename from abc/mno
        rename to ab\n\tc/mno
        index 0ac2a8c..3deac99 100644
        --- a/abc/mno
        +++ b/ab\n\tc/mno
        @@ -1 +1,3 @@
         Fri Oct  7 23:18:45 PDT 2005
        +foo
        +foo
        ...

        : siamese; git diff HEAD | git apply --index-info
        100644 0ac2a8c8cad088c3e843689dbd833aeabf6b1870	abc/mno
        100644 9ee055c103e84ffdd9ec15457481c92699d12fc8	def\nghi/pqr
	...

Anyway, I'll keep this in the "pu" branch a bit longer to let
the discussion simmer.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-08 20:19             ` Junio C Hamano
@ 2005-10-11  6:20               ` Paul Eggert
  2005-10-11  7:37                 ` Junio C Hamano
  2005-10-11 15:17                 ` Linus Torvalds
  0 siblings, 2 replies; 33+ messages in thread
From: Paul Eggert @ 2005-10-11  6:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler

Junio C Hamano <junkio@cox.net> writes:

> Although 'GNU patch' has --quoting-style flag, it seems to be
> used only on its output side

Yes, that's right.

The convention I had been thinking of adding is to have GNU diff
use shell-quoting style, e.g.,

'three
o'\''clock'

to represent a file name with a newline and an apostrophe in it.
This sort of file name can be cut and pasted into the shell.
The quoting could be used with any file name containing a
troublesome character.

Perhaps another quoting style would be better.

An issue I hadn't really had time to think about is the character
encoding of file names.  E.g., suppose one file system uses UTF-8
encoding for Japanese file names, and another file system uses EUC-JP.
I suppose it would be nice to handle this problem too.  Perhaps GNU
'diff' could standardize on using UTF-8 in its file names, regardless
of what the underlying file system uses.  Another option is to pass
the bytes of the file name through, no matter what.  This might
require a runtime flag to diff, or to patch, or both.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-11  6:20               ` Paul Eggert
@ 2005-10-11  7:37                 ` Junio C Hamano
  2005-10-11 15:17                 ` Linus Torvalds
  1 sibling, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2005-10-11  7:37 UTC (permalink / raw)
  To: Paul Eggert; +Cc: git

Paul Eggert <eggert@CS.UCLA.EDU> writes:

> The convention I had been thinking of adding is to have GNU diff
> use shell-quoting style, e.g.,
>
> 'three
> o'\''clock'
>
> to represent a file name with a newline and an apostrophe in it.
> This sort of file name can be cut and pasted into the shell.
> The quoting could be used with any file name containing a
> troublesome character.
>
> Perhaps another quoting style would be better.

A patch header (both "diff --git" line and ---/+++ lines) I've
been considering, and have in the proposed updates branch, looks
something like this:

    diff --git a/def\nghi/pqr b/dee/pqr
    similarity index 72%
    rename from def\nghi/pqr
    rename to dee/pqr
    index 9ee055c..243fbbc 100644
    --- a/def\nghi/pqr
    +++ b/dee/pqr
    @@ -1 +1,3 @@
     Fri Oct  7 23:19:04 PDT 2005
    +foo
    +foo

If we can keep things on one line, that would help parsing the
stuff very simple, but more importantly, it is easier to see
what's happening.  The pattern is the same whether you have
funny pathnames or not, and that helps the human consumer.

Adjusting the "git diff" output to the style the GNU diff with
your shell quoting style would produce something like this:

    diff --git 'a/def
    ghi/pqr' b/dee/pqr
    similarity index 72%
    rename from 'def
    ghi/pqr'
    rename to dee/pqr
    index 9ee055c..243fbbc 100644
    --- 'a/def
    ghi/pqr'
    +++ b/dee/pqr
    @@ -1 +1,3 @@
     Fri Oct  7 23:19:04 PDT 2005
    +foo
    +foo

Which, while it is possible to make tools parse them, is very
distracting for humans to read and review.  Yes, LF is quoted,
but it still breaks the line, disrupting the pattern we are used
to see.  If you are talking about a funny file, whose name is
"a\ndiff --git a/b/c", your diff would look like this:

    diff --git 'a/
    diff --git a/b/c' 'b/
    diff --git a/b/c'
    index 9ee055c..243fbbc 100644
    --- 'a/
    diff --git a/b/c'
    +++ 'b/
    diff --git a/b/c'
    @@ -1 +1,3 @@
     Fri Oct  7 23:19:04 PDT 2005
    +foo
    +foo

We are used to tell the "less" command to do "/^diff --git .*"
while reviewing patches.  The shell quoting, while I admit I
learned its beauty from you, is a disaster for human consumption.

For diff output quoting purposes, LF is the only thing that
matters, as you mentioned in another message to me.  Our parsing
side ("GNU patch" counterpart) checks two pathnames on "diff
--git" line and makes sure what follows a/ and b/ are consistent
(that is, they should be identical, or each are the same as
"rename from" and "rename to"), so there is no ambiguity.  But
again for human consumption purposes, we cannot easily tell SP
and TAB apart by just reading, and a TAB is so unusual character
to have in pathname (as opposed to SP which is not that
uncommon), we may be better off making them visible.

Quoting TAB incidentally has an added benefit, which you as GNU
diff/patch person would probably not care too much about.  Our
other tools sometimes need to show two paths in one record, and
TAB is used as the field separator between two paths (LF is the
record separator).  The tools do have '-z' mode to let us use
anything but NUL in the pathname, and carefully written scripts
tend to run them with '-z' flag and use Perl or Python to parse
paths out, but it would be nicer if we did not always have to.

For example, the 'git commit' command prepares the log editor
with the status information about changes being committed, and
needs to mention paths.  This is purely for human consumption,
and showing something like:

	# Type commit message to this file.  Lines that start
        # with '#' are ignored.
        #
        # Updated but not checked in:
        #   (will commit)
        #
        #	new file: ab\n\tc/mno
        #	modified: abc/mno
        #	renamed: def\nghi/pqr -> dee/pqr
        ...

is perfectly readable for human users, and can be done without
running the tool in '-z' mode, if the tool output is quoted with
'\n' and '\t' convention -- the parsing and formatting side can
just split the field with TAB and show them, without worrying
about an embedded LF making the rest of the pathname spilling
over to the next line.  And once we start teaching the user we
represent funny characters in their paths this way, it becomes
nicer to be consistent in the diff output as well.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-11  6:20               ` Paul Eggert
  2005-10-11  7:37                 ` Junio C Hamano
@ 2005-10-11 15:17                 ` Linus Torvalds
  2005-10-11 18:03                   ` Paul Eggert
  1 sibling, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2005-10-11 15:17 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler



On Mon, 10 Oct 2005, Paul Eggert wrote:
> 
> An issue I hadn't really had time to think about is the character
> encoding of file names.

Please don't. Use filenames as if they are just binary blobs of data, 
that's the only thing that has a high chance of success. Yes, it too can 
break in the presense of something _else_ doing character translation 
and/or people moving a patch from one encoding to another , buthat's 
just true of anything.

Eventually everybody will hopefully use UTF-8, and nothing else really 
matters, but the thing is, if you see filenames as just blobs of data, 
that works with UTF-8 too, so it's not "wrong" even in the long run. And 
until everybody has one single encoding, you simply won't be able to tell, 
and the likelihood that you'd screw up is pretty high.

The happy part of the "binary blob" approach is that users _understand_ 
it. People who actively use different encoding formats are (painfully) 
aware of conversions, and they may curse you for not doing the random 
encoding format of the day, but they will be able to handle it.

In contrast, if you start doing conversions, I guarantee you that people 
will _not_ be able to handle it when you do something strange - you've 
changed the data.

Personally, I'd like the normal C quoting the best. Leave space as-is, and 
quote TAB/NL as \t and \n respectively. It's pretty universally understood 
in programming circles even outside of C, and it's not like a very 
uncommon patch format like that really needs to be well-understood outside 
of those circles.

It also has a very obvious and ASCII-safe format for other characters (ie 
just the normal octal escapes: \377 etc..

That said, I personally don't think it's necessarily even worth it. If 
somebody wants to use names with tabs and newlines, is he really going to 
work with diffs? Or is it just a driver error?

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-11 15:17                 ` Linus Torvalds
@ 2005-10-11 18:03                   ` Paul Eggert
  2005-10-11 18:37                     ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Paul Eggert @ 2005-10-11 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler

Linus Torvalds <torvalds@osdl.org> writes:

> Personally, I'd like the normal C quoting the best.

That would be fine with me too.  How about if we use the equivalent of
--quoting-style="c" for file names that contain funny bytes, and no
quoting for other file names?  So, for example, something like this:

    diff --git "space tab\tnewline\nquote\"backslash\\" b/dee/pqr
    similarity index 72%
    rename from "space tab\tnewline\nquote\"backslash\\"
    rename to dee/pqr
    index 9ee055c..243fbbc 100644
    --- "space tab\tnewline\nquote\"backslash\\"
    +++ b/dee/pqr
    @@ -1 +1,3 @@
     Fri Oct  7 23:19:04 PDT 2005
    +foo
    +foo

The surrounding double-quotes are an extra indication to the human
reader that there is something weird about the quoted file name.

> Use filenames as if they are just binary blobs of data, 
> that's the only thing that has a high chance of success.

Thanks for thinking those things through.  I agree mostly, but there's
still a technical problem, in that we have to decide what a "funny
byte" is if we are using C-style quoting.  For example, the simplest
approach is to say a byte is funny if it is space, backslash, quote,
an ASCII control character, or is non-ASCII.  But this will cause
perfectly-reasonable UTF-8 file names to be presented in git format
using unreadable strings like "a\293\203\257b" or whatever.

Perhaps it would be better to say that a byte is "funny" if it is
space, backslash, quote, an ASCII control character, or a byte that is
not part of a valid UTF-8 encoding.  This will let UTF-8 file names
through unscathed, while still warning the reader when funny business
is going on.  File names with other encodings (e.g., Shift-JIS) will
contain lots of backslashes, but that's OK: we don't mind making
nonstandard encodings hard-to-read, so long as we preserve the bytes
correctly.

We could implement in other GNU applications by having a new quoting
style that supports this quoting behavior.  I can arrange for that.


> If somebody wants to use names with tabs and newlines, is he really
> going to work with diffs? Or is it just a driver error?

The current-supported scheme with 'diff' and 'patch' should work for
everything but newlines.  I like the idea of getting it to work even
with newlines, and I am willing to sacrifice old patches with file
names starting with '"' (extremely rare, if any) to get newlines to
work.  Among other things I worry about people submitting
purposely-malformed patches in non-git environments.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-11 18:03                   ` Paul Eggert
@ 2005-10-11 18:37                     ` Linus Torvalds
  2005-10-11 19:42                       ` Paul Eggert
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2005-10-11 18:37 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2436 bytes --]



On Tue, 11 Oct 2005, Paul Eggert wrote:
>
> For example, the simplest approach is to say a byte is funny if it is 
> space, backslash, quote, an ASCII control character, or is non-ASCII.  
> But this will cause perfectly-reasonable UTF-8 file names to be 
> presented in git format using unreadable strings like "a\293\203\257b" 
> or whatever.

I think the simplest question to ask is "what are we protecting against?"

There's only two characters that are _really_ special diff itself: \n and 
\t. The former is obvious, the latter just because the regular gnu diff 
format puts a tab between the name and the date (and if you _knew_ the 
date was always there you could just work backwards, but since not all 
diffs even put a date, \t ends up being special in practice).

So what else would you want to protect against? I hope not 8-bit 
cleanness: if some stupid protocol still isn't 8-bit clean, it should be 
fixed.

And \0 is already impossible, at least on sane systems.

So arguably you don't need to quote anything else than \n and \t (and that 
obviously means you have to quote \ itself). That means that any filename 
always shows "sanely" in its own byte locale, and everything is readable, 
regardless of whether it's UTF-8 or just plain byte-encoded Latin1, or 
anything else.

So I don't think you should quote invalid UTF-8: it's invalid UTF-8 
whether ítis quoted or not.

		Linus

PS. There _is_ something you may want to quote, namely the standard CSI 
terminal escapes. Not because they wouldn't pass through, but because some 
people might just "cat" a patch. This is debatable. Now, they are in all 
in the range 0x00-0x1f and 0x80-0x9f, and since UTF-8 encoding is supposed 
to happen before it (but you don't know how many get that right), if you 
want to quote those characters, you need to do so _both_ for the "raw" 
format and for the UTF-8 format.

Now, the UTF-8 format for that high range is actually the same character, 
except preceded by a 0xc2 (I think), so the simplest thing is to do 
quoting _purely_ on a byte-stream level (ignore any UTF-8 stuff), and 
screw the fact that you end up with a non-UTF-8 sequence (character 0x0080 
is UTF-8 sequence 0xC2 0x80, and would be quoted as 0xC2 + "\200", which 
is no longer valid in UTF-8).

It gets quite nasty. For any UTF-8 quoting scheme you come up with, I'll 
point out something that it does wrong or looks horrible for a Latin1 
filename ;)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-11 18:37                     ` Linus Torvalds
@ 2005-10-11 19:42                       ` Paul Eggert
  2005-10-11 20:56                         ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Paul Eggert @ 2005-10-11 19:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler

Linus Torvalds <torvalds@osdl.org> writes:

> the simplest question to ask is "what are we protecting against?"

I'd like to protect against:

  1.  File names that cannot be handled correctly with the current
      formats.  Newline is the obvious problem here, along with
      (arguably) tab and space.

  2.  Common transliterations of patches.  Many programs (and mailers,
      alas) expand tabs to spaces, append CR to lines, prepend spaces
      to lines, break lines at spaces, etc.  'patch' already deals
      with this to some extent, but it'd be nice if the format
      resisted these transliterations better.

  3.  Humans misreading patches.  The patch format is intended to be
      human-readable, after all.

  4.  Reencoded patches.  Programs like Emacs can and will convert
      patches from UTF-8 to EUC-JP, for example.

You convinced me that (4) is not worth the hassle, but I'd still like
to address (1)-(3) when it's easy.

> invalid UTF-8 [is] invalid UTF-8

Yes, but (2) and (3) can lose information about invalid UTF-8 if we
don't suitably protect the encoding errors.  I daresay that many
mailers will mishandle invalid UTF-8, for example.

> There _is_ something you may want to quote, namely the standard CSI
> terminal escapes.

If I understand you aright, we could do that by modifying my previous
proposal to escape all bytes in the UTF-8 representation of a control
character.  In Unicode, the characters 0080 through 009F are control
characters, so that should suffice to quote the terminal escapes you
mentioned.  (Perhaps we should also escape unassigned Unicode
characters too, on the theory that they might become control
characters in the future.)

> For any UTF-8 quoting scheme you come up with, I'll point out
> something that it does wrong or looks horrible for a Latin1 filename
> ;)

Yes, quite true.  But we don't have to come up with something that's
perfect in all cases, just something that's good enough to handle
cases that we expect will be common in practice, in a world where
UTF-8 is the preferred encoding for non-ASCII characters.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-11 19:42                       ` Paul Eggert
@ 2005-10-11 20:56                         ` Linus Torvalds
  2005-10-12  6:51                           ` Paul Eggert
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2005-10-11 20:56 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler



On Tue, 11 Oct 2005, Paul Eggert wrote:
> 
> Yes, quite true.  But we don't have to come up with something that's
> perfect in all cases, just something that's good enough to handle
> cases that we expect will be common in practice, in a world where
> UTF-8 is the preferred encoding for non-ASCII characters.

The thing is, I can almost guarantee you that any quoting in the high 
characters is going to be _worse_ than no quoting at all.

Exactly because quoting as UTF-8 is the wrong thing when it isn't actually 
UTF-8, and quoting as non-UTF-8 is the wrong thing when it _is_.

Not quoting at all, on the other hand, is unambigious. If you have a 
mailer that corrupts your text stream (which-ever type it is), then it's 
clearly the mailers problem. The _mailer_ at least has a chance in hell to 
know what character set it is getting mailed as.

The other alternative is to quote _everything_ non-ASCII. That's 
definitely reliable, but it's also unquestionably ugly as hell, especially 
in the long run.

Yes, there are some complex quoting approaches you can do, which quote 
things "correctly" (ie at a byte stream level) _and_ keep it valid UTF-8 
at the same time.

For example, you can read it as a UTF-8 stream, but then quote things at a 
byte level (ie if you quote one "character", you quote _all_ bytes in that 
character). And you quote if:

 - the UTF-8 _character_ is in the 0x80-0x9f control range
 - any _raw_byte_ is in the 0x80-0x9f range (it might not be UTF-8)
 - any _raw_byte_ is 0xfe-0xff (illegal UTF-8 character)
 - misformed UTF-8 (non-shortest sequence, or just generally invalid 
   sequences with missing or wrong high bits)

but quite frankly, that's a pretty painful thing to write. The upside is 
that it's easy to decode: you can _unquote_ it just as a byte stream.

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-11 20:56                         ` Linus Torvalds
@ 2005-10-12  6:51                           ` Paul Eggert
  2005-10-12 14:59                             ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Paul Eggert @ 2005-10-12  6:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler

Linus Torvalds <torvalds@osdl.org> writes:

> you can read it as a UTF-8 stream, but then quote things at a byte
> level (ie if you quote one "character", you quote _all_ bytes in
> that character).

Yes, that's what I had in mind.

> And you quote if:
>
>  - the UTF-8 _character_ is in the 0x80-0x9f control range

Yes.  Or more generally, if it's any UTF-8 control character.

>  - any _raw_byte_ is in the 0x80-0x9f range (it might not be UTF-8)

Why quote the raw bytes?  Is this for terminal escapes on older xterm
(or xterm-like) implementations that don't understand UTF-8?  If so,
I'm not sure I'd bother, as it would introduce a lot of annoying
quoting with perfectly reasonable UTF-8, and (if we assume the world
is moving to UTF-8) it addresses a problem that is going away.

>  - any _raw_byte_ is 0xfe-0xff (illegal UTF-8 character)
>  - misformed UTF-8 (non-shortest sequence, or just generally invalid 
>    sequences with missing or wrong high bits)

Yes, that makes sense.

> quite frankly, that's a pretty painful thing to write.

It's not trivially short, yes.  But it shouldn't be that hard.

Also, I guess we don't have to write it, at least not at first.  As
long as we specify something like the C quoted-string format mentioned
earlier, we can encode into that format using a naive algorithm (e.g.,
quote any non-ASCII byte or ASCII control character), and beautify the
encoding method later.

> The upside is that it's easy to decode: you can _unquote_ it just as
> a byte stream.

Yes, that's the idea.

Also, the interchange format is the most important thing.  We have to
decode anything that is in the format, and we must encode into the
format.  Encoding prettily is nice, but not necessary.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-12  6:51                           ` Paul Eggert
@ 2005-10-12 14:59                             ` Linus Torvalds
  2005-10-12 19:07                               ` Daniel Barkalow
       [not found]                               ` <87vf02qy79.fsf@penguin.cs.ucla.edu>
  0 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2005-10-12 14:59 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1949 bytes --]



On Tue, 11 Oct 2005, Paul Eggert wrote:
> 
> >  - any _raw_byte_ is in the 0x80-0x9f range (it might not be UTF-8)
> 
> Why quote the raw bytes?  Is this for terminal escapes on older xterm
> (or xterm-like) implementations that don't understand UTF-8?

It's not about "understanding" UTF-8.

Even a perfectly modern xterm may simply not be in UTF-8 mode: if it 
wasn't in an UTF-8 locale, then it won't do UTF-8 decoding.

> If so, I'm not sure I'd bother, as it would introduce a lot of annoying
> quoting with perfectly reasonable UTF-8, and (if we assume the world
> is moving to UTF-8) it addresses a problem that is going away.

UTF-8 is only _now_ getting really widespread, and I think it's because 
RedHat bit the bullet and made UTF-8 the default locale a few years ago.

These things take _decades_.

I don't know if you realize it, but it's only within the last couple of 
years that the old 7-bit "finnish ASCII" went away. Finnish and Swedish 
have three extra characters: åäö (latin1) and åäö (utf-8). But only
within the last few years has the really _old_ ASCII representation really 
gone away so much that I don't see it at all (the characters '{' '}' and 
'|' were taken over, so that if you had a Finnish ASCII font, programming 
in C was really funky - but it was common enough that I could do it 
without thinking much about it ;)

So lots of people still use the byte-wide encodings. Whether really old 
ASCII only or some special locale-dependent one (of which latin1 and the 
"win-latin1" thing are obviously the most common by far). And in that 
locale, it's not the UTF-8 control characters that matter, it's the _byte_ 
control characters that do.

So if you want to support any other locale than UTF-8, you need to escape 
them. Assuming you want to escape control characters at all, of course (I 
still think it's perfectly fine to just let the raw mess through and 
depend on escaping at higher levels)

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-12 14:59                             ` Linus Torvalds
@ 2005-10-12 19:07                               ` Daniel Barkalow
  2005-10-12 19:52                                 ` Linus Torvalds
       [not found]                               ` <87vf02qy79.fsf@penguin.cs.ucla.edu>
  1 sibling, 1 reply; 33+ messages in thread
From: Daniel Barkalow @ 2005-10-12 19:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git,
	Kai Ruemmler

On Wed, 12 Oct 2005, Linus Torvalds wrote:

> So if you want to support any other locale than UTF-8, you need to escape 
> them. Assuming you want to escape control characters at all, of course (I 
> still think it's perfectly fine to just let the raw mess through and 
> depend on escaping at higher levels)

I think it's actually sufficient to escape 0x00-0x1f and 0x7f; those 
ranges are both easy and, as far as I can tell, include all of the control 
characters that do annoying things. I think escape, backspace, delete, and 
bell are the only ones we'd rather the terminal not get; beyond that, 
patches with screwy filenames look screwy, but don't screw up anything 
outside of the filename.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-12 19:07                               ` Daniel Barkalow
@ 2005-10-12 19:52                                 ` Linus Torvalds
  2005-10-12 20:21                                   ` H. Peter Anvin
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2005-10-12 19:52 UTC (permalink / raw)
  To: Daniel Barkalow
  Cc: Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git,
	Kai Ruemmler



On Wed, 12 Oct 2005, Daniel Barkalow wrote:
> 
> I think it's actually sufficient to escape 0x00-0x1f and 0x7f; those 
> ranges are both easy

They are indeed easy.

>		 and, as far as I can tell, include all of the control 
> characters that do annoying things.

Nope. The traditional vt100 escape sequence is "ESC" followed by a 
character to indicate the type of sequence (the most common one is '['). 
That's all 7-bit and fine.

HOWEVER, they made the 8-bit extension be such that any of these vt100 
begin sequences where the second character is in the appropriate range can 
be instead shortened by one character, by instead using a single 8-bit 
character of "0x80+(char-0x40)". Ie the traditional "ESC + '['" (\x1b\x5b) 
can also be written as a single '\x9b' character, aka CSI.

In other words, 0x80-0x9f are _all_ just vt100 shorthand for ESC+'@' 
through ESC+'_'.

(I guess it's not strictly "vt100" any more - it's the extended vt220 
format).

> I think escape, backspace, delete, and 
> bell are the only ones we'd rather the terminal not get; beyond that, 
> patches with screwy filenames look screwy, but don't screw up anything 
> outside of the filename.

Try this on a (non-UTF-8) xterm:

	echo -en '\x9b5B---\x9b1A---\x9b4A\r'

and it should do:
 - move cursor 5 lines down
 - print "---"
 - move cursor 1 line up
 - print "---"
 - move cursor 4 lines up
 - return carriage to beginning.

In other words, your screen should end up looking something like this:

	[torvalds@g5 ~]$ echo -en '\x9b5B---\x9b1A---\x9b4A\r'
	[torvalds@g5 ~]$
	
	
	
	   ---
	---

where that "staircase" of two "---" things was done with cursor movements.

And that's a _benign_ sequence. You can do all kinds of funky stuff that 
really screws up the user experience. Including have the thing echo keys 
to you that you didn't type:

	echo -en '\x9b5n'

or lock the keyboard (I don't think any of the terminal emulators 
implement the latter, or some of the other stranger sequences - things to 
do double-wide characters etc).

			Linus

PS. You can do all the same in UTF-8 one, but then you'll have to add a 
\xc2 before the \x9b:

	echo -en '\xc2\x9b5B---\xc2\x9b1A---\xc2\x9b4A\r'

etc..

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-12 19:52                                 ` Linus Torvalds
@ 2005-10-12 20:21                                   ` H. Peter Anvin
  0 siblings, 0 replies; 33+ messages in thread
From: H. Peter Anvin @ 2005-10-12 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Daniel Barkalow, Paul Eggert, Junio C Hamano, Robert Fitzsimons,
	Alex Riesen, git, Kai Ruemmler

Linus Torvalds wrote:
> 
> Nope. The traditional vt100 escape sequence is "ESC" followed by a 
> character to indicate the type of sequence (the most common one is '['). 
> That's all 7-bit and fine.
> 
> HOWEVER, they made the 8-bit extension be such that any of these vt100 
> begin sequences where the second character is in the appropriate range can 
> be instead shortened by one character, by instead using a single 8-bit 
> character of "0x80+(char-0x40)". Ie the traditional "ESC + '['" (\x1b\x5b) 
> can also be written as a single '\x9b' character, aka CSI.
> 
> In other words, 0x80-0x9f are _all_ just vt100 shorthand for ESC+'@' 
> through ESC+'_'.
> 
> (I guess it's not strictly "vt100" any more - it's the extended vt220 
> format).
> 

Actually, it's even trickier than that.

CSI is character 0x1b of control code set C1; there are two "windows" 
for control codes -- CL (0x00-0x1f) and CR (0x80-0x9f).  Normally CL is 
mapped to C0 and CR is mapped to CL, but ESC will temporarily map C1 
into CL.

VT1xx didn't support this since they didn't support 8-bit anything.

Anyway, a *lot* of character sets -- not just UTF-8 -- use the CR range 
of bytes for printables.

	-hpa

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
       [not found]                               ` <87vf02qy79.fsf@penguin.cs.ucla.edu>
@ 2005-10-12 21:02                                 ` Junio C Hamano
  2005-10-12 21:05                                 ` Linus Torvalds
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2005-10-12 21:02 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Linus Torvalds, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler

Paul Eggert <eggert@CS.UCLA.EDU> writes:

> Linus Torvalds <torvalds@osdl.org> writes:
>
>> I don't know if you realize it, but it's only within the last couple of 
>> years that the old 7-bit "finnish ASCII" went away.
>
> Aach!  Those Finns!  Always on the trailing edge of technology!

Nah, Japanese are much worse.  We are so used to see Yen signs
at the end of multi-line CPP macro definitions (backslashes are
taken over by it) and I do not foresee it going away anytime
soon.  I think windows people believe Yen signs are path
component separators ;-).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
       [not found]                               ` <87vf02qy79.fsf@penguin.cs.ucla.edu>
  2005-10-12 21:02                                 ` Junio C Hamano
@ 2005-10-12 21:05                                 ` Linus Torvalds
  2005-10-12 21:09                                   ` H. Peter Anvin
                                                     ` (3 more replies)
  2005-10-12 21:24                                 ` Linus Torvalds
  2005-10-14  6:59                                 ` Junio C Hamano
  3 siblings, 4 replies; 33+ messages in thread
From: Linus Torvalds @ 2005-10-12 21:05 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler



On Wed, 12 Oct 2005, Paul Eggert wrote:
> 
> Your email message suggests that we need to be cautious here.
> That message contained UTF-8 text but its header said "Content-Type:
> TEXT/PLAIN; charset=ISO-8859-1".

Well, my email message was wrong and evil, because it _mixed_ two 
different encodings in the same text. No sane client could have shown them 
both at the same time - but especially with a stupid client, you could 
have changed your terminal to show either one or the other by switching 
from utf-8 to latin1 encoding and doing a refresh.

In other words, my email really was a nasty case of not one or the other, 
but both.

Now, I believe patches can actually be that way - it's not at all 
impossible to have a diff where the _filename_ is utf-8, but the content 
of the patch itself is some byte-encoding like latin1. Or the other way 
around.

> If we're still having problems like this in 2005 then I guess we need
> to deal with them.  This suggests we should be escaping every
> non-ASCII byte, at least for patches designed to be emailed robustly.

I find that email is very robust - it's basically 8-bit clean. No 
character encoding, no crap. Just a byte stream. It really _is_ the most 
reliable format.

Now, a lot of email clients are really weak in _showing_ it, and as 
mentioned, the email that mixed both is fundamentally not something you 
really even _can_ show sanely. But who cares? What matters is not what it 
looks like, but what it _saves_ as. If you save the email message, it 
should come out as the same reliable 8-bit byte stream, or your client is 
actively corrupting messages rather than just showing them.

This is really what my argument boils down to: character set encoding 
should _not_ EVER affect the _transfer_ of the data. It doesn't matter if 
something is latin1 or utf-8, the only thing that matters is the byte 
sequence. Only when you _display_ it should you try to figure out what the 
byte sequence possibly means.

So I repeat: 
 - escape as little as possible
 - make the _viewer_ decide how to view it.

Yes, if people use "cat" to view patches, it can be dangerous. But that's 
_their_ problem.

		Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-12 21:05                                 ` Linus Torvalds
@ 2005-10-12 21:09                                   ` H. Peter Anvin
  2005-10-12 21:15                                   ` Johannes Schindelin
                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: H. Peter Anvin @ 2005-10-12 21:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git,
	Kai Ruemmler

Linus Torvalds wrote:
> 
> Now, I believe patches can actually be that way - it's not at all 
> impossible to have a diff where the _filename_ is utf-8, but the content 
> of the patch itself is some byte-encoding like latin1. Or the other way 
> around.
> 

Or both.  Trivial example: a patch to change names in comments from ISO 
8859-1 to UTF-8.

	-hpa

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-12 21:05                                 ` Linus Torvalds
  2005-10-12 21:09                                   ` H. Peter Anvin
@ 2005-10-12 21:15                                   ` Johannes Schindelin
  2005-10-12 21:33                                   ` Junio C Hamano
  2005-10-14  0:57                                   ` Paul Eggert
  3 siblings, 0 replies; 33+ messages in thread
From: Johannes Schindelin @ 2005-10-12 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git,
	Kai Ruemmler

Hi,

On Wed, 12 Oct 2005, Linus Torvalds wrote:

> Yes, if people use "cat" to view patches, it can be dangerous. But that's 
> _their_ problem.

No, that is the cat's problem. Sorry, couldn't resist.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
       [not found]                               ` <87vf02qy79.fsf@penguin.cs.ucla.edu>
  2005-10-12 21:02                                 ` Junio C Hamano
  2005-10-12 21:05                                 ` Linus Torvalds
@ 2005-10-12 21:24                                 ` Linus Torvalds
  2005-10-14  0:16                                   ` Paul Eggert
  2005-10-14  6:59                                 ` Junio C Hamano
  3 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2005-10-12 21:24 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler



On Wed, 12 Oct 2005, Paul Eggert wrote:
> 
> Worse, when I used Emacs to copy your text into another file -- the
> sort of thing that is likely to be done with an emailed patch -- the
> file contained the UTF-8 encoding of the gibberish, rather than the
> original bytes of your message.

Btw, this is an example of where locale-based character translations just 
fundamentally suck.

cut-and-paste quote naturally tries to translate between the source 
and destination locales, but it fundamentally cannot work. The only thing 
that ever works is bit-for-bit copying.

Any program that tries to do locale conversion is always going to be a bug 
waiting to happen.

If GNU emacs does locale translations rather than just do a binary 
transfer of the data, then that's a sign that GNU emavs is being really 
stupid. If the data was UTF-8 to begin with, then a binary copy is also 
going to be UTF-8. And if it wasn't UTF-8, then a binary copy is the only 
thing that is sensible.

And this is the thing that makes UTF-8 so wonderful: exactly the fact that 
it makes bit-for-bit copying an acceptable policy again, and locales 
become a non-issue. In a truly UTF-8 world, you should _never_ convert 
anything at all (and that includes mis-formed UTF-8).

Any non-binary file saving or transfer approach where characters have 
"meaning" is always mistake. It's why DOS/Windows "binary" vs "text" files 
was wrong. It's why font-encoding locales are wrong (Mixed text with two 
types? Yet another metadata quoting scheme? No thank you! It's also why 
UCS-16 and UCS-32 were total disasters: they had "context" in their 
encoding).

Say "yes" to binary transfer. Because text transfers are broken.

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-12 21:05                                 ` Linus Torvalds
  2005-10-12 21:09                                   ` H. Peter Anvin
  2005-10-12 21:15                                   ` Johannes Schindelin
@ 2005-10-12 21:33                                   ` Junio C Hamano
  2005-10-14  0:57                                   ` Paul Eggert
  3 siblings, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2005-10-12 21:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> This is really what my argument boils down to: character set encoding 
> should _not_ EVER affect the _transfer_ of the data. It doesn't matter if 
> something is latin1 or utf-8, the only thing that matters is the byte 
> sequence. Only when you _display_ it should you try to figure out what the 
> byte sequence possibly means.
>
> So I repeat: 
>  - escape as little as possible
>  - make the _viewer_ decide how to view it.

I think the same argument can be made about patch application,
although strictly speaking it is not "viewing".  Let the patch
program decide (or the user to tell her decision to the patch
program) what the unescaped byte sequence in the patch that
represents the path being affected is encoded in, and do
something sensible while taking into account that the pathname
encoding on the working tree may be different from what is
recorded in the patch.

For example, one of my partitions is ntfs mounted with
nls=euc-jp, and I expect the tool to help me apply patches to a
Japanese-named file when the patch is from a system with UTF-8
encoded filenames.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-12 21:24                                 ` Linus Torvalds
@ 2005-10-14  0:16                                   ` Paul Eggert
  2005-10-14  5:20                                     ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Paul Eggert @ 2005-10-14  0:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler

Linus Torvalds <torvalds@osdl.org> writes:

> So I repeat: 
>  - escape as little as possible
>  - make the _viewer_ decide how to view it.

Under my most recent proposal, the only bytes one must escape are ",
\, and LF.  Doesn't that satisfy these two main criteria?


> If GNU emacs does locale translations rather than just do a binary
> transfer of the data, then that's a sign that GNU emacs is being
> really stupid.

Perhaps so, but it has a lot of company.  I have even worse problems
with Mozilla Thunderbird.  And as we observed, Pine also has problems
sending properly-formatted email containing arbitrary binary data.

I suspect the vast majority of email clients will screw up in
relatively common cases involving unusual characters in file names.
Using attachments avoids many of the problems, but lots of patches are
emailed inline and I'd rather not force people to use attachments to
send diffs.


> I find that email is very robust - it's basically 8-bit clean. No 
> character encoding, no crap. Just a byte stream. It really _is_ the most 
> reliable format.

Hmm.  To test that theory, I just now sent plain-text email to myself,
containing a carriage-return (CR) byte in the middle of a line.

The CR byte was transliterated into a LF.  Ooops.

This was the very first (and only) test I tried, which isn't a good
sign for reliability.  If you're curious, I tracked the problem down
to Exim, a popular mail transfer agent that is running on my personal
Debian GNU/Linux (stable) box.  As to why Exim munges email, please see
<http://www.exim.org/exim-html-4.40/doc/html/spec_44.html#SECT44.1>.
(And I didn't know about the Exim glitch before trying my test.
I'm normally a Sendmail man myself.)

More generally, I suspect inline patches with weird bytes will suffer
greatly from encoding and recoding by mail agents.


> What matters is not what it looks like, but what it _saves_ as. If
> you save the email message, it should come out as the same reliable
> 8-bit byte stream

Unfortunately this isn't true for Emacs, and I suspect other mailers
will have similar problems.  For example, with Emacs I can easily save
either the exact byte-for-byte message body that my mail transfer
agent gave me; or I can have Emacs decode the message into its
constituent characters, reencode the result as UTF-8, and put that
into a file.  In neither case, though, am I saving the original byte
stream that you presented to your mail user agent.  Even if I save the
byte-for-byte message body, it is often in quoted-printable format so
I'll have to decode strings like "=EF" to recover the original bytes.
This is doable, yes, but it's inconvenient in practice, at least with
the mail user agents I'm familiar with.  And even if I do it, I don't
necessarily have the same byte stream you gave your mail user agent; I
merely have the byte stream that your MUA gave to your MTA, and these
may not be the same thing (they certainly aren't always the same thing
with Emacs).


The simplest fix for git may be to say "Don't use inline patches; use
attachments if you must email anything with strange characters in it."
That's fine.  But I prefer a format that also allows GNU diff, if it
chooses, to generate output that resists common inline-email botches.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-12 21:05                                 ` Linus Torvalds
                                                     ` (2 preceding siblings ...)
  2005-10-12 21:33                                   ` Junio C Hamano
@ 2005-10-14  0:57                                   ` Paul Eggert
  2005-10-14  5:43                                     ` Linus Torvalds
  3 siblings, 1 reply; 33+ messages in thread
From: Paul Eggert @ 2005-10-14  0:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler

Linus Torvalds <torvalds@osdl.org> writes:

> I find that email is very robust - it's basically 8-bit clean. No 
> character encoding, no crap. Just a byte stream. It really _is_ the most 
> reliable format.

I found another amusing bit of info that tends to undercut this claim.

This discussion thread is archived at
<http://marc.theaimsgroup.com/?t=112877773400002&r=1&w=2&n=22>.
But there's an item missing from the archive: my message with
Message-ID <87vf02qy79.fsf@penguin.cs.ucla.edu>.  This is the message
with the joke "Aach!  Those Finns!  Always on the trailing edge of
technology!".

All my other messages are achived.  What was special about this
one?  Surely there's not a joke filter at theaimsgroup.com!

I nosed around through the archive and here's my guess as to what
happened.  My message's email header contained this:

   Content-Type: text/plain; charset=utf-8
   Content-Transfer-Encoding: quoted-printable

and my guess is that the web archiver can't handle that format.

This is just a guess.  I can't confirm it because (among other things)
the web archiver won't give me all the bytes of the messages that it
archives.  Even its "Download message RAW" doesn't do that: it omits
the header.  But I have a strong suspicion.  Let's put it this way: I
think mine was the only message in the thread that said
"charset=utf-8".

If my guess is right, the archiver dropped my email on the floor
simply because it contained UTF-8.  This is not a good sign for
putting UTF-8 into email, or for relying on email to transmit byte
streams.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-14  0:16                                   ` Paul Eggert
@ 2005-10-14  5:20                                     ` Linus Torvalds
  2005-10-14 17:18                                       ` H. Peter Anvin
  0 siblings, 1 reply; 33+ messages in thread
From: Linus Torvalds @ 2005-10-14  5:20 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler



On Thu, 13 Oct 2005, Paul Eggert wrote:
> 
> Perhaps so, but it has a lot of company.  I have even worse problems
> with Mozilla Thunderbird.  And as we observed, Pine also has problems
> sending properly-formatted email containing arbitrary binary data.

No, pine does it right. Exactly because it sends _arbitraty_ binary data.

The fact that I turned the terminal into utf-8 mode in order to generate 
the bytes (that end up being a garbage string in latin1) is not pine's 
fault. 

The point being that because the transport was 8-bit clean, I could do 
that. I could mix a latin1-encoding with a UTF-8 encoding, and the other 
side could see the mixed setting. Now, the other side had no way of 
knowing that I mixed things (unless it was a smart human and could read 
and understand what I wrote), so any email client would have trouble 
showing it.

But it got _transferred_ right, and you could have saved the email, and 
turned the terminal into latin1 or utf-8 mode, and done a "cat" both ways, 
and you'd have seen both versions.

> I suspect the vast majority of email clients will screw up in
> relatively common cases involving unusual characters in file names.

Not if they just save it.

Oh, sure, they can't _display_ it, since they don't know what it is, but 
when they save it, they'd _better_ save it bit-for-bit.

Which is the right thing to do. Then you apply it with "patch", and you 
get the right answer.

> Using attachments avoids many of the problems, but lots of patches are
> emailed inline and I'd rather not force people to use attachments to
> send diffs.

inline or attachment should not matter to any sane email client. If it 
does, then the email client isn't sane.

The point is, when you save it, it _has_ to be saved bit-for-bit. 

The only difference between a binary attachment and a text thing is that 
an email client will _try_ to show the text thing to you as text. It has 
no other meaning.

And trying is better than not trying. Attachments are _inferior_ to inline 
for that reason.

> > I find that email is very robust - it's basically 8-bit clean. No 
> > character encoding, no crap. Just a byte stream. It really _is_ the most 
> > reliable format.
> 
> Hmm.  To test that theory, I just now sent plain-text email to myself,
> containing a carriage-return (CR) byte in the middle of a line.
> 
> The CR byte was transliterated into a LF.  Ooops.

I'm not surprised, since CR/LF is special for a lot of (sad) reasons. Oh, 
well.

I agree that it makes sense to escape \r, and obviously you _have_ to 
escape \n. In general, escaping pretty much everything in the 0-31 range 
is likely the right approach, since those are never printable anyway.

That, btw, is probably true of the patch contents too, not just the 
filename. The exception being \t (and in patch contents, \n is obviously 
part of the stream).

> More generally, I suspect inline patches with weird bytes will suffer
> greatly from encoding and recoding by mail agents.

I've had pretty good luck. We do have 8-bit stuff occasionally, but it 
almost always makes it through. 

Spaces and tabs are much worse (yes, they're more common too). That's 
clearly just crap mailers.

> Unfortunately this isn't true for Emacs, and I suspect other mailers
> will have similar problems.  For example, with Emacs I can easily save
> either the exact byte-for-byte message body that my mail transfer
> agent gave me; or I can have Emacs decode the message into its
> constituent characters, reencode the result as UTF-8, and put that
> into a file.

Well, as long as there's a choice.

> In neither case, though, am I saving the original byte
> stream that you presented to your mail user agent.  Even if I save the
> byte-for-byte message body, it is often in quoted-printable format so
> I'll have to decode strings like "=EF" to recover the original bytes.

You have a broken mail client. Now, I'm not a big fan of QP (I think it 
was making a stupid excuse for bad transport), but QP is a _mail_ level 
quoting protocol, and the same way a MUA uses QP to encode, the MUA should 
have de-coded the QP. It shouldn't leave it to somebody else.

I think GNU emacs is a horrible mistake ("do everything - badly"), but you 
may be able to fix it by letting your mail transport agent do the un-QP 
for you. A lot of them do, which makes it easier to then use weak MUA's.

Anyway, it sounds like GNU emacs made the wrong choices (hey, I'm not 
surprised). It should have decoded QP, not the character set. There are 
lots of tools that do charset conversions, that's not very email-specific.

			Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-14  0:57                                   ` Paul Eggert
@ 2005-10-14  5:43                                     ` Linus Torvalds
  0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2005-10-14  5:43 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Junio C Hamano, Robert Fitzsimons, Alex Riesen, git, Kai Ruemmler



On Thu, 13 Oct 2005, Paul Eggert wrote:

> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > I find that email is very robust - it's basically 8-bit clean. No 
> > character encoding, no crap. Just a byte stream. It really _is_ the most 
> > reliable format.
> 
> I found another amusing bit of info that tends to undercut this claim.

No, I think you found that email as a _transfer_ is mostly 8-bit clean 
(finally! Oh - has qmail gotten fixed?).

But the end-points aren't. They do strange things with encodings, 
sometimes. They see an encoding they don't know what to do with, and they 
just freak out.

		Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
       [not found]                               ` <87vf02qy79.fsf@penguin.cs.ucla.edu>
                                                   ` (2 preceding siblings ...)
  2005-10-12 21:24                                 ` Linus Torvalds
@ 2005-10-14  6:59                                 ` Junio C Hamano
  3 siblings, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2005-10-14  6:59 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Linus Torvalds, git

Paul Eggert <eggert@CS.UCLA.EDU> writes:

> Here is the proposed format.  Each file name is a string of bytes, in
> one of the following two formats:
>
> A.  A nonempty sequence of ASCII graphic characters (i.e., bytes in
>     the range '!' == '\041' through '~' == '\177').  The first byte
>     cannot be '!' == '\041' or '"' == '\042'.  Leading '"' is used for
>     (B) below, and leading '!' is reserved for future extensions.
>
> B.  A nonempty C-language character string literal, with the following
>     restrictions and modifications:
>
>     B1.  No multibyte character processing is done.  Members of the
>          string literal are treated as bytes, not characters.  Null
>          bytes are not allowed, and '"' == '\042', '\\' == '\134' and
>          '\n' == '\012' are allowed only if properly escaped as shown
>          below; but all other bytes are allowed.
>
>     B2.  No trigraph processing is done (e.g., ??/ stands for three
>          bytes, not one).
>
>     B3.  No line-splicing is done (i.e., backslash-newline is not allowed).
>
>     B4.  Only the following escape sequences are allowed.
>
>            \" \\ \a \b \f \n \r \t \v
>            \XYZ  (where X, Y, and Z are octal digits, X <= 3, and
>                   at least one of the digits is nonzero)

Just to let you know, I am slowly converting apply.c to accept
this format, and also diff.c to produce this.  I did not
personally like the missing double quotes around what I did
anyway, although it was easier to code.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
  2005-10-14  5:20                                     ` Linus Torvalds
@ 2005-10-14 17:18                                       ` H. Peter Anvin
  0 siblings, 0 replies; 33+ messages in thread
From: H. Peter Anvin @ 2005-10-14 17:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Eggert, Junio C Hamano, Robert Fitzsimons, Alex Riesen, git,
	Kai Ruemmler

Linus Torvalds wrote:
> 
> No, pine does it right. Exactly because it sends _arbitraty_ binary data.
> 
> The fact that I turned the terminal into utf-8 mode in order to generate 
> the bytes (that end up being a garbage string in latin1) is not pine's 
> fault. 
> 

I would think a full-screen editor would need to know about multibyte 
encodings.

	-hpa

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2005-10-14 17:18 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-10-07 19:35 [RFC] embedded TAB and LF in pathnames Junio C Hamano
2005-10-07 23:29 ` Alex Riesen
2005-10-07 23:44   ` Junio C Hamano
2005-10-08  6:45     ` Alex Riesen
2005-10-08  9:10       ` Junio C Hamano
2005-10-08 13:30         ` [PATCH] Try URI quoting for " Robert Fitzsimons
2005-10-08 18:30           ` Junio C Hamano
2005-10-08 20:19             ` Junio C Hamano
2005-10-11  6:20               ` Paul Eggert
2005-10-11  7:37                 ` Junio C Hamano
2005-10-11 15:17                 ` Linus Torvalds
2005-10-11 18:03                   ` Paul Eggert
2005-10-11 18:37                     ` Linus Torvalds
2005-10-11 19:42                       ` Paul Eggert
2005-10-11 20:56                         ` Linus Torvalds
2005-10-12  6:51                           ` Paul Eggert
2005-10-12 14:59                             ` Linus Torvalds
2005-10-12 19:07                               ` Daniel Barkalow
2005-10-12 19:52                                 ` Linus Torvalds
2005-10-12 20:21                                   ` H. Peter Anvin
     [not found]                               ` <87vf02qy79.fsf@penguin.cs.ucla.edu>
2005-10-12 21:02                                 ` Junio C Hamano
2005-10-12 21:05                                 ` Linus Torvalds
2005-10-12 21:09                                   ` H. Peter Anvin
2005-10-12 21:15                                   ` Johannes Schindelin
2005-10-12 21:33                                   ` Junio C Hamano
2005-10-14  0:57                                   ` Paul Eggert
2005-10-14  5:43                                     ` Linus Torvalds
2005-10-12 21:24                                 ` Linus Torvalds
2005-10-14  0:16                                   ` Paul Eggert
2005-10-14  5:20                                     ` Linus Torvalds
2005-10-14 17:18                                       ` H. Peter Anvin
2005-10-14  6:59                                 ` Junio C Hamano
2005-10-09 10:42           ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).