git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@osdl.org>
To: Paul Eggert <eggert@CS.UCLA.EDU>
Cc: Junio C Hamano <junkio@cox.net>,
	Robert Fitzsimons <robfitz@273k.net>,
	Alex Riesen <raa.lkml@gmail.com>,
	git@vger.kernel.org, Kai Ruemmler <kai.ruemmler@gmx.net>
Subject: Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
Date: Tue, 11 Oct 2005 11:37:46 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.64.0510111121030.14597@g5.osdl.org> (raw)
In-Reply-To: <87ek6s0w34.fsf@penguin.cs.ucla.edu>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2436 bytes --]



On Tue, 11 Oct 2005, Paul Eggert wrote:
>
> For example, the simplest approach is to say a byte is funny if it is 
> space, backslash, quote, an ASCII control character, or is non-ASCII.  
> But this will cause perfectly-reasonable UTF-8 file names to be 
> presented in git format using unreadable strings like "a\293\203\257b" 
> or whatever.

I think the simplest question to ask is "what are we protecting against?"

There's only two characters that are _really_ special diff itself: \n and 
\t. The former is obvious, the latter just because the regular gnu diff 
format puts a tab between the name and the date (and if you _knew_ the 
date was always there you could just work backwards, but since not all 
diffs even put a date, \t ends up being special in practice).

So what else would you want to protect against? I hope not 8-bit 
cleanness: if some stupid protocol still isn't 8-bit clean, it should be 
fixed.

And \0 is already impossible, at least on sane systems.

So arguably you don't need to quote anything else than \n and \t (and that 
obviously means you have to quote \ itself). That means that any filename 
always shows "sanely" in its own byte locale, and everything is readable, 
regardless of whether it's UTF-8 or just plain byte-encoded Latin1, or 
anything else.

So I don't think you should quote invalid UTF-8: it's invalid UTF-8 
whether ítis quoted or not.

		Linus

PS. There _is_ something you may want to quote, namely the standard CSI 
terminal escapes. Not because they wouldn't pass through, but because some 
people might just "cat" a patch. This is debatable. Now, they are in all 
in the range 0x00-0x1f and 0x80-0x9f, and since UTF-8 encoding is supposed 
to happen before it (but you don't know how many get that right), if you 
want to quote those characters, you need to do so _both_ for the "raw" 
format and for the UTF-8 format.

Now, the UTF-8 format for that high range is actually the same character, 
except preceded by a 0xc2 (I think), so the simplest thing is to do 
quoting _purely_ on a byte-stream level (ignore any UTF-8 stuff), and 
screw the fact that you end up with a non-UTF-8 sequence (character 0x0080 
is UTF-8 sequence 0xC2 0x80, and would be quoted as 0xC2 + "\200", which 
is no longer valid in UTF-8).

It gets quite nasty. For any UTF-8 quoting scheme you come up with, I'll 
point out something that it does wrong or looks horrible for a Latin1 
filename ;)

  reply	other threads:[~2005-10-11 18:38 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-10-07 19:35 [RFC] embedded TAB and LF in pathnames Junio C Hamano
2005-10-07 23:29 ` Alex Riesen
2005-10-07 23:44   ` Junio C Hamano
2005-10-08  6:45     ` Alex Riesen
2005-10-08  9:10       ` Junio C Hamano
2005-10-08 13:30         ` [PATCH] Try URI quoting for " Robert Fitzsimons
2005-10-08 18:30           ` Junio C Hamano
2005-10-08 20:19             ` Junio C Hamano
2005-10-11  6:20               ` Paul Eggert
2005-10-11  7:37                 ` Junio C Hamano
2005-10-11 15:17                 ` Linus Torvalds
2005-10-11 18:03                   ` Paul Eggert
2005-10-11 18:37                     ` Linus Torvalds [this message]
2005-10-11 19:42                       ` Paul Eggert
2005-10-11 20:56                         ` Linus Torvalds
2005-10-12  6:51                           ` Paul Eggert
2005-10-12 14:59                             ` Linus Torvalds
2005-10-12 19:07                               ` Daniel Barkalow
2005-10-12 19:52                                 ` Linus Torvalds
2005-10-12 20:21                                   ` H. Peter Anvin
     [not found]                               ` <87vf02qy79.fsf@penguin.cs.ucla.edu>
2005-10-12 21:02                                 ` Junio C Hamano
2005-10-12 21:05                                 ` Linus Torvalds
2005-10-12 21:09                                   ` H. Peter Anvin
2005-10-12 21:15                                   ` Johannes Schindelin
2005-10-12 21:33                                   ` Junio C Hamano
2005-10-14  0:57                                   ` Paul Eggert
2005-10-14  5:43                                     ` Linus Torvalds
2005-10-12 21:24                                 ` Linus Torvalds
2005-10-14  0:16                                   ` Paul Eggert
2005-10-14  5:20                                     ` Linus Torvalds
2005-10-14 17:18                                       ` H. Peter Anvin
2005-10-14  6:59                                 ` Junio C Hamano
2005-10-09 10:42           ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0510111121030.14597@g5.osdl.org \
    --to=torvalds@osdl.org \
    --cc=eggert@CS.UCLA.EDU \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    --cc=kai.ruemmler@gmx.net \
    --cc=raa.lkml@gmail.com \
    --cc=robfitz@273k.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).