From: Linus Torvalds <torvalds@osdl.org>
To: Paul Eggert <eggert@CS.UCLA.EDU>
Cc: Junio C Hamano <junkio@cox.net>,
Robert Fitzsimons <robfitz@273k.net>,
Alex Riesen <raa.lkml@gmail.com>,
git@vger.kernel.org, Kai Ruemmler <kai.ruemmler@gmx.net>
Subject: Re: [PATCH] Try URI quoting for embedded TAB and LF in pathnames
Date: Tue, 11 Oct 2005 11:37:46 -0700 (PDT) [thread overview]
Message-ID: <Pine.LNX.4.64.0510111121030.14597@g5.osdl.org> (raw)
In-Reply-To: <87ek6s0w34.fsf@penguin.cs.ucla.edu>
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2436 bytes --]
On Tue, 11 Oct 2005, Paul Eggert wrote:
>
> For example, the simplest approach is to say a byte is funny if it is
> space, backslash, quote, an ASCII control character, or is non-ASCII.
> But this will cause perfectly-reasonable UTF-8 file names to be
> presented in git format using unreadable strings like "a\293\203\257b"
> or whatever.
I think the simplest question to ask is "what are we protecting against?"
There's only two characters that are _really_ special diff itself: \n and
\t. The former is obvious, the latter just because the regular gnu diff
format puts a tab between the name and the date (and if you _knew_ the
date was always there you could just work backwards, but since not all
diffs even put a date, \t ends up being special in practice).
So what else would you want to protect against? I hope not 8-bit
cleanness: if some stupid protocol still isn't 8-bit clean, it should be
fixed.
And \0 is already impossible, at least on sane systems.
So arguably you don't need to quote anything else than \n and \t (and that
obviously means you have to quote \ itself). That means that any filename
always shows "sanely" in its own byte locale, and everything is readable,
regardless of whether it's UTF-8 or just plain byte-encoded Latin1, or
anything else.
So I don't think you should quote invalid UTF-8: it's invalid UTF-8
whether Ãtis quoted or not.
Linus
PS. There _is_ something you may want to quote, namely the standard CSI
terminal escapes. Not because they wouldn't pass through, but because some
people might just "cat" a patch. This is debatable. Now, they are in all
in the range 0x00-0x1f and 0x80-0x9f, and since UTF-8 encoding is supposed
to happen before it (but you don't know how many get that right), if you
want to quote those characters, you need to do so _both_ for the "raw"
format and for the UTF-8 format.
Now, the UTF-8 format for that high range is actually the same character,
except preceded by a 0xc2 (I think), so the simplest thing is to do
quoting _purely_ on a byte-stream level (ignore any UTF-8 stuff), and
screw the fact that you end up with a non-UTF-8 sequence (character 0x0080
is UTF-8 sequence 0xC2 0x80, and would be quoted as 0xC2 + "\200", which
is no longer valid in UTF-8).
It gets quite nasty. For any UTF-8 quoting scheme you come up with, I'll
point out something that it does wrong or looks horrible for a Latin1
filename ;)
next prev parent reply other threads:[~2005-10-11 18:38 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-10-07 19:35 [RFC] embedded TAB and LF in pathnames Junio C Hamano
2005-10-07 23:29 ` Alex Riesen
2005-10-07 23:44 ` Junio C Hamano
2005-10-08 6:45 ` Alex Riesen
2005-10-08 9:10 ` Junio C Hamano
2005-10-08 13:30 ` [PATCH] Try URI quoting for " Robert Fitzsimons
2005-10-08 18:30 ` Junio C Hamano
2005-10-08 20:19 ` Junio C Hamano
2005-10-11 6:20 ` Paul Eggert
2005-10-11 7:37 ` Junio C Hamano
2005-10-11 15:17 ` Linus Torvalds
2005-10-11 18:03 ` Paul Eggert
2005-10-11 18:37 ` Linus Torvalds [this message]
2005-10-11 19:42 ` Paul Eggert
2005-10-11 20:56 ` Linus Torvalds
2005-10-12 6:51 ` Paul Eggert
2005-10-12 14:59 ` Linus Torvalds
2005-10-12 19:07 ` Daniel Barkalow
2005-10-12 19:52 ` Linus Torvalds
2005-10-12 20:21 ` H. Peter Anvin
[not found] ` <87vf02qy79.fsf@penguin.cs.ucla.edu>
2005-10-12 21:02 ` Junio C Hamano
2005-10-12 21:05 ` Linus Torvalds
2005-10-12 21:09 ` H. Peter Anvin
2005-10-12 21:15 ` Johannes Schindelin
2005-10-12 21:33 ` Junio C Hamano
2005-10-14 0:57 ` Paul Eggert
2005-10-14 5:43 ` Linus Torvalds
2005-10-12 21:24 ` Linus Torvalds
2005-10-14 0:16 ` Paul Eggert
2005-10-14 5:20 ` Linus Torvalds
2005-10-14 17:18 ` H. Peter Anvin
2005-10-14 6:59 ` Junio C Hamano
2005-10-09 10:42 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0510111121030.14597@g5.osdl.org \
--to=torvalds@osdl.org \
--cc=eggert@CS.UCLA.EDU \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
--cc=kai.ruemmler@gmx.net \
--cc=raa.lkml@gmail.com \
--cc=robfitz@273k.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).