On Tue, 11 Oct 2005, Paul Eggert wrote:
>
> For example, the simplest approach is to say a byte is funny if it is 
> space, backslash, quote, an ASCII control character, or is non-ASCII.  
> But this will cause perfectly-reasonable UTF-8 file names to be 
> presented in git format using unreadable strings like "a\293\203\257b" 
> or whatever.

I think the simplest question to ask is "what are we protecting against?"

There's only two characters that are _really_ special diff itself: \n and 
\t. The former is obvious, the latter just because the regular gnu diff 
format puts a tab between the name and the date (and if you _knew_ the 
date was always there you could just work backwards, but since not all 
diffs even put a date, \t ends up being special in practice).

So what else would you want to protect against? I hope not 8-bit 
cleanness: if some stupid protocol still isn't 8-bit clean, it should be 
fixed.

And \0 is already impossible, at least on sane systems.

So arguably you don't need to quote anything else than \n and \t (and that 
obviously means you have to quote \ itself). That means that any filename 
always shows "sanely" in its own byte locale, and everything is readable, 
regardless of whether it's UTF-8 or just plain byte-encoded Latin1, or 
anything else.

So I don't think you should quote invalid UTF-8: it's invalid UTF-8 
whether ítis quoted or not.

		Linus

PS. There _is_ something you may want to quote, namely the standard CSI 
terminal escapes. Not because they wouldn't pass through, but because some 
people might just "cat" a patch. This is debatable. Now, they are in all 
in the range 0x00-0x1f and 0x80-0x9f, and since UTF-8 encoding is supposed 
to happen before it (but you don't know how many get that right), if you 
want to quote those characters, you need to do so _both_ for the "raw" 
format and for the UTF-8 format.

Now, the UTF-8 format for that high range is actually the same character, 
except preceded by a 0xc2 (I think), so the simplest thing is to do 
quoting _purely_ on a byte-stream level (ignore any UTF-8 stuff), and 
screw the fact that you end up with a non-UTF-8 sequence (character 0x0080 
is UTF-8 sequence 0xC2 0x80, and would be quoted as 0xC2 + "\200", which 
is no longer valid in UTF-8).

It gets quite nasty. For any UTF-8 quoting scheme you come up with, I'll 
point out something that it does wrong or looks horrible for a Latin1 
filename ;)