On Tue, 11 Oct 2005, Paul Eggert wrote: > > For example, the simplest approach is to say a byte is funny if it is > space, backslash, quote, an ASCII control character, or is non-ASCII. > But this will cause perfectly-reasonable UTF-8 file names to be > presented in git format using unreadable strings like "a\293\203\257b" > or whatever. I think the simplest question to ask is "what are we protecting against?" There's only two characters that are _really_ special diff itself: \n and \t. The former is obvious, the latter just because the regular gnu diff format puts a tab between the name and the date (and if you _knew_ the date was always there you could just work backwards, but since not all diffs even put a date, \t ends up being special in practice). So what else would you want to protect against? I hope not 8-bit cleanness: if some stupid protocol still isn't 8-bit clean, it should be fixed. And \0 is already impossible, at least on sane systems. So arguably you don't need to quote anything else than \n and \t (and that obviously means you have to quote \ itself). That means that any filename always shows "sanely" in its own byte locale, and everything is readable, regardless of whether it's UTF-8 or just plain byte-encoded Latin1, or anything else. So I don't think you should quote invalid UTF-8: it's invalid UTF-8 whether ítis quoted or not. Linus PS. There _is_ something you may want to quote, namely the standard CSI terminal escapes. Not because they wouldn't pass through, but because some people might just "cat" a patch. This is debatable. Now, they are in all in the range 0x00-0x1f and 0x80-0x9f, and since UTF-8 encoding is supposed to happen before it (but you don't know how many get that right), if you want to quote those characters, you need to do so _both_ for the "raw" format and for the UTF-8 format. Now, the UTF-8 format for that high range is actually the same character, except preceded by a 0xc2 (I think), so the simplest thing is to do quoting _purely_ on a byte-stream level (ignore any UTF-8 stuff), and screw the fact that you end up with a non-UTF-8 sequence (character 0x0080 is UTF-8 sequence 0xC2 0x80, and would be quoted as 0xC2 + "\200", which is no longer valid in UTF-8). It gets quite nasty. For any UTF-8 quoting scheme you come up with, I'll point out something that it does wrong or looks horrible for a Latin1 filename ;)