git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
@ 2021-05-26 22:47 Yuri
  2021-05-26 23:32 ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Yuri @ 2021-05-26 22:47 UTC (permalink / raw)
  To: Git Mailing List

I have the file that contains the "∞" character in its name.


When this file was modified, 'git status .' showed it as:

 >    modified:   "file-name-\342\210\236.ext"


It replaced the UTF8 character with its byte representation, and put the 
file name in quotes.


git should show such files without escaping when the terminal is able to 
show UTF8 characters because escaping decreases readability.

$ env | grep TERM
COLORTERM=truecolor
TERM=xterm-256color

$ env | grep LANG
LANG=C.UTF-8

$ env | grep CTYPE
LC_CTYPE=en_US.UTF-8


Thanks,

Yuri


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-26 22:47 [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages Yuri
@ 2021-05-26 23:32 ` Junio C Hamano
  2021-05-26 23:41   ` Yuri
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2021-05-26 23:32 UTC (permalink / raw)
  To: Yuri; +Cc: Git Mailing List

Yuri <yuri@rawbw.com> writes:

> I have the file that contains the "∞" character in its name.

"git config core.quotepath no"?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-26 23:32 ` Junio C Hamano
@ 2021-05-26 23:41   ` Yuri
  2021-05-27  4:56     ` Torsten Bögershausen
  0 siblings, 1 reply; 14+ messages in thread
From: Yuri @ 2021-05-26 23:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On 5/26/21 4:32 PM, Junio C Hamano wrote:
> "git config core.quotepath no"?


I didn't have the 'core.quotepath' value set. 'git config core.quotepath 
no' changed the behavior to no quoting.

So it looks like the default value of 'core.quotepath' is incorrect: it 
should be based on terminal capabilities.



Yuri



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-26 23:41   ` Yuri
@ 2021-05-27  4:56     ` Torsten Bögershausen
  2021-05-27 14:02       ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Torsten Bögershausen @ 2021-05-27  4:56 UTC (permalink / raw)
  To: Yuri; +Cc: Junio C Hamano, Git Mailing List

On Wed, May 26, 2021 at 04:41:38PM -0700, Yuri wrote:
> On 5/26/21 4:32 PM, Junio C Hamano wrote:
> > "git config core.quotepath no"?
>
>
> I didn't have the 'core.quotepath' value set. 'git config core.quotepath no'
> changed the behavior to no quoting.
>
> So it looks like the default value of 'core.quotepath' is incorrect: it
> should be based on terminal capabilities.
>

This are 2 different things.
If you are in a project where only ASCII names are allowed (for whatever reason),
you may want `git config core.quotepath no`, regardless what the terminal can do.

(Beside that, are ther terminals that don't handle UTF-8 these days?)

Any, if you prefer UTF-8 as a default,

git config --global core.quotepath yes

is your friend (like mine)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-27  4:56     ` Torsten Bögershausen
@ 2021-05-27 14:02       ` Jeff King
  2021-05-27 20:50         ` Yuri
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2021-05-27 14:02 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: Yuri, Junio C Hamano, Git Mailing List

On Thu, May 27, 2021 at 06:56:28AM +0200, Torsten Bögershausen wrote:

> On Wed, May 26, 2021 at 04:41:38PM -0700, Yuri wrote:
> > On 5/26/21 4:32 PM, Junio C Hamano wrote:
> > > "git config core.quotepath no"?
> >
> >
> > I didn't have the 'core.quotepath' value set. 'git config core.quotepath no'
> > changed the behavior to no quoting.
> >
> > So it looks like the default value of 'core.quotepath' is incorrect: it
> > should be based on terminal capabilities.
> >
> 
> This are 2 different things.
> If you are in a project where only ASCII names are allowed (for whatever reason),
> you may want `git config core.quotepath no`, regardless what the terminal can do.
> 
> (Beside that, are ther terminals that don't handle UTF-8 these days?)

I don't think core.quotepath is just about UTF-8. It is agnostic to the
encoding of the paths, so it is really a question of whether to just
pass through bytes with the high bit set.

So I think the more accurate question is: do the paths in your
repositories generally contain bytes that your terminal can interpret
sensibly?  I'd guess the answer is usually yes, even if you are using
latin1 or similar (or else "ls" would show you mojibake, too).

But there's a follow-on, too: do all the other things which consume
quoted path output likewise handle it? Setting core.quotepath will
impact all parts of Git, including plumbing. So a script that parses
diff-tree output, for example, will see a difference.

I'd guess that most text-processing tools these days are reasonably
happy with high-bit chars. But if we were to flip the default, we might
see regressions with:

  - very old / obscure systems (I'd guess even old versions of GNU tools
    are good, but who knows what Solaris sed will do)

  - some scripting languages (like perl and ruby) have internal strings
    that are encoding-aware, and so they are picky about reading
    high-bit input from a descriptor, especially if it isn't utf8.
    The fix is usually easy-ish, but may be a surprise for some folks
    (OTOH, I can imagine it fixes bugs in sloppily-written scripts which
    did not anticipate the incoming filenames being quoted ;) ).

As Git is used more and more internationally, I suspect the value of
defaulting core.quotepath=no increases. And as time goes on and people
tend to standardize on utf8-aware tools and environments, the risk of
doing so decreases. So while core.quotepath=yes was a conservative
choice in 2007, it might be time to look at switching.

> Any, if you prefer UTF-8 as a default,
> 
> git config --global core.quotepath yes
> 
> is your friend (like mine)

Just a nit/clarification for other readers, but I think you have yes/no
flipped here and earlier in your message.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-27 14:02       ` Jeff King
@ 2021-05-27 20:50         ` Yuri
  2021-05-28  4:39           ` Bagas Sanjaya
  0 siblings, 1 reply; 14+ messages in thread
From: Yuri @ 2021-05-27 20:50 UTC (permalink / raw)
  To: Jeff King, Torsten Bögershausen; +Cc: Junio C Hamano, Git Mailing List

It's not clear from the conversation if git reads terminal capabilities 
at all.


But the default behavior, without any options set, should be to read 
terminal capabilities, and write non-ASCII characters verbatim when 
terminal supports this and escape them when terminal doesn't support them.

Current default behavior appears to be to always escape non-ASCII 
characters.


Then options can change this basic behavior according to user's choice.



Yuri



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-27 20:50         ` Yuri
@ 2021-05-28  4:39           ` Bagas Sanjaya
  2021-05-28  4:45             ` Yuri
  0 siblings, 1 reply; 14+ messages in thread
From: Bagas Sanjaya @ 2021-05-28  4:39 UTC (permalink / raw)
  To: Yuri, Jeff King, Torsten Bögershausen
  Cc: Junio C Hamano, Git Mailing List

On 28/05/21 03.50, Yuri wrote:
> It's not clear from the conversation if git reads terminal capabilities 
> at all.
> 
> 
> But the default behavior, without any options set, should be to read 
> terminal capabilities, and write non-ASCII characters verbatim when 
> terminal supports this and escape them when terminal doesn't support them.
> 
> Current default behavior appears to be to always escape non-ASCII 
> characters.

So the current default is only supports ASCII, and escape other 
characters, right?

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-28  4:39           ` Bagas Sanjaya
@ 2021-05-28  4:45             ` Yuri
  2021-05-29  9:27               ` Torsten Bögershausen
  0 siblings, 1 reply; 14+ messages in thread
From: Yuri @ 2021-05-28  4:45 UTC (permalink / raw)
  To: Bagas Sanjaya, Jeff King, Torsten Bögershausen
  Cc: Junio C Hamano, Git Mailing List

On 5/27/21 9:39 PM, Bagas Sanjaya wrote:
> So the current default is only supports ASCII, and escape other 
> characters, right?


It appears this way.


Yuri


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-28  4:45             ` Yuri
@ 2021-05-29  9:27               ` Torsten Bögershausen
  2021-05-30 21:44                 ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Torsten Bögershausen @ 2021-05-29  9:27 UTC (permalink / raw)
  To: Yuri; +Cc: Bagas Sanjaya, Jeff King, Junio C Hamano, Git Mailing List

On Thu, May 27, 2021 at 09:45:53PM -0700, Yuri wrote:
> On 5/27/21 9:39 PM, Bagas Sanjaya wrote:
> > So the current default is only supports ASCII, and escape other
> > characters, right?
>
>
> It appears this way.
>

Yes, that is how it is.

After reading the wiki here:

https://wiki.gentoo.org/wiki/UTF-8

(There are many other web pages as well)

I am not sure that there is a reliable way for Git to detect, if the
terminal is capable of handling UTF-8.
This should work reliable under Linux, Windows, Mac and all the supported
Unix-ish platforms.

Beside that, the outputs of git commands can be feed into other programs
via a pipe usning  "|"  on the command line or redirectet to a file.

And what is a terminal ?
We need to consider that we run programs like `less` or `more´ which
need to be UTF-8 compatble.

Most of them are probably UTF-8 compliant (and LANG is set to xx.UTF-8)
these days.
And most repositories have been feed with filenames encoded in  UTF-8 as well.

Having said that, the default could be switched some day in the future.
Before that is "save", there may be a transition phase,
where users are warned that the default may change.
Scripts calling git need to use `git -c core.quotepath=yes`, or no,
whatever input they expect.

Sorry for the longish answer.
Changing one thing for some users may effect hundrets, thousands or millions
of other users later, cause surprises, need debugging and fixing effort.

Does someone wants to come up with a patch that anounces a possible change ?




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-29  9:27               ` Torsten Bögershausen
@ 2021-05-30 21:44                 ` Jeff King
  2021-05-30 21:55                   ` Yuri
  2021-05-30 22:23                   ` Junio C Hamano
  0 siblings, 2 replies; 14+ messages in thread
From: Jeff King @ 2021-05-30 21:44 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Yuri, Bagas Sanjaya, Junio C Hamano, Git Mailing List

On Sat, May 29, 2021 at 11:27:52AM +0200, Torsten Bögershausen wrote:

> I am not sure that there is a reliable way for Git to detect, if the
> terminal is capable of handling UTF-8.
> This should work reliable under Linux, Windows, Mac and all the supported
> Unix-ish platforms.

Yeah, I'm not sure how such a check would be done. On most Linux systems
I've seen, $LANG will mention "en_US.UTF-8" or similar. But I've no idea
how portable that convention is, not to mention that people may have
more complex setups anyway (e.g., not setting $LANG but setting some of
LC_*).

But more importantly, this is not even a UTF-8 problem. It is "can your
terminal do something sensible with high-bit characters in filenames of
your repositories". We don't know the encoding of those filenames (and
you may even have a mix).

(And likewise "terminal" here is really "whatever consumes Git's output,
be it the terminal or some program you've piped to).

> Having said that, the default could be switched some day in the future.
> Before that is "save", there may be a transition phase,
> where users are warned that the default may change.
> Scripts calling git need to use `git -c core.quotepath=yes`, or no,
> whatever input they expect.

Yes. If we're going to do anything, I think it would be to say "most
terminals and programs deal with high-bit characters OK these days, so
switching the default is more likely to fix things than break them".

I suspect most scripts would be OK either way. They need to handle
maybe-quoted filenames already, so it is really just a question of
whether the consuming program is OK with the high bits. If so, we could
probably get away with just a mention in the release notes, rather than
an annoying transition phase (which is likely to simply confuse most
users, who are unaware of the issue entirely).

But I'd feel more confident if whoever proposes such a change does some
research on how piping such names into common tools and scripting
languages works (both for utf8 and non-utf8 names).

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-30 21:44                 ` Jeff King
@ 2021-05-30 21:55                   ` Yuri
  2021-05-31  1:14                     ` Thomas Guyot
  2021-05-30 22:23                   ` Junio C Hamano
  1 sibling, 1 reply; 14+ messages in thread
From: Yuri @ 2021-05-30 21:55 UTC (permalink / raw)
  To: Jeff King, Torsten Bögershausen
  Cc: Bagas Sanjaya, Junio C Hamano, Git Mailing List

On 5/30/21 2:44 PM, Jeff King wrote:
> Yeah, I'm not sure how such a check would be done. On most Linux systems
> I've seen, $LANG will mention "en_US.UTF-8" or similar. But I've no idea
> how portable that convention is, not to mention that people may have
> more complex setups anyway (e.g., not setting $LANG but setting some of
> LC_*).


When 'locale charmap' prints 'UTF-8' the terminal can be assumed to be 
able to accept UTF-8 characters.

'locale charmap', I think, determines this only based on environment 
variables.



Yuri


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-30 21:44                 ` Jeff King
  2021-05-30 21:55                   ` Yuri
@ 2021-05-30 22:23                   ` Junio C Hamano
  1 sibling, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2021-05-30 22:23 UTC (permalink / raw)
  To: Jeff King
  Cc: Torsten Bögershausen, Yuri, Bagas Sanjaya, Git Mailing List

Jeff King <peff@peff.net> writes:

> Yes. If we're going to do anything, I think it would be to say "most
> terminals and programs deal with high-bit characters OK these days, so
> switching the default is more likely to fix things than break them".

Amen to that.  The conservative setting was from v1.5.3 days in
2007, and it would be highly disappointing if the situation hasn't
changed in the 14 years.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-30 21:55                   ` Yuri
@ 2021-05-31  1:14                     ` Thomas Guyot
  2021-05-31  3:35                       ` Bagas Sanjaya
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Guyot @ 2021-05-31  1:14 UTC (permalink / raw)
  To: Yuri, Jeff King, Torsten Bögershausen
  Cc: Bagas Sanjaya, Junio C Hamano, Git Mailing List

On 2021-05-30 17:55, Yuri wrote:
> On 5/30/21 2:44 PM, Jeff King wrote:
>> Yeah, I'm not sure how such a check would be done. On most Linux systems
>> I've seen, $LANG will mention "en_US.UTF-8" or similar. But I've no idea
>> how portable that convention is, not to mention that people may have
>> more complex setups anyway (e.g., not setting $LANG but setting some of
>> LC_*).
> 
> 
> When 'locale charmap' prints 'UTF-8' the terminal can be assumed to be
> able to accept UTF-8 characters.
> 
> 'locale charmap', I think, determines this only based on environment
> variables.
> 

Hi Yuri,

Even if the terminal supports UTF8, will it print it properly? The font
used could have no or minimal utf8 support. Even when it's supported,
some characters might look alike and this could have undesired
consequences (ex accidentally switching from a normal space to a
non-break space while renaming a file that has spaces...).

I believe repos with utf8 files are rare enough and it could be left to
the user to select whenever to use utf8 or not... An option like "auto"
or "detect" could make it automatic but I'm not convinced it should be
the default.


Oh, and looking at "locale charmap", it doesn't check the terminal
capabilities at all - it just prints the charmap based on LC_ALL or
LC_CTYPE value, or default if they're unset. It doesn't mater what
terminal you're on...

Regards,

--
Thomas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages
  2021-05-31  1:14                     ` Thomas Guyot
@ 2021-05-31  3:35                       ` Bagas Sanjaya
  0 siblings, 0 replies; 14+ messages in thread
From: Bagas Sanjaya @ 2021-05-31  3:35 UTC (permalink / raw)
  To: Thomas Guyot, Yuri, Jeff King, Torsten Bögershausen
  Cc: Junio C Hamano, Git Mailing List

On 31/05/21 08.14, Thomas Guyot wrote:
> Even if the terminal supports UTF8, will it print it properly? The font
> used could have no or minimal utf8 support. Even when it's supported,
> some characters might look alike and this could have undesired
> consequences (ex accidentally switching from a normal space to a
> non-break space while renaming a file that has spaces...).

On Linux distributions, Noto and DejaVu fonts are often installed as 
default fonts, because Noto has almost complete Unicode coverage and 
DejaVu Mono become goto monospace font.

And yeah, we steer clear of using non-monospace fonts (either serif or 
sans serif), because many terminal-only programs depend on text 
alignment which often can be achieved only with monospace fonts, and 
reading texts on terminal screen is vertical-oriented as opposed to 
horizontal-oriented texts like books.

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-05-31  3:36 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-26 22:47 [BUG REPORT] File names that contain UTF8 characters are unnecessarily escaped in 'git status .' messages Yuri
2021-05-26 23:32 ` Junio C Hamano
2021-05-26 23:41   ` Yuri
2021-05-27  4:56     ` Torsten Bögershausen
2021-05-27 14:02       ` Jeff King
2021-05-27 20:50         ` Yuri
2021-05-28  4:39           ` Bagas Sanjaya
2021-05-28  4:45             ` Yuri
2021-05-29  9:27               ` Torsten Bögershausen
2021-05-30 21:44                 ` Jeff King
2021-05-30 21:55                   ` Yuri
2021-05-31  1:14                     ` Thomas Guyot
2021-05-31  3:35                       ` Bagas Sanjaya
2021-05-30 22:23                   ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).