All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Documentation/i18n.txt: clarify character encoding support
@ 2015-06-13 20:24 Karsten Blees
  2015-06-15  0:12 ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Karsten Blees @ 2015-06-13 20:24 UTC (permalink / raw)
  To: Git List

As a "distributed" VCS, git should better define the encodings of its core
textual data structures, in particular those that are part of the network
protocol.

That git is encoding agnostic is only really true for blob objects. E.g.
the 'non-NUL bytes' requirement of tree and commit objects excludes
UTF-16/32, and the special meaning of '/' in the index file as well as
space and linefeed in commit objects eliminates EBCDIC and other non-ASCII
encodings.

Git expects bytes < 0x80 to be pure ASCII, thus CJK encodings that partly
overlap with the ASCII range are problematic as well. E.g. fmt_ident()
removes trailing 0x5C from user names on the assumption that it is ASCII
'\'. However, there are over 200 GBK double byte codes that end in 0x5C.

UTF-8 as default encoding on Linux and respective path translations in the
Mac and Windows versions have established UTF-8 NFC as de-facto standard
for path names.

Update the documentation in i18n.txt to reflect the current status-quo.

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 Documentation/i18n.txt | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/Documentation/i18n.txt b/Documentation/i18n.txt
index e9a1d5d..e5f6233 100644
--- a/Documentation/i18n.txt
+++ b/Documentation/i18n.txt
@@ -1,18 +1,28 @@
-At the core level, Git is character encoding agnostic.
-
- - The pathnames recorded in the index and in the tree objects
-   are treated as uninterpreted sequences of non-NUL bytes.
-   What readdir(2) returns are what are recorded and compared
-   with the data Git keeps track of, which in turn are expected
-   to be what lstat(2) and creat(2) accepts.  There is no such
-   thing as pathname encoding translation.
+Git is to some extent character encoding agnostic.
 
  - The contents of the blob objects are uninterpreted sequences
    of bytes.  There is no encoding translation at the core
    level.
 
- - The commit log messages are uninterpreted sequences of non-NUL
-   bytes.
+ - Pathnames are encoded in UTF-8 normalization form C. This
+   applies to tree objects, the index file, ref names and
+   config files (`.git/config` (see linkgit:git-config[1]),
+   linkgit:gitignore[5], linkgit:gitattributes[5] and
+   linkgit:gitmodules[5]).
+   The Mac and Windows versions automatically translate pathnames
+   to and from UTF-8 NFC in their readdir(2), lstat(2), creat(2)
+   etc. APIs. However, there is no such translation on other
+   platforms. If file system APIs don't use UTF-8 (which may be
+   file system specific), it is recommended to stick to pure
+   ASCII file names. While Git technically supports other
+   extended ASCII encodings at the core level, such repositories
+   will not be portable.
+
+ - Commit log messages are typically encoded in UTF-8, but other
+   extended ASCII encodings are also supported. This includes
+   ISO-8859-x, CP125x and many others, but _not_ UTF-16/32,
+   EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5,
+   EUC-x, CP9xx etc.).
 
 Although we encourage that the commit log messages are encoded
 in UTF-8, both the core and Git Porcelain are designed not to
-- 
2.4.1.windows.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] Documentation/i18n.txt: clarify character encoding support
  2015-06-13 20:24 [PATCH] Documentation/i18n.txt: clarify character encoding support Karsten Blees
@ 2015-06-15  0:12 ` Junio C Hamano
  2015-06-15 10:08   ` Karsten Blees
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2015-06-15  0:12 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git List

Karsten Blees <karsten.blees@gmail.com> writes:

> diff --git a/Documentation/i18n.txt b/Documentation/i18n.txt
> index e9a1d5d..e5f6233 100644
> --- a/Documentation/i18n.txt
> +++ b/Documentation/i18n.txt
> @@ -1,18 +1,28 @@
> -At the core level, Git is character encoding agnostic.
> -
> - - The pathnames recorded in the index and in the tree objects
> -   are treated as uninterpreted sequences of non-NUL bytes.
> -   What readdir(2) returns are what are recorded and compared
> -   with the data Git keeps track of, which in turn are expected
> -   to be what lstat(2) and creat(2) accepts.  There is no such
> -   thing as pathname encoding translation.
> +Git is to some extent character encoding agnostic.

I do not think the removal of the text makes much sense here unless
you add the equivalent to the new text below.

>   - The contents of the blob objects are uninterpreted sequences
>     of bytes.  There is no encoding translation at the core
>     level.
>  
> - - The commit log messages are uninterpreted sequences of non-NUL
> -   bytes.
> + - Pathnames are encoded in UTF-8 normalization form C. This

That is true only on some systems like OSX (with HFS+) and Windows,
no?  BSDs in general and Linux do not do any such mangling IIRC.  I
am OK with mangling described as a notable oddball to warn users,
though; i.e. not as a norm as your new text suggests but as an
exception.

> +   platforms. If file system APIs don't use UTF-8 (which may be
> +   file system specific), it is recommended to stick to pure
> +   ASCII file names.

Hmph, who endorsed such a recommendation?  It is recommended to
stick to whatever naming scheme that would not cause troubles to
project participants.  If your participants all want to (and can)
use ISO-8859-1, we do not discourage them from doing so.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Documentation/i18n.txt: clarify character encoding support
  2015-06-15  0:12 ` Junio C Hamano
@ 2015-06-15 10:08   ` Karsten Blees
  2015-06-17 20:45     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Karsten Blees @ 2015-06-15 10:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List

Am 15.06.2015 um 02:12 schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@gmail.com> writes:
> 
>> diff --git a/Documentation/i18n.txt b/Documentation/i18n.txt
>> index e9a1d5d..e5f6233 100644
>> --- a/Documentation/i18n.txt
>> +++ b/Documentation/i18n.txt
>> @@ -1,18 +1,28 @@
>> -At the core level, Git is character encoding agnostic.
>> -
>> - - The pathnames recorded in the index and in the tree objects
>> -   are treated as uninterpreted sequences of non-NUL bytes.
>> -   What readdir(2) returns are what are recorded and compared
>> -   with the data Git keeps track of, which in turn are expected
>> -   to be what lstat(2) and creat(2) accepts.  There is no such
>> -   thing as pathname encoding translation.
>> +Git is to some extent character encoding agnostic.
> 
> I do not think the removal of the text makes much sense here unless
> you add the equivalent to the new text below.
> 
>>   - The contents of the blob objects are uninterpreted sequences
>>     of bytes.  There is no encoding translation at the core
>>     level.
>>  
>> - - The commit log messages are uninterpreted sequences of non-NUL
>> -   bytes.
>> + - Pathnames are encoded in UTF-8 normalization form C. This
> 
> That is true only on some systems like OSX (with HFS+) and Windows,
> no?  BSDs in general and Linux do not do any such mangling IIRC.

Modern Unices don't need any such mangling because UTF-8 NFC should
be the default system encoding. I'm not sure for BSDs, but it has
been the default on all major Linux distros for more than 10 years.

> I
> am OK with mangling described as a notable oddball to warn users,
> though; i.e. not as a norm as your new text suggests but as an
> exception.
> 

I would guess that non-UTF-8 Unices (or file systems) are the oddball
case, which is why I described them last. But I could be wrong.

>> +   platforms. If file system APIs don't use UTF-8 (which may be
>> +   file system specific), it is recommended to stick to pure
>> +   ASCII file names.
> 
> Hmph, who endorsed such a recommendation?  It is recommended to
> stick to whatever naming scheme that would not cause troubles to
> project participants.  If your participants all want to (and can)
> use ISO-8859-1, we do not discourage them from doing so.
> 

ISO-8859-x file names may be fine if you won't ever need to:
- use git-web, JGit, gitk, git-gui...
- exchange repos with "normal" (UTF-8) Unices, Mac and Windows systems
- publish your work on a git hosting service (and expect file and
  ref names to show up correctly in the web interface)
- store the repo on Unicode-based file systems (JFS, Joliet, UDF,
  exFat, NTFS, HFS, CIFS...)

These restrictions are not that obvious when you start a new git
project, and while converting file names after the fact is possible
(e.g. using the recodetree script we shipped with Git for Windows
1.7.10), it will destroy history.

Thus I think we should strongly discourage users from using anything
but UTF-8.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Documentation/i18n.txt: clarify character encoding support
  2015-06-15 10:08   ` Karsten Blees
@ 2015-06-17 20:45     ` Junio C Hamano
  2015-07-01 19:10       ` [PATCH v2] " Karsten Blees
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2015-06-17 20:45 UTC (permalink / raw)
  To: Karsten Blees; +Cc: Git List

Karsten Blees <karsten.blees@gmail.com> writes:

>> I do not think the removal of the text makes much sense here unless
>> you add the equivalent to the new text below.
>> 
>>>   - The contents of the blob objects are uninterpreted sequences
>>>     of bytes.  There is no encoding translation at the core
>>>     level.
>>>  
>>> - - The commit log messages are uninterpreted sequences of non-NUL
>>> -   bytes.
>>> + - Pathnames are encoded in UTF-8 normalization form C. This
>> 
>> That is true only on some systems like OSX (with HFS+) and Windows,
>> no?  BSDs in general and Linux do not do any such mangling IIRC.
>
> Modern Unices don't need any such mangling because UTF-8 NFC should
> be the default system encoding. I'm not sure for BSDs, but it has
> been the default on all major Linux distros for more than 10 years.

So?  All major distros do not have to worry (and do not even need to
know).  As I said,...

>> I
>> am OK with mangling described as a notable oddball to warn users,
>> though; i.e. not as a norm as your new text suggests but as an
>> exception.

... I am OK to describe "pathnames are mangled into UTF-8 NFC on
certain filesystems" as a warning.  I am OK if we encourage the use
of UTF-8, especially if a project wants to be forward looking
(i.e. it may currently be a monoculture but may become cross
platform in the future).  I just do not want to see us saying "you
*must* encode your path in UTF-8 NFC".

> ISO-8859-x file names may be fine if you won't ever need to:
> - use git-web, JGit, gitk, git-gui...
> - exchange repos with "normal" (UTF-8) Unices, Mac and Windows systems
> - publish your work on a git hosting service (and expect file and
>   ref names to show up correctly in the web interface)
> - store the repo on Unicode-based file systems (JFS, Joliet, UDF,
>   exFat, NTFS, HFS, CIFS...)

Yes, that is exatly what I said, isn't it?  "Use whatever works for
your project, we do not dictate."

> These restrictions are not that obvious when you start a new git
> project,...

Or any project for that matter, not limited to "git project", no?
Perhaps that is a moot point by now, as everything in the workd
seems to be a "git project" these days.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2] Documentation/i18n.txt: clarify character encoding support
  2015-06-17 20:45     ` Junio C Hamano
@ 2015-07-01 19:10       ` Karsten Blees
  2015-07-02  5:25         ` Torsten Bögershausen
  0 siblings, 1 reply; 6+ messages in thread
From: Karsten Blees @ 2015-07-01 19:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List

As a "distributed" VCS, git should better define the encodings of its core
textual data structures, in particular those that are part of the network
protocol.

That git is encoding agnostic is only really true for blob objects. E.g.
the 'non-NUL bytes' requirement of tree and commit objects excludes
UTF-16/32, and the special meaning of '/' in the index file as well as
space and linefeed in commit objects eliminates EBCDIC and other non-ASCII
encodings.

Git expects bytes < 0x80 to be pure ASCII, thus CJK encodings that partly
overlap with the ASCII range are problematic as well. E.g. fmt_ident()
removes trailing 0x5C from user names on the assumption that it is ASCII
'\'. However, there are over 200 GBK double byte codes that end in 0x5C.

UTF-8 as default encoding on Linux and respective path translations in the
Mac and Windows versions have established UTF-8 NFC as de-facto standard
for path names.

Update the documentation in i18n.txt to reflect the current status-quo.

Signed-off-by: Karsten Blees <blees@dcon.de>
---

Sorry for the delay, got swamped with other stuff...

Am 17.06.2015 um 22:45 schrieb Junio C Hamano:
> 
> ... I am OK to describe "pathnames are mangled into UTF-8 NFC on
> certain filesystems" as a warning.  I am OK if we encourage the use
> of UTF-8, especially if a project wants to be forward looking
> (i.e. it may currently be a monoculture but may become cross
> platform in the future).  I just do not want to see us saying "you
> *must* encode your path in UTF-8 NFC".
> 
...
> Yes, that is exatly what I said, isn't it?  "Use whatever works for
> your project, we do not dictate."


IMO we *have* to clearly specify an encoding. This freedom of choice
you're proclaiming just does not work in reality.

E.g. Git for Windows prior to 1.7.10 recorded file names in Windows
system encoding, which was perfectly legitimate according to the
documentation. Yet we had numerous bug reports regarding file name
encoding problems (you couldn't even share repos across different
Windows versions, let alone with Linux / Mac / JGit...).

You cannot simply tell users that this is because of Git's superior,
flexible design and its their own fault...except of course if you
want them to switch to VCSes that *do* properly define their file
formats and network protocols - such as subversion or bazaar.
(sorry for the sarcasm, couldn't resist)

I think its important to realize that specifying an encoding is
*not* a limitation - on the contrary: it *enables* us to do things
that would be impossible if file names were just "uninterpreted
sequences of non-NUL bytes". This includes features that are so
fundamental that we take them for granted, e.g. displaying file
names using *real* characters rather than just octal escapes.


I've rewritten the path name paragraph to better describe the
problems to expect with legacy encodings. I hope you like this
version better.

Of course, it would be nice to hear other opinions as well - this
probably shouldn't be a discussion between the two of us :-)

Karsten



 Documentation/i18n.txt | 33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/Documentation/i18n.txt b/Documentation/i18n.txt
index e9a1d5d..2dd79db 100644
--- a/Documentation/i18n.txt
+++ b/Documentation/i18n.txt
@@ -1,18 +1,31 @@
-At the core level, Git is character encoding agnostic.
-
- - The pathnames recorded in the index and in the tree objects
-   are treated as uninterpreted sequences of non-NUL bytes.
-   What readdir(2) returns are what are recorded and compared
-   with the data Git keeps track of, which in turn are expected
-   to be what lstat(2) and creat(2) accepts.  There is no such
-   thing as pathname encoding translation.
+Git is to some extent character encoding agnostic.
 
  - The contents of the blob objects are uninterpreted sequences
    of bytes.  There is no encoding translation at the core
    level.
 
- - The commit log messages are uninterpreted sequences of non-NUL
-   bytes.
+ - Path names are encoded in UTF-8 normalization form C. This
+   applies to tree objects, the index file, ref names, as well as
+   path names in command line arguments, environment variables
+   and config files (`.git/config` (see linkgit:git-config[1]),
+   linkgit:gitignore[5], linkgit:gitattributes[5] and
+   linkgit:gitmodules[5]).
++
+Note that Git at the core level treats path names simply as
+sequences of non-NUL bytes, there are no path name encoding
+conversions (except on Mac and Windows). Therefore, using
+non-ASCII path names will mostly work even on platforms and file
+systems that use legacy extended ASCII encodings. However,
+repositories created on such systems will not work properly on
+UTF-8-based systems (e.g. Linux, Mac, Windows) and vice versa.
+Additionally, many Git-based tools simply assume path names to
+be UTF-8 and will fail to display other encodings correctly.
+
+ - Commit log messages are typically encoded in UTF-8, but other
+   extended ASCII encodings are also supported. This includes
+   ISO-8859-x, CP125x and many others, but _not_ UTF-16/32,
+   EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5,
+   EUC-x, CP9xx etc.).
 
 Although we encourage that the commit log messages are encoded
 in UTF-8, both the core and Git Porcelain are designed not to
-- 
2.4.3.windows.1.1.g87477f9

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] Documentation/i18n.txt: clarify character encoding support
  2015-07-01 19:10       ` [PATCH v2] " Karsten Blees
@ 2015-07-02  5:25         ` Torsten Bögershausen
  0 siblings, 0 replies; 6+ messages in thread
From: Torsten Bögershausen @ 2015-07-02  5:25 UTC (permalink / raw)
  To: Karsten Blees, Junio C Hamano; +Cc: Git List

On 07/01/2015 09:10 PM, Karsten Blees wrote:
>
> Of course, it would be nice to hear other opinions as well - this
> probably shouldn't be a discussion between the two of us :-)
>
> Karsten
>
I like this paragraf from your previous mail, I think it can go
into i18n.txt "as is":

ISO-8859-x file names may be fine if you won't ever need to:
- use git-web, JGit, gitk, git-gui...
- exchange repos with "normal" (UTF-8) Unices, Mac and Windows systems
- publish your work on a git hosting service (and expect file and
   ref names to show up correctly in the web interface)
- store the repo on Unicode-based file systems (JFS, Joliet, UDF,
   exFat, NTFS, HFS+, CIFS...)

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-07-02  5:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-13 20:24 [PATCH] Documentation/i18n.txt: clarify character encoding support Karsten Blees
2015-06-15  0:12 ` Junio C Hamano
2015-06-15 10:08   ` Karsten Blees
2015-06-17 20:45     ` Junio C Hamano
2015-07-01 19:10       ` [PATCH v2] " Karsten Blees
2015-07-02  5:25         ` Torsten Bögershausen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.