All of lore.kernel.org
 help / color / mirror / Atom feed
* non-US-ASCII file names (e.g. Hiragana) on Windows
@ 2009-11-28 18:15 Thomas Singer
  2009-11-28 20:00 ` Johannes Sixt
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Thomas Singer @ 2009-11-28 18:15 UTC (permalink / raw)
  To: git

I've created a file with unicode characters in its name (using Java):

 new File(dir, "\u3041\u3042\u3043\u3044").createNewFile();

The file name is stored correctly on disk, because if invoking a

 dir.list()

the name is listed correctly.

When opening this directory in the Windows Explorer (German Windows XP SP3),
it shows 4 boxes - which most likely is a problem of the font not supporting
these characters.

When launching 'git status' from the git shell (msys 1.6.5.1.1367.gcd48 from
7zip-bundle) it only shows me 4 question marks. I would have expected to see
the non-displayable characters escaped like it did with the umlauts on OS X.

Even adding fails:

$ git add .
fatal: unable to stat '????': No such file or directory

What should I do to make Git recognize these characters?

-- 
Thanks in advance,
Tom

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer
@ 2009-11-28 20:00 ` Johannes Sixt
  2009-12-01  8:57   ` Thomas Singer
  2009-11-28 23:07 ` Maximilien Noal
  2009-11-28 23:37 ` Reece Dunn
  2 siblings, 1 reply; 27+ messages in thread
From: Johannes Sixt @ 2009-11-28 20:00 UTC (permalink / raw)
  To: Thomas Singer; +Cc: git

On Samstag, 28. November 2009, Thomas Singer wrote:
> I've created a file with unicode characters in its name (using Java):
>
>  new File(dir, "\u3041\u3042\u3043\u3044").createNewFile();
>...
> $ git add .
> fatal: unable to stat '????': No such file or directory
>
> What should I do to make Git recognize these characters?

You cannot on a German Windows.

You can switch your Windows to Japanese (not the UI, just the codepage 
aka "locale"; yes, that's possible, I have such a setup), but even then the 
characters of the file name will be recorded in Shift-JIS encoding, not UTF-8 
or Unicode. When you later switch back to German, these bytes will be 
interpreted as cp850 or cp1252 text and displayed accordingly.

-- Hannes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer
  2009-11-28 20:00 ` Johannes Sixt
@ 2009-11-28 23:07 ` Maximilien Noal
  2009-11-29  9:18   ` Thomas Singer
  2009-11-28 23:37 ` Reece Dunn
  2 siblings, 1 reply; 27+ messages in thread
From: Maximilien Noal @ 2009-11-28 23:07 UTC (permalink / raw)
  To: Thomas Singer; +Cc: git

Thomas Singer a écrit :
> I've created a file with unicode characters in its name (using Java):
> 
>  new File(dir, "\u3041\u3042\u3043\u3044").createNewFile();
> 
> The file name is stored correctly on disk, because if invoking a
> 
>  dir.list()
> 
> the name is listed correctly.
> 
> When opening this directory in the Windows Explorer (German Windows XP SP3),
> it shows 4 boxes - which most likely is a problem of the font not supporting
> these characters.
> 
> When launching 'git status' from the git shell (msys 1.6.5.1.1367.gcd48 from
> 7zip-bundle) it only shows me 4 question marks. I would have expected to see
> the non-displayable characters escaped like it did with the umlauts on OS X.
> 
> Even adding fails:
> 
> $ git add .
> fatal: unable to stat '????': No such file or directory
> 
> What should I do to make Git recognize these characters?
> 
Hi

About the 'boxes' :

The thing is, Windows' files for Asian languages are _not_ installed by 
default.

They can be installed (even while installing Windows), by checking the 
two checkboxes under the "Supplemtal languages support" groupbox in the 
"Languages" tab of the "Regional and language options" control panel. 
*re-take some breath ;-) *

It will remove the "boxes" in Explorer and display nice Asian characters.

But that will only fix Windows' files' names display, surely not git 
(unless I'm mistaken).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer
  2009-11-28 20:00 ` Johannes Sixt
  2009-11-28 23:07 ` Maximilien Noal
@ 2009-11-28 23:37 ` Reece Dunn
  2 siblings, 0 replies; 27+ messages in thread
From: Reece Dunn @ 2009-11-28 23:37 UTC (permalink / raw)
  To: Thomas Singer; +Cc: git

2009/11/28 Thomas Singer <thomas.singer@syntevo.com>:
>
> When launching 'git status' from the git shell (msys 1.6.5.1.1367.gcd48 from
> 7zip-bundle) it only shows me 4 question marks. I would have expected to see
> the non-displayable characters escaped like it did with the umlauts on OS X.
>
> Even adding fails:
>
> $ git add .
> fatal: unable to stat '????': No such file or directory
>
> What should I do to make Git recognize these characters?

This is a bug in git's character encoding/conversion logic. It looks
like git is taking the source string and converting it to ascii to be
displayed on the console output (e.g. by using the WideCharToMultiByte
conversion API) -- these APIs will use a '?' character for characters
that it cannot map to the target character encoding (like the Hiragana
characters that you are using).

SetConsoleOutputCP can be used to change the console output codepage
[http://msdn.microsoft.com/en-us/library/ms686036%28VS.85%29.aspx] and
SetConsoleCP is the equivalent for input
[http://msdn.microsoft.com/en-us/library/ms686013%28VS.85%29.aspx].
e.g.

    SetConsoleCP(CP_UTF8);
    SetConsoleOutputCP(CP_UTF8);

should make the console process UTF-8 characters, so git shouldn't
need to do any character conversions on Windows when reading/writing
it's data.

NOTE: I have not tested this, just noting what I have found via Google.

- Reece

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-11-28 23:07 ` Maximilien Noal
@ 2009-11-29  9:18   ` Thomas Singer
  2009-12-01  7:49     ` Thomas Singer
  2009-12-01  9:12     ` Erik Faye-Lund
  0 siblings, 2 replies; 27+ messages in thread
From: Thomas Singer @ 2009-11-29  9:18 UTC (permalink / raw)
  To: Maximilien Noal; +Cc: git

Maximilien Noal wrote:
> About the 'boxes' :
> 
> The thing is, Windows' files for Asian languages are _not_ installed by
> default.
> 
> They can be installed (even while installing Windows), by checking the
> two checkboxes under the "Supplemtal languages support" groupbox in the
> "Languages" tab of the "Regional and language options" control panel.
> *re-take some breath ;-) *
> 
> It will remove the "boxes" in Explorer and display nice Asian characters.

Thanks, now the characters are showing up fine in the Explorer.

Reece Dunn wrote:
> This is a bug in git's character encoding/conversion logic. It looks
> like git is taking the source string and converting it to ascii to be
> displayed on the console output (e.g. by using the WideCharToMultiByte
> conversion API) -- these APIs will use a '?' character for characters
> that it cannot map to the target character encoding (like the Hiragana
> characters that you are using).

I have a screenshot from a SmartGit user where 1) the console can show the
far-east-characters and 2) Git *can* show the characters escaped. Are there
two versions of Git available or does Gits behaviour depends somehow on the
system locale?

-- 
Tom

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-11-29  9:18   ` Thomas Singer
@ 2009-12-01  7:49     ` Thomas Singer
  2009-12-01  8:27       ` Johannes Sixt
  2009-12-01  9:12     ` Erik Faye-Lund
  1 sibling, 1 reply; 27+ messages in thread
From: Thomas Singer @ 2009-12-01  7:49 UTC (permalink / raw)
  To: git

Thomas Singer wrote:
> Reece Dunn wrote:
>> This is a bug in git's character encoding/conversion logic. It looks
>> like git is taking the source string and converting it to ascii to be
>> displayed on the console output (e.g. by using the WideCharToMultiByte
>> conversion API) -- these APIs will use a '?' character for characters
>> that it cannot map to the target character encoding (like the Hiragana
>> characters that you are using).
> 
> I have a screenshot from a SmartGit user where 1) the console can show the
> far-east-characters and 2) Git *can* show the characters escaped. Are there
> two versions of Git available or does Gits behaviour depends somehow on the
> system locale?

Does no Git expert know what to do to get it working?

-- 
Tom

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01  7:49     ` Thomas Singer
@ 2009-12-01  8:27       ` Johannes Sixt
  2009-12-01  8:55         ` Thomas Singer
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Sixt @ 2009-12-01  8:27 UTC (permalink / raw)
  To: Thomas Singer; +Cc: git

Thomas Singer schrieb:
> Thomas Singer wrote:
>> Reece Dunn wrote:
>>> This is a bug in git's character encoding/conversion logic. It looks
>>> like git is taking the source string and converting it to ascii to be
>>> displayed on the console output (e.g. by using the WideCharToMultiByte
>>> conversion API) -- these APIs will use a '?' character for characters
>>> that it cannot map to the target character encoding (like the Hiragana
>>> characters that you are using).
>> I have a screenshot from a SmartGit user where 1) the console can show the
>> far-east-characters and 2) Git *can* show the characters escaped. Are there
>> two versions of Git available or does Gits behaviour depends somehow on the
>> system locale?
> 
> Does no Git expert know what to do to get it working?

http://article.gmane.org/gmane.comp.version-control.git/133980 [*]

The possible reason why some one else is seeing correct glyphs with
SmartGit is because it is a Unicode application and the Windows box has
suitable fonts installed and the console is configured with a suitable
font as well.

-- Hannes

[*] I had a botch email infrastructure when I sent this message, and the
copy intended for you went to the waste bin, but I thought I had re-sent
to you in a private mail.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01  8:27       ` Johannes Sixt
@ 2009-12-01  8:55         ` Thomas Singer
  2009-12-01 10:00           ` Johannes Sixt
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Singer @ 2009-12-01  8:55 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

Johannes Sixt wrote:
> Thomas Singer schrieb:
>> Thomas Singer wrote:
>>> Reece Dunn wrote:
>>>> This is a bug in git's character encoding/conversion logic. It looks
>>>> like git is taking the source string and converting it to ascii to be
>>>> displayed on the console output (e.g. by using the WideCharToMultiByte
>>>> conversion API) -- these APIs will use a '?' character for characters
>>>> that it cannot map to the target character encoding (like the Hiragana
>>>> characters that you are using).
>>> I have a screenshot from a SmartGit user where 1) the console can show the
>>> far-east-characters and 2) Git *can* show the characters escaped. Are there
>>> two versions of Git available or does Gits behaviour depends somehow on the
>>> system locale?
>> Does no Git expert know what to do to get it working?
> 
> http://article.gmane.org/gmane.comp.version-control.git/133980 [*]
> 
> The possible reason why some one else is seeing correct glyphs with
> SmartGit is because it is a Unicode application and the Windows box has
> suitable fonts installed and the console is configured with a suitable
> font as well.

I wasn't talking about SmartGit, but msysgit on the Windows console. Sorry,
if that wasn't clear.

Is it a German Windows limitation, that far-east characters are not
supported on it (but work fine on a Japanese Windows), are there different
(mysys)Git versions available or is this a configuration issue?

-- 
Tom

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-11-28 20:00 ` Johannes Sixt
@ 2009-12-01  8:57   ` Thomas Singer
  2009-12-01  9:04     ` Thomas Singer
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Singer @ 2009-12-01  8:57 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

Johannes Sixt wrote:
> On Samstag, 28. November 2009, Thomas Singer wrote:
>> I've created a file with unicode characters in its name (using Java):
>>
>>  new File(dir, "\u3041\u3042\u3043\u3044").createNewFile();
>> ...
>> $ git add .
>> fatal: unable to stat '????': No such file or directory
>>
>> What should I do to make Git recognize these characters?
> 
> You cannot on a German Windows.
> 
> You can switch your Windows to Japanese (not the UI, just the codepage 
> aka "locale"; yes, that's possible, I have such a setup), but even then the 
> characters of the file name will be recorded in Shift-JIS encoding, not UTF-8 
> or Unicode. When you later switch back to German, these bytes will be 
> interpreted as cp850 or cp1252 text and displayed accordingly.

Who is interpreting the file names? Windows or Git or Java?

-- 
Tom

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01  8:57   ` Thomas Singer
@ 2009-12-01  9:04     ` Thomas Singer
  2009-12-01 10:08       ` Johannes Sixt
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Singer @ 2009-12-01  9:04 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

Thomas Singer wrote:
> Johannes Sixt wrote:
>> You can switch your Windows to Japanese (not the UI, just the codepage 
>> aka "locale"; yes, that's possible, I have such a setup), but even then the 
>> characters of the file name will be recorded in Shift-JIS encoding, not UTF-8 
>> or Unicode. When you later switch back to German, these bytes will be 
>> interpreted as cp850 or cp1252 text and displayed accordingly.
> 
> Who is interpreting the file names? Windows or Git or Java?

To be more precise: Who is interpreting the bytes in the file names as
characters? Windows, Git or Java?

-- 
Tom

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-11-29  9:18   ` Thomas Singer
  2009-12-01  7:49     ` Thomas Singer
@ 2009-12-01  9:12     ` Erik Faye-Lund
  2009-12-01 12:11       ` Thomas Singer
  1 sibling, 1 reply; 27+ messages in thread
From: Erik Faye-Lund @ 2009-12-01  9:12 UTC (permalink / raw)
  To: Thomas Singer; +Cc: Maximilien Noal, git

On Sun, Nov 29, 2009 at 10:18 AM, Thomas Singer
<thomas.singer@syntevo.com> wrote:
> Maximilien Noal wrote:
>> About the 'boxes' :
>>
>> The thing is, Windows' files for Asian languages are _not_ installed by
>> default.
>>
>> They can be installed (even while installing Windows), by checking the
>> two checkboxes under the "Supplemtal languages support" groupbox in the
>> "Languages" tab of the "Regional and language options" control panel.
>> *re-take some breath ;-) *
>>
>> It will remove the "boxes" in Explorer and display nice Asian characters.
>
> Thanks, now the characters are showing up fine in the Explorer.
>
> Reece Dunn wrote:
>> This is a bug in git's character encoding/conversion logic. It looks
>> like git is taking the source string and converting it to ascii to be
>> displayed on the console output (e.g. by using the WideCharToMultiByte
>> conversion API) -- these APIs will use a '?' character for characters
>> that it cannot map to the target character encoding (like the Hiragana
>> characters that you are using).
>
> I have a screenshot from a SmartGit user where 1) the console can show the
> far-east-characters and 2) Git *can* show the characters escaped. Are there
> two versions of Git available or does Gits behaviour depends somehow on the
> system locale?

Did you try to make sure your console window used a Unicode font on
your German Windows installation? Asian Windows installations might do
this by default, something at least neither English nor Norwegian
Windows installations seems to do...

You can change the console window font through the properties-menu
that appears when you right click the title-bar.

-- 
Erik "kusma" Faye-Lund

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01  8:55         ` Thomas Singer
@ 2009-12-01 10:00           ` Johannes Sixt
  2009-12-01 12:08             ` Thomas Singer
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Sixt @ 2009-12-01 10:00 UTC (permalink / raw)
  To: Thomas Singer; +Cc: git

Thomas Singer schrieb:
> Is it a German Windows limitation, that far-east characters are not
> supported on it (but work fine on a Japanese Windows), are there different
> (mysys)Git versions available or is this a configuration issue?

It is a matter of configuration.

Since 8 bits are not sufficient to support Japanese alphabet in addition
to the German alphabet, programs that are not Unicode aware -- such as git
-- have to make a decision which alphabet they support. The decision is
made by picking a "codepage".

On German Windows, you are in codepage 850 (in the console). The filenames
 (that actually are in Unicode) are converted to bytes according to
codepage 850 *before* git sees them. If your filenames contain Hiragana,
they are substituted by the "unknown character" marker because there is no
place for them in codepage 850.

However, you can install Japanese language support on German Windows. Then
you can change your console to codepage 932:

  chcp 932

When you run git from *this* console, Hiragana in the filenames are
converted to cp932 before git sees them. The resulting byte sequence is
different from the one in cp850, but git will be able to see that the file
exists and was modified, and you can 'git add' it.

But if you have files with umlauts, they will not be recognized anymore
because umlauts have no place in cp932.

In neither case can you exchange the repository with Linux if you have
your locale set to UTF-8 on Linux, because neither byte sequence (umlauts
from cp850 or Hiragana from cp932) are valid UTF-8 sequences, let alone
result in the expected glyphs.

Corollary: Stick to ASCII file names.

There have been suggestions to switch the console to codepage 65001
(UTF-8), but I have never heard of success reports. I'm not saying it does
not work, though.

-- Hannes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01  9:04     ` Thomas Singer
@ 2009-12-01 10:08       ` Johannes Sixt
  2009-12-01 16:26         ` Shawn O. Pearce
  0 siblings, 1 reply; 27+ messages in thread
From: Johannes Sixt @ 2009-12-01 10:08 UTC (permalink / raw)
  To: Thomas Singer; +Cc: git

Thomas Singer schrieb:
> To be more precise: Who is interpreting the bytes in the file names as
> characters? Windows, Git or Java?

In the case of git: Windows does it, using the console's codepage to
convert between bytes and Unicode.

I don't know about Java, but I guess that no conversion is necessary
because Java is Unicode-aware.

-- Hannes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 10:00           ` Johannes Sixt
@ 2009-12-01 12:08             ` Thomas Singer
  2009-12-01 13:17               ` Johannes Sixt
  2009-12-01 17:24               ` Jakub Narebski
  0 siblings, 2 replies; 27+ messages in thread
From: Thomas Singer @ 2009-12-01 12:08 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

Johannes Sixt wrote:
> Thomas Singer schrieb:
>> Is it a German Windows limitation, that far-east characters are not
>> supported on it (but work fine on a Japanese Windows), are there different
>> (mysys)Git versions available or is this a configuration issue?
> 
> It is a matter of configuration.
> 
> Since 8 bits are not sufficient to support Japanese alphabet in addition
> to the German alphabet, programs that are not Unicode aware -- such as git
> -- have to make a decision which alphabet they support. The decision is
> made by picking a "codepage".
> 
> On German Windows, you are in codepage 850 (in the console). The filenames
>  (that actually are in Unicode) are converted to bytes according to
> codepage 850 *before* git sees them. If your filenames contain Hiragana,
> they are substituted by the "unknown character" marker because there is no
> place for them in codepage 850.
> 
> However, you can install Japanese language support on German Windows. Then
> you can change your console to codepage 932:
> 
>   chcp 932
> 
> When you run git from *this* console, Hiragana in the filenames are
> converted to cp932 before git sees them. The resulting byte sequence is
> different from the one in cp850, but git will be able to see that the file
> exists and was modified, and you can 'git add' it.
> 
> But if you have files with umlauts, they will not be recognized anymore
> because umlauts have no place in cp932.
> 
> In neither case can you exchange the repository with Linux if you have
> your locale set to UTF-8 on Linux, because neither byte sequence (umlauts
> from cp850 or Hiragana from cp932) are valid UTF-8 sequences, let alone
> result in the expected glyphs.
> 
> Corollary: Stick to ASCII file names.
> 
> There have been suggestions to switch the console to codepage 65001
> (UTF-8), but I have never heard of success reports. I'm not saying it does
> not work, though.

Thanks for the detailed explanation. I know the differences between bytes
and characters and the needed *encoding* to convert from one to another, but
I did not know how Git handles it. I'm quite surprised, that -- as I
understand you -- msys-Git (or Git at all?) is not able to handle all
characters (aka unicode) at the same time. I expected it would be better
than older tools, e.g. SVN.

BTW, we are invoking the Git executable from Java. Is there automatically a
console "around" Git? Should we invoke a shell-script (which sets the
console's code page) instead of the Git executable directly?

-- 
Tom

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01  9:12     ` Erik Faye-Lund
@ 2009-12-01 12:11       ` Thomas Singer
  0 siblings, 0 replies; 27+ messages in thread
From: Thomas Singer @ 2009-12-01 12:11 UTC (permalink / raw)
  To: kusmabite; +Cc: Maximilien Noal, git

Erik Faye-Lund wrote:
> Did you try to make sure your console window used a Unicode font on
> your German Windows installation? Asian Windows installations might do
> this by default, something at least neither English nor Norwegian
> Windows installations seems to do...
> 
> You can change the console window font through the properties-menu
> that appears when you right click the title-bar.

I've tried to change the console font (there is just one alternative), but
without any change.

-- 
Tom

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 12:08             ` Thomas Singer
@ 2009-12-01 13:17               ` Johannes Sixt
  2009-12-01 15:41                 ` Thomas Singer
  2009-12-01 17:24               ` Jakub Narebski
  1 sibling, 1 reply; 27+ messages in thread
From: Johannes Sixt @ 2009-12-01 13:17 UTC (permalink / raw)
  To: Thomas Singer; +Cc: git

Thomas Singer schrieb:
> I'm quite surprised, that -- as I
> understand you -- msys-Git (or Git at all?) is not able to handle all
> characters (aka unicode) at the same time. I expected it would be better
> than older tools, e.g. SVN.

This has been discussed at length here and in the msysgit mailing list.
Git expects that the file system returns file names with the same byte
sequence that git used to create a file. On Windows, this works only as
long as you do not switch the codepage.

> BTW, we are invoking the Git executable from Java. Is there automatically a
> console "around" Git?

I don't think so. In this case, the codepage that Java has set up will
apply. I guess that Java doesn't mess with the codepage at all, and then
on German Windows git would operate in cp1252.

> Should we invoke a shell-script (which sets the
> console's code page) instead of the Git executable directly?

I don't think that is necessary.

-- Hannes

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 13:17               ` Johannes Sixt
@ 2009-12-01 15:41                 ` Thomas Singer
  2009-12-01 15:50                   ` Erik Faye-Lund
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Singer @ 2009-12-01 15:41 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

Johannes Sixt wrote:
> Thomas Singer schrieb:
>> I'm quite surprised, that -- as I
>> understand you -- msys-Git (or Git at all?) is not able to handle all
>> characters (aka unicode) at the same time. I expected it would be better
>> than older tools, e.g. SVN.
> 
> This has been discussed at length here and in the msysgit mailing list.
> Git expects that the file system returns file names with the same byte
> sequence that git used to create a file. On Windows, this works only as
> long as you do not switch the codepage.

Now you confuse me: is this a problem of Windows, Git using a less capable
Windows-API call or is there no unicode-capable API call to list file names
on Windows? I ask myself how Java does it in its internals, finally it
(also) consists of a C-base, I guess.

-- 
Tom

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 15:41                 ` Thomas Singer
@ 2009-12-01 15:50                   ` Erik Faye-Lund
  2009-12-01 16:33                     ` Thomas Singer
  0 siblings, 1 reply; 27+ messages in thread
From: Erik Faye-Lund @ 2009-12-01 15:50 UTC (permalink / raw)
  To: Thomas Singer; +Cc: Johannes Sixt, git

On Tue, Dec 1, 2009 at 4:41 PM, Thomas Singer <thomas.singer@syntevo.com> wrote:
> Johannes Sixt wrote:
>> Thomas Singer schrieb:
>>> I'm quite surprised, that -- as I
>>> understand you -- msys-Git (or Git at all?) is not able to handle all
>>> characters (aka unicode) at the same time. I expected it would be better
>>> than older tools, e.g. SVN.
>>
>> This has been discussed at length here and in the msysgit mailing list.
>> Git expects that the file system returns file names with the same byte
>> sequence that git used to create a file. On Windows, this works only as
>> long as you do not switch the codepage.
>
> Now you confuse me: is this a problem of Windows, Git using a less capable
> Windows-API call or is there no unicode-capable API call to list file names
> on Windows? I ask myself how Java does it in its internals, finally it
> (also) consists of a C-base, I guess.
>

Git uses the 8-bit file APIs, and Windows doesn't support setting
UTF-8 as the locale. Some work have been done in msysGit to use
_wopen() and friends instead, but AFAIK it's not completed. See the
branch called "work/utf-filenames" in
git://repo.or.cz/git/mingw/4msysgit.git if you are interested in
helping to complete it.

-- 
Erik "kusma" Faye-Lund

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 10:08       ` Johannes Sixt
@ 2009-12-01 16:26         ` Shawn O. Pearce
  2009-12-01 22:11           ` Robin Rosenberg
  0 siblings, 1 reply; 27+ messages in thread
From: Shawn O. Pearce @ 2009-12-01 16:26 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Thomas Singer, git

Johannes Sixt <j.sixt@viscovery.net> wrote:
> Thomas Singer schrieb:
> > To be more precise: Who is interpreting the bytes in the file names as
> > characters? Windows, Git or Java?
> 
> In the case of git: Windows does it, using the console's codepage to
> convert between bytes and Unicode.
> 
> I don't know about Java, but I guess that no conversion is necessary
> because Java is Unicode-aware.

Actually, conversion is necessary, and its something that is proving
to be really painful within JGit.

The Java IO APIs use UTF-16 for file names.  However we are reading
a stream of unknown bytes from the index file and tree objects.
Thus JGit must convert a stream of bytes into UTF-16 just to get
to the OS.

The JVM then turns around and converts from UTF-16 to some other
encoding for the filesystem.

On Win32 I suspect the JVM uses the native UTF-16 file APIs, so
this translation is lossless.

On POSIX, I suspect the JVM uses $LANG or some other related
environment variable to guess the user's preferred encoding, and
then converts from UTF-16 to bytes in that encoding.  And I have
no idea how they handle normalization of composed code points.

All of these layers make for a *very* confusing situation for us
within JGit:

  git tree
  +---------+
  | bytes   | -+
  +---------+   \
                 \             +--------+            +---------+
                  +-- JGit --> | UTF-16 | -- JVM --> | OS call |
  .git/index     /             +--------+            +---------+
  +---------+   /
  | bytes   | -+
  +---------+

Its impossible for us to do what C git does, which is just use the
bytes used by the OS call within the git datastructure.  Which of
course also isn't always portable, e.g. the Mac OS X HFS+ mess.

:-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 15:50                   ` Erik Faye-Lund
@ 2009-12-01 16:33                     ` Thomas Singer
  2010-10-30  4:02                       ` brad12
  0 siblings, 1 reply; 27+ messages in thread
From: Thomas Singer @ 2009-12-01 16:33 UTC (permalink / raw)
  To: kusmabite; +Cc: Johannes Sixt, git

Erik Faye-Lund wrote:
> Git uses the 8-bit file APIs, and Windows doesn't support setting
> UTF-8 as the locale. Some work have been done in msysGit to use
> _wopen() and friends instead, but AFAIK it's not completed. See the
> branch called "work/utf-filenames" in
> git://repo.or.cz/git/mingw/4msysgit.git if you are interested in
> helping to complete it.

Thanks, now I understand.

-- 
Tom

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 12:08             ` Thomas Singer
  2009-12-01 13:17               ` Johannes Sixt
@ 2009-12-01 17:24               ` Jakub Narebski
  2009-12-01 18:55                 ` Thomas Singer
  2010-10-30  9:52                 ` demerphq
  1 sibling, 2 replies; 27+ messages in thread
From: Jakub Narebski @ 2009-12-01 17:24 UTC (permalink / raw)
  To: Thomas Singer; +Cc: Johannes Sixt, git

Thomas Singer <thomas.singer@syntevo.com> writes:

> Johannes Sixt wrote:
>> Thomas Singer schrieb:
>>>
>>> Is it a German Windows limitation, that far-east characters are not
>>> supported on it (but work fine on a Japanese Windows), are there different
>>> (mysys)Git versions available or is this a configuration issue?
>> 
>> It is a matter of configuration.
>> 
>> Since 8 bits are not sufficient to support Japanese alphabet in addition
>> to the German alphabet, programs that are not Unicode aware -- such as git
>> -- have to make a decision which alphabet they support. The decision is
>> made by picking a "codepage".
>> 
>> On German Windows, you are in codepage 850 (in the console). The filenames
>>  (that actually are in Unicode) are converted to bytes according to
>> codepage 850 *before* git sees them. If your filenames contain Hiragana,
>> they are substituted by the "unknown character" marker because there is no
>> place for them in codepage 850.
[...]

>> Corollary: Stick to ASCII file names.
>> 
>> There have been suggestions to switch the console to codepage 65001
>> (UTF-8), but I have never heard of success reports. I'm not saying it does
>> not work, though.
> 
> Thanks for the detailed explanation. I know the differences between bytes
> and characters and the needed *encoding* to convert from one to another, but
> I did not know how Git handles it. I'm quite surprised, that -- as I
> understand you -- msys-Git (or Git at all?) is not able to handle all
> characters (aka unicode) at the same time. I expected it would be better
> than older tools, e.g. SVN.

The problem is not with Git, as Git is (currently) agnostic with
respect to filename encoding; for Git filenames are opaque NUL ('\0)
terminated binary data.  There is some infrastructure to convert
between filename encodings and other filename quirks (like
case-insensivity), though...

The problem is with MS Windows *console*, from which you invoke git
commands, and which does translation from filename encoding used by
the filesystem to encoding / codepage used by console.

> BTW, we are invoking the Git executable from Java. Is there automatically a
> console "around" Git? Should we invoke a shell-script (which sets the
> console's code page) instead of the Git executable directly?

If you use Git from Java, why don't you just use JGit (www.jgit.org),
which is Git implementation in Java?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 17:24               ` Jakub Narebski
@ 2009-12-01 18:55                 ` Thomas Singer
  2009-12-02 16:22                   ` Shawn Pearce
  2010-10-30  9:52                 ` demerphq
  1 sibling, 1 reply; 27+ messages in thread
From: Thomas Singer @ 2009-12-01 18:55 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Johannes Sixt, git

Jakub Narebski wrote:
> If you use Git from Java, why don't you just use JGit (www.jgit.org),
> which is Git implementation in Java?

We are using JGit for the read-only stuff and the Git command line
executable for all writing commands. We very much appreciate Shawn O.
Pearce' (and the other JGit developers') effort, but Git is a fast moving
target and (much) more complex than CVS or SVN, for which we use Java
libraries communicating with the corresponding server which adds another
sanity layer to the repository making repository corruption less likely than
direct access.

-- 
Best regards,
Thomas Singer
=============
syntevo GmbH
http://www.syntevo.com
http://blog.syntevo.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 16:26         ` Shawn O. Pearce
@ 2009-12-01 22:11           ` Robin Rosenberg
  0 siblings, 0 replies; 27+ messages in thread
From: Robin Rosenberg @ 2009-12-01 22:11 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Johannes Sixt, Thomas Singer, git

tisdag 01 december 2009 17:26:27 skrev du:
> Johannes Sixt <j.sixt@viscovery.net> wrote:
> > Thomas Singer schrieb:
> > > To be more precise: Who is interpreting the bytes in the file names as
> > > characters? Windows, Git or Java?
> >
> > In the case of git: Windows does it, using the console's codepage to
> > convert between bytes and Unicode.
> >
> > I don't know about Java, but I guess that no conversion is necessary
> > because Java is Unicode-aware.
>
> Actually, conversion is necessary, and its something that is proving
> to be really painful within JGit.
>
> The Java IO APIs use UTF-16 for file names.  However we are reading
> a stream of unknown bytes from the index file and tree objects.
> Thus JGit must convert a stream of bytes into UTF-16 just to get
> to the OS.
>
> The JVM then turns around and converts from UTF-16 to some other
> encoding for the filesystem.
>
> On Win32 I suspect the JVM uses the native UTF-16 file APIs, so
> this translation is lossless.
>
> On POSIX, I suspect the JVM uses $LANG or some other related
> environment variable to guess the user's preferred encoding, and
> then converts from UTF-16 to bytes in that encoding.  And I have
> no idea how they handle normalization of composed code points.
>
> All of these layers make for a *very* confusing situation for us
> within JGit:
>
>   git tree
>   +---------+
>
>   | bytes   | -+
>
>   +---------+   \
>                  \             +--------+            +---------+
>                   +-- JGit --> | UTF-16 | -- JVM --> | OS call |
>   .git/index     /             +--------+            +---------+
>   +---------+   /
>
>   | bytes   | -+
>
>   +---------+
>
> Its impossible for us to do what C git does, which is just use the
> bytes used by the OS call within the git datastructure.  Which of
> course also isn't always portable, e.g. the Mac OS X HFS+ mess.

We can decode the index anyway we like but not file names coming from
the file system. On Windows, any sane name (it does allow invalid UTF-16 too, 
but...) will be readable by JGit, but on a UTF-8 posix that may not be so, if 
the filename is actually Latin.-1 encoded. In that case the Java runtime will 
return a decoded filename containing an "invalid" code point and any attempt to 
access the file from java will fail. I can see some horribly expensive ways to 
work around that but...

As for the more sane cases I have a compare routine that works on mixed 
encodings that may help to solve some of the problems. Ideally it would not
only be able to compare filenames with unknown encodings to handling case 
folding and composing characters in one go too. I guess one could make it
fall back to another encoding than Latin-1, but with lesser certainty, but
it will not (for sure) work with any arbitrary set of encodings. You'll have 
to choose, so it's only a legacy workaround, as opposed to a solution. 

-- robin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 18:55                 ` Thomas Singer
@ 2009-12-02 16:22                   ` Shawn Pearce
  0 siblings, 0 replies; 27+ messages in thread
From: Shawn Pearce @ 2009-12-02 16:22 UTC (permalink / raw)
  To: Thomas Singer; +Cc: Jakub Narebski, Johannes Sixt, git

On Tue, Dec 1, 2009 at 10:55 AM, Thomas Singer
<thomas.singer@syntevo.com> wrote:
>
> Jakub Narebski wrote:
> > If you use Git from Java, why don't you just use JGit (www.jgit.org),
> > which is Git implementation in Java?
>
> We are using JGit for the read-only stuff and the Git command line
> executable for all writing commands. We very much appreciate Shawn O.
> Pearce' (and the other JGit developers') effort, but Git is a fast moving
> target and (much) more complex than CVS or SVN, for which we use Java
> libraries communicating with the corresponding server which adds another
> sanity layer to the repository making repository corruption less likely than
> direct access.

Uhm.  I'm sorry, but this is just plain FUD.

JGit implements the current on disk formats and network protocols
completely[1].  In the area of disk formats and network protocols, Git
*IS NOT* a fast moving target.  This area of Git hasn't changed much
since pack files were first introduced.  As a community, we have been
very careful to avoid changes which break compatibility with older
implementations.

Git is also a lot less complex than CVS or SVN.  Its data model is
simpler on disk.  Its network protocol is *vastly* more simple than
SVN's WebDAV protocol.  And unlike SVN we haven't had to break the
network protocol on every 1.x release we make.


[1]  Actually, JGit lacks --depth support for shallow clones, but
otherwise is complete.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 16:33                     ` Thomas Singer
@ 2010-10-30  4:02                       ` brad12
  2010-10-30  8:58                         ` Jakub Narebski
  0 siblings, 1 reply; 27+ messages in thread
From: brad12 @ 2010-10-30  4:02 UTC (permalink / raw)
  To: git


actually I am also working Japanese language site , I have same problem , I
have also some japanese sentences which is to be written in japanese , but
they did not appear on browser....so what should I do

-----
http://www.learnjapanesefree.com/ Learning japanese language  | 
http://www.learnjapanesefree.com/japanese-hiragana.html Hiragana 
-- 
View this message in context: http://git.661346.n2.nabble.com/non-US-ASCII-file-names-e-g-Hiragana-on-Windows-tp4080246p5688741.html
Sent from the git mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2010-10-30  4:02                       ` brad12
@ 2010-10-30  8:58                         ` Jakub Narebski
  0 siblings, 0 replies; 27+ messages in thread
From: Jakub Narebski @ 2010-10-30  8:58 UTC (permalink / raw)
  To: brad12; +Cc: git

brad12 <brad.john75@gmail.com> writes:

> actually I am also working Japanese language site , I have same problem , I
> have also some japanese sentences which is to be written in japanese , but
> they did not appear on browser....so what should I do

Why are you sending this question to *this* mailing list (this
newsgroup)?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: non-US-ASCII file names (e.g. Hiragana) on Windows
  2009-12-01 17:24               ` Jakub Narebski
  2009-12-01 18:55                 ` Thomas Singer
@ 2010-10-30  9:52                 ` demerphq
  1 sibling, 0 replies; 27+ messages in thread
From: demerphq @ 2010-10-30  9:52 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Thomas Singer, Johannes Sixt, git

On 1 December 2009 18:24, Jakub Narebski <jnareb@gmail.com> wrote:
> Thomas Singer <thomas.singer@syntevo.com> writes:
>
>> Johannes Sixt wrote:
>>> Thomas Singer schrieb:
>>>>
>>>> Is it a German Windows limitation, that far-east characters are not
>>>> supported on it (but work fine on a Japanese Windows), are there different
>>>> (mysys)Git versions available or is this a configuration issue?
>>>
>>> It is a matter of configuration.
>>>
>>> Since 8 bits are not sufficient to support Japanese alphabet in addition
>>> to the German alphabet, programs that are not Unicode aware -- such as git
>>> -- have to make a decision which alphabet they support. The decision is
>>> made by picking a "codepage".
>>>
>>> On German Windows, you are in codepage 850 (in the console). The filenames
>>>  (that actually are in Unicode) are converted to bytes according to
>>> codepage 850 *before* git sees them. If your filenames contain Hiragana,
>>> they are substituted by the "unknown character" marker because there is no
>>> place for them in codepage 850.
> [...]
>
>>> Corollary: Stick to ASCII file names.
>>>
>>> There have been suggestions to switch the console to codepage 65001
>>> (UTF-8), but I have never heard of success reports. I'm not saying it does
>>> not work, though.
>>
>> Thanks for the detailed explanation. I know the differences between bytes
>> and characters and the needed *encoding* to convert from one to another, but
>> I did not know how Git handles it. I'm quite surprised, that -- as I
>> understand you -- msys-Git (or Git at all?) is not able to handle all
>> characters (aka unicode) at the same time. I expected it would be better
>> than older tools, e.g. SVN.
>
> The problem is not with Git, as Git is (currently) agnostic with
> respect to filename encoding; for Git filenames are opaque NUL ('\0)
> terminated binary data.  There is some infrastructure to convert
> between filename encodings and other filename quirks (like
> case-insensivity), though...

"You can use whatever encoding you want. So long as it looks like a
standard UNIX filename."






-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2010-10-30  9:53 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer
2009-11-28 20:00 ` Johannes Sixt
2009-12-01  8:57   ` Thomas Singer
2009-12-01  9:04     ` Thomas Singer
2009-12-01 10:08       ` Johannes Sixt
2009-12-01 16:26         ` Shawn O. Pearce
2009-12-01 22:11           ` Robin Rosenberg
2009-11-28 23:07 ` Maximilien Noal
2009-11-29  9:18   ` Thomas Singer
2009-12-01  7:49     ` Thomas Singer
2009-12-01  8:27       ` Johannes Sixt
2009-12-01  8:55         ` Thomas Singer
2009-12-01 10:00           ` Johannes Sixt
2009-12-01 12:08             ` Thomas Singer
2009-12-01 13:17               ` Johannes Sixt
2009-12-01 15:41                 ` Thomas Singer
2009-12-01 15:50                   ` Erik Faye-Lund
2009-12-01 16:33                     ` Thomas Singer
2010-10-30  4:02                       ` brad12
2010-10-30  8:58                         ` Jakub Narebski
2009-12-01 17:24               ` Jakub Narebski
2009-12-01 18:55                 ` Thomas Singer
2009-12-02 16:22                   ` Shawn Pearce
2010-10-30  9:52                 ` demerphq
2009-12-01  9:12     ` Erik Faye-Lund
2009-12-01 12:11       ` Thomas Singer
2009-11-28 23:37 ` Reece Dunn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.