* non-US-ASCII file names (e.g. Hiragana) on Windows @ 2009-11-28 18:15 Thomas Singer 2009-11-28 20:00 ` Johannes Sixt ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Thomas Singer @ 2009-11-28 18:15 UTC (permalink / raw) To: git I've created a file with unicode characters in its name (using Java): new File(dir, "\u3041\u3042\u3043\u3044").createNewFile(); The file name is stored correctly on disk, because if invoking a dir.list() the name is listed correctly. When opening this directory in the Windows Explorer (German Windows XP SP3), it shows 4 boxes - which most likely is a problem of the font not supporting these characters. When launching 'git status' from the git shell (msys 1.6.5.1.1367.gcd48 from 7zip-bundle) it only shows me 4 question marks. I would have expected to see the non-displayable characters escaped like it did with the umlauts on OS X. Even adding fails: $ git add . fatal: unable to stat '????': No such file or directory What should I do to make Git recognize these characters? -- Thanks in advance, Tom ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer @ 2009-11-28 20:00 ` Johannes Sixt 2009-12-01 8:57 ` Thomas Singer 2009-11-28 23:07 ` Maximilien Noal 2009-11-28 23:37 ` Reece Dunn 2 siblings, 1 reply; 27+ messages in thread From: Johannes Sixt @ 2009-11-28 20:00 UTC (permalink / raw) To: Thomas Singer; +Cc: git On Samstag, 28. November 2009, Thomas Singer wrote: > I've created a file with unicode characters in its name (using Java): > > new File(dir, "\u3041\u3042\u3043\u3044").createNewFile(); >... > $ git add . > fatal: unable to stat '????': No such file or directory > > What should I do to make Git recognize these characters? You cannot on a German Windows. You can switch your Windows to Japanese (not the UI, just the codepage aka "locale"; yes, that's possible, I have such a setup), but even then the characters of the file name will be recorded in Shift-JIS encoding, not UTF-8 or Unicode. When you later switch back to German, these bytes will be interpreted as cp850 or cp1252 text and displayed accordingly. -- Hannes ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-11-28 20:00 ` Johannes Sixt @ 2009-12-01 8:57 ` Thomas Singer 2009-12-01 9:04 ` Thomas Singer 0 siblings, 1 reply; 27+ messages in thread From: Thomas Singer @ 2009-12-01 8:57 UTC (permalink / raw) To: Johannes Sixt; +Cc: git Johannes Sixt wrote: > On Samstag, 28. November 2009, Thomas Singer wrote: >> I've created a file with unicode characters in its name (using Java): >> >> new File(dir, "\u3041\u3042\u3043\u3044").createNewFile(); >> ... >> $ git add . >> fatal: unable to stat '????': No such file or directory >> >> What should I do to make Git recognize these characters? > > You cannot on a German Windows. > > You can switch your Windows to Japanese (not the UI, just the codepage > aka "locale"; yes, that's possible, I have such a setup), but even then the > characters of the file name will be recorded in Shift-JIS encoding, not UTF-8 > or Unicode. When you later switch back to German, these bytes will be > interpreted as cp850 or cp1252 text and displayed accordingly. Who is interpreting the file names? Windows or Git or Java? -- Tom ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 8:57 ` Thomas Singer @ 2009-12-01 9:04 ` Thomas Singer 2009-12-01 10:08 ` Johannes Sixt 0 siblings, 1 reply; 27+ messages in thread From: Thomas Singer @ 2009-12-01 9:04 UTC (permalink / raw) To: Johannes Sixt; +Cc: git Thomas Singer wrote: > Johannes Sixt wrote: >> You can switch your Windows to Japanese (not the UI, just the codepage >> aka "locale"; yes, that's possible, I have such a setup), but even then the >> characters of the file name will be recorded in Shift-JIS encoding, not UTF-8 >> or Unicode. When you later switch back to German, these bytes will be >> interpreted as cp850 or cp1252 text and displayed accordingly. > > Who is interpreting the file names? Windows or Git or Java? To be more precise: Who is interpreting the bytes in the file names as characters? Windows, Git or Java? -- Tom ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 9:04 ` Thomas Singer @ 2009-12-01 10:08 ` Johannes Sixt 2009-12-01 16:26 ` Shawn O. Pearce 0 siblings, 1 reply; 27+ messages in thread From: Johannes Sixt @ 2009-12-01 10:08 UTC (permalink / raw) To: Thomas Singer; +Cc: git Thomas Singer schrieb: > To be more precise: Who is interpreting the bytes in the file names as > characters? Windows, Git or Java? In the case of git: Windows does it, using the console's codepage to convert between bytes and Unicode. I don't know about Java, but I guess that no conversion is necessary because Java is Unicode-aware. -- Hannes ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 10:08 ` Johannes Sixt @ 2009-12-01 16:26 ` Shawn O. Pearce 2009-12-01 22:11 ` Robin Rosenberg 0 siblings, 1 reply; 27+ messages in thread From: Shawn O. Pearce @ 2009-12-01 16:26 UTC (permalink / raw) To: Johannes Sixt; +Cc: Thomas Singer, git Johannes Sixt <j.sixt@viscovery.net> wrote: > Thomas Singer schrieb: > > To be more precise: Who is interpreting the bytes in the file names as > > characters? Windows, Git or Java? > > In the case of git: Windows does it, using the console's codepage to > convert between bytes and Unicode. > > I don't know about Java, but I guess that no conversion is necessary > because Java is Unicode-aware. Actually, conversion is necessary, and its something that is proving to be really painful within JGit. The Java IO APIs use UTF-16 for file names. However we are reading a stream of unknown bytes from the index file and tree objects. Thus JGit must convert a stream of bytes into UTF-16 just to get to the OS. The JVM then turns around and converts from UTF-16 to some other encoding for the filesystem. On Win32 I suspect the JVM uses the native UTF-16 file APIs, so this translation is lossless. On POSIX, I suspect the JVM uses $LANG or some other related environment variable to guess the user's preferred encoding, and then converts from UTF-16 to bytes in that encoding. And I have no idea how they handle normalization of composed code points. All of these layers make for a *very* confusing situation for us within JGit: git tree +---------+ | bytes | -+ +---------+ \ \ +--------+ +---------+ +-- JGit --> | UTF-16 | -- JVM --> | OS call | .git/index / +--------+ +---------+ +---------+ / | bytes | -+ +---------+ Its impossible for us to do what C git does, which is just use the bytes used by the OS call within the git datastructure. Which of course also isn't always portable, e.g. the Mac OS X HFS+ mess. :-) -- Shawn. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 16:26 ` Shawn O. Pearce @ 2009-12-01 22:11 ` Robin Rosenberg 0 siblings, 0 replies; 27+ messages in thread From: Robin Rosenberg @ 2009-12-01 22:11 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Johannes Sixt, Thomas Singer, git tisdag 01 december 2009 17:26:27 skrev du: > Johannes Sixt <j.sixt@viscovery.net> wrote: > > Thomas Singer schrieb: > > > To be more precise: Who is interpreting the bytes in the file names as > > > characters? Windows, Git or Java? > > > > In the case of git: Windows does it, using the console's codepage to > > convert between bytes and Unicode. > > > > I don't know about Java, but I guess that no conversion is necessary > > because Java is Unicode-aware. > > Actually, conversion is necessary, and its something that is proving > to be really painful within JGit. > > The Java IO APIs use UTF-16 for file names. However we are reading > a stream of unknown bytes from the index file and tree objects. > Thus JGit must convert a stream of bytes into UTF-16 just to get > to the OS. > > The JVM then turns around and converts from UTF-16 to some other > encoding for the filesystem. > > On Win32 I suspect the JVM uses the native UTF-16 file APIs, so > this translation is lossless. > > On POSIX, I suspect the JVM uses $LANG or some other related > environment variable to guess the user's preferred encoding, and > then converts from UTF-16 to bytes in that encoding. And I have > no idea how they handle normalization of composed code points. > > All of these layers make for a *very* confusing situation for us > within JGit: > > git tree > +---------+ > > | bytes | -+ > > +---------+ \ > \ +--------+ +---------+ > +-- JGit --> | UTF-16 | -- JVM --> | OS call | > .git/index / +--------+ +---------+ > +---------+ / > > | bytes | -+ > > +---------+ > > Its impossible for us to do what C git does, which is just use the > bytes used by the OS call within the git datastructure. Which of > course also isn't always portable, e.g. the Mac OS X HFS+ mess. We can decode the index anyway we like but not file names coming from the file system. On Windows, any sane name (it does allow invalid UTF-16 too, but...) will be readable by JGit, but on a UTF-8 posix that may not be so, if the filename is actually Latin.-1 encoded. In that case the Java runtime will return a decoded filename containing an "invalid" code point and any attempt to access the file from java will fail. I can see some horribly expensive ways to work around that but... As for the more sane cases I have a compare routine that works on mixed encodings that may help to solve some of the problems. Ideally it would not only be able to compare filenames with unknown encodings to handling case folding and composing characters in one go too. I guess one could make it fall back to another encoding than Latin-1, but with lesser certainty, but it will not (for sure) work with any arbitrary set of encodings. You'll have to choose, so it's only a legacy workaround, as opposed to a solution. -- robin ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer 2009-11-28 20:00 ` Johannes Sixt @ 2009-11-28 23:07 ` Maximilien Noal 2009-11-29 9:18 ` Thomas Singer 2009-11-28 23:37 ` Reece Dunn 2 siblings, 1 reply; 27+ messages in thread From: Maximilien Noal @ 2009-11-28 23:07 UTC (permalink / raw) To: Thomas Singer; +Cc: git Thomas Singer a écrit : > I've created a file with unicode characters in its name (using Java): > > new File(dir, "\u3041\u3042\u3043\u3044").createNewFile(); > > The file name is stored correctly on disk, because if invoking a > > dir.list() > > the name is listed correctly. > > When opening this directory in the Windows Explorer (German Windows XP SP3), > it shows 4 boxes - which most likely is a problem of the font not supporting > these characters. > > When launching 'git status' from the git shell (msys 1.6.5.1.1367.gcd48 from > 7zip-bundle) it only shows me 4 question marks. I would have expected to see > the non-displayable characters escaped like it did with the umlauts on OS X. > > Even adding fails: > > $ git add . > fatal: unable to stat '????': No such file or directory > > What should I do to make Git recognize these characters? > Hi About the 'boxes' : The thing is, Windows' files for Asian languages are _not_ installed by default. They can be installed (even while installing Windows), by checking the two checkboxes under the "Supplemtal languages support" groupbox in the "Languages" tab of the "Regional and language options" control panel. *re-take some breath ;-) * It will remove the "boxes" in Explorer and display nice Asian characters. But that will only fix Windows' files' names display, surely not git (unless I'm mistaken). ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-11-28 23:07 ` Maximilien Noal @ 2009-11-29 9:18 ` Thomas Singer 2009-12-01 7:49 ` Thomas Singer 2009-12-01 9:12 ` Erik Faye-Lund 0 siblings, 2 replies; 27+ messages in thread From: Thomas Singer @ 2009-11-29 9:18 UTC (permalink / raw) To: Maximilien Noal; +Cc: git Maximilien Noal wrote: > About the 'boxes' : > > The thing is, Windows' files for Asian languages are _not_ installed by > default. > > They can be installed (even while installing Windows), by checking the > two checkboxes under the "Supplemtal languages support" groupbox in the > "Languages" tab of the "Regional and language options" control panel. > *re-take some breath ;-) * > > It will remove the "boxes" in Explorer and display nice Asian characters. Thanks, now the characters are showing up fine in the Explorer. Reece Dunn wrote: > This is a bug in git's character encoding/conversion logic. It looks > like git is taking the source string and converting it to ascii to be > displayed on the console output (e.g. by using the WideCharToMultiByte > conversion API) -- these APIs will use a '?' character for characters > that it cannot map to the target character encoding (like the Hiragana > characters that you are using). I have a screenshot from a SmartGit user where 1) the console can show the far-east-characters and 2) Git *can* show the characters escaped. Are there two versions of Git available or does Gits behaviour depends somehow on the system locale? -- Tom ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-11-29 9:18 ` Thomas Singer @ 2009-12-01 7:49 ` Thomas Singer 2009-12-01 8:27 ` Johannes Sixt 2009-12-01 9:12 ` Erik Faye-Lund 1 sibling, 1 reply; 27+ messages in thread From: Thomas Singer @ 2009-12-01 7:49 UTC (permalink / raw) To: git Thomas Singer wrote: > Reece Dunn wrote: >> This is a bug in git's character encoding/conversion logic. It looks >> like git is taking the source string and converting it to ascii to be >> displayed on the console output (e.g. by using the WideCharToMultiByte >> conversion API) -- these APIs will use a '?' character for characters >> that it cannot map to the target character encoding (like the Hiragana >> characters that you are using). > > I have a screenshot from a SmartGit user where 1) the console can show the > far-east-characters and 2) Git *can* show the characters escaped. Are there > two versions of Git available or does Gits behaviour depends somehow on the > system locale? Does no Git expert know what to do to get it working? -- Tom ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 7:49 ` Thomas Singer @ 2009-12-01 8:27 ` Johannes Sixt 2009-12-01 8:55 ` Thomas Singer 0 siblings, 1 reply; 27+ messages in thread From: Johannes Sixt @ 2009-12-01 8:27 UTC (permalink / raw) To: Thomas Singer; +Cc: git Thomas Singer schrieb: > Thomas Singer wrote: >> Reece Dunn wrote: >>> This is a bug in git's character encoding/conversion logic. It looks >>> like git is taking the source string and converting it to ascii to be >>> displayed on the console output (e.g. by using the WideCharToMultiByte >>> conversion API) -- these APIs will use a '?' character for characters >>> that it cannot map to the target character encoding (like the Hiragana >>> characters that you are using). >> I have a screenshot from a SmartGit user where 1) the console can show the >> far-east-characters and 2) Git *can* show the characters escaped. Are there >> two versions of Git available or does Gits behaviour depends somehow on the >> system locale? > > Does no Git expert know what to do to get it working? http://article.gmane.org/gmane.comp.version-control.git/133980 [*] The possible reason why some one else is seeing correct glyphs with SmartGit is because it is a Unicode application and the Windows box has suitable fonts installed and the console is configured with a suitable font as well. -- Hannes [*] I had a botch email infrastructure when I sent this message, and the copy intended for you went to the waste bin, but I thought I had re-sent to you in a private mail. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 8:27 ` Johannes Sixt @ 2009-12-01 8:55 ` Thomas Singer 2009-12-01 10:00 ` Johannes Sixt 0 siblings, 1 reply; 27+ messages in thread From: Thomas Singer @ 2009-12-01 8:55 UTC (permalink / raw) To: Johannes Sixt; +Cc: git Johannes Sixt wrote: > Thomas Singer schrieb: >> Thomas Singer wrote: >>> Reece Dunn wrote: >>>> This is a bug in git's character encoding/conversion logic. It looks >>>> like git is taking the source string and converting it to ascii to be >>>> displayed on the console output (e.g. by using the WideCharToMultiByte >>>> conversion API) -- these APIs will use a '?' character for characters >>>> that it cannot map to the target character encoding (like the Hiragana >>>> characters that you are using). >>> I have a screenshot from a SmartGit user where 1) the console can show the >>> far-east-characters and 2) Git *can* show the characters escaped. Are there >>> two versions of Git available or does Gits behaviour depends somehow on the >>> system locale? >> Does no Git expert know what to do to get it working? > > http://article.gmane.org/gmane.comp.version-control.git/133980 [*] > > The possible reason why some one else is seeing correct glyphs with > SmartGit is because it is a Unicode application and the Windows box has > suitable fonts installed and the console is configured with a suitable > font as well. I wasn't talking about SmartGit, but msysgit on the Windows console. Sorry, if that wasn't clear. Is it a German Windows limitation, that far-east characters are not supported on it (but work fine on a Japanese Windows), are there different (mysys)Git versions available or is this a configuration issue? -- Tom ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 8:55 ` Thomas Singer @ 2009-12-01 10:00 ` Johannes Sixt 2009-12-01 12:08 ` Thomas Singer 0 siblings, 1 reply; 27+ messages in thread From: Johannes Sixt @ 2009-12-01 10:00 UTC (permalink / raw) To: Thomas Singer; +Cc: git Thomas Singer schrieb: > Is it a German Windows limitation, that far-east characters are not > supported on it (but work fine on a Japanese Windows), are there different > (mysys)Git versions available or is this a configuration issue? It is a matter of configuration. Since 8 bits are not sufficient to support Japanese alphabet in addition to the German alphabet, programs that are not Unicode aware -- such as git -- have to make a decision which alphabet they support. The decision is made by picking a "codepage". On German Windows, you are in codepage 850 (in the console). The filenames (that actually are in Unicode) are converted to bytes according to codepage 850 *before* git sees them. If your filenames contain Hiragana, they are substituted by the "unknown character" marker because there is no place for them in codepage 850. However, you can install Japanese language support on German Windows. Then you can change your console to codepage 932: chcp 932 When you run git from *this* console, Hiragana in the filenames are converted to cp932 before git sees them. The resulting byte sequence is different from the one in cp850, but git will be able to see that the file exists and was modified, and you can 'git add' it. But if you have files with umlauts, they will not be recognized anymore because umlauts have no place in cp932. In neither case can you exchange the repository with Linux if you have your locale set to UTF-8 on Linux, because neither byte sequence (umlauts from cp850 or Hiragana from cp932) are valid UTF-8 sequences, let alone result in the expected glyphs. Corollary: Stick to ASCII file names. There have been suggestions to switch the console to codepage 65001 (UTF-8), but I have never heard of success reports. I'm not saying it does not work, though. -- Hannes ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 10:00 ` Johannes Sixt @ 2009-12-01 12:08 ` Thomas Singer 2009-12-01 13:17 ` Johannes Sixt 2009-12-01 17:24 ` Jakub Narebski 0 siblings, 2 replies; 27+ messages in thread From: Thomas Singer @ 2009-12-01 12:08 UTC (permalink / raw) To: Johannes Sixt; +Cc: git Johannes Sixt wrote: > Thomas Singer schrieb: >> Is it a German Windows limitation, that far-east characters are not >> supported on it (but work fine on a Japanese Windows), are there different >> (mysys)Git versions available or is this a configuration issue? > > It is a matter of configuration. > > Since 8 bits are not sufficient to support Japanese alphabet in addition > to the German alphabet, programs that are not Unicode aware -- such as git > -- have to make a decision which alphabet they support. The decision is > made by picking a "codepage". > > On German Windows, you are in codepage 850 (in the console). The filenames > (that actually are in Unicode) are converted to bytes according to > codepage 850 *before* git sees them. If your filenames contain Hiragana, > they are substituted by the "unknown character" marker because there is no > place for them in codepage 850. > > However, you can install Japanese language support on German Windows. Then > you can change your console to codepage 932: > > chcp 932 > > When you run git from *this* console, Hiragana in the filenames are > converted to cp932 before git sees them. The resulting byte sequence is > different from the one in cp850, but git will be able to see that the file > exists and was modified, and you can 'git add' it. > > But if you have files with umlauts, they will not be recognized anymore > because umlauts have no place in cp932. > > In neither case can you exchange the repository with Linux if you have > your locale set to UTF-8 on Linux, because neither byte sequence (umlauts > from cp850 or Hiragana from cp932) are valid UTF-8 sequences, let alone > result in the expected glyphs. > > Corollary: Stick to ASCII file names. > > There have been suggestions to switch the console to codepage 65001 > (UTF-8), but I have never heard of success reports. I'm not saying it does > not work, though. Thanks for the detailed explanation. I know the differences between bytes and characters and the needed *encoding* to convert from one to another, but I did not know how Git handles it. I'm quite surprised, that -- as I understand you -- msys-Git (or Git at all?) is not able to handle all characters (aka unicode) at the same time. I expected it would be better than older tools, e.g. SVN. BTW, we are invoking the Git executable from Java. Is there automatically a console "around" Git? Should we invoke a shell-script (which sets the console's code page) instead of the Git executable directly? -- Tom ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 12:08 ` Thomas Singer @ 2009-12-01 13:17 ` Johannes Sixt 2009-12-01 15:41 ` Thomas Singer 2009-12-01 17:24 ` Jakub Narebski 1 sibling, 1 reply; 27+ messages in thread From: Johannes Sixt @ 2009-12-01 13:17 UTC (permalink / raw) To: Thomas Singer; +Cc: git Thomas Singer schrieb: > I'm quite surprised, that -- as I > understand you -- msys-Git (or Git at all?) is not able to handle all > characters (aka unicode) at the same time. I expected it would be better > than older tools, e.g. SVN. This has been discussed at length here and in the msysgit mailing list. Git expects that the file system returns file names with the same byte sequence that git used to create a file. On Windows, this works only as long as you do not switch the codepage. > BTW, we are invoking the Git executable from Java. Is there automatically a > console "around" Git? I don't think so. In this case, the codepage that Java has set up will apply. I guess that Java doesn't mess with the codepage at all, and then on German Windows git would operate in cp1252. > Should we invoke a shell-script (which sets the > console's code page) instead of the Git executable directly? I don't think that is necessary. -- Hannes ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 13:17 ` Johannes Sixt @ 2009-12-01 15:41 ` Thomas Singer 2009-12-01 15:50 ` Erik Faye-Lund 0 siblings, 1 reply; 27+ messages in thread From: Thomas Singer @ 2009-12-01 15:41 UTC (permalink / raw) To: Johannes Sixt; +Cc: git Johannes Sixt wrote: > Thomas Singer schrieb: >> I'm quite surprised, that -- as I >> understand you -- msys-Git (or Git at all?) is not able to handle all >> characters (aka unicode) at the same time. I expected it would be better >> than older tools, e.g. SVN. > > This has been discussed at length here and in the msysgit mailing list. > Git expects that the file system returns file names with the same byte > sequence that git used to create a file. On Windows, this works only as > long as you do not switch the codepage. Now you confuse me: is this a problem of Windows, Git using a less capable Windows-API call or is there no unicode-capable API call to list file names on Windows? I ask myself how Java does it in its internals, finally it (also) consists of a C-base, I guess. -- Tom ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 15:41 ` Thomas Singer @ 2009-12-01 15:50 ` Erik Faye-Lund 2009-12-01 16:33 ` Thomas Singer 0 siblings, 1 reply; 27+ messages in thread From: Erik Faye-Lund @ 2009-12-01 15:50 UTC (permalink / raw) To: Thomas Singer; +Cc: Johannes Sixt, git On Tue, Dec 1, 2009 at 4:41 PM, Thomas Singer <thomas.singer@syntevo.com> wrote: > Johannes Sixt wrote: >> Thomas Singer schrieb: >>> I'm quite surprised, that -- as I >>> understand you -- msys-Git (or Git at all?) is not able to handle all >>> characters (aka unicode) at the same time. I expected it would be better >>> than older tools, e.g. SVN. >> >> This has been discussed at length here and in the msysgit mailing list. >> Git expects that the file system returns file names with the same byte >> sequence that git used to create a file. On Windows, this works only as >> long as you do not switch the codepage. > > Now you confuse me: is this a problem of Windows, Git using a less capable > Windows-API call or is there no unicode-capable API call to list file names > on Windows? I ask myself how Java does it in its internals, finally it > (also) consists of a C-base, I guess. > Git uses the 8-bit file APIs, and Windows doesn't support setting UTF-8 as the locale. Some work have been done in msysGit to use _wopen() and friends instead, but AFAIK it's not completed. See the branch called "work/utf-filenames" in git://repo.or.cz/git/mingw/4msysgit.git if you are interested in helping to complete it. -- Erik "kusma" Faye-Lund ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 15:50 ` Erik Faye-Lund @ 2009-12-01 16:33 ` Thomas Singer 2010-10-30 4:02 ` brad12 0 siblings, 1 reply; 27+ messages in thread From: Thomas Singer @ 2009-12-01 16:33 UTC (permalink / raw) To: kusmabite; +Cc: Johannes Sixt, git Erik Faye-Lund wrote: > Git uses the 8-bit file APIs, and Windows doesn't support setting > UTF-8 as the locale. Some work have been done in msysGit to use > _wopen() and friends instead, but AFAIK it's not completed. See the > branch called "work/utf-filenames" in > git://repo.or.cz/git/mingw/4msysgit.git if you are interested in > helping to complete it. Thanks, now I understand. -- Tom ^ permalink raw reply [flat|nested] 27+ messages in thread
* non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 16:33 ` Thomas Singer @ 2010-10-30 4:02 ` brad12 2010-10-30 8:58 ` Jakub Narebski 0 siblings, 1 reply; 27+ messages in thread From: brad12 @ 2010-10-30 4:02 UTC (permalink / raw) To: git actually I am also working Japanese language site , I have same problem , I have also some japanese sentences which is to be written in japanese , but they did not appear on browser....so what should I do ----- http://www.learnjapanesefree.com/ Learning japanese language | http://www.learnjapanesefree.com/japanese-hiragana.html Hiragana -- View this message in context: http://git.661346.n2.nabble.com/non-US-ASCII-file-names-e-g-Hiragana-on-Windows-tp4080246p5688741.html Sent from the git mailing list archive at Nabble.com. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2010-10-30 4:02 ` brad12 @ 2010-10-30 8:58 ` Jakub Narebski 0 siblings, 0 replies; 27+ messages in thread From: Jakub Narebski @ 2010-10-30 8:58 UTC (permalink / raw) To: brad12; +Cc: git brad12 <brad.john75@gmail.com> writes: > actually I am also working Japanese language site , I have same problem , I > have also some japanese sentences which is to be written in japanese , but > they did not appear on browser....so what should I do Why are you sending this question to *this* mailing list (this newsgroup)? -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 12:08 ` Thomas Singer 2009-12-01 13:17 ` Johannes Sixt @ 2009-12-01 17:24 ` Jakub Narebski 2009-12-01 18:55 ` Thomas Singer 2010-10-30 9:52 ` demerphq 1 sibling, 2 replies; 27+ messages in thread From: Jakub Narebski @ 2009-12-01 17:24 UTC (permalink / raw) To: Thomas Singer; +Cc: Johannes Sixt, git Thomas Singer <thomas.singer@syntevo.com> writes: > Johannes Sixt wrote: >> Thomas Singer schrieb: >>> >>> Is it a German Windows limitation, that far-east characters are not >>> supported on it (but work fine on a Japanese Windows), are there different >>> (mysys)Git versions available or is this a configuration issue? >> >> It is a matter of configuration. >> >> Since 8 bits are not sufficient to support Japanese alphabet in addition >> to the German alphabet, programs that are not Unicode aware -- such as git >> -- have to make a decision which alphabet they support. The decision is >> made by picking a "codepage". >> >> On German Windows, you are in codepage 850 (in the console). The filenames >> (that actually are in Unicode) are converted to bytes according to >> codepage 850 *before* git sees them. If your filenames contain Hiragana, >> they are substituted by the "unknown character" marker because there is no >> place for them in codepage 850. [...] >> Corollary: Stick to ASCII file names. >> >> There have been suggestions to switch the console to codepage 65001 >> (UTF-8), but I have never heard of success reports. I'm not saying it does >> not work, though. > > Thanks for the detailed explanation. I know the differences between bytes > and characters and the needed *encoding* to convert from one to another, but > I did not know how Git handles it. I'm quite surprised, that -- as I > understand you -- msys-Git (or Git at all?) is not able to handle all > characters (aka unicode) at the same time. I expected it would be better > than older tools, e.g. SVN. The problem is not with Git, as Git is (currently) agnostic with respect to filename encoding; for Git filenames are opaque NUL ('\0) terminated binary data. There is some infrastructure to convert between filename encodings and other filename quirks (like case-insensivity), though... The problem is with MS Windows *console*, from which you invoke git commands, and which does translation from filename encoding used by the filesystem to encoding / codepage used by console. > BTW, we are invoking the Git executable from Java. Is there automatically a > console "around" Git? Should we invoke a shell-script (which sets the > console's code page) instead of the Git executable directly? If you use Git from Java, why don't you just use JGit (www.jgit.org), which is Git implementation in Java? -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 17:24 ` Jakub Narebski @ 2009-12-01 18:55 ` Thomas Singer 2009-12-02 16:22 ` Shawn Pearce 2010-10-30 9:52 ` demerphq 1 sibling, 1 reply; 27+ messages in thread From: Thomas Singer @ 2009-12-01 18:55 UTC (permalink / raw) To: Jakub Narebski; +Cc: Johannes Sixt, git Jakub Narebski wrote: > If you use Git from Java, why don't you just use JGit (www.jgit.org), > which is Git implementation in Java? We are using JGit for the read-only stuff and the Git command line executable for all writing commands. We very much appreciate Shawn O. Pearce' (and the other JGit developers') effort, but Git is a fast moving target and (much) more complex than CVS or SVN, for which we use Java libraries communicating with the corresponding server which adds another sanity layer to the repository making repository corruption less likely than direct access. -- Best regards, Thomas Singer ============= syntevo GmbH http://www.syntevo.com http://blog.syntevo.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 18:55 ` Thomas Singer @ 2009-12-02 16:22 ` Shawn Pearce 0 siblings, 0 replies; 27+ messages in thread From: Shawn Pearce @ 2009-12-02 16:22 UTC (permalink / raw) To: Thomas Singer; +Cc: Jakub Narebski, Johannes Sixt, git On Tue, Dec 1, 2009 at 10:55 AM, Thomas Singer <thomas.singer@syntevo.com> wrote: > > Jakub Narebski wrote: > > If you use Git from Java, why don't you just use JGit (www.jgit.org), > > which is Git implementation in Java? > > We are using JGit for the read-only stuff and the Git command line > executable for all writing commands. We very much appreciate Shawn O. > Pearce' (and the other JGit developers') effort, but Git is a fast moving > target and (much) more complex than CVS or SVN, for which we use Java > libraries communicating with the corresponding server which adds another > sanity layer to the repository making repository corruption less likely than > direct access. Uhm. I'm sorry, but this is just plain FUD. JGit implements the current on disk formats and network protocols completely[1]. In the area of disk formats and network protocols, Git *IS NOT* a fast moving target. This area of Git hasn't changed much since pack files were first introduced. As a community, we have been very careful to avoid changes which break compatibility with older implementations. Git is also a lot less complex than CVS or SVN. Its data model is simpler on disk. Its network protocol is *vastly* more simple than SVN's WebDAV protocol. And unlike SVN we haven't had to break the network protocol on every 1.x release we make. [1] Actually, JGit lacks --depth support for shallow clones, but otherwise is complete. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 17:24 ` Jakub Narebski 2009-12-01 18:55 ` Thomas Singer @ 2010-10-30 9:52 ` demerphq 1 sibling, 0 replies; 27+ messages in thread From: demerphq @ 2010-10-30 9:52 UTC (permalink / raw) To: Jakub Narebski; +Cc: Thomas Singer, Johannes Sixt, git On 1 December 2009 18:24, Jakub Narebski <jnareb@gmail.com> wrote: > Thomas Singer <thomas.singer@syntevo.com> writes: > >> Johannes Sixt wrote: >>> Thomas Singer schrieb: >>>> >>>> Is it a German Windows limitation, that far-east characters are not >>>> supported on it (but work fine on a Japanese Windows), are there different >>>> (mysys)Git versions available or is this a configuration issue? >>> >>> It is a matter of configuration. >>> >>> Since 8 bits are not sufficient to support Japanese alphabet in addition >>> to the German alphabet, programs that are not Unicode aware -- such as git >>> -- have to make a decision which alphabet they support. The decision is >>> made by picking a "codepage". >>> >>> On German Windows, you are in codepage 850 (in the console). The filenames >>> (that actually are in Unicode) are converted to bytes according to >>> codepage 850 *before* git sees them. If your filenames contain Hiragana, >>> they are substituted by the "unknown character" marker because there is no >>> place for them in codepage 850. > [...] > >>> Corollary: Stick to ASCII file names. >>> >>> There have been suggestions to switch the console to codepage 65001 >>> (UTF-8), but I have never heard of success reports. I'm not saying it does >>> not work, though. >> >> Thanks for the detailed explanation. I know the differences between bytes >> and characters and the needed *encoding* to convert from one to another, but >> I did not know how Git handles it. I'm quite surprised, that -- as I >> understand you -- msys-Git (or Git at all?) is not able to handle all >> characters (aka unicode) at the same time. I expected it would be better >> than older tools, e.g. SVN. > > The problem is not with Git, as Git is (currently) agnostic with > respect to filename encoding; for Git filenames are opaque NUL ('\0) > terminated binary data. There is some infrastructure to convert > between filename encodings and other filename quirks (like > case-insensivity), though... "You can use whatever encoding you want. So long as it looks like a standard UNIX filename." -- perl -Mre=debug -e "/just|another|perl|hacker/" ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-11-29 9:18 ` Thomas Singer 2009-12-01 7:49 ` Thomas Singer @ 2009-12-01 9:12 ` Erik Faye-Lund 2009-12-01 12:11 ` Thomas Singer 1 sibling, 1 reply; 27+ messages in thread From: Erik Faye-Lund @ 2009-12-01 9:12 UTC (permalink / raw) To: Thomas Singer; +Cc: Maximilien Noal, git On Sun, Nov 29, 2009 at 10:18 AM, Thomas Singer <thomas.singer@syntevo.com> wrote: > Maximilien Noal wrote: >> About the 'boxes' : >> >> The thing is, Windows' files for Asian languages are _not_ installed by >> default. >> >> They can be installed (even while installing Windows), by checking the >> two checkboxes under the "Supplemtal languages support" groupbox in the >> "Languages" tab of the "Regional and language options" control panel. >> *re-take some breath ;-) * >> >> It will remove the "boxes" in Explorer and display nice Asian characters. > > Thanks, now the characters are showing up fine in the Explorer. > > Reece Dunn wrote: >> This is a bug in git's character encoding/conversion logic. It looks >> like git is taking the source string and converting it to ascii to be >> displayed on the console output (e.g. by using the WideCharToMultiByte >> conversion API) -- these APIs will use a '?' character for characters >> that it cannot map to the target character encoding (like the Hiragana >> characters that you are using). > > I have a screenshot from a SmartGit user where 1) the console can show the > far-east-characters and 2) Git *can* show the characters escaped. Are there > two versions of Git available or does Gits behaviour depends somehow on the > system locale? Did you try to make sure your console window used a Unicode font on your German Windows installation? Asian Windows installations might do this by default, something at least neither English nor Norwegian Windows installations seems to do... You can change the console window font through the properties-menu that appears when you right click the title-bar. -- Erik "kusma" Faye-Lund ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-12-01 9:12 ` Erik Faye-Lund @ 2009-12-01 12:11 ` Thomas Singer 0 siblings, 0 replies; 27+ messages in thread From: Thomas Singer @ 2009-12-01 12:11 UTC (permalink / raw) To: kusmabite; +Cc: Maximilien Noal, git Erik Faye-Lund wrote: > Did you try to make sure your console window used a Unicode font on > your German Windows installation? Asian Windows installations might do > this by default, something at least neither English nor Norwegian > Windows installations seems to do... > > You can change the console window font through the properties-menu > that appears when you right click the title-bar. I've tried to change the console font (there is just one alternative), but without any change. -- Tom ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: non-US-ASCII file names (e.g. Hiragana) on Windows 2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer 2009-11-28 20:00 ` Johannes Sixt 2009-11-28 23:07 ` Maximilien Noal @ 2009-11-28 23:37 ` Reece Dunn 2 siblings, 0 replies; 27+ messages in thread From: Reece Dunn @ 2009-11-28 23:37 UTC (permalink / raw) To: Thomas Singer; +Cc: git 2009/11/28 Thomas Singer <thomas.singer@syntevo.com>: > > When launching 'git status' from the git shell (msys 1.6.5.1.1367.gcd48 from > 7zip-bundle) it only shows me 4 question marks. I would have expected to see > the non-displayable characters escaped like it did with the umlauts on OS X. > > Even adding fails: > > $ git add . > fatal: unable to stat '????': No such file or directory > > What should I do to make Git recognize these characters? This is a bug in git's character encoding/conversion logic. It looks like git is taking the source string and converting it to ascii to be displayed on the console output (e.g. by using the WideCharToMultiByte conversion API) -- these APIs will use a '?' character for characters that it cannot map to the target character encoding (like the Hiragana characters that you are using). SetConsoleOutputCP can be used to change the console output codepage [http://msdn.microsoft.com/en-us/library/ms686036%28VS.85%29.aspx] and SetConsoleCP is the equivalent for input [http://msdn.microsoft.com/en-us/library/ms686013%28VS.85%29.aspx]. e.g. SetConsoleCP(CP_UTF8); SetConsoleOutputCP(CP_UTF8); should make the console process UTF-8 characters, so git shouldn't need to do any character conversions on Windows when reading/writing it's data. NOTE: I have not tested this, just noting what I have found via Google. - Reece ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2010-10-30 9:53 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer 2009-11-28 20:00 ` Johannes Sixt 2009-12-01 8:57 ` Thomas Singer 2009-12-01 9:04 ` Thomas Singer 2009-12-01 10:08 ` Johannes Sixt 2009-12-01 16:26 ` Shawn O. Pearce 2009-12-01 22:11 ` Robin Rosenberg 2009-11-28 23:07 ` Maximilien Noal 2009-11-29 9:18 ` Thomas Singer 2009-12-01 7:49 ` Thomas Singer 2009-12-01 8:27 ` Johannes Sixt 2009-12-01 8:55 ` Thomas Singer 2009-12-01 10:00 ` Johannes Sixt 2009-12-01 12:08 ` Thomas Singer 2009-12-01 13:17 ` Johannes Sixt 2009-12-01 15:41 ` Thomas Singer 2009-12-01 15:50 ` Erik Faye-Lund 2009-12-01 16:33 ` Thomas Singer 2010-10-30 4:02 ` brad12 2010-10-30 8:58 ` Jakub Narebski 2009-12-01 17:24 ` Jakub Narebski 2009-12-01 18:55 ` Thomas Singer 2009-12-02 16:22 ` Shawn Pearce 2010-10-30 9:52 ` demerphq 2009-12-01 9:12 ` Erik Faye-Lund 2009-12-01 12:11 ` Thomas Singer 2009-11-28 23:37 ` Reece Dunn
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.