* Rss produced by git is not valid xml? @ 2005-11-18 16:33 Ismail Donmez 2005-11-18 17:26 ` Ismail Donmez 0 siblings, 1 reply; 53+ messages in thread From: Ismail Donmez @ 2005-11-18 16:33 UTC (permalink / raw) To: git Hi all, I am trying to parse git's rss feed and now xml parsers seems to choke on it because of an error in the produced feed. Looking at http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=rss line 781 says : On Thu, 17 Nov 2005, David G\363mez wrote:<br/> which is part of the commit : http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=05b8b0fafd4cac75d205ecd5ad40992e2cc5934d This looks like malformed xml to me ( because of \363 part ). Is there any way to fix this so git rss can be parsed? Or is this legal in xml and parsers are buggy? Regards, ismail ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 16:33 Rss produced by git is not valid xml? Ismail Donmez @ 2005-11-18 17:26 ` Ismail Donmez 2005-11-18 19:27 ` Ismail Donmez 0 siblings, 1 reply; 53+ messages in thread From: Ismail Donmez @ 2005-11-18 17:26 UTC (permalink / raw) To: git On Friday 18 November 2005 18:33, you wrote: > Hi all, > > I am trying to parse git's rss feed and now xml parsers seems to choke on > it because of an error in the produced feed. Looking at > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=rss > > line 781 says : > > On Thu, 17 Nov 2005, David G\363mez wrote:<br/> > > which is part of the commit : > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=comm >it;h=05b8b0fafd4cac75d205ecd5ad40992e2cc5934d Ok looks like this text is latin-1 encoded although xml is served as utf-8. /ismail ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 17:26 ` Ismail Donmez @ 2005-11-18 19:27 ` Ismail Donmez 2005-11-18 20:02 ` Kay Sievers 0 siblings, 1 reply; 53+ messages in thread From: Ismail Donmez @ 2005-11-18 19:27 UTC (permalink / raw) To: git On Friday 18 November 2005 19:26, you wrote: > On Friday 18 November 2005 18:33, you wrote: > > Hi all, > > > > I am trying to parse git's rss feed and now xml parsers seems to choke on > > it because of an error in the produced feed. Looking at > > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=rs > >s > > > > line 781 says : > > > > On Thu, 17 Nov 2005, David G\363mez wrote:<br/> > > > > which is part of the commit : > > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=co > >mm it;h=05b8b0fafd4cac75d205ecd5ad40992e2cc5934d > > Ok looks like this text is latin-1 encoded although xml is served as utf-8. Any comments on this? Regards, ismail ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 19:27 ` Ismail Donmez @ 2005-11-18 20:02 ` Kay Sievers 2005-11-18 20:08 ` Ismail Donmez ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: Kay Sievers @ 2005-11-18 20:02 UTC (permalink / raw) To: Ismail Donmez; +Cc: git On Fri, Nov 18, 2005 at 09:27:06PM +0200, Ismail Donmez wrote: > On Friday 18 November 2005 19:26, you wrote: > > On Friday 18 November 2005 18:33, you wrote: > > > Hi all, > > > > > > I am trying to parse git's rss feed and now xml parsers seems to choke on > > > it because of an error in the produced feed. Looking at > > > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=rs > > >s > > > > > > line 781 says : > > > > > > On Thu, 17 Nov 2005, David G\363mez wrote:<br/> > > > > > > which is part of the commit : > > > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=co > > >mm it;h=05b8b0fafd4cac75d205ecd5ad40992e2cc5934d > > > > Ok looks like this text is latin-1 encoded although xml is served as utf-8. > > Any comments on this? Yes, convince the git maintainers, that it's incredibly stupid not to enforce utf8 in commit messages. It makes absolutely zero sense in a SCM, which merges forth and back between people around the world to allow random encodings from the last century. I still can't believe that this is a subject for discussion, in a software developed in the year 2005. With the next round of gitweb, I will substitute these caracters with valid utf8, which will show up as invalid chars. And git guys, please start to think again about your insane options, that cause more harm than anything good. Thanks, Kay ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:02 ` Kay Sievers @ 2005-11-18 20:08 ` Ismail Donmez 2005-11-18 20:22 ` Linus Torvalds 2005-11-19 0:04 ` Johannes Schindelin 2005-11-19 3:28 ` Junio C Hamano 2 siblings, 1 reply; 53+ messages in thread From: Ismail Donmez @ 2005-11-18 20:08 UTC (permalink / raw) To: git On Friday 18 November 2005 22:02, you wrote: > On Fri, Nov 18, 2005 at 09:27:06PM +0200, Ismail Donmez wrote: > > On Friday 18 November 2005 19:26, you wrote: > > > On Friday 18 November 2005 18:33, you wrote: > > > > Hi all, > > > > > > > > I am trying to parse git's rss feed and now xml parsers seems to > > > > choke on it because of an error in the produced feed. Looking at > > > > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git; > > > >a=rs s > > > > > > > > line 781 says : > > > > > > > > On Thu, 17 Nov 2005, David G\363mez wrote:<br/> > > > > > > > > which is part of the commit : > > > > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git; > > > >a=co mm it;h=05b8b0fafd4cac75d205ecd5ad40992e2cc5934d > > > > > > Ok looks like this text is latin-1 encoded although xml is served as > > > utf-8. > > > > Any comments on this? > > Yes, convince the git maintainers, that it's incredibly stupid not to > enforce utf8 in commit messages. It makes absolutely zero sense in a > SCM, which merges forth and back between people around the world to > allow random encodings from the last century. > I totally agree, utf8 should be default else the produced XML is wrong. Its advertised as utf-8 but the content is latin1. > With the next round of gitweb, I will substitute these caracters with > valid utf8, which will show up as invalid chars. When should we expect this? Currently I can't parse commit feed without encoding to utf8 first. > And git guys, please start to think again about your insane options, > that cause more harm than anything good. Can git maintainer(s) comment on this please? Regards, ismail ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:08 ` Ismail Donmez @ 2005-11-18 20:22 ` Linus Torvalds 2005-11-18 20:28 ` H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: Linus Torvalds @ 2005-11-18 20:22 UTC (permalink / raw) To: Ismail Donmez; +Cc: git On Fri, 18 Nov 2005, Ismail Donmez wrote: > > > And git guys, please start to think again about your insane options, > > that cause more harm than anything good. > > Can git maintainer(s) comment on this please? It's easy to say "just do the right thing", and ignore reality. git commit logs have always been "8-bit data". It's actually gitweb that is buggy if it claims it is UTF-8 without checking or converting it to such. I agree that UTF-8 is a good idea, but that's a totally different argument. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:22 ` Linus Torvalds @ 2005-11-18 20:28 ` H. Peter Anvin 2005-11-18 20:47 ` Linus Torvalds 2005-11-18 20:51 ` Josef Weidendorfer 2005-11-18 20:45 ` Ismail Donmez 2005-11-18 20:55 ` Kay Sievers 2 siblings, 2 replies; 53+ messages in thread From: H. Peter Anvin @ 2005-11-18 20:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ismail Donmez, git Linus Torvalds wrote: > > It's easy to say "just do the right thing", and ignore reality. > > git commit logs have always been "8-bit data". It's actually gitweb that > is buggy if it claims it is UTF-8 without checking or converting it to > such. > > I agree that UTF-8 is a good idea, but that's a totally different > argument. > I think the point is: what do you do with the data? If it *looks* like valid UTF-8, you pretty much have to assume it is; if it's not (it contains invalid UTF-8 sequences), what do you do? There are only a small handful of alternatives, and none are really good: - Reject it (it's kind of too late, should have been done at checkin) - Show them as SUBSTITUTE characters (U+FFFD). - Show them as Latin-1 or Windows-1252 - Provide a complex configuration mechanism I think Kay is going with the second option. Note this problem always exists for the data contents anyway. We can't do anything about that. What's probably more important is that tools that rely on email or other outside data sources (like CVS) do the necessary conversions, so one doesn't end up with an inadvertently incorrect repository. -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:28 ` H. Peter Anvin @ 2005-11-18 20:47 ` Linus Torvalds 2005-11-18 20:55 ` H. Peter Anvin 2005-11-18 20:51 ` Josef Weidendorfer 1 sibling, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2005-11-18 20:47 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Ismail Donmez, git On Fri, 18 Nov 2005, H. Peter Anvin wrote: > > I think the point is: what do you do with the data? If it *looks* like valid > UTF-8, you pretty much have to assume it is; if it's not (it contains invalid > UTF-8 sequences), what do you do? Btw, this is not a new issue. This is true even of data that _claims_ to be UTF-8 but contains sequences that are illegal. A program that just barfs on it is a buggy program. And yes, I know there are buggy programs out there. I seem to recall some perl(?) problems when it got UTF-8 strings that weren't, and did impossible things. > There are only a small handful of alternatives, and none are really good: > > - Reject it (it's kind of too late, should have been done at > checkin) It can't be done at checkin, since it's not _wrong_. It's 8-bit data. It's like saying that /bin/echo is an illegal program and shouldn't be executed, because it's not encoded in utf-8. I can well imagine somebody wanting to put a binary signature at the end of a commit. git shouldn't care, and the important thing to realize is that there _is_ no "encoding" for such things. So the commits don't necessarily have to have a font encoding at all, and any visualization tool should just accept that fact. > - Show them as SUBSTITUTE characters (U+FFFD). > - Show them as Latin-1 or Windows-1252 > - Provide a complex configuration mechanism > > I think Kay is going with the second option. Which is a fine option. Latin-1 is probably the right choice for the kernel, but not necessarily for other projects. Another option is to just pass them through unmodified, and encourage the XML parser to handle it. Anything that takes UTF-8 and doesn't have some fallback to handle malformed input is basically buggy. It simply _will_ happen occasionally, quite independently of git. You can either give up, or try to handle it. And giving up is always the wrong choice. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:47 ` Linus Torvalds @ 2005-11-18 20:55 ` H. Peter Anvin 0 siblings, 0 replies; 53+ messages in thread From: H. Peter Anvin @ 2005-11-18 20:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ismail Donmez, git Linus Torvalds wrote: > > Which is a fine option. Latin-1 is probably the right choice for the > kernel, but not necessarily for other projects. > > Another option is to just pass them through unmodified, and encourage the > XML parser to handle it. Anything that takes UTF-8 and doesn't have some > fallback to handle malformed input is basically buggy. It simply _will_ > happen occasionally, quite independently of git. You can either give up, > or try to handle it. And giving up is always the wrong choice. > Not necessarily. If you can't guarantee that you won't do something that's bad for security, giving up is the only valid choice. The problem, of course, comes into place when people write generic XML parsers -- or, for that matter, UTF-8 decoders -- and don't know what will happen to the data downstream. Trying to make invalid data valid has the same problems as DWIM (after all, it *is* DWIM): if done on the wrong side of a security barrier it has unpredictable consequences. Thus, making gitweb -- a producer application -- do the guessing is probably the right thing. Sorry, Mr. Protocol; in this malware-infested world the old adage "be liberal in what you accept, conservative in what you send" unfortunately has had to be modified. -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:28 ` H. Peter Anvin 2005-11-18 20:47 ` Linus Torvalds @ 2005-11-18 20:51 ` Josef Weidendorfer 2005-11-18 21:01 ` Kay Sievers 1 sibling, 1 reply; 53+ messages in thread From: Josef Weidendorfer @ 2005-11-18 20:51 UTC (permalink / raw) To: git On Friday 18 November 2005 21:28, H. Peter Anvin wrote: > I think the point is: what do you do with the data? If it *looks* like > valid UTF-8, you pretty much have to assume it is; if it's not (it > contains invalid UTF-8 sequences), what do you do? There are only a > small handful of alternatives, and none are really good: > > - Reject it (it's kind of too late, should have been done at > checkin) > - Show them as SUBSTITUTE characters (U+FFFD). > - Show them as Latin-1 or Windows-1252 > - Provide a complex configuration mechanism > > I think Kay is going with the second option. In the case of the Linux kernel, UTF-8 of course is the way to go. As you can not reject already commited objects, the second option seems the best way. But I think it would be better to have a config option specifying the prefered encoding for commit comments in a project. Something like core.commit-encoding = Latin-1 gitweb should use this. Josef > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:51 ` Josef Weidendorfer @ 2005-11-18 21:01 ` Kay Sievers 0 siblings, 0 replies; 53+ messages in thread From: Kay Sievers @ 2005-11-18 21:01 UTC (permalink / raw) To: Josef Weidendorfer; +Cc: git On Fri, Nov 18, 2005 at 09:51:56PM +0100, Josef Weidendorfer wrote: > On Friday 18 November 2005 21:28, H. Peter Anvin wrote: > > I think the point is: what do you do with the data? If it *looks* like > > valid UTF-8, you pretty much have to assume it is; if it's not (it > > contains invalid UTF-8 sequences), what do you do? There are only a > > small handful of alternatives, and none are really good: > > > > - Reject it (it's kind of too late, should have been done at > > checkin) > > - Show them as SUBSTITUTE characters (U+FFFD). > > - Show them as Latin-1 or Windows-1252 > > - Provide a complex configuration mechanism > > > > I think Kay is going with the second option. > > In the case of the Linux kernel, UTF-8 of course is the > way to go. As you can not reject already commited objects, the second > option seems the best way. > > But I think it would be better to have a config option specifying the > prefered encoding for commit comments in a project. Something like > > core.commit-encoding = Latin-1 > > gitweb should use this. Sorry, the 90's are over. Patch it, if you need it, I will not make it happen. Kay ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:22 ` Linus Torvalds 2005-11-18 20:28 ` H. Peter Anvin @ 2005-11-18 20:45 ` Ismail Donmez 2005-11-18 21:13 ` Linus Torvalds 2005-11-18 21:25 ` Junio C Hamano 2005-11-18 20:55 ` Kay Sievers 2 siblings, 2 replies; 53+ messages in thread From: Ismail Donmez @ 2005-11-18 20:45 UTC (permalink / raw) To: git On Friday 18 November 2005 22:22, you wrote: > On Fri, 18 Nov 2005, Ismail Donmez wrote: > > > And git guys, please start to think again about your insane options, > > > that cause more harm than anything good. > > > > Can git maintainer(s) comment on this please? > > It's easy to say "just do the right thing", and ignore reality. > > git commit logs have always been "8-bit data". It's actually gitweb that > is buggy if it claims it is UTF-8 without checking or converting it to > such. > > I agree that UTF-8 is a good idea, but that's a totally different > argument. Maybe you could officially require all commit messages to be UTF-8 then the problem would be just solved for future commits at least. Until then it should be workarounded in gitweb I guess. Regards ismail ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:45 ` Ismail Donmez @ 2005-11-18 21:13 ` Linus Torvalds 2005-11-18 21:22 ` Ismail Donmez 2005-11-18 21:25 ` Junio C Hamano 1 sibling, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2005-11-18 21:13 UTC (permalink / raw) To: Ismail Donmez; +Cc: git On Fri, 18 Nov 2005, Ismail Donmez wrote: > > Maybe you could officially require all commit messages to be UTF-8 then the > problem would be just solved for future commits at least. Just think about what that would mean for a second. What do people put in commit messages? They put things like filenames, to indicate that they changed file so-and-so because of issue so-and-so, or they needed to include header file so-and-so to fix a problem. So by virtue of forcing all commit messages to be in UTF-8, you've suddenly forced all filesystems to do UTF-8 too. Take that one step further: you've also forced all the file _contents_ you talk about to be in UTF-8, since the commit message might quote part of the file ("'xyzzy' was misspelled, it should be 'abcde'"). Or alternatively, you've forced the commit message to no longer match the reality that it tries to explain. See the problem? And that's ignoring the fact that you've unilaterally forced probably 50% of asian users to use an environment that they don't normally use. Remember: it's actually pretty _easy_ for most of the western world to move to UTF-8, because 99% of what we do doesn't really care one whit, and the remaining 1% isn't usually even a huge problem (ie it's such a small percentage that even if you show the wrong character for it, people understand what it said). There's only one thing that is easier still: to force your way of working on others. This is why I'm so steadfast on it being just a stream of bytes. Because let's face it, no english-speaking project will ever _really_ care: we'll get a few peoples names wrong, but it's all going to be pretty irrelevant, and there's not going to be any real confusion. In contrast, _forcing_ people to use UTF-8 results in real problems, and really limits what can be done. A data stream of 8-bit bytes is really powerful. And oh, btw, it just happens to be the UNIX way. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 21:13 ` Linus Torvalds @ 2005-11-18 21:22 ` Ismail Donmez 0 siblings, 0 replies; 53+ messages in thread From: Ismail Donmez @ 2005-11-18 21:22 UTC (permalink / raw) To: git On Friday 18 November 2005 23:13, you wrote: > On Fri, 18 Nov 2005, Ismail Donmez wrote: > > Maybe you could officially require all commit messages to be UTF-8 then > > the problem would be just solved for future commits at least. > > Just think about what that would mean for a second. > > What do people put in commit messages? They put things like filenames, to > indicate that they changed file so-and-so because of issue so-and-so, or > they needed to include header file so-and-so to fix a problem. > > So by virtue of forcing all commit messages to be in UTF-8, you've > suddenly forced all filesystems to do UTF-8 too. > > Take that one step further: you've also forced all the file _contents_ > you talk about to be in UTF-8, since the commit message might quote part > of the file ("'xyzzy' was misspelled, it should be 'abcde'"). > > Or alternatively, you've forced the commit message to no longer match the > reality that it tries to explain. > > See the problem? > > And that's ignoring the fact that you've unilaterally forced probably 50% > of asian users to use an environment that they don't normally use. > > Remember: it's actually pretty _easy_ for most of the western world to > move to UTF-8, because 99% of what we do doesn't really care one whit, and > the remaining 1% isn't usually even a huge problem (ie it's such a small > percentage that even if you show the wrong character for it, people > understand what it said). These days you can just open kwrite, select encoding and voila you don't have to change anything on the filesystem you can still use whatever $LANG you use. We would just force them to use a working editor imho. Nothing else. And thats not much to ask is it? Even joe(1) can edit utf-8 these days that must tell something. Regards, ismail ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:45 ` Ismail Donmez 2005-11-18 21:13 ` Linus Torvalds @ 2005-11-18 21:25 ` Junio C Hamano 2005-11-18 21:29 ` Ismail Donmez 1 sibling, 1 reply; 53+ messages in thread From: Junio C Hamano @ 2005-11-18 21:25 UTC (permalink / raw) To: Ismail Donmez; +Cc: git Ismail Donmez <ismail@uludag.org.tr> writes: >> I agree that UTF-8 is a good idea, but that's a totally different >> argument. > > Maybe you could officially require all commit messages to be UTF-8 then the > problem would be just solved for future commits at least. Until then it > should be workarounded in gitweb I guess. No, that's something I will *not* do. Linus is right --- he is always right but he is slightly more right than he usually is in this particular case ;-). We allow any 8-bit data in commit log messages. We even make it easier to use utf-8 than other encodings, and we encourage use of utf-8 for obvious reasons. But we do not go further than that. Any patch to change commit-tree.c to reject binary data in a commit log message that utf-8 validator chokes at *will* be rejected. Go back to the list archive. Dig out messages on this topic. Summarize the ones that say why we encourage utf-8 in textual commit log messages, submit a patch to add that to Documentation/howto/ or perhaps Documentation/tutorial.txt, to further encourage people to use utf-8. Just do not forbid non utf-8 text nor binary data in general. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 21:25 ` Junio C Hamano @ 2005-11-18 21:29 ` Ismail Donmez 2005-11-19 8:48 ` Junio C Hamano 0 siblings, 1 reply; 53+ messages in thread From: Ismail Donmez @ 2005-11-18 21:29 UTC (permalink / raw) To: git On Friday 18 November 2005 23:25, you wrote: > Ismail Donmez <ismail@uludag.org.tr> writes: > >> I agree that UTF-8 is a good idea, but that's a totally different > >> argument. > > > > Maybe you could officially require all commit messages to be UTF-8 then > > the problem would be just solved for future commits at least. Until then > > it should be workarounded in gitweb I guess. > > No, that's something I will *not* do. > > Linus is right --- he is always right but he is slightly more > right than he usually is in this particular case ;-). > > We allow any 8-bit data in commit log messages. We even make it > easier to use utf-8 than other encodings, and we encourage use > of utf-8 for obvious reasons. But we do not go further than > that. Any patch to change commit-tree.c to reject binary data > in a commit log message that utf-8 validator chokes at *will* be > rejected. > > Go back to the list archive. Dig out messages on this topic. > Summarize the ones that say why we encourage utf-8 in textual > commit log messages, submit a patch to add that to > Documentation/howto/ or perhaps Documentation/tutorial.txt, to > further encourage people to use utf-8. Just do not forbid non > utf-8 text nor binary data in general. Your produced XML is NOT valid then. You put encoding=utf-8 and then put latin-1 encoded data in it. You SHOULD NOT do that. Either put latin-1 as encoding in the RSS because you say its the way data should be else encode non-utf stuff to be utf-8. Regards, ismail ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 21:29 ` Ismail Donmez @ 2005-11-19 8:48 ` Junio C Hamano 0 siblings, 0 replies; 53+ messages in thread From: Junio C Hamano @ 2005-11-19 8:48 UTC (permalink / raw) To: Ismail Donmez; +Cc: git Ismail Donmez <ismail@uludag.org.tr> writes: > Your produced XML is NOT valid then. You put encoding=utf-8 and then put > latin-1 encoded data in it. You SHOULD NOT do that. Either put latin-1 as > encoding in the RSS because you say its the way data should be else encode > non-utf stuff to be utf-8. Maybe, but that is not me ;-). ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:22 ` Linus Torvalds 2005-11-18 20:28 ` H. Peter Anvin 2005-11-18 20:45 ` Ismail Donmez @ 2005-11-18 20:55 ` Kay Sievers 2005-11-18 21:30 ` Linus Torvalds 2 siblings, 1 reply; 53+ messages in thread From: Kay Sievers @ 2005-11-18 20:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ismail Donmez, git On Fri, Nov 18, 2005 at 12:22:34PM -0800, Linus Torvalds wrote: > > > On Fri, 18 Nov 2005, Ismail Donmez wrote: > > > > > And git guys, please start to think again about your insane options, > > > that cause more harm than anything good. > > > > Can git maintainer(s) comment on this please? > > It's easy to say "just do the right thing", and ignore reality. Well the reality tells that everything that is successful does not give too many options that harm adoption. For me it's a very simple and "real" rule. It's all about a sane default, which git obviously doesn't have. You guys may look at it from the very low level, but that isn't what I call "reality". > git commit logs have always been "8-bit data". It's actually gitweb that > is buggy if it claims it is UTF-8 without checking or converting it to > such. Actually, the real bug is not to try to prevent binary nonsense in textual commit logs, which are distibuted. Remember, that you provide a SCM not a filesystem. > I agree that UTF-8 is a good idea, but that's a totally different > argument. Well, I don't see real arguments against sane a default. Thanks, Kay ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:55 ` Kay Sievers @ 2005-11-18 21:30 ` Linus Torvalds 2005-11-18 21:33 ` Ismail Donmez 2005-11-18 21:48 ` Linus Torvalds 0 siblings, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2005-11-18 21:30 UTC (permalink / raw) To: Kay Sievers; +Cc: Ismail Donmez, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 2625 bytes --] On Fri, 18 Nov 2005, Kay Sievers wrote: > > Actually, the real bug is not to try to prevent binary nonsense in textual > commit logs, which are distibuted. Remember, that you provide a SCM not a > filesystem. I never said they were text, and in fact, I never even said I'm doing an SCM. Quite the reverse. I very much said that I'm doing a filesystem that is flexible. The fact that the headers are text-like is not so much about text as it is about flexibility and easy tool access. If you look at the git object format, for example, the header is strictly NUL-terminated ASCII, but the object itself is a pure binary data stream. Which obviously just _happens_ to often be text too, since quite often the object contents is something like a C source file, but there's a real power to _not_ thinking that it means that files are text-files. And I like UTF-8, but the fact is, all my editors and mail tools are still Latin-1. My editor converts the UTF-8 input into latin1 and keeps it in that format on disk (it writes it to the _screen_ as UTF-1 just to make the glyphs come out right, but the file it works with is still latin1). Could I change? Yup, I could change pretty easily. I wrote the code that did the latin1 conversion, and I've got source for my tools, so I could just decide one day that I'll join the 21st century and switch. I just haven't done so yet. The fact that _I_ can't be bothered, even though I'm in just about the best possible situation (I've got a keyboard with åäö on it, but they're not in my name, so I don't use them that much) should tell you something. Namely, it should tell you that there's a _lot_ of people who have a much harder time than I do in changing their setups. I think most of Asia _still_ doesn't use utf-8. And I _guarantee_ you that it's a hell of a lot easier for you to complain about it and say "they should" than it is for them to actually do so and convert all the programs they use. On this mailing list, the only person that I've seen pipe up about these things in the past _and_ that I suspect actually has to work with this thing in real life (instead of just from a theoretical "this is how things should be done" standpoint) is Junio. And last I heard (if I remember correctly), Junio explicitly said that a lot of the people he works with still use shift-jis. And I'm not surprised. Look on the web. As far as I know, shift-jis is still much more common than utf-8. AND IT DOESN'T MATTER ONE WHIT WHEN SOME GEEK SAYS "THEY SHOULDN'T DO THAT, THEN"! Software should conform to people, not the other way around. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 21:30 ` Linus Torvalds @ 2005-11-18 21:33 ` Ismail Donmez 2005-11-18 21:48 ` Linus Torvalds 1 sibling, 0 replies; 53+ messages in thread From: Ismail Donmez @ 2005-11-18 21:33 UTC (permalink / raw) To: git On Friday 18 November 2005 23:30, you wrote: > On Fri, 18 Nov 2005, Kay Sievers wrote: > > Actually, the real bug is not to try to prevent binary nonsense in > > textual commit logs, which are distibuted. Remember, that you provide a > > SCM not a filesystem. > > I never said they were text, and in fact, I never even said I'm doing an > SCM. Quite the reverse. I very much said that I'm doing a filesystem that > is flexible. > > The fact that the headers are text-like is not so much about text as it is > about flexibility and easy tool access. If you look at the git object > format, for example, the header is strictly NUL-terminated ASCII, but the > object itself is a pure binary data stream. Which obviously just _happens_ > to often be text too, since quite often the object contents is something > like a C source file, but there's a real power to _not_ thinking that it > means that files are text-files. > > And I like UTF-8, but the fact is, all my editors and mail tools are still > Latin-1. My editor converts the UTF-8 input into latin1 and keeps it in > that format on disk (it writes it to the _screen_ as UTF-1 just to make > the glyphs come out right, but the file it works with is still latin1). > > Could I change? Yup, I could change pretty easily. I wrote the code that > did the latin1 conversion, and I've got source for my tools, so I could > just decide one day that I'll join the 21st century and switch. I just > haven't done so yet. > > The fact that _I_ can't be bothered, even though I'm in just about the > best possible situation (I've got a keyboard with åäö on it, but they're > not in my name, so I don't use them that much) should tell you something. > Namely, it should tell you that there's a _lot_ of people who have a much > harder time than I do in changing their setups. > > I think most of Asia _still_ doesn't use utf-8. And I _guarantee_ you that > it's a hell of a lot easier for you to complain about it and say "they > should" than it is for them to actually do so and convert all the programs > they use. > > On this mailing list, the only person that I've seen pipe up about these > things in the past _and_ that I suspect actually has to work with this > thing in real life (instead of just from a theoretical "this is how things > should be done" standpoint) is Junio. And last I heard (if I remember > correctly), Junio explicitly said that a lot of the people he works with > still use shift-jis. > > And I'm not surprised. Look on the web. As far as I know, shift-jis is > still much more common than utf-8. > > AND IT DOESN'T MATTER ONE WHIT WHEN SOME GEEK SAYS "THEY SHOULDN'T DO > THAT, THEN"! > > Software should conform to people, not the other way around. Linus, I got your point. But the XML should reflect the data it contains. This _is_ my problem. Will the data be latin-1, OK then the xml should say its latin-1 and not lie as utf-8. Regards, ismail ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 21:30 ` Linus Torvalds 2005-11-18 21:33 ` Ismail Donmez @ 2005-11-18 21:48 ` Linus Torvalds 2005-11-18 22:12 ` H. Peter Anvin 1 sibling, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2005-11-18 21:48 UTC (permalink / raw) To: Kay Sievers; +Cc: Ismail Donmez, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 1820 bytes --] On Fri, 18 Nov 2005, Linus Torvalds wrote: > > And last I heard (if I remember correctly), Junio explicitly said that a > lot of the people he works with still use shift-jis. I should have dug it up. Apparently it's AUC-JP, not SJIS. Anyway. I literally have _no_ idea what the difference between those encodings are. I'm totally clueless when it comes to how the encodings actually work etc. I wouldn't know a Japanese character if it painted itself purple and did a risqué dance number. But I do know just how slowly these conversions happen, and what a huge deal it is for people who have documents and tools and databases that are encoded in some particular encoding. You do have to realize that while you may think that it's stupid that people use a non-utf8 encoding, those very people don't actually see a huge advantage from switching away from what has worked for them for decades, and they _do_ see huge transition pains and costs. So let's say that you have a project where the coding style includes S-JIS or EUC-JP (of which there are multiple variations, I believe, just to make things even more fun). You could argue that if such a project moves into git, it should be converted to UTF-8 at that point. That's all fine and dandy, but usually you don't do flag-days. You have people who start tracking it in git, and maybe even developing it in git, but they still have to work with the outside people. Want to do on-the-fly conversion on CVS import (or worse yet - something like clearcase)? With magic rules or fragile heuristics for binary files? That's crazy, and that's not how these things work. No, the way these things work is that they continue to be maintained in EUC-JP or whatever, and a tool that requires conversion is a tool that just doesn't get used. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 21:48 ` Linus Torvalds @ 2005-11-18 22:12 ` H. Peter Anvin 2005-11-18 23:20 ` Linus Torvalds 2005-11-18 23:25 ` Linus Torvalds 0 siblings, 2 replies; 53+ messages in thread From: H. Peter Anvin @ 2005-11-18 22:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kay Sievers, Ismail Donmez, git Linus Torvalds wrote: > > Want to do on-the-fly conversion on CVS import (or worse yet - something > like clearcase)? With magic rules or fragile heuristics for binary files? > That's crazy, and that's not how these things work. No, the way these > things work is that they continue to be maintained in EUC-JP or whatever, > and a tool that requires conversion is a tool that just doesn't get used. > On the fly conversion on CVS import isn't particularly crazy, as long as it's under user control. Although I was primarly thinking about it in the context of commit messages, it could be done on file contents as well, since CVS has the ability to flag files as text or as binary (-kb). We already have a bunch of options relating to how to map CVS onto git, and conversion time is a good time to do it. Similarly, it may not be a bad idea to add an *option* -- now when we have a config file mechanism -- to signal error on invalid UTF-8 import. This would keep a correct UTF-8 repository from getting inadvertently messed up. What *does* need to happen, I'm convinced, is that any tool that handles email needs to be able to take the email and convert its character set encodings (by default to UTF-8). Most MUAs today use all kinds of weird heuristics for which character set to use, and it's frequently not what the user expected. -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 22:12 ` H. Peter Anvin @ 2005-11-18 23:20 ` Linus Torvalds 2005-11-18 23:34 ` H. Peter Anvin 2005-11-18 23:25 ` Linus Torvalds 1 sibling, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2005-11-18 23:20 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Kay Sievers, Ismail Donmez, git On Fri, 18 Nov 2005, H. Peter Anvin wrote: > > On the fly conversion on CVS import isn't particularly crazy, as long as it's > under user control. Actually, it is. Why? How are you going to feed your changes back to the original (and initially main) project? Hint: they're not going to pull from your git tree, are they? Ahh. Maybe patches would be a good idea. Ooops. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 23:20 ` Linus Torvalds @ 2005-11-18 23:34 ` H. Peter Anvin 2005-11-18 23:53 ` Andreas Ericsson 2005-11-18 23:57 ` Linus Torvalds 0 siblings, 2 replies; 53+ messages in thread From: H. Peter Anvin @ 2005-11-18 23:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kay Sievers, Ismail Donmez, git Linus Torvalds wrote: > > On Fri, 18 Nov 2005, H. Peter Anvin wrote: > >>On the fly conversion on CVS import isn't particularly crazy, as long as it's >>under user control. > > Actually, it is. > > Why? > > How are you going to feed your changes back to the original (and initially > main) project? > > Hint: they're not going to pull from your git tree, are they? > > Ahh. Maybe patches would be a good idea. > > Ooops. > You're assuming there *IS* an original (and initially main) project. There is another usage mode: "we're dumping CVS and switching to this new-fangled git thing." I have myself done this with several projects by now. -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 23:34 ` H. Peter Anvin @ 2005-11-18 23:53 ` Andreas Ericsson 2005-11-19 1:22 ` H. Peter Anvin 2005-11-18 23:57 ` Linus Torvalds 1 sibling, 1 reply; 53+ messages in thread From: Andreas Ericsson @ 2005-11-18 23:53 UTC (permalink / raw) To: git H. Peter Anvin wrote: > Linus Torvalds wrote: > >> >> On Fri, 18 Nov 2005, H. Peter Anvin wrote: >> >>> On the fly conversion on CVS import isn't particularly crazy, as long >>> as it's >>> under user control. >> >> >> Actually, it is. >> >> Why? >> >> How are you going to feed your changes back to the original (and >> initially main) project? >> >> Hint: they're not going to pull from your git tree, are they? >> >> Ahh. Maybe patches would be a good idea. >> >> Ooops. >> > > You're assuming there *IS* an original (and initially main) project. > > There is another usage mode: "we're dumping CVS and switching to this > new-fangled git thing." I have myself done this with several projects > by now. > I'm guessing Linus' scenario is more common. I do it myself and I'd like it to keep working. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 23:53 ` Andreas Ericsson @ 2005-11-19 1:22 ` H. Peter Anvin 2005-11-19 8:49 ` Andreas Ericsson 0 siblings, 1 reply; 53+ messages in thread From: H. Peter Anvin @ 2005-11-19 1:22 UTC (permalink / raw) To: Andreas Ericsson; +Cc: git Andreas Ericsson wrote: >> >> You're assuming there *IS* an original (and initially main) project. >> >> There is another usage mode: "we're dumping CVS and switching to this >> new-fangled git thing." I have myself done this with several projects >> by now. > > I'm guessing Linus' scenario is more common. I do it myself and I'd like > it to keep working. > I'm not arguing that. I'm arguing that the *option* might be useful. -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-19 1:22 ` H. Peter Anvin @ 2005-11-19 8:49 ` Andreas Ericsson 2005-11-19 10:58 ` Johannes Schindelin 0 siblings, 1 reply; 53+ messages in thread From: Andreas Ericsson @ 2005-11-19 8:49 UTC (permalink / raw) To: git H. Peter Anvin wrote: > Andreas Ericsson wrote: > >>> >>> You're assuming there *IS* an original (and initially main) project. >>> >>> There is another usage mode: "we're dumping CVS and switching to this >>> new-fangled git thing." I have myself done this with several >>> projects by now. >> >> >> I'm guessing Linus' scenario is more common. I do it myself and I'd >> like it to keep working. >> > > I'm not arguing that. I'm arguing that the *option* might be useful. > Isn't it already? You can install and use any hooks you like after all. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-19 8:49 ` Andreas Ericsson @ 2005-11-19 10:58 ` Johannes Schindelin 0 siblings, 0 replies; 53+ messages in thread From: Johannes Schindelin @ 2005-11-19 10:58 UTC (permalink / raw) To: Andreas Ericsson; +Cc: git Hi, On Sat, 19 Nov 2005, Andreas Ericsson wrote: > Isn't it already? You can install and use any hooks you like after all. Exactly. All you have to do is provide a recipe for Documentation/howto/. Anybody wanting to enforce policy just takes that recipe, adjusts it for her needs, and installs the hook. Hth, Dscho ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 23:34 ` H. Peter Anvin 2005-11-18 23:53 ` Andreas Ericsson @ 2005-11-18 23:57 ` Linus Torvalds 2005-11-18 23:58 ` H. Peter Anvin 1 sibling, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2005-11-18 23:57 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Kay Sievers, Ismail Donmez, git On Fri, 18 Nov 2005, H. Peter Anvin wrote: > > There is another usage mode: "we're dumping CVS and switching to this > new-fangled git thing." I have myself done this with several projects by now. I agree that in that case, the problem space is _much_ simpler, and you're able to do much more. And I suspect it works well for projects with a few developers that can just afford to do that. And it obviously works for a big project with hundreds of developers that is forced to do it. But I suspect it's not the common way of doing things. There's already a few projects that do the "maintain in parallel" thing, like the Wine tree discussed a few days ago. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 23:57 ` Linus Torvalds @ 2005-11-18 23:58 ` H. Peter Anvin 2005-11-19 0:29 ` Johannes Schindelin 0 siblings, 1 reply; 53+ messages in thread From: H. Peter Anvin @ 2005-11-18 23:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: Kay Sievers, Ismail Donmez, git Linus Torvalds wrote: > > On Fri, 18 Nov 2005, H. Peter Anvin wrote: > >>There is another usage mode: "we're dumping CVS and switching to this >>new-fangled git thing." I have myself done this with several projects by now. > > > I agree that in that case, the problem space is _much_ simpler, and you're > able to do much more. > > And I suspect it works well for projects with a few developers that can > just afford to do that. And it obviously works for a big project with > hundreds of developers that is forced to do it. > > But I suspect it's not the common way of doing things. There's already a > few projects that do the "maintain in parallel" thing, like the Wine tree > discussed a few days ago. > Oh, agreed. However, if you want to convert your master repository you may want to do conversion. And in *either* case you may want to convert commit messages. -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 23:58 ` H. Peter Anvin @ 2005-11-19 0:29 ` Johannes Schindelin 0 siblings, 0 replies; 53+ messages in thread From: Johannes Schindelin @ 2005-11-19 0:29 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Linus Torvalds, Kay Sievers, Ismail Donmez, git Hi, On Fri, 18 Nov 2005, H. Peter Anvin wrote: > And in *either* case you may want to convert commit messages. You may, and you may not. Remember, this is *free* software. If there is no technical point to it, you should not restrict people. Else you get forked... Hth, Dscho ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 22:12 ` H. Peter Anvin 2005-11-18 23:20 ` Linus Torvalds @ 2005-11-18 23:25 ` Linus Torvalds 2005-11-19 0:34 ` Johannes Schindelin 2005-11-19 0:37 ` Junio C Hamano 1 sibling, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2005-11-18 23:25 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Kay Sievers, Ismail Donmez, git On Fri, 18 Nov 2005, H. Peter Anvin wrote: > > Similarly, it may not be a bad idea to add an *option* -- now when we have a > config file mechanism -- to signal error on invalid UTF-8 import. This would > keep a correct UTF-8 repository from getting inadvertently messed up. This I agree with, btw. We could easily have a [core] utf=1 thing, and make git-commit-tree refuse to commit a non-UTF8 message. Of course, you could equally easily (more so?) make it just a commit trigger instead, which might well be the right thing. (And that still leaves the question open what to do about patches and pulls, but if people mainly worry about newly written commit messages itself, then at least that part is unambiguous). Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 23:25 ` Linus Torvalds @ 2005-11-19 0:34 ` Johannes Schindelin 2005-11-19 0:37 ` Junio C Hamano 1 sibling, 0 replies; 53+ messages in thread From: Johannes Schindelin @ 2005-11-19 0:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: H. Peter Anvin, Kay Sievers, Ismail Donmez, git Hi, On Fri, 18 Nov 2005, Linus Torvalds wrote: > This I agree with, btw. We could easily have a > > [core] > utf=1 > > thing, and make git-commit-tree refuse to commit a non-UTF8 message. > > Of course, you could equally easily (more so?) make it just a commit > trigger instead, which might well be the right thing. Actually, hooks have been introduced for exactly that purpose! Besides, they are a much more powerful tool. For example, you can not only enforce utf-8, but also replace words from a swear words list by "*beep*". So, hooks are the way to go. Introducing another way to accomplish the same thing would be like Microsoft, implementing hundreds of APIs for the same task, none of them correct. I can only underline what Linus said here: Software should work for people, not the other way round. Please, before you send some BS like "utf-8 is the only reasonable thing for everybody, everywhere, ever", read that sentence in Linus' mail again. Software should *not* restrict anybody for non-technical reasons. ever. Period. Ciao, Dscho ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 23:25 ` Linus Torvalds 2005-11-19 0:34 ` Johannes Schindelin @ 2005-11-19 0:37 ` Junio C Hamano 2005-11-19 1:05 ` Linus Torvalds 1 sibling, 1 reply; 53+ messages in thread From: Junio C Hamano @ 2005-11-19 0:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > Of course, you could equally easily (more so?) make it just a commit > trigger instead, which might well be the right thing. I think it is the right approach. As I repeatedly said (not that repeating things makes them right) on this list, I think that the interpretation of what is in commit log is a policy issue that is local to each project. > (And that still leaves the question open what to do about patches and > pulls, but if people mainly worry about newly written commit messages > itself, then at least that part is unambiguous). Pulls are "too late, sorry you have to live with it"; for patches, mailinfo and am have -u (I do not remember if I added it to applymbox -- I do not use applymbox anymore myself). Maybe we should make -u the default and countermand with -U to encourage the use of utf8 further? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-19 0:37 ` Junio C Hamano @ 2005-11-19 1:05 ` Linus Torvalds 2005-11-19 10:31 ` Junio C Hamano 0 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2005-11-19 1:05 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Fri, 18 Nov 2005, Junio C Hamano wrote: > > Pulls are "too late, sorry you have to live with it"; for > patches, mailinfo and am have -u (I do not remember if I added > it to applymbox -- I do not use applymbox anymore myself). It's in applymbox too, although the default is not to use it (and applymbox only supports the short "-u" form, not the "--utf8" one). > Maybe we should make -u the default and countermand with -U to > encourage the use of utf8 further? Probably. Although right now "-u" doesn't actually _force_ a conversion: if you have an email with 8-bit characters and no character set mentioned, it will silently just do nothing, and the end result won't be valid UTF-8 after all. I think. You're the one who wrote all the conversion stuff ;) If we want utf-8, we should probably force it, and default to the latin1 translation (with some way to specify alternatives). Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-19 1:05 ` Linus Torvalds @ 2005-11-19 10:31 ` Junio C Hamano 2005-11-19 17:52 ` Linus Torvalds 0 siblings, 1 reply; 53+ messages in thread From: Junio C Hamano @ 2005-11-19 10:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > Although right now "-u" doesn't actually _force_ a conversion: if you have > an email with 8-bit characters and no character set mentioned, it will > silently just do nothing, and the end result won't be valid UTF-8 after > all. ... unless it was already utf8, that is. I have received a couple of patches with charset=utf-8; I think cte of them were qp, which was a bit unpleasant. > If we want utf-8, we should probably force it, and default to the latin1 > translation (with some way to specify alternatives). Well, some people on the list seem to think UTF-8 is the one and only right encoding, so for them if the message does not identify what it is in, assuming UTF-8 and not doing any conversion is probably the right thing ;-). This suggests a few flags (config items) to mailinfo: (1) if we pass thru the input intact or not (1 bit); (2) what charset to assume if the mail does not identify itself (default to latin1; specify "barf" to mean abort processing if a message with 8-bit character does not identify itself); (3) what we do when the mail does not transliterate correctly (1 bit -- fail, or remove offending bytes and pretend things are peachy -- defaulting on the stricter side); ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-19 10:31 ` Junio C Hamano @ 2005-11-19 17:52 ` Linus Torvalds 2005-11-20 1:16 ` Johannes Schindelin [not found] ` <20051127025249.GA12286@vrfy.org> 0 siblings, 2 replies; 53+ messages in thread From: Linus Torvalds @ 2005-11-19 17:52 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sat, 19 Nov 2005, Junio C Hamano wrote: > > Well, some people on the list seem to think UTF-8 is the one and > only right encoding, so for them if the message does not > identify what it is in, assuming UTF-8 and not doing any > conversion is probably the right thing ;-). If you replace "assume" with "verify", then I agree. It's pretty easy to verify whether something is valid utf-8 or not (not trivial - you have to also check the sequences for minimality, which adds a few extra tests, but it's certainly not complicated). And text with 8-bit latin1 is almost never valid utf-8. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-19 17:52 ` Linus Torvalds @ 2005-11-20 1:16 ` Johannes Schindelin 2005-11-20 3:10 ` Linus Torvalds [not found] ` <20051127025249.GA12286@vrfy.org> 1 sibling, 1 reply; 53+ messages in thread From: Johannes Schindelin @ 2005-11-20 1:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git Hi, On Sat, 19 Nov 2005, Linus Torvalds wrote: > And text with 8-bit latin1 is almost never valid utf-8. I had the impression utf-8 was designed in a way so you could strike "almost". But I don't have my docs handy... Ciao, Dscho ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-20 1:16 ` Johannes Schindelin @ 2005-11-20 3:10 ` Linus Torvalds 2005-11-20 4:13 ` Johannes Schindelin 0 siblings, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2005-11-20 3:10 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 855 bytes --] On Sun, 20 Nov 2005, Johannes Schindelin wrote: > > On Sat, 19 Nov 2005, Linus Torvalds wrote: > > > > And text with 8-bit latin1 is almost never valid utf-8. > > I had the impression utf-8 was designed in a way so you could strike > "almost". But I don't have my docs handy... No, strange latin combinations will be valid utf-8. It needs to be some really strange text to be real latin1 but look like it might be utf-8, though. (In Finnish/Swedish, the letter 'ä' is code \x00E4, which in UTF-8 is the sequence \xA5\xC3. But you can't know if a text that has that sequence is UTF-8, or if it's a strange two-character latin1 sequence of "Ã¥" (character codes \x00A5 and \x00C3). But I can pretty much guarantee that most any _sane_ latin1 text will obviously not be UTF-8, so in _practice_ you can definitely tell the two apart. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-20 3:10 ` Linus Torvalds @ 2005-11-20 4:13 ` Johannes Schindelin 0 siblings, 0 replies; 53+ messages in thread From: Johannes Schindelin @ 2005-11-20 4:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git [-- Attachment #1: Type: TEXT/PLAIN, Size: 505 bytes --] Hi, On Sat, 19 Nov 2005, Linus Torvalds wrote: > (In Finnish/Swedish, the letter 'ä' is code \x00E4, which in UTF-8 is the > sequence \xA5\xC3. But you can't know if a text that has that sequence is > UTF-8, or if it's a strange two-character latin1 sequence of "Ã¥" > (character codes \x00A5 and \x00C3). > > But I can pretty much guarantee that most any _sane_ latin1 text will > obviously not be UTF-8, so in _practice_ you can definitely tell the two > apart. Thank you, Dscho ^ permalink raw reply [flat|nested] 53+ messages in thread
[parent not found: <20051127025249.GA12286@vrfy.org>]
* Re: Rss produced by git is not valid xml? [not found] ` <20051127025249.GA12286@vrfy.org> @ 2005-11-27 3:57 ` Junio C Hamano 2005-11-27 4:13 ` Linus Torvalds 2005-11-27 16:18 ` Rss produced by git is not valid xml? Kay Sievers 0 siblings, 2 replies; 53+ messages in thread From: Junio C Hamano @ 2005-11-27 3:57 UTC (permalink / raw) To: Kay Sievers; +Cc: Linus Torvalds, git Kay Sievers <kay.sievers@vrfy.org> writes: > On Sat, Nov 19, 2005 at 09:52:34AM -0800, Linus Torvalds wrote: >> >> On Sat, 19 Nov 2005, Junio C Hamano wrote: >> > >> > Well, some people on the list seem to think UTF-8 is the one and >> > only right encoding, so for them if the message does not >> > identify what it is in, assuming UTF-8 and not doing any >> > conversion is probably the right thing ;-). >> >> If you replace "assume" with "verify", then I agree. One problem I have that approach is what to do if it does not verify. Reject and ask them to re-run the program with another option --binary-log-message? > I found some test code I did a while ago for validation of > filesystem labels, cause D-BUS diconnects your session, if you > send an invalid utf-8 string to the bus. :) > > Kay Thanks. I take it that you are licensing this code to use in git when we doing what Linus suggests? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-27 3:57 ` Junio C Hamano @ 2005-11-27 4:13 ` Linus Torvalds 2005-11-28 0:39 ` [PATCH 2/3] mailinfo: allow -u to fall back on latin1 to utf8 conversion Junio C Hamano 2005-11-27 16:18 ` Rss produced by git is not valid xml? Kay Sievers 1 sibling, 1 reply; 53+ messages in thread From: Linus Torvalds @ 2005-11-27 4:13 UTC (permalink / raw) To: Junio C Hamano; +Cc: Kay Sievers, git On Sat, 26 Nov 2005, Junio C Hamano wrote: > Kay Sievers <kay.sievers@vrfy.org> writes: > >> > >> If you replace "assume" with "verify", then I agree. > > One problem I have that approach is what to do if it does not > verify. Reject and ask them to re-run the program with another > option --binary-log-message? We could do that. With perhaps an option to just do the trivial "latin1->utf8" translation, which will be correct in most of the western world (and, perhaps more importantly - the places it won't be correct in will almost universally have an explicit locale setting or similar, since otherwise nothing would work). In other words, in the absense of locale settings, we can pretty much assume any 8-bit data is latin1 if it isn't already utf-8. That's what a lot of tools do already (eg, gitk automatically does the right thing, exactly because it will assume non-proper utf-8 being in latin1). I'd suggest that the current "-u" flag do the latin1->utf8 autoconversion, and _without_ the "-u" flag, you'd just commit it as binary data.. Linus ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 2/3] mailinfo: allow -u to fall back on latin1 to utf8 conversion. 2005-11-27 4:13 ` Linus Torvalds @ 2005-11-28 0:39 ` Junio C Hamano 2005-11-28 6:32 ` H. Peter Anvin 0 siblings, 1 reply; 53+ messages in thread From: Junio C Hamano @ 2005-11-28 0:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: git When the message body does not identify what encoding it is in, -u assumes it is in latin-1 and converts it to utf8, which is the recommended encoding for git commit log messages. With -u=<encoding>, the conversion is made into the specified one, instead of utf8, to allow project-local policies. Signed-off-by: Junio C Hamano <junkio@cox.net> --- * This says [2/3] but does not use the first one in the series, to keep mailinfo less dependent on git. [3/3] integrates it to git a bit further by using the configuration file. mailinfo.c | 59 +++++++++++++++++++++++++++++++++++------------------------ 1 files changed, 35 insertions(+), 24 deletions(-) applies-to: dfac5ab58034e7129ba0d8096ca2bb6857df2242 650e4be59b9f385f56e5829d97d09e8440f174b8 diff --git a/mailinfo.c b/mailinfo.c index cb853df..6d8c933 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -16,7 +16,7 @@ extern char *gitstrcasestr(const char *h static FILE *cmitmsg, *patchfile; static int keep_subject = 0; -static int metainfo_utf8 = 0; +static char *metainfo_charset = NULL; static char line[1000]; static char date[1000]; static char name[1000]; @@ -441,29 +441,38 @@ static int decode_b_segment(char *in, ch static void convert_to_utf8(char *line, char *charset) { - if (*charset) { - char *in, *out; - size_t insize, outsize, nrc; - char outbuf[4096]; /* cheat */ - iconv_t conv = iconv_open("utf-8", charset); - - if (conv == (iconv_t) -1) { - fprintf(stderr, "cannot convert from %s to utf-8\n", - charset); + char *in, *out; + size_t insize, outsize, nrc; + char outbuf[4096]; /* cheat */ + static char latin_one[] = "latin-1"; + char *input_charset = *charset ? charset : latin_one; + iconv_t conv = iconv_open(metainfo_charset, input_charset); + + if (conv == (iconv_t) -1) { + static int warned_latin1_once = 0; + if (input_charset != latin_one) { + fprintf(stderr, "cannot convert from %s to %s\n", + input_charset, metainfo_charset); *charset = 0; - return; } - in = line; - insize = strlen(in); - out = outbuf; - outsize = sizeof(outbuf); - nrc = iconv(conv, &in, &insize, &out, &outsize); - iconv_close(conv); - if (nrc == (size_t) -1) - return; - *out = 0; - strcpy(line, outbuf); + else if (!warned_latin1_once) { + warned_latin1_once = 1; + fprintf(stderr, "tried to convert from %s to %s, " + "but your iconv does not work with it.\n", + input_charset, metainfo_charset); + } + return; } + in = line; + insize = strlen(in); + out = outbuf; + outsize = sizeof(outbuf); + nrc = iconv(conv, &in, &insize, &out, &outsize); + iconv_close(conv); + if (nrc == (size_t) -1) + return; + *out = 0; + strcpy(line, outbuf); } static void decode_header_bq(char *it) @@ -511,7 +520,7 @@ static void decode_header_bq(char *it) } if (sz < 0) return; - if (metainfo_utf8) + if (metainfo_charset) convert_to_utf8(piecebuf, charset_q); strcpy(out, piecebuf); out += strlen(out); @@ -590,7 +599,7 @@ static int handle_commit_msg(void) * normalize the log message to UTF-8. */ decode_transfer_encoding(line); - if (metainfo_utf8) + if (metainfo_charset) convert_to_utf8(line, charset); fputs(line, cmitmsg); } while (fgets(line, sizeof(line), stdin) != NULL); @@ -720,7 +729,9 @@ int main(int argc, char **argv) if (!strcmp(argv[1], "-k")) keep_subject = 1; else if (!strcmp(argv[1], "-u")) - metainfo_utf8 = 1; + metainfo_charset = "utf-8"; + else if (!strncmp(argv[1], "-u=", 3)) + metainfo_charset = argv[1] + 3; else usage(); argc--; argv++; --- 0.99.9.GIT ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 2/3] mailinfo: allow -u to fall back on latin1 to utf8 conversion. 2005-11-28 0:39 ` [PATCH 2/3] mailinfo: allow -u to fall back on latin1 to utf8 conversion Junio C Hamano @ 2005-11-28 6:32 ` H. Peter Anvin 2005-11-28 9:21 ` Junio C Hamano 0 siblings, 1 reply; 53+ messages in thread From: H. Peter Anvin @ 2005-11-28 6:32 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Junio C Hamano wrote: > When the message body does not identify what encoding it is in, > -u assumes it is in latin-1 and converts it to utf8, which is > the recommended encoding for git commit log messages. > > With -u=<encoding>, the conversion is made into the specified > one, instead of utf8, to allow project-local policies. > > Signed-off-by: Junio C Hamano <junkio@cox.net> > -u= is very odd syntax. Typically you see "-u argument" (sometimes you have "-u" and "-U argument" as a pair); --foo=argument is used for long options, although even there "--foo argument" tends to be used at least when the argument is required. Incidentally, any reason we're not using getopt_long() for command-line parsing? -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 2/3] mailinfo: allow -u to fall back on latin1 to utf8 conversion. 2005-11-28 6:32 ` H. Peter Anvin @ 2005-11-28 9:21 ` Junio C Hamano 0 siblings, 0 replies; 53+ messages in thread From: Junio C Hamano @ 2005-11-28 9:21 UTC (permalink / raw) To: H. Peter Anvin; +Cc: git "H. Peter Anvin" <hpa@zytor.com> writes: > Junio C Hamano wrote: >> When the message body does not identify what encoding it is in, >> -u assumes it is in latin-1 and converts it to utf8, which is >> the recommended encoding for git commit log messages. >> With -u=<encoding>, the conversion is made into the specified >> one, instead of utf8, to allow project-local policies. >> Signed-off-by: Junio C Hamano <junkio@cox.net> > > -u= is very odd syntax. Fair enough. "--encoding=<encoding>" then. > Incidentally, any reason we're not using getopt_long() for command-line > parsing? There was a talk about using popt a while back but we never got around to it, primarily because we were running too fast for parties interested in command line parsing clean-ups to catch up. I think we are almost done and immediately post 1.0 when things stabilize may be a good time to do it if somebody wants to go wild, but not before please. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-27 3:57 ` Junio C Hamano 2005-11-27 4:13 ` Linus Torvalds @ 2005-11-27 16:18 ` Kay Sievers 1 sibling, 0 replies; 53+ messages in thread From: Kay Sievers @ 2005-11-27 16:18 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git On Sat, Nov 26, 2005 at 07:57:48PM -0800, Junio C Hamano wrote: > Kay Sievers <kay.sievers@vrfy.org> writes: > > On Sat, Nov 19, 2005 at 09:52:34AM -0800, Linus Torvalds wrote: > >> On Sat, 19 Nov 2005, Junio C Hamano wrote: > >> > > >> > Well, some people on the list seem to think UTF-8 is the one and > >> > only right encoding, so for them if the message does not > >> > identify what it is in, assuming UTF-8 and not doing any > >> > conversion is probably the right thing ;-). > >> > >> If you replace "assume" with "verify", then I agree. > > I found some test code I did a while ago for validation of > > filesystem labels, cause D-BUS diconnects your session, if you > > send an invalid utf-8 string to the bus. :) > > Thanks. I take it that you are licensing this code to use in > git when we doing what Linus suggests? Sure, it's free to use under any version of the GPL git uses itself. Thanks, Kay ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:02 ` Kay Sievers 2005-11-18 20:08 ` Ismail Donmez @ 2005-11-19 0:04 ` Johannes Schindelin 2005-11-20 18:28 ` H. Peter Anvin 2005-11-19 3:28 ` Junio C Hamano 2 siblings, 1 reply; 53+ messages in thread From: Johannes Schindelin @ 2005-11-19 0:04 UTC (permalink / raw) To: Kay Sievers; +Cc: Ismail Donmez, git Hi, On Fri, 18 Nov 2005, Kay Sievers wrote: > Yes, convince the git maintainers, that it's incredibly stupid not to > enforce utf8 in commit messages. It makes absolutely zero sense in a > SCM, which merges forth and back between people around the world to > allow random encodings from the last century. Oh, but it makes sense! Just because you happen to work on a very international project does not mean everybody does. Just because you happen to like utf-8 does not mean that you still do in 2046. The encoding-du-jour might well be a 64-bit wide char code by then, since they'll laugh about our dreaming about terabytes. BTW, utf-8 was designed on purpose to be easily distinguishable from other encodings so that you don't have to rely on every document obeying a certain encoding. Hth, Dscho ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-19 0:04 ` Johannes Schindelin @ 2005-11-20 18:28 ` H. Peter Anvin 2005-11-21 8:38 ` Johannes Schindelin 0 siblings, 1 reply; 53+ messages in thread From: H. Peter Anvin @ 2005-11-20 18:28 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Kay Sievers, Ismail Donmez, git Johannes Schindelin wrote: > Hi, > > On Fri, 18 Nov 2005, Kay Sievers wrote: > > >>Yes, convince the git maintainers, that it's incredibly stupid not to >>enforce utf8 in commit messages. It makes absolutely zero sense in a >>SCM, which merges forth and back between people around the world to >>allow random encodings from the last century. > > > Oh, but it makes sense! Just because you happen to work on a very > international project does not mean everybody does. > > Just because you happen to like utf-8 does not mean that you still do in > 2046. The encoding-du-jour might well be a 64-bit wide char code by then, > since they'll laugh about our dreaming about terabytes. > > BTW, utf-8 was designed on purpose to be easily distinguishable from other > encodings so that you don't have to rely on every document obeying a > certain encoding. > No, it wasn't. It was designated on purpose to be ASCII-compatible, substring-safe, and minimally stateful. Furthermore, it's extensible. Although the original UTF-8 is limited to 31 bits, and the officially published UTF-8 is further crippled to 21 bits by Mirco$oft cronies who wanted it to be brainfuck-compatible with UTF-16, it could easily be extended to 64 bits or beyond. I think it's *definitely* safe to say that whatever encoding we'll use in 2046, current UTF-8 will be a subset. If you don't believe me, consider how long we've had ASCII and the first of the design criteria for UTF-8 that I listed in the first paragraph. -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-20 18:28 ` H. Peter Anvin @ 2005-11-21 8:38 ` Johannes Schindelin 2005-11-21 9:28 ` H. Peter Anvin 0 siblings, 1 reply; 53+ messages in thread From: Johannes Schindelin @ 2005-11-21 8:38 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Kay Sievers, Ismail Donmez, git Hi, On Sun, 20 Nov 2005, H. Peter Anvin wrote: > Johannes Schindelin wrote: > > > > BTW, utf-8 was designed on purpose to be easily distinguishable from > > other encodings so that you don't have to rely on every document > > obeying a certain encoding. > > > > No, it wasn't. It was designated on purpose to be ASCII-compatible, > substring-safe, and minimally stateful. For the record, my information stems from http://en.wikipedia.org/wiki/Utf-8#Rationale_behind_UTF-8.27s_mechanics Hth, Dscho ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-21 8:38 ` Johannes Schindelin @ 2005-11-21 9:28 ` H. Peter Anvin 0 siblings, 0 replies; 53+ messages in thread From: H. Peter Anvin @ 2005-11-21 9:28 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Kay Sievers, Ismail Donmez, git Johannes Schindelin wrote: > Hi, > > On Sun, 20 Nov 2005, H. Peter Anvin wrote: > > >>Johannes Schindelin wrote: >> >>>BTW, utf-8 was designed on purpose to be easily distinguishable from >>>other encodings so that you don't have to rely on every document >>>obeying a certain encoding. >>> >> >>No, it wasn't. It was designated on purpose to be ASCII-compatible, >>substring-safe, and minimally stateful. > > > For the record, my information stems from > > http://en.wikipedia.org/wiki/Utf-8#Rationale_behind_UTF-8.27s_mechanics > That article is a bit confusing, as it mixes rationale with commentary. -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-18 20:02 ` Kay Sievers 2005-11-18 20:08 ` Ismail Donmez 2005-11-19 0:04 ` Johannes Schindelin @ 2005-11-19 3:28 ` Junio C Hamano 2005-11-19 4:35 ` H. Peter Anvin 2 siblings, 1 reply; 53+ messages in thread From: Junio C Hamano @ 2005-11-19 3:28 UTC (permalink / raw) To: git I just looked at the diff this commit introduces: e6bd23911efd0a2bd756c77d9e7ba6576eb739a1 Documentation: asciidoc sources are utf-8 with gitk (BTW, I pulled from paulus today, so "master" branch has the latest gitk) while my locale set to LC_CTYPE=en_US.utf8. Surprisingly, the diff to Documentation/git-pack-redundant.txt, which changes Lukas' name originally incorrectly encoded in iso-8859-1 to utf-8, was shown and both pre-image and post-image lines are readable. I do not know how tcl/tk does it, but it is doing the right thing. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml? 2005-11-19 3:28 ` Junio C Hamano @ 2005-11-19 4:35 ` H. Peter Anvin 0 siblings, 0 replies; 53+ messages in thread From: H. Peter Anvin @ 2005-11-19 4:35 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Junio C Hamano wrote: > I just looked at the diff this commit introduces: > > e6bd23911efd0a2bd756c77d9e7ba6576eb739a1 > Documentation: asciidoc sources are utf-8 > > with gitk (BTW, I pulled from paulus today, so "master" branch > has the latest gitk) while my locale set to LC_CTYPE=en_US.utf8. > > Surprisingly, the diff to Documentation/git-pack-redundant.txt, > which changes Lukas' name originally incorrectly encoded in > iso-8859-1 to utf-8, was shown and both pre-image and post-image > lines are readable. > > I do not know how tcl/tk does it, but it is doing the right > thing. > Tcl/Tk assumes that anything that isn't valid UTF-8 is Latin-1. -hpa ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Rss produced by git is not valid xml?
@ 2005-11-19 6:31 Marco Costalba
0 siblings, 0 replies; 53+ messages in thread
From: Marco Costalba @ 2005-11-19 6:31 UTC (permalink / raw)
To: H. Peter Anvin; +Cc: Junio C Hamano, git
H. Peter Anvin wrote:
> Junio C Hamano wrote:
>
>> I just looked at the diff this commit introduces:
>>
>> e6bd23911efd0a2bd756c77d9e7ba6576eb739a1
>> Documentation: asciidoc sources are utf-8
>>
>> with gitk (BTW, I pulled from paulus today, so "master" branch
>> has the latest gitk) while my locale set to LC_CTYPE=en_US.utf8.
>>
>> Surprisingly, the diff to Documentation/git-pack-redundant.txt,
>> which changes Lukas' name originally incorrectly encoded in
>> iso-8859-1 to utf-8, was shown and both pre-image and post-image
>> lines are readable.
>>
>> I do not know how tcl/tk does it, but it is doing the right
>> thing.
>>
>
> Tcl/Tk assumes that anything that isn't valid UTF-8 is Latin-1.
>
> -hpa
> -
My locale is set to LC_CTYPE=it_IT (local codec is ISO 8859-15).
Gitk shows correctly pre-image lines, but not post-image. BTW it's
the same output I have with
git-diff-tree -p e6bd23911efd0a2bd756c77d9e7ba6576eb739a1
run from KDE Konsole.
So I think the local encoding (LC_CTYPE) has a role in the story.
Marco
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2005-11-28 9:21 UTC | newest] Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-11-18 16:33 Rss produced by git is not valid xml? Ismail Donmez 2005-11-18 17:26 ` Ismail Donmez 2005-11-18 19:27 ` Ismail Donmez 2005-11-18 20:02 ` Kay Sievers 2005-11-18 20:08 ` Ismail Donmez 2005-11-18 20:22 ` Linus Torvalds 2005-11-18 20:28 ` H. Peter Anvin 2005-11-18 20:47 ` Linus Torvalds 2005-11-18 20:55 ` H. Peter Anvin 2005-11-18 20:51 ` Josef Weidendorfer 2005-11-18 21:01 ` Kay Sievers 2005-11-18 20:45 ` Ismail Donmez 2005-11-18 21:13 ` Linus Torvalds 2005-11-18 21:22 ` Ismail Donmez 2005-11-18 21:25 ` Junio C Hamano 2005-11-18 21:29 ` Ismail Donmez 2005-11-19 8:48 ` Junio C Hamano 2005-11-18 20:55 ` Kay Sievers 2005-11-18 21:30 ` Linus Torvalds 2005-11-18 21:33 ` Ismail Donmez 2005-11-18 21:48 ` Linus Torvalds 2005-11-18 22:12 ` H. Peter Anvin 2005-11-18 23:20 ` Linus Torvalds 2005-11-18 23:34 ` H. Peter Anvin 2005-11-18 23:53 ` Andreas Ericsson 2005-11-19 1:22 ` H. Peter Anvin 2005-11-19 8:49 ` Andreas Ericsson 2005-11-19 10:58 ` Johannes Schindelin 2005-11-18 23:57 ` Linus Torvalds 2005-11-18 23:58 ` H. Peter Anvin 2005-11-19 0:29 ` Johannes Schindelin 2005-11-18 23:25 ` Linus Torvalds 2005-11-19 0:34 ` Johannes Schindelin 2005-11-19 0:37 ` Junio C Hamano 2005-11-19 1:05 ` Linus Torvalds 2005-11-19 10:31 ` Junio C Hamano 2005-11-19 17:52 ` Linus Torvalds 2005-11-20 1:16 ` Johannes Schindelin 2005-11-20 3:10 ` Linus Torvalds 2005-11-20 4:13 ` Johannes Schindelin [not found] ` <20051127025249.GA12286@vrfy.org> 2005-11-27 3:57 ` Junio C Hamano 2005-11-27 4:13 ` Linus Torvalds 2005-11-28 0:39 ` [PATCH 2/3] mailinfo: allow -u to fall back on latin1 to utf8 conversion Junio C Hamano 2005-11-28 6:32 ` H. Peter Anvin 2005-11-28 9:21 ` Junio C Hamano 2005-11-27 16:18 ` Rss produced by git is not valid xml? Kay Sievers 2005-11-19 0:04 ` Johannes Schindelin 2005-11-20 18:28 ` H. Peter Anvin 2005-11-21 8:38 ` Johannes Schindelin 2005-11-21 9:28 ` H. Peter Anvin 2005-11-19 3:28 ` Junio C Hamano 2005-11-19 4:35 ` H. Peter Anvin 2005-11-19 6:31 Marco Costalba
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).