All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH/RFC] gitweb: parse_commit_text encoding fix
@ 2009-08-01  8:28 Zoltán Füzesi
  2009-08-01  9:21 ` Jakub Narebski
  0 siblings, 1 reply; 8+ messages in thread
From: Zoltán Füzesi @ 2009-08-01  8:28 UTC (permalink / raw)
  To: git; +Cc: giuseppe.bilotta, Zoltán Füzesi

Call to_utf8 when parsing author and committer names, otherwise they will appear
with bad encoding if they written by using chop_and_escape_str.

Signed-off-by: Zoltán Füzesi <zfuzesi@eaglet.hu>
---
 gitweb/gitweb.perl |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 7fbd5ff..06bbf60 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -2570,22 +2570,21 @@ sub parse_commit_text {
 		} elsif ((!defined $withparents) && ($line =~ m/^parent ([0-9a-fA-F]{40})$/)) {
 			push @parents, $1;
 		} elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) {
-			$co{'author'} = $1;
+			$co{'author'} = to_utf8($1);
 			$co{'author_epoch'} = $2;
 			$co{'author_tz'} = $3;
 			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
-				$co{'author_name'}  = $1;
+				$co{'author_name'}  = to_utf8($1);
 				$co{'author_email'} = $2;
 			} else {
 				$co{'author_name'} = $co{'author'};
 			}
 		} elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) {
-			$co{'committer'} = $1;
+			$co{'committer'} = to_utf8($1);
 			$co{'committer_epoch'} = $2;
 			$co{'committer_tz'} = $3;
-			$co{'committer_name'} = $co{'committer'};
 			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
-				$co{'committer_name'}  = $1;
+				$co{'committer_name'}  = to_utf8($1);
 				$co{'committer_email'} = $2;
 			} else {
 				$co{'committer_name'} = $co{'committer'};
-- 
1.6.4.13.ge6580

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH/RFC] gitweb: parse_commit_text encoding fix
  2009-08-01  8:28 [PATCH/RFC] gitweb: parse_commit_text encoding fix Zoltán Füzesi
@ 2009-08-01  9:21 ` Jakub Narebski
  2009-08-01 16:55   ` Füzesi Zoltán
  0 siblings, 1 reply; 8+ messages in thread
From: Jakub Narebski @ 2009-08-01  9:21 UTC (permalink / raw)
  To: Zoltán Füzesi; +Cc: git, Giuseppe Bilotta

Zoltán Füzesi <zfuzesi@eaglet.hu> writes:

> Call to_utf8 when parsing author and committer names, otherwise they
> will appear with bad encoding if they written by using
> chop_and_escape_str.

[re-wrapped]

> 
> Signed-off-by: Zoltán Füzesi <zfuzesi@eaglet.hu>

Thanks.


Still, I do wonder if it would be possible to simply do the following:

  -binmode STDOUT, ':utf8';
  +use open qw(:std :utf8);

...but it unfortunately doesn't work.  It was tried in
  http://thread.gmane.org/gmane.comp.version-control.git/87129/focus=87135

to_utf8() has at least (possible) fallback if it encounters characters
outside of UTF-8 coding.

> -			$co{'author'} = $1;
> +			$co{'author'} = to_utf8($1);

-- 
Jakub Narebski

Git User's Survey 2009: 
http://tinyurl.com/GitSurvey2009

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH/RFC] gitweb: parse_commit_text encoding fix
  2009-08-01  9:21 ` Jakub Narebski
@ 2009-08-01 16:55   ` Füzesi Zoltán
  2009-08-02  7:42     ` [PATCH] " Zoltán Füzesi
  0 siblings, 1 reply; 8+ messages in thread
From: Füzesi Zoltán @ 2009-08-01 16:55 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

2009/8/1 Jakub Narebski <jnareb@gmail.com>:
>
> Thanks.
>
>
> Still, I do wonder if it would be possible to simply do the following:
>
>  -binmode STDOUT, ':utf8';
>  +use open qw(:std :utf8);
>
> ...but it unfortunately doesn't work.  It was tried in
>  http://thread.gmane.org/gmane.comp.version-control.git/87129/focus=87135
>
> to_utf8() has at least (possible) fallback if it encounters characters
> outside of UTF-8 coding.
>

The following 2 changes in my patch are unnecessary, but please
confirm (I'm not familiar with perl (yet)):

-                               $co{'author_name'}  = $1;
+                               $co{'author_name'}  = to_utf8($1);

-                               $co{'committer_name'}  = $1;
+                               $co{'committer_name'}  = to_utf8($1);

BR,
Zé

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] gitweb: parse_commit_text encoding fix
  2009-08-01 16:55   ` Füzesi Zoltán
@ 2009-08-02  7:42     ` Zoltán Füzesi
  2009-08-04  6:59       ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Zoltán Füzesi @ 2009-08-02  7:42 UTC (permalink / raw)
  To: git; +Cc: Zoltán Füzesi

Call to_utf8 when parsing author and committer names, otherwise they will appear
with bad encoding if they written by using chop_and_escape_str.

Signed-off-by: Zoltán Füzesi <zfuzesi@eaglet.hu>
---
 gitweb/gitweb.perl |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 7fbd5ff..4f05194 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -2570,7 +2570,7 @@ sub parse_commit_text {
 		} elsif ((!defined $withparents) && ($line =~ m/^parent ([0-9a-fA-F]{40})$/)) {
 			push @parents, $1;
 		} elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) {
-			$co{'author'} = $1;
+			$co{'author'} = to_utf8($1);
 			$co{'author_epoch'} = $2;
 			$co{'author_tz'} = $3;
 			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -2580,10 +2580,9 @@ sub parse_commit_text {
 				$co{'author_name'} = $co{'author'};
 			}
 		} elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) {
-			$co{'committer'} = $1;
+			$co{'committer'} = to_utf8($1);
 			$co{'committer_epoch'} = $2;
 			$co{'committer_tz'} = $3;
-			$co{'committer_name'} = $co{'committer'};
 			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
 				$co{'committer_name'}  = $1;
 				$co{'committer_email'} = $2;
-- 
1.6.4.13.ge6580

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] gitweb: parse_commit_text encoding fix
  2009-08-02  7:42     ` [PATCH] " Zoltán Füzesi
@ 2009-08-04  6:59       ` Junio C Hamano
  2009-08-06  8:15         ` Zoltán Füzesi
  2009-08-07  0:41         ` Jakub Narebski
  0 siblings, 2 replies; 8+ messages in thread
From: Junio C Hamano @ 2009-08-04  6:59 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Zoltán Füzesi

Zoltán Füzesi <zfuzesi@eaglet.hu> writes:

> Call to_utf8 when parsing author and committer names, otherwise they will appear
> with bad encoding if they written by using chop_and_escape_str.
>
> Signed-off-by: Zoltán Füzesi <zfuzesi@eaglet.hu>
> ---

Thanks, Zoltán.

We should be able to set up a script that scrapes the output to test this
kind of thing.  We may not want to have a test pattern that matches too
strictly for the current structure and appearance of the output
(e.g. counting nested <div>s, presentation styles and such), but if we can
robustly scrape off HTML tags (e.g. "elinks -dump") and check the
remaining payload, it might be enough.

Jakub what do you think?  I suspect that scraping approach may turn out to
be too fragile for tests to be worth doing, but I am just throwing out a
thought.

>  gitweb/gitweb.perl |    5 ++---
>  1 files changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 7fbd5ff..4f05194 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -2570,7 +2570,7 @@ sub parse_commit_text {
>  		} elsif ((!defined $withparents) && ($line =~ m/^parent ([0-9a-fA-F]{40})$/)) {
>  			push @parents, $1;
>  		} elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) {
> -			$co{'author'} = $1;
> +			$co{'author'} = to_utf8($1);
>  			$co{'author_epoch'} = $2;
>  			$co{'author_tz'} = $3;
>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
> @@ -2580,10 +2580,9 @@ sub parse_commit_text {
>  				$co{'author_name'} = $co{'author'};
>  			}
>  		} elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) {
> -			$co{'committer'} = $1;
> +			$co{'committer'} = to_utf8($1);
>  			$co{'committer_epoch'} = $2;
>  			$co{'committer_tz'} = $3;
> -			$co{'committer_name'} = $co{'committer'};
>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'committer_name'}  = $1;
>  				$co{'committer_email'} = $2;
> -- 
> 1.6.4.13.ge6580

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] gitweb: parse_commit_text encoding fix
  2009-08-04  6:59       ` Junio C Hamano
@ 2009-08-06  8:15         ` Zoltán Füzesi
  2009-08-07 20:31           ` Jakub Narebski
  2009-08-07  0:41         ` Jakub Narebski
  1 sibling, 1 reply; 8+ messages in thread
From: Zoltán Füzesi @ 2009-08-06  8:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jakub Narebski, git

2009/8/4 Junio C Hamano <gitster@pobox.com>:
>
> Thanks, Zoltán.
>
> We should be able to set up a script that scrapes the output to test this
> kind of thing.  We may not want to have a test pattern that matches too
> strictly for the current structure and appearance of the output
> (e.g. counting nested <div>s, presentation styles and such), but if we can
> robustly scrape off HTML tags (e.g. "elinks -dump") and check the
> remaining payload, it might be enough.
>
> Jakub what do you think?  I suspect that scraping approach may turn out to
> be too fragile for tests to be worth doing, but I am just throwing out a
> thought.
>

This issue comes out when chop_and_escape_str function is called with
a non-ascii string (like my name :)) without before calling to_utf8 on
it. "author_name" and "committer_name" are two examples, and
"author_name" shows up with bad encoding in HTML.

Example from one of my repos (little piece from shortlog output):
<td class="author"><span title="Füzesi Zoltán">Füzesi Zoltán</span></td>
After applying the patch:
<td class="author">Füzesi Zoltán</td>

This is an "old" (seen in 1.5.6 version too) and (I think) minor issue.
I haven't spent time on thinking how a test script could show this yet.
Waiting for Jakub's reaction.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] gitweb: parse_commit_text encoding fix
  2009-08-04  6:59       ` Junio C Hamano
  2009-08-06  8:15         ` Zoltán Füzesi
@ 2009-08-07  0:41         ` Jakub Narebski
  1 sibling, 0 replies; 8+ messages in thread
From: Jakub Narebski @ 2009-08-07  0:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Zoltán Füzesi

On Tue, 4 Aug 2009, Junio C Hamano wrote:
> Zoltán Füzesi <zfuzesi@eaglet.hu> writes:
> 
> > Call to_utf8 when parsing author and committer names, otherwise they will appear
> > with bad encoding if they written by using chop_and_escape_str.
> >
> > Signed-off-by: Zoltán Füzesi <zfuzesi@eaglet.hu>
> > ---
> 
> Thanks, Zoltán.
> 
> We should be able to set up a script that scrapes the output to test this
> kind of thing.  We may not want to have a test pattern that matches too
> strictly for the current structure and appearance of the output
> (e.g. counting nested <div>s, presentation styles and such), but if we can
> robustly scrape off HTML tags (e.g. "elinks -dump") and check the
> remaining payload, it might be enough.
> 
> Jakub what do you think?  I suspect that scraping approach may turn out to
> be too fragile for tests to be worth doing, but I am just throwing out a
> thought.

First, I'd like to have existing t9500-gitweb-standalone-no-errors.sh
be about Perl errors and warning only, as it is now.  Anything outside
this should IMVHO be put in separate test.

Second, for checking whether gitweb handles non US-ASCII input correctly
we don't need HTML scrapping or parsing.  We can simply check if we have
correct string in output... and (after Zoltán Füzesi example) that we
don't have incorrect one.  For example if we have 'xxxóxxx' in input,
then there is 'xxxóxxx' in output, and that all match againts 'xxx.xxx'
matches 'xxxóxxx'.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] gitweb: parse_commit_text encoding fix
  2009-08-06  8:15         ` Zoltán Füzesi
@ 2009-08-07 20:31           ` Jakub Narebski
  0 siblings, 0 replies; 8+ messages in thread
From: Jakub Narebski @ 2009-08-07 20:31 UTC (permalink / raw)
  To: Zoltán Füzesi; +Cc: Junio C Hamano, git

On Thu, 6 Aug 2009, Zoltán Füzesi wrote:
> 2009/8/4 Junio C Hamano <gitster@pobox.com>:
> >
> > Thanks, Zoltán.
> >
> > We should be able to set up a script that scrapes the output to test this
> > kind of thing.  We may not want to have a test pattern that matches too
> > strictly for the current structure and appearance of the output
> > (e.g. counting nested <div>s, presentation styles and such), but if we can
> > robustly scrape off HTML tags (e.g. "elinks -dump") and check the
> > remaining payload, it might be enough.
> >
> > Jakub what do you think?  I suspect that scraping approach may turn out to
> > be too fragile for tests to be worth doing, but I am just throwing out a
> > thought.
> >
> 
> This issue comes out when chop_and_escape_str function is called with
> a non-ascii string (like my name :)) without before calling to_utf8 on
> it. "author_name" and "committer_name" are two examples, and
> "author_name" shows up with bad encoding in HTML.
> 
> Example from one of my repos (little piece from shortlog output):
> <td class="author"><span title="Füzesi Zoltán">Füzesi Zoltán</span></td>
> After applying the patch:
> <td class="author">Füzesi Zoltán</td>
> 
> This is an "old" (seen in 1.5.6 version too) and (I think) minor issue.
> I haven't spent time on thinking how a test script could show this yet.
> Waiting for Jakub's reaction.

Oh, so the problem is not only to just have correct output (for example
"Füzesi Zoltán" somewhere on HTML page produced by gitweb), but also do
not have incorrect output (for example "Füzesi Zoltán").

I think it would be better to leave t9500-gitweb-standalone-no-errors.sh
to be only about no Perl errors and no Perl warnings.  So I'd rather
have test checking if gitweb handles non US-ASCII in output correctly
in a separate test, e.g. t9501-gitweb-standalone-i18n.sh.  That would
mean extracting gitweb_init() and gitweb_run() (and perhaps also
gitweb_check_prereq() or something) into common file t/lib-gitweb.sh

We would check e.g. if "startáąend" is present in output (correct output),
and whether extracting "start[^ ]*end" produces only "startáąend" (no
incorrect output).


As for gitweb, we should make sure that everything is stored in Perl
variables and Perl structures _after_ treating with to_utf8().  This
would require some cleanup of the code, and having such test would
help to check if we didn't introduce any regressions.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-08-07 20:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-01  8:28 [PATCH/RFC] gitweb: parse_commit_text encoding fix Zoltán Füzesi
2009-08-01  9:21 ` Jakub Narebski
2009-08-01 16:55   ` Füzesi Zoltán
2009-08-02  7:42     ` [PATCH] " Zoltán Füzesi
2009-08-04  6:59       ` Junio C Hamano
2009-08-06  8:15         ` Zoltán Füzesi
2009-08-07 20:31           ` Jakub Narebski
2009-08-07  0:41         ` Jakub Narebski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.