All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes
@ 2011-12-17  9:22 Jakub Narebski
  2011-12-17  9:22 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-17  9:22 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin, Jakub Narebski

Sorry for resend of this series, but I forgot to generate patches in
UTF-8 instead of i18n.logoutputencoding=iso-8859-2


This is post-release resend of Jürgen patches (which were sent
during feature-freeze).

I have slightly extended commit messages, and added my ACK.

Jürgen Kreileder (3):
  gitweb: Call to_utf8() on input string in chop_and_escape_str()
  gitweb: esc_html() site name for title in OPML
  gitweb: Output valid utf8 in git_blame_common('data')

 gitweb/gitweb.perl |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

-- 
1.7.6

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str()
  2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
@ 2011-12-17  9:22 ` Jakub Narebski
  2011-12-17  9:22 ` [PATCH 2/3] gitweb: esc_html() site name for title in OPML Jakub Narebski
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-17  9:22 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin

From: Jürgen Kreileder <jk@blackdown.de>

a) To fix the comparison with the chopped string,
   otherwise we compare bytes with characters, as
   chop_str() must run to_utf8() for correct operation
b) To give the title attribute correct encoding;
   we need to mark strings as UTF-8 before outpur

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
Acked-by: Jakub Narębski <jnareb@gmail.com>
---
 gitweb/gitweb.perl |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index f80f259..35126cd 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1696,6 +1696,7 @@ sub chop_and_escape_str {
 	my ($str) = @_;
 
 	my $chopped = chop_str(@_);
+	$str = to_utf8($str);
 	if ($chopped eq $str) {
 		return esc_html($chopped);
 	} else {
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] gitweb: esc_html() site name for title in OPML
  2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
  2011-12-17  9:22 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski
@ 2011-12-17  9:22 ` Jakub Narebski
  2011-12-17  9:22 ` [PATCH 3/3] gitweb: Output valid utf8 in git_blame_common('data') Jakub Narebski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-17  9:22 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin

From: Jürgen Kreileder <jk@blackdown.de>

This escapes the site name in OPML (XML uses the same escaping rules
as HTML).  Also fixes encoding issues because esc_html() uses
to_utf8().

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
Acked-by: Jakub Narębski <jnareb@gmail.com>
---
 gitweb/gitweb.perl |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 35126cd..dcf4658 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -7863,11 +7863,12 @@ sub git_opml {
 		-charset => 'utf-8',
 		-content_disposition => 'inline; filename="opml.xml"');
 
+	my $title = esc_html($site_name);
 	print <<XML;
 <?xml version="1.0" encoding="utf-8"?>
 <opml version="1.0">
 <head>
-  <title>$site_name OPML Export</title>
+  <title>$title OPML Export</title>
 </head>
 <body>
 <outline text="git RSS feeds">
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] gitweb: Output valid utf8 in git_blame_common('data')
  2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
  2011-12-17  9:22 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski
  2011-12-17  9:22 ` [PATCH 2/3] gitweb: esc_html() site name for title in OPML Jakub Narebski
@ 2011-12-17  9:22 ` Jakub Narebski
  2011-12-17 19:27 ` [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Junio C Hamano
  2011-12-19  0:54 ` [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine Jakub Narebski
  4 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-17  9:22 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin

From: Jürgen Kreileder <jk@blackdown.de>

Otherwise when javascript-actions are enabled gitweb shown broken
author names in the tooltips on blame pages ('blame_incremental'
view).

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
Acked-by: Jakub Narębski <jnareb@gmail.com>
---
 gitweb/gitweb.perl |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index dcf4658..d24763b 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -6244,7 +6244,9 @@ sub git_blame_common {
 			-type=>"text/plain", -charset => "utf-8",
 			-status=> "200 OK");
 		local $| = 1; # output autoflush
-		print while <$fd>;
+		while (my $line = <$fd>) {
+			print to_utf8($line);
+		}
 		close $fd
 			or print "ERROR $!\n";
 
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes
  2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
                   ` (2 preceding siblings ...)
  2011-12-17  9:22 ` [PATCH 3/3] gitweb: Output valid utf8 in git_blame_common('data') Jakub Narebski
@ 2011-12-17 19:27 ` Junio C Hamano
  2011-12-19  0:54 ` [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine Jakub Narebski
  4 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2011-12-17 19:27 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Juergen Kreileder, John Hawley, admin

Jakub Narebski <jnareb@gmail.com> writes:

> Sorry for resend of this series, but I forgot to generate patches in
> UTF-8 instead of i18n.logoutputencoding=iso-8859-2
>
>
> This is post-release resend of Jürgen patches (which were sent
> during feature-freeze).
>
> I have slightly extended commit messages, and added my ACK.
>
> Jürgen Kreileder (3):
>   gitweb: Call to_utf8() on input string in chop_and_escape_str()
>   gitweb: esc_html() site name for title in OPML
>   gitweb: Output valid utf8 in git_blame_common('data')
>
>  gitweb/gitweb.perl |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine
  2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
                   ` (3 preceding siblings ...)
  2011-12-17 19:27 ` [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Junio C Hamano
@ 2011-12-19  0:54 ` Jakub Narebski
  2011-12-19 12:11   ` Jakub Narebski
  4 siblings, 1 reply; 9+ messages in thread
From: Jakub Narebski @ 2011-12-19  0:54 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin

e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
2007-12-04) was meant to make gitweb faster by using Perl's internals
(see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)

Simple benchmark confirms that (old = 00f429a, new = this version);
note that it is synthetic benchmark of standalone subroutines, not
of gitweb itself

        old  new
  old    -- -65%
  new  189%   --

Unfortunately it made fallback mode of to_utf8 do not work...  except
for default value 'latin1' of $fallback_encoding ('latin1' is Perl
native encoding), which is why it was not noticed for such long time.

utf8::valid(STRING) is an internal function that tests whether STRING
is in a _consistent state_ regarding UTF-8.  It returns true is
well-formed UTF-8 and has the UTF-8 flag on _*or*_ if string is held
as bytes (both these states are 'consistent').  For gitweb the second
option was true, as output from git commands is opened without ':utf8'
layer.

What made it work at all for STRING in 'latin1' encoding is the fact
that utf8:decode(STRING) turns on UTF-8 flag only if source string is
valid UTF-8 and contains multi-byte UTF-8 characters... and that if
string doesn't have UTF-8 flag set it is treated as in native Perl
encoding, i.e.  'latin1' / 'iso-8859-1' (unless native encoding it is
EBCDIC ;-)).  It was ':utf8' layer that actually converted 'latin1'
(no UTF-8 flag == native == 'latin1) to 'utf8'.


Let's make use of the fact that utf8:decode(STRING) returns false if
STRING is invalid as UTF-8 to check whether to enable fallback mode.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Excuse me for overly long commit message...

Resent as part of to_utf8 fixes for better visibility

 gitweb/gitweb.perl |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index d24763b..75b0970 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1443,8 +1443,7 @@ sub validate_refname {
 sub to_utf8 {
 	my $str = shift;
 	return undef unless defined $str;
-	if (utf8::valid($str)) {
-		utf8::decode($str);
+	if (utf8::valid($str) && utf8::decode($str)) {
 		return $str;
 	} else {
 		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine
  2011-12-19  0:54 ` [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine Jakub Narebski
@ 2011-12-19 12:11   ` Jakub Narebski
  2011-12-19 16:21     ` [PATCH 4/3 v2 (bugfix)] " Jakub Narebski
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Narebski @ 2011-12-19 12:11 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, Junio Hamano

Jakub Narebski wrote:

> e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
> 2007-12-04) was meant to make gitweb faster by using Perl's internals
> (see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)
> 
> Simple benchmark confirms that (old = 00f429a, new = this version);
> note that it is synthetic benchmark of standalone subroutines, not
> of gitweb itself
> 
>         old  new
>   old    -- -65%
>   new  189%   --

Nb. that was about operations / second (higher is better):

           Rate  old  new
    old  2067/s   -- -65%
    new  5863/s 184%   --

Or in slightly different benchmark (more smaller lines):

          Rate  old  new
    old  277/s   -- -73%
    new 1021/s 268%   --

 old$ time ./t9500-gitweb-standalone-no-errors.sh >/dev/null

   real       1m16.788s
   user       1m0.908s
   sys        0m14.033s
   user+sys   1m14.941s

 new$ time ./t9500-gitweb-standalone-no-errors.sh >/dev/null

    real      1m12.216s
    user      0m57.300s
    sys       0m13.329s
    user+sys  1m10.639s

Though such benchmarks should have been a part of e5d3de5.


P.S. I started to get strange errors

 XML Parsing Error: xml processing instruction not at start of external entity
 Location: http://localhost/cgi-bin/gitweb/gitweb.cgi
 Line Number 37, Column 1:
 <?xml version="1.0" encoding="utf-8"?>
 ^

while "show source" shows that '<?xml version="1.0" encoding="utf-8"?>'
is the first line.  WTF?!?

P.P.S. Now I am getting errors when running gitweb, but only in some
cases (via mod_cgi not as standalone script, only when using lynx),
namely it looks like it falls back to 'latin1' when doing content
which is valid UTF-8.

Will investigate.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4/3 v2 (bugfix)] gitweb: Fix fallback mode of to_utf8 subroutine
  2011-12-19 12:11   ` Jakub Narebski
@ 2011-12-19 16:21     ` Jakub Narebski
  0 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-19 16:21 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, Junio Hamano

e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
2007-12-04) was meant to make gitweb faster by using Perl's internals
(see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)

Simple benchmark confirms that (old = 00f429a, new = this version):
note that it is synthetic benchmark of standalone subroutines, not
of gitweb itself (where probably no visible difference in performace
will show)

        Rate  old  new
  old 1582/s   -- -64%
  new 4453/s 181%   --

Unfortunately it made fallback mode of to_utf8 do not work...  except
for default value 'latin1' of $fallback_encoding (because 'latin1' is
Perl native encoding), which is probably why it was not noticed for so
long.


utf8::valid(STRING) is an internal function that tests whether STRING
is in a _consistent state_ regarding UTF-8.  It returns true is
well-formed UTF-8 and has the UTF-8 flag on _*or*_ if string is held
as bytes (both these states are 'consistent').

For gitweb in most cases the second option was true, as output from
git commands is opened without ':utf8' layer.  So utf8::valid is not
useful for to_utf8.

What made it look as if to_utf8() fallback mode worked correctly
(though only for $fallback_encoding at its default value 'latin1')
was the fact that utf8::decode(STRING) turns on UTF-8 flag only if
source string^W octets form a valid UTF-8 and it contains multi-byte
UTF-8 characters... this means that if string was not valid UTF-8
it didn't get UTF-8 flag.

When string doesn't have UTF-8 flag set, it is treated as if it was in
native Perl encoding, which is 'latin1' (unless native encoding is
EBCDIC ;-)).  So it was ':utf8' layer that actually converted 'latin1'
(no UTF-8 flag == native == 'latin1) to 'utf8', and not to_utf8()
subroutine.  Fallback mode was never triggered.


Let's make use of the fact that utf8::decode(STRING) returns false if
STRING is invalid as UTF-8 to check whether to enable fallback mode.

Note however that if STRING has UTF-8 flag set already, then
utf8::decode also returns false, which could cause problems if given
string was already converted with to_utf8().  Such double conversion
can happen in gitweb.  Therefore we have to check if STRING has UTF-8
flag set with utf8::is_utf8(); if this subroutine returns true then we
have already decoded (converted) string, and don't have to do it
second time.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Jakub Narebski wrote:
> P.S. I started to get strange errors
> 
>  XML Parsing Error: xml processing instruction not at start of external entity
>  Location: http://localhost/cgi-bin/gitweb/gitweb.cgi
>  Line Number 37, Column 1:
>  <?xml version="1.0" encoding="utf-8"?>
>  ^
> 
> while "show source" shows that '<?xml version="1.0" encoding="utf-8"?>'
> is the first line.  WTF?!?
> 
> P.P.S. Now I am getting errors when running gitweb, but only in some
> cases (via mod_cgi not as standalone script, only when using lynx),
> namely it looks like it falls back to 'latin1' when doing content
> which is valid UTF-8.
> 
> Will investigate.

Now it is fixed; the error was caused by to_utf8 not dealing with double
encoding for strings outside 7bit ASCII.

 gitweb/gitweb.perl |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index d24763b..fc41b07 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1443,8 +1443,8 @@ sub validate_refname {
 sub to_utf8 {
 	my $str = shift;
 	return undef unless defined $str;
-	if (utf8::valid($str)) {
-		utf8::decode($str);
+
+	if (utf8::is_utf8($str) || utf8::decode($str)) {
 		return $str;
 	} else {
 		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str()
  2011-12-17  9:15 [PATCH 0/3] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
@ 2011-12-17  9:15 ` Jakub Narebski
  0 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-17  9:15 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin

From: Jürgen Kreileder <jk@blackdown.de>

a) To fix the comparison with the chopped string,
   otherwise we compare bytes with characters, as
   chop_str() must run to_utf8() for correct operation
b) To give the title attribute correct encoding;
   we need to mark strings as UTF-8 before outpur

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
Acked-by: Jakub Narębski <jnareb@gmail.com>
---
 gitweb/gitweb.perl |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index f80f259..35126cd 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1696,6 +1696,7 @@ sub chop_and_escape_str {
 	my ($str) = @_;
 
 	my $chopped = chop_str(@_);
+	$str = to_utf8($str);
 	if ($chopped eq $str) {
 		return esc_html($chopped);
 	} else {
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-12-19 16:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
2011-12-17  9:22 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski
2011-12-17  9:22 ` [PATCH 2/3] gitweb: esc_html() site name for title in OPML Jakub Narebski
2011-12-17  9:22 ` [PATCH 3/3] gitweb: Output valid utf8 in git_blame_common('data') Jakub Narebski
2011-12-17 19:27 ` [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Junio C Hamano
2011-12-19  0:54 ` [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine Jakub Narebski
2011-12-19 12:11   ` Jakub Narebski
2011-12-19 16:21     ` [PATCH 4/3 v2 (bugfix)] " Jakub Narebski
  -- strict thread matches above, loose matches on Subject: below --
2011-12-17  9:15 [PATCH 0/3] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
2011-12-17  9:15 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.