All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes
@ 2011-12-17  9:22 Jakub Narebski
  2011-12-17  9:22 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-17  9:22 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin, Jakub Narebski

Sorry for resend of this series, but I forgot to generate patches in
UTF-8 instead of i18n.logoutputencoding=iso-8859-2


This is post-release resend of Jürgen patches (which were sent
during feature-freeze).

I have slightly extended commit messages, and added my ACK.

Jürgen Kreileder (3):
  gitweb: Call to_utf8() on input string in chop_and_escape_str()
  gitweb: esc_html() site name for title in OPML
  gitweb: Output valid utf8 in git_blame_common('data')

 gitweb/gitweb.perl |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

-- 
1.7.6

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str()
  2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
@ 2011-12-17  9:22 ` Jakub Narebski
  2011-12-17  9:22 ` [PATCH 2/3] gitweb: esc_html() site name for title in OPML Jakub Narebski
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-17  9:22 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin

From: Jürgen Kreileder <jk@blackdown.de>

a) To fix the comparison with the chopped string,
   otherwise we compare bytes with characters, as
   chop_str() must run to_utf8() for correct operation
b) To give the title attribute correct encoding;
   we need to mark strings as UTF-8 before outpur

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
Acked-by: Jakub Narębski <jnareb@gmail.com>
---
 gitweb/gitweb.perl |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index f80f259..35126cd 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1696,6 +1696,7 @@ sub chop_and_escape_str {
 	my ($str) = @_;
 
 	my $chopped = chop_str(@_);
+	$str = to_utf8($str);
 	if ($chopped eq $str) {
 		return esc_html($chopped);
 	} else {
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] gitweb: esc_html() site name for title in OPML
  2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
  2011-12-17  9:22 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski
@ 2011-12-17  9:22 ` Jakub Narebski
  2011-12-17  9:22 ` [PATCH 3/3] gitweb: Output valid utf8 in git_blame_common('data') Jakub Narebski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-17  9:22 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin

From: Jürgen Kreileder <jk@blackdown.de>

This escapes the site name in OPML (XML uses the same escaping rules
as HTML).  Also fixes encoding issues because esc_html() uses
to_utf8().

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
Acked-by: Jakub Narębski <jnareb@gmail.com>
---
 gitweb/gitweb.perl |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 35126cd..dcf4658 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -7863,11 +7863,12 @@ sub git_opml {
 		-charset => 'utf-8',
 		-content_disposition => 'inline; filename="opml.xml"');
 
+	my $title = esc_html($site_name);
 	print <<XML;
 <?xml version="1.0" encoding="utf-8"?>
 <opml version="1.0">
 <head>
-  <title>$site_name OPML Export</title>
+  <title>$title OPML Export</title>
 </head>
 <body>
 <outline text="git RSS feeds">
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] gitweb: Output valid utf8 in git_blame_common('data')
  2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
  2011-12-17  9:22 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski
  2011-12-17  9:22 ` [PATCH 2/3] gitweb: esc_html() site name for title in OPML Jakub Narebski
@ 2011-12-17  9:22 ` Jakub Narebski
  2011-12-17 19:27 ` [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Junio C Hamano
  2011-12-19  0:54 ` [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine Jakub Narebski
  4 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-17  9:22 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin

From: Jürgen Kreileder <jk@blackdown.de>

Otherwise when javascript-actions are enabled gitweb shown broken
author names in the tooltips on blame pages ('blame_incremental'
view).

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
Acked-by: Jakub Narębski <jnareb@gmail.com>
---
 gitweb/gitweb.perl |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index dcf4658..d24763b 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -6244,7 +6244,9 @@ sub git_blame_common {
 			-type=>"text/plain", -charset => "utf-8",
 			-status=> "200 OK");
 		local $| = 1; # output autoflush
-		print while <$fd>;
+		while (my $line = <$fd>) {
+			print to_utf8($line);
+		}
 		close $fd
 			or print "ERROR $!\n";
 
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes
  2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
                   ` (2 preceding siblings ...)
  2011-12-17  9:22 ` [PATCH 3/3] gitweb: Output valid utf8 in git_blame_common('data') Jakub Narebski
@ 2011-12-17 19:27 ` Junio C Hamano
  2011-12-19  0:54 ` [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine Jakub Narebski
  4 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2011-12-17 19:27 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Juergen Kreileder, John Hawley, admin

Jakub Narebski <jnareb@gmail.com> writes:

> Sorry for resend of this series, but I forgot to generate patches in
> UTF-8 instead of i18n.logoutputencoding=iso-8859-2
>
>
> This is post-release resend of Jürgen patches (which were sent
> during feature-freeze).
>
> I have slightly extended commit messages, and added my ACK.
>
> Jürgen Kreileder (3):
>   gitweb: Call to_utf8() on input string in chop_and_escape_str()
>   gitweb: esc_html() site name for title in OPML
>   gitweb: Output valid utf8 in git_blame_common('data')
>
>  gitweb/gitweb.perl |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine
  2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
                   ` (3 preceding siblings ...)
  2011-12-17 19:27 ` [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Junio C Hamano
@ 2011-12-19  0:54 ` Jakub Narebski
  2011-12-19 12:11   ` Jakub Narebski
  4 siblings, 1 reply; 9+ messages in thread
From: Jakub Narebski @ 2011-12-19  0:54 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin

e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
2007-12-04) was meant to make gitweb faster by using Perl's internals
(see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)

Simple benchmark confirms that (old = 00f429a, new = this version);
note that it is synthetic benchmark of standalone subroutines, not
of gitweb itself

        old  new
  old    -- -65%
  new  189%   --

Unfortunately it made fallback mode of to_utf8 do not work...  except
for default value 'latin1' of $fallback_encoding ('latin1' is Perl
native encoding), which is why it was not noticed for such long time.

utf8::valid(STRING) is an internal function that tests whether STRING
is in a _consistent state_ regarding UTF-8.  It returns true is
well-formed UTF-8 and has the UTF-8 flag on _*or*_ if string is held
as bytes (both these states are 'consistent').  For gitweb the second
option was true, as output from git commands is opened without ':utf8'
layer.

What made it work at all for STRING in 'latin1' encoding is the fact
that utf8:decode(STRING) turns on UTF-8 flag only if source string is
valid UTF-8 and contains multi-byte UTF-8 characters... and that if
string doesn't have UTF-8 flag set it is treated as in native Perl
encoding, i.e.  'latin1' / 'iso-8859-1' (unless native encoding it is
EBCDIC ;-)).  It was ':utf8' layer that actually converted 'latin1'
(no UTF-8 flag == native == 'latin1) to 'utf8'.


Let's make use of the fact that utf8:decode(STRING) returns false if
STRING is invalid as UTF-8 to check whether to enable fallback mode.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Excuse me for overly long commit message...

Resent as part of to_utf8 fixes for better visibility

 gitweb/gitweb.perl |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index d24763b..75b0970 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1443,8 +1443,7 @@ sub validate_refname {
 sub to_utf8 {
 	my $str = shift;
 	return undef unless defined $str;
-	if (utf8::valid($str)) {
-		utf8::decode($str);
+	if (utf8::valid($str) && utf8::decode($str)) {
 		return $str;
 	} else {
 		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine
  2011-12-19  0:54 ` [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine Jakub Narebski
@ 2011-12-19 12:11   ` Jakub Narebski
  2011-12-19 16:21     ` [PATCH 4/3 v2 (bugfix)] " Jakub Narebski
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Narebski @ 2011-12-19 12:11 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, Junio Hamano

Jakub Narebski wrote:

> e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
> 2007-12-04) was meant to make gitweb faster by using Perl's internals
> (see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)
> 
> Simple benchmark confirms that (old = 00f429a, new = this version);
> note that it is synthetic benchmark of standalone subroutines, not
> of gitweb itself
> 
>         old  new
>   old    -- -65%
>   new  189%   --

Nb. that was about operations / second (higher is better):

           Rate  old  new
    old  2067/s   -- -65%
    new  5863/s 184%   --

Or in slightly different benchmark (more smaller lines):

          Rate  old  new
    old  277/s   -- -73%
    new 1021/s 268%   --

 old$ time ./t9500-gitweb-standalone-no-errors.sh >/dev/null

   real       1m16.788s
   user       1m0.908s
   sys        0m14.033s
   user+sys   1m14.941s

 new$ time ./t9500-gitweb-standalone-no-errors.sh >/dev/null

    real      1m12.216s
    user      0m57.300s
    sys       0m13.329s
    user+sys  1m10.639s

Though such benchmarks should have been a part of e5d3de5.


P.S. I started to get strange errors

 XML Parsing Error: xml processing instruction not at start of external entity
 Location: http://localhost/cgi-bin/gitweb/gitweb.cgi
 Line Number 37, Column 1:
 <?xml version="1.0" encoding="utf-8"?>
 ^

while "show source" shows that '<?xml version="1.0" encoding="utf-8"?>'
is the first line.  WTF?!?

P.P.S. Now I am getting errors when running gitweb, but only in some
cases (via mod_cgi not as standalone script, only when using lynx),
namely it looks like it falls back to 'latin1' when doing content
which is valid UTF-8.

Will investigate.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4/3 v2 (bugfix)] gitweb: Fix fallback mode of to_utf8 subroutine
  2011-12-19 12:11   ` Jakub Narebski
@ 2011-12-19 16:21     ` Jakub Narebski
  0 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-19 16:21 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, Junio Hamano

e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
2007-12-04) was meant to make gitweb faster by using Perl's internals
(see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)

Simple benchmark confirms that (old = 00f429a, new = this version):
note that it is synthetic benchmark of standalone subroutines, not
of gitweb itself (where probably no visible difference in performace
will show)

        Rate  old  new
  old 1582/s   -- -64%
  new 4453/s 181%   --

Unfortunately it made fallback mode of to_utf8 do not work...  except
for default value 'latin1' of $fallback_encoding (because 'latin1' is
Perl native encoding), which is probably why it was not noticed for so
long.


utf8::valid(STRING) is an internal function that tests whether STRING
is in a _consistent state_ regarding UTF-8.  It returns true is
well-formed UTF-8 and has the UTF-8 flag on _*or*_ if string is held
as bytes (both these states are 'consistent').

For gitweb in most cases the second option was true, as output from
git commands is opened without ':utf8' layer.  So utf8::valid is not
useful for to_utf8.

What made it look as if to_utf8() fallback mode worked correctly
(though only for $fallback_encoding at its default value 'latin1')
was the fact that utf8::decode(STRING) turns on UTF-8 flag only if
source string^W octets form a valid UTF-8 and it contains multi-byte
UTF-8 characters... this means that if string was not valid UTF-8
it didn't get UTF-8 flag.

When string doesn't have UTF-8 flag set, it is treated as if it was in
native Perl encoding, which is 'latin1' (unless native encoding is
EBCDIC ;-)).  So it was ':utf8' layer that actually converted 'latin1'
(no UTF-8 flag == native == 'latin1) to 'utf8', and not to_utf8()
subroutine.  Fallback mode was never triggered.


Let's make use of the fact that utf8::decode(STRING) returns false if
STRING is invalid as UTF-8 to check whether to enable fallback mode.

Note however that if STRING has UTF-8 flag set already, then
utf8::decode also returns false, which could cause problems if given
string was already converted with to_utf8().  Such double conversion
can happen in gitweb.  Therefore we have to check if STRING has UTF-8
flag set with utf8::is_utf8(); if this subroutine returns true then we
have already decoded (converted) string, and don't have to do it
second time.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Jakub Narebski wrote:
> P.S. I started to get strange errors
> 
>  XML Parsing Error: xml processing instruction not at start of external entity
>  Location: http://localhost/cgi-bin/gitweb/gitweb.cgi
>  Line Number 37, Column 1:
>  <?xml version="1.0" encoding="utf-8"?>
>  ^
> 
> while "show source" shows that '<?xml version="1.0" encoding="utf-8"?>'
> is the first line.  WTF?!?
> 
> P.P.S. Now I am getting errors when running gitweb, but only in some
> cases (via mod_cgi not as standalone script, only when using lynx),
> namely it looks like it falls back to 'latin1' when doing content
> which is valid UTF-8.
> 
> Will investigate.

Now it is fixed; the error was caused by to_utf8 not dealing with double
encoding for strings outside 7bit ASCII.

 gitweb/gitweb.perl |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index d24763b..fc41b07 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1443,8 +1443,8 @@ sub validate_refname {
 sub to_utf8 {
 	my $str = shift;
 	return undef unless defined $str;
-	if (utf8::valid($str)) {
-		utf8::decode($str);
+
+	if (utf8::is_utf8($str) || utf8::decode($str)) {
 		return $str;
 	} else {
 		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] gitweb: esc_html() site name for title in OPML
  2011-12-17  9:15 [PATCH 0/3] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
@ 2011-12-17  9:15 ` Jakub Narebski
  0 siblings, 0 replies; 9+ messages in thread
From: Jakub Narebski @ 2011-12-17  9:15 UTC (permalink / raw)
  To: git; +Cc: Juergen Kreileder, John Hawley, admin

From: Jürgen Kreileder <jk@blackdown.de>

This escapes the site name in OPML (XML uses the same escaping rules
as HTML).  Also fixes encoding issues because esc_html() uses
to_utf8().

Signed-off-by: Jürgen Kreileder <jk@blackdown.de>
Acked-by: Jakub Narębski <jnareb@gmail.com>
---
 gitweb/gitweb.perl |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 35126cd..dcf4658 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -7863,11 +7863,12 @@ sub git_opml {
 		-charset => 'utf-8',
 		-content_disposition => 'inline; filename="opml.xml"');
 
+	my $title = esc_html($site_name);
 	print <<XML;
 <?xml version="1.0" encoding="utf-8"?>
 <opml version="1.0">
 <head>
-  <title>$site_name OPML Export</title>
+  <title>$title OPML Export</title>
 </head>
 <body>
 <outline text="git RSS feeds">
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-12-19 16:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
2011-12-17  9:22 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski
2011-12-17  9:22 ` [PATCH 2/3] gitweb: esc_html() site name for title in OPML Jakub Narebski
2011-12-17  9:22 ` [PATCH 3/3] gitweb: Output valid utf8 in git_blame_common('data') Jakub Narebski
2011-12-17 19:27 ` [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Junio C Hamano
2011-12-19  0:54 ` [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine Jakub Narebski
2011-12-19 12:11   ` Jakub Narebski
2011-12-19 16:21     ` [PATCH 4/3 v2 (bugfix)] " Jakub Narebski
  -- strict thread matches above, loose matches on Subject: below --
2011-12-17  9:15 [PATCH 0/3] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
2011-12-17  9:15 ` [PATCH 2/3] gitweb: esc_html() site name for title in OPML Jakub Narebski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.