All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: git@vger.kernel.org
Cc: Juergen Kreileder <jk@blackdown.de>, Junio Hamano <gitster@pobox.com>
Subject: [PATCH 4/3 v2 (bugfix)] gitweb: Fix fallback mode of to_utf8 subroutine
Date: Mon, 19 Dec 2011 17:21:17 +0100	[thread overview]
Message-ID: <201112191721.17478.jnareb@gmail.com> (raw)
In-Reply-To: <201112191311.58787.jnareb@gmail.com>

e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
2007-12-04) was meant to make gitweb faster by using Perl's internals
(see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)

Simple benchmark confirms that (old = 00f429a, new = this version):
note that it is synthetic benchmark of standalone subroutines, not
of gitweb itself (where probably no visible difference in performace
will show)

        Rate  old  new
  old 1582/s   -- -64%
  new 4453/s 181%   --

Unfortunately it made fallback mode of to_utf8 do not work...  except
for default value 'latin1' of $fallback_encoding (because 'latin1' is
Perl native encoding), which is probably why it was not noticed for so
long.


utf8::valid(STRING) is an internal function that tests whether STRING
is in a _consistent state_ regarding UTF-8.  It returns true is
well-formed UTF-8 and has the UTF-8 flag on _*or*_ if string is held
as bytes (both these states are 'consistent').

For gitweb in most cases the second option was true, as output from
git commands is opened without ':utf8' layer.  So utf8::valid is not
useful for to_utf8.

What made it look as if to_utf8() fallback mode worked correctly
(though only for $fallback_encoding at its default value 'latin1')
was the fact that utf8::decode(STRING) turns on UTF-8 flag only if
source string^W octets form a valid UTF-8 and it contains multi-byte
UTF-8 characters... this means that if string was not valid UTF-8
it didn't get UTF-8 flag.

When string doesn't have UTF-8 flag set, it is treated as if it was in
native Perl encoding, which is 'latin1' (unless native encoding is
EBCDIC ;-)).  So it was ':utf8' layer that actually converted 'latin1'
(no UTF-8 flag == native == 'latin1) to 'utf8', and not to_utf8()
subroutine.  Fallback mode was never triggered.


Let's make use of the fact that utf8::decode(STRING) returns false if
STRING is invalid as UTF-8 to check whether to enable fallback mode.

Note however that if STRING has UTF-8 flag set already, then
utf8::decode also returns false, which could cause problems if given
string was already converted with to_utf8().  Such double conversion
can happen in gitweb.  Therefore we have to check if STRING has UTF-8
flag set with utf8::is_utf8(); if this subroutine returns true then we
have already decoded (converted) string, and don't have to do it
second time.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Jakub Narebski wrote:
> P.S. I started to get strange errors
> 
>  XML Parsing Error: xml processing instruction not at start of external entity
>  Location: http://localhost/cgi-bin/gitweb/gitweb.cgi
>  Line Number 37, Column 1:
>  <?xml version="1.0" encoding="utf-8"?>
>  ^
> 
> while "show source" shows that '<?xml version="1.0" encoding="utf-8"?>'
> is the first line.  WTF?!?
> 
> P.P.S. Now I am getting errors when running gitweb, but only in some
> cases (via mod_cgi not as standalone script, only when using lynx),
> namely it looks like it falls back to 'latin1' when doing content
> which is valid UTF-8.
> 
> Will investigate.

Now it is fixed; the error was caused by to_utf8 not dealing with double
encoding for strings outside 7bit ASCII.

 gitweb/gitweb.perl |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index d24763b..fc41b07 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1443,8 +1443,8 @@ sub validate_refname {
 sub to_utf8 {
 	my $str = shift;
 	return undef unless defined $str;
-	if (utf8::valid($str)) {
-		utf8::decode($str);
+
+	if (utf8::is_utf8($str) || utf8::decode($str)) {
 		return $str;
 	} else {
 		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
-- 
1.7.6

      reply	other threads:[~2011-12-19 16:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
2011-12-17  9:22 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski
2011-12-17  9:22 ` [PATCH 2/3] gitweb: esc_html() site name for title in OPML Jakub Narebski
2011-12-17  9:22 ` [PATCH 3/3] gitweb: Output valid utf8 in git_blame_common('data') Jakub Narebski
2011-12-17 19:27 ` [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Junio C Hamano
2011-12-19  0:54 ` [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine Jakub Narebski
2011-12-19 12:11   ` Jakub Narebski
2011-12-19 16:21     ` Jakub Narebski [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201112191721.17478.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jk@blackdown.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.