* [PATCH 1/2] mailinfo.c: Allow convert_to_utf8() to specify both src/dst charset and do conversion alone
@ 2011-04-16 20:49 ZHANG, Le
2011-04-18 19:36 ` Junio C Hamano
0 siblings, 1 reply; 2+ messages in thread
From: ZHANG, Le @ 2011-04-16 20:49 UTC (permalink / raw)
To: git; +Cc: ZHANG, Le
The convert_to_utf8() function actually converts to whatever charset
"metainfo_charset" variable contains, which is not necessarily UTF-8.
Rename it to convert_to(), and give an extra parameter "to_charset" to
specify what charset to re-encode to. Also rename its "charset"
parameter to "from_charset" to clarify which is which.
Also convert_to_utf8() tries to guess what the src charset is if it is
not specified. And the guess_charset() function does not do exactly what
its name says. So make convert_to() only do the conversion. Make a new
function guess_and_convert_to().
Signed-off-by: ZHANG, Le <r0bertz@gentoo.org>
---
builtin/mailinfo.c | 44 ++++++++++++++++++++------------------------
1 files changed, 20 insertions(+), 24 deletions(-)
diff --git a/builtin/mailinfo.c b/builtin/mailinfo.c
index 71e6262..0f42ff1 100644
--- a/builtin/mailinfo.c
+++ b/builtin/mailinfo.c
@@ -472,6 +472,20 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
return out;
}
+static void convert_to(struct strbuf *line, const char *to_charset, const char *from_charset)
+{
+ char *out;
+
+ if (!strcasecmp(to_charset, from_charset))
+ return;
+
+ out = reencode_string(line->buf, to_charset, from_charset);
+ if (!out)
+ die("cannot convert from %s to %s",
+ from_charset, to_charset);
+ strbuf_attach(line, out, strlen(out), strlen(out));
+}
+
/*
* When there is no known charset, guess.
*
@@ -483,32 +497,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
* Otherwise, we default to assuming it is Latin1 for historical
* reasons.
*/
-static const char *guess_charset(const struct strbuf *line, const char *target_charset)
+static void guess_and_convert_to(struct strbuf *line, const char *to_charset)
{
- if (is_encoding_utf8(target_charset)) {
+ if (is_encoding_utf8(to_charset)) {
if (is_utf8(line->buf))
- return NULL;
- }
- return "ISO8859-1";
-}
-
-static void convert_to_utf8(struct strbuf *line, const char *charset)
-{
- char *out;
-
- if (!charset || !*charset) {
- charset = guess_charset(line, metainfo_charset);
- if (!charset)
return;
}
- if (!strcasecmp(metainfo_charset, charset))
- return;
- out = reencode_string(line->buf, metainfo_charset, charset);
- if (!out)
- die("cannot convert from %s to %s",
- charset, metainfo_charset);
- strbuf_attach(line, out, strlen(out), strlen(out));
+ convert_to(line, to_charset, "ISO8859-1");
}
static int decode_header_bq(struct strbuf *it)
@@ -577,7 +573,7 @@ static int decode_header_bq(struct strbuf *it)
break;
}
if (metainfo_charset)
- convert_to_utf8(dec, charset_q.buf);
+ convert_to(dec, metainfo_charset, charset_q.buf);
strbuf_addbuf(&outbuf, dec);
strbuf_release(dec);
@@ -602,7 +598,7 @@ static void decode_header(struct strbuf *it)
* This can be binary guck but there is no charset specified.
*/
if (metainfo_charset)
- convert_to_utf8(it, "");
+ guess_and_convert_to(it, metainfo_charset);
}
static void decode_transfer_encoding(struct strbuf *line)
@@ -796,7 +792,7 @@ static int handle_commit_msg(struct strbuf *line)
/* normalize the log message to UTF-8. */
if (metainfo_charset)
- convert_to_utf8(line, charset.buf);
+ convert_to(line, metainfo_charset, charset.buf);
if (use_scissors && is_scissors_line(line)) {
int i;
--
1.7.5.rc2.5.gb2ee76.dirty
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH 1/2] mailinfo.c: Allow convert_to_utf8() to specify both src/dst charset and do conversion alone
2011-04-16 20:49 [PATCH 1/2] mailinfo.c: Allow convert_to_utf8() to specify both src/dst charset and do conversion alone ZHANG, Le
@ 2011-04-18 19:36 ` Junio C Hamano
0 siblings, 0 replies; 2+ messages in thread
From: Junio C Hamano @ 2011-04-18 19:36 UTC (permalink / raw)
To: ZHANG, Le; +Cc: git
[-- Attachment #1: Type: text/plain, Size: 3210 bytes --]
"ZHANG, Le" <r0bertz@gentoo.org> writes:
> And the guess_charset() function does not do exactly what
> its name says.
Really? See below.
> @@ -483,32 +497,14 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
> * Otherwise, we default to assuming it is Latin1 for historical
> * reasons.
> */
> -static const char *guess_charset(const struct strbuf *line, const char *target_charset)
> +static void guess_and_convert_to(struct strbuf *line, const char *to_charset)
> {
> + if (is_encoding_utf8(to_charset)) {
> if (is_utf8(line->buf))
> return;
> }
>
> + convert_to(line, to_charset, "ISO8859-1");
Broken indent.
I have to wonder if this helper should be inlined into its single caller,
i.e.
if (metainfo_charset) {
if (is_encoding_utf8(metainfo_charset) && is_utf8(it->buf))
; /* nothing to be done */
else
convert_to(it, metainfo_charset, "ISO8859-1");
}
The decode_header() codepath is the only place that we have to handle a
possible binary guck without an explicit "this piece of text is in that
encoding" available in the message, and we guess by "if the data looks
like utf-8, we treat it as such, otherwise we assume it is 8859-1 which
was historically popular". All the other codepaths we should know what
encoding the incoming data is in.
Having said all that, I do not think your patch is correct.
- Does the code with your patch work correctly when the incoming data is
pre-MIME and does not specify charset, in which case charset.buf may be
an empty string?
- When the incoming data does not say in what encoding it is in, our
intended behaviour is to inspect each line, and if it looks like utf-8
then assume it is in utf-8, otherwise assume it is in 8859-1. And then
we convert it to whatever encoding the repository wants, often utf-8
(coming from metainfo_charset, suitable for recording commit log
messages).
We might want to change this heuristic in the future, but I do not see
a need for doing so right now (Cf. b59d398: Do a better job at guessing
unknown character sets, 2007-07-17).
I do think the guess_and_convert_to() does not implement that intended
logic correctly. When we are _not_ encoding to UTF-8, we do not even
bother to inspect the data to guess if it is UTF-8. Shouldn't it be
more like (modulo "NULL implies utf-8"):
if (is_utf8(it->buf))
from_charset = "utf-8";
else
from_charset = "ISO8859-1";
if (is_encoding_utf8(metainfo_charset) && !strcmp(from_charset, "utf-8"))
; /* nothing to do */
else if (strcasecmp(to_charset, from_charset))
convert_to(it, metainfo_charset, from_charset);
- I don't think the commit message part is handled correctly anymore with
your patch. When you want UTF-8 commit log message (metainfo_charset is
set to utf-8), and when the incoming data does not have its charset
specified, we should be doing the same "guess line-by-line" conversion.
You seem to have lost that with this patch, which would be a grave
regression.
Please apply the attached patch that adds a test at the end of t5100 and
make sure the test passes to prevent that regression from happening.
Thanks.
[-- Attachment #2: addition to t5100 --]
[-- Type: application/octet-stream, Size: 1901 bytes --]
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2011-04-18 19:36 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-16 20:49 [PATCH 1/2] mailinfo.c: Allow convert_to_utf8() to specify both src/dst charset and do conversion alone ZHANG, Le
2011-04-18 19:36 ` Junio C Hamano
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.